Featured

Processing GC-MS made easy

Are you interested in learning a simple and free tool for turning untargeted GC-MS data into peak tables. And do so with less time, less user-dependence and with more analytes recovered. Then you may want to learn how to use PARADISe, a stand-alone Windows program for just that. We are running a two-day course on how to use the software:

PARADISe : a user friendly software for untargeted analysis of Gas Chromatography Mass Spectrometry (GC MS) data

Traditionally, GC-MS data analysis follows a targeted approach that involves several time consuming steps (integration and quantification), usually carried out sample by sample, being also subject to inter-user variability. Furthermore, interesting compounds are often left undiscovered due either to practical reasons or analytical limitations (limit of detection and quantification). PARADISe is a user-friendly tool for GC-MS deconvolution and identification It allows to perform untargeted analysis, meaning that all compounds present in the samples are considered, while overcoming the problems mentioned above.

Audience
The course is intended for GC-MS users at any level of expertise in any scientific
field, working in both academic and industrial environment Basic knowledge of chemometrics or statistics is advisable, but not mandatory.

This course will provide the participants a complete overview of the software, from theory to practice Participants are encouraged to bring and work with their own data, otherwise we will provide them with a dataset.

Teachers: Professor Rasmus Bro and Postdoc Beatriz Quintanilla Casas
Place: Online Microsoft Teams
Participation cost is 100 Euro
Registration: here!

Monday, November 13th 9-12 Theoretical background. The data science behind the tool
Monday, November 13th 12.30-15 Getting started with PARADISe
Tuesday, November 21 st 9-15 Discussion of your experience so far, troubleshooting, good practices and challenges

Honey fluorescence

Have a look at https://ucphchemometrics.com/honey/ for a dataset of approximately 100 EEMs of honey from different varieties.

Function for scatter interpolation in fluorescence data

We keep finding that we left out some data sets and functions when we transferred to our new home page here. So keep sending us notifications if you miss something.

We now put the function for scatter interpolation for EEM data online again at https://ucphchemometrics.com/eemscat.

An interface for control charts

We have made a small interactive program for learning about Control Charts (Shewhart, CUMSUM). The program is an educational tool, that has been made freely available.

The app runs in MATLAB either locally hosted or through MATLAB online. A small user guide is available . Note, the program can be made available as a standalone application using the MATLAB compiler to produce an *.exe file.

See more here.

Wine samples analyzed by GC-MS and FT-IR instruments

Wine Samples

Red wines, 44 samples, produced from the same grape (100% Cabernet Sauvignon), harvested in different geographical areas, have been collected from local supermarkets in the area of Copenhagen, Denmark. Details on the geographical origins and number of wine samples analysed are given in Table 1.

Table 1. Geographical origin of the analysed red wines

Origin	Wine samples
Argentina	6
Chile	15
Australia	12
South Africa	11
Total	44

The wine samples have been analyzed using head space GC-MS and FT-IR analytical instruments. The FT-IR was a commercial WineScan instrument provided by FOSS Analytical A/S.

GC-MS data

For each sample a mass spectrum scan (m/z: 5-204) measured at 2700 elution time-points was obtained providing a data cube of size 44×2700×200. In Figure 1 an example of a chromatogram for one red wine sample is shown.

In the figure the abundance at each scan is found by summing the contribution of all intensities of mass channels investigated (m/z: 5-204).

FT-IR data

For all wine samples 14 quality parameters were predicted from the IR spectra (Figure 2) using the FOSS WineScan build-in calibration models (Table 2).

Table 2. Quality parameters measured on the WineScan instrument and used in MVP (units shown in brackets)

#	Quality parameter
1	Ethanol (vol. %)
2	Total acid (g/L)
3	Volatile acid (g/L)
4	Malic acid (g/L)
5	pH
6	Lactic acid (g/L)
7	Rest Sugar (Glucose + Fructose) (g/L)
8	Citric acid (mg/L)
9	CO₂ (g/L)
10	Density (g/mL)
11	Total polyphenol index
12	Glycerol (g/L)
13	Methanol (vol. %)
14	Tartaric acid (g/L)

Get the data

The data are available in zipped MATLAB 6.x format. Download the data and write load Wine_v6 in MATLAB.

DOWNLOAD DATA

If you use the data we would appreciate that you report the results to us as a courtesy of the work involved in producing and preparing the data. Also you may want to refer to the data by referring to:

T. Skov, D. Balabio, R. Bro (2008). Multiblock Variance Partitioning. A new approach for comparing variation in multiple data blocks. Analytica Chimica Acta, 615 (1): 18-29

Zip-file information

Variable	Description	Dimensions
Aroma_compounds	Peak areas of aroma compounds	44×57
Class	Classes of wines (see Table 1)	44×1
Data_GC	Three-way data	44×2700×200
Elution_profiles	Summed mass dimension – see Figure 1	44×2700
IR_spectra	IR spectra without waterband	44×842
IR_spectra_with_waterband	IR spectra with waterband – see Figure 2	44×1056
Label_Aroma_comp	Label aroma compounds	1×57
Label_Elution_time	Label elution time in minutes	1×2700
Label_Mass_channels	Label m/z	1×200
Label_Pred_values_IR	Label quality parameters	1×14
Label_Wine_samples	Label wine samples ARG: Argentina AUS: Australia CHI: Chile SOU: South Africa	44×1
Mass_profiles	Summed elution time dimension	44×200
Pred_values_IR	Quality parameters (see Table 2)	44×14
axis_spectra_wavenumber	Axis for spectra in cm^-1	1×842
axis_spectra_with_waterband_wavenumber	Axis for spectra with waterband in cm^-1	1×1056

Real time monitoring of a fermentation process

Introduction

A method for at-line quality assessment of a cultivation process is sought to

1) enable improved process control,
2) enable faster detection of batch end point, and
3) enable immediate quality assessment of final product.

Fluorescence excitation-emission measurements are used because they are known to reflect important properties of the fermentation process. Several samples from many batches are obtained and are measured on an at-line multi channel fluorescence detection system (BioView®).

Example of an excitation/emission fluorescence landscape.

The data are available in zipped MATLAB 7 format and stored as dataset objects (see freeware dataset object at www.eigenvector.com). If you use the data we would appreciate that you report the results to us as a courtesy of the work involved in producing and preparing the data. Also you may want to refer to the data by referring to

P. P. Mortensen and R. Bro. Real-time monitoring and chemical profiling of a cultivation process. Chemom.Intell.Lab.Syst. 84:106-113, 2006.

Get the Data (Matlab)

Each fluorescence landscape from the sensor is obtained using15 excitation filters in the range from 270 to 550 nm with a spectral resolution of 20 nm, and 15 emission filters range from 310 to 590 also with a spectral resolution of 20 nm. All filters have a maximum half-width of 20 nm.

The reference analysis for the produced protease is based on the protease degrading N,N-dimethyl-casein. Amino acids liberated by the action of the enzyme reacts with 2,4,6 trinitrobenzene sulfonic acid to form a colored complex. The amount of complex is correlated to the enzyme activity and is determined spectroscopically. The reference values given here are not the actual activity measurements, but a related quality parameter that is highly correlated to activity.

The data are split in a calibration set and a test set consisting of data from three normal batches. The test set is less variable than the calibration data.

Amino acid fluorescence data

Problem

There really isn’t any problem here! But these simple fluorescence data are nice for illustrating different aspects of the trilinear PARAFAC model. They can be used for second order calibration, for working with systematically missing data, for imposing constraints, etc. The samples were generated and measured by Claus A. Andersson (KVL, DK).

Get the data

The data are available in zipped MATLAB 4.2 format. Download the data and write load data in MATLAB. If you use the data we would appreciate that you report the results to us as a courtesey of the work involved in producing and preparing the data. Also you may want to refer to the data by referring to

Bro, R, PARAFAC: Tutorial and applications, Chemometrics and Intelligent Laboratory Systems, 1997, 38, 149-171

Data (Matlab format)

The data have also been described in

Bro, R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. 1998. Ph.D. Thesis, University of Amsterdam (NL) & Royal Veterinary and Agricultural University (DK).
Kiers, H.A.L. (1998) A three-step algorithm for Candecomp/Parafac analysis of large data sets with multicollinearity, Journal of Chemometrics, 12, 155-171.

Data

This data set consists of five simple laboratory-made samples. Each sample contains different amounts of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. The samples were measured by fluorescence (excitation 250-300 nm, emission 250-450 nm, 1 nm intervals) on a PE LS50B spectrofluorometer with excitation slit-width of 2.5 nm, an emission slit-width of 10 nm and a scan-speed of 1500 nm/s. The array to be decomposed is hence 5 × 51 × 201. In Figure 1 measurements of one of the samples are shown. Ideally these data should be describable with three PARAFAC components. This is so because each individual amino acid gives a rank-one contribution to the data.

Figure 1. Fluorescence landscape of a sample containing only phenylalanine.

In Figure 2 and Figure 3 the normalized loadings of a three- and a four-component model are shown. It is readily seen that the three loadings of the three-component model are also found in the four-component model. These three loadings resemble the pure spectra of tryptophan, tyrosine and phenylalanin. The fourth component does not resemble any of the analytes and in fact does not seem to be reflecting chemical information. The reason for the presence of this fourth and quite distinct component must be that non-linearities or scatter effects causes some additional systematic variation.

Figure 2. Loading vectors resulting from fitting a three-component PARAFAC model to amino acid data.

Figure 3. Loading vectors resulting from fitting a four-component model to amino acid data. The fourth suspicious component shown with a thicker line.

In fact, these data have been investigated at several times and always using three components. Even when used for second-order calibration the use of three components has given satisfactory results. This is so because the fourth component has a very low variance. The variance of this fourth component is only 0.03% as compared to 50.7, 25.5, and 16.2% of the three ‘chemical’ components. Therefore the bulk variation is not affected significantly by the fourth component and this is also the reason why traditional tools based on residuals have difficulties in detecting this fourth component.

As an explanation for this finding, it is important to notice in Figure 1 the Rayleigh scatter in the left part, which is not multilinear in its nature. It is situated around a diagonal of corresponding emission and excitation wavelengths. Additionally to Rayleigh scatter the emission below the excitation wavelength does not vary according to the multilinear model, since the emission intensity is zero (up to the noise) regardless of excitation. In fact, the emission mode loading of the fourth ‘spurious’ component resembles the Rayleigh scatter. To avoid such spurious results, the lower part of the data (emission below excitation wavelength) as well as the part corresponding to Rayleigh scatter should not be fitted by the model. Rather these elements must be set to missing values in the three-way array in order not to bias the model.

Figure 4. Core consistency plot of a three (left) and a four (right) component PARAFAC model of the amino acid data with missing entries.

When all appropriate elements of the array have been set to missing, the values in Table 1 are obtained. Clearly, CORCONDIA now correctly identifies that there are three trilinear components in the data. In Figure 4 the core consistency plots are shown. It is easily seen that for the three-component model the Tucker3 core elements do have values close to the target whereas for the four-component model the values of the Tucker3 core vary very much. One element that ideally should be one is close to zero and some elements that ideally should be zero are actually close to one.

Table 1. Amino acid fluorescence data. Results from fitting one- to six-component PARAFAC models to amino acid data with missing values.

# Components	LOSS	RELFIT	CORCO
1	1.000	68.57	100.0
2	0.293	90.81	99.9
3	0.001	99.96	99.4
4	0.001	99.98	28.2
5	0.001	99.99	13.7
6	0.000	99.99	62.8

In Table 1 it is seen that a six-component solution has a quite high core consistency, but this is preceded by two very low values. Hence, the choice here clearly is to take three components.

Fluorescence on blood plasma for cancer diagnosis

The datasets are from fluorescence Excitation Emission Matrix (EEM) measurements on human blood plasma samples (citrate plasma). The samples are a part of a larger sample set from a multi-centre cross sectional study conducted at six Danish hospitals of patients undergoing large bowel endoscopy due to symptoms associated with CRC¹. The present sample set is designed as a case control study with one case group (verified CRC) and three different control groups. The three control groups are (1) healthy subjects with no findings at endoscopy, (2) subjects with other, non-malignant findings and (3) subjects with pathologically verified adenomas². Each of the groups, case and controls, consisted of samples from 77 individuals. The samples are matched in case control groups based on age gender and location of cancer and adenoma. Additional information is available on age, gender, smoking habits and lCD-10 cancer codes.

The three datasets are the same samples measured in different dilutions or different spectral areas:

X_UD (299 variables): The undiluted samples measured in the spectral area with excitation wavelengths from 250 to 450 with 5 nm increment, and emission wavelengths from 300 to 600 with 1 nm increment.

X_D (289 variables): The diluted samples (100 times in PBS) measured in the same spectral area as above.

X_HW (299 variables): The undiluted spectra measured in the spectral area with excitation wavelengths from 385 to 425 with 5 nm increment, and emission wavelengths from 585 to 680 with 1 nm increment.

All class data (cancer, case control, age, gender, smoking and cancer codes) are found in the data (for example, cancer status is found in X_UD.class{1,1}, and class labels in X_UD.classlookup{1,1})

Rayleigh scatter and second order fluorescence are removed from the data. and replaced with missing data and zeros using in-house software³. For the diluted samples, a background spectrum of the solute PBS, measured the same day as the sample, is subtracted from each sample in order to remove possible Raman scatter⁴. All samples are intensity calibrated by normalizing to the integrated area of the water Raman peak of a sealed water sample measured each day prior to the measurements. This converts the intensity scale into Raman units and allows comparison of intensity of samples measured on other fluorescence spectrometers⁵.

There are unequal number of samples in the files. Some samples lacked sample material to be measured in all three setups, further some spectra were discarded due to obviously erroneous measurements.

Get the data here

The data are available in MATLAB 7 format and stored as dataset objects (get freeware dataset object http://eigenvector.com/software/dataset.htm).

If you use the data please refer to:

Lawaetz, A.; Bro, R.; Kamstrup-Nielsen, M.; Christensen, I.; Jørgensen, L.; Nielsen, H. Fluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer. Metabolomics 8 (supplement 1): 111-121 (2012).

NOTE: There is a mismatch between the number of samples described in the paper and the dataset for download. This has been published in the erratum

A. Lawaetz, R. Bro, M. Kamstrup-Nielsen, I. Christensen, L. Jørgensen, and H. Nielsen, “Erratum to: Fluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer,” Metabolomics 8 (supplement 1): 122 (2012)

Reference List

1. Nielsen, H. J.; Brunner, N.; Frederiksen, C.; Lomholt, A. F.; King, D.; Jorgensen, L. N.; Olsen, J.; Rahr, H. B.; Thygesen, K.; Hoyer, U. Plasma tissue inhibitor of metalloproteinases-1 (TIMP-1): a novel biological marker in the detection of primary colorectal cancer. Protocol outlines of the Danish-Australian endoscopy study group on colorectal cancer detection. Scand. J. Gastroenterol. 2008, 43 (2), 242-248.

2. Lomholt, A. F.; Hoyer-Hansen, G.; Nielsen, H. J.; Christensen, I. J. Intact and cleaved forms of the urokinase receptor enhance discrimination of cancer from non-malignant conditions in patients presenting with symptoms related to colorectal cancer. British Journal of Cancer 2009, 101 (6), 992-997.

3. Andersen, C. M.; Bro, R. Practical aspects of PARAFAC modeling of fluorescence excitation-emission data. J. Chemometrics 2003, 17 (4), 200-215.

4. McKnight, D. M.; Boyer, E. W.; Westerhoff, P. K.; Doran, P. T.; Kulbe, T.; Andersen, D. T. Spectrofluorometric characterization of dissolved organic matter for indication of precursor organic material and aromaticity. Limnology and Oceanography 2001, 46 (1), 38-48.

5. Lawaetz, A. J.; Stedmon, C. A. Fluorescence intensity calibration using the Raman scatter peak of water. Appl. Spectrosc.2009, 63 (8), 936-940.

Award best PhD thesis

Every year, the Catalan Nutrition Center (CCNIEC) of the Institute for Catalan Studies (IEC) in Catalonia (Spain) recognizes the best doctoral theses’ in the field of food science and nutrition. In the current edition, our very own Beatriz Quintanilla Casas has been awarded the CCNIEC-Eroski Foundation 2022 Prize for her doctoral thesis, titled “Development of innovative analytical techniques for olive oil authentication and quality assessment”.

In all honesty, this is really something that happened on our duty. The work was carried out within the Lipids and bioactive compounds research group at the department of Nutrition, Food Science and Gastronomy of University of Barcelona (Spain). But we are happy and proud regardless!

Link to the CCNIEC website: https://www.ccniec.cat/premis-ccniec/premi-tesi-doctoral/#toggle-id-1

Link to the thesis or at least the abstract: http://diposit.ub.edu/dspace/handle/2445/189681

PARADISe for untargeted GC-MS

We are extremely happy and proud to announce our first official version of the PARADISe software. It transforms untargeted GC-MS data from large sample sets into peak tables in a very simple, robust and reproducible manner. And it is very easy to use. You can find it at

PARADISe

where you can also find information on how to use the software in the form of a tutorial and a little instruction video.

Please try it and let us know what you think. We are very happy for all the support especially from Arla and the Danish Dairy Research Foundation who has made this work possible.

History. PARADISe has been in the making for more than a decade and has been developed with generous funding from numerous companies and agencies. Here is a list of some of the most important contributors to the software:

Beatriz Quintanilla-Casas

Maja Kamstrup-Nielsen