||Data used: claus.mat
fluorescence excitation emission data from five samples containing tryptophan,
phenylalanine, and tyrosine.
Purpose: Learning to fit and interpret a PARAFAC model
Information: R. Bro, PARAFAC: Tutorial & applications. Chemom. Intell. Lab. Syst., 1997, 38, 149-171.
Prerequisites: Be sure to understand the basics of handling multi-way arrays in MATLAB (Chapter 1), and have a vague idea about what PARAFAC is.
Task: Fit a three-component PARAFAC model and investigate the model. To fit the PARAFAC model simply type
> [Factors] = parafac(X,DimX,3);
and to convert the output parameters to scores and loadings type
> [A,B,C] = fac2let(Factors,DimX,3);
The scores/loadings in A are the sample mode loadings, the loadings
in B are the emission mode loadings and the loadings in C are the excitation
mode loadings. Use plot(B) or plot(EmAx,B)
to investigate the emission mode loadings. Use plot(ExAx,C)
to plot the excitation mode loadings.
As can be seen from the plots the PARAFAC model is unique. So what does unique mean? It means that the scores and loadings can not be rotated. In PCA you can rotate the scores and loadings and still the model will fit the data just as well as before. Hence, there is no way that you can estimate for example pure spectra from PCA because there is an infinity of different scores and loadings that all give the same fit. In PARAFAC this is not so. The only thing that can be varied is the order and scale of the components. There is no way to say, from the decomposition whether component one is rightfully the first, second or fifth component. So the order of the components is unidentified. There is also no way to define the scale. This is intrinsic to bi- and multilinear models. If X = tp', then we can equally well let t be twice the original t if we just let p be half the original p.
What is the implication of these two unidentifiabilities? That there is no intrinsic order means that the components must be identified after fitting. There is no way to guarantee that, say tyrosine, will come out as component one (in second-order calibration tricks can be used to automatically identify which factor is the analyte of interest if a pure sample is included. If the scores of that sample are forced to be zero for all but one factor, then naturally this factor must be the one pertaining to the analyte of interest). That the scale is unidentified means that we can only determine spectra, concentrations etc. up to a scaling. In the above example we can agree that the second and third mode loadings are pure emission and excitation spectra. There the scores (amounts of the latent variables) must be concentrations, since the latent variables in this case are real chemical variables. So we can find the concentrations in the samples from PARAFAC. But what is the scale? mM, M?? Because of the scale indeterminacy there is no way to tell. This is natural. We can not measure the absorption of a fluid and tell the concentration from that. We need a reference providing the scale. So if the concentration is known in one sample we can scale the corresponding column so that this score equals the known concentration. Then the remaining scores of that factor will be estimates of the concentrations in the other samples.
Task: Evaluate if the scores provide reasonable estimates of the concentrations. Assume that you know the concentrations in, say, the fifth sample. If the model is unique this should provide sufficient information to estimate the concentrations in the first four samples. Use the reference concentrations (given in the matrix y and described in the file readme) to scale the scores such that estimates of the concentrations of tryptophan is obtained using the known concentration of tryptophan in the fifth sample (y(5,1)). Do the same for tyrosine and phenylalanine. This is the basis for the so-called second-order calibration.
How to do it
Task: Investigate the fits of a one-, two- and three-component PARAFAC model of the data. Try also to estimate a four-component model. Note that an increased number of iterations are mostly necessary when fitting the four-component model. This is a typical sign of either too many components being extracted or an ill-defined problem (loadings are very correlated).
An important difference between two-way PCA and the multi-way PARAFAC model is that the PARAFAC model is not nested. Hence the parameters of a three-component model do not equal the parameters of a two-component model plus one additional component. The reason for this is that the components are not required to be orthogonal; hence independent. Therefore, every model has to be calculated specifically with all its components.
In estimating the model note some of the options available in the PARAFAC m-file (type help parafac): The input options makes it possible to, e.g., have the interim loadings being plotted during the iterations (options(3) = 2), or only allow for, say, 100 iterations (options(6) = 100).
Fitting the PARAFAC the model
Converting model parameters to loadings
Calculating the model of the data
Calculating the model fit
Additionally to the above-mentioned tools, one may use the split-half analysis for determining the proper number of components. If two sample sets are available that should describe the same common variation, then it holds that for the proper number of components they should give the same loading vectors up to scaling and permutation (more on this). Thus, if this is not the case, it is a clear indication that too many components are being extracted.
A new approach called the core consistency diagnostic (CORCONDIA) is very helpful in determining the right number of components. The details will not be given here (but here in a pdf-file) but the principle is as follows. For a given number of components fit the PARAFAC model to the data. Use the thereby found solution to calculate the so-called core consistency. If the PARAFAC model is valid then the core consistency is close to 100%. If the data can not approximately be described by a trilinear model or too many components are used the core consistency will be close to zero (or even negative). If the consistency is in-between (around 50%) the model is unstable in which case imposing valid constraints may help stabilize the model. In practice the core consistency levels off slowly for an increasing number of components and then sharply when the correct number of components is exceeded. The number of components corresponding to the last high consistency value should be chosen.
Task: Use different tools for assessing the proper number of components for the fluorescence data. For models of dimension one to five
The multi-way toolbox also provides an m-file called pftest for checking the number of components (type help pftest). The approach adopted there is more or less similar to the one mentioned here, but additionally each model corresponding to a given number of components is fitted several times. This provides means for assessing if the model converges to the same minimum every time. This is useful also in determining the proper number of components, since if too many components are used the number of local minima typically increases dramatically.
How to do it
> [Factors] = parafac(X,DimX,Fac);
where Fac is the number of components. The parameters of the model are held in the vector Factors and may be converted into ordinary scores and loadings by means of fac2let. In case of a three-way model this would read
> [A,B,C] = fac2let(Factors,DimX,Fac);
The scores and loadings of a PARAFAC model may be assessed and used just as in ordinary two-way analysis (e.g., use scores for building regression models, plot loadings and make bi-plots for interpretation etc.)
It has also been discussed that using the file pftest one may very quickly get a good indication of the appropriate number of components to use.
Some important aspects that differ from ordinary two-way analysis have been discussed:
The N-way tutorial
Copyright © 1998