Contents |
Data used: claus.mat contains fluorescence excitation emission data from five samples containing tryptophan, phenylalanine, and tyrosine. Purpose: Learning about proper preprocessing Information: R. Bro, PARAFAC: Tutorial & applications. Chemom. Intell. Lab. Syst., 1997, 38, 149-171. Also see the paper "Centering and Scaling in Component Analysis" |
Preprocessing of higher-order arrays is more complicated than in the two-way case, though understandable in light of the multilinear variation presumed to be an acceptable model of the data. Centering serves the same purpose as in two-way analysis, namely to remove constant terms in the data, that may otherwise at best need an extra component, at worst make modeling impossible.
All models described here are implicitly based on that the data are ratio-scale (interval-scale with a natural origin), i.e., that there is a natural zero which really does correspond to zero (no presence means no signal) and that the measurements are otherwise proportional such that doubling the amount of a phenomenon implies that its corresponding contribution to the data is doubled. If data are not approximately ratio-scale, then centering the data is also mandatory. In two-way analysis, centering is almost always performed, but not at all always needed.
Centering is performed to make the data compatible with the structural
model. Scaling on the other hand is a way of making the data compatible
with the least squares loss function normally used. Scaling does not change
the structural model of the data, but only the weight paid to errors of
specific elements in the estimation. Scaling is dramatically simpler than
using a weighted loss function, and is therefore to be preferred to this,
if approximate homoscedastic data can be obtained by scaling. Centering
and scaling will be described using three-way arrays in the following.
For instance, if it is known that the true model consists of one PARAFAC term (a trilinear component) and an overall level, it may seem feasible to estimate a PARAFAC model on the original data subtracted the grand level. However, even though the mathematical structure might theoretically be true, the subtraction of the grand level introduces some artifacts in the data, not easily described by the PARAFAC model. In this case even though the grand level has been subtracted two components are still necessary to describe the data. This shows that the preprocessing has not achieved its goal of simplifying the subsequent model. If on the other the data are centered across one mode the data can be modeled by a one-component model. Another possibility is to estimate a two-component model but constraining one component to have constant loadings in each mode, thus reflecting the grand level. This provide a model with a unique estimate of the grand level (see box below).
Fit an appropriate three-component PARAFAC model to the amino acid data and look at the loadings. How much variance does the model describe?
load claus
model = parafac(X,3);
plotfac(model)
Center the data across samples, i.e., as in ordinary two-way analysis:
x1 = reshape(X,5,201*61);
meanx1 = mean(x1);
centeredx1 = x1-ones(5,1)*meanx1;
centeredx1 = reshape(centeredx1,[5 201 61]);
You can also use the m-file nprocess for this (centeredx1=nprocess(X,[1 0 0 ],[0 0 0]);) but we avoid that here to get some insight into preprocessing and how it works. For real data analysis, though, it is always a good idea to use nprocess to avoid problematic preprocessing.
Look at the centered data compared to the uncentered data:
figure
subplot(1,2,1),
mesh(ExAx,EmAx,squeeze(X(4,:,:))),axis tight
title('Uncentered')
subplot(1,2,2),
mesh(ExAx,EmAx,squeeze(centeredx1(4,:,:))),axis tight
title('Centered')
Upon centering, any offsets constant across samples should be removed and hence a three-component model should still be valid. Fit an appropriate three-component PARAFAC model to the amino acid data and look at the loadings. How much variance does the model describe? Do the the loadings look like the ones earlier obtained? Which ones differ a little/a lot?
model1 = parafac(centeredx1,3);
figure
plotfac(model1)
Now try to do an incorrect centering across two modes. For example,
meanx1b = mean(x1');
centeredx1b = x1'-ones(201*61,1)*meanx1b;
centeredx1b = centeredx1b';
centeredx1b=reshape(centeredx1b,size(X));
model2 = parafac(centeredx1b,3);
figure
plotfac(model2)
How much variance does the model describe? Why? Do the the loadings look like the ones earlier obtained? Why (not)?
Scaling in multi-way analysis has to be done, taking the trilinear model into account. It is not, as for centering, appropriate to scale the unfolded array column-wise, but rather whole slabs or submatrices of the array should be scaled. If variable j of the second mode is to be scaled (compared to the rest of the variables in the second mode), it is necessary to scale all columns where variable j occurs by the same scalar. This means that whole matrices instead of columns has to be scaled. For a four-way array, three-way arrays would have to scaled. Mathematically scaling within the first mode can be described
Another complicating issue, is the interdependence of centering and scaling. Scaling within one mode disturbs prior centering across the same mode, but not across other modes. Centering across one mode disturbs scaling within all modes. Hence only centering across arbitrary modes or scaling within one mode is straightforward, and furthermore not all combinations of iterative scaling and centering will converge. In practice, though, it need not influence the outcome much if an iterative approach is not used. Scaling to a sum-of-squares of one is arbitrary anyway and it may be just as reasonable to just scale, e.g., by variances, within the modes of interest once, thereby having at least mostly equalized any huge differences in scale. Centering can then be performed after scaling and thereby it is assured that the modes to be centered are indeed centered.
The appropriate centering and scaling procedures can most easily be summarized in a figure where the array is shown unfolded to a matrix (Figure 1). Centering must be done across the columns of this matrix, while scaling should be done within the rows of this matrix. Note that the common approach of scaling the columns of a data-matrix would not be appropriate for the above unfolded data. The consequence of such a scaling could be that more components are necessary than if proper scaling is used, and that the resulting model will be more difficult to interpret (see box).
The N-way tutorial
Copyright © 1998
Changed Jan-2001
R. Bro