Revised version published in J.Chemom., 2003, 17, 1633
Centering and scaling in component analysis
Rasmus Bro
Chemometrics Group, Food Technology, Dept. Dairy and Food Science
The Royal Veterinary and Agricultural University
Denmark
rb@kvl.dk
&
Age K. Smilde
Process Analysis and Chemometrics
Department of Chemical Engineering
The
Abstract................................................................................................................................................................................. 4
1. Introduction................................................................................................................................................................ 5
1.1 Definitions........................................................................................................................................................ 5
1.2 Leading principles........................................................................................................................................... 8
1.3 Outline of paper............................................................................................................................................... 8
1.4 Notation............................................................................................................................................................ 8
2. Twoway centering.................................................................................................................................................... 9
2.1 Reasons for centering..................................................................................................................................... 9
2.2 How centering works.................................................................................................................................... 10
2.2.1 Centering can remove offsets because it is a projection step...................................................................... 10
2.2.2 Centering across several modes................................................................................................................... 11
2.2.3 Centering is a twostage procedure for a least squares fitting problem...................................................... 11
2.2.4 Rank reduction and centering...................................................................................................................... 13
2.3 When centering does not work................................................................................................................... 14
2.3.1 Handling missing data................................................................................................................................. 14
2.3.2 Subtracting the grand mean......................................................................................................................... 14
2.4 Alternatives to centering............................................................................................................................. 17
2.5 Summary................................................................................................................................................................... 18
3. Twoway scaling...................................................................................................................................................... 19
3.1 Reasons for scaling....................................................................................................................................... 19
3.2 How scaling works........................................................................................................................................ 20
3.2.1 Different types of scaling........................................................................................................................... 20
3.2.2 Scaling and weighted least squares fitting................................................................................................... 21
3.3 When scaling does not work....................................................................................................................... 23
3.4 Alternatives to scaling................................................................................................................................. 23
3.5 Summary......................................................................................................................................................... 23
4. Simultaneous twoway centering and scaling................................................................................................... 23
4.1 Scaling within a mode disturbs centering across the same mode, but not across other modes....... 24
4.2 Centering across one mode disturbs scaling within all modes.............................................................. 24
4.3 Centering across and scaling within the same mode is problematic..................................................... 25
4.4 Summary......................................................................................................................................................... 25
5. Threeway preprocessing...................................................................................................................................... 26
5.1 Centering........................................................................................................................................................ 26
5.1.1 Possible threeway offsets and their proper removal................................................................................. 26
5.2 Scaling............................................................................................................................................................. 28
5.2.1 Incorrect scaling of threeway arrays.......................................................................................................... 29
5.3 Simultaneous centering and scaling........................................................................................................... 30
5.4 Summary......................................................................................................................................................... 31
6. Conclusion............................................................................................................................................................... 31
Acknowledgments............................................................................................................................................................. 33
Appendix 1: Projections................................................................................................................................................... 33
Appendix 2: Fitting bilinear model plus offsets across one mode equals fitting a bilinear model to centered data 35
In this paper, the purpose and use of centering and scaling is discussed in depth. The main focus is on twoway bilinear data analysis, but the results can easily be generalized to multiway data analysis. In fact, one of the scopes of this paper is to show that if twoway centering and scaling is understood, then multiway centering and scaling is quite straightforward. In the literature, it is often stated that preprocessing of multiway arrays is difficult, but here it is shown that most of the difficulties do not pertain to three and higherway modeling in particular.
It is shown that centering is most conveniently seen as a projection step, where the data are projected onto certain welldefined spaces within a given mode. This view of centering helps explaining why, for example, centering data with missing elements is likely to be suboptimal if there are many missing elements.
Building a model for data consists of two parts: postulating a structural model and using a method to estimate the parameters. Centering has to do with the first part: when centering, a model including offsets is postulated. Scaling has to do with the second part: when scaling another way of fitting the model is employed. It is shown that centering is simply a convenient technique to estimate model parameters for models with certain offsets, but this does not work for all types of offsets. It is also shown that scaling is a way to fit models with a weighted least squares loss function and that sometimes this change in objective function cannot be performed by a simple scaling step.
Further practical aspects of and alternatives to centering and scaling are discussed, and examples are used throughout to show that the conclusions in the paper are not only of theoretical interest, but can have impact on practical data analysis.
Keywords: Offset, weighted least squares, preprocessing, twoway, threeway, multiway, missing data, PCA, PARAFAC
It is important to have a concise terminology for scaling and centering. The following convention is based on suggestions from the literature [Harshman & Lundy 1984, Kiers 2000, Kruskal 1989, ten Berge 1989]. The term 'an offset'  also sometimes called an intercept  is used for a part of the model that is constant across one or several modes. An Rcomponent bilinear model of a data matrix, X (I × J) with elements x_{ij}, may be written in terms of scalars or in matrix notation as
X = ΦΘ^{T} + E 
(1) 
where Φ (I × R) and Θ (J × R) hold the parameters, _{ir} and _{jr}, where Greek symbols are used to indicate population parameters. The matrix E holds the unknown errors. Offsets may be constant across the first mode (rows). The model associated with such offsets is:
X = ΦΘ^{T} + 1 ^{T} + E 
where (J × 1) holds the constant terms (j=1,..,J) and where 1 is a onevector of suitable size (I × 1 in this case). Again, the Greek symbol indicates a population value. Offsets may also be constant across the second mode (columns). The underlying bilinear model with offsets across the second mode reads
X = ΦΘ^{T} + 1^{T} + E 
(3) 
where the vector (I × 1) is now holding the offsets (i=1,..I). Offsets may also be constant across columns and across rows yielding
X = ΦΘ^{T} + 11^{T} + E 

in which the single constant is the same for all elements of X. Such a situation may arise for example in chromatography or capillary electrophoresis where a constant offset in the detector may appear, due to the way the detector zerolevel is determined.
Thus, for bilinear models there are two basic types of offsets: constants across one mode (columns or rows) or constants across both modes. Combinations of such offsets may also appear as is for example seen in analysis of variance settings.
As will be shown below, offsets are often handled by first centering the data and subsequently fitting the bilinear model. If the data are centered by subtracting the columnaverage from every element in the column, this is referred to as centering across the first mode. Mathematically it can be expressed
y_{ij} = x_{ij}  
(5) 
where y_{ij} is an element of the centered data matrix. If m (J×1) is a vector holding the jth columnaverage in its jth element, then centering across the first mode can also be expressed
Y = X 1m^{T} 
(6) 
where 1 is an Ivector of ones and Y the matrix holding the centered data. Subtracting the rowaverage from each element in a row is referred to as centering across the second mode and can be expressed
y_{ij} = x_{ij}  
(7) 
or using m (I×1) as a vector holding the ith rowaverage in its ith element
Y = X m1^{T}._{} 
(8) 
In general, centering across one mode is also called singlecentering and performing for example a centering across the first mode and then a subsequent centering of the outcome across the second mode is called doublecentering. The term slabcentering, which is sometimes seen in the literature, refers to centering by subtracting, from each slab in a threeway array, the overall average of that slab. For twoway data, this simply corresponds to subtracting the average of all elements.
For scaling, another terminology is used. When a matrix is scaled such that each row is multiplied by a specific scalar, it is called scaling within the first mode (y_{ij} = x_{ij}w_{i}). If each column is multiplied by a certain scalar as in traditional autoscaling, it is referred to as scaling within the second mode (y_{ij} = x_{ij}w_{j}). In matrix notation scaling within the first mode can be written
Y = WX 
(9) 
where W is an I × I diagonal matrix holding the scalar w_{i} in its ith diagonal element. Scaling within the second mode can be written
Y = XW 
(10) 
where W is now a J × J diagonal matrix holding the scalar w_{j} in its jth diagonal element.
Figure 1. Example of centering and scaling using sugar process data (legend: see
text).
An example of centering and scaling is shown in Figure 1. The data is from a sugar factory (see Nørgaard [1995] for more information). The variables shown are Ash (1), Color (2), Color type (3), Turbidity (4), Grain size1 (5), Grain size2 (6), SO_{2} (7), Invert (8), Floc (9), Insoluble residue (10), AminoN (11) which are all measured in different units of different magnitude. Each line in a plot is the status ('spectrum') of these 11 variables at a certain time. Ninetyseven times are shown. The raw data are shown on the left side A). The different ranges of these variables will manifest themselves in a subsequent modeling of the data, where the variables with little variation will not be modeled to any significant degree. Centering (across the first mode) will not remove these scale differences, but will move the variation of each variable to the zerolevel. As the differences in scales between variables are arbitrary, it is useful to scale the data so that each variable has the same initial standard deviation (and remove different measurement units). This can be achieved by scaling the centered data within the second (variable) mode. The corresponding autoscaled data are shown on the right side of the figure. It is seen that the variation of each processed variable is comparable for the autoscaled data. The outlying sample in the lower part of the plot, however, leads to a too dramatic downweighting of, for example, variable one and should hence be excluded before preprocessing is carried out.
It may seem strange that different words are used for scaling (within) and centering (across). The explanation for this is as follows. Centering is performed across a mode in the sense that one offset is subtracted from every element in a certain vector, i.e., the data are centered across the elements of one mode. The same holds for threeway data; the average value is subtracted from each element of a vector. Scaling is performed by multiplying all elements in the array containing a certain variable (or object) by the same scalar. For twoway data, scaling therefore also pertains to vectors, but e.g. for threeway data, this means that a whole slab (corresponding, for example, to the I × K matrix of the jth wavelength in a spectral threeway array) has to be multiplied by the same scalar. Thus scaling is performed within the elements of one mode.
In the following, much use will be made of the notion of 'fit' or 'model fit'. In general terms this means what portion of the data is fitted by the model. This can be expressed by

(11) 
where X contains the data, the fitted values of the data using the model and E the residuals. The symbol denotes a norm of A, here taken to be the Frobenius or Euclidian norm (square root of the sum of squared elements of A). The R^{2} statistic can take on values between zero and one, where one means perfect fit and zero means no fit. The fit may also be expressed in percentages between 0 and 100.
There are two leading principles in this paper. The first principle is parsimony. It is preferred that a model is as simple as possible. This means that if two models give the same fit, the model using the fewer number of parameters is preferred. This idea goes back to Willem of Ockham who lived in the Middle Ages and stated that a minimum number of assumptions should be adopted to explain a phenomenon [Weinberg 1964]. This principle is known as 'Ockham's Razor'. In statistics, the notion of parsimony is formulated in Statistical Decision Theory [Judge et al. 1985] and in chemometrics it was introduced as the 'parsimony principle' [Seasholtz & Kowalski 1993].
The second principle is that centering in this paper is not considered to estimate offsets, but to remove offsets. Estimating offsets is a different issue than removing them and estimating offsets has its own properties and problems.
In the main part of the paper only twoway (bilinear) models are considered because most results generalize straightforwardly to multiway models. Section 2 is concerned with twoway centering, Section 3 with twoway scaling, and Section 4 with the combined use of twoway centering and scaling. Section 5 explains the discussed results in terms of multiway models and finally in Section 6 some conclusions are drawn.
The following notation is used in this paper. Twoway data are held in an I × J matrix X with typical elements x_{ij}. Threeway data are held in an I × J × K matrix X with typical elements x_{ijk}. Such a threeway array is often rearranged to an I × JK matrix X^{(I×JK)} by concatenating the K thirdmode frontal slabs after each other; i.e., X^{(I×JK)} = [X_{1} X_{2} .. X_{K}] where X_{k} is the I × J matrix obtained by fixing the third mode of X at k. This operation has been referred to in a number of ways (e.g. unfolding), but it has been suggested to use the term matricizing to avoid confusion with other techniques [Kiers 2000]. The matricized array it is often just denoted X Instead of X^{(I×JK)} if no confusion is possible. The letter Y is used for preprocessed data and is used for a model of X (be it two or threeway). The number of components in a model is called R.
The letter m is used for calculated offsets (e.g. averages) and μ for true offsets (population values). The character w is used for a scaling parameter. The character P is used for a projection matrix related to centering. Usually its dimension will not be specified, because it follows from the context. Similar rules apply for the diagonal matrix W holding the weights associated with scaling the array.
The Kronecker product is denoted , the Hadamard (elementwise) product *, and the KhatriRao [Styan 1973] product which is the columnwise Kronecker product of two matrices is denoted . The use of these special products makes it possible to express most threeway models with twoway (matricized) arrays [Bro 1998]. For example, a PARAFAC model of an I×J×K array X can be expressed as
X^{(I×JK)} = A(CB)^{T}_{ }+ E^{(I×JK)},_{} 
where A (I×R), B (J×R), and C (K×R) are component matrices and E holds the residual unexplained variation. This notation is equivalent to the model
_{ MPSetEqnAttrs('eq0022','',3,[[86,24,11,1,1],[114,31,14,1,1],[142,40,18,1,1],[128,36,16,1,1],[173,48,21,1,1],[215,58,26,1,1],[358,98,44,2,2]]) MPEquation() }i=1,..I; j=1,..J; k=1,..K;_{} 
(13) 
In order to understand when and how centering works, it is important first to consider the goals of centering and to realize how these goals are achieved in practice. These aspects are described in this section.
As stated by Harshman and Lundy [1984], quite subjective and qualitative reasons are often given for performing centering. It is possible to formulate rational reasons for centering on scientific grounds. Basically, centering should be performed only if there are common offsets in the data or if modeling such offsets provides an approximately reasonable model. Thus, centering is performed to make intervalscale data behave as ratioscale data, which is the type of data assumed in most multivariate models. Said more simply, centering should make a difference. This difference can manifest itself as
i) Reduced rank of the model
ii) Increased fit to the data
iii) Specific removal of offsets
iv) Avoidance of numerical problems
Re i). If a model of the raw data requires, say, R + 1 components to describe the data well, whereas a model of the centered data requires only R components, then centering is sensible because the model of the centered data only has R(I+J) + J parameters. The J parameters pertain to the calculated averages, assuming that centering is performed across the first mode. The alternative of fitting the (R+1)component model to the raw data would lead to a model with (R+1)(I+J) parameters and thus would violate the parsimony principle. Re ii). In some situations, the rank of the appropriate model is not reduced upon centering, but if the fit of the model of the centered data is significantly improved, then naturally introducing extra parameters is useful. It is possible to heuristically consider the offsets introduced by centering as one extra 'half' component of which either the scores (centering across first mode) or the loadings (centering across second mode) are known a priori to be equal to one. This holds in the sense that the fit of a model with R components is poorer than the fit of a model with R components and offsets, which again is poorer than a model with R+1 components. Re iii). Centering can remove certain offsets. In some situations the offsets as such are of interest, in which case it is interesting to estimate these. This can usually not be achieved by centering. Re iv). In certain algorithms, it may be useful to center the data in order to minimize algorithmic problems. For fitting a bilinear model using principal component analysis, it is known that the ratio of the two largest eigenvalues is related to the convergence rate of the power method (and related techniques such as NIPALS). For PARAFAC it is also known that if some components are strongly correlated as evidenced through Tucker's congruence coefficient [Tucker 1951] then the fitting procedure may be complicated by socalled swamps. For both situations, it holds that centering across certain modes can be helpful in minimizing the cause of the problem because the resulting optimization problem is related to a different model with different (and hopefully better) properties with respect to numerical problems.
The following discussion pertains to centering across the first mode (ordinary columncentering) but is readily applicable to centering across the second mode as well. Understanding that centering is a special projection step within one specific mode, explains why it eliminates constant terms in the data (see Appendix 1 for details on projections).
If the vector m holds in its jth element the average of the jth column of X, then m can be expressed as
m = (1/I)X^{T}1 m^{T} = (1^{T}/I)X 
(14) 
where 1 is an Ivector of ones and X (the data) has size I × J. Then centering X across the first mode (columncentering) amounts to
Y = X  1m^{T} 
where Y (IxJ) contains the centered data. As
1m^{T} = (11^{T}/I)X, 
(16) 
the centered data can also be expressed
Y = X  1m^{T} =
X  (11^{T}/I)X = (I  (11^{T}/I))X = (IP)X= P^{ MPSetChAttrs('ch0049','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X 
(17) 
where PX = P[x_{1}, ... , x_{J}] = [Px_{1}, ..., Px_{J}] and x_{j} is the jth column of X. The matrix (11^{T}/I) = P is a symmetric and idempotent (I × I) matrix and is thus an (orthogonal) projection matrix (See Appendix 1). This shows that the columnaverages are the orthogonal projections of the columns of X onto the direction of ones, i.e., the direction given by the vector 1. The centering matrix (I  (11^{T}/I)) = P^{ MPSetChAttrs('ch0051','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) } is the projection matrix onto the nullspace of 1^{T }which equals range(1)^{ MPSetChAttrs('ch0052','ch5',[[2,2,2,1,1],[6,5,2,0,0],[7,7,2,0,0],[],[],[],[19,17,7,0,0]]) MPInlineChar(2) } (where range(·) is the range of a matrix). Stated otherwise, centering may also be interpreted as providing the residuals after regressing the columns of X onto 1. It follows, that centering may be viewed as the projection of the data onto a space with the common offset (given by the Ivector 1) removed.
Mathematically, centering is a projection onto the nullspace of 1^{T} and it is worthwhile to keep this in mind. Suppose that the true model of the data contains offsets across the first mode. Then the model can be written as
X = ΦΘ’ + 1μ^{T} + E 
(18) 
Projecting these data onto the nullspace of 1^{T} leads to
P^{ MPSetChAttrs('ch0056','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X = P^{ MPSetChAttrs('ch0057','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T} + P^{ MPSetChAttrs('ch0060','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }1μ^{T} + P^{ MPSetChAttrs('ch0062','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }E 
(19) 
P^{ MPSetChAttrs('ch0064','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X = Y = P^{ MPSetChAttrs('ch0065','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T} + P^{ MPSetChAttrs('ch0068','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }E 
(20) 
where P^{ MPSetChAttrs('ch0069','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X = Y is the matrix holding the centered data and where the matrix P^{ MPSetChAttrs('ch0070','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }1μ^{T} vanishes, as 1 has no residuals when projected onto itself. The part P^{ MPSetChAttrs('ch0072','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T} is a bilinear model with scores P^{ MPSetChAttrs('ch0075','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }Φ and loadings Θ. Thus instead of fitting the bilinear model and the offsets to the original data, it is only necessary to fit the bilinear model to the centered data, Y with true structure, P^{ MPSetChAttrs('ch0078','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T}+ P^{ MPSetChAttrs('ch0081','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }E. Centering also leads to models with residuals with zero column averages (centered across the first mode), because these are also projected onto the nullspace of 1^{T}, as is clear from Eq. (20).
If Φ and Θ are nonnegative, then nonnegativity constraints can be imposed on the model for X. When X is centered, e.g., across the first mode, such a constraint is not meaningful for Φ anymore, because centering destroys nonnegativity of Φ (but not of Θ).
As mentioned earlier, centering across a given mode is called singlecentering. Singlecentering an array across one mode that has previously been singlecentered across another mode is called doublecentering. Performing several singlecenterings (for multiway arrays as many can be performed as the number of ways) is unproblematic in the sense that centering across one mode, leaves the ‘centeredness’ intact in other modes [Harshman & Lundy 1984]. Further, the order of centering is immaterial. This means that if the data are first centered across the first mode, and subsequently the centered data are centered across the second mode, then the average of every column and every row will be zero. This follows, because centering across the first mode can be written as X where is the centering operator for the first mode. Centering across the second mode can be written X . Thus doublecentering can be written X , and hence: i) the order of which centering is performed is immaterial and ii) the doublecentered array will have both column and rowaverage zero, because X can be viewed as centering the matrix X across the first mode or centering the matrix X across the second mode.
Consider a twoway data set, which is generated as
X = ΦΘ^{T} + 1μ^{T} + E 
where Φ is (I × R), Θ (J × R) and μ (J × 1). It follows, that the data can be modeled by a bilinear model plus a common offset for each variable/column plus additional unmodeled variation held in the residual matrix E. This is the model assumed to be a valid approximation in most bilinear methods in chemometrics. The least squares loss function for the above model is
X (AB^{T} + 1n^{T})^{2} 
and this function is to be minimized directly over A, B, and n, with A, B and n being of the same dimensions as Φ, Θ and μ respectively. The matrices A, B, and n contain estimates of Φ, Θ, and μ respectively, but it is intrinsic to the problem, that these estimates do not uniquely recover the underlying parameters. For example, Φ and Θ can, at most, be found up to a rotation. As shown above, however, the parameters can be estimated in two steps. Centering the data across the first mode will remove the offsets μ, and the bilinear model is subsequently fitted to the centered data Y, thus minimizing the loss function
 Y  CD^{T}^{2} 
only over C and D that are of the same dimensions as A and B above. It holds that
minX (AB^{T} + 1n^{T})^{2} = min Y  CD^{T}^{2} 
(24) 
i.e. the fit of the model fitted directly and the bilinear model fitted to the centered data will be exactly the same (see proof in Appendix 2) even though the actual parameters will usually differ. This is an important result because it guarantees the optimality of the model even if the offsets are calculated separately from the bilinear parameters.
The
solution for minimizing Eq. (22) directly is not unique for n. That is, the twostage solution of
centering first and then fitting the bilinear part gives one solution of many
to the problem in Eq. (22). Therefore, centering removes μ, but m is not necessarily an estimate of μ.
This nonuniqueness is explained in short. Centering involves subtracting from each column its columnaverage. The matrix
PX = PΦΘ^{T} + P1μ^{T}+ PE 
(25) 
holds in each row the vector m^{T} containing the average value of each column of X (P is 11^{T}/I, as above). If the part PΦΘ^{T} is a matrix of zeros, then m will be an estimate of the true offsets μ (with error PE), because
P1μ^{T} = 1μ^{T} 
(26) 
by definition. However, PΦΘ^{T} will only be a zeromatrix if Φ is orthogonal to P and/or Θ = 0 (assuming that Φ, Θ have full column rank). That is, PΦ=0 or Φ^{T}P=0. Thus, the offsets will only equal the true offsets if the columnspace of Φ is orthogonal to 1 (meaning that Φ is centered already) or if Θ = 0.
In some cases centering reduces rank and in some cases not. Column centering of X (I×J) reduces the rank of X if and only if 1range(X), where range(X) is the range of X (see Amrhein [1998] p. 156, and Amrhein [1996]). Intuitively this is understandable. Centering is a projection. If the axis on which the projection takes place is a part of the range of X, then the residuals of this projection do not have this direction available anymore. Hence, the rank of the matrix of residuals lowers by one. This simple fact has several repercussions for centering across the first mode (columncentering).
The following can be said about the noiseless case. Suppose that X (I×J) is noiseless and can be modeled as


where Φ is (I×R) of full rank; Θ (J×R). Assume that has full rank R+1 which will be fulfilled for real data. For Y, the columncentered X, two cases can be distinguished:
1. 1 range(Φ) rank(X) = R rank(Y) = R 1
2. 1 range(Φ) rank(X) = R +1 rank(Y) = R
Hence, in both cases the rank of X is reduced by one. The reverse also holds: if for the noiseless case no rank reduction of X is obtained upon columncentering, then model (27) is not valid. To summarize, in the noiseless case there is a simple relation between the validity of model (27) and rank reduction upon columncentering.
In the case with noise added things are less simple. Suppose that X (I×J) also contains noise and the model for X is


If I>J, then in general 1 range(X). Although 1 range(‘noiseless’ X), this property is destroyed upon adding noise to X. In the case I ≤ J and if X has full rank I, then 1 is by definition in the range of X, whether or not there exists an offset term 1μ^{T}. Hence, in the case with noise there is no simple relationship anymore between the validity of (28) and mathematical rank reduction upon columncentering.
Viewing centering as a projection step rather than as a simple subtraction of averages has more than theoretical importance. In practice, situations often occur where subtraction of averages does not work and may, in fact, lead to models that fit the original data more poorly than if the data had not been preprocessed. This will be shown in the following for two different problems: handling missing data and modeling a single common offset.
When data are missing, centering by subtracting averages from columns or rows does not lead to elimination of offsets. Rather, the offsets have to be eliminated simultaneously with the fitting of the bilinear part [Grung & Manne 1998]. This is so because the equivalence between subtracting average values and projecting onto the nullspace of vectors of ones no longer holds as the projection cannot be calculated with missing elements. As an example, consider the matrix X^{(1)} shown below.
. 

This is a rank one matrix and will remain so, even after centering across the first mode. The averages of the two columns are 4.1 and 8.2 respectively and the centered matrix reads as Y^{(1)} in Eq. (29), which is also a rank one matrix.
Consider now a situation in which the first two elements in the first column are missing. The data then reads as X^{(2)}. This data set is naturally still perfectly modeled by a rank one bilinear model, as no new information has been added. The average of each column is now 1 and 8.2 and subtracting these values leads to centered matrix Y^{(2)}. This matrix cannot be described by a rank one model. This is easily realized by only looking at the last three rows. This is a rank two submatrix, and the addition of the first two rows cannot change this. Thus, by subtracting averages from the data with missing elements, the structure of the data has been destroyed and meaningful results cannot be expected. Prior centering no longer leads to elimination of the true offsets as centering ordinarily does.
Centering is really an extension of the bi (or multi) linear model where offsets are assumed to be present in the model of the data. Data with missing elements constitute one situation in which such a model cannot be fitted in a least squares sense using centering. An alternative to eliminating offsets by preprocessing is given in Section 2.4.
The traditional centering across the first mode easily leads to the belief that subtracting averages with the same structure as the offsets will generally eliminate these offsets. This holds for offsets constant across one mode but it does not hold in general.
Consider a data set structured as a bilinear part plus a constant identical for all elements; that is, all elements have the same common offset, as also shown in Eq. (4). It might seem natural to remove this offset by initially subtracting the grand mean m from the data. However, this will not simplify the subsequent modeling of the data, and in fact, it obscures interpretation of the model, because such a centering leads to artificial mathematical components that also need to be modeled.
To explain this, assume that X is perfectly described by a bilinear part plus a common offset
X = ΦΘ^{T} + 1_{I}1_{J}^{T}μ. 
Centering by removing the overall mean of all elements of X can be written
Y = X  P_{I}XP_{J} 
where P_{J} is the projection matrix of 1_{J} (=1_{J}1_{J}^{T}/J) and P_{I} of 1_{I} (=1_{I}1_{I}^{T}/I). Then P_{I}XP_{J} is a matrix of the same size as X holding the overall average of X in all its elements. Inserting the true model of Eq. (30) in Eq. (31) leads to
Y = ΦΘ^{T} + 1_{I}1_{J}^{T}μ  P_{I}(ΦΘ^{T} + 1_{I}1_{J}^{T}μ)P_{J} =
ΦΘ^{T} + 1_{I}1_{J}^{T}μ  P_{I}ΦΘ^{T}P_{J}  P_{I}1_{I}1_{J}^{T}μP_{J} =
ΦΘ^{T} + 1_{I}1_{J}^{T}μ  P_{I}ΦΘ^{T}P_{J}  1_{I}1_{J}^{T}μ =
ΦΘ^{T}  P_{I}ΦΘ^{T}P_{J} =
ΦΘ^{T}  1_{I}1_{J}^{T}s. 
(32) 
The scalar s is the overall average of the true bilinear part ΦΘ^{T}. Even though the overall mean, λ, has been removed, a new common offset, s, has been introduced into the preprocessed data, and hence the same number of components is still necessary for modeling the data. Depending on the true parameters in the underlying model (ΦΘ^{T}), the model fitted to the preprocessed data may therefore explain less or more of the original data than if the data had not been preprocessed! Clearly, preprocessing the data by subtracting the overall mean is generally not useful.
As subtracting the overall level does not remove the offset, another approach must apparently be adopted for handling situations with one common offset. There are basically two different ways of treating the problem. The best way is to optimize the loss function of the problem directly in a onestep procedure, rather than trying to use a twostep procedure where the offsets are first removed (See Section 2.4).
Another simpler way of dealing with a constant offset follows from the observation that the model
X = ΦΘ^{T} + 1_{I}1_{J}^{T}λ 
may equivalently be written
X = ΦΘ^{T} + 1_{I}μ^{T} 
where
μ = 1_{J}μ. 
Posed in this way it is evident that a model with one global offset is a special case of the situation treated earlier where each variable (or sample) has a specific offset. Therefore, the constant offset may be eliminated by using ordinary centering across the first mode. As the offset is constant across rows, this centering removes the constant. An alternative procedure is to center across columns instead of rows, because the offset is also constant across columns.
An example is given for illustration. Consider [RB1]a spectral data set. The data have been synthesized according to a bilinear threecomponent part plus a scalar offset of one. Thus, the model is
X = ΦΘ^{T} + 11^{T}. 
(36) 
No noise is added. Using principal component analysis to model the raw data, four components are needed for describing the variation as expected (Table 1 left). Modeling the data centered by subtracting the overall mean leads to a situation where four components still have to be used for describing all systematic variation (Table 1 right). In fact, the threecomponent model explains less of the original data after preprocessing in this case.
Table 1. Percentage of variation explained for different models of the raw
(left) and corrected data (right).
# 
Raw data 
Overall average subtracted 
1 
99.19% 
98.05% 
2 
99.58% 
99.08% 
3 
99.95% 
99.67% 
4 
100.00% 
100.00% 
5 
100.00% 
100.00% 
Even though only three systematic components should be present, the loading plot (Figure 2 left) clearly shows that the first four components are 'spectral'like. With proper preprocessing only three loading vectors will be systematic, as shown in Figure 2 (right), using centering across the first mode.
Figure 2. Left: Loading plot from principal component model of the data matrix
where the overall average is subtracted. Right: The first five loadings are
shown when the data have been correctly centered across the first mode.
Singlecentering involves fitting several parameters (I or J respectively). When there is only one constant parameter in the true model, as is the case here, a risk of overfitting is introduced with this approach. It is advisable, therefore, to center across the mode of largest dimension such that as few offsets as possible need to be estimated.
To capitalize, the following rule establishes the 'correct' procedure for removing offsets before fitting the model: centering across one mode removes offsets constant across that mode as well as offsets constant across both modes [Harshman & Lundy 1984]. This important rule also extends to multiway data of arbitrary order. Thus, centering across one mode removes offsets constant across that mode as well as offsets constant across several modes involving that mode. This generalization follows from realizing that multiway models can always be considered as a constrained version of a bilinear model. Hence, offsets constant across an arbitrary number of modes can always be considered a constrained version of a model with offsets constant across one mode. Centering across one of these modes will therefore eliminate the offsets because of the projection properties of centering.
Instead of modeling the data in two steps  removing offsets by centering and fitting a model to the residuals  it is possible to fit the model in one step, alleviating the need for projecting the offsets away. Two examples are given.
The example of missing data (Section 2.3.1) can be fitted directly in the following way. Assume that the offsets are, for instance, constant across the first mode and that a principal component analysis model is sought including offsets across the first mode. Such a PCA model of the data held in the matrix X including offsets reads
X = ΦΘ^{T} + 1μ^{T} + E 
(37) 
where μ is a Jvector. A very simple way to fit this model to a data matrix with missing elements in a least squares sense, is by the use of an alternating least squares approach where the missing elements are continuously exchanged with their model estimates. Such an algorithm may proceed as follows.
1. Initialize missing values with reasonable values. Then the data set is complete and can be modeled by ordinary techniques.
2. Fit the model including offsets to the (now complete) data set. For PCA, this amounts to centering the data and fitting the PCA model.
3. Exchange missing values in the data matrix for model estimates. These estimates will improve the current estimates and thus provide a data set where the estimated missing elements are closer to the correct ones according to the (yet unknown) true least squares model.
4. Proceed from step 2 until convergence
This approach can be shown to converge, because it is an alternating least squares algorithm and hence has a nonincreasing loss function. Upon convergence to the global minimum, the imputed missing data will have no residuals and hence no influence on the model. The model parameters computed from the complete data are exactly those that would have been obtained, had the model only been fitted to the nonmissing data directly*. This approach can be viewed as a simple special case of expectation maximization [Little & Rubin 1987]. For specific models or specific offsets, other approaches can also be feasible, but the above approach is general and easily implemented.
The problem with one common offset (Section 2.3.2) can be dealt with in the following way. The loss function for a bilinear model with constant overall offset is expressed as
X  ΦΘ^{T} 1_{I}1_{J}^{T}μ^{2}. 
(38) 
Instead of fitting the overparameterized model
X = AB^{T} + 1m^{T} + E 
(39) 
in a twostep procedure (see Eq.(34)), it is possible to fit a 'correct' structural model
X = AB^{T} + m11^{T} + E 
directly. Instead of J parameters in m, only one parameter, m, has to be estimated. The PCA model is used here as an example, but may be exchanged with any other structural model including a threeway model. Also, other types of offsets may be used. The loss function may be optimized in several ways leading to a least squares model [Cornelius et al. 1992, van Eeuwijk 1996]. A simple algorithm is based on alternating least squares where first the offset is set to an initial value. Then the structural model is fitted to the corrected data X  m11^{ T} using an ordinary PCA algorithm in this case. This provides an update of the bilinear parameters. Subtracting the new interim PCA model from the data leads to
X  AB^{T} = m11^{T} + E. 
(41) 
Therefore, m may be determined conditional on A and B as the overall average of X  AB^{T}. By alternating between updating the loading parameters and the offset, the model parameters will be estimated upon convergence.
In a threeway context using a PARAFAC model, an auxiliary benefit is that the offsets may often be uniquely determined due to the uniqueness of the PARAFAC model.
Offsets are part of the model hypothesized for the data. Some offsets can be removed by centering before fitting the remaining part of the model. This removal of offsets involves fitting additional parameters. Proper centering is defined as a centering operation that removes the offsets postulated and does not change the structural model of the data. Proper centering is always performed across a single mode. Sequentially centering across several modes is allowed. Hence, proper centering can always be written as or or depending on whether centering is performed across the first, second or both modes. Here is an I × I matrix defined as (I  (11^{T}/I)), where I is an I × I identity matrix and 1 an Ivector of ones. The matrix is defined similarly. If elements are missing or other offsets are to be modeled, this has to be done using a onestep modeling approach where offsets and other parameters are considered simultaneously (usually using iterative algorithms).
Unlike centering, scaling does not change the structure of the model. Scaling is used to change the weights given to different parts of the data in fitting the model. Though scaling is important, it usually has a much less dramatic influence on the fitted model than centering, as long as the model and scaling are reasonable [Westerhuis et al. 1999]. Some issues related to scaling are discussed by Paatero & Tapper [1994].
Scaling is used for several reasons. Some important ones are
i) To adjust scale differences
ii) To accommodate for heteroscedasticity
iii) To allow for different sizes of subsets of data (blockscaling)
Re i). It is quite common to use, for instance, autoscaling (centering across the first mode and scaling to unit standard deviation within the second mode) to let the variance of each variable be identical initially. Thereby, all variables have the same variance, and as the subsequent fitting of a model is performed so as to describe as much systematic variation as possible, every variable has the same initial opportunity of entering the model. This type of scaling is especially useful when the variables are measured in different measurement units (e.g. Pascal, Centigrade,...). Re ii). The ordinary least squares fitting of a model is statistically optimal in a maximum likelihood sense, if the errors are homoscedastic, independent and Gaussian. If the variances of the distributions are not the same, though the same within, e.g., a specific variable, it is possible to accommodate the fitting procedure accordingly by initially scaling the data within the variable mode. By scaling each variable with the inverse of the standard deviation of the residual variance, the fitted model will be optimal in a maximum likelihood sense. Re iii). When the data is made up of several subsets of very different sizes, it may sometimes be advantageous to scale each block separately in order to ensure that all the different blocks are allowed to influence the model. Consider, for example, a situation in which 5000 wavelengths are measured in an infrared spectrum (absorbance between 0 and 1) and one variable is given for the temperature. Due to the huge difference in number of variables (5000 and one respectively), the total variance of the infrared spectra will be tremendous compared to the temperature. If no scaling is applied to adjust for this difference, then the model is implicitly forced to focus on the infrared data. Explaining the temperature variable will not lead to a wellfitting model, unless the model is so complex that it can fit both subsets simultaneously or because the temperature data is in accordance with the infrared data. If the infrared data and the temperature data are initially believed to be equally important, then scaling both subsets to the same total variance will provide a model that reflects this assumption. Thus, in this case scaling is used from an information point of view to ensure that all important information can enter the model, irrespective of the variance of the different sources of information.
It is important to note that even if no scaling is applied, the data are still scaled by the weight one. Thus, scaling (or the loss function in general, as will be shown in the next section) always has to be considered before fitting the model.
All the abovementioned reasons can be put under the same heading by the term weighted least squares fitting, which is a general and broader approach to fitting models than merely the use of scaling. This will be elaborated on in the next section.
Scaling is a subject often treated in conjunction with centering. However, the purpose of scaling is very different from the purpose of centering. Scaling is a way of introducing another loss function than the least squares loss function normally used. Therefore, scaling does not change the interpretation of the model and its parameters. As for centering, scaling has to be performed in a specific way in order not to introduce artificial structure that needs to be modeled. This becomes even more apparent when going to threeway models.
Scaling is usually performed by multiplying each column or each row in the data matrix by a scalar. There are two types of scaling that are relevant for twoway matrices. One is scaling within the first mode where every row is multiplied by a specific number
Y = WX^{} 
(42) 
where W is an I × I diagonal matrix with the scaling parameter for the ith row on its ith diagonal element. This is the type of scaling used, for example, in standard normal variate correction [Barnes et al. 1989, Guo et al. 1999, Helland et al. 1995] where the norm or area of each row of X is scaled to the same scalar value by using an appropriate W. It extends easily to multiway arrays, as discussed in Section 5.2.
The other main type of scaling is within the second mode where every column is multiplied by a specific number
Y = XW^{} 
where W is here a J × J diagonal matrix with the scaling parameter for the jth column on its jth diagonal element. This is the scaling ordinarily used in, e.g., PCA where the weight of a specific column (variable) is often chosen to be the inverse of the standard deviation of the variable. In combination with centering across the first mode, such scaling within the second mode is often referred to as autoscaling in the twoway case.
An example of a scaling according to Eq. (43) is shown in Figure 3. Onecomponent bilinear data with huge random residual variation in the last half of the variables (upper left) is generated. The resulting X has the spectra (as plotted in Figure 3 upper left) in its rows. The first principal component is seen to correlate well with the reference score generating the data (lower left). However, when the data are initially weighted as in Eq. (43) by the inverse of the residual standard deviation (upper right), the right part of the data is downweighted substantially. The correlation between the true known score (‘concentration’) and the estimated score of the scaled data is even higher (lower right) than for the raw data (left). Even in this extreme case, the influence of the noise is not at all as drastic as might be expected. This illustrates that scaling is not as critical as centering as long as the scaling is reasonable and the variables relevant. Note that the scaling in this example is not the usual autoscaling using the inverse standard deviation of the data. Rather, the inverse standard deviation of the residual variation, for instance, assessed by replicates, is used.

Figure 3. Influence of scaling on fitted model 
Given a data matrix X (I × J) and a model (I × J) which may, for example, be a bilinear model ( = AB^{T}), a standard approach for determining the model and its parameters is to fit the model in a least squares sense by minimizing the loss function
X  ^{2} 
(44) 
which can also be expressed
^{ MPSetEqnAttrs('eq0044','',3,[[62,22,5,1,1],[82,29,6,1,1],[103,36,8,1,1],[93,33,8,1,1],[124,44,10,1,1],[153,54,12,1,1],[257,90,19,2,2]]) MPEquation() }^{.}^{} 
(45) 
When the different elements of the data have different uncertainties or relevances, it is possible to fit the model using a weighted least squares loss function. In one of its simplest form this can be expressed
(X  )* ^{2} 
(46) 
where (I × J) holds in its ijth element the weight of the ijth element of X, and '*' is the Hadamard (elementwise) product. Often the weight of an element is set equal to the inverse of the standard deviation of the residual variation. It is also possible to use more elaborate weights, if there are certain correlations between the residuals [Bro et al. 2001, Wentzell & Lohnes 1999]. The above weighted loss function can also be expressed in scalar notation as
^{ MPSetEqnAttrs('eq0048','',3,[[75,22,5,1,1],[99,29,6,1,1],[124,36,8,1,1],[112,33,8,1,1],[149,44,10,1,1],[185,54,12,1,1],[310,90,19,2,2]]) MPEquation() }^{.}^{} 

In a maximumlikelihood sense, this loss function is optimal if the weights reflect the uncertainty of the individual elements, if there is no correlation between the residuals of different elements, and if the residuals are normally distributed. If the uncertainty of a given variable is the same over all objects, the above model turns into
(X  )* ^{2} = (X  )W^{2} 
where the weight matrix, W , is now a diagonal matrix which holds the columnspecific weights in the diagonal. The loss function may be transformed as follows
(X  )W^{2} =
XW  W^{2}. 
(49) 
In case of a bilinear model the above reads
(X  AB^{T})W^{2} =
XW  AB^{T}W^{2} =
XW  AH^{T}^{2}. 
(50) 
Thus, by fitting the bilinear model AH^{T} to the data scaled within the second mode, XW, in an ordinary least squares sense, the weighted loss function of Eq. (48) is automatically optimized. This is the basic mathematical rationale behind scaling. If the sought model, , has the structure AB^{T} then fitting a bilinear model to the scaled data provides the model in the form AH^{T}. Thus, the score matrix (or a rotated version of it) is directly provided in A whereas the loadings of the problem are found by premultiplying the found loadings, H, with W^{1}, as
B^{T}W=H^{T}B^{T}=H^{T}W^{1} B=W^{1}H. 
(51) 
Sometimes, only the scaled data are considered and model parameters are not transformed back to the original domain. The appropriateness of this approach is still, however, governed by the fact that scaling as outlined above maintains the bilinear structure assumed reasonable for the raw data.
There is a direct connection between (X  AB^{T})W^{2} and XW  AH^{T}^{2}. However, there is no direct connection between the solution to (X  AB^{T})W^{2} and (X  TP^{T})^{2} unless the model has perfect fit. That is, fitting the bilinear model to scaled and unscaled data are two different problems with no direct relation.
When scaling within several modes is desired, the situation is complicated, because scaling one mode affects the scale of the other mode. For example, scaling to a standard deviation of one within both the first and second mode will generally not be possible, not even using iterative scaling [Harshman & Lundy 1984]. If scaling to a mean square of one is desired within both modes, this has to be done iteratively, until convergence [Harshman & Lundy 1984, ten Berge 1989]. Using meansquares rather than, for instance, standard deviations for scaling has the attractive property that iterative scaling is guaranteed to converge, in case no centering is included in the iterative scheme [Harshman & Lundy 1984].
Iterative preprocessing may seem unsatisfactory, because it tends to complicate the subsequent evaluation and validation of the preprocessing, since more than one set of scaling parameters for each mode has to be used. These several matrices holding the scaling parameters from each iteration may be combined, though, into one single matrix [Harshman & Lundy 1984]. An ad hoc alternative is to skip the iterative preprocessing and perform only one scaling of each mode. The purpose of scaling is mainly to bring the level of variation of different variables to some sort of equivalent levels. Therefore, one iteration of scaling can suffice to scale the data, so that no part of the data will have unreasonably large influence on the subsequent fitting.
As shown in the preceding paragraphs, scaling can be considered to be a special case of using a weighted least squares loss function. When more complicated weights are needed, it is not always possible to fit the model indirectly by fitting the least squares model to the scaled data. in such situations an alternative to scaling is to use algorithms that directly handle a weighted least squares optimization criterion [Bro 1998, Kiers 1997, Paatero 1997]. This can be relevant, for example, when the residual variation is correlated both across rows and columns [Andrews & Wentzell 1997, Bro et al. 2001, Wentzell et al. 1997].
Scaling does not affect the structural model of the data, but the loss function used to estimate the model parameters. Proper scaling is defined as the scaling that can be expressed in terms of pre or postmultiplied weights in the loss function of the model. Hence, proper scaling of X can always be expressed as W_{I}X or XW_{J} or W_{I}XW_{J}; where W_{I} and W_{J }are diagonal matrices holding, for example, the inverse standard deviation of the corresponding row or column. If a certain scale is needed for both modes, then the corresponding weights have to be found iteratively. Scaling, for example, to unit standard deviation in two modes is not possible in general, whereas scaling to unit mean square variation is possible.
A complicating issue in preprocessing is the interdependence of centering and scaling [ten Berge & Kiers 1989]. Because preprocessing is mostly performed in one or a few standardized ways in twoway analysis, the problems are seldom appreciated. It is important, however, to be aware that not all combinations of centering and scaling will work as anticipated (see e.g. Harshman and Lundy [1984]). Generally, only centering across both modes is straightforward or scaling within one mode combined with centering across the other mode [Harshman & Lundy 1984, Kruskal 1983], which is exactly what e.g. autoscaling amounts to.
Scaling within one mode disturbs prior centering across the same mode, but not across other modes [ten Berge 1989]. This holds for twoway arrays as well as higherorder arrays. The reason for this is illustrated in Figure 4. The solid line shows a typical column vector and the dashed line a typical rowvector. When scaling within the first mode, the elements of any column are multiplied by different numbers and hence prior centering across the first mode is destroyed.
Consider a twoway array X (I×J). If the array is scaled within the first mode, this can be expressed
Y = WX 
(52) 
where Y is the scaled array, W is an I × I diagonal matrix holding the scaling parameters and X is the original array. As can be seen, scaling within the first mode amounts to a multiplication of every row by a scalar. This does not affect any centering of the vectors across the second mode, because every element in a rowvector is multiplied by the same number. The average of any row will be the original average of the row scaled down accordingly, and therefore, if the average is zero, it will stay zero. In the first mode, however, each element in a column is multiplied by a different scalar. If centering is performed across the first mode, these columnvectors will not necessarily preserve their zero average after subsequent scaling within the first mode. Mathematically, the centered matrix P^{ MPSetChAttrs('ch0209','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X becomes WP^{ MPSetChAttrs('ch0210','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X upon scaling. As 1^{T}WP^{ MPSetChAttrs('ch0211','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X?0, the preprocessed matrix is no longer guaranteed to be centered. Offsets constant across the first mode, however, will still be removed, because
P^{ MPSetChAttrs('ch0212','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }(ΦΘ^{T} + 1_{I}μ^{T}) = P^{ MPSetChAttrs('ch0216','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T}_{ MPSetEqnAttrs('eq0055','',3,[[10,7,1,1,1],[14,9,1,1,1],[17,11,2,1,1],[16,9,2,1,1],[21,13,2,1,1],[26,15,3,1,1],[43,23,4,2,2]]) MPEquation() } WP^{ MPSetChAttrs('ch0219','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }(ΦΘ^{T} + 1_{I}μ^{T}) = WP^{ MPSetChAttrs('ch0223','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }ΦΘ^{T}. 
(53) 
Note also the interesting fact that if scaling is performed before centering, the result will be different. In that case, the original offsets will not be removed, but the data will be centered (yielding centered scores and residuals), because
P^{ MPSetChAttrs('ch0226','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }W (ΦΘ^{T} + 1_{I}μ^{T}) = P^{ MPSetChAttrs('ch0230','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }WΦΘ^{T} + P^{ MPSetChAttrs('ch0233','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }W1_{I}μ^{T} P^{ MPSetChAttrs('ch0235','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }WΦΘ^{T}. 
(54) 
This holds, because unlike for P^{ MPSetChAttrs('ch0238','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }1, it does not hold that P^{ MPSetChAttrs('ch0239','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }W1 is a zero vector in general.
Figure 4. Twoway array showing the dependency between centering and scaling
Centering across one mode disturbs scaling within all modes. This holds for two as well as multiway arrays,
but there are certain cases for which it does not hold. One of these special
cases is the situation in which twoway data are scaled within the second mode
to a standard deviation of one and subsequently centered across the first mode
(autoscaling)*. This
subsequent centering will not disturb the scaling within the second mode
(though it would disturb scaling within the first mode, had that been
performed). The reason is that the scaling is specifically performed relative
to the center of the data (standard deviations are based on centered data).
Hence, any change in offset is immaterial for the standard deviation. Scaling
by other means than standard deviations will not have this property. In
multiway analysis it is common to use mean squares for scaling instead of
standard deviations because such scaling more often converges when implemented
in an iterative scheme and because scaling by standard deviations implicitly
assumes an offset, which may or may not be present depending on the structural
part of the model.
Centering across a mode within which scaling is
also applied or vice versa is generally not going to
retain all the properties of the two individual operations, as discussed
earlier. For example, if centering across the first mode and scaling within the
first mode is desired, then setting
Y = WP^{ MPSetChAttrs('ch0240','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X^{} 
(55) 
with W
being a diagonal scaling matrix and P^{
MPSetChAttrs('ch0241','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}
a centering operator will not lead to a preprocessed matrix in which the first
mode vectors are centered, although possible offsets will be eliminated.
Conversely, setting
Y = P^{ MPSetChAttrs('ch0242','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }WX^{} 
(56) 
will not eliminate the original offsets, even
though the preprocessed array will have centered first mode vectors. Hence, for
a specific application, a choice has to be made between these two approaches,
depending on why the centering is applied.
Proper centering is a centering operation
that correctly removes the presupposed offsets and which does not introduce
other offsets into the data. Likewise, proper
scaling introduces the correct weights into the loss function. Stated
otherwise, proper centering and scaling do what they are supposed to do. Proper
centering and proper scaling can sometimes be combined. Unproblematic
combinations can always be expressed as
(P^{
MPSetChAttrs('ch0243','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X)W or W(XP^{
MPSetChAttrs('ch0244','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}) 
where P^{
MPSetChAttrs('ch0245','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}
is the projection matrix that centers the data and W is a weighting matrix, both of appropriate sizes. The brackets in
Eq. (57) indicate the proper order in which the different
preprocessing steps have to be performed. If performed oppositely, offsets will
not be removed, although the data will be centered. Problematic combinations are
WP^{
MPSetChAttrs('ch0246','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X or XP^{
MPSetChAttrs('ch0247','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}W. 
(58) 
These combinations do not retain all the
properties desired of the preprocessing steps.
The preprocessing of multiway arrays will now
be discussed using threeway arrays as an example. The basic properties
discussed thus far are unchanged. Centering has to be performed across a
specific mode and scaling has to be performed by a transformation within a
specific mode. Most difficulties in preprocessing threeway arrays arise
because of the problems outlined so far, which all generalize to multiway
arrays. The problems are sometimes enhanced, because threeway data are often
rearranged (matricized) to twoway arrays before preprocessing. This is
unfortunate, because it introduces a columnmode that is a combination of two
of the original modes. Transformation within or across this combined mode
should be avoided if multiway models are to be fitted, because the mode is not
a 'real' mode, but merely a computational construct.
The observations on centering of twoway data
are helpful in discussing centering of three and higherway arrays. If the basis of twoway centering is understood, then
threeway centering is quite simple. In the following, it will be assumed that
the true model of the data is a PARAFAC model plus possible offsets, but the
conclusions hold for any multilinear model.

Figure 5. Structure of offsets in threeway (trilinear) data when all elements
have the same offset held in the scalar λ. Two alternative but equivalent ways of
writing the PARAFAC model are shown: the slab notation using submatrices X_{k} and the notation
described in Eq. (12) 
Consider a
threeway array. Conceptually, offsets may occur in three different ways.
Constant across all modes (Figure
5), constant across two modes (Figure 6) or constant across one mode only, as shown in Figure 7. In
the figures, the firstmode loadings are held in the I×R matrix ,
the secondmode loadings are held in the J×R matrix and the thirdmode loadings are held in the K×R matrix .
The matrix D_{k} is an R×R diagonal matrix holding the kth
row of in its diagonal.

Figure 6. Structure of offsets in threeway (trilinear) data when all elements
in each vertical slab have the same offset. The offsets are held in the
Jvector λ 

Figure 7. Structure of offsets in
threeway (trilinear) data when all elements in each vector have the same
offset (case three). The offsets are held in a matrix Λ (J×K) whose
jk'th element holds the offset of the vertical column with second and third
mode index j and k. The vector λ_{k
}is the k'th row of the matrix Λ 
Regardless of
the structure of the offsets, the basic principle of centering is that the data
must be preprocessed, so that they are projected onto the nullspace of vectors
of ones in a particular mode. For the first mode, projecting a data array X onto the nullspace of 1^{T}, i.e., centering across
the first mode amounts to
Y = P^{
MPSetChAttrs('ch0262','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X^{(I}^{×}^{JK)}, 
(59) 
where P^{
MPSetChAttrs('ch0264','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}
= I  (11^{T}/I). For
the second and third modes, the centering can be performed similarly. As
mentioned earlier, such centering is referred to as singlecentering. Centering, for example, across the first mode of
an array can thus be done by matricizing the array to an I × JK
matrix, and then centering this matrix across the first mode as in ordinary
twoway analysis:
. 
(60) 
The columnmean is subtracted from every element, as depicted
graphically in Figure
8.
As can be seen, singlecentering is similar in structure to the type of offsets
shown in Figure 7.
Figure 8. Centering across the first mode. For each column of the raw data, a
mean value is calculated and subtracted from each element in the column. Thus,
a twoway matrix of mean values is obtained. Note that this centering is identical
to matricizing the data to a twoway structure and centering these twoway data
across the first mode
Such
singlecenterings performed successively across two modes, are referred to as
doublecentering. That is, doublecentering is performed by first centering
across one mode, and then centering the outcome across another mode. The order
of centerings is immaterial, but it is essential that they are performed
sequentially. For all three situations depicted in Figure 5Figure 7 the above centering across the first mode will remove
the shown offsets, because in all three situations the offsets are constant
across the first mode.
As for the twoway case, only singlecentering leads to the properties sought in centering (removal of offsets). Other types of centering, such as subtracting the overall mean, will introduce artifacts that have to be modeled additionally to the inherent systematic variation. This was shown for the twoway case in Section 2.3.2. Such types of incorrect centering are often used in threeway analysis. For example, matricized data are centered across a combined mode. E.g., if an array of structure (variables × time × samples) is centered by subtracting the average calculated across samples and time, then in line with the above example, artificial offsets are introduced and the subsequent model will have to fit this additional variation as well. This can obscure validation and exploration of the model and will lead to models that do not provide overall least squares solutions.
As explained for the twoway case, scaling is a
transformation of a particular variable (or object) space. Instead of fitting
the model to the original data, the model is fitted to the data transformed by
a (usually) diagonal scaling matrix in
the mode whose variables are to be scaled. This means that whole matrices
instead of columns have to be scaled by the same value in threeway analysis.
For a fourway array, threeway slabs would have to be scaled by the same
scalar. Mathematically, scaling within
the first mode can be described as
y_{ijk} = w_{i}x_{ijk}, i=1,..I;
j=1,..J;k=1,..K 
where Y
with elements y_{ijk} is the
scaled array and, for instance, setting

(62) 
will scale to a unit mean
square within the sample mode. Using matricized arrays, scaling may be
expressed as
Y^{(I}^{×}^{JK)} = WX^{(I}^{×}^{JK)} 
(63) 
where W
is an I × I diagonal matrix holding the scaling
values in its diagonal. The assumed structural model of the data is a
multilinear model  e.g. a PARAFAC model = A(CB)^{T}. With the above scaling, a similar structural
model  = P(RQ)^{T}  will hold for the transformed data., because
Y
 P(RQ)^{T}^{2}
= WX
 P(RQ)^{T}^{2}
= W(X  W^{1}P(RQ)^{T})^{2}. 
(64) 
The loss function minimized when
fitting a model with a weighted criterion is
W(X  A(CB)^{T})^{2} 
(65) 
and hence it holds that the parameters of this
model can be found by fitting the scaled data and setting
A
= W^{1}P, B = Q, and C = R. 
(66) 
Thus, fitting the scaled data provides a
solution not only to the problem posed as fitting a model to Y, but to the problem of fitting a
model to X and where the first mode
loadings obtained are transformed by the scaling matrix W. As for twoway scaling, it is emphasized that the found model
parameters have no direct relations to the parameters found when fitting the
model to the raw data in a least squares sense.
Scaling has to be applied by transforming the
data within a given mode. It is not appropriate to scale an array within two
combined modes which can happen, for example, when autoscaling a matricized
array. Such an inappropriate scaling will lead to the inclusion of artificial
components in the data.

Figure 9. Incorrect scaling of threeway array. Spectra from one sample
measured at two conditions (slabs one and two). The top plot shows the data
before scaling and the lower one after a hypothetical scaling. 
Consider an I × 30 ×2 threeway array with I samples, thirty variables (say spectral), measured at two
conditions (e.g., two different pH values) as shown in Figure 9. In the two top plots, the profiles of the
first hypothetical sample are shown (30 variables). To the left the measured
spectrum is shown at the first condition and to the right at the second
condition. As can be seen, the shape is identical in the two plots, only a
multiplicative factor distinguishes the profiles. In this case, only one
component is necessary for describing this variation. If the array is scaled
such that different occasions of the same variable are scaled differently, then
the phenomena can no longer be described by the variation of one basic profile.
This is shown in the lower plots, where the first three wavelength variables
have been scaled differently in slab one and two. In slab one they have been
scaled by one and in slab two by 0.01. It is clear that to preserve the
multilinearity of the data, any occurrence of a given variable must be scaled
by the same amount.
To
show the influence of incorrect scaling, consider a synthetic data set with PARAFAC
structure
X^{(I×JK)} = A(CB)^{T} 
(67) 
where A
is a 4 × 2 matrix of random numbers and B and C are defined likewise. It is not important how these matrices are
generated, as long as they have full column rank. Consider the following
alternative twocomponent PARAFAC models:
1.
Using
X
2.
Using
X centered across the first
mode and scaled within the combined second and third mode (autoscaled as a
twoway matrix, X^{(I×JK)})
3.
Using
X scaled within, e.g., mode
two.
The fit values of these three models
always using two components are given below in percentages of the sumofsquares
of the preprocessed data.
1.
100.00%
2.
98.80%
3.
100.00%
As can be readily seen, a twocomponent model
is appropriate and should be so, even after scaling as in case (3). However,
using ordinary twoway scaling methods as in case (2) destroys the multilinear
structure of the data and deteriorates the model.
The exact same rules for interdependence of
preprocessing steps apply for multiway data as for twoway data (Section 4), also with respect to treating missing data etc. Any preprocessed array may be written in matrix notation
using the matricized I × JK data array X and the preprocessed array Y
Y = M_{I}X(M_{K} M_{J})^{T} 
(68) 
where e.g. M_{I} is an I × I array holding either the centering or
scaling transformation matrix for the first mode or even a combination of such.
The exact content of these transformation matrices depends on the type of
preprocessing chosen and in case of iterative preprocessing M may be a
product of several matrices [Harshman & Lundy 1984].
Combined centering and
scaling in one operator M is generally not going to retain all the
properties of the two individual operations (see Section 4.3). If centering across the first mode and scaling
within the first mode is desired, centering first and scaling afterwards will not lead to an array in which the
firstmode vectors are centered, although possible offsets will be eliminated.
Scaling first and centering afterwards will not eliminate the original offsets,
even though the preprocessed array will have centered firstmode vectors.
Another example based on the
guidelines in Section 4 is a situation in which e.g. scaling within the
second and third mode is desired together with centering across the first mode.
If the preprocessing is performed so that the data are centered after the
weights are determined (iteratively) and applied
Y = P^{ MPSetChAttrs('ch0285','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }(X(W_{K}W_{J})^{T})_{} 
Y = (P^{ MPSetChAttrs('ch0287','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]]) MPInlineChar(2) }X)(W_{K}W_{J})^{T}_{} 
this is not the case. In this case, the data
are first centered and then the weights are determined from the centered array
rather than from the raw data. Hence, the weights in Eq. (70) are preferred.
Proper singlecentering of a multiway array can
always be expressed as
P^{
MPSetChAttrs('ch0289','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X 
(71) 
where X
(I × JKL...) is a multiway array rearranged
to a twoway matrix such that the mode to be centered across is the rowmode.
Hence, P^{
MPSetChAttrs('ch0291','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}
works on the nonmatricized mode (I
in this case). All combinations of centering of this form are proper and will
maintain a preprocessed array with offsets removed.
Proper scaling of a multiway array
can be expressed as
WX 
(72) 
where X
is (I × JKL...) is a multiway array rearranged
to a twoway matrix such that the mode to be scaled within is the rowmode.
Scaling, e.g., to unit mean square variation within several modes sequentially
will not yield modes within which the variables or samples have unit mean
square variation unless iterative determination of the weights are used.
Combinations
of centering and scaling are unproblematic only when centering across several
modes is desired or scaling within one mode combined with centering across
other modes. Centering must be performed before scaling. All other combinations
will only partly fulfill the requirements. E.g. centering across a mode
followed by scaling within the same mode will not lead to zeroaverage vectors
in the mode, but will remove any offsets across the mode.
A number of important features of
the common preprocessing steps centering and scaling have been discussed.
·
Centering deals with the structural model; scaling with the way this
model is fitted.
·
Centering is a part of a twostage procedure in which offsets are
removed first and the multilinear terms are estimated in the second stage. This
is only equivalent to the onestage procedure of estimating all parameters
simultaneously, if proper centering, as defined in this paper, is used. Proper
centering is shown in Figure 10 (always across one mode at a time).
·
For offsets that cannot be removed by using proper centering, a
onestage procedure has to be used. This holds generally for data with missing
elements.
·
Scaling provides a way to change the objective function by assuming
certain weights. Some weight arrangements can be dealt with by scaling followed
by ordinary least squares fitting. Only proper scaling is allowed. Proper
scaling is shown in Figure 10. For weighting schemes that cannot be dealt with by
scaling, weighted least squares algorithms have to be used.
·
Incorrect centering or scaling introduces artificial variation. The
amount of artificial variation introduced depends on the data and leads to
models that are suboptimal to their 'correct' (least squares) counterparts.
This is so, because the artificial variation has to be modeled additionally.
Twoway
results
·
Proper centering can always be written as P^{
MPSetChAttrs('ch0293','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X
·
Several centerings can be performed sequentially
·
Proper scaling can always be expressed as WX
·
Several scalings can be performed sequentially, but will generally need
iterations to establish the scaling constants and this may not converge
·
Unproblematic combinations of centering and scaling can be expressed as (P^{
MPSetChAttrs('ch0294','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X)W.
Similar results hold for transposed matrices.
Threeway
results
·
Proper centering can always be written P^{
MPSetChAttrs('ch0295','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X where X is the threeway array matricized, so that the mode to be
centered across is the first mode.
·
Several such singlecenterings may be performed sequentially across
several modes
·
Proper scaling can always be expressed WX for a matricized array as above
·
Several scalings can be performed sequentially, but will generally need
iterations and may not converge
·
Proper combinations of centering and scaling can be expressed similarly
to the twoway case. That is, scaling does not affect centering across other
modes, but centering affects scaling within all modes.
The appropriate centering
and scaling procedures can most easily be summarized as in Figure 10. Centering must be done by subtracting scalars from
individual vectors of the array, while scaling must be performed by multiplying
individual slabs.
Figure 10. Threeway array. Proper centering must be done across a mode,
exemplified here by proper centering across the second mode. From all elements
of a specific row (fixed i and k) the same scalar m_{ik} is subtracted.
Proper scaling, e.g., within the first mode is performed such that all elements
of a specific horizontal slab are multiplied by the same scalar w_{i}.
Most algorithms used in this paper are
available from the home page of Food Technology at http://www.models.life.ku.dk.
The first author is grateful for financial support through LMC (Center for
Advanced Food Studies), EU project NwayQual GRD1199910377 and AQM (Advanced Quality Monitoring) supported by the Danish
Ministries of Research and Industry. Anonymous referees are thanked for helpful comments.
In this appendix, the projection of vectors on
other vectors is explained. In Figure
11 the orthogonal projection of an Ivector b on an Ivector a is considered. The resulting projection can be expressed in the original coordinate
system or in terms of the new basis vector a.
Figure 11. Projection of b on a.
Expression of in terms of a
The score of on the new basis vector a is k. The following
equations hold

(73) 
where the second equation is called the cosine
rule for vectors. Hence,
. 
(74) 
This expression becomes particularly simple
when a is normalized to length one
(a=1)
k = (a,b) = a^{T}b 
Note that expressions like Eq. (75) are used in principal component analysis  PCA, where
usually the loading vectors p are
chosen to be of length one and then


where x_{i}^{T}
is a row of X (I×J). Hence, the scores of PCA are just
orthogonal projections of the rows of X
on p in coordinates of this p.
Expression of in terms of the original coordinate system
It is also possible to express in the original coordinate system.
Referring to Figure
11, it holds that =ka. Hence,

(77) 
and the matrix P (I×I) is
special because


which means that P is symmetric and idempotent. Then this matrix is an orthogonal
projection matrix [Schott 1997]. It projects orthogonally on the
vector a. In the case of centering,
(see Eq. (15)), the vector 1
takes the role of a.
Orthogonal projections
in general including residuals
A matrix P=AA^{+} (where ‘+’ indicates the
MoorePenrose inverse) projects orthogonally on the range (columnspace) of A [Schott 1997]. It can be checked that a^{+}=a^{T}/(a^{T}a), hence, Eq. (78) is a special case of the general orthogonal
projection theorem.
It
is also interesting to consider the residuals from the orthogonal projection of
b on A, that is, the vector b*.
As ,
it holds that

(79) 
and is again a symmetric and idempotent matrix,
i.e., an orthogonal projection operator. This matrix projects onto the
orthogonal complement of A, that is,
onto range( ), where range(.)
is used to indicate the range (columnspace) of a matrix or vector. It holds
that range( )=nullspace(A^{T}), where nullspace(.) is used to indicate the
nullspace of a matrix or vector. For every x
range(
) it holds that A^{T}x=0, hence,
x nullspace(A^{T}) and vice versa.
Theorem
Given
X of size I × J
and the columndimension R of a
sought bilinear model. Then
minX (AB^{T}
+ 1m^{T})^{2} = minY
 CD^{T}^{2} 
(80) 
where Y is the original data, X, with the columnaverages subtracted.
Proof
The
proof has been given by Kruskal [1977] and Gabriel [1978].
Understanding that centering is a projection, it is simple to prove the above
theorem. Let the loss function be
X  AB^{T}
 1m^{T}^{2} 
(81) 
and partition
it into two orthogonal parts
P^{
MPSetChAttrs('ch0302','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}(X  AB^{T}  1m^{T})^{2} + P(X  AB^{T}  1m^{T})^{2} 
(82) 
using the Pythagorean fact that the
squares of two orthogonal parts equal the square of the total. This equation
can be further developed to
min P^{
MPSetChAttrs('ch0303','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X  P^{
MPSetChAttrs('ch0304','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}AB^{T}^{2} + minPX  P(AB^{T} + 1m^{T})^{2} = min P^{
MPSetChAttrs('ch0305','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}X  P^{
MPSetChAttrs('ch0306','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}AB^{T}^{2} 
(83) 
because PX  P(AB^{T} + 1m^{T})^{2}
will be zero by setting
m’ = (1^{T}/I)(X  AB^{T})
since 1(1^{T}/I) = P. 
(84) 
Setting C = P^{
MPSetChAttrs('ch0307','ch4',[[2,3,0,1,1],[5,6,0,0,0],[7,7,0,0,0],[],[],[],[17,18,0,0,1]])
MPInlineChar(2)
}A and D = B will therefore
provide a solution with exactly the same fit as would be obtained by minimizing
the original loss function. The solution may be computed using any bilinear
algorithm for fitting a principal component analysis model of Y. The scores will automatically be
centered, because they are linear combinations of the columns of X. If the columns are centered, so are
their linear combinations.
1. Amrhein M, Reaction and flow variants/invariants for the analysis of chemical reaction data. Ph.D. thesis, École Polytechnique Fédérale de Lausanne, 1998,
2. Amrhein M, Srinivasan B, Bonvin D, Schumacher MM, On the rank deficiency and rank augmentation of the spectral measurement matrix, Chemom Intell Lab Syst, 1996, 33, 1733.
3. Andrews DT, Wentzell PD, Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer, Analytica Chimica Acta, 1997, 350, 341352.
4. Barnes RJ, Dhanoa MS, Lister SJ, Standard normal variate transformation and detrending of nearinfrared diffuse reflectance spectra, Appl Spectrosc, 1989, 43, 772777.
5. Bro R, Sidiropoulos ND, Smilde AK, Maximum likelihood fitting using simple least squares algorithms, Journal of Chemometrics, 2001, Submitted.
6. Bro R, Multiway Analysis in the Food Industry. Models, Algorithms, and Applications. Ph.D. thesis, University of Amsterdam (NL), http://www.mli.kvl.dk/staff/foodtech/brothesis.pdf, 1998,
7. Cornelius PL, Seyedsadr M, Crossa J, Using the shifted multiplicative model to search for 'separability' in crop cultivar trials, Theoretical and Applied Genetics, 1992, 84, 161172.
8. Gabriel KR, Least squares approximation of matrices by additive and multiplicative models, Journal Of The Royal Statistical Society Series B Statistical Methodology, 1978, 40, 186196.
9. Grung B, Manne R, Missing values in principal component analysis, Chemom Intell Lab Syst, 1998, 42, 125139.
10. Guo Q, Wu W, Massart DL, The robust normal variate transform for pattern recognition with nearinfrared data, Analytica Chimica Acta, 1999, 382, 87103.
11. Harshman RA, Lundy ME, Data preprocessing and the extended PARAFAC model, Research Methods for Multimode Data Analysis, (Eds. Law,HG, Snyder,CW, Jr., Hattie,J, and McDonald,RP), Praeger, New York, 1984, 216284.
12. Helland IS, Næs T, Isaksson T, Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data, Chemom Intell Lab Syst, 1995, 29, 233241.
13. Judge GG, Griffifths WE, Carter Hill R, Lütkepohl H, Lee TC, The
theory and practice of econometrics. John Wiley & Sons,
14. Kiers HAL, Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 1997, 62, 251266.
15. Kiers HAL, Towards a standardized notation and terminology in multiway analysis, Journal of Chemometrics, 2000, 14, 105122.
16. Kruskal JB, Some least squares theorems for matrices and Nway
arrays,1977, Manuscript, Bell Laboratories,
17. Kruskal JB, Multilinear methods, Proceedings of Symposia in Applied Mathematics, 1983, 28, 75104.
18. Kruskal JB, Rank, decomposition, and uniqueness for 3way and
Nway arrays, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S),
Elsevier,
19. Little RJA, Rubin DB, Statistical analysis with missing data.
John Wiley & Sons,
20. Nørgaard L, Classification and prediction of quality and process parameters of beet sugar and thick juice by fluorescence spectroscopy and chemometrics, Zuckerindustrie, 1995, 120, 970981.
21. Paatero P, A weighted nonnegative least squares algorithm for threeway 'PARAFAC' factor analysis, Chemom Intell Lab Syst, 1997, 38, 223242.
22. Paatero P, Tapper U, Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values, Environmetrics, 1994, 5, 111126.
23. Schott JR, Matrix analysis for statistics. Wiley and Sons,
24. Seasholtz MB, Kowalski BR, The parsimony principle applied to multivariate calibration, Analytica Chimica Acta, 1993, 277, 165177.
25. Styan PH, Hadamard products and multivariate statistical analysis, Linear Algebra and its Applications, 1973, 6, 217240.
26. ten Berge JMF, Convergence of PARAFAC preprocessing procedures and the DemingStephan method of iterative proportional fitting, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S), Elsevier, Amsterdam, 1989, 5363.
27. ten Berge JMF, Kiers HAL, Convergence properties of an iterative procedure of ipsatizing and standardizing a data matrix, with application to PARAFAC/CANDECOMP preprocessing, Psychometrika, 1989, 54, 231235.
28. Tucker LR, A method for synthesis of factor analysis studies, Personnel Research Section, Report 984, Dept. of the Army, 1951,
29. van Eeuwijk FA, Between and beyond additivity and nonadditivity;
the statistical modelling of genotype by environment interaction in plant
breeding. Ph.D. thesis,
30. Weinberg JR, A Short History of Medieval Philosophy.
31. Wentzell PD, Andrews DT, Hamilton DC, Faber NM, Kowalski BR, Maximum likelihood principal component analysis, Journal of Chemometrics, 1997, 11, 339366.
32. Wentzell PD,
33. Westerhuis JA, Kourti T, MacGregor JF, Comparing alternative approaches for multivariate statistical analysis of batch process data, Journal of Chemometrics, 1999, 13, 397413.
* If the algorithm for fitting the bilinear part of model is iterative, it is useful not to iterate until convergence in each step two. Instead only a few iterations are performed before one proceeds to step three where the complete model (AB’ + 1m’ in case of PCA) is calculated and used for obtaining better estimates of the missing values.
* Normally, centering would be performed before scaling for computational reasons, as the averages are needed for scaling by the inverse standard deviation.
[RB1]% Shows that centering with the grand mean sucks
%load claus,f=parafac(X,DimX,3,[],[2 2 2]);
%[a,b,c]=fac2let(f,DimX,3);save spec
load spec
clear
load spec
X = rand(10,3)*b([1:3:150],[1 2 3])';
X = X + .1;
[t,s,p]=svds(X,5);t=t*s;
[t1,s1,p1]=svds(Xmean(X(:)),5);t1=t1*s1;
[t2,s2,p2]=svds(mncn(X),5);t2=t2*s2;
ss=[];
for i=1:5
ss=[ss;100*(1ssq(Xt(:,1:i)*p(:,1:i)')/ssq(X))];
end
ss1=[];
for i=1:5
ss1=[ss1;100*(1ssq(Xt1(:,1:i)*p1(:,1:i)'mean(X(:)))/ssq(X))];
end
subplot(1,2,1)
plot(p1(:,1:4),'b','LineWidth',2)
hold on
plot(p1(:,5),'b','LineWidth',1)
set(gca,'XTick',[0 25 50])
set(gca,'YTick',[.5 0 0.5])
title('Loadings grandcentered data','FontWeight','Bold')
xlabel('Spectral variable')
ylabel('Loading')
hold off
subplot(1,2,2)
plot(p2(:,1:3),'b','LineWidth',2)
hold on
plot(p2(:,4:5),'b','LineWidth',1)
set(gca,'XTick',[0 25 50])
set(gca,'YTick',[.5 0 0.5])
title('Loadings singlecentered data','FontWeight','Bold')
xlabel('Spectral variable')
ylabel('Loading')
hold off
%For finding numbers
figure,plot(p1),legend('1','2','3','4','5')
figure,plot(p2),legend('1','2','3','4','5')