Revised version published in J.Chemom., 2003, 17, 16-33

 

Centering and scaling in component analysis

 

 

Rasmus Bro

Chemometrics Group, Food Technology, Dept. Dairy and Food Science

The Royal Veterinary and Agricultural University

Denmark

rb@kvl.dk

 

&

 

Age K. Smilde

Process Analysis and Chemometrics

Department of Chemical Engineering

University of Amsterdam

The Netherlands

 

 


Abstract................................................................................................................................................................................. 4

1.      Introduction................................................................................................................................................................ 5

1.1        Definitions........................................................................................................................................................ 5

1.2        Leading principles........................................................................................................................................... 8

1.3        Outline of paper............................................................................................................................................... 8

1.4        Notation............................................................................................................................................................ 8

2.      Two-way centering.................................................................................................................................................... 9

2.1        Reasons for centering..................................................................................................................................... 9

2.2        How centering works.................................................................................................................................... 10

2.2.1      Centering can remove offsets because it is a projection step...................................................................... 10

2.2.2      Centering across several modes................................................................................................................... 11

2.2.3      Centering is a two-stage procedure for a least squares fitting problem...................................................... 11

2.2.4      Rank reduction and centering...................................................................................................................... 13

2.3        When centering does not work................................................................................................................... 14

2.3.1      Handling missing data................................................................................................................................. 14

2.3.2      Subtracting the grand mean......................................................................................................................... 14

2.4        Alternatives to centering............................................................................................................................. 17

2.5 Summary................................................................................................................................................................... 18

3.      Two-way scaling...................................................................................................................................................... 19

3.1        Reasons for scaling....................................................................................................................................... 19

3.2        How scaling works........................................................................................................................................ 20

3.2.1      Different types of scaling........................................................................................................................... 20

3.2.2      Scaling and weighted least squares fitting................................................................................................... 21

3.3        When scaling does not work....................................................................................................................... 23

3.4        Alternatives to scaling................................................................................................................................. 23

3.5        Summary......................................................................................................................................................... 23

4.      Simultaneous two-way centering and scaling................................................................................................... 23

4.1        Scaling within a mode disturbs centering across the same mode, but not across other modes....... 24

4.2        Centering across one mode disturbs scaling within all modes.............................................................. 24

4.3        Centering across and scaling within the same mode is problematic..................................................... 25

4.4        Summary......................................................................................................................................................... 25

5.      Three-way preprocessing...................................................................................................................................... 26

5.1        Centering........................................................................................................................................................ 26

5.1.1      Possible three-way offsets and their proper removal................................................................................. 26

5.2        Scaling............................................................................................................................................................. 28

5.2.1      Incorrect scaling of three-way arrays.......................................................................................................... 29

5.3        Simultaneous centering and scaling........................................................................................................... 30

5.4        Summary......................................................................................................................................................... 31

6.      Conclusion............................................................................................................................................................... 31

Acknowledgments............................................................................................................................................................. 33

Appendix 1: Projections................................................................................................................................................... 33

Appendix 2: Fitting bilinear model plus offsets across one mode equals fitting a bilinear model to centered data    35

 

 

 

 


Abstract

In this paper, the purpose and use of centering and scaling is discussed in depth. The main focus is on two-way bilinear data analysis, but the results can easily be generalized to multiway data analysis. In fact, one of the scopes of this paper is to show that if two-way centering and scaling is understood, then multiway centering and scaling is quite straightforward. In the literature, it is often stated that preprocessing of multiway arrays is difficult, but here it is shown that most of the difficulties do not pertain to three- and higher-way modeling in particular.

                It is shown that centering is most conveniently seen as a projection step, where the data are projected onto certain well-defined spaces within a given mode. This view of centering helps explaining why, for example, centering data with missing elements is likely to be suboptimal if there are many missing elements.

                Building a model for data consists of two parts: postulating a structural model and using a method to estimate the parameters. Centering has to do with the first part: when centering, a model including offsets is postulated. Scaling has to do with the second part: when scaling another way of fitting the model is employed. It is shown that centering is simply a convenient technique to estimate model parameters for models with certain offsets, but this does not work for all types of offsets. It is also shown that scaling is a way to fit models with a weighted least squares loss function and that sometimes this change in objective function cannot be performed by a simple scaling step.

                Further practical aspects of and alternatives to centering and scaling are discussed, and examples are used throughout to show that the conclusions in the paper are not only of theoretical interest, but can have impact on practical data analysis.

 

Keywords: Offset, weighted least squares, preprocessing, two-way, three-way, multiway, missing data, PCA, PARAFAC


1.          Introduction

1.1            Definitions

It is important to have a concise terminology for scaling and centering. The following convention is based on suggestions from the literature [Harshman & Lundy 1984, Kiers 2000, Kruskal 1989, ten Berge 1989]. The term 'an offset' - also sometimes called an intercept - is used for a part of the model that is constant across one or several modes. An R-component bilinear model of a data matrix, X (I × J) with elements xij, may be written in terms of scalars or in matrix notation as

 

X = ΦΘT + E   

 

(1)

 

where Φ (I × R) and Θ (J × R) hold the parameters,  ir and  jr, where Greek symbols are used to indicate population parameters. The matrix E holds the unknown errors. Offsets may be constant across the first mode (rows). The model associated with such offsets is:

 

X = ΦΘT + 1  T + E   

(2)

 

where   (J × 1) holds the constant terms  (j=1,..,J) and where 1 is a one-vector of suitable size (I × 1 in this case). Again, the Greek symbol  indicates a population value. Offsets may also be constant across the second mode (columns). The underlying bilinear model with offsets across the second mode reads

 

X = ΦΘT +  1T + E    

 

(3)

 

 

where the vector   (I × 1) is now holding the offsets  (i=1,..I). Offsets may also be constant across columns and across rows yielding

 

X = ΦΘT +  11T  + E  

 

(4)

 

in which the single constant   is the same for all elements of X. Such a situation may arise for example in chromatography or capillary electrophoresis where a constant offset in the detector may appear, due to the way the detector zero-level is determined.

Thus, for bilinear models there are two basic types of offsets: constants across one mode (columns or rows) or constants across both modes. Combinations of such offsets may also appear as is for example seen in analysis of variance settings.

As will be shown below, offsets are often handled by first centering the data and subsequently fitting the bilinear model. If the data are centered by subtracting the column-average from every element in the column, this is referred to as centering across the first mode. Mathematically it can be expressed

 

yij = xij -  

(5)

 

where yij is an element of the centered data matrix. If m (J×1) is a vector holding the jth column-average in its jth element, then centering across the first mode can also be expressed

 

Y = X  1mT

(6)

 

where 1 is an I-vector of ones and Y the matrix holding the centered data. Subtracting the row-average from each element in a row is referred to as centering across the second mode and can be expressed

 

yij = xij -  

(7)

 

or using m (I×1) as a vector holding the ith row-average in its ith element

 

Y = X  m1T.

(8)

 

In general, centering across one mode is also called single-centering and performing for example a centering across the first mode and then a subsequent centering of the outcome across the second mode is called double-centering. The term slab-centering, which is sometimes seen in the literature, refers to centering by subtracting, from each slab in a three-way array, the overall average of that slab. For two-way data, this simply corresponds to subtracting the average of all elements. 

                For scaling, another terminology is used. When a matrix is scaled such that each row is multiplied by a specific scalar, it is called scaling within the first mode (yij = xijwi). If each column is multiplied by a certain scalar as in traditional autoscaling, it is referred to as scaling within the second mode (yij = xijwj). In matrix notation scaling within the first mode can be written

 

Y = WX

(9)

 

where W is an I × I diagonal matrix holding the scalar wi in its ith diagonal element. Scaling within the second mode can be written

 

Y = XW

(10)

 

where W is now a J × J diagonal matrix holding the scalar wj in its jth diagonal element.

 

Figure 1. Example of centering and scaling using sugar process data (legend: see text).

An example of centering and scaling is shown in Figure 1. The data is from a sugar factory (see Nørgaard [1995] for more information). The variables shown are Ash (1), Color (2), Color type (3), Turbidity (4), Grain size1 (5), Grain size2 (6), SO2 (7), Invert (8), Floc (9), Insoluble residue (10), Amino-N (11) which are all measured in different units of different magnitude. Each line in a plot is the status ('spectrum') of these 11 variables at a certain time. Ninety-seven times are shown. The raw data are shown on the left side A). The different ranges of these variables will manifest themselves in a subsequent modeling of the data, where the variables with little variation will not be modeled to any significant degree. Centering (across the first mode) will not remove these scale differences, but will move the variation of each variable to the zero-level. As the differences in scales between variables are arbitrary, it is useful to scale the data so that each variable has the same initial standard deviation (and remove different measurement units). This can be achieved by scaling the centered data within the second (variable) mode. The corresponding autoscaled data are shown on the right side of the figure. It is seen that the variation of each processed variable is comparable for the autoscaled data. The outlying sample in the lower part of the plot, however, leads to a too dramatic down-weighting of, for example, variable one and should hence be excluded before preprocessing is carried out.

It may seem strange that different words are used for scaling (within) and centering (across). The explanation for this is as follows. Centering is performed across a mode in the sense that one offset is subtracted from every element in a certain vector, i.e., the data are centered across the elements of one mode. The same holds for three-way data; the average value is subtracted from each element of a vector. Scaling is performed by multiplying all elements in the array containing a certain variable (or object) by the same scalar. For two-way data, scaling therefore also pertains to vectors, but e.g. for three-way data, this means that a whole slab (corresponding, for example, to the I × K matrix of the jth wavelength in a spectral three-way array) has to be multiplied by the same scalar. Thus scaling is performed within the elements of one mode.

                In the following, much use will be made of the notion of 'fit' or 'model fit'. In general terms this means what portion of the data is fitted by the model. This can be expressed by

 

 

 

(11)

 

where X contains the data,  the fitted values of the data using the model and E the residuals. The symbol  denotes a norm of A, here taken to be the Frobenius or Euclidian norm (square root of the sum of squared elements of A). The R2 statistic can take on values between zero and one, where one means perfect fit and zero means no fit. The fit may also be expressed in percentages between 0 and 100.

 

1.2           Leading principles

There are two leading principles in this paper. The first principle is parsimony. It is preferred that a model is as simple as possible. This means that if two models give the same fit, the model using the fewer number of parameters is preferred. This idea goes back to Willem of Ockham who lived in the Middle Ages and stated that a minimum number of assumptions should be adopted to explain a phenomenon [Weinberg 1964]. This principle is known as 'Ockham's Razor'. In statistics, the notion of parsimony is formulated in Statistical Decision Theory [Judge et al. 1985] and in chemometrics it was introduced as the 'parsimony principle' [Seasholtz & Kowalski 1993].       

The second principle is that centering in this paper is not considered to estimate offsets, but to remove offsets. Estimating offsets is a different issue than removing them and estimating offsets has its own properties and problems.

 

1.3           Outline of paper

In the main part of the paper only two-way (bilinear) models are considered because most results generalize straightforwardly to multiway models. Section 2 is concerned with two-way centering, Section 3 with two-way scaling, and Section 4 with the combined use of two-way centering and scaling. Section 5 explains the discussed results in terms of multiway models and finally in Section 6 some conclusions are drawn.

 

1.4           Notation

The following notation is used in this paper. Two-way data are held in an I × J matrix X with typical elements xij. Three-way data are held in an I × J × K matrix X with typical elements xijk. Such a three-way array is often rearranged to an I × JK matrix X(I×JK) by concatenating the K third-mode frontal slabs after each other; i.e., X(I×JK) = [X1 X2 .. XK] where Xk is the I × J matrix obtained by fixing the third mode of X at k. This operation has been referred to in a number of ways (e.g. unfolding), but it has been suggested to use the term matricizing to avoid confusion with other techniques [Kiers 2000]. The matricized array it is often just denoted X Instead of X(I×JK) if no confusion is possible. The letter Y is used for preprocessed data and  is used for a model of X (be it two- or three-way). The number of components in a model is called R.

                The letter m is used for calculated offsets (e.g. averages) and μ for true offsets (population values). The character w is used for a scaling parameter. The character P is used for a projection matrix related to centering. Usually its dimension will not be specified, because it follows from the context. Similar rules apply for the diagonal matrix W holding the weights associated with scaling the array.

                The Kronecker product is denoted , the Hadamard (element-wise) product *, and the Khatri-Rao [Styan 1973] product which is the column-wise Kronecker product of two matrices is denoted . The use of these special products makes it possible to express most three-way models with two-way (matricized) arrays [Bro 1998]. For example, a PARAFAC model of an I×J×K array X can be expressed as

 

X(I×JK) = A(CB)T + E(I×JK),

(12)

 

where A (I×R), B (J×R), and C (K×R) are component matrices and E holds the residual unexplained variation. This notation is equivalent to the model

 

 i=1,..I; j=1,..J; k=1,..K;

(13)

 

2.          Two-way centering

In order to understand when and how centering works, it is important first to consider the goals of centering and to realize how these goals are achieved in practice. These aspects are described in this section.

 

2.1           Reasons for centering

As stated by Harshman and Lundy [1984], quite subjective and qualitative reasons are often given for performing centering. It is possible to formulate rational reasons for centering on scientific grounds. Basically, centering should be performed only if there are common offsets in the data or if modeling such offsets provides an approximately reasonable model. Thus, centering is performed to make interval-scale data behave as ratio-scale data, which is the type of data assumed in most multivariate models. Said more simply, centering should make a difference. This difference can manifest itself as

 

i)                     Reduced rank of the model

ii)                   Increased fit to the data

iii)                  Specific removal of offsets

iv)                 Avoidance of numerical problems

 

Re i). If a model of the raw data requires, say, R + 1 components to describe the data well, whereas a model of the centered data requires only R components, then centering is sensible because the model of the centered data only has R(I+J) + J parameters. The J parameters pertain to the calculated averages, assuming that centering is performed across the first mode. The alternative of fitting the (R+1)-component model to the raw data would lead to a model with (R+1)(I+J) parameters and thus would violate the parsimony principle. Re ii). In some situations, the rank of the appropriate model is not reduced upon centering, but if the fit of the model of the centered data is significantly improved, then naturally introducing extra parameters is useful. It is possible to heuristically consider the offsets introduced by centering as one extra 'half' component of which either the scores (centering across first mode) or the loadings (centering across second mode) are known a priori to be equal to one. This holds in the sense that the fit of a model with R components is poorer than the fit of a model with R components and offsets, which again is poorer than a model with R+1 components. Re iii). Centering can remove certain offsets. In some situations the offsets as such are of interest, in which case it is interesting to estimate these. This can usually not be achieved by centering. Re iv). In certain algorithms, it may be useful to center the data in order to minimize algorithmic problems. For fitting a bilinear model using principal component analysis, it is known that the ratio of the two largest eigenvalues is related to the convergence rate of the power method (and related techniques such as NIPALS). For PARAFAC it is also known that if some components are strongly correlated as evidenced through Tucker's congruence coefficient [Tucker 1951] then the fitting procedure may be complicated by so-called swamps. For both situations, it holds that centering across certain modes can be helpful in minimizing the cause of the problem because the resulting optimization problem is related to a different model with different (and hopefully better) properties with respect to numerical problems.

 

2.2           How centering works

2.2.1           Centering can remove offsets because it is a projection step

The following discussion pertains to centering across the first mode (ordinary column-centering) but is readily applicable to centering across the second mode as well. Understanding that centering is a special projection step within one specific mode, explains why it eliminates constant terms in the data (see Appendix 1 for details on projections).

                If the vector m holds in its jth element the average of the jth column of X, then m can be expressed as

 

m = (1/I)XT1  mT = (1T/I)X

(14)

 

where 1 is an I-vector of ones and X (the data) has size I × J. Then centering X across the first mode (column-centering) amounts to

 

Y = X - 1mT

(15)

 

where Y (IxJ) contains the centered data. As

 

1mT = (11T/I)X,

(16)

 

the centered data can also be expressed

 

Y = X - 1mT =

 

X - (11T/I)X = (I - (11T/I))X = (I-P)X= PX

 

 

(17)

 

where PX = P[x1, ... , xJ] = [Px1, ..., PxJ] and xj is the jth column of X. The matrix (11T/I) = P is a symmetric and idempotent (I × I) matrix and is thus an (orthogonal) projection matrix (See Appendix 1). This shows that the column-averages are the orthogonal projections of the columns of X onto the direction of ones, i.e., the direction given by the vector 1. The centering matrix (I - (11T/I)) = P is the projection matrix onto the nullspace of 1T which equals range(1) (where range(·) is the range of a matrix). Stated otherwise, centering may also be interpreted as providing the residuals after regressing the columns of X onto 1. It follows, that centering may be viewed as the projection of the data onto a space with the common offset (given by the I-vector 1) removed.

Mathematically, centering is a projection onto the nullspace of 1T and it is worthwhile to keep this in mind. Suppose that the true model of the data contains offsets across the first mode. Then the model can be written as

 

X = ΦΘ’ + 1μT + E

(18)

 

Projecting these data onto the nullspace of 1T leads to

 

PX = PΦΘT + P1μT  + PE  

(19)

 

PX = Y = PΦΘT + PE

(20)

 

where PX = Y is the matrix holding the centered data and where the matrix P1μT vanishes, as 1 has no residuals when projected onto itself. The part PΦΘT is a bilinear model with scores PΦ and loadings Θ. Thus instead of fitting the bilinear model and the offsets to the original data, it is only necessary to fit the bilinear model to the centered data, Y with true structure, PΦΘT+ PE. Centering also leads to models with residuals with zero column averages (centered across the first mode), because these are also projected onto the nullspace of 1T, as is clear from Eq. (20).

If Φ and Θ are nonnegative, then nonnegativity constraints can be imposed on the model for X. When X is centered, e.g., across the first mode, such a constraint is not meaningful for Φ anymore, because centering destroys nonnegativity of Φ (but not of Θ)

 

2.2.2           Centering across several modes

As mentioned earlier, centering across a given mode is called single-centering. Single-centering an array across one mode that has previously been single-centered across another mode is called double-centering. Performing several single-centerings (for multiway arrays as many can be performed as the number of ways) is unproblematic in the sense that centering across one mode, leaves the ‘centeredness’ intact in other modes [Harshman & Lundy 1984]. Further, the order of centering is immaterial. This means that if the data are first centered across the first mode, and subsequently the centered data are centered across the second mode, then the average of every column and every row will be zero. This follows, because centering across the first mode can be written as  X where  is the centering operator for the first mode. Centering across the second mode can be written X . Thus double-centering can be written  X , and hence: i) the order of which centering is performed is immaterial and ii) the double-centered array will have both column- and row-average zero, because  X  can be viewed as centering the matrix X  across the first mode or centering the matrix  X across the second mode.

 

2.2.3           Centering is a two-stage procedure for a least squares fitting problem

Consider a two-way data set, which is generated as

 

X = ΦΘT + 1μT + E

(21)

 

 

where Φ is (I × R), Θ (J × R) and μ (J × 1). It follows, that the data can be modeled by a bilinear model plus a common offset for each variable/column plus additional unmodeled variation held in the residual matrix E. This is the model assumed to be a valid approximation in most bilinear methods in chemometrics. The least squares loss function for the above model is

 

||X  (ABT + 1nT)||2

(22)

 

and this function is to be minimized directly over A, B, and n, with A, B and n being of the same dimensions as Φ, Θ and μ respectively. The matrices A, B, and n contain estimates of Φ, Θ, and μ respectively, but it is intrinsic to the problem, that these estimates do not uniquely recover the underlying parameters. For example, Φ and Θ can, at most, be found up to a rotation. As shown above, however, the parameters can be estimated in two steps. Centering the data across the first mode will remove the offsets μ, and the bilinear model is subsequently fitted to the centered data Y, thus minimizing the loss function

 

|| Y - CDT||2

(23)

 

only over C and D that are of the same dimensions as A and B above. It holds that

 

min||X  (ABT + 1nT)||2 = min|| Y - CDT||2

(24)

 

i.e. the fit of the model fitted directly and the bilinear model fitted to the centered data will be exactly the same (see proof in Appendix 2) even though the actual parameters will usually differ. This is an important result because it guarantees the optimality of the model even if the offsets are calculated separately from the bilinear parameters.

The solution for minimizing Eq. (22) directly is not unique for n. That is, the two-stage solution of centering first and then fitting the bilinear part gives one solution of many to the problem in Eq. (22). Therefore, centering removes μ, but m is not necessarily an estimate of μ.

This non-uniqueness is explained in short. Centering involves subtracting from each column its column-average. The matrix

 

PX = PΦΘT + P1μT+ PE

(25)

 

holds in each row the vector mT containing the average value of each column of X (P is 11T/I, as above). If the part PΦΘT is a matrix of zeros, then m will be an estimate of the true offsets μ (with error PE), because

 

P1μT = 1μT

(26)

 

by definition. However, PΦΘT will only be a zero-matrix if Φ is orthogonal to P and/or Θ = 0 (assuming that Φ, Θ have full column rank). That is, PΦ=0 or ΦTP=0. Thus, the offsets will only equal the true offsets if the column-space of Φ is orthogonal to 1 (meaning that Φ is centered already) or if Θ = 0.

 

2.2.4           Rank reduction and centering

In some cases centering reduces rank and in some cases not. Column centering of X (I×J) reduces the rank of X if and only if 1range(X), where range(X) is the range of X (see  Amrhein [1998]  p. 156, and Amrhein [1996]). Intuitively this is understandable. Centering is a projection. If the axis on which the projection takes place is a part of the range of X, then the residuals of this projection do not have this direction available anymore. Hence, the rank of the matrix of residuals lowers by one. This simple fact has several repercussions for centering across the first mode (column-centering).

                The following can be said about the noiseless case. Suppose that X (I×J) is noiseless and can be modeled as

 

 

 

(27)

 

where Φ is (I×R) of full rank; Θ (J×R). Assume that  has full rank R+1 which will be fulfilled for real data. For Y, the column-centered X, two cases can be distinguished:

 

1.  1  range(Φ)  rank(X) = R  rank(Y) = R -1

2.  1  range(Φ)  rank(X) = R +1  rank(Y) = R

 

Hence, in both cases the rank of X is reduced by one. The reverse also holds: if for the noiseless case no rank reduction of X is obtained upon column-centering, then model (27) is not valid. To summarize, in the noiseless case there is a simple relation between the validity of model (27) and rank reduction upon column-centering.

                In the case with noise added things are less simple. Suppose that X (I×J) also contains noise and the model for X is  

 

 

 

(28)

 

If I>J, then in general 1  range(X). Although 1  range(‘noiseless’ X), this property is destroyed upon adding noise to X. In the case I  J and if X has full rank I, then 1 is by definition in the range of X, whether or not there exists an offset term 1μT. Hence, in the case with noise there is no simple relationship anymore between the validity of (28) and mathematical rank reduction upon column-centering.

 

2.3           When centering does not work

Viewing centering as a projection step rather than as a simple subtraction of averages has more than theoretical importance. In practice, situations often occur where subtraction of averages does not work and may, in fact, lead to models that fit the original data more poorly than if the data had not been preprocessed. This will be shown in the following for two different problems: handling missing data and modeling a single common offset.

 

2.3.1           Handling missing data

When data are missing, centering by subtracting averages from columns or rows does not lead to elimination of offsets. Rather, the offsets have to be eliminated simultaneously with the fitting of the bilinear part [Grung & Manne 1998]. This is so because the equivalence between subtracting average values and projecting onto the nullspace of vectors of ones no longer holds as the projection cannot be calculated with missing elements. As an example, consider the matrix X(1) shown below.

 

 

.

 

 

(29)

 

 

This is a rank one matrix and will remain so, even after centering across the first mode. The averages of the two columns are 4.1 and 8.2 respectively and the centered matrix reads as Y(1) in Eq. (29), which is also a rank one matrix.

      Consider now a situation in which the first two elements in the first column are missing. The data then reads as X(2). This data set is naturally still perfectly modeled by a rank one bilinear model, as no new information has been added. The average of each column is now 1 and 8.2 and subtracting these values leads to centered matrix Y(2). This matrix cannot be described by a rank one model. This is easily realized by only looking at the last three rows. This is a rank two submatrix, and the addition of the first two rows cannot change this. Thus, by subtracting averages from the data with missing elements, the structure of the data has been destroyed and meaningful results cannot be expected. Prior centering no longer leads to elimination of the true offsets as centering ordinarily does.

                Centering is really an extension of the bi- (or multi-) linear model where offsets are assumed to be present in the model of the data. Data with missing elements constitute one situation in which such a model cannot be fitted in a least squares sense using centering. An alternative to eliminating offsets by preprocessing is given in Section 2.4.

 

2.3.2           Subtracting the grand mean

The traditional centering across the first mode easily leads to the belief that subtracting averages with the same structure as the offsets will generally eliminate these offsets. This holds for offsets constant across one mode but it does not hold in general.

Consider a data set structured as a bilinear part plus a constant identical for all elements; that is, all elements have the same common offset, as also shown in Eq. (4). It might seem natural to remove this offset by initially subtracting the grand mean m from the data. However, this will not simplify the subsequent modeling of the data, and in fact, it obscures interpretation of the model, because such a centering leads to artificial mathematical components that also need to be modeled.

To explain this, assume that X is perfectly described by a bilinear part plus a common offset

 

X = ΦΘT + 1I1JTμ.

(30)

 

Centering by removing the overall mean of all elements of X can be written

 

Y = X - PIXPJ

(31)

 

where PJ is the projection matrix of 1J (=1J1JT/J) and PI of 1I (=1I1IT/I). Then PIXPJ is a matrix of the same size as X holding the overall average of X in all its elements. Inserting the true model of Eq. (30) in Eq. (31) leads to

 

Y = ΦΘT + 1I1JTμ - PI(ΦΘT + 1I1JTμ)PJ =

 

ΦΘT + 1I1JTμ - PIΦΘTPJ - PI1I1JTμPJ =

 

ΦΘT + 1I1JTμ - PIΦΘTPJ - 1I1JTμ =

 

ΦΘT - PIΦΘTPJ =

 

ΦΘT - 1I1JTs.

 

 

 

 

 

 

 

 

(32)

 

The scalar s is the overall average of the true bilinear part ΦΘT. Even though the overall mean, λ, has been removed, a new common offset, s, has been introduced into the preprocessed data, and hence the same number of components is still necessary for modeling the data. Depending on the true parameters in the underlying model (ΦΘT), the model fitted to the preprocessed data may therefore explain less or more of the original data than if the data had not been preprocessed! Clearly, preprocessing the data by subtracting the overall mean is generally not useful.

                As subtracting the overall level does not remove the offset, another approach must apparently be adopted for handling situations with one common offset. There are basically two different ways of treating the problem. The best way is to optimize the loss function of the problem directly in a one-step procedure, rather than trying to use a two-step procedure where the offsets are first removed (See Section 2.4).

                Another simpler way of dealing with a constant offset follows from the observation that the model

 

X = ΦΘT + 1I1JTλ

(33)

 

may equivalently be written

 

X = ΦΘT + 1IμT

(34)

 

where

 

μ = 1Jμ.

(35)

 

Posed in this way it is evident that a model with one global offset is a special case of the situation treated earlier where each variable (or sample) has a specific offset. Therefore, the constant offset may be eliminated by using ordinary centering across the first mode. As the offset is constant across rows, this centering removes the constant. An alternative procedure is to center across columns instead of rows, because the offset is also constant across columns.

An example is given for illustration. Consider [RB1] a spectral data set. The data have been synthesized according to a bilinear three-component part plus a scalar offset of one. Thus, the model is

 

X = ΦΘT + 11T.

(36)

 

No noise is added. Using principal component analysis to model the raw data, four components are needed for describing the variation as expected (Table 1 left). Modeling the data centered by subtracting the overall mean leads to a situation where four components still have to be used for describing all systematic variation (Table 1 right). In fact, the three-component model explains less of the original data after preprocessing in this case.

 

Table 1. Percentage of variation explained for different models of the raw (left) and corrected data (right).

#LV

Raw data

Overall average subtracted

1

99.19%

98.05%

2

99.58%

99.08%

3

99.95%

99.67%

4

100.00%

100.00%

5

100.00%

100.00%

 

 

Even though only three systematic components should be present, the loading plot (Figure 2 left) clearly shows that the first four components are 'spectral'-like. With proper preprocessing only three loading vectors will be systematic, as shown in Figure 2 (right), using centering across the first mode.

 

 

Figure 2. Left: Loading plot from principal component model of the data matrix where the overall average is subtracted. Right: The first five loadings are shown when the data have been correctly centered across the first mode.

Single-centering involves fitting several parameters (I or J respectively). When there is only one constant parameter in the true model, as is the case here, a risk of overfitting is introduced with this approach. It is advisable, therefore, to center across the mode of largest dimension such that as few offsets as possible need to be estimated.

To capitalize, the following rule establishes the 'correct' procedure for removing offsets before fitting the model: centering across one mode removes offsets constant across that mode as well as offsets constant across both modes [Harshman & Lundy 1984]. This important rule also extends to multiway data of arbitrary order. Thus, centering across one mode removes offsets constant across that mode as well as offsets constant across several modes involving that mode. This generalization follows from realizing that multiway models can always be considered as a constrained version of a bilinear model. Hence, offsets constant across an arbitrary number of modes can always be considered a constrained version of a model with offsets constant across one mode. Centering across one of these modes will therefore eliminate the offsets because of the projection properties of centering.

 

2.4           Alternatives to centering

Instead of modeling the data in two steps - removing offsets by centering and fitting a model to the residuals - it is possible to fit the model in one step, alleviating the need for projecting the offsets away. Two examples are given.

      The example of missing data (Section 2.3.1) can be fitted directly in the following way. Assume that the offsets are, for instance, constant across the first mode and that a principal component analysis model is sought including offsets across the first mode. Such a PCA model of the data held in the matrix X including offsets reads

 

X = ΦΘT + 1μT + E

(37)

 

where μ is a J-vector. A very simple way to fit this model to a data matrix with missing elements in a least squares sense, is by the use of an alternating least squares approach where the missing elements are continuously exchanged with their model estimates. Such an algorithm may proceed as follows.

 

1.        Initialize missing values with reasonable values. Then the data set is complete and can be modeled by ordinary techniques.

2.        Fit the model including offsets to the (now complete) data set. For PCA, this amounts to centering the data and fitting the PCA model.

3.        Exchange missing values in the data matrix for model estimates. These estimates will improve the current estimates and thus provide a data set where the estimated missing elements are closer to the correct ones according to the (yet unknown) true least squares model.

4.        Proceed from step 2 until convergence

 

This approach can be shown to converge, because it is an alternating least squares algorithm and hence has a non-increasing loss function. Upon convergence to the global minimum, the imputed missing data will have no residuals and hence no influence on the model. The model parameters computed from the complete data are exactly those that would have been obtained, had the model only been fitted to the nonmissing data directly*. This approach can be viewed as a simple special case of expectation maximization [Little & Rubin 1987]. For specific models or specific offsets, other approaches can also be feasible, but the above approach is general and easily implemented.

The problem with one common offset (Section 2.3.2) can be dealt with in the following way. The loss function for a bilinear model with constant overall offset is expressed as

 

||X - ΦΘT  1I1JTμ||2.

(38)

 

Instead of fitting the over-parameterized model

 

X = ABT + 1mT + E

(39)

 

in a two-step procedure (see Eq.(34)), it is possible to fit a 'correct' structural model

 

X = ABT + m11T + E

(40)

 

directly. Instead of J parameters in m, only one parameter, m, has to be estimated. The PCA model is used here as an example, but may be exchanged with any other structural model including a three-way model. Also, other types of offsets may be used. The loss function may be optimized in several ways leading to a least squares model [Cornelius et al. 1992, van Eeuwijk 1996]. A simple algorithm is based on alternating least squares where first the offset is set to an initial value. Then the structural model is fitted to the corrected data X - m11 T using an ordinary PCA algorithm in this case. This provides an update of the bilinear parameters. Subtracting the new interim PCA model from the data leads to

 

X - ABT = m11T + E.

(41)

 

Therefore, m may be determined conditional on A and B as the overall average of X - ABT. By alternating between updating the loading parameters and the offset, the model parameters will be estimated upon convergence.

In a three-way context using a PARAFAC model, an auxiliary benefit is that the offsets may often be uniquely determined due to the uniqueness of the PARAFAC model.

 

2.5 Summary

Offsets are part of the model hypothesized for the data. Some offsets can be removed by centering before fitting the remaining part of the model. This removal of offsets involves fitting additional parameters. Proper centering is defined as a centering operation that removes the offsets postulated and does not change the structural model of the data. Proper centering is always performed across a single mode. Sequentially centering across several modes is allowed. Hence, proper centering can always be written as  or  or  depending on whether centering is performed across the first, second or both modes. Here  is an I × I matrix defined as (I - (11T/I)), where I is an I × I identity matrix and 1 an I-vector of ones. The matrix  is defined similarly. If elements are missing or other offsets are to be modeled, this has to be done using a one-step modeling approach where offsets and other parameters are considered simultaneously (usually using iterative algorithms).

 

3.          Two-way scaling

Unlike centering, scaling does not change the structure of the model. Scaling is used to change the weights given to different parts of the data in fitting the model. Though scaling is important, it usually has a much less dramatic influence on the fitted model than centering, as long as the model and scaling are reasonable [Westerhuis et al. 1999]. Some issues related to scaling are discussed by Paatero & Tapper  [1994].

 

3.1           Reasons for scaling

Scaling is used for several reasons. Some important ones are

 

i)                     To adjust scale differences

ii)                   To accommodate for heteroscedasticity

iii)                  To allow for different sizes of subsets of data (block-scaling)

 

Re i). It is quite common to use, for instance, autoscaling (centering across the first mode and scaling to unit standard deviation within the second mode) to let the variance of each variable be identical initially. Thereby, all variables have the same variance, and as the subsequent fitting of a model is performed so as to describe as much systematic variation as possible, every variable has the same initial opportunity of entering the model. This type of scaling is especially useful when the variables are measured in different measurement units (e.g. Pascal, Centigrade,...). Re ii). The ordinary least squares fitting of a model is statistically optimal in a maximum likelihood sense, if the errors are homoscedastic, independent and Gaussian. If the variances of the distributions are not the same, though the same within, e.g., a specific variable, it is possible to accommodate the fitting procedure accordingly by initially scaling the data within the variable mode. By scaling each variable with the inverse of the standard deviation of the residual variance, the fitted model will be optimal in a maximum likelihood sense. Re iii). When the data is made up of several subsets of very different sizes, it may sometimes be advantageous to scale each block separately in order to ensure that all the different blocks are allowed to influence the model. Consider, for example, a situation in which 5000 wavelengths are measured in an infrared spectrum (absorbance between 0 and 1) and one variable is given for the temperature. Due to the huge difference in number of variables (5000 and one respectively), the total variance of the infrared spectra will be tremendous compared to the temperature. If no scaling is applied to adjust for this difference, then the model is implicitly forced to focus on the infrared data. Explaining the temperature variable will not lead to a well-fitting model, unless the model is so complex that it can fit both subsets simultaneously or because the temperature data is in accordance with the infrared data. If the infrared data and the temperature data are initially believed to be equally important, then scaling both subsets to the same total variance will provide a model that reflects this assumption. Thus, in this case scaling is used from an information point of view to ensure that all important information can enter the model, irrespective of the variance of the different sources of information.

It is important to note that even if no scaling is applied, the data are still scaled by the weight one. Thus, scaling (or the loss function in general, as will be shown in the next section) always has to be considered before fitting the model. 

All the above-mentioned reasons can be put under the same heading by the term weighted least squares fitting, which is a general and broader approach to fitting models than merely the use of scaling. This will be elaborated on in the next section.

 

3.2           How scaling works

Scaling is a subject often treated in conjunction with centering. However, the purpose of scaling is very different from the purpose of centering. Scaling is a way of introducing another loss function than the least squares loss function normally used. Therefore, scaling does not change the interpretation of the model and its parameters. As for centering, scaling has to be performed in a specific way in order not to introduce artificial structure that needs to be modeled. This becomes even more apparent when going to three-way models.

 

3.2.1           Different types of scaling

Scaling is usually performed by multiplying each column or each row in the data matrix by a scalar. There are two types of scaling that are relevant for two-way matrices. One is scaling within the first mode where every row is multiplied by a specific number

 

Y = WX

(42)

 

where W is an I × I diagonal matrix with the scaling parameter for the ith row on its ith diagonal element. This is the type of scaling used, for example, in standard normal variate correction [Barnes et al. 1989, Guo et al. 1999, Helland et al. 1995] where the norm or area of each row of X is scaled to the same scalar value by using an appropriate W. It extends easily to multiway arrays, as discussed in Section 5.2.

                The other main type of scaling is within the second mode where every column is multiplied by a specific number

 

Y = XW

(43)

 

where W is here a J × J diagonal matrix with the scaling parameter for the jth column on its jth diagonal element. This is the scaling ordinarily used in, e.g., PCA where the weight of a specific column (variable) is often chosen to be the inverse of the standard deviation of the variable. In combination with centering across the first mode, such scaling within the second mode is often referred to as autoscaling in the two-way case.

An example of a scaling according to Eq. (43) is shown in Figure 3. One-component bilinear data with huge random residual variation in the last half of the variables (upper left) is generated. The resulting X has the spectra (as plotted in Figure 3 upper left) in its rows. The first principal component is seen to correlate well with the reference score generating the data (lower left). However, when the data are initially weighted as in Eq. (43) by the inverse of the residual standard deviation (upper right), the right part of the data is downweighted substantially. The correlation between the true known score (‘concentration’) and the estimated score of the scaled data is even higher (lower right) than for the raw data (left). Even in this extreme case, the influence of the noise is not at all as drastic as might be expected. This illustrates that scaling is not as critical as centering as long as the scaling is reasonable and the variables relevant. Note that the scaling in this example is not the usual autoscaling using the inverse standard deviation of the data. Rather, the inverse standard deviation of the residual variation, for instance, assessed by replicates, is used.

 

 

 

 

Figure 3. Influence of scaling on fitted model

 

3.2.2           Scaling and weighted least squares fitting

Given a data matrix X (I × J) and a model  (I × J) which may, for example, be a bilinear model (  = ABT), a standard approach for determining the model and its parameters is to fit the model in a least squares sense by minimizing the loss function

 

||X -  ||2

(44)

 

which can also be expressed

 

.

 

(45)

 

When the different elements of the data have different uncertainties or relevances, it is possible to fit the model using a weighted least squares loss function. In one of its simplest form this can be expressed

 

||(X -  )*  ||2

(46)

 

where  (I × J) holds in its ijth element the weight of the ijth element of X, and '*' is the Hadamard (elementwise) product. Often the weight of an element is set equal to the inverse of the standard deviation of the residual variation. It is also possible to use more elaborate weights, if there are certain correlations between the residuals [Bro et al. 2001, Wentzell & Lohnes 1999]. The above weighted loss function can also be expressed in scalar notation as

 

.

 

(47)

 

 

In a maximum-likelihood sense, this loss function is optimal if the weights reflect the uncertainty of the individual elements, if there is no correlation between the residuals of different elements, and if the residuals are normally distributed. If the uncertainty of a given variable is the same over all objects, the above model turns into

 

||(X -  )*  ||2 = ||(X -  )W||2

(48)

 

where the weight matrix, W , is now a diagonal matrix which holds the column-specific weights in the diagonal. The loss function may be transformed as follows

 

||(X -  )W||2 =

 

||XW -  W||2.

 

 

(49)

 

In case of a bilinear model the above reads

 

||(X - ABT)W||2 =

 

||XW - ABTW||2 =

 

||XW - AHT||2.

 

 

 

 

(50)

 

Thus, by fitting the bilinear model AHT to the data scaled within the second mode, XW, in an ordinary least squares sense, the weighted loss function of Eq. (48) is automatically optimized. This is the basic mathematical rationale behind scaling. If the sought model, , has the structure ABT then fitting a bilinear model to the scaled data provides the model in the form AHT. Thus, the score matrix (or a rotated version of it) is directly provided in A whereas the loadings of the problem are found by premultiplying the found loadings, H, with W-1, as

 

BTW=HTBT=HTW-1  B=W-1H.

(51)

 

Sometimes, only the scaled data are considered and model parameters are not transformed back to the original domain. The appropriateness of this approach is still, however, governed by the fact that scaling as outlined above maintains the bilinear structure assumed reasonable for the raw data.

There is a direct connection between ||(X - ABT)W||2 and ||XW - AHT||2. However, there is no direct connection between the solution to ||(X - ABT)W||2 and ||(X - TPT)||2 unless the model has perfect fit. That is, fitting the bilinear model to scaled and unscaled data are two different problems with no direct relation.

 

3.3           When scaling does not work

When scaling within several modes is desired, the situation is complicated, because scaling one mode affects the scale of the other mode. For example, scaling to a standard deviation of one within both the first and second mode will generally not be possible, not even using iterative scaling [Harshman & Lundy 1984]. If scaling to a mean square of one is desired within both modes, this has to be done iteratively, until convergence [Harshman & Lundy 1984, ten Berge 1989]. Using mean-squares rather than, for instance, standard deviations for scaling has the attractive property that iterative scaling is guaranteed to converge, in case no centering is included in the iterative scheme [Harshman & Lundy 1984].

      Iterative preprocessing may seem unsatisfactory, because it tends to complicate the subsequent evaluation and validation of the preprocessing, since more than one set of scaling parameters for each mode has to be used. These several matrices holding the scaling parameters from each iteration may be combined, though, into one single matrix [Harshman & Lundy 1984]. An ad hoc alternative is to skip the iterative preprocessing and perform only one scaling of each mode. The purpose of scaling is mainly to bring the level of variation of different variables to some sort of equivalent levels. Therefore, one iteration of scaling can suffice to scale the data, so that no part of the data will have unreasonably large influence on the subsequent fitting.

 

3.4           Alternatives to scaling

As shown in the preceding paragraphs, scaling can be considered to be a special case of using a weighted least squares loss function. When more complicated weights are needed, it is not always possible to fit the model indirectly by fitting the least squares model to the scaled data. in such situations an alternative to scaling is to use algorithms that directly handle a weighted least squares optimization criterion [Bro 1998, Kiers 1997, Paatero 1997]. This can be relevant, for example, when the residual variation is correlated both across rows and columns [Andrews & Wentzell 1997, Bro et al. 2001, Wentzell et al. 1997].

 

3.5           Summary

Scaling does not affect the structural model of the data, but the loss function used to estimate the model parameters. Proper scaling is defined as the scaling that can be expressed in terms of pre- or post-multiplied weights in the loss function of the model. Hence, proper scaling of X can always be expressed as WIX or XWJ or WIXWJ; where WI and WJ are diagonal matrices holding, for example, the inverse standard deviation of the corresponding row or column. If a certain scale is needed for both modes, then the corresponding weights have to be found iteratively. Scaling, for example, to unit standard deviation in two modes is not possible in general, whereas scaling to unit mean square variation is possible.

 

4.          Simultaneous two-way centering and scaling

A complicating issue in preprocessing is the interdependence of centering and scaling [ten Berge & Kiers 1989]. Because preprocessing is mostly performed in one or a few standardized ways in two-way analysis, the problems are seldom appreciated. It is important, however, to be aware that not all combinations of centering and scaling will work as anticipated (see e.g. Harshman and Lundy [1984]). Generally, only centering across both modes is straightforward or scaling within one mode combined with centering across the other mode [Harshman & Lundy 1984, Kruskal 1983], which is exactly what e.g. autoscaling amounts to.

 

4.1           Scaling within a mode disturbs centering across the same mode, but not across other modes

Scaling within one mode disturbs prior centering across the same mode, but not across other modes [ten Berge 1989]. This holds for two-way arrays as well as higher-order arrays. The reason for this is illustrated in Figure 4. The solid line shows a typical column vector and the dashed line a typical row-vector. When scaling within the first mode, the elements of any column are multiplied by different numbers and hence prior centering across the first mode is destroyed.

Consider a two-way array X (I×J). If the array is scaled within the first mode, this can be expressed

 

Y = WX

(52)

 

where Y is the scaled array, W is an I × I diagonal matrix holding the scaling parameters and X is the original array. As can be seen, scaling within the first mode amounts to a multiplication of every row by a scalar. This does not affect any centering of the vectors across the second mode, because every element in a row-vector is multiplied by the same number. The average of any row will be the original average of the row scaled down accordingly, and therefore, if the average is zero, it will stay zero. In the first mode, however, each element in a column is multiplied by a different scalar. If centering is performed across the first mode, these column-vectors will not necessarily preserve their zero average after subsequent scaling within the first mode. Mathematically, the centered matrix PX becomes WPX upon scaling. As 1TWPX?0, the preprocessed matrix is no longer guaranteed to be centered. Offsets constant across the first mode, however, will still be removed, because

 

P(ΦΘT + 1IμT) = PΦΘT  

WP(ΦΘT + 1IμT) = WPΦΘT.

(53)

 

Note also the interesting fact that if scaling is performed before centering, the result will be different. In that case, the original offsets will not be removed, but the data will be centered (yielding centered scores and residuals), because

 

PW (ΦΘT + 1IμT) = PWΦΘT + PW1IμT  PWΦΘT.

(54)

 

 

This holds, because unlike for P1, it does not hold that PW1 is a zero vector in general.

 

Figure 4. Two-way array showing the dependency between centering and scaling

 

4.2           Centering across one mode disturbs scaling within all modes

Centering across one mode disturbs scaling within all modes. This holds for two- as well as multiway arrays, but there are certain cases for which it does not hold. One of these special cases is the situation in which two-way data are scaled within the second mode to a standard deviation of one and subsequently centered across the first mode (autoscaling)*. This subsequent centering will not disturb the scaling within the second mode (though it would disturb scaling within the first mode, had that been performed). The reason is that the scaling is specifically performed relative to the center of the data (standard deviations are based on centered data). Hence, any change in offset is immaterial for the standard deviation. Scaling by other means than standard deviations will not have this property. In multiway analysis it is common to use mean squares for scaling instead of standard deviations because such scaling more often converges when implemented in an iterative scheme and because scaling by standard deviations implicitly assumes an offset, which may or may not be present depending on the structural part of the model.

 

4.3           Centering across and scaling within the same mode is problematic

Centering across a mode within which scaling is also applied or vice versa is generally not going to retain all the properties of the two individual operations, as discussed earlier. For example, if centering across the first mode and scaling within the first mode is desired, then setting

 

Y = WPX

(55)

 

with W being a diagonal scaling matrix and P a centering operator will not lead to a preprocessed matrix in which the first mode vectors are centered, although possible offsets will be eliminated. Conversely, setting

 

Y = PWX

(56)

 

will not eliminate the original offsets, even though the preprocessed array will have centered first mode vectors. Hence, for a specific application, a choice has to be made between these two approaches, depending on why the centering is applied.

 

4.4           Summary

Proper centering is a centering operation that correctly removes the presupposed offsets and which does not introduce other offsets into the data. Likewise, proper scaling introduces the correct weights into the loss function. Stated otherwise, proper centering and scaling do what they are supposed to do. Proper centering and proper scaling can sometimes be combined. Unproblematic combinations can always be expressed as

 

(PX)W or W(XP)

(57)

 

where P is the projection matrix that centers the data and W is a weighting matrix, both of appropriate sizes. The brackets in Eq. (57) indicate the proper order in which the different preprocessing steps have to be performed. If performed oppositely, offsets will not be removed, although the data will be centered.  Problematic combinations are

 

WPX or XPW.

(58)

 

These combinations do not retain all the properties desired of the preprocessing steps.

 

5.          Three-way preprocessing

The preprocessing of multiway arrays will now be discussed using three-way arrays as an example. The basic properties discussed thus far are unchanged. Centering has to be performed across a specific mode and scaling has to be performed by a transformation within a specific mode. Most difficulties in preprocessing three-way arrays arise because of the problems outlined so far, which all generalize to multiway arrays. The problems are sometimes enhanced, because three-way data are often rearranged (matricized) to two-way arrays before preprocessing. This is unfortunate, because it introduces a column-mode that is a combination of two of the original modes. Transformation within or across this combined mode should be avoided if multiway models are to be fitted, because the mode is not a 'real' mode, but merely a computational construct.

 

5.1           Centering

5.1.1           Possible three-way offsets and their proper removal

The observations on centering of two-way data are helpful in discussing centering of three- and higher-way arrays. If the basis of two-way centering is understood, then three-way centering is quite simple. In the following, it will be assumed that the true model of the data is a PARAFAC model plus possible offsets, but the conclusions hold for any multilinear model.

 

 

 

Figure 5. Structure of offsets in three-way (trilinear) data when all elements have the same offset held in the scalar λ. Two alternative but equivalent ways of writing the PARAFAC model are shown: the slab notation using submatrices Xk and the notation described in Eq. (12)

 

Consider a three-way array. Conceptually, offsets may occur in three different ways. Constant across all modes (Figure 5), constant across two modes (Figure 6) or constant across one mode only, as shown in Figure 7. In the figures, the first-mode loadings are held in the I×R matrix , the second-mode loadings are held in the J×R matrix  and the third-mode loadings are held in the K×R matrix . The matrix Dk is an R×R diagonal matrix holding the kth row of   in its diagonal.

 

 

Figure 6. Structure of offsets in three-way (trilinear) data when all elements in each vertical slab have the same offset. The offsets are held in the J-vector λ

 

 

Figure 7. Structure of offsets in three-way (trilinear) data when all elements in each vector have the same offset (case three). The offsets are held in a matrix Λ (J×K) whose jk'th element holds the offset of the vertical column with second and third mode index j and k. The vector λk  is the k'th row of the matrix Λ

 

Regardless of the structure of the offsets, the basic principle of centering is that the data must be preprocessed, so that they are projected onto the nullspace of vectors of ones in a particular mode. For the first mode, projecting a data array X onto the nullspace of 1T, i.e., centering across the first mode amounts to

 

Y = PX(I×JK),

(59)

 

where P = I - (11T/I). For the second and third modes, the centering can be performed similarly. As mentioned earlier, such centering is referred to as single-centering. Centering, for example, across the first mode of an array can thus be done by matricizing the array to an I × JK matrix, and then centering this matrix across the first mode as in ordinary two-way analysis:

 

.

 

(60)

 

The column-mean is subtracted from every element, as depicted graphically in Figure 8. As can be seen, single-centering is similar in structure to the type of offsets shown in Figure 7.

 

Figure 8. Centering across the first mode. For each column of the raw data, a mean value is calculated and subtracted from each element in the column. Thus, a two-way matrix of mean values is obtained. Note that this centering is identical to matricizing the data to a two-way structure and centering these two-way data across the first mode

Such single-centerings performed successively across two modes, are referred to as double-centering. That is, double-centering is performed by first centering across one mode, and then centering the outcome across another mode. The order of centerings is immaterial, but it is essential that they are performed sequentially. For all three situations depicted in Figure 5-Figure 7 the above centering across the first mode will remove the shown offsets, because in all three situations the offsets are constant across the first mode.

As for the two-way case, only single-centering leads to the properties sought in centering (removal of offsets). Other types of centering, such as subtracting the overall mean, will introduce artifacts that have to be modeled additionally to the inherent systematic variation. This was shown for the two-way case in Section 2.3.2. Such types of incorrect centering are often used in three-way analysis. For example, matricized data are centered across a combined mode. E.g., if an array of structure (variables × time × samples) is centered by subtracting the average calculated across samples and time, then in line with the above example, artificial offsets are introduced and the subsequent model will have to fit this additional variation as well. This can obscure validation and exploration of the model and will lead to models that do not provide overall least squares solutions.

 

5.2           Scaling

As explained for the two-way case, scaling is a transformation of a particular variable (or object) space. Instead of fitting the model to the original data, the model is fitted to the data transformed by a (usually) diagonal scaling matrix in the mode whose variables are to be scaled. This means that whole matrices instead of columns have to be scaled by the same value in three-way analysis. For a four-way array, three-way slabs would have to be scaled by the same scalar. Mathematically, scaling within the first mode can be described as

 

yijk = wixijk,     i=1,..I; j=1,..J;k=1,..K

(61)

 

where Y with elements yijk is the scaled array and, for instance, setting

 

 

 

(62)

 

will scale to a unit mean square within the sample mode. Using matricized arrays, scaling may be expressed as

 

Y(I×JK) = WX(I×JK)

(63)

 

where W is an I × I diagonal matrix holding the scaling values in its diagonal. The assumed structural model of the data is a multilinear model - e.g. a PARAFAC model  = A(CB)T. With the above scaling, a similar structural model -  = P(RQ)T - will hold for the transformed data., because

 

||Y - P(RQ)T||2 =

 

||WX - P(RQ)T||2 =

 

||W(X - W-1P(RQ)T)||2.

 

 

 

 

 

 

(64)

 

The loss function minimized when fitting a model with a weighted criterion is

 

||W(X - A(CB)T)||2

(65)

 

and hence it holds that the parameters of this model can be found by fitting the scaled data and setting

 

A = W-1P, B = Q, and C = R.

(66)

 

Thus, fitting the scaled data provides a solution not only to the problem posed as fitting a model to Y, but to the problem of fitting a model to X and where the first mode loadings obtained are transformed by the scaling matrix W. As for two-way scaling, it is emphasized that the found model parameters have no direct relations to the parameters found when fitting the model to the raw data in a least squares sense.

 

5.2.1           Incorrect scaling of three-way arrays

Scaling has to be applied by transforming the data within a given mode. It is not appropriate to scale an array within two combined modes which can happen, for example, when autoscaling a matricized array. Such an inappropriate scaling will lead to the inclusion of artificial components in the data.

 

 

Figure 9. Incorrect scaling of three-way array. Spectra from one sample measured at two conditions (slabs one and two). The top plot shows the data before scaling and the lower one after a hypothetical scaling.

 

Consider an I × 30 ×2 three-way array with I samples, thirty variables (say spectral), measured at two conditions (e.g., two different pH values) as shown in Figure 9. In the two top plots, the profiles of the first hypothetical sample are shown (30 variables). To the left the measured spectrum is shown at the first condition and to the right at the second condition. As can be seen, the shape is identical in the two plots, only a multiplicative factor distinguishes the profiles. In this case, only one component is necessary for describing this variation. If the array is scaled such that different occasions of the same variable are scaled differently, then the phenomena can no longer be described by the variation of one basic profile. This is shown in the lower plots, where the first three wavelength variables have been scaled differently in slab one and two. In slab one they have been scaled by one and in slab two by 0.01. It is clear that to preserve the multilinearity of the data, any occurrence of a given variable must be scaled by the same amount.

      To show the influence of incorrect scaling, consider a synthetic data set with PARAFAC structure

 

X(I×JK) = A(CB)T

(67)

 

where A is a 4 × 2 matrix of random numbers and B and C are defined likewise. It is not important how these matrices are generated, as long as they have full column rank. Consider the following alternative two-component PARAFAC models:

 

1.        Using X

2.        Using X centered across the first mode and scaled within the combined second and third mode (autoscaled as a two-way matrix, X(I×JK))

3.        Using X scaled within, e.g., mode two.

 

The fit values of these three models always using two components are given below in percentages of the sum-of-squares of the preprocessed data.

 

1.        100.00%

2.          98.80%

3.        100.00%

 

As can be readily seen, a two-component model is appropriate and should be so, even after scaling as in case (3). However, using ordinary two-way scaling methods as in case (2) destroys the multilinear structure of the data and deteriorates the model.

 

5.3           Simultaneous centering and scaling

The exact same rules for interdependence of preprocessing steps apply for multiway data as for two-way data (Section 4), also with respect to treating missing data etc. Any preprocessed array may be written in matrix notation using the matricized I × JK data array X and the preprocessed array Y

 

Y = MIX(MK MJ)T

(68)

 

where e.g. MI is an I × I array holding either the centering or scaling transformation matrix for the first mode or even a combination of such. The exact content of these transformation matrices depends on the type of preprocessing chosen and in case of iterative preprocessing M may be a product of several matrices [Harshman & Lundy 1984].

Combined centering and scaling in one operator M is generally not going to retain all the properties of the two individual operations (see Section 4.3). If centering across the first mode and scaling within the first mode is desired, centering first and scaling afterwards will not lead to an array in which the first-mode vectors are centered, although possible offsets will be eliminated. Scaling first and centering afterwards will not eliminate the original offsets, even though the preprocessed array will have centered first-mode vectors.

Another example based on the guidelines in Section 4 is a situation in which e.g. scaling within the second and third mode is desired together with centering across the first mode. If the preprocessing is performed so that the data are centered after the weights are determined (iteratively) and applied

 

Y = P(X(WKWJ)T)

(69)

 

then the centering operation will destroy the property of e.g. suitable mean-squared error in the second and third mode (the brackets indicate the proper order of the preprocessing steps). If, on the other hand, the preprocessing is performed as

 

Y = (PX)(WKWJ)T

(70)

 

this is not the case. In this case, the data are first centered and then the weights are determined from the centered array rather than from the raw data. Hence, the weights in Eq. (70) are preferred.

 

5.4           Summary

Proper single-centering of a multiway array can always be expressed as

 

PX

(71)

 

where X (I × JKL...) is a multiway array rearranged to a two-way matrix such that the mode to be centered across is the row-mode. Hence, P works on the non-matricized mode (I in this case). All combinations of centering of this form are proper and will maintain a preprocessed array with offsets removed.

Proper scaling of a multiway array can be expressed as

 

WX

(72)

 

where X is (I × JKL...) is a multiway array rearranged to a two-way matrix such that the mode to be scaled within is the row-mode. Scaling, e.g., to unit mean square variation within several modes sequentially will not yield modes within which the variables or samples have unit mean square variation unless iterative determination of the weights are used.

                Combinations of centering and scaling are unproblematic only when centering across several modes is desired or scaling within one mode combined with centering across other modes. Centering must be performed before scaling. All other combinations will only partly fulfill the requirements. E.g. centering across a mode followed by scaling within the same mode will not lead to zero-average vectors in the mode, but will remove any offsets across the mode.

 

6.          Conclusion

A number of important features of the common preprocessing steps centering and scaling have been discussed.

 

·         Centering deals with the structural model; scaling with the way this model is fitted.

·         Centering is a part of a two-stage procedure in which offsets are removed first and the multilinear terms are estimated in the second stage. This is only equivalent to the one-stage procedure of estimating all parameters simultaneously, if proper centering, as defined in this paper, is used. Proper centering is shown in Figure 10 (always across one mode at a time).

·         For offsets that cannot be removed by using proper centering, a one-stage procedure has to be used. This holds generally for data with missing elements.

·         Scaling provides a way to change the objective function by assuming certain weights. Some weight arrangements can be dealt with by scaling followed by ordinary least squares fitting. Only proper scaling is allowed. Proper scaling is shown in Figure 10. For weighting schemes that cannot be dealt with by scaling, weighted least squares algorithms have to be used.

·         Incorrect centering or scaling introduces artificial variation. The amount of artificial variation introduced depends on the data and leads to models that are suboptimal to their 'correct' (least squares) counterparts. This is so, because the artificial variation has to be modeled additionally.

 

Two-way results

·         Proper centering can always be written as PX

·         Several centerings can be performed sequentially

·         Proper scaling can always be expressed as WX

·         Several scalings can be performed sequentially, but will generally need iterations to establish the scaling constants and this may not converge

·         Unproblematic combinations of centering and scaling can be expressed as (PX)W.

 

Similar results hold for transposed matrices.

 

Three-way results

·         Proper centering can always be written PX where X is the three-way array matricized, so that the mode to be centered across is the first mode.

·         Several such single-centerings may be performed sequentially across several modes

·         Proper scaling can always be expressed WX for a matricized array as above

·         Several scalings can be performed sequentially, but will generally need iterations and may not converge

·         Proper combinations of centering and scaling can be expressed similarly to the two-way case. That is, scaling does not affect centering across other modes, but centering affects scaling within all modes.

 

The appropriate centering and scaling procedures can most easily be summarized as in Figure 10. Centering must be done by subtracting scalars from individual vectors of the array, while scaling must be performed by multiplying individual slabs.

 

 

Figure 10. Three-way array. Proper centering must be done across a mode, exemplified here by proper centering across the second mode. From all elements of a specific row (fixed i and k) the same scalar mik is subtracted. Proper scaling, e.g., within the first mode is performed such that all elements of a specific horizontal slab are multiplied by the same scalar wi.

 

Acknowledgments

Most algorithms used in this paper are available from the home page of Food Technology at http://www.models.life.ku.dk. The first author is grateful for financial support through LMC (Center for Advanced Food Studies), EU project NwayQual GRD1-1999-10377 and AQM (Advanced Quality Monitoring) supported by the Danish Ministries of Research and Industry. Anonymous referees are thanked for helpful comments.

 

Appendix 1: Projections

In this appendix, the projection of vectors on other vectors is explained. In Figure 11 the orthogonal projection of an I-vector b on an I-vector a is considered. The resulting projection  can be expressed in the original coordinate system or in terms of the new basis vector a.

 

 

Figure 11. Projection of b on a.

 

Expression of  in terms of a

The score of  on the new basis vector a is k. The following equations hold

 

 

 

 

 

 

 

 

(73)

 

where the second equation is called the cosine rule for vectors. Hence,

 

.

 

(74)

 

This expression becomes particularly simple when a is normalized to length one (||a|=1)

 

k = (a,b) = aTb

(75)

 

Note that expressions like Eq. (75) are used in principal component analysis - PCA, where usually the loading vectors p are chosen to be of length one and then

 

 

 

 

 

(76)

 

where xiT is a row of X (I×J). Hence, the scores of PCA are just orthogonal projections of the rows of X on p in coordinates of this p.

 

Expression of  in terms of the original coordinate system

It is also possible to express  in the original coordinate system. Referring to Figure 11, it holds that  =ka. Hence,

 

 

 

(77)

 

and the matrix P (I×I) is special because

 

 

 

 

 

(78)

 

which means that P is symmetric and idempotent. Then this matrix is an orthogonal projection matrix [Schott 1997]. It projects orthogonally on the vector a. In the case of centering, (see Eq. (15)), the vector 1 takes the role of a.

               

Orthogonal projections in general including residuals

A matrix P=AA+ (where ‘+’ indicates the Moore-Penrose inverse) projects orthogonally on the range (column-space) of A [Schott 1997]. It can be checked that a+=aT/(aTa), hence, Eq. (78) is a special case of the general orthogonal projection theorem.

                It is also interesting to consider the residuals from the orthogonal projection of b on A, that is, the vector b*. As , it holds that

 

 

 

 

(79)

 

and  is again a symmetric and idempotent matrix, i.e., an orthogonal projection operator. This matrix projects onto the orthogonal complement of A, that is, onto range(  ), where range(.) is used to indicate the range (column-space) of a matrix or vector. It holds that range(  )=nullspace(AT), where nullspace(.) is used to indicate the nullspace of a matrix or vector. For every x  range(  ) it holds that ATx=0, hence, x  nullspace(AT) and vice versa.

 

Appendix 2: Fitting bilinear model plus offsets across one mode equals fitting a bilinear model to centered data

 

Theorem

Given X of size I × J and the column-dimension R of a sought bilinear model. Then

 

min||X  (ABT + 1mT)||2 = min||Y - CDT||2

(80)

 

where Y is the original data, X, with the column-averages subtracted.

 

Proof

The proof has been given by Kruskal [1977] and Gabriel [1978]. Understanding that centering is a projection, it is simple to prove the above theorem. Let the loss function be

 

||X - ABT - 1mT||2

(81)

 

and partition it into two orthogonal parts

 

||P(X - ABT - 1mT)||2 + ||P(X - ABT - 1mT)||2

(82)

 

using the Pythagorean fact that the squares of two orthogonal parts equal the square of the total. This equation can be further developed to

 

min ||PX - PABT||2 + min||PX - P(ABT + 1mT)||2 =

 

min ||PX - PABT||2

 

 

(83)

 

because ||PX - P(ABT + 1mT)||2 will be zero by setting

 

m’ = (1T/I)(X - ABT) since 1(1T/I) = P.

(84)

 

Setting C = PA and D = B will therefore provide a solution with exactly the same fit as would be obtained by minimizing the original loss function. The solution may be computed using any bilinear algorithm for fitting a principal component analysis model of Y. The scores will automatically be centered, because they are linear combinations of the columns of X. If the columns are centered, so are their linear combinations.

 

Reference List

 

     1.    Amrhein M, Reaction and flow variants/invariants for the analysis of chemical reaction data. Ph.D. thesis, École Polytechnique Fédérale de Lausanne, 1998,

     2.    Amrhein M, Srinivasan B, Bonvin D, Schumacher MM, On the rank deficiency and rank augmentation of the spectral measurement matrix, Chemom Intell Lab Syst, 1996, 33, 17-33.

     3.    Andrews DT, Wentzell PD, Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer, Analytica Chimica Acta, 1997, 350, 341-352.

     4.    Barnes RJ, Dhanoa MS, Lister SJ, Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra, Appl Spectrosc, 1989, 43, 772-777.

     5.    Bro R, Sidiropoulos ND, Smilde AK, Maximum likelihood fitting using simple least squares algorithms, Journal of Chemometrics, 2001,  Submitted.

     6.    Bro R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. Ph.D. thesis, University of Amsterdam (NL), http://www.mli.kvl.dk/staff/foodtech/brothesis.pdf, 1998,

     7.    Cornelius PL, Seyedsadr M, Crossa J, Using the shifted multiplicative model to search for 'separability' in crop cultivar trials, Theoretical and Applied Genetics, 1992, 84, 161-172.

     8.    Gabriel KR, Least squares approximation of matrices by additive and multiplicative models, Journal Of The Royal Statistical Society Series B Statistical Methodology, 1978, 40, 186-196.

     9.    Grung B, Manne R, Missing values in principal component analysis, Chemom Intell Lab Syst, 1998, 42, 125-139.

   10.    Guo Q, Wu W, Massart DL, The robust normal variate transform for pattern recognition with near-infrared data, Analytica Chimica Acta, 1999, 382, 87-103.

   11.    Harshman RA, Lundy ME, Data preprocessing and the extended PARAFAC model, Research Methods for Multimode Data Analysis, (Eds. Law,HG, Snyder,CW, Jr., Hattie,J, and McDonald,RP), Praeger, New York, 1984, 216-284.

   12.    Helland IS, Næs T, Isaksson T, Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data, Chemom Intell Lab Syst, 1995, 29, 233-241.

   13.    Judge GG, Griffifths WE, Carter Hill R, Lütkepohl H, Lee TC, The theory and practice of econometrics. John Wiley & Sons, New York, 1985,

   14.    Kiers HAL, Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 1997, 62, 251-266.

   15.    Kiers HAL, Towards a standardized notation and terminology in multiway analysis, Journal of Chemometrics, 2000, 14, 105-122.

   16.    Kruskal JB, Some least squares theorems for matrices and N-way arrays,1977, Manuscript, Bell Laboratories, Murray Hill, New Jersey.

   17.    Kruskal JB, Multilinear methods, Proceedings of Symposia in Applied Mathematics, 1983, 28, 75-104.

   18.    Kruskal JB, Rank, decomposition, and uniqueness for 3-way and N-way arrays, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S), Elsevier, Amsterdam, 1989, 8-18.

   19.    Little RJA, Rubin DB, Statistical analysis with missing data. John Wiley & Sons, New York, 1987,

   20.    Nørgaard L, Classification and prediction of quality and process parameters of beet sugar and thick juice by fluorescence spectroscopy and chemometrics, Zuckerindustrie, 1995, 120, 970-981.

   21.    Paatero P, A weighted non-negative least squares algorithm for three-way 'PARAFAC' factor analysis, Chemom Intell Lab Syst, 1997, 38, 223-242.

   22.    Paatero P, Tapper U, Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 1994, 5, 111-126.

   23.    Schott JR, Matrix analysis for statistics. Wiley and Sons, New York, 1997,

   24.    Seasholtz MB, Kowalski BR, The parsimony principle applied to multivariate calibration, Analytica Chimica Acta, 1993, 277, 165-177.

   25.    Styan PH, Hadamard products and multivariate statistical analysis, Linear Algebra and its Applications, 1973, 6, 217-240.

   26.    ten Berge JMF, Convergence of PARAFAC preprocessing procedures and the Deming-Stephan method of iterative proportional fitting, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S), Elsevier, Amsterdam, 1989, 53-63.

   27.    ten Berge JMF, Kiers HAL, Convergence properties of an iterative procedure of ipsatizing and standardizing a data matrix, with application to PARAFAC/CANDECOMP preprocessing, Psychometrika, 1989, 54, 231-235.

   28.    Tucker LR,  A method for synthesis of factor analysis studies, Personnel Research Section, Report 984, Dept. of the Army, 1951,

   29.    van Eeuwijk FA, Between and beyond additivity and non-additivity; the statistical modelling of genotype by environment interaction in plant breeding. Ph.D. thesis, University of Wageningen, 1996,

   30.    Weinberg JR, A Short History of Medieval Philosophy. Princeton University Press, New Yersey, 1964,  235-266.

   31.    Wentzell PD, Andrews DT, Hamilton DC, Faber NM, Kowalski BR, Maximum likelihood principal component analysis, Journal of Chemometrics, 1997, 11, 339-366.

   32.    Wentzell PD, Lohnes MT, Maximum likelihood principal component analysis with correlated measurement errors: theoretical and practical considerations, Chemom Intell Lab Syst, 1999, 45, 65-85.

   33.    Westerhuis JA, Kourti T, MacGregor JF, Comparing alternative approaches for multivariate statistical analysis of batch process data, Journal of Chemometrics, 1999, 13, 397-413.

 



* If the algorithm for fitting the bilinear part of model is iterative, it is useful not to iterate until convergence in each step two. Instead only a few iterations are performed before one proceeds to step three where the complete model (AB’ + 1m’ in case of PCA) is calculated and used for obtaining better estimates of the missing values.

* Normally, centering would be performed before scaling for computational reasons, as the averages are needed for scaling by the inverse standard deviation.


 [RB1]% Shows that centering with the grand mean sucks

 

%load claus,f=parafac(X,DimX,3,[],[2 2 2]);

%[a,b,c]=fac2let(f,DimX,3);save spec

load spec

clear

load spec

 

X = rand(10,3)*b([1:3:150],[1 2 3])';

X = X + .1;

 

[t,s,p]=svds(X,5);t=t*s;

[t1,s1,p1]=svds(X-mean(X(:)),5);t1=t1*s1;

[t2,s2,p2]=svds(mncn(X),5);t2=t2*s2;

 

 

ss=[];

for i=1:5

   ss=[ss;100*(1-ssq(X-t(:,1:i)*p(:,1:i)')/ssq(X))];

end

ss1=[];

for i=1:5

   ss1=[ss1;100*(1-ssq(X-t1(:,1:i)*p1(:,1:i)'-mean(X(:)))/ssq(X))];

end

 

subplot(1,2,1)

plot(p1(:,1:4),'b','LineWidth',2)

hold on

plot(p1(:,5),'b','LineWidth',1)

set(gca,'XTick',[0 25 50])

set(gca,'YTick',[-.5 0 0.5])

title('Loadings grand-centered data','FontWeight','Bold')

xlabel('Spectral variable')

ylabel('Loading')

hold off

 

subplot(1,2,2)

plot(p2(:,1:3),'b','LineWidth',2)

hold on

plot(p2(:,4:5),'b','LineWidth',1)

set(gca,'XTick',[0 25 50])

set(gca,'YTick',[-.5 0 0.5])

title('Loadings single-centered data','FontWeight','Bold')

xlabel('Spectral variable')

ylabel('Loading')

hold off

 

%For finding numbers

figure,plot(p1),legend('1','2','3','4','5')

figure,plot(p2),legend('1','2','3','4','5')