# Maximum likelihood fitting using ordinary least squares algorithms

Rasmus Bro1, Nicholas D. Sidiropoulos2 & Age K. Smilde3

1Chemometrics Group, Food Technology, Dept. Dairy and Food Science,

The Royal Veterinary and Agricultural University, Denmark, rb@kvl.dk

2Department of Electrical and Computer Engineering, University of Minnesota,

Minneapolis, MN 55455, USA, nikos@ece.umn.edu

3Process Analysis and Chemometrics, Department of Chemical Engineering

University of Amsterdam, The Netherlands, asmilde@its.chem.uva.nl

## Abstract

In this paper, a general algorithm is provided for maximum likelihood fitting of deterministic models subject to Gaussian distributed residual variation (including any type of nonsingular covariance). By deterministic models is meant models in which no distributional assumptions are valid (or applied) on the parameters. The algorithm may also more generally be used for weighted least squares (WLS) fitting in situations where distributional assumptions are either not available or where other than statistical assumptions guide the choice of loss function. The algorithm to solve the associated problem is called MILES (Maximum likelihood via Iterative Least squares EStimation). It is shown that the sought parameters can be estimated using simple least squares (LS) algorithms in an iterative fashion. The algorithm is based on iterative majorization and extends earlier work for WLS-fitting of models with heteroscedastic uncorrelated residual variation. The algorithm is shown to include several current algorithms as special cases. E.g. maximum likelihood principal component analysis models with and without offsets can be easily fitted with MILES. The MILES algorithm is simple and can be implemented as an outer loop in any least squares algorithm, e.g. for analysis of variance, regression, response surface modeling etc. Several examples are provided on the use of MILES.

Keywords: Iterative majorization, PARAFAC, PCA, weighted least squares

## 1.    Introduction

In order to set the stage for the maximum likelihood method developed here, it is necessary to first discuss the least squares and weighted least squares approaches for modeling. Often, least squares fitting is used for estimating parameters. Least squares fitting is especially useful for fitting parameters when the residual variation is homoscedastic, independent, and Gaussian. In this case, least squares fitting will usually yield maximum likelihood estimates. When the residual variation is no longer homoscedastic, the different magnitudes of different errors can be handled by using weighted least squares fitting rather than least squares fitting. The weight attached to a certain residual reflects the magnitude of the variance of that particular residual. However, in some situations the variance of the errors are not independent, and simple weighted least squares is no longer optimal. In order to handle correlated errors, more complicated fitting procedures are generally required, and often the algorithms for such fitting procedures are either complicated to implement or cumbersome to use. In this paper, a general fitting procedure will be devised that can handle all the above situations for fitting any model for which a least squares fitting procedure exists.

The general models dealt with in this paper are first described along with the optimization problem to be solved. Let x be an I×1 vector holding a data set for which a model is sought. Having x organized in a vector is not a restriction of the type of data that can be modeled. This vector may represent a matrix, a three-way array, or a multi-way array or any other structure of data elements rearranged appropriately into a vector. For a P×J matrix Z this rearrangement can be achieved by vectorizing the matrix such that x = vecZ = [z1T z2T .. zJT]T where zj is a P-vector holding the jth column of Z. For a three-way P×J×K array Z, vectorization can be done by setting x = vecZ = [vecZ1T vecZ2T vecZKT]T where Zk is a P×J matrix with typical elements zpjk, p=1,..,P; j=1,..,J.

A model m (I×1), with a certain structure and certain constraints, is sought for the data. Hence

 x = m + e

where e (I×1) holds the residual variation not explained by the model. The model, m, is defined as belonging to a certain set, m , where the set  defines the structure and constraints. In fitting the model in a least squares sense, m is found as the argument minimizing

where  denotes the squared Euclidian/Frobenius norm of g, i.e., the sum of the squared elements of g. The expression  indicates that the loss function  is a function of both m and x and that x is considered fixed; hence only the model m is optimized. This follows the notation used by Kiers in a related paper [Kiers 1997]. For the model, m, any structure and/or constraints can apply.

For example, in principal component analysis, m = vecM =vec(ABT) with both A and B having orthogonal columns. Centering can also be included because this corresponds to including a set of offsets as part of the least squares model [Bro & Smilde 2001]. In such case, M = ABT + 1nT where n is a J-vector holding the offsets (the average of the jth column in case these are estimated by centering). Another model could be a three-way PARAFAC model [Harshman 1970] in which case M = [AD1BT AD2BT ADKBT] with Dk (k=1,..,K) being diagonal. As yet another alternative, the model could be a multivariate linear regression model which can be formulated as M = QB where Q is a given fixed I×P matrix holding the independent variables, M (I×R) holds the model of the R dependent variables and B (P×R) holds the regression coefficients. In short, any problem, which has a least squares formulation, can be expressed according to Eq. (2). For many special types of models, there are well-established algorithms for fitting the above least squares model. This paper deals with incorporating knowledge of the error covariance structure or a priori problem-specific information in finding the solution to Eq. (1), taking advantage of the availability of an algorithm for fitting the least squares model.

A way to pose the problem considered is the following. The problem is to be able to fit deterministic models (i.e. where parameters have no distributional assumptions attached) but subject to any type of nonsingular covariance of the residual Gaussian noise. A different problem that leads to the same loss function and hence solution, is the problem of fitting a model in a weighted least squares (WLS) sense, i.e., to minimize the loss eTWTWe where e is the residual vector and where W is a nonsingular weight matrix. In the following, the derivation of the algorithm will be presented as the problem of maximum likelihood fitting, but, as stated, weighted least squares fitting can also be sensible on the basis of non-distributional a priori information, even if it does not admit a maximum likelihood interpretation.

After deriving the MILES algorithm, several applications are discussed. The applications will focus on well-known chemometric models such as principal component analysis and PARAFAC.

## 2.    Theory

### 2.1.  Problem formulation

It is assumed that the residual vector e (I×1) in the model x = m + e is zero-mean Gaussian, with known covariance matrix

 cov(e) = Δ = E[eeT] (3)

where Δ is a matrix of size I×I assumed to be of full rank and where eT is the transpose of e. If e does not have zero-mean expectation, but the mean is known or can be estimated, this mean can simply be subtracted from the observations. The vector m is a deterministic unknown subject to structural constrains. Specifically, the only thing known about m is that it belongs to the set , but no distribution of the outcomes of m over  is known or can be assumed. We will develop a deterministic likelihood algorithm for fitting this model.

Estimating parameters in a model using least squares algorithms is appropriate for situations in which the errors, e in Eq. (1), are iid Gaussian. By appropriate is meant that the maximum likelihood solution is found under the given premises. However, for other types of distributions, this is not the case. If the residuals are independently Gaussian distributed but with different variances, the maximum likelihood principle leads to fitting the model in a weighted least squares sense according to the criterion

where W is an I×I diagonal matrix holding the weights and where  is the squared Frobenius norm of G. The ith diagonal element signifies the uncertainty of the ith element of x and are equal to the inverse of the standard deviation of the corresponding residual elements. In e.g. bilinear decomposition such as principal component analysis (PCA), it is common to fit models using this criterion in the special situation where the elements in W corresponding to one specific column of the data matrix have the same value. The parameters of the model can then be found by fitting an ordinary least squares model to the data appropriately preprocessed by scaling each column [Bro & Smilde 2001, Martens et al. 2001]. For fitting PCA models in the case of general but diagonal W, more advanced algorithms have to be used [Paatero & Tapper 1993, Paatero & Tapper 1994]. An algorithm for fitting any structural model according to this criterion has been devised by Kiers [1997] and in fact, the work presented here can be seen as a natural extension of his work. A different algorithm, has been proposed specifically for PCA by Wentzell and coworkers [Wentzell et al. 1997] who also extended their model to the more complicated problem of having correlated errors in PCA.

If there is covariance between the different elements of e, the assumption of independence of the distributions of individual residuals is not valid. Then the loss function in Eq. (4) with only diagonal W is no longer the maximum likelihood solution. When e is zero-mean multivariate Gaussian, the covariance of its distribution is

 cov(e) = Δ

which has non-zero off-diagonal elements in general and is assumed to be known. This covariance structure incorporates the assumed error distribution for both  and  as extreme cases.

The relevant part of the log-likelihood function of estimating the model parameters m is [Judge et al. 1985, p. 181]

 (x-m)T-1(x-m)

and by taking W = Δ-, Eq. (6)  can be rewritten as

 (x-m)TWTW(x-m) or [W(x-m)]T[W(x-m)]

Maximizing the likelihood, requires minimizing Eq. (7), hence the model can be found by minimizing the loss function (e.g. [Kay 1993])

 .

The matrix W is of size I×I and holds the weights. In case of the error distribution of Eq. (5) it holds that W = Δ- assuming full rank of Δ, i.e. W is the symmetric square root of the inverse of Δ. Thus, if the eigendecomposition of Δ-1 is Δ-1 = UDUT with D diagonal and U an orthogonal matrix holding the eigenvectors, then W = UDUT.

As stated in the introduction, the algorithm developed here can also be applied for weighted least squares fitting in situations where the weights do not arise (solely) from statistical considerations. The basis for this broader viewpoint is the loss function in Eq. (8). Rather than deriving the weights in W from statistical considerations, it is possible to introduce any other suitable set of weights based e.g. on a priori problem-specific information. Minimizing the loss function in this case, will lead to weighted least squares estimates that do not necessarily have any maximum likelihood properties. An example of this is provided in the last Section.

The loss function can be rewritten as

 .

It is the objective in this paper to derive an algorithm for finding m with its associated structure and constraints that will minimize . This is pursued in the next section.

For some models, it is very simple to devise a maximum likelihood algorithm from a least squares algorithm. For example, consider the multiple linear regression problem

 . (10)

It is easy to see that the solution can be found as the solution to the ordinary least squares problem

 , (11)

thus, by regressing Wx onto WZ. This is called Generalized Least Squares and is standard in regression theory [Golub & van Loan 1989, p. 266]. It is also called whitening in filtering theory [De Lathauwer et al. 2000]. Also, consider a bilinear PCA model of an I×J matrix X. If the errors are independent across rows, and identically distributed within each row, the weight matrix W is a block-diagonal matrix with blocks of J×J matrices, V, that are all identical. It can be shown [Martens et al. 2001], that in such a case, the problem is equivalent to minimizing the loss function

 . (12)

The solution can be obtained by fitting a PCA model to the transformed data according to

 (13)

The parameters are thus obtained as T and PTV when fitted to XV, and P can be found as (PTVV-1)T. Because (PTVV-1)T is generally not orthogonal, the parameters have to be appropriately orthogonalized in order to get the standard PCA solution. This can most easily be obtained from a singular value decomposition of TPTVV-1. Such preprocessing can only be used when the errors are independent across all but one mode. For general W of size I×I, even PCA models cannot be fitted using least squares algorithms on preprocessed data. For e.g. multi-way models such as PARAFAC [Harshman 1970] and Tucker3 [Kroonenberg 1983] the problems become even more pronounced.

### 2.2.  Deriving an algorithm

An algorithm for minimizing  can be developed using iterative majorization. The theory of majorization has been described in the literature [Groenen et al. 1995, Heiser 1995, Kiers 1990, Kiers & ten Berge 1992], but the principle as it pertains to minimizing a loss function is briefly reviewed next.

The loss function can be minimized by iteratively improving any current estimate of the model. Let such a current estimate be mc, where c is the iteration number. An update is sought such that  < . Improving the estimate continuously will lead to convergence, because the loss is bounded below by zero. For a complicated problem, however, improving the estimate can be difficult. If no closed-form solution exists, gradient-based methods may be used. Majorization, though, provides another alternative route for improving a current estimate. Instead of optimizing the loss function directly, a new loss function is derived which is called the majorizing function. This is a function of x, W and the current estimate mc. The majorizing function is denoted by  indicating that the aim is to find an estimate m but now also as a function of the current estimate mc. The majorizing function is constructed in such a way that i) it is easy (or easier) to minimize and ii) an improvement of  will also lead to an improvement of . How this is achieved is described below and a pictorial description of majorization can also be seen in Figure 1.

 Figure 1. The principle behind majorization illustrated with a one-parameter model. The loss function  is to be minimized as a function of m. The current estimate of m is mc (abscissa) with a corresponding loss (ordinate). This loss is improved by minimizing a majorizing function .

The majorization function should satisfy certain properties to lend itself to iterative minimization of the original loss function. It must hold that the function value of the majorizing function is never smaller than that of the ML loss function to ensure monotone convergence. I.e. . This requirement is the reason for the name majorizing function. In order to obtain a reasonable convergence rate, it is often appropriate, though, that  is close to  at least in the neighborhood of mc. Further it must hold that the two loss functions are identical at the supporting point, which is the current estimate mc. I.e. . Note, that this supporting point will not be the point corresponding to the minimum of neither  nor  unless at convergence.

A majorizing function for  will now be developed fulfilling the above criteria. This majorizing function is closely related to the one provided by Kiers [1997] and is using the function suggested by Heiser [1987] in another context. The derivation here follows closely that of these two papers.

As shown in Eq. (9), the loss function  can be written

 .

The sought model m can be written as m = mc + (m - mc) where mc is the current estimate of m. Thus, Eq. (14) can be formulated as

 = (15)

Note, that the first term, , is a constant because x and mc are known and fixed, i.e.

The third term is linear in m, while the second term is the difficult part because it is quadratic in m and because of the presence of the matrix WTW. As shown by Heiser [1987], modifying this second term can provide a majorizing function, well suited for the purpose here. Let  be the largest eigenvalue of WTW. It then holds that

 (17)

for vectors s of appropriate size. Thus, for any s, it holds that

 (18)

Therefore, it holds that

 (19)

and from this relation, a majorizing function can be defined from Eq. (16) as

It follows that  for all m. Setting m = mc leads to  = . Thus, the requirements of a majorizing function are satisfied. In order to improve the current estimate of m with respect to the loss function in Eq. (16) it suffices to improve m with respect to Eq. (20). This provides a tremendous simplification, because, as will be shown, finding the minimum of Eq. (20) corresponds to solving a certain ordinary least squares problem.

Defining the constant vector , the majorizing function can be written

 (21)

The proof of this is provided in the appendix. In finding the minimum of this loss function, we can ignore the constant  and  and the argument m that minimizes this function is thus also the solution to

 (22)

It therefore holds that in order to improve an estimated model mc with respect to  it suffices to find the update mc+1 that minimizes  for m  and with . Thus, a least squares model fitted to the transformed data q will provide the necessary update. The monotone convergence for an algorithm based on iterative majorization as the one derived here, follows, as shown by [RB1]Kiers [1997] from the fact that

.

A convergent algorithm for finding the deterministic maximum likelihood model follows immediately as

###### Algorithm MILES

1.      Set counter c := 0; Initialize model, mc, e.g. using the least squares model,

2.      , where β = max(eigenvalue(WTW))

3.      mc+1 =

4.      c := c+1; go to step 2 until  where  is a pre-specified small number (e.g. 10-6).

In step 3, it is often appropriate to not actually find the update providing the least squares fit in case the involved algorithm is iterative. Instead, it is sufficient to find an update that improves the fit, rather than maximizing it. E.g. for a slowly converging least squares algorithm, it may suffice to do ten or perhaps only one iteration and then subsequently update q. In order [RB2]to ensure reasonable convergence it is useful to add a simple line search (on q) and for minimizing the risk of encountering local minima, it is useful to redo the analysis a few times from different starting points. In step 4, the convergence is determined in terms of the relative change in the vectorized model. The algorithm is guaranteed to converge in (WLS/ML) fit, but convergence in fit does not in general imply convergence of model parameters or the vectorized model per se. In practical applications of MILES, convergence will likely be determined via relative change tests on the vectorized model, for the sake of computational simplicity. It is for this reason that this choice of convergence criterion is incorporated in MILES.

## 3.    Application of MILES

In the following two small and a more detailed example will be provided showing the usefulness of MILES and comparing MILES with other similar algorithms. The first example is devoted to comparing MILES with maximum likelihood PCA. The second example shows the usefulness of MILES in maximum likelihood PARAFAC modeling in signal processing while the last detailed example shows that MILES can also be used as a weighted least squares approach for handling measurement artifacts in a situation where maximum likelihood fitting does not apply.

### 3.1.  Spectroscopy MPSetChAttrs('ch0025','ch3',[[7,1,-3,0,0],[9,1,-4,0,0],[11,2,-4,0,0],[],[],[],[28,4,-11,0,0]]) MPInlineChar(0) MPNNCalcTopLeft(document.mpch0025ph,'') MPDeleteCode('ch0025')  PCA

The first data set was generated and treated by Wentzell and coworkers [1999] and arise from a designed experiment with three-component mixtures of nitrates of Co(II), Cr(III), Ni(II). A three-level, three-factor calibration design was used in which 1, 3, or 5 mL aliquots of stock solutions of the three nitrates were combined and diluted to 25 mL with 4% nitric acid. For technical reasons one sample was not measured, so the data set to consist of 26 rather than 27 samples. The concentration ranges were 6.88 to 34.40 mM for Co, 3.06 to 15.29 mM for Cr, and 15.70 to 78.8 mM for Ni. The samples were measured in a range of 300 to 650 nm in 2 nm intervals on a HP 8452 DAD (Hewlett-Packard, Palo Alto, CA) using a 1 cm quartz cuvette. For each sample, five replicate measurements were made and from these, the uncertainty of each measurement is calculated and used in a maximum likelihood PCA model assuming independent but different errors for each element. The loss function is illustrated graphically in Figure 2.

 Figure 2. The loss function in least squares fitting and weighted least squares fitting. The residuals are given in the vector e which is equal to vec(X-M) where M is the PCA model (including offsets). Thus, in weighted least squares fitting, one specific weight is attached to each residual and hence to each element in X. Note, that ML-PCA and MILES-PCA as given below also handles non-diagonal W.

A PCA model is sought in which the scores are subsequently used for building a least squares regression model (hence a principal component regression  PCR  model). Several alternatives are tested here, mainly to illustrate the appropriateness of the algorithm by comparing to the earlier suggested ML-PCA algorithm given by Wentzell and coworkers [Wentzell et al. 1997, Wentzell & Lohnes 1999] but also to show how simple it is to include centering in the maximum likelihood estimation with MILES which is not the case for ML-PCA [Wentzell et al. 1997].

For illustration, the MILES algorithm for PCA with centering is given below. Note, that this algorithm handles correlated errors, but as W is diagonal in this application, the weighted least algorithm of Kiers [1997] could also be used.

###### Algorithm MILES-PCA
1. Initialize model, m0, using centered LS-PCA model of the data, and set c := 0;
2.
3. T, P are found as the PCA parameters when fitted to centered Q, i.e. to Q - 1nT where n (J×1) holds the averages of the J columns of Q and where Q is the vector q arranged to the same size as the original data matrix.
4. mc+1 = vec(TPT + 1nT)
5. c := c+1; go to step 2 until

In order to evaluate different PCR models, a leave one out cross-validation scheme was used where each sample was left out in turn. A PCA model was fitted and the scores were used in an ordinary multiple linear regression model for predicting Co(II), Cr(III) and Ni(II). Afterwards, the reference value of the left-out sample was predicted.

The following different PCA models were investigated, all with three principal components:

·        A maximum likelihood PCA model was used where scores and centering parameters were estimated with MILES. Centering was included in the maximum likelihood fitting

·        A corresponding least squares PCA model was also used with ordinary centering

·        Finally a maximum likelihood PCA model was estimated but with the data centered in an ordinary least squares sense. This model was fitted using both the MILES algorithm suggested here and the ML-PCA algorithm by [Wentzell et al. 1997].

The corresponding prediction results for three components are shown in Table 1 as the root mean square error in cross-validation (RMSECV) which is defined as

 (23)

where  is the residual of the predicted valued of sample i predicted from a regression model built with the ith sample excluded. All models provide excellent results (correlations higher than 0.99), but the maximum likelihood results are never worse than the least squares results, suggesting that the use of maximum likelihood is feasible here. The results also show that MILES and the earlier algorithm suggested specifically for PCA give identical results as they should. Finally, it is noted that the inclusion of the centering within the maximum likelihood estimation is straightforward although the significance of this with respect to predictions is limited here.

Table 1. RMSECV values for PCR model using three components. First column shows result for a maximum likelihood model including maximum likelihood centering. Second column is an ordinary least squares PCR and the third and fourth columns show the results using maximum likelihood estimation but ordinary least squares centering with the Wentzell and the MILES algorithm respectively. As expected these two are identical.

 MILES LS Wentzell (no ML offset) MILES (no ML offset) Cr 0.0897 0.0904 0.0894 0.0894 Ni 0.304 0.3464 0.3031 0.3031 Co 0.2931 0.3045 0.2934 0.2934

### 3.2.  Signal processing - PARAFAC

An example is provided here on the usefulness of MILES in a different field; namely signal processing. The problem pertains to so-called "blind" beamforming using receive antenna arrays [Sidiropoulos & Liu 2001].  The objective of blind beamforming is to reconstruct the emitted signal(s) and propagation parameters of radio waves propagating along multiple paths, each with its own attenuation and delay, without explicit knowledge of the propagation environment. In such problems, the data arise out of band-limited radio signals, and the first step on the receiver side is to down-convert and low-pass filter the received data. This is done to filter off out-of-band interference. At the same time, the filtered base-band signals are over-sampled beyond the (minimum possible) Nyquist sampling rate, in order to help resolve path delays and ensure that a PARAFAC model [Bro 1997, Harshman 1970, Sidiropoulos & Liu 2001] is appropriate. This over-sampling of the filtered signal creates correlation of the noise samples along the over-sampling mode, because the spectrum of the noise has been shaped by the frequency response of the low-pass filter.

The data is a 2×4×20 array with additive white Gaussian noise added. A two-component (two-ray) PARAFAC model is suitable for the data, which is generated according to

 ,    i=1,2; j=1,..,4; k=1,..,40. (24)

The first mode corresponds to receive antenna element (two antennas are used) and the parameter aif holds the gain for the ith antenna with respect to the fth signal. The second mode corresponds to symbol snapshots collected (four) given by the parameters bjf, and the third mode to the number of samples taken per symbol interval given by ckf. The noise is held in eijk. The parameters are randomly drawn from a Gaussian (0,1) distribution and the residuals from a Gaussian (0,0.l) distribution. This noisy data is filtered by a 5-sample moving-average filter along the long over-sampling mode, to simulate the effect of the receive low-pass filter. This filtering step colors the noise spectrum according to a sin(x)/x pulse in the frequency domain  inducing noise correlation along the long mode.

 Figure 3. The loss function for the signal processing data. The residual variation is independent from antenna to antenna and from symbol to symbol, but is correlated within the third sample mode. Thus, for each combination of receive antenna (first mode) and symbol snapshot (second mode), a characteristic error covariance matrix is obtained.

For maximum likelihood estimation, the error covariance matrix is estimated from 30 realizations of the noise. In Figure 4, the results from 100 runs are shown in terms of the signal-to-noise ratio (SNR) of the estimated symbols (sorted in size of maximum likelihood results). SNR is the common fidelity measure used for assessing the quality of the model in these applications and it is defined by the true matrix B as well as the estimated  as

 . (25)

It is hence a measure of how well the loading matrix is recovered.

As can be seen, the maximum likelihood estimates are significantly better except for a few distinct cases. The generally higher SNR of the maximum likelihood estimates translates directly to better source/path localization and also reduced error rates in source signal recovery in case of digital communication signals

 Figure 4. Signal-to-Noise Ratios (SNR) for 100 simulated models. The data are fitted by maximum likelihood PARAFAC (unbroken line) and by ordinary least squares PARAFAC (broken line). The ML results in the figure are sorted in order of increasing SNR to allow easier visual comparison to the LS solution. As can be seen the use of MILES provides better results in general.

### 3.3.  Fluorescence spectroscopy MPSetChAttrs('ch0032','ch3',[[7,1,-3,0,0],[9,1,-4,0,0],[11,2,-4,0,0],[],[],[],[28,4,-11,0,0]]) MPInlineChar(0) MPNNCalcTopLeft(document.mpch0032ph,'') MPDeleteCode('ch0032')  handling scatter

#### Description of the problem

In fluorescence spectroscopy, scattering phenomena such as Rayleigh and Raman scatter are typically considered as noise and often the areas where the scatter occurs are simply removed from the data beforehand or equivalently treated as missing data [Bro 1997]. However, the interesting chemical information is sometimes situated in the areas where the scatter occurs and it is then not possible to treat these areas as missing data. It then becomes crucial to be able to handle and possibly separate the physical noise arising from scatter from the chemical information.

Table 2. Concentrations of four fluorophores in 22 samples (10-6 M).

 Sample number Hydroquinone Tryptophan Phenylalanine Dopa 1 17,00 2,00 4700,00 28,00 2 20,00 1,00 3200,00 8,00 3 10,00 4,00 3200,00 16,00 4 6,00 2,00 2800,00 28,00 5 0,00 0,00 5600,00 0,00 6 0,00 8,00 0,00 0,00 7 56,00 0,00 0,00 0,00 8 28,00 0,00 0,00 0,00 9 0,00 0,00 0,00 5,00 10 0,00 0,00 700,00 0,00 11 0,00 16,00 0,00 0,00 12 3,50 1,00 350,00 20,00 13 3,50 0,50 175,00 20,00 14 3,50 0,25 700,00 10,00 15 1,75 4,00 1400,00 5,00 16 0,88 2,00 700,00 2,50 17 28,00 8,00 700,00 40,00 18 28,00 8,00 350,00 20,00 19 14,00 8,00 175,00 20,00 20 0,88 8,00 1400,00 2,50 21 1,75 8,00 700,00 5,00 22 3,50 2,00 700,00 80,00

Baunsgaard [1999] analyzed aqueous mixtures of four different fluorophores: L-phenylalanine, L- 3,4-dihydroxyphenylalanine (DOPA), 1,4-dihydroxybenzene and L-tryptophan. Samples were prepared from stock solutions according to the design in Table 2. Fluorescence excitation-emission landscapes were obtained of the 22 samples using a Perkin-Elmer LS50 B fluorescence spectrometer using excitation wavelengths between 200-350 nm and emission wavelength range between 200-750 nm. Excitation and emission monochromator slit widths were set to 5 nm and the scan speed was 1500 nm/min. In order to keep that data manageable and exclude irrelevant areas, a subset of the emission and excitation wavelengths were chosen. Thus, the actual data used in the models contained excitations from 245 to 305 nm with 5 nm intervals and emissions from 246 to 436 nm with 5 nm intervals. A typical sample is shown in Figure 5 left.

 Figure 5. Left: a typical excitation-emission landscape. Note the diagonal Rayleigh scatter peak to the right. In the right plot, emission below excitation has been removed. Some scatter signal remains, but the bulk part of the data that does not follow the PARAFAC model has been removed.

#### A model of fluorescence data

For dilute samples, fluorescence excitation-emission measurements ideally follow a trilinear PARAFAC model [Bro 1998, Bro 1999, Leurgans & Ross 1992, Ross & Leurgans 1995]. It is well established, though, that two types of problematic areas exist in such data [Bro 1998, Bro 1999, Bro & Sidiropoulos 1998]. First of all, emission below excitation does not exhibit any fluorescence and the intensity is simply zero. This part of the data does not follow the PARAFAC model and therefore, emission data below the excitation wavelength has to be set to missing or downweighted such that the model does not incorrectly try to fit these zero values. Otherwise, this part of the data will bias the estimated model towards zero. In Figure 5, a typical landscape is shown before and after removal of emission below zero. The other problematic type of variation in such data is the so-called scatter ridges. Most significant in Figure 5 is the first order Rayleigh scatter ridge that is seen in the rightmost part of the landscape [Ewing 1985]. It occurs approximately on the diagonal where the excitation equals the emission wavelength. This part of the data is problematic, because it does not provide any chemical information but only physical information that is not interesting with respect to the PARAFAC modeling. Further, such a ridge lying on a diagonal is not possible to model efficiently e.g. by using an added PARAFAC component. The presence of scatter is cumbersome, especially when the chemical information appears close to the ridge. If this is not the case, it is sometimes possible to eliminate the scatter by setting the areas where the scatter occurs to missing or downweighting the elements [Bro 1997, Bro & Heimdal 1996, Heimdal et al. 1997, Jiji & Booksh 2000]. In addition to first order Rayleigh scatter, other sources of scatter also occur. E.g. for these data, a minor banded ridge of Raman scatter is observed for some of the samples. This type of scattering is less problematic in the sense that it usually is of a minor magnitude and because it can often be almost completely removed by subtracting measurements of the solvent without sample.

Two alternative PARAFAC models are tested for fitting these data. Both models are fitted with four components; one for each chemical analyte, with the aim of resolving the information for each analyte in each component. The alternative PARAFAC models are

·        least squares fitting to the raw data but eliminating emission below excitation and

·        weighted least squares fitting using MILES where emission below excitation is downweighted and weights are used to minimize the influence of scatter.

In order to define the weights in the MILES model, a simplified model of the scatter is used. This model is shown in Figure 6 (left). It consists of Gaussian curves generated as the sum of

 (26)

and

 . (27)

The parameters of these distributions are μj (emission wavelength) and μk (excitation wavelength) and σRayleigh, σRaman, hRayleigh, and hRaman which are set to 8, 2, 25 and 0.5 respectively. These values were chosen rather arbitrarily to provide a reasonable visual resemblance to the observed sizes of the scatter peaks. In actual practice, the width and height will change somewhat across the ridges and the ridges will not be perfectly Gaussian shaped. However, it is anticipated that the approximation of the scatter is sufficiently good for the purpose of minimizing the negative effects of scatter.

 Figure 6. Left: an idealized representation of first order Rayleigh and Raman scatter. The large Rayleigh trace appears around the diagonal where the excitation wavelength equals the emission wavelength, while the Raman scatter has a lower intensity and moves along a different diagonal. To the right, the total error “standard deviation” for one sample is given. It is the sum of the scatter error from the left plot, the iid error (standard deviation one for every element) and a very high value (1000) for the non-chemical part of the data.

The combined effect of scatter, measurement noise and downweighting of emission below excitation is shown in Figure 6 (right). This matrix is the sum of an iid term of magnitude standard deviation one, a missing data part of magnitude 1000 and the scatter model of Figure 6 (left). The absolute size of these three contributions is immaterial, whereas the relative sizes define the influence of the different residuals on the loss function. The weights used in MILES fitting are simply taken to be the inverse of the values in Figure 6 (right). Some comments on this fairly crude approach to defining the weights are appropriate.

·        The same weights are applied to each sample. Although some differences in actual uncertainty may appear between samples, this is ignored here.

·        The model fitted with the above defined weights is not a maximum likelihood estimate. First of all, the scatter has a bias part which must be eliminated in order to be able to fit the model in a maximum likelihood sense and secondly, the error estimates are not based on actual estimates of uncertainty.

·        Eliminating bias by centering, though, is not feasible in this case. In practice, the bias can be removed by centering the data across samples [Bro & Smilde 2001] but this will lead to overfitting because a huge number of averages needs to be calculated which are not strictly necessary.

·        Improving the error estimates based e.g. on actual replicate measurements is avoided here in favor of more simple estimates. Estimating actual error variances and covariances from empirical data, can lead to quite noisy estimates, which will possibly even have less resemblance to the true error covariances than simply using the implicit iid assumptions of a least squares approach.

·        No use is made of off-diagonals in the weighting matrix (W is diagonal as in the weighted least squares loss function in Figure 2). This is justified by the adequacy of the results obtained without off-diagonals and the computational simplicity of the associated matrix inversions when off-diagonals are left out.

#### Qualitative results

 Figure 7. Four components estimated with least squares approach (top) and weighted least squares (bottom). Component number three (solid line) has a peculiar peak in the low emission part in the least squares approach. Component two (dotted line) has an incorrect excitation maximum above the corresponding emission maximum in the least squares model.

In Figure 7, the resulting estimated loading parameters are shown for two competing four-component PARAFAC models. It is evident that there is a discrepancy in one of the estimated emission spectra and also in one of the excitation spectra.

The shape of the 260 nm artifact peak in least squares emission component three (solid line) is naturally related to the Rayleigh scatter but as important, the relatively high magnitude of the peak occurs because of the pattern of missing data. Looking at the model approximation from only component 3, i.e., the outer product of the emission and excitation loading, it is seen that although the 260 nm peak appears large in the emission loading, its contribution to the model is quite moderate (Figure 8) when the area without fluorescence information is taken into account. Thus, the apparent large size and hence importance of the scatter peak is an artifact of the special data structure. In actual practice, the magnitude of the 260 nm peak can be changed with little associated change in fit. Such artifacts are often observed when modeling fluorescence data even if emission slightly above the excitation wavelength has also been removed to minimize the effect of scatter [Bro & Sidiropoulos 1998].

The high values of the least squares excitation component two (dotted line) is also related to the Rayleigh scatter. Due to low signal from this component and a relatively high amount of Rayleigh scatter, the component is primarily reflecting the Rayleigh trace rather than the chemical variation.

 Figure 8. The landscape of component three (outer product of emission and excitation loading) for the least squares fitted model. The emission wavelengths below excitation have been set to missing and it is seen that the apparent high scatter peak in the emission loading is almost absent from this landscape because of the missing data.

When fitting the model with the MILES procedure, the artifact is not seen and the overall estimated spectra seem more reasonable. Hence, the use of the WLS fitting seems warranted in this case.

As an aside, fitting a least squares model to data where non-fluorescent zeros from emission below excitation and scatter is retained (i.e. not treated as missing), the results are sometimes similar to the ones obtained here with MILES fitting. This indicates that the two approaches are equally good. However, the reason why the MILES approach works is because it handles the special characteristics and deviations from the ideal "iid least squares" conditions in a direct and reasonable way. On the other hand, when fitting the data with large amounts of ‘incorrect’ zeros, these zeros act, in the model, similar to ridge parameters in a regression model. Because the model is forced to describe the zeros, the parameters are forced towards zero. Further, because the scatter peaks in the estimated spectra only describe minor variation in the data, the gain in fit by describing the (incorrect zeros) supercedes the loss in fit by not describing the scatter peak. This is an ad hoc approach and can be expected to bias the solution in many cases. And indeed for these data, this approach is not feasible. This is further corroborated by means of the quantitative results that follow.

#### Quantitative results

In order to substantiate the adequacy of the MILES approach, the following resampling approach was adopted.

1)      Ten of the 22 samples are randomly chosen

2)      A four-component PARAFAC model is fitted to these ten samples

3)      The ten concentrations of the four analytes are predicted by the PARAFAC scores using multiple linear regression (no offset).

4)      For each analyte R2 is computed as , where yi is the concentration in the ith sample and  the corresponding prediction.

5)      This procedure is repeated 100 times, yielding 400 R2 values.

The R2 statistic provides a measure of the fraction of variance explained [Draper & Smith 1981] and is expressed in percentages in the results that follow. The closer this percentage is to 100 percent, the better the model fit the data. Note, that the predictions of y are based on a no-intercept model.

The following alternative loss functions and models were evaluated.

A.     Least squares fit to raw data (no missing data, hence emission below excitation and scatter is retained (see Figure 5 left))

B.     Least squares fit to data (setting excitation below emission to missing (see Figure 5 right))

C.     Weigted least squares fitting (as specified above)

Least squares fitting without handling the areas where fluorescence does not occur is performed (A) in order to test if this is a feasible approach as discussed above. In B, ordinary least squares fitting is used but setting emission below excitation to missing. The missing data are handled by expectation maximization [Bro 1998]. This idea of eliminating or downweighting the problematic areas, is the approach most often used for multi-way modeling of fluorescence data. In this case, it would likely be possible to push the limit for setting elements to missing a bit. Setting emission slightly above the excitation wavelength to missing can help remove even more of the problematic scatter. However, as this will also eliminate chemical information, which is potentially important for example for low-wavelength emitting analytes, this is not pursued here.

 Figure 9. Histograms of 400 values of R2100 obtained from the three different models.

In Figure 9, the results obtained from 100 runs for each of the three alternative models are shown. Handling the missing data (B) is slightly better than disregarding the non-fluorescent parts (A) and using the MILES approach provides the best results by far (C). This result, of course, is simply a manifestation of the results outlined in the discussion of the qualitative results. The results can be further understood e.g. by looking at the actual emission loadings estimated by the three models in all the one hundred cases. From Figure 10, it is easily seen that only the MILES models are able to accurately determine the spectra in all cases. For the two alternative models, the correct estimates are only obtained occasionally, whereas in many cases, incorrect models are obtained reflecting the scattering phenomena.

 Figure 10. Emission loadings obtained from a least squares model of raw data (top), a least squares model handling missing data (middle) and a weighted least squares model (bottom).

The model of the scatter phenomena used for defining the weights has been chosen in an ad hoc fashion based on visual assessment of the model. In order to verify that the results obtained with MILES are robust towards slightly different weights, two alternative models were investigated; one where the scatter ridges are incorrectly moved as much as ten nanometers away in the emission direction from the correct diagonal where emission equals excitation; and one where the Rayleigh trace is only half the width of the former. As can be seen in Figure 11, the resulting PARAFAC loadings are almost identical. Although minor deviations appear in the estimated spectra, these are insignificant compared to the differences observed in the least squares model (Figure 7). The results do indicate that some optimization of the scatter model may be achieved but more importantly, it is verified that the model is robust against minor changes in the model. Hence, any reasonable model of the scatter will help correcting the problems encountered in the least squares approach.

 Figure 11. Three alternative scatter models shown in upper plots. Leftmost, the model used so far; then a model where the scatter ridges are (incorrectly) moved 10 nm away from the correct diagonal in the emission direction and finally a model where the Rayleigh trace is half the width. The resulting emission and excitation loadings are shown below the scatter plots.

## 6.    Conclusion

A new and general algorithm has been developed for maximum likelihood or weighted least squares estimation. It is applicable to any situation where a least squares algorithm exists, and can be implemented as a simple iterative procedure. This algorithm makes it simple e.g. to make maximum likelihood fitting algorithms in situations where it would otherwise be difficult to come up with a suitable algorithm. One such example is to fit a bilinear PCA model including centering. The centering part is not easily handled with other maximum likelihood PCA algorithms, but for MILES, the structure of the model is immaterial as long as a least squares algorithm exists. The algorithm has been applied to different data sets to show its applicability. These examples include PCA and PARAFAC. It has been shown that very accurate knowledge of the error covariance structure is not mandatory for the beneficial use of the weighted least squares principle. Sometimes though, maximum likelihood estimation is disregarded in favor of, for example, least squares approaches. Indeed, if the residuals have quite similar variances and are imprecisely determined from few replicates, it is conceivable that the use of such uncertain variances can lead to poor results. The difference between maximum likelihood and least squares fitting, though, gets less pronounced if the errors are small compared to the signal (which is often the case in e.g. near infrared spectroscopy).

The main results of the presented work are thus

- MILES is a general algorithm that can be used to substitute a number of specialized algorithms such as the algorithms developed by Wentzell, Kiers etc.

- The first spectroscopic example basically shows that the MILES algorithm provides the same results as current ML-PCA algorithms and that offsets are easily included in the maximum likelihood fitting.

- The second signal processing example shows that MILES handles residual covariance e.g. for fitting PARAFAC models. No other PARAFAC algorithm currently does this.

- The third fluorescence example provides an example on how the weighted least squares approach of MILES can be used for solving a well-known problem in modeling fluorescence data.

## Acknowledgments

R. Bro gratefully acknowledges support provided by LMC (Center for Advanced Food Studies), Frame program AQM (Advanced Quality Monitoring in the Food Production Chain), as well as the EU-project Project GRD1-1999-10337, NWAYQUAL. N. D. Sidiropoulos gratefully acknowledge support provided by NSF grants 0096164 and 0096165. The data for and m-file performing ML-PCA was obtained from http://www.dal.ca/~pdwentze/index.html.

## Appendix

For the second term in Eq. (20) it holds that

where mcTmc is a constant. The third term in Eq. (20) can be rearranged as

 (29)

Therefore, Eq. (20) can be written

 (30)

Defininig the constant term  leads to

 (31)

Reference List

1.   Baunsgaard D,  Factors affecting 3-way modelling (PARAFAC) of fluorescence landscapes, The Royal Veterinary & Agricultural University, 1999,

2.   Bro R, PARAFAC. Tutorial and applications,  Chemom Intell Lab Syst, 1997, 38, 149-171.

3.   Bro R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. Ph.D. thesis, University of Amsterdam (NL), 1998,

4.   Bro R, Exploratory study of sugar production using fluorescence spectroscopy and multi-way analysis, Chemom Intell Lab Syst, 1999, 46, 133-147.

5.   Bro R, Heimdal H, Enzymatic browning of vegetables. Calibration and analysis of variance by multiway methods, Chemom Intell Lab Syst, 1996, 34, 85-102.

6.   Bro R, Sidiropoulos ND, Least squares algorithms under unimodality and non-negativity constraints, Journal of Chemometrics, 1998, 12, 223-247.

7.   Bro R, Smilde AK, Centering and scaling in component analysis, Journal of Chemometrics, 2001,  Submitted.

8.   De Lathauwer L, de Moor B, Vandewalle J, An introduction to independent component analysis, Journal of Chemometrics, 2000, 14, 123-149.

9.   Draper NR, Smith H, Applied regression analysis. John Wiley and Sons, New York, 1981,

10.   Ewing GW, Instrumental methods of chemical analysis. McGraw-Hill Book Company, New York, 1985,

11.   Golub GH, van Loan CF, Matrix computations. The John Hopkins University Press, Baltimore, 1989,

12.   Groenen PJF, Mathar R, Heiser WJ, The majorization approach to multidimensional scaling for Minkowski distances, Journal of Classification, 1995, 12, 3-19.

13.   Harshman RA, Foundations of the PARAFAC procedure: Models and conditions for an 'explanatory' multi-modal factor analysis, UCLA working papers in phonetics, 1970, 16, 1-84.

14.   Heimdal H, Bro R, Larsen LM, Poll L, Prediction of polyphenol oxidase activity in model solutions containing various combinations of chlorogenic acid, (-)-epicatechin, O2, CO2, temperature and pH by multiway analysis, J Agric Food Chem, 1997, 45, 2399-2406.

15.   Heiser WJ, Correspondence analysis with least absolute residuals, Comp Stat Data Anal, 1987, 5, 337-356.

16.   Heiser WJ, Convergent computation by iterative majorization: Theory and applications in multidimensional data analysis, Recent advances in descriptive multivariate analysis, (Ed. Krzanowski,WJ),  1995, 157-189.

17.   Jiji RD, Booksh KS, Mitigation of Rayleigh and Raman spectral interferences in multiway calibration of excitation-emission matrix fluorescence spectra, Anal Chem, 2000, 72, 718-725.

18.   Judge GG, Griffifths WE, Carter Hill R, Lütkepohl H, Lee TC, The theory and practice of econometrics. John Wiley & Sons, New York, 1985,

19.   Kay S, Statistical Signal Processing, Part I: Estimation Theory. Prentice-Hall, 1993,

20.   Kiers HAL, Majorization as a tool for optimizing a class of matrix functions, Psychometrika, 1990, 55, 417-428.

21.   Kiers HAL, Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 1997, 62, 251-266.

22.   Kiers HAL, ten Berge JMF, Minimization of a class of matrix trace functions by means of refined majorization, Psychometrika, 1992, 57, 371-382.

23.   Kroonenberg PM, Three-mode Principal Component Analysis. Theory and Applications. DSWO Press, Leiden, 1983,

24.   Leurgans SE, Ross RT, Multilinear models: applications in spectroscopy, Statistical Science, 1992, 7, 289-319.

25.   Martens H, Høy M, Wise BM, Bro R, Brockhoff PMB, GLS Pre-processing of Multivariate Data, Journal of Chemometrics, 2001,  Submitted.

26.   Paatero P, Tapper U, Analysis of different modes of factor analysis as least squares problems, Chemom Intell Lab Syst, 1993, 18, 183-194.

27.   Paatero P, Tapper U, Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 1994, 5, 111-126.

28.   Ross RT, Leurgans SE, Component resolution using multilinear models, Methods Enzymol, 1995, 246, 679-700.

29.   Sidiropoulos ND, Liu XQ, Identifiability results for blind beamforming in incoherent multipath with small delay spread, IEEE Transactions on Signal Processing, 2001, 49, 228-236.

30.   Wentzell PD, Andrews DT, Hamilton DC, Faber NM, Kowalski BR, Maximum likelihood principal component analysis, Journal of Chemometrics, 1997, 11, 339-366.

31.   Wentzell PD, Lohnes MT, Maximum likelihood principal component analysis with correlated measurement errors: theoretical and practical considerations, Chemom Intell Lab Syst, 1999, 45, 65-85.

xxFra henk

No, this definitely cannot happen in the way you portrayed, the

reason simply being that the majorizing function you have drawn

does not really majorize the object function: it crosses it, whereas

it should only touch it in the supporting point and be always above

it.

There's quite a bit more theory on majorization and convergence

and I gather the bottom line is that, with reasonably well behaved

fucntions (double differnetiability, continiuous, etc), it converges to

stationary points (in practice to only one stationary point, but in

theory this could be a continuum of stationary points, but then

these must all lead to the same function value, and I do not expect

you to encounter such a problematic/bizarre situation).

I guess that if you do ML via WLS you actually do IRLS (iteratively

reweighted least squares, which, under some conditions, is

equivalent to ML). Please note, however, that, if you iteratively

reweight, you get a different function all the time, and this simple

fact may indeed destroy monotony and hence convergence.

### 1.1.   [RB2]Computational aspects

xxThis part sucks. Forget it for now

In many cases for example in spectroscopy, the data are themselves of considerable size. If the number of elements in a data set is R then the size of the residual covariance matrix will be R × R which can be prohibitive for actual computations. If there are independence between certain elements (e.g. when error are iid or if samples are independent) then this leads to a large number of zeros in the residual covariance matrix. Handling these with sparse techniques is one simple way of diminishing the computational demands. In some cases, however, the size of the residual covariance matrix is still too big for efficient analysis. In the following a simple compression technique will be described that helps handling large problems for situations in which the model-parameters are linear in the mode to compress. The method described is developed based on the assumption that the residual covariance matrix is determined based on replicate measurements but it can be modified to handle covariance matrices stemming from other sources.

Let the data be denoted X (I×J). In case that X is a multi-way array e.g. of size I×P×K then assume that it has been properly matricized so that the column-mode of X is made up of the second and third mode of the three-way array (thus J = PK). Further, assume that X has been arranged such that the first mode is the one of largest dimension. Assume that the residual covariance matrix Δ (IJ×IJ) is calculated from a matrix E (I×Q) where Q is a number that depends on the number of replicates used for determining the residual variation.

The idea in the suggested compression is to compress the largest mode, such that both X and E are expressed in terms of a truncated basis. In order to achieve this, a small basis is needed for the first mode. Such a basis can be found from a singular value decomposition of the concatenated matrix

 (17)

where X and E are both normalized in order to ensure comparable variation in the two matrices. The left singular vectors from a singular value decomposition of this combined matrix will provide a basis for the variation in the first mode both with respect to the data and with respect to the residual variation. By discarding singular vectors with very small associated singular values, a truncated basis, UC (I×IC). is obtained of column-dimension IC (<I) Both X and E are subsequently regressed onto this truncated basis yielding compressed data XC (IC×J) and compressed residuals EC (IC×Q). The data and residual relate to the compressed data and residuals by the relation

#### XUC XC

EUC  EC

So this is just a re-expression of the data with a new sensible basis onto which X and E projects. Thus everything can be calculated from there as well, right?