Department of Food Science

Faculty of Science

University of Copenhagen

# Tensor data sets exemplifying problems in tensor modeling

The following problems are illustrated below:

- Simple PARAFAC data (amino acids)
- Local minima (FIA)
- Two-factor degeneracy (Kojima)

Much more data sets are available at University of Copenhagen (here) and at Leiden University (here).

Tools available for working with multi-way data:

Tensor toolbox (handling tensors and models in matlab)

The N-way toolbox (fitting tensor models in matlab)

PLS_Toolbox (fitting tensor and other models in matlab - commercial)

**What is degeneracy**

A few clarifying words on degeneracy would be in place. The following description is in part (very much so) stolen from a note of Richard Harshman. If you are interested in the note, you can send me an email. The term ’degeneracy’ was introduced in by Harshman et al. in 1984 (Harshman, R. A., & Lundy, M. E. (1984b). Data preprocessing and the extended PARAFAC model. In H. G. Law, C. W. Snyder Jr, J. A. Hattie, & R. P. McDonald (Eds.), Research methods for multimode data analysis (pp. 216-284). New York: Praeger)

Harshman, R. A., & DeSarbo, W. S. (1984. An application of PARAFAC to a small sample problem, demonstrating preprocessing, orthogonality constraints, and split-half diagnostic techniques. In H. G. Law, C. W. Snyder Jr, J. A. Hattie, & R. P. McDonald (Eds.), Research methods for ultimode data analysis (pp. 602-642). New York: Praeger) were the first to show a practical example of a dataset with degeneracy and also how to overcome the problem using alternative constraints (orthogonality) and proper preprocessing.

Further practical examples are given in Lundy, M. E., Harshman, R. A., & Kruskal, J. B. (1989. A two-stage procedure incorporating good features of both trilinear and quadrilinear models. In R. Coppi & S. Bolasco (Eds.), Multiway data analysis (pp. 123-130). Amsterdam: Elsevier)

The more mathematical aspects of degeneracy were first described ny Kruskal (Kruskal, J.B. (1989). Rank, decomposition, and uniqueness for 3-way and N-way arrays. In R. Coppi & S. Bolasco (Eds.), Multiway data analysis (pp. 7-18). Amsterdam: Elsevier and Kruskal, J. B., Harshman, R. A., & Lundy, M. E. (1989). How 3-MFA data can cause degenerate PARAFAC solutions, among other relationships. In R. Coppi & S. Bolasco (Eds.), Multiway data analysis (pp. 115-122). Amsterdam: Elsevier)

The first of these two papers is more general in that it explains essential ideas about rank and decomposition of arrays, while the latter uses these ideas to explain degeneracy. It begins with a description of seven observed properties of actual degenerate solutions, stated as they apply in the simplest case, which is called a “two-factor degeneracy”. (When more factors are involved, most properties are the same but correlations and cancellations are more complicated.) For this simple case, (i) two factors are involved, (ii) their loadings are highly correlated in all three modes, (iii) one or all three correlations are negative, (iv) the factors have large loadings but (v) a “normal sized” net contribution because their combined contributions almost cancel out (due to the negative correlations), (vi) their loadings keep growing in size and degree of correlation as further iterations are performed, and thus (vii) they eventually become larger than any other factors.

Mitchell, B.C., & Burdick, D.S. (1994). Slowly converging Parafac sequences: Swamps and two-factor degeneracies. Journal of Chemometrics, 8, 155-168.

Rayens, W. S., & Mitchell, B. C. (1997). Two-factor degeneracies and a stabilization of Parafac. Chemometrics and Intelligent Laboratory Systems, 38, 173-181.

Discuss something they call swamps: situations where the iterative algorithm slows down and where the interim solution shows degeneracy like behavior. Note, though, that swamps have nothing to do with degeneracies as degeneracy is built in to the data whereas swamps are just temporary numerical problems.

Other relevant papers on degeneracy are

Ten Berge, J.M.F., Kiers, H.A.L., & De Leeuw, J. (1988), Explicit candecomp/parafac solutions for a contrived 2x2x2 array of rank three. Psychometrika, 53, 579-584.

Paatero, P. (2000). Construction and analysis of degenerate PARAFAC models. Journal of Chemometrics, 14, 285-299.

Zijlstra, B. J. H., & Kiers, H. A. L. (2002). Degenerate solutions obtained from several variants of factor analysis. Journal of Chemometrics, 16, 596-605.

Ten Berge, J.M.F. (1991). Kruskal's polynomial for 2×2×2 arrays and a generalization to 2×n×n arrays. Psychometrika, 56, 631-636.

**Amino Acids**

* Problem*: No real problem here. Just a nice set for fitting PARAFAC

* Fixes*: None as there is no problem! Actually, some problems are present in these data. The data have a bit of non-trilinear Rayleigh scattering variation but that is not important for basic illustration of the use of PARAFAC. Refer to the references below if scattering is of interest.

* Download*: Download the data here.

* Characteristics*: 5x201x61. Dense reals. Nonnegative components. Some small negative values in the data.

**Suitable model****:** Three component PARAFAC.

**How should it look****: **Basically, the result of the analysis can be validated by checking that the scores (component matrix in the sample mode - mode 1) is correlated to the known concentrations of the three chemicals that are in the sample (held in the matrix Y). Each sample contains three aminoacids (tryptophan (Trp),tyrosine (Tyr),phenylalanine (Phe)).

1.Column in Y is Tryptphan concentration in Mole/L

2.Column in Y is Tyrosine concentration in Mole/L

3.Column in Y is Phenylalanin concentration in Mole/L

The data set consists of five simple laboratory-made samples. Each sample contains different amounts of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. The samples were measured by fluorescence (excitation 250-300 nm, emission 250-450 nm, 1 nm intervals) on a spectrofluorometer The array to be decomposed is hence 5 × 51 × 201. Below measurements of one of the samples are shown. Ideally these data should be describable with three PARAFAC components. This is so because each individual amino acid gives a rank-one contribution to the data.

Hence, if a PARAFAC model is appropriate, it should have three components and therefore a 5 × 3 socalled score matrix (first mode loading matrix). Each column in this score matrix should approximately match the concentration of one of the three aminoacids which are held in the 5 × 3 Y matrix. Matching in this case, means that the corresponding columns should be correlated.

As there is no specific order of the components; it is not possible a priori to say which column is correlated to which column, but you can e.g. do plotmatrix([A Y]) in matlab where A is the score matrix. This will produce scatter plots of all combinations of columns. An example of such a plot is shown below

*Result of using the matlab plotmatrix function for making a scatter plot of the first mode PARAFAC components and the known concentrations.*

*Fluorescence landscape of a sample containing only phenylalanine.*

Below the PARAFAC loadings of a three-component model are shown.

*Loading vectors resulting from fitting a three-component PARAFAC model to amino acid data.*

**References****: **

*Original publication*: The data were originally measured by Claus A. Andersson and have been published e.g. Bro, R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. 1998. Ph.D. Thesis, University of Amsterdam (NL) and Royal Veterinary and Agricultural University (DK)

*Additional publications: *Kiers, H.A.L. (1998) A three-step algorithm for Candecomp/Parafac analysis of large data sets with multicollinearity, Journal of Chemometrics, 12, 155-171.

**FIA (Flow injection analysis)**

* Problem*: Has severe problems with local minima

* Fixes*: A description of a partial remedy was given on page 215 of R.Bro 1998 using regularization

* Download*: Download the data here. It is a socalled dataset object that contains axis scales, names etc. It requires the free dataset object thingie from www.eigenvector.com but alternatively just load the data and the dataset will 'explode' into the different fields of information.

* Characteristics*: 12x50x45. Dense reals. Nonnegative components. Some small negative values in the data. Real components are linearly dependent in two modes, but it doesn’t really affect the problem llustrated here.

**Suitable model****:** Six component PARAFAC (Actually PARALIND is appropriate but that is beyond the scope here).

**How should it look****: **Basically, the result of the analysis can be validated by checking that the scores (component matrix in the sample mode - mode 1) is correlated to the known concentrations of the three chemicals that are in the sample (held in the matrix Y). Each chemical analyte has two associated scores and they should be correlated to the concentrations given in the mat-file Y. This is similar to what was shown for the amino acid data above except that for these data, there are two column vectors in the score matrix for each column vector in Y.

As to the local minima, you will find that when you fit the model to the data several times (from different starting values), you will get different loss function values (sum squared residuals usually).

**References****: **

*Original publication*: NĂ¸rgaard L, Ridder C, Rank annihilation factor analysis applied to flow injection analysis with photodiode-array detection. Chemometrics and Intelligent Laboratory Systems 23:107, 1994

*Additional publications: *Bro, R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. 1998. Ph.D. Thesis, University of Amsterdam (NL) & Royal Veterinary and Agricultural University (DK).

Bro R, Sidiropoulos ND, Least squares algorithms under unimodality and non-negativity constraints. Journal of Chemometrics 12:223, 1998.

Kiers HAL, Smilde AK, Constrained three-mode factor analysis as a tool for parameter estimation with second-order instrumental data. Journal of Chemometrics, 1998, 12, 125-147.

R. Bro, R. A. Harshman, N. D. Sidiropoulos, and M. E. Lundy. Modeling multi-way data with linearly dependent loadings. J.Chemom. 23:324-340, 2009.

**Kojima Girls**

* Problem*: The preprocessed data suffers from a two-factor degeneracy.

* Fixes*: Different preprocessing can solve the problem of the two-factor degeneracy in this case. See references below for more details.

* Download*: This data set has very kindly been made available by Kojima and is available from Pieter M. Kroonenberg website. They are also available as preprocessed data here ready for fitting. It is a socalled dataset object that contains axis scales, names etc. It requires the free dataset object thingie from www.eigenvector.com but alternatively just load the data and the dataset will 'explode' into the different fields of information.

* Characteristics*: 153x4x20. Dense reals.

**Suitable model****:** A two-component PARAFAC will show degeneracy.

**How should it look****: **As the data suffer from a degeneracy there is no least squares solution. PARAFAC will show components that are highly correlated.

**References****: **

*Original publication*: Kojima, H. (1975). Inter-battery factor analysis of parents’ and children’s reports of parental behavior. Japanese Psychological Bulletin, 17, 33-48.

*Additional publications: *Kroonenberg PM, Harshman RA, Murakami T, Analysing three-way profile data using the parafac and tucker3 models illustrated with views on parenting, Applied Multivariate Research, Volume 13, No. 1, 2009, 5-41