Frans van den Berg1), Vibeke Povlsen1),
Anette Thybo2) and Rasmus Bro1)
1) The Royal Veterinary and Agricultural University,
Dept. of Dairy and Food Science, Food Technology Core, Denmark
2) Danish Institute of Agricultural Sciences,
Dept. of Horticulture, Denmark
Modern
researchers have no problem collecting huge amounts of data! With the help of
computers, electronics and hyphenated instrumentation the number measured
variables in almost every field of science grows at enormous speed.
Sophisticated mathematical and statistical methods are developed to handle such
vast amounts of data. Unfortunately these methods are not used as frequently as
anticipated from the abundance of large dataset problems.
In the field of chemistry these new developments are
primarily organized in the discipline called chemometrics. Well-known
techniques as Principal Component Analysis (PCA) and Partial Least Squares
Regression (PLSR) are designed to handle the correlation between series of
measured variables, revealing the important, underlying (so-called latent)
phenomena in multivariate data tables [1].
Probably the most familiar application of these methods is in multivariate
calibration for Near Infra Red (NIR) spectrometers. The highly correlated
absorbance values at different wavelengths can be used to predict e.g. protein
contents of barley samples [2]. For hyphenated (2D) techniques
like multi-wavelength-fluorescence emission-excitation spectroscopy and gas
chromatography-mass spectrometry (GC-MS) new methods like Parallel Factor
analysis (PARAFAC), Tucker-models and Multilinear-PLSR have been developed [3]. These techniques are designed to decompose higher
order data tables (e.g. cubes), again to reveal the underlying, latent
phenomena for the purpose of data analysis and predictions.
All these different data methods have in common that
they are highly graphical. Next to important figures of merit like model
accuracy and precision, they are developed to represent the model diagnostics
in the form of plots. Drawings are made to e.g. determine the position of one
sample compared to all others (e.g. outlier detection). Other plots are used to
evaluate the relative importance of a variable in a multivariate data table.
In this paper we discuss a group tools that handle
multiple blocks of data collected on the same set of samples, so-called multi-block
models. They can be considered extensions of ‘single-block’ PCA and PLSRa. Their use can be beneficial when
analyzing large datasets where measurements are organized in conceptually
meaningful blocks. An example of such a ‘natural blocking’ could be data of
different instrumental techniques (NIR, GC, physical/rheological parameters,
etc.) used on the same set of samples. The first approach for handling this
many variables would be to put everything in one big data table, and analysis
the entire block. This can however significantly blur the final results.
Multi-block models strive to maintain the natural ordering in the data. They
try to explain the relation between different blocks, and the block its
relative contribution in the model.
Multi-block models are considered ‘data mining’ tools
in that they can give a (graphical) overview of large amounts of data, with the
aim of improving the knowledge on the subject under study by notably reducing
the complexity. Multi-block models are considered ‘exploratory’ in that they
are suitable for initial investigations of the data-mountain, to e.g.
intelligently reduce this amount of data and find a more dedicated mathematical
model from the reduced dataset.
In
this paragraph we will give an introduction to two specific methods:
multi-block PCA (developed under the name Consensus-PCA) and multi-block PLSR
(officially known as MB-PLSR with deflation on the super scores) [4]. The reader should be aware that many more
multi-block methods, dedicated to special data-analytical problems, can be
found in literature.
To assist in the explanation of multi-block
algorithms a brief description of the PCA and PLSR algorithms will be given at
first. We start with a descriptor data table X. In this table every row
is formed by one sample (object), while the columns are formed by the
measurements (variables). If we measure e.g. NIR spectra on a set of samples
we, get a data matrix X of
size samples x wavelengths (= objects x variables).
The first PCA principal component finds a rank one bilinear model that explains
the maximum amount of variance in the original data matrix X (see Box 1a). The variance not captured by the
first factor can be subjected to a second analysis step that tries to model the
most variance in these residuals. Three sources of information become available
from the PCA model: i) how much of the total variance in X is modeled by
successive factors (‘how important is the last factor extracted’); ii) the
object score values ti for every factor (‘what is the role of
individual samples compared to others for this factor’); iii) the variable
loadings pi (‘what part do different variables play in this
factor'). We also get information on what has not been captured by the factor.
To come back to our spectroscopy example: if the first factor explains 80% of
the total variance in X, and the object score-values in t1
show a clear separation into two clusters, we know that there are likely to be
two types of samples in our dataset. We can use the loadings vector p1
to identify NIR absorbance peaks that cause the two clusters to differ. PCA is
used to study the full data table in a model of reduced complexity, formed by a
small number of scores and loadings. Any experienced NIR user can tell that
deriving conclusions from raw NIR data tables can be very hard!
In PLSR we are looking for a regression relation
between a descriptor block X and a response block Y (Box 1b). In our NIR example this could e.g.
be concentration for some components of interest, determined by laboratory
reference methods. The first PLSR factor (‘Latent Variable’) builds two
bilinear models (one for X and one for Y) that are optimized to
simultaneously explain the maximum amount of variance in X and predict
as well as possible the response variables in Y. The same diagnostics as
in PCA – percentage explained in X and Y, score- and
loading-values for both models – are available in the PLSR.
As stated in the introduction, multi-block methods
have the ‘restriction’ that different blocks have to have one mode in common.
This is usually the sample mode: a series of experiments, divided in meaningful
blocks, are run on the same set of objects. Next, there are two conceptual
viewpoints to handle these data blocks by multi-block methods. The first one is
to consider each data block as a separate source of information, where the task
of the multi-block model is to express the common structure for the objects.
This object ‘consensus’ is formulated in a so-called super level, an additional
top layer, combining information from all X-blocks on the lower data
level. The alternative view on multi-block modeling is as follows: we have a
large number of measurements, and we want to use all of them in an analysis or
regression problem. Thus, we form one large data table to do the computations.
However, we know that there are distinct groups of variables and we want to
keep track of these separate blocks. At the super level we have the augmented
data block. One level lower we have the individual blocks. We can actually go
one level deeper, by looking at the individual variables in these blocks (just
like regular PCA or PLSR).
In MB-PCA (Box 1c)
we get the following information: i) percentages of explained variance for the
augmented block (‘how important is the common factor over all blocks’); ii)
super object score values ts (‘what is the influence of an
object, seen over all blocks’); iii) super block-weights ws
(‘how important is a block for this factor’). Besides these three tools we have
diagnostics on the block level similar to regular PCA.
In the MB-PLSR a regression model is found between
response block Y and a super descriptor block T, which itself is
a function of the original descriptor X-blocks (Box 1d). Again, the same diagnostics as in
MB-PCA and regular PLSR are available, but again with the additional
restriction that score-values and loadings are optimized for object consensus
at the super level.
It is important to emphasize that different blocks
should be weighted before they can be used in one model (similar to the
weighting of individual variables in regular PCA and PLSR). If the variance in
one block is much larger than all others, this block will dominate the
solution, and the conclusions can be misleading.
The
theory explained in the previous paragraph will now be used in an example from
the agricultural industry. Five different potato varieties were harvested in
September 1999 and analyzed in November 1999 and May 2000. The yields were sorted
by salt weighting in two or three dry matter intervals, resulting in thirteen
and ten different so-called bins for the two storage times. From these bins
tubers were selected for laboratory analysis and sensory evaluation. The
lab-measurements consist of uniaxial compression curves on raw and cooked
potato material
(Figure 1). In
this technique a small potato sample is compressed at constant velocity under
well-controlled conditions. The force resistance of the potato material – a
function of chemical and physical composition of the tuber, expected to be
related to consumer experience of the product – is recorded. These experiments
are repeated for ten tubers from each variety. The uniaxial compression curves
– averaged over ten replicates to reduce the natural variety in the bins – form
the predictor blocks (the blocks raw X1 and cooked X2)
in our multi-block models.
The descriptor block is formed by a sensory
evaluation of the same potato bins. A trained sensory panel of ten assessors
evaluated the cooked tubers on a number of attributes. In this paper we will use
two of them: Cohesiveness and Mealiness. The panelists scored these attributes
on a scale from zero to fifteen, and the average of this score is used as Y-blocks.
In total we have 23 objects, two X-blocks of
compression ‘spectra’, and Y-blocks with Cohesiveness or Mealiness
sensory scores. The two X-blocks where scaled to have block unit
variance, and all three data tables where column mean centered before modeling.
The correct model complexity were determined from a so-called
leave-one-sample-out cross validation. In this procedure one potato sample is
taken from the calibration set, a MB-PLSR model is build from the remaining 22
samples, and the response value for the removed sample is predicted. This
procedure is repeated for all 23 samples in the set. The overall prediction
error is determined as the Root Mean Squared error of Prediction (RMSEP), which
usually shows a minimum for the optimal model complexity.
The potato samples data is subjected to a MB-PLSR
modeling. Figure 2 shows the primary
information we get for the two models. Cohesiveness: from the cross validation
RMSEP curve (Fig. 2c) we see that a
two-factor model is optimal. The super weights ws are
normalized to length one in the MB-PLSR algorithm. This means that high value
(close to one) indicates that a block is important in this factor, while a low
value means the block has little involvement. The super weights ws
for Cohesiveness (Fig. 2b) show that both
blocks are of approximate equal importance, with a slight preference for cooked
sample compression curves (green bar). From the percentages variance explained
we learn that the only a small part of the X1-block (raw
samples) information is used in the first factor (Fig. 2a).
The Mealiness RMSEP-values also indicate a two-factor
model. From ws we learn that the first factor mostly depends
on the X1-block while the second factor is dominated by the X2-block
(cooked samples). The explained variances start out high for both blocks, hence
the relative low model complexities.
Figure 3 shows
the predicted versus reference values from cross validation for the two sensory
attribute models. The predictive performance for the Cohesiveness (correlation
coefficients R2 = 0.77) and Mealiness (R2 = 0.81) are
considered acceptable (Fig. 3a), inline
with other sensory attribute modeling experiments [5].
From the X-block loadings pXi we see that for raw
potato sample compression curves the information
extracted is approximately the same (sign is undetermined for loading vectors),
with a switch in the factor order (Fig. 3b).
For the cooked potato sample curves Mealiness shows somewhat more skewed
loading vectors (Fig. 3c).
In Figure 4 we
show the score on factor 2 versus factor 1 plot on the super level. The figure
shows the effect of varieties and dry matter on Mealiness and Cohesiveness and
clusters according to these design variables. The two models show approximately
the same clustering for the two regressions, with the observation that factor
one and two have switched places in the Cohesiveness and Mealiness results.
In
this paper we tried to familiarize the reader with the concept of multi-block
models. The general idea of these methods is to get a comprehensible (graphical) overview in a large amount of
information, while maintaining the natural order in the data (block structure).
The alternative to multi-block methods would be to analyze the individual
blocks and try to reach an overall conclusion from the separate observations.
In the theory and experimental section of this paper
we showed an example of the simplest multi-block situation: two laboratory
predictor blocks for the prediction of sensory attributes Cohesiveness and
Mealiness in potato samples. This data is part of a larger study where five X-blocks
– both physical and chemical – with eight different sensory attributes are
available.
The methods presented in this paper are extensions of
the well-known bilinear modeling methods PCA and PLSR. There are however many
more multi-block methods available from literature, sometimes highly dedicated
to specific analysis problems. E.g. in the realm of chemical engineering,
process measurements and settings at different points in time are used as
separate blocks in the algorithms [6], while in
sensory texture studies the panel participants can be seen as ‘blocks’ with the
product under evaluation as common denominator [7].
The
work presented in this paper is part of the Advanced Quality Monitoring (AQM)
project, a joined framework of KVL (LMC), DJF and DFU.
[1]
H.Martens and M.Martens ‘Multivariate Analysis of Quality – and introduction’
Wiley(2001)
[2]
J.S.Shenk and M.O.Westerhaus ‘Population structuring of near-infrared spectra
and modified partial least-squares regression’ Crop Science no.6,
31(1991)1548-1555
[3]
C.A.Andersson and R.Bro ‘The N-way Toolbox for Matlab’ Chemometrics and
Intelligent Laboratory Systems 52(2000)1-4
[4]
J.A.Westerhuis, Th.Kourti and J.F.MacGregor ‘Analysis of Multiblock and
hierarchical PCA and PLS Models’ Journal of Chemometrics 12(1998)301-321
[5]
A.K.Thybo, I.E.Bechmann, M.Martens and S.B.Engelsen ‘Prediction of Sensory
Texture of Cooked Potatoes using Uniaxial Compression, Near Infrared
Spectroscopy and Low Field 1H NMR Spectroscopy’ Lebensmittel
Wissenschaft und Technology 33(2000)103-111
[6]
A.K.Smilde, J.A.Westerhuis an R.Boqué ‘Multiway multiblock component and
covariates regression models’ Journal of Chemometrics 14(2000)301-331
[7]
G.M. Arnold and A.A.Williams ‘The use of Generalised Procrustes Techniques in
Sensory Analysis’ in J.R.Piggott ‘Statistical Procedures in Food Research’ 1986
a Historically this is
not entirely correct. PLSR was designed as a general (‘multi-block’) method and
later evaluated to the X-Y block regression method popular today.