Multi-block
Toolbox for Matlab
Frans van den Berg,
The Royal
Veterinary and Agricultural University,
Dept. of Dairy
and Food Science, Food Technology, Denmark

On
these pages you can find some information on a project we are working on here
at KVL. The goal of this project is to study, implement (in the Matlab
numerical science computer language) and apply so-called multi-block models.
Multi-block methods are mathematical/statistical techniques for analysis and
regression on several data-blocks simultaneously [1].
The basic requirement is that these data-blocks have one ‘mode’ in common. The
objects usually form this common dimension: e.g. a series of measurements and
experiments is performed on a sample set. The scientist/researcher can point
out a natural/logical blocking, organizing the measurements in conceptually
meaningful blocks (e.g. spectroscopy measurements on a sample versus a number
of rheological parameters). Multi-block methods are developed to analyze such
data, with the objective of keeping track of the different blocks during the
analysis. This is done much in the same way as one keeps track of individual
variables in regular data analysis [2].
A
short introduction to the application of multi-block
models and the terminology can be found here. An extended reference
list accompanies this work. In this list you can find the original work
where we got the ideas and/or algorithms for this toolbox.
Here
you can DOWNLOAD (version
The front entrance of the toolbox consist of three functions: MBmodel, MBlabel and MBplot. A small flow sheet
shows what has been implemented so far, and the route a normal analyses or
regression experiment would take.
Next
to the three main functions the toolbox contains a number of multi-block
algorithms (MBcpca, MBpca, MBhpca, MBspls, MBbpls, MBhpls) and miscellaneous functions (MBT_PCA, MBT_PLS, MEANC, AUTOSC, MBnorm, MBdata).
In the toolbox we distinguish between Analysis: algorithms for
doing data analysis in multiple X-blocks; and Regression:
algorithms for doing regression between one or more X-blocks and one or
more Y-blocks (figure). We keep the
convention that all data tables have objects/samples in the rows, and that this
is the mode that all blocks have in common (same number of rows). The columns
in X- and Y-blocks (the variables) may differ. The blocks at the
super levels (T for X-blocks and U for Y-blocks)
are used to describe what different blocks have in common (the object or block
‘consensus’). These super-blocks are what makes the algorithms differ from e.g.
PCA on one big data table X.
Basically,
the methods try to describe maximum variance in the individual blocks while
fulfilling the requirements/restrictions imposed by the T-super block,
in the hope that a common structure in the object is found. The T-block
at the super level will reveal this common structure and show the role of
individual blocks in the analysis or regression.
Note
that scaling is an important issue in multi-block applications. Just like it is
important to equal out big differences between individual variables in regular
bilinear modeling (e.g. auto scale process variables that are of completely
different magnitudes in PCA/PLS), it is important to think about the variance
(‘sum-of-squares’) for different blocks. If one block has a much higher
variance than others it will dominate the results. A good starting point is to
give each block equal weight (e.g. unit block variance). You can always rerun
the analysis giving more weight to individual blocks if you have good arguments
to do so (see references for some hints). The main
routine MBmodel can call a scaling function MBnorm to change block variances.
The Analysis part of
the toolbox consists of a number of implementations to do multi-block
‘PCA’-like data analysis (this figure shows a
graphical definition of PCA on this web-page, used for comparison with
multi-block PCA algorithms). Most functions have the same form and output (figure), but all with there own little twist.
The Regression part
is formed by a number of multi-block ‘PLS’-like modeling functions (figure). Functions have the same form and output, but
again with specific characteristics (figure).
This is the main routine from the multi-block
toolbox.
Format
[MB] = MBmodel(X,Xin,Xpp,Y,Yin,Ypp,model)
Input:
X (objects x all X-block
variables) data matrix X
Xin (number of X-blocks
x 2) index for X-blocks
(optional:)
Xpp (number of X-blocks
x 3) processing per X-block
Y (objects x all Y-block
variables) data matrix Y
Yin (number of Y-blocks
x 2) index for Y-blocks
Ypp (number of Y-blocks
x 1) processing per Y-block
Model (1 x 2)
first entry selects yes (1) or no cross-validation (0, default)
second or overrides
default modeling methods
Output:
MB (record) record/structure
with all the analysis results
The minimum input to the MBplot function is one
augmented X-block and the index for the different block segments: MB =
mbmodel(X,Xin)
E.g. assuming 10 objects and two X-blocks with
20 and 30 variables the input in Matlab notation would read X (10x20+30)
= [X1 (10x20) X2 (10x30)] and Xin = [1
20;21 50]. This routine input ask the user how to process the augmented X-block,
and compute a MBCPCA model (the default). The results would be stored in the
MB-record. To skip the interactive questions on data processing one can specify
them in the input Xpp: MB = mbmodel(X,Xin,Xpp).
E.g. Xpp = [1 1 4;1 2 4] would perform the following
tasks: for block one (first row) scale block to sum-of-squares one (1), mean
center data (1) and compute four factors (4); for the second block:
sum-of-squares one (1), auto-scale (2) and four factors (4). The general format
for Xpp is: [sum-of-squares
pre-processing factors], where
pre-processing can have the values 0 (none), 1 (mean center) and 2 (auto
scale).
Regression models are entered as MB =
mbmodel(X,Xin,Xpp,Y,Yin). If Yin is omitted one Y-block is assumed,
otherwise Yin must have the same indexing format as Xin (one row per block,
first and last index per block in first and second column, respectively). Y-block
processing input Ypp can be added as a column vector (length equal to the
number of blocks) with values 0 (none), 1 (mean center) and 2 (auto scale).
The last input (‘model’) can be used to deduce
cross-validation to determine correct system complexity (first entry; see figure) and to override the default algorithms (MBcpca for analysis and MBspls for regression):
0=MBcpca, 1=MBpca, 2=MBhpca, 10=MBspls, 11=MBbpls, 12=MBhpls.
Here are a few example inputs using the test data set MBdata:
>>
MB = mbmodel(X,[1 50;51 100],[1 1 5;1 1 5]);
MBcpca on two X-blocks, both scaled to norm one and
mean centred, computing 5 factors, no cross-validation.
>>
MB = mbmodel(X,[1 50;51 100],[1 1 5;1 1 5],[],[],[],1);
Same as before, but with cross-validation.
>>
MB = mbmodel(X,Xin4,[1 1 2;1 1 3;1 1 2;1 1 3],[],[],[],[0 2]);
MBhpca on four X-blocks, no cross-validation.
>> MB = mbmodel(Xmv,Xin2,[],Y,Yin2);
MBspls for two X-blocks (with missing values)
and two Y-blocks, with user interactive processing questions for x and
y, no cross-validation.
>>
MB = mbmodel(X,[1 100]);
MBcpca on one data block (should be the same as just
PCA on the whole block)! Some entries on the super and block level will contain
the same values.
The easiest way to save a result from MBmodel in
Matlab is as follows:
>>
save filename MB
The record MB saved in ‘filename’ can later on be
reloaded and used in e.g. the plotting routine (‘MB’ and ‘filename’ are of
coarse free choices for the user!).
To access the contents of the MB-record use e.g. the
following command line:
>> plot(MB.variable)
Where ‘variable’ is the variable you want to plot.
Not all algorithms have the same record contents, and the specific contents for
different algorithms will not be explained. However, we have tried to be
consistent in the definition of the records, and knowledge of Matlab and
bilinear modeling should make it fairly easy to use the record contents
directly. All algorithms (see below) are separate functions that can be
addressed directly, without interaction or using the general MBmodel-function.
Add labels to MB-record with analysis/regression
results from MBmodel function.
Format:
[MB] = MBlabel(MB,objlab,Xvarlab,Yvarlab)
Input:
MB (record) an
analysis/regression record form MBmodel
objlab (objects x string length)
is a string array with one row for every object/sample in the data set,
all of the same length
Xvarlab (1 x all X-block
variables) one row vector with numerical values to put on the x-axis of
plots
Yvarlab (1 x all Y-block
variables) one row vector with numerical values to put on the x-axis of
plots
Output:
MB (record) an
analysis/regression record with the labels filled in
If the labeling function is skipped, leaving the
labels empty, the plot routine MBplot will fill in numbers.
Some examples of MBlabel using the MBdata
test set:
>>
MB = mblabel(MB,objlab,Xvarlab,Yvarlab);
Puts labels on all data blocks in the record MB.
>>
Yep = mblabel(MB,[],[],Yvarlab);
Creates a new record ‘Yep’ (same contents as ‘MB’)
with labels for the Y-block variables.
A function to make the essential plots from
multi-block analysis or regression results.
Format:
MBplot(MB)
Input:
MB (record) an
analysis/regression record form MBmodel
This (somewhat primitive) function guides the user
through a series of menus, where different plots can be selected by entering
numbers at the command line. If the symbol “:” shows in the query line you can
enter more than one number in a Matlab vector notation; e.g. 1:3 or [2 4] to
plot vectors 1 to 3 or 2 and 4 respectively.
The available plots depend on the type of analysis
performed, and the number of X and Y-blocks. The selection of
plots is based on our experience, but more combinations are definitely
possible. If you require more advanced plots you can always retrieve the model
parameters from the MB-record.
The routine will start by closing all the present
Matlab figures. For every plot it will open up a new window. If a figure is not
interesting, just close it. If you want to save a figure to a file, you can do
so through the menus of the figure window. Nowadays you can even edit a figure,
add text, etc. through the figure menus.
As stated not all possible plot are in the
MBplot-routine (you wouldn’t believe how many possibilities there are for just
a simple PCA!). The user can however create here/his own plots by directly
accessing the results. Here is an example from cpca-analysis of a U-block
scores 2 versus 1, with the loading values 2 and 1 from the tops of the four
Gaussian peaks in the MBdata-set imposed:
>> load mbdata
>> MB = mbmodel(X,Xin2,[1 1 2;1 1 2]);
>> MB = mblabel(MB,objlab,Xvarlab);
>> figure; tops=[25 35 70 90];
>> plot(MB.Tt(:,1),MB.Tt(:,2),'o',MB.Pb(tops,1),MB.Pb(tops,2),'+')
>> text(MB.Tt(:,1),MB.Tt(:,2),MB.objlab)
>> text(MB.Pb(tops,1),MB.Pb(tops,2),num2str(MB.Xvarlab(tops)'))
>> grid; xlabel('Tu_1 Pb_1');
ylabel('Tu_2 Pb_2')
Which should result in something like this.
This function computes ‘consensus PCA’ on multiple X-blocks.
Format:
[Tb,Pb,Wt,Tt,ssq,Rbo,Rbv,Lbo,Lbv,Lto] = MBcpca(X,Xin,nPC,tol)
Input:
X (objects x all
variables) single, augmented data-blocks
Xin (number of blocks x 2)
begin to end variable index for this block; index for X-block
nPC (number of blocks x 1)
number of principal components extracted per block
tol (1 x 1) tolerance for
convergence (default 1e-12)
Output:
Tb (objects x number of
blocks.max(nPC)) X-block scores, [t1-block-1 t1-block-2
...t2-block-1...]
Pb (all variables x
max(nPC)) X-block loadings
Wt (number of blocks x
max(nPC)) super T-weights
Tt (objects x max(nPC))
super scores
ssq (max(nPC) x 1 + number of
blocks) cumulative sum of squares for whole and individual blocks
Rbo (objects x max(nPC)) X-block
object residuals
Rbv (all variables x
max(nPC)) X-block variable residuals
Lbo (objects x max(nPC)) X-block
object leverages
Lbv (all variables x
max(nPC)) X-block variable leverages
Lto (objects x max(nPC)) T-block
score object leverages
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. The algorithm can handle (a
limited amount!) of missing values (NaN). The T-block scores Wt are
normalized, which makes it easy to interpret block contributions. The algorithm
follows a NIPALS-like routine for bilinear modeling of individual X- and
super T-blocks.
Function to compute regular PCA on individual X-blocks,
where the results are organized in the same way as the true multi-block models,
to make comparison easier.
Format,
Input and Output: see MBcpca
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. Since there is no super
level for separate PCA-models on individual blocks, T-block scores and
loadings contain only zeros. MBpca uses the regular MBT_PCA-routine.
Function to compute ‘hierarchical PCA’ on multiple X-blocks [3].
Format,
Input and Output: see MBcpca
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. A slight change is made to
the original algorithm since this turned out to give non-unique solutions [1][4].
The algorithm is based on a NIPLAS-like
decomposition, and can handle a small amount of missing values.
Multi-block PLS function with X-block
deflation on the T-super scores [7][1].
Format:
[Tb,Pb,Wb,Wt,Tt,Ub,Qb,Wu,Tu,B,ssq,Rbo,Rbv,Ry,Lbo,Lbv,Lto] =
MBspls(X,Xin,nLV,Y,Yin,tol)
Input:
X (objects x all
X-variables) single, augmented X-data-block
Xin (number of X-blocks x 2)
= begin to end variable index for this X-block; index for X-block
nLV (number of blocks x 1)
number of latent variables extracted per X-block
Y (objects x all
Y-variables) single, augmented Y-data-block
Yin (number of Y-blocks x 2) = begin to end variable index for this Y-block; index for Y-block
tol (1 x 1) tolerance for
convergence (default 1e-12)
Output:
Tb (objects x number of
X-blocks.max(nLV)) block scores, [t1-block-1 t1-block-2 ...t2-block-1...]
Pb (X-variables x max(nLV)) X-block
loadings
Wb (X-varaibles x max(nLV)) X-block
weights
Wt (number of X-blocks x
max(nLV)) X-block super weights
Tt (objects x max(nLV)) X-block
super scores
Ub (objects x number of
Y-blocks.max(nLV)) Y-block scores
Qb (Y-variables x max(nLV)) Y-block
weights
Wu (number of Y-blocks x
max(nLV)) Y-block super weights
Tu (objects x max(nLV)) Y-block
super scores
B (X-variables x Y-variables.max(nLV))
regression vectors
ssq (max(nLV) x 1+number of
X-blocks+1+number of Y-blocks) cumulative sum of squares, [Xt X-blocks Yt
Y-blocks]
Rbo (objects x max(nLV)) X-block
object residuals
Rbv (all variables x
max(nLV)) X-block variable residuals
Ry (objects x
Y-variables.max(nLV)) Y-block residuals
Lbo (objects x max(nLV)) X-block
object leverages
Lbv (all variables x
max(nLV)) X-block variable leverages
Lto (objects x max(nLV)) X-block
super score object leverages
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. The algorithm can handle (a
limited amount!) of missing values (NaN) in both x and y. The T-and U-block
scores are normalized, which makes it easy to interpret block contributions.
The algorithm follows a NIPALS-like routine for bilinear modelling of
individual blocks.
Multi-block PLS function with X-block
deflation on the X-block scores [8]-[10] [1].
Format,
Input and Output: see MBspls
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. The algorithm can handle (a
limited amount!) of missing values (NaN) in both x and y. The T-and U-block
scores are normalized, which makes it easy to interpret block contributions.
The algorithm follows a NIPALS-like routine for bilinear modelling of
individual blocks.
Function to compute ‘hierarchical PLS’ on multiple X-blocks [3]. The algorithm can not handle multiple Y-blocks
because it is not clear yet how to implement this in the algorithm from literature.
Format,
Input and Output: see MBspls
When blocks do not participate for scores or loading,
the corresponding matrices will contain all zeros. A slight change is made to
the original algorithm since this turned out to give non-unique solutions [1][4].
The algorithm is based on a NIPLAS-like
decomposition, and can handle a small amount of missing values.
Function to compute
Principal Component Analysis bilinear model of single X-matrix via
NIPALS algorithm [5]. Since many people might have
principal component algorithms of there own, we have decided to name the
function ‘MBT_PCA’ instead of simply ‘PCA’.
Format:
[T,P,ssq,Ro,Rv,Lo,Lv] = mbt_pca(X,nPC,tol)
Input :
X (objects x variables)
data block
nPC (1 x 1) number of
principal components
tol (1 x 1) tolerance for
convergence (default 1e-12)
Output:
T (objects x nPC) scores
P (variables x nPC)
loadings
ssq (nPC x 1) cumulative sum
of squares
Ro (objects x nPC) object
residuals
Rv (variables x nPC)
variable residuals
Lo (objects x nPC) object
leverages
Lv (variables x nPC)
variable leverages
The function can handle missing values (NaN).
Function to compute Partial
Least Squares regression by bilinear modelling of single X- and Y-matrix
via NIPALS algorithm [6]. We have named the function
MBT_PLS as not to override other partial least squares algorithms.
Format:
[T,P,W,U,Q,B,ssq,Ro,Rv,Lo,Lv] = mbt_pls(X,Y,nLV,tol)
Input:
X (objects x m-variables)
data-block
Y (objects x p-variables)
data-block
nLV (1 x 1) number of latent
variables
tol (1 x 1) tolerance for
convergence (default 1e-12)
Output:
T (objects x nLV) X-scores
P (m-var
x nLV) X-loadings
W (m-var x nLV) X-weights
U (objects x nLV) Y-scores
Q (p-var x nLV) Y-weights
B (m-var x p-var.nLV)
regression vectors
ssq (nLV x 2) cumulative sum
of squares X and Y
Ro (objects x nLV) object
residuals
Rv (variables x nLV)
variable residuals
Lo (objects x nLV) object
leverages
Lv (variables x nLV)
variable leverages
The function can handle a limited amount of missing
values (NaN) in both X and Y-block.
Pre-processing
routines for mean centring or auto scaling data tables.
Format:
[Zpp,mz] =
meanc(Z)
[Zpp] =
meanc(Z,mz)
[Z] =
meanc(Zpp,mz,rev)
[Zpp,mz,stdz] =
autosc(Z)
[Zpp] =
autosc(Z,mz,stdz)
[Z] =
autosc(Zpp,mz,stdz,rev)
Input/Output:
Z (objects x variables) data Z-block
Zpp (objects
x variables) column pre-processed data matrix
(mean centred = column mean zero,
auto scale = column mean zero, unit standard deviation)
mz (1 x variables) column means
stdz (1 x variables) column standard deviations
rev (1 x 1) trigger variable to reverse
preprocessing operation
Both
routines can handle missing values. Enter them as ’NaN’ (Not-a-Number’ Matlab
variable) in the Z-matrix. Inside the algorithm these variables will be
neglected when computing the column means and standard deviations. Thus, a
column is mean centered by adding all the numbers, dividing this sum by the
count of numerical entries and subtracting this average from all the numbers in
the column. To reverse the centering of auto scaling set the 3th/4th
dummy input to 1. This will resulted in a data table in the original scales
(post-preprocess @
process?).
This
routine rescales a data table Z to a desired norm (sum-of-squares)
Format:
[Zs,f] =
MBnorm(Z,targetnorm)
Input:
Z (objects x variables) Z-block
data
targetnorm (1 x 1) desired norm (sum-of-squares)
for the Z-block
Output:
Zs (objects x variables) the
scaled data matrix
f (1 x 1) the factor that
achieves the desired norm
This routine can handle missing values (NaN).
This
is a Matlab data file that contains some simulated matrices for first
experiments.
Contents:
X (10x100)
double array
Xin2 (2x2) double array
Xin4 (4x2) double array
- Predictor data
block with 10 simulated Gaussian-curves plus normally distributed noise, which
can be used as 2- or 4-segmented X-block (Indexing Xin2 or Xin4, reps.).
Xmv (10x100) double array
- Same as X, but
with some ‘tactically placed’ missing values.
Y (10x3) double array
Yin2 (2x2) double array
- Response data
block with 3 different variables (linear combinations of X = mT.[1 0
0.5; 1 1 0;1 0 0; 0 3 0]) plus normally distributed noise, which can be used as
one or 2-segmented Y-block.
objlab (10x2) char array
Xvarlab (1x100) double array
Yvarlab (1x3) double array
- String and
numerical labels for objects/samples, x- and y-variables, respectively.
mP (50x4) double
array; mT (10x4) double array
- The true
loadings and scores for generating the X- and Y-blocks.
See MBmodel and MBlabel for
examples on how to use this data.
[1]
J.A.Westerhuis, Th.Kourti and J.F.MacGregor ‘Analysis of Multiblock and
hierarchical PCA and PLS Models’ Journal of Chemometrics 12(1998)301-321
[2] H.Martens and
M.Martens ‘Multivariate Analysis of Quality – and introduction’ Wiley(2001)
[3] S.Wold,
N.Kettaneh and K.Tjessem ‘Hierarchical Multiblock PLS and PC Modles for Easier
Model Interpretation and as an Alternative to Variable Selection’ Journal of
Chemometrics 10(1996)463-482
[4] S.Rännar,
J.F.MacGregor and S.Wold ‘Adaptive Batch Monitoring using Hierarchical PCA’
Chemometrics and Intelligent Laboratory Systems 41(1998)73-81
[5] S.Wold,
K.Esbensen and P.Geladi ‘Principal Component Analysis’ Chemometrics and
Intelligent Laboratory Systems 2(1987)37-52
[6] P.Geladi and
B.R.Kowalski ‘Partial Least-Squares Regression: A Tutorial’ Analytica Chimica
Acta 185(1986)1-17
[7]
J.A.Westerhuis and PJ. Coenengracht ‘Multivariate Modelling of the
Pharmacuetical Two-Step Process of Wet Granulation and Tableting with
Multiblock Partial Least Squares’ Journal of Chemometrics 11(1997)379-392
[8] L.E.Wangen
and B.R.Kowalski ‘A Multiblock Partial Least Squares Algorithm for Investigating
Complex Chemical Systems’ Journal of Chemometrics 3(1988)3-20
[9] I.E.Frank and
B.R.Kowalski ‘Prediction of Wine Quality and Geographic Origin from Chemical
Measurements by Partial Least-Squares Regression Modeling’ Analytica Chimica
Acta 162(1984)241-251
[10] I.E.Frank
and B.R.Kowalski ‘A Multivariate Method for Relating Groups of Measurements
connected by a Causal Pathway’ Analytica Chimica Acta 167(1985)51-63
The
following references have no direct anchor in the text (yet!), but they nevertheless
contain material related to multi-block algorithms (both theory and
application).
[X] C.A.Andersson
and R.Bro ‘The N-way Toolbox for Matlab’ Chemometrics and Intelligent
Laboratory Systems 52(2000)1-4
[X] G.M.Arnold
and A.A.Williams ‘The use of Generalised Procrustes Techniques in Sensory
Analysis’ in J.R.Piggott ‘Statistical Procedures in Food Research’ 1986
[X] A.K. Smilde,
J.A.Westerhuis and R.Boqué ‘Multiway Multiblock component and covariates
Regression Models’ Journal of Chemometrics 14(2000)301-331
[X] J.F.
MacGregor, Ch.Jaeckle, C.Kiparissides and M.Koutoudi ‘Process Monitoring and
Diagnosis by Multiblock PLS Methods’ AIChE Journal 5(1994)826-838
[X] Th.Kourti,
P.Nomikos and J.F.MacGregor ‘Analysis, Monitoring and Fualt Diagnosis of Batch
Processes using Multiblock and Multiway PLS’ Journal of Process Control
5(1995)277-284
[X] C.Duchesne
and J.F. MacGregor ‘Multivariate Analysis and Optimization of Process Variable
Trajectories for Batch Processes’ Chemometrics and Intelligent Laboratory Systems
51(2000)125-137
[X] A.Berglund
and S.Wold ‘A Serial Extension of Multiblock PLS’ Journal of Chemometrics
13(1999)461-471
[X] G.Chen and
Th.J.MacAvoy ‘Predictive on-line Monitoring of Continuous Processes’ Journal of
Process Control 5-6(1998)409-420
The
authors of this toolbox can obviously not be held responsible for the contents
and consequences of using the material offered in the toolbox. The ideas and
algorithms are taken from open literature, but if there are some – for the
author’s unknown – legal issues that prohibit distribution, please contact us so we can correct
things.
We do not claim to give a historically correct picture of what might be considered ‘Multi-Block methods’ within the definition we use here. The same arguments hold for completeness. The material presented here is what we came across and considered interesting. If you have a different view, if you have ideas or new methods that will complete the picture, if you found blunt stupidities/a subtle incorrectness.