PCA, PCR, PLS: Calibration Methods in NIR Spectroscopy

The aim of chemometric calibration is to find out from the spectrum which substance is in front of the spectrometer or to which proportion the substance of our interest is present in the sample.

With the completion of the last article we have designed a clean measurement setup, minimized sources of scattering error to the extent possible, and processed the recorded spectra using appropriate preprocessing methods.

Thus, the data are now ready for the next step.

Spectroscopy is an indirect measurement method: the measurand of interest, such as the amount of a substance in the sample, is not measured directly with it. Instead, radiation is sent to the sample, and parts of the radiation, i.e. individual wavelengths, then react with the sample. The spectrometer then detects the radiation, and provides the difference in wavelengths before and after interaction with the sample.

However, this really is all we have at this point: knowledge of the wavelengths that have changed in concentration after absorption by the sample. We now still need to correlate certain absorption patterns with the known substance concentrations, so that in the future we can then infer concentrations from the measured radiation alone. This step is called calibration in NIR spectroscopy.

For the calibration, especially in the NIR range, different algorithmic approaches have been developed over the decades. The best known variants are the Principal Component Analysis (PCA), the Principal Component Regression (PCR), and the Partial Least Squares Regression (PLS or PLSR).

Principal Component Analysis (PCA)

The essential achievement of Principal Component Analysis is the reduction of the many variables of a spectrum (each wavelength is a variable) to a few new variables (the Principal Components), which can nevertheless explain almost the entire variance in the data. The main disadvantage of this approach is that our known chemical concentrations are not included in this reduction process. Whether the principal components found correspond to our substances of interest depends heavily on whether these substances clearly lead to absorption in the NIR range. However, it is also possible that the algorithm zeroes in on properties of the sample that are of no interest for the application (e.g. moisture, which is often encountered in the NIR band).

How does Principal Components Analysis work?

First, the variables ("dimensions", in this case the intensities per detected wavelength) are normalized: the mean value is subtracted from all variables. In the end, all values are centered around 0. If this step did not take place, then the wavelengths with the strongest deflections would automatically dominate the result.

Now a covariance matrix is created. This is a symmetrical matrix, i.e. length and width are equal and correspond to the number of wavelengths detectable by the spectrometer. Now we look to see which dimension correlates with which other - in other words, how large each value is in the matrix. If they correlate strongly, then that is redundant information that can be reduced with virtually no loss.

The first principal component is now determined by drawing a line through the matrix that encompasses the largest possible variance. The second component should preferably not contain any of the information already captured by the first component - it is therefore orthogonal to the first component.

In principle, as many principal components can be created as the original data set contains dimensions, i.e. wavelengths. As a rule, however, one limits oneself in the evaluation to the first 1-5 components. The reason for this is the just described independence of the components from each other: If the first found component captures e.g. 80% of the variance of the measurements, then the sum of all following components can explain only 20% variance. The variance described by the components will almost not overlap - otherwise they would not be independent of each other. So, in this example, the second principal component would perhaps describe 12% of the remaining variance, leaving only 8% of variance for the sum of all other principal components. Each calculated principal component represents a mixture of the originally recorded individual wavelengths. The components are less directly interpretable than the original wavelengths, because they are practically arbitrarily assembled from them. The only criterion is the maximum summed variance per component created; no other information is processed. In other words, PCA knows nothing about the actual substances we want to analyze.

Accordingly, the analyst has to figure out for herself whether the Principal Components found correspond to the interesting properties of the sample. Often the principal components are used as axes of a diagram, in which the individual samples are then located. Usually, distinct sample clusters are found in this way, which can then become the basis for a qualitative evaluation of samples.

However, it is not always sufficient to be able to recognize only the presence of a substance. Often the interest is in the quantity of the substance. However, this requires an evaluation method that also uses our known sample compositions for calibration.

Principal Component Regression (PCR)

One such method is Principal Component Regression. PCR combines a Principal Component Analysis with a linear regression.

First, a PCA is performed and the principal components are calculated. The computationally essential part is thus identical with the Principal Component Analysis. The relevant principal components are further used (as in PCA, relevant usually meaning those that explain most of the variance), the rest is discarded.

In contrast to PCA, however, a reference to the known compositions of the samples is now to be established. This information was obtained in advance, e.g. by chemical analysis. For this purpose, a linear regression is performed with the sample compositions and the selected principal components of the PCA.

The vector obtained in this way only has the reduced dimensions of the selected principal components, which are less than the number of wavelengths arriving from the spectrometer. Therefore, the vector is finally transformed back again so that it has the same number of dimensions.

Because the Principal Component Regression gains access to the previously known sample compositions via linear regression, a calibration obtained by PCR can also make direct quantitative statements about future samples. With a pure PCA, this would require manually checking whether any of the principal components corresponds to any of the analytes in the sample.

A potential disadvantage of PCR is that the known compositions only find their way into the calculations so late: The principal components have already been created and are just weighted differently in the subsequent steps. However, it may well be that the principal components do not correspond directly to individual analytes and that the calibration thus produced does not achieve the precision that would in principle be possible on the basis of the data available.

Not least for this reason, there are many other calibration approaches. The best known among them is called Partial Least Squares Regression.

Partial Least Squares Regression (PLS)

In PLS, the dependent variables, i.e. the components of the sample previously determined by other means, are used from the beginning. Partial Least Squares and PCA/PCR are structurally similar. However, while PCA and PCR reduce the dimensions of the independent variables (i.e., wavelengths) by working out the maximum variance (i.e., emphasizing the wavelengths that are least correlated), PLS is about maximizing the covariance of independent variables and dependent variables. In other words, Partial Least Squares attempts to directly find correlations between wavelengths and sample constituents.

Components are also calculated in Partial Least Squares regression in order to reduce the dimensionality of the data sets. They are referred to as PLS components or as latent variables.

PCA would determine the weights of individual wavelengths for principal components based solely on the variance they can explain. The fact that a different wavelength better predicts the composition of the sample is irrelevant.

In PLS, on the other hand, individual wavelengths receive stronger weights than others based on their association ("covariance") with the known analytes in the samples.

Partial Least Squares Regression is considered particularly suitable for data sets where the number of dimensions exceeds the number of samples. So a data set with 100 spectra each covering 256 wavelengths each would be an ideal candidate for PLS regression.

Numerous variants of Partial Least Squares Regression now exist that attempt to optimize various aspects of the calibration process, mainly either the computational cost or the precision of the result. The PLS-DA (for Discriminant Analysis) represents a special feature: It does not provide quantitative, but rather qualitative information about the samples. With this calibration method, the presence or absence of a substance can be easily determined, but not the amount of this substance present. Similar to PCA, PLS-DA is therefore particularly suitable for classification tasks.

Is there a best calibration method?

The nature of the problem at hand determines which methods might fit in principle: If it is a classification problem, then PCA and PLS-DA come into question. If quantities are to be determined, then PCR and PLS are options.

Which of the calibration methods is then the best has to be found out in the field based on the prediction error.