Unsupervised models: principal components

Jeremiah Green

14 Unsupervised models: principal components

Learning Objectives

Describe principle components analysis including its strengths and weaknesses
Describe accounting applications
Use R packages for principal components analysis

Chapter content

Feature reduction is the process of reducing the number of features in the data. This is often used when there are a large number of correlated features, potentially redundant features, or too many features for computational resources or interpretation. Reducing the number of correlated features has several potential benefits. First, it can reduce the computational burden of fitting models. Second, it can reduce the risk of overfitting. Third, it can make the model easier to interpret.

There are many ways to reduce features. Some models include feature selection as part of their algorithm. For example, linear models might use a stepwise variable selection algorithm or LASSO regressions may be used to minimize the effect of features. Unsupervised methods rely on correlations between features to select the most informative features. This chapter focuses on principal components analysis (PCA) as a tool for reducing the number of features to a smaller set of informative features.

Feature reduction with accounting data

Feature reduction is often used in business settings because there are many correlated features. For example, in accounting data, there may be many features that are related to sales, costs, and profits. These features are often highly correlated. Reducing the number of features can help to identify the most important features in the data. Some uses of modeling in financial or accounting applications have also generated a large number of features for example for predicting changes in earnings (https://www.sciencedirect.com/science/article/pii/0165410189900177) or stock returns (https://link.springer.com/article/10.1007/s11142-013-9231-1). Principal components analysis may also be used as a way to understand the primary sources of variation in the data.

Principal Components Analysis

PCA is a method that creates new features that are linear combinations of the original features. The new features are created to maximize the variance in the created features. The new features are not correlated with each other. The first principal component is the linear combination of the original features that explains the most variance in the data. The second principal component is the linear combination of the original features that explains the most variance in the data that is not explained by the first principal component. And so on.

The non-technical algorithm for PCA is as follows:

1. Find a linear combination of the original features the explains the largest portion of the total variation in features. Suppose there are three features A, B, and C. The linear combination would be something like the following.

$Principal_Component = \phi_1 A + \phi_2 B + \phi_3 C$

This step tries different $\phi$ values until it finds the combination of values that gives the principal component that explains the most variation in the data.

2. Find another linear combination of the original features that explains the most variation in the data that is not already explained by the first principal component. By definition, the second principal component is uncorrelated with the first principal component.

3. Continue the process adding more principal components up to the number of original features in the data. Each added principal component will explain less of the total variation in the data.

4. Choose a cutoff for the number of principal components to keep (less than the number of original features). This might be based on the amount of variation explained by the principal components.

Example with R

This section uses a data set with a large number of features used for predicting stock returns to demonstrate PCA. Most of these features are accounting ratios that are available from annual financial statements. Some variables are identifiers for the company (permno and gvkey) and year (fyear) with the remaining columns being the features for applying PCA. The data set is available as a csv file here: https://www.dropbox.com/scl/fi/6sze5bcsfv53epla1k54l/AnnualSignals.csv?rlkey=fhyu6cv08z8tm9pfm06m4b7gr&st=27q0jcd3&dl=0.

Assuming that the data has been imported as “df”, the following code initiates h2o, moves the data frame to the h2o environment, and runs PCA. k gives the number of components to be estimated. The max number for k is the number of columns and if successful should be smaller than the number of columns.

library(h2o)

library(DALEX)

library(DALEXtra)

library(tidyverse)

library(parallel)

tmp<-df %>%

select(mve_f:ps)

h2o.init(nthreads=8)

tmp.h2o <- as.h2o(tmp)

pca <- h2o.prcomp(training_frame = tmp.h2o,

k=15)

pca

The output shows the percent of variance of the total data that the component captures (proportion of variance) and the cumulative amount of variation that the components contain by adding the first to the nth component (cumulative proportion). A heuristic for how many components is enough for the data set is to include components up until the point that the increase in the cumulative proportion explained stops increasing to the same degree (i.e. the slope levels out).

The PCs can then be used as features rather than the original features or can be used to understand the most important features in the data set. The following outputs the components as a data frame and adds them to the original data.

pca<-as.data.frame(predict(pca, tmp.h2o))
tmp2<-cbind(tmp,pca)

Looking at the correlation matrix can give insights about which features are most correlated with the principal components.

cm<-tmp2 %>%

cor()

View(cm)

h2o.shutdown()

Combining and ordering unsupervised modeling steps

This chapter and the previous chapter covered three unsupervised learning methods: clustering, anomaly detection, and feature reduction. At times all three methods may be done in the order presented. Other times another order or a different combination of models may be necessary. For example, if you have a very large number of features, you might first try to reduce the number of features with PCA before clustering. You might start with anomaly detection before trying PCA or clustering. You might do all three and then after learning from one of the methods return and revise what you do with another.

The two primary purposes for using unsupervised learning are first to understand the data and second to prepare the data for supervised learning. Working through unsupervised methods can lead to a better understanding of patterns in the data and ideas of what might generate the data. This exploration might also lead to improving preprocessing, which features are included in a supervised model, or creating new features.

Tutorial video

Conclusion

This chapter introduced principal components analysis. PCA is a tool for reducing a large number of columns to a smaller set of columns that capture as much variation in the full set as possible. PCA might be used as a standalone anaysis or it may be combined with other unsupervised or supervised analysis techniques.

14 Unsupervised models: principal components

Chapter content

Feature reduction with accounting data

Principal Components Analysis

Example with R

Combining and ordering unsupervised modeling steps

Tutorial video

Conclusion

Review

Mini-case video

References

License

Share This Book