14 Unsupervised models: principal components
Learning Objectives
- Describe principle components analysis including its strengths and weaknesses
- Describe accounting applications
- Use R packages for principal components analysis
Chapter content
Feature reduction is the process of reducing the number of features in the data. This is often used when there are a large number of correlated features, potentially redundant features, or too many features for computational resources or interpretation. Reducing the number of correlated features has several potential benefits. First, it can reduce the computational burden of fitting models. Second, it can reduce the risk of overfitting. Third, it can make the model easier to interpret.
There are many ways to reduce features. Some models include feature selection as part of their algorithm. For example, linear models might use a stepwise variable selection algorithm or LASSO regressions may be used to minimize the effect of features. Unsupervised methods rely on correlations between features to select the most informative features. This chapter focuses on principal components analysis (PCA) as a tool for reducing the number of features to a smaller set of informative features.
Feature reduction with accounting data
Feature reduction is often used in business settings because there are many correlated features. For example, in accounting data, there may be many features that are related to sales, costs, and profits. These features are often highly correlated. Reducing the number of features can help to identify the most important features in the data. Some uses of modeling in financial or accounting applications have also generated a large number of features for example for predicting changes in earnings (https://www.sciencedirect.com/science/article/pii/0165410189900177) or stock returns (https://link.springer.com/article/10.1007/s11142-013-9231-1). Principal components analysis may also be used as a way to understand the primary sources of variation in the data.
Principal Components Analysis
PCA is a method that creates new features that are linear combinations of the original features. The new features are created to maximize the variance in the created features. The new features are not correlated with each other. The first principal component is the linear combination of the original features that explains the most variance in the data. The second principal component is the linear combination of the original features that explains the most variance in the data that is not explained by the first principal component. And so on.
The non-technical algorithm for PCA is as follows:
1. Find a linear combination of the original features the explains the largest portion of the total variation in features. Suppose there are three features A, B, and C. The linear combination would be something like the following.
This step tries different values until it finds the combination of values that gives the principal component that explains the most variation in the data.
2. Find another linear combination of the original features that explains the most variation in the data that is not already explained by the first principal component. By definition, the second principal component is uncorrelated with the first principal component.
3. Continue the process adding more principal components up to the number of original features in the data. Each added principal component will explain less of the total variation in the data.
4. Choose a cutoff for the number of principal components to keep (less than the number of original features). This might be based on the amount of variation explained by the principal components.
Example with R
This section uses a data set with a large number of features used for predicting stock returns to demonstrate PCA. Most of these features are accounting ratios that are available from annual financial statements. Some variables are identifiers for the company (permno and gvkey) and year (fyear) with the remaining columns being the features for applying PCA. The data set is available as a csv file here: https://www.dropbox.com/scl/fi/6sze5bcsfv53epla1k54l/AnnualSignals.csv?rlkey=fhyu6cv08z8tm9pfm06m4b7gr&st=27q0jcd3&dl=0.
Assuming that the data has been imported as “df”, the following code initiates h2o, moves the data frame to the h2o environment, and runs PCA. k gives the number of components to be estimated. The max number for k is the number of columns and if successful should be smaller than the number of columns.
library(h2o)
library(DALEX)
library(DALEXtra)
library(tidyverse)
library(parallel)
tmp<-df %>%
select(mve_f:ps)
h2o.init(nthreads=8)
tmp.h2o <- as.h2o(tmp)
pca <- h2o.prcomp(training_frame = tmp.h2o,
k=15)
pca
pca<-as.data.frame(predict(pca, tmp.h2o))
tmp2<-cbind(tmp,pca)
cm<-tmp2 %>%
cor()
View(cm)
h2o.shutdown()
Combining and ordering unsupervised modeling steps
This chapter and the previous chapter covered three unsupervised learning methods: clustering, anomaly detection, and feature reduction. At times all three methods may be done in the order presented. Other times another order or a different combination of models may be necessary. For example, if you have a very large number of features, you might first try to reduce the number of features with PCA before clustering. You might start with anomaly detection before trying PCA or clustering. You might do all three and then after learning from one of the methods return and revise what you do with another.
The two primary purposes for using unsupervised learning are first to understand the data and second to prepare the data for supervised learning. Working through unsupervised methods can lead to a better understanding of patterns in the data and ideas of what might generate the data. This exploration might also lead to improving preprocessing, which features are included in a supervised model, or creating new features.
Tutorial video
Conclusion
Review
Mini-case video
References
https://www.geeksforgeeks.org/principal-component-analysis-pca/
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html