16 Supervised Models: Tree based methods
Learning Objectives
- Explain classification and regression tree methods
- Explain adaptations of single tree methods including forests, bagging, and boosting
- Describe the purpose of tuning paramaters and typical tuning parameters for tree based models
- Use R packages for tree based models
Chapter content
The linear and logistic regression models from the last chapter are among the most interpretable models. However, they are limited in their ability to capture complex relationships between x and y variables. Machine learning models are designed to capture more complex relationships. However, machine learning models are more difficult to interpret. Because of their limited interpretability, machine learning models are often referred to as black box models. Machine learning models are also computationally intensive. Because machine learning models can capture more complex patterns in the data, they are also more prone to overfitting.
There are two major families of machine learning models: tree-based models and neural networks. This chapter discusses tree-based models. The next chapter discuss neural network models.
Classification and regression trees
Classification and regression trees are models that are based on a series of decisions. The model is a tree structure where each node is a decision based on a feature. The tree structure is built by choosing the feature that best splits the data into two groups. The best split is determined by the feature that best separates the dependent variable. The tree is built by recursively splitting the data into two groups until the data is separated into groups that are as homogeneous as possible. The tree is then pruned to avoid overfitting. The pruned tree is used to predict the dependent variable by following the tree structure. The predicted value is the average of the dependent variable in the terminal node.
A very simple tree model is easy to intepret. However, as a tree becomes more complex or as extensions to a simple tree are added, the model becomes difficult to interpret. The tree model is prone to overfitting and some methods tend to be used to reduce the chances of overfitting.
A single regression and classification tree algorithm is as follows:
1. Randomly choose a feature to start.
2. Search for a value of the feature that best splits the observations based on the dependent variable where observations in the same group are a similar as possible. For a binary dependent variable, this can be the percent of observations with the same outcome. For a continuous dependent variable, this can be how different the observations are from the mean of the group (just like with clustering). Repeat the process until both groups are as homogeneous as possible. This could be defined with a continuous variable as:
where is the dependent variable and
is the mean of the dependent variable in the group. The total sum of squares is the sum of the squared differences for the two groups.
3. Repeat the process for each feature and choose the feature that best splits the data.
4. Spit the data into two groups based on this feature. Observations with values of the feature above the cutoff point go into one group and observations with values of the feature below the cutoff point go into the other group.
5. Now for each group, repeat the process of finding the feature that best splits the data. Continue this process until the data is split into groups that are as homogeneous as possible or a stopping point is reached such as the number of observations in a group is one or a chosen cutoff number (e.g. 10 observations in a group).
6. The model is then used to predict the dependent variable by following the choices through the tree until the endpoint is reached. The predicted value is the average of the dependent variable in the end group.
The entire process is referred to as the tree. The branches are the decisions and the leaves are the end groups. Tree depth is the number of levels of decisions in the tree – i.e. the top level to the last level. Textbooks and tutorials are widely available that help make tree methods intuitive (e.g. https://www.youtube.com/watch?v=JcI5E2Ng6r4).
Problems
There are a number of challenges with tree models. The first is that the model is prone to overfitting. This is because a tree could be built so that it perfectly fits the data. This would be the case if the tree was built to have one observation in each ending group. One approach to limiting overfitting is to prune the tree. This means that splits that are allowed on the data are limited to those that have a certain number of observations in the end group and/or the splits are limited to those that achieve some specified criteria such as the fit with that split decision. Another challenge is that the tree model is sensitive to the path it chooses. For example, splitting on a less informative first variable and then performing subsequent splits could better fit the data than the chosen path. The predictions are sensitive to the chosen path.
Other extentions have been proposed that help to reduce overfitting and path sensitivity. The two most common extensions are random forests and gradient boosting.
Extensions
Random forests are a collection of trees where each tree is built on a random sample of the data and a random sample of the features. The random forest model averages the predictions of the trees. Gradient boosting is a collection of trees where each tree is built to correct the errors of the previous tree. Like the random forest model, the gradient boosting model averages the predictions of the trees. There are also extensions on these extensions (extreme gradient boosting, extremely randomized trees, etc). These extensions on extensions will not be discussed, although they will be included to the extent they are included in automl.
Random forests (bagging)
Bagging refers to using random samples from the training data to estimate trees. Random forests is a method that combines multiple trees to make a prediction. Combining the predictions across multiple trees helps to reduce the overfitting of a single tree. Combining trees also helps reduce the effects of path dependency from estimating a single tree.
1. Randomly choose a feature to start and randomly choose a subset of the data.
2. Search for a value of the feature that best splits the observations based on the dependent variable where observations in the same group are a similar as possible.
3. Spit the data into two groups based on this feature. Observations with values of the feature above the cutoff point go into one group and observations with values of the feature below the cutoff point go into the other group.
4. Now for each group, repeat the process of randomly choosing a feature and finding the best split for that feature. Continue this process until the data is split into groups that are as homogeneous as possible or a stopping point is reached such as the number of observations in a group is one or a chosen cutoff number (e.g. 10 observations in a group).
5. Start a new tree by repeating 1-4. Repeat until the chosing number of trees is reached.
6. For each observation, average the predictions for that observation across all trees.
Gradient boosting (boosting)
Boosting combines multiple trees, but the trees are not estimated independently based on random samples and features.
Boosting starts with a single tree, finds observations that are poorly predicted based on the tree, and then builds a new tree that weights the poorly predicted observations more heavily. A simple way to understand this could use the following adjustment to the fit as shown previously.
Here a weight is added that places more emphasis on the poorly predicted observation. This weight is a function of the error from the previous tree. How well the tree fits the data is kept as well as a measure of the fit of the tree. The processes is repeated with a new tree and is repeated until a stopping point is reached. The predictions are then averaged across all trees with higher weights being given to the trees that best fit the data. The gradient part of gradient boosting refers to the method for adjusting the weights and fitting a new tree.
A generic algorithm for boosting is as follows:
1. Estimate a tree to create the base tree.
2. Create the weight for each observation and the fit of the tree.
3. Estimate a new tree with the weights from the prior tree.
4. Repeat until stopping criteria is reached.
5. Average the predictions from all trees with higher weights being placed on the trees that best fit the data.
Hyperparameter tuning
Hyperparameters are configuration variables that are external to a machine learning model that define details of the model. For example, a hyperparameter in a tree based method is the maximum depth of the tree or in a forest based method is the number of trees to create.
Hyperparameter tuning involves iteratively running models with different hyperparameter values to determine which configuration yields the best model fit. Hyperparameter tuning could be done manually through trial and error or could use algorithms to try a wide variety of hyperparameters. Algorithms for hyperparameter tuning include grid search (search all possible hyperparameters), random search, or other optimized algorithms.
Hyperparameter tuning can be important because it can affect model fit and predictive performance.
Hyperparameters for tree methods
Some of the most common hyperparameters for tree based methods are described in the dropdown menu below.
Ensemble methods
The final “model” in this chapter is a combination of models called an ensemble model. Extensions to tree models (ex. random forest) are examples of specific ensemble models. Ensemble models combine the predictions from multiple models into a single prediction. This can be done by calculating the average of the predictions or by combining by weighting models with better fit more heavily.
Most often, the best model is unknown or there are concerns about overfitting training data with a single model. Combining predictions from multiple models can help reduce the risk of overfitting or the risk from choosing the wrong model.
Explainable AI
Example in R
AutoML estimates various tree based methods. AutoML includes algorithms for distributed random forests (DRF), gradient boosting machines (GBM), and extreme gradient boosting (XGBoost). (Note that XGBoost in h2o doesn’t work with Windows. We could use a different package, but here we will stick with h2o and omit XGBoost.)
Each tree based model has hyperparameters to tune. For example, the number of trees to estimate, the depth of the trees, the minimum number of observations in a group, the extent to which a tree should be pruned, etc. AutoML handles hyperparameter tuning by including models with different hyperparameters as different models and does some of the tuning with cross-validation.
The sample code below uses the data and dependent variables from the prior chapter.
The csv file for this chapter is available here: https://www.dropbox.com/scl/fi/jrrnvo9xeyud863q63cpr/dEarningsPred.csv?rlkey=nqv9in8ukx4xxhf9cr78h0wxp&dl=0. The data is from Compustat – annual financial statement information. The two variables we will be trying to model, i.e. predict, are `DIncr_ld1` and `Incr_ld1`. The first is a binary variable indicating for whether year t+1 earnings will be higher than year t earnings. The second is the percentage change in earnings from year to to year t+1 scaled by the company’s market value of equity at the end of fiscal year t. The other features are financial statement analysis variables from the papers linked above. The features have been winsorized and standardized each fiscal year.
Assuming that the data is imported into R as “df”, the following code prepares the h2o environment and creates a training data set and moves it to the h2o environment.
library(DALEX)
library(DALEXtra)
library(tidyverse)
library(parallel)
h2o.init(nthreads=8)
train <- df %>%
filter(fyear<2010)
test <- df %>%
filter(fyear>2010)
rm(df)
tmp <- train %>%
select(Incr_ld1, CurRat:NiCf)
trn.h2o <- as.h2o(tmp)
mdlres <- h2o.automl(y = "Incr_ld1",
training_frame = trn.h2o,
include_algos = c("DRF","GBM"),
max_models = 25,
seed = 1)
lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")
print(lb, n = nrow(lb))
bmdl <- h2o.get_best_model(mdlres)
bmdl
vi <- h2o.varimp(bmdl)
vi
h2o.shutdown()
The code could be extended for binary dependent variables, prediction outputs, and explainable AI as in the last chapter.
Tutorial video
Conclusion
Review
Mini-case video
References
Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research, 19(84), 1-5.