Chapter 17: Supervised Models I

Jeremiah Green

17 Chapter 17: Supervised Models I

Learning outcomes

At the end of this chapter, you should be able to

Explain linear regression methods and interpretation
Explain non-linear adjustments to linear regression
Explain linear probability models and probit models
Describe accounting applications of regression models
Use R packages to estimate and interpret regression models

Chapter content

Note: include practice and knowledge checks throughout chapter

# Supervised models

The primary purpose of this chapter is to provide more details on supervised models so that you can understand and interpret them and apply them in different ways. This chapter will emphasize the conceptual understanding of the models with some limited practical applications. We will then return to AutoML as the primary method we will use for modeling. A complete understanding of every method, even the ones we will discuss in this chapter, is not possible in a single chapter. Therefore, the goal is to provide enough information to get you started and to provide a starting point for further study.

## Additional discussion of pitfalls in modeling

In the chapter introducting modeling, we briefly introduced some pitfalls that can limit the success of modeling. Here we will discuss in a little more detail some specific pitfalls.

### Survivorship/sample selection

Analyzing a sample of data that is not representative of the population for which decisions are made creates a class of problems. These problems occur when a sample has characteristics that differ from the population.

Some sample biases are easy to spot. For instance, analyzing data from service oriented companies with no inventory to make better inventory decisions at retail companies with large product inventories is unlikely to be useful because the sample is biased towards companies that do not represent the population of interest. Other biases are less apparent and require careful consideration and critical thinking to detect.

Survivorship bias occurs when sample data is available for observations that survive some conditions. For example, a sample of large public companies like the S&P 500 that have survived competitive threats and have a history of successful products and services are not likely to be representative of all business ventures from inception. Data analysis that is informative for start-up companies might therefore require a sample that includes business ventures that are much smaller and those that have failed as well as succeeded. Survivorship bias is a type of sample selection bias in which data for survivors are included exclusively or more than non-survivors.

Sample selection bias might also apply to circumstances that do not directly correspond to surviving some fixed event. For example, analyzing which managers commit fraud using a sample of managers that have been caught committing fraud omits managers that have committed fraud but have not been caught doing so.

### Look ahead bias

Look ahead bias is created when using historical data to estimate a model that will be used to predict future data. For example, an auditor issuing a going concern opinion could estimate a model to predict bankruptcy. The variable to be predicted is the probability of bankruptcy. The auditor uses observable data and the model to estimate the probability. If the model includes information not yet available to the auditor, then the model had look ahead bias. The model can include information not available to the auditor because it is using data from prior bankruptcies and non-bankruptcies to estimate the model. Perhaps the auditor believes that bankruptcies cluster within industries and therefore includes a variable that is the percent of companies in the industry that go bankrupt at the same time as the bankruptcy being forecasted. The reasoning seems sensible. The problem is that when predicting the probability of bankruptcy, the auditor does not now the percent of companies that will also subsequently go bankrupt when predicting bankruptcy. This variable would only be available when using historical data where this was already observed.

For this bias, the proof is in the application. Attempting to apply the model to data for which the predicted variable cannot yet be observed can also reveal if other data necessary for the prediction is available.

### Confirmation bias/ p-hacking

Confirmation bias is a cognitive bias that is related to a set of practices referred to as p-hacking, data-mining, or similar terminology. Confirmation bias refers to the bias in which humans seek evidence that supports their predetermined conclusions and does not seek or ignores further evidence that contradicts their predetermined conclusions. This bias leads to decisions that are not based on an accurate or complete understanding of the data.

For example, if a manager has determined to pursue a project and then collects analyses to support their decision while ignoring analyses that do not support their decision, the manager may be working with a confirmation bias. Importantly, confirmation bias is a normal human cognitive behavior that those doing data analysis have to protect themselves and others from.

In research this bias can take the form of repeating data steps and analyses in search of a desired result (p-hacking). Because data contains random variation, there is some probability that with a few specific choices a desired result may occur. This p-hacking practice has raised questions about the reliability of research findings.

### Correlation is not causation

Correlation is a statistical property of how two (or more) variables move together. Two variables can move together because one variable causes the other. For example, dropping a ball from varying heights causes the ball to bounce higher. Two variables could also move together because a third variable causes them to move together. The third variable may be part of the causal chain relating the two variables, for example, the force the ground exerts on the ball when it hits causes the ball to bounce. In this case dropping the ball might be said to cause the ball to bounce even though it is not the proximate cause. The third variable may be not related to the causal chain and in this case the correlation between the first two variables is only driven by a spurious correlated variable. Alternatively, two variables may be correlated by chance. These correlations then are spurious and imply no form of causation.

### Overfitting

Overfitting refers to applying correlations or other patterns from one data set to another when the correlations within the first data set are unique to that data set. Overfitting often occurs when a complex combination of variables, non-linear correlations, or a complex model finds patterns in a data set that represent idiosyncrasies, unusual occurrences, and random variation that are unlikely to be repeated in other data sets. Highly complex models such as occur with machine learning models are prone to overfitting. Careful best practices help to reduce concerns with overfitting.

### Important principles to avoid these pitfalls

Some practices help limit the impact of these problems in modeling. These include:

* Be careful and think critically before analyzing data.
* Only work with training data that is representative of the population of interest.
* Keep testing separate and only use it once after all model development is complete.

### References

* Some interesting and intuitive discussions of data biases can be found in Ellenberg, J. (2015). *How not to be wrong: The power of mathematical thinking*. Penguin.
* Important concerns about how p-hacking influences publication. Ioannidis, J. P. (2005). Why most published research findings are false. *PLoS medicine*, *2*(8), e124.
* Correlation is not causation. <http://www.tylervigen.com/spurious-correlations>.

## Model types

We will discuss the following model types in this chapter:

* Linear regression
* Logistic regression
* Classification and regression trees (and extensions – random forests, gradient boosting)
* Neural networks
* Ensemble methods

After discussing these models and introducing some basic practice with them, we will return to AutoML as the primary method we will use for modeling. Note that for most of the methods we will use, there are tuning parameters that can be adjusted to search for the best fitting models. Other than a cursory introduction, we will not spend time on these tuning parameters. The purpose of this chapter is to develop a conceptual understanding of the models.

## Chapter data

The data for this chapter will be from research that takes a shareholder perspective to try to predict difference across companies in the probability of future earnings increases. We will also expand this to predict the magnitude of future earnings changes.

The most related papers from which the data is created are linked here: [Ou and Penman 1989](https://www.sciencedirect.com/science/article/pii/0165410189900177),
[Chen et al 2022](https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12429).

“`{r, echo=FALSE, eval=FALSE}

source(“C:/Users/jgreen/Dropbox/Research/Tools/functions.R”)

wrds <- dbConnect(Postgres(),host=’wrds-pgdata.wharton.upenn.edu’,port=9737,user=”jrg”,password=”W_VUaV5_HEMXRKk”,dbname=’wrds’,sslmode=’require’)
sql <- paste(“SELECT *”,
“FROM comp.funda WHERE indfmt=’INDL’ and datafmt=’STD’ and popsrc=’D’ and consol=’C'”)
res <- dbSendQuery(wrds, sql)
temp <- dbFetch(res, n = -1)

infzero <- function(x){
x[!is.finite(x)]<-0
x[is.nan(x)]<-0
x
}

temp2<-temp %>%
arrange(gvkey,datadate) %>%
distinct(gvkey,datadate,.keep_all=TRUE) %>%
mutate(across(where(is.numeric), miss)) %>%
mutate(
DIncr_ld1 = ifelse(lead(ni,1)>ni,1,0),
Incr_ld1 = (lead(ib,1)-ib)/abs(prcc_f*csho),
CurRat = act/lct,
dCurRat = (CurRat-lag(CurRat,1))/lag(CurRat,1),
QuiRat = (che+rect)/lct,
dQuiRat = (QuiRat – lag(QuiRat,1))/lag(QuiRat,1),
DaysRec = ((rect+lag(rect,1))/2)/sale * 365,
dDaysRect = (DaysRec – lag(DaysRec,1))/lag(DaysRec,1),
InvTurn = cogs/((invt+lag(invt,1))/2),
dInvTurn = (InvTurn – lag(InvTurn,1))/lag(InvTurn,1),
InvRat = invt/at,
dInvRat = (InvRat – lag(InvRat,1))/lag(InvRat,1),
dInv = (invt-lag(invt,1))/lag(invt,1),
dSale = (sale-lag(sale,1))/lag(sale,1),
dDep = (dp-lag(dp,1))/lag(dp,1),
dDivPS = (dvpsp_f-lag(dvpsp_f,1))/lag(dvpsp_f,1),
DepRat = dp/ppent,
dDepRat = (DepRat – lag(DepRat,1))/lag(DepRat,1),
ROE = ni/lag(seq,1),
dROE = ROE – lag(ROE,1),
dCAPXRat = (capx/at – lag(capx/at,1))/lag(capx/at,1),
dCAPXRat_l1 = lag(dCAPXRat,1),
DebtEquity = (dltt+dlc)/seq,
dDebtEquity = (DebtEquity – lag(DebtEquity,1))/lag(DebtEquity,1),
LTDebtEquity = dltt/seq,
dLTDebtEquity = (LTDebtEquity – lag(LTDebtEquity,1))/lag(LTDebtEquity,1),
EquityPPE = seq/ppent,
dEquityPPE = (EquityPPE – lag(EquityPPE,1))/lag(EquityPPE,1),
TimesInterest = ebit/xint,
dTimesInterest = (TimesInterest – lag(TimesInterest,1))/lag(abs(TimesInterest),1),
SalesRat = sale/at,
dSalesRat = (SalesRat – lag(SalesRat,1))/lag(SalesRat,1),
ROA = ni/((at+lag(at,1))/2),
ROE_EOY = ni/seq,
GMRat = gp/revt,
dGMRat = (GMRat – lag(GMRat,1))/lag(abs(GMRat),1),
OPRat = oibdp/sale,
dOPRat = (OPRat – lag(OPRat,1))/lag(abs(OPRat),1),
PTIRat = pi/sale,
dPTIRat = (PTIRat – lag(PTIRat,1))/lag(abs(PTIRat),1),
PMRat = ni/revt,
dPMRat = (PMRat – lag(PMRat,1))/lag(abs(PMRat),1),
SaleCash = sale/ch,
SaleRect = sale/rect,
SaleInvt = sale/invt,
dSaleInvt = (SaleInvt – lag(SaleInvt,1))/lag(abs(SaleInvt),1),
SaleWCap = sale/wcap,
dSaleWCap = (SaleWCap – lag(SaleWCap,1))/lag(abs(SaleWCap),1),
SalePPent = sale/ppent,
dProd = (((invt – lag(invt,1) + cogs))-(lag(invt,1)-lag(invt,2)+lag(cogs,1)))/(lag(invt,1)-lag(invt,2)+lag(cogs,1)),
dRD = (xrd – lag(xrd,1))/lag(xrd,1),
dRDRat = (xrd/sale – lag(xrd/sale,1))/lag(xrd/sale,1),
dAdv = (xad – lag(xad,1))/lag(xad,1),
dAdvRat = (xad/sale – lag(xad/sale,1))/lag(xad/sale,1),
dTotAsset = (at – lag(at,1))/lag(at,1),
CashDebt = (ni-(at-lag(at,1))+(lt-lag(lt,1))+(seq-lag(seq,1)))/(dltt+dlc),
WCapRat = wcap/at,
dWCapRat = (WCapRat – lag(WCapRat,1))/lag(abs(WCapRat),1),
OIRat = oibdp/at,
dOIRat = (OIRat – lag(OIRat,1))/lag(abs(OIRat),1),
dUseFunds = (fuset – lag(fuset,1))/lag(fuset,1),
dSourceFunds = (fsrct – lag(fsrct,1))/lag(abs(fsrct),1),
DtRepay = (lag(dltt,1) + dltis – dltt)/dltt,
DtIssue = dltis/dltt,
TreasStck = (tstkc + tstkp-lag(tstkc+tstkp,1))/(ceq+pstk),
dUnFunds = (utfdoc – lag(utfdoc,1))/lag(utfdoc,1),
dLTDt = (dltt – lag(dltt,1))/lag(dltt,1),
DivCash = dv/(ni-(at-lag(at,1))+(lt-lag(lt,1))+(seq-lag(seq,1))),
dWCap = (wcap – lag(wcap,1))/lag(wcap,1),
NiCf = ni/(ni-(at-lag(at,1))+(lt-lag(lt,1))+(seq-lag(seq,1)))
) %>%
mutate(across(c(Incr_ld1:NiCf), infzero)) %>%
group_by(fyear) %>%
mutate(across(c(Incr_ld1:NiCf), winsIQR)) %>%
mutate(across(c(Incr_ld1:NiCf), norm)) %>%
select(gvkey,fyear,DIncr_ld1:NiCf) %>%
ungroup() %>%
mutate(across(c(Incr_ld1:NiCf), infzero)) %>%
drop_na(gvkey,fyear,DIncr_ld1:NiCf)%>%
filter(fyear>=1980) %>%
select(-dDivPS,-dRD,-dRDRat,-dAdv,-dAdvRat,-dUseFunds,-dSourceFunds,-dUnFunds,-TreasStck)
# some of these (RD,Adv…) are only zeros because the winsIQR… could try to adjust rather than use same for all features…
rm(temp)
write.csv(temp2,”C:/Users/jgreen/Dropbox/ACC648DocumentSharing/2024/dEarningsPred.csv”,row.names=FALSE)
“`

The csv file for this chapter is available [here](https://www.dropbox.com/scl/fi/e3y8ulurjfl0hoqpp5iu0/dEarningsPred.csv?rlkey=4ply7jm2y20wzx7ms1bzmxk8e&dl=0). The data is from Compustat – annual financial statement information. The two variables we will be trying to model, i.e. predict, are `DIncr_ld1` and `Incr_ld1`. The first is a binary variable indicating for whether year t+1 earnings will be higher than year t earnings. The second is the percentage change in earnings from year to to year t+1 scaled by the company’s market value of equity at the end of fiscal year t. The other features are financial statement analysis variables from the papers linked above. The features have been winsorized and standardized each fiscal year.

Some adjustments to models and tests related to those modes differ between the dummy/binary/indicator/classification variable and the continuous variable. We will discuss these differences as we go through the models.

We can read in the data here.

“`{r, eval=FALSE,echo=FALSE}
df <- read.csv(“C:/Users/jgreen/Dropbox/ACC648DocumentSharing/2024/dEarningsPred.csv”)
“`

“`r
df <- read.csv(filepath)
“`

We will be learning about the models using the training data. Because earnings and stock return prediction models are typically used with the intention of predicting future data, the standard approach is to do what is called back-testing. This treats models with training data that is assumed to be used after the training data. The backtesting is typically repeated on a rolling basis for each point-in-time. We will use only one slice of this backtesting by using a single training data that will be used to train the model and then we will treat all subsequent data as if it is future data.

The key point is that training data here is a time-based filter. We have data from 1980 through 2023. We will use 1980 through 2010 as training data. We will also leave a gap year between the training and testing data to avoid any data leakage because the variables we are trying to predict are based on t+1 data.

“`r
library(tidyverse)
train <- df %>%
filter(fyear<2010)
test <- df %>%
filter(fyear>2010)
rm(df)
“`

Before walking through different model types, we will prepare with h2o.

“`r
train<-train %>% mutate(fDIncr_ld1 = as.factor(DIncr_ld1))
y1 <- “Incr_ld1”
y2 <- “DIncr_ld1”
y2b <- “fDIncr_ld1”
x <- setdiff(names(train),c(“gvkey”,”fyear”,y1,y2,y2b))

library(h2o)
h2o.init(nthreads=10)
trn.h2o <- as.h2o(train)
“`

## Linear regression

A linear regression model is the most similar to the discussion in the introduction to modeling chapter. The model is a linear combination of the features. The model is estimated by minimizing the sum of the squared differences between the predicted values and the actual values. The algorithm for the linear regression model is:

1. Propose a linear model by choosing the $x$ variables. The proposed model is then given by the following:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon $$
The $\beta$s are unknown and are to be estimated by the model. The $\epsilon$ is the error term that gives how much the model predicted value differs from the actual value.

2. Estimate the $\beta$s by minimizing the sum of the squared differences between the predicted values and the actual values. The sum of the squared differences is given by:

$$ \sum_{i=1}^n \epsilon^2=\sum_{i=1}^n (y_i – \hat{y}_i)^2$$

where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. The predicted value is given by the linear model.

The $\beta$s can be estimated with some mathematical formulas or with numerical optimization (i.e. try different $\beta$s until the sum of the squared differences is minimized).

There are two primary benefits of linear regression models. First, the model is understandable via the estimated coefficients. The $\beta$s can be interpreted as partial derivatives meaning that the $\beta$ gives the change in the dependent variable for a one unit change in the associated $x$ variable holding all other $x$ variables constant. Second, because the model has a predefined form, it can be estimated quickly. The primary drawback of linear regression models is that they are limited to the proposed linear relationships.

We will estimate a linear regression model with the h2o package. The h2o package has the GLM – generalized linear model. The generalized linear model is a generalization of the linear regression model that allows for different assumptions with the estimation.

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y1,
training_frame = trn.h2o,
include_algos = c(“GLM”),
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)
bmdl

#perf <- h2o.performance(bmdl)
#plot(perf), type = “roc”) # ROC is for classification models and not regression models
# a single regression model also doesn’t have variation in the model

vi <- h2o.varimp(bmdl)
vi

“`

“`r
Time difference of 52.38173 secs

model_id rmse mse mae rmsle mean_residual_deviance
1 GLM_1_AutoML_1_20240724_83344 0.2836039 0.08043114 0.2165695 0.1991977 0.08043114
training_time_ms predict_time_per_row_ms algo
1 491 0.000111 GLM

H2ORegressionMetrics: glm
** Reported on training data. **

R^2 : 0.09320464

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept 0.618432 0.489705
2 CurRat -0.063123 -0.018631
3 dCurRat -0.086448 -0.024368
4 QuiRat 0.005659 0.001767
5 dQuiRat 0.045762 0.012639

—
names coefficients standardized_coefficients
55 DtRepay -0.000024 -0.000006
56 DtIssue -0.004347 -0.001486
57 dLTDt -0.005325 -0.001628
58 DivCash 0.010141 0.003169
59 dWCap 0.015578 0.004459
60 NiCf 0.023109 0.007103

Variable Importances:
variable relative_importance scaled_importance percentage
1 PTIRat 0.048125 1.000000 0.086043
2 ROA 0.047138 0.979499 0.084279
3 dOIRat 0.031742 0.659565 0.056751
4 dOPRat 0.030638 0.636631 0.054778
5 PMRat 0.028740 0.597188 0.051384

“`

Walking through the output shows that the linear model estimates much faster than other models or combinations of models. AutoML also only estimates a single model because once the linear model is specified, there are no parameters to tune.

The R^2 on the training data is 9.3%. This means that the model explains 9.3% of the variation in the dependent variable. Unfortunately, for many accounting and finance variables, to date the R^2s are low. This may be because the variables are influence by other factors that we have not been able to identify or the variables include a large amount of random variation that we cannot explain or predict. The R^2 is strongly influenced by the dependent variable chosen. We have future changes in earnings here. The future level of earnings tends to have a higher R^2, stock returns lower R^2, fraud related variables a lower R^2, etc.

A snapshot of the coefficients is output. We can interpret some to understand how they work.

“`r
2 CurRat -0.063123
4 QuiRat 0.005659
“`

`CurRat` is the current ratio and `QuiRat` is the quick ratio. Both are related to the liquidity of the company’s current assets. The coefficient on `CurRat` is -0.06. This means that keeping all other variables constant, a one unit increase (these are normalized variables so that the feature ranges from 0 to 1) meaning going from the lowest to the highest level of this variable, predicts that the following year’s change in earnings divided by the market value of equity will decrease by 0.06. The dependent variable is a form of change in ROE, so this is a reduction in this measure of ROE by 6%.

The coefficient on `QuiRat` is 0.006. This means that keeping all other variables constant, a one unit increase in the quick ratio predicts that the following year’s change in earnings divided by the market value of equity will increase by 0.006. This is an increase in the measure of ROE by 0.6%.

We can also get the most important variables and here those variables are `PTIRat`, `ROA`, etc.

## Logistic regression

Linear regression has some limitations when using variables that are not continuous such as binary outcome variables. These limitations led to the development of other models. A logistic regression is a model designed for binary outcome variables. The logistic regression model is a linear model that is transformed to a probability. The logistic regression model has the same benefits as the linear regression model. The model is understandable via the estimated coefficients.

The logistic regression model is estimated by maximizing the likelihood of the observed data. The algorithm for the logistic regression model is:

1. Propose a linear model by choosing the $x$ variables. The proposed model is then given by the following:

$$ p = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}$$

The $\beta$s are unknown and are to be estimated by the model. The $p$ is the probability that the binary outcome variable is 1. The $e$ is the base of the natural logarithm.

2. Estimate the $\beta$s by maximizing the likelihood of the observed data on the training data. The likelihood is the probability of observing the data given the model. The likelihood is maximized by adjusting the $\beta$s. The inituition is similar to that of the linear regression model. The model is trying to find the $\beta$s that best fit the data. However, the estimation approach is different. If you are interested in how maximum likelihood estimation works, here is a

(https://www.youtube.com/watch?v=XepXtl9YKwc) and here is [a paper that describes it in more detail](https://gribblelab.org/9040_FW22/files/Myung2003.pdf). Note that the linear regression coefficients can also be estimated in the same manner.

3. The predicted value for training and testing data is given by the probability function above and the estimated $\beta$s.

The estimated $\beta$s are similar to those in the linear regression model. However, the partial derivative interpretation is different. If you want to understand more about how to intepret the coefficients, [a video here](https://www.youtube.com/watch?v=vN5cNN2-HWE) explains.

We can estimate a logistic regression in the same way as we did the linear model. If we make the outcome variable a factor, automl will estimate a logistic regression.

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y2b,
training_frame = trn.h2o,
include_algos = c(“GLM”),
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time

#lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
#print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)
bmdl

perf <- h2o.performance(bmdl)
plot(perf, type = “roc”)

vi <- h2o.varimp(bmdl)
vi
“`

“`r
Time difference of 1.236157 mins

H2OBinomialModel: glm
Model ID: GLM_1_AutoML_2_20240724_90052
GLM Model: summary
family link regularization
1 binomial logit Ridge ( lambda = 1.621E-4 )
lambda_search
1 nlambda = 30, lambda.max = 9.2961, lambda.min = 1.621E-4, lambda.1se = 0.004549
number_of_predictors_total number_of_active_predictors number_of_iterations
1 59 59 41
training_frame
1 AutoML_2_20240724_90052_training_train_sid_beec_67

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -1.038538 0.014688
2 CurRat 0.072137 0.021292
3 dCurRat -0.608851 -0.171625
4 QuiRat 0.307146 0.095924
5 dQuiRat 0.191125 0.052787

AUC: 0.6856095

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 43490 110061 0.716772 =110061/153551
1 8558 148373 0.054534 =8558/156931
Totals 52048 258434 0.382048 =118619/310482

Variable Importances:
variable relative_importance scaled_importance percentage
1 ROA 0.528579 1.000000 0.105236
2 PTIRat 0.481434 0.910807 0.095849
3 PMRat 0.469706 0.888620 0.093514
4 OIRat 0.218943 0.414211 0.043590
5 dPMRat 0.211249 0.399655 0.042058
“`
![ROC curve](rocA.png)

The output is similar to the output we saw in the chapter on automl. First, we see that GLM estimated a logistic regression. The AUC is 0.69 which is above random. The variable importances show that `ROA` and `PTIRat` are the most important variables.

## Machine learning models

The linear and logistic regression models are the most interpretable models. However, they are limited in their ability to capture complex relationships. Machine learning models are designed to capture more complex relationships. However, the machine learning models are more difficult to interpret. The machine learning models are also more computationally intensive. Because machine learning models can capture more complex patterns in the data, they are also more prone to overfitting.

There are two major classes of machine learning models: tree-based models and neural networks.

### Classification and regression trees (and extensions – random forests, gradient boosting)

Classification and regression trees are models that are based on a series of decisions. The model is a tree structure where each node is a decision based on a feature. The tree structure is built by choosing the feature that best splits the data into two groups. The best split is determined by the feature that best separates the dependent variable. The tree is built by recursively splitting the data into two groups until the data is separated into groups that are as homogeneous as possible. The tree is then pruned to avoid overfitting. The tree is then used to predict the dependent variable by following the tree structure. The predicted value is the average of the dependent variable in the terminal node.

A very simple tree model is easy to intepret. However, as a tree becomes more complex or as extensions to a simple tree are added, the model becomes difficult to interpret. The tree model is prone to overfitting and some methods tend to be used to reduce the chances of overfitting.

A single regression and classification tree algorithm is as follows:

1. Randomly choose a feature to start.
2. Search for a value of the feature that best splits the observations based on the dependent variable where observations in the same group are a similar as possible. For a binary dependent variable, this can be the percent of observations with the same outcome. For a continuous dependent variable, this can be how different the observations are from the mean of the group (just like with clustering). Repeat the process until both groups are as homogeneous as possible. This could be defined with a continuous variable as:

$$ \sum_{i=1}^m (y_i – \bar{y_M})^2 + \sum_{i=1}^n (y_i – \bar{y_N})^2$$

where $y_i$ is the dependent variable and $\bar{y}$ is the mean of the dependent variable in the group. The total sum of squares is the sum of the squared differences for the two groups.
3. Repeat the process for each feature and choose the feature that best splits the data.
4. Spit the data into two groups based on this feature. Observations with values of the feature above the cutoff point go into one group and observations with values of the feature below the cutoff point go into the other group.
5. Now for each group, repeat the process of finding the feature that best splits the data. Continue this process until the data is split into groups that are as homogeneous as possible or a stopping point is reached such as the number of observations in a group is one or a chosen cutoff number (e.g. 10 observations in a group).
6. The model is then used to predict the dependent variable by following the choices through the tree until the endpoint is reached. The predicted value is the average of the dependent variable in the end group.

The entire process is referred to as the tree. The branches are the decisions and the leaves are the end groups. Tree depth is the number of levels of decisions in the tree – i.e. the top level to the last level.

#### Problems and extensions

There are a number of challenges with tree models. The first is that the model is prone to overfitting. This is because a tree could be built so that it perfectly fits the data. This would be the case if the tree was built to have one observation in each ending group. One approach to limiting overfitting is to prune the tree. This means that splits that are allowed on the data are limited to those that have a certain number of observations in the end group and/or the splits are limited to those that achieve some specified criteria such as the fit with that split decision. Another challenge is that the tree model is sensitive to the path it chooses. For example, splitting on a less informative first variable and then performing subsequent splits could better fit the data than the chosen path. The predictions are very sensitive to the chosen path.

Other extentions have been proposed that help to reduce the overfitting and path sensitivity. The two most common extensions are random forests and gradient boosting. Random forests are a collection of trees where each tree is built on a random sample of the data and a random sample of the features. The random forest model averages the predictions of the trees. Gradient boosting is a collection of trees where each tree is built to correct the errors of the previous tree. The gradient boosting model averages the predictions of the trees. We will briefly discuss these extensions.There are also extensions on these extensions (extreme gradient boosting, extremely randomized trees, etc). We will not discuss these extensions, although we will include them when using automl.

#### Random forests (bagging)

Bagging refers to using random samples from the training data to estimate trees. Random forests is a method that combines multiple trees to make a prediction. Combining the predictions across multiple trees helps to reduce the overfitting of a single tree. Combining trees also helps reduce the effects of path dependency from estimating a single tree.

1. Randomly choose a feature to start and randomly choose a subset of the data.
2. Search for a value of the feature that best splits the observations based on the dependent variable where observations in the same group are a similar as possible.
3. Spit the data into two groups based on this feature. Observations with values of the feature above the cutoff point go into one group and observations with values of the feature below the cutoff point go into the other group.
4. Now for each group, repeat the process of randomly choosing a feature and finding the best split for that feature. Continue this process until the data is split into groups that are as homogeneous as possible or a stopping point is reached such as the number of observations in a group is one or a chosen cutoff number (e.g. 10 observations in a group).
5. Start a new tree by repeating 1-4. Repeat until the chosing number of trees is reached.
6. For each observation, average the predictions for that observation across all trees.

#### Gradient boosting (boosting)

Boosting combines multiple trees, but the trees are not estimated independently based on random samples and features. Boosting starts with a single tree, finds observations that are poorly predicted based on the tree, and then builds a new tree that weights the poorly predicted observations more heavily. A simple way to understand this could use the following adjustment to the fit as shown previously.

$$\sum_{i=1}^m w_i (y_i – \bar{y_M})^2 + \sum_{i=1}^n w_i (y_i – \bar{y_N})^2$$
Here a weight is added that places more emphasis on the poorly predicted observation. This weight is a function of the error from the previous tree. How well the tree fits the data is kept as well as a measure of the fit of the tree. The processes is repeated with a new tree and is repeated until a stopping point is reached. The predictions are then averaged across all trees with higher weights being given to the trees that best fit the data. The gradient part of gradient boosting refers to the method for adjusting the weights and fitting a new tree.

A generic algorithm for boosting is as follows:

1. Estimate a tree to create the base tree.
2. Create the weight for each observation and the fit of the tree.
3. Estimate a new tree with the weights from the prior tree.
4. Repeat until stopping criteria is reached.
5. Average the predictions from all trees with higher weights being placed on the trees that best fit the data.

#### Tree based methods in h2o

Each tree based model has hyperparameters to tune. For example, the number of trees to estimate, the depth of the trees, the minimum number of observations in a group, the extent to which a tree should be pruned, etc. automl does some of the tuning for us with cross-validation. We can estimate various tree based methods with automl. automl includes algorithms for distributed random forests (DRF), gradient boosting machines (GBM), and extreme gradient boosting (XGBoost). (Note that XGBoost in h2o doesn’t work with Windows. We could use a different package, but here we will stick with h2o and omit XGBoost.)

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y1,
training_frame = trn.h2o,
include_algos = c(“DRF”,”GBM”),
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)
bmdl

#perf <- h2o.performance(bmdl)
#plot(perf), type = “roc”)

vi <- h2o.varimp(bmdl)
vi
“`

“`r
Time difference of 48.49129 mins

model_id rmse mse mae rmsle
1 GBM_grid_1_AutoML_1_20240724_130131_model_16 0.2746579 0.07543696 0.2003546 0.1934716
2 GBM_2_AutoML_1_20240724_130131 0.2746969 0.07545841 0.2005104 0.1935189
3 GBM_3_AutoML_1_20240724_130131 0.2748562 0.07554591 0.2004910 0.1935980
4 GBM_5_AutoML_1_20240724_130131 0.2748681 0.07555249 0.2010021 0.1936542
5 GBM_1_AutoML_1_20240724_130131 0.2749528 0.07559902 0.2001961 0.1935871
6 GBM_grid_1_AutoML_1_20240724_130131_model_3 0.2754047 0.07584776 0.2016921 0.1940060
7 GBM_4_AutoML_1_20240724_130131 0.2754724 0.07588507 0.2007943 0.1939629
8 GBM_grid_1_AutoML_1_20240724_130131_model_14 0.2757521 0.07603921 0.2023507 0.1942619
9 GBM_grid_1_AutoML_1_20240724_130131_model_2 0.2760375 0.07619669 0.2027294 0.1944408
10 GBM_grid_1_AutoML_1_20240724_130131_model_12 0.2763491 0.07636881 0.2032937 0.1946742
11 XRT_1_AutoML_1_20240724_130131 0.2764082 0.07640147 0.2019100 0.1945235
12 DRF_1_AutoML_1_20240724_130131 0.2766396 0.07652945 0.2021444 0.1946616
13 GBM_grid_1_AutoML_1_20240724_130131_model_6 0.2768266 0.07663296 0.2039455 0.1949944
14 GBM_grid_1_AutoML_1_20240724_130131_model_4 0.2772119 0.07684645 0.2027568 0.1950357
15 GBM_grid_1_AutoML_1_20240724_130131_model_9 0.2774313 0.07696813 0.2050022 0.1954172
16 GBM_grid_1_AutoML_1_20240724_130131_model_10 0.2778972 0.07722684 0.2029337 0.1954665
17 GBM_grid_1_AutoML_1_20240724_130131_model_7 0.2779019 0.07722948 0.2055389 0.1956640
18 GBM_grid_1_AutoML_1_20240724_130131_model_17 0.2779548 0.07725885 0.2058923 0.1957420
19 GBM_grid_1_AutoML_1_20240724_130131_model_11 0.2781791 0.07738361 0.2060555 0.1958886
20 GBM_grid_1_AutoML_1_20240724_130131_model_5 0.2785413 0.07758528 0.2030808 0.1957543
21 GBM_grid_1_AutoML_1_20240724_130131_model_15 0.2795151 0.07812869 0.2043051 0.1964334
22 GBM_grid_1_AutoML_1_20240724_130131_model_1 0.2799180 0.07835411 0.2047814 0.1967039
23 GBM_grid_1_AutoML_1_20240724_130131_model_8 0.2799291 0.07836030 0.2047598 0.1966145
24 GBM_grid_1_AutoML_1_20240724_130131_model_18 0.2807808 0.07883785 0.2058048 0.1972457
25 GBM_grid_1_AutoML_1_20240724_130131_model_13 0.2838149 0.08055089 0.2078455 0.1990124
mean_residual_deviance training_time_ms predict_time_per_row_ms algo
1 0.07543696 7768 0.002448 GBM
2 0.07545841 8737 0.002298 GBM
3 0.07554591 10061 0.001856 GBM
4 0.07555249 9184 0.002167 GBM
5 0.07559902 13810 0.001933 GBM
6 0.07584776 5617 0.002199 GBM
7 0.07588507 8842 0.001497 GBM
8 0.07603921 4271 0.002511 GBM
9 0.07619669 9256 0.002358 GBM
10 0.07636881 4772 0.001074 GBM
11 0.07640147 133423 0.002905 DRF
12 0.07652945 117691 0.002792 DRF
13 0.07663296 2800 0.001165 GBM
14 0.07684645 11034 0.001721 GBM
15 0.07696813 2690 0.001320 GBM
16 0.07722684 9118 0.001651 GBM
17 0.07722948 4339 0.002336 GBM
18 0.07725885 3206 0.001753 GBM
19 0.07738361 2063 0.001674 GBM
20 0.07758528 25950 0.001460 GBM
21 0.07812869 19023 0.001830 GBM
22 0.07835411 10664 0.001688 GBM
23 0.07836030 19982 0.001445 GBM
24 0.07883785 5947 0.001320 GBM
25 0.08055089 25720 0.001228 GBM

Model Details:
==============

H2ORegressionModel: gbm
Model ID: GBM_grid_1_AutoML_1_20240724_130131_model_16
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth
1 93 93 311404 9 9 9.00000
min_leaves max_leaves mean_leaves
1 145 371 262.21506

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE: 0.06614461
RMSE: 0.2571859
MAE: 0.1878626
RMSLE: 0.1821754
Mean Residual Deviance : 0.06614461

H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE: 0.07543696
RMSE: 0.2746579
MAE: 0.2003546
RMSLE: 0.1934716
Mean Residual Deviance : 0.07543696

Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
aic NA 0.000000 NA NA NA NA NA
loglikelihood NA 0.000000 NA NA NA NA NA
mae 0.200355 0.001137 0.199898 0.198986 0.201921 0.199936 0.201032
mean_residual_deviance 0.075437 0.000587 0.075309 0.074827 0.076302 0.075032 0.075715
mse 0.075437 0.000587 0.075309 0.074827 0.076302 0.075032 0.075715
r2 0.149064 0.003652 0.146955 0.150007 0.145319 0.154832 0.148209
residual_deviance 0.075437 0.000587 0.075309 0.074827 0.076302 0.075032 0.075715
rmse 0.274656 0.001068 0.274424 0.273545 0.276228 0.273920 0.275164
rmsle 0.193471 0.000695 0.193462 0.192602 0.194505 0.193172 0.193612

Variable Importances:
variable relative_importance scaled_importance percentage
1 PTIRat 2695.256348 1.000000 0.082669
2 ROA 2430.473145 0.901760 0.074548
3 PMRat 2121.519043 0.787131 0.065071
4 dTotAsset 1497.423950 0.555578 0.045929
5 dPTIRat 1213.899658 0.450384 0.037233
“`

Here we see that the time running is much longer again. This is because each model type has hyperparameters to tune and each model requires multiple trees. The different models represent models with different hyperparamters.

The best model is a GBM model with 93 trees. The R^2 is 14.9%. Note that these are better, at least in terms of the training data than the linear model. (However, the testing data may not be better if it is overfitting).

It is also possible to estimate a binary model by setting the input to be a factor.

The most important variables are PTIRat, ROA, etc.

### Neural networks

Neural networks are another type of machine learning model that get their name from the synapses in the human brain. The basic idea of neural networks is to take `x` variables and predict a `y` variable but the `x` variables can be combined in many different ways to predict `y`. A neural network is defined by an input layer–the `x` variables, and output layer — they `y` variable, and hidden layers — the pieces that define how the `x` variables are combined to predict `y`. Each piece within the network is called a node. A typical visual description of a network is shown below.

![Neural network example figure. Source: https://towardsdatascience.com/designing-your-neural-networks-a5e4617027ed](NNFig.png)

To build an understanding of how a neural network functions, let’s begin with a single node in the hidden layer. A hidden layer node has two components – the weights applied to the input variables and the activation function. Applying weights to the input variables is a linear function.

$$ z = w_1x_1 + w_2x_2 + \ldots + w_nx_n$$
Weights are unknown and estimated by the model. This looks a lot like a linear regression. The difference is that the weights are estimated as part of the full network. The combination of the weights and the input variables is then passed through an activation function. The activation function transforms the weighted inputs into a new variable. There are different activation functions, but they are all some form of non-linear transformation. A sigmoid function is a common activation function. The sigmoid function is:

$$ \sigma(z) = \frac{1}{1+e^{-z}}$$

The sigmoid function is a function that takes any value and transforms it to a value between 0 and 1.

The same process is repeated for each node in the first hidden layer. The output from the first layer, the `z`s now become the input to the next hidden layer, and so forth until the last hidden layer. The number of layers is the depth of the neural network and is a choice that we have to determine. A deep neural network or deep learning has many hidden layers. The more hidden layers and the more nodes in each layer, the more complex the model. As with tree based models, the more complex the model, the more closely it can fit the data, but the higher the likelihood that it will overfit the data.

The output from the last hidden layer is then passed to the output layer. The output layer is a linear combination of the last hidden layer. The output layer is the prediction of the dependent variable. Fitting the model is done in a similar way to the other models and uses a measure of model fit, for example, the mean squared error. A computer then searches through weights to find the best fit of the model. Because there are a large number of weights that need to be estimated, the model is computationally intensive.

There are many resources for learning more about neural networks. For example, [a description of neural networks is available here.](https://www.freecodecamp.org/news/deep-learning-neural-networks-explained-in-plain-english/) and [a short video introduction is available here](https://www.youtube.com/watch?v=jmmW0F0biz0).

The algorithm for estimating a neural network is as follows:

1. Supply features and propose a neural network structure. This includes the number of hidden layers, the number of nodes in each hidden layer, and the activation function.
2. Estimate the weights in the network by minimizing the error in the prediction of the dependent variable. The error is typically the mean squared error.
3. The model is then used to predict the dependent variable.

The simple description of the algorithm above belies the complexity of estimating a neural network. The weights are estimated by a process called backpropagation. Backpropagation is a method that adjusts the weights in the network to minimize the error in the prediction. [A description of backpropagation is available here.](https://www.youtube.com/watch?v=Ilg3gGewQ5U). There are also different types of neural networks. The different types have been developed for different tasks and to address different problems. The feedforward neural network was described above. A recurrent neural network (RNN) does not require that the input to a node come from the previous layer. A node can receive input from any node in the network. RNNs have been highly successful with audio and text. A convolutional neural network (CNN) is designed for image data and affects how the input it processed. It captures spacial relationships. Other types of networks have other properties. We will not go into more details about the different types of networks.

#### Neural networks in h2o

h2o has a feedforward neural network algorithm that allows for many hidden layers (deep neural network). The algorithm is applied similar to other algorithms in h2o. Note that there is only one algorithm but because it has hyperparameters, there are many models that can be estimated.

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y1,
training_frame = trn.h2o,
include_algos = c(“DeepLearning”),
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)
bmdl

#perf <- h2o.performance(bmdl)
#plot(perf), type = “roc”)

vi <- h2o.varimp(bmdl)
vi
“`

“`r
Time difference of 1.733151 hours

model_id rmse mse mae rmsle
1 DeepLearning_grid_3_AutoML_2_20240725_92814_model_1 0.2782959 0.07744858 0.2049889 0.1960928
2 DeepLearning_grid_2_AutoML_2_20240725_92814_model_1 0.2783562 0.07748215 0.2043291 0.1959433
3 DeepLearning_grid_1_AutoML_2_20240725_92814_model_6 0.2787980 0.07772835 0.2059411 0.1963094
4 DeepLearning_1_AutoML_2_20240725_92814 0.2788812 0.07777475 0.2075788 0.1962472
5 DeepLearning_grid_2_AutoML_2_20240725_92814_model_4 0.2790975 0.07789542 0.2046404 0.1965332
6 DeepLearning_grid_1_AutoML_2_20240725_92814_model_5 0.2791581 0.07792923 0.2061686 0.1965498
7 DeepLearning_grid_3_AutoML_2_20240725_92814_model_4 0.2793012 0.07800917 0.2053480 0.1967847
8 DeepLearning_grid_3_AutoML_2_20240725_92814_model_3 0.2794595 0.07809762 0.2075103 0.1969767
9 DeepLearning_grid_1_AutoML_2_20240725_92814_model_4 0.2794716 0.07810439 0.2067435 0.1966284
10 DeepLearning_grid_1_AutoML_2_20240725_92814_model_1 0.2795010 0.07812081 0.2056895 0.1969621
11 DeepLearning_grid_2_AutoML_2_20240725_92814_model_3 0.2797725 0.07827267 0.2075569 0.1967518
12 DeepLearning_grid_3_AutoML_2_20240725_92814_model_7 0.2799658 0.07838086 0.2066907 0.1970536
13 DeepLearning_grid_2_AutoML_2_20240725_92814_model_7 0.2802205 0.07852353 0.2054360 0.1974649
14 DeepLearning_grid_1_AutoML_2_20240725_92814_model_2 0.2805136 0.07868785 0.2082385 0.1972320
15 DeepLearning_grid_1_AutoML_2_20240725_92814_model_3 0.2805330 0.07869876 0.2087615 0.1968041
16 DeepLearning_grid_1_AutoML_2_20240725_92814_model_7 0.2807860 0.07884077 0.2072728 0.1982565
17 DeepLearning_grid_3_AutoML_2_20240725_92814_model_8 0.2810437 0.07898557 0.2066100 0.1979191
18 DeepLearning_grid_3_AutoML_2_20240725_92814_model_5 0.2814549 0.07921686 0.2116013 0.1981621
19 DeepLearning_grid_1_AutoML_2_20240725_92814_model_8 0.2815492 0.07926996 0.2073582 0.1983087
20 DeepLearning_grid_2_AutoML_2_20240725_92814_model_8 0.2819794 0.07951236 0.2064933 0.1988825
21 DeepLearning_grid_2_AutoML_2_20240725_92814_model_5 0.2820063 0.07952754 0.2121231 0.1978610
22 DeepLearning_grid_2_AutoML_2_20240725_92814_model_6 0.2822883 0.07968668 0.2119425 0.1980462
23 DeepLearning_grid_2_AutoML_2_20240725_92814_model_2 0.2825089 0.07981128 0.2117534 0.1984350
24 DeepLearning_grid_3_AutoML_2_20240725_92814_model_6 0.2837184 0.08049615 0.2144233 0.1994662
25 DeepLearning_grid_3_AutoML_2_20240725_92814_model_2 0.2851709 0.08132247 0.2159964 0.2004867
mean_residual_deviance training_time_ms predict_time_per_row_ms algo
1 0.07744858 68650 0.002784 DeepLearning
2 0.07748215 49599 0.001792 DeepLearning
3 0.07772835 57835 0.000938 DeepLearning
4 0.07777475 6065 0.000801 DeepLearning
5 0.07789542 61109 0.001801 DeepLearning
6 0.07792923 42846 0.000964 DeepLearning
7 0.07800917 56323 0.002620 DeepLearning
8 0.07809762 42195 0.000848 DeepLearning
9 0.07810439 46160 0.000901 DeepLearning
10 0.07812081 50825 0.000932 DeepLearning
11 0.07827267 35918 0.000727 DeepLearning
12 0.07838086 59867 0.002662 DeepLearning
13 0.07852353 70922 0.001768 DeepLearning
14 0.07868785 42617 0.000659 DeepLearning
15 0.07869876 33318 0.000567 DeepLearning
16 0.07884077 47686 0.000896 DeepLearning
17 0.07898557 57779 0.002641 DeepLearning
18 0.07921686 32970 0.002983 DeepLearning
19 0.07926996 48213 0.001068 DeepLearning
20 0.07951236 58303 0.002077 DeepLearning
21 0.07952754 36490 0.002079 DeepLearning
22 0.07968668 32234 0.002071 DeepLearning
23 0.07981128 45650 0.001051 DeepLearning
24 0.08049615 31288 0.003060 DeepLearning
25 0.08132247 36370 0.001444 DeepLearning

Model Details:
==============

H2ORegressionModel: deeplearning
Model ID: DeepLearning_grid_3_AutoML_2_20240725_92814_model_1
Status of Neuron Layers: predicting Incr_ld1, regression, gaussian distribution, Quadratic loss, 26,301 weights/biases, 322.6 KB, 5,303,214 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight
1 1 59 Input 15.00 % NA NA NA NA NA NA
2 2 100 RectifierDropout 10.00 % 0.000000 0.000000 0.006113 0.003477 0.000000 0.021984
3 3 100 RectifierDropout 10.00 % 0.000000 0.000000 0.004776 0.005183 0.000000 -0.045682
4 4 100 RectifierDropout 10.00 % 0.000000 0.000000 0.011180 0.017306 0.000000 -0.052729
5 5 1 Linear NA 0.000000 0.000000 0.000222 0.000107 0.000000 0.004774
weight_rms mean_bias bias_rms
1 NA NA NA
2 0.263607 -0.400754 0.478355
3 0.174167 0.734447 0.318895
4 0.110785 0.774907 0.187154
5 0.047707 0.131337 0.000000

H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9883 samples **

MSE: 0.07849261
RMSE: 0.2801653
MAE: 0.2052053
RMSLE: 0.1982642
Mean Residual Deviance : 0.07849261

H2ORegressionMetrics: deeplearning
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE: 0.07744858
RMSE: 0.2782959
MAE: 0.2049889
RMSLE: 0.1960928
Mean Residual Deviance : 0.07744858

Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
aic NA 0.000000 NA NA NA NA NA
loglikelihood NA 0.000000 NA NA NA NA NA
mae 0.204989 0.001238 0.203504 0.204473 0.206783 0.204633 0.205551
mean_residual_deviance 0.077449 0.000591 0.077381 0.076883 0.078139 0.076875 0.077964
mse 0.077449 0.000591 0.077381 0.076883 0.078139 0.076875 0.077964
r2 0.126369 0.004540 0.123481 0.126646 0.124739 0.134074 0.122906
residual_deviance 0.077449 0.000591 0.077381 0.076883 0.078139 0.076875 0.077964
rmse 0.278294 0.001061 0.278174 0.277278 0.279534 0.277263 0.279221
rmsle 0.196091 0.000821 0.196761 0.195110 0.196293 0.195360 0.196934

Variable Importances:
variable relative_importance scaled_importance percentage
1 PTIRat 1.000000 1.000000 0.037031
2 ROA 0.835835 0.835835 0.030952
3 dTotAsset 0.811024 0.811024 0.030033
4 DivCash 0.792220 0.792220 0.029337
5 dPTIRat 0.771353 0.771353 0.028564
“`

All models are deep learning models. The best model has 5 layers. The first (input) layer has 59 units, the 2-4 layers (the hidden layers) have 100 units, and the output layer is 1. In the cross-validation tests the R^2 is 12.6%. The most important variables are PTIRat, ROA, etc.

It is also possible to estimate a binary model by setting the input to be a factor.

### Hyperparameter tuning

In the previous sections, we used 25 models to estimate the best model. For models with hyperparameters to be tuned, there are many possible models that could be estimated with different hyperparameters. However, 25 is an arbitrary number that does not capture all possible models that could be estimated. The best model is only the best model from the 25 models that were estimated. We could allow automl to estimate 1000 models and perhaps we would get the very best possible model. However, this is computationally intensive and may not be necessary. So how do we deal with the tradeoff between trying all possible models with different hyperparameters and the computational cost of estimating all possible models? Here are two possible ways to address the issue.

Approach (1)

1. Start with a broad set of algorithms with a limited number of models estimated (limit the hyperparameter search).
2. Choose the best algorithm from the initial trial.
3. Select only the best algorithm and estimate more models with this algorithm.

Approach (2)

1. Start with a broad set of algorithms with a limited number of models estimated.
2. Use variation in how well the models fit to guess at how much model improvement will improve fit. If there is large variation in the fit of the models, then more models should be estimated. If there is little variation in the fit of the models, then estimating more models is unlikely to provide a meaningfully better model.

## Ensemble methods

The final “model” we will discuss is a combination of models called an ensemble model. Extensions to tree models (ex. random forest) are examples of specific ensemble models. Ensemble models combine the predictions from multiple models into a single prediction. This can be done by calculating the average of the predictions or by combining by weighting models with better fit more heavily.

Often we do not know the best model (or hyperparamters) or we are worried about overfitting for a single model. Combining predictions from multiple models can help reduce the risk of overfitting or the risk from choosing the wrong model.

h2o also has an ensemble model algorithm that can combine the predictions from multiple models.

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y1,
training_frame = trn.h2o,
include_algos = c(“GLM”,”GBM”,”StackedEnsemble”),
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)
bmdl

#perf <- h2o.performance(bmdl)
#plot(perf), type = “roc”)

#vi <- h2o.varimp(bmdl)
#vi # can’t generate for ensemble — use with individual models
“`
“`r
Time difference of 26.46948 mins

model_id rmse mse mae
1 StackedEnsemble_AllModels_1_AutoML_2_20240725_164020 0.2730671 0.07456566 0.1985139
2 GBM_grid_1_AutoML_2_20240725_164020_model_16 0.2746579 0.07543696 0.2003546
3 StackedEnsemble_BestOfFamily_1_AutoML_2_20240725_164020 0.2746774 0.07544769 0.2009733
4 GBM_2_AutoML_2_20240725_164020 0.2746969 0.07545841 0.2005104
5 GBM_5_AutoML_2_20240725_164020 0.2747959 0.07551280 0.2009196
6 GBM_3_AutoML_2_20240725_164020 0.2748562 0.07554591 0.2004910
7 GBM_1_AutoML_2_20240725_164020 0.2749182 0.07558001 0.2002622
8 GBM_grid_1_AutoML_2_20240725_164020_model_19 0.2749714 0.07560929 0.2005038
9 GBM_grid_1_AutoML_2_20240725_164020_model_3 0.2754047 0.07584776 0.2016921
10 GBM_4_AutoML_2_20240725_164020 0.2754724 0.07588507 0.2007943
11 GBM_grid_1_AutoML_2_20240725_164020_model_14 0.2757521 0.07603921 0.2023507
12 GBM_grid_1_AutoML_2_20240725_164020_model_2 0.2760375 0.07619669 0.2027294
13 GBM_grid_1_AutoML_2_20240725_164020_model_12 0.2763491 0.07636881 0.2032937
14 GBM_grid_1_AutoML_2_20240725_164020_model_6 0.2768266 0.07663296 0.2039455
15 GBM_grid_1_AutoML_2_20240725_164020_model_4 0.2772119 0.07684645 0.2027568
16 GBM_grid_1_AutoML_2_20240725_164020_model_9 0.2774313 0.07696813 0.2050022
17 GBM_grid_1_AutoML_2_20240725_164020_model_10 0.2778972 0.07722684 0.2029337
18 GBM_grid_1_AutoML_2_20240725_164020_model_7 0.2779019 0.07722948 0.2055389
19 GBM_grid_1_AutoML_2_20240725_164020_model_17 0.2779548 0.07725885 0.2058923
20 GBM_grid_1_AutoML_2_20240725_164020_model_11 0.2781791 0.07738361 0.2060555
21 GBM_grid_1_AutoML_2_20240725_164020_model_5 0.2785413 0.07758528 0.2030808
22 GBM_grid_1_AutoML_2_20240725_164020_model_15 0.2795151 0.07812869 0.2043051
23 GBM_grid_1_AutoML_2_20240725_164020_model_1 0.2799180 0.07835411 0.2047814
24 GBM_grid_1_AutoML_2_20240725_164020_model_8 0.2799291 0.07836030 0.2047598
25 GBM_grid_1_AutoML_2_20240725_164020_model_18 0.2807808 0.07883785 0.2058048
26 GLM_1_AutoML_2_20240725_164020 0.2836039 0.08043114 0.2165695
27 GBM_grid_1_AutoML_2_20240725_164020_model_13 0.2838149 0.08055089 0.2078455
rmsle mean_residual_deviance training_time_ms predict_time_per_row_ms algo
1 0.1925072 0.07456566 4623 0.021040 StackedEnsemble
2 0.1934716 0.07543696 7141 0.002292 GBM
3 0.1934935 0.07544769 1722 0.002321 StackedEnsemble
4 0.1935189 0.07545841 8765 0.002137 GBM
5 0.1935857 0.07551280 8468 0.002290 GBM
6 0.1935980 0.07554591 8455 0.002101 GBM
7 0.1935652 0.07558001 12381 0.001934 GBM
8 0.1936500 0.07560929 6908 0.002525 GBM
9 0.1940060 0.07584776 5733 0.002379 GBM
10 0.1939629 0.07588507 8938 0.001687 GBM
11 0.1942619 0.07603921 4267 0.002730 GBM
12 0.1944408 0.07619669 8216 0.002406 GBM
13 0.1946742 0.07636881 4183 0.001089 GBM
14 0.1949944 0.07663296 2757 0.001247 GBM
15 0.1950357 0.07684645 10625 0.001687 GBM
16 0.1954172 0.07696813 2565 0.001280 GBM
17 0.1954665 0.07722684 8949 0.001555 GBM
18 0.1956640 0.07722948 5371 0.002241 GBM
19 0.1957420 0.07725885 3859 0.001344 GBM
20 0.1958886 0.07738361 1989 0.001542 GBM
21 0.1957543 0.07758528 25266 0.001331 GBM
22 0.1964334 0.07812869 19016 0.001727 GBM
23 0.1967039 0.07835411 10751 0.001516 GBM
24 0.1966145 0.07836030 20188 0.001407 GBM
25 0.1972457 0.07883785 6016 0.001260 GBM
26 0.1991977 0.08043114 491 0.000101 GLM
27 0.1990124 0.08055089 25828 0.001400 GBM

Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_1_AutoML_2_20240725_164020
Model Summary for Stacked Ensemble:
key value
1 Stacking strategy cross_validation
2 Number of base models (used / total) 14/25
3 # GBM base models (used / total) 14/24
4 # GLM base models (used / total) 0/1
5 Metalearner algorithm GLM
6 Metalearner fold assignment scheme Random
7 Metalearner nfolds 5
8 Metalearner fold_column NA
9 Custom metalearner hyperparameters None

H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE: 0.04725273
RMSE: 0.2173769
MAE: 0.1579406
RMSLE: 0.1558698
Mean Residual Deviance : 0.04725273

H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE: 0.07456566
RMSE: 0.2730671
MAE: 0.1985139
RMSLE: 0.1925072
Mean Residual Deviance : 0.07456566

Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
aic 15042.765000 412.100430 14620.218000 15012.034000 15232.792000 15637.562500
loglikelihood 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
mae 0.198479 0.000899 0.197382 0.198711 0.198984 0.199569
mean_residual_deviance 0.074563 0.000505 0.074013 0.074533 0.074767 0.075300
mse 0.074563 0.000505 0.074013 0.074533 0.074767 0.075300
null_deviance 5505.042000 24.612093 5481.345000 5520.609400 5528.384000 5519.328600
r2 0.158902 0.003097 0.159470 0.161961 0.159128 0.153718
residual_deviance 4630.086000 29.654886 4607.133000 4626.420000 4648.288600 4670.422400
rmse 0.273061 0.000924 0.272052 0.273008 0.273436 0.274409
rmsle 0.192504 0.000741 0.192035 0.192403 0.192418 0.193766
cv_5_valid
aic 14711.219000
loglikelihood 0.000000
mae 0.197748
mean_residual_deviance 0.074202
mse 0.074202
null_deviance 5475.543000
r2 0.160232
residual_deviance 4598.164600
rmse 0.272401
rmsle 0.191899
“`
We see here that the best model is the stacked ensemble model. The stacked ensemble model uses 14 of the 25 models estimated. 14 models were used because the ensemble model is introduced at different points in the estimation process (notice that model 3 is a different ensemble model). We could also estimate separate models and then create our own ensemble model based on separate models that we select.

The R^2 for the ensemble model (megalearner) is 15.9%. If this model were to reduce overfitting, we should see better performance on the test set than for other models.

## Returning to AutoML

What could we do to improve a model.

For each of the models we have discussed, we could perhaps improve the model by further fine tuning hyperparameters or we could run more models to search for better performance.

Perhaps more important is to consider the data. We could collect more data or engineer new features or improve the features we do have. For models using accounting data, this is usually the most important option. When there is already an abundance of data, the model may be the best option. For example, some of the leading large language models have been trained on nearly the entire universe of all known text. In that case colleting more data is not much of an option. The more completely trained models have hundreds of billions or more parameters, i.e. weights in the network model ([see here](https://www.cnet.com/tech/services-and-software/gpt-4o-and-gemini-1-5-pro-how-the-new-ai-models-compare/)). Training these types of models require tremendous computational resources and a long time. This type of training is beyond all but the most well funded organizations.

Let’s return to our problem. We have looked at a continuous model and a binary model. We will focus on the continuous variable model. The best model we have seen so far on the training data set has been an R^2 of around 15%. Let’s give training a little more time to see if we can improve on the model. Importantly, if there is a model that runs quickly and achieves a similar fit, then we don’t get enough meaningful improvement from a model that takes a long time to run.

Below is an attempt to see if giving more time helps. The longest our models have run is slightly less than 2 hours. Here I will let it run four hours and see what happens.

“`r
# here estimating some base models
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y1,
training_frame = trn.h2o,
max_runtime_secs = (60*60*4), #4 hours
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb) #print leaderboard without telling it to print all and it will print just the top of the leaderboard

bmdl <- h2o.get_best_model(mdlres)
bmdl

#perf <- h2o.performance(bmdl)
#plot(perf), type = “roc”)

#vi <- h2o.varimp(bmdl)
#vi # can’t generate for ensemble — use with individual models
“`
“`r
Time difference of 4.165661 hours

model_id rmse mse mae rmsle
1 StackedEnsemble_AllModels_5_AutoML_1_20240726_72627 0.2725643 0.07429127 0.1976803 0.1920841
2 StackedEnsemble_AllModels_6_AutoML_1_20240726_72627 0.2727979 0.07441867 0.1980682 0.1923286
3 StackedEnsemble_AllModels_3_AutoML_1_20240726_72627 0.2729367 0.07449445 0.1978402 0.1924359
4 StackedEnsemble_AllModels_4_AutoML_1_20240726_72627 0.2729421 0.07449740 0.1978431 0.1924394
5 StackedEnsemble_Best1000_1_AutoML_1_20240726_72627 0.2729463 0.07449971 0.1978462 0.1924415
6 StackedEnsemble_AllModels_2_AutoML_1_20240726_72627 0.2734387 0.07476873 0.1984623 0.1927523
mean_residual_deviance training_time_ms predict_time_per_row_ms algo
1 0.07429127 82710 0.238670 StackedEnsemble
2 0.07441867 30380 0.083307 StackedEnsemble
3 0.07449445 7588 0.055163 StackedEnsemble
4 0.07449740 7460 0.054903 StackedEnsemble
5 0.07449971 8473 0.057772 StackedEnsemble
6 0.07476873 907 0.017133 StackedEnsemble

[176 rows x 9 columns]

Model Details:
==============

H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_5_AutoML_1_20240726_72627
Model Summary for Stacked Ensemble:
key value
1 Stacking strategy cross_validation
2 Number of base models (used / total) 155/157
3 # GBM base models (used / total) 129/129
4 # DRF base models (used / total) 2/2
5 # DeepLearning base models (used / total) 23/25
6 # GLM base models (used / total) 1/1
7 Metalearner algorithm GBM
8 Metalearner fold assignment scheme Random
9 Metalearner nfolds 5
10 Metalearner fold_column NA
11 Custom metalearner hyperparameters None

H2ORegressionMetrics: stackedensemble
** Reported on training data. **

MSE: 0.03670929
RMSE: 0.1915967
MAE: 0.1417174
RMSLE: 0.1365716
Mean Residual Deviance : 0.03670929

H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE: 0.07429127
RMSE: 0.2725643
MAE: 0.1976803
RMSLE: 0.1920841
Mean Residual Deviance : 0.07429127

Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
aic NA 0.000000 NA NA NA NA NA
loglikelihood NA 0.000000 NA NA NA NA NA
mae 0.197680 0.000495 0.197612 0.197295 0.198393 0.197172 0.197929
mean_residual_deviance 0.074291 0.000365 0.074409 0.073979 0.074772 0.073875 0.074420
mse 0.074291 0.000365 0.074409 0.073979 0.074772 0.073875 0.074420
r2 0.161982 0.003333 0.156856 0.164121 0.161947 0.165618 0.161371
residual_deviance 0.074291 0.000365 0.074409 0.073979 0.074772 0.073875 0.074420
rmse 0.272563 0.000669 0.272780 0.271991 0.273445 0.271800 0.272800
rmsle 0.192083 0.000510 0.192176 0.191776 0.192876 0.191527 0.192061
“`

176 rows in the leaderboard means that automl estimated 176 models in the slightly more than 4 hours. The best models that we see are the stacked ensemble models.

The best stacked ensemble model is fit on 155 models. The R^2 for the best model is 16.2%. This is the best training data fit that we have seen. However, 4 hours improved the training model fit by approximately 1%. At some point running more models should not improve the fit more. The big question that we can’t address, for example, without trying is if 8 hours would improve the model even more. What we do know now is that doubling the time for the model did not double the fit. We may have diminishing returns to running more models.

### Test data

Let’s presume now that we are satisfied with our training data efforts. We can now turn to the testing data.

“`r
library(DALEXtra)
tst.h2o <- as.h2o(test)
expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,x],
y = tst.h2o[,y1])
model_performance(expln)

varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)

pdp <- model_profile(expln)
plot(pdp)
“`

“`r
Measures for: regression
mse : 0.08337036
rmse : 0.2887393
r2 : 0.124274
mad : 0.1399124
“`

Notice that the fit for the testing data, 12.4%, is not as high as for the training data 16.2%. This is common because models often fit random variation in the training data that is not present in the testing data. In the extreme case a model will perform well on the training data and be useless or even wrong on the testing data. This problem is referred to as overfitting as we have discussed. It is possible that if we continued training the model, we might improve the fit on the testind data; on the other hand the chance increases that we fit random variation in the training data.

The feature importance plot is shown below.

![Feature Importance](varimpSM.png)

The feature importance plot shows that PTIRat and dTotAsset are the most important variables along with some others.

![Partial dependence plot](pdpSM.png)

The partial dependence plot is a little hard to take in because there are so many variables. However, PTIRat shows that as PTIRat increases, the prediction for `dIncr` decreases. This suggests that having a higher PTIRat (Pre-tax income to total assets) is associated with smaller increases, or decreases in future earnings. Other variables could be similarly interpreted.

Finally, if we wanted to create or use individual predictions, we could output the predictions.

“`r
prediction <- h2o.predict(bmdl, tst.h2o)
test$pred_misstate <- as.data.frame(prediction)$predict
“`

And as always shutdown h2o.

“`r
h2o.shutdown()
“`

## Review

This chapter discussed common model types and how to estimate them using h2o. The modeling algorithms included linear regression, logistic regression, tree models, deep learning models, and ensemble models. The chapter discussed hyperparameter tuning and the tradeoff between computational cost and model fit. The chapter also discussed the importance of the data in model fit.

### Conceptual questions
1. Describe how a linear regression model works.
2. Describe how a logistic regression model works.
3. Describe how a tree model works.
4. Describe how a deep learning model works.
5. How and why do hyperparameters need to be tuned?
6. How does h2o fit hyperparmaters?
7. How does h2o differentiate between continuous and binary outcome variables and models?

### Practice questions

1. What are some concerns when modeling data?
2. What does correlation does not mean causation mean?
3. What is the difference between a linear regression model and a logistic regression model?
4. What is the difference between a tree model and a deep learning model?
5. What is an ensemble model?
6. What are hyperparameters?
7. From the data in the chapter, estimate a large number of models for `dIncr` as a binary outcome variable.
8. Evaluate the best model from 7 on the test data.

### Solutions to practice questions

1. Survivorship, sample selection, look ahead bias, confirmation bias, and overfitting are some concerns when modeling data.
2. Correlation does not mean causation means that just because two variables are correlated does not mean that one causes the other.
3. A linear regression model estimates a continuous variable while a logistic regression model estimates a binary variable.
4. A tree model is a series of if-then statements that split the data into groups. A deep learning model is a series of layers of nodes that are connected to each other.
5. An ensemble model is a model that combines the predictions of multiple models.
6. Hyperparameters are parameters that are set before the model is estimated.
7. The code to estimate a large number of models for `dIncr` as a binary outcome variable is below.

“`r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y2b,
training_frame = trn.h2o,
max_runtime_secs = (60*60*4),
seed = 1)
end.time <- Sys.time()
end.time – start.time

lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb)

bmdl <- h2o.get_best_model(mdlres)
bmdl

perf <- h2o.performance(bmdl)
plot(perf), type = “roc”)

“`

8. The code to evaluate the best model from 7 on the test data is below.

“`r
tst.h2o <- as.h2o(test)
expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,x],
y = tst.h2o[,y2b])
model_performance(expln)
“`

Tutorial video

Note: include practice and knowledge checks

Mini-case video

Note: include practice and knowledge checks

17 Chapter 17: Supervised Models I

License

Share This Book