15 Supervised models: parametric models
Learning Objectives
- Explain linear regression methods and interpretation
- Explain logistic regression
- Describe accounting applications of regression models
- Use R packages to estimate and interpret regression models
Chapter content
The description “parametric models” traditionally refers to models with a finite number of parameters that can be solved mathematically. These models are useful for several reasons: First, they are relatively fast, requiring low computational speed and power. Second, because of their model derivations, they are interpretable. Machine learning models, on the other hand, refer to models with the number of parameters being limited only by data and computational constraints.
The primary purpose of this chapter and the following two chapters is to provide more details on supervised models so that you can understand and interpret them and apply them in different ways. A complete understanding of every method, even the ones discussed in this chapter, is not possible in a few pages of text. Therefore, the goal is to provide enough information to get you started and to provide a starting point for further study.
This chapter will discuss linear regression and logistic regression. The next chapter will discuss classification and regression trees, and the following chapter will discuss neural networks. Note that for the machine learning methods, tuning parameters can be adjusted to search for the best fitting models. Other than a cursory introduction, these chapters will not spend time on these tuning parameters. The purpose of these chapters is to develop a conceptual understanding of the models.
Linear regression
A linear regression model was introduced in the introduction to modeling chapter. The model is a linear combination of the features. The model is called the ordinary least squared regression because it is estimated by minimizing the sum of the squared differences between the predicted values and the actual values. The algorithm for the linear regression model is:
1. Propose a linear model by choosing the variables. The proposed model is then given by the following:
The s are unknown and are to be estimated by the model. The
is the error term that gives how much the model predicted value differs from the actual value.
2. Estimate the s by minimizing the sum of the squared differences between the predicted values and the actual values. The sum of the squared differences is given by:
where is the actual value and
is the predicted value. The predicted value is given by the linear model.
The s is traditionally estimated with mathematical formulas but can be estimated similarly with numerical optimization (i.e. try different
s until the sum of the squared differences is minimized).
There are two primary benefits of linear regression models. First, the model is understandable via the estimated coefficients. The s can be interpreted as partial derivatives meaning that the
gives the change in the dependent variable for a one unit change in the associated
variable holding all other
variables constant. Second, because the model has a predefined form, it can be estimated quickly. The primary drawback of linear regression models is that they are limited to the proposed linear relationships.
There are many helpful books and videos on regression (one helpful example: https://www.youtube.com/watch?v=i3IadpjctWg).
Logistic regression
Linear regression has limitations when using variables that are not continuous such as binary outcome variables. These limitations led to the development of other models. A logistic regression is a model designed for binary outcome variables. The logistic regression model is a linear model that is transformed to a probability. The logistic regression model has the same benefits as the linear regression model. The model is understandable via the estimated coefficients.
The logistic regression model is estimated by maximizing the likelihood of the observed data. The algorithm for the logistic regression model is:
1. Propose a linear model by choosing the variables. The proposed model is then given by the following:
The s are unknown and are to be estimated by the model. The
is the probability that the binary outcome variable is 1. The
is the base of the natural logarithm.
2. Estimate the s by maximizing the likelihood of the observed data on the training data. The likelihood is the probability of observing the data given the model. The likelihood is maximized by adjusting the
s. The inituition is similar to that of the linear regression model. The model is trying to find the
s that best fit the data. However, the estimation approach is different. Maximum likelihood estimation is covered in other resources. Note that the linear regression coefficients can also be estimated in the same manner.
3. The predicted value for training and testing data is given by the probability function above and the estimated s.
The estimated s are similar to those in the linear regression model. However, the partial derivative interpretation is different. Additional resources for interpretting the coefficients are provided at the end of the chapter.
There are many helpful books and videos on logistic regression (one helpful example: https://www.youtube.com/watch?v=Ax5kqLHls-I).
Example in R
The data for this chapter will be from research that takes a shareholder perspective to try to predict difference across companies in the change in future earnings and the probability of future earnings increases.
The most related papers from which the data is created are linked here: Ou and Penman 1989 (https://www.sciencedirect.com/science/article/pii/0165410189900177),
Chen et al 2022 (https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12429).
The csv file for this chapter is available here: https://www.dropbox.com/scl/fi/jrrnvo9xeyud863q63cpr/dEarningsPred.csv?rlkey=nqv9in8ukx4xxhf9cr78h0wxp&dl=0. The data is from Compustat – annual financial statement information. The two variables we will be trying to model, i.e. predict, are `DIncr_ld1` and `Incr_ld1`. The first is a binary variable indicating for whether year t+1 earnings will be higher than year t earnings. The second is the percentage change in earnings from year to to year t+1 scaled by the company’s market value of equity at the end of fiscal year t. The other features are financial statement analysis variables from the papers linked above. The features have been winsorized and standardized each fiscal year.
Assuming that the data is imported into R as “df”, the following code prepares the h2o environment and creates a training data set and moves it to the h2o environment.
library(DALEX)
library(DALEXtra)
library(tidyverse)
library(parallel)
h2o.init(nthreads=8)
train <- df %>%
filter(fyear<2010)
test <- df %>%
filter(fyear>2010)
rm(df)
Linear regression
Estimate linear model with continous dependent variable, keeping only x and y variables and sending to h2o environment.
tmp <- train %>%
select(Incr_ld1, CurRat:NiCf)
trn.h2o <- as.h2o(tmp)
mdlres <- h2o.automl(y = "Incr_ld1",
training_frame = trn.h2o,
include_algos = c("GLM"),
seed = 1)
GLM is generalized linear model. The output shows that the linear model estimates much faster than other models or combinations of models. AutoML also only estimates a single model because once the linear model is specified, there are no parameters to tune.
Explore training data model and performance.
lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")
print(lb, n = nrow(lb))
bmdl <- h2o.get_best_model(mdlres)
bmdl
perf <- h2o.performance(bmdl)
perf
vi <- h2o.varimp(bmdl)
vi
Retrieve predictions and explore testing data performance.
tmp <- test %>%
select(Incr_ld1, CurRat:NiCf)
tst.h2o <- as.h2o(tmp)
pred.h2o <- h2o.predict(bmdl, tst.h2o)
pred_Incr <- as.data.frame(pred.h2o)$predict
expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,-1],
y = tst.h2o[,1])
model_performance(expln)
varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)
pdp <- model_profile(expln)
plot(pdp)
Logistic regression
A logistic regression can be estimated in the same way as the linear model. If the outcome variable is a factor, automl will estimate a logistic regression.
tmp <- train %>%
mutate(fDIncr_ld1 = as.factor(DIncr_ld1)) %>%
select(fDIncr_ld1, CurRat:NiCf)
trn.h2o <- as.h2o(tmp)
mdlres <- h2o.automl(y = "fDIncr_ld1",
training_frame = trn.h2o,
include_algos = c("GLM"),
seed = 1)
lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")
print(lb, n = nrow(lb))
bmdl <- h2o.get_best_model(mdlres)
bmdl
perf <- h2o.performance(bmdl)
perf
vi <- h2o.varimp(bmdl)
vi
tmp <- test %>%
select(dIncr_ld1, CurRat:NiCf)
tst.h2o <- as.h2o(tmp)
pred.h2o <- h2o.predict(bmdl, tst.h2o)
pred_dIncr <- as.data.frame(pred.h2o)$predict
expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,-1],
y = tst.h2o[,1])
model_performance(expln)
varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)
pdp <- model_profile(expln)
plot(pdp)
h2o.shutdown()
Tutorial video
Conclusion
Review
Mini-case video
References
https://www.youtube.com/watch?v=XepXtl9YKwc
https://gribblelab.org/9040_FW22/files/Myung2003.pdf
https://www.youtube.com/watch?v=vN5cNN2-HWE