Supervised models: parametric models

Jeremiah Green

15 Supervised models: parametric models

Learning Objectives

Explain linear regression methods and interpretation
Explain logistic regression
Describe accounting applications of regression models
Use R packages to estimate and interpret regression models

Chapter content

The description “parametric models” traditionally refers to models with a finite number of parameters that can be solved mathematically. These models are useful for several reasons: First, they are relatively fast, requiring low computational speed and power. Second, because of their model derivations, they are interpretable. Machine learning models, on the other hand, refer to models with the number of parameters being limited only by data and computational constraints.

The primary purpose of this chapter and the following two chapters is to provide more details on supervised models so that you can understand and interpret them and apply them in different ways. A complete understanding of every method, even the ones discussed in this chapter, is not possible in a few pages of text. Therefore, the goal is to provide enough information to get you started and to provide a starting point for further study.

This chapter will discuss linear regression and logistic regression. The next chapter will discuss classification and regression trees, and the following chapter will discuss neural networks. Note that for the machine learning methods, tuning parameters can be adjusted to search for the best fitting models. Other than a cursory introduction, these chapters will not spend time on these tuning parameters. The purpose of these chapters is to develop a conceptual understanding of the models.

Linear regression

A linear regression model was introduced in the introduction to modeling chapter. The model is a linear combination of the features. The model is called the ordinary least squared regression because it is estimated by minimizing the sum of the squared differences between the predicted values and the actual values. The algorithm for the linear regression model is:

1. Propose a linear model by choosing the $x$ variables. The proposed model is then given by the following:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon$

The $\beta$ s are unknown and are to be estimated by the model. The $\epsilon$ is the error term that gives how much the model predicted value differs from the actual value.

2. Estimate the $\beta$ s by minimizing the sum of the squared differences between the predicted values and the actual values. The sum of the squared differences is given by:

$\sum_{i=1}^n \epsilon^2=\sum_{i=1}^n (y_i - \hat{y}_i)^2$

where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. The predicted value is given by the linear model.

The $\beta$ s is traditionally estimated with mathematical formulas but can be estimated similarly with numerical optimization (i.e. try different $\beta$ s until the sum of the squared differences is minimized).

There are two primary benefits of linear regression models. First, the model is understandable via the estimated coefficients. The $\beta$ s can be interpreted as partial derivatives meaning that the $\beta$ gives the change in the dependent variable for a one unit change in the associated $x$ variable holding all other $x$ variables constant. Second, because the model has a predefined form, it can be estimated quickly. The primary drawback of linear regression models is that they are limited to the proposed linear relationships.

There are many helpful books and videos on regression (one helpful example: https://www.youtube.com/watch?v=i3IadpjctWg).

Logistic regression

Linear regression has limitations when using variables that are not continuous such as binary outcome variables. These limitations led to the development of other models. A logistic regression is a model designed for binary outcome variables. The logistic regression model is a linear model that is transformed to a probability. The logistic regression model has the same benefits as the linear regression model. The model is understandable via the estimated coefficients.

The logistic regression model is estimated by maximizing the likelihood of the observed data. The algorithm for the logistic regression model is:

1. Propose a linear model by choosing the $x$ variables. The proposed model is then given by the following:

$p = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}$

The $\beta$ s are unknown and are to be estimated by the model. The $p$ is the probability that the binary outcome variable is 1. The $e$ is the base of the natural logarithm.

2. Estimate the $\beta$ s by maximizing the likelihood of the observed data on the training data. The likelihood is the probability of observing the data given the model. The likelihood is maximized by adjusting the $\beta$ s. The inituition is similar to that of the linear regression model. The model is trying to find the $\beta$ s that best fit the data. However, the estimation approach is different. Maximum likelihood estimation is covered in other resources. Note that the linear regression coefficients can also be estimated in the same manner.

3. The predicted value for training and testing data is given by the probability function above and the estimated $\beta$ s.

The estimated $\beta$ s are similar to those in the linear regression model. However, the partial derivative interpretation is different. Additional resources for interpretting the coefficients are provided at the end of the chapter.

There are many helpful books and videos on logistic regression (one helpful example: https://www.youtube.com/watch?v=Ax5kqLHls-I).

Example in R

The data for this chapter will be from research that takes a shareholder perspective to try to predict difference across companies in the change in future earnings and the probability of future earnings increases.

The most related papers from which the data is created are linked here: Ou and Penman 1989 (https://www.sciencedirect.com/science/article/pii/0165410189900177),
Chen et al 2022 (https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12429).

The csv file for this chapter is available here: https://www.dropbox.com/scl/fi/jrrnvo9xeyud863q63cpr/dEarningsPred.csv?rlkey=nqv9in8ukx4xxhf9cr78h0wxp&dl=0. The data is from Compustat – annual financial statement information. The two variables we will be trying to model, i.e. predict, are `DIncr_ld1` and `Incr_ld1`. The first is a binary variable indicating for whether year t+1 earnings will be higher than year t earnings. The second is the percentage change in earnings from year to to year t+1 scaled by the company’s market value of equity at the end of fiscal year t. The other features are financial statement analysis variables from the papers linked above. The features have been winsorized and standardized each fiscal year.

Assuming that the data is imported into R as “df”, the following code prepares the h2o environment and creates a training data set and moves it to the h2o environment.

library(DALEX)

library(DALEXtra)

library(tidyverse)

library(parallel)

h2o.init(nthreads=8)

Create training and testing data sets. Here rather than a random sample, try separating data based on time.

train <- df %>%
filter(fyear<2010)
test <- df %>%
filter(fyear>2010)
rm(df)

Linear regression

Estimate linear model with continous dependent variable, keeping only x and y variables and sending to h2o environment.

tmp <- train %>%

select(Incr_ld1, CurRat:NiCf)

trn.h2o <- as.h2o(tmp)

mdlres <- h2o.automl(y = "Incr_ld1",
training_frame = trn.h2o,
include_algos = c("GLM"),
seed = 1)

GLM is generalized linear model. The output shows that the linear model estimates much faster than other models or combinations of models. AutoML also only estimates a single model because once the linear model is specified, there are no parameters to tune.

Explore training data model and performance.

lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")

print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)

bmdl

perf <- h2o.performance(bmdl)

perf

vi <- h2o.varimp(bmdl)

vi

Retrieve predictions and explore testing data performance.

tmp <- test %>%

select(Incr_ld1, CurRat:NiCf)

tst.h2o <- as.h2o(tmp)

pred.h2o <- h2o.predict(bmdl, tst.h2o)

pred_Incr <- as.data.frame(pred.h2o)$predict

expln <- DALEXtra::explain_h2o(

model = bmdl,

data = tst.h2o[,-1],

y = tst.h2o[,1])

model_performance(expln)

varimp <- variable_importance(expln)

plot(varimp,show_boxplots = FALSE)

pdp <- model_profile(expln)

plot(pdp)

Logistic regression

A logistic regression can be estimated in the same way as the linear model. If the outcome variable is a factor, automl will estimate a logistic regression.

tmp <- train %>%

mutate(fDIncr_ld1 = as.factor(DIncr_ld1)) %>%

select(fDIncr_ld1, CurRat:NiCf)

trn.h2o <- as.h2o(tmp)

mdlres <- h2o.automl(y = "fDIncr_ld1",

training_frame = trn.h2o,

include_algos = c("GLM"),

seed = 1)

lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")

print(lb, n = nrow(lb))

bmdl <- h2o.get_best_model(mdlres)

bmdl

perf <- h2o.performance(bmdl)

perf

vi <- h2o.varimp(bmdl)

vi

Moving to testing data.

tmp <- test %>%

select(dIncr_ld1, CurRat:NiCf)

tst.h2o <- as.h2o(tmp)

pred.h2o <- h2o.predict(bmdl, tst.h2o)

pred_dIncr <- as.data.frame(pred.h2o)$predict

expln <- DALEXtra::explain_h2o(

model = bmdl,

data = tst.h2o[,-1],

y = tst.h2o[,1])

model_performance(expln)

varimp <- variable_importance(expln)

plot(varimp,show_boxplots = FALSE)

pdp <- model_profile(expln)

plot(pdp)

h2o.shutdown()

Tutorial video

Conclusion

Review

Mini-case video

References

https://www.youtube.com/watch?v=XepXtl9YKwc

https://gribblelab.org/9040_FW22/files/Myung2003.pdf

https://www.youtube.com/watch?v=vN5cNN2-HWE

15 Supervised models: parametric models

Chapter content

Linear regression

Logistic regression

Example in R

Linear regression

Logistic regression

Conclusion

Review

Mini-case video

References

License

Share This Book