AutoML with Continous Dependent Variables

Jeremiah Green

12 AutoML with Continous Dependent Variables

Learning Objectives

Describe the purpose of and steps for predictive modeling with continous dependent variables
Describe examples of accounting relevant countinuous dependent variables
Use AutoML with continuous dependent variables

Chapter content

This chapter explains an automated approach to using many models for continuous dependent variables. This chapter closely follows the previous chapter for binary dependent variables. The focus is to apply the modeling steps to continous dependent variables without selecting or getting into the details of specific models. This approach puts the focus on the modeling process and concepts rather than on technical modeling details. This chapter first discusses concepts related to continous dependent variable models and such models in accounting and then applies H2O’s AutoML to automatic model selection and estimation.

Modeling continous dependent variables

Continous variables are variables that can take on any value (i.e. not counted or put into a discrete class). For example, the distance between one machine and another could be 10.23 feet or 11.5 feet or 0.1 feet. Models for continous outcome variables predict the magnitude of the outcome. For example, a model for predicting next year’s earnings per share (EPS) for a public company might be $3.25. The goal of the model would be to predict in new data EPS as accurately as possible. The model could include independent variables that might predict EPS. These “x” variables could be things like prior EPS, investment, or industry.

Walking through the steps

The prior chapter walks through the binary variable algorithm and modeling steps. This chapter repeats the discussion for continous dependent variables. Note that to be consistent some of the text is identical to the prior chapter.

A key part of the modeling process is identifying patterns in a training data set and then using those patterns to make predictions on a new data set. This means that before beginning any modeling, it is necessary to identify data and separate them.

Training

Training data is used to find patterns that can predict outcomes in a new data set. In the idea scenario, training data should contain features and patterns that are likely to apply to other data and specifically to data that will be used for prediction. Training data should be carefully separated from testing data because comingling the two can lead to statistical and inference problems.

There are various approaches to partitioning out training data. If we are creating a model to be used in real time decisions such as with making trading decisions, we might use all available data as training data and evaluate the model as we use it for making trading decisions. Most of the time, we use a portion of the data as training data and the rest as testing data. In this way we can evaluate the model’s performance on data that it has not seen before before we use it to make real decisions. Some approaches to partitioning the data include randomly selecting a portion of the data as training data and the rest as testing data or using a time-based approach where we use data up to a certain point in time as training data and data after that point as testing data.

Once we separate data, the first of the modeling steps is finding patterns in training data. A model algorithm is a large part of this first step.

Model algorithm

1. Specify a pattern

2. Define fit

3. Find the parameters that maximize fit

Specify a pattern

By focusing on continous dependent variables, we have already selected supervised modeling and we have specified the type of the dependent variable. Patterns include the model type. For continuous dependent variables, model types might include ordinary least squares, regression trees, neural networks, or other models. An ensemble model could include many different types of models. In automated methods of machine learning like the one used in this chapter, the modeling tool may try multiple models. Later chapters will spend more time on specific model types.

The most most common model historically because of the mathematical tractability and computational speed is a linear model and in particular ordinary least squares regressions. The model is a linear combination of the features. The model is estimated by minimizing the sum of the squared differences between the predicted values and the actual values (described in the fit section below). The algorithm for the linear regression model is:

1. Propose a linear model by choosing the x variables. The proposed model is then given by the following:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon$

The $\beta$ s are unknown and are to be estimated by the model. The $\epsilon$ is the error term that gives how much the model predicted value differs from the actual value.

2. Estimate the $\beta$ s by maximizing model fit. The $\beta$ s can be estimated with some mathematical formulas or with numerical optimization (i.e. try different $\beta$ s until the sum of the squared differences is minimized).

There are two primary benefits of linear regression models. First, the model is understandable via the estimated coefficients. The $\beta$ s can be interpreted as partial derivatives meaning that the $\beta$ s gives the change in the dependent variable for a one unit change in the associated x variable holding all other x variables constant. Second, because the model has a predefined form, it can be estimated quickly. The primary drawback of linear regression models is that they are limited to the proposed linear relationships.

Define fit

Alternative model types can use different ways to measure model fit. This section focuses on general measures and concepts related to model fit with continous depdendent variables. Model fit compares observed outputs to predicted outputs derived from a model. These comparisons can be intuitive. For example, a model that predicts EPS to be $3.50 but the actual EPS is -$5.00 it is less accurate than a prediction of $3.00. Generalizing the comparison for a single case to an entire data set to evaluate a model requires combining the individual cases into a measure of how well a model fits the observed outcomes overall. There are different measures of fit, but a common and perhaps the most widely known and used measure of fit is r-squared.

R-squared is a measure of how well the model explains the variation in the outcome variable. R-squared ranges from 0 to 1, with 0 indicating that the model explains none of the variation in the outcome variable and 1 indicating that the model explains all of the variation in the outcome variable. The following steps build r-squared from the intuitive starting point of comparing predicted to observed outcomes.

Compare predicted with observed outcomes. Here the error is the difference between the two:

$error = \hat(y) - y$ .

Square the error to make positive and negative errors comparable:

$SE = error^2$

Combine the squared errors across all observations by summing:

$SSE = SE^2$

Calculate the total sum of squares for comparison with SSE:

$SST = (y - \bar(y))^2$

Meausre the percent of total sum of squares explained by the model:

$R^2 = 1 - \frac{SSE}{SST}$

R-squared is the proportion of the variance in the outcome variable that is explained by the model.

Find the parameters that maximize fit

Once model fit is defined any model can be fit by estimating parameters that best fit the data. Estimating the parameters that maximize fit is part of the modeling process (i.e. find patterns in training data). The parameters that make the model best fit the data could be mathematically derived or could involve a search over possible parameters. There are many models that might be used to fit the data. Even after the training data is prepared, the pattern selected, and the model fit defined, estimating parameters could lead to parameters that are unreliable outside of the training data. The primary concern is that the next step in the model process, assuming the same pattern for new data, will fail because the assumption is bad. The assumption that training data patterns apply to new data could be a bad assumption if the training data is not representative of the new data or if the model estimated on the training data, i.e. trained, fits spurious patterns in the training data that will not be repeated on new data. This is referred to as overfitting.

The most common approach to mitigate overfitting in the training process is cross-validation. Cross-validation is a technique used to iteratively train and evaluate the performance of a model by partitioning the training data into multiple “training” and “validation” sets. This allows us to evaluate the model’s performance on multiple subsets of the data as part of the model estimation and tuning process to get a more robust model. Cross validation is therefore is a method for using the training data. The key decision in cross-validation is how to partition and use the training data. Most commonly the process is referred to as k-fold cross-validation.The training data is divided into k subsets. The model is trained on k-1 subsets and tested on the remaining subset. This process is repeated k times, with each subset used as the testing set once. The models are combined and the results are averaged to get a more robust model and a better estimate of the model’s performance.

Assume the same pattern for new data and make predictions

Once a model is estimated and/or comparisons between competing models are complete, the trained model is assumed to apply to new data. The trained model is applied to new data to make predictions.

Continuous dependent variables in accounting

Many continuous dependent variables are important in accounting practice and research. In lieu of providing an overview of all accounting related data analysis, this section only highlights some continous variables that can or have been important for data analysis and modeling.

Modeling with R and AutoML

Models for continuous dependent variables include linear regression, regression trees, neural networks, and other models. Each model is based on different assumptions for patterns in the data. Because they are based on different assumptions, the usage and interpretation differs across models. Despite the differences, the objectives of the different types of models are the same and the ultimate objective is to create the most accurate model possible. Modern computing power and open source platforms make fitting many models with many parameters more feasible even for non-expert users. This section introduces the H20 AutoML package as an easy to use tool for fitting many models without getting into the details of any specific model.

H2O is a machine learning library that can be used in R. It provides a set of functions and algorithms that can be used to build and train machine learning models. Most relevant to our course is what is called AutoML, which is a function that can be used to automatically build and train machine learning models. AutoML can be used to build models for regression, classification, and clustering tasks. It can also be used to build models for time series forecasting and anomaly detection tasks. AutoML is a class of tools currently being developed and refined on different platforms. For example, see an article about AutoGluon (https://towardsdatascience.com/automl-with-autogluon-transform-your-ml-workflow-with-just-four-lines-of-code-1d4b593be129) or about H2O AutoML (https://towardsdatascience.com/automated-machine-learning-with-h2o-258a2f3a203f). The primary benefit of AutoML is that it can allow users to build and train AI models without needing to have a deep understanding of machine learning algorithms or programming. It also speeds up the process of building and training models by automating many of the steps involved in the process. This is useful even for experienced data scientists because it can save time and effort when building and training models. After trying AutoML, a model might be further refined by using specific tools along with preprocessing features and training the hyperparameters of the model.

Details of H2O and AutoML are available online (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). Tutorials for getting started with H2O in RStudio are also easily available (https://www.youtube.com/watch?v=zzV1kTCnmR0). The key requirement is that H2O is built on Java, so you need to have Java installed on your computer to use H2O. You can download Java from the Java website (https://www.java.com/en/download/). H20 installation information is available on the H2O website (https://www.h2o.ai/download/).

H20 AutoML Example

This section uses an example to walk through the steps for using AutoML for continuous dependent variables. The example uses annual financial information to predict 1 year forward ROA. The data is available here: https://www.dropbox.com/scl/fi/dp5vuvaricvj7rqdpgb2o/finstmtlg.csv?rlkey=w96bhuhs59f4p6y06tbv5xvro&st=3g2x1rha&dl=0. In the data, the column CHROA_ld is the one year forward change in ROA. The remaining columns revt to mve_f are annual financial statement information from the most recent year prior to CHROA_ld.

Running H2O AutoML from R requires the following steps:

AutoML with H2O steps (binary dependent variable)

1. Load libraries and initialize the H2O cluster.
2. Load the data into the H2O cluster.
3. Run the AutoML function to build the model.
4. Get the best model from the AutoML results.
5. Create predictions for the testing data.
6. Evaluate the model on the testing data.
7. Understand and communicate the results.
8. Stop the H2O cluster.

The steps with the fraud data example are shown and explained below. The video at the end of this section demonstrate the steps. The example in this chapter assumes that any necessary data preparation steps are complete so that modeling can proceed.

Before beginning, h2o and DALEX must be installed on your computer. We will also be using the parallel package to help figure out how to initialize h2o.

install.packages("h2o")
install.packages("DALEX") install.packages("DALEXtra")
install.packages("parallel")

When you install these packages, you may see that there are some conflicts with other packages. This means that when using some functions, it is necessary to specify which package those functions come from. This will apply below to the DALEX function explain. Rather than directly using the explain function, you will have to specify DALEX::explain. This will be demonstrated at that point.

H2O runs on Java. h2o opens and runs Java in the background. This means that calling h2o is not actually running in R. Java therefore needs to be installed. You can download Java from the website: https://www.java.com/en/download/.

Step 1: Load packages and initialize the H2O cluster.

library(h2o)
library(DALEX) library(DALEXtra)
library(tidyverse) library(parallel)

We run h2o by starting the h2o environment. h2o will use all available cores on your computer by default. You can specify the number of cores to use by setting the nthreads argument in the h2o.init function. Note that training machine learning models can be computationally intensive, so using a computer with multiple cores and a good amount of memory can be helpful or may be necessary. Because running multiple models may also require extensive time, estimating a machine learning model requires setting aside time for running the model(s). If you need additional computing resources, you may explore cloud based options such as Google colab.

Check cores on computer.

detectCores()

Initiate h2o with fewer cores than on the computer (unless you want to only have h2o running). I ran this code on a computer with 16 cores. I will initialize h2o with 8 cores.

h2o.init(nthreads=8)

2. Load the data into the H2O cluster.

We first have to separate training and testing data. Assuming that the imported data frame is titled “df”, the training and testing data sets could be created as follows. The code below uses a random sample to create the training data. set.seed() makes the random sample repeatable. Keep the dependent variable and variables used in the model.

set.seed(1111)

`train <- df %>%`
`sample_frac(0.67)%>%`
`select(CHROA_ld,revt:mve_f)`

`test <- df %>%`
`anti_join(train)`
`select(CHROA_ld,revt:mve_f)`

Once the data is ready, transfer the data frame to the h2o instance (running in Java). Here the h2o data frame is named something that indicates that it is training data and that it is in h2o (trn.h2o).

trn.h2o <- as.h2o(train)

3. Run the AutoML function to build the model.

After the data is in the h2o instance, the automl function can run models on the training data. The requirements for the function are to specify the y variable and the training data. max_models limits the number of models that automl will estimate. An alternative is to limit the amount of time that automl runs using max_runtime_secs. Automl with then run as many models as it can during that time. seed is used to make the random components of running automl replicate for different runs.

mdlres <- h2o.automl(y = "CHROA_ld",

training_frame = trn.h2o,

max_models = 25,

seed = 1)

The size of the data (number of columns and number of rows), the types of models, the number of models, the number of cores, and computer speed determine how long estimating these models takes. On my machine with 8 cores and the data above, the code took a little over an hour.

4. Get the best model from the AutoML results.

You can see the models autoML ran by getting and printing the “leaderboard”.

lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")
print(lb, n = nrow(lb))

You can get the best model and view aspects of the model.

bmdl <- h2o.get_best_model(mdlres)
bmdl

The output displays the best model that autoML created on the training data. Later chapters will discuss details of specific models.

The model then displays model fit statistics. These include r squared (r2) as discussed previously.

You can also see what which predictor variables were most important for the model. Variable importance (also called feature importance) in machine learning refers to how much each input variable (feature) contributes to making accurate predictions in a model. In simple terms, it tells you which variables matter most for the model’s decisions and which ones have little or no effect. (Note: this is not available for ensemble models.)

vi <- h2o.varimp(bmdl)
vi

In the output the features are sorted by importance. The relative importance is a measure of how much the feature contributes to the model. The scaled importance is the relative importance scaled to sum to 1. The percentage is the percentage of the total importance that the feature contributes. Information on feature importance in h2o can be found here: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html.

Sometimes you may want to save the model to be used for predictions at a later time.

h2o.saveModel(bmdl, path = "file path")

bmdl <- h2o.loadModel("file path")

5. Create predictions for the testing data.

Next, is creating predictions for testing data. This step only begins after you are satistfied with the model training. For example, you may want to go back and work on collecting more data, improving the current features, or expanding the list of features.

Once you are satistfied, you can import the testing data into h2o and predict the outcome variable with the best model. Note that you have to do anything you did to the training data to the testing data.

tst.h2o <- as.h2o(test)

With the data imported to h2o, the preditions can be created with the testing data and the model.

pred.h2o <- h2o.predict(bmdl, tst.h2o)

Return the prediction to R to use in other applications.

pred <- as.data.frame(pred.h2o)$predict

6. Evaluate the model on the testing data.

The DALEX package is a tool for understanding and explaining machine learning models. It provides a set of tools for visualizing and interpreting the model’s predictions and performance. Describing machine learning models is referred to as explainable AI or XAI.

Machine learning models are often referred to as black box models. This means that the model makes predictions, but it is not always clear how the model is making those predictions. Explainable AI is an approach and set of tools that are designed to try to understand how the model is making predictions. To understand the models, various approaches include “what if” scenarios. Some of these methods are computationally intensive because they require making predictions for many different scenarios. The DALEX package can be used for the most common methods.

The DALEX package first creates an “explain” object that it uses to create different outputs. The “explain” object includes testing data predictions along with other information to create the output.

Create the DALEX explain object.

expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,-1],
y = tst.h2o[,1])

The explain object now makes it simple (but not necessarily fast) to create different explainable AI outputs.

Test data fit with the following code.

model_performance(expln)

Test variable importance with the following code.

varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)

Other methods are used to evaluate the direction and size of effects of each feature. These include partial dependence plots and Shapley values. Partial dependence plots show the relationship between a feature and the model’s predictions while holding all other features constant. This demonstrates how the model’s predictions change as the feature changes.

pdp <- model_profile(expln)
plot(pdp)

7. Understand and communicate the results.

Once you have evaluated the model, you understand the results, and have predictions, you can use the results to make decisions. You can use the model to predict an outcome for a specific case or for a set of cases. You may then use the model, predictions, and decisions to communicate with users, stakeholders, or decision makers. Some important considerations and pieces to communicate include the following:

What the objective of the analysis is
The source of the data
How the training data was defined and created
How the predictions are created
How reliable the model is
What can be learned from the model

8. Stop the H2O cluster.

h2o.shutdown()

Tutorial video

Conclusion

This chapter has provided an overview of modeling continuous dependent variables, emphasizing both the conceptual foundations and practical steps involved in predictive modeling. We explored how models—ranging from traditional linear regression to advanced machine learning algorithms—can be used to predict magnitudes. The modeling process was broken down into key stages: partitioning data into training and testing sets, specifying model patterns, defining and maximizing model fit, and addressing challenges such as overfitting with cross-validation. We also used a practical example to implement these steps in R, from data preparation and model training to evaluation and interpretation using explainable AI tools1.

Review

Mini-case video

References

Ellenberg, J. (2015). *How not to be wrong: The power of mathematical thinking*. Penguin.

Ioannidis, J. P. (2005). Why most published research findings are false. *PLoS medicine*, *2*(8), e124.

http://www.tylervigen.com/spurious-correlations

Ou, J. A., & Penman, S. H. (1989). Financial statement analysis and the prediction of stock returns. Journal of Accounting and Economics, 11(4), 295-329. https://www.sciencedirect.com/science/article/pii/0165410189900177.

Chen, X., Cho, Y. H., Dou, Y., & Lev, B. (2022). Predicting future earnings changes using machine learning and detailed financial data. Journal of Accounting Research, 60(2), 467-515. https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12429.

12 AutoML with Continous Dependent Variables

Chapter content

Modeling continous dependent variables