AutoML with Binary Dependent Variables

Jeremiah Green

11 AutoML with Binary Dependent Variables

Learning Objectives

Describe the purpose of and steps for predictive modeling with binary dependent variables
Describe examples of accounting relevant binary dependent variables
Use AutoML with categorical dependent variables

Chapter content

This chapter explains an automated approach to using many models for binary dependent variables. The focus is to apply the modeling steps from the prior chapter to binary dependent variables without selecting or getting into the details of specific models. This approach puts the focus on the modeling process and concepts rather than on technical modeling details. This chapter first discusses concepts related to binary dependent variable models and such models in accounting and then introduces tools for automatic model selection and estimation in R.

Modeling categorical dependent variables

Supervised models model an observed outcome variable. These dependent variables can be continuous numerical variables (the next chapter) or limited (this chapter). One way these dependent variables can be limited relative to continuous numerical variables is that the dependent variable is categorical. Categories could represent multiple categories; however, the most common type of categorical variable is for binary outcomes.

Binary dependent variables

Binary variables might include a pseudo-numeric variable, for example, a variable that is either 0 or 1. These variables might also be “Yes” or “No” or “True” or “False” or any other similar two group classification. Methods for binary outcome variables are designed to predict the probability of one of the two outcomes. For example, an IRS model that models tax fraud could use a binary variable that is equal to one if an entity commits tax fraud and a zero if not. The goal of the model would be to predict in new data the probability that an entity commits tax fraud (a value equal to one). The modeled answer might be something like a 5% probability that the entity is committing tax fraud. The model could include independent variables that might predict fraud. These “x” variables could be things like taxable income, industry, entity type, or tax rate.

Walking through the steps

The prior chapter introduces the model algorithm and modeling steps. This section walks through both with binary dependent variables. A key part of the modeling process is identifying patterns in a training data set and then using those patterns to make predictions on a new data set. This means that before beginning any modeling, it is necessary to identify data and separate them.

Training

Training data is used to find patterns that can predict outcomes in a new data set. In the idea scenario, training data should contain features and patterns that are likely to apply to other data and specifically to data that will be used for prediction. Training data should be carefully separated from testing data because comingling the two can lead to statistical and inference problems. For binary dependent variables, having a sufficient number of observations for each class (0 and 1) is important for fitting a model.

There are various approaches to partitioning out training data. If we are creating a model to be used in real time decisions such as with making trading decisions, we might use all available data as training data and evaluate the model as we use it for making trading decisions. Most of the time, we use a portion of the data as training data and the rest as testing data. In this way we can evaluate the model’s performance on data that it has not seen before before we use it to make real decisions. Some approaches to partitioning the data include randomly selecting a portion of the data as training data and the rest as testing data or using a time-based approach where we use data up to a certain point in time as training data and data after that point as testing data.

Once we separate data, the first of the modeling steps is finding patterns in training data. A model algorithm is a large part of this first step.

Model algorithm

1. Specify a pattern

2. Define fit

3. Find the parameters that maximize fit

Specify a pattern

By focusing on binary dependent variables, we have already selected supervised modeling and we have specified the type of the dependent variable. Patterns include the model type. For binary dependent variables, model types might include logistic or probit regressions, classification trees, a linear probability model, or other types. An ensemble model could include many different types of models. In automated methods of machine learning like the one used in this chapter, the modeling tool may try multiple models. Later chapters will spend more time on specific model types.

Define fit

Alternative model types can use different ways to measure model fit. This section focuses on general measures and concepts related to model fit with binary depdendent variables. Model fit compares observed outputs to predicted outputs derived from a model. These comparisons can be intuitive. If a model predicts an outcome to be “1” and the outcome is actually “1”, then the model is correct. In this instance, the model performed as expected. Similarly, if a model predicts and outcome to be “0” and the outcome is actually “0”, then the model is also correct. Other instances are cases in which the model is not correct, e.g., the model predicts an outcome to be “1” but the outcome is actually “0”. These possibilities can be described with what is called a confusion matrix.

The confusion matrix is set up with the predictions in the rows and the actuals in the columns. An example confusion matrix is shown below. “Positive” typically refers to “1” and “Negative” typically refers to “0”.

	Actual 0	Actual 1
Predicted 0	True negative (TN)	False negative (FN)
Predicted 1	False positive (FP)	True positive (TP)

The intuitive explanations can be read in this confusion matrix. True positives (TP) are instances that the model predicts “1” and the outcome is actually “1”. True negatives (TN) are instances that the model predicts “0” and the ouctome is actually “0”. False positives (FP) are instances that the model predicts “1” but the outcome is actually “0”. False negatives (FN) are instances that the model predicts “0” but the outcome is actually “1”. These groups can be used to create measures of fit. The true positive rate and the false positive rate are given by the following formulas:

True positive rate = TP / (TP + FN)

False positive rate = FP / (FP + TN)

The true positive rate is the percent of actual positives (1s) that are predicted to be “1”. This measure captures the percent of observations that the model did flag as “1” when it should have. Higher values of this measure mean better model fit. The false positive rate is the percent of actual negatives (0s) that are predicted to be “1”. This measure captures the percent of observations that the model should have flagged as “0” but it falsely predicted to be “1”. Higher values of this measure mean worse model fit. Model fit has to consider both measures because there is a tradeoff between the two. For example, a model could avoid ever missing an actual “1” by always predicting “1”. However, this would mean that the true positive rate and the false positive rate would be at their highest.

The Area Under the Receiver Operating Characteristic Curve (AUC) is a measure of how well the model can distinguish between the two possible outcomes in a binary model. The AUC is the area under the curve created by plotting the true positive rate against the false positive rate. The more area under the curve, i.e. the higher the true positive rate is at a given level of the false positive rate, the better the predictions of ones is. AUC ranges from 0 to 1, with 0 indicating that the model explains nothing in the output, 0.5 indicating a model that is no better than random, and 1 indicating that the model explains all outcomes.

Find the parameters that maximize fit

Once model fit is defined any model can be fit by estimating parameters that best fit the data. Estimating the parameters that maximize fit is part of the modeling process (i.e. find patterns in training data). The parameters that make the model best fit the data could be mathematically derived or could involve a search over possible parameters. Even after the training data is prepared, the pattern selected, and the model fit defined, estimating parameters could lead to parameters that are unreliable outside of the training data. The primary concern is that the next step in the model process, assuming the same pattern for new data, will fail because the assumption is bad. The assumption that training data patterns apply to new data could be a bad assumption if the training data is not representative of the new data or if the model estimated on the training data, i.e. trained, fits spurious patterns in the training data that will not be repeated on new data. This is referred to as overfitting.

The most common approach to mitigate overfitting in the training process is cross-validation. Cross-validation is a technique used to iteratively train and evaluate the performance of a model by partitioning the training data into multiple “training” and “validation” sets. This allows us to evaluate the model’s performance on multiple subsets of the data as part of the model estimation and tuning process to get a more robust model. Cross validation is therefore is a method for using the training data. The key decision in cross-validation is how to partition and use the training data. Most commonly the process is referred to as k-fold cross-validation.The training data is divided into k subsets. The model is trained on k-1 subsets and tested on the remaining subset. This process is repeated k times, with each subset used as the testing set once. The models are combined and the results are averaged to get a more robust model and a better estimate of the model’s performance.

Assume the same pattern for new data and make predictions

Once a model is estimated and/or comparisons between competing models are complete, the trained model is assumed to apply to new data. The trained model is applied to new data to make predictions.

Binary dependent variables in accounting

Many continuous or categorical variables could be made to be binary. For example, whether revenues increased, a product is new, a company hired new employees, or a company lost market share. Some important accounting variables are inherently binary. For example, whether a company was audited or had a restatement, whether an executive committed fraud, or whether a contract contains accounting related conditions. This section provides an introduction to a small, non-comprehensive sample of binary variables in accounting that have generated work in research and practice.

Conflicting interests between managers and investors create incentives for managers to opportunistically report financial information to extract benefits from shareholders. Shareholders therefore want to know whether reported financial information is inaccurate or manipulated, or results from fraud. Empirical estimates suggest that approximately 10% of publicly traded firms commit fraud yearly. Earnings management is detected in different ways. One tool that can reduce the costs of detecting fraud is fraud prediction models. These models include binary dependent variables that indicate whether a company has committed fraud, restated earnings, or manipulated financial reports. Detecting earnings management or fraud is related to non-financial statement concerns such as credit card fraud.

Another binary variable that is important to accountants and in particular to auditors is predicting whether a company will fail in the coming year (the going concern opinion). Closely related to the going concern opinion is bankruptcy prediction. Many models for predicting bankruptcy have been proposed with varying levels of success.

Other examples of binary dependent variables in accounting research and practice include whether a company discloses information, whether shareholders pursue litigation, or whether a company adopts a specified accounting treatment.

Modeling with R and AutoML

Models for binary dependent variables include probit, logistic, linear probability models, fractal models, and classification trees among others. Each model is based on different assumptions for patterns in the data. Because they are based on different assumptions, the usage and interpretation differs across models. Despite the differences, the objectives of the different types of models are the same and the ultimate objective is to create the most accurate model possible. Modern computing power and open source platforms make fitting many models with many parameters more feasible even for non-expert users. This section introduces the H20 AutoML package as an easy to use tool for fitting many models without getting into the details of any specific model.

H2O is a machine learning library that can be used in R. It provides a set of functions and algorithms that can be used to build and train machine learning models. Most relevant to our course is what is called AutoML, which is a function that can be used to automatically build and train machine learning models. AutoML can be used to build models for regression, classification, and clustering tasks. It can also be used to build models for time series forecasting and anomaly detection tasks. AutoML is a class of tools currently being developed and refined on different platforms. For example, see an article about AutoGluon (https://towardsdatascience.com/automl-with-autogluon-transform-your-ml-workflow-with-just-four-lines-of-code-1d4b593be129) or about H2O AutoML (https://towardsdatascience.com/automated-machine-learning-with-h2o-258a2f3a203f). The primary benefit of AutoML is that it can allow users to build and train AI models without needing to have a deep understanding of machine learning algorithms or programming. It also speeds up the process of building and training models by automating many of the steps involved in the process. This is useful even for experienced data scientists because it can save time and effort when building and training models. After trying AutoML, a model might be further refined by using specific tools along with preprocessing features and training the hyperparameters of the model.

Details of H2O and AutoML are available online (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). Tutorials for getting started with H2O in RStudio are also easily available (https://www.youtube.com/watch?v=zzV1kTCnmR0). The key requirement is that H2O is built on Java, so you need to have Java installed on your computer to use H2O. You can download Java from the Java website (https://www.java.com/en/download/). H20 installation information is available on the H2O website (https://www.h2o.ai/download/).

H20 AutoML Example

This section uses an example to walk through the steps for using AutoML for binary dependent variables. The example is from a research paper on fraud detection. The data from the paper and descriptions of the data and variables are available publicly on github here: https://github.com/JarFraud/FraudDetection.

Running H2O AutoML from R requires the following steps:

AutoML with H2O steps (binary dependent variable)

1. Load libraries and initialize the H2O cluster.
2. Load the data into the H2O cluster.
3. Run the AutoML function to build the model.
4. Get the best model from the AutoML results.
5. Create predictions for the testing data.
6. Evaluate the model on the testing data.
7. Understand and communicate the results.
8. Stop the H2O cluster.

The steps with the fraud data example are shown and explained below. The video at the end of this section demonstrate the steps. The example in this chapter assumes that any necessary data preparation steps are complete so that modeling can proceed.

Before beginning, h2o and DALEX must be installed on your computer. We will also be using the parallel package to help figure out how to initialize h2o.

install.packages("h2o")
install.packages("DALEX") install.packages("DALEXtra")
install.packages("parallel")

When you install these packages, you may see that there are some conflicts with other packages. This means that when using some functions, it is necessary to specify which package those functions come from. This will apply below to the DALEX function explain. Rather than directly using the explain function, you will have to specify DALEX::explain. This will be demonstrated at that point.

H2O runs on Java. h2o opens and runs Java in the background. This means that calling h2o is not actually running in R. Java therefore needs to be installed. You can download Java from the website: https://www.java.com/en/download/.

Step 1: Load packages and initialize the H2O cluster.

library(h2o)
library(DALEX) library(DALEXtra)
library(tidyverse) library(parallel)

We run h2o by starting the h2o environment. h2o will use all available cores on your computer by default. You can specify the number of cores to use by setting the nthreads argument in the h2o.init function. Note that training machine learning models can be computationally intensive, so using a computer with multiple cores and a good amount of memory can be helpful or may be necessary. Because running multiple models may also require extensive time, estimating a machine learning model requires setting aside time for running the model(s). If you need additional computing resources, you may explore cloud based options such as Google colab.

Check cores on computer.

detectCores()

Initiate h2o with fewer cores than on the computer (unless you want to only have h2o running). I ran this code on a computer with 16 cores. I will initialize h2o with 8 cores.

h2o.init(nthreads=8)

2. Load the data into the H2O cluster.

We first have to separate training and testing data. Assuming that the imported data frame is titled “df”, the training and testing data sets could be created as follows. The code below uses a random sample to create the training data. set.seed() makes the random sample repeatable.

set.seed(1111)

`train <- df %>%`
`sample_frac(0.67)`
`test <- df %>%`
`anti_join(train)`

Binary models (or other categorical models) require specifying that the dependent variable is not to be treated as numeric. This can be done by making the dependent variable a factor in R (non-numeric variable with different levels). The fraud variable in this data is “misstate”. Also, only keep the dependent variable and variables used in the model.

train<-train%>%
mutate(misstate = as.factor(misstate)) %>%
select(misstate,act:ch_fcf)

Once the data is ready, transfer the data frame to the h2o instance (running in Java). Here the h2o data frame is named something that indicates that it is training data and that it is in h2o (trn.h2o).

trn.h2o <- as.h2o(train)

3. Run the AutoML function to build the model.

After the data is in the h2o instance, the automl function can run models on the training data. The requirements for the function are to specify the y variable and the training data. max_models limits the number of models that automl will estimate. An alternative is to limit the amount of time that automl runs using max_runtime_secs. Automl with then run as many models as it can during that time. seed is used to make the random components of running automl replicate for different runs.

mdlres <- h2o.automl(y = "misstate",

training_frame = trn.h2o,

max_models = 25,

seed = 1)

The size of the data (number of columns and number of rows), the types of models, the number of models, the number of cores, and computer speed determine how long estimating these models takes. On my machine with 8 cores and the data above, the code took 45 minutes.

4. Get the best model from the AutoML results.

You can see the models autoML ran by getting and printing the “leaderboard”.

lb<-h2o.get_leaderboard(mdlres, extra_columns = "ALL")
print(lb, n = nrow(lb))

You can get the best model and view aspects of the model.

bmdl <- h2o.get_best_model(mdlres)
bmdl

The output displays the best model that autoML created on the training data. Later chapters will discuss details of specific models.

The model then displays model fit statistics. These include AUC and the confusion matrix as discussed previously.

The confusion matrix is set up with the threshold at 0.5. The F1-optimal threshold is the threshold that gives the best balance between the true positive rate and the false positive rate. We can choose a different threshold to get different results. For example, we could determine that any probability above 0.75 should be classified as a one and anything below 0.75 as a zero. Setting a higher threshold will mean that we will miss fewer ones (false negatives will be lower), but we will also have more false positives. The opposite is true if we set a lower threshold.

The AUC is meant to summarize the model’s performance across all possible thresholds. The AUC is the area under the (ROC) curve created by plotting the true positive rate against the false positive rate for all possible thresholds. The more area under the curve, i.e. the higher the true positive rate is, the better the predictions of ones is.

The ROC curve can be created:

perf <- h2o.performance(bmdl)

plot(perf, type = "roc")

You can also see what which predictor variables were most important for the model. Variable importance (also called feature importance) in machine learning refers to how much each input variable (feature) contributes to making accurate predictions in a model. In simple terms, it tells you which variables matter most for the model’s decisions and which ones have little or no effect. (Note: this is not available for ensemble models.)

vi <- h2o.varimp(bmdl)
vi

In the output the features are sorted by importance. The relative importance is a measure of how much the feature contributes to the model. The scaled importance is the relative importance scaled to sum to 1. The percentage is the percentage of the total importance that the feature contributes. Information on feature importance in h2o can be found here: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html.

Sometimes you may want to save the model to be used for predictions at a later time.

h2o.saveModel(bmdl, path = "file path")

bmdl <- h2o.loadModel("file path")

5. Create predictions for the testing data.

Next, is creating predictions for testing data. This step only begins after you are satistfied with the model training. For example, you may want to go back and work on collecting more data, improving the current features, or expanding the list of features.

Once you are satistfied, you can import the testing data into h2o and predict the outcome variable with the best model. Note that you have to do anything you did to the training data to the testing data.

test<-test%>%
mutate(misstate = as.factor(misstate)) %>%
select(misstate,act:ch_fcf)

tst.h2o <- as.h2o(test)

With the data imported to h2o, the preditions can be created with the testing data and the model.

pred.h2o <- h2o.predict(bmdl, tst.h2o)

Return the prediction to R to use in other applications.

pred <- as.data.frame(pred.h2o)$predict

6. Evaluate the model on the testing data.

The DALEX package is a tool for understanding and explaining machine learning models. It provides a set of tools for visualizing and interpreting the model’s predictions and performance. Describing machine learning models is referred to as explainable AI or XAI.

Machine learning models are often referred to as black box models. This means that the model makes predictions, but it is not always clear how the model is making those predictions. Explainable AI is an approach and set of tools that are designed to try to understand how the model is making predictions. To understand the models, various approaches include “what if” scenarios. Some of these methods are computationally intensive because they require making predictions for many different scenarios. The DALEX package can be used for the most common methods.

The DALEX package first creates an “explain” object that it uses to create different outputs. The “explain” object includes testing data predictions along with other information to create the output.

Unfortunately, for some tests DALEX requires a numeric binary variable, so it requires converting the factor back to a number before using DALEX.

tst.h2o[, 1] <- as.numeric(tst.h2o[, 1])

Create the DALEX explain object.

expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst.h2o[,-1],
y = tst.h2o[,1])

The explain object now makes it simple (but not necessarily fast) to create different explainable AI outputs.

Test data fit with the following code.

model_performance(expln)

Test variable importance with the following code.

varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)

Other methods are used to evaluate the direction and size of effects of each feature. These include partial dependence plots and Shapley values. Partial dependence plots show the relationship between a feature and the model’s predictions while holding all other features constant. This demonstrates how the model’s predictions change as the feature changes.

pdp <- model_profile(expln)
plot(pdp)

7. Understand and communicate the results.

Once you have evaluated the model, you understand the results, and have predictions, you can use the results to make decisions. You can use the model to predict an outcome for a specific case or for a set of cases. You may then use the model, predictions, and decisions to communicate with users, stakeholders, or decision makers. Some important considerations and pieces to communicate include the following:

What the objective of the analysis is
The source of the data
How the training data was defined and created
How the predictions are created
How reliable the model is
What can be learned from the model

8. Stop the H2O cluster.

h2o.shutdown()

Tutorial video

Conclusion

This chapter has provided an introduction to predictive modeling with binary dependent variables, emphasizing the modeling process over the technical details of specific algorithms. By exploring examples relevant to accounting and walking through the essential steps—from data partitioning and model training to evaluation and interpretation—you have learned how to apply supervised learning techniques to real-world binary outcomes. The chapter also introduced automated machine learning using H2O AutoML in R, demonstrating how modern software can streamline model selection, training, and evaluation for users with varying levels of technical expertise.

Review

Mini-case video

McLeod, K. S. (2000). Our sense of Snow: the myth of John Snow in medical geography. Social science & medicine, 50(7-8), 923-935. https://www.sciencedirect.com/science/article/pii/S0277953699003457

References

Awoyemi, J. O., Adetunmbi, A. O., & Oluwadare, S. A. (2017, October). Credit card fraud detection using machine learning techniques: A comparative analysis. In 2017 international conference on computing networking and informatics (ICCNI) (pp. 1-9). IEEE.

Dyck, A., Morse, A., & Zingales, L. (2010). Who blows the whistle on corporate fraud?. The journal of finance, 65(6), 2213-2253.

Dyck, A., Morse, A., & Zingales, L. (2024). How pervasive is corporate fraud?. Review of Accounting Studies, 29(1), 736-769.

https://developers.google.com/machine-learning/decision-forests/variable-importances

Green, J., & Zhao, W. (2022). Forecasting earnings and returns: A review of recent advancements. The Journal of Finance and Data Science, 8, 120-137.

Hopwood, W., McKEOWN, J. C., & Mutchler, J. F. (1994). A reexamination of auditor versus model accuracy within the context of the going‐concern opinion decision. Contemporary Accounting Research, 10(2), 409-431.

Martens, D., Bruynseels, L., Baesens, B., Willekens, M., & Vanthienen, J. (2008). Predicting going concern opinion with data mining. Decision Support Systems, 45(4), 765-777.

Perols, J. (2011). Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19-50.

West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: a comprehensive review. Computers & security, 57, 47-66.

Yang Bao, Bin Ke, Bin Li, Julia Yu, and Jie Zhang (2020). Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research, 58 (1): 199-235. https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12292. erratum: https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12454.

11 AutoML with Binary Dependent Variables

Chapter content

Modeling categorical dependent variables

Binary dependent variables

Walking through the steps

Training

Specify a pattern

Define fit

Find the parameters that maximize fit

Assume the same pattern for new data and make predictions

Binary dependent variables in accounting

Modeling with R and AutoML

H20 AutoML Example

Step 1: Load packages and initialize the H2O cluster.

2. Load the data into the H2O cluster.

We first have to separate training and testing data. Assuming that the imported data frame is titled “df”, the training and testing data sets could be created as follows. The code below uses a random sample to create the training data. set.seed() makes the random sample repeatable.

`train <- df %>%`
`sample_frac(0.67)`
`test <- df %>%`
`anti_join(train)`

3. Run the AutoML function to build the model.

The size of the data (number of columns and number of rows), the types of models, the number of models, the number of cores, and computer speed determine how long estimating these models takes. On my machine with 8 cores and the data above, the code took 45 minutes.

4. Get the best model from the AutoML results.

You can see the models autoML ran by getting and printing the “leaderboard”.

You can get the best model and view aspects of the model.

The output displays the best model that autoML created on the training data. Later chapters will discuss details of specific models.

5. Create predictions for the testing data.

6. Evaluate the model on the testing data.

7. Understand and communicate the results.

8. Stop the H2O cluster.

Tutorial video

Conclusion

Review

Mini-case video

References

License

Share This Book

Chapter content

Modeling categorical dependent variables

Binary dependent variables

Walking through the steps

Training

Specify a pattern

Define fit

Find the parameters that maximize fit

Assume the same pattern for new data and make predictions

Binary dependent variables in accounting

Modeling with R and AutoML

H20 AutoML Example

Step 1: Load packages and initialize the H2O cluster.

2. Load the data into the H2O cluster.

We first have to separate training and testing data. Assuming that the imported data frame is titled “df”, the training and testing data sets could be created as follows. The code below uses a random sample to create the training data. set.seed() makes the random sample repeatable.

train <- df %>% sample_frac(0.67) test <- df %>% anti_join(train)

3. Run the AutoML function to build the model.

The size of the data (number of columns and number of rows), the types of models, the number of models, the number of cores, and computer speed determine how long estimating these models takes. On my machine with 8 cores and the data above, the code took 45 minutes.

4. Get the best model from the AutoML results.

You can see the models autoML ran by getting and printing the “leaderboard”.

You can get the best model and view aspects of the model.

The output displays the best model that autoML created on the training data. Later chapters will discuss details of specific models.

5. Create predictions for the testing data.

6. Evaluate the model on the testing data.

7. Understand and communicate the results.

8. Stop the H2O cluster.

Tutorial video

Conclusion

Review

Mini-case video

References

License

Share This Book

`train <- df %>%`
`sample_frac(0.67)`
`test <- df %>%`
`anti_join(train)`