"

13 Chapter 13: AutoML with Categorical Dependent Variables

Learning outcomes

At the end of this chapter, you should be able to

  • Describe the purpose of predictive modeling with categorical dependent variables
  • Describe examples of accounting relevant categorical dependent variables
  • Describe the interpretation and evaluation of categorial dependent variable models
  • Use AutoML with categorical dependent variables

Chapter content

Note: include practice and knowledge checks throughout chapter

# Modeling concepts and practices using H2O auto-machine learning

## Packages and data

In this chapter, we will introduce the basic process of machine learning without going into the details of specific machine learning models. We will use the H2O AutoML package to demonstrate the process of building a machine learning model.

H2O is an open source machine learning platform that provides a user-friendly interface for building machine learning models. The AutoML feature in H2O automates the process of building and tuning machine learning models, making it easier for users to experiment with different models and find the best one for their data.

H2O has a package in R. The key requirment is that H2O is built on Java, so you need to have Java installed on your computer to use H2O. You can download Java from the [Java website](https://www.java.com/en/download/). H20 installation information is available on the [H2O website](https://www.h2o.ai/download/).

The H2O package information requires running some installation code in R. Note that if there is no prior installation, and the required packages are also correctly installed, you can simply run install.packages(“h2o”). The code below ensures that prior installations are removed and required packages are installed.

“` r

if (“package:h2o” %in% search()) { detach(“package:h2o”, unload=TRUE) }
if (“h2o” %in% rownames(installed.packages())) { remove.packages(“h2o”) }

pkgs <- c(“RCurl”,”jsonlite”)
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

install.packages(“h2o”, type=”source”, repos=(c(“http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R”)))
“`

This code removes any h2O package previously installed and then installs the required packages for H2O.

Once you have installed H2O, you can load the package and start building machine learning models using the AutoML feature. We will work through an example of using the AutoML feature to build a machine learning model.

We will also use the DALEX package to explain the model. The DALEX package is a tool for understanding and explaining machine learning models. It provides a set of tools for visualizing and interpreting the model’s predictions and performance. Describing machine learning models is referred to as explainable AI or XAI.

First we must install the DALEX package and then we will read in the packages that we will be using.

“` r
library(h2o)
library(DALEX)
library(tidyverse)
“`

We will be using the same data set we used in the previous chapter. Read this in.

“` r
df<-read.csv(filepath)
“`

“`{r, cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
df<-read.csv(“C:\\Users\\jgreen\\Documents\\Teaching\\NotesText\\Data Analytics with Accounting Data\\data_FraudDetection_JAR2020.csv”)
“`

## Feature engineering and data preparation

We will create the features and prepare the data following how we did this in the previous chapter.

“`{r, cache=TRUE,echo=TRUE,warning=FALSE,message=FALSE}

wins<-function(x,p){
lim<-quantile(x,c(p,1-p),na.rm=T)
x[x<lim[1]]<-lim[1]
x[x>lim[2]]<-lim[2]
x
}

norm1<-function(x){
(x-mean(x,na.rm=T))/sd(x,na.rm=T)
}

tmp <- df %>%
arrange(gvkey,fyear) %>%
group_by(gvkey) %>%
mutate(
sales_growth = (sale – lag(sale,1))/lag(sale,1),
excess_invent_growth = ( (invt – lag(invt,1)) – (ap – lag(ap,1)) )/lag(invt,1),
consec = ifelse(fyear-1==lag(fyear,1),1,0),
across(c(sales_growth,excess_invent_growth), ~ifelse(consec==1,.x,NA)),
loss = ifelse(ni<0,1,0)
)%>%
group_by(fyear)%>%
mutate(
market_share = sale/sum(sale,na.rm=TRUE)) %>%
mutate(
sales_growth = wins(sales_growth,0.02),
market_share = wins(market_share,0.02),
excess_invent_growth = wins(excess_invent_growth,0.25)) %>%
mutate(
across(c(sales_growth,excess_invent_growth,market_share), ~norm1(.x))
) %>%
ungroup() %>%
select(gvkey,fyear,misstate,sales_growth,excess_invent_growth,loss,market_share)%>%
drop_na(sales_growth,excess_invent_growth,market_share,loss)

head(tmp)

“`

## Separate training data and test data

Creating any model that can be used to make predictions requires data to train and tune a model. The data that is used to train and tune a model is called the training data. Training data cannot be used to later test the model or make predictions. The data that is used to evaluate the model’s performance is called the test data. Similarly, testing data cannot be used to train or tune the model.

There are various approaches to partitioning out training data. If we are creating a model to be used in real time decisions such as with making trading decisions, we might use all available data as training data and evaluate the model as we use it for making trading decisions. Most of the time, we use a portion of the data as training data and the rest as testing data. In this way we can evaluate the model’s performance on data that it has not seen before before we use it to make real decisions. Some approaches to partitioning the data include:

– Randomly selecting a portion of the data as training data and the rest as testing data.
– Using a time-based approach where we use data up to a certain point in time as training data and data after that point as testing data.

### Cross-validation

Related to separating training and testing data is the concept of cross-validation. Cross-validation is a technique used to evaluate the performance of a model by partitioning the training data into multiple “training” and “testing” sets. This allows us to evaluate the model’s performance on multiple subsets of the data as part of the model estimation and tuning process to get a more robust estimate of the model’s performance. Cross validation is therefore is a method for using the training data. The key decision in cross-validation is how to partition and use the training data. Most commonly the process is referred to as k-fold cross-validation.The training data is divided into k subsets. The model is trained on k-1 subsets and tested on the remaining subset. This process is repeated k times, with each subset used as the testing set once. The results are then averaged to get an overall estimate of the model’s performance.

### Creating the training and testing data sets

Here we will use a time-based approach to partition the data into training and testing data. There is no one-size-fits-all approach to partitioning the data. The approach you use will depend on the specific characteristics of your data and the problem you are trying to solve. Some general guidelines for partitioning the data include:

– Properly creating and tuning a model is best done with as much data as possible. Using something like 2/3 of the data for training and 1/3 for testing is common.

– If you are training to predict an outcome variable, the distribution of the outcome variable should be similar in the training and testing data. For example, if the outcome variable is binary, the proportion of 1s and 0s should be similar in the training and testing data. If all of the 1s are in the training data, we won’t be able to evaluate how well the model predicts the ones with the testing data.

– There shouldn’t be any bleeding between the training and testing data. At a minimum the testing and the training data should not include the same observations. Subtle forms of data bleeding can occur if there is overlap in other ways. For example, if the training data is trying to predict the probability of bankruptcy in year t+2, and some year t+1 data is included in the testing data, there could be data bleeding that can make the model appear to perform better than it actually would if used when data bleeding isn’t possible in real time decisions.

Suppose we want to split the data into training and testing data based on the year. We could use a summary of the data to decide where to split the data.

“`{r, cache=TRUE,echo=TRUE,warning=FALSE,message=FALSE}

tmp %>%
group_by(fyear) %>%
summarise(n=n(),
p_misstate = mean(misstate,na.rm=TRUE))
“`

We can see that the number of observations is fairly similar across years from 1991 to 2014, however, the percent of observations with misstatements is largest from 1999-2004. If we are concerned about making predictions in recent years, perhaps we could use data from 1991-2005, approximately 2/3 of the data as the training data. If we are concerned about equal probabilities in the training and testing data, we might use a random sample from all years as training data. There is not a right or a wrong answer. However, importantly, once we make the decision, we should not go back and train the data again on different data because we didn’t like the results. Here, we will use a random sample of 2/3 of the data as training data and the rest as testing data.

“`{r, cache=TRUE,echo=TRUE,warning=FALSE,message=FALSE}
set.seed(123)

train <- tmp %>%
sample_frac(0.67)

test <- tmp %>%
anti_join(train)
“`

Here we use set.seed to make the random sample reproducible. If we don’t use the same seed, each random sample that is created whenever the code is run can be different. The number in set.seed can be any number. It is used to set the seed for the random number generator.

sample_frac is a dplyr function that samples a fraction of the data. Here we sample 67% of the data as training data and the rest as testing data.

we can test whether the samples have similar proportions of misstatements.

“`{r, cache=TRUE,echo=TRUE,warning=FALSE,message=FALSE}

train %>%
summarise(p_misstate = mean(misstate,na.rm=TRUE))

test %>%
summarise(p_misstate = mean(misstate,na.rm=TRUE))

“`

Notice in both samples the proportion of misstatements is very low, but the proportion is similar in the training and testing data.

Once we have the training and the testing data separated, we can start building the machine learning model.

## Model types and evaluation

There are many machine learning algorithms. For example, an incomplete list of all possible algorithms is the [list of algorithms available via H2O](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html). Understanding all of the details about every algorithm is more than we can cover. In this chapter, we will use the H2O AutoML package to build a machine learning model. The AutoML package automates the process of building and tuning machine learning models, making it easier for users to experiment with different models and find the best one for their data. The AutoML package doesn’t require understanding the details of every model. It does require understanding modeling basics and can require significant computing resources. We will work through using the AutoML package in this chapter. In subsequent chapters we will develop a deeper understanding of specific models and specific applications of modeling.

## Generic outline of AutoML with H2O

Running H2O AutoML from R requires the following steps:

1. Load the H2O library and initialize the H2O cluster.
2. Load the data into the H2O cluster.
3. Specify the target variable and the predictor variables.
4. Run the AutoML function to build the model.
5. Get the best model from the AutoML results.
6. Create predictions for the testing data.
7. Evaluate the model on the testing data.
8. Understand and communicate the results.
9. Stop the H2O cluster.

We will work through each of these steps in the following sections.

### Load the H2O library and initialize the H2O cluster

H2O runs on Java, so you need to have Java installed on your computer to use H2O. You can download Java from the [Java website](https://www.java.com/en/download/). Once you have installed Java, you can load the h2o library and initialize the h2o cluster.

h2o opens and runs Java in the background. This means that calling h2o is not actually running in R. We therefore run h2o by starting and ending the h2o environment. h2o will use all available cores on your computer by default. You can specify the number of cores to use by setting the nthreads argument in the h2o.init function. Note that training machine learning models can be computationally intensive, so it is recommended to use a computer with multiple cores and a good amount of memory. Because running multiple models may also require extensive time, estimating a machine learning model requires setting aside time for running the model(s). Below I am running h2o on my desktop. However, we can consider other options for running h2o, such as on a cloud server or on a cluster. For example, a simple way to run h2o could be to run it with google collab.

How many cores does your computer have?

“`{r, cache=TRUE,echo=TRUE,warning=FALSE,message=FALSE}
library(parallel)
detectCores()
“`

You can leave it up to h2o to decide how many cores to use. Note that if you do so, you should plan to let h2o run without doing other things on your computer. If you want to run h2o in the background while you do other things, you can limit the number of cores h2o uses. Here we will initiate h2o with fewer cores than are available on the computer. If you want to let h2o determine the number of cores to use, you can omit the nthreads argument.

“` r
library(h2o)
h2o.init(nthreads=10)
“`

Note you should see something like the following output (I don’t know why, but I always have to run h2o.init twice to get it to work.)

“`
Connection successful!

R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 1 milliseconds
H2O cluster timezone: America/New_York
H2O data parsing timezone: UTC
H2O cluster version: …
“`

At the end of the analysis, we need to stop the h2o cluster (mentioned again later).

“` r
h2o.shutdown()
“`

### Import data to h2o

Next we need to import the data into the h2o cluster. We can use the as.h2o function to convert a data frame to an h2o object.

Note that if we are going to try to fit a binary 0-1 model, we need to tell h2o that the target variable is a categorical variable. We can do this by using the as.factor function.

“` r
train<-train%>%
mutate(misstate = as.factor(misstate))
“`

“` r
trn.h2o <- as.h2o(train)
“`

### Specify the target variable and the predictor variables

We are first going to specify the target variable and the predictor variables for the model. The target variable is the variable we are trying to predict, and the predictor variables are the variables we are using to make the prediction. We will save these variables as x and y lists that we can use to build the model.

“` r
y <- “misstate”
x <- c(“sales_growth”,”excess_invent_growth”,”market_share”,”loss”)
“`

### Run AutoML

Now we have the h2o data frame and the columns specified. We can run the AutoML function with these pieces. We can also specify how long to run the model training. We can specify the number of models to try or we can specify how long to run the model training and it will stop with whatever models it has trained in that amount of time. Notice that the more models and the longer time the longer the model training will take. Not every type of model is currently available through AutoML (i.e. XGBoost), however, a large number of models are available.

To train the model, we specify an object to save the model training results to. Here we will call that object mdlres. We can specify the seed to make the results reproducible. Here we will set 25 models as the max number of models to try. We will also check how long it takes to train the models.

“` r
start.time <- Sys.time()
mdlres <- h2o.automl(x = x, y = y,
training_frame = trn.h2o,
max_models = 25,
seed = 1)
end.time <- Sys.time()
end.time – start.time
“`

We could replace max_models with max_runtime_secs by using the following to limit the time to 30 minutes.

“` r
max_runtime_secs = (60*30)
“`

Here we get the following output:

“` r
Time difference of 43.4792 mins
“`

### Describe the training data model

Once AutoML has finished running 25 models in this case, we may want to know which models it used, how they performed, and what the best model is. There are a couple of functions that can help us with this.

First, we can use the h2o.leaderboard function to get a summary of the models that were trained.

“` r
lb<-h2o.get_leaderboard(mdlres, extra_columns = “ALL”)
print(lb, n = nrow(lb))
“`

This will give us a summary of the models that were trained and some measures of fit. This gives us the following output.

“` r
model_id auc logloss aucpr
1 GBM_grid_1_AutoML_2_20240715_112354_model_2 0.6633467 0.04214462 0.019957117
2 StackedEnsemble_AllModels_1_AutoML_2_20240715_112354 0.6591723 0.04181289 0.024853087
3 StackedEnsemble_BestOfFamily_1_AutoML_2_20240715_112354 0.6576796 0.04206463 0.019814622
4 GBM_grid_1_AutoML_2_20240715_112354_model_7 0.6520508 0.04199742 0.020808136
5 GBM_grid_1_AutoML_2_20240715_112354_model_3 0.6457805 0.04236963 0.021330519
6 GBM_grid_1_AutoML_2_20240715_112354_model_6 0.6419566 0.04217833 0.019267584
7 GBM_2_AutoML_2_20240715_112354 0.6392135 0.04296885 0.015146165
8 GBM_5_AutoML_2_20240715_112354 0.6369565 0.04737734 0.014242429
9 GBM_3_AutoML_2_20240715_112354 0.6306030 0.04315451 0.017148832
10 GLM_1_AutoML_2_20240715_112354 0.6184928 0.04273163 0.013397685
11 GBM_4_AutoML_2_20240715_112354 0.6128366 0.04404961 0.017256576
12 GBM_1_AutoML_2_20240715_112354 0.6102838 0.04344093 0.017230821
13 GBM_grid_1_AutoML_2_20240715_112354_model_5 0.6025600 0.04610814 0.015627881
14 DRF_1_AutoML_2_20240715_112354 0.5838888 0.12546788 0.011843079
15 XRT_1_AutoML_2_20240715_112354 0.5822970 0.10035710 0.013878478
16 GBM_grid_1_AutoML_2_20240715_112354_model_1 0.5782248 0.04407528 0.010552519
17 DeepLearning_1_AutoML_2_20240715_112354 0.5675395 0.04453148 0.009118853
18 GBM_grid_1_AutoML_2_20240715_112354_model_4 0.5382861 0.04465290 0.010626531
19 DeepLearning_grid_3_AutoML_2_20240715_112354_model_1 0.5336919 0.07584625 0.009881718
20 DeepLearning_grid_2_AutoML_2_20240715_112354_model_2 0.5260784 0.04352110 0.007827509
21 DeepLearning_grid_3_AutoML_2_20240715_112354_model_2 0.5078914 0.04432943 0.008001891
22 DeepLearning_grid_2_AutoML_2_20240715_112354_model_1 0.5016731 0.08034842 0.007814751
23 DeepLearning_grid_2_AutoML_2_20240715_112354_model_3 0.4963519 0.05204610 0.007964430
24 DeepLearning_grid_3_AutoML_2_20240715_112354_model_3 0.4912107 0.04486644 0.007313437
25 DeepLearning_grid_1_AutoML_2_20240715_112354_model_3 0.4863326 0.06203427 0.006886737
26 DeepLearning_grid_1_AutoML_2_20240715_112354_model_2 0.4802334 0.06233440 0.006977481
27 DeepLearning_grid_1_AutoML_2_20240715_112354_model_1 0.4576477 0.09281279 0.006531578
mean_per_class_error rmse mse training_time_ms predict_time_per_row_ms
1 0.4649377 0.08560759 0.007328660 240 0.002241
2 0.4674436 0.08516377 0.007252868 10528 0.015256
3 0.4671301 0.08530746 0.007277363 6270 0.010015
4 0.4669777 0.08522069 0.007262566 286 0.002469
5 0.4773328 0.08560238 0.007327767 219 0.002430
6 0.4650554 0.08526429 0.007269999 296 0.002275
7 0.4816753 0.08593366 0.007384594 247 0.002736
8 0.4856394 0.08984690 0.008072465 197 0.002155
9 0.4829930 0.08599712 0.007395505 235 0.002611
10 0.4576672 0.08537578 0.007289024 150 0.000448
11 0.4841615 0.08614064 0.007420211 281 0.003052
12 0.4803209 0.08545493 0.007302544 307 0.003674
13 0.4798450 0.08662116 0.007503225 540 0.004356
14 0.4801335 0.08891701 0.007906235 1031 0.010021
15 0.4868217 0.08791825 0.007729618 1323 0.010310
16 0.4803910 0.08586281 0.007372422 272 0.001851
17 0.4499864 0.08550935 0.007311848 1215 0.002942
18 0.4908994 0.08563700 0.007333696 281 0.003109
19 0.4769706 0.08577441 0.007357249 54077 0.013306
20 0.4791015 0.08546600 0.007304437 63833 0.004097
21 0.4956534 0.08549949 0.007310164 43816 0.005734
22 0.4908381 0.08577565 0.007357462 49244 0.007895
23 0.4920060 0.08606469 0.007407130 69350 0.003305
24 0.4983835 0.08550563 0.007311213 48252 0.003314
25 0.4903757 0.08570565 0.007345459 38580 0.001912
26 0.4922397 0.08569421 0.007343498 44025 0.002610
27 0.5000000 0.08577906 0.007358047 30740 0.003114
algo
1 GBM
2 StackedEnsemble
3 StackedEnsemble
4 GBM
5 GBM
6 GBM
7 GBM
8 GBM
9 GBM
10 GLM
11 GBM
12 GBM
13 GBM
14 DRF
15 DRF
16 GBM
17 DeepLearning
18 GBM
19 DeepLearning
20 DeepLearning
21 DeepLearning
22 DeepLearning
23 DeepLearning
24 DeepLearning
25 DeepLearning
26 DeepLearning
27 DeepLearning

[27 rows x 10 columns]
“`

We can also get the best model from the leaderboard.

“` r
bmdl <- h2o.get_best_model(mdlres)
bmdl
“`

There is a lot of output here. Here we will go through some of the output.

“` r
Model Details:
==============

H2OBinomialModel: gbm
Model ID: GBM_grid_1_AutoML_2_20240715_112354_model_2
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth
1 26 26 6588 4 4 4.00000
min_leaves max_leaves mean_leaves
1 11 16 15.57692
“`

The best among the 25 models is a GBM model. This is a Gradient Boosted Machine model. Gradient Boosted Machine is a type of tree-based model that builds multiple trees in sequence, with each tree learning from the errors of the previous tree. We will discuss the details of different models in later chapters. Next come details of the model that we will discuss when we learn more about different model types: The model has 26 trees with a minimum depth of 4 and a maximum depth of 4.

#### Model fit

Next in the output are the model fit statistics. These are measures of how well the model fits the data. There are pros and cons to each of these measures. Here we will discuss two common measures of fit. One for binary classification models and one for regression models. We will then examine the output of the model fit statistics.

– AUC: The Area Under the Receiver Operating Characteristic Curve (AUC) is a measure of how well the model can distinguish between the two possible outcomes in a binary model. The AUC ranges from 0 to 1, with 0 indicating that the model explains nothing in the output, 0.5 indicating a model that is no better than random, and 1 indicating that the model explains all outcomes. The AUC is created as follows. The AUC summarizes the trade-off between the true positive rate and the false positive rate. The true positive rate is the proportion of actual positives that are correctly predicted as positive. The false positive rate is the proportion of actual negatives that are incorrectly predicted as positive.

We can use the “confusion matrix” to understand the AUC. The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives. In this type of language, positive means binary values equal to one and negative means binary values equal to zero. In other words then, the confusion matrix shows the observations in which the model predicted a one and the actual value was a one (true positive), the observations in which the model predicted a zero and the actual value was a zero (true negative), the observations in which the model predicted a one and the actual value was a zero (false positive), and the observations in which the model predicted a zero and the actual value was a one (false negative).

The confusion matrix is set up with the predictions in the rows and the actuals in the columns. An example confusion matrix is shown below.

| | Actual 0 | Actual 1 |
|————-|———————|———————|
| Predicted 0 | True negative (TN) | False positive (FP) |
| Predicted 1 | False negative (FN) | True positive (TP) |

We then define the true positive rate and the false positive rate with the following formulas:

True positive rate = TP / (TP + FN)

False positive rate = FP / (FP + TN)

We can see these pieces in the results in the confusion matrix.

“` r
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 75628 458 0.006020 =458/76086
1 496 68 0.879433 =496/564
Totals 76124 526 0.012446 =954/76650
“`

In our model, the number of observations that were predicted as zero and were actually zero is 75628 The number of observations that were predicted as one and were actually one is 68.

The number of observations that were predicted as zero and were actually one is 458. The number of observations that were predicted as one and were actually zero is 496.

The true positive rate is about the row where the prediction was one (the second row): 68 / (68 + 496) = 0.12. This means that when the model forecasted a one, it was right 12% of the time.

The false positive rate is about the row where the prediction was zero (the first row): 458 / (458 + 75628) = 0.006. This means that when the model forecasted a zero, it was wrong 0.6% of the time.

By themselves, these give us some indication of the model’s performance. The AUC is a summary of these two rates and the trade-off between them.

The next point to understand is that in a binary classification model the model is predicting a probability. The probability is then used to determine the classification. The model will predict a one if the probability is greater than a certain threshold. The threshold is set at 0.5 by default. However, we can change the threshold to get different results. The confusion matrix is set up with the threshold at 0.5. The F1-optimal threshold is the threshold that gives the best balance between the true positive rate and the false positive rate. We can choose a different threshold to get different results. For example, we could determine that any probability above 0.75 should be classified as a one and anything below 0.75 as a zero. Setting a higher threshold will mean that we will miss fewer ones (false negatives will be lower), but we will also have more false positives. The opposite is true if we set a lower threshold.

The AUC is meant to summarize the model’s performance across all possible thresholds. The AUC is the area under the curve created by plotting the true positive rate against the false positive rate for all possible thresholds. The more area under the curve, i.e. the higher the true positive rate is, the better the predictions of ones is.

“` r
perf <- h2o.performance(bmdl)
plot(perf, type = “roc”)
“`

The plot, colled the ROC curve, is shown below.

![ROC curve](roc.png)

The measure of the area under the curve is the AUC. In our model output, the AUC is 0.745. This means that the model is better than random and reasonably good. However, the individual rates caution that we cannot be too confident in the model’s predictions.

“` r
H2OBinomialMetrics: gbm
** Reported on training data. **

MSE: 0.007102116
RMSE: 0.08427405
LogLoss: 0.03976389
Mean Per-Class Error: 0.4427261
AUC: 0.7450582
AUCPR: 0.0660392
Gini: 0.4901164
R^2: 0.02763741
AIC: NaN
“`

– R\^2. The other commonly used model fit measure is r-squared. R-squared is a measure of how well the model explains the variation in the outcome variable. R\^2 makes the most sense with continuous variables. R-squared ranges from 0 to 1, with 0 indicating that the model explains none of the variation in the outcome variable and 1 indicating that the model explains all of the variation in the outcome variable.

R-squared is the proportion of the variance in the outcome variable that is explained by the model. The formula for R-squared is shown below.

R-squared = 1 – (SSE/SST)

In this formula, SSE is the sum of the squared errors and SST is the total sum of squares. The sum of the squared errors is the sume of the squared value of the portion of each outcome that is not explained by the model, i.e. sum of the squared differences between the predicted values and the actual values. The total sum of squares is the sum of the squared differences between the actual values and the mean of the actual values.

We will return to r-squared when we discuss specific regression models.

In addition to getting a summary of the overall fit of the model, we can evaluate which features matter most to the model. How feature or variable importance is measured depends on the model. Importance could be measured by how much model fit changes when a variable is remove, how often a feature appears in a model that iterates through variables as in some tree based models, or some other measure.

We can do variable importance on the training data as follows.

“` r
vi <- h2o.varimp(bmdl)
vi
“`

This gives us the following output.

“` r
Variable Importances:
variable relative_importance scaled_importance percentage
1 market_share 17.851400 1.000000 0.434445
2 sales_growth 15.475105 0.866885 0.376614
3 excess_invent_growth 6.734966 0.377279 0.163907
4 loss 1.028610 0.057621 0.025033
“`

In the output the features are sorted by importance. The most important feature is market share. The relative importance is a measure of how much the feature contributes to the model. The scaled importance is the relative importance scaled to sum to 1. The percentage is the percentage of the total importance that the feature contributes. Information on feature importance in h2o can be found [here](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html).

Given the variable importance output, we conclude that the most important feature for predicting fraud is market share. There may be other important variables that are not in our model, but given what we have put into the model, this is the most important. The loss indicator variable explains the least.

## Create predictions for the testing data

Next, we create predictions for the testing data. Note that we only move on to this point once we are satistfied with our training data training. For example, if we are not satisfied with the AUC, we may want to go back and work on collecting more data, improving the current features, or expanding the list of features. Once we are satistfied, we can import the testing data into h2o and predict the outcome variable with the best model. Note that we have to do anything we did to the training data to the testing data.

“` r
test<-test%>%
mutate(misstate = as.factor(misstate))

tst.h2o <- as.h2o(test)
“`

With the data imported to h2o, we can create the preditions with the testing data and the model we have created.

“` r
prediction <- h2o.predict(bmdl, tst.h2o)
“`

We can return the prediction to R to use it in whatever application we have with the following code.

“` r
test$pred_misstate <- as.data.frame(prediction)$predict
“`

### Evaluate the model on the testing data

We can then evaluate the model on the testing data. We can use the DALEX package to explain the model. The DALEX package is a tool for understanding and explaining machine learning models. It provides a set of tools for visualizing and interpreting the model’s predictions and performance. Describing machine learning models is referred to as explainable AI or XAI.

Machine learning models are often referred to as black box models. This means that the model makes predictions, but it is not always clear how the model is making those predictions. Explainable AI is an approach and set of tools that are designed to try to understand how the model is making predictions. To understand the models, various approaches include “what if” scenarios. Some of these methods are computationally intensive because they require making predictions for many different scenarios. Here we will work through some of the most common methods.

The DALEX package first greats an “explain” object that it uses to create different outputs. The “explain” object includes testing data predictions along with other information to create the output.

Note, unfortunately, for some tests DALEX requires a numeric binary variable, we need to convert the factor back to a number before DALEX.

“` r
test2<-test%>%
mutate(misstate = as.numeric(as.character(misstate)))
tst2.h2o <- as.h2o(test2)

expln <- DALEXtra::explain_h2o(
model = bmdl,
data = tst2.h2o[,x],
y = tst2.h2o[,y])
“`

With the explain object we can create different explainable AI outputs.

### Testing data fit

We can get the testing data fit with the following code.

“` r
model_performance(expln)
“`

The AUC for the training data was 0.745. The AUC for the testing data is 0.658. The fit for the testing data is worse than for the training data. This is common because models often fit random variation in the training data that is not present in the testing data. In the extreme case a model will perform well on the training data and be useless or even wrong on the testing data. This problem is referred to as overfitting. Various approaches are used to reduce overfitting, one of which is cross-validation. We will discuss other approaches to reducing overfitting when we discuss specific models.

In this case, the fit is better than a random guess (AUC = 0.5), but not so high that we feel confident in the predictions.

### Variable importance

Variable importance as stated before, measures how important each feature is to the model. On the testing data, we can use the following code to get the variable importance. Note that this can take some time because it has to go through each feature and make predictions. The more features it has to go through, the longer it will take.

“` r
varimp <- variable_importance(expln)
plot(varimp,show_boxplots = FALSE)
“`

The feature importance plot is shown below.

![Feature Importance](varimp.png)

As shown in the plot, on the testing data, the most important feature is market share. The least important feature is excess_invent_growth.

### Partial dependence plots

Other methods are used to evaluate the direction and size of effects of each feature. These include partial dependence plots and Shapley values. Here we will use partial dependence plots. These are similar to coefficients in linear regression models that have an interpretable meaning.

Partial dependence plots show the relationship between a feature and the model’s predictions while holding all other features constant. This allows us to see how the model’s predictions change as the feature changes.

“` r
pdp <- model_profile(expln)
plot(pdp)
“`

The partial dependence plot is shown below.

![Partial Dependence Plot](pdp.png)

The partial dependence plot shows the relationship between the feature and the model’s predictions. The market share plots shows that at the lowest level of market share there is a small increase in the probability of fraud even though this probability is very small. The biggest effect of market share is at the highest levels of market share, i.e. more than 5 standard deviations above the mean. The highest levels of market share have the largest effect on the prediction of fraud. excess_invent_growth and loss seem to have no effect on the predicted probability of fraud. Sales growth has some small positive effect on predicted fraud.

## Using and communicating results

### Using the predictions

Once we have evaluated the model, we can use the results to make decisions. We can use the model to predict fraud. We can also use the results to understand what features are important for predicting fraud. Suppose we are on a fraud team at the SEC would like to use a predictied fraud value for two companies we are looking at and determining whether we should further investigate (let’s say Walmart, gvkey = 011259 and Target, gvkey = 003813). Let’s get the last year we have available for those companies in the data.

“` r
checkdta <- tmp%>%
filter(gvkey %in% c(011259,003813))%>%
group_by(gvkey)%>%
filter(fyear == max(fyear))

checkdta

gvkey fyear misstate sales_growth excess_invent_growth loss market_share
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 3813 2014 0 -0.180 0.486 1 5.31
2 11259 2014 0 -0.177 0.477 0 5.31
“`

Notice that we do not currently have a predicton for these observations and we currently have a zero for misstate. This is for fiscal year 2014. We can use the model to predict the probability of fraud for these companies.

“` r
chk.h2o <- as.h2o(checkdta)
prediction <- h2o.predict(bmdl, chk.h2o)
prediction <- as.data.frame(prediction)
prediction

predict p0 p1
1 0 0.9884465 0.01155353
2 0 0.9898673 0.01013269
“`

The first column “predict” is the predicted outcome. In this case, neither is predicted to be fraud. The second column is the probability that the outcome is zero and the third column is the probability that the outcome is one. Let’s say that as part of the SEC team, we want to select the company with the higher probabilty of fraud to investigate. The first company is Target with a 1.2\$ probability of fraud and the second company is Walmart with a 1.0% probability of fraud. We would select Target for further investigation (even though the probability is quite low).

An alternative approach could be to select only the highest risk companies. We could do this by either setting a threshold for the probability or we could select those we predicted to have a fraud. We have already added the prediction to the test data set, so we could do this as follows.

“` r
test%>%select(gvkey,fyear,pred_misstate)%>%filter(pred_misstate == 1 & fyear == 2014)
“`

In the same year as the last Walmart and Target observations, we have two observations that are predicted to have fraud.

“` r
gvkey fyear pred_misstate
<int> <int> <fct>
1 143748 2014 1
2 171045 2014 1
“`

### Communicating results

There are many ways that we might choose to communicate our results. We could write a report, create a presentation, or create a dashboard. We could also create a report that is automatically updated with the latest data. In this case, let’s lay out the critical pieces that might allow someone to understand what we have done. We will not create a full report or presentation, rather we will create the pieces that could be used in a coherent presentation or discussion.

We will presume that we are on the SEC team deciding which companies to investigate (2014).

1. Introduce the problem: We are trying to identify the highest fraud risk companies to best allocate our investigation resources.

2. Introduce the suggested solution: We are creating a model to predict fraud risk. The output will be predicted fraud (or it could be the probability of fraud). We can then choose only the highest predicted fraud companies to investigate.

3. Explain the data: We have used a random sample of companies from Compustat annual and labels for whether the companies have had an accounting misstatement. The data comes from a research paper: Yang Bao, Bin Ke, Bin Li, Julia Yu, and Jie Zhang (2020). Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research, 58 (1): 199-235. We created the following features:

– Sales growth: The percentage growth in sales from the previous year.
– Excess inventory growth: (change in inventory – change in accounts payable) / lagged inventory.
– Market share: The company’s market share by sales as a percent of total market sales.
– Loss: A binary variable indicating whether the company had a loss in the previous year.

4. Explain the model: We trained 25 different models on the random sample. The best model was a Gradient Boosted Machine model. The model had an AUC of 0.745 on the training data and 0.658 on the testing data. The most important feature for predicting fraud was market share. The following graphs describe the model’s predictions on testing data.

![Feature Importance](varimp.png)

![Partial Dependence Plot](pdp.png)

5.  Explain the predictions: For 2014, we identified two companies with a predicted fraud.
“` r
gvkey fyear pred_misstate
   <int> <int> <fct>
1 143748  2014 1
2 171045  2014 1
“`
### Critical thinking
If you were to make the presentation based on the outline above, you could expect questions and challenges. Part of critical thinking is being able to respond to these questions and challenges. Perhaps more importantly, critical thinking is being able to create such questions and challenges for yourself. In response to your own questions and challenges you might return to the data, the features, the model, the predictions, and/or the presentation.
Here are examples of some questions and challenges that might apply to the presentation above.
1.  How did you deal with outliers?
2.  How did you choose the features?
3.  Do these features seem reasonable or powerful enough to predict fraud sufficiently that we should rely on the model output? If not, what are other features that might be useful?
4.  Are you sufficiently confident in the model’s predictions to rely on them? If not, what could you do to improve the model?
5.  Does the training data seem reasonably representative of the data we are trying to predict for? Why or why not?
6.  What are the alternatives to relying on the model? What are the pros and cons of these alternatives?
7.  Is the model time and cost effective?
8.  Does the presentation have a logical flow and is it logically consistent and clear?
9.  Could someone not familiar with the data, the model, or machine learning understand the presentation?
## Shut down h2o
If we are happy with the model, we can save the model and load it in later.
“` r
h2o.saveModel(bmdl, path = “full file path”)
“`
Later when we want to read the model back into h2o, we can use the following code.
“` r
bmdl <- h2o.loadModel(“full file path”)
“`
Once we are ready, we can shut down h2o.
“` r
h2o.shutdown()
“`
## Alternative computing environments
Using the data and code above we can run and evaluate the models in a reasonable amount of time. the large the data, the larger the number of features, and the more complex the model, the longer it will take to run the model. If we cannot dedicate commuting time to the task or if the task is too large for our computer, we can use alternative computing environments.
There are a number of options. We can use a cloud service such as AWS, Azure, or Google Cloud. We can use a cluster or a GPU. We can use parallel processing. We can use a different computer with more memory and processing power.
I will run and evaluate the same model using one common option that is free for limited use and relatively affordable for moderate use: Google Collab. Most importanlty, it is fairly easy to use. I will use the free version of Google Collab. I will provide a summary below. There are tutorials available online. For example, [here](https://www.geeksforgeeks.org/how-to-use-r-with-google-colaboratory/).
To use R with Google Collab, you have to change the runtime to R. The free account can run what we have above. Using the default (the one change I made was delete threads=10 to allow it to use whatever Google allocated to me). Prior to running the code, you can also add the csv file to your files in Google Collab.
Training the 25 models with h2o took 40 minutes to run.
## Review
In this chapter, we used h2o to create a model to predict fraud. We used the h2o automl function to create 25 models and then selected the best model. We evaluated the model on the testing data and used explainable AI tools to understand the model’s predictions. We then used the model to predict fraud for two companies. We outlined a presentation to communicate the results. We also discussed alternative computing environments.
### Conceptual questions
1.  What is the difference between the training data and the testing data?
2.  Describe the alternative approaches to splitting the data into training and testing data.
3.  List the possible algorithms available with h2o autoML.
4.  Describe how cross-validation is used to train a model.
5.  Define true positive rate and false positive rate.
6.  Describe the calculation and meaning of AUC.
7.  Describe the calculation and meaning of R-squared.
### Practice questions
1.  Create a training data set and a testing data set from the data above by splitting by fyear where years before 2009 are training data and years after 2009 are testing data.
2.  What is the percent of observations with misstate equal to one in the training and testing data sets created in number 1?
3.  Why might the answer to number 2 raise concerns about the training and testing data sets.
4.  List the steps for using h2o autoML to create and use a model.
5.  Describe the changes that would need to be made to the code in the chapter to create and add a new feature to the model.
6.  Show the change in the code necessary to limit autoML to a time limit of 30 minutes rather than limiting the number of models to try.
7.  What are the possible reasons that the model fit is worse on the testing data than on the training data?
8.  Suppose you are choosing between two models for fraud. The first model has an AUC of 0.9 and the second model has an AUC of 0.4. What do the AUCs tell you about the models?
9.  What are the explainable AI tools that we used in the chapter to understand how features affect the model’s predictions?
10. What are the steps to create predictions for the testing data?
## Solutions to practice questions
1.  This could be done with the code below.
“` r
train <- tmp%>%
filter(fyear < 2009)
test <- tmp%>%
filter(fyear >= 2009)
“`
2.  This could be done with the code below.
“` r
train%>%
summarize(percent = mean(misstate))
test%>%
summarize(percent = mean(misstate))
“`
This gives the following output:
“` r
> train%>%
+ summarize(percent = mean(misstate))
# A tibble: 1 × 1
  percent
    <dbl>
1 0.00813
>
> test%>%
+ summarize(percent = mean(misstate))
# A tibble: 1 × 1
  percent
    <dbl>
1 0.00402
>
“`
3.  The percent of observations with misstate equal to one is 0.8% in the training data and 0.4% in the testing data. The percent of observations with misstate is nearly double in the training data than in the testing data. This raises the concern that there is something different about the training data that may make the model less effective on the testing data. The lower percent of observations with misstatements in the testing data also makes it more difficult to evaluate the model on the testing data.
4.  The steps for using h2o autoML to create and use a model are as follows:
-Load the H2O library and initialize the H2O cluster.
-Load the data into the H2O cluster.
-Specify the target variable and the predictor variables.
-Run the AutoML function to build the model.
-Get the best model from the AutoML results.
-Create predictions for the testing data.
-Evaluate the model on the testing data.
-Understand and communicate the results.
-Stop the H2O cluster.
5.  To create and add a new feature to the model, we would need to add the feature to the data before loading it into h2o. This would include creating the feature, treating outliers, and normalizing the feature. The feature would needed to be added to the list of x variables being used to model the outcome variable.
6.  To limit autoML to a time limit of 20 minutes, we would need to change the max_runtime_secs argument in the h2o.automl function. The code would look like this.
“` r
mdlres <- h2o.automl(x = x, y = y,
                  training_frame = trn.h2o,
                  max_runtime_secs = (60*20),
                  seed = 1)
“`
7.  The possible reasons that the model fit is worse on the testing data than on the training data could be caused by overfitting or by the training data not being representative of the testing data.
8.  The AUC is a measure of how well the model can distinguish between the two possible outcomes in a binary model. The AUC ranges from 0 to 1, with 0 indicating that the model explains nothing in the output, 0.5 indicating a model that is no better than random, and 1 indicating that the model explains all outcomes. The AUC of 0.9 indicates that the first model is very good at distinguishing between the two outcomes. The AUC of 0.4 indicates that the second model is worse than random at distinguishing between the two outcomes.
9.  We used variable importance and partial dependence plots to understand how features affect the model’s predictions. Variable importance measures how important each feature is to the model. Partial dependence plots show the relationship between a feature and the model’s predictions while holding all other features constant. This allows us to see how the model’s predictions change as the feature changes.
10. The steps to create predictions for the testing data are as follows:
-Apply the same data steps to the testing data as to the training data. Note that in the code in the chapter, we did most of the data preparation prior to separating the data into training and testing data. We only needed to change the outcome variable to a factor because we did this after separating the data into training and testing data.
-Use the testing data with the model created on the training data to create predictions for the testing data.
-Return the predictions to R to use them in whatever application we have.

Tutorial video

Note: include practice and knowledge checks

Mini-case video

Note: include practice and knowledge checks

License

Data Analytics with Accounting Data and R Copyright © by Jeremiah Green. All Rights Reserved.