"

11 Chapter 11: Introduction to Modeling

Learning outcomes

At the end of this chapter, you should be able to

  • Explain modeling principles
  • List modeling techniques and tools
  • Describe data modeling pitfalls
  • Describe strengths and weaknesses of modeling techniques for decision making

Chapter content

Note: include practice and knowledge checks throughout chapter

From earlier material:

This chapter will work to create intuition for how all models work. We will define some terms, lay out some principles, and provide some examples. In later chapters, we will use the intuition from this chapter to provide more specific details. This chapter will provide some model details but will not require an in depth understanding of those details. For now, when details are shown here, work to develop the modeling intuition rather than the specific details by which the modeling occurs. We will later return to work through the details.
## Modeling steps
“`{mermaid}
%%| fig-cap: |
%%|  Figure 1. This figure shows the steps in the modeling process.
flowchart LR
  A[Find patterns in training data] –> B(Assume the same patterns for new data)
  B –> C[Make predictions]
“`
All models use training data to find patterns in data. Different modeling methods have different objectives and use the training data in different ways, but the goal of finding patterns that can be used to make inferences in new data is common to all modeling methods. The biggest differences between models is the type of patterns that the models seek to identify in the training data. After a model is trained, the patterns from the training data are assumed to also apply to new data. The patterns from the training data are then combined with the new data to make some predictions.
An important note here is that statistical, econometric, and machine learning models are most commonly applied when patterns are not deterministic. That means that the predictions that are made with the training data patterns and the new data are uncertain estimates and probabilities rather than determined outcomes based on deterministic patterns. Therefore accurately described predictions might have language like this: “given the training data for the model, the model we used, and the observed data, this is our best prediction. We estimate that the range of possible outcomes is…”
Let’s begin with a very simple example that we will use to provide some concreteness to the modeling steps. We will then expand this example to more complex models.
## Simple example
Suppose we want to predict which companies commit accounting fraud. This might be something that a regulator might do when trying to decide whether to investigate a company or an investor might do as part of an investment strategy. We can observe companies that have committed fraud, when they committed the fraud, and we have their annual financial statements. To model accounting fraud, we need observations with known accounting fraud and we need observations that can be used as a benchmark, i.e. the non-fraud observations. Two possibilities might include choosing observations for the same companies in periods that they were not committing fraud or choosing observations for different companies that didn’t commit fraud.
For this example, I will use data from a paper that wanted to create a model to predict accounting fraud. The paper citation is below with a link to the paper.
Yang Bao, Bin Ke, Bin Li, Julia Yu, and Jie Zhang (2020). Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research, 58 (1): 199-235. [Link here](https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12292). Note that there is also an erratum to the published paper [here](https://onlinelibrary.wiley.com/doi/10.1111/1475-679X.12454).
The data from the paper and descriptions of the data and variables are available publicly on github. The link to the data and the descriptions is available [here](https://github.com/JarFraud/FraudDetection). I will use a simplified sample of the data for this example.
I’m only keeping companies that at some point during the sample have been caught committing fraud. Therefore I will compare periods when companies commit fraud with periods when companies do not commit fraud (or are not caught committing fraud) to try to find patterns in the data that might tell us something about the characteristics that tend to be associated with committing fraud.
### Describe data
To begin, I will describe the data that I am using and provide some summary statistics.
From the data explanation (the README.md doc on github), the variable descriptions are as follows:
–   misstate – fraud label equal to one if a fraud observation and zero if a non-fraud observation
–   at – total assets at fiscal year end (there are multiple years per company with some fraud years and some non-fraud years)
–   bm – the book-to-market ratio that is the accounting value of equity divided by the market value of equity
–   EBIT – earnings before interest and taxes divided by total assets
–   ch_roa – year to year change in return on assets (net income divided by total assets year t minus the same ration from year t-1)
–   dch_wc – year to year change in working capital accruals (non-cash working capital)
–   ch_fcf – year to year change in free cash flows
A small clip of the data is shown below.
“`{r,cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(kableExtra)
df<-read.csv(“C:\\Users\\jgreen\\Documents\\Teaching\\NotesText\\Data Analytics with Accounting Data\\data_FraudDetection_JAR2020.csv”)
df<-df %>%
  group_by(gvkey) %>%
  mutate(hasfraud = max(misstate)) %>%
  ungroup() %>%
  filter(hasfraud==1)
df<-df%>%select(gvkey,fyear,misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)
head(df)%>%
 mutate(across(misstate:ch_fcf,~round(.,digits=2)))%>%
 kable(“html”,caption = “Snapshot of the training data”)%>%kable_styling()
“`
gvkey is a company identifier from Compustat/CapitalIQ. fyear is the fiscal year that the annual report information applies to. For example, the first row contains information from the annual financial statements for company “1009” for the fiscal year 1990. There are multiple companies per year (and multiple years per company). Fraud observations include rows 4 and 5, where misstate equals 1; in 1990, companies “1286” and “1513” were found to commit fraud. Financial information from the annual financial statements is located in a row and each column indicates a different piece of information from those financial statements. For example, this means that any information that we have about company “1009” in the fiscal year 1990 is contained in the same row. In 1990 “1009” had total assets of \$32 million and has EBIT of 16%.
To help understand the data better, the table below provides summary statistics.
“`{r, cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(kableExtra)
df<-read.csv(“C:\\Users\\jgreen\\Documents\\Teaching\\NotesText\\Data Analytics with Accounting Data\\data_FraudDetection_JAR2020.csv”)
df<-df %>%
  group_by(gvkey) %>%
  mutate(hasfraud = max(misstate)) %>%
  ungroup() %>%
  filter(hasfraud==1)
df<-df%>%select(gvkey,fyear,misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)
a<-df%>%select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
  summarise_all(mean,na.rm=TRUE)%>%
  mutate_all(~round(.,digits=2))%>%
  mutate(Stat=”Mean”,.before=misstate)
b<-df%>%select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
  summarise_all(min,na.rm=TRUE)%>%
  mutate_all(~round(.,digits=2))%>%
  mutate(Stat=”Min”,.before=misstate)
c<-df%>%select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
  summarise_all(max,na.rm=TRUE)%>%
  mutate_all(~round(.,digits=2))%>%
  mutate(Stat=”Max”,.before=misstate)
rbind(a,b,c)%>%
  kable(“html”,caption=”Training sample statistics”)%>%kable_styling()
“`
The statistics in the table above for the selected variables are the mean, the minimum, and the maximum. The mean is a standard statistic used in many techniques for describing the central tendency of the data and to test differences across groups. I included the minimum and maximum values to help understand the ranges of the variables.
The misstate variable is a label for fraud. The mean of a variable that has only 1s and 0s is the percent of observations where the label is a one. 0.2 therefore indicates that 20% of the observations are for fraud years. This means that 80% of the years in the sample for companies that at some point do commit fraud are for non-fraud years. If we were to include in the sample observations that never have committed fraud, the percent of fraud observations would me much lower.
Total assets (at) is in \$millions. The mean total assets is therefore \$5 billion. However, the range in asset sizes is from less than \$1 million to more than \$300 billion. Mean EBIT (this is really a return on assets measure here, so the variable name is a little misleading) is -7%. Firms that at some point commit fraud seem to have average losses. However, the average may be driven by the extreme -702%. Note that these variables are after adjusted extreme values to be less extreme, so the -702% is not even as extreme as in the raw data. ch_roa and ch_fcf give a similar picture. dch_wc is the change in non cash working capital. This measure is often referred to as accruals because this is roughly the portion of net income (scaled by assets) that is not operating cash flows. In other words, these are the net adjustments to operating cash flows to reach net income. A rough interpretation is that return-on-assets (ROA) on average for these companies is 1% higher (e.g. -8% to -7% for EBIT) because of non-cash adjustments to operating cash flows. Research generally presumes that companies that try to fraudulently manipulate earnings do so to increase performance measures relative to cash flows, so perhaps this is expected.
### Compare fraud and non-fraud observations (trying to find patterns patterns in the training data)
We can think of this sample as training data. Training data is data that we use to find patterns that we can then use on new data to make predictions. In this example, we want to find patterns that are associated with companies committing fraud so that we can go to new data in which we do not know if companies committed fraud so that we can get a prediction for whether they are likely to have committed fraud.
Modeling is an ever expanding set of techniques for detecting and summarizing patterns in training data. We can start with the simplest version of modeling. This might be labeled as exploratory analysis. Let’s describe what is different between fraud and non-fraud years. Note that this is different from asking what might be able to predict fraud, but we will come back to that.
When modeling patterns in data, we have to make some choices about how to measure and identify the patterns. One of the most common choices is to use the mean of a column for different sub samples to identify how those sub samples are different. We might ask how years that companies commit fraud are different from years that companies do not commit fraud. We could use the mean of a column for each group (commit fraud and not commit fraud) to test what is different between fraud and non-fraud observations. The table below presents the mean for each column for the fraud and not fraud observations. Note that we will not try to test whether the means of the two groups are different.
“`{r,cache=TRUE,echo=FALSE}
df%>%select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
  group_by(misstate)%>%
  summarise_all(mean,na.rm=TRUE)%>%
  mutate_all(~round(.,digits=2))%>%
  kable(“html”,caption=”Means for fraud and non-fraud observations”)%>%kable_styling()
“`
What does describing the differences between the fraud and non-fraud observations tell us?
Companies have slightly larger asset values during fraud years (\$5.6 billion versus \$5.4 billion). The book-to-market ratio (accounting equity value to market value of equity) is higher during fraud years (0.5 versus 0.4). This may mean that during fraud years investors view the company less favorably than during non-fraud years. The mean return on assets (EBIT here) is negative for both groups, but slightly less for fraud years (-2% versus -9%). Change in return on assets is no different. Change in non-cash working capital (or accruals) as a percent of assets is more positive for fraud years (0.02 versus 0.0). This may be consistent with companies overstating inventory, receivables, or other short-term assets during fraud years on average. Changes in free cash flows are more negative for fraud years (-0.1 versus -0.01).
We don’t really know yet at this point what might predict fraud, but a possible interpretation of these differences is that the market is pessimistic about the company, that the company needs cash, and that it overstates assets to try to improve investors’ perception of the company to get access to capital. We could identify one or two of the variables that might help predict whether a company is committing fraud. Within this same training data, we could sort company years by different variables to test whether years with higher or lower values of these variables are more or less likely to be fraud years.
I’ll sort by the last two columns: dch_wc and ch_fcf. These two columns together might identify when accruals are high (dch_wc) and cash flow needs are high (-ch_fcf). I’ll rank observations into quintiles for each column and then group by these rankings to see what percent of observations are fraud years in each group.
“`{r,cache=TRUE,echo=FALSE,message=FALSE,warning=FALSE}
# start with df and rank dch_wc and ch_fcf into quintiles
df %>%
  select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
  mutate(dch_wc_rank = ntile(dch_wc, 5),
         ch_fcf_rank = ntile(ch_fcf, 5))%>%
  group_by(dch_wc_rank,ch_fcf_rank)%>%
  summarize(
    pctfraud = round(100*mean(misstate,na.rm=TRUE),digits=0)
  )%>%
  na.omit()%>%
  spread(key = ch_fcf_rank, value = pctfraud)%>%
  kable(“html”,caption=”Percent of observations with fraud”)%>%kable_styling()%>%
  add_header_above(c(” ” = 1, “ch_fcf_rank” = 5))
“`
The table shows observations grouped by their rankings for the two columns. For example, the top left cell shows the percent of observations that are fraud observations for the group that were in the lowest quintile groups for dch_wc and ch_fcf (27%). In general, moving from right to left from the highest increases in free cash flows to the largest decreases in free cash flows results in higher percentages of fraud observations. Similarly, moving from top to bottom in the columns with ch_fcf quintiles 1 and 3 may result in higher percentages of fraud, but the increase in percentages are not linear. For example, in the third ch_fcf quintile, the high and the low quintiles have a larger percentage of fraud than the middle quintile of dch_wc.
Somewhat consistent with the earlier interpretation, the observations with the highest accruals and lowest free cash flows, the bottom left corner have a higher percent of fraud observations (28%) than the observations with the lowest accruals and highest free cash flows, the top right corner (12%). A couple of important points we should consider. First, this sample is only for companies that at some point were known to have committed fraud. In a sample of all companies, these percentages will be much lower. Second, the variables we have used cannot cleanly separate observations into fraud and non-fraud. For example, we do not get 100% in the bottom left cell and 0% in the top right cell. This means that our ability to predict fraud, at least with these two variables may be limited and as we will see later, this is a problem even if we include use all available variables. Third, some groups have more fraud observations than our expected bottom left cell. This may occur from random chance or because we do not completely capture what drives fraud with our variables.
### Next steps in modelling process
Despite these challenges, suppose this is the best model we can come up with. We will not be very confident in our model, but we might then predict going to new data, that company-years with low ch_fcf and high dch_wc have a higher probability of committing fraud than company-years with high ch_fcf and low dch_wc. Let’s say we worked with the SEC on initiating fraud investigations. We could then label the higher probability observations with red flags that might lead us to spend more time looking for fraud than in low probability observations. Again, note that even the low probability observations may have fraud, so we couldn’t completely ignore these. We don’t know until later whether our model was successful on the new data. In practice, we might reserve a separate set of data, called testing data, to see if our model worked on new data. We will later practice working with different training and testing data sets.
Let’s walk through the details of what we are doing here to understand the modeling steps.
First, we searched for patterns in training data to try to figure out what can help us predict fraud. We determined that low ch_fcf and high dch_wc indicate a higher probability of fraud than high ch_fcf and low dch_wc.
Second, we assume that this same pattern will apply in new data. We might consider reasons why this assumption is reasonable as well as reasons this assumption may not be reasonable. We will expand on this part of the process in the next section.
Third, based on the assumption that the same pattern will apply in new data, we use the same pattern to predict fraud in new data. We predict that observations with low ch_fcf and high dch_wc have a higher probability of fraud than observations with high ch_fcf and low dch_wc. Notice here that we have a different way of thinking. Our predictions are probabilistic. We don’t know what percent of observations have fraud until we look more closely. In practice, we may save some data to use as “new data” so that we can test whether our assumption that the same pattern applies to new data is good. The more we can test our model on new data and it works, the more confidence we have in the model. The less it works, the less confident we are. We will also expand on this part of the process below.
We will get into the details and practice of building models later. For now, we will introduce general principles that can help us understand the concepts behind building models and how they work.
## Modelling basic principles
In this section, I refer to the example we have used and provide definitions to organize principles that we need for modelling.
–   Training data includes outcomes we want to predict and observable data (variables or columns) that can be used to make the predictions
For most modeling applications, we need individual observations (rows) that have data for the outcome we want to predict. If we are trying to model fraud, we need a column in the training data set that tells us whether an observation represents a fraud or a non-fraud observation. In standard linear regression models, this is the “y” variable that we want to model on the training data and then predict for new data for which we do not know the outcome.
We also need other information about individual observations that can be used to explain and then predict the outcome variable. We collect and/or create these variables based on our understanding of the drivers of fraud. For example, if we think companies with new CEOs are more likely to commit fraud than other companies, we might have a column that indicates whether an observation relates to a new CEO or not. The better our columns are for predicting the outcome, the better our model will perform.
–   Testing data or new data requires the observable data to make predictions before the outcomes are observed
The predictor columns must be available even when the outcome variable is not available and must apply to each observation. For example, predicting which companies are likely to commit fraud next year based on information currently available, we cannot have in our model variables that are not known until next year.
–   Identifying patterns in training data requires structure
Patterns in the data tell us the expected associations between predictor and outcome variables. Because there may be random causes of the outcome variable, the expected associations tell us what is typical based on the variables and patterns we observe. Note that we expect to be wrong because we will not be able to perfectly model the outcome variables.
The patterns that determine what is typical depend on mathematical representations of how well the model fits the data. However, to find patterns in the data, we need to propose a structure that we hope might capture these patterns. There are many algorithms for measuring how well a model fits the data and many modeling structures for identifying patterns in the data.
Here I will lay out a simple standard structure (ordinary least squares, i.e. linear regression). We can use the same intuition when we apply different structures for modeling patterns in the data (e.g. non-linear models, machine learning models). To make the concept tractable, let’s assume that we are trying to predict the amount of the fine that a company has to pay after committing fraud. Suppose we have a few hundred observations. Below I show you a few rows of the data. Note that I am just making up numbers for the purpose of the example.
| Row ID | Fine Amount | Fraud Damage Amount | Prior Fraud Committed | Company Size |
|——–|————-|———————|———————–|————–|
| 1      | \$300       | \$500               | 0                     | \$25,000     |
| 2      | \$1,550     | \$750               | 1                     | \$100,000    |
| 3      | \$725       | \$1,000             | 0                     | \$15,000     |
: Fraud fine data
Here, the outcome that we want to model and later predict is “Fine Amount”. “Fraud Damage Amount” is how much company owners lost because of the fraud. “Prior Fraud Committed” is equal to 1 if the company previously committed fraud and 0 otherwise. “Company Size” is the company’s cash account balance. We can sort rows by “Fraud Damage Amount” or another column to see if higher fraud damages are associated with higher fines and we could do the same thing for each column. We could do something similar by calculating the correlation between one of the predictor columns and fine amount. However, most often, it will be a combination of predictors that best explain the outcome variable. We therefore need a way to combine the information from multiple columns. To do so, we have to propose a structure and then use the data to find the patterns.
Let’s start with the general idea.
$$
Y_{i,t} = fn(\beta_k,X_k)
$$
Here we have a Y variable we want to predict that is observable for company i at time t (for example this could be ABC company in 2017) and we have k unknown parameters $\beta$ that apply to k predictor variables X. We don’t know the $\beta$s because these are what we are going to figure out by using the training data.
We also don’t know how the $\beta$s should be combined with the Xs to explain/predict the Ys. This is where we propose a structure for how these patterns might occur in the data. The longest used and explored is a linear model. This approach is like a system of equations where we are trying to solve for the $\beta$s. Let’s walk step by step through how this might work.
First, let’s impose the linear structure. Using the example rows from above, we are going to impose a structure where we assume that each $\beta$ is constant across all rows.
| Fine Amount |     |           | Fraud Damage Amount |     |           | Prior Fraud Committed |     |           | Company Size |     | Error term       |
|————-|—–|———–|———————|—–|———–|———————–|—–|———–|————–|—–|——————|
| \$300       | =   | $\beta_1$ | \$500               | \+  | $\beta_2$ | 0                     | \+  | $\beta_3$ | \$25,000     | \+  | $\epsilon_{i,t}$ |
| \$1,550     | =   | $\beta_1$ | \$750               | \+  | $\beta_2$ | 1                     | \+  | $\beta_3$ | \$100,000    | \+  | $\epsilon_{i,t}$ |
| \$725       | =   | $\beta_1$ | \$1,000             | \+  | $\beta_2$ | 0                     | \+  | $\beta_3$ | \$15,000     | \+  | $\epsilon_{i,t}$ |
Notice that each row represents an equation. We are proposing that $\beta_1$ is constant across all rows and that this parameter tells us how the “Fraud Damage Amount” is associated with “Fine Amount”. This could be thought of as the partial correlation between “Fraud Damage Amount” and “Fine Amount”. We apply the same logic to $\beta_2$ and $\beta_3$. We don’t know what these value are. These values are also not deterministic so that we get to unique solution where the math lines up perfectly. Because we are proposing the structure and trying to find the parameters that seem to reflect what is in the data, we need the final piece of the equation which is the error term at the end. The error term is the value that makes the equation balance. Another way to think of this is that we are looking for the parameters that are typical for the data and anything that differs from what is typical is captured by the error term.
How do we figure out what the parameters should be? This is the algorithm or machine learning piece. We try different values until we parameters that fit the data as best as possible. For this simple structure there are some mathematical ways to do this. Using other structures, the only available way is to try different values until we get something that seems to fit the data best.
How do we determine whether the model fits the data well? There are different was to evaluate model fit, but here we can introduce a simple version – mean squared error. Mean squared error is defined as follows.
$$
MSE = \frac{\sum_1^N \epsilon^2_{i,t}}{N}
$$

The mean squared error takes the error term from the equation, squares it so that larger positive or negative values have the same sign and then calculates the average squared error term for all N rows in the sample. The larger the MSE, the larger the adjustments have to be to make the equation work. This means that the model being proposed and the values for the parameters fit the data worse when the MSE is large.

With a proposed structure and a proposed measure for how well the model fits the data, we can run through possible values for the parameters. We choose the parameters that best fit the training data. Perhaps in the end, we come up with the best parameter estimates and we get the following equations:

| Fine Amount | | | Fraud Damage Amount | | | Prior Fraud Committed | | | Company Size | | Error term |
|————-|—–|——|———————|—–|——-|———————–|—–|——|————–|—–|————|
| \$300 | = | 0.75 | \$500 | \+ | \$500 | 0 | \+ | 0.01 | \$25,000 | \+ | -\$325 |
| \$1,550 | = | 0.75 | \$750 | \+ | \$500 | 1 | \+ | 0.01 | \$100,000 | \+ | -\$512.5 |
| \$725 | = | 0.75 | \$1,000 | \+ | \$500 | 0 | \+ | 0.01 | \$15,000 | \+ | -\$175 |

We can interpret the parameters as partial correlations. The “Fine Amount” is typically 75% of the “Fraud Damage Amount”, having committed a prior fraud adds \$500 to the fine and every dollar in “Company Size” results in a \$0.01 larger fine. Notice for these companies the error term is negative meaning that the model overshot the actual fine amount. For other observations we would see a positive error term.

Given the model that we have created, we would apply this same approach to new data to predict the fine. For example, if in our new data, we see the following:

| Predicted Fine Amount | Fraud Damage Amount | Prior Fraud Committed | Company Size |
|———————–|———————|———————–|————–|
| ? | \$600 | 1 | \$55,000 |

: New data

We have the structure and the parameters from the training data. We can calculate the predicted fine amount based on the patterns we observed in the training data:

$$
PredictedFineAmount = 0.75 \times 600 + 1 \times 500 + 0.01 \times 55,000 = $1,500
$$

We only know if our prediction was good after we observe the actual fine amount. Over time as we get more new data we could evaluate how well our model worked.

– Model structure

All modelling approaches follow the same general approach. We propose a way to measure model fit, we propose a model that can capture patterns in the data, and we estimate parameters for the model to find the best fit based on the data. There are many modeling possibilities. Not all of these modeling approaches have a nice mathematical format. These modeling approaches might be called algorithms or methods for finding patterns in the data. However, there are still choices that must be made in how the algorithm identifies these patterns and the data informs these choices (i.e. parameters and hyper parameters). These algoritms without a particular mathematical form are often referred to as machine learning models. We will return to other models and algorithms later and discuss the structure they impose to find patterns.

– The risk from applying patterns to new data is significant and we use methods to try to reduce that risk

After using the training data to develop our models, we move to predicting outcomes. We take the same structure to new data where we do not observe the outcome to predict what we think it would be based on the model we estimate on the training data.

How well the training data model predicts outcomes on new data depends on many factors. Here I outline some of the biggest concerns.

– Under fitting

Some models may provide very poor measures of fit. In many cases model fit depends on the outcome being predicted. Because of differences across outcomes, what constitutes a good or a bad measure of fit depends on the outcome variable. Bad measures of fit can occur because the model parameters are not sufficiently tuned to best fit the data. Bad measures of fit can also occur because important predictor variables are missing from the model or the model structure cannot capture important variation in the outcome variables.

– Over training

Some models may provide good measures of fit on training data but then perform poorly on new data. This problem is typically referred to as overtraining although there may be various reasons why this may happen. Over training occurs when a model’s structure or parameters are fit so closely to the training data that they capture patterns in the data that are unlikely to occur in other data. Stated differently, over trained models capture random sample information that is unlikely to recur in other samples.

– Data shifting

Training data may differ from new data in important ways that limits how well a model fits new data. For example, a model fit on training data for large public companies may not apply to new data for small private companies. This problem may occur because the new data are not drawn from the same population as the training data.

– Data leakage

Models from training data may also fail when moving to new data for different reasons. For example, training data may use data that is not available for new data or the model may have included variables that are only available after observing the outcome variable.

## Return to Fraud Prediction Model

To complete the fraud prediction model example, here I will present a simple linear model based on the data we explored in the fraud example. This model tries to predict the probability that a company commits fraud in a given year. The details of how this model works are not important right now but will later become easier to understand as we model other outcomes. For now, I want you to see this piece of the process and to try to understand the intuition behind training and applying a model.

Here I will estimate a linear probability model. This model imposes a linear structure and uses a binary outcome variable.

“`{r,cache=TRUE,echo=FALSE,message=FALSE,warning=FALSE}
df2<-df %>%
select(misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)%>%
na.omit()
estlm<-lm(misstate~at+bm+EBIT+ch_roa+dch_wc+ch_fcf,data=df)
#library(modelsummary)
#modelsummary(estlm)
# would like to use this or something similar at some point…need updates
summary(estlm)
“`

There is a lot to potentially unpack here. Let’s focus only on a few things. First, the coefficient column gives the parallel to the $\beta$s mentioned earlier. These are the parameters that have been fit on the data to try to capture the patterns that occur in the training data. I’ll describe two pieces that are similar to what we saw before. The parameter for bm is positive 0.0089. This means that when bm increases by 1, the probability that the company-year is a fraud year increases by a rounded 1% (if we round to 0.01). The parameter (coefficient estimate in regression terms) on ch_fcf is -0.088. This means that when ch_fcf increases by 1, the probability that a company-year is a fraud year is approximately 9% lower.

We don’t need to go into the details at this point, but this model is not a particularly strong model for predicting fraud. In part, this is because fraud is an infrequent occurrence that is hard to predict, in part because we don’t have the strongest predictors of fraud, and perhaps in part because we have chosen the wrong structure for the model. We can take this model to new data to make a prediction for the probability that the observation has fraud.

The following observation comes from the data that has never been identified as having had a fraud.

“`{r, cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
df2<-read.csv(“C:\\Users\\jgreen\\Documents\\Teaching\\NotesText\\Data Analytics with Accounting Data\\data_FraudDetection_JAR2020.csv”)
df2<-df2 %>%
group_by(gvkey) %>%
mutate(hasfraud = max(misstate)) %>%
ungroup() %>%
filter(hasfraud==0) %>%
select(fyear,misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)
df2<-df2[1,]

df2%>% kable(“html”,caption=”New data”)%>%kable_styling()
“`

“`{r, cache = TRUE, echo=FALSE, warning=FALSE, message=FALSE}
pred<-round(100*predict(estlm,df2),digits=0)
“`

We can use the same model (use the results from the regression above) to predict what the probability of this firm-year being a fraud. In this case, the probability would be `r pred`%.

There may be reasons to doubt this prediction, but based on the model we estimated, this is the prediction for the new data.

Note: one reason this is so high is because our model used only companies that at some point during the sample had been found to commit fraud (approximately 19% of the training sample was a fraud year). This new observation has never been identified as having committed fraud. Should the probability of a fraud year having been fraud be the same for this observation? Some research suggests that 10% of companies each year commit fraud [see here](https://link.springer.com/article/10.1007/s11142-022-09738-5?campaign_id=4&emc=edit_dk_20230114&instance_id=82723&nl=dealbook&regi_id=86591575&segment_id=122544&te=1&user_id=3a31c185ca43c0f889045eace6a8812b). As a practical reality, sometimes we can take estimates seriously and other times we have to use a different interpretation. For example, below is another new data point.

“`{r, cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}

df2b<-read.csv(“C:\\Users\\jgreen\\Documents\\Teaching\\NotesText\\Data Analytics with Accounting Data\\data_FraudDetection_JAR2020.csv”)
df2b<-df2b %>%
group_by(gvkey) %>%
mutate(hasfraud = max(misstate)) %>%
ungroup() %>%
filter(hasfraud==0) %>%
select(fyear,misstate,at,bm,EBIT,ch_roa,dch_wc,ch_fcf)
df2b<-df2b[25,]
df2b%>% kable(“html”,caption=”New data”)%>%kable_styling()
“`

“`{r, cache = TRUE, echo=FALSE, warning=FALSE, message=FALSE}
predb<-round(100*predict(estlm,df2b),digits=0)
“`

The prediction for this observation is `r predb`%. Notice the difference between the two is only 2% but the predictor variables are pretty different. Perhaps the only thing we can say is that the second observation might have a higher probability of being fraud.

<!–# Worthwhile to do a different model e.g. random forest like the paper for the data? –>

## Different application

### Generative AI text model

In this section we will discuss the modeling process for a different task, the task of creating and using a generative AI model. Generative AI refers to a model that generates new data. Most commonly this now refers to models that generate text. The same modelling process that we have discussed applies to generative AI text models like ChatGPT and CoPilot.

The first step for generative AI models is to train the model. This is done by providing the model with a large amount of text data. Some of these models presumably have been trained on terabytes of text data from every imaginable source. There are several challenges when working with text data that we will not go into here. However, to understand how the models work, we will use the simplest possible example. Let’s imagine that we start with the following sentences that we will use to create our training data. I will deliberately keep the example as simple as possible.

| Sentence |
|—————————————————————————–|
| Company ABC overstated its inventory and revenue from sales by \$3 million. |
| Company XYZ increased revenue from sales by 10%. |
| Sales revenue decreased compared with last year. |
| Every company in the industry paid the CEO a bonus. |

To model patterns in the training data, we would need to structure the data in a way that can be used an algorithm. Rather than work through setting up the data, let’s walk through the logic that an algorithm might use when trying to predict the next word in a sentence.

Suppose you give the model a word and ask it to predict the next word. If you provide “revenue” as the word, you can see in the training data that three of the four sentences contain the word revenue. In two of the three sentences the word after “revenue” is “from”. This means that the highest probability word to follow revenue is “from” based on the training data. The model would then predict “from” as the next word. Now the model has two words, “revenue from”. The model would then look at the training data to see what word follows “revenue from”. In this case, the word “sales” follows “revenue from” in every instance (2) where these two words appear together. The same process applies to the next word “by”. Now we have “revenue from sales by”. Now we have a problem because there is no higher probability next word. The model can choose either \$3 million or 10%. Now the model creator has to decide how to predict the next word. One option is to randomly choose between the two words. Another option is to force the model to consider additional words. This process continues until the model reaches a predetermined stopping criteria.

Let’s walk through a few things that we can learn from this example. First, a generative AI model depends crucially on the training data text it has used to create the model. Second, the model can only successfully predict on words or combinations of words for which it has sufficient training data to make a prediction. Third, the prediction is path dependent meaning that as it predicts the next word, the last prediction becomes part of the input for the next prediction. Fourth, the model creator has to make choices about how to predict the next word when the model has multiple options. Fifth, the model creator has to decide when to stop the prediction process. Note: we could also talk about the problem of predicting words when we think the words are from new data but they come from the training data ([see here](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4754678)).

### Fraud prediction model

Let’s use the example above to examine another application. Suppose we have a different type of model where the model is used to predict whether text has hints about whether a company commits fraud. We need training data again. This time the training data has labels for each sentence.

| Sentence | Label |
|—————————————————————————–|———–|
| Company ABC overstated its inventory and revenue from sales by \$3 million. | Fraud |
| Company XYZ increased revenue from sales by 10%. | Not Fraud |
| Sales revenue decreased compared with last year. | Not Fraud |
| Every company in the industry paid the CEO a bonus. | Fraud |

Models like one we might use here searches for individual words, combinations of words, grammatical structures, or other patterns that are best associated with the labeled observations. We can look at the observations labeled as fraud and compare those with the observations labeled as not fraud to get an idea for how an algorithm might work. We might see that “overstated” or “bonus” show up only in the “Fraud” observations. On the other hand “revenue” is in “Fraud” and “Non Fraud” observations. Perhaps then the model will predict that fraud is more likely when “overstated” or “bonus” appear in the text. However, “bonus” could appear in many other observations that are not fraud. The important challenge here is that the model crucially depends on the training data.

## Summary

In this chapter, we have introduced the modeling process. Creating models based on training data can help us identify patterns that can improve our predictions and decision making. Modeling data can have a wide range of applications. We have discussed fraud prediction and generative AI text models. The following points highlight the key concepts from this chapter.

– Modeling requires training data. The nature of the training data is important for the success of the model.
– Identifying patterns in the training data requires a model structure. The model structure is used to estimate parameters that best fit the data.
– Fitting a model requires a measure of fit. The measure of fit is used to determine how well the model captures patterns in the data.
– The model structure is used to predict outcomes on new data. The success of the model on new data depends on the similarity between the training data and the new data.

In the following chapters we will discuss how to apply these modeling concepts to accounting data, will also discuss some of the challenges when working with accounting data, and we will create models for various accounting related tasks.

## Review

### Conceptual questions

1. Explain the modeling process.
2. Write a summary in your own words for how modeling is used to find patterns in training data and to predict on new data.
3. What is the purpose of defining a measure of fit for modeling?
4. Why is a model structure necessary for finding patterns in training data?
5. What is the intuition for how a model is mathematically estimated?
6. Consider observable an attribute about a company that might be used to predict whether they have committed fraud. Why do you expect this attribute to be associated with fraud? How do you expect this attribute to be associated with fraud?
7. Why might a model for predicting fraud fail when making predictions on new data?
8. Describe how a generative AI model might be said to “understand” text.
9. Describe how a generative AI model trained on an accounting textbook might differ from a generative AI model trained on news articles if asked to determine using a company’s annual report whether the company is likely to have committed fraud.

### Practice questions

1. Statistical, econometric, and machine learning models are methods for \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.
2. What is the primary assumption in the modeling process that is required for using a model to predict using new data?
3. In the fraud example, in order to create a model with the training data, we need two types of observations. What are these two types of observations?
4. In the fraud example, the mean for the column “misstate” is 0.2. What does this mean?
5. What is training data?
6. In the fraud example, fraud years have a ch_fcf of -0.1 and non-fraud years have a ch_fcf of -0.01. What can we infer from this information?
7. Why must predictor variables be available before the outcome variable or available when the outcome variable is not available?
8. Why is model fit necessary for training a model?
9. What is overtraining?
10. Name three reasons the fraud model in our example may perform poorly.
11. In the fraud example, we predicted a probability of being a fraud year to be 19% for a new observation. If that probability is too high, why might the model have predicted it to be so high?
12. Describe how a generative AI model predicts the next word.

## Solutions to practice questions

1. turning patterns in the data into specific insights.
2. That the patterns in the training data will apply to new data.
3. Observations with fraud and non-fraud labels.
4. That in the sample, 20% of the company-years were fraud years.
5. Training data is data that we use to find patterns that we can then use on new data to make predictions.
6. We might infer that companies are more likely to commit fraud when their cash flows are lower.
7. We make predictions on new data to try to predict what the outcome variable will be without knowing the outcome.
8. Model fit is used to find parameter values that best fit the training data.
9. Overtraining occurs when a model fits the training data so closely that it captures random sample information that is unlikely to recur in other samples.
10.
(1) Fraud is an infrequent occurrence that is hard to predict, (2) we don’t have the strongest predictors of fraud in the model, (3) we may have chosen the wrong (linear) structure for the model.
11. The model was trained on a sample in which the frequency of fraud was high.
12. A generative AI model predicts the next word by finding the highest probability word given the prompt word(s) based on the probabilities in the training data.

Tutorial video

Note: include practice and knowledge checks

Mini-case video

Note: include practice and knowledge checks

License

Data Analytics with Accounting Data and R Copyright © by Jeremiah Green. All Rights Reserved.