Data preparation

Jeremiah Green

9 Data preparation

Learning Objectives

Describe the required structure for data frames for aggregate summaries, group summaries, time series summaries, and modeling
Identify common data frame preparation steps
Use dplyr tools to prepare data for subsequent analyses

Chapter content

Most analysis and tools require data to be organized in a format that the tools and models are designed to work with. Some models require data to be organized in different ways. This chapter discusses the most common way of organizing data for analysis and modeling, particularly for accounting based analyses. This chapter also highlights the most common data preparation steps. You will then use the dplyr package in R to perform the common data preparation steps.

Data structure

There are many sources of data that might be relevant for accounting related analyses. Here I will introduce two examples. The first example is transaction data for a single company at different points in time for various products. The second example is financial statement data for multiple companies at different points in time. I will keep the examples simple to avoid additional complications that might arise from more complex data.

Both of these data sources might be referred to as panel data. Panel data is data that has a time element, for example, data collected on multiple days or for multiple years, and a cross-sectional element, for example, data collected for multiple companies or products. Data that has only one of these elements is referred to as time series data or cross-sectional data.

Example 1: Transaction Data

Transaction ID	Date	Product Name	N Units	Per Unit Sales Price	Per Unit Cost
12001	2 JUL 2024	Cattleman leather	1	225	95
12002	3 JUL 2024	Cattleman leather	1	225	95
12003	2 JUL 2024	Cutter straw	3	75	25

The transaction data in Example 1 is organized in a table format. Each row represents a transaction, and each column represents a different attribute of the transaction. The data is organized in a way that is easy to understand and analyze. There is only one row for each transaction. This means that information about the July 2nd transaction for the product “Cattleman leather” is only contained in the row with Transaction ID 12001 and does not have another row with different information for the transaction. However, the same product was purchased on different days, so “Cattleman leather” was purchased on July 2nd and July 3rd. The two rows are different because they represent different transactions. We have multiple products because we have a product “Cutter straw” along with the product “Cattleman leather”.

Example 2: Company Financial Statement Data

Company	Fiscal Year End	Total Revenue ($M)	Net Income ($M)	Prior Year Net Income ($M)	Total Assets ($M)
ABC	31 DEC 2023	2,512	755	700	5,000
ABC	31 DEC 2022	2,200	700	650	4,500
XYZ	31 DEC 2023	1,000	200	150	2,000

The company financial data in Example 2 is also organized in a similar table format. Each row represents a company’s financial data for a fiscal year. Each column represents a different attribute of the company’s financial data and there are multiple companies in the data. This example differs from the first example in an important way other than the subject of the table. The column “Prior Year Net Income ($M)” is a column about the prior year’s net income for a company. For company ABC for the 2023 fiscal year net income is 755 million dollars. Net income for the prior fiscal year is 700 million dollars. In the first example, we saw that all information about an observation was contained within the same row. Here, $700 million is the net income from the next row in the table. However, when we are thinking about ABC’s 2023 fiscal year end information (period t), relative to that observation, the prior year’s net income is the net income from the 2022 fiscal year end (period t-1). In this way, any information that describes the observation, including information from the prior year, is contained within the same row.

The “Prior Year Net Income ($M)” column is an example of a lag variable. A lag variable is a variable that contains information from the prior period. In this case, the lag variable is the net income from the prior year. Lag variables are common in accounting data because they are used to compare the current period’s performance to the prior period’s performance. Consider how the lag variable is created. This is important in this example, because if we were to recreate the “Prior Year Net Income ($M)” column, we would not see the 2021 fiscal year information that would be need for the “Prior Year Net Income ($M)” column for the 2022 observation. The steps then for creating a lag variable requires finding the prior year’s observation for the same company. This generally is done by sorting the data frame and grabbing the value from the row above or below each observation. However, for row 2 in this example, moving to the next row would give us the 2023 fiscal year information for company XYZ rather than 2021 for company ABC.

Organizing the data

Here we will continue the examples from above and discuss different ways the data might be organized and how we might reorganize the data to match our desired format.

Consider an alternative format for the transaction data. The format below is sometimes called a “long” format versus what we saw above which is sometimes called a “wide” format.

Transaction Data in Long Format

Transaction ID	Variable Label	Value
12001	Date	2 JUL 2024
12001	Product Name	Cattleman leather
12001	N Units	1
12001	Per Unit Sales Price	225
12001	Per Unit Cost	95
12002	Date	3 JUL 2024
12002	Product Name	Cattleman leather
12002	N Units	1
12002	Per Unit Sales Price	225
12002	Per Unit Cost	95
12003	Date	2 JUL 2024
12003	Product Name	Cutter straw
12003	N Units	3
12003	Per Unit Sales Price	75
12003	Per Unit Cost	25

This format is called a long format because information is added as a new row rather than as a separate column. When there are many missing values this format can be more efficient because it does not require a separate column for each variable and missing values can be omitted from the data set. However, notice that there are multiple rows for each transaction. There is also not a simple way to associate information from a lagged period with the current period because the date and the other content are not connected in a format that could be sorted by date to find a value associated with a previous date.

Most often, before using some of the tools that make data analysis faster and easier or before fitting a model to training data, the data must be changed from the long format to the wide format. This is because most tools that we will used are designed to work with data in the wide format. There are tools to make the the transformation, but other times this might include making choices for how to format the table.

Let’s consider the features of the long format that might help us transform the data to the wide format. First, there is a unique identifier for each transaction. This is the “Transaction ID” column. Second, there is a variable label for each value. This is the “Variable Label” column. Third, there is a value for each variable label. This is the “Value” column. We would like to have a single row for the unique transaction identifier.

Data format for most analyses

Most analysis tools require wide format data tables.

The variable label column tells us something about the transaction. We want anything about an observation to be in the same row. This means that the variable label column will need to be the name of the column in the wide format. The value column will be the value in the wide format. Some software tools can make this transformation simple or we may need to do the transformation manually. The transformation is often called a pivot or a reshape.

Tools for wide data

Many tools in software packages are designed to work with wide data. Here we will discuss the logic for summary tools that make the wide format useful or necessary.

Most tools create summaries of a column across rows. For example, a mean function calculates the mean of a column. In the the first example, we might calculate the mean of the “N Units” column. This would give us the average number of units sold. In the second example, we might calculate the mean of the “Net Income ($M)” column. This would give us the average net income for the companies in the data set. Other functions also work in this way. For example, the standard deviation function calculates the standard deviation of a column.

Summary functions designed to work with a column also simplify grouped summary functions. For example, we might want to calculate the average number of units sold for each product. Summary functions that summarize the rows of a column can then be applied to groups of rows such as the rows for each product.

As shown previously, working within a column can be useful for creating lag variables. For example, we might want to calculate the difference between the current period’s net income and the prior period’s net income. This would require subtracting the prior period’s net income from the current period’s net income which can be done when working within a column that contains unique-by-row data. Structuring the data in this way and using the tools that work with this structure can make the analysis and modeling process fast and simple.

Transforming from long to wide data

There are many tools that can be used to transform data from the long format to the wide format or to reshape the data in different ways. These tools are often called pivot tools or reshape tools.

Let’s use the example 1 long format to define long format variables that can be used to transform the long table into a wide format table. The first column – “Transaction ID” – is the unique identifier for each transaction (id). This is the column for which we would like a single row in the wide format. We may need to sort by this column or group by this column so that any transposing we do is done separately for each unique id.

The second column – “Variable Label” – identifies unique information about each transaction that we need to have in each row (label). We will use this label to identify which rows to use when transposing the data to have a column for each unique label. We may have only a single type of information in which case we do not need to transpose the data for different values of labels.

The third column – “Value” – is the value that is identified by the combination of “Transaction ID” and “Variable Label” (value). We will transpose this value so that there is one unique value for a row and a column for each unique label.

The code below transforms the table above from long to wide format using dplyr, if the table is a data frame “df”:

df_wide<-df %>%
group_by(Transaction_ID) %>%
pivot_wider(names_from = Variable_Label, values_from = Value) %>%
ungroup()

Let’s consider the long format for example 2 to practice identifying the id, label, and value columns.

Company Financial Statement Data in Long Format

Company ID	Variable Label	Value
ABC	Fiscal Year End	31 DEC 2023
ABC	Total Revenue ($M)	2,512
ABC	Net Income ($M)	755
ABC	Prior Year Net Income ($M)	700
ABC	Total Assets ($M)	5,000
ABC	Fiscal Year End	31 DEC 2022
ABC	Total Revenue ($M)	2,200
ABC	Net Income ($M)	700
ABC	Prior Year Net Income ($M)	650
ABC	Total Assets ($M)	4,500
XYZ	Fiscal Year End	31 DEC 2023
XYZ	Total Revenue ($M)	1,000
XYZ	Net Income ($M)	200
XYZ	Prior Year Net Income ($M)	150
XYZ	Total Assets ($M)	2,000

Note that because the table is structured in the same way, the transposition will be similar. The primary difference is that “Company ID” does not uniquely identify the row that we would like unique information about. We would like to have a single row for each company-year. In this case, we need an interim step because we have multiple identifiers. Our interim step would give us a unique row identifier of “Company ID” and “Fiscal Year End”. We would then transpose the data so that there is one unique value for a row and a column for each unique label.

Once we have the table structured with a unique identifier in each row, in this case the “Company ID” and “Fiscal Year End” columns, we can transpose the data so that there is one unique value for a row and a column for each unique label.

The steps for transforming the long format table above, if it is a data frame “df”, are below.

Interim step

df2 <- df %>%
group_by(Company_ID) %>%
mutate(FYE_ID = if_else(Variable_Label == "Fiscal Year End", Value, NA)) %>%
fill(FYE_ID) %>%
ungroup()

Pivot step

df_wide <- df2 %>%
group_by(Company_ID,FYE_ID) %>%
pivot_wider(names_from = Variable_Label, values_from = Value)

Data pre-processing

Getting data into a structure that can be used for modeling is the first step in preparing data for model training. After the data is correctly structured, most modeling techniques require or perform better when the data is prepared with the modeling task and process in mind. This process is called pre-processing. This section will discuss common pre-processing steps and the reasoning for these steps. R example code in the section below will use the csv file from previous chapters available here https://www.dropbox.com/scl/fi/g7gmo0jgj797lwk5b0ltr/ROARDA.csv?rlkey=1rk4cjb0n6ezby4e3y1gicure&st=rf35i3y4&dl=0 and saved as a data frame titled “df”. The variables that will be used in the examples include RDA (research and development expense divided by average total assets), ROA (net income divided by average total assets), and AVGTA (average total assets).

Sampling from different populations (scale)

A recurring issue in accounting data is that the scale of observations can differ dramatically. For example, predicting which companies are most likely to commit fraud may involve comparing companies with a few thousands of dollars in sales with companies with billions of dollars in sales. Another example is predicting which customers might default on their accounts when some customers have a few dollars in their accounts and others have thousands or millions of dollars in their accounts.

Unfortunately, there is no agreed upon method that will solve these scale issues in every situation. Thinking carefully about the purpose of the model is necessary for most applications. Given the caveat that there is no one-size-fits-all solution, below are two methods for dealing with scale issues.

The first option is to ignore the scale differences across observations. This may be appropriate when the scale differences are small or are not important for the model. A related possibility is to include predictors in the model that are related to the scale of the observation. For example, if we are predicting fraud, we might include a variable that is the log of sales. This would allow the model to consider the scale of sales in the prediction.

The second option is demonstrated with ratio analysis in accounting. Trying to create predictors and potentially outcome variables that are scale independent may result in better performing models. For example, if we are predicting which companies are likely to commit fraud, we might use as predictors variables that are ratios that may be comparable across companies of different sizes. We might include a ratio of cash balances to total assets, net income to sales, or other similar ratios. We will return to scaling and accounting questions in the next chapter.

Missing values

In many data sets, some features have missing values that limit the ability to use those features for modeling. There are various options for dealing with missing values without a particular approach to be applied in all circumstances.

The first option is to delete observations with missing values for the feature. This is the simplest approach and may be appropriate if the number of observations with missing values is small.

Example

Suppose the concern is that there are missing RDA observations:

tmp <- df %>%
drop_na(RDA)

The second option is to use background knowledge about the data to fill in missing values. For example, in a database of customer transactions, the location of the transaction may be missing for online sales. Or in a database of financial statements, gross margin might be missing but could be calculated from other financial statement items.

Example

Suppose RDA is mostly missing because the data set doesn’t capture research and development expenses that are zero:

tmp <- df %>% mutate(RDA=ifelse(!is.na(RDA), RDA, 0))

The third option is to fill in missing values with assumed or typical values. For example, in a database of customer demographics, missing values for age might be filled in with the average age of the customers. In a database of financial statements, missing values for depreciation expense might be filled in using the year-to-year trend in depreciation expense for the company. A related possibility is to fill in missing values with the value expected based on the values of non-missing features. This approach uses a model to predict the missing value based on the values of other features.

Example

Suppose we don’t know why RDA values are missing, but are typical for a given industry:

tmp <- df %>% group_by(sic,fyear) %>% mutate( mdRDA = median(RDA, na.rm = TRUE) ) %>% ungroup() %>% mutate(RDA=ifelse(!is.na(RDA), RDA, mdRDA))

The final option is to use models that can use missing values. Some models are designed to work with missing values and can be used without filling in missing values.

Outliers

Many models are sensitive to outliers. Outliers are observations that are very different from other observations. In some cases outliers are errors in the data that should be corrected or removed. In other cases, outliers are valid observations that should be kept in the data set but may need to be considered when creating models. As with other pre-processing steps, there is no one-size-fits-all solution for dealing with outliers.

First, the feature may be transformed to reduce the effects of outliers on the models. For example, the log transformation is often used to reduce the effects of outliers for features with only positive values. Another transformation is ranking observations.

Example

tmp <- df %>% mutate(lnAVGTA = log(AVGTA))

Second, outliers may be deleted. This may be appropriate if the number of outliers is small. If the number of features with outliers is large, deleting outliers may result in a significant loss of data.

Example

tmp <- df %>% mutate(ROA_high = quantile(ROA,0.99,na.rm=TRUE), ROA_low = quantile(ROA,0.01,na.rm=TRUE)) %>% filter(between(ROA,ROA_low,ROA_high))mutate(lnAVGTA = log(AVGTA))

Third, outliers may be winsorized. Winsorizing is a method that replaces the value of an outlier with the value of the next closest non-outlier. For example, if the 99th percentile of a feature is 100 and the 100th percentile is 1000, the value of 1000 would be replaced with 100. This method reduces the effects of outliers on the model without deleting the data. However, winsorizing replaces the value of the outlier with a value that is not the true value of the observation.

Example

tmp <- df %>% mutate(ROA_high = quantile(ROA,0.99,na.rm=TRUE), ROA_low = quantile(ROA,0.01,na.rm=TRUE)) %>% mutate(ROA = ifelse(ROA>ROA_high,ROA_high, ROA), ROA = ifelse(ROA<ROA_low,ROA_low, ROA))

Detecting which observations should be classified as outliers is a challenge. There are common practices, but no fully accepted practice. Outliers may be determined by the distribution of the individual feature. For example, observations that are more than three standard deviations from the mean of a feature may be classified as outliers. Outliers may also be determined by their influence on a model for some types of models. Separate models may also be used to detect outliers based on combinations of features. These models are referred to as “anomaly detection” models.

Low feature variance

Modeling relies on variation in features to learn differences between observations and to make predictions. Features with no variation cannot be used to differentiate between observations. Some features may have some variation but not enough to be useful. Typically, features with low variation are removed from the data set before modeling. The challenge with choosing which features to remove is defining what low variation means.

Differences in feature variance

Differences in feature variance can be a challenge for some models. For example, models that use distance measures, such as k-means clustering, are sensitive to differences in the scales and variances of features. Some models are not sensitive to across feature differences; however, standard practice is to apply the same pre-processing steps to all features for consistency.

The typical approach to addressing scale and variance differences across features is to standardize the features to have the same scale and variance. For example, the features may be standardized to have a mean of zero and a standard deviation of one. This is done by subtracting the mean of the feature from each observation and dividing by the standard deviation of the feature. Another standardizing approach is to scale the features to have a minimum of zero and a maximum of one. This is done by subtracting the minimum value of the feature from each observation and dividing by the range of the feature.

Example

Here, assume that one of the outlier steps has been performed from above.

tmp2 <- tmp %>% mutate(ROA_std = (ROA - mean(ROA,na.rm=TRUE))/sd(ROA,na.rm=TRUE))

Feature selection

Highly correlated features can be a challenge for some models. For example, models that use linear regression are sensitive to highly correlated features. In most cases, perfectly correlated features, i.e. including sales in millions of dollars as a feature and sales in thousands of dollars as a feature make the estimation impossible or highly problematic. Similarly, combinations of features that are algebraically equivalent creates the same problem. For example, including the components of gross margin (sales and cost of goods sold) and gross margin as features in the same model would create the same problem.

A related issue is that some data sets may have hundreds or thousands of features which may introduce high correlations among features or may increase the storage and computational requirements of the model.

The issue of having a large number of features that may be correlated is typically addressed in one of two ways.

First, highly correlated features may be removed from the data set before modeling.

Second, feature selection or dimension reduction techniques may be used to select a subset of features or collapse the features into a smaller set of features. Selecting a subset of features is referred to as feature selection. Collapsing the features using techniques such as principal component analysis is referred to as dimension reduction.
Some models include feature selection or dimension reduction as part of the model fitting process.

Tutorial video

The video below demonstrates the R code data preparation examples from above.

Conclusion

In this chapter, you learned how to structure data for most data analyses. You also learned about common data preparation tasks.

Review

Mini-case video

Adams, J., Hayunga, D., Mansi, S., Reeb, D., & Verardi, V. (2019). Identifying and treating outliers in finance. Financial Management, 48(2), 345-384.

References

Adams, J., Hayunga, D., Mansi, S., Reeb, D., & Verardi, V. (2019). Identifying and treating outliers in finance. Financial Management, 48(2), 345-384.

Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational research methods, 16(2), 270-301.

Sullivan, J. H., Warkentin, M., & Wallace, L. (2021). So many ways for assessing outliers: What really works and does it matter?. Journal of Business Research, 132, 530-543.

9 Data preparation

Chapter content

Data structure

Organizing the data

Tools for wide data

Transforming from long to wide data

Data pre-processing

Sampling from different populations (scale)

Missing values

Outliers

Low feature variance

Differences in feature variance

Feature selection

Tutorial video

Conclusion

Review

Mini-case video

References

License

Share This Book