10 Chapter 10: Organizing Data for Specific Purposes
Learning outcomes
At the end of this chapter, you should be able to
- Describe the required structure for data frames for aggregate summaries, group summaries, time series summaries, and modeling
- Match structure types to accounting data analysis problems
- Outline steps to transform input data into required structures
- Map data manipulation tools from dplyr to each step
Chapter content
Note: include practice and knowledge checks throughout chapter
# Data structure
Most analysis and tools require data to be organized in a format that the tools and models are designed to work with. Some models require data to be organized in different ways. In this chapter, we will discuss the most common way of organizing data for analysis and modeling, particularly for accounting based analyses. We will also discuss how this structure works with tools designed for table formatted data.
## Two examples
There are many sources of data that might be relevant for accounting related analyses. Here I will introduce two examples. The first example is transaction data for a single company at different points in time for various products. The second example is financial statement data for multiple companies at different points in time. I will keep the examples simple to avoid additional complications that might arise from more complex data.
Both of these data sources might be referred to as panel data. Panel data is data that has a time element, for example, data collected on multiple days or for multiple years, and a cross-sectional element, for example, data collected for multiple companies or products. Data that has only one of these elements is referred to as time series data or cross-sectional data.
| Transaction ID | Date | Product Name | N Units | Per Unit Sales Price | Per Unit Cost |
|—————|—————|—————|—————|—————|
| 12001 | 2 JUL 2024 | Cattleman leather | 1 | 225 | 95 |
| 12002 | 3 JUL 2024 | Cattleman leather | 1 | 225 | 95 |
| 12003 | 2 JUL 2024 | Cutter straw | 3 | 75 | 25 |
: Example 1: Transaction data
The transaction data in Example 1 is organized in a table format. Each row represents a transaction, and each column represents a different attribute of the transaction. The data is organized in a way that is easy to understand and analyze. There is only one row for each transaction. This means that information about the July 2nd transaction for the product “Cattleman leather” is only contained in the row with Transaction ID 12001 and does not have another row with different information for the transaction. However, the same product was purchased on different days, so “Cattleman leather” was purchased on July 2nd and July 3rd. The two rows are different because they represent different transactions. We have multiple products because we have a product “Cutter straw” along with the product “Cattleman leather”.
| Company | Fiscal Year End | Total Revenue (\$M) | Net Income (\$M) | Prior Year Net Income (\$M) | Total Assets (\$M) |
|—————|—————|—————|—————|—————|—————|
| ABC | 31 DEC 2023 | 2,512 | 755 | 700 | 5,000 |
| ABC | 31 DEC 2022 | 2,200 | 700 | 650 | 4,500 |
| XYZ | 31 DEC 2023 | 1,000 | 200 | 150 | 2,000 |
: Example 2: Company financial statement data
The company financial data in Example 2 is also organized in a similar table format. Each row represents a company’s financial data for a fiscal year. Each column represents a different attribute of the company’s financial data and there are multiple companies in the data. This example differs from the first example in an important way other than the subject of the table. The column “Prior Year Net Income (\$M)” is a column about the prior year’s net income for a company. For company ABC for the 2023 fiscal year net income is 755 million dollars. Net income for the prior fiscal year is 700 million dollars. In the first example, we saw that all information about an observation was contained within the same row. Here, \$700 million is the net income from the next row in the table. However, when we are thinking about ABC’s 2023 fiscal year end information (period t), relative to that observation, the prior year’s net income is the net income from the 2022 fiscal year end (period t-1). In this way, any information that describes the observation, including information from the prior year, is contained within the same row.
The “Prior Year Net Income (\$M)” column is an example of a lag variable. A lag variable is a variable that contains information from the prior period. In this case, the lag variable is the net income from the prior year. Lag variables are common in accounting data because they are used to compare the current period’s performance to the prior period’s performance. Consider how the lag variable is created. This is important in this example, because if we were to recreate the “Prior Year Net Income (\$M)” column, we would not see the 2021 fiscal year information that would be need for the “Prior Year Net Income (\$M)” column for the 2022 observation. The steps then for creating a lag variable requires finding the prior year’s observation for the same company. This generally is done by sorting the data frame and grabbing the value from the row above or below each observation. However, for row 2 in this example, moving to the next row would give us the 2023 fiscal year information for company XYZ rather than 2021 for company ABC.
## Organizing the data
Here we will continue the examples from above and discuss different ways the data might be organized and how we might reorganize the data to match our desired format.
Consider an alternative format for the transaction data. The format below is sometimes called a “long” format versus what we saw above which is sometimes called a “wide” format.
| Transaction ID | Variable Label | Value |
|—————|—————|—————|
| 12001 | Date | 2 JUL 2024 |
| 12001 | Product Name | Cattleman leather |
| 12001 | N Units | 1 |
| 12001 | Per Unit Sales Price | 225 |
| 12001 | Per Unit Cost | 95 |
| 12002 | Date | 3 JUL 2024 |
| 12002 | Product Name | Cattleman leather |
| 12002 | N Units | 1 |
| 12002 | Per Unit Sales Price | 225 |
| 12002 | Per Unit Cost | 95 |
| 12003 | Date | 2 JUL 2024 |
| 12003 | Product Name | Cutter straw |
| 12003 | N Units | 3 |
| 12003 | Per Unit Sales Price | 75 |
| 12003 | Per Unit Cost | 25 |
: Example 1: Transaction data in long format
This format is called a long format because information is added as a new row rather than as a separate column. When there are many missing values this format can be more efficient because it does not require a separate column for each variable and missing values can be omitted from the data set. However, notice that there are multiple rows for each transaction. There is also not a simple way to associate information from a lagged period with the current period because the date and the other content are not connected in a format that could be sorted by date to find a value associated with a previous date.
Most often, before using some of the tools that make data analysis faster and easier or before fitting a model to training data, the data must be changed from the long format to the wide format. This is because most tools that we will used are designed to work with data in the wide format. There are tools to make the the transformation, but other times this might include making choices for how to format the table.
Let’s consider the features of the long format that might help us transform the data to the wide format. First, there is a unique identifier for each transaction. This is the “Transaction ID” column. Second, there is a variable label for each value. This is the “Variable Label” column. Third, there is a value for each variable label. This is the “Value” column. We would like to have a single row for the unique transaction identifier. The variable label column tells us something about the transaction. We want anything about an observation to be in the same row. This means that the variable label column will need to be the name of the column in the wide format. The value column will be the value in the wide format. Some software tools can make this transformation simple or we may need to do the transformation manually. The transformation is often called a pivot or a reshape.
## Tools for wide data
Many tools in software packages are designed to work with wide data. Here we will discuss some of the logic for some of the summary tools that make the wide format useful or necessary.
Most summary tools create summaries of a column cross rows. For example, a mean function calculates the mean of a column. In the the first example, we might calculate the mean of the “N Units” column. This would give us the average number of units sold. In the second example, we might calculate the mean of the “Net Income (\$M)” column. This would give us the average net income for the companies in the data set. Other functions also work in this way. For example, the standard deviation function calculates the standard deviation of a column.
Summary functions designed to work with a column also simplify grouped summary functions. For example, we might want to calculate the average number of units sold for each product. Summary functions that summarize the rows of a column can then be applied to groups of rows such as the rows for each product.
As shown previously, working within a column can be useful for creating lag variables. For example, we might want to calculate the difference between the current period’s net income and the prior period’s net income. This would require subtracting the prior period’s net income from the current period’s net income which can be done when working within a column that contains unique-by-row data. Structuring the data in this way and using the tools that work with this structure can make the analysis and modeling process fast and simple.
## Tools for transforming to wide data
There are many tools that can be used to transform data from the long format to the wide format or to reshape the data in different ways. These tools are often called pivot tools or reshape tools.
Let’s use the example 1 long format to define long format variables that can be used to transform the long table into a wide format table. The first column – “Transaction ID” – is the unique identifier for each transaction (id). This is the column for which we would like a single row in the wide format. We may need to sort by this column or group by this column so that any transposing we do is done separately for each unique id.
The second column – “Variable Label” – identifies unique information about each transaction that we need to have in each row (label). We will use this label to identify which rows to use when transposing the data to have a column for each unique label. We may have only a single type of information in which case we do not need to transpose the data for different values of labels.
The third column – “Value” – is the value that is identified by the combination of “Transaction ID” and “Variable Label” (value). We will transpose this value so that there is one unique value for a row and a column for each unique label.
Let’s consider the long format for example 2 to practice identifying the id, label, and value columns.
| Company ID | Variable Label | Value |
|—————|—————|—————|
| ABC | Fiscal Year End | 31 DEC 2023 |
| ABC | Total Revenue (\$M) | 2,512 |
| ABC | Net Income (\$M) | 755 |
| ABC | Prior Year Net Income (\$M) | 700 |
| ABC | Total Assets (\$M) | 5,000 |
| ABC | Fiscal Year End | 31 DEC 2022 |
| ABC | Total Revenue (\$M) | 2,200 |
| ABC | Net Income (\$M) | 700 |
| ABC | Prior Year Net Income (\$M) | 650 |
| ABC | Total Assets (\$M) | 4,500 |
| XYZ | Fiscal Year End | 31 DEC 2023 |
| XYZ | Total Revenue (\$M) | 1,000 |
| XYZ | Net Income (\$M) | 200 |
| XYZ | Prior Year Net Income (\$M) | 150 |
| XYZ | Total Assets (\$M) | 2,000 |
: Example 2: Company financial statement data in long format
Note that because the table is structured in the same way, the transposition will be similar. The primary difference is that “Company ID” does not uniquely identify the row that we would like unique information about. We would like to have a single row for each company-year. In this case, we need an interim step because we have multiple identifiers. Our interim step would give us a unique row identifier of “Company ID” and “Fiscal Year End”. We would then transpose the data so that there is one unique value for a row and a column for each unique label.
| Company ID | Fiscal Year End |Variable Label | Value |
|—————|—————|—————|—————|
| ABC | 31 DEC 2023 |Total Revenue (\$M) | 2,512 |
| ABC | 31 DEC 2023 |Net Income (\$M) | 755 |
| ABC | 31 DEC 2023 |Prior Year Net Income (\$M) | 700 |
| ABC | 31 DEC 2023 |Total Assets (\$M) | 5,000 |
| ABC | 31 DEC 2022 |Total Revenue (\$M) | 2,200 |
| ABC | 31 DEC 2022 |Net Income (\$M) | 700 |
| ABC | 31 DEC 2022 |Prior Year Net Income (\$M) | 650 |
| ABC | 31 DEC 2022 |Total Assets (\$M) | 4,500 |
| XYZ | 31 DEC 2023 |Total Revenue (\$M) | 1,000 |
| XYZ | 31 DEC 2023 |Net Income (\$M) | 200 |
| XYZ | 31 DEC 2023 |Prior Year Net Income (\$M) | 150 |
| XYZ | 31 DEC 2023 |Total Assets (\$M) | 2,000 |
: Example 2: Company financial statement data – interim step
Once we have the table structured with a unique identifier in each row, in this case the “Company ID” and “Fiscal Year End” columns, we can transpose the data so that there is one unique value for a row and a column for each unique label.
## Pre-processing
Getting data into a structure that can be used for modeling is the first step in preparing data for model training. After the data is correctly structured, most modeling techniques require or perform better when the data is prepared with the modeling task and process in mind. This process is called pre-processing. This section will discuss common pre-processing steps and the reasoning for these steps. Subsequent chapters will introduce the tools and provide practice for pre-processing.
### Sampling from different populations (scale)
A recurring issue in accounting data is that the scale of observations can differ dramatically. For example, predicting which companies are most likely to commit fraud may involve comparing companies with a few thousands of dollars in sales with companies with billions of dollars in sales. Another example is predicting which customers might default on their accounts when some customers have a few dollars in their accounts and others have thousands or millions of dollars in their accounts.
Unfortunately, there is no agreed upon method that will solve these scale issues in every situation. Thinking carefully about the purpose of the model is necessary for most applications. Given the caveat that there is no one-size-fits-all solution, below are two methods for dealing with scale issues.
The first option is to ignore the scale differences across observations. This may be appropriate when the scale differences are small or are not important for the model. A related possibility is to include predictors in the model that are related to the scale of the observation. For example, if we are predicting fraud, we might include a variable that is the log of sales. This would allow the model to consider the scale of sales in the prediction.
The second option is demonstrated with ratio analysis in accounting. Trying to create predictors and potentially outcome variables that are scale independent may result in better performing models. For example, if we are predicting which companies are likely to commit fraud, we might use as predictors variables that are ratios that may be comparable across companies of different sizes. We might include a ratio of cash balances to total assets, net income to sales, or other similar ratios. We will return to scaling and accounting questions in the next chapter.
### Missing values
In many data sets, some features have missing values that limit the ability to use those features for modeling. There are various options for dealing with missing values without a particular approach to be applied in all circumstances.
The first option is to delete observations with missing values for the feature. This is the simplest approach and may be appropriate if the number of observations with missing values is small.
The second option is to use background knowledge about the data to fill in missing values. For example, in a database of customer transactions, the location of the transaction may be missing for online sales. Or in a database of financial statements, gross margin might be missing but could be calculated from other financial statement items.
The third option is to fill in missing values with assumed or typical values. For example, in a database of customer demographics, missing values for age might be filled in with the average age of the customers. In a database of financial statements, missing values for depreciation expense might be filled in using the year-to-year trend in depreciation expense for the company. A related possibility is to fill in missing values with the value expected based on the values of non-missing features. This approach uses a model to predict the missing value based on the values of other features.
The final option is to use models that can use missing values. Some models are designed to work with missing values and can be used without filling in missing values.
### Outliers
Many models are sensitive to outliers. Outliers are observations that are very different from other observations. In some cases outliers are errors in the data that should be corrected or removed. In other cases, outliers are valid observations that should be kept in the data set but may need to be considered when creating models. As with other pre-processing steps, there is no one-size-fits-all solution for dealing with outliers.
First, the feature may be transformed to reduce the effects of outliers on the models. For example, the log transformation is often used to reduce the effects of outliers for features with only positive values. Another transformation is ranking observations.
Second, outliers may be deleted. This may be appropriate if the number of outliers is small. If the number of features with outliers is large, deleting outliers may result in a significant loss of data.
Third, outliers may be winsorized. Winsorizing is a method that replaces the value of an outlier with the value of the next closest non-outlier. For example, if the 99th percentile of a feature is 100 and the 100th percentile is 1000, the value of 1000 would be replaced with 100. This method reduces the effects of outliers on the model without deleting the data. However, winsorizing replaces the value of the outlier with a value that is not the true value of the observation.
Detecting which observations should be classified as outliers is a challenge. There are common practices, but no fully accepted practice. Outliers may be determined by the distribution of the individual feature. For example, observations that are more than three standard deviations from the mean of a feature may be classified as outliers. Outliers may also be determined by their influence on a model for some types of models. Separate models may also be used to detect outliers based on combinations of features. These models are referred to as “anomaly detection” models.
### Low feature variance
Modeling relies on variation in features to learn differences between observations and to make predictions. Features with no variation cannot be used to differentiate between observations. Some features may have some variation but not enough to be useful. Typically, features with low variation are removed from the data set before modeling. The challenge with choosing which features to remove is defining what low variation means.
### Differences in scale and/or feature variance
Differences in scale and feature variance can be a challenge for some models. For example, models that use distance measures, such as k-means clustering, are sensitive to differences in the scales and variances of features. Some models are not sensitive to across feature differences; however, standard practice is to apply the same pre-processing steps to all features for consistency.
The typical approach to addressing scale and variance differences across features is to standardize the features to have the same scale and variance. For example, the features may be standardized to have a mean of zero and a standard deviation of one. This is done by subtracting the mean of the feature from each observation and dividing by the standard deviation of the feature. Another standardizing approach is to scale the features to have a minimum of zero and a maximum of one. This is done by subtracting the minimum value of the feature from each observation and dividing by the range of the feature.
### Feature selection
Highly correlated features can be a challenge for some models. For example, models that use linear regression are sensitive to highly correlated features. In most cases, perfectly correlated features, i.e. including sales in millions of dollars as a feature and sales in thousands of dollars as a feature make the estimation impossible or highly problematic. Similarly, combinations of features that are algebraically equivalent creates the same problem. For example, including the components of gross margin (sales and cost of goods sold) and gross margin as features in the same model would create the same problem.
A related issue is that some data sets may have hundreds or thousands of features which may introduce high correlations among features or may increase the storage and computational requirements of the model.
The issue of having a large number of features that may be correlated is typically addressed in one of two ways.
First, highly correlated features may be removed from the data set before modeling.
Second, feature selection or dimension reduction techniques may be used to select a subset of features or collapse the features into a smaller set of features. Selecting a subset of features is referred to as feature selection. Collapsing the features using techniques such as principal component analysis is referred to as dimension reduction.
Some models include feature selection or dimension reduction as part of the model fitting process.
## Other data issues
### Different data sources
Data relevant to accounting analyses can come from many sources. For example, a company may collect tabular data from its own and competitors financial statements, textual data from customer comments, and transaction data from sales. These data sources might then be stored in a txt file, a csv file, or some other format.
To prepare for most modeling tasks, the data must be organized in the same format we have discussed in the same data table. This means that any data we want to include in a model must have a single observation as a row and any information about the observation must be included as a separate column in the same row. For example, this means that if there is textual data from customer comments about a single transaction, this text must be included as a column or multiple columns that applies only to a single observation.
To combine data from different data sources requires merging data from one source with the data from a different source. To merge the data, we need a unique identifier for each row that is the same across data sources to be merged.
### Column types
Most models function on numerical data. This means that features that enter the model should be numerical. Some exceptions are also possible. For example, some models can use categorical data. Categorical data is data that is not numerical. For example, a feature that is the color of a product is categorical data. Most models address non-numerical data in the following ways.
First, categorical data may be converted to numerical data. This approach is often referred to as one-hot encoding. Using this approach, each unique value of the categorical data is converted to a separate column. For example, if the color of a product is a feature, the color of the product might be converted to a column for each unique color. The column would have a value of one if the product is that color and zero otherwise. An important technical aside is that in some models one of the categories must be omitted from the model to be used as a reference category. For example, if the colors of a product were red, blue, and green, the red column could be omitted from the model and the blue and green columns would be included in the model. This is because the red column can be inferred from the blue and green columns, i.e. when blue is zero and green is zero, the product is red.
Second, some models accept categorical data directly. When these models are used, the feature may need to be included as a specified type such as a factor or a character.
Other feature types may have special numerical meanings or may require different functions or treatments. The most common of these are dates and character features.
## Review
### Conceptual questions
1. What is the difference between panel data, time series data, and cross-sectional data?
2. What is the difference between the wide format and the long format for data?
3. Describe how you would transform long data to wide data.
4. How do summary functions such as a mean function work with wide data? How is this different from using a mean function with long data?
5. What is a lag variable?
6. Why is pre-processing necessary for modeling?
7. What are some common pre-processing steps?
8. What is feature selection?
9. What is dimension reduction?
### Practice questions
1. If an observation in row 5 of the data represents a company’s financial data for a fiscal year and the data is sorted by company and fiscal year, in what row would lagged sales be located?
2. If an observation in row 5 of the data represents a company’s financial data for a fiscal year, if lagged sales is important for modeling purposes, in what row of the data should sales from the prior year be located?
3. To create a lag variable, what steps are necessary to include the variable in the same row 5?
4. If the data is in a long format, what are the unique identifiers for each observation in the data?
5. What are options for dealing with observations that represent different scales of data?
6. What are options for dealing with missing values in the data?
7. What are options for dealing with outliers in the data?
8. How can outliers be detected?
9. What are options for dealing with features that have low variance?
10. What are options for dealing with differences in scale and/or feature variance?
11. What are options for dealing with highly correlated features in the data?
## Solutions to practice questions
1. If an observation in row 5 of the data represents a company’s financial data for a fiscal year and the data is sorted by company and fiscal year, row 4 would be the location of lagged sales.
2. Any feature that describes the observation should be in the same row (e.g., also in row 5).
3. Create a new column that is lagged sales. This column would be the sales from the prior year. This would require finding the sales from the prior year for the same company. This generally is done by sorting the data frame and grabbing the value from the row above (or below depending on the sorting order) each observation.
4. From the examples in this chapter, the unique identifier would be for a transaction ID or a company ID and for a feature label that describes the observation.
5. Ignore the scale problem, include predictors that are related to the scale of the observation, use ratios that are comparable across observations of different sizes.
6. Delete observations with missing values, use background knowledge to fill in missing values, fill in missing values with assumed or typical values, fill in missing values with the value expected based on the values of non-missing features, use models that can use missing values.
7. Transform the feature to reduce the effects of outliers on the models, delete outliers, winsorize outliers.
8. Outliers may be determined by the distribution of the individual feature, by their influence on a model, or by separate models that detect outliers based on combinations of features.
9. Remove features with low variance from the data set before modeling.
10. Standardize the features to have the same scale and variance.
11. Remove highly correlated features from the data set before modeling, use feature selection or dimension reduction techniques to select a subset of features or collapse the features into a smaller set of features.
Tutorial video
Note: include practice and knowledge checks
Mini-case video
Note: include practice and knowledge checks