8 Understanding Data with Statistics
Learning Objectives
- Explain how data summaries provide insights about data
- Describe key insights from statistical accounting data summaries
- Create simple statistical summaries of data sets
Chapter content
The last chapter discussed using visual summaries for exploratory analysis. The purpose of exploratory analysis is to understand the story behind the data. Statistical summaries of a data set serve the same purpose. Before diving into complex analyses, the first step is to understand the data. Data summaries, also known as descriptive statistics, are the foundation for this understanding. They help you quickly grasp the main features of a dataset, identify patterns, and spot potential problems.
Descriptive Statistics
Descriptive statistics are numerical summaries that describe and present the main characteristics of a dataset. Descriptive statistics help analysts detect patterns, identify outliers and errors, compare groups, and further explore the story behind the data. Common descriptive statistics include:
-
Measures of central tendency: Mean, median, and mode (Where is the “center” of the data?)
-
Measures of variability: Range, variance, and standard deviation (How spread out is the data?)
-
Measures of shape: Skewness and kurtosis (Is the data symmetric or skewed?)
-
Frequency counts and percentages: How often does each value occur?
As an example, suppose you have a dataset of monthly expenses for several departments in a company. Descriptive statistics can reveal which department spends the most on average, how consistent each department’s spending is, whether any department had an unusually high or low expense in a given month, or the typical expense for all departments.
Descriptive Statistics in R
Summary functions
dplyr
The data manipulation chater introduced the dplyr package which includes the summarize function. This function can be used to create summaries. For example, the mean of ROA in a data frame ‘df’ can be created with the following code:
df %>%
summarize( meanROA = mean(ROA))
The code below would save the summary as its own object “res”:
res <- df %>%
summarize(meanROA = mean(ROA)
The summary could be created for each group as well:
res <- df %>%
group_by(Year) %>%
summarize(meanROA = mean(ROA)
summary
The summarize function as shown above requires specifying each statistic and column. A handy way to summarize the entire data frame can be done with the summary() function:
summary(df)
This function outputs statistics about each column in the data frame.
Aggregation functions
There are many types of descriptive statistics that require different aggregation functions. The most common aggregation functions are shown below. Any of these could be used in place of the mean() function shown in the previous section. Click on each function to learn more about it.
Missing values in aggregation functions
Many of the aggregation functions make calculations using the column in a data frame, for example, mean requires summing a column and dividing by the number of rows in the column. If the column has missing values, then the aggregation function will return a missing value. For this reason, many of the aggregation functions can be adjusted to ignore missing values. In R one type of missing values is “NA”. “rm” is short-hand for remove. Removing missing values when doing an aggregation function that allows this option is done in the same way as the example below with the mean() aggregation function.
df %>%
summarize( meanROA = mean(ROA,na.rm=TRUE))
Examples
The video below provides examples of descriptive statistics with the data set from the previous chapter. This data set is available here: https://www.dropbox.com/scl/fi/g7gmo0jgj797lwk5b0ltr/ROARDA.csv?rlkey=1rk4cjb0n6ezby4e3y1gicure&st=vg1e4qk2&dl=0.
Conclusion
This chapter explained the importance and usage of descriptive statistics and how to use summary(), summarize(), and aggregation functions to create descriptive statistics.
Review
The flashcards below review the R aggregation functions.