"

6 Data manipulation

Learning Objectives

  • Describe dplyr commands
  • Combine commands for common manipulation tasks
  • Examine tables to verify data manipulation

Chapter content

In this chapter, we will introduce the Tidyverse, a set of packages that can be used for data manipulation and analysis. The Tidyverse includes packages such as dplyr, lubridate, stringr, and ggplot2, which are designed to work together to perform most data manipulation and analysis tasks. This chapter primarily works with dplyr and provides examples of how it can be used to manipulate and analyze data. lubridate will also be introduced. Many resources such as websites and tutorial videos explain the tidyverse packages in detail. Links to some of these resources are provided here.

Tidyverse website

dplyr

dplyr is an R package designed to make data manipulation and analysis easier and more efficient. dplyr provides a set of functions that work with data frames. The most common dplyr functions are listed below.
filter() filter rows in a data frame based on a condition or conditions
select() select columns in a data frame
arrange() sort a data frame by the values in a column or columns
mutate() alter or create columns
summarise() aggregate a data frame to a group level based on aggregation functions (e.g. mean, sd)
group_by() group data by row values within a column for subsequent steps such as summarize
ungroup() removing grouping for subsequent steps
distinct() remove duplicate rows based on a column or columns
rename() rename columns
drop_na() remove rows with missing values of a column or columns
across() apply a function to multiple columns
sample_frac() sample a fraction of data frame rows
sample_n() sample number of rows of data frame
join() left_join, right_join, inner_join, or outer_join. Join a data frame with another data frame by columns with the same names (or specify ids)
pivot_longer() convert a data frame from wide to long format
pivot_wider() convert a data frome from long to wide format
We will cover the functions by example below. Many of the functions in dplyr are similar to functions that you might find in EXCEL or SQL.

Pipe operator

Most often, manipulating or analyzing a data frame involves multiple steps. dplyr allows multiple steps to be combined via a connection operator referred to as a piper operator. In R this is %>% (typed directly or CTRL+SHIFT+M). This operator takes the input coming from the left of the operator and performs the action on the right of the operator. In this way, a sequence of steps can be combined together. For example, the following code combines multiple steps for the same data frame.

altereddata <- df %>%

arrange(id) %>%

filter(col1>0)

This example code can be read as follows. Create a new data frame called ‘altereddata’ from (<-) a data frame named ‘df’. Using ‘df’ then (%>%) sort by the column named ‘id’. After sorting the data frame then (%>%) filter the data frame so that all rows only have value of ‘col1’ greater than zero.

Column types and lubridate

Column types

Many functions in a data frame can only be used on columns of specific types. If column types do not import as desired or if a new column type needs to be specified, you may need to change a column type format. Click on the following to learn about column format functions. These functions might be used within a mutate function, for example as mutate(newcolumn = as.character(old.column)).

lubridate package

Sometimes working with dates involves making calculations, extracting parts of dates, or changing types in a way that as.Date cannot handle. The lubridate package helps make working with dates easier. There are various lubridate functions that may be useful when working with dates.

ymd(), mdy(), dmy() Operates as as.Date function when dates are in year-month-day (ymd), month-day-year (mdy), or day-month-year (dmy) formats. Example: mdy("05/01/2025")="2025-05-01"
year(),month(),day() Extracts a piece of date. Example: month("2025-05-01")=5
days(),months(),years() Allows arithmetic with dates. Example: "2025-05-01"+ days(10) = "2025-05-11"
The dplyr and pipe operator functions will be explained and combined by example below.

dplyr and pipe operator examples and explanations

Step-by-Step Example Using Pipes and Multiple Functions

Suppose you want to find the average ROA for each company, but only for companies with an ROA above 0.05. These steps can be chained together.

result<- df %>%
filter(ROA > 0.05) %>%
group_by(YEAR) %>%
summarize(mean_ROA = mean(ROA, na.rm = TRUE))

This code creates a new data frame labeled result from the data frame labeled df and filters for ROA above 0.05, groups by company, and calculates the mean ROA per company.

The video below walks through the code above and examines the resulting data frame step-by-step.

Review

Mini-case video

Jones, J. J. (1991). Earnings management during import relief investigations. Journal of Accounting Research29(2), 193-228.

References

https://www.tidyverse.org/

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

https://rstudio.github.io/cheatsheets/html/lubridate.html

 

License

Data Analytics with Accounting Data and R Copyright © by Jeremiah Green. All Rights Reserved.