6 Data manipulation
Learning Objectives
- Describe dplyr commands
- Combine commands for common manipulation tasks
- Examine tables to verify data manipulation
Chapter content
dplyr
filter() |
filter rows in a data frame based on a condition or conditions |
select() |
select columns in a data frame |
arrange() |
sort a data frame by the values in a column or columns |
mutate() |
alter or create columns |
summarise() |
aggregate a data frame to a group level based on aggregation functions (e.g. mean, sd) |
group_by() |
group data by row values within a column for subsequent steps such as summarize |
ungroup() |
removing grouping for subsequent steps |
distinct() |
remove duplicate rows based on a column or columns |
rename() |
rename columns |
drop_na() |
remove rows with missing values of a column or columns |
across() |
apply a function to multiple columns |
sample_frac() |
sample a fraction of data frame rows |
sample_n() |
sample number of rows of data frame |
join() |
left_join, right_join, inner_join, or outer_join. Join a data frame with another data frame by columns with the same names (or specify ids) |
pivot_longer() |
convert a data frame from wide to long format |
pivot_wider() |
convert a data frome from long to wide format |
Pipe operator
%>%
(typed directly or CTRL+SHIFT+M). This operator takes the input coming from the left of the operator and performs the action on the right of the operator. In this way, a sequence of steps can be combined together. For example, the following code combines multiple steps for the same data frame.altereddata <- df %>%
arrange(id) %>%
filter(col1>0)
This example code can be read as follows. Create a new data frame called ‘altereddata’ from (<-
) a data frame named ‘df’. Using ‘df’ then (%>%
) sort by the column named ‘id’. After sorting the data frame then (%>%
) filter the data frame so that all rows only have value of ‘col1’ greater than zero.
Column types and lubridate
Column types
Many functions in a data frame can only be used on columns of specific types. If column types do not import as desired or if a new column type needs to be specified, you may need to change a column type format. Click on the following to learn about column format functions. These functions might be used within a mutate function, for example as mutate(newcolumn = as.character(old.column)).
lubridate package
Sometimes working with dates involves making calculations, extracting parts of dates, or changing types in a way that as.Date cannot handle. The lubridate package helps make working with dates easier. There are various lubridate functions that may be useful when working with dates.
ymd(), mdy(), dmy() | Operates as as.Date function when dates are in year-month-day (ymd), month-day-year (mdy), or day-month-year (dmy) formats. Example: mdy("05/01/2025")="2025-05-01" |
year(),month(),day() | Extracts a piece of date. Example: month("2025-05-01")=5 |
days(),months(),years() | Allows arithmetic with dates. Example: "2025-05-01" + days(10) = "2025-05-11" |
dplyr and pipe operator examples and explanations
Step-by-Step Example Using Pipes and Multiple Functions
Suppose you want to find the average ROA for each company, but only for companies with an ROA above 0.05. These steps can be chained together.
result<- df %>%
filter(ROA > 0.05) %>%
group_by(YEAR) %>%
summarize(mean_ROA = mean(ROA, na.rm = TRUE))
This code creates a new data frame labeled result from the data frame labeled df and filters for ROA above 0.05, groups by company, and calculates the mean ROA per company.
The video below walks through the code above and examines the resulting data frame step-by-step.
Review
Mini-case video
Jones, J. J. (1991). Earnings management during import relief investigations. Journal of Accounting Research, 29(2), 193-228.
References
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
https://rstudio.github.io/cheatsheets/html/lubridate.html