Data manipulation

Jeremiah Green

6 Data manipulation

Learning Objectives

Describe dplyr commands
Combine commands for common manipulation tasks
Examine tables to verify data manipulation

Chapter content

In this chapter, we will introduce the Tidyverse, a set of packages that can be used for data manipulation and analysis. The Tidyverse includes packages such as dplyr, lubridate, stringr, and ggplot2, which are designed to work together to perform most data manipulation and analysis tasks. This chapter primarily works with dplyr and provides examples of how it can be used to manipulate and analyze data. lubridate will also be introduced. Many resources such as websites and tutorial videos explain the tidyverse packages in detail. Links to some of these resources are provided here.

lubridate cheatsheet

https://rstudio.github.io/cheatsheets/html/lubridate.html

dplyr cheatsheet

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Tidyverse website

https://www.tidyverse.org/

dplyr

dplyr is an R package designed to make data manipulation and analysis easier and more efficient. dplyr provides a set of functions that work with data frames. The most common dplyr functions are listed below.

`filter()`	filter rows in a data frame based on a condition or conditions
`select()`	select columns in a data frame
`arrange()`	sort a data frame by the values in a column or columns
`mutate()`	alter or create columns
`summarise()`	aggregate a data frame to a group level based on aggregation functions (e.g. mean, sd)
`group_by()`	group data by row values within a column for subsequent steps such as summarize
`ungroup()`	removing grouping for subsequent steps
`distinct()`	remove duplicate rows based on a column or columns
`rename()`	rename columns
`drop_na()`	remove rows with missing values of a column or columns
`across()`	apply a function to multiple columns
`sample_frac()`	sample a fraction of data frame rows
`sample_n()`	sample number of rows of data frame
`join()`	left_join, right_join, inner_join, or outer_join. Join a data frame with another data frame by columns with the same names (or specify ids)
`pivot_longer()`	convert a data frame from wide to long format
`pivot_wider()`	convert a data frome from long to wide format

We will cover the functions by example below. Many of the functions in dplyr are similar to functions that you might find in EXCEL or SQL.

Pipe operator

Most often, manipulating or analyzing a data frame involves multiple steps. dplyr allows multiple steps to be combined via a connection operator referred to as a piper operator. In R this is %>% (typed directly or CTRL+SHIFT+M). This operator takes the input coming from the left of the operator and performs the action on the right of the operator. In this way, a sequence of steps can be combined together. For example, the following code combines multiple steps for the same data frame.

altereddata <- df %>%

arrange(id) %>%

filter(col1>0)

This example code can be read as follows. Create a new data frame called ‘altereddata’ from (<-) a data frame named ‘df’. Using ‘df’ then (%>%) sort by the column named ‘id’. After sorting the data frame then (%>%) filter the data frame so that all rows only have value of ‘col1’ greater than zero.

Column types and lubridate

Column types

Many functions in a data frame can only be used on columns of specific types. If column types do not import as desired or if a new column type needs to be specified, you may need to change a column type format. Click on the following to learn about column format functions. These functions might be used within a mutate function, for example as mutate(newcolumn = as.character(old.column)).

lubridate package

Sometimes working with dates involves making calculations, extracting parts of dates, or changing types in a way that as.Date cannot handle. The lubridate package helps make working with dates easier. There are various lubridate functions that may be useful when working with dates.

ymd(), mdy(), dmy()	Operates as as.Date function when dates are in year-month-day (ymd), month-day-year (mdy), or day-month-year (dmy) formats. Example: `mdy("05/01/2025")="2025-05-01"`
year(),month(),day()	Extracts a piece of date. Example: `month("2025-05-01")=5`
days(),months(),years()	Allows arithmetic with dates. Example: `"2025-05-01"+ days(10) = "2025-05-11"`

The dplyr and pipe operator functions will be explained and combined by example below.

dplyr and pipe operator examples and explanations

Step-by-Step Example Using Pipes and Multiple Functions

Suppose you want to find the average ROA for each company, but only for companies with an ROA above 0.05. These steps can be chained together.

result<- df %>%

filter(ROA > 0.05) %>%

group_by(YEAR) %>%

summarize(mean_ROA = mean(ROA, na.rm = TRUE))

This code creates a new data frame labeled result from the data frame labeled df and filters for ROA above 0.05, groups by company, and calculates the mean ROA per company.

The video below walks through the code above and examines the resulting data frame step-by-step.