"

4 Chapter 4: General Purpose Software – R Basics

Learning outcomes

At the end of this chapter, you should be able to

  • Explain the value of general purpose statistical software
  • Navigate RStudio, R packages, and R commands
  • Use code prompting and LLM interactions for troubleshooting

Chapter content

Note: include practice and knowledge checks throughout chapter

# R Basics

R is a high level computing language that is widely used for data analysis and statistical computing. R is an open source software that is free for personal use and has a large community of users and developers. R is a powerful tool for data analysis and visualization and is widely used in academia, research, and industry. In this course we will use R because it has highly accessible tools for data analysis and modeling that can be applied in a wide range of applications.\footnote{R is not the only software for statistical computing. Others include python, SAS, Matlab, and SPSS. Python is perhaps the most commonly used in industry, particularly for machine learning tasks. We will use R in this course because it has some tools that are particularly useful for data analysis and visualization and for some tasks is more user-friendly.}

Below is a list of useful packages and tools that we will use in this course:

1 – RStudio – an IDE for accessing R.

2 – Github Copilot – an AI tool that can help generate code in R and Python.

3 – H2O – a machine learning library that can be used in R.

4 – Tidyverse – a set of packages that can be used for data manipulation and analysis.

There are many tutorials and resources available online for learning R. Additionally, searching online for specific questions or problems can be very helpful.

A useful online and print book is available [here: R for Data Science](https://r4ds.had.co.nz/).

## R/RStudio

R is a programming language that is widely used for data analysis and statistical computing. RStudio is an integrated development environment (IDE) for R that provides a user-friendly interface for working with R. In this course, we will use R and RStudio. Installing R and RStudio is available [here](https://posit.co/download/rstudio-desktop/).

### Accessing and navigating RStudio

When you open RStudio, you will see four panes: the console, the script editor, the environment/history, and the files/plots/packages/help pane. The console is where you can type and run R code. The script editor is where you can write and save R scripts. The environment/history pane shows the objects in your workspace and the history of commands you have run. The files/plots/packages/help pane allows you to navigate your files, view plots, install and load packages, and access help documentation.

You can use [this video](https://www.youtube.com/watch?v=FIrsOBy5k58) to walk through the RStudio interface.

### R objects and structure

The R language has different ways to work with data that include base structures and functions as well as packages that augment these structures and functions. There are many introductions to basic R structures and functions. A short introduction is available [here](https://www.youtube.com/watch?v=FY8BISK5DpM).

The primary structure we will work with in this class is called a data frame. A data frame might be thought of as a spreadsheet with numbered or named rows and columns. Each column can have a different type of data, such as numeric, character, or logical. Data frames are the primary structure used in the Tidyverse, a set of packages that we will use in this course.

Below, we will provide an introduction to working with data frames in R.

*Create data frame*

To create a data frame, you can use the `data.frame()` function. For example, you can create a data frame with the following code:

“`r
data <- data.frame(
companyTIC = c(“WMT”, “WMT”, “AMZN”),
year = c(2005, 2004, 2005),
sales = c(14.5, 14.0, 13.7)
)
“`
This code creates a data frame with three columns: `companyTIC`, `year`, and `sales`. The `companyTIC` column contains the company’s ticker symbol, the `year` column contains the year of the sales data, and the `sales` column contains the sales data.

*Import data frame*

You can also import data frames from external sources, such as CSV files. For example, you can import a CSV file with the following code:

“`r
data <- read.csv(“data.csv”)
“`

This code reads a CSV file called `data.csv` and stores it in a data frame called `data`. The CSV file must be in the working directory or you must provide the full path to the file. If the file is not in the working directory, you can use the `setwd()` function to set the working directory. For example:

“`r
setwd(“C:/Users/username/Documents”)
“`

Alternatively, you can provide the full path to the file in the `read.csv()` function. For example:

“`r
data <- read.csv(“C:/Users/username/Documents/data.csv”)
“`

Different functions might be used to import data from different file types, such as `read_excel()` for Excel files or `read_sas()` for SAS files.

As an alternative to reading the data from a file by typing code into the console, you can use the RStudio interface to import data. You can do this by clicking on the “Import Dataset” button in the Environment pane and selecting the file type you want to import.

*View data frame*

To view the data in a data frame, you can use the `head()` function. For example:

“`r
head(data)
“`

This code displays the first few rows of the data frame `data`. You can also use the `tail()` function to display the last few rows of the data frame. To view the entire data frame in a spreadsheet like format, you can use the `View()` function. For example:

“`r
View(data)
“`

This code opens a new window that displays the data frame `data` in a spreadsheet like format. Alternatively, you can click on the data frame in the Environment pane and view it in the data viewer.

*Column and row names*

You can access the column names of a data frame using the `colnames()` function. For example:

“`r
colnames(data)
“`

This code returns the column names of the data frame `data`. You can also access the row names of a data frame using the `rownames()` function. For example:

“`r
rownames(data)
“`

This code returns the row names of the data frame `data`. You can set the names of the columns and rows using the `colnames()` and `rownames()` functions. For example:

“`r
colnames(data) <- c(“Company”, “Year”, “Sales”)
rownames(data) <- c(“1”, “2”, “3”)
“`

This code sets the column names of the data frame `data` to `Company`, `Year`, and `Sales`, and the row names to `1`, `2`, and `3`.

Rows and columns can be accessed using the `[]` operator. For example:

“`r
data[1, 2]
“`

This code returns the value in the first row and second column of the data frame `data`. Alternatively, the column and row names can be used to access rows and columns. For example:

“`r
data[1, “Year”]
“`

This code returns the value in the first row and the `Year` column of the data frame `data`.

*Working with data frame*

You can perform various operations on data frames, such as filtering, sorting, and summarizing the data. For example, you can filter the data frame to include only rows where the `Sales` column is greater than 14. For example:

“`r
data[data$Sales > 14, ]
“`

This code returns the rows of the data frame `data` where the `Sales` column is greater than 14. You can sort the data frame by the `Sales` column in descending order. For example:

“`r
data[order(data$Sales, decreasing = TRUE), ]
“`

This code sorts the data frame `data` by the `Sales` column in descending order. You can summarize the data frame to calculate the mean, median, and standard deviation of the `Sales` column. For example:

“`r
summary(data$Sales)
“`

This code calculates the mean, median, and standard deviation of the `Sales` column of the data frame `data`.

These are just a few examples of the operations you can perform on data frames in R. There are many other functions and packages available for working with data frames in R.

*Setting/changing column types*

You can change the data type of a column in a data frame using the `as.` functions. For example, you can change the `Sales` column from numeric to character using the `as.character()` function. For example:

“`r
data$Sales <- as.character(data$Sales)
“`

This code changes the `Sales` column of the data frame `data` from numeric to character. You can also change the data type of a column when you create the data frame. For example:

“`r
data <- data.frame(
companyTIC = c(“WMT”, “WMT”, “AMZN”),
year = c(2005, 2004, 2005),
sales = as.character(c(14.5, 14.0, 13.7))
)
“`

This code creates a data frame with the `Sales` column as character data type.

Working with dates requires special attention. Dates can be imported as character or numeric data types and then converted to date data types using the `as.Date()` function. For example:

“`r
data$date <- c(“01/01/2005”, “01/02/2004”, “01/03/2005”)
data$date <- as.Date(data$date, format = “%m/%d/%Y”)
“`

This code creates a column `date` and converts the `date` column of the data frame `data` from character to date data type. The `format` argument specifies the format of the date in the `date` column.

### R functions and packages

R has a large number of functions and packages that can be used for data analysis and visualization. Some of the most commonly used packages are part of the Tidyverse, a set of packages that work together to make data manipulation and analysis easier and more efficient. The Tidyverse includes packages such as dplyr, lubridate, stringr, and ggplot2, which are designed to work together to perform most data manipulation and analysis tasks.

The dplyr package is a powerful tool for data manipulation and analysis. It provides a set of functions that can be used to filter, sort, summarize, and join data frames. The lubridate package is a tool for working with dates and times. It provides functions that can be used to extract, manipulate, and format dates and times. The stringr package is a tool for working with strings. It provides functions that can be used to manipulate and format strings. The ggplot2 package is a tool for data visualization. It provides functions that can be used to create a wide range of plots and charts.

To use functions from a package in R, you must first install the package using the `install.packages()` function and then load the package using the `library()` function. For example, to install and load the dplyr package, you can use the following code:

“`r
install.packages(“dplyr”)
library(dplyr)
“`

This code installs the dplyr package and loads it into the R session. You can then use functions from the dplyr package in your code. For example, you can use the `filter()` function from the package to filter rows of a data frame that meet certain conditions. Installing a package only needs to be done once on a device, but the package must be loaded into the R session each time you start a new R session.

There are many tutorials and resources available online for learning how to use R packages. Additionally, searching online for specific questions or problems can be very helpful.

### Github Copilot (or other AI tools)

Github Copilot is an AI tool that can help generate code in R and Python. It is a powerful tool that can save time and effort when writing code. Github Copilot can be used in RStudio to generate code snippets, function definitions, and other code elements. It can also be used to provide suggestions and corrections when writing code.

To use Github Copilot in RStudio, you must first install the Github Copilot extension in RStudio. You can do this by going to the Extensions menu in RStudio and selecting the Github Copilot extension. Once the extension is installed, you can use Github Copilot to generate code snippets and other code elements in RStudio. A GitHub Copilot account is free for students. Instructions for setting up GitHub Copilot with RStudio is available [here](https://docs.posit.co/ide/user/ide/guide/tools/copilot.html). Instructions for setting up a student account for Github and Github Copilot is available [here](https://techcommunity.microsoft.com/t5/educator-developer-blog/step-by-step-setting-up-github-student-and-github-copilot-as-an/ba-p/3736279). Further information about Github Copilot is available [here](https://docs.github.com/en/copilot/about-github-copilot).

Once you have Github Copilot working in RStudio, you can generate text and code by beginning typing and it will make suggestions similar to autofill tools you might be familiar with.

Alternatively, you may use other AI tools to help generate code in R. For example, you can use Microsoft CoPilot in a chat environment to generate code via prompts.

Importantly, while these tools can be very helpful, they are not perfect and may generate incomplete or incorrect code. It is important to review the code generated by these tools by testing and evaluating the output.

In my experience, working with generative text tools works best when working with tools I already understand but want to speed up the process of writing code. At times it can be helpful to get started with code I do not yet understand or to explain code I have written or it has generated.

### H2O

H2O is a machine learning library that can be used in R. It provides a set of functions and algorithms that can be used to build and train machine learning models. Most relevant to our course is what is called AutoML, which is a function that can be used to automatically build and train machine learning models. AutoML can be used to build models for regression, classification, and clustering tasks. It can also be used to build models for time series forecasting and anomaly detection tasks. AutoML is a class of tools currently being developed and refined on different platforms. For example, see an article about AutoGluon [here](https://towardsdatascience.com/automl-with-autogluon-transform-your-ml-workflow-with-just-four-lines-of-code-1d4b593be129) or about H2O AutoML [here](https://towardsdatascience.com/automated-machine-learning-with-h2o-258a2f3a203f). The primary benefit of AutoML is that it can allow users to build and train AI models without needing to have a deep understanding of machine learning algorithms or programming. It also speeds up the process of building and training models by automating many of the steps involved in the process. This is useful even for experienced data scientists because it can save time and effort when building and training models. After trying AutoML, a model might be further refined by using specific tools along with preprocessing features and training the hyperparameters of the model.

Details of H2O and AutoML are available [here](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). Getting started with H2O in RStudio is explained [here](https://www.youtube.com/watch?v=zzV1kTCnmR0).

## Review
### Conceptual questions

1. What is R and why is it used for data analysis and statistical computing?

2. What is RStudio and how is it used with R?

3. What is a data frame in R and how is it used?

4. What are some common functions and packages used in R for data analysis and visualization?

5. What is Github Copilot and how can it be used with RStudio?

6. What is H2O and how can it be used in R?

### Practice questions

1. Create a new data frame object for the data frame `USArrests`, a data frame included in base R, by creating a new data object and view the first few rows. As an example, you could do this by typing the following into the console and pushing “enter”: `practicedta <- USArrests`. What are the column names of the data frame?

2. Add the totals from all crime columns to create a new column that is the sum of all crime cases for each state. What are the total number of crimes for Alabama, Alaska, and Arizona?

3. Sum the new column to calculate the total number of crimes in the dataset.

4. Create a new data frame with the following data:

| Name | Age
| — | —
| John | 25
| Jane | 30
| Jack | 35

5. What are the data types for these columns?

6. Change the column type to character for the `Age` column.

7. Create a csv file saved in the working directory with the following data.

| Company Name | Income statement date | Sales |
| — | — | —|
| ABC | ‘2020-01-15’ | 19.3 |
| ABC | ‘2021-01-15’ | 8.5 |
| XYZ | ‘2022-01-01’ | 27.1 |

7. Import the csv file from the working director.

8. View the first few rows of the data frame.

9. What are the column names of the data frame?

10. Filter the data frame to include only rows where the `Sales` column is greater than 14.

11. Sort the data frame by the `Sales` column in descending order.

12. Summarize the data frame to calculate the mean, median, and standard deviation of the `Sales` column.

13. Set up RStudio to work with Github Copilot.

14. Install and load the dplyr package.

15. Set up H2O in RStudio.

## Solutions to practice questions

1. The column names of the `USArrests` data frame are `Murder`, `Assault`, `UrbanPop`, and `Rape`.

2. This could be done with the following code: ‘practicedta\$TotalCrimes <- practicedta\$Murder + practicedta\$Assault + practicedta\$UrbanPop + practicedta\$Rape’. The total number of crimes for Alabama is 328.4, for Alaska is 365.5, and for Arizona is 413.1.

3. This could be done with the following code: ‘sum(practicedta$TotalCrimes)’. The total number of crimes in the dataset is 13,266.

4. This could be done with the following code: ‘newdata <- data.frame(Name = c(“John”, “Jane”, “Jack”), Age = c(25, 30, 35))’.

5. The data types for the columns are character for `Name` and numeric for `Age`.

6. This could be done with the following code: ‘newdata\$Age <- as.character(newdata\$Age)’.

7. This could be done with the following code: ‘dta <- data.frame(Company.Name = c(“ABC”, “ABC”, “XYZ”), Income.statement.date = c(“2020-01-15”, “2021-01-15”, “2022-01-01”), Sales = c(19.3, 8.5, 27.1))’ and then ‘write.csv(dta, “dta.csv”)’.

8. This could be done with the following code: ‘dta <- read.csv(“dta.csv”)’.

9. The column names of the data frame are `Company.Name`, `Income.statement.date`, and `Sales`.

10. This could be done with the following code: ‘dta[dta$Sales > 14, ]’.

11. This could be done with the following code: ‘dta[order(dta$Sales, decreasing = TRUE), ]’.

12. This could be done with the following code: ‘summary(dta$Sales)’.

13. Instructions for setting up RStudio to work with Github Copilot are available [here](https://docs.posit.co/ide/user/ide/guide/tools/copilot.html).

14. This could be done with the following code: ‘install.packages(“dplyr”)’ and then ‘library(dplyr)’.

15. Instructions for setting up H2O in RStudio are available [here](https://www.youtube.com/watch?v=zzV1kTCnmR0).

Tutorial video

Note: include practice and knowledge checks

Mini-case video

Note: include practice and knowledge checks

License

Data Analytics with Accounting Data and R Copyright © by Jeremiah Green. All Rights Reserved.