General Purpose Software – R Basics

Jeremiah Green

5 General Purpose Software – R Basics

Learning Objectives

Explain the value of general purpose statistical software
Navigate RStudio, R packages, and R commands
Use code prompting and LLM interactions for troubleshooting

Chapter content

This chapter introduces R as the software tool for this course. By the end of this chapter, you will be able to navigate R and perform basic data related functions. Subsequent chapters will introduce specific packages and tools.

Statistical programming languages make data analysis flexible, fast, and replicable. Flexible means that the same tools can be used when a data analysis task or data set changes without excessive manual intervention such as copy and pasting. Fast means that data analysis steps can be performed quickly relative to more manual or spreadsheet based analysis. Replicable means that the script keeps a record of the analysis steps that can be repeated by someone with the script.

R Basics

R is a high level, open source computing language that is used for data analysis and statistical computing. R is free for personal use and has a large community of users and developers. R is widely used in academia, research, and industry. In this course we will use R because it has highly accessible tools for data analysis and modeling that can be applied in a wide range of applications.^[1]

Required software and installation

This chapter walks through using R through RStudio. RStudio is an integrated development environment (IDE) that is commonly used to work in R. However, R can be used in other environments such as directly in R’s own interface, in Google Colab in an R running environment, or in Jupyter notebooks. Most of the features of R and RStudio can be used on different operating systems, but this textbook will only refer to the Windows operating environment features.

This textbook assumes that you have installed R and RStudio. You may download and install R and RStudio from the RStudio website: https://posit.co/download/rstudio-desktop/.

Navigating RStudio

The following video walks through the RStudio environment.

R functions and packages

R has a large number of functions and packages that can be used for data analysis and visualization. Some of the most commonly used packages are part of the Tidyverse, a set of packages that work together to make data manipulation and analysis easier and more efficient. The Tidyverse includes packages such as dplyr, lubridate, stringr, and ggplot2, which are designed to work together to perform most data manipulation and analysis tasks.

The dplyr package is a powerful tool for data manipulation and analysis. It provides a set of functions that can be used to filter, sort, summarize, and join data frames. The lubridate package is a tool for working with dates and times. It provides functions that can be used to extract, manipulate, and format dates and times. The stringr package is a tool for working with strings. It provides functions that can be used to manipulate and format strings. The ggplot2 package is a tool for data visualization. It provides functions that can be used to create a wide range of plots and charts.

To use functions from a package in R, you must first install the package using the `install.packages()` function and then load the package using the `library()` function.

Install and load tidyverse packages.

Run once on an R installation:

install.packages("tidyverse")

Run everytime when first starting a running environment:

library(tidyverse)

The following video shows how to install and load packages.

R Data Frames

The R language has different ways to work with data that include base structures and functions as well as packages that augment these structures and functions. The primary structure we will work with in this class is called a data frame. A data frame might be thought of as a spreadsheet with numbered or named rows and columns. Each column can have a different type of data, such as numeric, character, or logical. Data frames are the primary structure used in the Tidyverse, a set of packages that we will use in this course.

Getting started with Data Frames

This section describes getting started working with data frames in R.

Importing data

A data frame can be created in R by importing or by creating a data frame. A file can be imported using the file menu or by using scripted commands. The following video demonstrates importing a csv file.

Import files. Note forward slashes (/) for file path.

df<-read.csv("path/file.csv")

library(readxl)

df<-read_excel("path/file.xlsx")

Examining the data frame

There are various visual and script methods for examing a data frame. These methods include viewing the data frame, the number of columns, the number of rows, and column types. The following video demonstrates examing a data frame.

Rows

rownames(df)

Columns

colnames(df)

str(df)

df$columnname

Data frame

head(df)

View(df)

summary(df)

Creating a data frame

At times, it is necessary to create a data frame using manual inputs. This may be done with the data.frame() function. The script below presents an example.

data <- data.frame(
companyTIC = c("WMT", "WMT", "AMZN"),
year = c(2005, 2004, 2005),
sales = c(14.5, 14.0, 13.7)
)

This script creates a data frame with three columns: `companyTIC`, `year`, and `sales`. The `companyTIC` column contains the company’s ticker symbol, the `year` column contains the year of the sales data, and the `sales` column contains the sales data.

AI Assistance

Recent rapid advancements in language prediction models have forever altered how humans learn and use coding and scripting languages. Large language models have been trained on massive amounts of data including online coding examples, blogs, and community groups. This has made these models highly effective at suggesting, debugging, and explaining code. Later chapters look specifically at using large language models for different data analysis purposes. This chapter introduces using AI assistance for two important tasks: creating code and explaining code.

Creating code

Large language models such as CoPilot or ChatGPT can be effective for getting a working version of code for some tasks. Some models can even integrate with IDE software (e.g. github Copilot for RStudio^[2] or Gemini with Google Colab). For example, prompting a model to create code to sort a data frame called df by a column called id in R can lead to the code you might need. This might be especially useful if you are learning a coding language and do not know how to get started. However, there are important caveats with relying on a model to use coding languages:

Using suggested code effectively requires understanding what the code does and what you expect to have happen when using the code.
With experience, asking for code suggestions for tasks you perform regularly can be slower than directly creating the code.

The first point is essential. You must know what you expect to have happen to the data with code you do not understand and then check that what you expected actually happened. Relying on faulty code for creating presentations or making decisions is not an excuse that is acceptable in a classroom or professional setting. If you do the analysis, you need to know what you are doing.

Explaining code

Perhaps more useful than generating code is the ability of language models to explain code and code errors. For example, prompting a language model to explain code word by word can be useful when you are first learning a coding language. Or asking for an explanation of errors can also be useful.

The video below provides an example of prompting a language model for R code and asking for code explanations.

Review

Mini-case video

https://www.supermarketnews.com/grocery-marketing/using-data-to-help-retailers-get-closer-to-customers

References

Wickham, H. Cetinkaya-Rundel, M., and G. Grolemund (2023). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (2e). O’Reilly. https://r4ds.hadley.nz/.

R is not the only software for statistical computing. Others include python, SAS, Matlab, and SPSS. Python is perhaps the most commonly used in industry, particularly for machine learning tasks. We will use R in this course because it has statistics tools that are particularly useful for data analysis and visualization and for some tasks is more user-friendly. ↵
To use Github Copilot in RStudio, you must first install the Github Copilot extension in RStudio. You can do this by going to the Extensions menu in RStudio and selecting the Github Copilot extension. Once the extension is installed, you can use Github Copilot to generate code snippets and other code elements in RStudio. A GitHub Copilot account is free for students. Instructions for setting up GitHub Copilot with RStudio is available here(https://docs.posit.co/ide/user/ide/guide/tools/copilot.html). Instructions for setting up a student account for Github and Github Copilot is available here (https://techcommunity.microsoft.com/t5/educator-developer-blog/step-by-step-setting-up-github-student-and-github-copilot-as-an/ba-p/3736279). Further information about Github Copilot is available here (https://docs.github.com/en/copilot/about-github-copilot). ↵