2  Getting Ready to Use R via RStudio

This book can be read and understood without using R at all. However, many of the calculations discussed would be quite difficult without the statistical functions of a language like R.

2.1 Get Ready for a Challenge

Figure 2.1: “Blooming, buzzing confusion?” I see what you did there. —WJ

For R to pay off, you have to buy in. Learning challenging material does not have to be “fun” at every moment to be worthwhile. The beginning data analyst, assailed by statistical concepts, coding conventions, and baffling error messages, feels it all as one great blooming, buzzing confusion.

The challenge of learning R is indeed going to be challenging at times, but just there will also be many little a-ha moments, several larger check-out-what-I-can-do celebrations, and a few peak experiences of eudaemonic reverie.

2.2 What is R?

A headshot of statistician John Chambers
Figure 2.2: John Chambers
A headshot of statistician Robert Gentleman
Figure 2.3: Robert Gentleman
A headshot of statistician Ross Ihaka
Figure 2.4: Ross Ihaka

R is the eighteenth letter of the alphabet, of course, but for data analysts, it has several specialized meanings.

  1. R is a programming language built for statistics. In 1976, John Chambers and colleagues at Bell Labs developed S, a programming language specifically designed to facilitate statistical analyses and data visualization. In 1995, Robert Gentleman and Ross Ihaka released an open-source variant of S and named it “R” after their shared first initial. Most S code looks identical to R code, but R has a number of subtle enhancements such as better memory management. You can read more about the origins of R in Ihaka (1998).
  2. R is a computer program that runs code in the programming language R. The primary way that the R language is used is to perform analyses in the R software environment for statistical computing and graphics. This program is free and open source, meaning that not only can anyone download it for free, but its source code can be reused by anyone for any purpose.
  3. R is a data science ecosystem in which the R community flourishes. Strictly speaking, R is just one program. However, people use R in startlingly creative ways, often in combination with other programs and in coordination with other people. I like to think of R as the entire ecosystem of software, services, standards, and cultural practices shared by the entire community of R users.
Ihaka, R. (1998). R: Past and future history. Interface 98. https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf

The technical merits of the R language and software do matter, but the success of R has more to do with the inclusive, welcoming, diverse, and expanding culture of the R community. When an individual R user has a great idea, the R community has as integrated set of technical standards, web services, and cultural practices such that innovations spread quickly to everyone in the community—often with multiple tutorials aimed at users of all levels of expertise. With the support of the R community, ordinary people can leverage R to accomplish far more than anyone would have thought possible not so very long ago.

2.3 What can you do with R?

What can’t be done with R is an ever-contracting list. If you let it, R will help you in ways you did not know you needed it. You can (and should) use R for tasks that extend well beyond analyzing data. Because R plays well with many other programs, you can leverage its capacities to do all sorts of things:

  • Manage, analyze, and visualize data
  • Control and run other programs
  • Write and publish books, papers, and posters
  • Automate reports and customized letters
  • Create slide presentations and interactive tutorials
  • Make websites and blogs
  • Design web applications and dashboards
  • Make music and digital aRt

For example, this book was “created in R.” That is, I wrote the text in a simple format and then let R (and RStudio) handle all the complex and tedious coordination of many other programs to make this webpage. It does most of its work under the hood without my awareness.

2.4 What is RStudio?

Most users never actually open the R program directly. Why not? R’s text editor works well for what it does, but it has limited functionality compared to other ways of interacting with R. Instead, we will run R code in an integrated development environment (IDE) called RStudio. An IDE is a program that makes programming easier. RStudio has many features that automates tasks what would otherwise be confusing, difficult, or tedious.

2.5 Code > Clicks

Artwork by @allison_horst
Figure 2.5: Frustrations with R are normal. Working through them pays off.

Though R will be frustrating at times, stick with it—the dividends on your invested efforts are huge. Nothing beats the flexibility, freedom, and power of using code. Not only can I do things much faster than I could with point-and-click programs, I can do things I never would have considered doing. If there are no functions that do exactly what you want, you can adapt or make one that fits your needs perfectly. If you need to to re-run your analysis with a point-and-click menu interface, you need a perfect memory of what you did before. With code, every step is perfectly preserved.

That said, if you need to complete a simple data-analysis task quickly, and you have no time to learn R, I recommend using Jamovi. Jamovi has a beautiful point-and-click interface that you can learn to use in just a few minutes. Jamovi uses R under the hood, and like R, it is free.

Jamovi: A point-and-click alternative to R and RStudio

Jamovi: A point-and-click alternative to R and RStudio

The ease of use of a point-and-click interface has a trade-off. If you never learn to code, you will be stuck with a limited set of skills, and many tasks will be prohibitively complicated or dauntingly tedious. In Figure 2.6 I recreate a plot made by many others before me.

Figure 2.6: Different Learning Curves for Different Tasks
tibble(
  Complexity = seq(0, 100, 0.01),
  Code = 10.5 * Complexity ^ (1 / 3.1),
  `Point-and-Click` = 0.0002 * Complexity ^ 3.1) %>%
  gather(Approach, Difficulty,-Complexity) %>%
  mutate(
    rotation =  180 * atan((Difficulty - lag(Difficulty)) / 0.01) / pi) %>%
  filter(Difficulty <= 100) %>%
  ggplot(aes(Complexity, Difficulty, color = Approach)) +
  geom_arrow(arrow_head = my_arrowhead) +
  ggtext::geom_richtext(
    data = . %>% filter((Complexity == 62.5 & Approach == "Point-and-Click") | (Complexity == 75 & Approach == "Code")),
    aes(label = Approach, angle = rotation),
    vjust = -0.2,
    hjust = 0.5,
    size = 6,
    fill = "white",
    label.color = NA,
    label.padding = grid::unit(rep(0, 4), "pt")
  ) +
  annotate(geom = "richtext", label = "Point-and-click<br>programs are great<br>for simple tasks but<br>can be tedious for<br>complex projects", x = 83, y = 87.5, hjust = 0.5, vjust = .5, color = myfills[2], label.color = NA) +
  annotate(geom = "richtext", label = "...but it makes complex<br>tasks much easier.", x = 75, y = (37.5 + 25) / 2, hjust = 0.5, vjust = .5, color = myfills[1], label.color = NA) +
  annotate(geom = "richtext", label = "Coding has a steep<br>learning curve...", x = 12.5, y = 37.5, hjust = 0.5, vjust = .5, color = myfills[1], label.color = NA) +
  coord_fixed() +
  scale_x_continuous("Task Complexity", limits = c(0, 100), expand = expansion(.02)) +
  scale_y_continuous("Task Difficulty", limits = c(0, 100), expand = expansion(.02)) +
  scale_color_manual(values = myfills) +
  theme(legend.position = "none")

2.6 Installing R

Installation depends on your operating system. If you have already installed R, make sure you have the latest version.

Go to the CRAN (Comprehensive R Archive Network) site. Click the link that is appropriate for your operating system (Windows, Mac, or Linux) and follow the instructions.

2.6.1 Windows

Click here and download the latest version of R for Windows. Open the file and follow installation instructions.

2.6.2 Mac

Click here and download the latest version of R for Mac. Double-click the file and follow installation instructions.]

2.6.3 Linux

Install R on Ubuntu with this Bash script:]{.instruction}

More detailed instructions here:

2.7 Installing RStudio

2.7.1 RStudio Desktop

RStudio is free. Please do not pay for anything. The paid versions of RStudio have nothing you will need for this course. They are not better than the free version. They simply have features tailored to businesses and developers.

Install the free version of RStudio Desktop here.

2.7.2 RStudio Cloud

One alternative to installing R and RStudio on your machine is to use RStudio in a web browser in Posit Cloud. After signing up for a free account, you can use RStudio online. As the internet speeds up, I imagine that this option will become increasingly attractive.

2.9 Using RStudio

Open RStudio the same way you would any other program in your operating system.

2.10 A Quick Tour of RStudio’s 4 Panes

RStudio may appear overwhelming at first. That feeling will fade quickly with use. By default, RStudio has 4 panes where you can interact with R in different ways.

Quick Tour of RStudio

2.10.1 The Console

The console (lower left) is where you submit quick temporary calculations and run code you have no intention of saving. Hit Enter to submit code to R. R is interactive in the sense that it will display the result of the code in the console where you typed.

2.10.2 Scripts

The script pane (upper left) is for writing code you want to preserve in a script file (.R) or RMarkdown file (.Rmd). More on RMarkdown later. Save your script files frequently.

You can submit code to the console by hitting Ctrl+Enter (or ⌘+Enter on Macs). You can also submit code by hitting the Run button. If you have code selected, only the selected part will be submitted. If you have no code selected, the current line (wherever the cursor is) will be submitted.

2.10.3 Environment Variables

In the Environment tab in the upper right pane, you can see which variables have been created and a preview of what they contain.

2.10.4 Viewer

The lower right pane has several tabs:

  • Files: Interact with project files
  • Plots: Preview plots
  • Packages: Install and update packages
  • Help: Search for help
  • Viewer: View documents created by RStudio

2.11 Installing and Updating Packages

What comes installed with R is called “Base R,” which by itself is quite powerful. However, what makes R especially great is the fact that tens of thousands of free “packages” are available in a central repository called CRAN. These packages extend what R can do far beyond what any one person could have imagined.

In the bottom right pane, click the Packages tab.

R Packages

Any new package you need but do not have, you can click the Install button and type the name of a new package. For example, click Install button, type tidyverse in the Packages textbox, and click Install.

You might have to wait a while, but now many new packages have been installed on your machine.

You do not need to install packages again. However, I update all my packages each time I use RStudio to make sure I have the best and most up-to-date versions of the packages available.

2.12 Create a New Project

When you take on a new data-analysis task, you will want to keep your files organized in a particular folder. You want to make sure that this folder is the first place that R will look when you specify a file name. Otherwise we will have to write out the entire file name, which will be likely different if we need to run the same analysis on a different computer.

RStudio’s “projects” are useful for making sure that all your analyses are in the same folder and that your analyses will work no matter which machine runs it.

Without projects, we would need to write out the entire file name like so:

read_csv("C:/Users/my_user_name/Research/My_Project/my_data.csv")

With projects, we can import data with just the bare file name, like so:

read_csv("my_data.csv")

Much better!

Okay, let’s get to it.

Steps to create a project
  1. In your operating system, create a folder on your computer with a name that describes what the project is about.
  2. Open RStudio.
  3. Click File->New Project
  4. Select Existing Directory
  5. Click Browse
  6. Navigate to the new folder.
  7. Click Open
  8. Click Create Project

2.13 Be Kind to Future-You

Hadley Wickham reminds us that the most important person to write code for is future-you.

At all times, work with the assumption that 10 minutes from now, you will suffer catastrophic datamnesia—memory failure for everything related to your analysis.

Present-you thinks future-you has god-like memory capacity. Future-you has very human and very fallible memory. Predicting is hard, especially about the future and what future-you will remember.

Future-you spends an inordinate amount of time recreating what present-you thought future-you would understand and remember—so much time that future-you neglects to take care of future–future-you. Future-you won’t take care of anything unless present-you learns to take care of future-you.

2.14 Folder Organization

  • Bad: All files on your desktop
  • Also Bad: All files in the default folder (e.g., My Documents)
  • Worse: Important files scattered in various folders with no organization.
  • Even worse: Important files are in your Downloads folder where they are likely to be overwritten or deleted.
  • Good: Use a file synchronization service like Dropbox, Box, OneDrive, GoogleDrive, etc.
  • Also Good: Within your file synchronization service, have a well-organized hierarchical structure to your folders.
  • Even Better: Use version control software like Git, and save the code in an online repository like GitHub.

2.15 File Name Guidance

Giving convenient, easily understood names to files, whether data files, image files, or script files can have far-reaching consequences for both present-you and future-you. I draw heavily from the advice on naming things from Jenny Bryan.

2.15.1 Give your files long, descriptive names.

Make it clear to future-you exactly what is in the file.

  • Bad: data.xlsx
  • Better: dissertation_data.csv
  • Even Better: student_questionnaire_time_1.csv

2.15.2 Separate words with underscores, not spaces.

Replace spaces with underscores (_). Spaces often work fine, but sometimes they do not, which can result in hours and hours of debugging. Play it safe and don’t use spaces in file names.

student scores.xlsxstudent_scores.xlsx

2.15.3 No special characters in file names:

Avoid including in file names characters that have special meanings in many programming languages such as *@^$! and many others. Otherwise unexpected results can make your life complicated.

parent@emotion*survey.csvparent_emotion_survey.csv

2.15.4 Separate dates with hyphens using ISO format

from XKCD
Figure 2.7: Date formats are fun!

ISO format for dates is YYYY-MM-DD, meaning that a four-digit year comes first, followed by a two-digit month, followed by a two-digit day. This format makes sorting order much easier than the formats used in the U.S.

Here is a data file name that begins with a date, followed by the school district from from the data were collected:

2020-01-12_district_A.csv

One of the benefits of using dates in the ISO format is that they sort chronologically. Which file names would you prefer to deal with?

Sorted Traditional Dates Sorted ISO Dates
April 11, 2007_District_F.csv 2007-04-11_District_F.csv
April 18, 2011_District_N.csv 2008-01-21_District_T.csv
August 15, 2009_District_K.csv 2008-03-08_District_G.csv
January 21, 2008_District_T.csv 2009-08-15_District_K.csv
March 08, 2008_District_G.csv 2010-09-18_District_G.csv
March 28, 2013_District_R.csv 2011-04-18_District_N.csv
November 24, 2014_District_Q.csv 2013-03-28_District_R.csv
September 18, 2010_District_G.csv 2014-11-24_District_Q.csv