Foundational Skills

Getting Started

First, you will need to download the latest versions of R and R Studio. R is a free environment for statistical computing and graphics using the programming language R. R Studio is a set of integrated tools that allows for a more user-friendly experience for using R.

Although you will likely use R Studio as your main console and editor, you must first install R as R Studio uses R behind-the-scenes. Both are freely-available, cross-platform, and open-source.

Downloading R and R Studio

To download R:

Visit this page to download R: https://cran.r-project.org/
Find your operating system (Mac, Windows, or Linux)
Download the ‘latest release’ on the page for your operating system and download and install the application

Don’t worry; you will not mess anything up if you download (or even install!) the wrong file. Once you’ve installed both, you can get started.

To download R Studio:

Visit this page to download R studio: https://www.rstudio.com/products/rstudio/download/
Find your operating system (Mac, Windows, or Linux)
Download the ‘latest release’ on the page for your operating system and download and install the application

If you do have issues, consider this page, and then reach out for help. One good place to start is the R Studio Community is a great place to start.

For more information on installing R and R Studio, check out DataCamp’s Installing R.

Check that it worked

Open R Studio. Find the console window and type in 2 + 2. If what you can guess is returned (hint: it’s what you expect!), then R Studio and R both work.

Help, I’m completely new to using R / R Studio!

If you’re completely new, Swirl is a great place to start, as it helps you to learn R from within R Studio. Visit this page to see some directions: http://swirlstats.com.

If you have a bit more confidence but still feel like you need some time to get started, Data Camp is another good place to start.

And if you’re ready to go, please proceed to the next sections on processing and preparing, plotting, loading, and modeling data and sharing results.

Creating Projects

Before proceeding, we’re going to take a few steps to set ourselves to make the analysis easier; namely, through the use of Projects, an R Studio-specific organizational tool.

To create a project, in R Studio, navigate to “File” and then “New Directory”.

Then, click “New Project”. Choose a directory name for the project that helps you to remember that this is a project that involves data science in education; it can be convenient if the name is typed in lower-case-letters-separated-by-dashes, like that. You can also choose the sub-directory. If you are just using this to learn and to test out creating a project, you may consider placing it in your downloads or another temporary directory so that you remember to remove it later.

Even if you do not create a Project, you can always check where your working directory (i.e., where your R is pointing) is by running getwd(). To change it manually, run setwd(desired/file/path/here).

Packages

“Packages” are shareable collections of R code that provide functions (i.e., a command to perform a specific task), data and documentation,. Packages increase the functionality of R by improving and expanding on base R (basic R functions).

Installing and Loading Packages

To download a package, you must call install.packages():

install.packages("dplyr", repos = "http://cran.us.r-project.org")

You can also navigate to the Packages pane, and then click “Install”, which will work the same as the line of code above. This is a way to install a package using code or part of the R Studio interface. Usually, writing code is a bit quicker, but using the interface can be very useful and complimentary to use of code.

After the package is installed, it must be loaded into your R Studio session using library():

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We only have to install a package once, but to use it, we have to load it each time we start a new R session.

a package is a like a book, a library is like a library; you use library() to check a package out of the library - Hadley Wickham, Chief Scientist, R Studio

Running Functions from Packages

Once you have loaded the package in your session, you can run the functions that are contained within that package. To find a list of all those functions, you can run this in the R Studio console:

help(package = dplyr)

The documentation should tell you what the function does, what arguments (i.e., details) needed for it to successfully run, examples, and what the output should look like.

If you know the specific function that you want to look up, you can run this in the R Studio console:

??dplyr::filter

Once you know what you want to do with the function, you can run it in your code:

dat <- # example data frame
    data.frame(stringsAsFactors=FALSE,
               letter = c("A", "A", "A", "B", "B"),
               number = c(1L, 2L, 3L, 4L, 5L))

dat

##   letter number
## 1      A      1
## 2      A      2
## 3      A      3
## 4      B      4
## 5      B      5

filter(dat, letter == "A") # using dplyr::filter

##   letter number
## 1      A      1
## 2      A      2
## 3      A      3

For more information on R packages, check out DataCamp’s Beginner’s Guide to R Packages article.

Welcome to the Tidyverse

The Tidyverse is a set of packages for data manipulation, exploration, and visualization using the design philosophy of ‘tidy’ data. Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

The packages contained in the Tidyverse provide useful functions that augment base R functionality.

You can installing and load the complete Tidyverse with:

install.packages("tidyverse")

library(tidyverse)

For more information on tidy data, check out Hadley Wickhams’s Tidy Data paper. For more information on the Tidyverse, check out DataCamp’s Getting Started with the Tidyverse tutorial.

Loading Data from Various Sources

In this section, we’ll load data.

You might be thinking that an Excel file is the first that we would load, but there happens to be a format which you can open and edit in Excel that is even easier to use between Excel and R as well as SPSS and other statistical software, like MPlus, and even other programming languages, like Python. That format is CSV, or a comma-separated-values file.

The CSV file is useful because you can open it with Excel and save Excel files as CSV files. Additionally, and as its name indicates, a CSV file is rows of a spreadsheet with the columns separated by commas, so you can view it in a text editor, like TextEdit for Macintosh, as well. Not surprisingly, Google Sheets easily converts CSV files into a Sheet, and also easily saves Sheets as CSV files.

For these reasons, we start with reading CSV files.

Saving a File from the Web

You’ll need to copy this URL:

https://goo.gl/bUeMhV

Here’s what it resolves to (it’s a CSV file):

https://raw.githubusercontent.com/data-edu/data-science-in-education/master/data/pisaUSA15/stu-quest.csv

This next chunk of code downloads the file to your working directory. Run this to download it so in the next step you can read it into R. As a note: There are ways to read the file directory (from the web) into R. Also, of course, you could do what the next (two) lines of code do manually: Feel free to open the file in your browser and to save it to your computer (you should be able to ‘right’ or ‘control’ click the page to save it as a text file with a CSV extension).

student_responses_url <-
    "https://goo.gl/bUeMhV"

student_responses_file_name <-
    paste0(getwd(), "/data/student-responses-data.csv")

download.file(
    url = student_responses_url,
    destfile = student_responses_file_name)

It may take a few seconds to download as it’s around 20 MB.

The process above involves many core data science ideas and ideas from programming/coding. We will walk through them step-by-step.

The character string "https://goo.gl/wPmujv" is being saved to an object called student_responses_url.

student_responses_url <-
    "https://goo.gl/bUeMhV"

We concatenate your working directory file path to the desired file name for the CSV using a function called paste0. This is stored in another object called student_reponses_file_name. This creates a file name with a file path in your working directory and it saves the file in the folder that you are working in.

student_responses_file_name <-
    paste0(getwd(), "/data/student-responses-data.csv")

The student_responses_url object is passed to the url argument of the function called download.file() along with student_responses_file_name, which is passed to the destfile argument.

In short, the download.file() function needs to know - where the file is coming from (which you tell it through the url) argument and - where the file will be saved (which you tell it through the destfile argument).

download.file(
    url = student_responses_url,
    destfile = student_responses_file_name)

Understanding how R is working in these terms can be helpful for troubleshooting and reaching out for help. It also helps you to use functions that you have never used before because you are familiar with how some functions work.

Now, in R Studio, you should see the downloaded file in the Files tab. This should be the case if you created a project with R Studio; if not, it should be whatever your working directory is set to. If the file is there, great. If things are not working, consider downloading the file in the manual way and then move it into the directory that the R Project you created it.

Loading a CSV File

Okay, we’re ready to go. The easiest way to read a CSV file is with the function read_csv() from the package readr, which is contained within the Tidyverse.

Let’s load the tidyverse library:

library(tidyverse) # so tidyverse packages can be used for analysis

You may have noticed the hash symbol after the code that says library(tidyverse). It reads# so tidyverse packages can be used for analysis`. That is a comment and the code after it (but not before it) is not run (the code before it runs just like normal). Comments are useful for showing why a line of code does what it does.

After loading the tidyverse packages, we can now load a file. We are going to call the data student_responses:

# readr::write_csv(pisaUSA15::stu_quest, here::here("data", "pisaUSA15", "stu_quest.csv"))
student_responses <-
    read_csv("./data/student-responses-data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   CNTRYID = col_integer(),
##   CNT = col_character(),
##   CNTSCHID = col_integer(),
##   CYC = col_character(),
##   NatCen = col_character(),
##   Region = col_integer(),
##   STRATUM = col_character(),
##   SUBNATIO = col_integer(),
##   OECD = col_integer(),
##   ADMINMODE = col_integer(),
##   Option_CPS = col_integer(),
##   Option_FL = col_integer(),
##   Option_ICTQ = col_integer(),
##   Option_ECQ = col_integer(),
##   Option_PQ = col_integer(),
##   Option_TQ = col_integer(),
##   Option_UH = col_integer(),
##   Option_Read = col_character(),
##   Option_Math = col_character(),
##   LANGTEST_QQQ = col_integer()
##   # ... with 460 more columns
## )

## See spec(...) for full column specifications.

Since we loaded the data, we now want to look at it. We can type its name in the function glimpse() to print some information on the dataset (this code is not run here).

glimpse(student_responses)

Woah, that’s a big data frame (with a lot of variables with confusing names, to boot)!

Great job loading a file and printing it! We are now well on our way to carrying out analysis of our data.

Loading Excel files

We will now do the same with an Excel file. You might be thinking that you can open the file in Excel and then save it as a CSV. This is generally a good idea. At the same time, sometimes you may need to directly read a file from Excel.

The package for loading Excel files, readxl, is not a part of the tidyverse, so we will have to install it first (remember, we only need to do this once), and then load it using library(readxl). Note that the command to install readxl is grayed-out below: The # symbol before install.packages("readxl") indicates that this line should be treated as a comment and not actually run, like the lines of code that are not grayed-out. It is here just as a reminder that the package needs to be installed if it is not already.

Once we have installed readxl, we have to load it (just like tidyverse):

install.packages("readxl")

library(readxl)

We can then use the function read_excel() in the same way as read_csv(), where “path/to/file.xlsx” is where an Excel file you want to load is located (note that this code is not run here):

my_data <-
    read_excel("path/to/file.xlsx")

Of course, were this run, you can replace my_data with a name you like. Generally, it’s best to use short and easy-to-type names for data as you will be typing and using it a lot.

Note that one easy way to find the path to a file is to use the “Import Dataset” menu. It is in the Environment window of R Studio. Click on that menu bar option, select the option corresponding to the type of file you are trying to load (e.g., “From Excel”), and then click The “Browse” button beside the File/URL field. Once you click on the, R Studio will automatically generate the file path - and the code to read the file, too - for you. You can copy this code or click Import to load the data.

Loading SAV files

The same factors that apply to reading Excel files apply to reading SAV files (from SPSS). First, install the package haven, load it, and the use the function read_sav():

install.packages("haven")

library(haven)
my_data <-
    read_sav("path/to/file.xlsx")

Google Sheets

Finally, it can sometimes be useful to load a file directly from Google Sheets, and this can be done using the Google Sheets package.

install.packages("googlesheets")

library(googlesheets)

When you run the command below, a link to authenticate with your Google account will open in your browser.

my_sheets <- gs_ls()

You can then simply use the gs_title() function in conjunction with the gs_read() function:

df <- gs_title('title')
df <- gs_read(df)

Saving Files

Using our data frame student_responses, we can save it as a CSV (for example) with the following function. The first argument, student_reponses, is the name of the object that you want to save. The second argument, student-responses.csv, what you want to call the saved dataset.

write_csv(student_responses, "student-responses.csv")

That will save a CSV file entitled student-responses.csv in the working directory. If you want to save it to another directory, simply add the file path to the file, i.e. path/to/student-responses.csv. To save a file for SPSS, load the haven package and use write_sav(). There is not a function to save an Excel file, but you can save as a CSV and directly load it in Excel.

Conclusion

We will detail the functions used to read every file in a folder (or, to write files to a folder).

For more information on installing R and R Studio, check out DataCamp’s R Data Import Tutorial.

Processing Data

Now that we have loaded student_responses into an object, we can process it. This section highlights some common data processing functions.

We’re also going to introduce a powerful, unusual operator in R, the pipe. The pipe is this symbol: %>%. It lets you compose functions. It does this by passing the output of one function to the next. A handy shortcut for writing out %>% is Command + Shift + M.

Here’s an example. Let’s say that we want to select a few variables from the student_responses dataset and save those variables into a new object, student_mot_vars. Here’s how we would do that using dplyr::select().

student_mot_vars <- # save object student_mot_vars by...
    student_responses %>% # using dataframe student_responses
    select(SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables

Note that we saved the output from the select() function to student_mot_vars but we could also save it back to student_responses, which would simply overwrite the original data frame (the following code is not run here):

student_responses <- # save object student_responses by...
    student_responses %>% # using dataframe student_responses
    select(student_responses, SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables

We can also rename the variables at the same time we select them. I put these on separate lines so I could add the comment, but you could do this all in the same line, too. It does not make a difference in terms of how select() will work.

student_mot_vars <- # save object student_mot_vars by...
    student_responses %>% # using dataframe student_responses
    select(student_efficacy = SCIEEFF, # selecting variable SCIEEFF and renaming to student_efficiency
           student_joy = JOYSCIE, # selecting variable JOYSCIE and renaming to student_joy
           student_broad_interest = INTBRSCI, # selecting variable INTBRSCI and renaming to student_broad_interest
           student_epistemic_beliefs = EPIST, # selecting variable EPIST and renaming to student_epistemic_beliefs
           student_instrumental_motivation = INSTSCIE # selecting variable INSTSCIE and renaming to student_instrumental_motivation
    )

[will add more on creating new variables, filtering grouping and summarizing, and joining data sets]

Visualizing data

[not yet added - will add scatter plots, bar plots, and time series plots]

Modeling data

[not yet added - will add about regression/ANOVA]

Other foundational notes

Configuring R Studio

There are a number of changes you can (but do not need to) make to configure R Studio. If you navigate to the Preferences menu in R Studio, you’ll see a number of options you can change, from the appearance of the application to which windows appear where.

One important consideration is whether to save your workspace when you close R Studio. By default, R Studio saves all of the objects in your environment. This means that any data that you have loaded–or new data or objects that you have created, such as by merging two data sets together or creating a plot–will, by default, still exist when you open R Studio next. In general, this is not ideal, because it means that you may have taken steps interactively that are not documented your code. This means that when you share your code, or re-run it from the start, it may not work. An easy way to change this is to tell R Studio to start from scratch (in terms of your workspace) each time you open it. You can do that by changing the dropdown menu pointed out in the image below to “Never”.

optional caption text

While this may seem like a dramatic step - never saving your workspace - it is the foundation for doing reproducible work and research using R Studio (and R). It also represents one of the biggest shifts from using software like Excel or SPSS, where most steps are not documented in code. This involves a shift from thinking that your most permanent and important part of an analysis is your data to thinking of the most important part as being the code: with the code, you can keep your data in its original form, process it, and then save a processed file, through running code. This also means that when you have to make a change to this code, you can re-run the entire analysis easily.

R Studio Projects

R Studio projects are a good way to organize related work. In R Studio, navigate to “New Project” from the file menu [add].

Getting data in and out

clipr is a package to easily copy data into and out of R using the clipboard. [add]

datapasta is another option. [add]