CS&SS/SOC/STAT 221, University of Washington, Winter 2015

This lab introduces you to R and RStudio, which you will be using throughout this course to apply and learn the statistical concepts discussed in class. Additionally, you will learn R commands to implement some of the concepts covered in Chapters 4 and 5 of The Basic Practice of Statistics: scatterplots, correlation and linear regression.

RStudio

R is the name of the programming language, and RStudio is a convenient and widely used interface to that language. R is a programming language, and RStudio is a n You should familiarize yourself with the RStudio GUI.

RStudio GUI

RStudio GUI

It consists of four windows,

  1. Bottom left: The console window. You type commands at the > prompt and R executes them.
  2. Top left: The editor window. Here you can edit and save R scripts which contain multiple R commands.
    • You can open a new R script using File -> New -> R script.
    • If you highlight an area, you can run those commands in the console with the “Run” button.
    • You can run all the commands in the editor window using the “Source” button.
  3. Top right
    • workspace lists all R objects (variables) that are defined
    • history lists all the commands that have been typed into the console.
  4. Bottom right

    • files allows you to browse directories and open files.
    • plots displays any plots created. In this window you can toggle back through previously created plots.
    • packages shows which packages are installed and loaded.
    • help displays R help.

RStudio documentation can be found at http://www.rstudio.com/ide/docs/. Of those, the most likely to be useful to you are:

Installing and Loading Packages

One of the best features of R is that, at the time of writing, there are over 6,000 packages which add functions and data to R. The current list of packages is can be found on CRAN, the R project’s homepage.

You will use several packages in this course, so you will need to know how to install them. Using a package requires two steps.

  1. installing
  2. loading

These two steps are analogous to downloading/installing a program and opening/executing a program, respectively. Installing the package downloads the code onto your computer so that R can make use of it. Loading a package makes the functionality contained in the package available to your current R session. You only need to to install the package once, although if new versions of the package become available, you may need to update or reinstall the package.

This course will make use of several packages. The following code will download and install these packages.

If you are working on a CSSCR lab computer these should have already been installed.

install.packages(c("devtools", "ggplot2"))
devtools::install_github("jrnold/bps5data")

The previous code installs these packages:

Before you use any of the code in these packages, you need to load them using the function library:

library("ggplot2")
library("bps5data")

You can see which packages are installed and loaded using the Packages tab in the lower right panel. All packages that are installed are listed. Packages that are loaded are checked. By checking (un-checking) the box next to a package you can load (unload) that package.

Using R as a calculator

Although it is so much more, you can use R as a calculator. For example, to add, subtract, multiply or divide:

2 + 3
## [1] 5
2 - 3
## [1] -1
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667

The power of a number is calculated with ^, e.g. \(4^2\) is,

4 ^ 2
## [1] 16

R includes many functions for standard math functions. For example, the square root function is sqrt, e.g. \(\sqrt{2}\),

sqrt(2)
## [1] 1.414214

And you can combine many of them together

(2 * 4 + 3 ) / 10
## [1] 1.1
sqrt(2 * 2)
## [1] 2

Variables and Assignment

In R, you can save the results of calculations into objects that you can use later. This is done using the special symbol, <-. For example, this saves the results of 2 + 2 to an object named foo.

foo <- 2 + 2

You can see that foo is equal to 4

foo
## [1] 4

And you can reuse foo in other calculations,

foo + 3
## [1] 7
foo / 2 * 8 + foo
## [1] 20

Data and Data Frames

Data frames in R correspond to what you usually think of as a dataset or a spreadsheet, rows are observations and columns are variables.

R can import data in a variety of sources, such as csv files. And, in most real applications, you will be loading data from an external source. However, to keep things simple for this lab we will use datasets from some of the packages installed at the start of this lab.

Today we will use the a dataset derived from the Behavioral Risk Factor Surveillance System (BRFSS), an annual telephone survey of 350,000 people in the United States conducted by the Center for Disease Control (CDC). As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. We will download this data and load it as a data frame named cdc:

source("http://www.openintro.org/stat/data/cdc.R")

You should see an object cdc in the Environment panel (upper right).

You can view the data and get basic information like the number of rows, number of columns, and names of the variables in a couple ways.

Use the View function to open a spreadsheet view of the data in the upper left panel,

View(cdc)
RStudio data frame viewer (upper left panel)

RStudio data frame viewer (upper left panel)

Use the dim function to get the number of columns (variables) and rows (individuals)

dim(cdc)
## [1] 20000     9

The first number is the number of rows, and the second number is the number of columns.

Use the names function to get the names of the variables:

names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

To extract a single column from the data frame use $. For example, to extract the height column of the cdc, use cdc$height``.

cdc$height

This will print a long list of numbers.

Descriptive Statistics

R has many functions. In this section you will learn a few which cover the descriptive statistics introduced in Chapter 2 of BPS.

To get a general summary of the dataset, use the summary function

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

Notice that summary returns the 5-number summary and the mean.

To calculate the mean of a column (weight),

mean(cdc$weight)
## [1] 169.683

the median,

median(cdc$weight)
## [1] 165

standard deviation,

sd(cdc$weight)
## [1] 40.08097

variance,

var(cdc$weight)
## [1] 1606.484

1st and 3rd quartiles

quantile(cdc$weight, c(0.25, 0.75))
## 25% 75% 
## 140 190

and inter-quartile range

IQR(cdc$weight)
## [1] 50

Scatterplots

There are many different methods for creating plots and graphs in R. This lab and course will use one called ggplot. ggplot is one of the most popular plotting packages in R (and data visualization in general). It simplifies the construction of plots, and makes nice looking plots by default.

Let’s create the plot in Example 4.5 of BPS 5th edition. First, load the dataset that we will be using.

data("ta04.01", package = "bps5data")

This loads a data frame named ta04.01 into the workspace. This data frame has 30 observations and 3 variables: Year, Boats, Kills.

We want to create a scatteplot of the number of boats registered versus the Florida manatees killed by boats. Using ggplot, we create plots using the function qplot:

qplot(x = Boats, y = Kills, data = ta04.01, geom = "point")

So what does that function mean:

Correlation

The function cor is used to calculate the correlation between two variables. Using the manatee dataset in Table 4.1 of BPS5e, let’s calculate the correlation between the number of boat registrations and the number of manatees killed per year.

cor(ta04.01$Kills, ta04.01$Boats)
## [1] 0.9529787

Since the variables Kills and Boats are columns in the data frame ta04.01, we need to use the $ to refer to them.

One of the properties of correlation is \(cor(x, y) = cor(y, x)\). Sure enough, that is the case:

cor(ta04.01$Boats, ta04.01$Kills)
## [1] 0.9529787

Regression

To estimate a linear regession in R, we use the function lm; the name lm comes from “linear model”. To run a regression of the

lm(Kills ~ Boats, data = ta04.01)
## 
## Call:
## lm(formula = Kills ~ Boats, data = ta04.01)
## 
## Coefficients:
## (Intercept)        Boats  
##    -43.8120       0.1302

The first argument to lm is a formula that has the form y ~ x. The response/outcome/dependent variable is to the left of the ~ and the explanatory/independent variable(s) are to the right of the ~. This can be read that we want to run a model of Kills as a function of Boats; or that we want to regress Kills on Boats. The argument data specifies that R should look for the variables Kills and Boats inside the ta04.01 data.

The output of this model can be read as: \[ y = -43.81 + 0.13 x \] The value under “(Intercept)” is the value of the intercept of the regression line. The value under “Boats” is the value of the slope of the regession line.

We can also produce more detailed linear regression output using summary:

summary(lm(Kills ~ Boats, data = ta04.01))
## 
## Call:
## lm(formula = Kills ~ Boats, data = ta04.01)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1391  -4.5343   0.1022   4.8298  17.7760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.812038   5.717379  -7.663 2.39e-08 ***
## Boats         0.130164   0.007822  16.640 4.74e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.445 on 28 degrees of freedom
## Multiple R-squared:  0.9082, Adjusted R-squared:  0.9049 
## F-statistic: 276.9 on 1 and 28 DF,  p-value: 4.741e-16

Much of this output involves the statistical inference of regression and we can ignore for now. The values in the “Estimate” colum. However, this output gives us another useful piece of information. The value of \(r^2\) can be found next to “Multiple R-squared:”. The \(r^2\) of this regression is 0.91. You can ignore the value of “Adjusted R-squared”; it is a more advanced topic.

We can plot the linear regression line over

qplot(x = Boats, y = Kills, data = ta04.01, geom = c("point", "smooth"),
      method = "lm", se = FALSE)

You run and adjust this code without understanding the the details; adjust the values of x, y and data to the variables and data you wish to use. However, here is a brief explanation of what the various other arguments are doing:

Help Me!

The most important thing in learning any programming language is how to find help. Even if you have been using a language for years, you will never remember every function name or its arguments, so you will always be looking things up.

The easiest way to find help in RStudio is to search in the Help panel (lower right panel).

If you want to get help for a specific function help function, or ? in the console. For example, to read help for a function sum, you can do either of these

help("sum")
?sum

R Scripts

You can save R commands in a file called an R script. To create a new R Script use File -> New File -> R Script. This will create a new tab in the upper left panel which will have a name like “Untitled1”. Save this to a file with the extension “.R” (RStudio will warn you if you do not)

To see how this works, write a few commands in the editor. For example,

2 + 2
## [1] 4
3 + 8
## [1] 11
mean(c(1, 2, 3))
## [1] 2

You can run the current line or highlighted section with Ctl-Enter or the Run button. You can run the entire script with Ctl-Shift-S or the Source button.

Comments

Any R code following a hash (#) is not executed. These are called comments, and can be used to annotate and explain your code. For example, this doesn’t do anything.

# hello, world!

And in this, nothing after the # is executed,

2 + 2 # hello, world!
## [1] 4

Although you can put comments on the same line after code, it is good practice to put comments on separate lines

Practice

Before running any code, load the libraries that will be needed for this:

library("ggplot2")
library("bps5data")

The practice is derived from BPS5e, Exercise 4.46. Read the exercise in BPS5e for a description of the data.

First, load the data needed for this problem:

data("ta04.03")

This loads a dataset named ta04.03 into your workspace.

  1. What are the variables in the data? How many observations are there?
  2. Create a scatterplot of fish supply vs. biomass change using qplot.
  3. Calculate the correlation between fish supply and biomass change.
  4. Answer the question in Ex. 4.46: How do the data support the idea that more animals are killed for bushmeat when the fish supply is low?
  5. Regress Biomass change on Fish supply using the function lm.
  6. Given the results of your regression, what would you predict the biomass change to be if there was a year with a fish supply of 31.

Some guidelines for your submission:

References

Parts of this lab were adapted from the OpenIntro Lab 1, which was released under a Creative Commons Attribution-ShareAlike 3.0 Unported.

Data and examples come from Moore, The Basic Practice of Statistics, 5th ed.