CS&SS/SOC/STAT 221, University of Washington, Winter 2015
This lab introduces you to R and RStudio, which you will be using throughout this course to apply and learn the statistical concepts discussed in class. Additionally, you will learn R commands to implement some of the concepts covered in Chapters 4 and 5 of The Basic Practice of Statistics: scatterplots, correlation and linear regression.
R is the name of the programming language, and RStudio is a convenient and widely used interface to that language. R is a programming language, and RStudio is a n You should familiarize yourself with the RStudio GUI.
RStudio GUI
It consists of four windows,
>
prompt and R executes them.Bottom right
RStudio documentation can be found at http://www.rstudio.com/ide/docs/. Of those, the most likely to be useful to you are:
One of the best features of R is that, at the time of writing, there are over 6,000 packages which add functions and data to R. The current list of packages is can be found on CRAN, the R project’s homepage.
You will use several packages in this course, so you will need to know how to install them. Using a package requires two steps.
These two steps are analogous to downloading/installing a program and opening/executing a program, respectively. Installing the package downloads the code onto your computer so that R can make use of it. Loading a package makes the functionality contained in the package available to your current R session. You only need to to install the package once, although if new versions of the package become available, you may need to update or reinstall the package.
This course will make use of several packages. The following code will download and install these packages.
If you are working on a CSSCR lab computer these should have already been installed.
install.packages(c("devtools", "ggplot2"))
devtools::install_github("jrnold/bps5data")
The previous code installs these packages:
Before you use any of the code in these packages, you need to load them using the function library
:
library("ggplot2")
library("bps5data")
You can see which packages are installed and loaded using the Packages tab in the lower right panel. All packages that are installed are listed. Packages that are loaded are checked. By checking (un-checking) the box next to a package you can load (unload) that package.
Although it is so much more, you can use R as a calculator. For example, to add, subtract, multiply or divide:
2 + 3
## [1] 5
2 - 3
## [1] -1
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667
The power of a number is calculated with ^
, e.g. \(4^2\) is,
4 ^ 2
## [1] 16
R includes many functions for standard math functions. For example, the square root function is sqrt
, e.g. \(\sqrt{2}\),
sqrt(2)
## [1] 1.414214
And you can combine many of them together
(2 * 4 + 3 ) / 10
## [1] 1.1
sqrt(2 * 2)
## [1] 2
In R, you can save the results of calculations into objects that you can use later. This is done using the special symbol, <-
. For example, this saves the results of 2 + 2 to an object named foo
.
foo <- 2 + 2
You can see that foo
is equal to 4
foo
## [1] 4
And you can reuse foo in other calculations,
foo + 3
## [1] 7
foo / 2 * 8 + foo
## [1] 20
Data frames in R correspond to what you usually think of as a dataset or a spreadsheet, rows are observations and columns are variables.
R can import data in a variety of sources, such as csv files. And, in most real applications, you will be loading data from an external source. However, to keep things simple for this lab we will use datasets from some of the packages installed at the start of this lab.
Today we will use the a dataset derived from the Behavioral Risk Factor Surveillance System (BRFSS), an annual telephone survey of 350,000 people in the United States conducted by the Center for Disease Control (CDC). As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. We will download this data and load it as a data frame named cdc
:
source("http://www.openintro.org/stat/data/cdc.R")
You should see an object cdc
in the Environment panel (upper right).
You can view the data and get basic information like the number of rows, number of columns, and names of the variables in a couple ways.
Use the View
function to open a spreadsheet view of the data in the upper left panel,
View(cdc)
RStudio data frame viewer (upper left panel)
Use the dim
function to get the number of columns (variables) and rows (individuals)
dim(cdc)
## [1] 20000 9
The first number is the number of rows, and the second number is the number of columns.
Use the names
function to get the names of the variables:
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
To extract a single column from the data frame use $
. For example, to extract the height column of the cdc, use cdc$height``.
cdc$height
This will print a long list of numbers.
R has many functions. In this section you will learn a few which cover the descriptive statistics introduced in Chapter 2 of BPS.
To get a general summary of the dataset, use the summary
function
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
Notice that summary
returns the 5-number summary and the mean.
To calculate the mean of a column (weight
),
mean(cdc$weight)
## [1] 169.683
the median,
median(cdc$weight)
## [1] 165
standard deviation,
sd(cdc$weight)
## [1] 40.08097
variance,
var(cdc$weight)
## [1] 1606.484
1st and 3rd quartiles
quantile(cdc$weight, c(0.25, 0.75))
## 25% 75%
## 140 190
and inter-quartile range
IQR(cdc$weight)
## [1] 50
There are many different methods for creating plots and graphs in R. This lab and course will use one called ggplot. ggplot is one of the most popular plotting packages in R (and data visualization in general). It simplifies the construction of plots, and makes nice looking plots by default.
Let’s create the plot in Example 4.5 of BPS 5th edition. First, load the dataset that we will be using.
data("ta04.01", package = "bps5data")
This loads a data frame named ta04.01
into the workspace. This data frame has 30 observations and 3 variables: Year
, Boats
, Kills
.
We want to create a scatteplot of the number of boats registered versus the Florida manatees killed by boats. Using ggplot
, we create plots using the function qplot
:
qplot(x = Boats, y = Kills, data = ta04.01, geom = "point")
So what does that function mean:
x
and y
specify the x
and y
variables in the plot. In this case, Boats
, and Kills
, respectively.data
specifies the name of the data frame from which those variables come. In this case, we are using the data “ta04.01”.geom
specifies the type of plot to create. “point” will create a scatterplot. If you are interested, this argument is called geom
because it derived from the “grammar of graphics”, a formal way of representing data as graphics. See http://vita.had.co.nz/papers/layered-grammar.html for more.The function cor
is used to calculate the correlation between two variables. Using the manatee dataset in Table 4.1 of BPS5e, let’s calculate the correlation between the number of boat registrations and the number of manatees killed per year.
cor(ta04.01$Kills, ta04.01$Boats)
## [1] 0.9529787
Since the variables Kills
and Boats
are columns in the data frame ta04.01
, we need to use the $
to refer to them.
One of the properties of correlation is \(cor(x, y) = cor(y, x)\). Sure enough, that is the case:
cor(ta04.01$Boats, ta04.01$Kills)
## [1] 0.9529787
To estimate a linear regession in R, we use the function lm
; the name lm
comes from “linear model”. To run a regression of the
lm(Kills ~ Boats, data = ta04.01)
##
## Call:
## lm(formula = Kills ~ Boats, data = ta04.01)
##
## Coefficients:
## (Intercept) Boats
## -43.8120 0.1302
The first argument to lm
is a formula that has the form y ~ x
. The response/outcome/dependent variable is to the left of the ~
and the explanatory/independent variable(s) are to the right of the ~
. This can be read that we want to run a model of Kills
as a function of Boats
; or that we want to regress Kills
on Boats
. The argument data
specifies that R should look for the variables Kills
and Boats
inside the ta04.01
data.
The output of this model can be read as: \[ y = -43.81 + 0.13 x \] The value under “(Intercept)” is the value of the intercept of the regression line. The value under “Boats” is the value of the slope of the regession line.
We can also produce more detailed linear regression output using summary
:
summary(lm(Kills ~ Boats, data = ta04.01))
##
## Call:
## lm(formula = Kills ~ Boats, data = ta04.01)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1391 -4.5343 0.1022 4.8298 17.7760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.812038 5.717379 -7.663 2.39e-08 ***
## Boats 0.130164 0.007822 16.640 4.74e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.445 on 28 degrees of freedom
## Multiple R-squared: 0.9082, Adjusted R-squared: 0.9049
## F-statistic: 276.9 on 1 and 28 DF, p-value: 4.741e-16
Much of this output involves the statistical inference of regression and we can ignore for now. The values in the “Estimate” colum. However, this output gives us another useful piece of information. The value of \(r^2\) can be found next to “Multiple R-squared:”. The \(r^2\) of this regression is 0.91. You can ignore the value of “Adjusted R-squared”; it is a more advanced topic.
We can plot the linear regression line over
qplot(x = Boats, y = Kills, data = ta04.01, geom = c("point", "smooth"),
method = "lm", se = FALSE)
You run and adjust this code without understanding the the details; adjust the values of x
, y
and data
to the variables and data you wish to use. However, here is a brief explanation of what the various other arguments are doing:
geom
argument in order to draw a line representing the relationship between x
and y
.method="lm"
tells R to draw the linear regression line to represent the relationship between x
and y
. There are other more advanced methods to calculate non-linear relationships between the two variables.se = FALSE
tells R not to draw standard errors around the regression line. This is a topic in statistical inference that we have not covered yet.The most important thing in learning any programming language is how to find help. Even if you have been using a language for years, you will never remember every function name or its arguments, so you will always be looking things up.
The easiest way to find help in RStudio is to search in the Help panel (lower right panel).
If you want to get help for a specific function help
function, or ?
in the console. For example, to read help for a function sum
, you can do either of these
help("sum")
?sum
You can save R commands in a file called an R script. To create a new R Script use File -> New File -> R Script. This will create a new tab in the upper left panel which will have a name like “Untitled1”. Save this to a file with the extension “.R” (RStudio will warn you if you do not)
To see how this works, write a few commands in the editor. For example,
2 + 2
## [1] 4
3 + 8
## [1] 11
mean(c(1, 2, 3))
## [1] 2
You can run the current line or highlighted section with Ctl-Enter or the Run button. You can run the entire script with Ctl-Shift-S or the Source button.
Before running any code, load the libraries that will be needed for this:
library("ggplot2")
library("bps5data")
The practice is derived from BPS5e, Exercise 4.46. Read the exercise in BPS5e for a description of the data.
First, load the data needed for this problem:
data("ta04.03")
This loads a dataset named ta04.03
into your workspace.
qplot
.lm
.Some guidelines for your submission:
Parts of this lab were adapted from the OpenIntro Lab 1, which was released under a Creative Commons Attribution-ShareAlike 3.0 Unported.
Data and examples come from Moore, The Basic Practice of Statistics, 5th ed.
Comments
Any R code following a hash (
#
) is not executed. These are called comments, and can be used to annotate and explain your code. For example, this doesn’t do anything.And in this, nothing after the
#
is executed,Although you can put comments on the same line after code, it is good practice to put comments on separate lines