Introduction to R

Karim Naguib (Boston University)
9/11/2013

What is R?

  • A statistical analysis environment
  • A calculator!
1 + 2
[1] 3
4 * (1 - 2)
[1] -4
2^10
[1] 1024

What is a variable?

A storage for a value or calculation

x <- 1
x
[1] 1
x <- x + 1
x
[1] 2
y <- 3 
z <- x + y
z
[1] 5

Vectors

A variable can also store a vector of values.

x <- c(4, 4, 1, 5, 100)
x
[1]   4   4   1   5 100
y <- c("abc", "hello")
y
[1] "abc"   "hello"

Functions

  • R functions are just like mathematical functions \( y = f(x) \) where they take some input \( x \) and return some calculated value.
  • We'll use quite a few that are provided with R
  • We can write our own
f <- function(x, y) { x^2 + y }
f(2, 3)
[1] 7
x <- f(1, 10)
x
[1] 11

Packages

  • In some cases we will need to use a “package” which is a collection of functions and data packaged together.
  • The functions in these packages will not be available for us to use by default, so we need to load them.
  • For example, to load the foreign package we use the command
library(foreign)

The summary() command

It is a useful function to get a quick description of data

x <- c(1, 2, 100, 33, -34)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -34.0     1.0     2.0    20.4    33.0   100.0 

RStudio Server

  • To simplify our work with R we will use a web-accessible version of RStudio (you do not need to install anything)
  • Refer to the relevant post on Piazza for login information
  • Each user will have their own home directory on the server where they will save all their work (/home/<username>/)

Projects

  • We will use projects to help us organize work on assignments
  • For each new assignment select New Project from the File Menu
  • Create the new project in a new directory and give it a meaningful name such as “assignment1”
  • The actual R code we will write will be save in script files, stored in their project directories

Scripts

  • In many cases we will not find it convenient to work on the commandline all the time; we might have many commands we want to execute
  • A script is a simple text file in which you list the commands you want executed, one after another
  • Once you are done with an assignment's script and want to turn it in, click on the Compile HTML Notebook from R Script button on the upper right side of the script's pane. You can then print the output HTML file that will be displayed.

Data

  • All data used in this class will be stored at /usr/data/ec414/

Loading Data

For example, suppose want to load the CPS data used in chapter 3 of the text book

load("/usr/data/ec414/cps_ch3.RData")

If we execute the ls functions to display the variables in the workspace we will see

ls()
[1] "cps.ch3" "f"       "x"       "y"       "z"      

Learning About The Data

There are several functions we can use to learn more about the data we just loaded

str(cps.ch3)
'data.frame':   15393 obs. of  3 variables:
 $ a.sex: int  1 1 1 2 1 2 2 1 1 1 ...
 $ year : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
 $ ahe08: num  17.2 15.3 22.9 13.3 22.1 ...
summary(cps.ch3)
     a.sex           year          ahe08     
 Min.   :1.00   Min.   :1992   Min.   : 2.0  
 1st Qu.:1.00   1st Qu.:1996   1st Qu.:15.3  
 Median :1.00   Median :2000   Median :20.5  
 Mean   :1.48   Mean   :2001   Mean   :22.4  
 3rd Qu.:2.00   3rd Qu.:2004   3rd Qu.:27.4  
 Max.   :2.00   Max.   :2008   Max.   :82.4  

Categorical Variables

  • Notice that a couple of columns (or variables) cps.ch3 are stored as integers (a.sex and year)
  • We would like to have them stored in the dataset (or data.frame) as categorical variables
  • We do that using the function factor
cps.ch3$a.sex <- factor(cps.ch3$a.sex, levels=c(1,2), labels=c("male", "female"))
cps.ch3$year <- factor(cps.ch3$year)
str(cps.ch3)
'data.frame':   15393 obs. of  3 variables:
 $ a.sex: Factor w/ 2 levels "male","female": 1 1 1 2 1 2 2 1 1 1 ...
 $ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ahe08: num  17.2 15.3 22.9 13.3 22.1 ...

Calculating Sample Statistics

Using the data we just loaded, we want to calculate the sample mean and variance of average hourly earnings (ahe08)

mean(cps.ch3$ahe08)
[1] 22.4
var(cps.ch3$ahe08)
[1] 108.4
sd(cps.ch3$ahe08) # standard deviation
[1] 10.41

Subsetting the Data

But since we want to compare average wages between men and women, we want to calculate sample statistics for each of these groups

male.cps <- subset(cps.ch3, a.sex == "male")
str(male.cps)
'data.frame':   8008 obs. of  3 variables:
 $ a.sex: Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
 $ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ahe08: num  17.2 15.3 22.9 22.1 36.1 ...
female.cps <- subset(cps.ch3, a.sex == "female")
str(female.cps)
'data.frame':   7385 obs. of  3 variables:
 $ a.sex: Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
 $ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ahe08: num  13.28 12.17 21.07 7.82 18.61 ...

Subsetting the Data (cont.)

To get an more specific subset, for example, female and surveyed in 1992

female.92.cps <- subset(cps.ch3, a.sex == "female" & year == "1992")
str(female.92.cps)
'data.frame':   1368 obs. of  3 variables:
 $ a.sex: Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
 $ year : Factor w/ 5 levels "1992","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ahe08: num  13.28 12.17 21.07 7.82 18.61 ...
summary(female.92.cps)
    a.sex        year          ahe08      
 male  :   0   1992:1368   Min.   : 2.61  
 female:1368   1996:   0   1st Qu.:14.75  
               2000:   0   Median :18.65  
               2004:   0   Mean   :20.05  
               2008:   0   3rd Qu.:24.40  
                           Max.   :66.37  

Plots

  • To plot data in the course we are going to mainly rely on the ggplot2 package.
library(ggplot2)
  • The main function we are going to use from this package is qplot

Boxplots (1)

A boxplot is useful at looking at the different distribution of average hourly earnings.

qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot")

plot of chunk unnamed-chunk-17

Boxplots (2)

If we wanted to see the same boxplots but separate the years we use the facets argument.

qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot", facets= ~ year)

plot of chunk unnamed-chunk-18

Boxplots (3)

We might want to add a few things to the plot to make it easier to read, such as axis labels

qplot(x=a.sex, y=ahe08, data=cps.ch3, geom="boxplot", facets= ~ year, xlab="Sex", ylab="Average Hourly Earnings")

plot of chunk unnamed-chunk-19

Density Plot

Another way to view the same data is to use density plots

qplot(x=ahe08, color=a.sex, data=cps.ch3, geom="density", facets= ~ year, xlab="Average Hourly Earnings")

plot of chunk unnamed-chunk-20

Scatterplot (1)

For this part let's load another dataset, the March CPS survey for 2008 with more information on full-time workers

load("/usr/data/ec414/cps08_full.RData")
ls()
[1] "cps.08.full"   "cps.ch3"       "f"             "female.92.cps"
[5] "female.cps"    "male.cps"      "x"             "y"            
[9] "z"            
str(cps.08.full)
'data.frame':   7711 obs. of  5 variables:
 $ ahe     : num  38.46 12.5 9.86 8.24 17.79 ...
 $ year    : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ bachelor: int  1 1 0 0 0 0 1 1 1 1 ...
 $ female  : int  0 0 0 0 0 1 0 0 0 1 ...
 $ age     : int  33 31 30 30 31 29 26 28 30 25 ...

Scatterplot (2)

cps.08.full$year <- factor(cps.08.full$year)
cps.08.full$bachelor <- cps.08.full$bachelor == 1
cps.08.full$female <- cps.08.full$female == 1

cps.08.bac <- subset(cps.08.full, bachelor == TRUE)

Scatterplot (3)

qplot(x=age, y=ahe, data=cps.08.bac, color=female, geom="point")

plot of chunk unnamed-chunk-25

qplot(x=factor(age), y=ahe, data=cps.08.bac, color=female, geom="boxplot")

plot of chunk unnamed-chunk-26

Documentation and Help

  • You can find documentation for R functions in the Help menu of the RStudio environment
  • A quick way to get help on a particular function is the ? command (make sure that the function's package is loaded). For example
? qplot
  • Give it a try: find the documentation for the cov function
  • Make use of Piazza to ask and answer questions
  • Refer to the R documentation resource provided on Piazza