See TOC
Important: Before you begin the examples, you need to download some data and to set the correct working directory so that Rstudio knows where the data is located on your computer.
read.table()
commandsurvey <- read.table("~/Course_survey.csv", header=TRUE, sep=",")
Let’s begin by doing some basic data manipulations
Convert factor variable \(\rightarrow\) numerical variable
You may want to convert the integers to more meaningful values, e.g. integers, for the purpose of your analysis.
Recall the Program' variable in the
survey` data
head(survey$Program)
## [1] SOE NonEcon NonEcon OtherEcon SOE NonEcon
## Levels: NonEcon OtherEcon SOE
survey <- transform(survey, Program=as.numeric(Program))
head(survey$Program)
## [1] 3 1 1 2 3 1
We can see now that the original variable codes are as follows: 1 = Non Econ, 2 = Other Econ, 3 = SOE
Convert numerical variable \(\rightarrow\) factor variable
Undo what we just do, and change program values back to factor names
we can do this using the transform()
, as.factor()
and mapvalues()
functions
Note, that mapvalues()
function works as follows.
library(plyr)
# Mapvalues works on factors
y <- factor(c("a", "b", "c", "a"))
mapvalues(y, c("a", "c"), c("A", "C")) # mapvalues function takes three parameters: (i) vector object 'y'; (ii) original values or elements withing 'y', in this case lowercase 'a' and 'c'; and outcome parameter, uppercase 'A' and 'C'
## [1] A b C A
## Levels: A b C
survey <- transform(survey,
Program = as.factor(mapvalues(Program,
c(1, 2, 3),
c("NonEcon", "OtherEcon", "SOE")))
)
head(survey$Program)
## [1] SOE NonEcon NonEcon OtherEcon SOE NonEcon
## Levels: NonEcon OtherEcon SOE
# Check order of factor levels
levels(survey$Program)
## [1] "NonEcon" "OtherEcon" "SOE"
Chagne order of factor levels using factor()
# Change Order of factor levels
survey$Program<- factor(survey$Program, levels= c("SOE","OtherEcon","NonEcon"))
levels(survey$Program)
## [1] "SOE" "OtherEcon" "NonEcon"
Sometimes, we need to change the levels of a factor. Recall, from lecture 1 the following example:
When data is entered manually, misspellings and case changes are very common
E.g., a column showing life support mechanism may look like,
Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
Health Example
Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
summary(Health)
## Length Class Mode
## 6 character character
For now, I’ll give you another example on how to change levels of a factor using the Cars93
data in the MASS
library
library(MASS)
names(Cars93) #list names of all variables
## [1] "Manufacturer" "Model" "Type"
## [4] "Min.Price" "Price" "Max.Price"
## [7] "MPG.city" "MPG.highway" "AirBags"
## [10] "DriveTrain" "Cylinders" "EngineSize"
## [13] "Horsepower" "RPM" "Rev.per.mile"
## [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers"
## [19] "Length" "Wheelbase" "Width"
## [22] "Turn.circle" "Rear.seat.room" "Luggage.room"
## [25] "Weight" "Origin" "Make"
manufacturer <- Cars93$Manufacturer
head(manufacturer, 10)
## [1] Acura Acura Audi Audi BMW Buick Buick
## [8] Buick Buick Cadillac
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo
We’ll use the mapvalues(x, from, to)
function from the plyr
library.
# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- mapvalues(manufacturer,
from = c("Chevrolet", "Pontiac", "Buick","Cadillac"),
to = rep("GM", 4))
head(manufacturer.combined, 10)
## [1] Acura Acura Audi Audi BMW GM GM GM GM GM
## 29 Levels: Acura Audi BMW GM Chrylser Chrysler Dodge Eagle Ford ... Volvo
It is often times necessary to create a new variable based on some type of transformation of existing variables
transform()
function to add a new variable to existing dataframe
transform()
returns a new data frame with columns modified or added as specified by the function callnames(Cars93) #list names of all variables
## [1] "Manufacturer" "Model" "Type"
## [4] "Min.Price" "Price" "Max.Price"
## [7] "MPG.city" "MPG.highway" "AirBags"
## [10] "DriveTrain" "Cylinders" "EngineSize"
## [13] "Horsepower" "RPM" "Rev.per.mile"
## [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers"
## [19] "Length" "Wheelbase" "Width"
## [22] "Turn.circle" "Rear.seat.room" "Luggage.room"
## [25] "Weight" "Origin" "Make"
KMPL.highway
and KMPL.city
based on transforming two existing variables in the Cars93
dataframe.Cars93.metric <- transform(Cars93,
KMPL.highway = 0.425 * MPG.highway, ## converts miles per gallion into km/l
KMPL.city = 0.425 * MPG.city
)
tail(names(Cars93.metric))
## [1] "Luggage.room" "Weight" "Origin" "Make"
## [5] "KMPL.highway" "KMPL.city"
A simpler approach
# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))
## [1] "Weight" "Origin" "Make" "KMPL.highway"
## [5] "KMPL.city" "KMPL.city.2"
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
## [1] TRUE
Some more data frame summaries: table()
function
table()
function builds contingency tables showing counts at each combination of factor levelstable(Cars93$AirBags)
##
## Driver & Passenger Driver only None
## 16 43 34
table(Cars93$Origin)
##
## USA non-USA
## 48 45
table(Cars93$AirBags, Cars93$Origin)
##
## USA non-USA
## Driver & Passenger 9 7
## Driver only 23 20
## None 16 18
Looks like US and non-US cars had about the same distribution of AirBag types
Later in the class we’ll learn how to do a hypothesis tests on this kind of data
Thus far we’ve repeatedly typed out the data frame name when referencing its columns
This is because the data variables don’t exist in our working environment
Using with(data, expr)
lets us specify that the code in expr
should be evaluated in an environment that contains the elements of data
as variables
with(Cars93, table(Origin, Type))
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 7 11 10 7 8 5
## non-USA 9 0 12 14 6 4
with()
and indexingwith(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!
## [1] FALSE
We have used a lot of built-in functions: mean()
, subset()
, transform()
, read.table()
…
An important part of programming and data analysis is to write custom functions
Functions help make code modular
Functions make debugging easier
Remember: this entire class is about applying functions to data
What is a function?
addOne <- function(x) {
x + 1
}
x
is the argument or input
The function output is the input x
incremented by 1
addOne(12)
## [1] 13
calculatePercentage <- function(x, y, d) {
decimal <- x / y # Calculate decimal value
round(100 * decimal, d) # Convert to % and round to d digits
}
calculatePercentage(27, 80, 1)
## [1] 33.8
threeNumberSummary <- function(x) {
c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
## mean median sd
## 4.861348 5.166916 1.918449
Oftentimes we want our code to have different effects depending on the features of the input
In all other cases, assign F
To code this up, we use if-else statements
calculateLetterGrade <- function(x) {
if(x >= 90) {
grade <- "A"
} else if(x >= 80) {
grade <- "B"
} else if(x >= 70) {
grade <- "C"
} else {
grade <- "F"
}
grade
}
course.grades <- c(92, 78, 87, 91, 62)
sapply(course.grades, FUN=calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"
course.grades <- c(92, 78, 87, 91, 62)
Letter.Grades<-ifelse(course.grades>= 90, "A",
ifelse(course.grades< 90 & course.grades>=80, "B",
ifelse(course.grades < 80 & course.grades>=70, "C", "F")))
Letter.Grades
## [1] "A" "C" "B" "A" "F"
Syntax
A for loop executes a chunk of code for every value of an index variable in an index set
The basic syntax takes the form
for(index.variable in index.set) {
code to be repeated at every value of index.variable
}
for(i in 1:4) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
phrase <- "Good Night, "
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
## [1] "Good Night, and"
## [1] "Good Night, and Good"
## [1] "Good Night, and Good Luck"
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
typeof(index.set$name)
## [1] "character"
typeof(index.set$weight)
## [1] "double"
typeof(index.set$is.male)
## [1] "logical"
for(i in index.set) {
print(c(i, typeof(i)))
}
## [1] "Michael" "character"
## [1] "185" "double"
## [1] "TRUE" "logical"
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.9837844 0.5365768 -0.16519567 -0.6494568 0.8370631
## [2,] -0.1799905 -2.5936594 -0.00521573 0.8744326 -1.5922350
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1] 2.679758 16.135101 -13.646715 18.589498 -7.228209
colSums(fake.data)
## [1] 2.679758 16.135101 -13.646715 18.589498 -7.228209
Command | Description |
---|---|
apply(X, MARGIN, FUN) |
Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X |
lapply(X, FUN) |
Obtain a list by applying FUN to the elements of a list X |
sapply(X, FUN) |
Simplified version of lapply . Returns a vector/array instead of list. |
tapply(X, INDEX, FUN) |
Obtain a table by applying FUN to each combination of the factors given in INDEX |
The different variants of the Apply() function family can be used under certain settings
These functions are (good!) alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
You can always type example(function)
to get code examples (E.g., example(apply)
)
One of R’s many advantages
tapply()
vs aggregate()
Get the mean of each column in the fake data we created earlier.
dim(fake.data) # 100 rows, 5 columns
## [1] 100 5
colMeans(fake.data)# Built-in function in R
## [1] 0.02679758 0.16135101 -0.13646715 0.18589498 -0.07228209
#Returns a mean for each column
Use apply()
function in plyr
library
apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1] 0.02679758 0.16135101 -0.13646715 0.18589498 -0.07228209
Create a function that calculates proportion of vector indexes that are > 0, and get the new means for each column
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)
## [1] 0.50 0.57 0.45 0.60 0.43
apply(cars, 2, FUN=mean) # Data frames are arrays
## speed dist
## 15.40 42.98
lapply(cars, FUN=mean) # Data frames are also lists
## $speed
## [1] 15.4
##
## $dist
## [1] 42.98
sapply(cars, FUN=mean) # sapply() is just simplified lapply()
## speed dist
## 15.40 42.98
Think of tapply()
as a generalized form of the table()
function
Recall Cars93 Data
library(MASS)
names(Cars93)
## [1] "Manufacturer" "Model" "Type"
## [4] "Min.Price" "Price" "Max.Price"
## [7] "MPG.city" "MPG.highway" "AirBags"
## [10] "DriveTrain" "Cylinders" "EngineSize"
## [13] "Horsepower" "RPM" "Rev.per.mile"
## [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers"
## [19] "Length" "Wheelbase" "Width"
## [22] "Turn.circle" "Rear.seat.room" "Luggage.room"
## [25] "Weight" "Origin" "Make"
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)
##
## 4WD Front Rear
## USA 5 34 9
## non-USA 5 33 7
tapply()
# Lets get the average 'MPG.City' by car 'Origin' and 'Drivetrain'
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean) #Same as above result
## DriveTrain
## Origin 4WD Front Rear
## USA 17.6 22.14706 18.33333
## non-USA 23.4 24.93939 19.14286
Use tapply()
and indexing
#Now, let's get the average 'Horsepower' by car `Origin` and `Type`
tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 117.4286 179.4545 153.5000 89.42857 166.5000 158.40
## non-USA 141.5556 NA 189.4167 91.78571 151.6667 138.25
NA
doing there?any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
## [1] FALSE
Let’s first recall what tapply()
does
tapply(X, INDEX, FUN)
FUN
to X
grouped by factors in INDEX
aggregate()
performs a similar operation, but presents the results in a form that is at times more convenient
There are many ways to call the aggregate()
function
tapply
call: aggregate(X, by, FUN)
by
is exactly like INDEX
tapply vs aggregate
with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN = mean)) # tapply
## Compact Large Midsize Small Sporty Van
## USA 117.4286 179.4545 153.5000 89.42857 166.5000 158.40
## non-USA 141.5556 NA 189.4167 91.78571 151.6667 138.25
with(Cars93, aggregate(Horsepower, by = list(Origin, Type), FUN = mean)) # aggregate
## Group.1 Group.2 x
## 1 USA Compact 117.42857
## 2 non-USA Compact 141.55556
## 3 USA Large 179.45455
## 4 USA Midsize 153.50000
## 5 non-USA Midsize 189.41667
## 6 USA Small 89.42857
## 7 non-USA Small 91.78571
## 8 USA Sporty 166.50000
## 9 non-USA Sporty 151.66667
## 10 USA Van 158.40000
## 11 non-USA Van 138.25000
aggregate(Horsepower ~ Origin + Type, FUN=mean, data=Cars93) # aggregate with different syntax
## Origin Type Horsepower
## 1 USA Compact 117.42857
## 2 non-USA Compact 141.55556
## 3 USA Large 179.45455
## 4 USA Midsize 153.50000
## 5 non-USA Midsize 189.41667
## 6 USA Small 89.42857
## 7 non-USA Small 91.78571
## 8 USA Sporty 166.50000
## 9 non-USA Sporty 151.66667
## 10 USA Van 158.40000
## 11 non-USA Van 138.25000
We’re going to go over some of the main things we’ve learned so far by operating on the birthwt
dataset from the MASS library
Let’s get it loaded and see what we’re working
table()
Let’s get the structure of the data
library(MASS)
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Notice that all of the variables are integers (numerical). This is likely incorrect.. For instance, the variable race
should be a factor (categorical) variable. We’ll correct this in example 3.
Now, let’s get summary information for each variable
library(MASS)
summary(birthwt)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
The dataset doesn’t come with very descriptive variable names
Let’s get better column names (use help(birthwt)
to understand the variables and come up with better names)
colnames(birthwt)
## [1] "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui"
## [9] "ftv" "bwt"
# The default names are not very descriptive
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight",
"race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr",
"physician.visits", "birthwt.grams")
# Better names!
All the factors are currently represented as integers
Let’s use the transform()
and mapvalues()
functions to convert variables to factors and give the factors more meaningful levels
library(plyr)
birthwt <- transform(birthwt,
race = as.factor(mapvalues(race, c(1, 2, 3),
c("white","black", "other"))),
mother.smokes = as.factor(mapvalues(mother.smokes,
c(0,1), c("no", "yes"))),
hypertension = as.factor(mapvalues(hypertension,
c(0,1), c("no", "yes"))),
uterine.irr = as.factor(mapvalues(uterine.irr,
c(0,1), c("no", "yes"))),
birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
c(0,1), c("no", "yes")))
)
summary(birthwt)
## birthwt.below.2500 mother.age mother.weight race
## no :130 Min. :14.00 Min. : 80.0 black:26
## yes: 59 1st Qu.:19.00 1st Qu.:110.0 other:67
## Median :23.00 Median :121.0 white:96
## Mean :23.24 Mean :129.8
## 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :45.00 Max. :250.0
## mother.smokes previous.prem.labor hypertension uterine.irr
## no :115 Min. :0.0000 no :177 no :161
## yes: 74 1st Qu.:0.0000 yes: 12 yes: 28
## Median :0.0000
## Mean :0.1958
## 3rd Qu.:0.0000
## Max. :3.0000
## physician.visits birthwt.grams
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
tapply()
function to see what the average birthweight looks like when broken down by race and smoking statuswith(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
## no yes
## black 2854.500 2504.000
## other 2815.782 2757.167
## white 3428.750 2826.846
What if we wanted nicer looking output? - Let’s use the header {r, results='asis'}
, along with the kable()
function from the knitr
library
library(knitr)
bwt.tbl <- with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
kable(bwt.tbl, format = "markdown")
no | yes | |
---|---|---|
black | 2854.500 | 2504.000 |
other | 2815.782 | 2757.167 |
white | 3428.750 | 2826.846 |
kable()
outputs the table in a way that Markdown can read and nicely display
Note: changing the CSS changes the table appearance
What are the odds of having low birthweight among non-smoking mothers?
First, let’s look at the contingency table
weight.smoke.tbl <- with(birthwt, table(birthwt.below.2500, mother.smokes))
weight.smoke.tbl
## mother.smokes
## birthwt.below.2500 no yes
## no 86 44
## yes 29 30
or.smoke.bwt <- (weight.smoke.tbl[2,2] / weight.smoke.tbl[1,2]) / (weight.smoke.tbl[2,1] / weight.smoke.tbl[1,1])
or.smoke.bwt
## [1] 2.021944
The two most common questions in social science research and natural science research pertains to understanding how two variables are related to each other (correlation) and whether that relationship depends on another (mediating) variable.
We will discuss this more when we get to linear regression models. For now, consider the following two questions related to the birthweight data:
cor()
functionby()
functionwith(birthwt, cor(birthwt.grams, mother.age)) # Calculate correlation
## [1] 0.09031781
Does the correlation above change when we account for smoking status (Mediating variable)?
with(birthwt, cor(birthwt.grams[mother.smokes == "yes"], mother.age[mother.smokes == "yes"]))
## [1] -0.1441649
with(birthwt, cor(birthwt.grams[mother.smokes == "no"], mother.age[mother.smokes == "no"]))
## [1] 0.2014558
A Faster way to study mediating effects in Example 2 is to use the by()
function
Think of the by(data, INDICES, FUN)
function as a tapply()
function that operates on data frames instead of just vectors
When using tapply(X, INDEX, FUN)
, X
is generally a numeric vector
To calculate correlations, we need to allow X
to be a data frame or matrix
by(data = birthwt[c("birthwt.grams", "mother.age")],
INDICES = birthwt["mother.smokes"],
FUN = function(x) {cor(x[,1], x[,2])})
## mother.smokes: no
## [1] 0.2014558
## --------------------------------------------------------
## mother.smokes: yes
## [1] -0.1441649