Agenda

See TOC

Importing Data (From Last Week)

Important: Before you begin the examples, you need to download some data and to set the correct working directory so that Rstudio knows where the data is located on your computer.

  • Download Data:
    • go to the course website and download the file called ‘Course_Survey.csv’
    • Create a new folder on your computer, e.g. ‘RdataAnalysis’
  • Set working directory in Rstudio:
    • ‘Session’ (top of screen),
    • ‘Set Working Directory’,
    • ‘Choose Directory’
    • Find and select your new folder, e.g. ‘RdataAnalysis’.
  • Import Excel data file (.csv) into R using the read.table() command
survey <- read.table("~/Course_survey.csv", header=TRUE, sep=",")

Manipulating Dataframes - Basics

Let’s begin by doing some basic data manipulations

Examples

  • Ex. 1: Change factor \(\rightarrow\) numerical variable
    • transform() function
  • Ex. 2: Change numerical \(\rightarrow\) factor variable
    • mapvalues() function
  • Ex. 3: Changing levels of a factor
    • mapvalues() function
  • Ex. 4: Creating a new variable within a dataframe
    • transform() function
  • Ex. 5: Build contingency tables
    • table() function
  • Ex. 6: Referencing variables within dataframe
    • with() function

Ex. 1

Convert factor variable \(\rightarrow\) numerical variable

  • A lot of data comes with integer encodings of levels
  • You may want to convert the integers to more meaningful values, e.g. integers, for the purpose of your analysis.

  • Recall the Program' variable in thesurvey` data

head(survey$Program)
## [1] SOE       NonEcon   NonEcon   OtherEcon SOE       NonEcon  
## Levels: NonEcon OtherEcon SOE
  • Let’s convert these names to an integer as follows
survey <- transform(survey, Program=as.numeric(Program))
head(survey$Program)
## [1] 3 1 1 2 3 1

We can see now that the original variable codes are as follows: 1 = Non Econ, 2 = Other Econ, 3 = SOE

Ex. 2

Convert numerical variable \(\rightarrow\) factor variable

  • Undo what we just do, and change program values back to factor names

  • we can do this using the transform(), as.factor() and mapvalues() functions

  • Note, that mapvalues() function works as follows.

library(plyr)

# Mapvalues works on factors
y <- factor(c("a", "b", "c", "a"))
mapvalues(y, c("a", "c"), c("A", "C")) # mapvalues function takes three parameters: (i) vector object 'y'; (ii) original values or elements withing 'y', in this case lowercase 'a' and 'c';  and outcome parameter, uppercase 'A' and 'C'
## [1] A b C A
## Levels: A b C
survey <- transform(survey, 
                    Program = as.factor(mapvalues(Program, 
                                                  c(1, 2, 3), 
                                                  c("NonEcon", "OtherEcon", "SOE")))
                    )
head(survey$Program)
## [1] SOE       NonEcon   NonEcon   OtherEcon SOE       NonEcon  
## Levels: NonEcon OtherEcon SOE
# Check order of factor levels
levels(survey$Program)
## [1] "NonEcon"   "OtherEcon" "SOE"

Chagne order of factor levels using factor()

# Change Order of factor levels
survey$Program<- factor(survey$Program,  levels= c("SOE","OtherEcon","NonEcon"))
levels(survey$Program)
## [1] "SOE"       "OtherEcon" "NonEcon"

Ex. 3

Sometimes, we need to change the levels of a factor. Recall, from lecture 1 the following example:

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing life support mechanism may look like,

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")

Health Example

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
summary(Health)
##    Length     Class      Mode 
##         6 character character
  • This character has 6 levels even though it should have 2 (dialysis, none)
    • This will be your job as part of a lab/hw assignment.

For now, I’ll give you another example on how to change levels of a factor using the Cars93 data in the MASS library

library(MASS)
names(Cars93) #list names of all variables
##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"
manufacturer <- Cars93$Manufacturer
head(manufacturer, 10)
##  [1] Acura    Acura    Audi     Audi     BMW      Buick    Buick   
##  [8] Buick    Buick    Cadillac
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo

We’ll use the mapvalues(x, from, to) function from the plyr library.

# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- mapvalues(manufacturer, 
                                   from = c("Chevrolet", "Pontiac", "Buick","Cadillac"), 
                                   to = rep("GM", 4))

head(manufacturer.combined, 10)
##  [1] Acura Acura Audi  Audi  BMW   GM    GM    GM    GM    GM   
## 29 Levels: Acura Audi BMW GM Chrylser Chrysler Dodge Eagle Ford ... Volvo

Ex. 4

  • It is often times necessary to create a new variable based on some type of transformation of existing variables

  • We will use the transform() function to add a new variable to existing dataframe
    • transform() returns a new data frame with columns modified or added as specified by the function call
names(Cars93) #list names of all variables
##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"
  • Lets create two new variables, KMPL.highway and KMPL.city based on transforming two existing variables in the Cars93 dataframe.
Cars93.metric <- transform(Cars93, 
                           KMPL.highway = 0.425 * MPG.highway, ## converts miles per gallion into km/l
                           KMPL.city = 0.425 * MPG.city
                           )
tail(names(Cars93.metric))
## [1] "Luggage.room" "Weight"       "Origin"       "Make"        
## [5] "KMPL.highway" "KMPL.city"
  • Our data frame has two new columns

A simpler approach

# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))
## [1] "Weight"       "Origin"       "Make"         "KMPL.highway"
## [5] "KMPL.city"    "KMPL.city.2"
  • Let’s check that both approaches did the same thing
identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)
## [1] TRUE

Ex. 5

Some more data frame summaries: table() function

  • The table() function builds contingency tables showing counts at each combination of factor levels
table(Cars93$AirBags)
## 
## Driver & Passenger        Driver only               None 
##                 16                 43                 34
table(Cars93$Origin)
## 
##     USA non-USA 
##      48      45
table(Cars93$AirBags, Cars93$Origin)
##                     
##                      USA non-USA
##   Driver & Passenger   9       7
##   Driver only         23      20
##   None                16      18
  • Looks like US and non-US cars had about the same distribution of AirBag types

  • Later in the class we’ll learn how to do a hypothesis tests on this kind of data

Ex. 6

  • Thus far we’ve repeatedly typed out the data frame name when referencing its columns

  • This is because the data variables don’t exist in our working environment

  • Using with(data, expr) lets us specify that the code in expr should be evaluated in an environment that contains the elements of data as variables

with(Cars93, table(Origin, Type))
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           7    11      10     7      8   5
##   non-USA       9     0      12    14      6   4
  • Using with() and indexing
with(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!
## [1] FALSE
  • Using with() makes code simpler, easier to read, and easier to debug

Functions - Basics

  • We have used a lot of built-in functions: mean(), subset(), transform(), read.table()

  • An important part of programming and data analysis is to write custom functions

  • Functions help make code modular

  • Functions make debugging easier

  • Remember: this entire class is about applying functions to data

What is a function?

  • A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.

Ex. 1

  • Let’s look at a really simple function
addOne <- function(x) {
  x + 1
}
  • x is the argument or input

  • The function output is the input x incremented by 1

addOne(12)
## [1] 13

Ex. 2

  • Here’s a function that returns a % given a numerator, denominator, and desired number of decimal values
calculatePercentage <- function(x, y, d) {
  decimal <- x / y  # Calculate decimal value
  round(100 * decimal, d)  # Convert to % and round to d digits
}

calculatePercentage(27, 80, 1)
## [1] 33.8
  • If you’re calculating several %’s for your report, you should use this kind of function instead of repeatedly copying and pasting code

Ex. 3

  • Calculate a 3 number summary: mean, median and standard deviation
threeNumberSummary <- function(x) {
  c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)
##     mean   median       sd 
## 4.861348 5.166916 1.918449

If-else statements

  • Oftentimes we want our code to have different effects depending on the features of the input

  • Example: Calculating a student’s letter grade
  • If grade >= 90, assign A
  • Otherwise, if grade >= 80, assign B
  • Otherwise, if grade >= 70, assign C
  • In all other cases, assign F

  • To code this up, we use if-else statements

Ex. 1

calculateLetterGrade <- function(x) {
  if(x >= 90) {
    grade <- "A"
  } else if(x >= 80) {
    grade <- "B"
  } else if(x >= 70) {
    grade <- "C"
  } else {
    grade <- "F"
  }
  grade
}

course.grades <- c(92, 78, 87, 91, 62)
sapply(course.grades, FUN=calculateLetterGrade)
## [1] "A" "C" "B" "A" "F"

Ex. 2

course.grades <- c(92, 78, 87, 91, 62)

Letter.Grades<-ifelse(course.grades>= 90, "A",
                      ifelse(course.grades< 90 & course.grades>=80, "B",
                             ifelse(course.grades < 80 & course.grades>=70, "C", "F")))
Letter.Grades
## [1] "A" "C" "B" "A" "F"

For loops

  • loops are ways of iterating over data

Syntax

  • A for loop executes a chunk of code for every value of an index variable in an index set

  • The basic syntax takes the form

for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}
  • The index set is often a vector of integers, but can be more general

Ex. 1

for(i in 1:4) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Ex. 2

phrase <- "Good Night, "
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}
## [1] "Good Night,  and"
## [1] "Good Night,  and Good"
## [1] "Good Night,  and Good Luck"

Ex. 3

  • what if we want to know what kind of elements are contained in a list
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
typeof(index.set$name)
## [1] "character"
typeof(index.set$weight)
## [1] "double"
typeof(index.set$is.male)
## [1] "logical"
  • create a loop to get information for each element in the list
for(i in index.set) {
  print(c(i, typeof(i)))
}
## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Ex. 4

  • Calculate sum of each column in a matrix
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] -0.9837844  0.5365768 -0.16519567 -0.6494568  0.8370631
## [2,] -0.1799905 -2.5936594 -0.00521573  0.8744326 -1.5922350
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
## [1]   2.679758  16.135101 -13.646715  18.589498  -7.228209
  • of course, R already has a built-in function to calculate column sums
colSums(fake.data)
## [1]   2.679758  16.135101 -13.646715  18.589498  -7.228209

Apply() Functions in plyr package

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
lapply(X, FUN) Obtain a list by applying FUN to the elements of a list X
sapply(X, FUN) Simplified version of lapply. Returns a vector/array instead of list.
tapply(X, INDEX, FUN) Obtain a table by applying FUN to each combination of the factors given in INDEX
  • The different variants of the Apply() function family can be used under certain settings

  • These functions are (good!) alternatives to loops

  • They are typically more efficient than loops (often run considerably faster on large data sets)

  • Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • You can always type example(function) to get code examples (E.g., example(apply))

  • One of R’s many advantages

Examples

  • Ex. 1: apply()
  • Ex. 2: apply(), lapply(), sapply()
  • Ex. 3: tapply()
  • Ex. 4: tapply() vs aggregate()

Ex. 1

Get the mean of each column in the fake data we created earlier.

dim(fake.data) # 100 rows, 5 columns
## [1] 100   5
colMeans(fake.data)# Built-in function in R
## [1]  0.02679758  0.16135101 -0.13646715  0.18589498 -0.07228209
#Returns a mean for each column

Use apply() function in plyr library

apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns
## [1]  0.02679758  0.16135101 -0.13646715  0.18589498 -0.07228209

Create a function that calculates proportion of vector indexes that are > 0, and get the new means for each column

propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
## [1] 0.50 0.57 0.45 0.60 0.43

Ex. 2

apply(cars, 2, FUN=mean) # Data frames are arrays
## speed  dist 
## 15.40 42.98
lapply(cars, FUN=mean) # Data frames are also lists
## $speed
## [1] 15.4
## 
## $dist
## [1] 42.98
sapply(cars, FUN=mean) # sapply() is just simplified lapply()
## speed  dist 
## 15.40 42.98

Ex. 3

  • Think of tapply() as a generalized form of the table() function

  • Recall Cars93 Data

library(MASS)
names(Cars93)
##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"
# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)
##          
##           4WD Front Rear
##   USA       5    34    9
##   non-USA   5    33    7
  • Use tapply()
# Lets get the average 'MPG.City' by car 'Origin' and 'Drivetrain'
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean) #Same as above result
##          DriveTrain
## Origin     4WD    Front     Rear
##   USA     17.6 22.14706 18.33333
##   non-USA 23.4 24.93939 19.14286

Use tapply() and indexing

#Now, let's get the average 'Horsepower' by car `Origin` and `Type`

tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)
##          Type
## Origin     Compact    Large  Midsize    Small   Sporty    Van
##   USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
##   non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25
  • What’s that NA doing there?
any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")
## [1] FALSE
  • None of the non-USA manufacturers produced Large cars!

Ex. 4

  • Let’s first recall what tapply() does

  • Command: tapply(X, INDEX, FUN)
    • Applies FUN to X grouped by factors in INDEX
  • aggregate() performs a similar operation, but presents the results in a form that is at times more convenient

  • There are many ways to call the aggregate() function

  • Analog of tapply call: aggregate(X, by, FUN)
    • Here, by is exactly like INDEX

tapply vs aggregate

with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN = mean)) # tapply
##          Compact    Large  Midsize    Small   Sporty    Van
## USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
## non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25
with(Cars93, aggregate(Horsepower, by = list(Origin, Type), FUN = mean)) # aggregate
##    Group.1 Group.2         x
## 1      USA Compact 117.42857
## 2  non-USA Compact 141.55556
## 3      USA   Large 179.45455
## 4      USA Midsize 153.50000
## 5  non-USA Midsize 189.41667
## 6      USA   Small  89.42857
## 7  non-USA   Small  91.78571
## 8      USA  Sporty 166.50000
## 9  non-USA  Sporty 151.66667
## 10     USA     Van 158.40000
## 11 non-USA     Van 138.25000
aggregate(Horsepower ~ Origin + Type, FUN=mean, data=Cars93) # aggregate with different syntax
##     Origin    Type Horsepower
## 1      USA Compact  117.42857
## 2  non-USA Compact  141.55556
## 3      USA   Large  179.45455
## 4      USA Midsize  153.50000
## 5  non-USA Midsize  189.41667
## 6      USA   Small   89.42857
## 7  non-USA   Small   91.78571
## 8      USA  Sporty  166.50000
## 9  non-USA  Sporty  151.66667
## 10     USA     Van  158.40000
## 11 non-USA     Van  138.25000

Let’s Review

  • We’re going to go over some of the main things we’ve learned so far by operating on the birthwt dataset from the MASS library

  • Let’s get it loaded and see what we’re working

Examples

  • Ex. 1: Data structure and summary
  • Ex. 2: Renaming the variables
  • Ex. 3: Renaming the factors
  • Ex. 4: A simple table
  • Ex. 5: Calculate odds from a contingency table using table()

Ex. 1

Let’s get the structure of the data

library(MASS)
str(birthwt)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
  • Notice that all of the variables are integers (numerical). This is likely incorrect.. For instance, the variable race should be a factor (categorical) variable. We’ll correct this in example 3.

  • Now, let’s get summary information for each variable

library(MASS)
summary(birthwt)
##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Ex. 2

  • The dataset doesn’t come with very descriptive variable names

  • Let’s get better column names (use help(birthwt) to understand the variables and come up with better names)

colnames(birthwt) 
##  [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
##  [9] "ftv"   "bwt"
# The default names are not very descriptive

colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", 
    "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", 
    "physician.visits", "birthwt.grams")

# Better names!

Ex. 3

  • All the factors are currently represented as integers

  • Let’s use the transform() and mapvalues() functions to convert variables to factors and give the factors more meaningful levels

library(plyr)
birthwt <- transform(birthwt, 
            race = as.factor(mapvalues(race, c(1, 2, 3), 
                              c("white","black", "other"))),
            mother.smokes = as.factor(mapvalues(mother.smokes, 
                              c(0,1), c("no", "yes"))),
            hypertension = as.factor(mapvalues(hypertension, 
                              c(0,1), c("no", "yes"))),
            uterine.irr = as.factor(mapvalues(uterine.irr, 
                              c(0,1), c("no", "yes"))),
            birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
                              c(0,1), c("no", "yes")))
            )
  • Now that things are coded correctly, we can look at an overall summary
summary(birthwt)
##  birthwt.below.2500   mother.age    mother.weight      race   
##  no :130            Min.   :14.00   Min.   : 80.0   black:26  
##  yes: 59            1st Qu.:19.00   1st Qu.:110.0   other:67  
##                     Median :23.00   Median :121.0   white:96  
##                     Mean   :23.24   Mean   :129.8             
##                     3rd Qu.:26.00   3rd Qu.:140.0             
##                     Max.   :45.00   Max.   :250.0             
##  mother.smokes previous.prem.labor hypertension uterine.irr
##  no :115       Min.   :0.0000      no :177      no :161    
##  yes: 74       1st Qu.:0.0000      yes: 12      yes: 28    
##                Median :0.0000                              
##                Mean   :0.1958                              
##                3rd Qu.:0.0000                              
##                Max.   :3.0000                              
##  physician.visits birthwt.grams 
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Ex. 4

  • Let’s use the tapply() function to see what the average birthweight looks like when broken down by race and smoking status
with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean)) 
##             no      yes
## black 2854.500 2504.000
## other 2815.782 2757.167
## white 3428.750 2826.846

What if we wanted nicer looking output? - Let’s use the header {r, results='asis'}, along with the kable() function from the knitr library

library(knitr)
bwt.tbl <- with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
kable(bwt.tbl, format = "markdown")
no yes
black 2854.500 2504.000
other 2815.782 2757.167
white 3428.750 2826.846
  • kable() outputs the table in a way that Markdown can read and nicely display

  • Note: changing the CSS changes the table appearance

Ex.5

  • What are the odds of having low birthweight among non-smoking mothers?

  • First, let’s look at the contingency table

weight.smoke.tbl <- with(birthwt, table(birthwt.below.2500, mother.smokes))
weight.smoke.tbl
##                   mother.smokes
## birthwt.below.2500 no yes
##                no  86  44
##                yes 29  30
  • The odds of having low birthweight among non-smoking mothers is
or.smoke.bwt <- (weight.smoke.tbl[2,2] / weight.smoke.tbl[1,2]) / (weight.smoke.tbl[2,1] / weight.smoke.tbl[1,1])
or.smoke.bwt
## [1] 2.021944
  • So the odds of low birth weight are 2 times higher when the mother smokes

Correlations and Mediating Variable

The two most common questions in social science research and natural science research pertains to understanding how two variables are related to each other (correlation) and whether that relationship depends on another (mediating) variable.

We will discuss this more when we get to linear regression models. For now, consider the following two questions related to the birthweight data:

  1. Is the mother’s age correlated with birth weight?
  2. Does this change when we account for smoking status (Mediating variable)?

Examples

  • Ex. 1: Correlation
    • cor() function
  • Ex. 2: Mediating variable
  • Ex. 3: Mediating variable
    • by() function

Ex. 1

  • Is the mother’s age correlated with birth weight?
with(birthwt, cor(birthwt.grams, mother.age))  # Calculate correlation
## [1] 0.09031781

Ex. 2

Does the correlation above change when we account for smoking status (Mediating variable)?

with(birthwt, cor(birthwt.grams[mother.smokes == "yes"], mother.age[mother.smokes == "yes"]))
## [1] -0.1441649
with(birthwt, cor(birthwt.grams[mother.smokes == "no"], mother.age[mother.smokes == "no"]))
## [1] 0.2014558
  • Notice that the correlation between the two variables, 0.0903178, is clearly mediated by whether or not the mother smokes.

Ex. 3

A Faster way to study mediating effects in Example 2 is to use the by() function

  • Think of the by(data, INDICES, FUN) function as a tapply() function that operates on data frames instead of just vectors

  • When using tapply(X, INDEX, FUN), X is generally a numeric vector

  • To calculate correlations, we need to allow X to be a data frame or matrix

by(data = birthwt[c("birthwt.grams", "mother.age")], 
   INDICES = birthwt["mother.smokes"], 
   FUN = function(x) {cor(x[,1], x[,2])})
## mother.smokes: no
## [1] 0.2014558
## -------------------------------------------------------- 
## mother.smokes: yes
## [1] -0.1441649

Reminder

  • Lab 1 Today