Week 3 - More data manipulation and programming basics

Agenda

See TOC

Importing Data (From Last Week)

Important: Before you begin the examples, you need to download some data and to set the correct working directory so that Rstudio knows where the data is located on your computer.

Download Data:
- go to the course website and download the file called ‘Course_Survey.csv’
- Create a new folder on your computer, e.g. ‘RdataAnalysis’
Set working directory in Rstudio:
- ‘Session’ (top of screen),
- ‘Set Working Directory’,
- ‘Choose Directory’
- Find and select your new folder, e.g. ‘RdataAnalysis’.
Import Excel data file (.csv) into R using the read.table() command

survey <- read.table("~/Course_survey.csv", header=TRUE, sep=",")

Manipulating Dataframes - Basics

Let’s begin by doing some basic data manipulations

Examples

Ex. 1: Change factor \(\rightarrow\) numerical variable
- transform() function
Ex. 2: Change numerical \(\rightarrow\) factor variable
- mapvalues() function
Ex. 3: Changing levels of a factor
- mapvalues() function
Ex. 4: Creating a new variable within a dataframe
- transform() function
Ex. 5: Build contingency tables
- table() function
Ex. 6: Referencing variables within dataframe
- with() function

Ex. 1

Convert factor variable \(\rightarrow\) numerical variable

A lot of data comes with integer encodings of levels
You may want to convert the integers to more meaningful values, e.g. integers, for the purpose of your analysis.
Recall the Program' variable in thesurvey` data

head(survey$Program)

## [1] SOE       NonEcon   NonEcon   OtherEcon SOE       NonEcon  
## Levels: NonEcon OtherEcon SOE

Let’s convert these names to an integer as follows

survey <- transform(survey, Program=as.numeric(Program))
head(survey$Program)

## [1] 3 1 1 2 3 1

We can see now that the original variable codes are as follows: 1 = Non Econ, 2 = Other Econ, 3 = SOE

Ex. 2

Convert numerical variable \(\rightarrow\) factor variable

Undo what we just do, and change program values back to factor names
we can do this using the transform(), as.factor() and mapvalues() functions
Note, that mapvalues() function works as follows.

library(plyr)

# Mapvalues works on factors
y <- factor(c("a", "b", "c", "a"))
mapvalues(y, c("a", "c"), c("A", "C")) # mapvalues function takes three parameters: (i) vector object 'y'; (ii) original values or elements withing 'y', in this case lowercase 'a' and 'c';  and outcome parameter, uppercase 'A' and 'C'

## [1] A b C A
## Levels: A b C

survey <- transform(survey, 
                    Program = as.factor(mapvalues(Program, 
                                                  c(1, 2, 3), 
                                                  c("NonEcon", "OtherEcon", "SOE")))
                    )
head(survey$Program)

## [1] SOE       NonEcon   NonEcon   OtherEcon SOE       NonEcon  
## Levels: NonEcon OtherEcon SOE

# Check order of factor levels
levels(survey$Program)

## [1] "NonEcon"   "OtherEcon" "SOE"

Chagne order of factor levels using factor()

# Change Order of factor levels
survey$Program<- factor(survey$Program,  levels= c("SOE","OtherEcon","NonEcon"))
levels(survey$Program)

## [1] "SOE"       "OtherEcon" "NonEcon"

Ex. 3

Sometimes, we need to change the levels of a factor. Recall, from lecture 1 the following example:

When data is entered manually, misspellings and case changes are very common
E.g., a column showing life support mechanism may look like,

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")

Health Example

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
summary(Health)

##    Length     Class      Mode 
##         6 character character

This character has 6 levels even though it should have 2 (dialysis, none)
- This will be your job as part of a lab/hw assignment.

For now, I’ll give you another example on how to change levels of a factor using the Cars93 data in the MASS library

library(MASS)
names(Cars93) #list names of all variables

##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"

manufacturer <- Cars93$Manufacturer
head(manufacturer, 10)

##  [1] Acura    Acura    Audi     Audi     BMW      Buick    Buick   
##  [8] Buick    Buick    Cadillac
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo

We’ll use the mapvalues(x, from, to) function from the plyr library.

# Map Chevrolet, Pontiac and Buick to GM
manufacturer.combined <- mapvalues(manufacturer, 
                                   from = c("Chevrolet", "Pontiac", "Buick","Cadillac"), 
                                   to = rep("GM", 4))

head(manufacturer.combined, 10)

##  [1] Acura Acura Audi  Audi  BMW   GM    GM    GM    GM    GM   
## 29 Levels: Acura Audi BMW GM Chrylser Chrysler Dodge Eagle Ford ... Volvo

Ex. 4

It is often times necessary to create a new variable based on some type of transformation of existing variables
We will use the transform() function to add a new variable to existing dataframe
- transform() returns a new data frame with columns modified or added as specified by the function call

names(Cars93) #list names of all variables

##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"

Lets create two new variables, KMPL.highway and KMPL.city based on transforming two existing variables in the Cars93 dataframe.

Cars93.metric <- transform(Cars93, 
                           KMPL.highway = 0.425 * MPG.highway, ## converts miles per gallion into km/l
                           KMPL.city = 0.425 * MPG.city
                           )
tail(names(Cars93.metric))

## [1] "Luggage.room" "Weight"       "Origin"       "Make"        
## [5] "KMPL.highway" "KMPL.city"

Our data frame has two new columns

A simpler approach

# Add a new column called KMPL.city.2
Cars93.metric$KMPL.city.2 <- 0.425 * Cars93$MPG.city
tail(names(Cars93.metric))

## [1] "Weight"       "Origin"       "Make"         "KMPL.highway"
## [5] "KMPL.city"    "KMPL.city.2"

Let’s check that both approaches did the same thing

identical(Cars93.metric$KMPL.city, Cars93.metric$KMPL.city.2)

## [1] TRUE

Ex. 5

Some more data frame summaries: table() function

The table() function builds contingency tables showing counts at each combination of factor levels

table(Cars93$AirBags)

## 
## Driver & Passenger        Driver only               None 
##                 16                 43                 34

table(Cars93$Origin)

## 
##     USA non-USA 
##      48      45

table(Cars93$AirBags, Cars93$Origin)

##                     
##                      USA non-USA
##   Driver & Passenger   9       7
##   Driver only         23      20
##   None                16      18

Looks like US and non-US cars had about the same distribution of AirBag types
Later in the class we’ll learn how to do a hypothesis tests on this kind of data

Ex. 6

Thus far we’ve repeatedly typed out the data frame name when referencing its columns
This is because the data variables don’t exist in our working environment
Using with(data, expr) lets us specify that the code in expr should be evaluated in an environment that contains the elements of data as variables

with(Cars93, table(Origin, Type))

##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           7    11      10     7      8   5
##   non-USA       9     0      12    14      6   4

Using with() and indexing

with(Cars93, any(Origin == "non-USA" & Type == "Large")) # Same effect!

## [1] FALSE

Using with() makes code simpler, easier to read, and easier to debug

Functions - Basics

We have used a lot of built-in functions: mean(), subset(), transform(), read.table()…
An important part of programming and data analysis is to write custom functions
Functions help make code modular
Functions make debugging easier
Remember: this entire class is about applying functions to data

What is a function?

A function is a machine that turns input objects (arguments) into an output object (return value) according to a definite rule.

Ex. 1

Let’s look at a really simple function

addOne <- function(x) {
  x + 1
}

x is the argument or input
The function output is the input x incremented by 1

addOne(12)

## [1] 13

Ex. 2

Here’s a function that returns a % given a numerator, denominator, and desired number of decimal values

calculatePercentage <- function(x, y, d) {
  decimal <- x / y  # Calculate decimal value
  round(100 * decimal, d)  # Convert to % and round to d digits
}

calculatePercentage(27, 80, 1)

## [1] 33.8

If you’re calculating several %’s for your report, you should use this kind of function instead of repeatedly copying and pasting code

Ex. 3

Calculate a 3 number summary: mean, median and standard deviation

threeNumberSummary <- function(x) {
  c(mean=mean(x), median=median(x), sd=sd(x))
}
x <- rnorm(100, mean=5, sd=2) # Vector of 100 normals with mean 5 and sd 2
threeNumberSummary(x)

##     mean   median       sd 
## 4.861348 5.166916 1.918449

If-else statements

Oftentimes we want our code to have different effects depending on the features of the input
Example: Calculating a student’s letter grade
If grade >= 90, assign A
Otherwise, if grade >= 80, assign B
Otherwise, if grade >= 70, assign C
In all other cases, assign F
To code this up, we use if-else statements

Ex. 1

calculateLetterGrade <- function(x) {
  if(x >= 90) {
    grade <- "A"
  } else if(x >= 80) {
    grade <- "B"
  } else if(x >= 70) {
    grade <- "C"
  } else {
    grade <- "F"
  }
  grade
}

course.grades <- c(92, 78, 87, 91, 62)
sapply(course.grades, FUN=calculateLetterGrade)

## [1] "A" "C" "B" "A" "F"

Ex. 2

course.grades <- c(92, 78, 87, 91, 62)

Letter.Grades<-ifelse(course.grades>= 90, "A",
                      ifelse(course.grades< 90 & course.grades>=80, "B",
                             ifelse(course.grades < 80 & course.grades>=70, "C", "F")))
Letter.Grades

## [1] "A" "C" "B" "A" "F"

For loops

loops are ways of iterating over data

Syntax

A for loop executes a chunk of code for every value of an index variable in an index set
The basic syntax takes the form

for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}

The index set is often a vector of integers, but can be more general

Ex. 1

for(i in 1:4) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4

Ex. 2

phrase <- "Good Night, "
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}

## [1] "Good Night,  and"
## [1] "Good Night,  and Good"
## [1] "Good Night,  and Good Luck"

Ex. 3

what if we want to know what kind of elements are contained in a list

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
typeof(index.set$name)

## [1] "character"

typeof(index.set$weight)

## [1] "double"

typeof(index.set$is.male)

## [1] "logical"

create a loop to get information for each element in the list

for(i in index.set) {
  print(c(i, typeof(i)))
}

## [1] "Michael"   "character"
## [1] "185"    "double"
## [1] "TRUE"    "logical"

Ex. 4

Calculate sum of each column in a matrix

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows

##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] -0.9837844  0.5365768 -0.16519567 -0.6494568  0.8370631
## [2,] -0.1799905 -2.5936594 -0.00521573  0.8744326 -1.5922350

col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums

## [1]   2.679758  16.135101 -13.646715  18.589498  -7.228209

of course, R already has a built-in function to calculate column sums

colSums(fake.data)

## [1]   2.679758  16.135101 -13.646715  18.589498  -7.228209

Apply() Functions in plyr package

Command	Description
`apply(X, MARGIN, FUN)`	Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`lapply(X, FUN)`	Obtain a list by applying `FUN` to the elements of a list `X`
`sapply(X, FUN)`	Simplified version of `lapply`. Returns a vector/array instead of list.
`tapply(X, INDEX, FUN)`	Obtain a table by applying `FUN` to each combination of the factors given in `INDEX`

The different variants of the Apply() function family can be used under certain settings
These functions are (good!) alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
Take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
You can always type example(function) to get code examples (E.g., example(apply))
One of R’s many advantages

Examples

Ex. 1: apply()
Ex. 2: apply(), lapply(), sapply()
Ex. 3: tapply()
Ex. 4: tapply() vs aggregate()

Ex. 1

Get the mean of each column in the fake data we created earlier.

dim(fake.data) # 100 rows, 5 columns

## [1] 100   5

colMeans(fake.data)# Built-in function in R

## [1]  0.02679758  0.16135101 -0.13646715  0.18589498 -0.07228209

#Returns a mean for each column

Use apply() function in plyr library

apply(fake.data, MARGIN=2, FUN=mean) # MARGIN = 1 for rows, 2 for columns

## [1]  0.02679758  0.16135101 -0.13646715  0.18589498 -0.07228209

Create a function that calculates proportion of vector indexes that are > 0, and get the new means for each column

propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)

## [1] 0.50 0.57 0.45 0.60 0.43

Ex. 2

apply(cars, 2, FUN=mean) # Data frames are arrays

## speed  dist 
## 15.40 42.98

lapply(cars, FUN=mean) # Data frames are also lists

## $speed
## [1] 15.4
## 
## $dist
## [1] 42.98

sapply(cars, FUN=mean) # sapply() is just simplified lapply()

## speed  dist 
## 15.40 42.98

Ex. 3

Think of tapply() as a generalized form of the table() function
Recall Cars93 Data

library(MASS)
names(Cars93)

##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"

# Get a count table, data broken down by Origin and DriveTrain
table(Cars93$Origin, Cars93$DriveTrain)

##          
##           4WD Front Rear
##   USA       5    34    9
##   non-USA   5    33    7

Use tapply()

# Lets get the average 'MPG.City' by car 'Origin' and 'Drivetrain'
tapply(Cars93$MPG.city, INDEX = Cars93[c("Origin", "DriveTrain")], FUN=mean) #Same as above result

##          DriveTrain
## Origin     4WD    Front     Rear
##   USA     17.6 22.14706 18.33333
##   non-USA 23.4 24.93939 19.14286

Use tapply() and indexing

#Now, let's get the average 'Horsepower' by car `Origin` and `Type`

tapply(Cars93[["Horsepower"]], INDEX = Cars93[c("Origin", "Type")], FUN=mean)

##          Type
## Origin     Compact    Large  Midsize    Small   Sporty    Van
##   USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
##   non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25

What’s that NA doing there?

any(Cars93$Origin == "non-USA" & Cars93$Type == "Large")

## [1] FALSE

None of the non-USA manufacturers produced Large cars!

Ex. 4

Let’s first recall what tapply() does
Command: tapply(X, INDEX, FUN)
- Applies FUN to X grouped by factors in INDEX
aggregate() performs a similar operation, but presents the results in a form that is at times more convenient
There are many ways to call the aggregate() function
Analog of tapply call: aggregate(X, by, FUN)
- Here, by is exactly like INDEX

tapply vs aggregate

with(Cars93, tapply(Horsepower, INDEX = list(Origin, Type), FUN = mean)) # tapply

##          Compact    Large  Midsize    Small   Sporty    Van
## USA     117.4286 179.4545 153.5000 89.42857 166.5000 158.40
## non-USA 141.5556       NA 189.4167 91.78571 151.6667 138.25

with(Cars93, aggregate(Horsepower, by = list(Origin, Type), FUN = mean)) # aggregate

##    Group.1 Group.2         x
## 1      USA Compact 117.42857
## 2  non-USA Compact 141.55556
## 3      USA   Large 179.45455
## 4      USA Midsize 153.50000
## 5  non-USA Midsize 189.41667
## 6      USA   Small  89.42857
## 7  non-USA   Small  91.78571
## 8      USA  Sporty 166.50000
## 9  non-USA  Sporty 151.66667
## 10     USA     Van 158.40000
## 11 non-USA     Van 138.25000

aggregate(Horsepower ~ Origin + Type, FUN=mean, data=Cars93) # aggregate with different syntax

##     Origin    Type Horsepower
## 1      USA Compact  117.42857
## 2  non-USA Compact  141.55556
## 3      USA   Large  179.45455
## 4      USA Midsize  153.50000
## 5  non-USA Midsize  189.41667
## 6      USA   Small   89.42857
## 7  non-USA   Small   91.78571
## 8      USA  Sporty  166.50000
## 9  non-USA  Sporty  151.66667
## 10     USA     Van  158.40000
## 11 non-USA     Van  138.25000

Let’s Review

We’re going to go over some of the main things we’ve learned so far by operating on the birthwt dataset from the MASS library
Let’s get it loaded and see what we’re working

Examples

Ex. 1: Data structure and summary
Ex. 2: Renaming the variables
Ex. 3: Renaming the factors
Ex. 4: A simple table
Ex. 5: Calculate odds from a contingency table using table()

Ex. 1

Let’s get the structure of the data

library(MASS)
str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Notice that all of the variables are integers (numerical). This is likely incorrect.. For instance, the variable race should be a factor (categorical) variable. We’ll correct this in example 3.
Now, let’s get summary information for each variable

library(MASS)
summary(birthwt)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Ex. 2

The dataset doesn’t come with very descriptive variable names
Let’s get better column names (use help(birthwt) to understand the variables and come up with better names)

colnames(birthwt)

##  [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
##  [9] "ftv"   "bwt"

# The default names are not very descriptive

colnames(birthwt) <- c("birthwt.below.2500", "mother.age", "mother.weight", 
    "race", "mother.smokes", "previous.prem.labor", "hypertension", "uterine.irr", 
    "physician.visits", "birthwt.grams")

# Better names!

Ex. 3

All the factors are currently represented as integers
Let’s use the transform() and mapvalues() functions to convert variables to factors and give the factors more meaningful levels

library(plyr)
birthwt <- transform(birthwt, 
            race = as.factor(mapvalues(race, c(1, 2, 3), 
                              c("white","black", "other"))),
            mother.smokes = as.factor(mapvalues(mother.smokes, 
                              c(0,1), c("no", "yes"))),
            hypertension = as.factor(mapvalues(hypertension, 
                              c(0,1), c("no", "yes"))),
            uterine.irr = as.factor(mapvalues(uterine.irr, 
                              c(0,1), c("no", "yes"))),
            birthwt.below.2500 = as.factor(mapvalues(birthwt.below.2500,
                              c(0,1), c("no", "yes")))
            )

Now that things are coded correctly, we can look at an overall summary

summary(birthwt)

##  birthwt.below.2500   mother.age    mother.weight      race   
##  no :130            Min.   :14.00   Min.   : 80.0   black:26  
##  yes: 59            1st Qu.:19.00   1st Qu.:110.0   other:67  
##                     Median :23.00   Median :121.0   white:96  
##                     Mean   :23.24   Mean   :129.8             
##                     3rd Qu.:26.00   3rd Qu.:140.0             
##                     Max.   :45.00   Max.   :250.0             
##  mother.smokes previous.prem.labor hypertension uterine.irr
##  no :115       Min.   :0.0000      no :177      no :161    
##  yes: 74       1st Qu.:0.0000      yes: 12      yes: 28    
##                Median :0.0000                              
##                Mean   :0.1958                              
##                3rd Qu.:0.0000                              
##                Max.   :3.0000                              
##  physician.visits birthwt.grams 
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Ex. 4

Let’s use the tapply() function to see what the average birthweight looks like when broken down by race and smoking status

with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))

##             no      yes
## black 2854.500 2504.000
## other 2815.782 2757.167
## white 3428.750 2826.846

What if we wanted nicer looking output? - Let’s use the header {r, results='asis'}, along with the kable() function from the knitr library

library(knitr)
bwt.tbl <- with(birthwt, tapply(birthwt.grams, INDEX = list(race, mother.smokes), FUN = mean))
kable(bwt.tbl, format = "markdown")

	no	yes
black	2854.500	2504.000
other	2815.782	2757.167
white	3428.750	2826.846

kable() outputs the table in a way that Markdown can read and nicely display
Note: changing the CSS changes the table appearance

Ex.5

What are the odds of having low birthweight among non-smoking mothers?
First, let’s look at the contingency table

weight.smoke.tbl <- with(birthwt, table(birthwt.below.2500, mother.smokes))
weight.smoke.tbl

##                   mother.smokes
## birthwt.below.2500 no yes
##                no  86  44
##                yes 29  30

The odds of having low birthweight among non-smoking mothers is

or.smoke.bwt <- (weight.smoke.tbl[2,2] / weight.smoke.tbl[1,2]) / (weight.smoke.tbl[2,1] / weight.smoke.tbl[1,1])
or.smoke.bwt

## [1] 2.021944

So the odds of low birth weight are 2 times higher when the mother smokes

Correlations and Mediating Variable

The two most common questions in social science research and natural science research pertains to understanding how two variables are related to each other (correlation) and whether that relationship depends on another (mediating) variable.

We will discuss this more when we get to linear regression models. For now, consider the following two questions related to the birthweight data:

Is the mother’s age correlated with birth weight?
Does this change when we account for smoking status (Mediating variable)?

Examples

Ex. 1: Correlation
- cor() function
Ex. 2: Mediating variable
Ex. 3: Mediating variable
- by() function

Ex. 1

Is the mother’s age correlated with birth weight?

with(birthwt, cor(birthwt.grams, mother.age))  # Calculate correlation

## [1] 0.09031781

Ex. 2

Does the correlation above change when we account for smoking status (Mediating variable)?

with(birthwt, cor(birthwt.grams[mother.smokes == "yes"], mother.age[mother.smokes == "yes"]))

## [1] -0.1441649

with(birthwt, cor(birthwt.grams[mother.smokes == "no"], mother.age[mother.smokes == "no"]))

## [1] 0.2014558

Notice that the correlation between the two variables, 0.0903178, is clearly mediated by whether or not the mother smokes.

Ex. 3

A Faster way to study mediating effects in Example 2 is to use the by() function

Think of the by(data, INDICES, FUN) function as a tapply() function that operates on data frames instead of just vectors
When using tapply(X, INDEX, FUN), X is generally a numeric vector
To calculate correlations, we need to allow X to be a data frame or matrix

by(data = birthwt[c("birthwt.grams", "mother.age")], 
   INDICES = birthwt["mother.smokes"], 
   FUN = function(x) {cor(x[,1], x[,2])})

## mother.smokes: no
## [1] 0.2014558
## -------------------------------------------------------- 
## mother.smokes: yes
## [1] -0.1441649

Reminder

Lab 1 Today

Week 3 - More data manipulation and programming basics

Prof. Anthony Howell

09/25/2017

Agenda

Importing Data (From Last Week)

Manipulating Dataframes - Basics

Examples

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Ex. 5

Ex. 6

Functions - Basics

Ex. 1

Ex. 2

Ex. 3

If-else statements

Ex. 1

Ex. 2

For loops

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Apply() Functions in plyr package

Examples

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Let’s Review

Examples

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Ex.5

Correlations and Mediating Variable

Examples

Ex. 1

Ex. 2

Ex. 3

Reminder