This notes file is a place for you to practice writing and executing the code found in the lecture notes. This is also a spot where you can (and should) write your own notes and thoughts. Explain what you are doing in each code chunk in your own words.

Preparation

Before class, run the following code chunk to make sure it works. If it does not, try to understand what the error message is telling you. Refer to the FAQ page for assistance.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
ncbirths <- openintro::ncbirths

Missing Data

head(ncbirths)
## # A tibble: 6 × 13
##    fage  mage mature    weeks premie visits marital gained weight lowbirthweight
##   <int> <int> <fct>     <int> <fct>   <int> <fct>    <int>  <dbl> <fct>         
## 1    NA    13 younger …    39 full …     10 not ma…     38   7.63 not low       
## 2    NA    14 younger …    42 full …     15 not ma…     20   7.88 not low       
## 3    19    15 younger …    37 full …     11 not ma…     38   6.63 not low       
## 4    21    15 younger …    41 full …      6 not ma…     34   8    not low       
## 5    NA    15 younger …    39 full …      9 not ma…     27   6.38 not low       
## 6    NA    15 younger …    38 full …     19 not ma…     22   5.38 low           
## # ℹ 3 more variables: gender <fct>, habit <fct>, whitemom <fct>

Problems

mean(ncbirths$fage)
## [1] NA
ggplot(ncbirths, aes(premie))+geom_bar()

Identifying missing values

table(ncbirths$habit, useNA="always")
## 
## nonsmoker    smoker      <NA> 
##       873       126         1
summary(ncbirths$fage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   25.00   30.00   30.26   35.00   55.00     171
x <- c("green", NA, 3)
is.na(x)
## [1] FALSE  TRUE FALSE
sum(is.na(ncbirths$fage))
## [1] 171

Summarizing data

two common way to do this is with table() and summary()

Frequency Tables for categorical data

Create a frequency table for whether or not the baby was born underweight.

table(ncbirths$lowbirthweight, useNA = "ifany")
## 
##     low not low 
##     111     889

Do it again but show if there are any missing.

table(ncbirths$lowbirthweight, useNA = "always")
## 
##     low not low    <NA> 
##     111     889       0

Summary statistics for numerical data

Summary statistics for the number of visits.

summary(ncbirths$visits)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    10.0    12.0    12.1    15.0    30.0       9

Data editing (recoding)

data[data$variable==value] # this is an example code to show how you would input this information

Example 1: Too low birth weight

Set all records where weight=1 to missing.

summary(ncbirths$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.380   7.310   7.101   8.060  11.750
ncbirths$weight[ncbirths$weight==1] <- NA

Confirm it worked by creating a box plot of weight.

boxplot(ncbirths$weight)

ncbirths$weight[ncbirths$weight < 4] <- NA
boxplot(ncbirths$weight)


Creating new variables

ncbirths$new_variable <- ncbirths$gained  # how to add more vairiables

Example: basic arithmetic on existing variables

Create a new variable wtgain_mom the weight gained by the mother, that is not due to the baby by subtracting weight from gained.

ncbirths$wtgain_mom <- ncbirths$gained - ncbirths$weight

Confirm this variable was created correctly

head(ncbirths[,c('gained', 'weight', 'wtgain_mom')])
## # A tibble: 6 × 3
##   gained weight wtgain_mom
##    <int>  <dbl>      <dbl>
## 1     38   7.63       30.4
## 2     20   7.88       12.1
## 3     38   6.63       31.4
## 4     34   8          26  
## 5     27   6.38       20.6
## 6     22   5.38       16.6

Dichtomizing data

Make a new variable underage on the NCbirths data set. If mage is under 18, then the value of this new variable is underage, else it is labeled as adult.

ncbirths$underage <- ifelse(ncbirths$mage < 18, "underage", "adult")

Confirm it worked.

table(ncbirths$underage, useNA="always")
## 
##    adult underage     <NA> 
##      963       37        0
ncbirths[ncbirths$mage %in% c(17,18),c('mage', 'underage')]
## # A tibble: 57 × 2
##     mage underage
##    <int> <chr>   
##  1    17 underage
##  2    17 underage
##  3    17 underage
##  4    17 underage
##  5    17 underage
##  6    17 underage
##  7    17 underage
##  8    17 underage
##  9    17 underage
## 10    17 underage
## # ℹ 47 more rows

Chaining using the pipe %>%

table(ncbirths$mature)
## 
##  mature mom younger mom 
##         133         867
ncbirths$mature %>% table()
## .
##  mature mom younger mom 
##         133         867
ncbirths$mage %>% mean()
## [1] 27