Introducing the American Community Survey

In this first chapter you will analyze data of the 2013 American Community Survey (ACS) to find out whether it makes sense to pursue a PhD. The end result will be a Kaggle script that you can share via your own Kaggle account.

Let’s get started by loading in the variable AC_Survey_Subset, a subset of the ACS data containing the columns SCHL (School Level), PINCP (Income) and ESR (Work Status).

Note: A basic understanding of the R syntax is required for this course. In addition, you will need to make use of some basic functions in the dplyr and ggplot2 packages.

# Load in your data
acs_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_810/AC_Survey_Subset.RData"

# Investigate first 20 observations
load(url(acs_url))
head(AC_Survey_Subset,20)
##    SCHL ESR  PINCP
## 1    19   6      0
## 2    20   1  52000
## 3    16   1  99000
## 4    19   6      0
## 5    19   6      0
## 6    21   1  39930
## 7    14   6  10300
## 8    16   3   1100
## 9     9  NA     NA
## 10    1   6   3900
## 11   10   6   5400
## 12   16   1  90000
## 13   18   1  46000
## 14   17   6  39500
## 15   17   6  13100
## 16   19   6 103000
## 17   21   1  53600
## 18   19   1  28000
## 19   16   1  18700
## 20   16   6   6000

Preparing your data set for further analysis

Your data still looks a bit messy so it’s time to clean it up with data manipulation techniques. You will do this using dplyr, and R package that gives you access to the most important data manipulation tools and makes them easy to use.

Dplyr makes use of the pipe operator %>% from the magrittr package. Pipes take the output from one function and feed it to the first argument of the next function:

# Load in the dplyr package and convert AC_Survey_Subset to tbl_df
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
AC_Survey_Subset <- tbl_df(AC_Survey_Subset) 
# Use the pipe operator and chaining 
AC_Survey_Subset_Cleaned <- AC_Survey_Subset %>% na.omit()
AC_Survey_Subset_Cleaned <- AC_Survey_Subset %>% na.omit() %>% filter(SCHL == 21 | SCHL == 22 | SCHL ==24) %>% group_by(SCHL)

How many are there? - Part one

Let’s have a look at the number of BSc, MSc & PhD holders in the US.

To do this you need to calculate the number of observations for the SCHL codes 21, 22 and 24.

The dplyr function summarize() is used together with aggregate functions that return a single number based on a vector of values. For example, you could use

Because AC_Survey_Subset_Cleaned was already grouped by SCHL, you will get the average wage for each SCHL code (try it yourself in the console!).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Count the number of Bachelor, Master and PhD holders
degree_holders <- summarize(AC_Survey_Subset_Cleaned, count=n())
degree_holders
## Source: local data frame [3 x 2]
## 
##    SCHL  count
##   (int)  (int)
## 1    21 423943
## 2    22 183182
## 3    24  30877
degree_codes <- data.frame(SCHL=c(21,22,24),Degree=c("Bachelor","Masters","Doctorate"))

# Join degree_codes with degree_holders, assign to degree_holders_2
degree_holders_2 <- inner_join(degree_holders,degree_codes)
## Joining by: "SCHL"
head(degree_holders_2)
## Source: local data frame [3 x 3]
## 
##    SCHL  count    Degree
##   (dbl)  (int)    (fctr)
## 1    21 423943  Bachelor
## 2    22 183182   Masters
## 3    24  30877 Doctorate

How many are there ?- Part two

# Load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
# Visualize the number of Bachelor, Master and PhD holders   
ggplot(degree_holders_2, aes(x = Degree , y = count, fill = Degree)) +                        
  geom_bar(stat = "identity") +
  xlab("Degree") + 
  ylab("No of People") + 
  ggtitle("Comparing Degree Holders in the US")

Do PhD’s earn more?

# income is avialable in the workspace, ggplot2 head(income)
over_thousand <- AC_Survey_Subset_Cleaned %>% # Exclude those whom earn less than $1000
  filter(PINCP>1000) %>%
  group_by(SCHL) 
    
freq <- 5000 # 5000 samples

result <- NULL
for(i in 1:freq){
  sample <-  sample_n(over_thousand,1000) # Select 1000 observations
  sample_stats <- summarise(sample, MinIncome=min(PINCP), MaxIncome=max(PINCP),
                          MedianIncome=median(PINCP), IncomeRange=IQR(PINCP))   # Calculate stats
  result <- rbind(result, sample_stats)
}

income <- result %>%  arrange(SCHL) 
income <- left_join(income , degree_codes, by.x=c("SCHL"))  
## Joining by: "SCHL"
head(income)    
## Source: local data frame [6 x 6]
## 
##    SCHL MinIncome MaxIncome MedianIncome IncomeRange   Degree
##   (dbl)     (int)     (int)        (dbl)       (dbl)   (fctr)
## 1    21      1150    655000        42550     54075.0 Bachelor
## 2    21      1200    655000        43000     50137.5 Bachelor
## 3    21      1200   1030000        43425     53200.0 Bachelor
## 4    21      1100    867000        45000     48000.0 Bachelor
## 5    21      1100    823000        45000     50700.0 Bachelor
## 6    21      1100    770000        46745     51000.0 Bachelor
# Create the boxplots
library(ggplot2)
ggplot(income, aes(x = Degree, y= MedianIncome, fill=Degree)) +  geom_boxplot() + ggtitle("Comparing Income of Degrees Holders")