In this first chapter you will analyze data of the 2013 American Community Survey (ACS) to find out whether it makes sense to pursue a PhD. The end result will be a Kaggle script that you can share via your own Kaggle account.
Let’s get started by loading in the variable AC_Survey_Subset, a subset of the ACS data containing the columns SCHL (School Level), PINCP (Income) and ESR (Work Status).
Note: A basic understanding of the R syntax is required for this course. In addition, you will need to make use of some basic functions in the dplyr and ggplot2 packages.
# Load in your data
acs_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_810/AC_Survey_Subset.RData"
# Investigate first 20 observations
load(url(acs_url))
head(AC_Survey_Subset,20)
## SCHL ESR PINCP
## 1 19 6 0
## 2 20 1 52000
## 3 16 1 99000
## 4 19 6 0
## 5 19 6 0
## 6 21 1 39930
## 7 14 6 10300
## 8 16 3 1100
## 9 9 NA NA
## 10 1 6 3900
## 11 10 6 5400
## 12 16 1 90000
## 13 18 1 46000
## 14 17 6 39500
## 15 17 6 13100
## 16 19 6 103000
## 17 21 1 53600
## 18 19 1 28000
## 19 16 1 18700
## 20 16 6 6000
Your data still looks a bit messy so it’s time to clean it up with data manipulation techniques. You will do this using dplyr, and R package that gives you access to the most important data manipulation tools and makes them easy to use.
Dplyr makes use of the pipe operator %>% from the magrittr package. Pipes take the output from one function and feed it to the first argument of the next function:
# Load in the dplyr package and convert AC_Survey_Subset to tbl_df
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
AC_Survey_Subset <- tbl_df(AC_Survey_Subset)
# Use the pipe operator and chaining
AC_Survey_Subset_Cleaned <- AC_Survey_Subset %>% na.omit()
AC_Survey_Subset_Cleaned <- AC_Survey_Subset %>% na.omit() %>% filter(SCHL == 21 | SCHL == 22 | SCHL ==24) %>% group_by(SCHL)
Let’s have a look at the number of BSc, MSc & PhD holders in the US.
To do this you need to calculate the number of observations for the SCHL codes 21, 22 and 24.
The dplyr function summarize() is used together with aggregate functions that return a single number based on a vector of values. For example, you could use
Because AC_Survey_Subset_Cleaned was already grouped by SCHL, you will get the average wage for each SCHL code (try it yourself in the console!).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Count the number of Bachelor, Master and PhD holders
degree_holders <- summarize(AC_Survey_Subset_Cleaned, count=n())
degree_holders
## Source: local data frame [3 x 2]
##
## SCHL count
## (int) (int)
## 1 21 423943
## 2 22 183182
## 3 24 30877
degree_codes <- data.frame(SCHL=c(21,22,24),Degree=c("Bachelor","Masters","Doctorate"))
# Join degree_codes with degree_holders, assign to degree_holders_2
degree_holders_2 <- inner_join(degree_holders,degree_codes)
## Joining by: "SCHL"
head(degree_holders_2)
## Source: local data frame [3 x 3]
##
## SCHL count Degree
## (dbl) (int) (fctr)
## 1 21 423943 Bachelor
## 2 22 183182 Masters
## 3 24 30877 Doctorate
# Load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
# Visualize the number of Bachelor, Master and PhD holders
ggplot(degree_holders_2, aes(x = Degree , y = count, fill = Degree)) +
geom_bar(stat = "identity") +
xlab("Degree") +
ylab("No of People") +
ggtitle("Comparing Degree Holders in the US")
# income is avialable in the workspace, ggplot2 head(income)
over_thousand <- AC_Survey_Subset_Cleaned %>% # Exclude those whom earn less than $1000
filter(PINCP>1000) %>%
group_by(SCHL)
freq <- 5000 # 5000 samples
result <- NULL
for(i in 1:freq){
sample <- sample_n(over_thousand,1000) # Select 1000 observations
sample_stats <- summarise(sample, MinIncome=min(PINCP), MaxIncome=max(PINCP),
MedianIncome=median(PINCP), IncomeRange=IQR(PINCP)) # Calculate stats
result <- rbind(result, sample_stats)
}
income <- result %>% arrange(SCHL)
income <- left_join(income , degree_codes, by.x=c("SCHL"))
## Joining by: "SCHL"
head(income)
## Source: local data frame [6 x 6]
##
## SCHL MinIncome MaxIncome MedianIncome IncomeRange Degree
## (dbl) (int) (int) (dbl) (dbl) (fctr)
## 1 21 1150 655000 42550 54075.0 Bachelor
## 2 21 1200 655000 43000 50137.5 Bachelor
## 3 21 1200 1030000 43425 53200.0 Bachelor
## 4 21 1100 867000 45000 48000.0 Bachelor
## 5 21 1100 823000 45000 50700.0 Bachelor
## 6 21 1100 770000 46745 51000.0 Bachelor
# Create the boxplots
library(ggplot2)
ggplot(income, aes(x = Degree, y= MedianIncome, fill=Degree)) + geom_boxplot() + ggtitle("Comparing Income of Degrees Holders")