Introduction

Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.

The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.

The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.

Objective

Our objective for this analysis is to explore the individuals earnings and how various attributes and characteristics impact earning potential. Ideally, we will be able to find some trends within this data to provide insights into high earners and low income alike.

Dataset description

Variables of Interest:

The following list is a subset of variables that descibe an individual filling out the census.

##  [1] "ST"    "AGEP"  "COW"   "INTP"  "JWMNP" "JWRIP" "JWTR"  "SEX"  
##  [9] "WAGP"  "WKW"   "FOD1P" "JWDP"  "PERNP" "PINCP" "RAC1P"

This subset of column names don’t provide much detail, so we will define each below.

Before we do, it is important to note that the following variables need factoring.

##  [1] "ST"    "COW"   "JWRIP" "JWTR"  "SEX"   "WKL"   "WKW"   "FOD1P"
##  [9] "JWDP"  "RAC1P"

To keep things simple, I will be defining variables in the order they appear in the data frame.

ST(.f)
- Description: State Code.
  This is an integer that will be factored into a new column that takes the State code and factors it with a State name. The following code will be variables that corrolate with the levels and labels of the new column.

##  [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"

##  [1] "Alabama/AL"              "Alaska/AK"              
##  [3] "Arizona/AZ"              "Arkansas/AR"            
##  [5] "California/CA"           "Colorado/CO"            
##  [7] "Connecticut/CT"          "Delaware/DE"            
##  [9] "District of Columbia/DC" "Florida/FL"             
## [11] "Georgia/GA"              "Hawaii/HI"              
## [13] "Idaho/ID"                "Illinois/IL"            
## [15] "Indiana/IN"              "Iowa/IA"                
## [17] "Kansas/KS"               "Kentucky/KY"            
## [19] "Louisiana/LA"            "Maine/ME"               
## [21] "Maryland/MD"             "Massachusetts/MA"       
## [23] "Michigan/MI"             "Minnesota/MN"           
## [25] "Mississippi/MS"          "Missouri/MO"            
## [27] "Montana/MT"              "Nebraska/NE"            
## [29] "Nevada/NV"               "New Hampshire/NH"       
## [31] "New Jersey/NJ"           "New Mexico/NM"          
## [33] "New York/NY"             "North Carolina/NC"      
## [35] "North Dakota/ND"         "Ohio/OH"                
## [37] "Oklahoma/OK"             "Oregon/OR"              
## [39] "Pennsylvania/PA"         "Rhode Island/RI"        
## [41] "South Carolina/SC"       "South Dakota/SD"        
## [43] "Tennessee/TN"            "Texas/TX"               
## [45] "Utah/UT"                 "Vermont/VT"             
## [47] "Virginia/VA"             "Washington/WA"          
## [49] "West Virginia/WV"        "Wisconsin/WI"           
## [51] "Wyoming/WY"              "Puerto Rico/PR"

AGEP
- Description: Age This is an integer of the individuals age.
COW(.f)
- Description: Class of Worker This is an integer that will be factored into a new column that takes the the Class of Worker code and aligns it to their description. The data will range from employee at public company, goverment employee, self-employed, etc…

The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.

##  [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"

##  [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"                    
##  [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
##  [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"                                
##  [4] "Local government employee (city- county- etc.)"                                                              
##  [5] "State government employee"                                                                                   
##  [6] "Federal government employee"                                                                                 
##  [7] "Self-employed in own not incorporated business- professional practice- or farm"                              
##  [8] "Self-employed in own incorporated business- professional practice or farm"                                   
##  [9] "Working without pay in family business or farm"                                                              
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"

INTP
- Description: Interest, dividends, and net rental income past 12 months. This is an integer of the individuals income across interest, dividends, and rental property. This value can be negative.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6300       0       0    2075       0  300000

JWMNP(.f)
- Description: Travel time to work This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time it took the individual to travel to work.
JWRIP(.f)
- Description: Vehicle occupancy This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual carpooled to work and the count of people in the vehicle.
JWTR(.f)
- Description: Means of transportation to work This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes vehicle type used to commute to work. e.g. Car, truck, bicycle, bus, worked at home, etc…
SEX(.f)
- Description: Sex This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individual sex.
WAGP
- Description: Wages or salary income past 12 months This is an integer that describes the income an individual received via wages and salary over the past 12 months.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    6600   26060   37000  660000

WKW(.f)
- Description: Weeks worked during past 12 months This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the number of weeks worked in the past 52 weeks.
DRIVESP(.f)
- Description: Number of vehicles calculated from JWRI This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the fraction of vehicle used while commuting. e.g. If the individual shared a ride with 1 other person (2 total), this value could be 0.50.
FOD1P(.f)
- Description: Recoded field of degree - first entry This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the first field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.
INDP(.f)
- Description: Industry recode for 2013 and later based on 2012 IND codes This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the industry code the individual is employed in. Note: The IND codes are besed on 2012 IND codes.
JWDP(.f)
- Description: Time of departure for work - hour and minute This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual left work.
PERNP
- Description: Total person’s earnings (includes wages- ss- pensions- dividends- etc..) This is an integer that described the total earnings across all incomes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9000       0   10000   27860   40000 1019000

PINCP
- Description: Total person’s income (signed) This is an integer that described the income an individual received via income.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -13600    7300   21400   36310   46000 1281000

RAC1P(.f)
- Description: Recoded detailed race code This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes race of the individual.

Load DataSet and Libraries

Dataset Preparation

For access to the script used to clean this data, use the following link: DropBox Link

Below is an overview of the steps taken in order to prepare the data.

Load libraries:
- data.table
- magrittr
- ggplot2
- dplyr
Select variables of interest (38). See Variables of Interest for additional details.
Load the two raw data sets provided by Kaggle/US Census. While selecting, limit the ingest to only the variables of interest, to save time.
Bind data sets into one.
List the subset of variables that need factoring and define variables for the levels and the labels, respectively.
Run a for loop to build the new factored columns for all factor variables.
Saved the resulting data frame to file, for use. This is the data frame that will be used for all of our analysis and research seen here.

The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.

library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")

#Jon's wd
setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")

#Xiao's wd
setwd("/Users/XiaoLi/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")

Analysis

Earnings by State

One of the first things we looked at was what the median earnings were when we looked at individuals grouped by State.

#Median Earnings by State(histogram)
pus.df %>%
  group_by(ST.f) %>%
  summarize(count=n(), 
            PERNP.median = median(PERNP,na.rm=TRUE)) %>%
  arrange(desc(PERNP.median)) %>%
  ggplot(aes(x=ST.f, y=PERNP.median)) +
  geom_histogram(stat="identity") +
  xlab("State")+
  ylab("Median Earnings")+
  coord_flip()

Description and Findings

Some of the States medium income isn’t very surprising, while other States have numbers that, for our team, wasn’t intuitive. For example, Washington DC makes sense that it has a high median, given the geographical area and concentration of power. On the other hand Wyoming is #4, which came to a surprise.

Looking at the data another way, we can see the Distribution within Each State.

#Earning by State(boxplot)
ggplot(pus.df, aes(x=ST.f, y=PERNP)) +
  geom_boxplot(outlier.size = 0) +
  scale_y_log10(breaks = c(10000, 50000, 100000, 400000)) +
  coord_flip() +
  ylim(6000, 150000)+
  xlab("State")+
  ylab("Earnings")

Annual Wages by Age

We’ve heard that from the age of 25 through 35 are the years in which you can increase your salary. Analyzing age vs annual wages will help us confirm or deny this “Rule of Thumb”

#Annual Wages by Age(boxplot)
wageData <- select(pus.df, WAGP, AGEP)
wageData <- na.omit(wageData)
wageData <- wageData[wageData$WAGP>0,]
ticks = c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
a <- ggplot(wageData, aes(x=factor(AGEP), y=WAGP))
a <- a + geom_boxplot(outlier.size = 0)+xlab("Age")+ylab("Annual Wages")
a <- a + ylim(0,130000)
a <- a + xlim(0,70)
a + scale_x_discrete(breaks=ticks)

Description and Findings

This data backs up the common folklore… sort of. There are a handful of interesting insights:

The steep curve for median income seems to start around 20 and continue until 30. After age 30, wage continues to increase, though not as quickly, until about age 40. At this point, wage levels out until the late 50s, presumably retirement age, then it begins to decrease.
The data used for this graph excludes those with zero wages. Therefore, this decline in wages could indicate that those who leave the workforce earliest, earn more than those who remain in the workforce after retirement age.
Another explanation could be that those who work past this age work less hours, or they could be paid less.

Annual Wages and Gender

The following plot includes annual wages greater than 0 and less than $200,000

#Wages and Gender(violin plot)
wagesex = select(pus.df, SEX.f, WAGP)
wagesex = na.omit(wagesex)
wagesex = subset(wagesex, WAGP<200000)
wagesex = subset(wagesex, WAGP>1)
g = ggplot(wagesex, aes(x=SEX.f, y=WAGP))
g + geom_violin() + xlab("Gender") +ylab("Annual Wages")

Description and Findings

This graph illustrates the distribution of wages by gender.
It is clear that the female plot is wider near the bottom, whereas the male??s plot has a more tapers more gradually. There is also a much larger bulge at $100,000 for males.
This graph suggests that males have higher overall earning potential including a greater likelihood of earning more than $50,000 annually.

Degree and Earnings Distribution

It is common knowledge that higher education is a key piece in advancing your potential, but what does the data say?

While this first graph is hard to read, there are some interesting anomolies worth looking at.

Interesting groupings of earnings grouped around very high wages. Not sure why, but to get more out of the data, I will now focus on <$300,000 wages.

DegreeData <- select(pus.df, FOD1P.f, WAGP)
DegreeData <- na.omit(DegreeData)
DegreeData2 <- DegreeData[1:40000,]
DegreeData2 %>%
  group_by(FOD1P.f) %>%
  ggplot(aes(y=FOD1P.f, x=WAGP)) + 
  geom_point(outlire.size = 0, alpha = 0.5)+ 
  xlab("Annual Wages")+
  ylab("Degree")

We are going to “paginate” the degrees so we can look through all of them with appropriate real estate.

a <- ggplot(DegreeData01, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree") 
a

a <- ggplot(DegreeData02, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData03, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData04, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData05, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData06, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData07, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData08, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

a <- ggplot(DegreeData09, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a

Description and Findings

Some majors are wildly more popular than others. There seem to be two reasons for this: One, the respondent has chosen a very specific degree that could be part of a more general category. The second reason is that the degree is a niche or a highly specialized concentration, where the demand is relatively low.
There are a handful of degrees that have surprisingly low wage distributions (Military technologies, Nuclear Engineering, and others) while other degrees are less surprising (Court Reporting, Meteorology)

** Science/Engineering Degree vs Everyone Else **

sci = select(pus.df, SCIENGP.f, WAGP)
sci <- na.omit(sci)
sci <- subset(sci, WAGP > 1)
sci <- subset(sci, WAGP < 200000)
ticks <- c(10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000, 200000)
g <- ggplot(sci, aes(x=SCIENGP.f, y=WAGP))
g + geom_violin()+ ylab("Annual Wages")+ xlab("Science Degree(Y/N)")+ scale_y_continuous(limits=c(0,200000), breaks=ticks)

Description and Findings

This violin plot illustrates the distribution of wages for those with a science/engineering degree (Yes) and those without a science/engineering degree (No).
Both plots are widest at around $50,000, but the “No” plot is comparatively much wider at this point, whereas the “Yes” plot is much more evenly distributed until about $100,000.
This graph suggests that individuals with science/engineering degrees are more likely to have a higher earning potential than those without that type of degree.

Conclusion

We were able to find trends across wages and earnings vs many other characteristics. Many of the insights that could be made are arguably intuitive, but we’ve also uncovered many trends that go against common knowledge. For example, while it is fairly common knowledge that between the ages of 25-35 is the time to increase your wages, our data shows that our thinking may need to be adjusted slightly. Today (2013 when the data was collected), it appears as though this age has shifted to 20-30.

Overall, the insights provided here are interesting and could be used to potentially guide some decisions to drive larger wages. I would urge an individual looking to use this data to drive his/her wages up with this data to take each insight with careful consideration. For example, if you decided to move to Washington D.C. for higher wages, you might find that your degree isn’t a good fit for the region; and if you do find higher wages, you could be surprised with higher costs of living.

Having explored this data I hope we have challenged some existing views you have on wages in the United States.

MA799 Assignment 1: American Consumer Survey

Alex Clark, Jon Leaman, Xiao Li

22 Oct 2015

Introduction

Objective

Dataset description

Variables of Interest:

Load DataSet and Libraries

Dataset Preparation

Analysis

Earnings by State

Annual Wages by Age

Annual Wages and Gender

Degree and Earnings Distribution

Conclusion