Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.
The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.
The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.
Our objective for this analysis is to explore the individuals earnings and how various attributes and characteristics impact earning potential. Ideally, we will be able to find some trends within this data to provide insights into high earners and low income alike.
The following list is a subset of variables that descibe an individual filling out the census.
## [1] "ST" "AGEP" "COW" "INTP" "JWMNP" "JWRIP" "JWTR" "SEX"
## [9] "WAGP" "WKW" "FOD1P" "JWDP" "PERNP" "PINCP" "RAC1P"
This subset of column names don’t provide much detail, so we will define each below.
Before we do, it is important to note that the following variables need factoring.
## [1] "ST" "COW" "JWRIP" "JWTR" "SEX" "WKL" "WKW" "FOD1P"
## [9] "JWDP" "RAC1P"
To keep things simple, I will be defining variables in the order they appear in the data frame.
## [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"
## [1] "Alabama/AL" "Alaska/AK"
## [3] "Arizona/AZ" "Arkansas/AR"
## [5] "California/CA" "Colorado/CO"
## [7] "Connecticut/CT" "Delaware/DE"
## [9] "District of Columbia/DC" "Florida/FL"
## [11] "Georgia/GA" "Hawaii/HI"
## [13] "Idaho/ID" "Illinois/IL"
## [15] "Indiana/IN" "Iowa/IA"
## [17] "Kansas/KS" "Kentucky/KY"
## [19] "Louisiana/LA" "Maine/ME"
## [21] "Maryland/MD" "Massachusetts/MA"
## [23] "Michigan/MI" "Minnesota/MN"
## [25] "Mississippi/MS" "Missouri/MO"
## [27] "Montana/MT" "Nebraska/NE"
## [29] "Nevada/NV" "New Hampshire/NH"
## [31] "New Jersey/NJ" "New Mexico/NM"
## [33] "New York/NY" "North Carolina/NC"
## [35] "North Dakota/ND" "Ohio/OH"
## [37] "Oklahoma/OK" "Oregon/OR"
## [39] "Pennsylvania/PA" "Rhode Island/RI"
## [41] "South Carolina/SC" "South Dakota/SD"
## [43] "Tennessee/TN" "Texas/TX"
## [45] "Utah/UT" "Vermont/VT"
## [47] "Virginia/VA" "Washington/WA"
## [49] "West Virginia/WV" "Wisconsin/WI"
## [51] "Wyoming/WY" "Puerto Rico/PR"
The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.
## [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"
## [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"
## [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
## [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"
## [4] "Local government employee (city- county- etc.)"
## [5] "State government employee"
## [6] "Federal government employee"
## [7] "Self-employed in own not incorporated business- professional practice- or farm"
## [8] "Self-employed in own incorporated business- professional practice or farm"
## [9] "Working without pay in family business or farm"
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6300 0 0 2075 0 300000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 6600 26060 37000 660000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9000 0 10000 27860 40000 1019000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13600 7300 21400 36310 46000 1281000
For access to the script used to clean this data, use the following link: DropBox Link
Below is an overview of the steps taken in order to prepare the data.
The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.
library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")
#Jon's wd
setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")
#Xiao's wd
setwd("/Users/XiaoLi/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")
One of the first things we looked at was what the median earnings were when we looked at individuals grouped by State.
#Median Earnings by State(histogram)
pus.df %>%
group_by(ST.f) %>%
summarize(count=n(),
PERNP.median = median(PERNP,na.rm=TRUE)) %>%
arrange(desc(PERNP.median)) %>%
ggplot(aes(x=ST.f, y=PERNP.median)) +
geom_histogram(stat="identity") +
xlab("State")+
ylab("Median Earnings")+
coord_flip()
Description and Findings
Some of the States medium income isn’t very surprising, while other States have numbers that, for our team, wasn’t intuitive. For example, Washington DC makes sense that it has a high median, given the geographical area and concentration of power. On the other hand Wyoming is #4, which came to a surprise.
Looking at the data another way, we can see the Distribution within Each State.
#Earning by State(boxplot)
ggplot(pus.df, aes(x=ST.f, y=PERNP)) +
geom_boxplot(outlier.size = 0) +
scale_y_log10(breaks = c(10000, 50000, 100000, 400000)) +
coord_flip() +
ylim(6000, 150000)+
xlab("State")+
ylab("Earnings")
We’ve heard that from the age of 25 through 35 are the years in which you can increase your salary. Analyzing age vs annual wages will help us confirm or deny this “Rule of Thumb”
#Annual Wages by Age(boxplot)
wageData <- select(pus.df, WAGP, AGEP)
wageData <- na.omit(wageData)
wageData <- wageData[wageData$WAGP>0,]
ticks = c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
a <- ggplot(wageData, aes(x=factor(AGEP), y=WAGP))
a <- a + geom_boxplot(outlier.size = 0)+xlab("Age")+ylab("Annual Wages")
a <- a + ylim(0,130000)
a <- a + xlim(0,70)
a + scale_x_discrete(breaks=ticks)
Description and Findings
This data backs up the common folklore… sort of. There are a handful of interesting insights:
The following plot includes annual wages greater than 0 and less than $200,000
#Wages and Gender(violin plot)
wagesex = select(pus.df, SEX.f, WAGP)
wagesex = na.omit(wagesex)
wagesex = subset(wagesex, WAGP<200000)
wagesex = subset(wagesex, WAGP>1)
g = ggplot(wagesex, aes(x=SEX.f, y=WAGP))
g + geom_violin() + xlab("Gender") +ylab("Annual Wages")
Description and Findings
It is common knowledge that higher education is a key piece in advancing your potential, but what does the data say?
While this first graph is hard to read, there are some interesting anomolies worth looking at.
DegreeData <- select(pus.df, FOD1P.f, WAGP)
DegreeData <- na.omit(DegreeData)
DegreeData2 <- DegreeData[1:40000,]
DegreeData2 %>%
group_by(FOD1P.f) %>%
ggplot(aes(y=FOD1P.f, x=WAGP)) +
geom_point(outlire.size = 0, alpha = 0.5)+
xlab("Annual Wages")+
ylab("Degree")
We are going to “paginate” the degrees so we can look through all of them with appropriate real estate.
a <- ggplot(DegreeData01, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData02, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData03, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData04, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData05, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData06, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData07, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData08, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
a <- ggplot(DegreeData09, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)+ xlab("Annual Wages")+ ylab("Degree")
a
Description and Findings
** Science/Engineering Degree vs Everyone Else **
sci = select(pus.df, SCIENGP.f, WAGP)
sci <- na.omit(sci)
sci <- subset(sci, WAGP > 1)
sci <- subset(sci, WAGP < 200000)
ticks <- c(10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000, 200000)
g <- ggplot(sci, aes(x=SCIENGP.f, y=WAGP))
g + geom_violin()+ ylab("Annual Wages")+ xlab("Science Degree(Y/N)")+ scale_y_continuous(limits=c(0,200000), breaks=ticks)
Description and Findings
We were able to find trends across wages and earnings vs many other characteristics. Many of the insights that could be made are arguably intuitive, but we’ve also uncovered many trends that go against common knowledge. For example, while it is fairly common knowledge that between the ages of 25-35 is the time to increase your wages, our data shows that our thinking may need to be adjusted slightly. Today (2013 when the data was collected), it appears as though this age has shifted to 20-30.
Overall, the insights provided here are interesting and could be used to potentially guide some decisions to drive larger wages. I would urge an individual looking to use this data to drive his/her wages up with this data to take each insight with careful consideration. For example, if you decided to move to Washington D.C. for higher wages, you might find that your degree isn’t a good fit for the region; and if you do find higher wages, you could be surprised with higher costs of living.
Having explored this data I hope we have challenged some existing views you have on wages in the United States.