Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.
The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.
The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.
While our objective may be out of focus as we begin to investigate the data, we want to look at Unemployment and Job Stability.
The following list is a subset of variables that descibe an individual filling out the census.
## [1] "ST" "PWGTP" "AGEP" "COW" "INTP"
## [6] "JWMNP" "JWRIP" "JWTR" "MAR" "OIP"
## [11] "PAP" "RETP" "SCH" "SCHG" "SCHL"
## [16] "SEMP" "SEX" "WAGP" "WKHP" "WKL"
## [21] "WKW" "WRK" "DRIVESP" "ESR" "FOD1P"
## [26] "FOD2P" "INDP" "JWAP" "JWDP" "MSP"
## [31] "NAICSP" "OCCP" "PERNP" "PINCP" "POWSP"
## [36] "RAC1P" "SCIENGP" "SCIENGRLP"
This subset of column names don’t provide much detail, so we will define each below.
Before we do, it is important to note that the following variables need factoring.
## [1] "ST" "COW" "JWRIP" "JWTR" "MAR"
## [6] "SCH" "SCHG" "SCHL" "SEX" "WKL"
## [11] "WKW" "WRK" "DRIVESP" "ESR" "FOD1P"
## [16] "FOD2P" "INDP" "JWAP" "JWDP" "MSP"
## [21] "NAICSP" "OCCP" "POWSP" "RAC1P" "SCIENGP"
## [26] "SCIENGRLP"
To keep things simple, I will be defining variables in the order they appear in the data frame.
This is an integer that will be factored into a new column that takes the State code and factors it with a State name. The following code will be variables that corrolate with the levels and labels of the new column.
## [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"
## [1] "Alabama/AL" "Alaska/AK"
## [3] "Arizona/AZ" "Arkansas/AR"
## [5] "California/CA" "Colorado/CO"
## [7] "Connecticut/CT" "Delaware/DE"
## [9] "District of Columbia/DC" "Florida/FL"
## [11] "Georgia/GA" "Hawaii/HI"
## [13] "Idaho/ID" "Illinois/IL"
## [15] "Indiana/IN" "Iowa/IA"
## [17] "Kansas/KS" "Kentucky/KY"
## [19] "Louisiana/LA" "Maine/ME"
## [21] "Maryland/MD" "Massachusetts/MA"
## [23] "Michigan/MI" "Minnesota/MN"
## [25] "Mississippi/MS" "Missouri/MO"
## [27] "Montana/MT" "Nebraska/NE"
## [29] "Nevada/NV" "New Hampshire/NH"
## [31] "New Jersey/NJ" "New Mexico/NM"
## [33] "New York/NY" "North Carolina/NC"
## [35] "North Dakota/ND" "Ohio/OH"
## [37] "Oklahoma/OK" "Oregon/OR"
## [39] "Pennsylvania/PA" "Rhode Island/RI"
## [41] "South Carolina/SC" "South Dakota/SD"
## [43] "Tennessee/TN" "Texas/TX"
## [45] "Utah/UT" "Vermont/VT"
## [47] "Virginia/VA" "Washington/WA"
## [49] "West Virginia/WV" "Wisconsin/WI"
## [51] "Wyoming/WY" "Puerto Rico/PR"
This is an integer of the individuals weight.
This is an integer of the individuals age.
This is an integer that will be factored into a new column that takes the the Class of Worker code and aligns it to their description. The data will range from employee at public company, goverment employee, self-employed, etc…
The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.
## [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"
## [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"
## [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
## [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"
## [4] "Local government employee (city- county- etc.)"
## [5] "State government employee"
## [6] "Federal government employee"
## [7] "Self-employed in own not incorporated business- professional practice- or farm"
## [8] "Self-employed in own incorporated business- professional practice or farm"
## [9] "Working without pay in family business or farm"
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"
This is an integer of the individuals income across interest, dividends, and rental property. This value can be negative.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time it took the individual to travel to work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual carpooled to work and the count of people in the vehicle.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes vehicle type used to commute to work. e.g. Car, truck, bicycle, bus, worked at home, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the marital status of the individual.
This is an integer that describes income of the individual that doesn’t fit into other income categories.
This is an integer that describes the income an individual received via public assitance.
This is an integer that describes the income an individual received via retirement streams.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current status of the individuals enrollment in school. e.g. currently enrolled in public school, private school, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current grade level of the individual. e.g. 9th grade, 11th grade, undergraduate, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the highest level of education achieved.
This is an integer that describes the income an individual received via self-employment. This value can be negative.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individual sex.
This is an integer that describes the income an individual received via wages and salary over the past 12 months.
This is an integer that describes the usual number of hours worked per week over the past 12 months.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes how recently in the past the individual has worked. e.g. Within 12 months, 1-5 years ago, over 5 years ago.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the number of weeks worked in the past 52 weeks.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual worked last week or not.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the fraction of vehicle used while commuting. e.g. If the individual shared a ride with 1 other person (2 total), this value could be 0.50.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual was employed in commercial or armed forces or no labor force.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the first field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the second field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the industry code the individual is employed in. Note: The IND codes are besed on 2012 IND codes.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual arrived at work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual left work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals marital status.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals NAICS industry code. Note: The NAICS code used are based off of 2012 NAICS codes.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals OCC industry code. Note: This code is based on the 2010 OCC codes.
This is an integer that described the total earnings across all incomes.
This is an integer that described the income an individual received via income.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the State the individual was employed in.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes race of the individual.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes a flag if the individual has a degree in Science and Engineering.
For access to the script used to clean this data, use the following link: DropBox Link
Below is an overview of the steps taken in order to prepare the data.
The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.
library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")
setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.RData")
Using the variables listed above, we will be able to see correlations between worker class, pay, and education. It would be interesting to see which worker classes tend to have the highest level of education and lowest pay. We will also be able to tell which worker classes have the highest and lowest pay rates. Additionally, it will be interesting to see how pay relates to hour of arrival hour of departure. Perhaps those within the middle 50% of income will have the typical 9 to 5 work schedule.
Additionally, the following types of analyses could provide some interesting insight:
We could try to find trends and relationships
Between employment and basic information or basic personal information
For example, it is possible to compare the past-12-months interest, dividends, and net rental incomes of local government employee and that of state government employee.
For instance, if workers arrive to work in an earlier time range and leave late, they might have carpool behavior and more people carpool.
We could figure out in employment status, how big difference in work intensity of people with different degrees and different fields of degrees and why that difference happened.
[Provide an executive summary of your findings.]
[Do not list all your findings, but highlight your insights concerning the dataset.]
r pus.df[!is.na(pus.df$DRIVESP.f),] %>% ggplot (aes(x = DRIVESP.f )) + geom_bar(aes(fill=DRIVESP.f)) + scale_y_log10()
r #Earnings per weeks worked over past 12 months ggplot(pus.df[!is.na(pus.df$WKW.f),], aes(x=PERNP, y=WKW.f)) + geom_point(shape=1, alpha = .05) + xlab("Total Earnings") + ylab("") + #factors are intuitive scale_x_log10(breaks = c(3, 3000, 30000, 300000))
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 2062 rows containing missing values (geom_point).
r #Earnings per state ggplot(pus.df, aes(x=ST.f, y=PERNP)) + geom_boxplot() + scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) + coord_flip() + labs(x="State", y="Earnings", title="Earning per State")
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 1544664 rows containing non-finite values (stat_boxplot).
r #class of worker per earning. interesting, at first glance looks like gov employee gets paid more thn private... ggplot(pus.df, aes(x=COW.f, y=PERNP)) + geom_boxplot() + scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) + coord_flip() + labs(x="Class of Worker", y="Earnings", title="Earnings by Class of Worker")
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 1544664 rows containing non-finite values (stat_boxplot).
r wages <- pus.df$WAGP age <- pus.df$AGEP plot(age, wages, xlab = "xlab", ylab = "ylab")