Introduction

Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.

The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.

The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.

Objective

While our objective may be out of focus as we begin to investigate the data, we want to look at Unemployment and Job Stability.

Dataset description

Variables of Interest:

The following list is a subset of variables that descibe an individual filling out the census.

##  [1] "ST"        "PWGTP"     "AGEP"      "COW"       "INTP"     
##  [6] "JWMNP"     "JWRIP"     "JWTR"      "MAR"       "OIP"      
## [11] "PAP"       "RETP"      "SCH"       "SCHG"      "SCHL"     
## [16] "SEMP"      "SEX"       "WAGP"      "WKHP"      "WKL"      
## [21] "WKW"       "WRK"       "DRIVESP"   "ESR"       "FOD1P"    
## [26] "FOD2P"     "INDP"      "JWAP"      "JWDP"      "MSP"      
## [31] "NAICSP"    "OCCP"      "PERNP"     "PINCP"     "POWSP"    
## [36] "RAC1P"     "SCIENGP"   "SCIENGRLP"

This subset of column names don’t provide much detail, so we will define each below.

Before we do, it is important to note that the following variables need factoring.

##  [1] "ST"        "COW"       "JWRIP"     "JWTR"      "MAR"      
##  [6] "SCH"       "SCHG"      "SCHL"      "SEX"       "WKL"      
## [11] "WKW"       "WRK"       "DRIVESP"   "ESR"       "FOD1P"    
## [16] "FOD2P"     "INDP"      "JWAP"      "JWDP"      "MSP"      
## [21] "NAICSP"    "OCCP"      "POWSP"     "RAC1P"     "SCIENGP"  
## [26] "SCIENGRLP"

To keep things simple, I will be defining variables in the order they appear in the data frame.

ST(.f)

Description: State Code.

This is an integer that will be factored into a new column that takes the State code and factors it with a State name. The following code will be variables that corrolate with the levels and labels of the new column.

##  [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"
##  [1] "Alabama/AL"              "Alaska/AK"              
##  [3] "Arizona/AZ"              "Arkansas/AR"            
##  [5] "California/CA"           "Colorado/CO"            
##  [7] "Connecticut/CT"          "Delaware/DE"            
##  [9] "District of Columbia/DC" "Florida/FL"             
## [11] "Georgia/GA"              "Hawaii/HI"              
## [13] "Idaho/ID"                "Illinois/IL"            
## [15] "Indiana/IN"              "Iowa/IA"                
## [17] "Kansas/KS"               "Kentucky/KY"            
## [19] "Louisiana/LA"            "Maine/ME"               
## [21] "Maryland/MD"             "Massachusetts/MA"       
## [23] "Michigan/MI"             "Minnesota/MN"           
## [25] "Mississippi/MS"          "Missouri/MO"            
## [27] "Montana/MT"              "Nebraska/NE"            
## [29] "Nevada/NV"               "New Hampshire/NH"       
## [31] "New Jersey/NJ"           "New Mexico/NM"          
## [33] "New York/NY"             "North Carolina/NC"      
## [35] "North Dakota/ND"         "Ohio/OH"                
## [37] "Oklahoma/OK"             "Oregon/OR"              
## [39] "Pennsylvania/PA"         "Rhode Island/RI"        
## [41] "South Carolina/SC"       "South Dakota/SD"        
## [43] "Tennessee/TN"            "Texas/TX"               
## [45] "Utah/UT"                 "Vermont/VT"             
## [47] "Virginia/VA"             "Washington/WA"          
## [49] "West Virginia/WV"        "Wisconsin/WI"           
## [51] "Wyoming/WY"              "Puerto Rico/PR"

PWGTP

Description: Person’s weight

This is an integer of the individuals weight.

AGEP

Description: Age

This is an integer of the individuals age.

COW(.f)

Description: Class of Worker

This is an integer that will be factored into a new column that takes the the Class of Worker code and aligns it to their description. The data will range from employee at public company, goverment employee, self-employed, etc…

The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.

##  [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"
##  [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"                    
##  [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
##  [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"                                
##  [4] "Local government employee (city- county- etc.)"                                                              
##  [5] "State government employee"                                                                                   
##  [6] "Federal government employee"                                                                                 
##  [7] "Self-employed in own not incorporated business- professional practice- or farm"                              
##  [8] "Self-employed in own incorporated business- professional practice or farm"                                   
##  [9] "Working without pay in family business or farm"                                                              
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"

INTP

Description: Interest, dividends, and net rental income past 12 months.

This is an integer of the individuals income across interest, dividends, and rental property. This value can be negative.

JWMNP(.f)

Description: Travel time to work

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time it took the individual to travel to work.

JWRIP(.f)

Description: Vehicle occupancy

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual carpooled to work and the count of people in the vehicle.

JWTR(.f)

Description: Means of transportation to work

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes vehicle type used to commute to work. e.g. Car, truck, bicycle, bus, worked at home, etc…

MAR(.f)

Description: Marital status

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the marital status of the individual.

OIP

Description: All other income past 12 months

This is an integer that describes income of the individual that doesn’t fit into other income categories.

PAP

Description: Public assistance income past 12 months

This is an integer that describes the income an individual received via public assitance.

RETP

Description: Retirement income past 12 months

This is an integer that describes the income an individual received via retirement streams.

SCH(.f)

Description: School enrollment

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current status of the individuals enrollment in school. e.g. currently enrolled in public school, private school, etc…

SCHG(.f)

Description: Grade level attending

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current grade level of the individual. e.g. 9th grade, 11th grade, undergraduate, etc…

SCHL(.f)

Description: Educational attainment

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the highest level of education achieved.

SEMP

Description: Self-employment income past 12 months (signed)

This is an integer that describes the income an individual received via self-employment. This value can be negative.

SEX(.f)

Description: Sex

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individual sex.

WAGP

Description: Wages or salary income past 12 months

This is an integer that describes the income an individual received via wages and salary over the past 12 months.

WKHP

Description: Usual hours worked per week past 12 months

This is an integer that describes the usual number of hours worked per week over the past 12 months.

WKL(.f)

Description: When last worked

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes how recently in the past the individual has worked. e.g. Within 12 months, 1-5 years ago, over 5 years ago.

WKW(.f)

Description: Weeks worked during past 12 months

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the number of weeks worked in the past 52 weeks.

WRK(.f)

Description: Worked last week

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual worked last week or not.

DRIVESP(.f)

Description: Number of vehicles calculated from JWRI

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the fraction of vehicle used while commuting. e.g. If the individual shared a ride with 1 other person (2 total), this value could be 0.50.

ESR(.f)

Description: Employment status recode

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual was employed in commercial or armed forces or no labor force.

FOD1P(.f)

Description: Recoded field of degree - first entry

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the first field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.

FOD2P(.f)

Description: Recoded field of degree - second entry

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the second field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.

INDP(.f)

Description: Industry recode for 2013 and later based on 2012 IND codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the industry code the individual is employed in. Note: The IND codes are besed on 2012 IND codes.

JWAP(.f)

Description: Time of arrival at work - hour and minute

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual arrived at work.

JWDP(.f)

Description: Time of departure for work - hour and minute

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual left work.

MSP(.f)

Description: Married, spouse present/spouse absent

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals marital status.

NAICSP(.f)

Description: NAICS Industry recode for 2013 and later based on 2012 NAICS codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals NAICS industry code. Note: The NAICS code used are based off of 2012 NAICS codes.

OCCP(.f)

Description: Occupation recode for 2012 and later based on 2010 OCC codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals OCC industry code. Note: This code is based on the 2010 OCC codes.

PERNP

Description: Total person’s earnings (includes wages- ss- pensions- dividends- etc..)

This is an integer that described the total earnings across all incomes.

PINCP

Description: Total person’s income (signed)

This is an integer that described the income an individual received via income.

POWSP(.f)

Description: Place of work - State or foreign country recode

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the State the individual was employed in.

RAC1P(.f)

Description: Recoded detailed race code

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes race of the individual.

SCIENGP(.f)

Description: Field of Degree Science and Engineering Flag - NSF Definition

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes a flag if the individual has a degree in Science and Engineering.

SCIENGRLP(.f)

Dataset Preparation

For access to the script used to clean this data, use the following link: DropBox Link

Below is an overview of the steps taken in order to prepare the data.

  • Load libraries:
  • data.table
  • magrittr
  • ggplot2
  • dplyr
  • Select variables of interest (38). See Variables of Interest for additional details.
  • Load the two raw data sets provided by Kaggle/US Census. While selecting, limit the ingest to only the variables of interest, to save time.
  • Bind data sets into one.
  • List the subset of variables that need factoring and define variables for the levels and the labels, respectively.
  • Run a for loop to build the new factored columns for all factor variables.
  • Saved the resulting data frame to file, for use. This is the data frame that will be used for all of our analysis and research seen here.

Load DataSet and Libraries

The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.

library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")

setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.RData")

Ideas to Investigate

Using the variables listed above, we will be able to see correlations between worker class, pay, and education. It would be interesting to see which worker classes tend to have the highest level of education and lowest pay. We will also be able to tell which worker classes have the highest and lowest pay rates. Additionally, it will be interesting to see how pay relates to hour of arrival hour of departure. Perhaps those within the middle 50% of income will have the typical 9 to 5 work schedule.

Additionally, the following types of analyses could provide some interesting insight:

We could try to find trends and relationships

Conclusion

[Provide an executive summary of your findings.]

[Do not list all your findings, but highlight your insights concerning the dataset.]

Exploratory Plots

This section is an area to cut/paste code and can be used as a sandbox for plots.

r pus.df[!is.na(pus.df$DRIVESP.f),] %>% ggplot (aes(x = DRIVESP.f )) + geom_bar(aes(fill=DRIVESP.f)) + scale_y_log10()

r #Earnings per weeks worked over past 12 months ggplot(pus.df[!is.na(pus.df$WKW.f),], aes(x=PERNP, y=WKW.f)) + geom_point(shape=1, alpha = .05) + xlab("Total Earnings") + ylab("") + #factors are intuitive scale_x_log10(breaks = c(3, 3000, 30000, 300000))

## Warning in scale$trans$trans(x): NaNs produced

## Warning: Removed 2062 rows containing missing values (geom_point).

r #Earnings per state ggplot(pus.df, aes(x=ST.f, y=PERNP)) + geom_boxplot() + scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) + coord_flip() + labs(x="State", y="Earnings", title="Earning per State")

## Warning in scale$trans$trans(x): NaNs produced

## Warning: Removed 1544664 rows containing non-finite values (stat_boxplot).

r #class of worker per earning. interesting, at first glance looks like gov employee gets paid more thn private... ggplot(pus.df, aes(x=COW.f, y=PERNP)) + geom_boxplot() + scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) + coord_flip() + labs(x="Class of Worker", y="Earnings", title="Earnings by Class of Worker")

## Warning in scale$trans$trans(x): NaNs produced

## Warning: Removed 1544664 rows containing non-finite values (stat_boxplot).

r wages <- pus.df$WAGP age <- pus.df$AGEP plot(age, wages, xlab = "xlab", ylab = "ylab")