Introduction

The data included in this report comes from the American Community Survey (ACS) made available through Kaggle. The ACS is a survey conducted by the U.S. Census Bureau to gather information about residents living within the United States. The ACS gathers data pertaining to demographics, education, employment, housing, and many other areas.

Dataset description

Below is a list of the variables that our team is most interested in examining.

Group 1

  • Class of worker (COW) - Identifies the class of worker based on type of company/organization. For example, class 3 represents an employee of a private, not-for-profit, tax-exempt or charitable organization and worker class 5 represents a federal government employee.
  • Interest, dividends, and net rental income past 12 months signed (INTP) - Identifies the amount of interest, dividends and net rental income received in the past 12 months. The observations may be positive or negative. All amounts are rounded to the nearest $1,000. There are five ranges to differentiate the observations. For example, positive income would have INTP ranging from $2 to $999999.

Group 2

  • Time of arrival at work (JWAP) - Identifies the hour and minute time range at which an employee reaches his/her workplace. For example, an employee could arrive at work between 12:00am and 12:04pm or between 2:40pm and 2:44pm.
  • Time of departure at work (JWDP) - Identifies the hour and minute time range at which an employee leaves his/her workplace. For example, an employee could leave his/her workplace between 12:00am and 12:04pm or between 2:40pm and 2:44pm.
  • Vehicle occupancy (JWRIP) - Identifies the number of people with which the individual carpooled. For example, an employee could drive alone to work or carpool with 1 or more persons.

Group 3

  • School enrollment (SCH) - Identifies whether the individual is attending public school, private school or not attending.
  • Educational attainment (SCHL) - Identifies the highest level of academic achievement (e.g. Bachelor’s degree Master’s degree, etc.).
  • Field of degree (FOD1P) - Identifies the field of the degree earned for the first entry (e.g. Food Science, Teacher Education, etc.).
  • Employment status (ESR) - Identifies the status of employment (e.g. unemployed, not in labor force etc.).
  • Weeks worked during past 12 months (WKW) - Identifies the range of the number of weeks worked in the past 12 months. Ranges include, but are not limited to: 50 to 52, 48 to 49, 27 to 39.

Using the variables listed above, we will be able to see correlations between worker class, pay, and education. It would be interesting to see which worker classes tend to have the highest level of education and lowest pay. We will also be able to tell which worker classes have the highest and lowest pay rates. Additionally, it will be interesting to see how pay relates to hour of arrival hour of departure. Perhaps those within the middle 50% of income will have the typical 9 to 5 work schedule.

Additionally, the following types of analyses could provide some interesting insight:

  • Compare income of government employees and non-government employees.
  • Determine which variables (if any) have the largest impact on income.
  • Examine the times of arrival and departure coupled with carpooling. For example, those that leave later than 5 or arrival earlier than 8 may be less likely to car pool than those who arrive and leave within typical business hours.
  • Attempt to identify which education field and level pays off in terms of secure employment by comparing employment status and number of weeks worked to type of degree earned and level of education obtained.
  • Attempt to identify fields where employment status depends on education level. For example, a master’s in education may demonstrate higher employment rate than a bachelors in education.

Dataset preparation

Below is an overview of the steps taken in order to prepare the data.

The code below was used to create the factors.

for (i in 1:length(colFactor))
{ tempcol <- colFactor[i]
tempcol.f <- paste0(tempcol, “.f”)
tempcol.values <- as.numeric(get(paste0(tempcol, “.values”)))
tempcol.labels <- get(paste0(tempcol, “.labels”))
tempMatrix <- as.matrix(pus.df[ , tempcol])
integerTemp <- tempMatrix[,1]
pus.df[ , tempcol.f ] <- factor(integerTemp,levels=tempcol.values, labels=tempcol.labels)
}

Load DataSet and Libraries

library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:data.table':
## 
##     between, last
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.RData")

Variable summaries

All of variables could be categorized into three large groups:

Relationships between variables

Conclusion

[Provide an executive summary of your findings.]

[Do not list all your findings, but highlight your insights concerning the dataset.]

Testing Zone

r pus.df%>% ggplot (aes(x = DRIVESP.f )) + geom_bar(aes(fill=DRIVESP.f))