Introduction

Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.

The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.

The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.

Objective

While our objective may be out of focus as we begin to investigate the data, we want to look at Unemployment and Job Stability.

Dataset description

Variables of Interest:

The following list is a subset of variables that descibe an individual filling out the census.

##  [1] "ST"        "PWGTP"     "AGEP"      "COW"       "INTP"     
##  [6] "JWMNP"     "JWRIP"     "JWTR"      "MAR"       "OIP"      
## [11] "PAP"       "RETP"      "SCH"       "SCHG"      "SCHL"     
## [16] "SEMP"      "SEX"       "WAGP"      "WKHP"      "WKL"      
## [21] "WKW"       "WRK"       "DRIVESP"   "ESR"       "FOD1P"    
## [26] "FOD2P"     "INDP"      "JWAP"      "JWDP"      "MSP"      
## [31] "NAICSP"    "OCCP"      "PERNP"     "PINCP"     "POWSP"    
## [36] "RAC1P"     "SCIENGP"   "SCIENGRLP"

This subset of column names don’t provide much detail, so we will define each below.

Before we do, it is important to note that the following variables need factoring.

##  [1] "ST"        "COW"       "JWRIP"     "JWTR"      "MAR"      
##  [6] "SCH"       "SCHG"      "SCHL"      "SEX"       "WKL"      
## [11] "WKW"       "WRK"       "DRIVESP"   "ESR"       "FOD1P"    
## [16] "FOD2P"     "INDP"      "JWAP"      "JWDP"      "MSP"      
## [21] "NAICSP"    "OCCP"      "POWSP"     "RAC1P"     "SCIENGP"  
## [26] "SCIENGRLP"

To keep things simple, I will be defining variables in the order they appear in the data frame.

ST(.f)

Description: State Code.

This is an integer that will be factored into a new column that takes the State code and factors it with a State name. The following code will be variables that corrolate with the levels and labels of the new column.

##  [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"

##  [1] "Alabama/AL"              "Alaska/AK"              
##  [3] "Arizona/AZ"              "Arkansas/AR"            
##  [5] "California/CA"           "Colorado/CO"            
##  [7] "Connecticut/CT"          "Delaware/DE"            
##  [9] "District of Columbia/DC" "Florida/FL"             
## [11] "Georgia/GA"              "Hawaii/HI"              
## [13] "Idaho/ID"                "Illinois/IL"            
## [15] "Indiana/IN"              "Iowa/IA"                
## [17] "Kansas/KS"               "Kentucky/KY"            
## [19] "Louisiana/LA"            "Maine/ME"               
## [21] "Maryland/MD"             "Massachusetts/MA"       
## [23] "Michigan/MI"             "Minnesota/MN"           
## [25] "Mississippi/MS"          "Missouri/MO"            
## [27] "Montana/MT"              "Nebraska/NE"            
## [29] "Nevada/NV"               "New Hampshire/NH"       
## [31] "New Jersey/NJ"           "New Mexico/NM"          
## [33] "New York/NY"             "North Carolina/NC"      
## [35] "North Dakota/ND"         "Ohio/OH"                
## [37] "Oklahoma/OK"             "Oregon/OR"              
## [39] "Pennsylvania/PA"         "Rhode Island/RI"        
## [41] "South Carolina/SC"       "South Dakota/SD"        
## [43] "Tennessee/TN"            "Texas/TX"               
## [45] "Utah/UT"                 "Vermont/VT"             
## [47] "Virginia/VA"             "Washington/WA"          
## [49] "West Virginia/WV"        "Wisconsin/WI"           
## [51] "Wyoming/WY"              "Puerto Rico/PR"

PWGTP

Description: Person’s weight

This is an integer of the individuals weight.

AGEP

Description: Age

This is an integer of the individuals age.

COW(.f)

Description: Class of Worker

This is an integer that will be factored into a new column that takes the the Class of Worker code and aligns it to their description. The data will range from employee at public company, goverment employee, self-employed, etc…

The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.

##  [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"

##  [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"                    
##  [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
##  [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"                                
##  [4] "Local government employee (city- county- etc.)"                                                              
##  [5] "State government employee"                                                                                   
##  [6] "Federal government employee"                                                                                 
##  [7] "Self-employed in own not incorporated business- professional practice- or farm"                              
##  [8] "Self-employed in own incorporated business- professional practice or farm"                                   
##  [9] "Working without pay in family business or farm"                                                              
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"

INTP

Description: Interest, dividends, and net rental income past 12 months.

This is an integer of the individuals income across interest, dividends, and rental property. This value can be negative.

JWMNP(.f)

Description: Travel time to work

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time it took the individual to travel to work.

JWRIP(.f)

Description: Vehicle occupancy

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual carpooled to work and the count of people in the vehicle.

JWTR(.f)

Description: Means of transportation to work

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes vehicle type used to commute to work. e.g. Car, truck, bicycle, bus, worked at home, etc…

MAR(.f)

Description: Marital status

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the marital status of the individual.

OIP

Description: All other income past 12 months

This is an integer that describes income of the individual that doesn’t fit into other income categories.

PAP

Description: Public assistance income past 12 months

This is an integer that describes the income an individual received via public assitance.

RETP

Description: Retirement income past 12 months

This is an integer that describes the income an individual received via retirement streams.

SCH(.f)

Description: School enrollment

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current status of the individuals enrollment in school. e.g. currently enrolled in public school, private school, etc…

SCHG(.f)

Description: Grade level attending

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current grade level of the individual. e.g. 9th grade, 11th grade, undergraduate, etc…

SCHL(.f)

Description: Educational attainment

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the highest level of education achieved.

SEMP

Description: Self-employment income past 12 months (signed)

This is an integer that describes the income an individual received via self-employment. This value can be negative.

SEX(.f)

Description: Sex

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individual sex.

WAGP

Description: Wages or salary income past 12 months

This is an integer that describes the income an individual received via wages and salary over the past 12 months.

WKHP

Description: Usual hours worked per week past 12 months

This is an integer that describes the usual number of hours worked per week over the past 12 months.

WKL(.f)

Description: When last worked

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes how recently in the past the individual has worked. e.g. Within 12 months, 1-5 years ago, over 5 years ago.

WKW(.f)

Description: Weeks worked during past 12 months

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the number of weeks worked in the past 52 weeks.

WRK(.f)

Description: Worked last week

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual worked last week or not.

DRIVESP(.f)

Description: Number of vehicles calculated from JWRI

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the fraction of vehicle used while commuting. e.g. If the individual shared a ride with 1 other person (2 total), this value could be 0.50.

ESR(.f)

Description: Employment status recode

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual was employed in commercial or armed forces or no labor force.

FOD1P(.f)

Description: Recoded field of degree - first entry

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the first field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.

FOD2P(.f)

Description: Recoded field of degree - second entry

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the second field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.

INDP(.f)

Description: Industry recode for 2013 and later based on 2012 IND codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the industry code the individual is employed in. Note: The IND codes are besed on 2012 IND codes.

JWAP(.f)

Description: Time of arrival at work - hour and minute

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual arrived at work.

JWDP(.f)

Description: Time of departure for work - hour and minute

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual left work.

MSP(.f)

Description: Married, spouse present/spouse absent

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals marital status.

NAICSP(.f)

Description: NAICS Industry recode for 2013 and later based on 2012 NAICS codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals NAICS industry code. Note: The NAICS code used are based off of 2012 NAICS codes.

OCCP(.f)

Description: Occupation recode for 2012 and later based on 2010 OCC codes

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals OCC industry code. Note: This code is based on the 2010 OCC codes.

PERNP

Description: Total person’s earnings (includes wages- ss- pensions- dividends- etc..)

This is an integer that described the total earnings across all incomes.

PINCP

Description: Total person’s income (signed)

This is an integer that described the income an individual received via income.

POWSP(.f)

Description: Place of work - State or foreign country recode

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the State the individual was employed in.

RAC1P(.f)

Description: Recoded detailed race code

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes race of the individual.

SCIENGP(.f)

Description: Field of Degree Science and Engineering Flag - NSF Definition

This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes a flag if the individual has a degree in Science and Engineering.

SCIENGRLP(.f)

Dataset Preparation

For access to the script used to clean this data, use the following link: DropBox Link

Below is an overview of the steps taken in order to prepare the data.

Load libraries:
data.table
magrittr
ggplot2
dplyr
Select variables of interest (38). See Variables of Interest for additional details.
Load the two raw data sets provided by Kaggle/US Census. While selecting, limit the ingest to only the variables of interest, to save time.
Bind data sets into one.
List the subset of variables that need factoring and define variables for the levels and the labels, respectively.
Run a for loop to build the new factored columns for all factor variables.
Saved the resulting data frame to file, for use. This is the data frame that will be used for all of our analysis and research seen here.

Load DataSet and Libraries

The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.

library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")

setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")

Ideas to Investigate

Using the variables listed above, we will be able to see correlations between worker class, pay, and education. It would be interesting to see which worker classes tend to have the highest level of education and lowest pay. We will also be able to tell which worker classes have the highest and lowest pay rates. Additionally, it will be interesting to see how pay relates to hour of arrival hour of departure. Perhaps those within the middle 50% of income will have the typical 9 to 5 work schedule.

Additionally, the following types of analyses could provide some interesting insight:

Compare income of government employees and non-government employees.
Determine which variables (if any) have the largest impact on income.
Examine the times of arrival and departure coupled with carpooling. For example, those that leave later than 5 or arrival earlier than 8 may be less likely to car pool than those who arrive and leave within typical business hours.
Attempt to identify which education field and level pays off in terms of secure employment by comparing employment status and number of weeks worked to type of degree earned and level of education obtained.
Attempt to identify fields where employment status depends on education level. For example, a master’s in education may demonstrate higher employment rate than a bachelors in education.

We could try to find trends and relationships

Between education and employment
Between education and basic information or basic personal information
Between employment and basic information or basic personal information
We could figure out various economic situations for different classes of worker and compare the situations between two classes.
For example, it is possible to compare the past-12-months interest, dividends, and net rental incomes of local government employee and that of state government employee.
It is possible to figure out whether there are some relationships among time of arrival at work, time of departure at work and carpool behavior.
For instance, if workers arrive to work in an earlier time range and leave late, they might have carpool behavior and more people carpool.
We could figure out in employment status, how big difference in work intensity of people with different degrees and different fields of degrees and why that difference happened.

Conclusion

[Provide an executive summary of your findings.]

[Do not list all your findings, but highlight your insights concerning the dataset.]

Exploratory Plots

This section is an area to cut/paste code and can be used as a sandbox for plots.

Alex’s Graphs

######################
#                    #
#  Alex's stuff      #
#                    #
######################
wagesex = select(pus.df, SEX.f, WAGP)
wagesex = na.omit(wagesex)
g = ggplot(wagesex, aes(x=SEX.f, y=WAGP))
g + geom_violin()

sci = select(pus.df, SCIENGP.f, WAGP)
sci <- na.omit(sci)
sci

## Source: local data frame [683,328 x 2]
## 
##    SCIENGP.f   WAGP
##       (fctr)  (int)
## 1        No   39000
## 2       Yes   20000
## 3        No    2000
## 4       Yes   16000
## 5        No  100000
## 6       Yes       0
## 7        No       0
## 8        No  322000
## 9        No       0
## 10      Yes       0
## ..       ...    ...

ticks<-c(10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000)
g <- ggplot(sci, aes(x=SCIENGP.f, y=WAGP))
g + geom_violin() + scale_y_continuous(limits=c(0,200000), breaks=ticks)

wageData <- select(pus.df, WAGP, AGEP)
wageData <- na.omit(wageData)
wageData <- wageData[wageData$WAGP>0,]
ticks = c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
a <- ggplot(wageData, aes(x=factor(AGEP), y=WAGP))
a <- a + geom_boxplot(outlier.size = 0) 
a <- a + ylim(0,130000)
a <- a + xlim(0,70)
a + scale_x_discrete(breaks=ticks)

Jon’s graphs

DegreeData <- select(pus.df, FOD1P.f, WAGP)
DegreeData <- na.omit(DegreeData)

#Interesting groupings of earnings grouped around very high wages.  Not sure why, but to get more out of the data, I will now focus on <$300,000 wages
DegreeData2 <- DegreeData[1:40000,]
DegreeData2 %>%
  group_by(FOD1P.f) %>%
  ggplot(aes(y=FOD1P.f, x=WAGP)) + 
  geom_point(outlire.size = 0, alpha = 0.5)#, stat="identity")

# I'm going to "paginate" the degrees so we can look through all of them with appropriate real estate
DegreeData <- subset(DegreeData, WAGP < 300000)
DegreeData01 <- subset(DegreeData, FOD1P.f == c("GENERAL AGRICULTURE",
                                                 "AGRICULTURE PRODUCTION AND MANAGEMENT",
                                                 "AGRICULTURAL ECONOMICS",
                                                 "ANIMAL SCIENCES",
                                                 "FOOD SCIENCE",
                                                 "PLANT SCIENCE AND AGRONOMY",
                                                 "SOIL SCIENCE",
                                                 "MISCELLANEOUS AGRICULTURE",
                                                 "ENVIRONMENTAL SCIENCE",
                                                 "FORESTRY",
                                                 "NATURAL RESOURCES MANAGEMENT",
                                                 "ARCHITECTURE",
                                                 "AREA ETHNIC AND CIVILIZATION STUDIES",
                                                 "COMMUNICATIONS",
                                                 "JOURNALISM"))
DegreeData02 <- subset(DegreeData, FOD1P.f == c("MASS MEDIA",
                                                "ADVERTISING AND PUBLIC RELATIONS",
                                                "COMMUNICATION TECHNOLOGIES",
                                                "COMPUTER AND INFORMATION SYSTEMS",
                                                "COMPUTER PROGRAMMING AND DATA PROCESSING",
                                                "COMPUTER SCIENCE",
                                                "INFORMATION SCIENCES",
                                                "COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY",
                                                "COMPUTER NETWORKING AND TELECOMMUNICATIONS",
                                                "COSMETOLOGY SERVICES AND CULINARY ARTS",
                                                "GENERAL EDUCATION",
                                                "EDUCATIONAL ADMINISTRATION AND SUPERVISION",
                                                "SCHOOL STUDENT COUNSELING",
                                                "ELEMENTARY EDUCATION",
                                                "MATHEMATICS TEACHER EDUCATION",
                                                "PHYSICAL AND HEALTH EDUCATION TEACHING",
                                                "EARLY CHILDHOOD EDUCATION",
                                                "SCIENCE AND COMPUTER TEACHER EDUCATION",
                                                "SECONDARY TEACHER EDUCATION"))
DegreeData03 <- subset(DegreeData, FOD1P.f == c("SPECIAL NEEDS EDUCATION",
                                                "SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION",
                                                "TEACHER EDUCATION: MULTIPLE LEVELS",
                                                "LANGUAGE AND DRAMA EDUCATION",
                                                "ART AND MUSIC EDUCATION",
                                                "MISCELLANEOUS EDUCATION",
                                                "GENERAL ENGINEERING",
                                                "AEROSPACE ENGINEERING",
                                                "BIOLOGICAL ENGINEERING",
                                                "ARCHITECTURAL ENGINEERING",
                                                "BIOMEDICAL ENGINEERING",
                                                "CHEMICAL ENGINEERING",
                                                "CIVIL ENGINEERING",
                                                "COMPUTER ENGINEERING",
                                                "ELECTRICAL ENGINEERING",
                                                "ENGINEERING MECHANICS PHYSICS AND SCIENCE",
                                                "ENVIRONMENTAL ENGINEERING",
                                                "GEOLOGICAL AND GEOPHYSICAL ENGINEERING",
                                                "INDUSTRIAL AND MANUFACTURING ENGINEERING",
                                                "MATERIALS ENGINEERING AND MATERIALS SCIENCE"))
DegreeData04 <- subset(DegreeData, FOD1P.f == c("MECHANICAL ENGINEERING",
                                                "METALLURGICAL ENGINEERING",
                                                "MINING AND MINERAL ENGINEERING",
                                                "NAVAL ARCHITECTURE AND MARINE ENGINEERING",
                                                "NUCLEAR ENGINEERING",
                                                "PETROLEUM ENGINEERING",
                                                "MISCELLANEOUS ENGINEERING",
                                                "ENGINEERING TECHNOLOGIES",
                                                "ENGINEERING AND INDUSTRIAL MANAGEMENT",
                                                "ELECTRICAL ENGINEERING TECHNOLOGY",
                                                "INDUSTRIAL PRODUCTION TECHNOLOGIES",
                                                "MECHANICAL ENGINEERING RELATED TECHNOLOGIES",
                                                "MISCELLANEOUS ENGINEERING TECHNOLOGIES",
                                                "LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE",
                                                "FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES",
                                                "OTHER FOREIGN LANGUAGES",
                                                "FAMILY AND CONSUMER SCIENCES",
                                                "COURT REPORTING",
                                                "PRE-LAW AND LEGAL STUDIES",
                                                "ENGLISH LANGUAGE AND LITERATURE"))
DegreeData05 <- subset(DegreeData, FOD1P.f == c("COMPOSITION AND RHETORIC",
                                                "LIBERAL ARTS",
                                                "HUMANITIES",
                                                "LIBRARY SCIENCE",
                                                "BIOLOGY",
                                                "BIOCHEMICAL SCIENCES",
                                                "BOTANY",
                                                "MOLECULAR BIOLOGY",
                                                "ECOLOGY",
                                                "GENETICS",
                                                "MICROBIOLOGY",
                                                "PHARMACOLOGY",
                                                "PHYSIOLOGY",
                                                "ZOOLOGY",
                                                "NEUROSCIENCE",
                                                "MISCELLANEOUS BIOLOGY",
                                                "MATHEMATICS",
                                                "APPLIED MATHEMATICS",
                                                "STATISTICS AND DECISION SCIENCE",
                                                "MILITARY TECHNOLOGIES"))
DegreeData06 <- subset(DegreeData, FOD1P.f == c("MULTI/INTERDISCIPLINARY STUDIES",
                                                "INTERCULTURAL AND INTERNATIONAL STUDIES",
                                                "NUTRITION SCIENCES",
                                                "MATHEMATICS AND COMPUTER SCIENCE",
                                                "COGNITIVE SCIENCE AND BIOPSYCHOLOGY",
                                                "INTERDISCIPLINARY SOCIAL SCIENCES",
                                                "PHYSICAL FITNESS PARKS RECREATION AND LEISURE",
                                                "PHILOSOPHY AND RELIGIOUS STUDIES",
                                                "THEOLOGY AND RELIGIOUS VOCATIONS",
                                                "PHYSICAL SCIENCES",
                                                "ASTRONOMY AND ASTROPHYSICS",
                                                "ATMOSPHERIC SCIENCES AND METEOROLOGY",
                                                "CHEMISTRY",
                                                "GEOLOGY AND EARTH SCIENCE",
                                                "GEOSCIENCES",
                                                "OCEANOGRAPHY",
                                                "PHYSICS",
                                                "MATERIALS SCIENCE",
                                                "MULTI-DISCIPLINARY OR GENERAL SCIENCE"))
DegreeData07 <- subset(DegreeData, FOD1P.f == c("NUCLEAR- INDUSTRIAL RADIOLOGY- AND BIOLOGICAL TECHNOLOGIES",
                                                "PSYCHOLOGY",
                                                "EDUCATIONAL PSYCHOLOGY",
                                                "CLINICAL PSYCHOLOGY",
                                                "COUNSELING PSYCHOLOGY",
                                                "INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY",
                                                "SOCIAL PSYCHOLOGY",
                                                "MISCELLANEOUS PSYCHOLOGY",
                                                "CRIMINAL JUSTICE AND FIRE PROTECTION",
                                                "PUBLIC ADMINISTRATION",
                                                "PUBLIC POLICY",
                                                "HUMAN SERVICES AND COMMUNITY ORGANIZATION",
                                                "SOCIAL WORK",
                                                "GENERAL SOCIAL SCIENCES",
                                                "ECONOMICS",
                                                "ANTHROPOLOGY AND ARCHEOLOGY",
                                                "CRIMINOLOGY",
                                                "GEOGRAPHY",
                                                "INTERNATIONAL RELATIONS",
                                                "POLITICAL SCIENCE AND GOVERNMENT"))
DegreeData08 <- subset(DegreeData, FOD1P.f == c("SOCIOLOGY",
                                                "MISCELLANEOUS SOCIAL SCIENCES",
                                                "CONSTRUCTION SERVICES",
                                                "ELECTRICAL- MECHANICAL- AND PRECISION TECHNOLOGIES ANDPRODUCTION",
                                                "TRANSPORTATION SCIENCES AND TECHNOLOGIES",
                                                "FINE ARTS",
                                                "DRAMA AND THEATER ARTS",
                                                "MUSIC",
                                                "VISUAL AND PERFORMING ARTS",
                                                "COMMERCIAL ART AND GRAPHIC DESIGN",
                                                "FILM VIDEO AND PHOTOGRAPHIC ARTS",
                                                "ART HISTORY AND CRITICISM",
                                                "STUDIO ARTS",
                                                "MISCELLANEOUS FINE ARTS",
                                                "GENERAL MEDICAL AND HEALTH SERVICES",
                                                "COMMUNICATION DISORDERS SCIENCES AND SERVICES",
                                                "HEALTH AND MEDICAL ADMINISTRATIVE SERVICES",
                                                "MEDICAL ASSISTING SERVICES",
                                                "MEDICAL TECHNOLOGIES TECHNICIANS"))
DegreeData09 <- subset(DegreeData, FOD1P.f == c("HEALTH AND MEDICAL PREPARATORY PROGRAMS",
                                                "NURSING",
                                                "PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION",
                                                "TREATMENT THERAPY PROFESSIONS",
                                                "COMMUNITY AND PUBLIC HEALTH",
                                                "MISCELLANEOUS HEALTH MEDICAL PROFESSIONS",
                                                "GENERAL BUSINESS",
                                                "ACCOUNTING",
                                                "ACTUARIAL SCIENCE",
                                                "BUSINESS MANAGEMENT AND ADMINISTRATION",
                                                "OPERATIONS LOGISTICS AND E-COMMERCE",
                                                "BUSINESS ECONOMICS",
                                                "MARKETING AND MARKETING RESEARCH",
                                                "FINANCE",
                                                "HUMAN RESOURCES AND PERSONNEL MANAGEMENT",
                                                "INTERNATIONAL BUSINESS",
                                                "HOSPITALITY MANAGEMENT",
                                                "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS",
                                                "MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION",
                                                "HISTORY",
                                                "UNITED STATES HISTORY"))


a <- ggplot(DegreeData01, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData02, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData03, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData04, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData05, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData06, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData07, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData08, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

a <- ggplot(DegreeData09, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a

#Explore different types of earnings:
#Remove NAs
bb <- pus.df[!is.na(pus.df$PERNP),] #total wages+ss+pension+dividents + etcc (?)
bb <- bb[!is.na(bb$WAGP),]
bb <- bb[!is.na(bb$PINCP),]
bb <- bb[!is.na(bb$INTP),]
#Interestlingly, where there are values in PERNP, there also exist values in WAGP, PINCP, and INTP.  In other words, the list didn't get smaller when attempting to remove NAs from the remaining earnings categories
summary(bb$PERNP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9000       0   10000   27860   40000 1019000

summary(bb$WAGP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    6600   26060   37000  660000

summary(bb$PINCP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -13600    7300   21400   36310   46000 1281000

summary(bb$INTP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6300       0       0    2075       0  300000

#Colorful graph of count of people who carpool
pus.df[!is.na(pus.df$DRIVESP.f),] %>%
  ggplot  (aes(x = DRIVESP.f )) +  
  geom_bar(aes(fill=DRIVESP.f)) +
  scale_y_log10()

#For some graphs that take a long time, I've used a subset
subSet40k <- pus.df[1:40000,]

#Earnings per weeks worked over past 12 months.  Earnings axis at log scale
ggplot(subSet40k[!is.na(subSet40k$WKW.f),], aes(x=PERNP, y=WKW.f)) + 
  geom_point(shape=1, alpha = .05) +
  xlab("Total Earnings") + 
  ylab("") + #factors are intuitive
  scale_x_log10(breaks = c(3, 3000, 30000, 300000))

#Earnings per state.  Earnings at log scale
#I wish I could limit the range of the x axis
ggplot(pus.df, aes(x=ST.f, y=PERNP)) +
  geom_boxplot(outlier.size = 0) +
  scale_y_log10(breaks = c(10000, 50000, 100000, 400000)) +
  coord_flip() +
  labs(x="State", y="Earnings", title="Earning per State")

#class of worker per earning.  interesting, at first glance looks like gov employee gets paid more thn private...
ggplot(pus.df, aes(x=COW.f, y=PERNP)) +
  geom_boxplot() +
  scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) +
  coord_flip() +
  labs(x="Class of Worker", y="Earnings", title="Earnings by Class of Worker")

ggplot(subSet40k, aes(x=PERNP, y=JWDP.f)) + 
  geom_point(shape=1) +
  xlab("Total Earnings") + 
  ylab("Departure From Work") +
  scale_x_log10() +
  geom_boxplot()

#earnings by weeks worked
ggplot(subSet40k, aes(x=PERNP, y=WKW.f)) + 
  geom_point(shape=1) +
  xlab("Total Earnings") + 
  ylab("Weeks Work in Past Year")

#median earnings by state
pus.df %>%
  group_by(ST.f) %>%
  summarize(count=n(), 
            PERNP.median = median(PERNP,na.rm=TRUE)) %>%
  arrange(desc(PERNP.median)) %>%
  ggplot(aes(x=ST.f, y=PERNP.median)) +
  geom_histogram(stat="identity") +
  coord_flip()

#earnings by age
pus.df %>%
  group_by(AGEP) %>%
  summarize(PERNP.median = median(PERNP,na.rm=TRUE)) %>%
  ggplot(.,aes(x=AGEP, y=PERNP.median)) +
  geom_histogram(stat="identity")

MA799 Assignment 1: American Consumer Survey

Alex Clark, Jon Leaman, Xiao Li

01 Oct 2015