Kaggle, “The Home of Data Science”, is an organization that hosts data science competitions. One of the active competitions is focused on the USA Census.
The data available for this survey is collected through the US Census Bureau, which runs the American Community Survey. In this survey, approximately 3.5 million households are asked detailed questions about who they are and how they live. Many topics are covered, including ancestry, education, work, transportation, internet use, and residency.
The goal of the Kaggle competition is to find interesting insights into this data. This is convenient because this aligns well with the objectives of this documend.
While our objective may be out of focus as we begin to investigate the data, we want to look at Unemployment and Job Stability.
The following list is a subset of variables that descibe an individual filling out the census.
## [1] "ST" "PWGTP" "AGEP" "COW" "INTP"
## [6] "JWMNP" "JWRIP" "JWTR" "MAR" "OIP"
## [11] "PAP" "RETP" "SCH" "SCHG" "SCHL"
## [16] "SEMP" "SEX" "WAGP" "WKHP" "WKL"
## [21] "WKW" "WRK" "DRIVESP" "ESR" "FOD1P"
## [26] "FOD2P" "INDP" "JWAP" "JWDP" "MSP"
## [31] "NAICSP" "OCCP" "PERNP" "PINCP" "POWSP"
## [36] "RAC1P" "SCIENGP" "SCIENGRLP"
This subset of column names don’t provide much detail, so we will define each below.
Before we do, it is important to note that the following variables need factoring.
## [1] "ST" "COW" "JWRIP" "JWTR" "MAR"
## [6] "SCH" "SCHG" "SCHL" "SEX" "WKL"
## [11] "WKW" "WRK" "DRIVESP" "ESR" "FOD1P"
## [16] "FOD2P" "INDP" "JWAP" "JWDP" "MSP"
## [21] "NAICSP" "OCCP" "POWSP" "RAC1P" "SCIENGP"
## [26] "SCIENGRLP"
To keep things simple, I will be defining variables in the order they appear in the data frame.
This is an integer that will be factored into a new column that takes the State code and factors it with a State name. The following code will be variables that corrolate with the levels and labels of the new column.
## [1] "01" "02" "04" "05" "06" "08" "09" "10" "11" "12" "13" "15" "16" "17"
## [15] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [29] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "44" "45" "46"
## [43] "47" "48" "49" "50" "51" "53" "54" "55" "56" "72"
## [1] "Alabama/AL" "Alaska/AK"
## [3] "Arizona/AZ" "Arkansas/AR"
## [5] "California/CA" "Colorado/CO"
## [7] "Connecticut/CT" "Delaware/DE"
## [9] "District of Columbia/DC" "Florida/FL"
## [11] "Georgia/GA" "Hawaii/HI"
## [13] "Idaho/ID" "Illinois/IL"
## [15] "Indiana/IN" "Iowa/IA"
## [17] "Kansas/KS" "Kentucky/KY"
## [19] "Louisiana/LA" "Maine/ME"
## [21] "Maryland/MD" "Massachusetts/MA"
## [23] "Michigan/MI" "Minnesota/MN"
## [25] "Mississippi/MS" "Missouri/MO"
## [27] "Montana/MT" "Nebraska/NE"
## [29] "Nevada/NV" "New Hampshire/NH"
## [31] "New Jersey/NJ" "New Mexico/NM"
## [33] "New York/NY" "North Carolina/NC"
## [35] "North Dakota/ND" "Ohio/OH"
## [37] "Oklahoma/OK" "Oregon/OR"
## [39] "Pennsylvania/PA" "Rhode Island/RI"
## [41] "South Carolina/SC" "South Dakota/SD"
## [43] "Tennessee/TN" "Texas/TX"
## [45] "Utah/UT" "Vermont/VT"
## [47] "Virginia/VA" "Washington/WA"
## [49] "West Virginia/WV" "Wisconsin/WI"
## [51] "Wyoming/WY" "Puerto Rico/PR"
This is an integer of the individuals weight.
This is an integer of the individuals age.
This is an integer that will be factored into a new column that takes the the Class of Worker code and aligns it to their description. The data will range from employee at public company, goverment employee, self-employed, etc…
The following code will be the variables that corrolate with the levels and labels of the new column. Note: This will be the last time I include the code for levels and labels given the size of these descriptions. Please refer to our R script for additional details if required.
## [1] "b" "1" "2" "3" "4" "5" "6" "7" "8" "9"
## [1] "N/A (less than 16 years old/NILF who last- worked more than 5 years ago or never worked)"
## [2] "Employee of a private for-profit company or business- or of an individual- for wages- salary- or commissions"
## [3] "Employee of a private not-for-profit- tax-exempt- or charitable organization"
## [4] "Local government employee (city- county- etc.)"
## [5] "State government employee"
## [6] "Federal government employee"
## [7] "Self-employed in own not incorporated business- professional practice- or farm"
## [8] "Self-employed in own incorporated business- professional practice or farm"
## [9] "Working without pay in family business or farm"
## [10] "Unemployed and last worked 5 years ago or earlier or never worked"
This is an integer of the individuals income across interest, dividends, and rental property. This value can be negative.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time it took the individual to travel to work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual carpooled to work and the count of people in the vehicle.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes vehicle type used to commute to work. e.g. Car, truck, bicycle, bus, worked at home, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the marital status of the individual.
This is an integer that describes income of the individual that doesn’t fit into other income categories.
This is an integer that describes the income an individual received via public assitance.
This is an integer that describes the income an individual received via retirement streams.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current status of the individuals enrollment in school. e.g. currently enrolled in public school, private school, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the current grade level of the individual. e.g. 9th grade, 11th grade, undergraduate, etc…
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the highest level of education achieved.
This is an integer that describes the income an individual received via self-employment. This value can be negative.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individual sex.
This is an integer that describes the income an individual received via wages and salary over the past 12 months.
This is an integer that describes the usual number of hours worked per week over the past 12 months.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes how recently in the past the individual has worked. e.g. Within 12 months, 1-5 years ago, over 5 years ago.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the number of weeks worked in the past 52 weeks.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual worked last week or not.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the fraction of vehicle used while commuting. e.g. If the individual shared a ride with 1 other person (2 total), this value could be 0.50.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes if the individual was employed in commercial or armed forces or no labor force.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the first field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the second field the individual was employed in. e.g. Engineering, legal studies, forestry, etc… This variable has 174 factors.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the industry code the individual is employed in. Note: The IND codes are besed on 2012 IND codes.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual arrived at work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the time the individual left work.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals marital status.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals NAICS industry code. Note: The NAICS code used are based off of 2012 NAICS codes.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the individuals OCC industry code. Note: This code is based on the 2010 OCC codes.
This is an integer that described the total earnings across all incomes.
This is an integer that described the income an individual received via income.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes the State the individual was employed in.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes race of the individual.
This is an integer that will be factored into a new column. The code associated with this variable will be translated into a string that describes a flag if the individual has a degree in Science and Engineering.
For access to the script used to clean this data, use the following link: DropBox Link
Below is an overview of the steps taken in order to prepare the data.
The following code loads relevant libraries for us to analyze our data. We also load the cleaned dataset generated from the Dataset Preperation step.
library(data.table)
library(magrittr)
library(ggplot2)
library(dplyr)
options(scipen = "20")
setwd("C:/Users/jwlea_000/Dropbox/MA799 Class - Data Storm")
load("dataStormCleanedCencusData.v2.RData")
Using the variables listed above, we will be able to see correlations between worker class, pay, and education. It would be interesting to see which worker classes tend to have the highest level of education and lowest pay. We will also be able to tell which worker classes have the highest and lowest pay rates. Additionally, it will be interesting to see how pay relates to hour of arrival hour of departure. Perhaps those within the middle 50% of income will have the typical 9 to 5 work schedule.
Additionally, the following types of analyses could provide some interesting insight:
We could try to find trends and relationships
Between employment and basic information or basic personal information
For example, it is possible to compare the past-12-months interest, dividends, and net rental incomes of local government employee and that of state government employee.
For instance, if workers arrive to work in an earlier time range and leave late, they might have carpool behavior and more people carpool.
We could figure out in employment status, how big difference in work intensity of people with different degrees and different fields of degrees and why that difference happened.
[Provide an executive summary of your findings.]
[Do not list all your findings, but highlight your insights concerning the dataset.]
######################
# #
# Alex's stuff #
# #
######################
wagesex = select(pus.df, SEX.f, WAGP)
wagesex = na.omit(wagesex)
g = ggplot(wagesex, aes(x=SEX.f, y=WAGP))
g + geom_violin()
sci = select(pus.df, SCIENGP.f, WAGP)
sci <- na.omit(sci)
sci
## Source: local data frame [683,328 x 2]
##
## SCIENGP.f WAGP
## (fctr) (int)
## 1 No 39000
## 2 Yes 20000
## 3 No 2000
## 4 Yes 16000
## 5 No 100000
## 6 Yes 0
## 7 No 0
## 8 No 322000
## 9 No 0
## 10 Yes 0
## .. ... ...
ticks<-c(10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000)
g <- ggplot(sci, aes(x=SCIENGP.f, y=WAGP))
g + geom_violin() + scale_y_continuous(limits=c(0,200000), breaks=ticks)
wageData <- select(pus.df, WAGP, AGEP)
wageData <- na.omit(wageData)
wageData <- wageData[wageData$WAGP>0,]
ticks = c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
a <- ggplot(wageData, aes(x=factor(AGEP), y=WAGP))
a <- a + geom_boxplot(outlier.size = 0)
a <- a + ylim(0,130000)
a <- a + xlim(0,70)
a + scale_x_discrete(breaks=ticks)
DegreeData <- select(pus.df, FOD1P.f, WAGP)
DegreeData <- na.omit(DegreeData)
#Interesting groupings of earnings grouped around very high wages. Not sure why, but to get more out of the data, I will now focus on <$300,000 wages
DegreeData2 <- DegreeData[1:40000,]
DegreeData2 %>%
group_by(FOD1P.f) %>%
ggplot(aes(y=FOD1P.f, x=WAGP)) +
geom_point(outlire.size = 0, alpha = 0.5)#, stat="identity")
# I'm going to "paginate" the degrees so we can look through all of them with appropriate real estate
DegreeData <- subset(DegreeData, WAGP < 300000)
DegreeData01 <- subset(DegreeData, FOD1P.f == c("GENERAL AGRICULTURE",
"AGRICULTURE PRODUCTION AND MANAGEMENT",
"AGRICULTURAL ECONOMICS",
"ANIMAL SCIENCES",
"FOOD SCIENCE",
"PLANT SCIENCE AND AGRONOMY",
"SOIL SCIENCE",
"MISCELLANEOUS AGRICULTURE",
"ENVIRONMENTAL SCIENCE",
"FORESTRY",
"NATURAL RESOURCES MANAGEMENT",
"ARCHITECTURE",
"AREA ETHNIC AND CIVILIZATION STUDIES",
"COMMUNICATIONS",
"JOURNALISM"))
DegreeData02 <- subset(DegreeData, FOD1P.f == c("MASS MEDIA",
"ADVERTISING AND PUBLIC RELATIONS",
"COMMUNICATION TECHNOLOGIES",
"COMPUTER AND INFORMATION SYSTEMS",
"COMPUTER PROGRAMMING AND DATA PROCESSING",
"COMPUTER SCIENCE",
"INFORMATION SCIENCES",
"COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY",
"COMPUTER NETWORKING AND TELECOMMUNICATIONS",
"COSMETOLOGY SERVICES AND CULINARY ARTS",
"GENERAL EDUCATION",
"EDUCATIONAL ADMINISTRATION AND SUPERVISION",
"SCHOOL STUDENT COUNSELING",
"ELEMENTARY EDUCATION",
"MATHEMATICS TEACHER EDUCATION",
"PHYSICAL AND HEALTH EDUCATION TEACHING",
"EARLY CHILDHOOD EDUCATION",
"SCIENCE AND COMPUTER TEACHER EDUCATION",
"SECONDARY TEACHER EDUCATION"))
DegreeData03 <- subset(DegreeData, FOD1P.f == c("SPECIAL NEEDS EDUCATION",
"SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION",
"TEACHER EDUCATION: MULTIPLE LEVELS",
"LANGUAGE AND DRAMA EDUCATION",
"ART AND MUSIC EDUCATION",
"MISCELLANEOUS EDUCATION",
"GENERAL ENGINEERING",
"AEROSPACE ENGINEERING",
"BIOLOGICAL ENGINEERING",
"ARCHITECTURAL ENGINEERING",
"BIOMEDICAL ENGINEERING",
"CHEMICAL ENGINEERING",
"CIVIL ENGINEERING",
"COMPUTER ENGINEERING",
"ELECTRICAL ENGINEERING",
"ENGINEERING MECHANICS PHYSICS AND SCIENCE",
"ENVIRONMENTAL ENGINEERING",
"GEOLOGICAL AND GEOPHYSICAL ENGINEERING",
"INDUSTRIAL AND MANUFACTURING ENGINEERING",
"MATERIALS ENGINEERING AND MATERIALS SCIENCE"))
DegreeData04 <- subset(DegreeData, FOD1P.f == c("MECHANICAL ENGINEERING",
"METALLURGICAL ENGINEERING",
"MINING AND MINERAL ENGINEERING",
"NAVAL ARCHITECTURE AND MARINE ENGINEERING",
"NUCLEAR ENGINEERING",
"PETROLEUM ENGINEERING",
"MISCELLANEOUS ENGINEERING",
"ENGINEERING TECHNOLOGIES",
"ENGINEERING AND INDUSTRIAL MANAGEMENT",
"ELECTRICAL ENGINEERING TECHNOLOGY",
"INDUSTRIAL PRODUCTION TECHNOLOGIES",
"MECHANICAL ENGINEERING RELATED TECHNOLOGIES",
"MISCELLANEOUS ENGINEERING TECHNOLOGIES",
"LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE",
"FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES",
"OTHER FOREIGN LANGUAGES",
"FAMILY AND CONSUMER SCIENCES",
"COURT REPORTING",
"PRE-LAW AND LEGAL STUDIES",
"ENGLISH LANGUAGE AND LITERATURE"))
DegreeData05 <- subset(DegreeData, FOD1P.f == c("COMPOSITION AND RHETORIC",
"LIBERAL ARTS",
"HUMANITIES",
"LIBRARY SCIENCE",
"BIOLOGY",
"BIOCHEMICAL SCIENCES",
"BOTANY",
"MOLECULAR BIOLOGY",
"ECOLOGY",
"GENETICS",
"MICROBIOLOGY",
"PHARMACOLOGY",
"PHYSIOLOGY",
"ZOOLOGY",
"NEUROSCIENCE",
"MISCELLANEOUS BIOLOGY",
"MATHEMATICS",
"APPLIED MATHEMATICS",
"STATISTICS AND DECISION SCIENCE",
"MILITARY TECHNOLOGIES"))
DegreeData06 <- subset(DegreeData, FOD1P.f == c("MULTI/INTERDISCIPLINARY STUDIES",
"INTERCULTURAL AND INTERNATIONAL STUDIES",
"NUTRITION SCIENCES",
"MATHEMATICS AND COMPUTER SCIENCE",
"COGNITIVE SCIENCE AND BIOPSYCHOLOGY",
"INTERDISCIPLINARY SOCIAL SCIENCES",
"PHYSICAL FITNESS PARKS RECREATION AND LEISURE",
"PHILOSOPHY AND RELIGIOUS STUDIES",
"THEOLOGY AND RELIGIOUS VOCATIONS",
"PHYSICAL SCIENCES",
"ASTRONOMY AND ASTROPHYSICS",
"ATMOSPHERIC SCIENCES AND METEOROLOGY",
"CHEMISTRY",
"GEOLOGY AND EARTH SCIENCE",
"GEOSCIENCES",
"OCEANOGRAPHY",
"PHYSICS",
"MATERIALS SCIENCE",
"MULTI-DISCIPLINARY OR GENERAL SCIENCE"))
DegreeData07 <- subset(DegreeData, FOD1P.f == c("NUCLEAR- INDUSTRIAL RADIOLOGY- AND BIOLOGICAL TECHNOLOGIES",
"PSYCHOLOGY",
"EDUCATIONAL PSYCHOLOGY",
"CLINICAL PSYCHOLOGY",
"COUNSELING PSYCHOLOGY",
"INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY",
"SOCIAL PSYCHOLOGY",
"MISCELLANEOUS PSYCHOLOGY",
"CRIMINAL JUSTICE AND FIRE PROTECTION",
"PUBLIC ADMINISTRATION",
"PUBLIC POLICY",
"HUMAN SERVICES AND COMMUNITY ORGANIZATION",
"SOCIAL WORK",
"GENERAL SOCIAL SCIENCES",
"ECONOMICS",
"ANTHROPOLOGY AND ARCHEOLOGY",
"CRIMINOLOGY",
"GEOGRAPHY",
"INTERNATIONAL RELATIONS",
"POLITICAL SCIENCE AND GOVERNMENT"))
DegreeData08 <- subset(DegreeData, FOD1P.f == c("SOCIOLOGY",
"MISCELLANEOUS SOCIAL SCIENCES",
"CONSTRUCTION SERVICES",
"ELECTRICAL- MECHANICAL- AND PRECISION TECHNOLOGIES ANDPRODUCTION",
"TRANSPORTATION SCIENCES AND TECHNOLOGIES",
"FINE ARTS",
"DRAMA AND THEATER ARTS",
"MUSIC",
"VISUAL AND PERFORMING ARTS",
"COMMERCIAL ART AND GRAPHIC DESIGN",
"FILM VIDEO AND PHOTOGRAPHIC ARTS",
"ART HISTORY AND CRITICISM",
"STUDIO ARTS",
"MISCELLANEOUS FINE ARTS",
"GENERAL MEDICAL AND HEALTH SERVICES",
"COMMUNICATION DISORDERS SCIENCES AND SERVICES",
"HEALTH AND MEDICAL ADMINISTRATIVE SERVICES",
"MEDICAL ASSISTING SERVICES",
"MEDICAL TECHNOLOGIES TECHNICIANS"))
DegreeData09 <- subset(DegreeData, FOD1P.f == c("HEALTH AND MEDICAL PREPARATORY PROGRAMS",
"NURSING",
"PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION",
"TREATMENT THERAPY PROFESSIONS",
"COMMUNITY AND PUBLIC HEALTH",
"MISCELLANEOUS HEALTH MEDICAL PROFESSIONS",
"GENERAL BUSINESS",
"ACCOUNTING",
"ACTUARIAL SCIENCE",
"BUSINESS MANAGEMENT AND ADMINISTRATION",
"OPERATIONS LOGISTICS AND E-COMMERCE",
"BUSINESS ECONOMICS",
"MARKETING AND MARKETING RESEARCH",
"FINANCE",
"HUMAN RESOURCES AND PERSONNEL MANAGEMENT",
"INTERNATIONAL BUSINESS",
"HOSPITALITY MANAGEMENT",
"MANAGEMENT INFORMATION SYSTEMS AND STATISTICS",
"MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION",
"HISTORY",
"UNITED STATES HISTORY"))
a <- ggplot(DegreeData01, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData02, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData03, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData04, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData05, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData06, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData07, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData08, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
a <- ggplot(DegreeData09, aes(WAGP, FOD1P.f))
a <- a + geom_point(alpha = 0.2)#outlier.size = 0)
a
#Explore different types of earnings:
#Remove NAs
bb <- pus.df[!is.na(pus.df$PERNP),] #total wages+ss+pension+dividents + etcc (?)
bb <- bb[!is.na(bb$WAGP),]
bb <- bb[!is.na(bb$PINCP),]
bb <- bb[!is.na(bb$INTP),]
#Interestlingly, where there are values in PERNP, there also exist values in WAGP, PINCP, and INTP. In other words, the list didn't get smaller when attempting to remove NAs from the remaining earnings categories
summary(bb$PERNP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9000 0 10000 27860 40000 1019000
summary(bb$WAGP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 6600 26060 37000 660000
summary(bb$PINCP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13600 7300 21400 36310 46000 1281000
summary(bb$INTP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6300 0 0 2075 0 300000
#Colorful graph of count of people who carpool
pus.df[!is.na(pus.df$DRIVESP.f),] %>%
ggplot (aes(x = DRIVESP.f )) +
geom_bar(aes(fill=DRIVESP.f)) +
scale_y_log10()
#For some graphs that take a long time, I've used a subset
subSet40k <- pus.df[1:40000,]
#Earnings per weeks worked over past 12 months. Earnings axis at log scale
ggplot(subSet40k[!is.na(subSet40k$WKW.f),], aes(x=PERNP, y=WKW.f)) +
geom_point(shape=1, alpha = .05) +
xlab("Total Earnings") +
ylab("") + #factors are intuitive
scale_x_log10(breaks = c(3, 3000, 30000, 300000))
#Earnings per state. Earnings at log scale
#I wish I could limit the range of the x axis
ggplot(pus.df, aes(x=ST.f, y=PERNP)) +
geom_boxplot(outlier.size = 0) +
scale_y_log10(breaks = c(10000, 50000, 100000, 400000)) +
coord_flip() +
labs(x="State", y="Earnings", title="Earning per State")
#class of worker per earning. interesting, at first glance looks like gov employee gets paid more thn private...
ggplot(pus.df, aes(x=COW.f, y=PERNP)) +
geom_boxplot() +
scale_y_log10(breaks = c(10, 1000, 100000, 400000, 800000)) +
coord_flip() +
labs(x="Class of Worker", y="Earnings", title="Earnings by Class of Worker")
ggplot(subSet40k, aes(x=PERNP, y=JWDP.f)) +
geom_point(shape=1) +
xlab("Total Earnings") +
ylab("Departure From Work") +
scale_x_log10() +
geom_boxplot()
#earnings by weeks worked
ggplot(subSet40k, aes(x=PERNP, y=WKW.f)) +
geom_point(shape=1) +
xlab("Total Earnings") +
ylab("Weeks Work in Past Year")
#median earnings by state
pus.df %>%
group_by(ST.f) %>%
summarize(count=n(),
PERNP.median = median(PERNP,na.rm=TRUE)) %>%
arrange(desc(PERNP.median)) %>%
ggplot(aes(x=ST.f, y=PERNP.median)) +
geom_histogram(stat="identity") +
coord_flip()
#earnings by age
pus.df %>%
group_by(AGEP) %>%
summarize(PERNP.median = median(PERNP,na.rm=TRUE)) %>%
ggplot(.,aes(x=AGEP, y=PERNP.median)) +
geom_histogram(stat="identity")