allStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/all-ages.csv', sep=",", header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)
datatable(allStudents, class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))
gradStudents <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/grad-students.csv', sep=",", header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)
datatable(gradStudents, class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))
recentGraduates <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20606/Projects/recent-grads.csv', sep=",", header=T, stringsAsFactors = FALSE) %>% tbl_df() %>% arrange(Major_category)
datatable(recentGraduates, class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
What are the cases, and how many are there?
dim(allStudents)
## [1] 173 11
dim(recentGraduates)
## [1] 173 21
dim(gradStudents)
## [1] 173 22
all-ages data set, it contains basic earnings and labor force information for all ages. The grad-students (ages 25+) data set contains 173 cases and 22 variables, while the recent-grads (ages <28) data set contains 173 cases and 21 variables. It contains a more detailed breakdown, including by sex and by the type of job they gotDescribe the method of data collection.
What type of study is this (observational/experiment)?
If you collected the data, state self-collected. If not, provide a citation/link.
This was pulled from the ACS website by Fivethirtyeigth and stored in their Github Repo after performing some categorising on the cases
What is the response variable, and what type is it (numerical/categorical)?
What is the explanatory variable(s), and what type is it (numerical/categorival)?
The explanatory variables include:
These variables are numerical
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
str(allStudents)
## Classes 'tbl_df', 'tbl' and 'data.frame': 173 obs. of 11 variables:
## $ Major_code : int 1100 1101 1102 1103 1104 1105 1106 1199 1302 1303 ...
## $ Major : chr "GENERAL AGRICULTURE" "AGRICULTURE PRODUCTION AND MANAGEMENT" "AGRICULTURAL ECONOMICS" "ANIMAL SCIENCES" ...
## $ Major_category : chr "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
## $ Total : int 128148 95326 33955 103549 24280 79409 6586 8549 69447 83188 ...
## $ Employed : int 90245 76865 26321 81177 17281 63043 4926 6392 48228 65937 ...
## $ Employed_full_time_year_round: int 74078 64240 22810 64937 12722 51077 4042 5074 39613 50595 ...
## $ Unemployed : int 2423 2266 821 3619 894 2070 264 261 2144 3789 ...
## $ Unemployment_rate : num 0.0261 0.0286 0.0302 0.0427 0.0492 ...
## $ Median : int 50000 54000 63000 46000 62000 50000 63000 52000 58000 52000 ...
## $ P25th : int 34000 36000 40000 30000 38500 35000 39400 35000 40500 37100 ...
## $ P75th : num 80000 80000 98000 72000 90000 75000 88000 75000 80000 75000 ...
summary(allStudents)
## Major_code Major Major_category Total
## Min. :1100 Length:173 Length:173 Min. : 2396
## 1st Qu.:2403 Class :character Class :character 1st Qu.: 24280
## Median :3608 Mode :character Mode :character Median : 75791
## Mean :3880 Mean : 230257
## 3rd Qu.:5503 3rd Qu.: 205763
## Max. :6403 Max. :3123510
## Employed Employed_full_time_year_round Unemployed
## Min. : 1492 Min. : 1093 Min. : 0
## 1st Qu.: 17281 1st Qu.: 12722 1st Qu.: 1101
## Median : 56564 Median : 39613 Median : 3619
## Mean : 166162 Mean : 126308 Mean : 9725
## 3rd Qu.: 142879 3rd Qu.: 111025 3rd Qu.: 8862
## Max. :2354398 Max. :1939384 Max. :147261
## Unemployment_rate Median P25th P75th
## Min. :0.00000 Min. : 35000 Min. :24900 Min. : 45800
## 1st Qu.:0.04626 1st Qu.: 46000 1st Qu.:32000 1st Qu.: 70000
## Median :0.05472 Median : 53000 Median :36000 Median : 80000
## Mean :0.05736 Mean : 56816 Mean :38697 Mean : 82506
## 3rd Qu.:0.06904 3rd Qu.: 65000 3rd Qu.:42000 3rd Qu.: 95000
## Max. :0.15615 Max. :125000 Max. :78000 Max. :210000
str(gradStudents)
## Classes 'tbl_df', 'tbl' and 'data.frame': 173 obs. of 22 variables:
## $ Major_code : int 1101 1100 1302 1303 1105 1102 1106 1103 1199 1104 ...
## $ Major : chr "AGRICULTURE PRODUCTION AND MANAGEMENT" "GENERAL AGRICULTURE" "FORESTRY" "NATURAL RESOURCES MANAGEMENT" ...
## $ Major_category : chr "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" "Agriculture & Natural Resources" ...
## $ Grad_total : int 17488 44306 24713 29357 30983 14800 3335 56807 5032 14521 ...
## $ Grad_sample_size : int 386 764 487 659 624 305 61 1335 98 266 ...
## $ Grad_employed : int 13104 28930 16831 23394 22782 10592 2284 47755 2758 10857 ...
## $ Grad_full_time_year_round : int 11207 23024 14102 19087 18312 8768 1641 39047 2276 8074 ...
## $ Grad_unemployed : int 473 874 725 711 735 216 34 596 261 370 ...
## $ Grad_unemployment_rate : num 0.0348 0.0293 0.0413 0.0295 0.0313 ...
## $ Grad_median : num 67000 68000 78000 70000 67000 80000 65000 70300 54000 72000 ...
## $ Grad_P25 : int 41600 45000 52000 50000 45000 53000 50000 48000 45000 50000 ...
## $ Grad_P75 : num 100000 104000 110000 100000 100000 120000 91000 104000 81000 110000 ...
## $ Nongrad_total : int 89169 123984 67649 77101 76190 33049 6242 94910 8092 22853 ...
## $ Nongrad_employed : int 71781 86631 46815 60690 60241 25557 4654 74896 5978 16298 ...
## $ Nongrad_full_time_year_round: int 61335 72409 39048 48256 49506 22496 3917 61629 4707 12431 ...
## $ Nongrad_unemployed : int 1869 2352 1885 3413 1899 734 264 3101 239 681 ...
## $ Nongrad_unemployment_rate : num 0.0254 0.0264 0.0387 0.0532 0.0306 ...
## $ Nongrad_median : num 55000 50000 59000 53000 50000 63000 65000 48000 55000 63000 ...
## $ Nongrad_P25 : int 38000 34000 42000 38000 35000 40000 41000 32000 39000 40000 ...
## $ Nongrad_P75 : num 80000 80000 80000 75000 75000 99000 89000 75000 78000 92000 ...
## $ Grad_share : num 0.164 0.263 0.268 0.276 0.289 ...
## $ Grad_premium : num 0.218 0.36 0.322 0.321 0.34 ...
summary(gradStudents)
## Major_code Major Major_category Grad_total
## Min. :1100 Length:173 Length:173 Min. : 1542
## 1st Qu.:2403 Class :character Class :character 1st Qu.: 15284
## Median :3608 Mode :character Mode :character Median : 37872
## Mean :3880 Mean : 127672
## 3rd Qu.:5503 3rd Qu.: 148255
## Max. :6403 Max. :1184158
## Grad_sample_size Grad_employed Grad_full_time_year_round
## Min. : 22 Min. : 1008 Min. : 770
## 1st Qu.: 314 1st Qu.: 12659 1st Qu.: 9894
## Median : 688 Median : 28930 Median : 22523
## Mean : 2251 Mean : 94037 Mean : 72861
## 3rd Qu.: 2528 3rd Qu.:109944 3rd Qu.: 80794
## Max. :21994 Max. :915341 Max. :703347
## Grad_unemployed Grad_unemployment_rate Grad_median Grad_P25
## Min. : 0 Min. :0.00000 Min. : 47000 Min. :24500
## 1st Qu.: 453 1st Qu.:0.02607 1st Qu.: 65000 1st Qu.:45000
## Median : 1179 Median :0.03665 Median : 75000 Median :50000
## Mean : 3506 Mean :0.03934 Mean : 76756 Mean :52597
## 3rd Qu.: 3329 3rd Qu.:0.04805 3rd Qu.: 90000 3rd Qu.:60000
## Max. :35718 Max. :0.13851 Max. :135000 Max. :85000
## Grad_P75 Nongrad_total Nongrad_employed
## Min. : 65000 Min. : 2232 Min. : 1328
## 1st Qu.: 93000 1st Qu.: 20564 1st Qu.: 15914
## Median :108000 Median : 68993 Median : 50092
## Mean :112087 Mean : 214720 Mean : 154554
## 3rd Qu.:130000 3rd Qu.: 184971 3rd Qu.: 129179
## Max. :294000 Max. :2996892 Max. :2253649
## Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate
## Min. : 980 Min. : 0 Min. :0.00000
## 1st Qu.: 11755 1st Qu.: 880 1st Qu.:0.04198
## Median : 38384 Median : 3157 Median :0.05103
## Mean : 120737 Mean : 8486 Mean :0.05395
## 3rd Qu.: 103629 3rd Qu.: 7409 3rd Qu.:0.06439
## Max. :1882507 Max. :136978 Max. :0.16091
## Nongrad_median Nongrad_P25 Nongrad_P75 Grad_share
## Min. : 37000 Min. :25000 Min. : 48000 Min. :0.09632
## 1st Qu.: 48700 1st Qu.:34000 1st Qu.: 72000 1st Qu.:0.26757
## Median : 55000 Median :38000 Median : 80000 Median :0.39875
## Mean : 58584 Mean :40078 Mean : 84333 Mean :0.40059
## 3rd Qu.: 65000 3rd Qu.:44000 3rd Qu.: 97000 3rd Qu.:0.49912
## Max. :126000 Max. :80000 Max. :215000 Max. :0.93117
## Grad_premium
## Min. :-0.0250
## 1st Qu.: 0.2308
## Median : 0.3208
## Mean : 0.3285
## 3rd Qu.: 0.4000
## Max. : 1.6471
all-ages data set:categories <- as.data.frame(table(allStudents$Major_category))
names(categories) <- c('category', 'frequency')
ggplot(categories, aes(x= reorder(category, frequency), y=frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
xlab("Accademic Major") + ylab("Frequency") +
ggtitle("Accademic majors Categories vs Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x=element_blank(),
plot.title = element_text(color="black", size=14,hjust = 0.5)) + coord_flip() + theme(legend.position="none")
ggplot(allStudents, aes(x=Median)) +
ggtitle("All ages' Median Salary") +
geom_histogram(fill = "steelblue", color='white', binwidth = 10000) +
theme(axis.title.x=element_blank(),
plot.title = element_text(color="black", size=14,hjust = 0.5))
all <- allStudents$Unemployment_rate
recent <- recentGraduates$Unemployment_rate
graduates <- gradStudents$Grad_unemployment_rate
allUnemployedRate <- cbind(all, recent, graduates)
barplot(allUnemployedRate/nrow(allUnemployedRate), names.arg = c('All Students', 'Recent Graduates', 'Graduate Students'), xlab = "Unemployment Rate", col = heat.colors(nrow(allUnemployedRate)))