Hint: You can choose any data you like but can’t take one that is already taken by other groups.
library(tidyverse)
recent_grads <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
Hint: Source and description of data, and definition of variables.
The source of the data is from GitHub in which David Robinson used for a TidyTuesday podcast which originates from fivethirtyeight. There are 173 rows which all resemble different college majors. This dataset has a bunch of variables, 21 as a matter of fact and they are; rank, major code, major, total, men, women, major category, share women, sample size, employed, full time, part time, full time year round, unemployed, unemployment rate, median, p25th, p75th, college jobs, non college jobs, and finally low wage jobs. Due to the high amount of variables, this allows us to break down the dataset to gather alot of information. However, we are analyzing college majors compared to income data, in which we will show what college major categories have the best median yearly earnings for full time workers.
Hint: Create at least two plots.
Histogram
ggplot(recent_grads, aes(Median)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
labs(title = "Combined College Majors vs Median Income")
Scatter Plot
library(tidyverse)
options(scipen=999)
ggplot(recent_grads,
aes(x = Median,
y = Major_category)) +
geom_point() +
geom_smooth(method = "lm")
Box Plot
recent_grads %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median, fill = Major_category)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Median Salary of Recent College Grads by Major Category",
x = NULL,
y = NULL) +
scale_y_continuous(labels = scales::dollar)
Correlation:
# import data
data(recent_grads, package="mosaicData")
# select numeric variables
df <- dplyr::select_if(recent_grads, is.numeric)
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)
## Rank Major_code Total Men Women ShareWomen
## Rank 1.00 0.10 0.07 -0.09 0.17 0.64
## Major_code 0.10 1.00 0.20 0.18 0.18 0.26
## Total 0.07 0.20 1.00 0.88 0.94 0.14
## Men -0.09 0.18 0.88 1.00 0.67 -0.11
## Women 0.17 0.18 0.94 0.67 1.00 0.30
## ShareWomen 0.64 0.26 0.14 -0.11 0.30 1.00
## Sample_size 0.00 0.20 0.95 0.88 0.86 0.10
## Employed 0.07 0.20 1.00 0.87 0.94 0.15
## Full_time 0.03 0.20 0.99 0.89 0.92 0.12
## Part_time 0.19 0.19 0.95 0.75 0.95 0.21
## Full_time_year_round 0.02 0.20 0.98 0.89 0.91 0.11
## Unemployed 0.09 0.22 0.97 0.87 0.91 0.12
## Unemployment_rate 0.08 0.14 0.08 0.10 0.06 0.07
## Median -0.87 -0.17 -0.11 0.03 -0.18 -0.62
## P25th -0.74 -0.17 -0.07 0.04 -0.14 -0.50
## P75th -0.80 -0.08 -0.08 0.05 -0.16 -0.59
## College_jobs 0.05 0.04 0.80 0.56 0.85 0.20
## Non_college_jobs 0.14 0.23 0.94 0.85 0.87 0.14
## Low_wage_jobs 0.20 0.22 0.94 0.79 0.90 0.19
## Sample_size Employed Full_time Part_time
## Rank 0.00 0.07 0.03 0.19
## Major_code 0.20 0.20 0.20 0.19
## Total 0.95 1.00 0.99 0.95
## Men 0.88 0.87 0.89 0.75
## Women 0.86 0.94 0.92 0.95
## ShareWomen 0.10 0.15 0.12 0.21
## Sample_size 1.00 0.96 0.98 0.82
## Employed 0.96 1.00 1.00 0.93
## Full_time 0.98 1.00 1.00 0.90
## Part_time 0.82 0.93 0.90 1.00
## Full_time_year_round 0.99 0.99 1.00 0.88
## Unemployed 0.92 0.97 0.96 0.95
## Unemployment_rate 0.06 0.07 0.07 0.11
## Median -0.06 -0.10 -0.08 -0.19
## P25th -0.02 -0.07 -0.04 -0.15
## P75th -0.05 -0.08 -0.06 -0.16
## College_jobs 0.70 0.80 0.77 0.80
## Non_college_jobs 0.92 0.94 0.93 0.91
## Low_wage_jobs 0.86 0.93 0.90 0.95
## Full_time_year_round Unemployed Unemployment_rate
## Rank 0.02 0.09 0.08
## Major_code 0.20 0.22 0.14
## Total 0.98 0.97 0.08
## Men 0.89 0.87 0.10
## Women 0.91 0.91 0.06
## ShareWomen 0.11 0.12 0.07
## Sample_size 0.99 0.92 0.06
## Employed 0.99 0.97 0.07
## Full_time 1.00 0.96 0.07
## Part_time 0.88 0.95 0.11
## Full_time_year_round 1.00 0.95 0.06
## Unemployed 0.95 1.00 0.17
## Unemployment_rate 0.06 0.17 1.00
## Median -0.07 -0.12 -0.12
## P25th -0.03 -0.09 -0.10
## P75th -0.05 -0.09 -0.04
## College_jobs 0.75 0.71 -0.01
## Non_college_jobs 0.93 0.96 0.12
## Low_wage_jobs 0.89 0.96 0.13
## Median P25th P75th College_jobs Non_college_jobs
## Rank -0.87 -0.74 -0.80 0.05 0.14
## Major_code -0.17 -0.17 -0.08 0.04 0.23
## Total -0.11 -0.07 -0.08 0.80 0.94
## Men 0.03 0.04 0.05 0.56 0.85
## Women -0.18 -0.14 -0.16 0.85 0.87
## ShareWomen -0.62 -0.50 -0.59 0.20 0.14
## Sample_size -0.06 -0.02 -0.05 0.70 0.92
## Employed -0.10 -0.07 -0.08 0.80 0.94
## Full_time -0.08 -0.04 -0.06 0.77 0.93
## Part_time -0.19 -0.15 -0.16 0.80 0.91
## Full_time_year_round -0.07 -0.03 -0.05 0.75 0.93
## Unemployed -0.12 -0.09 -0.09 0.71 0.96
## Unemployment_rate -0.12 -0.10 -0.04 -0.01 0.12
## Median 1.00 0.89 0.90 -0.05 -0.17
## P25th 0.89 1.00 0.74 -0.01 -0.14
## P75th 0.90 0.74 1.00 -0.05 -0.14
## College_jobs -0.05 -0.01 -0.05 1.00 0.61
## Non_college_jobs -0.17 -0.14 -0.14 0.61 1.00
## Low_wage_jobs -0.21 -0.17 -0.17 0.65 0.98
## Low_wage_jobs
## Rank 0.20
## Major_code 0.22
## Total 0.94
## Men 0.79
## Women 0.90
## ShareWomen 0.19
## Sample_size 0.86
## Employed 0.93
## Full_time 0.90
## Part_time 0.95
## Full_time_year_round 0.89
## Unemployed 0.96
## Unemployment_rate 0.13
## Median -0.21
## P25th -0.17
## P75th -0.17
## College_jobs 0.65
## Non_college_jobs 0.98
## Low_wage_jobs 1.00
library(ggplot2)
library(ggcorrplot)
ggcorrplot(r,
hc.order = TRUE,
type = "lower",
lab = TRUE)
Within the correlation plot, multiple variables have a positive correlation with others whereas other variables have a negative correlation related to other variables. For example, factors that have a positive correlation with non-college jobs are low-wage jobs, women, and part time jobs. So, the chances are if you have a job that is not related to college such as an internship, it will either be a minimum wage job or a part time job. But, there are no factors that have a negative correlation with non-college jobs.
Regression:
options(scipen=999)
data(recent_grads, package="mosaicData")
grads_lm <- lm(Median ~ Major_category,
data = recent_grads)
# View summary of model 1
summary(grads_lm)
##
## Call:
## lm(formula = Median ~ Major_category, data = recent_grads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17383 -4344 -350 2617 52617
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 36900.0 2483.2
## Major_categoryArts -3837.5 3724.8
## Major_categoryBiology & Life Science -478.6 3251.3
## Major_categoryBusiness 6638.5 3303.0
## Major_categoryCommunications & Journalism -2400.0 4645.6
## Major_categoryComputers & Mathematics 5845.4 3431.0
## Major_categoryEducation -4550.0 3165.5
## Major_categoryEngineering 20482.8 2879.7
## Major_categoryHealth -75.0 3362.3
## Major_categoryHumanities & Liberal Arts -4986.7 3205.8
## Major_categoryIndustrial Arts & Consumer Services -557.1 3869.8
## Major_categoryInterdisciplinary -1900.0 8235.9
## Major_categoryLaw & Public Policy 5300.0 4301.0
## Major_categoryPhysical Sciences 4990.0 3511.8
## Major_categoryPsychology & Social Work -6800.0 3608.0
## Major_categorySocial Science 444.4 3608.0
## t value
## (Intercept) 14.860
## Major_categoryArts -1.030
## Major_categoryBiology & Life Science -0.147
## Major_categoryBusiness 2.010
## Major_categoryCommunications & Journalism -0.517
## Major_categoryComputers & Mathematics 1.704
## Major_categoryEducation -1.437
## Major_categoryEngineering 7.113
## Major_categoryHealth -0.022
## Major_categoryHumanities & Liberal Arts -1.556
## Major_categoryIndustrial Arts & Consumer Services -0.144
## Major_categoryInterdisciplinary -0.231
## Major_categoryLaw & Public Policy 1.232
## Major_categoryPhysical Sciences 1.421
## Major_categoryPsychology & Social Work -1.885
## Major_categorySocial Science 0.123
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Major_categoryArts 0.3045
## Major_categoryBiology & Life Science 0.8832
## Major_categoryBusiness 0.0462 *
## Major_categoryCommunications & Journalism 0.6062
## Major_categoryComputers & Mathematics 0.0904 .
## Major_categoryEducation 0.1526
## Major_categoryEngineering 0.0000000000379 ***
## Major_categoryHealth 0.9822
## Major_categoryHumanities & Liberal Arts 0.1218
## Major_categoryIndustrial Arts & Consumer Services 0.8857
## Major_categoryInterdisciplinary 0.8178
## Major_categoryLaw & Public Policy 0.2197
## Major_categoryPhysical Sciences 0.1573
## Major_categoryPsychology & Social Work 0.0613 .
## Major_categorySocial Science 0.9021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7853 on 157 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5313
## F-statistic: 14 on 15 and 157 DF, p-value: < 0.00000000000000022
The regression model describes how each of the categories of majors reflects on median prices. The intercept is the median income of people with an Agriculture and Natural resources college major which equals $36900. All of the major categories listed either subtract the amount or add to the amount based on their median incomes. For example, both of our college majors are business, which means that our median income would be $6638.5 + $36900 which would equal $43,538.5. This means that the median of business majors’ income would be $43,538.5 according to the dataset.