Import data

Hint: You can choose any data you like but can’t take one that is already taken by other groups.

library(tidyverse)
recent_grads <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")

Explain data

Hint: Source and description of data, and definition of variables.

The source of the data is from GitHub in which David Robinson used for a TidyTuesday podcast which originates from fivethirtyeight. There are 173 rows which all resemble different college majors. This dataset has a bunch of variables, 21 as a matter of fact and they are; rank, major code, major, total, men, women, major category, share women, sample size, employed, full time, part time, full time year round, unemployed, unemployment rate, median, p25th, p75th, college jobs, non college jobs, and finally low wage jobs. Due to the high amount of variables, this allows us to break down the dataset to gather alot of information. However, we are analyzing college majors compared to income data, in which we will show what college major categories have the best median yearly earnings for full time workers.

Visualize data

Hint: Create at least two plots.

Histogram


ggplot(recent_grads, aes(Median)) +
  geom_histogram(fill = "cornflowerblue",
                 color = "white") +
  labs(title = "Combined College Majors vs Median Income")

Scatter Plot

library(tidyverse)
options(scipen=999)
ggplot(recent_grads,
       aes(x = Median,
           y = Major_category)) +
  geom_point() +
  geom_smooth(method = "lm")

Box Plot

recent_grads %>%
  mutate(Major_category = fct_reorder(Major_category, Median)) %>%
  ggplot(aes(Major_category, Median, fill = Major_category)) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Median Salary of Recent College Grads by Major Category",
       x = NULL,
       y = NULL) +
  scale_y_continuous(labels = scales::dollar)

Correlation and Regression Analysis

Correlation:


# import data
data(recent_grads, package="mosaicData")

# select numeric variables
df <- dplyr::select_if(recent_grads, is.numeric)

# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)
##                       Rank Major_code Total   Men Women ShareWomen
## Rank                  1.00       0.10  0.07 -0.09  0.17       0.64
## Major_code            0.10       1.00  0.20  0.18  0.18       0.26
## Total                 0.07       0.20  1.00  0.88  0.94       0.14
## Men                  -0.09       0.18  0.88  1.00  0.67      -0.11
## Women                 0.17       0.18  0.94  0.67  1.00       0.30
## ShareWomen            0.64       0.26  0.14 -0.11  0.30       1.00
## Sample_size           0.00       0.20  0.95  0.88  0.86       0.10
## Employed              0.07       0.20  1.00  0.87  0.94       0.15
## Full_time             0.03       0.20  0.99  0.89  0.92       0.12
## Part_time             0.19       0.19  0.95  0.75  0.95       0.21
## Full_time_year_round  0.02       0.20  0.98  0.89  0.91       0.11
## Unemployed            0.09       0.22  0.97  0.87  0.91       0.12
## Unemployment_rate     0.08       0.14  0.08  0.10  0.06       0.07
## Median               -0.87      -0.17 -0.11  0.03 -0.18      -0.62
## P25th                -0.74      -0.17 -0.07  0.04 -0.14      -0.50
## P75th                -0.80      -0.08 -0.08  0.05 -0.16      -0.59
## College_jobs          0.05       0.04  0.80  0.56  0.85       0.20
## Non_college_jobs      0.14       0.23  0.94  0.85  0.87       0.14
## Low_wage_jobs         0.20       0.22  0.94  0.79  0.90       0.19
##                      Sample_size Employed Full_time Part_time
## Rank                        0.00     0.07      0.03      0.19
## Major_code                  0.20     0.20      0.20      0.19
## Total                       0.95     1.00      0.99      0.95
## Men                         0.88     0.87      0.89      0.75
## Women                       0.86     0.94      0.92      0.95
## ShareWomen                  0.10     0.15      0.12      0.21
## Sample_size                 1.00     0.96      0.98      0.82
## Employed                    0.96     1.00      1.00      0.93
## Full_time                   0.98     1.00      1.00      0.90
## Part_time                   0.82     0.93      0.90      1.00
## Full_time_year_round        0.99     0.99      1.00      0.88
## Unemployed                  0.92     0.97      0.96      0.95
## Unemployment_rate           0.06     0.07      0.07      0.11
## Median                     -0.06    -0.10     -0.08     -0.19
## P25th                      -0.02    -0.07     -0.04     -0.15
## P75th                      -0.05    -0.08     -0.06     -0.16
## College_jobs                0.70     0.80      0.77      0.80
## Non_college_jobs            0.92     0.94      0.93      0.91
## Low_wage_jobs               0.86     0.93      0.90      0.95
##                      Full_time_year_round Unemployed Unemployment_rate
## Rank                                 0.02       0.09              0.08
## Major_code                           0.20       0.22              0.14
## Total                                0.98       0.97              0.08
## Men                                  0.89       0.87              0.10
## Women                                0.91       0.91              0.06
## ShareWomen                           0.11       0.12              0.07
## Sample_size                          0.99       0.92              0.06
## Employed                             0.99       0.97              0.07
## Full_time                            1.00       0.96              0.07
## Part_time                            0.88       0.95              0.11
## Full_time_year_round                 1.00       0.95              0.06
## Unemployed                           0.95       1.00              0.17
## Unemployment_rate                    0.06       0.17              1.00
## Median                              -0.07      -0.12             -0.12
## P25th                               -0.03      -0.09             -0.10
## P75th                               -0.05      -0.09             -0.04
## College_jobs                         0.75       0.71             -0.01
## Non_college_jobs                     0.93       0.96              0.12
## Low_wage_jobs                        0.89       0.96              0.13
##                      Median P25th P75th College_jobs Non_college_jobs
## Rank                  -0.87 -0.74 -0.80         0.05             0.14
## Major_code            -0.17 -0.17 -0.08         0.04             0.23
## Total                 -0.11 -0.07 -0.08         0.80             0.94
## Men                    0.03  0.04  0.05         0.56             0.85
## Women                 -0.18 -0.14 -0.16         0.85             0.87
## ShareWomen            -0.62 -0.50 -0.59         0.20             0.14
## Sample_size           -0.06 -0.02 -0.05         0.70             0.92
## Employed              -0.10 -0.07 -0.08         0.80             0.94
## Full_time             -0.08 -0.04 -0.06         0.77             0.93
## Part_time             -0.19 -0.15 -0.16         0.80             0.91
## Full_time_year_round  -0.07 -0.03 -0.05         0.75             0.93
## Unemployed            -0.12 -0.09 -0.09         0.71             0.96
## Unemployment_rate     -0.12 -0.10 -0.04        -0.01             0.12
## Median                 1.00  0.89  0.90        -0.05            -0.17
## P25th                  0.89  1.00  0.74        -0.01            -0.14
## P75th                  0.90  0.74  1.00        -0.05            -0.14
## College_jobs          -0.05 -0.01 -0.05         1.00             0.61
## Non_college_jobs      -0.17 -0.14 -0.14         0.61             1.00
## Low_wage_jobs         -0.21 -0.17 -0.17         0.65             0.98
##                      Low_wage_jobs
## Rank                          0.20
## Major_code                    0.22
## Total                         0.94
## Men                           0.79
## Women                         0.90
## ShareWomen                    0.19
## Sample_size                   0.86
## Employed                      0.93
## Full_time                     0.90
## Part_time                     0.95
## Full_time_year_round          0.89
## Unemployed                    0.96
## Unemployment_rate             0.13
## Median                       -0.21
## P25th                        -0.17
## P75th                        -0.17
## College_jobs                  0.65
## Non_college_jobs              0.98
## Low_wage_jobs                 1.00

library(ggplot2)
library(ggcorrplot)


ggcorrplot(r, 
           hc.order = TRUE, 
           type = "lower",
           lab = TRUE)

Within the correlation plot, multiple variables have a positive correlation with others whereas other variables have a negative correlation related to other variables. For example, factors that have a positive correlation with non-college jobs are low-wage jobs, women, and part time jobs. So, the chances are if you have a job that is not related to college such as an internship, it will either be a minimum wage job or a part time job. But, there are no factors that have a negative correlation with non-college jobs.

Regression:

options(scipen=999)
data(recent_grads, package="mosaicData")
grads_lm <- lm(Median ~ Major_category,
                data = recent_grads)

# View summary of model 1
summary(grads_lm)
## 
## Call:
## lm(formula = Median ~ Major_category, data = recent_grads)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17383  -4344   -350   2617  52617 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                        36900.0     2483.2
## Major_categoryArts                                 -3837.5     3724.8
## Major_categoryBiology & Life Science                -478.6     3251.3
## Major_categoryBusiness                              6638.5     3303.0
## Major_categoryCommunications & Journalism          -2400.0     4645.6
## Major_categoryComputers & Mathematics               5845.4     3431.0
## Major_categoryEducation                            -4550.0     3165.5
## Major_categoryEngineering                          20482.8     2879.7
## Major_categoryHealth                                 -75.0     3362.3
## Major_categoryHumanities & Liberal Arts            -4986.7     3205.8
## Major_categoryIndustrial Arts & Consumer Services   -557.1     3869.8
## Major_categoryInterdisciplinary                    -1900.0     8235.9
## Major_categoryLaw & Public Policy                   5300.0     4301.0
## Major_categoryPhysical Sciences                     4990.0     3511.8
## Major_categoryPsychology & Social Work             -6800.0     3608.0
## Major_categorySocial Science                         444.4     3608.0
##                                                   t value
## (Intercept)                                        14.860
## Major_categoryArts                                 -1.030
## Major_categoryBiology & Life Science               -0.147
## Major_categoryBusiness                              2.010
## Major_categoryCommunications & Journalism          -0.517
## Major_categoryComputers & Mathematics               1.704
## Major_categoryEducation                            -1.437
## Major_categoryEngineering                           7.113
## Major_categoryHealth                               -0.022
## Major_categoryHumanities & Liberal Arts            -1.556
## Major_categoryIndustrial Arts & Consumer Services  -0.144
## Major_categoryInterdisciplinary                    -0.231
## Major_categoryLaw & Public Policy                   1.232
## Major_categoryPhysical Sciences                     1.421
## Major_categoryPsychology & Social Work             -1.885
## Major_categorySocial Science                        0.123
##                                                               Pr(>|t|)    
## (Intercept)                                       < 0.0000000000000002 ***
## Major_categoryArts                                              0.3045    
## Major_categoryBiology & Life Science                            0.8832    
## Major_categoryBusiness                                          0.0462 *  
## Major_categoryCommunications & Journalism                       0.6062    
## Major_categoryComputers & Mathematics                           0.0904 .  
## Major_categoryEducation                                         0.1526    
## Major_categoryEngineering                              0.0000000000379 ***
## Major_categoryHealth                                            0.9822    
## Major_categoryHumanities & Liberal Arts                         0.1218    
## Major_categoryIndustrial Arts & Consumer Services               0.8857    
## Major_categoryInterdisciplinary                                 0.8178    
## Major_categoryLaw & Public Policy                               0.2197    
## Major_categoryPhysical Sciences                                 0.1573    
## Major_categoryPsychology & Social Work                          0.0613 .  
## Major_categorySocial Science                                    0.9021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7853 on 157 degrees of freedom
## Multiple R-squared:  0.5722, Adjusted R-squared:  0.5313 
## F-statistic:    14 on 15 and 157 DF,  p-value: < 0.00000000000000022

The regression model describes how each of the categories of majors reflects on median prices. The intercept is the median income of people with an Agriculture and Natural resources college major which equals $36900. All of the major categories listed either subtract the amount or add to the amount based on their median incomes. For example, both of our college majors are business, which means that our median income would be $6638.5 + $36900 which would equal $43,538.5. This means that the median of business majors’ income would be $43,538.5 according to the dataset.

Share interesting stories you found from the data

I was very intrigued by this dataset because it is very common to me due to being a college student. The dataset makes sense to us because we relate to it every day. I was very surprised about the median of incomes because I was unaware of how low incomes were by a median. I was also intrigued by some of our plots which laid out what majors have the best incomes, which we determined was engineering.

Hide the messages, but display the code and its results on the webpage.

List names of all group members (both first and last name) at the top of the webpage.

Use the correct slug.