For information about the data, click the link here. https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-10-16

library(tidyverse)
library(scales)
# Import data
recent_grads <- read.csv("~/R/BusStats/Data/recent_grads.csv") %>% as_tibble()

Q1 Describe the first observation

Q2 How many majors are there in the Business Major_category?

Hint: Use count.

recent_grads%>%count(Major_category)
## # A tibble: 16 x 2
##    Major_category                          n
##    <fct>                               <int>
##  1 Agriculture & Natural Resources        10
##  2 Arts                                    8
##  3 Biology & Life Science                 14
##  4 Business                               13
##  5 Communications & Journalism             4
##  6 Computers & Mathematics                11
##  7 Education                              16
##  8 Engineering                            29
##  9 Health                                 12
## 10 Humanities & Liberal Arts              15
## 11 Industrial Arts & Consumer Services     7
## 12 Interdisciplinary                       1
## 13 Law & Public Policy                     5
## 14 Physical Sciences                      10
## 15 Psychology & Social Work                9
## 16 Social Science                          9

There are 13 majors in the Business major category.

Q3 What major has the highest median earnings?

Hint: Take recent_grads, pipe it to dplyr::arrange, and pipe it to dplyr::select.

recent_grads%>%
  arrange(desc(Median))%>%
  select(Major, Median)
## # A tibble: 173 x 2
##    Major                                     Median
##    <fct>                                      <int>
##  1 PETROLEUM ENGINEERING                     110000
##  2 MINING AND MINERAL ENGINEERING             75000
##  3 METALLURGICAL ENGINEERING                  73000
##  4 NAVAL ARCHITECTURE AND MARINE ENGINEERING  70000
##  5 CHEMICAL ENGINEERING                       65000
##  6 NUCLEAR ENGINEERING                        65000
##  7 ACTUARIAL SCIENCE                          62000
##  8 ASTRONOMY AND ASTROPHYSICS                 62000
##  9 MECHANICAL ENGINEERING                     60000
## 10 ELECTRICAL ENGINEERING                     60000
## # ... with 163 more rows

Petroleum Engineering has the highest median earnings of $110,000 a year.

Q4 Is there a gender gap in wages? Decribe the relationship between ShareWomen and Median by creating a scatter plot.

Hint: Take recent_grads and pipe it ggplot(). Map ShareWomen to the x-asix and Median to the y-axis. Use geom_point() for the scatter plot.

recent_grads%>%
  ggplot(aes(x= ShareWomen, y= Median))+ geom_point()

The majors with the higher shares of women, have a lower median income earn.

Q5 Does Major_category have anything to do with median eaninigs?

Hint: Add the third variable to the aes function by mapping Major_category to color.

recent_grads%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+ geom_point()

Yes, because it shows all of the Majors in Major_category.

Q6 Lump together least common factor levels into “Other”. There are too many levels in Major_category.

Hint: Take recent_grads, pipe it to mutate(Major_category = fct_lump(Major_category, 4)), and pipe it to ggplot().

recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+ 
  geom_point()

Q7 Add the regression line.

Hint: Add geom_smooth(aes(group = 1), method = “lm”) to to the ggplot() code.

recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm")

Q8 Convert the numbers on the x-axis into the percent format.

Hint: Add scale_x_continuous(labels = percent_format()) to to the ggplot() code.

recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format())

Q9 Convert the numbers on the y-axis into the dollar format.

Hint: Add scale_y_continuous(labels = scales::dollar_format()) to to the ggplot() code.

recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = scales::dollar_format())

Q10 Expand the y-axis to zero.

Hint: Add expand_limits() to the ggplot() code.

recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  expand_limits(y=0)

Q11 What majors appear to be outliers (far away from the regression line)?

Hint: Add the third variable to the aes function by mapping Major to label. Assign the result to g and, in the next two lines, type library(plotly) and then ggplotly(g).

g <-
  recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category, label= Major))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  expand_limits(y=0) 
library(plotly)
ggplotly(g)

Outliers are Nursing, and Petroleum Engineering

Q12 Are the outliers valid in terms of the sample size?

Hint: Add the third variable to the aes function by mapping Sample_size to size.

g <-
  recent_grads%>%
  mutate(Major_category= fct_lump(Major_category, 4))%>%
  ggplot(aes(x= ShareWomen, y= Median, color= Major_category, label= Major, size= Sample_size ))+
  geom_point()+
  geom_smooth(aes(group = 1), method = "lm") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  expand_limits(y=0) 
library(plotly)
ggplotly(g)

The outlier Nursing is valid with a sample size of 2554. Petroleum Engineering is not with a sample size of 36.