For information about the data, click the link here. https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-10-16
library(tidyverse)
library(scales)
# Import data
recent_grads <- read.csv("~/R/BusStats/Data/recent_grads.csv") %>% as_tibble()
Hint: Use count.
recent_grads%>%count(Major_category)
## # A tibble: 16 x 2
## Major_category n
## <fct> <int>
## 1 Agriculture & Natural Resources 10
## 2 Arts 8
## 3 Biology & Life Science 14
## 4 Business 13
## 5 Communications & Journalism 4
## 6 Computers & Mathematics 11
## 7 Education 16
## 8 Engineering 29
## 9 Health 12
## 10 Humanities & Liberal Arts 15
## 11 Industrial Arts & Consumer Services 7
## 12 Interdisciplinary 1
## 13 Law & Public Policy 5
## 14 Physical Sciences 10
## 15 Psychology & Social Work 9
## 16 Social Science 9
There are 13 majors in the Business major category.
Hint: Take recent_grads, pipe it to dplyr::arrange, and pipe it to dplyr::select.
recent_grads%>%
arrange(desc(Median))%>%
select(Major, Median)
## # A tibble: 173 x 2
## Major Median
## <fct> <int>
## 1 PETROLEUM ENGINEERING 110000
## 2 MINING AND MINERAL ENGINEERING 75000
## 3 METALLURGICAL ENGINEERING 73000
## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING 70000
## 5 CHEMICAL ENGINEERING 65000
## 6 NUCLEAR ENGINEERING 65000
## 7 ACTUARIAL SCIENCE 62000
## 8 ASTRONOMY AND ASTROPHYSICS 62000
## 9 MECHANICAL ENGINEERING 60000
## 10 ELECTRICAL ENGINEERING 60000
## # ... with 163 more rows
Petroleum Engineering has the highest median earnings of $110,000 a year.
Hint: Add the third variable to the aes function by mapping Major_category to color.
recent_grads%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+ geom_point()
Yes, because it shows all of the Majors in Major_category.
Hint: Take recent_grads, pipe it to mutate(Major_category = fct_lump(Major_category, 4)), and pipe it to ggplot().
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
geom_point()
Hint: Add geom_smooth(aes(group = 1), method = “lm”) to to the ggplot() code.
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm")
Hint: Add scale_x_continuous(labels = percent_format()) to to the ggplot() code.
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format())
Hint: Add scale_y_continuous(labels = scales::dollar_format()) to to the ggplot() code.
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = scales::dollar_format())
Hint: Add expand_limits() to the ggplot() code.
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
expand_limits(y=0)
Hint: Add the third variable to the aes function by mapping Major to label. Assign the result to g and, in the next two lines, type library(plotly) and then ggplotly(g).
g <-
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category, label= Major))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
expand_limits(y=0)
library(plotly)
ggplotly(g)
Outliers are Nursing, and Petroleum Engineering
Hint: Add the third variable to the aes function by mapping Sample_size to size.
g <-
recent_grads%>%
mutate(Major_category= fct_lump(Major_category, 4))%>%
ggplot(aes(x= ShareWomen, y= Median, color= Major_category, label= Major, size= Sample_size ))+
geom_point()+
geom_smooth(aes(group = 1), method = "lm") +
scale_x_continuous(labels = percent_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
expand_limits(y=0)
library(plotly)
ggplotly(g)
The outlier Nursing is valid with a sample size of 2554. Petroleum Engineering is not with a sample size of 36.