# Introduction
This project entails the analysis of college major and earnings. Specifically, the data utilized is the The Economic Guide To Picking A College Major, which was published on the FiveThirtyEight website. There are many considerations that go into picking a major. For example, factors can be research interest,family or peeress influence, and etc. Based on the data set used, this project will analyze the correlation between Earnings potential and major considerations by women students.
# assumptions and questions
Within the data set mentioned above, female student will consider the types of major to select based on the earning potential of the field.
Assumptions: 1.Earnings potential will impact major selection
The questions below will be considered to perform the analysis based on the assumptions mentioned above:
1.Which major has the lowest unemployment rate after graduation? 2.Which major has the highest percentage of women? 3.How do the distributions of median income compare across major categories? 4.Do women tend to choose majors with lower or higher earnings?
# Procedure and steps taken to get result based on quetion proposed
The first step is to sort the data by Majors and unemployment rate and arrange the data in descending order with respect to proportion of women major and share.
Majors and median income
How do the distributions of median income compare across major categories?.There are three types of incomes reported in this data frame: p25th, median, and p75th. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.lastly Arranging median incomes for major categories.
# The sample correlation coefficient (r) and Add the regression line
The sample correlation coefficient (r) is a measure of the closeness of association of the points in a scatter plot to a linear regression line based on those points,are indicators of the strength of the linear relationship between two different variables, x and y.we obtain r= -0.6186898
# Conculustion
The result shown above proves the assumption that one of the driving factors that women students consider when picking a major is earnings potential. The result obtained ( r= -0.6186898) shows that there is a moderate relation between major picked and earnings potential. Therefor, there is evident that there is a linear correlation between earnings potential and major.
For the questions mentioned above, please see answer below: 1.Which major has the lowest unemployment rate after graduation? - Mathematics And Computer Science 2.Which major has the highest percentage of women? - Early Childhood Education 3.How do the distributions of median income compare across major categories? - Please see the graph below (Distribution of college earnings for college majors) 4.Do women tend to choose majors with lower or higher earnings? - See results and conclusion
#install.packages("fivethirtyeight")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 4.1.3
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
college<-college_recent_grads
#glimpse(college)
#We can use the select function to choose which variables to display, and use the percent() function to clean up the display a bit:
college_recent_grads %>%
arrange(unemployment_rate) %>%
select(rank, major, unemployment_rate) %>%
mutate(unemployment_rate = percent(unemployment_rate))
## # A tibble: 173 x 3
## rank major unemployment_rate
## <int> <chr> <chr>
## 1 53 Mathematics And Computer Science 0.00000%
## 2 74 Military Technologies 0.00000%
## 3 84 Botany 0.00000%
## 4 113 Soil Science 0.00000%
## 5 121 Educational Administration And Supervision 0.00000%
## 6 15 Engineering Mechanics Physics And Science 0.63343%
## 7 20 Court Reporting 1.16897%
## 8 120 Mathematics Teacher Education 1.62028%
## 9 1 Petroleum Engineering 1.83805%
## 10 65 General Agriculture 1.96425%
## # ... with 163 more rows
#Which major has the highest percentage of unemployment_rate?
college_recent_grads %>%
arrange(desc(unemployment_rate)) %>%
select(rank, major, unemployment_rate)%>%
top_n(20)
## Selecting by unemployment_rate
## # A tibble: 20 x 3
## rank major unemployment_rate
## <int> <chr> <dbl>
## 1 6 Nuclear Engineering 0.177
## 2 90 Public Administration 0.159
## 3 85 Computer Networking And Telecommunications 0.152
## 4 171 Clinical Psychology 0.149
## 5 30 Public Policy 0.128
## 6 106 Communication Technologies 0.120
## 7 2 Mining And Mineral Engineering 0.117
## 8 54 Computer Programming And Data Processing 0.114
## 9 80 Geography 0.113
## 10 59 Architecture 0.113
## 11 119 Community And Public Health 0.112
## 12 71 Industrial And Organizational Psychology 0.109
## 13 56 School Student Counseling 0.108
## 14 166 Other Foreign Languages 0.107
## 15 142 Film Video And Photographic Arts 0.106
## 16 173 Library Science 0.105
## 17 130 Linguistics And Comparative Language And Literature 0.104
## 18 143 General Social Sciences 0.103
## 19 163 Anthropology And Archeology 0.103
## 20 154 Visual And Performing Arts 0.102
#Which major has the highest percentage of women?
college_recent_grads %>%
arrange(desc(sharewomen)) %>%
select(major, total, sharewomen) %>%
top_n(20)
## Selecting by sharewomen
## # A tibble: 20 x 3
## major total sharewomen
## <chr> <int> <dbl>
## 1 Early Childhood Education 37589 0.969
## 2 Communication Disorders Sciences And Services 38279 0.968
## 3 Medical Assisting Services 11123 0.928
## 4 Elementary Education 170862 0.924
## 5 Family And Consumer Sciences 58001 0.911
## 6 Special Needs Education 28739 0.907
## 7 Human Services And Community Organization 9374 0.906
## 8 Social Work 53552 0.904
## 9 Nursing 209394 0.896
## 10 Miscellaneous Health Medical Professions 13386 0.881
## 11 Library Science 1098 0.878
## 12 Language And Drama Education 30471 0.877
## 13 Nutrition Sciences 18909 0.864
## 14 School Student Counseling 818 0.855
## 15 Art History And Criticism 21030 0.846
## 16 Educational Psychology 2854 0.817
## 17 General Education 143718 0.813
## 18 Teacher Education: Multiple Levels 14443 0.811
## 19 Clinical Psychology 2838 0.800
## 20 Miscellaneous Psychology 9628 0.799
#Arranging median incomes for major categories
college_recent_grads %>%
group_by(major_category) %>%
summarise(avg_mean_income = mean(median)) %>%
arrange(desc(avg_mean_income))%>%
top_n(100)
## Selecting by avg_mean_income
## # A tibble: 16 x 2
## major_category avg_mean_income
## <chr> <dbl>
## 1 Engineering 57383.
## 2 Business 43538.
## 3 Computers & Mathematics 42745.
## 4 Law & Public Policy 42200
## 5 Physical Sciences 41890
## 6 Social Science 37344.
## 7 Agriculture & Natural Resources 36900
## 8 Health 36825
## 9 Biology & Life Science 36421.
## 10 Industrial Arts & Consumer Services 36343.
## 11 Interdisciplinary 35000
## 12 Communications & Journalism 34500
## 13 Arts 33062.
## 14 Education 32350
## 15 Humanities & Liberal Arts 31913.
## 16 Psychology & Social Work 30100
#Summary Statistics #We can also calculate summary statistics for this distribution using the summarise function: #Majors and median income
college_recent_grads %>%
summarise(min = min(median), max = max(median),
mean = mean(median), med = median(median),
sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
## # A tibble: 1 x 7
## min max mean med sd q1 q3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000
#Distribution of median earnings for college majors
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram() +
labs(
x = "Median earnings, in $",
y = "Frequency",
title = "Distribution of median earnings for college majors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# ggplot(csub,aes(x=Anomaly10y)) +
# stat_bin(binwidth=1) + ylim(c(0, 12)) +
# stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5)
#Arranging median incomes for major categories
college_recent_grads %>%
group_by(major_category) %>%
summarise(avg_mean_income = mean(median)) %>%
arrange(desc(avg_mean_income))%>%
top_n(100)
## Selecting by avg_mean_income
## # A tibble: 16 x 2
## major_category avg_mean_income
## <chr> <dbl>
## 1 Engineering 57383.
## 2 Business 43538.
## 3 Computers & Mathematics 42745.
## 4 Law & Public Policy 42200
## 5 Physical Sciences 41890
## 6 Social Science 37344.
## 7 Agriculture & Natural Resources 36900
## 8 Health 36825
## 9 Biology & Life Science 36421.
## 10 Industrial Arts & Consumer Services 36343.
## 11 Interdisciplinary 35000
## 12 Communications & Journalism 34500
## 13 Arts 33062.
## 14 Education 32350
## 15 Humanities & Liberal Arts 31913.
## 16 Psychology & Social Work 30100
#Faceted histogram of median incomes #Plot the distribution of median income using a histogram, faceted by major_category. Use the correct binwidth.
ggplot(data = college_recent_grads, mapping = aes(x = median),col="blue") +
geom_histogram(binwidth = 5000) +
facet_wrap( ~ major_category) +
labs(
x = "Median earnings, in $",
y = "Frequency",
title = "Distribution of median earnings for college majors",
subtitle = "By major category"
)
#What types of majors do women tend to major in?
college_recent_grads %>%
filter(
major_category == "avg_mean_income",
median < 36000)
## # A tibble: 0 x 21
## # ... with 21 variables: rank <int>, major_code <int>, major <chr>,
## # major_category <chr>, total <int>, sample_size <int>, men <int>,
## # women <int>, sharewomen <dbl>, employed <int>, employed_fulltime <int>,
## # employed_parttime <int>, employed_fulltime_yearround <int>,
## # unemployed <int>, unemployment_rate <dbl>, p25th <dbl>, median <dbl>,
## # p75th <dbl>, college_jobs <int>, non_college_jobs <int>,
## # low_wage_jobs <int>
ggplot(data = college_recent_grads,
mapping = aes(x = sharewomen, y = median, colour = major_category)) +
geom_boxplot() +
labs(
x = "Percentage of women",
y = "Median income",
title = "Distribution of median earnings by major_category",
colour = "major_category") +
scale_x_continuous(labels = label_percent()) +
scale_y_continuous(labels = label_number())
## Warning: Removed 1 rows containing missing values (stat_boxplot).
#title = "Distribution of median earnings for college majors")
#grade_this_code("share of women by major_category!")
#scatter plot for Distribution of median earnings by major_category
ggplot(data = college_recent_grads,
mapping = aes(x = sharewomen, y = median, colour = major_category)) +
geom_point() +
labs(
x = "Percentage of women",
y = "Median income",
title = "Distribution of median earnings by major_category",
colour = "major_category"
) +
scale_x_continuous(labels = label_percent()) +
scale_y_continuous(labels = label_number())
## Warning: Removed 1 rows containing missing values (geom_point).
#grade_this_code("Those labels look much better now!")
#Caluclate the correlation coefficient
x_num <- as.numeric(college$median)
y_num <- as.numeric(college$sharewomen)
#x_num <- as.numeric(college$major)
#cor(college$median, college$sharewomen)
#cor(x_num,y_num)
#class(x_num)
#class(y_num)
#as.vector(x_num)
#as.vector(y_num)
w_num<-as.vector(x_num)
h_num<-as.vector(y_num)
cor(w_num,h_num,use="complete.obs")
## [1] -0.6186898
# Add the regression line
ggplot(college, aes(x=x_num, y=y_num)) +
geom_point()+
geom_smooth(method=lm)+
labs(
x = "Percentage of womenshare",
y = "Median income",
title = "Regression line by major_category")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).