# Introduction
  

This project entails the analysis of college major and earnings. Specifically, the data utilized is the The Economic Guide To Picking A College Major, which was published on the FiveThirtyEight website. There are many considerations that go into picking a major. For example, factors can be research interest,family or peeress influence, and etc. Based on the data set used, this project will analyze the correlation between Earnings potential and major considerations by women students.

           # assumptions and questions 
            

Within the data set mentioned above, female student will consider the types of major to select based on the earning potential of the field.

Assumptions: 1.Earnings potential will impact major selection

The questions below will be considered to perform the analysis based on the assumptions mentioned above:

1.Which major has the lowest unemployment rate after graduation? 2.Which major has the highest percentage of women? 3.How do the distributions of median income compare across major categories? 4.Do women tend to choose majors with lower or higher earnings?

#          Procedure and steps taken to get result based on quetion proposed

The first step is to sort the data by Majors and unemployment rate and arrange the data in descending order with respect to proportion of women major and share.

         Majors and median income

How do the distributions of median income compare across major categories?.There are three types of incomes reported in this data frame: p25th, median, and p75th. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.lastly Arranging median incomes for major categories.

# The sample correlation coefficient (r) and Add the regression line

The sample correlation coefficient (r) is a measure of the closeness of association of the points in a scatter plot to a linear regression line based on those points,are indicators of the strength of the linear relationship between two different variables, x and y.we obtain r= -0.6186898

         #                       Conculustion
                                
  The result shown above proves the assumption that one of the driving factors that women students consider when picking a major is earnings potential. The result obtained ( r= -0.6186898) shows that there is a moderate relation between major picked and earnings potential. Therefor, there is evident that there is a linear correlation between earnings potential and major.

For the questions mentioned above, please see answer below: 1.Which major has the lowest unemployment rate after graduation? - Mathematics And Computer Science 2.Which major has the highest percentage of women? - Early Childhood Education 3.How do the distributions of median income compare across major categories? - Please see the graph below (Distribution of college earnings for college majors) 4.Do women tend to choose majors with lower or higher earnings? - See results and conclusion

#install.packages("fivethirtyeight")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 4.1.3
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
college<-college_recent_grads
#glimpse(college)

#We can use the select function to choose which variables to display, and use the percent() function to clean up the display a bit:

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(rank, major, unemployment_rate) %>%
  mutate(unemployment_rate = percent(unemployment_rate))
## # A tibble: 173 x 3
##     rank major                                      unemployment_rate
##    <int> <chr>                                      <chr>            
##  1    53 Mathematics And Computer Science           0.00000%         
##  2    74 Military Technologies                      0.00000%         
##  3    84 Botany                                     0.00000%         
##  4   113 Soil Science                               0.00000%         
##  5   121 Educational Administration And Supervision 0.00000%         
##  6    15 Engineering Mechanics Physics And Science  0.63343%         
##  7    20 Court Reporting                            1.16897%         
##  8   120 Mathematics Teacher Education              1.62028%         
##  9     1 Petroleum Engineering                      1.83805%         
## 10    65 General Agriculture                        1.96425%         
## # ... with 163 more rows

#Which major has the highest percentage of unemployment_rate?

college_recent_grads %>%
  arrange(desc(unemployment_rate)) %>%
  select(rank, major, unemployment_rate)%>%
top_n(20)
## Selecting by unemployment_rate
## # A tibble: 20 x 3
##     rank major                                               unemployment_rate
##    <int> <chr>                                                           <dbl>
##  1     6 Nuclear Engineering                                             0.177
##  2    90 Public Administration                                           0.159
##  3    85 Computer Networking And Telecommunications                      0.152
##  4   171 Clinical Psychology                                             0.149
##  5    30 Public Policy                                                   0.128
##  6   106 Communication Technologies                                      0.120
##  7     2 Mining And Mineral Engineering                                  0.117
##  8    54 Computer Programming And Data Processing                        0.114
##  9    80 Geography                                                       0.113
## 10    59 Architecture                                                    0.113
## 11   119 Community And Public Health                                     0.112
## 12    71 Industrial And Organizational Psychology                        0.109
## 13    56 School Student Counseling                                       0.108
## 14   166 Other Foreign Languages                                         0.107
## 15   142 Film Video And Photographic Arts                                0.106
## 16   173 Library Science                                                 0.105
## 17   130 Linguistics And Comparative Language And Literature             0.104
## 18   143 General Social Sciences                                         0.103
## 19   163 Anthropology And Archeology                                     0.103
## 20   154 Visual And Performing Arts                                      0.102

#Which major has the highest percentage of women?

college_recent_grads %>%
  arrange(desc(sharewomen)) %>%
  select(major, total, sharewomen) %>%
  top_n(20)
## Selecting by sharewomen
## # A tibble: 20 x 3
##    major                                          total sharewomen
##    <chr>                                          <int>      <dbl>
##  1 Early Childhood Education                      37589      0.969
##  2 Communication Disorders Sciences And Services  38279      0.968
##  3 Medical Assisting Services                     11123      0.928
##  4 Elementary Education                          170862      0.924
##  5 Family And Consumer Sciences                   58001      0.911
##  6 Special Needs Education                        28739      0.907
##  7 Human Services And Community Organization       9374      0.906
##  8 Social Work                                    53552      0.904
##  9 Nursing                                       209394      0.896
## 10 Miscellaneous Health Medical Professions       13386      0.881
## 11 Library Science                                 1098      0.878
## 12 Language And Drama Education                   30471      0.877
## 13 Nutrition Sciences                             18909      0.864
## 14 School Student Counseling                        818      0.855
## 15 Art History And Criticism                      21030      0.846
## 16 Educational Psychology                          2854      0.817
## 17 General Education                             143718      0.813
## 18 Teacher Education: Multiple Levels             14443      0.811
## 19 Clinical Psychology                             2838      0.800
## 20 Miscellaneous Psychology                        9628      0.799

#Arranging median incomes for major categories

college_recent_grads %>%
  group_by(major_category) %>%
  summarise(avg_mean_income = mean(median)) %>%
  arrange(desc(avg_mean_income))%>%
  top_n(100)
## Selecting by avg_mean_income
## # A tibble: 16 x 2
##    major_category                      avg_mean_income
##    <chr>                                         <dbl>
##  1 Engineering                                  57383.
##  2 Business                                     43538.
##  3 Computers & Mathematics                      42745.
##  4 Law & Public Policy                          42200 
##  5 Physical Sciences                            41890 
##  6 Social Science                               37344.
##  7 Agriculture & Natural Resources              36900 
##  8 Health                                       36825 
##  9 Biology & Life Science                       36421.
## 10 Industrial Arts & Consumer Services          36343.
## 11 Interdisciplinary                            35000 
## 12 Communications & Journalism                  34500 
## 13 Arts                                         33062.
## 14 Education                                    32350 
## 15 Humanities & Liberal Arts                    31913.
## 16 Psychology & Social Work                     30100

#Summary Statistics #We can also calculate summary statistics for this distribution using the summarise function: #Majors and median income

college_recent_grads %>%
  summarise(min = min(median), max = max(median),
            mean = mean(median), med = median(median),
            sd = sd(median), 
            q1 = quantile(median, probs = 0.25),
            q3 = quantile(median, probs = 0.75))
## # A tibble: 1 x 7
##     min    max   mean   med     sd    q1    q3
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000

#Distribution of median earnings for college majors

ggplot(data = college_recent_grads, mapping = aes(x = median)) +
  geom_histogram() +
  labs(
    x = "Median earnings, in $",
    y = "Frequency",
    title = "Distribution of median earnings for college majors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 # ggplot(csub,aes(x=Anomaly10y)) + 
   # stat_bin(binwidth=1) + ylim(c(0, 12)) +  
   # stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5) 

#Arranging median incomes for major categories

college_recent_grads %>%
  group_by(major_category) %>%
  summarise(avg_mean_income = mean(median)) %>%
  arrange(desc(avg_mean_income))%>%
  top_n(100)
## Selecting by avg_mean_income
## # A tibble: 16 x 2
##    major_category                      avg_mean_income
##    <chr>                                         <dbl>
##  1 Engineering                                  57383.
##  2 Business                                     43538.
##  3 Computers & Mathematics                      42745.
##  4 Law & Public Policy                          42200 
##  5 Physical Sciences                            41890 
##  6 Social Science                               37344.
##  7 Agriculture & Natural Resources              36900 
##  8 Health                                       36825 
##  9 Biology & Life Science                       36421.
## 10 Industrial Arts & Consumer Services          36343.
## 11 Interdisciplinary                            35000 
## 12 Communications & Journalism                  34500 
## 13 Arts                                         33062.
## 14 Education                                    32350 
## 15 Humanities & Liberal Arts                    31913.
## 16 Psychology & Social Work                     30100

#Faceted histogram of median incomes #Plot the distribution of median income using a histogram, faceted by major_category. Use the correct binwidth.

ggplot(data = college_recent_grads, mapping = aes(x = median),col="blue") +
  geom_histogram(binwidth = 5000) + 
  facet_wrap( ~ major_category) + 
  labs(
    x = "Median earnings, in $",
    y = "Frequency",
    title = "Distribution of median earnings for college majors",
    subtitle = "By major category"
  )

#What types of majors do women tend to major in?

college_recent_grads %>%
  filter(
major_category == "avg_mean_income",
    median < 36000)
## # A tibble: 0 x 21
## # ... with 21 variables: rank <int>, major_code <int>, major <chr>,
## #   major_category <chr>, total <int>, sample_size <int>, men <int>,
## #   women <int>, sharewomen <dbl>, employed <int>, employed_fulltime <int>,
## #   employed_parttime <int>, employed_fulltime_yearround <int>,
## #   unemployed <int>, unemployment_rate <dbl>, p25th <dbl>, median <dbl>,
## #   p75th <dbl>, college_jobs <int>, non_college_jobs <int>,
## #   low_wage_jobs <int>
ggplot(data = college_recent_grads, 
       mapping = aes(x = sharewomen, y = median, colour = major_category)) +
  geom_boxplot() +
  labs(
    x = "Percentage of women",
    y = "Median income",
    title = "Distribution of median earnings by major_category",
    colour = "major_category") +
  scale_x_continuous(labels = label_percent()) +
  scale_y_continuous(labels = label_number())
## Warning: Removed 1 rows containing missing values (stat_boxplot).

#title = "Distribution of median earnings for college majors")
#grade_this_code("share of women by major_category!")

#scatter plot for Distribution of median earnings by major_category

ggplot(data = college_recent_grads, 
       mapping = aes(x = sharewomen, y = median, colour = major_category)) +
  geom_point() +
  labs(
    x = "Percentage of women",
    y = "Median income",
    title = "Distribution of median earnings by major_category",
    colour = "major_category"
  ) +
  scale_x_continuous(labels = label_percent()) +
  scale_y_continuous(labels = label_number())
## Warning: Removed 1 rows containing missing values (geom_point).

#grade_this_code("Those labels look much better now!")

#Caluclate the correlation coefficient

x_num <- as.numeric(college$median)
y_num <- as.numeric(college$sharewomen)
#x_num <- as.numeric(college$major)
#cor(college$median, college$sharewomen)
#cor(x_num,y_num)
#class(x_num)
#class(y_num)
#as.vector(x_num)
#as.vector(y_num)
w_num<-as.vector(x_num)
h_num<-as.vector(y_num)
cor(w_num,h_num,use="complete.obs")
## [1] -0.6186898
# Add the regression line
ggplot(college, aes(x=x_num, y=y_num)) + 
  geom_point()+
  geom_smooth(method=lm)+
  labs(
    x = "Percentage of womenshare",
    y = "Median income",
    title = "Regression line by major_category")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).