knitr::opts_chunk$set(echo = TRUE)
The aim of this assignment is to study how income varies across college major categories. Specifically answer: “Is there an association between college major category and income?”
To get started, start a new R/RStudio session with a clean workspace. To do this in R, you can use the q() function to quit, then reopen R. The easiest way to do this in RStudio is to quit RStudio entirely and reopen it. After you have started a new session, run the following commands. This will load a data.frame called college for you to work with:
college <- read.csv("https://query.data.world/s/uieteyrze67twkiujwxffsokaml44y", header=TRUE, stringsAsFactors=FALSE);
Please upload this college_major_analysis.rds file to a public GitHub repository. In question 4 of this quiz, you will share the link to this file.
The following code book describes the variables in the “college_major_analysis.rds” dataset:
head(college)
## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3 1102 AGRICULTURAL ECONOMICS
## 4 1103 ANIMAL SCIENCES
## 5 1104 FOOD SCIENCE
## 6 1105 PLANT SCIENCE AND AGRONOMY
## Major_category Total Employed Employed_full_time_year_round
## 1 Agriculture & Natural Resources 128148 90245 74078
## 2 Agriculture & Natural Resources 95326 76865 64240
## 3 Agriculture & Natural Resources 33955 26321 22810
## 4 Agriculture & Natural Resources 103549 81177 64937
## 5 Agriculture & Natural Resources 24280 17281 12722
## 6 Agriculture & Natural Resources 79409 63043 51077
## Unemployed Unemployment_rate Median P25th P75th
## 1 2423 0.02614711 50000 34000 80000
## 2 2266 0.02863606 54000 36000 80000
## 3 821 0.03024832 63000 40000 98000
## 4 3619 0.04267890 46000 30000 72000
## 5 894 0.04918845 62000 38500 90000
## 6 2070 0.03179089 50000 35000 75000
summary(college)
## Major_code Major Major_category Total
## Min. :1100 Length:173 Length:173 Min. : 2396
## 1st Qu.:2403 Class :character Class :character 1st Qu.: 24280
## Median :3608 Mode :character Mode :character Median : 75791
## Mean :3880 Mean : 230257
## 3rd Qu.:5503 3rd Qu.: 205763
## Max. :6403 Max. :3123510
## Employed Employed_full_time_year_round Unemployed
## Min. : 1492 Min. : 1093 Min. : 0
## 1st Qu.: 17281 1st Qu.: 12722 1st Qu.: 1101
## Median : 56564 Median : 39613 Median : 3619
## Mean : 166162 Mean : 126308 Mean : 9725
## 3rd Qu.: 142879 3rd Qu.: 111025 3rd Qu.: 8862
## Max. :2354398 Max. :1939384 Max. :147261
## Unemployment_rate Median P25th P75th
## Min. :0.00000 Min. : 35000 Min. :24900 Min. : 45800
## 1st Qu.:0.04626 1st Qu.: 46000 1st Qu.:32000 1st Qu.: 70000
## Median :0.05472 Median : 53000 Median :36000 Median : 80000
## Mean :0.05736 Mean : 56816 Mean :38697 Mean : 82506
## 3rd Qu.:0.06904 3rd Qu.: 65000 3rd Qu.:42000 3rd Qu.: 95000
## Max. :0.15615 Max. :125000 Max. :78000 Max. :210000
The first column pertaining to majors is major codes and I have ignored this the second colum is major (or name of the majors). There are 173 different university Majors in this category. I have not used this as the independent variable either. I have used major category in order to develop the univariate regression analyisis. I have enclosed a no simple histogram of median incomes of graduates in order to illustrate the distribution of graduate incomes.
as.numeric(college$Median)
## [1] 50000 54000 63000 46000 62000 50000 63000 52000 52000 58000
## [11] 52000 63000 46000 50000 50000 48000 50000 50000 65000 60000
## [21] 78000 68000 55000 55000 40000 43000 58000 41000 40000 43000
## [31] 48400 35300 46000 45000 42000 45000 40000 42000 42600 50000
## [41] 75000 80000 62000 78000 65000 86000 78000 80000 88000 65000
## [51] 70000 85000 75000 78000 80000 96000 92000 97000 95000 125000
## [61] 70000 63000 74000 67000 70000 60000 63000 48000 48000 45000
## [71] 40500 50000 48000 50000 40000 50000 46700 40000 51000 53000
## [81] 50000 45000 47500 48000 60000 60000 50000 55000 35000 52000
## [91] 66000 70000 70000 64000 43000 45000 49500 92000 53000 45000
## [101] 44000 45000 40000 60000 80000 60000 59000 65000 57000 55000
## [111] 70000 75000 56000 62000 45000 40000 45000 39000 62000 47000
## [121] 45000 50000 56000 60000 38000 40000 50000 69000 43000 49000
## [131] 54000 55000 58000 47000 52000 65000 48000 67000 45000 42000
## [141] 45000 40000 46600 47000 44500 37600 45000 50000 42000 50000
## [151] 55000 60000 50000 62000 106000 61000 47000 45000 60000 65000
## [161] 72000 58000 65000 65000 56000 65000 54000 54000 49000 72000
## [171] 53000 50000 50000
hist(college$Median, xlab="Median income ($)")
## make sure median is a numeric variable and major category is a factor
Median_income<- as.numeric(college$Median)
Major_cat<- as.factor(college$Major_category)
##perform univariate analysis
fit<- lm(Median_income~Major_cat)
summary(fit)
##
## Call:
## lm(formula = Median_income ~ Major_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17759 -5000 -831 3542 49542
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55000 3044 18.071
## Major_catArts -11475 4565 -2.514
## Major_catBiology & Life Science -4179 3985 -1.049
## Major_catBusiness 5615 4048 1.387
## Major_catCommunications & Journalism -5500 5694 -0.966
## Major_catComputers & Mathematics 11273 4205 2.681
## Major_catEducation -11169 3880 -2.879
## Major_catEngineering 22759 3529 6.448
## Major_catHealth 1458 4121 0.354
## Major_catHumanities & Liberal Arts -8920 3929 -2.270
## Major_catIndustrial Arts & Consumer Services -2357 4743 -0.497
## Major_catInterdisciplinary -12000 10094 -1.189
## Major_catLaw & Public Policy -2200 5272 -0.417
## Major_catPhysical Sciences 7400 4304 1.719
## Major_catPsychology & Social Work -10444 4422 -2.362
## Major_catSocial Science -1778 4422 -0.402
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Major_catArts 0.01296 *
## Major_catBiology & Life Science 0.29597
## Major_catBusiness 0.16737
## Major_catCommunications & Journalism 0.33555
## Major_catComputers & Mathematics 0.00813 **
## Major_catEducation 0.00455 **
## Major_catEngineering 1.33e-09 ***
## Major_catHealth 0.72390
## Major_catHumanities & Liberal Arts 0.02455 *
## Major_catIndustrial Arts & Consumer Services 0.61990
## Major_catInterdisciplinary 0.23631
## Major_catLaw & Public Policy 0.67700
## Major_catPhysical Sciences 0.08753 .
## Major_catPsychology & Social Work 0.01941 *
## Major_catSocial Science 0.68821
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9624 on 157 degrees of freedom
## Multiple R-squared: 0.6091, Adjusted R-squared: 0.5717
## F-statistic: 16.31 on 15 and 157 DF, p-value: < 2.2e-16
In order to visualise the difference. I have drawn a no frills box plot comparing these 14 categories of majors
This is a simplified assessment there is a statistically significant relationship between the category of college major and the graduate income (adjusted R squared=0.5717). There are some pertinent columns in this dataset which have not been used such as the percentage of different types of jobs(low income jobs, college jobs and non-college jobs) and percentage unemployed as well as the percentage of women taking that major category which may be effected by gender paygap. so multivariable analysis may result in a different result. But it cant be completed in 15 minutes.