Introduction

We’ll look at a College data and my assignment is to study how income varies across college major categories.

A codebook for the dataset is given below:

The specifically question for this project is: “Is there an association between college major category and income?”

Based on your analysis, would you conclude that there is a significant association between college major category and income?

Load data

library(collegeIncome)
data(college)

Some exploratory analysis

head(college,5)
##   rank major_code                                     major major_category
## 1    1       2419                     Petroleum Engineering    Engineering
## 2    2       2416            Mining And Mineral Engineering    Engineering
## 3    3       2415                 Metallurgical Engineering    Engineering
## 4    4       2417 Naval Architecture And Marine Engineering    Engineering
## 5    5       2405                      Chemical Engineering    Engineering
##   total sample_size perc_women p25th median  p75th   perc_men perc_employed
## 1  2339          36  0.9109326 25000  40000  50000 0.08906743     0.9115044
## 2   756           7  0.5154064 26000  37000  40000 0.48459355     0.7980501
## 3   856           3  0.5942076 26700  45000  60000 0.40579235     0.7871943
## 4  1258          16  0.6521298 26000  35000  45000 0.34787018     0.8465608
## 5 32260         289  0.4179248 31500  62000 109000 0.58207520     0.8515625
##   perc_employed_fulltime perc_employed_parttime
## 1              0.9206524              0.1774785
## 2              0.7110092              0.3623853
## 3              0.8833498              0.3387257
## 4              0.9366337              0.1673267
## 5              0.8086363              0.4020061
##   perc_employed_fulltime_yearround perc_unemployed perc_college_jobs
## 1                        0.7704431      0.08849558         0.6702970
## 2                        0.7093101      0.20194986         0.3867764
## 3                        0.7738366      0.21280567         0.7289116
## 4                        0.6527853      0.15343915         0.2460902
## 5                        0.6852821      0.14843750         0.5867515
##   perc_non_college_jobs perc_low_wage_jobs
## 1             0.1821782         0.05544554
## 2             0.5158761         0.21560172
## 3             0.1759983         0.03014828
## 4             0.4107636         0.04323827
## 5             0.3860437         0.11801062
str(college)
## 'data.frame':    173 obs. of  19 variables:
##  $ rank                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ major_code                      : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ major                           : chr  "Petroleum Engineering" "Mining And Mineral Engineering" "Metallurgical Engineering" "Naval Architecture And Marine Engineering" ...
##  $ major_category                  : chr  "Engineering" "Engineering" "Engineering" "Engineering" ...
##  $ total                           : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ sample_size                     : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ perc_women                      : num  0.911 0.515 0.594 0.652 0.418 ...
##  $ p25th                           : num  25000 26000 26700 26000 31500 23000 32500 37900 29200 23000 ...
##  $ median                          : num  40000 37000 45000 35000 62000 44700 45000 57000 36000 32200 ...
##  $ p75th                           : num  50000 40000 60000 45000 109000 50000 58000 67000 46000 47100 ...
##  $ perc_men                        : num  0.0891 0.4846 0.4058 0.3479 0.5821 ...
##  $ perc_employed                   : num  0.912 0.798 0.787 0.847 0.852 ...
##  $ perc_employed_fulltime          : num  0.921 0.711 0.883 0.937 0.809 ...
##  $ perc_employed_parttime          : num  0.177 0.362 0.339 0.167 0.402 ...
##  $ perc_employed_fulltime_yearround: num  0.77 0.709 0.774 0.653 0.685 ...
##  $ perc_unemployed                 : num  0.0885 0.2019 0.2128 0.1534 0.1484 ...
##  $ perc_college_jobs               : num  0.67 0.387 0.729 0.246 0.587 ...
##  $ perc_non_college_jobs           : num  0.182 0.516 0.176 0.411 0.386 ...
##  $ perc_low_wage_jobs              : num  0.0554 0.2156 0.0301 0.0432 0.118 ...

We can see that the data has 173 observations of 19 variables which corresponds to the codebook. The question asks about relationship between the major category and income, so I will only look at major_category and median. There are obviously other factors that may affect our analysis, for example: gender perc_men and perc_women, sample size (number of objects that provide income) perc_employed and total… I assume to omit all other variables.

Now let’s factorize the data and see the relationship between our two interested values:

college$major <- as.factor(college$major)
college$major_code <- as.factor(college$major_code)
college$major_category <- as.factor(college$major_category)

boxplot(median/1000 ~ major_category, data = college, main = "Income vs. Major", ylab="Income (thousands of dollar)", las = 2)

We can see the distribution of the median of Income of each major is not normal, they’re skewed. However for the purpose of this project of practicing linear model, I assume they’re normal.

Analyze

Let’s have a look at the major_category:

unique(college$major_category)
##  [1] Engineering                         Business                           
##  [3] Physical Sciences                   Law & Public Policy                
##  [5] Computers & Mathematics             Agriculture & Natural Resources    
##  [7] Industrial Arts & Consumer Services Arts                               
##  [9] Health                              Social Science                     
## [11] Biology & Life Science              Education                          
## [13] Humanities & Liberal Arts           Psychology & Social Work           
## [15] Communications & Journalism         Interdisciplinary                  
## 16 Levels: Agriculture & Natural Resources Arts ... Social Science

There are 16 of them. Let’s first reorder the category before doing regression model:

college <- college[order(college$major_category),]

When we apply a linear model to this data, linking Income to all Majors, the default output intercept is the mean of the referenced major (alphabet sorted, with Agriculture first), the gradient coefficient of other majors is the difference of the mean of that major to the referenced one, and the p-value of those coefficients is the probability of a t-test if that mean and the referenced mean is different. For example, say we want to compare major Arts with others:

major_category_ref <- relevel(college$major_category, "Arts")
fit <- lm(median ~ major_category_ref, data = college)
summary(fit)$coef
##                                                         Estimate Std. Error
## (Intercept)                                            38050.000   4014.658
## major_category_refAgriculture & Natural Resources       5450.000   5386.228
## major_category_refBiology & Life Science                5814.286   5032.640
## major_category_refBusiness                             11103.846   5102.541
## major_category_refCommunications & Journalism           3950.000   6953.591
## major_category_refComputers & Mathematics              -3331.818   5276.294
## major_category_refEducation                             -112.500   4916.931
## major_category_refEngineering                           2343.103   4534.719
## major_category_refHealth                                2266.667   5182.901
## major_category_refHumanities & Liberal Arts            -2883.333   4971.264
## major_category_refIndustrial Arts & Consumer Services   2378.571   5876.857
## major_category_refInterdisciplinary                   -10550.000  12043.973
## major_category_refLaw & Public Policy                   -250.000   6473.441
## major_category_refPhysical Sciences                     2350.000   5386.228
## major_category_refPsychology & Social Work              1838.889   5517.619
## major_category_refSocial Science                        1016.667   5517.619
##                                                           t value     Pr(>|t|)
## (Intercept)                                            9.47776950 3.919976e-17
## major_category_refAgriculture & Natural Resources      1.01183974 3.131715e-01
## major_category_refBiology & Life Science               1.15531531 2.497166e-01
## major_category_refBusiness                             2.17614057 3.103954e-02
## major_category_refCommunications & Journalism          0.56805181 5.708113e-01
## major_category_refComputers & Mathematics             -0.63146941 5.286520e-01
## major_category_refEducation                           -0.02288012 9.817749e-01
## major_category_refEngineering                          0.51670312 6.060905e-01
## major_category_refHealth                               0.43733553 6.624690e-01
## major_category_refHumanities & Liberal Arts           -0.58000007 5.627460e-01
## major_category_refIndustrial Arts & Consumer Services  0.40473529 6.862230e-01
## major_category_refInterdisciplinary                   -0.87595680 3.823917e-01
## major_category_refLaw & Public Policy                 -0.03861934 9.692429e-01
## major_category_refPhysical Sciences                    0.43629787 6.632200e-01
## major_category_refPsychology & Social Work             0.33327579 7.393708e-01
## major_category_refSocial Science                       0.18425822 8.540487e-01

From this result we can get some information: - mean of median of Income from major Arts is 38,050 - difference of mean of median of Income of Agriculture & Natural Resources from Arts is 5,450, and p-value of this difference is 0.31, which implies that the difference is not significant - the same interpretation can be done for coefficients of other variables

For this project, we ideally run linear regression models of income (median) vs. college major (major_catecory) for all majors as referenced. Given a referenced level, the model coefficients will indicate the difference of the mean of other variables and the probability if they are the same. I will run regression model for each major as the reference. The similar probabilities are stored in a 2D matrix A.

A <- matrix(, nrow = 16, ncol = 16)

for (i in 1:16){
    major_category_ref <- relevel(college$major_category, as.character(unique(college$major_category)[i]))
    fit <- lm(median ~ major_category_ref, data = college)
    tmp <- summary(fit)$coef[,4]
    # swap the first element to the corresponding position in the diagonal matrix
    tmp1 <- tmp[1:i]
    tmp1 <- c(0,tmp1)
    tmp1 <- c(tmp1[-2],tmp1[2])
    tmp1 <- tmp1[-1]
    # save to A
    A[,i] <- c(tmp1,tmp[-(1:i)])
}

Edit the matrix and plot.

library(reshape)

library(ggplot2)

We should expect a square symmetric matrix, with diagonal values are very low.

B <- data.frame(A)
names(B) <- unique(college$major_category)
B$major <- unique(college$major_category)
Bmelt <- melt(B)
## Using major as id variables
g = ggplot(data=Bmelt, aes(x=variable, y=major, fill=value))
g = g + geom_tile()
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Major") + xlab("Major")
g = g + ggtitle("Probability of difference in Income between Majors")
g = g + coord_fixed()
g

Assuming a confidence interval of 95%, I mark all probabilities smaller than 2.5% as Different and larger than or equal to 2.5% as Same.

g = ggplot(data=Bmelt, aes(x=variable, y=major, fill=value < 0.025))
g = g + geom_tile()
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Major") + xlab("Major")
g = g + ggtitle("Difference in Income between Majors")
g = g + coord_fixed()
g

Apparently, most majors have similar income, except the Business. Its income is significant different from Computers & Mathematics, Education, Engineering, and Humanities & Liberal Arts. Let’s try calculating how different they are from the Business:

major_category_ref <- relevel(college$major_category, "Business")
fit <- lm(median ~ major_category_ref, data = college)
summary(fit)$coef
##                                                         Estimate Std. Error
## (Intercept)                                            49153.846   3149.357
## major_category_refAgriculture & Natural Resources      -5653.846   4776.236
## major_category_refArts                                -11103.846   5102.541
## major_category_refBiology & Life Science               -5289.560   4373.606
## major_category_refCommunications & Journalism          -7153.846   6492.565
## major_category_refComputers & Mathematics             -14435.664   4651.908
## major_category_refEducation                           -11216.346   4239.951
## major_category_refEngineering                          -8760.743   3790.072
## major_category_refHealth                               -8837.179   4545.705
## major_category_refHumanities & Liberal Arts           -13987.179   4302.840
## major_category_refIndustrial Arts & Consumer Services  -8725.275   5323.384
## major_category_refInterdisciplinary                   -21653.846  11783.813
## major_category_refLaw & Public Policy                 -11353.846   5975.484
## major_category_refPhysical Sciences                    -8753.846   4776.236
## major_category_refPsychology & Social Work             -9264.957   4923.931
## major_category_refSocial Science                      -10087.179   4923.931
##                                                         t value     Pr(>|t|)
## (Intercept)                                           15.607584 9.444322e-34
## major_category_refAgriculture & Natural Resources     -1.183745 2.383031e-01
## major_category_refArts                                -2.176141 3.103954e-02
## major_category_refBiology & Life Science              -1.209428 2.283166e-01
## major_category_refCommunications & Journalism         -1.101852 2.722123e-01
## major_category_refComputers & Mathematics             -3.103171 2.271210e-03
## major_category_refEducation                           -2.645395 8.989341e-03
## major_category_refEngineering                         -2.311498 2.210557e-02
## major_category_refHealth                              -1.944073 5.367450e-02
## major_category_refHumanities & Liberal Arts           -3.250685 1.408831e-03
## major_category_refIndustrial Arts & Consumer Services -1.639047 1.032059e-01
## major_category_refInterdisciplinary                   -1.837592 6.801278e-02
## major_category_refLaw & Public Policy                 -1.900071 5.925698e-02
## major_category_refPhysical Sciences                   -1.832792 6.872781e-02
## major_category_refPsychology & Social Work            -1.881618 6.173891e-02
## major_category_refSocial Science                      -2.048603 4.216615e-02

and look at the lowest 5 majors

business_diff <- summary(fit)$coef[-1,]
business_diff[order(business_diff[,4])[1:5],]
##                                               Estimate Std. Error   t value
## major_category_refHumanities & Liberal Arts -13987.179   4302.840 -3.250685
## major_category_refComputers & Mathematics   -14435.664   4651.908 -3.103171
## major_category_refEducation                 -11216.346   4239.951 -2.645395
## major_category_refEngineering                -8760.743   3790.072 -2.311498
## major_category_refArts                      -11103.846   5102.541 -2.176141
##                                                Pr(>|t|)
## major_category_refHumanities & Liberal Arts 0.001408831
## major_category_refComputers & Mathematics   0.002271210
## major_category_refEducation                 0.008989341
## major_category_refEngineering               0.022105573
## major_category_refArts                      0.031039539

Clearly the 4 majors we pointed out above have lower p-value, and the fifth one (Arts) starts to have high enough p-value of 0.31.