Practice Quiz Regression
ModelsBased on an analysis involving 173 observations and 19 variables, there is no sufficient evidence to affirm that the major category has a significant association with income.
The Optional Quiz Assignment wants to analyze the relationship
between income and major categories. This study should be performed
using the college dataset from the
collegeIncome library.
This Practice Quiz aims to answer the following question:
Based on your analysis, would you conclude that there is a significant association between college major category and income?
It is necessary to use the following packages to perform this experiment.
# Loading packages
library(collegeIncome)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(explore)
library(kableExtra)
library(DT)
library(PerformanceAnalytics)
library(DiagrammeR)
library(GGally)
If you want to reproduce it, please, fork the experiment repository hosted on Github.
Following the practice quiz instructions, I have used the
college dataset from the collegeIncome
package.
# Loading college data to environment.
data("college")
# Creating a copy of college data.
df_college <- college
From the assignment instructions:
The college dataset has 173 observations and 19
variables.
# Checking the number of observations and variables.
dim(df_college)
## [1] 173 19
The first 10 rows of the dataset:
The last three rows of the dataset:
Let’s check the variables’ types.
## 'data.frame': 173 obs. of 19 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ major : chr "Petroleum Engineering" "Mining And Mineral Engineering" "Metallurgical Engineering" "Naval Architecture And Marine Engineering" ...
## $ major_category : chr "Engineering" "Engineering" "Engineering" "Engineering" ...
## $ total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ perc_women : num 0.911 0.515 0.594 0.652 0.418 ...
## $ p25th : num 25000 26000 26700 26000 31500 23000 32500 37900 29200 23000 ...
## $ median : num 40000 37000 45000 35000 62000 44700 45000 57000 36000 32200 ...
## $ p75th : num 50000 40000 60000 45000 109000 50000 58000 67000 46000 47100 ...
## $ perc_men : num 0.0891 0.4846 0.4058 0.3479 0.5821 ...
## $ perc_employed : num 0.912 0.798 0.787 0.847 0.852 ...
## $ perc_employed_fulltime : num 0.921 0.711 0.883 0.937 0.809 ...
## $ perc_employed_parttime : num 0.177 0.362 0.339 0.167 0.402 ...
## $ perc_employed_fulltime_yearround: num 0.77 0.709 0.774 0.653 0.685 ...
## $ perc_unemployed : num 0.0885 0.2019 0.2128 0.1534 0.1484 ...
## $ perc_college_jobs : num 0.67 0.387 0.729 0.246 0.587 ...
## $ perc_non_college_jobs : num 0.182 0.516 0.176 0.411 0.386 ...
## $ perc_low_wage_jobs : num 0.0554 0.2156 0.0301 0.0432 0.118 ...
One observes there are problems with variables types:
major: Should convert it into a category;major_category: Should convert it into a category.Also, some variables are not helpful for the analysis:
rank: There is no info about how this rank was
calculated.major_code: The major code is a primary key, so there
is no reason to use it in a deeper analysis.To confirm the presence of NA observations, categorical variables as
characters, and other problems. Let’s print the
summary().
## rank major_code major major_category
## Min. : 1 Min. :1100 Length:173 Length:173
## 1st Qu.: 44 1st Qu.:2403 Class :character Class :character
## Median : 87 Median :3608 Mode :character Mode :character
## Mean : 87 Mean :3880
## 3rd Qu.:130 3rd Qu.:5503
## Max. :173 Max. :6403
##
## total sample_size perc_women p25th
## Min. : 124 Min. : 2.0 Min. :0.0000 Min. :18500
## 1st Qu.: 4361 1st Qu.: 39.0 1st Qu.:0.3397 1st Qu.:24000
## Median : 15058 Median : 130.0 Median :0.5357 Median :27000
## Mean : 39168 Mean : 356.1 Mean :0.5226 Mean :29501
## 3rd Qu.: 38844 3rd Qu.: 338.0 3rd Qu.:0.7020 3rd Qu.:33000
## Max. :393735 Max. :4212.0 Max. :0.9690 Max. :95000
##
## median p75th perc_men perc_employed
## Min. : 22000 Min. : 22000 Min. :0.03105 Min. :0.0000
## 1st Qu.: 33000 1st Qu.: 42000 1st Qu.:0.29798 1st Qu.:0.7477
## Median : 36000 Median : 47000 Median :0.46429 Median :0.8028
## Mean : 40151 Mean : 51494 Mean :0.47745 Mean :0.7886
## 3rd Qu.: 45000 3rd Qu.: 60000 3rd Qu.:0.66033 3rd Qu.:0.8410
## Max. :110000 Max. :125000 Max. :1.00000 Max. :0.9562
##
## perc_employed_fulltime perc_employed_parttime perc_employed_fulltime_yearround
## Min. :0.5743 Min. :0.0000 Min. :0.5857
## 1st Qu.:0.7741 1st Qu.:0.2090 1st Qu.:0.7009
## Median :0.8319 Median :0.2862 Median :0.7484
## Mean : Inf Mean :0.2874 Mean :0.7476
## 3rd Qu.:0.8974 3rd Qu.:0.3623 3rd Qu.:0.7896
## Max. : Inf Max. :0.5518 Max. :1.0000
## NA's :1
## perc_unemployed perc_college_jobs perc_non_college_jobs perc_low_wage_jobs
## Min. :0.04383 Min. :0.0633 Min. :0.08278 Min. :0.00000
## 1st Qu.:0.15899 1st Qu.:0.2974 1st Qu.:0.27995 1st Qu.:0.06957
## Median :0.19723 Median :0.4160 Median :0.42020 Median :0.10857
## Mean :0.21140 Mean :0.4478 Mean :0.41498 Mean :0.11481
## 3rd Qu.:0.25229 3rd Qu.:0.6170 3rd Qu.:0.52756 3rd Qu.:0.15353
## Max. :1.00000 Max. :0.8383 Max. :0.85364 Max. :0.36566
## NA's :1 NA's :1 NA's :1
The following graph shows a density plot of each numeric variable.
Unfortunately, some variables have NA values, which will
be required to clean them. So, the following observations have one or
more NA, Inf, or invalid content.
Industrial And Manufacturing Engineering and
Computer And Information Systems majors contain invalid
values.
There is no way to plot a visible graph to major because
there are 173 categories in this variable. For this reason, I will not
plot any graph for it.
As expected, the major variables have 173 unique values,
meaning each row corresponds to a unique major. However, remember that I
have not inspected each major name, so I can not ensure if
it contains typos or the same major with different notations.
The major_category variable have 16 categories. Table 1
summarizes all majors in respect of total and
sample_size.
| major_category | number_major | total | sample_size |
|---|---|---|---|
| Engineering | 29 | 537583 | 4926 |
| Education | 16 | 559129 | 4742 |
| Humanities & Liberal Arts | 15 | 713468 | 5340 |
| Biology & Life Science | 14 | 453862 | 2317 |
| Business | 13 | 1302376 | 15505 |
| Health | 12 | 463230 | 3914 |
| Computers & Mathematics | 11 | 299008 | 2860 |
| Agriculture & Natural Resources | 10 | 79981 | 1104 |
| Physical Sciences | 10 | 185479 | 1137 |
| Psychology & Social Work | 9 | 481007 | 3180 |
| Social Science | 9 | 529966 | 4581 |
| Arts | 8 | 357130 | 3260 |
| Industrial Arts & Consumer Services | 7 | 229792 | 2165 |
| Law & Public Policy | 5 | 179107 | 1935 |
| Communications & Journalism | 4 | 392601 | 4508 |
| Interdisciplinary | 1 | 12296 | 128 |
Highlights:
Interdisciplinary corresponds to 0.6% of total major
courses. In absolute terms, this category has only one major course
(Multi/Interdisciplinary Studies);size_sample and total of the
Interdisciplinary category, it is convenient to remove
it.Engineering is the major_category with the
most number of majors, and;Business is the major_category with the
most number of students..I will show the scatter plot with histogram and correlation between each variable.
Highlights:
total and sample_size have a high
correlation because the greater the number of people with that major,
the greater the sample_size. It is necessary to drop one out of
two;perc_women and perc_men have a perfect
correlation, which is expected because they are complementary;perc_employed and perc_unemployed also
have a perfect correlation, which is expected because they are
complementary;perc_employed_fulltime and
perc_employed_parttime have high correlation. Given that
you are employed, you only have two options, full-time or part-time, so
those variables are also complementary;perc_college_jobs and
perc_non_college_jobs have high correlation. Given that you
are employed, you only have two options to the job type, college or
non-college, so those variables are also complementary, and;perc_low_wage_jobs is positive correlated with
perc_non_college_jobs and negative correlated with
perc_college_jobs because it is expected that college jobs
has greater payloads in comparison to non-college jobs, and.Based on section 4, Exploratory Data Analysis, there are several variables that I cannot use in the Model Selection due to a high correlation. For this reason, I will drop the following variables:
rank;major_code;major;sample_size;perc_men;perc_unemployed;perc_employed_fulltime;perc_college_jobs;p25th, and;p75th.The figure below shows the variable relationship.
To convert the college dataset into a tidy dataset, it
is mandatory to:
major_category, total,
perc_women, median,
perc_employed, perc_employed_parttime,
perc_employed_fulltime_yearround,
perc_non_college_jobs, and
perc_low_wage_jobs;NA observations, and;The tidy dataset has 9 variables and 170 observations.
# Checking the tidy dataset dimensions.
dim(df_tidy)
## [1] 170 11
I will count the NA values using the is.na() function to
ensure I have eliminated them.
# Testing
sum(is.na(df_tidy))
## [1] 0
It is zero, which means there are no NA values.
Regardless of the dependent variable used as output, the three options have a high correlation. So, the figure below will show the correlation matrix between the 25th, 50th, and 75th percentile.
Highlight:
p25th, median, and p75th are
strongly positively correlated.Any of those used as the dependent variables will perform similar outcomes.
According to the Bureau of Labor Statistics, the income gap between men and women is around 62 USD per week. Let’s divide the women penetration in a given major category into 4 levels:
perc_women below 25%;perc_women between 25% and 50%;perc_women between 50% and 75%, and;perc_women above 25%.The above graph shows no difference between high women penetration in the yearly income, which is a bit counter-intuitive due to the well-known gap between men’s and women’s wages. Moreover, all density curves are barely the same or with minor changes.
Considering income is directly related to how much time you spend
working, the perc_employed_fulltime_yearround will play a
key role. Also, people who does not work in college jobs
(perc_non_college_jobs) will have a lower wage, which is
also related to perc_low_wage_jobs. From the section 5.3.
Gender Income Gap, I will drop the gender variable
(perc_women).
\[median = \beta_0 + \beta_1 \cdot major\_category + \beta_2 \cdot perc\_employed\_fulltime\_yearround + \beta_3 \cdot perc\_non\_college\_jobs + \beta_4 \cdot perc\_low\_wage\_jobs\]
## Estimate Std. Error t value
## major_categoryAgriculture & Natural Resources 13230.9994 10328.471 1.2810
## major_categoryArts 5867.2472 11041.329 0.5314
## major_categoryBiology & Life Science 13137.8395 10212.755 1.2864
## major_categoryBusiness 19560.0748 9904.361 1.9749
## major_categoryCommunications & Journalism 10454.2052 11535.425 0.9063
## major_categoryComputers & Mathematics 2211.8554 10943.045 0.2021
## major_categoryEducation 7240.4506 10223.697 0.7082
## major_categoryEngineering 9661.4590 10176.620 0.9494
## major_categoryHealth 8751.5405 10539.399 0.8304
## major_categoryHumanities & Liberal Arts 3793.9222 10354.490 0.3664
## major_categoryIndustrial Arts & Consumer Services 8252.8659 11143.581 0.7406
## major_categoryLaw & Public Policy 7279.2157 10909.618 0.6672
## major_categoryPhysical Sciences 9430.2765 10511.430 0.8971
## major_categoryPsychology & Social Work 7813.1995 10779.438 0.7248
## major_categorySocial Science 8027.8700 10731.857 0.7480
## perc_employed_fulltime_yearround 37822.5392 13212.248 2.8627
## perc_non_college_jobs 6962.2952 7295.501 0.9543
## perc_low_wage_jobs -488.2877 17594.752 -0.0278
## Pr(>|t|)
## major_categoryAgriculture & Natural Resources 0.2021
## major_categoryArts 0.5959
## major_categoryBiology & Life Science 0.2003
## major_categoryBusiness 0.0501
## major_categoryCommunications & Journalism 0.3662
## major_categoryComputers & Mathematics 0.8401
## major_categoryEducation 0.4799
## major_categoryEngineering 0.3439
## major_categoryHealth 0.4076
## major_categoryHumanities & Liberal Arts 0.7146
## major_categoryIndustrial Arts & Consumer Services 0.4601
## major_categoryLaw & Public Policy 0.5056
## major_categoryPhysical Sciences 0.3711
## major_categoryPsychology & Social Work 0.4697
## major_categorySocial Science 0.4556
## perc_employed_fulltime_yearround 0.0048
## perc_non_college_jobs 0.3414
## perc_low_wage_jobs 0.9779
From the above results of the lm() function, there is no
statistical evidence that perc_employed_fulltime_yearround,
perc_non_college_jobs, perc_low_wage_jobs, and
the dummy variable major_category affect the income. Almost
all p-value failed to reject the \(H_0\) hypothesis.
The answer to the posed question is: There is no significant association between college and major category and income.