Brief Analysis about Manual and Automatic Transmissions

Binder

Synopsis

Based on an analysis involving 173 observations and 19 variables, there is no sufficient evidence to affirm that the major category has a significant association with income.

1. Introduction

The Optional Quiz Assignment wants to analyze the relationship between income and major categories. This study should be performed using the college dataset from the collegeIncome library.

2. Objectives

This Practice Quiz aims to answer the following question:

Based on your analysis, would you conclude that there is a significant association between college major category and income?

  • Yes
  • No

3. Requeriments

It is necessary to use the following packages to perform this experiment.

# Loading packages
library(collegeIncome)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(explore)
library(kableExtra)
library(DT)
library(PerformanceAnalytics)
library(DiagrammeR)
library(GGally)

3.1. Reproducibility

If you want to reproduce it, please, fork the experiment repository hosted on Github.

3.2. Loading Data

Following the practice quiz instructions, I have used the college dataset from the collegeIncome package.

# Loading college data to environment.
data("college")

# Creating a copy of college data.
df_college <- college

3.3. Codebook

From the assignment instructions:

  • rank: Rank by median earnings
  • major_code: Major code
  • major: Major description
  • major_category: Category of major
  • total: Total number of people with major
  • sample_size: Sample size of full-time, year-round individuals used for income/earnings estimates: p25th, median, p75th
  • p25th: 25th percentile of earnings
  • median: Median earnings of full-time, year-round workers
  • p75th: 75th percentile of earnings
  • perc_men: % men with major (out of total)
  • perc_women: % women with major (out of total)
  • perc_employed: % employed (out of total)
  • perc_employed_fulltime: % employed 35 hours or more (out of employed)
  • perc_employed_parttime: % employed less than 35 hours (out of employed)
  • perc_employed_fulltime_yearround: % employed at least 50 weeks and at least 35 hours (out of employed and full-time)
  • perc_unemployed: % unemployed (out of employed)
  • perc_college_jobs: % with job requiring a college degree (out of employed)
  • perc_non_college_jobs: % with job not requiring a college degree (out of employed)
  • perc_low_wage_jobs: % in low-wage service jobs (out of total)

4. Exploratory Analysis

4.1. Dataset Dimensions

The college dataset has 173 observations and 19 variables.

# Checking the number of observations and variables.
dim(df_college)
## [1] 173  19

4.3. Tail

The last three rows of the dataset:

4.4. Structure

Let’s check the variables’ types.

## 'data.frame':    173 obs. of  19 variables:
##  $ rank                            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ major_code                      : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ major                           : chr  "Petroleum Engineering" "Mining And Mineral Engineering" "Metallurgical Engineering" "Naval Architecture And Marine Engineering" ...
##  $ major_category                  : chr  "Engineering" "Engineering" "Engineering" "Engineering" ...
##  $ total                           : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ sample_size                     : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ perc_women                      : num  0.911 0.515 0.594 0.652 0.418 ...
##  $ p25th                           : num  25000 26000 26700 26000 31500 23000 32500 37900 29200 23000 ...
##  $ median                          : num  40000 37000 45000 35000 62000 44700 45000 57000 36000 32200 ...
##  $ p75th                           : num  50000 40000 60000 45000 109000 50000 58000 67000 46000 47100 ...
##  $ perc_men                        : num  0.0891 0.4846 0.4058 0.3479 0.5821 ...
##  $ perc_employed                   : num  0.912 0.798 0.787 0.847 0.852 ...
##  $ perc_employed_fulltime          : num  0.921 0.711 0.883 0.937 0.809 ...
##  $ perc_employed_parttime          : num  0.177 0.362 0.339 0.167 0.402 ...
##  $ perc_employed_fulltime_yearround: num  0.77 0.709 0.774 0.653 0.685 ...
##  $ perc_unemployed                 : num  0.0885 0.2019 0.2128 0.1534 0.1484 ...
##  $ perc_college_jobs               : num  0.67 0.387 0.729 0.246 0.587 ...
##  $ perc_non_college_jobs           : num  0.182 0.516 0.176 0.411 0.386 ...
##  $ perc_low_wage_jobs              : num  0.0554 0.2156 0.0301 0.0432 0.118 ...

One observes there are problems with variables types:

  • major: Should convert it into a category;
  • major_category: Should convert it into a category.

Also, some variables are not helpful for the analysis:

  • rank: There is no info about how this rank was calculated.
  • major_code: The major code is a primary key, so there is no reason to use it in a deeper analysis.

4.5. Summary

To confirm the presence of NA observations, categorical variables as characters, and other problems. Let’s print the summary().

##       rank       major_code      major           major_category    
##  Min.   :  1   Min.   :1100   Length:173         Length:173        
##  1st Qu.: 44   1st Qu.:2403   Class :character   Class :character  
##  Median : 87   Median :3608   Mode  :character   Mode  :character  
##  Mean   : 87   Mean   :3880                                        
##  3rd Qu.:130   3rd Qu.:5503                                        
##  Max.   :173   Max.   :6403                                        
##                                                                    
##      total         sample_size       perc_women         p25th      
##  Min.   :   124   Min.   :   2.0   Min.   :0.0000   Min.   :18500  
##  1st Qu.:  4361   1st Qu.:  39.0   1st Qu.:0.3397   1st Qu.:24000  
##  Median : 15058   Median : 130.0   Median :0.5357   Median :27000  
##  Mean   : 39168   Mean   : 356.1   Mean   :0.5226   Mean   :29501  
##  3rd Qu.: 38844   3rd Qu.: 338.0   3rd Qu.:0.7020   3rd Qu.:33000  
##  Max.   :393735   Max.   :4212.0   Max.   :0.9690   Max.   :95000  
##                                                                    
##      median           p75th           perc_men       perc_employed   
##  Min.   : 22000   Min.   : 22000   Min.   :0.03105   Min.   :0.0000  
##  1st Qu.: 33000   1st Qu.: 42000   1st Qu.:0.29798   1st Qu.:0.7477  
##  Median : 36000   Median : 47000   Median :0.46429   Median :0.8028  
##  Mean   : 40151   Mean   : 51494   Mean   :0.47745   Mean   :0.7886  
##  3rd Qu.: 45000   3rd Qu.: 60000   3rd Qu.:0.66033   3rd Qu.:0.8410  
##  Max.   :110000   Max.   :125000   Max.   :1.00000   Max.   :0.9562  
##                                                                      
##  perc_employed_fulltime perc_employed_parttime perc_employed_fulltime_yearround
##  Min.   :0.5743         Min.   :0.0000         Min.   :0.5857                  
##  1st Qu.:0.7741         1st Qu.:0.2090         1st Qu.:0.7009                  
##  Median :0.8319         Median :0.2862         Median :0.7484                  
##  Mean   :   Inf         Mean   :0.2874         Mean   :0.7476                  
##  3rd Qu.:0.8974         3rd Qu.:0.3623         3rd Qu.:0.7896                  
##  Max.   :   Inf         Max.   :0.5518         Max.   :1.0000                  
##                         NA's   :1                                              
##  perc_unemployed   perc_college_jobs perc_non_college_jobs perc_low_wage_jobs
##  Min.   :0.04383   Min.   :0.0633    Min.   :0.08278       Min.   :0.00000   
##  1st Qu.:0.15899   1st Qu.:0.2974    1st Qu.:0.27995       1st Qu.:0.06957   
##  Median :0.19723   Median :0.4160    Median :0.42020       Median :0.10857   
##  Mean   :0.21140   Mean   :0.4478    Mean   :0.41498       Mean   :0.11481   
##  3rd Qu.:0.25229   3rd Qu.:0.6170    3rd Qu.:0.52756       3rd Qu.:0.15353   
##  Max.   :1.00000   Max.   :0.8383    Max.   :0.85364       Max.   :0.36566   
##                    NA's   :1         NA's   :1             NA's   :1

4.6. Data Visualization

4.6.1. Numeric Variables

The following graph shows a density plot of each numeric variable.

Unfortunately, some variables have NA values, which will be required to clean them. So, the following observations have one or more NA, Inf, or invalid content.

Industrial And Manufacturing Engineering and Computer And Information Systems majors contain invalid values.

4.6.2. Categorical Variables

There is no way to plot a visible graph to major because there are 173 categories in this variable. For this reason, I will not plot any graph for it.

As expected, the major variables have 173 unique values, meaning each row corresponds to a unique major. However, remember that I have not inspected each major name, so I can not ensure if it contains typos or the same major with different notations.

The major_category variable have 16 categories. Table 1 summarizes all majors in respect of total and sample_size.

major_category number_major total sample_size
Engineering 29 537583 4926
Education 16 559129 4742
Humanities & Liberal Arts 15 713468 5340
Biology & Life Science 14 453862 2317
Business 13 1302376 15505
Health 12 463230 3914
Computers & Mathematics 11 299008 2860
Agriculture & Natural Resources 10 79981 1104
Physical Sciences 10 185479 1137
Psychology & Social Work 9 481007 3180
Social Science 9 529966 4581
Arts 8 357130 3260
Industrial Arts & Consumer Services 7 229792 2165
Law & Public Policy 5 179107 1935
Communications & Journalism 4 392601 4508
Interdisciplinary 1 12296 128

Highlights:

  • Interdisciplinary corresponds to 0.6% of total major courses. In absolute terms, this category has only one major course (Multi/Interdisciplinary Studies);
  • Due to the size_sample and total of the Interdisciplinary category, it is convenient to remove it.
  • Engineering is the major_category with the most number of majors, and;
  • Business is the major_category with the most number of students..

4.6.3. Variables Correlation

I will show the scatter plot with histogram and correlation between each variable.

Highlights:

  • total and sample_size have a high correlation because the greater the number of people with that major, the greater the sample_size. It is necessary to drop one out of two;
  • perc_women and perc_men have a perfect correlation, which is expected because they are complementary;
  • perc_employed and perc_unemployed also have a perfect correlation, which is expected because they are complementary;
  • perc_employed_fulltime and perc_employed_parttime have high correlation. Given that you are employed, you only have two options, full-time or part-time, so those variables are also complementary;
  • perc_college_jobs and perc_non_college_jobs have high correlation. Given that you are employed, you only have two options to the job type, college or non-college, so those variables are also complementary, and;
  • perc_low_wage_jobs is positive correlated with perc_non_college_jobs and negative correlated with perc_college_jobs because it is expected that college jobs has greater payloads in comparison to non-college jobs, and.

5. Model Selection

Based on section 4, Exploratory Data Analysis, there are several variables that I cannot use in the Model Selection due to a high correlation. For this reason, I will drop the following variables:

  • rank;
  • major_code;
  • major;
  • sample_size;
  • perc_men;
  • perc_unemployed;
  • perc_employed_fulltime;
  • perc_college_jobs;
  • p25th, and;
  • p75th.

The figure below shows the variable relationship.

5.1. Data Manipulation

To convert the college dataset into a tidy dataset, it is mandatory to:

  • Subset the college dataset by selecting the variables: major_category, total, perc_women, median, perc_employed, perc_employed_parttime, perc_employed_fulltime_yearround, perc_non_college_jobs, and perc_low_wage_jobs;
  • Converting misclassified variables as characters into factors;
  • Removing NA observations, and;
  • Removing non-representative categories.

The tidy dataset has 9 variables and 170 observations.

# Checking the tidy dataset dimensions.
dim(df_tidy)
## [1] 170  11

I will count the NA values using the is.na() function to ensure I have eliminated them.

# Testing
sum(is.na(df_tidy))
## [1] 0

It is zero, which means there are no NA values.

5.2. 25th, 50th or 75th Percentile

Regardless of the dependent variable used as output, the three options have a high correlation. So, the figure below will show the correlation matrix between the 25th, 50th, and 75th percentile.

Highlight:

  • p25th, median, and p75th are strongly positively correlated.

Any of those used as the dependent variables will perform similar outcomes.

5.3. Gender Income Gap

According to the Bureau of Labor Statistics, the income gap between men and women is around 62 USD per week. Let’s divide the women penetration in a given major category into 4 levels:

  • Category A: perc_women below 25%;
  • Category B: perc_women between 25% and 50%;
  • Category C: perc_women between 50% and 75%, and;
  • Category D: perc_women above 25%.

The above graph shows no difference between high women penetration in the yearly income, which is a bit counter-intuitive due to the well-known gap between men’s and women’s wages. Moreover, all density curves are barely the same or with minor changes.

5.4. Linear Regression

Considering income is directly related to how much time you spend working, the perc_employed_fulltime_yearround will play a key role. Also, people who does not work in college jobs (perc_non_college_jobs) will have a lower wage, which is also related to perc_low_wage_jobs. From the section 5.3. Gender Income Gap, I will drop the gender variable (perc_women).

\[median = \beta_0 + \beta_1 \cdot major\_category + \beta_2 \cdot perc\_employed\_fulltime\_yearround + \beta_3 \cdot perc\_non\_college\_jobs + \beta_4 \cdot perc\_low\_wage\_jobs\]

##                                                     Estimate Std. Error t value
## major_categoryAgriculture & Natural Resources     13230.9994  10328.471  1.2810
## major_categoryArts                                 5867.2472  11041.329  0.5314
## major_categoryBiology & Life Science              13137.8395  10212.755  1.2864
## major_categoryBusiness                            19560.0748   9904.361  1.9749
## major_categoryCommunications & Journalism         10454.2052  11535.425  0.9063
## major_categoryComputers & Mathematics              2211.8554  10943.045  0.2021
## major_categoryEducation                            7240.4506  10223.697  0.7082
## major_categoryEngineering                          9661.4590  10176.620  0.9494
## major_categoryHealth                               8751.5405  10539.399  0.8304
## major_categoryHumanities & Liberal Arts            3793.9222  10354.490  0.3664
## major_categoryIndustrial Arts & Consumer Services  8252.8659  11143.581  0.7406
## major_categoryLaw & Public Policy                  7279.2157  10909.618  0.6672
## major_categoryPhysical Sciences                    9430.2765  10511.430  0.8971
## major_categoryPsychology & Social Work             7813.1995  10779.438  0.7248
## major_categorySocial Science                       8027.8700  10731.857  0.7480
## perc_employed_fulltime_yearround                  37822.5392  13212.248  2.8627
## perc_non_college_jobs                              6962.2952   7295.501  0.9543
## perc_low_wage_jobs                                 -488.2877  17594.752 -0.0278
##                                                   Pr(>|t|)
## major_categoryAgriculture & Natural Resources       0.2021
## major_categoryArts                                  0.5959
## major_categoryBiology & Life Science                0.2003
## major_categoryBusiness                              0.0501
## major_categoryCommunications & Journalism           0.3662
## major_categoryComputers & Mathematics               0.8401
## major_categoryEducation                             0.4799
## major_categoryEngineering                           0.3439
## major_categoryHealth                                0.4076
## major_categoryHumanities & Liberal Arts             0.7146
## major_categoryIndustrial Arts & Consumer Services   0.4601
## major_categoryLaw & Public Policy                   0.5056
## major_categoryPhysical Sciences                     0.3711
## major_categoryPsychology & Social Work              0.4697
## major_categorySocial Science                        0.4556
## perc_employed_fulltime_yearround                    0.0048
## perc_non_college_jobs                               0.3414
## perc_low_wage_jobs                                  0.9779

From the above results of the lm() function, there is no statistical evidence that perc_employed_fulltime_yearround, perc_non_college_jobs, perc_low_wage_jobs, and the dummy variable major_category affect the income. Almost all p-value failed to reject the \(H_0\) hypothesis.

6. Results

The answer to the posed question is: There is no significant association between college and major category and income.