Problem Statement

The problem statement/question is to find out if male and female employees have equal income across many occupations in the United States of America. An open dataset is being used to observe the descriptive statistics, boxplots and qqplots to visualise the data and examine the different income rates for male and females across many occupations. Following this, a hypothesis test - Two sample t-test is performed on the dataset to find (if any) statistical significant difference between the income per week between the two genders.

Data

We use the open data from Kaggle - U.S. Incomes by Occupation and Gender. This dataset was retrieved from the Bureau of Labor Statistics and contains 558 observations and contains 7 variables.

Occupation - The job title
All_workers - Number of male and female workers
All_weekly - Median weekly income of male and female workers (USD)
M_workers - Number of male workers
M_weekly - Median weekly income of male workers (USD)
F_workers - Number of female workers
F_weekly - Median weekly income of female workers (USD)

Read and Examine the data:

statsdata <- read_csv("inc_occ_gender.csv")
str(statsdata, give.attr = FALSE)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 558 obs. of  7 variables:
##  $ Occupation : chr  "ALL OCCUPATIONS" "MANAGEMENT" "Chief executives" "General and operations managers" ...
##  $ All_workers: num  109080 12480 1046 823 8 ...
##  $ All_weekly : chr  "809" "1351" "2041" "1260" ...
##  $ M_workers  : num  60746 7332 763 621 5 ...
##  $ M_weekly   : chr  "895" "1486" "2251" "1347" ...
##  $ F_workers  : num  48334 5147 283 202 4 ...
##  $ F_weekly   : chr  "726" "1139" "1836" "1002" ...

On examining the structure of the data frame, we find that All_weekly, M_weekly and F_weekly are of character type, which is incorrect. These variables need to be numeric(double) as they hold the income values. Therefore in the next slide, we type cast these variables to numeric type.

Perform type conversions:

statsdata$All_weekly <- as.double(statsdata$All_weekly)
statsdata$M_weekly <- as.double(statsdata$M_weekly)
statsdata$F_weekly <- as.double(statsdata$F_weekly)

#Examine the data frame after type conversions.
str(statsdata, give.attr = FALSE)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 558 obs. of  7 variables:
##  $ Occupation : chr  "ALL OCCUPATIONS" "MANAGEMENT" "Chief executives" "General and operations managers" ...
##  $ All_workers: num  109080 12480 1046 823 8 ...
##  $ All_weekly : num  809 1351 2041 1260 NA ...
##  $ M_workers  : num  60746 7332 763 621 5 ...
##  $ M_weekly   : num  895 1486 2251 1347 NA ...
##  $ F_workers  : num  48334 5147 283 202 4 ...
##  $ F_weekly   : num  726 1139 1836 1002 NA ...

In this dataset, we find some missing values (which we will deal with later), but these values are stored as “Na”. R will not be able to recognize these missing values, therefore we recode them to NA.

#RECODING MISSING VALUES
statsdata[statsdata == "Na"] <- NA

#Checking the number of missing values
colSums(is.na(statsdata))

##  Occupation All_workers  All_weekly   M_workers    M_weekly   F_workers 
##           0           0         236           0         326           0 
##    F_weekly 
##         366

Descriptive Statistics and Visualisation

Since we our objective is to find if there is any statistical significance between the weekly income of male and female employees, we look at two main vairables, ‘M_weekly’ which is the male weekly income and ‘F_weekly’ which is the female weekly income. We first find the descriptive statistics of these varaibles.

Descriptive Statistics of `M_weekly` data:

statsdata %>% summarise( Min = min(M_weekly,na.rm = TRUE), Q1 = quantile(M_weekly,probs = .25,na.rm = TRUE), Median = median(M_weekly, na.rm = TRUE), 
                         Q3 = quantile(M_weekly,probs = .75,na.rm = TRUE), Max = max(M_weekly,na.rm = TRUE), Mean = mean(M_weekly, na.rm = TRUE), 
                         SD = sd(M_weekly, na.rm = TRUE), Interquartile = IQR(M_weekly, na.rm = TRUE), n = n(),  Missing = sum(is.na(M_weekly)))

Descriptive Statistics of `F_weekly` data:

statsdata %>% summarise(Min = min(F_weekly,na.rm = TRUE), Q1 = quantile(F_weekly,probs = .25,na.rm = TRUE), Median = median(F_weekly, na.rm = TRUE), 
                        Q3 = quantile(F_weekly,probs = .75,na.rm = TRUE), Max = max(F_weekly,na.rm = TRUE), Mean = mean(F_weekly, na.rm = TRUE), 
                        SD = sd(F_weekly, na.rm = TRUE), Interquartile = IQR(F_weekly, na.rm = TRUE), n = n(),  Missing = sum(is.na(F_weekly)))

From observing the descriptive statistics we can see that the Male weekly income has a higher mean price along with higher Q1 and Q3 values. Males also has a higher median salary of $915.5 compared to that of Females who have a median salary of $736.

These initial observations may lead us to believe that Males do receive higher income than Females. We did find many missing values in the previous slides, so we first have to deal with that.

Dealing with NA values:

Since we find that there are some missing values in both male and female weekly income variables, we have decided to remove these values to give us a appropriate dataset for our investigation.

#REMOVING NA VALUES
statsdata <- na.omit(statsdata)
colSums(is.na(statsdata))

##  Occupation All_workers  All_weekly   M_workers    M_weekly   F_workers 
##           0           0           0           0           0           0 
##    F_weekly 
##           0

Boxplot:

From the boxplots we can clearly observe that the Male weekly income has a much higher median. They have a comparable Q1 salary but Male employees has a significantly higher Q3, therefore the male weekly income also has bigger IQR.

par(mfrow = c(1,2))
male_outliers <- boxplot(statsdata$M_weekly, main="Box Plot for Male weekly income", ylab="Salary", ylim = c(0,2500))$out
female_outliers <- boxplot(statsdata$F_weekly, main="Box Plot for Female weekly income", ylab="Salary", ylim = c(0,2500))$out

Dealing with the outliers

We find 3 outliers in the female weekly income variable. Since the outliers are high valued salary for females, we opt for using the capping method to deal with the data rather than deleting the outliers because that could have an impact on our findings.

cap <- function(x){
    quantiles <- quantile(x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
statsdata$F_weekly <- statsdata$F_weekly %>% cap()

QQ-Plot:

We use qqPlot() to examine the normality of the sample distribution.

We find that for both the variables the data points show a trend of falling within the 95% CI for normal distrinution except for a few data points that fall outside the left tail of the distribution. Both of the variables appear to be left skewed. This tells us that both male and female weekly income data cannot be assumed to be normally distributed. But since the sample size is greater than 30 (n>30) according to Central Limit Theorem we can proceed with Two sample t-test even if the normality assumption is violated. So in this case, we can effectively ignore the non-normality of the data.

par(mfrow = c(1,2))
dat <- statsdata$M_weekly %>% qqPlot(dist="norm", main = "QQPlot for Male Weekly income", ylim = c(0,2500))
dat <- statsdata$F_weekly %>% qqPlot(dist="norm", main = "QQPlot for Female Weekly income", ylim = c(0,2500))

Hypothesis Testing

We use Two sample t-test for this dataset as the male and female weekly income are independent of each other. We perform the Two sample t-test which has the following statistical hypotheses:

\[H_0 : μ_1−μ_2=0\]

\[H_A : μ_1−μ_2≠0\]

where $μ_1$ and $μ_2$ refer to the population means of Male and Female weekly income respectively.

Levene’s test

We first need to examine the homogeneity of variance i.e. to check the assumption that the variance of the two populations are equal. We use Levene’s test to test the assumption of equal variance.

The statistical hypotheses for Levene’s test are:

\[H_0:σ_1^2=σ_2^2\]

\[H_A:σ_1^2≠σ_2^2\]

where $σ_1^2$ and $σ_2^2$ refer to the population variance of Male and Female weekly incomes respectively.

Hypthesis Testing Cont.

Our data is in the following format:

statsdata[1:4,]

However to perform t-test we require the data to have a factor variable to define the groups. We reshape our data into a suitable form and only keep the variables we need for our statistical analysis - Occupation, Gender and WeeklyIncome, where Gender serves as the factor variable that defines the groups - Male and Female.

Reshaping our data using gather():

y <- gather(statsdata,'M_weekly','F_weekly', key = 'Gender', value = "WeeklyIncome") %>% select(c(Occupation, Gender, WeeklyIncome))
y[y == "M_weekly"] <- "Male"
y[y == "F_weekly"] <- "Female"
y[1:4,]

Performing the Levene’s Test:

leveneTest(WeeklyIncome ~ Gender, data = y)

Here p value - 0.0003953 < 0.05, hence we reject null hypothesis, in other words it is not safe to assume equal variance.

So we perform the two-sample t-test assuming unequal variance.

Two-sample t-test - Assuming Unequal Variance:

t.test( WeeklyIncome ~ Gender, data = y, var.equal = FALSE, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  WeeklyIncome by Gender
## t = -4.6592, df = 257.95, p-value = 5.091e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -283.6525 -115.1144
## sample estimates:
## mean in group Female   mean in group Male 
##             818.1518            1017.5352

Critical Value Approach:

Finding $t$* (t critical) value:

qt(p = 0.025, df = 142 + 142 - 2)

## [1] -1.968412

From the Two-sample t-test we get the t-test statistic $t = -4.6592$ which does not fall within the lower critical value: -1.968412 and upper critical value: 1.968412. According to the critical value method, there was statistically significant difference between the mean weekly incomes for Males and Females. Hence, we reject the Null Hypothesis H₀.

p-value Approach:

The p-value of the two-sample t-test will tell us the probability of observing a sample with difference between the means as $-199.3834$, assuming the difference was 0 in the population. From the Two-sample t-test, we get the p-value as $p = 5.091e-06$.

According to p-value approach, since $p$ = 5.091e-06 < $α$ = 0.05 we reject the Null Hypothesis H₀. Therefore there is a statistical significance difference between the means.

Confindence Interval approach:

We calculate the difference between the mean weekly incomes of Males and Females:

818.1518 - 1017.5352

## [1] -199.3834

The Two-sample t-test gives us the 95% CI: $[ -283.6525, -115.1144]$. The result is captured in the 95% CI, hence we fail to reject the Null Hypothesis H₀. There is no statistical significance difference between the means.

Discussion

After performing the hypothesis test we found a statistical significant difference between the means of the weekly income of male and female employees suggesting that there is an inequality in the income given to male and females. There are however some limitations to the data.

While the sample size is greater than 30 (meaning statistically a big sample), we firstly omit many observations because of missing values, And a final sample size of 142 observation may not be enough to conclude our investigation.
Not all occupations were taken into consideration, specially after omitting certain data because of missing values.
Bonuses and length of stay at one company are not considered in this dataset which could have had an impact on our findings.
The dataset was collect back in 2015, not given us the current evaluations.

To imporve the accuracy of the investigation we could use a larger sample, and also try to include all possible occupations. A big factor were the missing values, so to increase the accuracy data entry errors should be limited.

Overall, the report concludes that the investigation found statistical evidence to support that Male employees do receive a higher income than female employees.

MATH1324 Assignment 3

Gender Gap in Income

RPubs link information

Introduction

Problem Statement

Data

Read and Examine the data:

Perform type conversions:

Descriptive Statistics and Visualisation

Descriptive Statistics of `M_weekly` data:

Descriptive Statistics of `F_weekly` data:

Dealing with NA values:

Boxplot:

Dealing with the outliers

QQ-Plot:

Hypothesis Testing

Levene’s test

Hypthesis Testing Cont.

Performing the Levene’s Test:

Two-sample t-test - Assuming Unequal Variance:

Critical Value Approach:

Finding \(t\)* (t critical) value:

p-value Approach:

Confindence Interval approach:

Discussion

References

MATH1324 Assignment 3

Gender Gap in Income

RPubs link information

Introduction

Problem Statement

Data

Read and Examine the data:

Perform type conversions:

Descriptive Statistics and Visualisation

Descriptive Statistics of M_weekly data:

Descriptive Statistics of F_weekly data:

Dealing with NA values:

Boxplot:

Dealing with the outliers

QQ-Plot:

Hypothesis Testing

Levene’s test

Hypthesis Testing Cont.

Performing the Levene’s Test:

Two-sample t-test - Assuming Unequal Variance:

Critical Value Approach:

Finding \(t\)* (t critical) value:

p-value Approach:

Confindence Interval approach:

Discussion

References

Descriptive Statistics of `M_weekly` data:

Descriptive Statistics of `F_weekly` data: