Hypothesis testing

Correlation and association between categorical variables)

Import dataset

dataset = read.csv("Salaries.csv")

head(dataset)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Description

Source: Salaries dataset offered by the library carData References: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

The 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

It contains 397 observations on the following 6 variables (columns):

- rank: a factor with levels AssocProf AsstProf Prof. Categorical

- discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments). Categorical

- yrs.since.phd: years since PhD. Numerical

- yrs.service: years of service. Numerical

- sex: a factor with levels Female Male. Categorical

- salary: nine-month salary, in dollars. Numerical

nrow(dataset)

## [1] 397

names(dataset)

## [1] "rank"          "discipline"    "yrs.since.phd" "yrs.service"  
## [5] "sex"           "salary"

str(dataset)

## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...
##  $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

Missing values

There are no missing values (NA or NULL values).

colSums(is.na(dataset))

##          rank    discipline yrs.since.phd   yrs.service           sex 
##             0             0             0             0             0 
##        salary 
##             0

any(is.na(dataset))

## [1] FALSE

Analyze numerical variables

library(psych)

describe(dataset[, c("yrs.since.phd", "yrs.service", "salary")])

##               vars   n      mean       sd median   trimmed      mad   min
## yrs.since.phd    1 397     22.31    12.89     21     21.83    14.83     1
## yrs.service      2 397     17.61    13.01     16     16.51    14.83     0
## salary           3 397 113706.46 30289.04 107300 111401.61 29355.48 57800
##                  max  range skew kurtosis      se
## yrs.since.phd     56     55 0.30    -0.81    0.65
## yrs.service       60     60 0.65    -0.34    0.65
## salary        231545 173745 0.71     0.18 1520.16

Distributions

hist(dataset$yrs.since.phd, 
     main="Histogram of yrs.since.phd",
     xlab="yrs.since.phd",)

hist(dataset$yrs.service, 
     main="Histogram of yrs.service",
     xlab="yrs.service")

hist(dataset$salary, 
     main="Histogram of salary",
     xlab="salary")

Check for outliers

par(mfrow=c(1, 3))

boxplot(dataset$yrs.since.phd, main = "yrs.since.phd", horizontal = FALSE)
boxplot(dataset$yrs.service, main = "yrs.service", horizontal = FALSE)
boxplot(dataset$salary, main = "salary", horizontal = FALSE)

There are only a few outliers in the salary column and one in the yrs.service columns.

Looking at the dataset they do not seem errors, so I decided to keep them.

Analyze categorical variables

The categorical variables are already factors.

Distributions:

barplot(table(dataset$rank), main = "rank Distribution")

barplot(table(dataset$discipline), main = "discipline Distribution")

barplot(table(dataset$sex), main = "sex Distribution")

RQ1

RQ1: Is there a significant difference in salaries between male and female professors?

female_salaries = dataset$salary[dataset$sex == "Female"]
male_salaries = dataset$salary[dataset$sex == "Male"]

describeBy(x = dataset$salary, group = dataset$sex)

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n     mean       sd median  trimmed      mad   min    max range skew
## X1    1 39 101002.4 25952.13 103750 99531.06 35229.54 62884 161101 98217 0.42
##    kurtosis      se
## X1     -0.8 4155.67
## ------------------------------------------------------------ 
## group: Male
##    vars   n     mean       sd median  trimmed      mad   min    max  range skew
## X1    1 358 115090.4 30436.93 108043 112748.1 29586.02 57800 231545 173745 0.71
##    kurtosis      se
## X1     0.15 1608.64

Check assumptions for parametric test (t-test)

a) Normality Test

The Shapiro-Wilk test would check if salaries are normally distributed within each gender.

If p-value < 0.05, it means that the data significantly deviates from normal distribution.

Since the p-value (0.063) for Female salaries is greater than 0.05, we fail to reject the null hypothesis that female salaries are normally distributed. The salary data for females appears to follow a normal distribution (more or less).

The p-value for Male salaries is much smaller than 0.05, indicating rejection of the null hypothesis. The salary data for males does not follow a normal distribution.

This implies that parametric statistical tests that assume normality may not be appropriate for analyzing male salaries.

Another thing to note that will influence the analysis is that Female data collected is much less than Male data.

shapiro_female = shapiro.test(female_salaries)
shapiro_male = shapiro.test(male_salaries)

print(shapiro_female)

## 
##  Shapiro-Wilk normality test
## 
## data:  female_salaries
## W = 0.94665, p-value = 0.06339

print(shapiro_male)

## 
##  Shapiro-Wilk normality test
## 
## data:  male_salaries
## W = 0.95877, p-value = 1.735e-08

Density plot for visual inspection

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(dataset, aes(x = salary, fill = sex)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Salary Distribution by Gender",
       x = "Salary",
       y = "Density")

b) Homogeneity of variances Test

Performs an F-test to compare the variances of two groups (Female and Male salaries).

Since the p-value is greater than 0.05, we fail to reject the null hypothesis that the variances of the two groups are equal.

Assumption of equal variances holds for this data.

var_test = var.test(female_salaries, male_salaries)

print(var_test)

## 
##  F test to compare two variances
## 
## data:  female_salaries and male_salaries
## F = 0.72702, num df = 38, denom df = 357, p-value = 0.2309
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4718089 1.2292716
## sample estimates:
## ratio of variances 
##          0.7270166

Parametric test (t-test)

Try to use a parametric test even if the normality assumption does not hold.

Welch Two Sample t-test is used to compare the means of two groups to determine if there is a significant difference between them.

The test statistic indicates the magnitude and direction of the difference in means. A negative value shows that the mean salary for females is lower than that for males.

Since the p-value (0.005667) is less than 0.05, we reject the null hypothesis. This indicates there is a statistically significant difference between the means of the two groups.

The 95% confidence interval for the difference in means does not include 0, further supporting the conclusion of a significant difference. The negative values indicate that female salaries are significantly lower than male salaries.

Mean of female salaries (mean of x): 101,002.4 Mean of male salaries (mean of y): 115,090.4

The average male salary is approximately $14,088 higher than the average female salary.

t_test_result = t.test(female_salaries, male_salaries, var.equal = TRUE)

print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  female_salaries and male_salaries
## t = -2.7817, df = 395, p-value = 0.005667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24044.910  -4131.107
## sample estimates:
## mean of x mean of y 
##  101002.4  115090.4

Non-parametric test (Wilcoxon rank sum test)

This test is more appropriate since the normality assumption does not hold.

The Wilcoxon rank sum test is a non-parametric test used to compare the distributions of two independent groups.

Since the p-value (0.0082) is less than 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in the distributions of salaries between females and males.

The test assumes the alternative hypothesis: “true location shift is not equal to 0,” meaning there is a significant difference between the two groups.

The Wilcoxon rank sum test confirms a statistically significant difference between female and male salary distributions.

wilcox_test_result = wilcox.test(female_salaries, male_salaries,
                                 paired = FALSE,
                                 correct = FALSE,
                                 exact = NULL,
                                 alternative = "two.sided")

print(wilcox_test_result)

## 
##  Wilcoxon rank sum test
## 
## data:  female_salaries and male_salaries
## W = 5182.5, p-value = 0.008219
## alternative hypothesis: true location shift is not equal to 0

Conclusion

In this analysis, both parametric (t-test) and non-parametric (Wilcoxon rank sum test) tests were applied to compare the salaries of males and females.

The Shapiro-Wilk test for normality revealed that the salary data for males deviates significantly from a normal distribution (p < 0.05), while the data for females does not.

So the best approach is to choose a non-parametric test (Wilcoxon Rank Sum Test) which does not require the data to be normally distributed providing a more robust alternative.

This test provides confidence in the conclusion of a statistically significant difference between male and female salaries.

Effect size for t-test

Cohen’s d is a common measure of effect size for a t-test. It quantifies the difference between two means in terms of standard deviations.

Cohen’s d value of -0.47 indicates a small effect size. This suggests that the difference in salaries between male and female employees is noticeable, but not very large.

A negative value for Cohen’s d suggests that females earn less on average than males, since the mean of female salaries is lower than that of male salaries in this case.

The 95% confidence interval does not include 0, confirming that the observed effect is statistically significant.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohen_d_result = effectsize::cohens_d(female_salaries, male_salaries)
print(cohen_d_result)

## Cohen's d |         95% CI
## --------------------------
## -0.47     | [-0.80, -0.14]
## 
## - Estimated using pooled SD.

interpret_cohens_d(0.47, rules="sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

Effect size for Wilcoxon test

This measures the strength and direction of the difference between the two groups (Female and Male salaries) based on their ranks. The effectsize() function returns a value for the rank-biserial correlation.

The result is 0.26, which indicates a small to medium effect size. This means that there is a moderate effect in the difference between the salaries of males and females based on the Wilcoxon rank-sum test.

effectsize(wilcox.test(female_salaries, male_salaries,
                       paired = FALSE,
                       correct = FALSE,
                       exact = NULL,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.26             | [-0.43, -0.07]

interpret_rank_biserial(0.26)

## [1] "medium"
## (Rules: funder2019)

ANSWER:

The analysis shows that there is a statistically significant difference in salaries between Male and Female employees. Both the t-test and the Wilcoxon rank sum test suggest that male employees earn more on average than female employees, with the effect size indicating a small to medium difference in salary distributions between the two groups.

RQ2

RQ2: Is there a significant correlation between years of service (yrs.service) and salary (salary) among faculty members?

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(dataset[, c("yrs.since.phd", "yrs.service", "salary")])

Pearson or Spearman’s Rank Correlation should be computed.

Pearson’s correlation coefficient is used to measure the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to 1 and assumes that both variables are normally distributed.

On the other hand, Spearman’s Rank Correlation does not assume a linear relationship. It can be also used when the relationship is monotonic but not necessarily linear.

The yrs.service and salary are continuous numerical variables and we want to test if there is a relationship (to confirm intuition).

In the next part, the Pearson or Spearman’s Rank Correlation will be computed based on the normality check of the variables. Then the p-value will be also calculated to test if the correlation is statistically significant.

If the p-value is less than 0.05, reject the null hypothesis and conclude that there is a significant correlation between the two variables.

Check the normality of variables:

The null hypothesis of the Shapiro-Wilk test is that the variable follows a normal distribution.

Since the p-values are way less than 0.05, this suggests that the variables do not follow a normal distribution.

This was also suggested by the above histograms and the skewness and kurtosis values computed before.

shapiro_test_yrs_service = shapiro.test(dataset$yrs.service)
shapiro_test_salary = shapiro.test(dataset$salary)

print(shapiro_test_yrs_service)

## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$yrs.service
## W = 0.94183, p-value = 2.337e-11

print(shapiro_test_salary)

## 
##  Shapiro-Wilk normality test
## 
## data:  dataset$salary
## W = 0.95988, p-value = 6.076e-09

Since the variables do not follow a normal distribution and the relationship between the variables seems monotonic increasing but not necessarily linear, the Spearman’s Rank Correlation is computed.

Spearman’s Rank Correlation

correlation_result = cor.test(dataset$yrs.service, dataset$salary, method = "spearman")

## Warning in cor.test.default(dataset$yrs.service, dataset$salary, method =
## "spearman"): Cannot compute exact p-value with ties

print(correlation_result)

## 
##  Spearman's rank correlation rho
## 
## data:  dataset$yrs.service and dataset$salary
## S = 5996863, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4249486

The value of rho is 0.4249486. This suggests a moderate positive correlation between the two variables yrs.service (years of service) and salary. A Spearman rho value closer to +1 indicates a strong positive relationship, 0 indicates no relationship, and closer to -1 indicates a strong negative relationship

The p-value is < 2.2e-16, which is extremely small, suggesting strong evidence against the null hypothesis. Since the p-value is much smaller than 0.05, we can reject the null hypothesis that there is no correlation between yrs.service and salary.

The plot below confirms this:

ggplot(dataset, aes(x=yrs.service, y=salary)) +
  geom_point()

ANSWER:

given the result of the Spearman’s Rank Correlation (rho = 0.42) and the very low p-value (< 2.2e-16), we can conclude that there is a moderate positive correlation between years of service and salary in the dataset. The small p-value suggests that this result is statistically significant, and it is highly unlikely that this correlation is due to random chance.

RQ3

RQ1: Is there a significant association between academic rank (rank) and gender (sex)

Null Hypothesis: there is no association between academic rank and gender (independence between rank and sex)

Alternative Hypothesis: there is an association between academic rank and gender

The observed frequencies are:

contingency_table = table(dataset$rank, dataset$sex)

print(contingency_table)

##            
##             Female Male
##   AssocProf     10   54
##   AsstProf      11   56
##   Prof          18  248

The data is structured as a contingency table with observed frequencies for combinations of rank and sex.

Proportions across each row:

prop.table(contingency_table, margin = 1)

##            
##                 Female       Male
##   AssocProf 0.15625000 0.84375000
##   AsstProf  0.16417910 0.83582090
##   Prof      0.06766917 0.93233083

Proportions across each column:

prop.table(contingency_table, margin = 2)

##            
##                Female      Male
##   AssocProf 0.2564103 0.1508380
##   AsstProf  0.2820513 0.1564246
##   Prof      0.4615385 0.6927374

Assumptions Check

- both rank and sex are categorical

- each professor appears only once (independence of observations)

- need to verify if at least 80% of cells have expected frequencies ≥ 5

“correct” parameter is set to FALSE since the contingency table is 3x2 (not 2x2) and the dataset size is relatively large (n=397). There is no need to apply Yates’ continuity correction.

chi2_test_result = chisq.test(contingency_table, correct = FALSE)

print(chi2_test_result)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 8.5259, df = 2, p-value = 0.01408

Expected frequencies:

all of them are greater than 5, meeting the assumption for the Chi-Square test.

expected_frequencies = round(chi2_test_result$expected, 2)

print(expected_frequencies)

##            
##             Female   Male
##   AssocProf   6.29  57.71
##   AsstProf    6.58  60.42
##   Prof       26.13 239.87

print("Percentage of cells with expected frequencies >= 5:")

## [1] "Percentage of cells with expected frequencies >= 5:"

print((sum(expected_frequencies >= 5)/length(expected_frequencies)) * 100)

## [1] 100

Since the assumptions are satisfied, the Chi-Square test results are valid.

The p-value is less than 0.05, leading to the rejection of the null hypothesis.

Conclusion: there is a statistically significant association between rank and sex.

Observed (Empirical) Frequencies: - Female professors are underrepresented in the Prof rank compared to their male counterparts (18 vs. 248) - Female professors are more evenly distributed in the AssocProf and AsstProf ranks, although they remain a minority in all ranks.

Expected (Theoretical) Frequencies: - they represent the distribution of sex across rank under the assumption of independence. - For AssocProf (Female), the expected count is 6.29, but the observed count is 10 - For Prof (Female), the expected count is 26.13, but the observed count is only 18.

The discrepancy between observed and expected frequencies suggests that gender distribution is not uniform across ranks.

Standardized residuals

They indicate how far each cell’s observed frequency deviates from the expected frequency, measured in standard deviation units.

Residuals greater than |1.96| suggest significant deviation at alpha=0.05:

AssocProf (Female): +1.70 (not significant)
AsstProf (Female): +1.99 (close to significant)
Prof (Female): −2.92 (significant underrepresentation)
Prof (Male): +2.92 (significant overrepresentation).

Female professors are significantly underrepresented in the Prof rank and very slightly overrepresented in the AsstProf rank compared to what would be expected under independence.

Male professors are significantly overrepresented in the Prof rank.

standardized_residuals = chi2_test_result$stdres

print(standardized_residuals)

##            
##                Female      Male
##   AssocProf  1.702577 -1.702577
##   AsstProf   1.989100 -1.989100
##   Prof      -2.915939  2.915939

Effect size

Cramér’s V is used to compute the effect size for the Chi-Square test. It measures the strength of association between the categorical variables.

library(effectsize)

effectsize::cramers_v(contingency_table)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.13              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.13)

## [1] "small"
## (Rules: funder2019)

Since V = 0.13, the effect size is small, indicating a weak relationship between gender and academic rank.”

Research Question: Is there an association between academic rank and gender in the dataset?

ANSWER:

- The Chi-Square test (p=0.014) shows a statistically significant association between gender and academic rank

- Female professors are significantly underrepresented in the Prof rank and slightly overrepresented in the AsstProf rank

The effect size (V=0.13) indicates a small strength of association

There is evidence of a modest association between gender and academic rank in this dataset. Gender distribution across ranks is not uniform, with females being underrepresented at higher academic ranks.

Johana Radojević - HOMEWORK ASSIGNMENT 1

Johana Radojević

2025-01-18

Hypothesis testing

Correlation and association between categorical variables)

Import dataset

Description

Missing values

Analyze numerical variables

Analyze categorical variables

RQ1

RQ2

RQ3