HW2

library(carData)
mydata <- Salaries
mydata <- mydata %>% tidyr::drop_na()


head(mydata)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

A data frame with 397 observations on the following 6 variables:

rank: a factor with levels AssocProf AsstProf Prof
discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments)
yrs.since.pfhd: years since PhD
yrs.service: years of service
sex: a factor with levels Female Male
salary: nine-month salary, in dollars

Units of observation are Assistant Professors, Associate Professors and Professors in a college in the U.S.

Main goal of the data analysis: Main goal of the data analysis is to figure out how big are salary differences between male and female faculty members.

Source: Library called carData

HYPOTHESIS TESTING

1) Is there the difference in salaries between female and male?

Hypothesis: On average there is no difference between in salaries between male and female

\(H_0: \mu_M = \mu_F\)

\(H_1: \mu_M \neq \mu_F\)

\(\mu_M:\) average salary men earn

\(\mu_F:\) average salary women earn

describeBy(mydata$salary, mydata$sex)

## 
##  Descriptive statistics by group 
## group: Female
##    vars  n     mean       sd median  trimmed      mad   min    max range skew kurtosis      se
## X1    1 39 101002.4 25952.13 103750 99531.06 35229.54 62884 161101 98217 0.42     -0.8 4155.67
## ------------------------------------------------------------------------------------------ 
## group: Male
##    vars   n     mean       sd median  trimmed      mad   min    max  range skew kurtosis      se
## X1    1 358 115090.4 30436.93 108043 112748.1 29586.02 57800 231545 173745 0.71     0.15 1608.64

What must we check?

Variable is numeric: OK
Normality:

Male <- ggplot(mydata %>% filter(sex == "Male"), aes(x= salary)) +
  geom_histogram(binwidth = 4000, fill = "blue", color = "black") +
  theme_classic() + 
  labs( title= "Male salaries", xlab="Salary")

Female <- ggplot(mydata %>% filter(sex == "Female"), aes(x= salary))+
geom_histogram(binwidth = 4000, fill = "red", color = "black") +
  theme_classic() + 
  labs( title= "Female salaries", xlab="Salary")



ggarrange(Male, Female, 
          ncol = 2)

The Male salaries don’t look normally distributed. However, Female salaries might be normally distributed. We can check the normality of distributions with Shapiro-Wilk test.

\(H_0:\) Salaries are normally distributed

\(H_1:\) Salaries are not normally distributed

mydata %>% group_by(sex) %>% 
           shapiro_test(salary)

## # A tibble: 2 × 4
##   sex    variable statistic            p
##   <fct>  <chr>        <dbl>        <dbl>
## 1 Female salary       0.947 0.0634      
## 2 Male   salary       0.959 0.0000000173

Because \(p\)-value for Male is \(<10\%\), we reject null hypothesis at \(p<0.001\).

Normality is violated, so we do non-parametric Wilcoxon Rank Sum test.

Hypothesis:

\(H_0:\) distribution locations are the same

\(H_1:\) distribution locations are not the same

wilcox.test(mydata$salary ~ mydata$sex,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$salary by mydata$sex
## W = 5182.5, p-value = 0.008219
## alternative hypothesis: true location shift is not equal to 0

We reject \(H_0\) at \(p<0.01\). Females earn lower salaries.

We can also check the effect size:

effectsize(wilcox.test(mydata$salary ~ mydata$sex,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.26             | [-0.43, -0.07]

interpret_rank_biserial(0.26)

## [1] "medium"
## (Rules: funder2019)

Effect size is meduim.

Conclusion: Based on the sample data, we find that male in female salaries differ (\(p<0.01\)) - male earn higher salaries, the difference in distribution is medium (\(r=0.26\)).

2) - Test of population proportion

mydata1 <- mydata %>%  group_by(discipline) %>% dplyr::summarise(num = n())

The sample of 397 included 181 from A (“theoretical” departments) and 216 from B (“applied” departments). Can we conclude that more people that works at univesities works in applied departments?

Assumptions are meet. Both \(n \pi_0 > 5\) and \(n (1-\pi_0) > 5\) Hypothesis:

\(H_0: \pi = 0.5\)

\(H_1: \pi > 0.5\)

prop.test(x=216,
          n=397,
          p=0.5,
          correct = FALSE,
          alternative = "greater")

## 
##  1-sample proportions test without continuity correction
## 
## data:  216 out of 397, null probability 0.5
## X-squared = 3.0856, df = 1, p-value = 0.03949
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.5028048 1.0000000
## sample estimates:
##         p 
## 0.5440806

\(p\)- value = 3.9%. So we can reject \(H_0\) at \(p<0.05%\). Meaning that more people work in applied departments.

Probability of working at applied department is 54,4%.

3) Are years since PhD and rank related?

Hypothesis: On average there is no difference between years since PhD and rank related.

\(H_0: \mu_{Prof} = \mu_{AsstProf} = \mu_{AssocProf}\)

\(H_1:\) At least one \(= \mu_{i}\) is different

Average years since PhD is 22.31

psych::describe(mydata$yrs.since.phd)

##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 397 22.31 12.89     21   21.83 14.83   1  56    55  0.3    -0.81 0.65

describeBy(mydata$yrs.since.phd, mydata$rank)

## 
##  Descriptive statistics by group 
## group: AsstProf
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 67  5.1 2.54      4    4.98 2.97   1  11    10  0.5     -0.5 0.31
## ------------------------------------------------------------------------------------------ 
## group: AssocProf
##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 64 15.45 9.65     12   13.52 4.45   6  49    43 2.04     3.77 1.21
## ------------------------------------------------------------------------------------------ 
## group: Prof
##    vars   n mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 266 28.3 10.11     28   27.86 11.86  11  56    45 0.35    -0.65 0.62

Levene test: \(H_0:\) Variance of years since PhD are the same for all ranks

\(H_1:\) Variance of years since PhD are not the same for all ranks

leveneTest(mydata$yrs.since.phd, group = mydata$rank)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2   35.16 8.906e-15 ***
##       394                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject \(H_0\) is it no “good” for us -> heteroskedasticity

Check normality:

\(H_0:\) Years since PhD are normally distributed

\(H_1:\) Years since PhD are normally distributed

mydata %>%
  group_by(rank) %>%
  shapiro_test(yrs.since.phd)

## # A tibble: 3 × 4
##   rank      variable      statistic             p
##   <fct>     <chr>             <dbl>         <dbl>
## 1 AsstProf  yrs.since.phd     0.936 0.00192      
## 2 AssocProf yrs.since.phd     0.727 0.00000000135
## 3 Prof      yrs.since.phd     0.971 0.0000304

We reject null hypothesis at \(p<0.001\) for all ranks.

Normality is violated, so we do non-parametric Kruskal-Wallis Rank Sum Test.

Kruskal-Wallis Rank Sum Test:

\(H_0:\) Distribution locations of years since PhD are the same in all 3 groups

\(H_1:\) Distribution locations of years since PhD are not the same in all 3 groups

kruskal.test(   yrs.since.phd ~ rank,
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  yrs.since.phd by rank
## Kruskal-Wallis chi-squared = 219.28, df = 2, p-value < 2.2e-16

At \(p\)-value \(p <0.0001\) we reject null hypothesis. This means that at least one distribution locations since PhD is not the same.

kruskal_effsize(yrs.since.phd ~ rank,
                data = mydata)

## # A tibble: 1 × 5
##   .y.               n effsize method  magnitude
## * <chr>         <int>   <dbl> <chr>   <ord>    
## 1 yrs.since.phd   397   0.551 eta2[H] large

Magnitude is large.

groups_nonpar <- wilcox_test(yrs.since.phd ~ rank,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)
groups_nonpar

## # A tibble: 3 × 9
##   .y.           group1    group2       n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>         <chr>     <chr>     <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 yrs.since.phd AsstProf  AssocProf    67    64     162.  5.72e-20 1.72e-19 ****        
## 2 yrs.since.phd AsstProf  Prof         67   266       1.5 1.06e-36 3.18e-36 ****        
## 3 yrs.since.phd AssocProf Prof         64   266    2400.  4.54e-19 1.36e-18 ****

Rank is affected by how many years since PhD person have.

HW2

Zan Mikola

2023-01-17

HYPOTHESIS TESTING

1) Is there the difference in salaries between female and male?

2) - Test of population proportion