#data()
#data(package= .packages(all.available=TRUE))
#install.packages("carData")
library(carData)
Reference: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage
mydata <- data.frame(Salaries)
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
nrow(mydata)
## [1] 397
Explanation of a data set:
We have a data set of salaries for professors and we have 397 observations (units) with 6 variables.
Unit of observation: A professor in the U.S. Sample size = 397
Variables:
rank: A factor with levels AssocProf, AsstProf, Prof.It is a type of categorical variable with 3 values.
discipline: A factor with levels A (“theoretical” departments) or B (“applied” departments). It is a categorical (dummy) variable.
yrs.since.phd: Years since PhD. It is a numeric type of variable.
yrs.service: Years of service. It is a numeric type of variable.
sex: A factor with levels Female and Male. It is a categorical variable (dummy) with two variables.
salary: A nine-month salary, measured in dollars.
Note that fore the first RQ I will analyze only the variables salary and yrs.since.phd (the correlation) and for the second RQ I will use rank and sex (for the association).
colnames(mydata) <- c("Position", "Discipline", "YearsSincePhD", "YearsOfService", "Gender", "Salary")
#Change of names of the variables
mydata$DisciplineF <- factor(mydata$Discipline,
levels = c("A", "B"),
labels = c("A", "B"))
mydata$GenderF <- factor(mydata$Gender, #Convert categorical variables into factors
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydata <- mydata[, -7, drop = FALSE]
head(mydata)
## Position Discipline YearsSincePhD YearsOfService Gender Salary GenderF
## 1 Prof B 19 18 Male 139750 Male
## 2 Prof B 20 16 Male 173200 Male
## 3 AsstProf B 4 3 Male 79750 Male
## 4 Prof B 45 39 Male 115000 Male
## 5 Prof B 40 41 Male 141500 Male
## 6 AssocProf B 6 6 Male 97000 Male
any_missing <- any(is.na(mydata)) #There is no missing values as it shows "FALSE"
summary(mydata)
## Position Discipline YearsSincePhD YearsOfService Gender
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## Salary GenderF
## Min. : 57800 Male :358
## 1st Qu.: 91000 Female: 39
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
Descriptive statistics:
In our sample, there is 67 professors that hold the position of Assistant Professor, 64 that are Associate Professors, and 266 of them are holding a position of a Professor (regular one). In our sample, we have 39 females and 358 males.
Minimum years since professors finished their PhD is 1 year, and the maximum is 56 years. The average value of years since PhD is 22.31 years.
A half of professors in the sample have up to 16 years of service till now, and the other half of the sample has been working at this industry for a longer period of time.
The average salary for the last nine months is 113.706 USD. 75% of professors from our sample have had a salary up to 134.185 USD in the last nine months, however the rest (25%) have had the salary higher than that.
library(psych) #For the variables in RQ 1
psych::describe(mydata[ , c("YearsSincePhD", "Salary")])
## vars n mean sd median trimmed mad min
## YearsSincePhD 1 397 22.31 12.89 21 21.83 14.83 1
## Salary 2 397 113706.46 30289.04 107300 111401.61 29355.48 57800
## max range skew kurtosis se
## YearsSincePhD 56 55 0.30 -0.81 0.65
## Salary 231545 173745 0.71 0.18 1520.16
Range for Years since PhD is 55, and the range for the Salary is 173.745 USD.
The average of Years since PhD is 22.31 years, and the average Salary is 113.706,46 USD.
H0: There is no correlation between the nine-month salary of the professors and the number of years since PhD.
H1: There is correlation between the nine-month salary of the professors and the number of years since PhD.
This assumption is met.
We assume this assumption is not violated, because our sample size is sufficiently large (n = 397).
We have to check that with the following procedure (scatter plot).
Normality:
H0: The variable Salary is normally distributed.
H1: The variable Salary is not normally distributed.
shapiro.test(mydata$Salary)
##
## Shapiro-Wilk normality test
##
## data: mydata$Salary
## W = 0.95988, p-value = 6.076e-09
We can reject H0 with p < 0.001, and conclude the variable Salary is not normally distributed.
Let’s check for the other variable:
H0: Variable YearsSincePhD is normally distributed.
H1: Variable YearsSincePhD is not normally distributed.
shapiro.test(mydata$YearsSincePhD)
##
## Shapiro-Wilk normality test
##
## data: mydata$YearsSincePhD
## W = 0.96957, p-value = 2.328e-07
We can reject H0 at p < 0.001, and conclude that the variable YearsSincePhD is not normally distributed.
Since normality is violated, we will do Spearman’s correlation test (even though our sample is sufficiently large). But first, let’s check the linearity assumption:
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(mydata$Salary, mydata$YearsSincePhD,
smooth = TRUE,
boxplot = FALSE,
main = "Relationship between the Salary and Years since PhD",
xlab = "Years since PhD",
ylab = "Salary")
From the scatter plot, we can see it does not look like a linear relationship of the two variables.
library(car) #Scatter plot matrix
scatterplotMatrix(mydata[ , c(-1,-2,-4,-5,-7,-8, -9)], smooth=FALSE)
The relationship between professor’s nine-month salaries and years from their PhD studies is POSITIVE.
We can see from the scatter plot matrix that both the Years since PhD and Salary are not normally distributed, and therefore skewed to the right.
fit <- lm(Salary ~ YearsSincePhD, data = mydata) #A histogram of standardized residuals
mydata$StdResid <-round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
We can see the outliers, where units are not included in the [-3,3] interval. However, since our sample size is quite large, by removing them, it would not make the function linear.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
As we can see from Cooks distance histogram, there are no units with high impact.
library(Hmisc) #Pearson correlation coefficient
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1,-2,-4,-5,-7,-8,-9,-10)]),
type = "pearson")
## YearsSincePhD Salary
## YearsSincePhD 1.00 0.42
## Salary 0.42 1.00
##
## n= 397
##
##
## P
## YearsSincePhD Salary
## YearsSincePhD 0
## Salary 0
**However, this test was made only for educational purposes - we shall not use it with a data base like this one, so let’s continue the testing with Spearman.
rcorr(as.matrix(mydata[ , c(-1,-2,-4,-5,-7,-8,-9,-10)]),
type = "spearman") #Spearman correlation coefficient
## YearsSincePhD Salary
## YearsSincePhD 1.00 0.48
## Salary 0.48 1.00
##
## n= 397
##
##
## P
## YearsSincePhD Salary
## YearsSincePhD 0
## Salary 0
If years since PhD increase by 1 year, nine-month salaries, on average increase by 0.48 USD, assuming all other values remain unchanged.
The relationship between these two variables is positive and semi-strong.
cor.test(mydata$Salary, mydata$YearsSincePhD,
method = "spearman",
use = "complete.obs",
exact = FALSE)
##
## Spearman's rank correlation rho
##
## data: mydata$Salary and mydata$YearsSincePhD
## S = 5468989, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4755676
H0: There is no correlation between the nine-month salary of the professors and the number of years since PhD.
H1: There is a correlation between the nine-month salary of the professors and the number of years since PhD.
rho =/= 0
Conclusion: We reject H0 at p < 0.001, and conclude that there is a correlation between a nine-month Salary and Years since PhD. It is a positive and semi-strong relationship.
H0: There is no association between the two categorical variables (between Gender and position of the professor).
H1: There is an association between these two categorical variables (between the Gender and the Position).
This assumption is met (we can see it from the variable Gender).
All expected frequencies must be greater than 1.
Up to 20% of the expected frequencies is allowed to be in between 1 and 5, however, any expected frequency that is lower than 5 will reduce the power of the test.
results <- chisq.test(mydata$GenderF, mydata$Position,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: mydata$GenderF and mydata$Position
## X-squared = 8.5259, df = 2, p-value = 0.01408
The p-value = 0.014, which tells us to reject the null hypothesis (p value is lower than 0.05). Based on the data we got, we showed that there is the association between the Gender and Position of professors.
addmargins(results$observed) #Observed or empirical frequencies
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof Sum
## Male 56 54 248 358
## Female 11 10 18 39
## Sum 67 64 266 397
addmargins(round(results$expected, 2)) #Expected of theoretical frequencies
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof Sum
## Male 60.42 57.71 239.87 358
## Female 6.58 6.29 26.13 39
## Sum 67.00 64.00 266.00 397
Explanation:
The second assumption is met, as ALL of the expected frequencies are greater than 1.
As well as the third assumption is met, as ALL of the expected frequencies are also greater than 5.
round(results$res, 2) #Standardized residuals
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof
## Male -0.57 -0.49 0.52
## Female 1.72 1.48 -1.59
Standardized residuals tell us whether the discrepancies between observed and expected frequencies are statistically significant.
For educational purposes, I will explain one number, if it was statistically significant.
addmargins(round(prop.table(results$observed), 3)) #Proportional table
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof Sum
## Male 0.141 0.136 0.625 0.902
## Female 0.028 0.025 0.045 0.098
## Sum 0.169 0.161 0.670 1.000
0.141 –> Out of 397 professors, there is 14.1% of professors, who are males and at the same time on an Assisting Professor position.
Out of 397, there are 62.5% of professors that are males and are on the Professor’s position.
addmargins(round(prop.table(results$observed, 1), 3),2)
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof Sum
## Male 0.156 0.151 0.693 1.000
## Female 0.282 0.256 0.462 1.000
addmargins(round(prop.table(results$observed, 2), 3),1)
## mydata$Position
## mydata$GenderF AsstProf AssocProf Prof
## Male 0.836 0.844 0.932
## Female 0.164 0.156 0.068
## Sum 1.000 1.000 1.000
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$GenderF, mydata$Position)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.13 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.13)
## [1] "small"
## (Rules: funder2019)
Conclusion: Based on the data from our sample, we reject our null hypothesis at p < 0.001, and can conclude that there is an association between the gender of professors and the position professors have. The relationship is positive and the effect is small.
There is no Odds ratio calculation, as my table is 2 x 3.