TRM Practical Data Analysis - Intermediate level

Lecture 7. Simple Linear Regression

Task 1. Determine whether time since PhD was correlated with time in service

1.1 Study design:

  • The cross-sectional investigation of 397 professors to determine whether time since PhD was correlated with time in service.
  • Null hypothesis: Time since PhD was not correlated with time in service.
  • Alternative hypothesis: Time since PhD was correlated with time in service.

1.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Thach\\UTS\\Teaching\\TRM\\Practical Data Analysis\\2024_Autumn semester\\Data\\Professorial Salaries.csv")

1.3 Describe characteristics of the study sample

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ Sex + Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary, data = salary)
Overall
(N=397)
Sex
Female 39 (9.8%)
Male 358 (90.2%)
Rank
AssocProf 64 (16.1%)
AsstProf 67 (16.9%)
Prof 266 (67.0%)
Discipline
A 181 (45.6%)
B 216 (54.4%)
Yrs.since.phd
Mean (SD) 22.3 (12.9)
Median [Min, Max] 21.0 [1.00, 56.0]
Yrs.service
Mean (SD) 17.6 (13.0)
Median [Min, Max] 16.0 [0, 60.0]
NPubs
Mean (SD) 18.2 (14.0)
Median [Min, Max] 13.0 [1.00, 69.0]
Ncits
Mean (SD) 40.2 (16.9)
Median [Min, Max] 35.0 [1.00, 90.0]
Salary
Mean (SD) 114000 (30300)
Median [Min, Max] 107000 [57800, 232000]

1.4 Check the distribution of time since PhD and time in service

library(ggplot2)
library(grid)
library(gridExtra)
p1 = ggplot(data = salary, aes(x = Yrs.since.phd)) + geom_histogram(aes(y = ..density..), color = "white", fill = "blue") + ggtitle("Time since PhD (years)") + theme_bw()
p2 = ggplot(data = salary, aes(x = Yrs.service)) + geom_histogram(aes(y = ..density..), color = "white", fill = "blue") + ggtitle("Time in service (years)") + theme_bw()

grid.arrange(p1, p2, nrow = 1, top = textGrob("Distribution of numeric variables", gp = gpar(fontsize = 20, font = 1)))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.5 Describe the correlation between time since PhD and time in service

ggplot(data = salary, aes(x = Yrs.since.phd, y = Yrs.service)) + geom_point() + geom_smooth() + labs(x = "Time since PhD (years)", y = "Time in service (years)") + ggtitle("Correlation between time since PhD and time in service") + theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1.6 Correlation analysis to determine whether time since PhD Was correlated with time in service

cor.test(salary$Yrs.since.phd, salary$Yrs.service, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  salary$Yrs.since.phd and salary$Yrs.service
## t = 43.524, df = 395, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8909977 0.9252353
## sample estimates:
##       cor 
## 0.9096491

Interpretation: There is evidence (P< 0.0001) that years in PhD were positively correlated with time in service (Pearson’s correlation coefficient= 0.91; 95% CI: 0.89 to 0.93).

Task 2. Determine whether the number of citations was associated with professors’ salaries

2.1 Study design:

  • The cross-sectional investigation of 397 professors to determine whether the number of citations was associated with professors’ salaries.
  • Null hypothesis: Number of citations was not associated with professors’ salaries.
  • Alternative hypothesis: Number of citations was associated with professors’ salaries.

2.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Thach\\UTS\\Teaching\\TRM\\Practical Data Analysis\\2024_Autumn semester\\Data\\Professorial Salaries.csv")

2.3 Describe characteristics of the study sample

library(table1)
table1(~ Sex + Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary, data = salary)
Overall
(N=397)
Sex
Female 39 (9.8%)
Male 358 (90.2%)
Rank
AssocProf 64 (16.1%)
AsstProf 67 (16.9%)
Prof 266 (67.0%)
Discipline
A 181 (45.6%)
B 216 (54.4%)
Yrs.since.phd
Mean (SD) 22.3 (12.9)
Median [Min, Max] 21.0 [1.00, 56.0]
Yrs.service
Mean (SD) 17.6 (13.0)
Median [Min, Max] 16.0 [0, 60.0]
NPubs
Mean (SD) 18.2 (14.0)
Median [Min, Max] 13.0 [1.00, 69.0]
Ncits
Mean (SD) 40.2 (16.9)
Median [Min, Max] 35.0 [1.00, 90.0]
Salary
Mean (SD) 114000 (30300)
Median [Min, Max] 107000 [57800, 232000]

2.4 Check the distribution of professors’ salaries

library(ggplot2)
ggplot(data = salary, aes(x = Salary)) + geom_histogram(aes(y = ..density..), color = "white", fill = "blue") + ggtitle("Professors' salaries (USD)") + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.5 Fit the model

m1 = lm(Salary ~ Ncits, data = salary)
summary(m1)
## 
## Call:
## lm(formula = Salary ~ Ncits, data = salary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61660 -23012  -5654  20638 120083 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105664.57    3899.87  27.094   <2e-16 ***
## Ncits          199.93      89.36   2.237   0.0258 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30140 on 395 degrees of freedom
## Multiple R-squared:  0.01251,    Adjusted R-squared:  0.01001 
## F-statistic: 5.005 on 1 and 395 DF,  p-value: 0.02583

2.6 Check the model’s assumptions

par(mfrow = c(2,2))
plot(m1)

Interpretation: The assumptions are met.

2.7 Interpret the result

There was evidence (P= 0.0258) that every one increase in the number of citations was associated with on average $199.9 increase in professors’ salaries, ranging from $25 to $375.

Task 3. Save your work and upload it to your Rpubs account