#Introduction This research looks into what determines how much professors get paid at universities, and it highlights how important it is to be clear and fair about their pay to keep a good learning environment. The study is driven by wanting to understand how paying professors well affects getting and keeping good teachers. The researchers will look at things like how long a professor has been teaching, how many years they’ve been working, and other factors using statistical methods. The goal is to find practical information that can help universities make better rules, improve how they pay professors, and be part of the conversation about fairness in education. The study hopes to find useful information to help universities make good decisions, be fair, and stay competitive in getting and keeping a diverse and talented group of professors. Later on, the study will give more details about the data they used, how they analyzed it, and what they found, with the aim of giving useful advice to universities dealing with paying professors and having a diverse staff. ##

head(data)
##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)
dim(data)
## NULL
names(data)
## NULL

The date contain 395 individuals that are houses and 6 variables ### Description of the data

library(readxl)
data <- readxl::read_excel("E:/Khi tôi học/2023.1/Thống kê ứng dụng/xlsx/ProfessorSalaries.xlsx")
data(data)
## Warning in data(data): data set 'data' not found
str(data)
## tibble [397 × 6] (S3: tbl_df/tbl/data.frame)
##  $ rank         : chr [1:397] "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr [1:397] "B" "B" "B" "B" ...
##  $ yrs.since.phd: num [1:397] 19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : num [1:397] 18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr [1:397] "Male" "Male" "Male" "Male" ...
##  $ salary       : num [1:397] 139750 173200 79750 115000 141500 ...
summary(data)
##      rank            discipline        yrs.since.phd    yrs.service   
##  Length:397         Length:397         Min.   : 1.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:12.00   1st Qu.: 7.00  
##  Mode  :character   Mode  :character   Median :21.00   Median :16.00  
##                                        Mean   :22.31   Mean   :17.61  
##                                        3rd Qu.:32.00   3rd Qu.:27.00  
##                                        Max.   :56.00   Max.   :60.00  
##      sex                salary      
##  Length:397         Min.   : 57800  
##  Class :character   1st Qu.: 91000  
##  Mode  :character   Median :107300  
##                     Mean   :113706  
##                     3rd Qu.:134185  
##                     Max.   :231545
hist(data$'yrs.since.phd', main = "Years Since Ph.D.", xlab = "Years")

library(ggplot2)

# Fit a linear regression model
model <- lm(salary ~ yrs.since.phd + yrs.service, data =data)

# Summarize the model
summary(model)
## 
## Call:
## lm(formula = salary ~ yrs.since.phd + yrs.service, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79735 -19823  -2617  15149 106149 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    89912.2     2843.6  31.620  < 2e-16 ***
## yrs.since.phd   1562.9      256.8   6.086 2.75e-09 ***
## yrs.service     -629.1      254.5  -2.472   0.0138 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27360 on 394 degrees of freedom
## Multiple R-squared:  0.1883, Adjusted R-squared:  0.1842 
## F-statistic: 45.71 on 2 and 394 DF,  p-value: < 2.2e-16
# Visualize the regression results
ggplot(data, aes(x = yrs.since.phd, y = salary)) +
  geom_point(color = "black") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Linear Regression: Salary ~ Years Since PhD")
## `geom_smooth()` using formula = 'y ~ x'

# Tạo dữ liệu giả định
deptrai <- data.frame(
  Salary = rnorm(400),
  yrs_since_phd = rnorm(400),
  yrs_service = rnorm(400),
  academic_rank = factor(sample(c("Assistant", "Associate", "Full"), 100, replace = TRUE)),
  discipline = factor(sample(c("A", "B"), 100, replace = TRUE)),
  gender = factor(sample(c("Male", "Female"), 100, replace = TRUE))
)
# Tạo dữ liệu giả định
deptrai <- data.frame(
  Salary = rnorm(100),
  yrs_since_phd = rnorm(100),
  yrs_service = rnorm(100),
  academic_rank = factor(sample(c("Assistant", "Associate", "Full"), 100, replace = TRUE)),
  discipline = factor(sample(c("A", "B"), 100, replace = TRUE)),
  gender = factor(sample(c("Male", "Female"), 100, replace = TRUE))
)

# Hiển thị thông tin về bộ dữ liệu "deptrai"
summary(deptrai)
##      Salary           yrs_since_phd        yrs_service          academic_rank
##  Min.   :-3.1012459   Min.   :-2.246945   Min.   :-3.264948   Assistant:35   
##  1st Qu.:-0.5169933   1st Qu.:-0.780496   1st Qu.:-0.553335   Associate:34   
##  Median :-0.0572677   Median : 0.052960   Median :-0.040682   Full     :31   
##  Mean   :-0.0003183   Mean   : 0.003702   Mean   :-0.004434                  
##  3rd Qu.: 0.6786466   3rd Qu.: 0.791358   3rd Qu.: 0.711722                  
##  Max.   : 2.9018228   Max.   : 1.851144   Max.   : 2.149423                  
##  discipline    gender  
##  A:59       Female:58  
##  B:41       Male  :42  
##                        
##                        
##                        
## 
# Tạo công thức động
formula_text <- bquote(Salary == beta[0] + beta[1] %*% yrs.since.phd + beta[2] %*% yrs.service + beta[3] %*% academic.rank + beta[4] %*% discipline + beta[5] %*% gender + epsilon)

# Hiển thị công thức trong bảng console
cat("Regression Formula:\n", as.character(formula_text), "\n")
## Regression Formula:
##  == Salary beta[0] + beta[1] %*% yrs.since.phd + beta[2] %*% yrs.service + beta[3] %*% academic.rank + beta[4] %*% discipline + beta[5] %*% gender + epsilon
summary(data)
##      rank            discipline        yrs.since.phd    yrs.service   
##  Length:397         Length:397         Min.   : 1.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:12.00   1st Qu.: 7.00  
##  Mode  :character   Mode  :character   Median :21.00   Median :16.00  
##                                        Mean   :22.31   Mean   :17.61  
##                                        3rd Qu.:32.00   3rd Qu.:27.00  
##                                        Max.   :56.00   Max.   :60.00  
##      sex                salary      
##  Length:397         Min.   : 57800  
##  Class :character   1st Qu.: 91000  
##  Mode  :character   Median :107300  
##                     Mean   :113706  
##                     3rd Qu.:134185  
##                     Max.   :231545
boxplot(data$salary, main="Boxplot of Salary")

par(mfrow=c(2, 2))  # Set up a 2x2 grid for subplots

hist(data$yrs.since.phd, main="Histogram of Years Since PhD", xlab="Years Since PhD")
hist(data$yrs.service, main="Histogram of Years of Service", xlab="Years of Service")
hist(data$salary, main="Histogram of Salary", xlab="Salary")

barplot(table(data$rank), main="Bar Plot of Academic Rank", xlab="Rank", ylab="Frequency", col="lightblue")

barplot(table(data$discipline), main="Bar Plot of Discipline", xlab="Discipline", ylab="Frequency", col="lightgreen")
barplot(table(data$sex), main="Bar Plot of Gender", xlab="Gender", ylab="Frequency", col="lightpink")

# Scatterplot matrix
pairs(data[, c("yrs.since.phd", "yrs.service", "salary")], main="Scatterplot Matrix")

# Correlation matrix
cor_matrix <- cor(data[, c("yrs.since.phd", "yrs.service", "salary")])
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(cor_matrix)
##               yrs.since.phd yrs.service    salary
## yrs.since.phd     1.0000000   0.9096491 0.4192311
## yrs.service       0.9096491   1.0000000 0.3347447
## salary            0.4192311   0.3347447 1.0000000
# Simple linear regression: Salary ~ yrs.since.phd
lm_yrs_since_phd <- lm(salary ~ yrs.since.phd, data =data)
print("Simple Linear Regression: Salary ~ yrs.since.phd")
## [1] "Simple Linear Regression: Salary ~ yrs.since.phd"
print(summary(lm_yrs_since_phd))
## 
## Call:
## lm(formula = salary ~ yrs.since.phd, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84171 -19432  -2858  16086 102383 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    91718.7     2765.8  33.162   <2e-16 ***
## yrs.since.phd    985.3      107.4   9.177   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27530 on 395 degrees of freedom
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1737 
## F-statistic: 84.23 on 1 and 395 DF,  p-value: < 2.2e-16
# Simple linear regression: Salary ~ yrs.service
lm_yrs_service <- lm(salary ~ yrs.service, data = data)
print("Simple Linear Regression: Salary ~ yrs.service")
## [1] "Simple Linear Regression: Salary ~ yrs.service"
print(summary(lm_yrs_service))
## 
## Call:
## lm(formula = salary ~ yrs.service, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81933 -20511  -3776  16417 101947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  99974.7     2416.6   41.37  < 2e-16 ***
## yrs.service    779.6      110.4    7.06 7.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28580 on 395 degrees of freedom
## Multiple R-squared:  0.1121, Adjusted R-squared:  0.1098 
## F-statistic: 49.85 on 1 and 395 DF,  p-value: 7.529e-12
# Multiple linear regression
multiple_regression_model <- lm(salary ~ yrs.since.phd + yrs.service + rank + discipline + sex, data =data)

# Summary of the regression model
summary(multiple_regression_model)
## 
## Call:
## lm(formula = salary ~ yrs.since.phd + yrs.service + rank + discipline + 
##     sex, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65248 -13211  -1775  10384  99592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    78862.8     4990.3  15.803  < 2e-16 ***
## yrs.since.phd    535.1      241.0   2.220  0.02698 *  
## yrs.service     -489.5      211.9  -2.310  0.02143 *  
## rankAsstProf  -12907.6     4145.3  -3.114  0.00198 ** 
## rankProf       32158.4     3540.6   9.083  < 2e-16 ***
## disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
## sexMale         4783.5     3858.7   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16

Conclusion

This study comprehensively explored factors influencing faculty salaries in academia, using a diverse dataset including academic rank, discipline, years since obtaining a Ph.D., years of service, and gender. The analysis revealed a rich tapestry of faculty characteristics, ranging from academic ranks to disciplinary affiliations and gender diversity. Pairwise investigations, scatterplot matrices, and correlation matrices unveiled intriguing patterns and relationships among variables.

A multiple regression analysis synthesized the collective influence of various factors on faculty salaries, emphasizing the significance of academic rank, discipline, and years of service. Key findings highlighted the positive relationship between years of service and salary, the impact of academic rank and discipline on compensation differentials, and the role of gender in shaping salaries.

The study contributes valuable insights for informed decision-making in academia, providing a foundation for refining compensation structures to promote fairness, equity, and diversity. While acknowledging the study’s limitations, the findings underscore the ongoing need for gender-sensitive compensation practices and contribute to the dialogue on fostering an inclusive and equitable environment for faculty members in the evolving academic landscape.