Instructions

This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:

rank: a factor with levels: AssocProf, AsstProf, Prof.
discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments).
yrs.since.phd: years since PhD.
yrs.service: years of service.
sex: a factor with levels Female, Male.
salary: nine-month salary, in thousand dollars.

I’ve reviewed this dataset, and confirmed that there is no missing values.

Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10

Good luck!

1. Data Import and Exploration (2 points)

Import the Salary dataset provided on Canvas, named it as Salary, and display the first few rows. (1 points)

# Your code here
Salary <- read.csv("salaries.csv")
head(Salary, n = 5)

##       rank discipline yrs.since.phd yrs.service  sex salary
## 1     Prof          B            19          18 Male 139.75
## 2     Prof          B            20          16 Male 173.20
## 3 AsstProf          B             4           3 Male  79.75
## 4     Prof          B            45          39 Male 115.00
## 5     Prof          B            40          41 Male 141.50

Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? Among these variables, how many of them are numeric? (1 points)

# Your code here
str(Salary)

## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...
##  $ salary       : num  139.8 173.2 79.8 115 141.5 ...

There are 397 observations and there are 6 variables in the dataset. Among the dataset there are 3 numeric variables.

2. Data Preprocessing and Visualization (3 points)

Rename the variables to snake case (like “snake_case”) (1 points)

# Your code here
library(snakecase)
library(ggplot2)

Please create frequency tables for variable rank and discipline. (How many AsstProf, AssocProf, and Prof; and how many of them are in theoretical departments and how many in applied departments). (1)

# Your code here
# ranks
ggplot(Salary, aes(x = rank, fill = rank)) +
geom_bar(alpha = 0.8) +
ggtitle("Frequency of Faculty Ranks") +
xlab("Rank") +
ylab("Count") +
theme_minimal()

# discipline

ggplot(Salary, aes(x = discipline, fill = discipline)) +
geom_bar(alpha = 0.8) +
ggtitle("Frequency of Disciplines") +
xlab("Discipline (A = Theoretical, B = Applied)") +
ylab("Count") +
theme_minimal()

Create a box plot of ‘salary’ vs ‘rank’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)

# Your code here
library(ggplot2)

ggplot(Salary, aes(x = rank, y = salary)) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Academic Rank",
    x = "Rank",
    y = "Nine-month Salary (thousands of dollars)"
  )

3. Linear Regression Analysis (5 points)

Split the data into training data (80%) and testing data (20%), and name them as Salary_train and Salary_test. A part of the code was given, please finish it. (1 points)

training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here

dim(Salary)

## [1] 397   6

training_index <- sample(1:dim(Salary)[1], round(0.8 * dim(Salary)[1]))
Salary_train <- Salary[training_index, ]
Salary_test <- Salary[-training_index, ]

model_null <- lm(salary ~ 1, data = Salary_train)

model_full <- lm(salary ~ ., data = Salary_train)

Using ‘salary’ as the response variable, and based on the training data, fit a full (linear) model (using all the other variables as predictors), and name it as model_full; fit a null (linear) model (no predictor, only an intercept), and name it as model_null. Display the summary of both the models. (1 points)

# Your code here

# Full
model_full <- lm(salary ~., data = Salary_train)
summary(model_full)

## 
## Call:
## lm(formula = salary ~ ., data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.146 -13.160  -0.444  10.980  99.975 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    76.0956     5.5267  13.769  < 2e-16 ***
## rankAsstProf  -12.7625     4.6863  -2.723  0.00683 ** 
## rankProf       32.2112     3.9576   8.139 9.68e-15 ***
## disciplineB    14.1125     2.6824   5.261 2.67e-07 ***
## yrs.since.phd   0.7685     0.2746   2.798  0.00546 ** 
## yrs.service    -0.7161     0.2460  -2.912  0.00386 ** 
## sexMale         7.1618     4.3933   1.630  0.10408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.96 on 311 degrees of freedom
## Multiple R-squared:  0.4464, Adjusted R-squared:  0.4357 
## F-statistic:  41.8 on 6 and 311 DF,  p-value: < 2.2e-16

# Null
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_null)

## 
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.373 -23.041  -6.923  20.286 117.372 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  114.173      1.714   66.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.57 on 317 degrees of freedom

Interpret the coefficient of discipline, and the coefficient of yrs.service. What do they tell us about the relationship between these predictors and salary? (1 points)

discipline is a factor (baseline is the first level, typically “A” after cleaning). The coefficient for disciplineB represents the average difference in salary (in thousands) between discipline B and discipline A, holding other variables constant. A positive estimate means B tends to earn more than A by that amount; a negative estimate means the opposite.

yrs_service is numeric. Its coefficient represents the expected change in salary (in thousands) for a one-year increase in service, holding other variables constant. Positive → salary increases with service; negative → salary decreases with additional service, all else equal.

Conduct a stepwise variable selection using BIC, and name the selected model as model_step_BIC. Which variables are selected in the final model? (1 points)

# Your code here

n_train <- nrow(Salary_train)
model_step_BIC <- step(
  model_null,
  scope = list(lower = model_null, upper = model_full),
  direction = "both",
  k = log(n_train),
  trace = 0
)

summary(model_step_BIC)

## 
## Call:
## lm(formula = salary ~ rank + discipline, data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.347 -14.198  -1.253  10.656  97.639 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    85.772      3.479  24.654  < 2e-16 ***
## rankAsstProf  -13.222      4.536  -2.915  0.00381 ** 
## rankProf       35.011      3.504   9.991  < 2e-16 ***
## disciplineB    13.123      2.650   4.952  1.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.26 on 314 degrees of freedom
## Multiple R-squared:  0.4262, Adjusted R-squared:  0.4207 
## F-statistic: 77.75 on 3 and 314 DF,  p-value: < 2.2e-16

[Your comments here]

Calculate the out-of-sample MSE with model_full and model_step_BIC. Based on this results, which model performs better in prediction? (1 points)

# Your code here

pred_full <- predict(model_full, newdata = Salary_test)
pred_step <- predict(model_step_BIC, newdata = Salary_test)

# Actuals
y_test <- Salary_test$salary

# MSEs
mse_full <- mean((y_test - pred_full)^2, na.rm = TRUE)
mse_step <- mean((y_test - pred_step)^2, na.rm = TRUE)

cat("Out-of-sample MSE (Full):", round(mse_full, 4), "\n")

## Out-of-sample MSE (Full): 444.3553

cat("Out-of-sample MSE (BIC Stepwise):", round(mse_step, 4), "\n")

## Out-of-sample MSE (BIC Stepwise): 403.3584

Thank you professor!