Instructions

This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:

I’ve reviewed this dataset, and confirmed that there is no missing values.

Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Salary dataset provided on Canvas, named it as Salary, and display the first few rows. (1 points)
# Your code here
salary <- read.csv("salaries.csv")
head(salary, n=3)
##       rank discipline yrs.since.phd yrs.service  sex salary
## 1     Prof          B            19          18 Male 139.75
## 2     Prof          B            20          16 Male 173.20
## 3 AsstProf          B             4           3 Male  79.75
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? Among these variables, how many of them are numeric? (1 points)
# Your code here
str(salary)
## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...
##  $ salary       : num  139.8 173.2 79.8 115 141.5 ...

[Your answer here]

In total, there are 397 observations, and 6 variables. 4 of the variables are numerical, and 2 of them is character variable.

2. Data Preprocessing and Visualization (3 points)

  1. Rename the variables to snake case (like “snake_case”) (1 points)
# Your code here
  1. Please create frequency tables for variable rank and discipline. (How many AsstProf, AssocProf, and Prof; and how many of them are in theoretical departments and how many in applied departments). (1)
# Your code here
table(salary$rank)
## 
## AssocProf  AsstProf      Prof 
##        64        67       266
table(salary$discipline)
## 
##   A   B 
## 181 216
table(salary$rank, salary$discipline)
##            
##               A   B
##   AssocProf  26  38
##   AsstProf   24  43
##   Prof      131 135
  1. Create a box plot of ‘salary’ vs ‘rank’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
boxplot(salary$salary ~ salary$rank, main = "Boxplot of Salary for different ranks", 
        xlab = "Rank",ylab = "Salary")

# ggplot2
library(ggplot2)

ggplot (data = salary, aes(y = salary, x = factor(rank))) +
  geom_boxplot() +
  labs(title = "Salary for different ranks", x = "Rank", y = "Salary")

3. Linear Regression Analysis (5 points)

  1. Split the data into training data (80%) and testing data (20%), and name them as Salary_train and Salary_test. A part of the code was given, please finish it. (1 points)
training_index <- sample(1:nrow(salary), round(0.8 * nrow(salary)))
# Your code here
Salary_train <- salary[training_index, ]
Salary_test <- salary[-training_index, ]
  1. Using ‘salary’ as the response variable, and based on the training data, fit a full (linear) model (using all the other variables as predictors), and name it as model_full; fit a null (linear) model (no predictor, only an intercept), and name it as model_null. Display the summary of both the models. (1 points)
# Your code here
model_full <- lm(salary ~ ., data = Salary_train)
summary(model_full)
## 
## Call:
## lm(formula = salary ~ ., data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.319 -13.335  -1.770   9.734 100.247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    79.0747     5.6618  13.966  < 2e-16 ***
## rankAsstProf  -12.4588     4.5314  -2.749  0.00632 ** 
## rankProf       32.0883     4.0213   7.980 2.85e-14 ***
## disciplineB    14.4154     2.6085   5.526 6.93e-08 ***
## yrs.since.phd   0.3970     0.2752   1.442  0.15023    
## yrs.service    -0.3707     0.2412  -1.537  0.12532    
## sexMale         4.7233     4.3464   1.087  0.27801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.68 on 311 degrees of freedom
## Multiple R-squared:  0.449,  Adjusted R-squared:  0.4384 
## F-statistic: 42.24 on 6 and 311 DF,  p-value: < 2.2e-16
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_null)
## 
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.953 -22.166  -6.212  19.317 119.792 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  111.753      1.697   65.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.27 on 317 degrees of freedom
  1. Interpret the coefficient of discipline, and the coefficient of yrs.service. What do they tell us about the relationship between these predictors and salary? (1 points)

[Your interpretation here]

The coefficient of discipline represents the average difference in the salaries of the two academic disciplines. The coefficient of years of service represents the expected change in salary for each additonal year os service.

  1. Conduct a stepwise variable selection using BIC, and name the selected model as model_step_BIC. Which variables are selected in the final model? (1 points)
# Your code here
model_step_bic <- step(model_null, 
                       scope = list(lower = model_null, upper = model_full), direction = "both", 
                       k = log(nrow(Salary_train)))
## Start:  AIC=2173.53
## salary ~ 1
## 
##                 Df Sum of Sq    RSS    AIC
## + rank           2    112809 177563 2028.7
## + yrs.since.phd  1     52222 238150 2116.2
## + yrs.service    1     34367 256005 2139.2
## + discipline     1      7326 283046 2171.2
## <none>                       290372 2173.5
## + sex            1      3938 286434 2174.9
## 
## Step:  AIC=2028.65
## salary ~ rank
## 
##                 Df Sum of Sq    RSS    AIC
## + discipline     1     15802 161761 2004.8
## <none>                       177563 2028.7
## + sex            1       762 176801 2033.0
## + yrs.service    1       666 176897 2033.2
## + yrs.since.phd  1       211 177352 2034.0
## - rank           2    112809 290372 2173.5
## 
## Step:  AIC=2004.77
## salary ~ rank + discipline
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                       161761 2004.8
## + sex            1       528 161233 2009.5
## + yrs.service    1       111 161650 2010.3
## + yrs.since.phd  1        49 161712 2010.4
## - discipline     1     15802 177563 2028.7
## - rank           2    121284 283046 2171.2
summary(model_step_bic)
## 
## Call:
## lm(formula = salary ~ rank + discipline, data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.141 -13.481  -1.922   8.565  98.845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    85.128      3.457  24.622  < 2e-16 ***
## rankAsstProf  -13.287      4.321  -3.075  0.00229 ** 
## rankProf       33.364      3.527   9.460  < 2e-16 ***
## disciplineB    14.208      2.565   5.538 6.46e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.7 on 314 degrees of freedom
## Multiple R-squared:  0.4429, Adjusted R-squared:  0.4376 
## F-statistic: 83.22 on 3 and 314 DF,  p-value: < 2.2e-16

[Your comments here]

Start from the null model and allow variables to be added or removed from the full model based on their contribution to improving model fit. The penalty term ensures selection follows BIC instead of AIC. The final selected model is stored as and summary displays the chosen variables.

  1. Calculate the out-of-sample MSE with model_full and model_step_BIC. Based on this results, which model performs better in prediction? (1 points)
# Your code here
pred_full <- predict(model_full, newdata = Salary_test)
pred_step <- predict(model_step_bic, newdata = Salary_test)

mse_full <- mean((Salary_test$salary - pred_full)^2)
mse_step <- mean((Salary_test$salary - pred_step)^2)

mse_full
## [1] 489.322
mse_step
## [1] 508.9849

[Your comments here]

The out of sample MSE measure how well each model predicts unseen data. Lower MSE mean better predictive performance.


End of Exam. Please submit this RMD file along with a knitted HTML report.