Instructions

This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:

I’ve reviewed this dataset, and confirmed that there is no missing values.

Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Salary dataset provided on Canvas, named it as Salary, and display the first few rows. (1 points)
# Your code here
Salary <- read.csv("Salaries.csv")
head(Salary, n = 5)
##       rank discipline yrs.since.phd yrs.service  sex salary
## 1     Prof          B            19          18 Male 139.75
## 2     Prof          B            20          16 Male 173.20
## 3 AsstProf          B             4           3 Male  79.75
## 4     Prof          B            45          39 Male 115.00
## 5     Prof          B            40          41 Male 141.50
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? Among these variables, how many of them are numeric? (1 points)
# Your code here
str(Salary)
## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...
##  $ salary       : num  139.8 173.2 79.8 115 141.5 ...
Salary_numeric <- Salary[, -6]
Salary_numeric <- Salary[, 1:5]
str(Salary_numeric)
## 'data.frame':    397 obs. of  5 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...

There are 3 variables that are numeric. years since phd, years of service and salary

2. Data Preprocessing and Visualization (3 points)

  1. Rename the variables to snake case (like “snake_case”) (1 points)
# Your code here
snake_case <- Salary$rank
snake_case <- Salary$discipline
snake_case <- Salary$yrs.since.phd
snake_case <- Salary$yrs.service
snake_case <- Salary$sex
snake_case <- Salary$salary
  1. Please create frequency tables for variable rank and discipline. (How many AsstProf, AssocProf, and Prof; and how many of them are in theoretical departments and how many in applied departments). (1)
# Your code here
freq_rank <- table(Salary$rank)
freq_discipline <- table(Salary$discipline)
freq_rank
## 
## AssocProf  AsstProf      Prof 
##        64        67       266
freq_discipline
## 
##   A   B 
## 181 216

There are 64 Associate Professors, 67 Assistant Professors, and 266 Professors There are 181 in theoretical departments and 216 in Applied departments

  1. Create a box plot of ‘salary’ vs ‘rank’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
library(ggplot2)
ggplot(Salary, aes(x = rank, y = salary))+
         geom_boxplot()+
         labs(title = "Salary v. Rank",
              x = "Rank of Professor",
              y = "Salary of Professor")

3. Linear Regression Analysis (5 points)

  1. Split the data into training data (80%) and testing data (20%), and name them as Salary_train and Salary_test. A part of the code was given, please finish it. (1 points)
training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
Salary_train <- Salary[training_index, ]
Salary_test <- Salary[-training_index, ]
  1. Using ‘salary’ as the response variable, and based on the training data, fit a full (linear) model (using all the other variables as predictors), and name it as model_full; fit a null (linear) model (no predictor, only an intercept), and name it as model_null. Display the summary of both the models. (1 points)
# Your code here
model_full <- lm(salary ~ ., data = Salary_train)
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_full)
## 
## Call:
## lm(formula = salary ~ ., data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.896 -13.408  -0.342   9.547  98.488 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    79.3906     5.3959  14.713  < 2e-16 ***
## rankAsstProf  -12.8095     4.4459  -2.881  0.00424 ** 
## rankProf       32.7065     3.8395   8.518 7.06e-16 ***
## disciplineB    15.9945     2.5779   6.204 1.75e-09 ***
## yrs.since.phd   0.5750     0.2509   2.292  0.02258 *  
## yrs.service    -0.5262     0.2196  -2.396  0.01716 *  
## sexMale         3.1123     4.0794   0.763  0.44608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.97 on 311 degrees of freedom
## Multiple R-squared:  0.4818, Adjusted R-squared:  0.4718 
## F-statistic:  48.2 on 6 and 311 DF,  p-value: < 2.2e-16
summary(model_null)
## 
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.124 -22.924  -6.619  20.535 117.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  113.924      1.695   67.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.23 on 317 degrees of freedom
  1. Interpret the coefficient of discipline, and the coefficient of yrs.service. What do they tell us about the relationship between these predictors and salary? (1 points)

Coefficient Discipline is positive meaning on average they earn 12,934 more than comparabke faculty Coefficient years of serviece tells us that there is a 590 decrease in salary

  1. Conduct a stepwise variable selection using BIC, and name the selected model as model_step_BIC. Which variables are selected in the final model? (1 points)
# Your code here
model_full <- lm(salary ~ ., data = Salary)
model_null <- lm(salary ~ 1, data = Salary)

summary(model_full)
## 
## Call:
## lm(formula = salary ~ ., data = Salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.248 -13.211  -1.775  10.384  99.592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    78.8628     4.9903  15.803  < 2e-16 ***
## rankAsstProf  -12.9076     4.1453  -3.114  0.00198 ** 
## rankProf       32.1584     3.5406   9.083  < 2e-16 ***
## disciplineB    14.4176     2.3429   6.154 1.88e-09 ***
## yrs.since.phd   0.5351     0.2410   2.220  0.02698 *  
## yrs.service    -0.4895     0.2119  -2.310  0.02143 *  
## sexMale         4.7835     3.8587   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.54 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16
summary(model_null)
## 
## Call:
## lm(formula = salary ~ 1, data = Salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.906 -22.706  -6.406  20.479 117.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   113.71       1.52    74.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.29 on 396 degrees of freedom

All varaibles are brought in the final mode

  1. Calculate the out-of-sample MSE with model_full and model_step_BIC. Based on this results, which model performs better in prediction? (1 points)
# Your code here
pred_full <- predict(model_full, data = Salary)
# pred_step <- predict(model_step_BIC, data = Salary)

mse_full <- mean((Salary$salary - pred_full)^2)
# mse_step <- mean((Salary$salary - pred_step)^2)

mse_full
## [1] 499.0336
# mse_step

MSE step is higher than mse full which means the model does not perform better


End of Exam. Please submit this RMD file along with a knitted HTML report.