Instructions

This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:

I’ve reviewed this dataset, and confirmed that there is no missing values.

Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Salary dataset provided on Canvas, named it as Salary, and display the first few rows. (1 points)
# Your code here
Salary <- read.csv("salaries.csv")
head(Salary, n=5)
##       rank discipline yrs.since.phd yrs.service  sex salary
## 1     Prof          B            19          18 Male 139.75
## 2     Prof          B            20          16 Male 173.20
## 3 AsstProf          B             4           3 Male  79.75
## 4     Prof          B            45          39 Male 115.00
## 5     Prof          B            40          41 Male 141.50
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? Among these variables, how many of them are numeric? (1 points)
# Your code here
str(Salary)
## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : chr  "Prof" "Prof" "AsstProf" "Prof" ...
##  $ discipline   : chr  "B" "B" "B" "B" ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : chr  "Male" "Male" "Male" "Male" ...
##  $ salary       : num  139.8 173.2 79.8 115 141.5 ...

There are 397 observations, and 6 variables. 1 is numeric, 2 are integer.

2. Data Preprocessing and Visualization (3 points)

  1. Rename the variables to snake case (like “snake_case”) (1 points)
# Your code here
snake_case <- Salary[,c(1,2,5)]
  1. Please create frequency tables for variable rank and discipline. (How many AsstProf, AssocProf, and Prof; and how many of them are in theoretical departments and how many in applied departments). (1) ?frequency tables
# Your code here
freq_table <- table(Salary$rank, Salary$discipline)
freq_table
##            
##               A   B
##   AssocProf  26  38
##   AsstProf   24  43
##   Prof      131 135
  1. Create a box plot of ‘salary’ vs ‘rank’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
library(ggplot2)
ggplot(data=Salary, aes(x = rank, y = salary)) + 
  geom_boxplot() +
  labs(title = "Salary Based On Rank")

3. Linear Regression Analysis (5 points)

  1. Split the data into training data (80%) and testing data (20%), and name them as Salary_train and Salary_test. A part of the code was given, please finish it. (1 points)
training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
salary_train <- floor(0.8 * training_index)
salary_test <- floor(0.2 * training_index)
  1. Using ‘salary’ as the response variable, and based on the training data, fit a full (linear) model (using all the other variables as predictors), and name it as model_full; fit a null (linear) model (no predictor, only an intercept), and name it as model_null. Display the summary of both the models. (1 points)
# Your code here
model_full <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = Salary)
summary(model_full)
## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service + 
##     sex, data = Salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.248 -13.211  -1.775  10.384  99.592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    78.8628     4.9903  15.803  < 2e-16 ***
## rankAsstProf  -12.9076     4.1453  -3.114  0.00198 ** 
## rankProf       32.1584     3.5406   9.083  < 2e-16 ***
## disciplineB    14.4176     2.3429   6.154 1.88e-09 ***
## yrs.since.phd   0.5351     0.2410   2.220  0.02698 *  
## yrs.service    -0.4895     0.2119  -2.310  0.02143 *  
## sexMale         4.7835     3.8587   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.54 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16
  1. Interpret the coefficient of discipline, and the coefficient of yrs.service. What do they tell us about the relationship between these predictors and salary? (1 points)

the coefficient of discipline is 14.4176, the coefficient of yrs.service is -0.4895. Through this, we can see that discipline is a good predictor, and yrs.service is not.

  1. Conduct a stepwise variable selection using BIC, and name the selected model as model_step_BIC. Which variables are selected in the final model? (1 points)
# Your code here

[Your comments here]

  1. Calculate the out-of-sample MSE with model_full and model_step_BIC. Based on this results, which model performs better in prediction? (1 points)
# Your code here

[Your comments here]


End of Exam. Please submit this RMD file along with a knitted HTML report.

getwd()