This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:
I’ve reviewed this dataset, and confirmed that there is no missing values.
Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Total points: 10
Good luck!
Salary
, and display the first few rows. (1 points)# Your code here
salary <- read.csv("salaries.csv")
head(salary, n=3)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139.75
## 2 Prof B 20 16 Male 173.20
## 3 AsstProf B 4 3 Male 79.75
# Your code here
str(salary)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ salary : num 139.8 173.2 79.8 115 141.5 ...
[Your answer here]
In total, there are 397 observations, and 6 variables. 4 of the variables are numerical, and 2 of them is character variable.
# Your code here
rank
and
discipline
. (How many AsstProf, AssocProf, and Prof; and
how many of them are in theoretical departments and how many in applied
departments). (1)# Your code here
table(salary$rank)
##
## AssocProf AsstProf Prof
## 64 67 266
table(salary$discipline)
##
## A B
## 181 216
table(salary$rank, salary$discipline)
##
## A B
## AssocProf 26 38
## AsstProf 24 43
## Prof 131 135
plot()
or ggplot()
). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
boxplot(salary$salary ~ salary$rank, main = "Boxplot of Salary for different ranks",
xlab = "Rank",ylab = "Salary")
# ggplot2
library(ggplot2)
ggplot (data = salary, aes(y = salary, x = factor(rank))) +
geom_boxplot() +
labs(title = "Salary for different ranks", x = "Rank", y = "Salary")
Salary_train
and Salary_test
. A
part of the code was given, please finish it. (1 points)training_index <- sample(1:nrow(salary), round(0.8 * nrow(salary)))
# Your code here
Salary_train <- salary[training_index, ]
Salary_test <- salary[-training_index, ]
model_full
;
fit a null (linear) model (no predictor, only an intercept), and name it
as model_null
. Display the summary of both the models. (1
points)# Your code here
model_full <- lm(salary ~ ., data = Salary_train)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.319 -13.335 -1.770 9.734 100.247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0747 5.6618 13.966 < 2e-16 ***
## rankAsstProf -12.4588 4.5314 -2.749 0.00632 **
## rankProf 32.0883 4.0213 7.980 2.85e-14 ***
## disciplineB 14.4154 2.6085 5.526 6.93e-08 ***
## yrs.since.phd 0.3970 0.2752 1.442 0.15023
## yrs.service -0.3707 0.2412 -1.537 0.12532
## sexMale 4.7233 4.3464 1.087 0.27801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.68 on 311 degrees of freedom
## Multiple R-squared: 0.449, Adjusted R-squared: 0.4384
## F-statistic: 42.24 on 6 and 311 DF, p-value: < 2.2e-16
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.953 -22.166 -6.212 19.317 119.792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.753 1.697 65.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.27 on 317 degrees of freedom
discipline
, and the
coefficient of yrs.service
. What do they tell us about the
relationship between these predictors and salary
? (1
points)[Your interpretation here]
The coefficient of discipline represents the average difference in the salaries of the two academic disciplines. The coefficient of years of service represents the expected change in salary for each additonal year os service.
model_step_BIC
. Which variables are
selected in the final model? (1 points)# Your code here
model_step_bic <- step(model_null,
scope = list(lower = model_null, upper = model_full), direction = "both",
k = log(nrow(Salary_train)))
## Start: AIC=2173.53
## salary ~ 1
##
## Df Sum of Sq RSS AIC
## + rank 2 112809 177563 2028.7
## + yrs.since.phd 1 52222 238150 2116.2
## + yrs.service 1 34367 256005 2139.2
## + discipline 1 7326 283046 2171.2
## <none> 290372 2173.5
## + sex 1 3938 286434 2174.9
##
## Step: AIC=2028.65
## salary ~ rank
##
## Df Sum of Sq RSS AIC
## + discipline 1 15802 161761 2004.8
## <none> 177563 2028.7
## + sex 1 762 176801 2033.0
## + yrs.service 1 666 176897 2033.2
## + yrs.since.phd 1 211 177352 2034.0
## - rank 2 112809 290372 2173.5
##
## Step: AIC=2004.77
## salary ~ rank + discipline
##
## Df Sum of Sq RSS AIC
## <none> 161761 2004.8
## + sex 1 528 161233 2009.5
## + yrs.service 1 111 161650 2010.3
## + yrs.since.phd 1 49 161712 2010.4
## - discipline 1 15802 177563 2028.7
## - rank 2 121284 283046 2171.2
summary(model_step_bic)
##
## Call:
## lm(formula = salary ~ rank + discipline, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.141 -13.481 -1.922 8.565 98.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.128 3.457 24.622 < 2e-16 ***
## rankAsstProf -13.287 4.321 -3.075 0.00229 **
## rankProf 33.364 3.527 9.460 < 2e-16 ***
## disciplineB 14.208 2.565 5.538 6.46e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.7 on 314 degrees of freedom
## Multiple R-squared: 0.4429, Adjusted R-squared: 0.4376
## F-statistic: 83.22 on 3 and 314 DF, p-value: < 2.2e-16
[Your comments here]
Start from the null model and allow variables to be added or removed from the full model based on their contribution to improving model fit. The penalty term ensures selection follows BIC instead of AIC. The final selected model is stored as and summary displays the chosen variables.
model_full
and
model_step_BIC
. Based on this results, which model performs
better in prediction? (1 points)# Your code here
pred_full <- predict(model_full, newdata = Salary_test)
pred_step <- predict(model_step_bic, newdata = Salary_test)
mse_full <- mean((Salary_test$salary - pred_full)^2)
mse_step <- mean((Salary_test$salary - pred_step)^2)
mse_full
## [1] 489.322
mse_step
## [1] 508.9849
[Your comments here]
The out of sample MSE measure how well each model predicts unseen data. Lower MSE mean better predictive performance.
End of Exam. Please submit this RMD file along with a knitted HTML report.