This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:
I’ve reviewed this dataset, and confirmed that there is no missing values.
Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Total points: 10
Good luck!
Salary
, and display the first few rows. (1 points)# Your code here
Salary <- read.csv("salaries.csv")
head(Salary, n = 5)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139.75
## 2 Prof B 20 16 Male 173.20
## 3 AsstProf B 4 3 Male 79.75
## 4 Prof B 45 39 Male 115.00
## 5 Prof B 40 41 Male 141.50
# Your code here
str(Salary)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ salary : num 139.8 173.2 79.8 115 141.5 ...
# Your code here
library(snakecase)
library(ggplot2)
rank
and
discipline
. (How many AsstProf, AssocProf, and Prof; and
how many of them are in theoretical departments and how many in applied
departments). (1)# Your code here
# ranks
ggplot(Salary, aes(x = rank, fill = rank)) +
geom_bar(alpha = 0.8) +
ggtitle("Frequency of Faculty Ranks") +
xlab("Rank") +
ylab("Count") +
theme_minimal()
# discipline
ggplot(Salary, aes(x = discipline, fill = discipline)) +
geom_bar(alpha = 0.8) +
ggtitle("Frequency of Disciplines") +
xlab("Discipline (A = Theoretical, B = Applied)") +
ylab("Count") +
theme_minimal()
plot()
or ggplot()
). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
library(ggplot2)
ggplot(Salary, aes(x = rank, y = salary)) +
geom_boxplot() +
labs(
title = "Salary Distribution by Academic Rank",
x = "Rank",
y = "Nine-month Salary (thousands of dollars)"
)
Salary_train
and Salary_test
. A
part of the code was given, please finish it. (1 points)training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
dim(Salary)
## [1] 397 6
training_index <- sample(1:dim(Salary)[1], round(0.8 * dim(Salary)[1]))
Salary_train <- Salary[training_index, ]
Salary_test <- Salary[-training_index, ]
model_null <- lm(salary ~ 1, data = Salary_train)
model_full <- lm(salary ~ ., data = Salary_train)
model_full
;
fit a null (linear) model (no predictor, only an intercept), and name it
as model_null
. Display the summary of both the models. (1
points)# Your code here
# Full
model_full <- lm(salary ~., data = Salary_train)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.146 -13.160 -0.444 10.980 99.975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.0956 5.5267 13.769 < 2e-16 ***
## rankAsstProf -12.7625 4.6863 -2.723 0.00683 **
## rankProf 32.2112 3.9576 8.139 9.68e-15 ***
## disciplineB 14.1125 2.6824 5.261 2.67e-07 ***
## yrs.since.phd 0.7685 0.2746 2.798 0.00546 **
## yrs.service -0.7161 0.2460 -2.912 0.00386 **
## sexMale 7.1618 4.3933 1.630 0.10408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.96 on 311 degrees of freedom
## Multiple R-squared: 0.4464, Adjusted R-squared: 0.4357
## F-statistic: 41.8 on 6 and 311 DF, p-value: < 2.2e-16
# Null
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.373 -23.041 -6.923 20.286 117.372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.173 1.714 66.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.57 on 317 degrees of freedom
discipline
, and the
coefficient of yrs.service
. What do they tell us about the
relationship between these predictors and salary
? (1
points)model_step_BIC
. Which variables are
selected in the final model? (1 points)# Your code here
n_train <- nrow(Salary_train)
model_step_BIC <- step(
model_null,
scope = list(lower = model_null, upper = model_full),
direction = "both",
k = log(n_train),
trace = 0
)
summary(model_step_BIC)
##
## Call:
## lm(formula = salary ~ rank + discipline, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.347 -14.198 -1.253 10.656 97.639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.772 3.479 24.654 < 2e-16 ***
## rankAsstProf -13.222 4.536 -2.915 0.00381 **
## rankProf 35.011 3.504 9.991 < 2e-16 ***
## disciplineB 13.123 2.650 4.952 1.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.26 on 314 degrees of freedom
## Multiple R-squared: 0.4262, Adjusted R-squared: 0.4207
## F-statistic: 77.75 on 3 and 314 DF, p-value: < 2.2e-16
[Your comments here]
model_full
and
model_step_BIC
. Based on this results, which model performs
better in prediction? (1 points)# Your code here
pred_full <- predict(model_full, newdata = Salary_test)
pred_step <- predict(model_step_BIC, newdata = Salary_test)
# Actuals
y_test <- Salary_test$salary
# MSEs
mse_full <- mean((y_test - pred_full)^2, na.rm = TRUE)
mse_step <- mean((y_test - pred_step)^2, na.rm = TRUE)
cat("Out-of-sample MSE (Full):", round(mse_full, 4), "\n")
## Out-of-sample MSE (Full): 444.3553
cat("Out-of-sample MSE (BIC Stepwise):", round(mse_step, 4), "\n")
## Out-of-sample MSE (BIC Stepwise): 403.3584
Thank you professor!
End of Exam. Please submit this RMD file along with a knitted HTML report.