This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:
I’ve reviewed this dataset, and confirmed that there is no missing values.
Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Total points: 10
Good luck!
Salary
, and display the first few rows. (1 points)# Your code here
Salary <- read.csv("Salaries.csv")
head(Salary, n = 5)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139.75
## 2 Prof B 20 16 Male 173.20
## 3 AsstProf B 4 3 Male 79.75
## 4 Prof B 45 39 Male 115.00
## 5 Prof B 40 41 Male 141.50
# Your code here
str(Salary)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ salary : num 139.8 173.2 79.8 115 141.5 ...
Salary_numeric <- Salary[, -6]
Salary_numeric <- Salary[, 1:5]
str(Salary_numeric)
## 'data.frame': 397 obs. of 5 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
There are 3 variables that are numeric. years since phd, years of service and salary
# Your code here
snake_case <- Salary$rank
snake_case <- Salary$discipline
snake_case <- Salary$yrs.since.phd
snake_case <- Salary$yrs.service
snake_case <- Salary$sex
snake_case <- Salary$salary
rank
and
discipline
. (How many AsstProf, AssocProf, and Prof; and
how many of them are in theoretical departments and how many in applied
departments). (1)# Your code here
freq_rank <- table(Salary$rank)
freq_discipline <- table(Salary$discipline)
freq_rank
##
## AssocProf AsstProf Prof
## 64 67 266
freq_discipline
##
## A B
## 181 216
There are 64 Associate Professors, 67 Assistant Professors, and 266 Professors There are 181 in theoretical departments and 216 in Applied departments
plot()
or ggplot()
). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
library(ggplot2)
ggplot(Salary, aes(x = rank, y = salary))+
geom_boxplot()+
labs(title = "Salary v. Rank",
x = "Rank of Professor",
y = "Salary of Professor")
Salary_train
and Salary_test
. A
part of the code was given, please finish it. (1 points)training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
Salary_train <- Salary[training_index, ]
Salary_test <- Salary[-training_index, ]
model_full
;
fit a null (linear) model (no predictor, only an intercept), and name it
as model_null
. Display the summary of both the models. (1
points)# Your code here
model_full <- lm(salary ~ ., data = Salary_train)
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.896 -13.408 -0.342 9.547 98.488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.3906 5.3959 14.713 < 2e-16 ***
## rankAsstProf -12.8095 4.4459 -2.881 0.00424 **
## rankProf 32.7065 3.8395 8.518 7.06e-16 ***
## disciplineB 15.9945 2.5779 6.204 1.75e-09 ***
## yrs.since.phd 0.5750 0.2509 2.292 0.02258 *
## yrs.service -0.5262 0.2196 -2.396 0.01716 *
## sexMale 3.1123 4.0794 0.763 0.44608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.97 on 311 degrees of freedom
## Multiple R-squared: 0.4818, Adjusted R-squared: 0.4718
## F-statistic: 48.2 on 6 and 311 DF, p-value: < 2.2e-16
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.124 -22.924 -6.619 20.535 117.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.924 1.695 67.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.23 on 317 degrees of freedom
discipline
, and the
coefficient of yrs.service
. What do they tell us about the
relationship between these predictors and salary
? (1
points)Coefficient Discipline is positive meaning on average they earn 12,934 more than comparabke faculty Coefficient years of serviece tells us that there is a 590 decrease in salary
model_step_BIC
. Which variables are
selected in the final model? (1 points)# Your code here
model_full <- lm(salary ~ ., data = Salary)
model_null <- lm(salary ~ 1, data = Salary)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.248 -13.211 -1.775 10.384 99.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.8628 4.9903 15.803 < 2e-16 ***
## rankAsstProf -12.9076 4.1453 -3.114 0.00198 **
## rankProf 32.1584 3.5406 9.083 < 2e-16 ***
## disciplineB 14.4176 2.3429 6.154 1.88e-09 ***
## yrs.since.phd 0.5351 0.2410 2.220 0.02698 *
## yrs.service -0.4895 0.2119 -2.310 0.02143 *
## sexMale 4.7835 3.8587 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.54 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.906 -22.706 -6.406 20.479 117.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.71 1.52 74.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.29 on 396 degrees of freedom
All varaibles are brought in the final mode
model_full
and
model_step_BIC
. Based on this results, which model performs
better in prediction? (1 points)# Your code here
pred_full <- predict(model_full, data = Salary)
# pred_step <- predict(model_step_BIC, data = Salary)
mse_full <- mean((Salary$salary - pred_full)^2)
# mse_step <- mean((Salary$salary - pred_step)^2)
mse_full
## [1] 499.0336
# mse_step
MSE step is higher than mse full which means the model does not perform better
End of Exam. Please submit this RMD file along with a knitted HTML report.