This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:
I’ve reviewed this dataset, and confirmed that there is no missing values.
Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Total points: 10
Good luck!
Salary, and display the first few rows. (1 points)# Your code here
Salary <- read.csv("Salaries.csv")
head(Salary, n = 5)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139.75
## 2 Prof B 20 16 Male 173.20
## 3 AsstProf B 4 3 Male 79.75
## 4 Prof B 45 39 Male 115.00
## 5 Prof B 40 41 Male 141.50
# Your code here
str(Salary)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ salary : num 139.8 173.2 79.8 115 141.5 ...
Salary_numeric <- Salary[, -6]
Salary_numeric <- Salary[, 1:5]
str(Salary_numeric)
## 'data.frame': 397 obs. of 5 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
There are 3 variables that are numeric. years since phd, years of service and salary
# Your code here
snake_case <- Salary$rank
snake_case <- Salary$discipline
snake_case <- Salary$yrs.since.phd
snake_case <- Salary$yrs.service
snake_case <- Salary$sex
snake_case <- Salary$salary
rank and
discipline. (How many AsstProf, AssocProf, and Prof; and
how many of them are in theoretical departments and how many in applied
departments). (1)# Your code here
freq_rank <- table(Salary$rank)
freq_discipline <- table(Salary$discipline)
freq_rank
##
## AssocProf AsstProf Prof
## 64 67 266
freq_discipline
##
## A B
## 181 216
There are 64 Associate Professors, 67 Assistant Professors, and 266 Professors There are 181 in theoretical departments and 216 in Applied departments
plot() or ggplot()). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
library(ggplot2)
ggplot(Salary, aes(x = rank, y = salary))+
geom_boxplot()+
labs(title = "Salary v. Rank",
x = "Rank of Professor",
y = "Salary of Professor")
Salary_train and Salary_test. A
part of the code was given, please finish it. (1 points)training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
Salary_train <- Salary[training_index, ]
Salary_test <- Salary[-training_index, ]
model_full;
fit a null (linear) model (no predictor, only an intercept), and name it
as model_null. Display the summary of both the models. (1
points)# Your code here
model_full <- lm(salary ~ ., data = Salary_train)
model_null <- lm(salary ~ 1, data = Salary_train)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.896 -13.408 -0.342 9.547 98.488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.3906 5.3959 14.713 < 2e-16 ***
## rankAsstProf -12.8095 4.4459 -2.881 0.00424 **
## rankProf 32.7065 3.8395 8.518 7.06e-16 ***
## disciplineB 15.9945 2.5779 6.204 1.75e-09 ***
## yrs.since.phd 0.5750 0.2509 2.292 0.02258 *
## yrs.service -0.5262 0.2196 -2.396 0.01716 *
## sexMale 3.1123 4.0794 0.763 0.44608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.97 on 311 degrees of freedom
## Multiple R-squared: 0.4818, Adjusted R-squared: 0.4718
## F-statistic: 48.2 on 6 and 311 DF, p-value: < 2.2e-16
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.124 -22.924 -6.619 20.535 117.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.924 1.695 67.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.23 on 317 degrees of freedom
discipline, and the
coefficient of yrs.service. What do they tell us about the
relationship between these predictors and salary? (1
points)Coefficient Discipline is positive meaning on average they earn 12,934 more than comparabke faculty Coefficient years of serviece tells us that there is a 590 decrease in salary
model_step_BIC. Which variables are
selected in the final model? (1 points)# Your code here
model_full <- lm(salary ~ ., data = Salary)
model_null <- lm(salary ~ 1, data = Salary)
summary(model_full)
##
## Call:
## lm(formula = salary ~ ., data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.248 -13.211 -1.775 10.384 99.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.8628 4.9903 15.803 < 2e-16 ***
## rankAsstProf -12.9076 4.1453 -3.114 0.00198 **
## rankProf 32.1584 3.5406 9.083 < 2e-16 ***
## disciplineB 14.4176 2.3429 6.154 1.88e-09 ***
## yrs.since.phd 0.5351 0.2410 2.220 0.02698 *
## yrs.service -0.4895 0.2119 -2.310 0.02143 *
## sexMale 4.7835 3.8587 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.54 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
summary(model_null)
##
## Call:
## lm(formula = salary ~ 1, data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.906 -22.706 -6.406 20.479 117.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.71 1.52 74.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.29 on 396 degrees of freedom
All varaibles are brought in the final mode
model_full and
model_step_BIC. Based on this results, which model performs
better in prediction? (1 points)# Your code here
pred_full <- predict(model_full, data = Salary)
# pred_step <- predict(model_step_BIC, data = Salary)
mse_full <- mean((Salary$salary - pred_full)^2)
# mse_step <- mean((Salary$salary - pred_step)^2)
mse_full
## [1] 499.0336
# mse_step
MSE step is higher than mse full which means the model does not perform better
End of Exam. Please submit this RMD file along with a knitted HTML report.