This is the R portion of your midterm exam. You will analyze the Salary dataset, which contains information salary for Assistant Professors, Associate Professors and Professors in a college in the U.S in 2008-2009. For each of the variables, please check the code book here:
I’ve reviewed this dataset, and confirmed that there is no missing values.
Please follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.
Total points: 10
Good luck!
Salary
, and display the first few rows. (1 points)# Your code here
Salary <- read.csv("salaries.csv")
head(Salary, n=5)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139.75
## 2 Prof B 20 16 Male 173.20
## 3 AsstProf B 4 3 Male 79.75
## 4 Prof B 45 39 Male 115.00
## 5 Prof B 40 41 Male 141.50
# Your code here
str(Salary)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : chr "Prof" "Prof" "AsstProf" "Prof" ...
## $ discipline : chr "B" "B" "B" "B" ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ salary : num 139.8 173.2 79.8 115 141.5 ...
There are 397 observations, and 6 variables. 1 is numeric, 2 are integer.
# Your code here
snake_case <- Salary[,c(1,2,5)]
rank
and
discipline
. (How many AsstProf, AssocProf, and Prof; and
how many of them are in theoretical departments and how many in applied
departments). (1) ?frequency tables# Your code here
freq_table <- table(Salary$rank, Salary$discipline)
freq_table
##
## A B
## AssocProf 26 38
## AsstProf 24 43
## Prof 131 135
plot()
or ggplot()
). Add a title and proper
axis labels. You don’t need to interpret the result here but you should
know how. (1 points)# Your code here
library(ggplot2)
ggplot(data=Salary, aes(x = rank, y = salary)) +
geom_boxplot() +
labs(title = "Salary Based On Rank")
Salary_train
and Salary_test
. A
part of the code was given, please finish it. (1 points)training_index <- sample(1:nrow(Salary), round(0.8 * nrow(Salary)))
# Your code here
salary_train <- floor(0.8 * training_index)
salary_test <- floor(0.2 * training_index)
model_full
;
fit a null (linear) model (no predictor, only an intercept), and name it
as model_null
. Display the summary of both the models. (1
points)# Your code here
model_full <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = Salary)
summary(model_full)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service +
## sex, data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.248 -13.211 -1.775 10.384 99.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.8628 4.9903 15.803 < 2e-16 ***
## rankAsstProf -12.9076 4.1453 -3.114 0.00198 **
## rankProf 32.1584 3.5406 9.083 < 2e-16 ***
## disciplineB 14.4176 2.3429 6.154 1.88e-09 ***
## yrs.since.phd 0.5351 0.2410 2.220 0.02698 *
## yrs.service -0.4895 0.2119 -2.310 0.02143 *
## sexMale 4.7835 3.8587 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.54 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
discipline
, and the
coefficient of yrs.service
. What do they tell us about the
relationship between these predictors and salary
? (1
points)the coefficient of discipline is 14.4176, the coefficient of yrs.service is -0.4895. Through this, we can see that discipline is a good predictor, and yrs.service is not.
model_step_BIC
. Which variables are
selected in the final model? (1 points)# Your code here
[Your comments here]
model_full
and
model_step_BIC
. Based on this results, which model performs
better in prediction? (1 points)# Your code here
[Your comments here]
End of Exam. Please submit this RMD file along with a knitted HTML report.
getwd()