Please indicate who you collaborated with on this problem set:
I did not collaborate on this problem set.
You’ll be completing slightly modified versions of the Learning Checks from Chapter 6 of ModernDive. Recall we are using the evals dataset with
As a refresher, let’s look at a random sample of 5 of 463 professors using the sample_n() command from the dplyr package
evals %>%
sample_n(5)| ID | score | age | bty_avg | gender | ethnicity | language | rank | pic_outfit | pic_color | cls_did_eval | cls_students | cls_level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 290 | 3.6 | 34 | 6.667 | male | not minority | english | tenure track | not formal | color | 20 | 35 | lower |
| 341 | 4.9 | 43 | 3.500 | male | not minority | english | tenure track | not formal | color | 18 | 30 | lower |
| 199 | 3.3 | 47 | 2.333 | male | minority | english | tenured | not formal | color | 9 | 14 | upper |
| 47 | 4.4 | 33 | 4.667 | female | not minority | english | tenure track | not formal | color | 25 | 29 | upper |
| 215 | 4.7 | 60 | 3.667 | male | not minority | english | tenured | not formal | color | 31 | 45 | upper |
Write the code to perform the following exploratory data analysis:
# 1. Code for visualization:
ggplot(evals, aes(x = age, y = score))+geom_point()+geom_jitter()# 2. Code for summary statistic:
evals %>%
get_correlation(score ~ age)| correlation |
|---|
| -0.107032 |
Based on these two outputs, comment on the relationship between teaching score and instructor age for these 463 instructors at the University of Texas Austin. Do this in three sentences or less below:
There is a weak negative relationship between age and evaluation score for this data set where as instructors get older their evaluation scores are lower.
Do the following in the code block below:
score_model_age# 1. Fit regression model:
score_model_age <- lm(score ~ age, evals)
score_model_age##
## Call:
## lm(formula = score ~ age, data = evals)
##
## Coefficients:
## (Intercept) age
## 4.461932 -0.005938
# 2. Output regression table:
get_regression_table(score_model_age)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 4.462 | 0.127 | 35.195 | 0.000 | 4.213 | 4.711 |
| age | -0.006 | 0.003 | -2.311 | 0.021 | -0.011 | -0.001 |
Part a) What is the numerical value of the intercept? What, according to this model, is the teaching score for teachers with age 0? Is there any real practical interpretation of the intercept in this case? Note that you can verify these results by comparing them with the visualization above. Write your answer below:
The numerical value of the intercept is 4.462, so the model predicts that teachers with age 0 would have a teaching score of 4.462.
There is no real practical interpretation of the intercept in this case because no teacher can possibly be age 0. This is supported by the fact that the graph has no teachers with age less than 20.
Part b) What is the slope? On average, how much does teaching score go down each year for a professor? Note that you can verify these results by comparing them with the visualization above. Write your answer below:
The slope is -0.006. So on average teaching score goes down by 0.006 each year for a professor.
Part c) Say an a new instructor of age 45 joins the faculty at the UT Austin. Knowing nothing else about this instructor, what would you predict their teaching score to be? Show your work:
If we knew nothing else about the professor we could do a crude calculation using
intercept - slope * age
His teaching score would be
#crude teaching score
4.462-0.006*45## [1] 4.192
But because there are many factors that go into teaching score, this is a crude prediction for the professor’s teaching score, and should be taken with a grain of salt.
Recall the following data/visualization from Question 1 on the practice midterm:
DD_vs_SB <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/DD_vs_SB.csv")
ggplot(DD_vs_SB, aes(x = med_inc, y = shops_per_1000, col = Type)) +
geom_point() +
facet_wrap(~Type) +
geom_smooth(method = "lm", se = FALSE, col = "blue") +
labs(x = "Median Household Income", y = "# of shops per 1000 people",
title = "Coffee/Cafe Comparison in Eastern MA") +
scale_color_manual(values=c("orange", "forestgreen"))Write the code in the code block below that will allow you to answer the following two questions:
# For Dunkin Donuts
DD_count <- DD_vs_SB %>%
filter(Type == "Dunkin Donuts")
DD_model_income <- lm(shops_per_1000 ~ med_inc, DD_count)
DD_model_income##
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = DD_count)
##
## Coefficients:
## (Intercept) med_inc
## 1.885e-01 -3.749e-07
# For Starbucks
SB_count <- DD_vs_SB %>%
filter(Type == "Starbucks")
SB_model_income <- lm(shops_per_1000 ~ med_inc, SB_count)
SB_model_income##
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = SB_count)
##
## Coefficients:
## (Intercept) med_inc
## 1.186e-02 3.757e-07
What are these two values?