Collaboration

Please indicate who you collaborated with on this problem set:

I did not collaborate on this problem set.

Question 1: Ch 6 Learning Checks

You’ll be completing slightly modified versions of the Learning Checks from Chapter 6 of ModernDive. Recall we are using the evals dataset with

  • Outcome variable y = teaching score
  • Explanatory variable x = age

As a refresher, let’s look at a random sample of 5 of 463 professors using the sample_n() command from the dplyr package

evals %>% 
  sample_n(5)
ID score age bty_avg gender ethnicity language rank pic_outfit pic_color cls_did_eval cls_students cls_level
290 3.6 34 6.667 male not minority english tenure track not formal color 20 35 lower
341 4.9 43 3.500 male not minority english tenure track not formal color 18 30 lower
199 3.3 47 2.333 male minority english tenured not formal color 9 14 upper
47 4.4 33 4.667 female not minority english tenure track not formal color 25 29 upper
215 4.7 60 3.667 male not minority english tenured not formal color 31 45 upper

LC 6.1: EDA

Write the code to perform the following exploratory data analysis:

  1. Create a visualization that allows you conduct an “eyeball” test of the relationship between teaching score and age.
  2. Compute a summary statistic of the strength of linear association of these two variables.
# 1. Code for visualization:
ggplot(evals, aes(x = age, y = score))+geom_point()+geom_jitter()

# 2. Code for summary statistic:
evals %>% 
  get_correlation(score ~ age)
correlation
-0.107032

Based on these two outputs, comment on the relationship between teaching score and instructor age for these 463 instructors at the University of Texas Austin. Do this in three sentences or less below:

There is a weak negative relationship between age and evaluation score for this data set where as instructors get older their evaluation scores are lower.

LC 6.2: Regression table

Do the following in the code block below:

  1. Fit the corresponding regression model and save this in score_model_age
  2. Output the regression table
# 1. Fit regression model:
score_model_age <- lm(score ~ age, evals)
score_model_age
## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Coefficients:
## (Intercept)          age  
##    4.461932    -0.005938
# 2. Output regression table:
get_regression_table(score_model_age)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.462 0.127 35.195 0.000 4.213 4.711
age -0.006 0.003 -2.311 0.021 -0.011 -0.001

Part a) What is the numerical value of the intercept? What, according to this model, is the teaching score for teachers with age 0? Is there any real practical interpretation of the intercept in this case? Note that you can verify these results by comparing them with the visualization above. Write your answer below:

The numerical value of the intercept is 4.462, so the model predicts that teachers with age 0 would have a teaching score of 4.462.

There is no real practical interpretation of the intercept in this case because no teacher can possibly be age 0. This is supported by the fact that the graph has no teachers with age less than 20.

Part b) What is the slope? On average, how much does teaching score go down each year for a professor? Note that you can verify these results by comparing them with the visualization above. Write your answer below:

The slope is -0.006. So on average teaching score goes down by 0.006 each year for a professor.

Part c) Say an a new instructor of age 45 joins the faculty at the UT Austin. Knowing nothing else about this instructor, what would you predict their teaching score to be? Show your work:

If we knew nothing else about the professor we could do a crude calculation using

intercept - slope * age

His teaching score would be

#crude teaching score
4.462-0.006*45
## [1] 4.192

But because there are many factors that go into teaching score, this is a crude prediction for the professor’s teaching score, and should be taken with a grain of salt.

Question 2:

Recall the following data/visualization from Question 1 on the practice midterm:

DD_vs_SB <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/DD_vs_SB.csv")

ggplot(DD_vs_SB, aes(x = med_inc, y = shops_per_1000, col = Type)) +
  geom_point() + 
  facet_wrap(~Type) +
  geom_smooth(method = "lm", se = FALSE, col = "blue") + 
  labs(x = "Median Household Income", y = "# of shops per 1000 people", 
       title = "Coffee/Cafe Comparison in Eastern MA") +
  scale_color_manual(values=c("orange", "forestgreen"))

Write the code in the code block below that will allow you to answer the following two questions:

  1. For every increase in $10K in median income, there is an associated decrease of on average how many Dunkin Donuts shops per 1000 individuals?
  2. For every increase in $10K in median income, there is an associated increase of on average how many Starbucks per 1000 individuals?
# For Dunkin Donuts
DD_count <- DD_vs_SB %>% 
  filter(Type == "Dunkin Donuts")
DD_model_income <- lm(shops_per_1000 ~ med_inc, DD_count)
DD_model_income
## 
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = DD_count)
## 
## Coefficients:
## (Intercept)      med_inc  
##   1.885e-01   -3.749e-07
# For Starbucks
SB_count <- DD_vs_SB %>% 
  filter(Type == "Starbucks")
SB_model_income <- lm(shops_per_1000 ~ med_inc, SB_count)
SB_model_income
## 
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = SB_count)
## 
## Coefficients:
## (Intercept)      med_inc  
##   1.186e-02    3.757e-07

What are these two values?

  1. Value 1: -3.749e-07
  2. Value 2: 3.757e-07