Collaboration

Please indicate who you collaborated with on this problem set:

I did not collaborate on this problem set.

Question 1: Ch 6 Learning Checks

You’ll be completing slightly modified versions of the Learning Checks from Chapter 6 of ModernDive. Recall we are using the evals dataset with

Outcome variable y = teaching score
Explanatory variable x = age

As a refresher, let’s look at a random sample of 5 of 463 professors using the sample_n() command from the dplyr package

evals %>% 
  sample_n(5)

ID	score	age	bty_avg	gender	ethnicity	language	rank	pic_outfit	pic_color	cls_did_eval	cls_students	cls_level
290	3.6	34	6.667	male	not minority	english	tenure track	not formal	color	20	35	lower
341	4.9	43	3.500	male	not minority	english	tenure track	not formal	color	18	30	lower
199	3.3	47	2.333	male	minority	english	tenured	not formal	color	9	14	upper
47	4.4	33	4.667	female	not minority	english	tenure track	not formal	color	25	29	upper
215	4.7	60	3.667	male	not minority	english	tenured	not formal	color	31	45	upper

LC 6.1: EDA

Write the code to perform the following exploratory data analysis:

Create a visualization that allows you conduct an “eyeball” test of the relationship between teaching score and age.
Compute a summary statistic of the strength of linear association of these two variables.

# 1. Code for visualization:
ggplot(evals, aes(x = age, y = score))+geom_point()+geom_jitter()

# 2. Code for summary statistic:
evals %>% 
  get_correlation(score ~ age)

correlation
-0.107032

Based on these two outputs, comment on the relationship between teaching score and instructor age for these 463 instructors at the University of Texas Austin. Do this in three sentences or less below:

There is a weak negative relationship between age and evaluation score for this data set where as instructors get older their evaluation scores are lower.

LC 6.2: Regression table

Do the following in the code block below:

Fit the corresponding regression model and save this in score_model_age
Output the regression table

# 1. Fit regression model:
score_model_age <- lm(score ~ age, evals)
score_model_age

## 
## Call:
## lm(formula = score ~ age, data = evals)
## 
## Coefficients:
## (Intercept)          age  
##    4.461932    -0.005938

# 2. Output regression table:
get_regression_table(score_model_age)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.462	0.127	35.195	0.000	4.213	4.711
age	-0.006	0.003	-2.311	0.021	-0.011	-0.001

Part a) What is the numerical value of the intercept? What, according to this model, is the teaching score for teachers with age 0? Is there any real practical interpretation of the intercept in this case? Note that you can verify these results by comparing them with the visualization above. Write your answer below:

The numerical value of the intercept is 4.462, so the model predicts that teachers with age 0 would have a teaching score of 4.462.

There is no real practical interpretation of the intercept in this case because no teacher can possibly be age 0. This is supported by the fact that the graph has no teachers with age less than 20.

Part b) What is the slope? On average, how much does teaching score go down each year for a professor? Note that you can verify these results by comparing them with the visualization above. Write your answer below:

The slope is -0.006. So on average teaching score goes down by 0.006 each year for a professor.

Part c) Say an a new instructor of age 45 joins the faculty at the UT Austin. Knowing nothing else about this instructor, what would you predict their teaching score to be? Show your work:

If we knew nothing else about the professor we could do a crude calculation using

intercept - slope * age

His teaching score would be

#crude teaching score
4.462-0.006*45

## [1] 4.192

But because there are many factors that go into teaching score, this is a crude prediction for the professor’s teaching score, and should be taken with a grain of salt.

Question 2:

Recall the following data/visualization from Question 1 on the practice midterm:

DD_vs_SB <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/DD_vs_SB.csv")

ggplot(DD_vs_SB, aes(x = med_inc, y = shops_per_1000, col = Type)) +
  geom_point() + 
  facet_wrap(~Type) +
  geom_smooth(method = "lm", se = FALSE, col = "blue") + 
  labs(x = "Median Household Income", y = "# of shops per 1000 people", 
       title = "Coffee/Cafe Comparison in Eastern MA") +
  scale_color_manual(values=c("orange", "forestgreen"))

Write the code in the code block below that will allow you to answer the following two questions:

For every increase in $10K in median income, there is an associated decrease of on average how many Dunkin Donuts shops per 1000 individuals?
For every increase in $10K in median income, there is an associated increase of on average how many Starbucks per 1000 individuals?

# For Dunkin Donuts
DD_count <- DD_vs_SB %>% 
  filter(Type == "Dunkin Donuts")
DD_model_income <- lm(shops_per_1000 ~ med_inc, DD_count)
DD_model_income

## 
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = DD_count)
## 
## Coefficients:
## (Intercept)      med_inc  
##   1.885e-01   -3.749e-07

# For Starbucks
SB_count <- DD_vs_SB %>% 
  filter(Type == "Starbucks")
SB_model_income <- lm(shops_per_1000 ~ med_inc, SB_count)
SB_model_income

## 
## Call:
## lm(formula = shops_per_1000 ~ med_inc, data = SB_count)
## 
## Coefficients:
## (Intercept)      med_inc  
##   1.186e-02    3.757e-07

What are these two values?

Value 1: -3.749e-07
Value 2: 3.757e-07

Problem Set 05

CAITLIN ONG

2018-10-10

Collaboration

Question 1: Ch 6 Learning Checks

LC 6.1: EDA

LC 6.2: Regression table

Question 2: