title: “API 202-MZ Problem Set 1” output: pdf_document
To successfully complete this problem set, please follow these steps:
HKS HONOR CODE: We abide by the Harvard Kennedy School Academic code for all aspects of the course. In terms of problem sets, unless explicitly written otherwise, the norms are the following: You are free (and encouraged) to discuss problem sets with your classmates. However, you must hand in your own unique written work and code in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your classmate(s), sitting side-by-side and going through the problem set question-by-question, but you must each type your own answers and your own code. For more details, please see syllabus.
Syllabus review: We briefly discussed the syllabus on our first day of class. The syllabus contains a lot of information that is important for you to know so you can maximize your learning in this course. Please read the syllabus on Canvas.
Put an ‘X’ in the cells below when you are done with this task:
| Task | Yes |
|---|---|
| I have reviewed the syllabus |
The purpose of this exercise is to help you learn the mechanics of ordinary least squares (OLS) regression and understand what the different terms in a regression mean. You will calculate the regression “by hand” using the formulas developed in class, and then you will use R to confirm the calculation. This is the only time in the course you will be asked to calculate the regression coefficients manually; we think it is important to do it once so you have a sense of what is really happening.
The relationship between education and income has long been a topic of interest in economics. While education is often assumed to lead to higher income, it is challenging to definitively prove causation. In this analysis, we explore the relationship between years of schooling and annual income.
The table below shows the average years of schooling completed by individuals in 1970 and their corresponding average annual income (in USD) in 1990 for five regions. Income is observed for a later time period because it takes time for education to have an impact on earning potential. Our hypothesis is that the dependent variable, income (\(Y\)), is a function of the independent variable, education (\(X\)).
| Region | \(X\): Average Years of Schooling (1970) | \(Y\): Average Annual Income (1990, USD) |
|---|---|---|
| North Region | 10 | 28,000 |
| South Region | 7 | 21,000 |
| East Region | 12 | 35,000 |
| West Region | 9 | 24,000 |
| Central Region | 8 | 22,500 |
Using the appropriate formulas, show how to calculate each of the following:
Please (1) write the appropriate formula; (2) plug in the appropriate values; and (3) show the computed answer. You are not required to show each arithmetic step, but you must show enough work for us to follow your steps. You also do not need to show the intermediate calculations between Steps 2 and 3. See the Appendix at the end of this problem set or a refresher on key formulas.
#Prework
# 1. Input Variables
education <- c(10, 7, 12, 9, 8)
income <- c(28000, 21000, 35000, 24000, 22500)
data <- data.frame(education, income)
# 2. Run the Regression Model
model <- lm(income ~ education, data = data)
# 3. Display the Results
# a and b {$\hat{\beta_1}$, the estimated slope coefficient from the regression $Y = \beta_0 + \beta_1 X + u$, $\hat{\beta_0}$, the estimated intercept term from the same regression}
coef(model)
## (Intercept) education
## -256.7568 2864.8649
# c {$\hat{Y}$, the predicted values for the five regions}
predict(model)
## 1 2 3 4 5
## 28391.89 19797.30 34121.62 25527.03 22662.16
# d {$\hat{u}$, the OLS residual for each region}
residuals(model)
## 1 2 3 4 5
## -391.8919 1202.7027 878.3784 -1527.0270 -162.1622
# e {$\sum_i \hat{u}^2_i$, the sum of squared residuals}
sum(residuals(model)^2)
## [1] 4729730
In a concise paragraph drawing on the numbers you calculated above, describe the relationship between years of schooling and income as precisely as you can. Indicate the direction and magnitude of the relationship based on this sample.
| The calculations show a positive relationship between years of education and earnings. With a marginal increase in years of education, earnings will increase by $2,864.86 (the value of \(\hat{\beta}_1\)). Also, when a person has zero years of education, the earnings will be -$256.71 (the value of \(\hat{\beta}_0\)) |
These next few questions involve using R. First, run the cell below to load the data for this question into your R environment.
# Run me!
data <- data.frame(
region = c("North Region", "South Region", "East Region", "West Region", "Central Region"),
years_of_schooling = c(10, 7, 12, 9, 8), # X values
annual_income = c(28000, 21000, 35000, 24000, 22500) # Y values
)
# View the data frame
head(data)
## region years_of_schooling annual_income
## 1 North Region 10 28000
## 2 South Region 7 21000
## 3 East Region 12 35000
## 4 West Region 9 24000
## 5 Central Region 8 22500
In the next questions, we are transitioning to using R to run regressions and analyze the output. Familiarize yourself with running regressions in R with the following two screencasts tailored to this course:
Part 1: Running lm goes over the basic syntax of the lm functions mentioned in lecture.
Part 2: Summarizing output from
lm shows how to summarize and extract quantities of interest from a
lm object.
Write “Done” below when you have finished this step.
| Done |
Using R, run a linear regression of years of schooling on income using the dataset you loaded in 1.3. From the summary R output, find and report \(\hat{\beta_0}\) and \(\hat{\beta_1}\). Do the values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) match the quantities you calculated above?
p1.4 <- lm(annual_income ~ years_of_schooling, data = data)
# summaries
summary(p1.4)
##
## Call:
## lm(formula = annual_income ~ years_of_schooling, data = data)
##
## Residuals:
## 1 2 3 4 5
## -391.9 1202.7 878.4 -1527.0 -162.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -256.8 3054.8 -0.084 0.93831
## years_of_schooling 2864.9 326.4 8.778 0.00311 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 3 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.95
## F-statistic: 77.05 on 1 and 3 DF, p-value: 0.003115
typeof(p1.4)
## [1] "list"
p1.4$coefficients
## (Intercept) years_of_schooling
## -256.7568 2864.8649
#Yes, values for $\hat{\beta_0}$ and $\hat{\beta_1}$ match the quantities you calculated above
Consider the hypothesis \(H_0:\beta_1=0\) in the context of the above regression. Write your answers to the following questions in the space below:
Write your answers to the following questions in the space below. You are required to use R to support your answers.
# a
# Predicted income for 8 years of schooling using the p1.4 model
predicted_8 <- predict(p1.4, newdata = data.frame(years_of_schooling = 8))
predicted_8
## 1
## 22662.16
# Actual value for 8 years (Central Region) from the data frame
actual_8 <- data$annual_income[data$years_of_schooling == 8]
actual_8
## [1] 22500
# Calculating the difference (the residual)
residual_8 <- actual_8 - predicted_8
residual_8
## 1
## -162.1622
Since \(\hat{\beta}_1 \approx \$2,864.86\), therefore, increasing \(\hat{\beta}_1\) by two would give us an additional \(\mathbf{\$5,729.72}\) in the annual income. # c East region: 12 years North region: 10 years Difference: 2 years \[\text{Predicted Difference} = \$5,729.72\]
Graph the 5 data points and the best-fit regression line using
ggplot2. Be sure to label the axes. Optional: Label each
region as well as the slope and intercept of the regression line.
ggplot(data, aes(x = years_of_schooling, y = annual_income)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_text(aes(label = region), vjust = -1, size = 3) +
labs(
title = "Education vs. Annual Income by Region",
subtitle = "Slope: 2,864.86 | Intercept: -256.71",
x = "Average Years of Schooling (1970)",
y = "Average Annual Income (1990, USD)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The purpose of this next section is to help you further develop your R skills to analyze data, and build off the first R training session.
The dataset for this question is data from the National Health and Nutrition Examination Survey (NHANES), a program of studies conducted by the CDC that collects health, nutrition, and demographic data through interviews, physical examinations, and laboratory tests to assess the health and nutritional status of the U.S. population.
Please run the following cell to load the data for this problem.
# Run me!
a1c_data <- read.csv("https://raw.githubusercontent.com/madisoncoots/api-202mz/main/a1c_data.csv")
head(a1c_data)
## age bmi a1c
## 1 66 31.7 6.2
## 2 18 21.5 5.2
## 3 13 18.1 5.6
## 4 66 23.7 6.2
## 5 75 38.9 6.3
## 6 56 21.3 5.7
The variables in this dataset are:
age: the respondent’s age in yearsbmi: the respondent’s body mass index (BMI) in \(kg/m^2\)a1c: the percentage of glycohemoglobin in the
respondent’s blood, commonly known as Hemoglobin A1c. This measurement
is crucial for assessing long-term blood glucose levels and is commonly
used in diagnosing and monitoring diabetes.Often, there is a modal level of a continuous variable like BMI in
health datasets. You can see this visually in a histogram. Note: While
BMI is a continuous variable, in a1c_data.csv, it is binned
by tenths of a percentage point, making it into a pseudo-categorical
variable that we can visualize with a histogram.
Use ggplot to plot a histogram of bmi
and report its modal value (you can assess this visually).
Produce the same histogram, but using proportions instead of counts on the y-axis. What percentage of observations are associated with the modal BMI value? (Hint: the prop.table function might be helpful here.)
Using “base” R (i.e., without additional packages loaded), you
can use the summary command (e.g.,
summary(dataset$var)) to get summary statistics for a
variable called var in a dataset called
dataset. What is the median of the
bmi?
Using R with tidyverse loaded, you can use the
summarize command to generate whatever summary statistic
you want. In this case, you can type
summarize(data, med_var=median(var)) to show the median of
a variable var represented with the label
med_var. Confirm that you get the same answer for the
median of bmi with this method (though this seems
duplicative, we will be using the summarize command more
extensively throughout the semester).
A1C reflects average blood glucose levels over 2-3 months and is a common diagnostic marker for diabetes and pre-diabetes. Using R, create a simple diabetes risk model by running a linear regression to predict A1C as a function of age and BMI. Report the model summary.
Based on the output above, provide interpretations for the
coefficients on age (\(\hat{\beta_1}\)) and on bmi
(\(\hat{\beta_2}\)).
Next, train a regression model that predicts age as a
function of bmi and extract the residuals from this model.
In R you can extract the residuals from an lm
object called model by calling
residuals(model).
Next, train a regression model that predicts a1c as a
function of the residuals from the model in problem 2.5. Report the
model coefficient on the residuals, \(\hat{\beta_{res}}\). Comment on this
coefficient’s value as it relates to \(\hat{\beta_1}\) from problem 2.4 provide an
explanation for this relationship.