title: “API 202-MZ Problem Set 1” output: pdf_document

Due Friday, February 6, 2026 before 5:00 pm ET

INSTRUCTIONS

To successfully complete this problem set, please follow these steps:

Include all your answers (including written responses, code, code output, and plots) in this RMarkdown. Typing out mathematical answers using Latex syntax is encouraged. For a helpful reference on how to use this syntax to write your solutions, see the Latex Math Guide under Canvas > Files > Resources.
Once your homework is complete, please knit this R Markdown file and submit as a PDF. To do so, knit to this .Rmd file to a PDF. Only PDF submissions will be accepted. We strongly advise you to test this work flow well in advance of the deadline. If you need help, please come to office hours.
Please submit an electronic copy of the PDF to the Canvas course page.

HKS HONOR CODE: We abide by the Harvard Kennedy School Academic code for all aspects of the course. In terms of problem sets, unless explicitly written otherwise, the norms are the following: You are free (and encouraged) to discuss problem sets with your classmates. However, you must hand in your own unique written work and code in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your classmate(s), sitting side-by-side and going through the problem set question-by-question, but you must each type your own answers and your own code. For more details, please see syllabus.

IDENTIFICATION

Your information:
- Last name: Naqvi
- First name: Mustafa
Study partners (please list below the classmates you worked with on this problem set, if any):
- Name: Avanti
- Name:
- Name:
- Name:
Compliance with Harvard Kennedy School Academic Code: I certify that my work in this problem set complies with the Harvard Kennedy School Academic Code.
- Type “Yes” to agree:

PART 0 – SETTING YOURSELF UP FOR SUCCESS

Syllabus review: We briefly discussed the syllabus on our first day of class. The syllabus contains a lot of information that is important for you to know so you can maximize your learning in this course. Please read the syllabus on Canvas.

Put an ‘X’ in the cells below when you are done with this task:

Task	Yes
I have reviewed the syllabus

PART 1 - EDUCATION AND INCOME

The purpose of this exercise is to help you learn the mechanics of ordinary least squares (OLS) regression and understand what the different terms in a regression mean. You will calculate the regression “by hand” using the formulas developed in class, and then you will use R to confirm the calculation. This is the only time in the course you will be asked to calculate the regression coefficients manually; we think it is important to do it once so you have a sense of what is really happening.

The relationship between education and income has long been a topic of interest in economics. While education is often assumed to lead to higher income, it is challenging to definitively prove causation. In this analysis, we explore the relationship between years of schooling and annual income.

The table below shows the average years of schooling completed by individuals in 1970 and their corresponding average annual income (in USD) in 1990 for five regions. Income is observed for a later time period because it takes time for education to have an impact on earning potential. Our hypothesis is that the dependent variable, income ($Y$), is a function of the independent variable, education ($X$).

Region	$X$: Average Years of Schooling (1970)	$Y$: Average Annual Income (1990, USD)
North Region	10	28,000
South Region	7	21,000
East Region	12	35,000
West Region	9	24,000
Central Region	8	22,500

Problem 1.1

Using the appropriate formulas, show how to calculate each of the following:

$\hat{\beta_1}$, the estimated slope coefficient from the regression $Y = \beta_0 + \beta_1 X + u$
$\hat{\beta_0}$, the estimated intercept term from the same regression
$\hat{Y}$, the predicted values for the five regions
$\hat{u}$, the OLS residual for each region
$\sum_i \hat{u}^2_i$, the sum of squared residuals

Please (1) write the appropriate formula; (2) plug in the appropriate values; and (3) show the computed answer. You are not required to show each arithmetic step, but you must show enough work for us to follow your steps. You also do not need to show the intermediate calculations between Steps 2 and 3. See the Appendix at the end of this problem set or a refresher on key formulas.

SOLUTION

#Prework
# 1. Input Variables
education <- c(10, 7, 12, 9, 8)
income <- c(28000, 21000, 35000, 24000, 22500)
data <- data.frame(education, income)

# 2. Run the Regression Model
model <- lm(income ~ education, data = data)

# 3. Display the Results
# a and b {$\hat{\beta_1}$, the estimated slope coefficient from the regression $Y = \beta_0 + \beta_1 X + u$, $\hat{\beta_0}$, the estimated intercept term from the same regression}
coef(model)

## (Intercept)   education 
##   -256.7568   2864.8649

# c {$\hat{Y}$, the predicted values for the five regions}
predict(model)

##        1        2        3        4        5 
## 28391.89 19797.30 34121.62 25527.03 22662.16

# d {$\hat{u}$, the OLS residual for each region}
residuals(model)

##          1          2          3          4          5 
##  -391.8919  1202.7027   878.3784 -1527.0270  -162.1622

# e {$\sum_i \hat{u}^2_i$, the sum of squared residuals}
sum(residuals(model)^2)

## [1] 4729730

Problem 1.2

In a concise paragraph drawing on the numbers you calculated above, describe the relationship between years of schooling and income as precisely as you can. Indicate the direction and magnitude of the relationship based on this sample.

SOLUTION

The calculations show a positive relationship between years of education and earnings. With a marginal increase in years of education, earnings will increase by $2,864.86 (the value of $\hat{\beta}_1$). Also, when a person has zero years of education, the earnings will be -$256.71 (the value of $\hat{\beta}_0$)

Problem 1.3

These next few questions involve using R. First, run the cell below to load the data for this question into your R environment.

# Run me!

data <- data.frame(
  region = c("North Region", "South Region", "East Region", "West Region", "Central Region"),
  years_of_schooling = c(10, 7, 12, 9, 8),  # X values
  annual_income = c(28000, 21000, 35000, 24000, 22500)     # Y values
)

# View the data frame
head(data)

##           region years_of_schooling annual_income
## 1   North Region                 10         28000
## 2   South Region                  7         21000
## 3    East Region                 12         35000
## 4    West Region                  9         24000
## 5 Central Region                  8         22500

In the next questions, we are transitioning to using R to run regressions and analyze the output. Familiarize yourself with running regressions in R with the following two screencasts tailored to this course:

Part 1: Running lm goes over the basic syntax of the lm functions mentioned in lecture.

Part 2: Summarizing output from lm shows how to summarize and extract quantities of interest from a lm object.

Write “Done” below when you have finished this step.

SOLUTION

Done

Problem 1.4

Using R, run a linear regression of years of schooling on income using the dataset you loaded in 1.3. From the summary R output, find and report $\hat{\beta_0}$ and $\hat{\beta_1}$. Do the values for $\hat{\beta_0}$ and $\hat{\beta_1}$ match the quantities you calculated above?

SOLUTION

p1.4 <- lm(annual_income ~ years_of_schooling, data = data)
# summaries
summary(p1.4)

## 
## Call:
## lm(formula = annual_income ~ years_of_schooling, data = data)
## 
## Residuals:
##       1       2       3       4       5 
##  -391.9  1202.7   878.4 -1527.0  -162.2 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          -256.8     3054.8  -0.084  0.93831   
## years_of_schooling   2864.9      326.4   8.778  0.00311 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 3 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:   0.95 
## F-statistic: 77.05 on 1 and 3 DF,  p-value: 0.003115

typeof(p1.4)

## [1] "list"

p1.4$coefficients

##        (Intercept) years_of_schooling 
##          -256.7568          2864.8649

#Yes, values for $\hat{\beta_0}$ and $\hat{\beta_1}$ match the quantities you calculated above

Problem 1.5

Consider the hypothesis $H_0:\beta_1=0$ in the context of the above regression. Write your answers to the following questions in the space below:

Why would you be interested in testing this hypothesis?
Would you reject the hypothesis $H_0:\beta_1=0$ at the 5% significance level? Explain.

SOLUTION

To understand the relationship between schooling years and education, we will find the relationship. $H_0:\beta_1=0$ implies that there is no relationship, so we will try to verify if this is true or not true.
To determine whether we should reject the null hypothesis $H_0: \beta_1=0$, we examine the p-value associated with the years_of_schooling coefficient in the R summary output. If the p-value is less than 0.05, we will reject the null hypothesis at the 5percent significance level. This 5percent is a rule of thumb, however, it can differ from case to case depending on how comfortable the researcher is when p_value>5 percent. Given the calculated slope of $\hat{\beta}_1 \approx 2,864.86$ and the strong positive trend in the data, the p-value is expected to be below this threshold. If true, we would reject the null hypothesis, concluding that there is statistically significant evidence at the 5% level that a relationship exists between years of schooling and annual income.

Problem 1.6

Write your answers to the following questions in the space below. You are required to use R to support your answers.

What is the value of annual income (in USD) for an individual with 8 years of schooling predicted by the OLS regression? What is the real value in the data? What term is used to denote their difference? Write R code to compute these values.
Suppose one individual has 2 more years of schooling than another. What would your OLS regression predict is the difference in their annual income (in USD)?
What is the predicted difference in annual income (in USD) between the average individual in the East Region and the average individual in the North Region? How does it compare to your answer in part b?

SOLUTION

# a
# Predicted income for 8 years of schooling using the p1.4 model
predicted_8 <- predict(p1.4, newdata = data.frame(years_of_schooling = 8))
predicted_8

##        1 
## 22662.16

# Actual value for 8 years (Central Region) from the data frame
actual_8 <- data$annual_income[data$years_of_schooling == 8]
actual_8

## [1] 22500

# Calculating the difference (the residual)
residual_8 <- actual_8 - predicted_8
residual_8

##         1 
## -162.1622

b

Since $\hat{\beta}_1 \approx \$2,864.86$, therefore, increasing $\hat{\beta}_1$ by two would give us an additional $\mathbf{\$5,729.72}$ in the annual income. # c East region: 12 years North region: 10 years Difference: 2 years \[\text{Predicted Difference} = \$5,729.72\]

Since our OLS model is linear, this means that the effect of an additional year of schooling is constant, regardless of the region. Whether you are comparing an individual with 0 vs. 2 years of schooling or 10 vs. 12 years of schooling, the predicted difference will always be $2 \times \hat{\beta}_1$

Problem 1.7

Graph the 5 data points and the best-fit regression line using ggplot2. Be sure to label the axes. Optional: Label each region as well as the slope and intercept of the regression line.

SOLUTION

ggplot(data, aes(x = years_of_schooling, y = annual_income)) +
  geom_point(color = "black", size = 3) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  geom_text(aes(label = region), vjust = -1, size = 3) + 
  labs(
    title = "Education vs. Annual Income by Region",
    subtitle = "Slope: 2,864.86 | Intercept: -256.71",
    x = "Average Years of Schooling (1970)",
    y = "Average Annual Income (1990, USD)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

PART 2 - DIABETES RISK

The purpose of this next section is to help you further develop your R skills to analyze data, and build off the first R training session.

The dataset for this question is data from the National Health and Nutrition Examination Survey (NHANES), a program of studies conducted by the CDC that collects health, nutrition, and demographic data through interviews, physical examinations, and laboratory tests to assess the health and nutritional status of the U.S. population.

Please run the following cell to load the data for this problem.

# Run me!
a1c_data <- read.csv("https://raw.githubusercontent.com/madisoncoots/api-202mz/main/a1c_data.csv")

head(a1c_data)

##   age  bmi a1c
## 1  66 31.7 6.2
## 2  18 21.5 5.2
## 3  13 18.1 5.6
## 4  66 23.7 6.2
## 5  75 38.9 6.3
## 6  56 21.3 5.7

The variables in this dataset are:

age: the respondent’s age in years
bmi: the respondent’s body mass index (BMI) in $kg/m^2$
a1c: the percentage of glycohemoglobin in the respondent’s blood, commonly known as Hemoglobin A1c. This measurement is crucial for assessing long-term blood glucose levels and is commonly used in diagnosing and monitoring diabetes.

Problem 2.1

Often, there is a modal level of a continuous variable like BMI in health datasets. You can see this visually in a histogram. Note: While BMI is a continuous variable, in a1c_data.csv, it is binned by tenths of a percentage point, making it into a pseudo-categorical variable that we can visualize with a histogram.

Use ggplot to plot a histogram of bmi and report its modal value (you can assess this visually).
Produce the same histogram, but using proportions instead of counts on the y-axis. What percentage of observations are associated with the modal BMI value? (Hint: the prop.table function might be helpful here.)

SOLUTION

Problem 2.2

Using “base” R (i.e., without additional packages loaded), you can use the summary command (e.g., summary(dataset$var)) to get summary statistics for a variable called var in a dataset called dataset. What is the median of the bmi?
Using R with tidyverse loaded, you can use the summarize command to generate whatever summary statistic you want. In this case, you can type summarize(data, med_var=median(var)) to show the median of a variable var represented with the label med_var. Confirm that you get the same answer for the median of bmi with this method (though this seems duplicative, we will be using the summarize command more extensively throughout the semester).

SOLUTION

Problem 2.3

A1C reflects average blood glucose levels over 2-3 months and is a common diagnostic marker for diabetes and pre-diabetes. Using R, create a simple diabetes risk model by running a linear regression to predict A1C as a function of age and BMI. Report the model summary.

SOLUTION

Problem 2.4

Based on the output above, provide interpretations for the coefficients on age ($\hat{\beta_1}$) and on bmi ($\hat{\beta_2}$).

SOLUTION

Problem 2.5

Next, train a regression model that predicts age as a function of bmi and extract the residuals from this model. In R you can extract the residuals from an lm object called model by calling residuals(model).

SOLUTION

Problem 2.6

Next, train a regression model that predicts a1c as a function of the residuals from the model in problem 2.5. Report the model coefficient on the residuals, $\hat{\beta_{res}}$. Comment on this coefficient’s value as it relates to $\hat{\beta_1}$ from problem 2.4 provide an explanation for this relationship.