In this problem set, you will begin working with the Educational Attainment and Wage Equations (EAWE) dataset. This dataset is a subset of the National Longitudinal Survey of Youth 1997-(NLSY97). The EAWE dataset is described in Appendix B of your book (pages 565-569). As is described in Appendix B, the EAWE is broken into 22 parallel datasets. For all problem sets in this class, you should use EAWE dataset #1. I have posted this dataset in .csv format on Canvas under the filename “EAWE01.csv” and a scanned copy of the description of the dataset from Appendix B under the filename “Appendix B EAWE Description.pdf.”


0. Setup

Load any necessary libraries and assign the dataset.

# Load the tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df = read_csv("EAWE01.csv")
## Rows: 500 Columns: 97
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): GENDER
## dbl (96): ID, FEMALE, MALE, BYEAR, AGE, AGEMBTH, HHINC97, POVRAT97, HHBMBF, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load broom package
library(broom)

1. Variable Descriptions & Summary Statistics

What do the variables EARNINGS and S in the dataset measure? What is the sample mean and sample standard deviation of each of these variables? Be sure to include the units for the variables in your answers.

  1. EARNINGS measures the current hourly earnings in $ reported during interviews with those the data was measured from in 2011.

  2. S measures the years of schooling that interviewees have, their highest grade completed as of 2011, when the data was measured.

# Calculate the sample mean of EARNINGS and S
mean_EARNINGS = mean(df$EARNINGS)
mean_S = mean(df$S)

From the calculations, I found that the standard deviation of EARNINGS is 18.34576 and the standard deviation of S is 14.548.

# Calculate the sample standard deviation of EARNINGS and S
sd_EARNINGS = sd(df$EARNINGS)
sd_S = sd(df$S)

From the calculations, I found that the standard deviation of EARNINGS is 10.7234258 and the standard deviation of S is 2.7797752.


2. Minimums & Maximums

What are the minimum and maximum values for S in the sample?

# Calculate the min of `S`
min(df$S)
## [1] 6
#Calculate the max of `S`
max(df$S)
## [1] 20

3. Percentages

What percentage of individuals in the sample are married?

# Calculate the percentage of married individuals
mean(df$MARRIED) * 100
## [1] 40.4

4. Averages

What is the average age of individuals in the sample?

# Calculate the average of AGE
mean(df$AGE)
## [1] 28.908

5. Histogram of EARNINGS

Produce a histogram of the EARNINGS variable.

# Produce a histogram of EARNINGS
hist(df$EARNINGS)


6. Regression: ASVABC & S

ASVABC is a variable meant to measure a person’s “intellectual ability,” as scored based on standardized tests. Estimate a simple linear regression model where S is the dependent variable and ASVABC is the independent variable. Report and interpret the estimated parameter on ASVABC. Report the value of the \(R^2\).

# Regress S on ASVABC
model_6 = lm(S ~ ASVABC, data = df)
tidy(model_6)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    14.2      0.109     130.  0       
## 2 ASVABC          1.62     0.118      13.7 1.49e-36
# Find $R^2$ 
glance(model_6)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.274         0.273  2.37      188. 1.49e-36     1 -1140. 2286. 2299.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The estimated parameter on ASVABC is 1.62, which indicates that a 1 unit increase (would this be a 1 point increase, since ASVABC is a variable obtained through one’s score on standardized tests? Not exactly sure, don’t want to assume), in one’s ASVABC score leads to an increase of 1.62 years of schooling completed, on average. The value of \(R^2\) is 0.274.


7. Scatterplot with Regression Line

  1. Produce a scatterplot where S is on the y-axis and ASVABC is on the x-axis. Include in the scatterplot the fitted regression line from the regression in the previous question.
# Produce a scatterplot
plot(df$ASVABC, df$S)
# Add a regression line
abline(lm(S ~ ASVABC, data = df), col = "blue")


8. Regression: S & EARNINGS

Estimate a simple linear regression model where EARNINGS is the dependent variable and S is the independent variable. Report and interpret the estimated parameter on S. Report the value of the \(R^2\).

# Regress EARNINGS on S
model_8 = lm(EARNINGS ~ S, data = df)
tidy(model_8)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     3.09     2.46       1.25 2.10e- 1
## 2 S               1.05     0.166      6.30 6.41e-10
# Find $R^2$ 
glance(model_8)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0739        0.0720  10.3      39.7 6.41e-10     1 -1876. 3758. 3771.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The estimated parameter on S is 1.05, which indicates that a 1 year increase in one’s years of schooling completed leads to an increase of 1.05 dollars per hour earned, on average. The value of \(R^2\) is 0.0739.


9. Predicted EARNINGS

Using the regression that you just estimated, what is the predicted value of EARNINGS for someone with 16 years of schooling?

# Predict EARNINGS for S = 16
predict(model_8, newdata = data.frame(S = 16))
##        1 
## 19.86842

For someone with 16 years of schooling, the regression predicts that they will earn 19.87 dollars per hour, on average.


10. Regression: Height & EARNINGS

Is there a relationship between height and earnings? Estimate a simple linear regression model where EARNINGS is the dependent variable and HEIGHT is the independent variable. Report and interpret the estimated parameter on HEIGHT. Report the value of the \(R^2\).

# Regress EARNINGS on HEIGHT
model_10 = lm(EARNINGS ~ HEIGHT, data = df)
tidy(model_10)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   -0.513     7.97    -0.0644  0.949 
## 2 HEIGHT         0.277     0.117    2.37    0.0182
# Find $R^2$ 
glance(model_10)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0112       0.00917  10.7      5.62  0.0182     1 -1892. 3791. 3803.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The estimated parameter on HEIGHT is 0.277, which indicates that a 1 inch increase in one’s height leads to an increase of 0.277 dollars per hour earned, on average. The value of \(R^2\) is 0.0112.