In this problem set, you will begin working with the Educational Attainment and Wage Equations (EAWE) dataset. This dataset is a subset of the National Longitudinal Survey of Youth 1997-(NLSY97). The EAWE dataset is described in Appendix B of your book (pages 565-569). As is described in Appendix B, the EAWE is broken into 22 parallel datasets. For all problem sets in this class, you should use EAWE dataset #1. I have posted this dataset in .csv format on Canvas under the filename “EAWE01.csv” and a scanned copy of the description of the dataset from Appendix B under the filename “Appendix B EAWE Description.pdf.”
Load any necessary libraries and assign the dataset.
# Load the tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df = read_csv("EAWE01.csv")
## Rows: 500 Columns: 97
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): GENDER
## dbl (96): ID, FEMALE, MALE, BYEAR, AGE, AGEMBTH, HHINC97, POVRAT97, HHBMBF, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load broom package
library(broom)
What do the variables EARNINGS and S in the
dataset measure? What is the sample mean and sample standard deviation
of each of these variables? Be sure to include the units for the
variables in your answers.
EARNINGS measures the current hourly earnings in $
reported during interviews with those the data was measured from in
2011.
S measures the years of schooling that interviewees
have, their highest grade completed as of 2011, when the data was
measured.
# Calculate the sample mean of EARNINGS and S
mean_EARNINGS = mean(df$EARNINGS)
mean_S = mean(df$S)
From the calculations, I found that the standard deviation of
EARNINGS is 18.34576 and the standard deviation of S is
14.548.
# Calculate the sample standard deviation of EARNINGS and S
sd_EARNINGS = sd(df$EARNINGS)
sd_S = sd(df$S)
From the calculations, I found that the standard deviation of
EARNINGS is 10.7234258 and the standard deviation of
S is 2.7797752.
What are the minimum and maximum values for S in the
sample?
# Calculate the min of `S`
min(df$S)
## [1] 6
#Calculate the max of `S`
max(df$S)
## [1] 20
What percentage of individuals in the sample are married?
# Calculate the percentage of married individuals
mean(df$MARRIED) * 100
## [1] 40.4
What is the average age of individuals in the sample?
# Calculate the average of AGE
mean(df$AGE)
## [1] 28.908
Produce a histogram of the EARNINGS variable.
# Produce a histogram of EARNINGS
hist(df$EARNINGS)
ASVABC is a variable meant to measure a person’s
“intellectual ability,” as scored based on standardized tests. Estimate
a simple linear regression model where S is the dependent
variable and ASVABC is the independent variable. Report and
interpret the estimated parameter on ASVABC. Report the
value of the \(R^2\).
# Regress S on ASVABC
model_6 = lm(S ~ ASVABC, data = df)
tidy(model_6)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 14.2 0.109 130. 0
## 2 ASVABC 1.62 0.118 13.7 1.49e-36
# Find $R^2$
glance(model_6)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.274 0.273 2.37 188. 1.49e-36 1 -1140. 2286. 2299.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The estimated parameter on ASVABC is 1.62, which
indicates that a 1 unit increase (would this be a 1 point increase,
since ASVABC is a variable obtained through one’s score on standardized
tests? Not exactly sure, don’t want to assume), in one’s ASVABC score
leads to an increase of 1.62 years of schooling completed, on average.
The value of \(R^2\) is 0.274.
S is on the y-axis and
ASVABC is on the x-axis. Include in the scatterplot the
fitted regression line from the regression in the previous
question.# Produce a scatterplot
plot(df$ASVABC, df$S)
# Add a regression line
abline(lm(S ~ ASVABC, data = df), col = "blue")
Estimate a simple linear regression model where EARNINGS
is the dependent variable and S is the independent
variable. Report and interpret the estimated parameter on
S. Report the value of the \(R^2\).
# Regress EARNINGS on S
model_8 = lm(EARNINGS ~ S, data = df)
tidy(model_8)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.09 2.46 1.25 2.10e- 1
## 2 S 1.05 0.166 6.30 6.41e-10
# Find $R^2$
glance(model_8)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0739 0.0720 10.3 39.7 6.41e-10 1 -1876. 3758. 3771.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The estimated parameter on S is 1.05, which indicates
that a 1 year increase in one’s years of schooling completed leads to an
increase of 1.05 dollars per hour earned, on average. The value of \(R^2\) is 0.0739.
Using the regression that you just estimated, what is the predicted
value of EARNINGS for someone with 16 years of
schooling?
# Predict EARNINGS for S = 16
predict(model_8, newdata = data.frame(S = 16))
## 1
## 19.86842
For someone with 16 years of schooling, the regression predicts that they will earn 19.87 dollars per hour, on average.
Is there a relationship between height and earnings? Estimate a
simple linear regression model where EARNINGS is the
dependent variable and HEIGHT is the independent variable.
Report and interpret the estimated parameter on HEIGHT.
Report the value of the \(R^2\).
# Regress EARNINGS on HEIGHT
model_10 = lm(EARNINGS ~ HEIGHT, data = df)
tidy(model_10)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.513 7.97 -0.0644 0.949
## 2 HEIGHT 0.277 0.117 2.37 0.0182
# Find $R^2$
glance(model_10)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0112 0.00917 10.7 5.62 0.0182 1 -1892. 3791. 3803.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The estimated parameter on HEIGHT is 0.277, which
indicates that a 1 inch increase in one’s height leads to an increase of
0.277 dollars per hour earned, on average. The value of \(R^2\) is 0.0112.