library(tidyverse)
library(modelsummary)
library(haven)
library(psych)
library(tibble)
library(broom)
library(knitr)

df_hsls <- read_spss("Datasets/HSLS_50_pct_sample_30_vars.sav")

1. Using unweighted observations, generate the N (or count of observations), min, max, mean, and standard deviation of the following variables (2):

  1. X1 Mathematics standardized theta score
  2. X2 Mathematics standardized theta score
  3. X3 Current job earnings per hour
  4. X3 Total credits earned
df_hsls %>% 
  select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>% 
  describe(fast = TRUE) %>% 
  kable()
vars n mean sd min max range se
X1TXMTSCOR 1 10684 51.00749 10.084101 24.0744 82.1876 58.1132 0.0975597
X2TXMTSCOR 2 10241 51.49905 10.138151 24.9578 84.9051 59.9473 0.1001815
X3EARNPERHR1 3 4058 9.04411 2.898820 6.0000 25.0000 19.0000 0.0455056
X3TCREDTOT 4 10929 24.08198 7.393138 0.0000 36.0000 36.0000 0.0707194

2. Compare the mean values for the Mathematics standardized theta scores in the two time periods? (2)

t.test(x = df_hsls$X1TXMTSCOR, 
       y = df_hsls$X2TXMTSCOR, 
       paired = TRUE, alternative = "two.sided") %>%
  tidy() %>% 
  kable()
estimate statistic p.value parameter conf.low conf.high method alternative
-0.1686957 -2.281326 0.0225517 9256 -0.3136467 -0.0237446 Paired t-test two.sided

3. What should a reader understand about the values in the N column of the table? (2)

The reader should be aware that this is a longitudinal study. Therefore the participants have been followed over time and naturally that number will decrease over time.

4. Generate a correlation matrix for this set of variables (2):

  1. X1 Mathematics standardized theta score
  2. X2 Mathematics standardized theta score
  3. X3 Current job earnings per hour
  4. X3 Total credits earned
df_hsls %>% 
  select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>% 
  cor(use = "pairwise.complete.obs") %>%  
  kable()
X1TXMTSCOR X2TXMTSCOR X3EARNPERHR1 X3TCREDTOT
X1TXMTSCOR 1.0000000 0.7490751 0.0171581 0.3010436
X2TXMTSCOR 0.7490751 1.0000000 0.0420769 0.2975178
X3EARNPERHR1 0.0171581 0.0420769 1.0000000 -0.0782904
X3TCREDTOT 0.3010436 0.2975178 -0.0782904 1.0000000

5. What would you expect to find based on your understanding of these variables? (2)

I would expect that test scores are highly positively correlated with one another. Additionally, I would expect test scores to be moderately, positively correlated with both earnings and credits earned. Lastly, I would expect credits earned to be moderately, positively correlated with hourly earnings.

6. If directional differences exist, what might explain them? (2)

I didn’t expect “credits earned” and “hourly wage” to have a negative correlation. I expected them to have a positive correlation (even if it’s not a statistically significant one). My theory is that this negative correlation is due to the presence of outliers.

cor.test(df_hsls$X3TCREDTOT, df_hsls$X3EARNPERHR1,)
## 
##  Pearson's product-moment correlation
## 
## data:  df_hsls$X3TCREDTOT and df_hsls$X3EARNPERHR1
## t = -4.8936, df = 3883, p-value = 1.03e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10946772 -0.04695915
## sample estimates:
##         cor 
## -0.07829039
ggplot(df_hsls, aes(X3TCREDTOT, df_hsls$X3EARNPERHR1)) +
  geom_point()

7. Including all observations from the original dataset, regress X3 Current job earnings per hour on X1 Mathematics standardized theta score. (2)

model_1 <- lm(data = df_hsls, X3EARNPERHR1 ~ X1TXMTSCOR)

tidy(model_1) %>% 
  kable()
term estimate std.error statistic p.value
(Intercept) 8.7461856 0.2584203 33.844805 0.000000
X1TXMTSCOR 0.0051254 0.0048618 1.054226 0.291847
glance(model_1)%>% 
  kable()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0002944 2.95e-05 2.858591 1.111392 0.291847 1 -9322.954 18651.91 18670.62 30839.41 3774 3776

8. Write a narrative description of how we should interpret the coefficient with a value of 0.005. (2)

For every 1 point increase in Theta Math score, we can expect hourly wage to increase by 0.005 cents.

9. What would you say about the practical significance of this result and its meaning given what you understand about this data? (2)

This result potentially suggests to me that standardized test scores bear little weight on earnings. However, my understanding is that these participants are only a few years out of high school since the first time point. You could argue that this is too soon to see any impact.

10. Select a 5% random sample of cases from your dataset and then generate a scatter plot which includes the following variables and integrates the regression equation fit line and equation into the output (3):

  1. Y Axis: X3 Total credits earned
  2. X Axis: X1 Mathematics standardized theta score
df_hsls_sample <- df_hsls %>% slice_sample(prop = .05)

df_hsls_sample %>% 
  ggplot(aes(X1TXMTSCOR, X3TCREDTOT)) +
  geom_point() +
  geom_smooth(method = "lm")

model_sample <- lm(data = df_hsls_sample, X3TCREDTOT ~ X1TXMTSCOR)

tidy(model_sample) %>% 
  kable()
term estimate std.error statistic p.value
(Intercept) 16.2709250 1.5781735 10.309972 0e+00
X1TXMTSCOR 0.1624239 0.0302368 5.371733 1e-07

11. Generate a new variable called Credit10 that calculates a student’s X3Total credits earned value multiplied by the number 10 using the Transform | Compute menu. Regress this resulting variable on X1 Mathematics standardized theta score and include the output below. (2)

df_hsls_sample <- 
  df_hsls_sample %>% 
  mutate(credit10 = X3TCREDTOT * 10)

df_hsls_sample %>% 
  ggplot(aes(X1TXMTSCOR, credit10)) +
  geom_point() +
  geom_smooth(method = "lm")

model_credit10 <- lm(data = df_hsls_sample, credit10 ~ X1TXMTSCOR)

tidy(model_credit10) %>% 
  kable()
term estimate std.error statistic p.value
(Intercept) 162.709250 15.7817353 10.309972 0e+00
X1TXMTSCOR 1.624239 0.3023678 5.371733 1e-07

12. Compare the fit line from question 10 to the regression output from question 11. Is there a difference in the strength of the relationship between the X and Y variables across the two sets of output? Explain your reasoning in narrative form. (2)

There is no difference. That’s because even though you multiply by 10, the proportions/relationships are still the same.

modelsummary(list(
  model_sample,
  model_credit10
))
Model 1 Model 2
(Intercept) 16.271 162.709
(1.578) (15.782)
X1TXMTSCOR 0.162 1.624
(0.030) (0.302)
Num.Obs. 499 499
R2 0.055 0.055
R2 Adj. 0.053 0.053
AIC 3351.7 5649.7
BIC 3364.4 5662.3
Log.Lik. -1672.861 -2821.851
F 28.856 28.856