library(tidyverse)
library(modelsummary)
library(haven)
library(psych)
library(tibble)
library(broom)
library(knitr)
df_hsls <- read_spss("Datasets/HSLS_50_pct_sample_30_vars.sav")
1. Using unweighted observations, generate the N (or count of observations), min, max, mean, and standard deviation of the following variables (2):
df_hsls %>%
select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>%
describe(fast = TRUE) %>%
kable()
| vars | n | mean | sd | min | max | range | se | |
|---|---|---|---|---|---|---|---|---|
| X1TXMTSCOR | 1 | 10684 | 51.00749 | 10.084101 | 24.0744 | 82.1876 | 58.1132 | 0.0975597 |
| X2TXMTSCOR | 2 | 10241 | 51.49905 | 10.138151 | 24.9578 | 84.9051 | 59.9473 | 0.1001815 |
| X3EARNPERHR1 | 3 | 4058 | 9.04411 | 2.898820 | 6.0000 | 25.0000 | 19.0000 | 0.0455056 |
| X3TCREDTOT | 4 | 10929 | 24.08198 | 7.393138 | 0.0000 | 36.0000 | 36.0000 | 0.0707194 |
2. Compare the mean values for the Mathematics standardized theta scores in the two time periods? (2)
t.test(x = df_hsls$X1TXMTSCOR,
y = df_hsls$X2TXMTSCOR,
paired = TRUE, alternative = "two.sided") %>%
tidy() %>%
kable()
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -0.1686957 | -2.281326 | 0.0225517 | 9256 | -0.3136467 | -0.0237446 | Paired t-test | two.sided |
3. What should a reader understand about the values in the N column of the table? (2)
The reader should be aware that this is a longitudinal study. Therefore the participants have been followed over time and naturally that number will decrease over time.
4. Generate a correlation matrix for this set of variables (2):
df_hsls %>%
select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>%
cor(use = "pairwise.complete.obs") %>%
kable()
| X1TXMTSCOR | X2TXMTSCOR | X3EARNPERHR1 | X3TCREDTOT | |
|---|---|---|---|---|
| X1TXMTSCOR | 1.0000000 | 0.7490751 | 0.0171581 | 0.3010436 |
| X2TXMTSCOR | 0.7490751 | 1.0000000 | 0.0420769 | 0.2975178 |
| X3EARNPERHR1 | 0.0171581 | 0.0420769 | 1.0000000 | -0.0782904 |
| X3TCREDTOT | 0.3010436 | 0.2975178 | -0.0782904 | 1.0000000 |
5. What would you expect to find based on your understanding of these variables? (2)
I would expect that test scores are highly positively correlated with one another. Additionally, I would expect test scores to be moderately, positively correlated with both earnings and credits earned. Lastly, I would expect credits earned to be moderately, positively correlated with hourly earnings.
6. If directional differences exist, what might explain them? (2)
I didn’t expect “credits earned” and “hourly wage” to have a negative correlation. I expected them to have a positive correlation (even if it’s not a statistically significant one). My theory is that this negative correlation is due to the presence of outliers.
cor.test(df_hsls$X3TCREDTOT, df_hsls$X3EARNPERHR1,)
##
## Pearson's product-moment correlation
##
## data: df_hsls$X3TCREDTOT and df_hsls$X3EARNPERHR1
## t = -4.8936, df = 3883, p-value = 1.03e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10946772 -0.04695915
## sample estimates:
## cor
## -0.07829039
ggplot(df_hsls, aes(X3TCREDTOT, df_hsls$X3EARNPERHR1)) +
geom_point()
7. Including all observations from the original dataset, regress X3 Current job earnings per hour on X1 Mathematics standardized theta score. (2)
model_1 <- lm(data = df_hsls, X3EARNPERHR1 ~ X1TXMTSCOR)
tidy(model_1) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.7461856 | 0.2584203 | 33.844805 | 0.000000 |
| X1TXMTSCOR | 0.0051254 | 0.0048618 | 1.054226 | 0.291847 |
glance(model_1)%>%
kable()
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0002944 | 2.95e-05 | 2.858591 | 1.111392 | 0.291847 | 1 | -9322.954 | 18651.91 | 18670.62 | 30839.41 | 3774 | 3776 |
8. Write a narrative description of how we should interpret the coefficient with a value of 0.005. (2)
For every 1 point increase in Theta Math score, we can expect hourly wage to increase by 0.005 cents.
9. What would you say about the practical significance of this result and its meaning given what you understand about this data? (2)
This result potentially suggests to me that standardized test scores bear little weight on earnings. However, my understanding is that these participants are only a few years out of high school since the first time point. You could argue that this is too soon to see any impact.
10. Select a 5% random sample of cases from your dataset and then generate a scatter plot which includes the following variables and integrates the regression equation fit line and equation into the output (3):
df_hsls_sample <- df_hsls %>% slice_sample(prop = .05)
df_hsls_sample %>%
ggplot(aes(X1TXMTSCOR, X3TCREDTOT)) +
geom_point() +
geom_smooth(method = "lm")
model_sample <- lm(data = df_hsls_sample, X3TCREDTOT ~ X1TXMTSCOR)
tidy(model_sample) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 16.2709250 | 1.5781735 | 10.309972 | 0e+00 |
| X1TXMTSCOR | 0.1624239 | 0.0302368 | 5.371733 | 1e-07 |
11. Generate a new variable called Credit10 that calculates a student’s X3Total credits earned value multiplied by the number 10 using the Transform | Compute menu. Regress this resulting variable on X1 Mathematics standardized theta score and include the output below. (2)
df_hsls_sample <-
df_hsls_sample %>%
mutate(credit10 = X3TCREDTOT * 10)
df_hsls_sample %>%
ggplot(aes(X1TXMTSCOR, credit10)) +
geom_point() +
geom_smooth(method = "lm")
model_credit10 <- lm(data = df_hsls_sample, credit10 ~ X1TXMTSCOR)
tidy(model_credit10) %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 162.709250 | 15.7817353 | 10.309972 | 0e+00 |
| X1TXMTSCOR | 1.624239 | 0.3023678 | 5.371733 | 1e-07 |
12. Compare the fit line from question 10 to the regression output from question 11. Is there a difference in the strength of the relationship between the X and Y variables across the two sets of output? Explain your reasoning in narrative form. (2)
There is no difference. That’s because even though you multiply by 10, the proportions/relationships are still the same.
modelsummary(list(
model_sample,
model_credit10
))
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 16.271 | 162.709 |
| (1.578) | (15.782) | |
| X1TXMTSCOR | 0.162 | 1.624 |
| (0.030) | (0.302) | |
| Num.Obs. | 499 | 499 |
| R2 | 0.055 | 0.055 |
| R2 Adj. | 0.053 | 0.053 |
| AIC | 3351.7 | 5649.7 |
| BIC | 3364.4 | 5662.3 |
| Log.Lik. | -1672.861 | -2821.851 |
| F | 28.856 | 28.856 |