Homework 1

library(tidyverse)
library(modelsummary)
library(haven)
library(psych)
library(tibble)
library(broom)
library(knitr)

df_hsls <- read_spss("Datasets/HSLS_50_pct_sample_30_vars.sav")

1. Using unweighted observations, generate the N (or count of observations), min, max, mean, and standard deviation of the following variables (2):

X1 Mathematics standardized theta score
X2 Mathematics standardized theta score
X3 Current job earnings per hour
X3 Total credits earned

df_hsls %>% 
  select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>% 
  describe(fast = TRUE) %>% 
  kable()

	vars	n	mean	sd	min	max	range	se
X1TXMTSCOR	1	10684	51.00749	10.084101	24.0744	82.1876	58.1132	0.0975597
X2TXMTSCOR	2	10241	51.49905	10.138151	24.9578	84.9051	59.9473	0.1001815
X3EARNPERHR1	3	4058	9.04411	2.898820	6.0000	25.0000	19.0000	0.0455056
X3TCREDTOT	4	10929	24.08198	7.393138	0.0000	36.0000	36.0000	0.0707194

2. Compare the mean values for the Mathematics standardized theta scores in the two time periods? (2)

t.test(x = df_hsls$X1TXMTSCOR, 
       y = df_hsls$X2TXMTSCOR, 
       paired = TRUE, alternative = "two.sided") %>%
  tidy() %>% 
  kable()

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.1686957	-2.281326	0.0225517	9256	-0.3136467	-0.0237446	Paired t-test	two.sided

3. What should a reader understand about the values in the N column of the table? (2)

The reader should be aware that this is a longitudinal study. Therefore the participants have been followed over time and naturally that number will decrease over time.

4. Generate a correlation matrix for this set of variables (2):

X1 Mathematics standardized theta score
X2 Mathematics standardized theta score
X3 Current job earnings per hour
X3 Total credits earned

df_hsls %>% 
  select(X1TXMTSCOR, X2TXMTSCOR, X3EARNPERHR1, X3TCREDTOT) %>% 
  cor(use = "pairwise.complete.obs") %>%  
  kable()

	X1TXMTSCOR	X2TXMTSCOR	X3EARNPERHR1	X3TCREDTOT
X1TXMTSCOR	1.0000000	0.7490751	0.0171581	0.3010436
X2TXMTSCOR	0.7490751	1.0000000	0.0420769	0.2975178
X3EARNPERHR1	0.0171581	0.0420769	1.0000000	-0.0782904
X3TCREDTOT	0.3010436	0.2975178	-0.0782904	1.0000000

5. What would you expect to find based on your understanding of these variables? (2)

I would expect that test scores are highly positively correlated with one another. Additionally, I would expect test scores to be moderately, positively correlated with both earnings and credits earned. Lastly, I would expect credits earned to be moderately, positively correlated with hourly earnings.

6. If directional differences exist, what might explain them? (2)

I didn’t expect “credits earned” and “hourly wage” to have a negative correlation. I expected them to have a positive correlation (even if it’s not a statistically significant one). My theory is that this negative correlation is due to the presence of outliers.

cor.test(df_hsls$X3TCREDTOT, df_hsls$X3EARNPERHR1,)

## 
##  Pearson's product-moment correlation
## 
## data:  df_hsls$X3TCREDTOT and df_hsls$X3EARNPERHR1
## t = -4.8936, df = 3883, p-value = 1.03e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10946772 -0.04695915
## sample estimates:
##         cor 
## -0.07829039

ggplot(df_hsls, aes(X3TCREDTOT, df_hsls$X3EARNPERHR1)) +
  geom_point()

7. Including all observations from the original dataset, regress X3 Current job earnings per hour on X1 Mathematics standardized theta score. (2)

model_1 <- lm(data = df_hsls, X3EARNPERHR1 ~ X1TXMTSCOR)

tidy(model_1) %>% 
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	8.7461856	0.2584203	33.844805	0.000000
X1TXMTSCOR	0.0051254	0.0048618	1.054226	0.291847

glance(model_1)%>% 
  kable()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.0002944	2.95e-05	2.858591	1.111392	0.291847	1	-9322.954	18651.91	18670.62	30839.41	3774	3776

8. Write a narrative description of how we should interpret the coefficient with a value of 0.005. (2)

For every 1 point increase in Theta Math score, we can expect hourly wage to increase by 0.005 cents.

9. What would you say about the practical significance of this result and its meaning given what you understand about this data? (2)

This result potentially suggests to me that standardized test scores bear little weight on earnings. However, my understanding is that these participants are only a few years out of high school since the first time point. You could argue that this is too soon to see any impact.

10. Select a 5% random sample of cases from your dataset and then generate a scatter plot which includes the following variables and integrates the regression equation fit line and equation into the output (3):

Y Axis: X3 Total credits earned
X Axis: X1 Mathematics standardized theta score

df_hsls_sample <- df_hsls %>% slice_sample(prop = .05)

df_hsls_sample %>% 
  ggplot(aes(X1TXMTSCOR, X3TCREDTOT)) +
  geom_point() +
  geom_smooth(method = "lm")

model_sample <- lm(data = df_hsls_sample, X3TCREDTOT ~ X1TXMTSCOR)

tidy(model_sample) %>% 
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	16.2709250	1.5781735	10.309972	0e+00
X1TXMTSCOR	0.1624239	0.0302368	5.371733	1e-07

11. Generate a new variable called Credit10 that calculates a student’s X3Total credits earned value multiplied by the number 10 using the Transform | Compute menu. Regress this resulting variable on X1 Mathematics standardized theta score and include the output below. (2)

df_hsls_sample <- 
  df_hsls_sample %>% 
  mutate(credit10 = X3TCREDTOT * 10)

df_hsls_sample %>% 
  ggplot(aes(X1TXMTSCOR, credit10)) +
  geom_point() +
  geom_smooth(method = "lm")

model_credit10 <- lm(data = df_hsls_sample, credit10 ~ X1TXMTSCOR)

tidy(model_credit10) %>% 
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	162.709250	15.7817353	10.309972	0e+00
X1TXMTSCOR	1.624239	0.3023678	5.371733	1e-07

12. Compare the fit line from question 10 to the regression output from question 11. Is there a difference in the strength of the relationship between the X and Y variables across the two sets of output? Explain your reasoning in narrative form. (2)

There is no difference. That’s because even though you multiply by 10, the proportions/relationships are still the same.

modelsummary(list(
  model_sample,
  model_credit10
))

	Model 1	Model 2
(Intercept)	16.271	162.709
	(1.578)	(15.782)
X1TXMTSCOR	0.162	1.624
	(0.030)	(0.302)
Num.Obs.	499	499
R2	0.055	0.055
R2 Adj.	0.053	0.053
AIC	3351.7	5649.7
BIC	3364.4	5662.3
Log.Lik.	-1672.861	-2821.851
F	28.856	28.856

Homework 1

EPRS 8540 Quantitative Methods II

James Malloy

January, 27, 2022