Week 6 Lab

Author

Shanteé Enitencio

The data that were utilized to create this analysis were sourced from catholic {wooldridge}. The read12 variable from the dataset was chosen to be the predicted variable.

Training set number of observations: 3583
Test dataset number of observations: 2387

EDA

Descriptive Statistics

The following boxplots indicated that the largest number of outliers were in the family income variable.

Additionally, most of the numerical variables were distributed normally, however read12 and family income were right skewed. It is also important to acknowledge that both the father and mother education variables were discrete numerical variables, therefore they did not have the same type of distribution that would be expected.

Scatterplots (depvar ~ all x)

The shape of the relationship between the dependent variable and all of the continuous potential independent variables indicated that there was a positive correlation between the reading score a student achieved and the math score they received. The correlation matrix indicated that there was a value of 0.73 between the reading and math scores. In addition to this it found that there was a weak but existent relationship between the father and mother education variables as 0.58.

First model

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

7.086

1.453

4.875

0.0000

***

math12

0.690

0.013

54.197

0.0000

***

female1

2.732

0.212

12.884

0.0000

***

asian1

-0.982

0.435

-2.258

0.0240

*

hispan1

-0.076

0.371

-0.206

0.8372

black1

-1.345

0.420

-3.201

0.0014

**

motheduc

0.043

0.066

0.647

0.5175

fatheduc

0.207

0.061

3.385

0.0007

***

lfaminc

0.379

0.154

2.468

0.0136

*

hsgrad1

0.187

0.421

0.445

0.6567

cathhs1

0.332

0.432

0.768

0.4425

parcath1

-0.383

0.247

-1.550

0.1213

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 6.28 on 3571 degrees of freedom

Multiple R-squared: 0.5544, Adjusted R-squared: 0.553

F-statistic: 403.9 on 3571 and 11 DF, p-value: 0.0000

This first iteration of the model found that the variables that were statistically significant at the 95% level were the students math scores, their sex, whether or not they were asian or black, their father’s education and their family income. Therefore the model stands as follows read12 = 7.08 + 0.69math12 + 2.73female - 0.98asian -0.07hispan -1.34black +0.04motheduc + 0.20fatheduc +0.37lfaminc +0.18hsgrad1 + 0.33cathhas1 -0.38parcath1. This model was able to predict 55% of the variability in the read12 variable with the dependent variables.

Second Model

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

-27.200

8.008

-3.397

0.0007

***

math12

1.066

0.113

9.394

0.0000

***

I(math12^2)

-0.004

0.001

-3.319

0.0009

***

female1

2.713

0.211

12.845

0.0000

***

black1

-1.026

0.408

-2.517

0.0119

*

fatheduc

0.265

0.055

4.863

0.0000

***

lfaminc

5.463

1.532

3.566

0.0004

***

I(lfaminc^2)

-0.259

0.078

-3.314

0.0009

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 6.264 on 3575 degrees of freedom

Multiple R-squared: 0.5562, Adjusted R-squared: 0.5553

F-statistic: 640 on 3575 and 7 DF, p-value: 0.0000

This second iteration of the model and the observation of the coefficients indicated that there were more available modification opportunities by introducing the income and math variables as a quadratics. This model maintained a very similar results at the 99% level of significance but was comparatively better than the previous. The estimate for this model is read12 = 27.2+ 1.09math12- 0.004math12^2 + 2.13female -1.02black + 0.26fatheduc +5.4lfaminc -0.25lfaminc^2.

This coefficient plot reiterated that the main things that had a positive effect on a students performance in reading were their families income, their gender and their math scores. Meanwhile the thing that had the largest negative effect on their scores was whether they were black.

Check for predictor independence

Using Variance Inflation Factors (VIF)

      math12  I(math12^2)       female        black     fatheduc      lfaminc 
  108.183447   108.462323     1.015147     1.057531     1.392745   138.804952 
I(lfaminc^2) 
  140.738116 
     0%     25%     50%     75%    100% 
-26.870  -3.910   0.330   4.115  22.000 

These models indicated that there was no multicollinearity between the predictor variables. The only variables that had high numbers were the ones that entered the model as squares, which was to be expected. The residuals vs. fitted chart indicated that there was not just a random distribution of the fitted values. The Q-Q plot was normal with values along the line, only deferring at the extremes. The Scale-location plot indicated that the observations were homoscedastic and that it maintained minimal influential obs.

Plot Residuals by Fitted Values

lfaminc

read12

fit

lwr

upr

10.3

61.4

48.8

36.5

61.0

10.3

59.3

51.6

39.3

63.9

10.7

57.6

54.6

42.3

66.8

11.0

52.5

55.9

43.6

68.2

10.3

42.1

52.2

39.9

64.5

10.3

64.0

60.6

48.3

72.9

9.1

38.5

58.3

46.0

70.6

10.3

60.3

60.3

48.1

72.6

9.8

63.6

54.6

42.3

66.9

10.3

52.6

56.4

44.1

68.7

n: 10

Plot Actual vs Fitted (test)

The previous charts plot the predicted and the fitted values and indicate a positive engagement between the two.

Performance Metrics

Metric

Value

MAE

5.2421

RMSE

6.6048

MAPE

0.1094

The metrics for this model are as follows it had a MAE 5.24, RMSE of 6.60 and MAPE 0.10. Overall, these values indicate that the final model was decent but still required some work. The models predictions were off by 5.24 units or by 10% which might seem small, but because the read12 variable was spread from 29-68 this 5 point difference was pretty big.


Call:
lm(formula = read12 ~ math12, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.0835  -4.0715   0.4588   4.2093  21.7244 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 14.59659    0.59522   24.52 <0.0000000000000002 ***
math12       0.71086    0.01126   63.15 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.462 on 3581 degrees of freedom
Multiple R-squared:  0.5269,    Adjusted R-squared:  0.5268 
F-statistic:  3988 on 1 and 3581 DF,  p-value: < 0.00000000000000022
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
Warning in rq.fit.sfn(x, y, tau = tau, rhs = rhs, control = control, ...): tiny diagonals replaced with Inf when calling blkfct

These plots took the math12 variable and found that it was able to predict 52% of the variability in read12. Additional plotting between read12 and math12 indicated that there was a positive relationship with the two and the higher the math score the higher reading score was expected.

Conclusion

The model that was developed in this analysis was a decent model but it was not a good one by any means. More testing and the introduction of other variables would be necessary to be able to predict the students reading scores with less errors and a higher degree of confidence. The model indicated that the things that most predicted what a students score was going to be were their math scores, their gender, their race, the fathers education and their household income. Therefore, in order for student to do better academically, the schools would have to tackle with and find a way to reduce the dependence the model has on things the school cannot control as in the case of this model where most of the predictors of a students success or failure are from situations in their families, outside of the schools’ sphere of influence.