Preface : The following assignment was worked on together as a group
where I guided the discussion & Analysis; but, asked questions to my
group member so we could think about the data more deeply. Afterwhich, I
took many of the things we discussed as a group to the next level up. I
hope you think this is analysis is as cool as i think it is.
Enjoy :
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## RACEGROUP GENDERIDENTITYtext
## 1 5 White
## 2 5 White
## 3 7 Two or more Races/Ethnicities
## 4 7 Two or more Races/Ethnicities
## 5 5 White
## 6 7 Two or more Races/Ethnicities
## FIRSTGEN FIRSTGEN_txt
## 1 2 Yes
## 2 1 No
## 3 1 No
## 4 1 No
## 5 1 No
## 6 1 No
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Compare
mean(df$SATV, na.rm = T) - mean(df$SATM, na.rm = T) < 0
## [1] TRUE
sd(df$SATV, na.rm = T) - sd(df$SATM, na.rm = T) < 0
## [1] TRUE
Notice that the means are approximately equivalent — However \(\bar{S}_v-\bar{S}_m <0 \implies \bar{S}_v < \bar{S}_m\) minimally.
Deviations are dramatically different – \(\hat{\sigma}_{S_v} - \hat{\sigma}_{S_m} < 0 \implies \hat{\sigma}_{S_v} < \hat{\sigma}_{S_m}\)
Lastly Notice : \(\text{Let : }|\Delta_1| = \bar{S}_v-M_{ed}(S_v) \text{ & } \Delta_2 = \bar{S}_v-M_{ed}(S_m)\)
\[ |\Delta_{Sv}| > |\Delta_{Sm}| \]
Meaning, our data for \(S_v\) is more skewwed than \(S_m\).
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).
Mathematical Analysis :
SATV scores are left skewwed \(\implies\) Median > Mean \(\implies\) skew
SATV scores are left skewwed \(\implies\) Median > Mean \(\implies\) skew
Graphical Analysis :
There appears to be Pooling of values in similar Locations.
With exception to the Mathematics Scores
The two Dist. Deviate from \(600≤x≤700\)– specifically, consider SATM Dist.
There are some big issues with this – we clearly cannot use a simple t-test to do hypothesis test as it doesnt take into account the skew and our first assumption is violated (Normal Dist).
So the question first becomes : What type of Dist? ***
And what can we do About Skew-Dist & Hyp-Testing \(\Delta\) in Means?
Excerpt :
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 706 rows containing missing values or values outside the scale range
## (`geom_point()`).
Not so fast!
So, this mdl is cool but we must first evaluate the assumptions of a linear mdl. In other words– Diagnostics. Otherwise, we are just doing nonsense.
Here are the assumptions of Linear Regression : L. I. N. E.
Linearity of predictor & predicted
Independence of Error
Normality of Error
Equal Variance of Error
Consider that we have leverage points which are clearly, highly influential points (ie. high cooks values) from X : 200-300. Simply by viewing the scatter plot, it is clear we likely are not capturing the general pattern but rather the influence of high-leverage & outliers in the specified range.
\[ D_i = \frac{r_i^2}{2}*\frac{h_{ii}}{1-h_{ii}} \\ \text{Where : } \\ r_i = \text{Standardized Residual} \\ h_{ii}=\text{Leverage} \\ s = \text{Residual Standard Error of }e_i \\ n = \text{Sample Size} \]
So consider that a High-cooks value is related to a high standardized residual (ie outlier) or being a high leverage point.
Consider the following heat map :
## `geom_smooth()` using formula = 'y ~ x'
To further emphasize this, lets consider the diagnostics
## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = SATV ~ SATM, data = Diagnostics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -534.79 -37.95 1.59 39.85 447.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.246e+02 1.373e+00 163.5 <2e-16 ***
## SATM 6.378e-01 2.126e-03 300.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.64 on 72715 degrees of freedom
## Multiple R-squared: 0.5531, Adjusted R-squared: 0.5531
## F-statistic: 8.999e+04 on 1 and 72715 DF, p-value: < 2.2e-16
There are many issues with our model. It Fails the following Conditions of simple linear regression :
Normality
Equal Variance
As indicated by the QQ plot and Residual vs. Fitted, our model is not very good. Furthermore, due to the high-cooks values, illustrated in the heat map around high leverage points – ie points far from \(\bar{x}\), it is clear that to have reliable inferences, we must first develop a better model then re-evaluate diagnostics to assess the reliability of those inferential statistics. However, roughly speaking, we can use the predicted values to get a sense of the “average individuals”. Further time would be needed to create a better model for the association between SATV and SATM.
According to Means & Medians, People appear to do slightly worse in SATV
However…
Some students are doing Ridiculously good at Math (Cracked) – “Pooling”; likely due to an underlying factor ( Sol. Outlier Analysis using Multi-Linear-Regression )
Both Dist. are not Normally Dist :(
Dist \(S_v\) is NOT bi-modal
Dist \(S_m\) is bi-modal
The linear Model developed requires more time to be useful.
## is.na(df$SATV) n
## 1 FALSE 72717
## 2 TRUE 706
## is.na(df$SATM) n
## 1 FALSE 73423
SATM doesn’t have any NAs
## SATV SATM SATCOMP
## 1 800 800 1600
## 2 400 400 800
## 3 690 620 1310
## 4 630 560 1190
## 5 430 530 960
## 6 650 450 1100
## [1] 1265.755
#Write your code: #Write your conclusion:
sd(df$SATCOMP, na.rm = T)
## [1] 198.66
df %>% ggplot(aes(x = SATCOMP)) + geom_histogram(bins = 50)
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).
So what just happen was that our graph encorperated both the skews and became more dramatic. And as we should expect, the mean is sent to the right as to where they both used to be : ( \(\bar{S}_v, \bar{S}_m\), \(\bar{S}_c\) ) = (630, 635, 1265) respectively.
This dataset provides a rich opportunity for analyzing the relationship between SAT scores (SATV, SATM) and various demographic and socioeconomic factors such as gender identity, racial identity, and income.
A key analysis I would like to conduct is examining whether SAT performance differs across gende, racial & economic groups, and if factors like high school GPA, income, or first-generation status mediate these differences.
To what extent do gender identity, racial identity, and socioeconomic status (measured by income and first-generation status) predict SAT verbal (SATV) and SAT math (SATM) scores, and how does high school GPA mediate these relationships?
Love this question. Syntax as always a struggle for both me & my group member. Usually, such as issue is solved via reading documentation.