Q1. Recode GENDERIDENTITY into GENDERIDENTITYtext and use the table command to check your work. (1.5 Pts)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

##   RACEGROUP            GENDERIDENTITYtext
## 1         5                         White
## 2         5                         White
## 3         7 Two or more Races/Ethnicities
## 4         7 Two or more Races/Ethnicities
## 5         5                         White
## 6         7 Two or more Races/Ethnicities

Q2. Recode FIRSTGEN into a new variable FIRSTGEN_txt and use table command to check your work. (1.5 Pts)

##   FIRSTGEN FIRSTGEN_txt
## 1        2          Yes
## 2        1           No
## 3        1           No
## 4        1           No
## 5        1           No
## 6        1           No

Q3. Develop code to compare the mean for SATV and SATM (1 Pt)

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Compare
mean(df$SATV, na.rm = T) - mean(df$SATM, na.rm = T) < 0

## [1] TRUE

sd(df$SATV, na.rm = T) - sd(df$SATM, na.rm = T) < 0

## [1] TRUE

Notice that the means are approximately equivalent — However \(\bar{S}_v-\bar{S}_m <0 \implies \bar{S}_v < \bar{S}_m\) minimally.
Deviations are dramatically different – \(\hat{\sigma}_{S_v} - \hat{\sigma}_{S_m} < 0 \implies \hat{\sigma}_{S_v} < \hat{\sigma}_{S_m}\)
Lastly Notice : \(\text{Let : }|\Delta_1| = \bar{S}_v-M_{ed}(S_v) \text{ & } \Delta_2 = \bar{S}_v-M_{ed}(S_m)\)
- \[ |\Delta_{Sv}| > |\Delta_{Sm}| \]
- Meaning, our data for \(S_v\) is more skewwed than \(S_m\).

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

Mathematical Analysis :

SATV scores are left skewwed \(\implies\) Median > Mean \(\implies\) skew
SATV scores are left skewwed \(\implies\) Median > Mean \(\implies\) skew
- Graphically, these conclusions are abundantly clear so how can we come up with a more helpful conclusion?

Graphical Analysis :

There appears to be Pooling of values in similar Locations.
- With exception to the Mathematics Scores
  - Many Students do Spectacular in maths.
The two Dist. Deviate from \(600≤x≤700\)– specifically, consider SATM Dist.

CI-95% & Hypothesis Testing \(\Delta\) Means :

There are some big issues with this – we clearly cannot use a simple t-test to do hypothesis test as it doesnt take into account the skew and our first assumption is violated (Normal Dist).
So the question first becomes : What type of Dist? ***
- How can we test for types of Dist.?
And what can we do About Skew-Dist & Hyp-Testing \(\Delta\) in Means?
- Box-Cox Method – Ie artificially normalize the data & Re-interpret transformed Data.
- Approximate Underlying Dist. & use mathematical approximative methods – particularly, utilizing going from a Probability Mass Func –> Moment Mass Function; Allowing for a condensed, computationally cheap analysis of Dist.

Excerpt :

So, clearly this Dist. is rather complex– we should then also consider the question about the corresponding relationship between each \(x_i\) to evaluate underlying patterns about each and understand the outliers more.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 706 rows containing missing values or values outside the scale range
## (`geom_point()`).

Not so fast!

So, this mdl is cool but we must first evaluate the assumptions of a linear mdl. In other words– Diagnostics. Otherwise, we are just doing nonsense.

Here are the assumptions of Linear Regression : L. I. N. E.

Linearity of predictor & predicted
Independence of Error
Normality of Error
Equal Variance of Error

Consider that we have leverage points which are clearly, highly influential points (ie. high cooks values) from X : 200-300. Simply by viewing the scatter plot, it is clear we likely are not capturing the general pattern but rather the influence of high-leverage & outliers in the specified range.

\[ D_i = \frac{r_i^2}{2}*\frac{h_{ii}}{1-h_{ii}} \\ \text{Where : } \\ r_i = \text{Standardized Residual} \\ h_{ii}=\text{Leverage} \\ s = \text{Residual Standard Error of }e_i \\ n = \text{Sample Size} \]

So consider that a High-cooks value is related to a high standardized residual (ie outlier) or being a high leverage point.

Consider the following heat map :

## `geom_smooth()` using formula = 'y ~ x'

To further emphasize this, lets consider the diagnostics

## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = SATV ~ SATM, data = Diagnostics)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -534.79  -37.95    1.59   39.85  447.86 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.246e+02  1.373e+00   163.5   <2e-16 ***
## SATM        6.378e-01  2.126e-03   300.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.64 on 72715 degrees of freedom
## Multiple R-squared:  0.5531, Adjusted R-squared:  0.5531 
## F-statistic: 8.999e+04 on 1 and 72715 DF,  p-value: < 2.2e-16

There are many issues with our model. It Fails the following Conditions of simple linear regression :

Normality
Equal Variance

As indicated by the QQ plot and Residual vs. Fitted, our model is not very good. Furthermore, due to the high-cooks values, illustrated in the heat map around high leverage points – ie points far from \(\bar{x}\), it is clear that to have reliable inferences, we must first develop a better model then re-evaluate diagnostics to assess the reliability of those inferential statistics. However, roughly speaking, we can use the predicted values to get a sense of the “average individuals”. Further time would be needed to create a better model for the association between SATV and SATM.

Final Analysis

According to Means & Medians, People appear to do slightly worse in SATV
- However…
  - Some students are doing Ridiculously good at Math (Cracked) – “Pooling”; likely due to an underlying factor ( Sol. Outlier Analysis using Multi-Linear-Regression )
  - Both Dist. are not Normally Dist :(
    - Left Skew for Both
  - Dist \(S_v\) is NOT bi-modal
  - Dist \(S_m\) is bi-modal
The linear Model developed requires more time to be useful.

Q4. Develop code to compare which of the two variables in Q3 has a larger number of missing data? (1 Pt)

##   is.na(df$SATV)     n
## 1          FALSE 72717
## 2           TRUE   706

##   is.na(df$SATM)     n
## 1          FALSE 73423

SATM doesn’t have any NAs

Q5. Use the code below to create a new variable called SATCOMP, a combined score from SATV and SATM.

##   SATV SATM SATCOMP
## 1  800  800    1600
## 2  400  400     800
## 3  690  620    1310
## 4  630  560    1190
## 5  430  530     960
## 6  650  450    1100

Develop the code to calculate mean of the new variable SATCOMP. (1 Pt)

## [1] 1265.755

#Write your code: #Write your conclusion:

Q6.Develop the code to find the sd of the new variable SATCOMP. (1 Pt)

sd(df$SATCOMP, na.rm = T)

## [1] 198.66

df %>% ggplot(aes(x = SATCOMP)) + geom_histogram(bins = 50)

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

So what just happen was that our graph encorperated both the skews and became more dramatic. And as we should expect, the mean is sent to the right as to where they both used to be : ( \(\bar{S}_v, \bar{S}_m\), \(\bar{S}_c\) ) = (630, 635, 1265) respectively.
- Which, \(\bar{S}_c\) = \(\bar{S}_v + \bar{S}_m\) as we should expect.

Q7. Reflect on the dataset, what kind of analysis do you want to conduct using the the existing variables such as SAT scores, gender identity, and racial identity? (1 Pts)

This dataset provides a rich opportunity for analyzing the relationship between SAT scores (SATV, SATM) and various demographic and socioeconomic factors such as gender identity, racial identity, and income.

A key analysis I would like to conduct is examining whether SAT performance differs across gende, racial & economic groups, and if factors like high school GPA, income, or first-generation status mediate these differences.

Q8. Propose a research question for the analysis that you mentioned in Q7. (1 Pts)

To what extent do gender identity, racial identity, and socioeconomic status (measured by income and first-generation status) predict SAT verbal (SATV) and SAT math (SATM) scores, and how does high school GPA mediate these relationships?

Q9. Write a short reflection: What questions do you still have about the data or what are your lingering questions about R and R Studio? (1 Pt)

Love this question. Syntax as always a struggle for both me & my group member. Usually, such as issue is solved via reading documentation.

Brief-Analysis Of Standardized Edu Assesment – SAT

Isaiah & Teya

2025-02-18

Q1. Recode GENDERIDENTITY into GENDERIDENTITYtext and use the table command to check your work. (1.5 Pts)

Q2. Recode FIRSTGEN into a new variable FIRSTGEN_txt and use table command to check your work. (1.5 Pts)

Q3. Develop code to compare the mean for SATV and SATM (1 Pt)

CI-95% & Hypothesis Testing \(\Delta\) Means :

Final Analysis

Q4. Develop code to compare which of the two variables in Q3 has a larger number of missing data? (1 Pt)

Q5. Use the code below to create a new variable called SATCOMP, a combined score from SATV and SATM.

Develop the code to calculate mean of the new variable SATCOMP. (1 Pt)

Q6.Develop the code to find the sd of the new variable SATCOMP. (1 Pt)

Q7. Reflect on the dataset, what kind of analysis do you want to conduct using the the existing variables such as SAT scores, gender identity, and racial identity? (1 Pts)

Q8. Propose a research question for the analysis that you mentioned in Q7. (1 Pts)

Q9. Write a short reflection: What questions do you still have about the data or what are your lingering questions about R and R Studio? (1 Pt)

Brief-Analysis Of Standardized Edu Assesment – SAT

Isaiah & Teya

2025-02-18

Q1. Recode GENDERIDENTITY into GENDERIDENTITYtext and use the table command to check your work. (1.5 Pts)

Q2. Recode FIRSTGEN into a new variable FIRSTGEN_txt and use table command to check your work. (1.5 Pts)

Q3. Develop code to compare the mean for SATV and SATM (1 Pt)

CI-95% & Hypothesis Testing \(\Delta\) Means :

Final Analysis

Q4. Develop code to compare which of the two variables in Q3 has a larger number of missing data? (1 Pt)

Q5. Use the code below to create a new variable called SATCOMP, a combined score from SATV and SATM.

Develop the code to calculate mean of the new variable SATCOMP. (1 Pt)

Q6.Develop the code to find the sd of the new variable SATCOMP. (1 Pt)

Q7. Reflect on the dataset, what kind of analysis do you want to conduct using the the existing variables such as SAT scores, gender identity, and racial identity? (1 Pts)

Q8. Propose a research question for the analysis that you mentioned in Q7. (1 Pts)

Q9. Write a short reflection: What questions do you still have about the data or what are your lingering questions about R and R Studio? (1 Pt)

Q10. If AI was used to solve any part of this assignment, please indicate where and for what purpose it was used. Please also share with us why AI was needed.