Preface : The following assignment was worked on together as a group where I guided the discussion & Analysis; but, asked questions to my group member so we could think about the data more deeply. Afterwhich, I took many of the things we discussed as a group to the next level up. I hope you think this is analysis is as cool as i think it is.

Enjoy :

Q1. Recode GENDERIDENTITY into GENDERIDENTITYtext and use the table command to check your work. (1.5 Pts)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##   RACEGROUP            GENDERIDENTITYtext
## 1         5                         White
## 2         5                         White
## 3         7 Two or more Races/Ethnicities
## 4         7 Two or more Races/Ethnicities
## 5         5                         White
## 6         7 Two or more Races/Ethnicities

Q2. Recode FIRSTGEN into a new variable FIRSTGEN_txt and use table command to check your work. (1.5 Pts)

##   FIRSTGEN FIRSTGEN_txt
## 1        2          Yes
## 2        1           No
## 3        1           No
## 4        1           No
## 5        1           No
## 6        1           No

Q3. Develop code to compare the mean for SATV and SATM (1 Pt)

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Compare
mean(df$SATV, na.rm = T) - mean(df$SATM, na.rm = T) < 0
## [1] TRUE
sd(df$SATV, na.rm = T) - sd(df$SATM, na.rm = T) < 0 
## [1] TRUE
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

Mathematical Analysis :

Graphical Analysis :

CI-95% & Hypothesis Testing \(\Delta\) Means :

  • There are some big issues with this – we clearly cannot use a simple t-test to do hypothesis test as it doesnt take into account the skew and our first assumption is violated (Normal Dist).

  • So the question first becomes : What type of Dist? ***

    • How can we test for types of Dist.?
  • And what can we do About Skew-Dist & Hyp-Testing \(\Delta\) in Means?

    • Box-Cox Method – Ie artificially normalize the data & Re-interpret transformed Data.
    • Approximate Underlying Dist. & use mathematical approximative methods – particularly, utilizing going from a Probability Mass Func –> Moment Mass Function; Allowing for a condensed, computationally cheap analysis of Dist.

Excerpt :

  • So, clearly this Dist. is rather complex– we should then also consider the question about the corresponding relationship between each \(x_i\) to evaluate underlying patterns about each and understand the outliers more.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 706 rows containing missing values or values outside the scale range
## (`geom_point()`).

Not so fast!

So, this mdl is cool but we must first evaluate the assumptions of a linear mdl. In other words– Diagnostics. Otherwise, we are just doing nonsense.

Here are the assumptions of Linear Regression : L. I. N. E.

  • Linearity of predictor & predicted

  • Independence of Error

  • Normality of Error

  • Equal Variance of Error

Consider that we have leverage points which are clearly, highly influential points (ie. high cooks values) from X : 200-300. Simply by viewing the scatter plot, it is clear we likely are not capturing the general pattern but rather the influence of high-leverage & outliers in the specified range.

\[ D_i = \frac{r_i^2}{2}*\frac{h_{ii}}{1-h_{ii}} \\ \text{Where : } \\ r_i = \text{Standardized Residual} \\ h_{ii}=\text{Leverage} \\ s = \text{Residual Standard Error of }e_i \\ n = \text{Sample Size} \]

So consider that a High-cooks value is related to a high standardized residual (ie outlier) or being a high leverage point.

Consider the following heat map :

## `geom_smooth()` using formula = 'y ~ x'

To further emphasize this, lets consider the diagnostics

## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = SATV ~ SATM, data = Diagnostics)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -534.79  -37.95    1.59   39.85  447.86 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.246e+02  1.373e+00   163.5   <2e-16 ***
## SATM        6.378e-01  2.126e-03   300.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.64 on 72715 degrees of freedom
## Multiple R-squared:  0.5531, Adjusted R-squared:  0.5531 
## F-statistic: 8.999e+04 on 1 and 72715 DF,  p-value: < 2.2e-16

There are many issues with our model. It Fails the following Conditions of simple linear regression :

  • Normality

  • Equal Variance

As indicated by the QQ plot and Residual vs. Fitted, our model is not very good. Furthermore, due to the high-cooks values, illustrated in the heat map around high leverage points – ie points far from \(\bar{x}\), it is clear that to have reliable inferences, we must first develop a better model then re-evaluate diagnostics to assess the reliability of those inferential statistics. However, roughly speaking, we can use the predicted values to get a sense of the “average individuals”. Further time would be needed to create a better model for the association between SATV and SATM.

Final Analysis

Q4. Develop code to compare which of the two variables in Q3 has a larger number of missing data? (1 Pt)

##   is.na(df$SATV)     n
## 1          FALSE 72717
## 2           TRUE   706
##   is.na(df$SATM)     n
## 1          FALSE 73423

SATM doesn’t have any NAs

Q5. Use the code below to create a new variable called SATCOMP, a combined score from SATV and SATM.

##   SATV SATM SATCOMP
## 1  800  800    1600
## 2  400  400     800
## 3  690  620    1310
## 4  630  560    1190
## 5  430  530     960
## 6  650  450    1100

Develop the code to calculate mean of the new variable SATCOMP. (1 Pt)

## [1] 1265.755

#Write your code: #Write your conclusion:

Q6.Develop the code to find the sd of the new variable SATCOMP. (1 Pt)

sd(df$SATCOMP, na.rm = T)
## [1] 198.66
df %>% ggplot(aes(x = SATCOMP)) + geom_histogram(bins = 50)
## Warning: Removed 706 rows containing non-finite outside the scale range
## (`stat_bin()`).

  • So what just happen was that our graph encorperated both the skews and became more dramatic. And as we should expect, the mean is sent to the right as to where they both used to be : ( \(\bar{S}_v, \bar{S}_m\), \(\bar{S}_c\) ) = (630, 635, 1265) respectively.

    • Which, \(\bar{S}_c\) = \(\bar{S}_v + \bar{S}_m\) as we should expect.

Q7. Reflect on the dataset, what kind of analysis do you want to conduct using the the existing variables such as SAT scores, gender identity, and racial identity? (1 Pts)

This dataset provides a rich opportunity for analyzing the relationship between SAT scores (SATV, SATM) and various demographic and socioeconomic factors such as gender identity, racial identity, and income.

A key analysis I would like to conduct is examining whether SAT performance differs across gende, racial & economic groups, and if factors like high school GPA, income, or first-generation status mediate these differences.

Q8. Propose a research question for the analysis that you mentioned in Q7. (1 Pts)

To what extent do gender identity, racial identity, and socioeconomic status (measured by income and first-generation status) predict SAT verbal (SATV) and SAT math (SATM) scores, and how does high school GPA mediate these relationships?

Q9. Write a short reflection: What questions do you still have about the data or what are your lingering questions about R and R Studio? (1 Pt)

Love this question. Syntax as always a struggle for both me & my group member. Usually, such as issue is solved via reading documentation.

Q10. If AI was used to solve any part of this assignment, please indicate where and for what purpose it was used. Please also share with us why AI was needed.

We used a large variety of resources : DataCamp, TextBooks (A Modern Approach to Regression & Probability & Statistical Inference ), GPTs (Probability Theory, Basic Stats, Advanced Regression), ChatGPT.

Rational :

  • DataCamp : I Partnered with this Co. in the past – provided me & my previous research group 6mo Free. So now I use this a ton. Educators can request free access to distribute to students. I highly suggest acquiring this – Provides, Videos, Practice Problems, Exams & Projects.

  • Text Books : I am a good student but a textbook makes me amazing–much smarter people than me wrote those books. Don’t like how mathematical my other class is so im using this assignment to investigate concepts I learned theoretically as applied.

  • GPTs : Specialized Knowledge specific to textbook & lectures – Provides Practice Exams to Test Understanding – Provides Responses Relevant to Classes I have Taken in the Past – Increases my coding capabilities. I Require assistance covered in my textbooks. Additionally, I use it as a data base to query specific equations, concepts or examples so i can think about the problem more deeply.

  • ChatGPT : Used for stupid syntax errors. Unlike the GPTs, it answers technical questions incompletely often – in other words, its smart but doesn’t consider assumptions of statistical tests often.

  • Friends : Isaiah reached out to some friends to talk about the data more deeply. Particularly, investigating methods for computing CI’s for skewwed Dist.