library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
family_college <- read_csv("family_college.csv")
## Rows: 792 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): teen, parents
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction:

Question: Is there/What is the association between parents attending college and their teen’s college attendance? This dataset is from Openintro.org, and the source is from a simulation based off of info from the National Center for Education Statistics studies and is named “family_college”. The dataset is about whether a parent has a college degree or not, and if their teen ends up going to college or not. There are 792 observations and 2 variables. The 2 variables are “parents” and “teens”. The words used to describe having a degree is “degree”, while the word used to describe going to college is “college”. The word to describe not going to college or not having a degree is “not”. I will be using both variables and all columns. The link to this dataset can be found here:https://www.openintro.org/data/index.php?data=family_college

Data Analysis:

First off, I used “str” to find the structure of the dataset, and by doing str, we see that the variables teen and parents are both characters, which is perfect, since the observations “college” and “degree” are both categorical. Next, I used the function “head” to check if there was anything missing or there was something that was hard to understand for viewers of this document. We want to make this dataset as easy to read to the human eye, and so, the “head” function helps to see any complicated columns or titles. Next, I used “colSums” to find any N/A’s in the dataset. I checked the dataset on its own beforehand, and it didn’t look like there were any N/A’s, but just to be sure, I performed the code with colSums. Luckily, I was correct and there were no N/A’s in this dataset. Moving on, we have the dplyr functions used which were group by, summarize, and mutate. Group by groups all of the different combinations you could have for all of the factors in the dataset, while summarize groups which factor belongs to which group. Mutate makes a new column that calculates the proportion of which group belongs to which factor, and so on. For the visualization, I used the sample code I received from Professor through an email, and also a mix of AI.My question tests two categorical variables, and in order to use the best visualization that will answer my question, a bar plot would be the best option. The bar plot explains the proportions I calculated earlier.

### Check structure
str(family_college)
## spc_tbl_ [792 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ teen   : chr [1:792] "college" "college" "college" "college" ...
##  $ parents: chr [1:792] "degree" "degree" "degree" "degree" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   teen = col_character(),
##   ..   parents = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
### Check head of dataset
head(family_college)
## # A tibble: 6 × 2
##   teen    parents
##   <chr>   <chr>  
## 1 college degree 
## 2 college degree 
## 3 college degree 
## 4 college degree 
## 5 college degree 
## 6 college degree
### Check for N/A's
colSums(is.na(family_college))
##    teen parents 
##       0       0
#### Using dplyr functions
family_college |>
  group_by (parents,teen) |>
  summarize (count = n()) |>
  mutate(proportion = count /sum(count))
## `summarise()` has grouped output by 'parents'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   parents [2]
##   parents teen    count proportion
##   <chr>   <chr>   <int>      <dbl>
## 1 degree  college   231      0.825
## 2 degree  not        49      0.175
## 3 not     college   214      0.418
## 4 not     not       298      0.582
##Visualization
###I used AI for the visuals, with permission.

college_summary <- family_college |>
  group_by(parents, teen) |>
  summarize(count = n()) |>
  mutate(proportion = count / sum(count))
## `summarise()` has grouped output by 'parents'. You can override using the
## `.groups` argument.
ggplot(college_summary, aes(x = parents, y = proportion, fill = teen)) +
  geom_col(position = "dodge") +
  labs(title = "Teen College Attendance by Parent Education",
       x = "Parent Education", 
       y = "Proportion",
       fill = "Teen Status") +
  theme_minimal()

#Statistical Analysis

###Hypothesis P₁ = Proportion of Teens who go to college while parents have a degree

P₂ = Proportion of Teens who go to college while parents do not have a degree

H₀: P₁ - P₂ = 0 No association between parents attending college and teens attending college

Hₐ: P₁- P₂ ≠ 0 There is an association between parents attending college and teens attending college

### Difference in Proportion test PT1, make another table

new_table<- table(family_college$parents, family_college$teen)

new_table
##         
##          college not
##   degree     231  49
##   not        214 298
### Difference in Proportion test PT2, do the actual test

prop.test(new_table,correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  new_table
## X-squared = 121.82, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.3453386 0.4687239
## sample estimates:
##    prop 1    prop 2 
## 0.8250000 0.4179688

#Interpret the results:

##Prop 1 and Prop 2: P₁ is = to 0.825 which = 82.5% of teens whose parents have a degree, go to college.

P₂ is = to 0.418… which = Around 41.8% of teens whose parents don’t have a degree, go to college. ##Difference of proportions: P₁- P₂ = 0.407 (Or 40.7% of a difference)

##Remaining results: P-value: Less than 2.2e-16, so it is very small, smaller than 0.05 (Our significance level).

Confidence Interval: We are 95% confident that there is a difference in proportions in between 34.5% and 46.9%.

##Overall results: The P-value is way under the significance level (or alpha) which is 0.05%. What this means is that we have to reject the null hypothesis. This also means that there is a statistically significant association between parents versus teens attending college.So, if a teens parent has a college degree, they are way more likely to attend college rather than a teen whose parent did not receive a college degree. The 40.7% difference backs this up as well.

#Conclusion and Future Directions: My conclusion is that there is a statistically significant difference between attending college as teens while their parents have a degree vs. not having a degree. It is implied that kids are likely to follow in their parent’s footsteps, and it is important on what or how parents raise their kids. Kids will often grow up to use similar techniques and mindsets in their daily lives. I would say some future research ideas that would be interesting to look into are factors of the parents who got their degree vs. did not get their degree. I want to find out their family income, their gender, and where they are from. I am sure that would be very eye-opening to learn.