Does having a parent with a college degree affect whether a teen goes to college?

For my third project I am investigating whether a parent’s education level has an impact on if their teen will go to college. This question is important, as college becomes more and more integral for the eventual success of future generations, and investigating this correlation becomes seemingly more important. Family background can play a huge role in educational opportunities, support, and expectations, so correlation would be logical between parents’ education and their children. Since the outcome is based on a qualitative response: whether or not an individual goes to college, a logistic regression model is the best fit, it syncs up with the setup of this project, only having options of yes/no, or 1/0.

The dataset used is called family_college from OpenIntro.org. The dataset has 792 observations and 2 variables. Each observation represents a child who either did or did not go to college, and whose parents either did or did not go to college. The Variable “teen” shows whether the teen goes to college, and the “parent” variable notes if the parent did or not. https://www.openintro.org/data/index.php?data=family_college.

Data Analysis

In the Data Analysis section, I began by loading the dataset from my Data101 Directory. Next, using exploratory data analysis functions like head(), dim)(), and str(), I ensured that the dataset was correctly loaded. Next, in the “Assigning values” section, I used select() to keep the variables, filter() to remove missing values, and mutate to convert the qualitative variables to numeric ones and note the values.

The results were: 82.5% of Teens whos parent had a degree attended college, also attended college, 17.5% didn’t. 41.8% of Teens whos parent did not attend college attended college, where 58.2% did. I also made a bar plot to better visualize the data. ## Loading Dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load("family_college.rda")

df <- family_college

Table || Initial Values

head(df)
## # A tibble: 6 × 2
##   teen    parents
##   <fct>   <fct>  
## 1 college degree 
## 2 college degree 
## 3 college degree 
## 4 college degree 
## 5 college degree 
## 6 college degree
dim(df)
## [1] 792   2
str(df)
## tibble [792 × 2] (S3: tbl_df/tbl/data.frame)
##  $ teen   : Factor w/ 2 levels "college","not": 1 1 1 1 1 1 1 1 1 1 ...
##  $ parents: Factor w/ 2 levels "degree","not": 1 1 1 1 1 1 1 1 1 1 ...
table(df$teen)
## 
## college     not 
##     445     347
table(df$parents)
## 
## degree    not 
##    280    512
table(df$parents, df$teen)
##         
##          college not
##   degree     231  49
##   not        214 298

Assigning Values

df_clean <- df %>%
  select(teen, parents) %>%
  filter(!is.na(teen), !is.na(parents)) %>%
  mutate(
    teen_college = ifelse(teen == "college", 1, 0),
    parent_degree = ifelse(parents == "degree", 1, 0)
  )

head(df_clean)
## # A tibble: 6 × 4
##   teen    parents teen_college parent_degree
##   <fct>   <fct>          <dbl>         <dbl>
## 1 college degree             1             1
## 2 college degree             1             1
## 3 college degree             1             1
## 4 college degree             1             1
## 5 college degree             1             1
## 6 college degree             1             1

Proportion Table

college_prop <- prop.table(table(df_clean$parents, df_clean$teen), margin = 1)

college_prop
##         
##            college       not
##   degree 0.8250000 0.1750000
##   not    0.4179688 0.5820312
barplot(
  college_prop[, "college"],
  ylim = c(0, 1),
  main = "Proportion of Teens Going to College by Parent Degree Status",
  xlab = "Parent College Degree Status",
  ylab = "Proportion of Teens Going to College"
)

### Regression Analysis

The model I used predicts whether a teen goes to college based on whether their parent has a college degree. In this model, 1 means that the student went to college, and 0 means they did not. Similarly, a 1 for the parent means they did attend, where a 0 means they did not receive a college degree.

The logistic regression showed that parent_degree received a coefficient of roughly 1.8817, with a P value of less than 2e-16. Because the P value is so much lower than .05, there is statistically significant evidence to conclude that there is a correlation between parents receiving a college degree and their children attending college. The odds ratio for parent_degree was 6.56, meaning that teens with a parent with a college degree are 6.56 times more likely to attend college than others. The 95% confidence interval was 4.64 to 9.44, which is above 1.

Logistic Regression Model

model <- glm(teen_college ~ parent_degree, data = df_clean, family = binomial)

summary(model)
## 
## Call:
## glm(formula = teen_college ~ parent_degree, family = binomial, 
##     data = df_clean)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.3311     0.0896  -3.695  0.00022 ***
## parent_degree   1.8817     0.1810  10.395  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1085.79  on 791  degrees of freedom
## Residual deviance:  955.63  on 790  degrees of freedom
## AIC: 959.63
## 
## Number of Fisher Scoring iterations: 4

Confidence intervals, odds ratios, and AIC

confint(model)
## Waiting for profiling to be done...
##                    2.5 %     97.5 %
## (Intercept)   -0.5076948 -0.1562308
## parent_degree  1.5343862  2.2450769
exp(coef(model))
##   (Intercept) parent_degree 
##     0.7181208     6.5647530
exp(confint(model))
## Waiting for profiling to be done...
##                   2.5 %    97.5 %
## (Intercept)   0.6018815 0.8553618
## parent_degree 4.6384774 9.4411413
AIC(model)
## [1] 959.6258

Confusion matrix

df_clean$predicted_prob <- predict(model, type = "response")

df_clean$predicted_class <- ifelse(df_clean$predicted_prob >= 0.5, 1, 0)

conf_matrix <- table(
  Predicted = df_clean$predicted_class,
  Actual = df_clean$teen_college
)

conf_matrix
##          Actual
## Predicted   0   1
##         0 298 214
##         1  49 231

Accuracy, sensitivity, and specificity

TP <- conf_matrix["1", "1"]
TN <- conf_matrix["0", "0"]
FP <- conf_matrix["1", "0"]
FN <- conf_matrix["0", "1"]

accuracy <- (TP + TN) / (TP + TN + FP + FN)

sensitivity <- TP / (TP + FN)

specificity <- TN / (TN + FP)

accuracy
## [1] 0.6679293
sensitivity
## [1] 0.5191011
specificity
## [1] 0.8587896

ROC curve and AUC chunk

library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_model <- roc(
  response = df_clean$teen_college,
  predictor = df_clean$predicted_prob,
  levels = c(0, 1),
  direction = "<"
)

auc_value <- auc(roc_model)
auc_value
## Area under the curve: 0.6889
plot.roc(
  roc_model,
  print.auc = TRUE,
  legacy.axes = TRUE,
  main = "ROC Curve for Logistic Regression Model",
  xlab = "False Positive Rate (1 - Specificity)",
  ylab = "True Positive Rate (Sensitivity)"
)

Model Assumptions and Diagnostics

For the logistics regression diagnostics, I used a confusion matrix, accuracy, sensitivity, specificity, ROC, and AUC curve. The model correctly predicted 298 teens who did not go to college, and 231 teens who did go to college. It incorrectly placed 49 teens as going to college when they did not. It also incorrectly predicted that 214 students would not go to college when they did.

The accuracy was about .668, meaning that 66.8% of the time the model correctly classified a case. The sensitivity was .519, meaning 51.9% of the time the model was correct about teens who actually went to college. The specificity was .859, meaning that 85.9% of the time the model correctly located the teens who did not go to college.

The AUC was 66.8%, so not random guesswork of 50%, but also not anywhere near a perfect accuracy of 100%.

Conclusion

Ultimately, the results suggest that having a parent who has a college degree is strongly related to whether a teen goes to college. Analysis showed that 82.5% of teens with a parent who received a degree from college also attended college, while only 41.8% of teens whose parents did not have degrees were able to attend college. The logistic regression model proved this relationship to be accurate because the parent_degree was positive and statistically significant.

In the fitted regression model, the odds ratio showed that teens with a parent with a college degree had 6.56 times the odds of going to college when compared to teens without a parent with a college degree.

The Model had a 66.8% accuracy, an AUC of .6889, meaning that the model was better than random chance guessing, but also not perfect or anywhere near 100% accuracy when guessing. The model had a much higher specificity, meaning that it was more talented at guessing which students did not attend college. The sensitivity was much higher, meaning it struggled more with those who did attend.

Overall, the implications of parents’ degrees being indicative of their children’s ability to obtain degrees are not small or limited. With job markets becoming more and more competitive, the importance of secondary education is unmatched. Finding a very clear link between parents and children, being able to identify the youth who are less likely to attend, is important for trying to find solutions, ultimately giving all interested individuals a better shot at college. Without knowing what populations to give support and extra attention to, positive growth and changes cannot occur.

References

OpenIntro. (n.d.). family_college: Family college attendance dataset. OpenIntro. https://www.openintro.org/data/index.php?data=family_college