Main Question

I am interested to see whether economic factors (wage/income level, unemployment rate of a student’s county) have an impact on the colleges students selected (both in terms of distance and average tuition). I’d like to perform a \(\chi^2\) test for independence to see if socioeconomic factors like income are correlated with how far away a student attends college.

Reading in our data set

Interested in cross-sectional data from the 1980 High School and beyond survey. I pulled the original data set from here. The csv file has been uploaded to our R Bridge GitHub repo here. Column descriptions have also been provided.

github_url <- "https://raw.githubusercontent.com/andrewbowen19/rBridgeCUNY/main/data/college-distance.csv"
df <- read.csv(github_url, sep=",", header=TRUE)

head(df, 10)
##     X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1   1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2   2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3   3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4   4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5   5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6   6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
## 7   7 female     other 56.07       no       no  yes    no   7.2 8.85      0.4
## 8   8 female     other 54.85       no       no  yes    no   7.2 8.85      0.4
## 9   9   male     other 64.74      yes       no  yes   yes   5.9 8.09      3.0
## 10 10 female     other 56.06       no       no  yes   yes   5.9 8.09      3.0
##    tuition education income region
## 1  0.88915        12   high  other
## 2  0.88915        12    low  other
## 3  0.88915        12    low  other
## 4  0.88915        12    low  other
## 5  0.88915        13    low  other
## 6  0.88915        12    low  other
## 7  0.84988        13    low  other
## 8  0.84988        15    low  other
## 9  0.88915        13    low  other
## 10 0.88915        15    low  other

Data Exploration

The score column in our dataset refers to the base year composite test score. This is a test given to high school seniors within the sample. We see from our summary call that the median and mean scores are within a half point, and could (though not necessarily) be close to a normal distribution. In addition, the unemp column gives the unemployment rate (as a percentage) of the student’s county in 1980. The mean and media county unemployment rates are both above 7%, which is above the natural rate of unemployment in the US (about 4.5%). For instance, the current unemployment rate (as of 2022) sits at 3.6%

summary.data.frame(df)
##        X            gender           ethnicity             score      
##  Min.   :    1   Length:4739        Length:4739        Min.   :28.95  
##  1st Qu.: 1186   Class :character   Class :character   1st Qu.:43.92  
##  Median : 2370   Mode  :character   Mode  :character   Median :51.19  
##  Mean   : 3955                                         Mean   :50.89  
##  3rd Qu.: 3554                                         3rd Qu.:57.77  
##  Max.   :37810                                         Max.   :72.81  
##    fcollege           mcollege             home              urban          
##  Length:4739        Length:4739        Length:4739        Length:4739       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      unemp             wage           distance         tuition      
##  Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
##  1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
##  Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
##  Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
##  3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
##  Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
##    education        income             region         
##  Min.   :12.00   Length:4739        Length:4739       
##  1st Qu.:12.00   Class :character   Class :character  
##  Median :13.00   Mode  :character   Mode  :character  
##  Mean   :13.81                                        
##  3rd Qu.:16.00                                        
##  Max.   :18.00

Data Wrangling

# Renaming some columns for improved readability
colnames(df)[colnames(df) == 'mcollege'] <- 'mother_college'
colnames(df)[colnames(df) == 'fcollege'] <- 'father_college'
colnames(df)[colnames(df) == 'unemp'] <- 'county_unemp'
colnames(df)[colnames(df) == 'score'] <- 'composite_score'
colnames(df)[colnames(df) == 'tuition'] <- 'avg_tuition'


# Converting tuition and distance columns to units of 
df$distance <- df$distance * 10
df$avg_tuition <- df$avg_tuition * 1000

# Taking some subsets of data -- basing it off the gender column
male_df <- df[df$gender=="male",]
female_df <- df[df$gender=="female",]


head(male_df, 5)
##   X gender ethnicity composite_score father_college mother_college home urban
## 1 1   male     other           39.15            yes             no  yes   yes
## 3 3   male     other           48.74             no             no  yes   yes
## 4 4   male      afam           40.40             no             no  yes   yes
## 6 6   male     other           54.71             no             no  yes   yes
## 9 9   male     other           64.74            yes             no  yes   yes
##   county_unemp wage distance avg_tuition education income region
## 1          6.2 8.09        2      889.15        12   high  other
## 3          6.2 8.09        2      889.15        12    low  other
## 4          6.2 8.09        2      889.15        12    low  other
## 6          5.6 8.09        4      889.15        12    low  other
## 9          5.9 8.09       30      889.15        13    low  other

Graphics

Scatter plot

Creating a scatter plot of our distance vs county_unemp columns. Interested to see if students that come from counties with higher unemployment tend to stay closer to home. There may be many reasons for this: easier/cheaper commute, need to work to provide family income, etc.

library(ggplot2)

scat <- ggplot(df, aes(x=county_unemp, y=distance)) + geom_point()
scat + ggtitle("Distance to college (in miles) vs County Unemployment Rate")

#### Histogram Creating a histogram of composite_scores. To the eye it appears slightly skewed left.

hist <- ggplot(df, aes(x=composite_score)) + geom_histogram()
hist + ggtitle("Composite Test Scores -- High School Seniors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Box Plot

Creating a

# Mother college
m <- ggplot(df, aes(x=mother_college, y=composite_score)) + 
  geom_boxplot()
m + ggtitle("Composite Test Scores by Mother's College Attendance")

# Father college
f <- ggplot(df, aes(x=father_college, y=composite_score)) + 
  geom_boxplot()
f + ggtitle("Composite Test Scores by Father's College Attendance")

#### Other Plots Want to create a density plot of wages overlaid with tuition. Keep in mind the tuition column is the average tuition (in USD) for the state in which college was attended. Still, this should show if students from certain income ranges chose expensive/affordable colleges

dp <- ggplot(df, aes(x=avg_tuition)) + 
  geom_density()

dp + ggtitle("Avg Tuition (USD)")

I’m interested in seeing a violin plot of high vs low income for tuition and distance as well. This shows the probability density of our variables (avg_tuition and distance )

# Tuition vs income
t <- ggplot(df, aes(x=income, y=avg_tuition)) + 
  geom_violin()
t + ggtitle("Avg Tuition (USD) vs Family Income Status")

# Distance vs income
d <- ggplot(df, aes(x=income, y=distance)) + 
  geom_violin()
d + stat_summary(fun=median, geom="point", size=2, color="red")

d + ggtitle("Distance to college (miles) vs Family Income Status")

T-test

Running a two independent sample t-test

high_incomes <- df[df$income=="high",]
low_incomes <- df[df$income=="low",]

Checking out summary statistics of distances first for high and low incomes

mean(high_incomes$distance)
## [1] 15.12454
mean(low_incomes$distance)
## [1] 19.20362
sd(high_incomes$distance)
## [1] 19.21389
sd(low_incomes$distance)
## [1] 24.23097

Performing t-test on high vs low income distance samples. We’re using an \(\alpha = 0.05\) and a null hypothesis that the two samples have no significant difference in means.

t.test(high_incomes$distance, low_incomes$distance)
## 
##  Welch Two Sample t-test
## 
## data:  high_incomes$distance and low_incomes$distance
## t = -6.1184, df = 3155.7, p-value = 1.061e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.386264 -2.771884
## sample estimates:
## mean of x mean of y 
##  15.12454  19.20362

Since \(p < \alpha = 0.05\), we can reject the null hypothesis and assume there is a significant difference between the sample means for high vs low income. One unexpected result was that the average distance for low-income students was higher than that of higher income students.

Running another t-test on the avg_tuition field between our high and low-income sample sets. Again we’ll be using an \(\alpha = 0.05\) and a null hypothesis that the two samples have no significant difference in means.

t.test(high_incomes$avg_tuition, low_incomes$avg_tuition)
## 
##  Welch Two Sample t-test
## 
## data:  high_incomes$avg_tuition and low_incomes$avg_tuition
## t = 3.6649, df = 2449.7, p-value = 0.0002527
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.79608 62.05843
## sample estimates:
## mean of x mean of y 
##  843.3910  802.9638

I’d liek to see if there is a correlation between income and distance. Will run a Pearson’s Correlation test on distance and some other columns to see if there are strong correlations between any variables.

dist_tuition_corr <- cor.test(df$distance, df$avg_tuition, method="pearson")
dist_tuition_corr
## 
##  Pearson's product-moment correlation
## 
## data:  df$distance and df$avg_tuition
## t = -6.9858, df = 4737, p-value = 3.226e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1290819 -0.0727172
## sample estimates:
##        cor 
## -0.1009806
dist_score_corr <- cor.test(df$distance, df$composite_score, method="pearson")
dist_score_corr
## 
##  Pearson's product-moment correlation
## 
## data:  df$distance and df$composite_score
## t = -4.6896, df = 4737, p-value = 2.815e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09626539 -0.03958343
## sample estimates:
##         cor 
## -0.06797927

Conclusion

We see slight negative correlations between these sets of variables. While we were expecting a students’ socio-economic background to impact their college decision, we actually see a statistically significant higher mean distance to college from lower income students in our dataset. This could be due to a number of factors, but was not in line with what we had expected prior to testing. The reasoning being that higher income students would have a greater ability to afford travel to colleges that are further away from their hometown. Our violin plots also show some lower-income students going to further colleges at a higher rate. Since we looked at the means of our high and low-income samples, these statistics will be more affected by outliers than a median value of distances. Overall, the results of our statistical tests did not support pre-existing assumptions we had going into the exercise.