Main Question
I am interested to see whether economic factors (wage/income level, unemployment rate of a student’s county) have an impact on the colleges students selected (both in terms of distance and average tuition). I’d like to perform a \(\chi^2\) test for independence to see if socioeconomic factors like income are correlated with how far away a student attends college.
Reading in our data set
Interested in cross-sectional data from the 1980 High School and beyond survey. I pulled the original data set from here. The csv file has been uploaded to our R Bridge GitHub repo here. Column descriptions have also been provided.
github_url <- "https://raw.githubusercontent.com/andrewbowen19/rBridgeCUNY/main/data/college-distance.csv"
df <- read.csv(github_url, sep=",", header=TRUE)
head(df, 10)
## X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## 7 7 female other 56.07 no no yes no 7.2 8.85 0.4
## 8 8 female other 54.85 no no yes no 7.2 8.85 0.4
## 9 9 male other 64.74 yes no yes yes 5.9 8.09 3.0
## 10 10 female other 56.06 no no yes yes 5.9 8.09 3.0
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
## 7 0.84988 13 low other
## 8 0.84988 15 low other
## 9 0.88915 13 low other
## 10 0.88915 15 low other
Data Exploration
The score column in our dataset refers to the base year
composite test score. This is a test given to high school seniors within
the sample. We see from our summary call that the median
and mean scores are within a half point, and could (though not
necessarily) be close to a normal distribution. In addition, the
unemp column gives the unemployment rate (as a percentage)
of the student’s county in 1980. The mean and media county unemployment
rates are both above 7%, which is above the natural rate of
unemployment in the US (about 4.5%). For instance, the current
unemployment rate (as of 2022) sits at 3.6%
summary.data.frame(df)
## X gender ethnicity score
## Min. : 1 Length:4739 Length:4739 Min. :28.95
## 1st Qu.: 1186 Class :character Class :character 1st Qu.:43.92
## Median : 2370 Mode :character Mode :character Median :51.19
## Mean : 3955 Mean :50.89
## 3rd Qu.: 3554 3rd Qu.:57.77
## Max. :37810 Max. :72.81
## fcollege mcollege home urban
## Length:4739 Length:4739 Length:4739 Length:4739
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## unemp wage distance tuition
## Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
## 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
## Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
## Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
## 3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
## Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
## education income region
## Min. :12.00 Length:4739 Length:4739
## 1st Qu.:12.00 Class :character Class :character
## Median :13.00 Mode :character Mode :character
## Mean :13.81
## 3rd Qu.:16.00
## Max. :18.00
Data Wrangling
# Renaming some columns for improved readability
colnames(df)[colnames(df) == 'mcollege'] <- 'mother_college'
colnames(df)[colnames(df) == 'fcollege'] <- 'father_college'
colnames(df)[colnames(df) == 'unemp'] <- 'county_unemp'
colnames(df)[colnames(df) == 'score'] <- 'composite_score'
colnames(df)[colnames(df) == 'tuition'] <- 'avg_tuition'
# Converting tuition and distance columns to units of
df$distance <- df$distance * 10
df$avg_tuition <- df$avg_tuition * 1000
# Taking some subsets of data -- basing it off the gender column
male_df <- df[df$gender=="male",]
female_df <- df[df$gender=="female",]
head(male_df, 5)
## X gender ethnicity composite_score father_college mother_college home urban
## 1 1 male other 39.15 yes no yes yes
## 3 3 male other 48.74 no no yes yes
## 4 4 male afam 40.40 no no yes yes
## 6 6 male other 54.71 no no yes yes
## 9 9 male other 64.74 yes no yes yes
## county_unemp wage distance avg_tuition education income region
## 1 6.2 8.09 2 889.15 12 high other
## 3 6.2 8.09 2 889.15 12 low other
## 4 6.2 8.09 2 889.15 12 low other
## 6 5.6 8.09 4 889.15 12 low other
## 9 5.9 8.09 30 889.15 13 low other
Graphics
Scatter plot
Creating a scatter plot of our distance vs
county_unemp columns. Interested to see if students that
come from counties with higher unemployment tend to stay closer to home.
There may be many reasons for this: easier/cheaper commute, need to work
to provide family income, etc.
library(ggplot2)
scat <- ggplot(df, aes(x=county_unemp, y=distance)) + geom_point()
scat + ggtitle("Distance to college (in miles) vs County Unemployment Rate")
#### Histogram Creating a histogram of
composite_scores. To
the eye it appears slightly skewed left.
hist <- ggplot(df, aes(x=composite_score)) + geom_histogram()
hist + ggtitle("Composite Test Scores -- High School Seniors")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Box Plot
Creating a
# Mother college
m <- ggplot(df, aes(x=mother_college, y=composite_score)) +
geom_boxplot()
m + ggtitle("Composite Test Scores by Mother's College Attendance")
# Father college
f <- ggplot(df, aes(x=father_college, y=composite_score)) +
geom_boxplot()
f + ggtitle("Composite Test Scores by Father's College Attendance")
#### Other Plots Want to create a density plot of wages overlaid with
tuition. Keep in mind the
tuition column is the average
tuition (in USD) for the state in which college was attended. Still,
this should show if students from certain income ranges chose
expensive/affordable colleges
dp <- ggplot(df, aes(x=avg_tuition)) +
geom_density()
dp + ggtitle("Avg Tuition (USD)")
I’m interested in seeing a violin plot of high vs low income for
tuition and distance as well. This shows the probability density of our
variables (avg_tuition and distance )
# Tuition vs income
t <- ggplot(df, aes(x=income, y=avg_tuition)) +
geom_violin()
t + ggtitle("Avg Tuition (USD) vs Family Income Status")
# Distance vs income
d <- ggplot(df, aes(x=income, y=distance)) +
geom_violin()
d + stat_summary(fun=median, geom="point", size=2, color="red")
d + ggtitle("Distance to college (miles) vs Family Income Status")
T-test
Running a two independent sample t-test
high_incomes <- df[df$income=="high",]
low_incomes <- df[df$income=="low",]
Checking out summary statistics of distances first for high and low incomes
mean(high_incomes$distance)
## [1] 15.12454
mean(low_incomes$distance)
## [1] 19.20362
sd(high_incomes$distance)
## [1] 19.21389
sd(low_incomes$distance)
## [1] 24.23097
Performing t-test
on high vs low income distance samples. We’re using an
\(\alpha = 0.05\) and a null hypothesis
that the two samples have no significant difference in means.
t.test(high_incomes$distance, low_incomes$distance)
##
## Welch Two Sample t-test
##
## data: high_incomes$distance and low_incomes$distance
## t = -6.1184, df = 3155.7, p-value = 1.061e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.386264 -2.771884
## sample estimates:
## mean of x mean of y
## 15.12454 19.20362
Since \(p < \alpha = 0.05\), we can reject the null hypothesis and assume there is a significant difference between the sample means for high vs low income. One unexpected result was that the average distance for low-income students was higher than that of higher income students.
Running another t-test on the avg_tuition field between
our high and low-income sample sets. Again we’ll be using an \(\alpha = 0.05\) and a null hypothesis that
the two samples have no significant difference in means.
t.test(high_incomes$avg_tuition, low_incomes$avg_tuition)
##
## Welch Two Sample t-test
##
## data: high_incomes$avg_tuition and low_incomes$avg_tuition
## t = 3.6649, df = 2449.7, p-value = 0.0002527
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 18.79608 62.05843
## sample estimates:
## mean of x mean of y
## 843.3910 802.9638
I’d liek to see if there is a correlation between income
and distance. Will run a Pearson’s Correlation test on
distance and some other columns to see if there are strong correlations
between any variables.
dist_tuition_corr <- cor.test(df$distance, df$avg_tuition, method="pearson")
dist_tuition_corr
##
## Pearson's product-moment correlation
##
## data: df$distance and df$avg_tuition
## t = -6.9858, df = 4737, p-value = 3.226e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1290819 -0.0727172
## sample estimates:
## cor
## -0.1009806
dist_score_corr <- cor.test(df$distance, df$composite_score, method="pearson")
dist_score_corr
##
## Pearson's product-moment correlation
##
## data: df$distance and df$composite_score
## t = -4.6896, df = 4737, p-value = 2.815e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09626539 -0.03958343
## sample estimates:
## cor
## -0.06797927
Conclusion
We see slight negative correlations between these sets of variables. While we were expecting a students’ socio-economic background to impact their college decision, we actually see a statistically significant higher mean distance to college from lower income students in our dataset. This could be due to a number of factors, but was not in line with what we had expected prior to testing. The reasoning being that higher income students would have a greater ability to afford travel to colleges that are further away from their hometown. Our violin plots also show some lower-income students going to further colleges at a higher rate. Since we looked at the means of our high and low-income samples, these statistics will be more affected by outliers than a median value of distances. Overall, the results of our statistical tests did not support pre-existing assumptions we had going into the exercise.