The “College Distance Data” was collected in a High School survey that was administered by the Department of Education in 1980. The post survey was conducted in 1986. At first glance, one of the first things I want to invesigate is what are factors that might influence students spending more time learning.
I’m interested in seeing the situations of students who go on to pursue a college degree or graduate education. I believe factors such as high accessibility (distance), affordability (tuition) and Income play a strong role in students that have longer education careers.
data <- 'https://raw.githubusercontent.com/curiostegui/R_bridge/main/CollegeDistance.csv'
cdis <- read.csv(file = data, header = TRUE, sep = ",")
head(cdis)
## X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
library(plyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(ggplot2)
summary(cdis)
## X gender ethnicity score
## Min. : 1 Length:4739 Length:4739 Min. :28.95
## 1st Qu.: 1186 Class :character Class :character 1st Qu.:43.92
## Median : 2370 Mode :character Mode :character Median :51.19
## Mean : 3955 Mean :50.89
## 3rd Qu.: 3554 3rd Qu.:57.77
## Max. :37810 Max. :72.81
## fcollege mcollege home urban
## Length:4739 Length:4739 Length:4739 Length:4739
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## unemp wage distance tuition
## Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
## 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
## Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
## Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
## 3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
## Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
## education income region
## Min. :12.00 Length:4739 Length:4739
## 1st Qu.:12.00 Class :character Class :character
## Median :13.00 Mode :character Mode :character
## Mean :13.81
## 3rd Qu.:16.00
## Max. :18.00
Looking at the information I noticed that certain columns might not enough data to use in our study such as Ethnicity - which only identifies African American, Hispanic and labels the other participants as “Other”.
describe(cdis)
## cdis
##
## 15 Variables 4739 Observations
## --------------------------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 4739 1 3955 4479 237.9 474.8
## .25 .50 .75 .90 .95
## 1185.5 2370.0 3554.5 7993.0 16220.0
##
## lowest : 1 2 3 4 5, highest: 37410 37510 37610 37710 37810
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 4739 0 2
##
## Value female male
## Frequency 2600 2139
## Proportion 0.549 0.451
## --------------------------------------------------------------------------------
## ethnicity
## n missing distinct
## 4739 0 3
##
## Value afam hispanic other
## Frequency 786 903 3050
## Proportion 0.166 0.191 0.644
## --------------------------------------------------------------------------------
## score
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 2464 1 50.89 10.01 36.85 38.93
## .25 .50 .75 .90 .95
## 43.92 51.19 57.77 62.43 64.81
##
## lowest : 28.95 30.78 30.91 30.98 31.05, highest: 70.10 70.56 70.70 71.36 72.81
## --------------------------------------------------------------------------------
## fcollege
## n missing distinct
## 4739 0 2
##
## Value no yes
## Frequency 3753 986
## Proportion 0.792 0.208
## --------------------------------------------------------------------------------
## mcollege
## n missing distinct
## 4739 0 2
##
## Value no yes
## Frequency 4088 651
## Proportion 0.863 0.137
## --------------------------------------------------------------------------------
## home
## n missing distinct
## 4739 0 2
##
## Value no yes
## Frequency 852 3887
## Proportion 0.18 0.82
## --------------------------------------------------------------------------------
## urban
## n missing distinct
## 4739 0 2
##
## Value no yes
## Frequency 3635 1104
## Proportion 0.767 0.233
## --------------------------------------------------------------------------------
## unemp
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 116 0.999 7.597 2.905 4.1 4.6
## .25 .50 .75 .90 .95
## 5.9 7.1 8.9 11.2 12.8
##
## lowest : 1.4 2.5 2.8 3.0 3.1, highest: 16.0 16.3 17.7 22.3 24.9
## --------------------------------------------------------------------------------
## wage
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 41 0.996 9.501 1.496 7.33 7.54
## .25 .50 .75 .90 .95
## 8.85 9.68 10.15 11.56 12.15
##
## lowest : 6.59 7.04 7.09 7.18 7.33, highest: 11.37 11.56 11.62 12.15 12.96
## --------------------------------------------------------------------------------
## distance
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 61 0.997 1.803 2.062 0.1 0.1
## .25 .50 .75 .90 .95
## 0.4 1.0 2.5 4.5 6.0
##
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 12.0 14.2 15.0 16.0 20.0
## --------------------------------------------------------------------------------
## tuition
## n missing distinct Info Mean Gmd .05 .10
## 4739 0 41 0.996 0.8146 0.3883 0.2575 0.2575
## .25 .50 .75 .90 .95
## 0.4850 0.8245 1.1270 1.2483 1.4042
##
## lowest : 0.25751 0.43418 0.45497 0.48499 0.49538
## highest: 1.16513 1.16628 1.24827 1.38568 1.40416
## --------------------------------------------------------------------------------
## education
## n missing distinct Info Mean Gmd
## 4739 0 7 0.93 13.81 1.972
##
## lowest : 12 13 14 15 16, highest: 14 15 16 17 18
##
## Value 12 13 14 15 16 17 18
## Frequency 1832 613 518 556 907 256 57
## Proportion 0.387 0.129 0.109 0.117 0.191 0.054 0.012
## --------------------------------------------------------------------------------
## income
## n missing distinct
## 4739 0 2
##
## Value high low
## Frequency 1365 3374
## Proportion 0.288 0.712
## --------------------------------------------------------------------------------
## region
## n missing distinct
## 4739 0 2
##
## Value other west
## Frequency 3796 943
## Proportion 0.801 0.199
## --------------------------------------------------------------------------------
Gathering subset of information I’m interested in.
cdis2 <- subset(cdis,select = c(tuition,education,score,income,distance,gender,home))
colnames(cdis2) <- c('Tuition','Edu','Score','Income','Dis','Gender','Home')
Looking at the chart, the more years of education a student has, the less distance they have to travel. We also observe that a majority of participants come from a family that have a home.
ggplot(data=cdis2, mapping=aes(x=Edu, y=Dis, col=Home)) + geom_point()
In the Education column, we can observe that most participants have studied less 15 years or less.
hist(cdis2$Edu)
I initially suspected that those who study more, would pay less in tuition. In this chart we see that the boxplot in the higher years of education is fairly similar to students with lighter years of education.
boxplot(Tuition~Edu, data = cdis2, xlab = "Tuition", ylab = "Education", main = "study")
In the study we saw that there seems to be a connection with distance and a participants education career. The closer they are to the school they attend, the longer their career. There weren’t any strong relationships between Education and Tuition or if their family owns a home. In a future study, other variables can be examined such as Income, or Achievement Scores.
data <- "https://raw.githubusercontent.com/curiostegui/R_bridge/main/CollegeDistance.csv"
cdis <- read.csv(file = data, header = TRUE, sep = ",")