Introduction

The “College Distance Data” was collected in a High School survey that was administered by the Department of Education in 1980. The post survey was conducted in 1986. At first glance, one of the first things I want to invesigate is what are factors that might influence students spending more time learning.

My Question & Hypothesis

I’m interested in seeing the situations of students who go on to pursue a college degree or graduate education. I believe factors such as high accessibility (distance), affordability (tuition) and Income play a strong role in students that have longer education careers.

Upload Data

data <- 'https://raw.githubusercontent.com/curiostegui/R_bridge/main/CollegeDistance.csv'
cdis <- read.csv(file = data, header = TRUE, sep = ",")
head(cdis)
##   X gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other

Step 0 Install Packages

library(plyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(ggplot2)

Step 1 Data Exploration

summary(cdis)
##        X            gender           ethnicity             score      
##  Min.   :    1   Length:4739        Length:4739        Min.   :28.95  
##  1st Qu.: 1186   Class :character   Class :character   1st Qu.:43.92  
##  Median : 2370   Mode  :character   Mode  :character   Median :51.19  
##  Mean   : 3955                                         Mean   :50.89  
##  3rd Qu.: 3554                                         3rd Qu.:57.77  
##  Max.   :37810                                         Max.   :72.81  
##    fcollege           mcollege             home              urban          
##  Length:4739        Length:4739        Length:4739        Length:4739       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      unemp             wage           distance         tuition      
##  Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
##  1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
##  Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
##  Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
##  3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
##  Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
##    education        income             region         
##  Min.   :12.00   Length:4739        Length:4739       
##  1st Qu.:12.00   Class :character   Class :character  
##  Median :13.00   Mode  :character   Mode  :character  
##  Mean   :13.81                                        
##  3rd Qu.:16.00                                        
##  Max.   :18.00

Looking at the information I noticed that certain columns might not enough data to use in our study such as Ethnicity - which only identifies African American, Hispanic and labels the other participants as “Other”.

describe(cdis)
## cdis 
## 
##  15  Variables      4739  Observations
## --------------------------------------------------------------------------------
## X 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0     4739        1     3955     4479    237.9    474.8 
##      .25      .50      .75      .90      .95 
##   1185.5   2370.0   3554.5   7993.0  16220.0 
## 
## lowest :     1     2     3     4     5, highest: 37410 37510 37610 37710 37810
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     4739        0        2 
##                         
## Value      female   male
## Frequency    2600   2139
## Proportion  0.549  0.451
## --------------------------------------------------------------------------------
## ethnicity 
##        n  missing distinct 
##     4739        0        3 
##                                      
## Value          afam hispanic    other
## Frequency       786      903     3050
## Proportion    0.166    0.191    0.644
## --------------------------------------------------------------------------------
## score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0     2464        1    50.89    10.01    36.85    38.93 
##      .25      .50      .75      .90      .95 
##    43.92    51.19    57.77    62.43    64.81 
## 
## lowest : 28.95 30.78 30.91 30.98 31.05, highest: 70.10 70.56 70.70 71.36 72.81
## --------------------------------------------------------------------------------
## fcollege 
##        n  missing distinct 
##     4739        0        2 
##                       
## Value         no   yes
## Frequency   3753   986
## Proportion 0.792 0.208
## --------------------------------------------------------------------------------
## mcollege 
##        n  missing distinct 
##     4739        0        2 
##                       
## Value         no   yes
## Frequency   4088   651
## Proportion 0.863 0.137
## --------------------------------------------------------------------------------
## home 
##        n  missing distinct 
##     4739        0        2 
##                     
## Value        no  yes
## Frequency   852 3887
## Proportion 0.18 0.82
## --------------------------------------------------------------------------------
## urban 
##        n  missing distinct 
##     4739        0        2 
##                       
## Value         no   yes
## Frequency   3635  1104
## Proportion 0.767 0.233
## --------------------------------------------------------------------------------
## unemp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0      116    0.999    7.597    2.905      4.1      4.6 
##      .25      .50      .75      .90      .95 
##      5.9      7.1      8.9     11.2     12.8 
## 
## lowest :  1.4  2.5  2.8  3.0  3.1, highest: 16.0 16.3 17.7 22.3 24.9
## --------------------------------------------------------------------------------
## wage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0       41    0.996    9.501    1.496     7.33     7.54 
##      .25      .50      .75      .90      .95 
##     8.85     9.68    10.15    11.56    12.15 
## 
## lowest :  6.59  7.04  7.09  7.18  7.33, highest: 11.37 11.56 11.62 12.15 12.96
## --------------------------------------------------------------------------------
## distance 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0       61    0.997    1.803    2.062      0.1      0.1 
##      .25      .50      .75      .90      .95 
##      0.4      1.0      2.5      4.5      6.0 
## 
## lowest :  0.0  0.1  0.2  0.3  0.4, highest: 12.0 14.2 15.0 16.0 20.0
## --------------------------------------------------------------------------------
## tuition 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4739        0       41    0.996   0.8146   0.3883   0.2575   0.2575 
##      .25      .50      .75      .90      .95 
##   0.4850   0.8245   1.1270   1.2483   1.4042 
## 
## lowest : 0.25751 0.43418 0.45497 0.48499 0.49538
## highest: 1.16513 1.16628 1.24827 1.38568 1.40416
## --------------------------------------------------------------------------------
## education 
##        n  missing distinct     Info     Mean      Gmd 
##     4739        0        7     0.93    13.81    1.972 
## 
## lowest : 12 13 14 15 16, highest: 14 15 16 17 18
##                                                     
## Value         12    13    14    15    16    17    18
## Frequency   1832   613   518   556   907   256    57
## Proportion 0.387 0.129 0.109 0.117 0.191 0.054 0.012
## --------------------------------------------------------------------------------
## income 
##        n  missing distinct 
##     4739        0        2 
##                       
## Value       high   low
## Frequency   1365  3374
## Proportion 0.288 0.712
## --------------------------------------------------------------------------------
## region 
##        n  missing distinct 
##     4739        0        2 
##                       
## Value      other  west
## Frequency   3796   943
## Proportion 0.801 0.199
## --------------------------------------------------------------------------------

Step 2 Data Wrangling

Gathering subset of information I’m interested in.

cdis2 <- subset(cdis,select = c(tuition,education,score,income,distance,gender,home))

colnames(cdis2) <- c('Tuition','Edu','Score','Income','Dis','Gender','Home')

Step 3 Graphics

Looking at the chart, the more years of education a student has, the less distance they have to travel. We also observe that a majority of participants come from a family that have a home.

ggplot(data=cdis2, mapping=aes(x=Edu, y=Dis, col=Home)) + geom_point()

In the Education column, we can observe that most participants have studied less 15 years or less.

hist(cdis2$Edu)

I initially suspected that those who study more, would pay less in tuition. In this chart we see that the boxplot in the higher years of education is fairly similar to students with lighter years of education.

boxplot(Tuition~Edu, data = cdis2, xlab = "Tuition", ylab = "Education", main = "study") 

Conclusion

In the study we saw that there seems to be a connection with distance and a participants education career. The closer they are to the school they attend, the longer their career. There weren’t any strong relationships between Education and Tuition or if their family owns a home. In a future study, other variables can be examined such as Income, or Achievement Scores.

Bonus Upload Dataset to Github

data <- "https://raw.githubusercontent.com/curiostegui/R_bridge/main/CollegeDistance.csv"
cdis <- read.csv(file = data, header = TRUE, sep = ",")