library(tidyverse)
library(dplyr)
library(ggplot2)project 2
Introduction
Is there a significant difference in the amount of students attending Germantown compared to Rockville? I am using the data set “Montgomery College Enrollment Data” from https://data.montgomerycountymd.gov. This data set gives me info of people that got into Montgomery College and their status. The data is provided by Montgomery County, MD, and it was last updated in May 18, 2016. It has 25.3K rows and 18 columns. It has variables like Fall Type, student type, student, status and more. The variable I will be focusing on is Attending Germantown and Attending Rockville. I made this question because I have been to both campus before, and I wanted to know the difference of students going from one campus to another.
Load the libraries
setwd("~/Data 101")enrollment <- read.csv("Montgomery_College_Enrollment_Data.csv")Data Analysis
To see if there is a difference in students attending Germantown and students attending Rockville, I will have to use Difference in Proportions Test to find out. And I need to see if I need to clean first so I can use the values. And if I need to edit any of the values. I will need to combine some of the values in order to create a graph. I will use the two variables attending Germantown and attending Rockville for the graph and the testing. I will make a bar graph to show the difference between the schools. The p-value will help me answer my question.
EDA
str(enrollment)'data.frame': 25320 obs. of 18 variables:
$ Fall.Term : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ Student.Type : chr "Continuing" "Continuing" "Continuing" "New" ...
$ Student.Status : chr "Full-Time" "Part-Time" "Part-Time" "Full-Time" ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ Ethnicity : chr "Not Hispanic" "Not Hispanic" "Not Hispanic" "Not Hispanic" ...
$ Race : chr "White" "White" "Black" "Asian" ...
$ Attending.Germantown : chr "Yes" "No" "No" "No" ...
$ Attending.Rockville : chr "Yes" "Yes" "Yes" "Yes" ...
$ Attending.Takoma.Park.SS: chr "No" "No" "No" "No" ...
$ Attend.Day.or.Evening : chr "Day Only" "Evening Only" "Day & Evening" "Day Only" ...
$ MC.Program.Description : chr "Health Sciences (Pre-Clinical Studies)" "Building Trades Technology (AA & AAS)" "Computer Gaming & Simulation (AA - All Tracks)" "Graphic Design (AA, AAS, & AFA - All Tracks)" ...
$ Age.Group : chr "25 - 29" "21 - 24" "20 or Younger" "20 or Younger" ...
$ HS.Category : chr "Foreign Country" "MCPS" "MCPS" "MCPS" ...
$ MCPS.High.School : chr "" "Sherwood High School" "Quince Orchard Sr High School" "Thomas Sprigg Wootton High Sch" ...
$ City.in.MD : chr "Bethesda" "Olney" "Gaithersburg" "North Potomac" ...
$ State : chr "MD" "MD" "MD" "MD" ...
$ ZIP : int 20816 20832 20877 20878 20906 20876 20876 20903 20901 20851 ...
$ County.in.MD : chr "Montgomery" "Montgomery" "Montgomery" "Montgomery" ...
This shows me the dataframe of the dataset.
unique(enrollment$Attending.Germantown)[1] "Yes" "No"
This shows me what is in the variable in attending Germantown
unique(enrollment$Attending.Rockville)[1] "Yes" "No"
This is the same code from the one above, but for attending rockville instead.
Using the unique code, it allows me to see everything that is in the attending Rockville and attending Germantown.
names(enrollment) <- gsub("[\\$,\\.]","_", names(enrollment)) The gsub code will get rid of any symbols that will interfere with future coding and graphs.
Summary of the dataset
summary(enrollment) Fall_Term Student_Type Student_Status Gender
Min. :2015 Length:25320 Length:25320 Length:25320
1st Qu.:2015 Class :character Class :character Class :character
Median :2015 Mode :character Mode :character Mode :character
Mean :2015
3rd Qu.:2015
Max. :2015
Ethnicity Race Attending_Germantown Attending_Rockville
Length:25320 Length:25320 Length:25320 Length:25320
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Attending_Takoma_Park_SS Attend_Day_or_Evening MC_Program_Description
Length:25320 Length:25320 Length:25320
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Age_Group HS_Category MCPS_High_School City_in_MD
Length:25320 Length:25320 Length:25320 Length:25320
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
State ZIP County_in_MD
Length:25320 Min. : 926 Length:25320
Class :character 1st Qu.:20852 Class :character
Mode :character Median :20877 Mode :character
Mean :20892
3rd Qu.:20902
Max. :95492
NA's :99
This is gives me the summary of the dataset. It gives me the length of the variables and what type of variable it is.
Germantown Campus
enrollment3 <- enrollment |>
filter(Attending_Germantown != "No") |>
select(Attending_Germantown) |>
summarise(Count = n()) |>
mutate(Campus = "Germantown")
head(enrollment3) Count Campus
1 7307 Germantown
This code gets rid of the No in the attending Germantown so it is just yes for the graph. I need to count it and mutate it as well so I can graph it. And The head shows me what happens after I did this coding.
Rockville Campus
enrollment2 <- enrollment |>
filter(Attending_Rockville != "No") |>
select(Attending_Rockville) |>
summarise(Count = n()) |>
mutate(Campus = "Rockville")
head(enrollment2) Count Campus
1 16286 Rockville
This is the same code but to Rockville instead.
Combine
campus_comparison <- bind_rows(enrollment2, enrollment3) #Got the bind code from this website/source which is listed on the bottom of this projectI had to use bind_rows instead of inner_join because inner_join didn’t work and this code combines the rows together.
print(campus_comparison) Count Campus
1 16286 Rockville
2 7307 Germantown
This shows me that the bind code really works.
Graph
ggplot(campus_comparison, aes(x = Campus, y = Count, fill = Campus)) +
geom_col() +
labs(title = "Student Attendance: Germantown vs. Rockville",
x = "Campus Location",
y = "Number of Students") +
theme_minimal()Looking above, you can see the graph that shows you the difference in students attending Rockville campus and Germantown campus.
Statistical Analysis
Null Hypothesis (H₀): p1 = p2 (The amount of students attending Rockville campus and Germantown campus is the same)
Alternative Hypothesis (H₁): p1 ≠ p2 (The amount of students attending Rockville campus and Germantown campus is not the same )
p1= Students attending ROckville campus
p2= students attending Germantown campus.
total_students <- dim(enrollment)print(total_students)[1] 25320 18
I need to do this in order to see the total amount of students of the dataset, and it is for the test below.
prop_results <- prop.test(c(16286,7307), c(25320, 25320 ))print(prop_results)
2-sample test for equality of proportions with continuity correction
data: c(16286, 7307) out of c(25320, 25320)
X-squared = 6396.6, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.3464594 0.3627823
sample estimates:
prop 1 prop 2
0.6432070 0.2885861
Interpretation: Looking at at the test above, it gives us all the information we need to answer my question. It gives me my x-squared, the p-value and the 95% CI which is nice to see. We have a very low p-value as well which will answer my question.
Conclusion
After reading the result from the testing above, we got a p-value less than 2.2e-16, which is significant smaller than .05. We will reject the null and accept the alternative hypothesis that the amount of students attending Rockville is not the same as the amount of students attending Rockville. The graph also supports the test as you can see the clear difference between the amount of students attending Rockville campus and Germantown campus. Where the rockville campus has 16286 students while Germantown has 7307 students. I think my answer is accurate because the p-value is so small. In the future I want to explore the dataset even more like me finding out where these students came from. Or like the program or major of the students attending this college. Or the age group of the students because there should be a big range in age due to early college and people can attend college an whatever age they want. I can’t do anything more with the campus question because I can just 25320-7307-16286 and that is the amount of students attending the silver spring campus. One thing they could have done with the data set to make it better is if they kept updating it instead of stopping it in 2016. Because the amount of students attending school will change and the answer to my question could be different.
References
https://data.montgomerycountymd.gov/Education/Montgomery-College-Enrollment-Data/wmr2-6hn6/about_data
https://dplyr.tidyverse.org/reference/bind_rows.html