project 2

Author

Kenneth Nguyen

Introduction

Is there a significant difference in the amount of students attending Germantown compared to Rockville? I am using the data set “Montgomery College Enrollment Data” from https://data.montgomerycountymd.gov. This data set gives me info of people that got into Montgomery College and their status. The data is provided by Montgomery County, MD, and it was last updated in May 18, 2016. It has 25.3K rows and 18 columns. It has variables like Fall Type, student type, student, status and more. The variable I will be focusing on is Attending Germantown and Attending Rockville. I made this question because I have been to both campus before, and I wanted to know the difference of students going from one campus to another.

Load the libraries

library(tidyverse)
library(dplyr)
library(ggplot2)

setwd("~/Data 101")

enrollment <- read.csv("Montgomery_College_Enrollment_Data.csv")

Data Analysis

To see if there is a difference in students attending Germantown and students attending Rockville, I will have to use Difference in Proportions Test to find out. And I need to see if I need to clean first so I can use the values. And if I need to edit any of the values. I will need to combine some of the values in order to create a graph. I will use the two variables attending Germantown and attending Rockville for the graph and the testing. I will make a bar graph to show the difference between the schools. The p-value will help me answer my question.

EDA

str(enrollment)

'data.frame':   25320 obs. of  18 variables:
 $ Fall.Term               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ Student.Type            : chr  "Continuing" "Continuing" "Continuing" "New" ...
 $ Student.Status          : chr  "Full-Time" "Part-Time" "Part-Time" "Full-Time" ...
 $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
 $ Ethnicity               : chr  "Not Hispanic" "Not Hispanic" "Not Hispanic" "Not Hispanic" ...
 $ Race                    : chr  "White" "White" "Black" "Asian" ...
 $ Attending.Germantown    : chr  "Yes" "No" "No" "No" ...
 $ Attending.Rockville     : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Attending.Takoma.Park.SS: chr  "No" "No" "No" "No" ...
 $ Attend.Day.or.Evening   : chr  "Day Only" "Evening Only" "Day & Evening" "Day Only" ...
 $ MC.Program.Description  : chr  "Health Sciences (Pre-Clinical Studies)" "Building Trades Technology (AA & AAS)" "Computer Gaming & Simulation (AA - All Tracks)" "Graphic Design (AA, AAS, & AFA - All Tracks)" ...
 $ Age.Group               : chr  "25 - 29" "21 - 24" "20 or Younger" "20 or Younger" ...
 $ HS.Category             : chr  "Foreign Country" "MCPS" "MCPS" "MCPS" ...
 $ MCPS.High.School        : chr  "" "Sherwood High School" "Quince Orchard Sr High School" "Thomas Sprigg Wootton High Sch" ...
 $ City.in.MD              : chr  "Bethesda" "Olney" "Gaithersburg" "North Potomac" ...
 $ State                   : chr  "MD" "MD" "MD" "MD" ...
 $ ZIP                     : int  20816 20832 20877 20878 20906 20876 20876 20903 20901 20851 ...
 $ County.in.MD            : chr  "Montgomery" "Montgomery" "Montgomery" "Montgomery" ...

This shows me the dataframe of the dataset.

unique(enrollment$Attending.Germantown)

[1] "Yes" "No"

This shows me what is in the variable in attending Germantown

unique(enrollment$Attending.Rockville)

[1] "Yes" "No"

This is the same code from the one above, but for attending rockville instead.

Using the unique code, it allows me to see everything that is in the attending Rockville and attending Germantown.

names(enrollment) <- gsub("[\\$,\\.]","_", names(enrollment))

The gsub code will get rid of any symbols that will interfere with future coding and graphs.

Summary of the dataset

summary(enrollment)

   Fall_Term    Student_Type       Student_Status        Gender         
 Min.   :2015   Length:25320       Length:25320       Length:25320      
 1st Qu.:2015   Class :character   Class :character   Class :character  
 Median :2015   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2015                                                           
 3rd Qu.:2015                                                           
 Max.   :2015                                                           
                                                                        
  Ethnicity             Race           Attending_Germantown Attending_Rockville
 Length:25320       Length:25320       Length:25320         Length:25320       
 Class :character   Class :character   Class :character     Class :character   
 Mode  :character   Mode  :character   Mode  :character     Mode  :character   
                                                                               
                                                                               
                                                                               
                                                                               
 Attending_Takoma_Park_SS Attend_Day_or_Evening MC_Program_Description
 Length:25320             Length:25320          Length:25320          
 Class :character         Class :character      Class :character      
 Mode  :character         Mode  :character      Mode  :character      
                                                                      
                                                                      
                                                                      
                                                                      
  Age_Group         HS_Category        MCPS_High_School    City_in_MD       
 Length:25320       Length:25320       Length:25320       Length:25320      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    State                ZIP        County_in_MD      
 Length:25320       Min.   :  926   Length:25320      
 Class :character   1st Qu.:20852   Class :character  
 Mode  :character   Median :20877   Mode  :character  
                    Mean   :20892                     
                    3rd Qu.:20902                     
                    Max.   :95492                     
                    NA's   :99

This is gives me the summary of the dataset. It gives me the length of the variables and what type of variable it is.

Germantown Campus

enrollment3 <- enrollment |>
  filter(Attending_Germantown != "No") |>
  select(Attending_Germantown) |>
 summarise(Count = n()) |>
  mutate(Campus = "Germantown")
head(enrollment3)

  Count     Campus
1  7307 Germantown

This code gets rid of the No in the attending Germantown so it is just yes for the graph. I need to count it and mutate it as well so I can graph it. And The head shows me what happens after I did this coding.

Rockville Campus

enrollment2 <- enrollment |>
  filter(Attending_Rockville != "No") |>
  select(Attending_Rockville) |>
  summarise(Count = n()) |>
  mutate(Campus = "Rockville")
head(enrollment2)

  Count    Campus
1 16286 Rockville

This is the same code but to Rockville instead.

Combine

campus_comparison <- bind_rows(enrollment2, enrollment3) #Got the bind code from this website/source which is listed on the bottom of this project

I had to use bind_rows instead of inner_join because inner_join didn’t work and this code combines the rows together.

print(campus_comparison)

  Count     Campus
1 16286  Rockville
2  7307 Germantown

This shows me that the bind code really works.

Graph

ggplot(campus_comparison, aes(x = Campus, y = Count, fill = Campus)) +
  geom_col() +
  labs(title = "Student Attendance: Germantown vs. Rockville",
       x = "Campus Location",
       y = "Number of Students") +
  theme_minimal()

Looking above, you can see the graph that shows you the difference in students attending Rockville campus and Germantown campus.

Statistical Analysis

Null Hypothesis (H₀): p₁ = p₂ (The amount of students attending Rockville campus and Germantown campus is the same)

Alternative Hypothesis (H₁): p₁ ≠ p₂ (The amount of students attending Rockville campus and Germantown campus is not the same )

p₁= Students attending ROckville campus

p₂= students attending Germantown campus.

total_students <- dim(enrollment)

print(total_students)

[1] 25320    18

I need to do this in order to see the total amount of students of the dataset, and it is for the test below.

prop_results <- prop.test(c(16286,7307), c(25320, 25320 ))

print(prop_results)


    2-sample test for equality of proportions with continuity correction

data:  c(16286, 7307) out of c(25320, 25320)
X-squared = 6396.6, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.3464594 0.3627823
sample estimates:
   prop 1    prop 2 
0.6432070 0.2885861

Interpretation: Looking at at the test above, it gives us all the information we need to answer my question. It gives me my x-squared, the p-value and the 95% CI which is nice to see. We have a very low p-value as well which will answer my question.

Conclusion

After reading the result from the testing above, we got a p-value less than 2.2e-16, which is significant smaller than .05. We will reject the null and accept the alternative hypothesis that the amount of students attending Rockville is not the same as the amount of students attending Rockville. The graph also supports the test as you can see the clear difference between the amount of students attending Rockville campus and Germantown campus. Where the rockville campus has 16286 students while Germantown has 7307 students. I think my answer is accurate because the p-value is so small. In the future I want to explore the dataset even more like me finding out where these students came from. Or like the program or major of the students attending this college. Or the age group of the students because there should be a big range in age due to early college and people can attend college an whatever age they want. I can’t do anything more with the campus question because I can just 25320-7307-16286 and that is the amount of students attending the silver spring campus. One thing they could have done with the data set to make it better is if they kept updating it instead of stopping it in 2016. Because the amount of students attending school will change and the answer to my question could be different.

References

https://data.montgomerycountymd.gov/Education/Montgomery-College-Enrollment-Data/wmr2-6hn6/about_data

https://dplyr.tidyverse.org/reference/bind_rows.html