Which variables have a significant association with an individuals category of major at Montgomery College? In this project I will be using the Montgomery College enrollment data from 2023, which is provided by Montgomery County, to answer this question. The data has 25,320 observations on 18 variables. Some of these variables include gender, program of study, race, ethnicity, age group, and high school information. I chose this topic because, as a student at Montgomery College, I am interesting in examining any trends or correlations between program of study and other aspects of a student’s life. If there are significant correlations, for example, if Humanities & Arts has primarily female students, then the college should be aware of them. Then these programs can cater more to people who are likely to be interested in the program.
Data Analysis
In order to clean the data, I will perform a few steps. First, I will change all the variable names to be lowercase and replace the spaces with underscores. Then I will categorize each of the 95 majors into sensible groups that will make the chi-squared test easier.
Loading the Libraries and Data
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 25320 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Student Type, Student Status, Gender, Ethnicity, Race, Attending G...
dbl (2): Fall Term, ZIP
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Variable Alterations
First, I need to make some basic alterations to the variable names to make coding easier
colnames(college) <-tolower(colnames(college)) # makes names lowercasecolnames(college) <-gsub(" ", "_", colnames(college)) # replaces space with underscorehead(college)
# A tibble: 6 × 18
fall_term student_type student_status gender ethnicity race
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2015 Continuing Full-Time Female Not Hispanic White
2 2015 Continuing Part-Time Male Not Hispanic White
3 2015 Continuing Part-Time Male Not Hispanic Black
4 2015 New Full-Time Male Not Hispanic Asian
5 2015 New Full-Time Female Hispanic White
6 2015 Continuing Full-Time Female Hispanic Hispanic
# ℹ 12 more variables: attending_germantown <chr>, attending_rockville <chr>,
# `attending_takoma_park/ss` <chr>, attend_day_or_evening <chr>,
# mc_program_description <chr>, age_group <chr>, hs_category <chr>,
# mcps_high_school <chr>, city_in_md <chr>, state <chr>, zip <dbl>,
# county_in_md <chr>
Now I am going explore all the different values that are being held in mc_program description. If there are too many, I may need to mutate a new variable that categorizes them or only select certain values.
Okay, there are 95 different program studies. I am going to try and categorize them to make a chi-squared test easier. AI was used to help generate and sort the 95 majors into manageable categories, but the code was written by me.
college1 <- college |>mutate(major_category =case_when(# All Health and Wellness majors mc_program_description %in%c("Health Sciences (Pre-Clinical Studies)", "Nursing (AA & AAS)", "Diagnostic Medical Sonography (AA & AAS)", "Diagnostic Medical Sonography (CT)", "Polysomnography Technology (CT)", "Radiologic (X-Ray) Technology (AA & AAS)", "Physical Therapist Assistant (AAS)", "Surgical Technologist (AAS)", "Health Information Management (AA & AAS)","Medical Coder/Abstractr/Biller (CT)", "Exercise Sci - Personal Trainer (LR)", "Exercise Sci - Personal Trainer (CT)") ~"Health & Wellness",# All Computer Science & IT majors mc_program_description %in%c("Computer Science & Technologies (AA - All Tracks)", "Computer Science - Computer Programming (CT)", "Cybersecurity (AAS)", "Cybersecurity (CT)", "Information Systems Secirity", "Computer Applications (AA & AAS)", "Computer Applications (CT)", "Microcomputer Technician (AA & AAS)", "Microcomputer Technician (CT)", "Network & Wireless Technologies (CT)", "Network Engineer/Administration (CT)", "Digital Media & Web Technology (AAS)", "Digital Media & Web Technology (CT)", "Computer Gaming & Simulation (AA - All Tracks)" ) ~"Computer Science & IT",# All Engineering and Construction majors mc_program_description %in%c("Engineering Science (AA & AS - All Tracks)", "Eng Technologies (AA & AAS - Discontinued)", "Electromechanical Sys Eng Tech (AA & AAS - Discnt)", "Architectural & Construction Tech (AA & AAS)", "Architect. & Construct. Tech - Sustainability (LR)", "Building Trades Technology (AA & AAS)", "Building Trades Technology (CT)", "Building Trades Technology (LR)", "Management of Construction (CT)", "Automotive Technology (AA & AAS)", "Automotive Technology (CT)", "Landscape Technology (AA & AAS)", "Landscape Technology (CT)" ) ~"Engineering & Construction",# Arts, Design, and Communication mc_program_description %in%c("School of Art & Design - Applicants", "Studio Art (AFA)", "Studio Art (AFA) - School of Art & Design", "Specialized Art Transfer (CT)", "Graphic Design (AA, AAS, & AFA - All Tracks)", "Graphic Design (AFA) - School of Art & Design", "Computer Graphics / Graphic Design (AAS)", "Computer Graphics / Graphic Design (CT)", "Photography (AA & AAS)", "Photography (CT)", "Interior Design - PreProfessional (AAS)", "Interior Design (CT)", "Communication Studies (AA)", "Commun & Broadcasting Tech (AA & AAS - All Tracks)", "Commun & Broadcasting Tech (CT)", "Music Transfer (CT)", "Technical Writing (CT)" ) ~"Arts, Design & Communication",# Buisness and Hospitality mc_program_description %in%c("Business / International Business (AA)", "Accounting (AA & AAS)", "Accounting (CT)", "Management (AA & AAS - All Tracks)", "Management (CT)", "Management (LR)", "Hospitality Management (AA & AAS)", "Hospitality Management (CT)", "Hospitality Management (LR)", "Administrative Support Tech (CT)", "Printing Management (AA & AAS)", "Printing Management (CT)" ) ~"Buisness & Hospitality",# Education and Social Sciences mc_program_description %in%c("Education / Teacher Education (AA & AAT)", "Early Childhood Education (AA & AAS)", "Early Childhood Education (CT)", "Early Childhood Education (LR)", "American Sign Language (AA & AAS)", "American Sign Language (CT)", "Applied Geography (AA & AAS)", "Cartography & Geographic Ed / Info Sys (CT)", "Women's Studies (CT)", "Ethnic Studies (CT)", "Ethnic Social Studies (LR)" ) ~"Education & Social Sciences",# Public Safety, Law, and Service mc_program_description %in%c("Criminal Justice (AA & AAS)", "Paralegal Studies (AA & AAS)", "Paralegal Studies (CT)", "Paralegal Studies - Legal Analysis (LR)", "Fire Sci./Preven., Emerg. Prepare. (AA, AS & AAS)", "Fire Sci./Preven., Emergency Prepare. (CT)", "Fire Science (LR)", "Mental Health Associate (AA & AAS)", "Recreation Leadership (AA)" ) ~"Public Safety, Law, & Social Sciences",# Science mc_program_description %in%c("Science (AS - All Tracks)", "Biotechnology (AA & AAS)", "Biotechnology (CT)" ) ~"Science",# General Studies and Undecided mc_program_description %in%c("General Studies (AA - All Tracks)", "Arts & Sciences Transfer (AA - All Tracks)", "Arts & Sciences Transfer (CT)", "Credit (Undeclared / Undecided)", "WIA (CE) Programs" ) ~"General Studies and Undecided" ))unique(college1$major_category)
First, I want to create a filled bar graph that shows the proportions of males and females in each category. For purposes of this project, I am going to filter out those with the “unknown” for gender to make statistical findings and graphing easier. I am also only going to filter for races Black, White, and Asian because those are the three major categories of race and there are not enough observations of other races to have an approximate chi-squared test. And, in the age_group category I will remove the singular unknown value.
[1] "25 - 29" "21 - 24" "20 or Younger" "30 or Older"
Now that all the majors have been sorted into categories and the data has been filtered, we can begin examining the data. First, I want to see a filled bar graph that shows the gender proportions in each major category
gender_major_plot <- college2 |>ggplot(aes(y = major_category, fill = gender)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="GnBu") +labs(title ="Gender Proportions in Each Major Category", y ="Category of Major") +theme_bw(base_family ="serif")gender_major_plot
Okay, it looks like there are 4 main categories where there is a large gap in gender proportions. In both Health & Wellness and Education & Social Science categories, there is significantly more females. In Computer Science & IT, as well as Engineering & Construction majors, males are more dominant.
Now I want to examine proportions of race in each major category.
college2 |>ggplot(aes(y = major_category, fill = race)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="PuBuGn") +labs(title ="Proportions of 3 Most Common Races in Each Major", y ="Category of Major") +theme_bw(base_family ="serif")
I do not see any clear patterns. The distributions seem seem fairly proportionate to the populations of each race at the college. The largest group of White people appears to be in Education and social sciences, with general studies and undecided not far behind. The majority of black individuals seem to be in the health and wellness section, and Asians have the highest proportion in Computer science and IT than any other category. The Chi-squared test will reveal if any of these differences in association are reflective of association.
Finally, I want to examine the age group and whether or not that has a correlation with major category.
college2 |>ggplot(aes(y = major_category, fill = age_group)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="RdPu") +labs(title ="Age Group Proportions in Each Major", y ="Category of Major") +theme_bw(base_family ="serif")
Again, I do not see any clear patterns. There obviously seems to a higher population of students that are 24 or younger, but that is to be expected at a college. Age doesn’t seem to play a large role in the category of major.
Chi-Squared Test for Association: Gender
I will now conduct a chi-squared test for association between gender and major category to determine whether or not the two variables are correlated.
Null and Alternative Hypothesis
\(H_0\) : Major category is not associated with gender
\(H_a\) : Major category is associated with gender
Bar Graph for Variables
gender_major_plot <- college2 |>ggplot(aes(y = major_category, fill = gender)) +geom_bar(position ="stack") +scale_fill_brewer(palette ="GnBu") +labs(title ="Gender Proportions in Each Major Category", y ="Category of Major") +theme_bw(base_family ="serif")gender_major_plot
Most of the students are either in general studies or undecided categories but, there are some categories where there is a clear difference between male and female counts (as mentioned above.
Now I will display a table with the counts of each gender in each major category
Female Male
Arts, Design & Communication 510.8133 464.1867
Buisness & Hospitality 1297.7276 1179.2724
Computer Science & IT 914.7487 831.2513
Education & Social Sciences 517.1002 469.8998
Engineering & Construction 941.9920 856.0080
General Studies and Undecided 4316.5030 3922.4970
Health & Wellness 1669.1805 1516.8195
Public Safety, Law, & Social Sciences 395.5528 359.4472
Science 742.3819 674.6181
Reject the null. Because our p-value is far below our threshold of 0.05, there is extremely strong evidence to suggest an association between major category and gender. Though this does not imply gender causes a certain major, it does show a correlation between the two variables. All of our expected counts were above 5, so these findings are valid.
Chi-Squared Test for Association: Race
Now I will conduct a test for association between race and category of major.
Null and Alternative Hypothesis
\(H_0\) : Major category is not associated with race
\(H_a\) : Major category is associated with race
Bar Graph for Variables
college2 |>filter(race %in%c("White", "Black", "Asian")) |>ggplot(aes(y = major_category, fill = race)) +# Add bar layer of proportionsgeom_bar(position ="stack") +scale_fill_brewer(palette ="PuBuGn") +labs(title ="Proportions of 3 Most Common Races in Each Major", y ="Category of Major") +theme_bw(base_family ="serif")
# Making a table to show all the valuestable_race <-table(college2$major_category, college2$race)table_race
Asian Black White
Arts, Design & Communication 114 397 464
Buisness & Hospitality 486 968 1023
Computer Science & IT 471 668 607
Education & Social Sciences 135 288 564
Engineering & Construction 321 687 790
General Studies and Undecided 1226 2666 4347
Health & Wellness 439 1705 1042
Public Safety, Law, & Social Sciences 74 298 383
Science 271 538 608
Test & Results
# Performing the chi squared testchi_race <-chisq.test(table_race)chi_race
Asian Black White
Arts, Design & Communication 159.8042 371.1596 444.0361
Buisness & Hospitality 405.9847 942.9358 1128.0795
Computer Science & IT 286.1725 664.6613 795.1663
Education & Social Sciences 161.7710 375.7278 449.5012
Engineering & Construction 294.6954 684.4564 818.8482
General Studies and Undecided 1350.3866 3136.3941 3752.2193
Health & Wellness 522.1910 1212.8355 1450.9735
Public Safety, Law, & Social Sciences 123.7458 287.4108 343.8434
Science 232.2488 539.4187 645.3325
#Chi-squared valuechi_race$statistic
X-squared
801.4298
Reject the null. There is extremely strong evidence to suggest an association between major category and race. And, all of the expected counts were above 5, meaning these findings are valid
Chi-Squared Test for Association: Age Group
Now I will conduct a test for association between race and category of major.
Null and Alternative Hypothesis
\(H_0\) : Major category is not associated with age group
\(H_a\) : Major category is associated with age group
Bar Graph for Variables
college2 |>ggplot(aes(y = major_category, fill = age_group)) +geom_bar(position ="stack") +scale_fill_brewer(palette ="RdPu") +labs(title ="Age Group Proportions in Each Major", y ="Category of Major") +theme_bw(base_family ="serif")
From examining the bar graph, there may not be enough observations in the 30 or Older category, meaning we may have expected counts below 5. This could lead to inaccuracies in our test results.
Now I will make a table to show what the graph represents, the counts in each of these categories
# Making a table to show all the valuestable_age <-table(college2$major_category, college2$age_group)table_age
20 or Younger 21 - 24 25 - 29
Arts, Design & Communication 439 254 107
Buisness & Hospitality 1158 581 287
Computer Science & IT 809 405 243
Education & Social Sciences 442 223 101
Engineering & Construction 831 443 215
General Studies and Undecided 3441 2137 990
Health & Wellness 842 630 545
Public Safety, Law, & Social Sciences 376 165 73
Science 724 341 182
30 or Older
Arts, Design & Communication 175
Buisness & Hospitality 451
Computer Science & IT 289
Education & Social Sciences 221
Engineering & Construction 309
General Studies and Undecided 1671
Health & Wellness 1169
Public Safety, Law, & Social Sciences 141
Science 170
Test & Results
# Performing the chi squared testchi_age <-chisq.test(table_age)chi_age
Now I will check the expected counts to validate the findings
# Checking the Expected countschi_age$expected
20 or Younger 21 - 24 25 - 29
Arts, Design & Communication 409.4277 233.9910 123.93072
Buisness & Hospitality 1040.1563 594.4570 314.84759
Computer Science & IT 733.1905 419.0238 221.93133
Education & Social Sciences 414.4668 236.8709 125.45602
Engineering & Construction 755.0267 431.5033 228.54096
General Studies and Undecided 3459.7691 1977.2836 1047.24639
Health & Wellness 1337.8838 764.6105 404.96747
Public Safety, Law, & Social Sciences 317.0440 181.1930 95.96687
Science 595.0349 340.0669 180.11265
30 or Older
Arts, Design & Communication 207.6506
Buisness & Hospitality 527.5390
Computer Science & IT 371.8543
Education & Social Sciences 210.2063
Engineering & Construction 382.9290
General Studies and Undecided 1754.7008
Health & Wellness 678.5383
Public Safety, Law, & Social Sciences 160.7961
Science 301.7855
All counts are above 5.
Reject the null. There is extremely strong evidence to suggest an association between age_group and major category.
Conclusion
Age group, race, and gender all were significantly associated with an individual’s major category. Each p-value was far below the threshold of 0.05 (<2.2e-16). Because of these associations, further research should be conducted to determine which age groups, genders, and races are most prominent in each major category. Then, the categories can better cater to the needs and interests of their students, as well as provide opportunities for conferences or meetings that might interest them more.