Project 1

Author

M Loukinov

DATASET

This dataset has college application information from 8 MCPS high schools: Albert Einstein Bethesda Chevy Chase Montgomery Blair Richard Montgomery Walt Whitman Walter Johnson Winston Churchill Thomas S. Wootton This information was self reported by the students at these schools and collected by MCPS. The data shows how many students from each school applied to, were admitted to, and enrolled in various colleges across the nation. To qualify for this dataset, a college/University had to total at least 6 applications across the 8 schools.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(zoo)


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggplot2)
library(treemap)

setwd("C:/Users/Thecr/OneDrive/Desktop/Data 110 Notes")
MocoUni = read_csv("MocoUni.csv")

New names:
Rows: 396 Columns: 28
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(28): ...1, ...2, ...3, ...4, ...5, ...6, ...7, ...8, ...9, ...10, ...11...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`

#Changed/Fixed Column names
colnames(MocoUni) <- as.character(MocoUni[1, ])  # Used ChatGpt/Google, the column names were in Row1
MocoUni <- MocoUni[-1, ]

#Saidi Cleaning
names(MocoUni) <- tolower(names(MocoUni))
names(MocoUni) <- gsub(" ","_",names(MocoUni))

#Long Version
MocoUniLong<- MocoUni|> 
  pivot_longer(
        cols = 2:28,                                  #Which columns
    names_to = "names",                      #Combines all names
    values_to = "values")                  # Combines all values

#Long Version
MocoUniLong1<- MocoUni|> 
  pivot_longer(
        cols = 2:25,                                  #Which columns
    names_to = "names",                      #Combines all names
    values_to = "values")                  # Combines all values

#Change to numeric
MocoUni1 <- MocoUni |>
  mutate(across(c(2:28), ~ as.numeric(as.character(.))))

MocoUni2 <- MocoUni1 |>
  mutate(percent_accepted = total_admitted / total_applied)
MocoUni2 <- MocoUni2 |>
  mutate(across(c("percent_accepted"), round, 4))   #Used google for assistance

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(c("percent_accepted"), round, 4)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

MocoUni2$percent_accepted = MocoUni2$percent_accepted * 100

#Filter out for at least 50 applications
HighApplied <- MocoUni2 |>
  filter (total_applied >= 50)

#Filter out for at least 50 applications
OutlierApplied <- MocoUni2 |>
  filter (total_applied <= 500)

#Percent accepted total
MocoUniTop <- MocoUni2|>                       #Creates a top 20 list
  arrange(desc(percent_accepted)) |>      # Desc makes descending order thus top 10
  mutate(column = "top") |>      #add a column for them to be consider top 20
  head(20)                               #Use 20 variables
MocoUniTop

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Montgom…                    176                    176                    129
 2 Univers…                      0                      0                      0
 3 Arcadia…                      2                      2                      0
 4 Brigham…                      0                      0                      0
 5 Bryant …                      1                      1                      0
 6 Butler …                      1                      1                      0
 7 Morehou…                      2                      2                      0
 8 Saint J…                      2                      2                      0
 9 Berklee…                      1                      1                      0
10 Fairlei…                      0                      0                      0
11 Embry–R…                      2                      2                      2
12 Ringlin…                      2                      2                      0
13 San Die…                      1                      1                      1
14 Duquesn…                      3                      3                      2
15 Marquet…                      1                      1                      0
16 Univers…                      4                      4                      1
17 Lewis &…                      1                      1                      1
18 Seton H…                      3                      3                      0
19 Bingham…                      2                      2                      1
20 Hobart …                      1                      1                      0
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

MocoUniBottom <- MocoUni2|>                       #Creates a bottom 20 list
  arrange(percent_accepted) |>                      #Orders 20 from bottom
  mutate(column = "bottom") |>      #add a column for them to be consider low 20
  head(20)                                        #Use 20 variables
MocoUniBottom

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 The Jui…                      0                      0                      0
 2 Univers…                      0                      0                      0
 3 Univers…                      4                      1                      1
 4 Harvard…                      3                      0                      0
 5 Dartmou…                      2                      0                      0
 6 Univers…                      2                      0                      0
 7 Stanfor…                      1                      0                      0
 8 Brown U…                     15                      2                      0
 9 Califor…                      0                      0                      0
10 Claremo…                      0                      0                      0
11 Columbi…                      5                      1                      0
12 Harvey …                      0                      0                      0
13 Princet…                      3                      1                      0
14 Univers…                      7                      0                      0
15 Massach…                      0                      0                      0
16 Barnard…                      2                      0                      0
17 Yale Un…                      3                      0                      0
18 Univers…                      2                      0                      0
19 William…                      2                      0                      0
20 Pomona …                      0                      0                      0
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

#Percent accepted with a minimum of 50 applications
PacceptedHT <- HighApplied|>                       #Creates a top 20 list
  arrange(desc(percent_accepted)) |>       # Desc makes descending order thus top 10
  mutate(column = "top") |>      #add a column for them to be consider top 20
  head(20)                               #Use 20 variables
PacceptedHT

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Montgom…                    176                    176                    129
 2 Miami U…                      3                      3                      0
 3 Indiana…                      4                      4                      2
 4 Marylan…                     15                     14                      7
 5 Univers…                      0                      0                      0
 6 Oberlin…                      5                      3                      2
 7 Univers…                     10                      8                      2
 8 Temple …                     14                     13                      1
 9 Univers…                      2                      1                      0
10 Univers…                      2                      1                      0
11 Univers…                      4                      3                      1
12 Univers…                      6                      5                      0
13 James M…                     14                     11                      0
14 Ithaca …                      9                      8                      2
15 Univers…                     26                     21                      2
16 McDanie…                     16                     12                      1
17 Towson …                    108                     92                     18
18 Pennsyl…                     28                     25                      3
19 The Ohi…                      0                      0                      0
20 Univers…                     79                     59                     11
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

PacceptedHB <- HighApplied|>                       #Creates a bottom 20 list
  arrange(percent_accepted) |>                      #Orders 20 from bottom
  mutate(column = "bottom") |>      #add a column for them to be consider low 20
  head(20)                                        #Use 20 variables
PacceptedHB

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Univers…                      4                      1                      1
 2 Harvard…                      3                      0                      0
 3 Dartmou…                      2                      0                      0
 4 Univers…                      2                      0                      0
 5 Stanfor…                      1                      0                      0
 6 Brown U…                     15                      2                      0
 7 Califor…                      0                      0                      0
 8 Columbi…                      5                      1                      0
 9 Princet…                      3                      1                      0
10 Univers…                      7                      0                      0
11 Massach…                      0                      0                      0
12 Barnard…                      2                      0                      0
13 Yale Un…                      3                      0                      0
14 Univers…                      2                      0                      0
15 William…                      2                      0                      0
16 Johns H…                      5                      1                      1
17 Northwe…                      0                      0                      0
18 Vanderb…                      3                      1                      1
19 Washing…                      1                      0                      0
20 Univers…                      9                      1                      0
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

MostApplied <- MocoUni2|>                       #Creates a top 20 list
  arrange(desc(total_applied)) |>       # Desc makes descending order thus top 10
  mutate(column = "top") |>      #add a column for them to be consider top 20
  head(20)                               #Use 20 variables
MostApplied

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Univers…                    154                     77                     36
 2 Montgom…                    176                    176                    129
 3 Univers…                     79                     59                     11
 4 Pennsyl…                     28                     25                      3
 5 Towson …                    108                     92                     18
 6 Univers…                     26                     21                      2
 7 Univers…                      8                      3                      1
 8 Univers…                      9                      1                      0
 9 Virgini…                     20                     17                      0
10 Northea…                     12                      4                      0
11 Univers…                      7                      0                      0
12 Cornell…                      5                      0                      0
13 Univers…                     12                      4                      0
14 New Yor…                     14                      1                      0
15 Boston …                      6                      1                      0
16 Univers…                      4                      3                      0
17 Johns H…                      5                      1                      1
18 Brown U…                     15                      2                      0
19 The Geo…                     15                      7                      0
20 America…                     35                     16                      3
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

LeastApplied <- MocoUni2|>                       #Creates a bottom 20 list(at least 6 applications)
  arrange(total_applied) |>                     #Orders 20 from bottom
  mutate(column = "bottom") |>      #add a column for them to be consider low 20
  head(20)                                        #Use 20 variables
LeastApplied

# A tibble: 20 × 30
   sort     albert_einstein_appl…¹ albert_einstein_admi…² albert_einstein_enro…³
   <chr>                     <dbl>                  <dbl>                  <dbl>
 1 Alaska …                      0                      0                      0
 2 Alverni…                      0                      0                      0
 3 Berklee…                      1                      1                      0
 4 Fairlei…                      0                      0                      0
 5 Florida…                      0                      0                      0
 6 Hope Co…                      0                      0                      0
 7 Indiana…                      0                      0                      0
 8 La Sall…                      1                      1                      0
 9 Lincoln…                      0                      0                      0
10 Loyola …                      0                      0                      0
11 Manhatt…                      0                      0                      0
12 Michiga…                      0                      0                      0
13 Ohio We…                      0                      0                      0
14 Rutgers…                      0                      0                      0
15 San Fra…                      1                      0                      0
16 Seattle…                      2                      2                      0
17 St. Ola…                      0                      0                      0
18 Stonehi…                      0                      0                      0
19 SUNY Co…                      0                      0                      0
20 Univers…                      0                      0                      0
# ℹ abbreviated names: ¹albert_einstein_applied, ²albert_einstein_admitted,
#   ³albert_einstein_enrolled
# ℹ 26 more variables: `bethesda-chevy_chase_applied` <dbl>,
#   `bethesda-chevy_chase_admitted` <dbl>,
#   `bethesda-chevy_chase_enrolled` <dbl>, montgomery_blair_applied <dbl>,
#   montgomery_blair_admitted <dbl>, montgomery_blair_enrolled <dbl>,
#   richard_montgomery_applied <dbl>, richard_montgomery_admitted <dbl>, …

total_admitted ~ total_applied + names

#Linear Plot
Linreg <- OutlierApplied |>
  ggplot(aes( x = total_admitted, y = total_enrolled)) +
  geom_point() +
  geom_smooth() +
  labs( x = "Total Students Admitted", y = "Total Students Enrolled" , title = "Stduents Admitted vs Enrolled")
Linreg

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#DO NOT GRADE!!
treemap(PacceptedHT, index="sort", vSize="total_applied", 
        vColor="percent_accepted", type="manual", border.col = c("purple"), border.lwds = c(5), 
        title = "Montgomery County College Applications", title.legend = "Percent Accepted", 
        palette="Purples")

#Do grade!
treemap(PacceptedHB, index="sort", vSize="total_applied", 
        vColor="percent_accepted", type="manual", border.col = c("purple"), border.lwds = c(5), 
        title = "Montgomery County College Applications", title.legend = "Percent Accepted", 
        palette="Purples")

There was a fairly large amount of cleaning to do in this dataset. Firstly, the column names were all in the first row, and the actual columns were just numbered. I fixed this by changing the column names to the first row, and then deleting that row. This was done with AI assistance. I then had to change all of the values from characters to numerical, using the mutate function to as.numeric. I performed the basic Saidi Cleaning for the data names. I created a new column that gave me the percent of students accepted to each university from the total schools. From there, I sorted out the less applied to colleges, by filtering out any university with less than 50 applications, and I used this data in my visualization to provide the 10 colleges with the lowest percent acceptance rate for students from Montgomery County. I would have liked to split things up by school, and I think I might have a general idea of how to do that, however I don’t have the skills in R studio, nor the time to learn exactly how that would work for this particular dataset. I would have also used the split by school for the linear regression plot, and would have explored which school has the highest acceptance rate etc. This was a really interesting dataset to play with, and I’d like to keep working with it as I gain knowledge about R studio.