DATA 110 Project 1

Author

David Burkart

Introduction:

Being the manager of the Ackerman STEM Learning Center, which provides STEM tutoring to Montgomery College’s students, I am interested in learning how students utilize our tutoring services. Specifically, I would like to know if the number of tutoring requests made for each discipline changes throughout the semester. To answer this, I will use a data set consisting of records of all of the tutoring requests made during the Spring 2024 semester. This data was collected by myself during the aforementioned semester. In the data set there are many variables for each record, but of relevance here is “Timestamp” (date) and “Course” (the course code consisting of the four-letter discipline abbreviation and a three-digit course number). To plot the desired data I will first have to clean and modify the data. Then, I will make an alluvial chart to illustrate the how the requests for each discipline changes over time and in relation to each other.

Load the necessary libraries and the data set:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(alluvial)

Warning: package 'alluvial' was built under R version 4.4.2

library(ggalluvial)

Warning: package 'ggalluvial' was built under R version 4.4.2

ALC <- read_csv("ALC Tutoring Data SP24 Sample.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3048 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): Timestamp, Day, AM or PM, Course, Declined, No Ans., Tutor #, Tut...
dbl   (1): Wait Time
time  (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(ALC)

# A tibble: 6 × 12
  Timestamp Day   Time     `AM or PM` Course   Declined `No Ans.` `Tutor #`
  <chr>     <chr> <time>   <chr>      <chr>    <chr>    <chr>     <chr>    
1 1/2/2024  Tue   10:33:42 AM         MATH 123 10:35 AM <NA>      <NA>     
2 1/4/2024  Thu   10:09:45 AM         MATH 117 <NA>     <NA>      447      
3 1/5/2024  Fri   01:11:00 PM         MATH PL  <NA>     <NA>      244      
4 1/5/2024  Fri   01:11:35 PM         MATH 117 <NA>     <NA>      447      
5 1/8/2024  Mon   01:48:43 PM         MATH PL  <NA>     <NA>      244      
6 1/10/2024 Wed   11:56:21 AM         MATH PL  <NA>     <NA>      393      
# ℹ 4 more variables: `Tutor Start` <chr>, `Tutor End` <chr>,
#   `Wait Time` <dbl>, `Time w Tutor` <chr>

Fix the format of the timestamp to be recognized as a date and then group into weeks:

ALC2 <- ALC |>
  mutate(date = as.Date(Timestamp, "%m/%d/%Y")) |>
   mutate(week = lubridate::week(ymd(date)))
head(ALC2)

# A tibble: 6 × 14
  Timestamp Day   Time     `AM or PM` Course   Declined `No Ans.` `Tutor #`
  <chr>     <chr> <time>   <chr>      <chr>    <chr>    <chr>     <chr>    
1 1/2/2024  Tue   10:33:42 AM         MATH 123 10:35 AM <NA>      <NA>     
2 1/4/2024  Thu   10:09:45 AM         MATH 117 <NA>     <NA>      447      
3 1/5/2024  Fri   01:11:00 PM         MATH PL  <NA>     <NA>      244      
4 1/5/2024  Fri   01:11:35 PM         MATH 117 <NA>     <NA>      447      
5 1/8/2024  Mon   01:48:43 PM         MATH PL  <NA>     <NA>      244      
6 1/10/2024 Wed   11:56:21 AM         MATH PL  <NA>     <NA>      393      
# ℹ 6 more variables: `Tutor Start` <chr>, `Tutor End` <chr>,
#   `Wait Time` <dbl>, `Time w Tutor` <chr>, date <date>, week <dbl>

Remove all uneccessary columns:

ALC3 <- ALC2 |>
  filter(week > 3) |>
  select(Course, week)

Create a column for each record with only the discipline instead of the course (which requires the creation of a separate database for each discipline):

math <- ALC3 |>
  filter(grepl("MATH", Course)) |>
  mutate(discipline = "MATH") |>
  group_by(week) |>
  summarise(Mathematics = n())

biol <- ALC3 |>
  filter(grepl("BIOL", Course)) |>
  mutate(discipline = "BIOL") |>
  group_by(week) |>
  summarise(Biology = n())

chem <- ALC3 |>
  filter(grepl("CHEM", Course)) |>
  mutate(discipline = "CHEM") |>
  group_by(week) |>
  summarise(Chemistry = n())

cmsc <- ALC3 |>
  filter(grepl("CMSC", Course)) |>
  mutate(discipline = "CMSC") |>
  group_by(week) |>
  summarise(Computer_Science = n())

enee <- ALC3 |>
  filter(grepl("ENEE", Course)) |>
  mutate(discipline = "ENEE") |>
  group_by(week) |>
  summarise(Electrical_Engineering = n())

enes <- ALC3 |>
  filter(grepl("ENES", Course)) |>
  mutate(discipline = "ENES") |>
  group_by(week) |>
  summarise(Engineering_Sciences = n())

phys <- ALC3 |>
  filter(grepl("PHYS", Course)) |>
  mutate(discipline = "PHYS") |>
  group_by(week) |>
  summarise(Physics = n())

Combine each discipline database back into a single database:

ALC4 <- biol |>
  full_join(chem, by=c("week")) |>
  full_join(cmsc, by=c("week")) |>
  full_join(enee, by=c("week")) |>
  full_join(enes, by=c("week")) |>
  full_join(math, by=c("week")) |>
  full_join(phys, by=c("week"))

Change any “NA”s to zeros:

ALC5 <- ALC4 |>
  mutate(across(where(is.numeric), ~replace_na(., 0)))

Source: https://www.geeksforgeeks.org/how-to-replace-na-with-zero-in-dplyr/

Pivot to longer format to condense data for plotting:

ALC6 <- ALC5 |>
  pivot_longer(cols = 2:8, names_to = "Discipline", values_to = "Tutoring_Requests")

Modify the week to represent the week of the semester:

ALC7 <- ALC6 |>
  mutate(semester_week = week - 3)

Make the alluvial chart with number of tutoring requests on the y axis, week on the x axis, and discipline as the alluvium and fill. Modify the theme and aspects of the alluvium for contrast. Modify the x axis intervals so that each week number is shown. Add a descriptive title, axis labels, and source. Also add a line and text to label week 11 as “Spring Break”:

p1 <- ALC7 |>
  ggplot(aes(x = semester_week, y = , Tutoring_Requests, alluvium = Discipline)) + 
  theme_classic() +
  geom_alluvium(aes(fill = Discipline), 
                color = "black",
                width = .01, 
                alpha = .9,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Set3") + 
  scale_x_continuous(breaks = seq(1, 16, by = 1)) +
  labs(title = "Number of Tutoring Requests per Discipline Made at the Ackerman \n STEM Learning Center During the Spring 2024 Semester",
       y = "Number of Tutoring Requests",
       x = "Week of the Semester",
       fill = "Discipline",
       caption = "Source: Ackerman STEM Learning Center") +
  geom_vline(xintercept = 8, linetype = "dotdash", size = 0.5, color = "black") +
  geom_text(aes(x=7.7, y=235, label="Spring Break"), cex=2.7, color="black", angle = 90)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p1

Prepare the data for a linear regression by getting the total number of requests for each week (and exclude week 8 because it is Spring Break):

ALC8 <- ALC7 |>
group_by(semester_week) |>
  summarise(Total_Requests = sum(Tutoring_Requests)) |>
  filter(semester_week != 8)

Plot the regression to see if there is a correlation between week of the semester and the number of tutoring requests made:

p2 <- ALC8 |>
  ggplot(aes(x = semester_week, y = Total_Requests)) +
  geom_point() +
  labs(title = "Total Tutoring Requests Made Per Week at the Ackerman \n STEM Learning Center During the Spring 2024 Semester",
  caption = "Source: Ackerman STEM Learning Center",
  x = "Week of the Semester",
  y = "Number of Tutoring Requests Made") +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 16, by = 1))+
  geom_smooth(method='lm',formula=y~x)
p2

Perform the correlation analysis to see how well a change in week predicts the number of tutoring sessions:

cor(ALC8$semester_week, ALC8$Total_Requests)

[1] 0.149778

Perform a linear regression to determine how well the linear model fits the data:

linear <- lm(semester_week ~ Total_Requests, data = ALC8)
summary(linear)


Call:
lm(formula = semester_week ~ Total_Requests, data = ALC8)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3689 -4.4108  0.9923  4.0525  8.4163 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     5.38489    5.91022   0.911    0.379
Total_Requests  0.01571    0.02875   0.546    0.594

Residual standard error: 5.054 on 13 degrees of freedom
Multiple R-squared:  0.02243,   Adjusted R-squared:  -0.05276 
F-statistic: 0.2983 on 1 and 13 DF,  p-value: 0.5942

Interpret the results of linear regression analysis:

The equation of the model is: y = 0.016x + 5.385. This means that for each increase in week of the semester, the total number of tutoring requests increases by 0.016. The correlation value of 0.15 is well below 0.5 and close to zero, suggesting there is essentially no correlation between the two variables. This is corroborated by the insignificant p-value of 0.594. Finally, the adjusted R-squared value of -0.053 indicates that very little of the variation in tutoring requests is caused by the week of the semester.

Brief Essay:

Cleaning the data set began by fixing the format of the variable “Timestamp” to be recognized as a date so that each date could be assigned a week number. This was achieved using the mutate and lubridate functions. Next,the filter function was used to remove all unnecessary variables.

The original data set had the course name for each record, but I only needed the course’s discipline. To make this discipline-only column, I had to create a new database for each discipline. This was achieved using grepl, which extracts certain words from an entry, followed by mutate to create the new column. I also added group_by and summarize (using n) to show only the total number of tutoring sessions for each week. After doing this for each discipline, I used full_join to combine each discipline-specific database back into a single database.

I noticed there were some “NA”s in the data so I removed these using mutate and replace_na (this code was sourced from geeksforgeeks.org). To plot the data, I had to re-format the data set into a long format using pivot_longer. Finally, since the week numbers were ordered by week of the year and not by week of the semester, I corrected this by subtracting 3 from each week number using mutate. This was the end of the data cleaning that I did for the visualization (later for the linear regression I prepared the data by getting the total number of requests for each week, excluding week 8 because it is Spring Break, using group_by, summarize, and filter).

The alluvial chart I made graphs the number of tutoring requests per discipline that were made each week during the semester. As an alluvial, it conveys the amount that each discipline makes up of the total tutoring requests for each week. Like a line graph, the visualization allows the viewer to appreciate any trends in how the number of requests for a given discipline changed over time. There isn’t much of this change, except for all disciplines having a lower number of tutoring requests at the beginning and end of the semester (and none during spring break). This is depicted by the relatively uniform width of the band for each discipline throughout the graph.

The order that the bands are stacked in the alluvial indicates which discipline had the highest number of tutoring requests (i.e. the one on top). This allows the viewer to also compare temporal changes between the disciplines. For example, for most weeks math was the most requested discipline but for weeks 5, 10, and 12 chemistry was the most requested discipline. Biology and computer science typically had a relatively similar number of requests throughout the semester but the demand for computer science tutoring was slightly higher than biology during the first half of the semester, then lower during the second half. In these ways, one could describe math and chemistry or biology and computer science as “competing” with each other. Physics and the engineering disciplines always had low numbers of requests.

Given my familiarity with tutoring at the Ackerman Center, none of this was very surprising aside from the decrease in computer science tutoring during the second half of the semester and the weeks when chemistry requests were higher than math (I assumed math would be higher every week). I wouldn’t change anything I did as I was able to make the chart I intended to make, but the small number of requests for physics and the engineering sciences compared to math and chemistry make them difficult to see on this graph. Perhaps a heat map would convey the same information and also allow the values for these disciplines to be viewable.