Step 0: PREPARE THE ENVIRONMENT

The functions for k-means clustering are included in base level R, but let’s install some packages for tidying and visualizing the data more easily.

install.packages("pacman") #makes loading other packages much easier
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library("pacman")

pacman::p_load(
  tidyverse,
  readr,
  magrittr,
  dplyr,
  psych, #basic stats
  ggplot2, #data visualizations
  zoo, #process NA values
  lubridate, #cleaning date values
  factoextra #kmeans visualization
)

Step 1: COLLECTING DATA

This project attempts to examine operational makerspace data to answer the question: what does persistent use look like, and what is the profile of a “persistent” makerspace user?

This data was collected from users of the BeAM makerspaces at UNC Chapel Hill between January 1st, 2023 and June 1st, 2023 (i.e. the Spring 2023 academic semester.) To use the tools in one of the makerspaces, users are required to check in with a staff member at a front desk kiosk. During this interaction, BeAM staff collect multiple points of data including the user’s Personal Identification number (PID), intended purpose for visit, and the tools that they intend to use during their visit. Users are required to check out with a staff member at the end of their visit.

The data collected through this check-in/check-out process was extracted from the BeAM database in four different CSV files:

Let’s read in the primary CSV files that will form the bulk of the data collection. All of the datasets have strings that need be read as factors (e.g. different levels of trainings, purposes, courses, etc.)

PIDOperations <- read.csv("/cloud/project/Data/PIDOperations.csv", stringsAsFactors = TRUE)
PIDPurposeCount <- read.csv("/cloud/project/Data/PIDPurposeCount.csv", stringsAsFactors = TRUE)
PIDToolsCount <- read.csv("/cloud/project/Data/PIDToolsCount.csv", stringsAsFactors = TRUE)
PIDCourseData <- read.csv("/cloud/project/Data/PIDCourseData.csv", stringsAsFactors = TRUE)

head(PIDOperations, n = 5)
##         PID FirstVisit  LastVisit TrainingsCount VisitsCount
## 1 704221707 26/03/2023 26/04/2023              1           9
## 2 704277964    10/1/23 19/01/2023              1           3
## 3 707857617 16/03/2023 24/03/2023              1           3
## 4 710237433     7/2/23     9/5/23              2          16
## 5 711002610 21/03/2023 21/03/2023              1           1
head(PIDPurposeCount, n = 5)
##         PID Purpose.Descr No.Measure.Value
## 1 704221707    Coursework                6
## 2        NA      Personal                2
## 3        NA      Training                1
## 4 704277964      Personal                2
## 5        NA      Training                1
head(PIDToolsCount, n = 5)
##         PID        TrainingName Count
## 1 704221707             BeAM101     5
## 2        NA            Woodshop     4
## 3 704277964 3D Printer (uPrint)     2
## 4        NA             BeAM101     1
## 5 707857617       Shaper Origin     1
head(PIDCourseData, n = 5)
##         PID  CourseID
## 1 730325671 AMST 460H
## 2 730333649 AMST 460H
## 3 730354285 AMST 460H
## 4 730510524 AMST 460H
## 5 730565792 AMST 460H

Step 2: EXPLORING AND PREPARING DATA

There are a number of processes required to clean and prepare this data. First, the PIDPurposeCount and PIDToolsCount have a significant number of ‘NAs’. The extracted data followed a structure that had a single PID in the first column, paired with multiple tools/purposes in the second column (as originally denoted through color coded sections.) This leads to the incorrect assignation of ‘NA’ to rows that are within the corresponding section of a PID, but do not have the PID value present in the same row.

Since we know that the NA values in a particular PID’s section should be replaced with the last listed PID, we can use the ‘na.locf’ function from the ‘zoo’ package to do so. We can perform this function with both the ‘PIDPurposeCount’ and the ‘PIDToolsCount’ files since they have the same structural issue. We’ll check to make sure it’s working as intended by examining the head of each data frame.

PIDPurposeCount <- na.locf(PIDPurposeCount, fromFirst = TRUE)
PIDToolsCount <- na.locf(PIDToolsCount, fromFirst = TRUE)

head(PIDPurposeCount, n = 5)
##         PID Purpose.Descr No.Measure.Value
## 1 704221707    Coursework                6
## 2 704221707      Personal                2
## 3 704221707      Training                1
## 4 704277964      Personal                2
## 5 704277964      Training                1
head(PIDToolsCount, n = 5)
##         PID        TrainingName Count
## 1 704221707             BeAM101     5
## 2 704221707            Woodshop     4
## 3 704277964 3D Printer (uPrint)     2
## 4 704277964             BeAM101     1
## 5 707857617       Shaper Origin     1

The ’PIDOperations file already has one observation per PID, which is the desired format for analysis. However, the other three files - PIDPurposeCount, PIDToolsCount, and PIDCourseData - have instances where a single student might have multiple repeating rows. These files need to be reformatted into a “wide” version using functions from the tidyr package.

The CourseData file in particular needs an additional treatment as it does not feature a measure or count - only the courses that a user mentioned during check-in, with each course listed on a different line. We can create a dummy variable with binary values that will indicate ‘yes’ or ‘no’ to describe whether a PID is associated with each course.

The PIDPurposeCount and PIDToolsCount files will also have NA values that indicate that a user did not check in for that purpose or that tool during the Spring 2023 semester. We should replace those with a value of ‘0’ to make later analysis easier.

PIDPurpose_wide <- spread(PIDPurposeCount, Purpose.Descr, No.Measure.Value) %>%
    mutate_at(c(2:9), ~replace_na(.,0))

PIDTools_wide <- spread(PIDToolsCount, TrainingName, Count) %>%
  mutate_at(c(2:18), ~replace_na(.,0))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tidyr package.
##   Please report the issue at <https://github.com/tidyverse/tidyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
PIDCourseData_wide <- PIDCourseData %>%
    count(PID, CourseID) %>%
    spread(CourseID, n, fill = 0)

colnames(PIDCourseData_wide) <-  gsub("\\s+", "", colnames(PIDCourseData_wide)) #eliminate spaces in column names

head(PIDPurpose_wide, n = 5)
##         PID Consultation Coursework Personal Product/business development
## 1 704221707            0          6        2                            0
## 2 704277964            0          0        2                            0
## 3 707857617            0          0        1                            0
## 4 710237433            0          0       14                            0
## 5 711002610            0          0        0                            0
##   Research Staff after-hours Training Workshop
## 1        0                 0        1        0
## 2        0                 0        1        0
## 3        0                 0        2        0
## 4        0                 0        2        1
## 5        0                 0        1        0
head(PIDTools_wide, n = 5)
##         PID V1 3D Printer (uPrint) 3D Prt (Form 3B,Flex filmnt,etc.) 3DP
## 1 704221707  0                   0                                 0   0
## 2 704277964  0                   2                                 0   0
## 3 707857617  0                   0                                 0   1
## 4 710237433  0                   0                                 0   0
## 5 711002610  0                   0                                 0   0
##   BeAM101 Desktop Embroidery Electronics Embroidery Laser Metal Shop Serger
## 1       5                  0           0          0     0          0      0
## 2       1                  0           0          0     0          0      0
## 3       1                  0           0          0     0          0      0
## 4       0                  0           0          0     0          0      1
## 5       1                  0           0          0     0          0      0
##   Sewing Shaper Origin ShopBot Vinyl Vinyl Cutter (t-shirt) Woodshop
## 1      0             0       0     0                      0        4
## 2      0             0       0     0                      0        0
## 3      0             1       0     0                      0        0
## 4      0             0       1     0                      0       14
## 5      0             0       0     0                      0        0
head(PIDCourseData_wide, n = 5)
##         PID AMST460H APPL089 APPL101 APPL110 APPL240 APPL260 APPL265 APPL285
## 1 704221707        0       1       0       0       0       0       0       0
## 2 714013347        0       0       0       0       0       0       0       0
## 3 720126788        0       0       0       0       0       0       0       0
## 4 720240491        0       0       0       0       0       0       0       0
## 5 720536612        0       0       0       0       0       0       0       0
##   APPL390 APPL412 APPL430 APPL463 ARTS363 BMME298 BMME698 CHEM089 CLAS059
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       1       0       0
## 4       1       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   CLAS133H CMPL246 COMP089 DENT205 DENT212 DRAM766 ENEC468 ENGL143 ENVR089
## 1        0       0       0       0       0       0       0       0       0
## 2        0       0       0       0       0       1       0       0       0
## 3        0       0       0       0       0       0       0       0       0
## 4        0       0       0       0       0       0       0       0       0
## 5        0       0       0       0       0       0       0       0       0
##   GEOG115 GEOG228 IDST101 INLS690 INLS737 KOR346 MASC053 MATH069 MEJO595
## 1       0       0       0       0       0      0       0       0       0
## 2       0       0       0       0       0      0       0       0       0
## 3       0       0       0       0       0      0       0       0       0
## 4       0       0       0       0       0      0       0       0       0
## 5       1       0       0       0       0      0       0       0       0
##   MTSC710 MUSC051 NSCI395 NSCI405 NSCI424 PHYS051 PHYS055 PHYS100 PHYS231
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   PSYC330 RELI105 SPAN321
## 1       0       0       0
## 2       0       0       0
## 3       0       0       0
## 4       0       0       0
## 5       0       0       0

One more housekeeping step: there are several tool categories and purposes that would be better analyzed when consolidated into common groups. For example - ‘3DP’, ‘3DP (Form 3B)’, and ‘3DP (uPrint)’ are all types of 3D printing that are in a distinct category from the other tools. Therefore, an individual’s use across these three columns has been summed into one overall column: T_3DP. Similarly, ‘Consultation’, ‘Training’, and ‘Workshop’ are all programs that are offered by BeAM (rather than an external course project, research project, or personal project.) These columns have been summed into one overall column: P_BeAM.

Note that the prefixes ‘T’ and ‘P’ in the new column names denote the original file that contained that variable. This is to provide clarity when all of the variables are joined into a single data frame.

PIDTools_wide <- 
  PIDTools_wide %>%
  mutate(
     T_BeAM101 = BeAM101,
     T_3DP = rowSums(across(3:5)),
     T_Laser = Laser,
     T_Vinyl = rowSums(across(c(2,16, 17))),
     T_Embroidery = rowSums(across(c(7,9))),
     T_Sewing = rowSums(across(12:13)),
     T_Electronics = Electronics,
     T_Metal = `Metal Shop`,
     T_Wood = rowSums(across(c(14,15,18)))) %>%
  select(1,19:27)

PIDPurpose_wide <- 
  PIDPurpose_wide %>%
  mutate(
     P_Coursework = Coursework,
     P_Research = rowSums(across(5:6)),
     P_BeAM = rowSums(across(c(2,8,9))),
     P_Personal = rowSums(across(c(4,7)))) %>%
  select(1,10:13)

head(PIDTools_wide, n = 5)
##         PID T_BeAM101 T_3DP T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics
## 1 704221707         5     0       0       0            0        0             0
## 2 704277964         1     2       0       0            0        0             0
## 3 707857617         1     1       0       0            0        0             0
## 4 710237433         0     0       0       0            0        1             0
## 5 711002610         1     0       0       0            0        0             0
##   T_Metal T_Wood
## 1       0      4
## 2       0      0
## 3       0      1
## 4       0     15
## 5       0      0
head(PIDPurpose_wide, n = 5)
##         PID P_Coursework P_Research P_BeAM P_Personal
## 1 704221707            6          0      1          2
## 2 704277964            0          0      1          2
## 3 707857617            0          0      2          1
## 4 710237433            0          0      3         14
## 5 711002610            0          0      1          0

Before joining all the data together, let’s clean up the dates that featured in the PIDOperations file. Several functions from lubridate will help to convert all the dates to a consistent format that R can read.

mdy <- mdy(PIDOperations$FirstVisit) 
## Warning: 452 failed to parse.
dmy <- dmy(PIDOperations$FirstVisit) 
dmy[is.na(dmy)] <- mdy[is.na(dmy)]
PIDOperations$FirstVisit <- dmy

mdy <- mdy(PIDOperations$LastVisit) 
## Warning: 456 failed to parse.
dmy <- dmy(PIDOperations$LastVisit) 
dmy[is.na(dmy)] <- mdy[is.na(dmy)]
PIDOperations$LastVisit <- dmy

head(PIDOperations, n = 5)
##         PID FirstVisit  LastVisit TrainingsCount VisitsCount
## 1 704221707 2023-03-26 2023-04-26              1           9
## 2 704277964 2023-01-10 2023-01-19              1           3
## 3 707857617 2023-03-16 2023-03-24              1           3
## 4 710237433 2023-02-07 2023-05-09              2          16
## 5 711002610 2023-03-21 2023-03-21              1           1

With four data frames that have one observation for each unique PID, it’s time to create one larger data frame to use going forward. Some students will not be associated with certain courses, which will result in ‘NA’ values, which we can replace with a value of ‘0’ to indicate no association.

list_df = list(PIDOperations, PIDTools_wide, PIDPurpose_wide, PIDCourseData_wide)
MegaPIDData <- list_df %>% reduce(left_join, by='PID') %>%
  mutate_at(c(2:65), ~replace_na(.,0))

head(MegaPIDData, n = 5)
##         PID FirstVisit  LastVisit TrainingsCount VisitsCount T_BeAM101 T_3DP
## 1 704221707 2023-03-26 2023-04-26              1           9         5     0
## 2 704277964 2023-01-10 2023-01-19              1           3         1     2
## 3 707857617 2023-03-16 2023-03-24              1           3         1     1
## 4 710237433 2023-02-07 2023-05-09              2          16         0     0
## 5 711002610 2023-03-21 2023-03-21              1           1         1     0
##   T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics T_Metal T_Wood
## 1       0       0            0        0             0       0      4
## 2       0       0            0        0             0       0      0
## 3       0       0            0        0             0       0      1
## 4       0       0            0        1             0       0     15
## 5       0       0            0        0             0       0      0
##   P_Coursework P_Research P_BeAM P_Personal AMST460H APPL089 APPL101 APPL110
## 1            6          0      1          2        0       1       0       0
## 2            0          0      1          2        0       0       0       0
## 3            0          0      2          1        0       0       0       0
## 4            0          0      3         14        0       0       0       0
## 5            0          0      1          0        0       0       0       0
##   APPL240 APPL260 APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   BMME298 BMME698 CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212
## 1       0       0       0       0        0       0       0       0       0
## 2       0       0       0       0        0       0       0       0       0
## 3       0       0       0       0        0       0       0       0       0
## 4       0       0       0       0        0       0       0       0       0
## 5       0       0       0       0        0       0       0       0       0
##   DRAM766 ENEC468 ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   KOR346 MASC053 MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424
## 1      0       0       0       0       0       0       0       0       0
## 2      0       0       0       0       0       0       0       0       0
## 3      0       0       0       0       0       0       0       0       0
## 4      0       0       0       0       0       0       0       0       0
## 5      0       0       0       0       0       0       0       0       0
##   PHYS051 PHYS055 PHYS100 PHYS231 PSYC330 RELI105 SPAN321
## 1       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0

After joining, it’s time to create several additional features for this dataset. Let’s start by creating a feature that will sort VisitsCount values into easier-to-understand categories. We’ll summarize VisitsCount to find the mean, then create a three-point scale to categorize values: 1 = single visit, 2 = below mean but more than 1 visit, and 3 = above mean.

summary(MegaPIDData$VisitsCount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   5.662   9.000  32.000
MegaPIDData <- MegaPIDData %>%
  mutate(VisitsLevel = cut(MegaPIDData$VisitsCount, breaks = c(-Inf, 1, 5.662, 1000),
                  labels = c("Once", "Few", "Many")))

First, a feature named CoursesTaken will be a measure of how many makerspace courses a student is associated with. Second, the Timespan feature will display the length of time in days between a user’s first and last visit.Second, we’ll create a TimespanLevel feature that will categorize Timespan on a 3-point scale: 1 = single visit, 2 = brief (less than 4 weeks), and 3 = long (more than 4 weeks).

To make it easier to skim the final dataset, let’s also rearrange the columns slightly to shift the individual course columns towards the end.

#CoursesTaken

CoursesTakenFeature <- PIDCourseData_wide %>%
  mutate(CoursesTaken = rowSums(select(., -PID))) %>%
  select(PID, CoursesTaken)

MegaPIDData <- left_join(MegaPIDData, CoursesTakenFeature, "PID") %>%
  mutate_at(c("CoursesTaken"), ~replace_na(.,0))

head(MegaPIDData, n = 5)
##         PID FirstVisit  LastVisit TrainingsCount VisitsCount T_BeAM101 T_3DP
## 1 704221707 2023-03-26 2023-04-26              1           9         5     0
## 2 704277964 2023-01-10 2023-01-19              1           3         1     2
## 3 707857617 2023-03-16 2023-03-24              1           3         1     1
## 4 710237433 2023-02-07 2023-05-09              2          16         0     0
## 5 711002610 2023-03-21 2023-03-21              1           1         1     0
##   T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics T_Metal T_Wood
## 1       0       0            0        0             0       0      4
## 2       0       0            0        0             0       0      0
## 3       0       0            0        0             0       0      1
## 4       0       0            0        1             0       0     15
## 5       0       0            0        0             0       0      0
##   P_Coursework P_Research P_BeAM P_Personal AMST460H APPL089 APPL101 APPL110
## 1            6          0      1          2        0       1       0       0
## 2            0          0      1          2        0       0       0       0
## 3            0          0      2          1        0       0       0       0
## 4            0          0      3         14        0       0       0       0
## 5            0          0      1          0        0       0       0       0
##   APPL240 APPL260 APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   BMME298 BMME698 CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212
## 1       0       0       0       0        0       0       0       0       0
## 2       0       0       0       0        0       0       0       0       0
## 3       0       0       0       0        0       0       0       0       0
## 4       0       0       0       0        0       0       0       0       0
## 5       0       0       0       0        0       0       0       0       0
##   DRAM766 ENEC468 ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   KOR346 MASC053 MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424
## 1      0       0       0       0       0       0       0       0       0
## 2      0       0       0       0       0       0       0       0       0
## 3      0       0       0       0       0       0       0       0       0
## 4      0       0       0       0       0       0       0       0       0
## 5      0       0       0       0       0       0       0       0       0
##   PHYS051 PHYS055 PHYS100 PHYS231 PSYC330 RELI105 SPAN321 VisitsLevel
## 1       0       0       0       0       0       0       0        Many
## 2       0       0       0       0       0       0       0         Few
## 3       0       0       0       0       0       0       0         Few
## 4       0       0       0       0       0       0       0        Many
## 5       0       0       0       0       0       0       0        Once
##   CoursesTaken
## 1            1
## 2            0
## 3            0
## 4            0
## 5            0
#TimeSpan

MegaPIDData <- MegaPIDData %>%
  mutate(Timespan = difftime(LastVisit,FirstVisit,units="days")) 

MegaPIDData$Timespan <- as.numeric(MegaPIDData$Timespan)

MegaPIDData <- MegaPIDData %>%
  mutate(TimespanLevel = cut(MegaPIDData$Timespan, breaks = c(-Inf, 0, 28, 1000),
                  labels = c("Once", "Brief", "Longterm")))

MegaPIDData <- MegaPIDData[,c(1, 4:5, 66:69, 6:18, 2:3, 19:65)]
                              

head(MegaPIDData, n = 5)
##         PID TrainingsCount VisitsCount VisitsLevel CoursesTaken Timespan
## 1 704221707              1           9        Many            1       31
## 2 704277964              1           3         Few            0        9
## 3 707857617              1           3         Few            0        8
## 4 710237433              2          16        Many            0       91
## 5 711002610              1           1        Once            0        0
##   TimespanLevel T_BeAM101 T_3DP T_Laser T_Vinyl T_Embroidery T_Sewing
## 1      Longterm         5     0       0       0            0        0
## 2         Brief         1     2       0       0            0        0
## 3         Brief         1     1       0       0            0        0
## 4      Longterm         0     0       0       0            0        1
## 5          Once         1     0       0       0            0        0
##   T_Electronics T_Metal T_Wood P_Coursework P_Research P_BeAM P_Personal
## 1             0       0      4            6          0      1          2
## 2             0       0      0            0          0      1          2
## 3             0       0      1            0          0      2          1
## 4             0       0     15            0          0      3         14
## 5             0       0      0            0          0      1          0
##   FirstVisit  LastVisit AMST460H APPL089 APPL101 APPL110 APPL240 APPL260
## 1 2023-03-26 2023-04-26        0       1       0       0       0       0
## 2 2023-01-10 2023-01-19        0       0       0       0       0       0
## 3 2023-03-16 2023-03-24        0       0       0       0       0       0
## 4 2023-02-07 2023-05-09        0       0       0       0       0       0
## 5 2023-03-21 2023-03-21        0       0       0       0       0       0
##   APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363 BMME298 BMME698
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212 DRAM766 ENEC468
## 1       0       0        0       0       0       0       0       0       0
## 2       0       0        0       0       0       0       0       0       0
## 3       0       0        0       0       0       0       0       0       0
## 4       0       0        0       0       0       0       0       0       0
## 5       0       0        0       0       0       0       0       0       0
##   ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737 KOR346 MASC053
## 1       0       0       0       0       0       0       0      0       0
## 2       0       0       0       0       0       0       0      0       0
## 3       0       0       0       0       0       0       0      0       0
## 4       0       0       0       0       0       0       0      0       0
## 5       0       0       0       0       0       0       0      0       0
##   MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424 PHYS051 PHYS055
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
##   PHYS100 PHYS231 PSYC330 RELI105 SPAN321
## 1       0       0       0       0       0
## 2       0       0       0       0       0
## 3       0       0       0       0       0
## 4       0       0       0       0       0
## 5       0       0       0       0       0

Before moving on to model building, we need to normalize any data that will be used for kmeans clustering so that the features will all be oriented on a similar scale. Any categorical/binary features (e.g. the individual course columns) will not be used for this process. I’m also going to leave out the total ‘VisitCounts’ variable, as I want to later compare that value to a user’s assigned cluster to see if additional patterns emerge.

That leaves us with the following features for clustering: TrainingsCount, CoursesTaken, Timespan, 9 different tool counts, and 4 different purpose counts. Note that for building kmeans clustering models, we do not need separate training and testing datasets, so we do not need to split the data for this process.

visits <- MegaPIDData[c(2,5:6,8:20)] 

visits_z <- as.data.frame(lapply(visits, scale))

summary(visits$TrainingsCount) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.839   2.000   6.000
summary(visits_z$TrainingsCount) #checking to confirm the effect of scaling
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7587 -0.7587 -0.7587  0.0000  0.1457  3.7633

Step 3: Training a Model

Before creating a k-means clustering model, we need to decide what value to use for k. I propose starting with a value of 3, based on the different levels of ‘timespan’. These levels mimic different use cases of the makerspace: single-visit users are often visiting the makerspace out of obligation (e.g. ‘I have to take this training for a class’), brief timespan users are often visiting as part of a time-bound project (e.g. ‘i just need to finish this project and then I’m done here’), and longterm timespan users are often using the makerspace throughout the semester for different tasks (‘I have to do a training, then I’ll work on my course project, then I might come back for something else.’)

Let’s set a random seed for reproducibility and then perform k-means clustering on the standardized counts data visits_z into three clusters. The results of this clustering process will be stored in the visits_clusters object.

set.seed(1212)
visits_clusters <- kmeans(visits_z, 3)

Step 4: Evaluating Model Performance

We’ll start by evaluating the number of cases that fall into each group.

visits_clusters$size
## [1] 190 407  92

Since there is unavoidable randomness involved with kmeans, I tested out this process with 5-10 different seeds to check for outliers. Across all of the clusters that were generated, there was always a cluster with less than 100 observations, a cluster with 150-200 observations, and a cluster with 400+ observations.

We can also look at the centers for each cluster using the function below.

visits_clusters$centers
##   TrainingsCount CoursesTaken   Timespan  T_BeAM101      T_3DP   T_Laser
## 1      1.0929550    0.5745562  1.1046245  1.0061672  0.9117477  1.021414
## 2     -0.3587224   -0.4933952 -0.3581375 -0.2891806 -0.3434535 -0.397204
## 3     -0.6702327    0.9961540 -0.6969205 -0.7986442 -0.3635487 -0.352245
##      T_Vinyl T_Embroidery    T_Sewing T_Electronics     T_Metal     T_Wood
## 1  0.5237468  0.109665971 -0.06115244    -0.3640431  0.19444506  0.4262421
## 2 -0.1832483  0.002322114  0.05514997    -0.3781813 -0.04784944 -0.1465700
## 3 -0.2709765 -0.236756903 -0.11768558     2.4248696 -0.18988957 -0.2318694
##   P_Coursework  P_Research     P_BeAM P_Personal
## 1    1.1124216  0.17464914  1.0893223  0.7864141
## 2   -0.4186356 -0.05656289 -0.3578224 -0.2935105
## 3   -0.4453852 -0.11045914 -0.6667119 -0.3256511

Interesting! Using the technique outlined by Lantz in Chapter 9, we can highlight patterns and begin to visualize an interpretation of these clusters. Here is a potential highlighting schema, along with my interpretation and visualization:

Observations about each cluster include:

Let’s try visualizing these clusters:

fviz_cluster(visits_clusters, #kmeans object
visits_z, #data
geom = "point")

Step 5: Improving Model Performance

Remember that we left out features like ‘VisitCounts’ and ‘TimespanLevel’ when we created these kmeans clusters. Let’s add the cluster assignment back into the main dataset to see if patterns emerge between cluster #, total visits, and/or timespan level.

I also want to confirm some of my interpretations from the clustering process - what’s the average number of courses that users in Cluster 3 are taking, compared to their number of Course Check-ins? What’s the average number of visits for Cluster 2 compared to their visits for Sewing/Embroidery?

MegaPIDData$Cluster <- visits_clusters$cluster

aggregate(data = MegaPIDData, VisitsCount ~ Cluster, mean) 
##   Cluster VisitsCount
## 1       1   13.036842
## 2       2    2.970516
## 3       3    2.336957
aggregate(data = MegaPIDData, Timespan ~ Cluster, mean)
##   Cluster Timespan
## 1       1 91.94737
## 2       2 30.25553
## 3       3 15.96739
aggregate(data = MegaPIDData, CoursesTaken ~ Cluster, mean) 
##   Cluster CoursesTaken
## 1       1    1.0052632
## 2       2    0.2751843
## 3       3    1.2934783
aggregate(data = MegaPIDData, P_Coursework ~ Cluster, mean) #let's compare CoursesTaken and P_Coursework averages for Cluster 3
##   Cluster P_Coursework
## 1       1    5.8473684
## 2       2    0.4422604
## 3       3    0.3478261
aggregate(data = MegaPIDData, T_Embroidery + T_Sewing ~ Cluster, mean) #let's compare the combined average of Textiles purposes of Cluster 2 to their overall visits average
##   Cluster T_Embroidery + T_Sewing
## 1       1               0.6894737
## 2       2               0.6805897
## 3       3               0.2500000
aggregate(data = MegaPIDData, T_Electronics ~ Cluster, mean) #let's compare the combined average of Electronics purposes of Cluster 3 to their overall visits average
##   Cluster T_Electronics
## 1       1   0.005263158
## 2       2   0.000000000
## 3       3   1.043478261

I’m starting to get a sense for each of these clusters, but it would help to have some basic counts for TimespanLevel and VisitsCountLevel in each cluster for comparison. Remember that my goal (and research question) is to figure out how these clusters relate to makerspace persistence (i.e. more visits over a longer period of time.)

ClusterVisits <- MegaPIDData %>% group_by(Cluster,VisitsLevel) %>% 
  summarise(total_count=n(),.groups = 'drop') %>%
  as.data.frame()

ClusterTimespan <- MegaPIDData %>% group_by(Cluster,TimespanLevel) %>% 
  summarise(total_count=n(),.groups = 'drop') %>%
  as.data.frame()

ClusterVisits
##   Cluster VisitsLevel total_count
## 1       1         Few           4
## 2       1        Many         186
## 3       2        Once         132
## 4       2         Few         221
## 5       2        Many          54
## 6       3        Once          66
## 7       3         Few          16
## 8       3        Many          10
ClusterTimespan
##   Cluster TimespanLevel total_count
## 1       1         Brief           1
## 2       1      Longterm         189
## 3       2          Once         145
## 4       2         Brief         100
## 5       2      Longterm         162
## 6       3          Once          66
## 7       3         Brief           4
## 8       3      Longterm          22
ggplot(ClusterVisits, aes(x=VisitsLevel, y=total_count, fill=VisitsLevel)) + 
  geom_col() +
  facet_wrap(~Cluster)

ggplot(ClusterTimespan, aes(x=TimespanLevel, y=total_count, fill=TimespanLevel)) + 
  geom_col() +
  facet_wrap(~Cluster)

This concludes the coding portion of the final assignment. Further interpretation and conclusions will be discussed in the accompanying LAK poster paper. For ease of use, I will summarize observations of each cluster below: