The functions for k-means clustering are included in base level R, but let’s install some packages for tidying and visualizing the data more easily.
install.packages("pacman") #makes loading other packages much easier
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library("pacman")
pacman::p_load(
tidyverse,
readr,
magrittr,
dplyr,
psych, #basic stats
ggplot2, #data visualizations
zoo, #process NA values
lubridate, #cleaning date values
factoextra #kmeans visualization
)
This project attempts to examine operational makerspace data to answer the question: what does persistent use look like, and what is the profile of a “persistent” makerspace user?
This data was collected from users of the BeAM makerspaces at UNC Chapel Hill between January 1st, 2023 and June 1st, 2023 (i.e. the Spring 2023 academic semester.) To use the tools in one of the makerspaces, users are required to check in with a staff member at a front desk kiosk. During this interaction, BeAM staff collect multiple points of data including the user’s Personal Identification number (PID), intended purpose for visit, and the tools that they intend to use during their visit. Users are required to check out with a staff member at the end of their visit.
The data collected through this check-in/check-out process was extracted from the BeAM database in four different CSV files:
PIDOperations: This file contains general operational data about each user’s activity during the Spring 2023 semester. That includes their PID, the date of their first visit, the date of their last visit, the total number of trainings that they completed, and the total number of times they visited the makerspaces.
PIDPurposeCount: This file contains information about how many times a user checked in for a particular purpose (i.e. coursework, research, etc.) during the Spring 2023 semester.
PIDToolsCount: This file contains information about how many times a user checked in to use each tool (i.e. 3D printer, sewing machine, etc.) during the Spring 2023 semester.
PIDCourseData: This file contains information about what courses each student checked in under - i.e., if they said “I’m here to work on a project for APPL110.” As part of the check-in procedure, BeAM staff are trained to ask users whether their visit is associated with an academic course and if so, to name the specific course.
Let’s read in the primary CSV files that will form the bulk of the data collection. All of the datasets have strings that need be read as factors (e.g. different levels of trainings, purposes, courses, etc.)
PIDOperations <- read.csv("/cloud/project/Data/PIDOperations.csv", stringsAsFactors = TRUE)
PIDPurposeCount <- read.csv("/cloud/project/Data/PIDPurposeCount.csv", stringsAsFactors = TRUE)
PIDToolsCount <- read.csv("/cloud/project/Data/PIDToolsCount.csv", stringsAsFactors = TRUE)
PIDCourseData <- read.csv("/cloud/project/Data/PIDCourseData.csv", stringsAsFactors = TRUE)
head(PIDOperations, n = 5)
## PID FirstVisit LastVisit TrainingsCount VisitsCount
## 1 704221707 26/03/2023 26/04/2023 1 9
## 2 704277964 10/1/23 19/01/2023 1 3
## 3 707857617 16/03/2023 24/03/2023 1 3
## 4 710237433 7/2/23 9/5/23 2 16
## 5 711002610 21/03/2023 21/03/2023 1 1
head(PIDPurposeCount, n = 5)
## PID Purpose.Descr No.Measure.Value
## 1 704221707 Coursework 6
## 2 NA Personal 2
## 3 NA Training 1
## 4 704277964 Personal 2
## 5 NA Training 1
head(PIDToolsCount, n = 5)
## PID TrainingName Count
## 1 704221707 BeAM101 5
## 2 NA Woodshop 4
## 3 704277964 3D Printer (uPrint) 2
## 4 NA BeAM101 1
## 5 707857617 Shaper Origin 1
head(PIDCourseData, n = 5)
## PID CourseID
## 1 730325671 AMST 460H
## 2 730333649 AMST 460H
## 3 730354285 AMST 460H
## 4 730510524 AMST 460H
## 5 730565792 AMST 460H
There are a number of processes required to clean and prepare this data. First, the PIDPurposeCount and PIDToolsCount have a significant number of ‘NAs’. The extracted data followed a structure that had a single PID in the first column, paired with multiple tools/purposes in the second column (as originally denoted through color coded sections.) This leads to the incorrect assignation of ‘NA’ to rows that are within the corresponding section of a PID, but do not have the PID value present in the same row.
Since we know that the NA values in a particular PID’s section should be replaced with the last listed PID, we can use the ‘na.locf’ function from the ‘zoo’ package to do so. We can perform this function with both the ‘PIDPurposeCount’ and the ‘PIDToolsCount’ files since they have the same structural issue. We’ll check to make sure it’s working as intended by examining the head of each data frame.
PIDPurposeCount <- na.locf(PIDPurposeCount, fromFirst = TRUE)
PIDToolsCount <- na.locf(PIDToolsCount, fromFirst = TRUE)
head(PIDPurposeCount, n = 5)
## PID Purpose.Descr No.Measure.Value
## 1 704221707 Coursework 6
## 2 704221707 Personal 2
## 3 704221707 Training 1
## 4 704277964 Personal 2
## 5 704277964 Training 1
head(PIDToolsCount, n = 5)
## PID TrainingName Count
## 1 704221707 BeAM101 5
## 2 704221707 Woodshop 4
## 3 704277964 3D Printer (uPrint) 2
## 4 704277964 BeAM101 1
## 5 707857617 Shaper Origin 1
The ’PIDOperations file already has one observation per PID, which is the desired format for analysis. However, the other three files - PIDPurposeCount, PIDToolsCount, and PIDCourseData - have instances where a single student might have multiple repeating rows. These files need to be reformatted into a “wide” version using functions from the tidyr package.
The CourseData file in particular needs an additional treatment as it does not feature a measure or count - only the courses that a user mentioned during check-in, with each course listed on a different line. We can create a dummy variable with binary values that will indicate ‘yes’ or ‘no’ to describe whether a PID is associated with each course.
The PIDPurposeCount and PIDToolsCount files will also have NA values that indicate that a user did not check in for that purpose or that tool during the Spring 2023 semester. We should replace those with a value of ‘0’ to make later analysis easier.
PIDPurpose_wide <- spread(PIDPurposeCount, Purpose.Descr, No.Measure.Value) %>%
mutate_at(c(2:9), ~replace_na(.,0))
PIDTools_wide <- spread(PIDToolsCount, TrainingName, Count) %>%
mutate_at(c(2:18), ~replace_na(.,0))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tidyr package.
## Please report the issue at <https://github.com/tidyverse/tidyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
PIDCourseData_wide <- PIDCourseData %>%
count(PID, CourseID) %>%
spread(CourseID, n, fill = 0)
colnames(PIDCourseData_wide) <- gsub("\\s+", "", colnames(PIDCourseData_wide)) #eliminate spaces in column names
head(PIDPurpose_wide, n = 5)
## PID Consultation Coursework Personal Product/business development
## 1 704221707 0 6 2 0
## 2 704277964 0 0 2 0
## 3 707857617 0 0 1 0
## 4 710237433 0 0 14 0
## 5 711002610 0 0 0 0
## Research Staff after-hours Training Workshop
## 1 0 0 1 0
## 2 0 0 1 0
## 3 0 0 2 0
## 4 0 0 2 1
## 5 0 0 1 0
head(PIDTools_wide, n = 5)
## PID V1 3D Printer (uPrint) 3D Prt (Form 3B,Flex filmnt,etc.) 3DP
## 1 704221707 0 0 0 0
## 2 704277964 0 2 0 0
## 3 707857617 0 0 0 1
## 4 710237433 0 0 0 0
## 5 711002610 0 0 0 0
## BeAM101 Desktop Embroidery Electronics Embroidery Laser Metal Shop Serger
## 1 5 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 0 0 0 0 0 0 1
## 5 1 0 0 0 0 0 0
## Sewing Shaper Origin ShopBot Vinyl Vinyl Cutter (t-shirt) Woodshop
## 1 0 0 0 0 0 4
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 0 1 0 0 14
## 5 0 0 0 0 0 0
head(PIDCourseData_wide, n = 5)
## PID AMST460H APPL089 APPL101 APPL110 APPL240 APPL260 APPL265 APPL285
## 1 704221707 0 1 0 0 0 0 0 0
## 2 714013347 0 0 0 0 0 0 0 0
## 3 720126788 0 0 0 0 0 0 0 0
## 4 720240491 0 0 0 0 0 0 0 0
## 5 720536612 0 0 0 0 0 0 0 0
## APPL390 APPL412 APPL430 APPL463 ARTS363 BMME298 BMME698 CHEM089 CLAS059
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 1 0 0
## 4 1 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## CLAS133H CMPL246 COMP089 DENT205 DENT212 DRAM766 ENEC468 ENGL143 ENVR089
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 1 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## GEOG115 GEOG228 IDST101 INLS690 INLS737 KOR346 MASC053 MATH069 MEJO595
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0 0 0
## MTSC710 MUSC051 NSCI395 NSCI405 NSCI424 PHYS051 PHYS055 PHYS100 PHYS231
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## PSYC330 RELI105 SPAN321
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
One more housekeeping step: there are several tool categories and purposes that would be better analyzed when consolidated into common groups. For example - ‘3DP’, ‘3DP (Form 3B)’, and ‘3DP (uPrint)’ are all types of 3D printing that are in a distinct category from the other tools. Therefore, an individual’s use across these three columns has been summed into one overall column: T_3DP. Similarly, ‘Consultation’, ‘Training’, and ‘Workshop’ are all programs that are offered by BeAM (rather than an external course project, research project, or personal project.) These columns have been summed into one overall column: P_BeAM.
Note that the prefixes ‘T’ and ‘P’ in the new column names denote the original file that contained that variable. This is to provide clarity when all of the variables are joined into a single data frame.
PIDTools_wide <-
PIDTools_wide %>%
mutate(
T_BeAM101 = BeAM101,
T_3DP = rowSums(across(3:5)),
T_Laser = Laser,
T_Vinyl = rowSums(across(c(2,16, 17))),
T_Embroidery = rowSums(across(c(7,9))),
T_Sewing = rowSums(across(12:13)),
T_Electronics = Electronics,
T_Metal = `Metal Shop`,
T_Wood = rowSums(across(c(14,15,18)))) %>%
select(1,19:27)
PIDPurpose_wide <-
PIDPurpose_wide %>%
mutate(
P_Coursework = Coursework,
P_Research = rowSums(across(5:6)),
P_BeAM = rowSums(across(c(2,8,9))),
P_Personal = rowSums(across(c(4,7)))) %>%
select(1,10:13)
head(PIDTools_wide, n = 5)
## PID T_BeAM101 T_3DP T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics
## 1 704221707 5 0 0 0 0 0 0
## 2 704277964 1 2 0 0 0 0 0
## 3 707857617 1 1 0 0 0 0 0
## 4 710237433 0 0 0 0 0 1 0
## 5 711002610 1 0 0 0 0 0 0
## T_Metal T_Wood
## 1 0 4
## 2 0 0
## 3 0 1
## 4 0 15
## 5 0 0
head(PIDPurpose_wide, n = 5)
## PID P_Coursework P_Research P_BeAM P_Personal
## 1 704221707 6 0 1 2
## 2 704277964 0 0 1 2
## 3 707857617 0 0 2 1
## 4 710237433 0 0 3 14
## 5 711002610 0 0 1 0
Before joining all the data together, let’s clean up the dates that featured in the PIDOperations file. Several functions from lubridate will help to convert all the dates to a consistent format that R can read.
mdy <- mdy(PIDOperations$FirstVisit)
## Warning: 452 failed to parse.
dmy <- dmy(PIDOperations$FirstVisit)
dmy[is.na(dmy)] <- mdy[is.na(dmy)]
PIDOperations$FirstVisit <- dmy
mdy <- mdy(PIDOperations$LastVisit)
## Warning: 456 failed to parse.
dmy <- dmy(PIDOperations$LastVisit)
dmy[is.na(dmy)] <- mdy[is.na(dmy)]
PIDOperations$LastVisit <- dmy
head(PIDOperations, n = 5)
## PID FirstVisit LastVisit TrainingsCount VisitsCount
## 1 704221707 2023-03-26 2023-04-26 1 9
## 2 704277964 2023-01-10 2023-01-19 1 3
## 3 707857617 2023-03-16 2023-03-24 1 3
## 4 710237433 2023-02-07 2023-05-09 2 16
## 5 711002610 2023-03-21 2023-03-21 1 1
With four data frames that have one observation for each unique PID, it’s time to create one larger data frame to use going forward. Some students will not be associated with certain courses, which will result in ‘NA’ values, which we can replace with a value of ‘0’ to indicate no association.
list_df = list(PIDOperations, PIDTools_wide, PIDPurpose_wide, PIDCourseData_wide)
MegaPIDData <- list_df %>% reduce(left_join, by='PID') %>%
mutate_at(c(2:65), ~replace_na(.,0))
head(MegaPIDData, n = 5)
## PID FirstVisit LastVisit TrainingsCount VisitsCount T_BeAM101 T_3DP
## 1 704221707 2023-03-26 2023-04-26 1 9 5 0
## 2 704277964 2023-01-10 2023-01-19 1 3 1 2
## 3 707857617 2023-03-16 2023-03-24 1 3 1 1
## 4 710237433 2023-02-07 2023-05-09 2 16 0 0
## 5 711002610 2023-03-21 2023-03-21 1 1 1 0
## T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics T_Metal T_Wood
## 1 0 0 0 0 0 0 4
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 1
## 4 0 0 0 1 0 0 15
## 5 0 0 0 0 0 0 0
## P_Coursework P_Research P_BeAM P_Personal AMST460H APPL089 APPL101 APPL110
## 1 6 0 1 2 0 1 0 0
## 2 0 0 1 2 0 0 0 0
## 3 0 0 2 1 0 0 0 0
## 4 0 0 3 14 0 0 0 0
## 5 0 0 1 0 0 0 0 0
## APPL240 APPL260 APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## BMME298 BMME698 CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## DRAM766 ENEC468 ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## KOR346 MASC053 MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## PHYS051 PHYS055 PHYS100 PHYS231 PSYC330 RELI105 SPAN321
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
After joining, it’s time to create several additional features for this dataset. Let’s start by creating a feature that will sort VisitsCount values into easier-to-understand categories. We’ll summarize VisitsCount to find the mean, then create a three-point scale to categorize values: 1 = single visit, 2 = below mean but more than 1 visit, and 3 = above mean.
summary(MegaPIDData$VisitsCount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 3.000 5.662 9.000 32.000
MegaPIDData <- MegaPIDData %>%
mutate(VisitsLevel = cut(MegaPIDData$VisitsCount, breaks = c(-Inf, 1, 5.662, 1000),
labels = c("Once", "Few", "Many")))
First, a feature named CoursesTaken will be a measure of how many makerspace courses a student is associated with. Second, the Timespan feature will display the length of time in days between a user’s first and last visit.Second, we’ll create a TimespanLevel feature that will categorize Timespan on a 3-point scale: 1 = single visit, 2 = brief (less than 4 weeks), and 3 = long (more than 4 weeks).
To make it easier to skim the final dataset, let’s also rearrange the columns slightly to shift the individual course columns towards the end.
#CoursesTaken
CoursesTakenFeature <- PIDCourseData_wide %>%
mutate(CoursesTaken = rowSums(select(., -PID))) %>%
select(PID, CoursesTaken)
MegaPIDData <- left_join(MegaPIDData, CoursesTakenFeature, "PID") %>%
mutate_at(c("CoursesTaken"), ~replace_na(.,0))
head(MegaPIDData, n = 5)
## PID FirstVisit LastVisit TrainingsCount VisitsCount T_BeAM101 T_3DP
## 1 704221707 2023-03-26 2023-04-26 1 9 5 0
## 2 704277964 2023-01-10 2023-01-19 1 3 1 2
## 3 707857617 2023-03-16 2023-03-24 1 3 1 1
## 4 710237433 2023-02-07 2023-05-09 2 16 0 0
## 5 711002610 2023-03-21 2023-03-21 1 1 1 0
## T_Laser T_Vinyl T_Embroidery T_Sewing T_Electronics T_Metal T_Wood
## 1 0 0 0 0 0 0 4
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 1
## 4 0 0 0 1 0 0 15
## 5 0 0 0 0 0 0 0
## P_Coursework P_Research P_BeAM P_Personal AMST460H APPL089 APPL101 APPL110
## 1 6 0 1 2 0 1 0 0
## 2 0 0 1 2 0 0 0 0
## 3 0 0 2 1 0 0 0 0
## 4 0 0 3 14 0 0 0 0
## 5 0 0 1 0 0 0 0 0
## APPL240 APPL260 APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## BMME298 BMME698 CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## DRAM766 ENEC468 ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## KOR346 MASC053 MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## PHYS051 PHYS055 PHYS100 PHYS231 PSYC330 RELI105 SPAN321 VisitsLevel
## 1 0 0 0 0 0 0 0 Many
## 2 0 0 0 0 0 0 0 Few
## 3 0 0 0 0 0 0 0 Few
## 4 0 0 0 0 0 0 0 Many
## 5 0 0 0 0 0 0 0 Once
## CoursesTaken
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
#TimeSpan
MegaPIDData <- MegaPIDData %>%
mutate(Timespan = difftime(LastVisit,FirstVisit,units="days"))
MegaPIDData$Timespan <- as.numeric(MegaPIDData$Timespan)
MegaPIDData <- MegaPIDData %>%
mutate(TimespanLevel = cut(MegaPIDData$Timespan, breaks = c(-Inf, 0, 28, 1000),
labels = c("Once", "Brief", "Longterm")))
MegaPIDData <- MegaPIDData[,c(1, 4:5, 66:69, 6:18, 2:3, 19:65)]
head(MegaPIDData, n = 5)
## PID TrainingsCount VisitsCount VisitsLevel CoursesTaken Timespan
## 1 704221707 1 9 Many 1 31
## 2 704277964 1 3 Few 0 9
## 3 707857617 1 3 Few 0 8
## 4 710237433 2 16 Many 0 91
## 5 711002610 1 1 Once 0 0
## TimespanLevel T_BeAM101 T_3DP T_Laser T_Vinyl T_Embroidery T_Sewing
## 1 Longterm 5 0 0 0 0 0
## 2 Brief 1 2 0 0 0 0
## 3 Brief 1 1 0 0 0 0
## 4 Longterm 0 0 0 0 0 1
## 5 Once 1 0 0 0 0 0
## T_Electronics T_Metal T_Wood P_Coursework P_Research P_BeAM P_Personal
## 1 0 0 4 6 0 1 2
## 2 0 0 0 0 0 1 2
## 3 0 0 1 0 0 2 1
## 4 0 0 15 0 0 3 14
## 5 0 0 0 0 0 1 0
## FirstVisit LastVisit AMST460H APPL089 APPL101 APPL110 APPL240 APPL260
## 1 2023-03-26 2023-04-26 0 1 0 0 0 0
## 2 2023-01-10 2023-01-19 0 0 0 0 0 0
## 3 2023-03-16 2023-03-24 0 0 0 0 0 0
## 4 2023-02-07 2023-05-09 0 0 0 0 0 0
## 5 2023-03-21 2023-03-21 0 0 0 0 0 0
## APPL265 APPL285 APPL390 APPL412 APPL430 APPL463 ARTS363 BMME298 BMME698
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## CHEM089 CLAS059 CLAS133H CMPL246 COMP089 DENT205 DENT212 DRAM766 ENEC468
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## ENGL143 ENVR089 GEOG115 GEOG228 IDST101 INLS690 INLS737 KOR346 MASC053
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## MATH069 MEJO595 MTSC710 MUSC051 NSCI395 NSCI405 NSCI424 PHYS051 PHYS055
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## PHYS100 PHYS231 PSYC330 RELI105 SPAN321
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
Before moving on to model building, we need to normalize any data that will be used for kmeans clustering so that the features will all be oriented on a similar scale. Any categorical/binary features (e.g. the individual course columns) will not be used for this process. I’m also going to leave out the total ‘VisitCounts’ variable, as I want to later compare that value to a user’s assigned cluster to see if additional patterns emerge.
That leaves us with the following features for clustering: TrainingsCount, CoursesTaken, Timespan, 9 different tool counts, and 4 different purpose counts. Note that for building kmeans clustering models, we do not need separate training and testing datasets, so we do not need to split the data for this process.
visits <- MegaPIDData[c(2,5:6,8:20)]
visits_z <- as.data.frame(lapply(visits, scale))
summary(visits$TrainingsCount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.839 2.000 6.000
summary(visits_z$TrainingsCount) #checking to confirm the effect of scaling
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7587 -0.7587 -0.7587 0.0000 0.1457 3.7633
Before creating a k-means clustering model, we need to decide what value to use for k. I propose starting with a value of 3, based on the different levels of ‘timespan’. These levels mimic different use cases of the makerspace: single-visit users are often visiting the makerspace out of obligation (e.g. ‘I have to take this training for a class’), brief timespan users are often visiting as part of a time-bound project (e.g. ‘i just need to finish this project and then I’m done here’), and longterm timespan users are often using the makerspace throughout the semester for different tasks (‘I have to do a training, then I’ll work on my course project, then I might come back for something else.’)
Let’s set a random seed for reproducibility and then perform k-means clustering on the standardized counts data visits_z into three clusters. The results of this clustering process will be stored in the visits_clusters object.
set.seed(1212)
visits_clusters <- kmeans(visits_z, 3)
We’ll start by evaluating the number of cases that fall into each group.
visits_clusters$size
## [1] 190 407 92
Since there is unavoidable randomness involved with kmeans, I tested out this process with 5-10 different seeds to check for outliers. Across all of the clusters that were generated, there was always a cluster with less than 100 observations, a cluster with 150-200 observations, and a cluster with 400+ observations.
We can also look at the centers for each cluster using the function below.
visits_clusters$centers
## TrainingsCount CoursesTaken Timespan T_BeAM101 T_3DP T_Laser
## 1 1.0929550 0.5745562 1.1046245 1.0061672 0.9117477 1.021414
## 2 -0.3587224 -0.4933952 -0.3581375 -0.2891806 -0.3434535 -0.397204
## 3 -0.6702327 0.9961540 -0.6969205 -0.7986442 -0.3635487 -0.352245
## T_Vinyl T_Embroidery T_Sewing T_Electronics T_Metal T_Wood
## 1 0.5237468 0.109665971 -0.06115244 -0.3640431 0.19444506 0.4262421
## 2 -0.1832483 0.002322114 0.05514997 -0.3781813 -0.04784944 -0.1465700
## 3 -0.2709765 -0.236756903 -0.11768558 2.4248696 -0.18988957 -0.2318694
## P_Coursework P_Research P_BeAM P_Personal
## 1 1.1124216 0.17464914 1.0893223 0.7864141
## 2 -0.4186356 -0.05656289 -0.3578224 -0.2935105
## 3 -0.4453852 -0.11045914 -0.6667119 -0.3256511
Interesting! Using the technique outlined by Lantz in Chapter 9, we can highlight patterns and begin to visualize an interpretation of these clusters. Here is a potential highlighting schema, along with my interpretation and visualization:
Observations about each cluster include:
Cluster 1: With 190 observations, this cluster contains about 27% of this semester’s users. This cluster is associated with higher-than-average values in the number of trainings they have completed and the length of time between their first/last visits. They use the laser cutter significantly more often than the other clusters, and they use the makerspaces for multiple purposes, possibly implicating this cluster as “superusers” of the makerspace.
Cluster 2: With 407 observations, this cluster contains about 60% of this semester’s users. At first, nearly all of this cluster’s values seem to be below average - perhaps implying that they are somewhat infrequent visitors compared to Cluster 1. However, I find it interesting that this cluster has a higher-than-average association with both Embroidery and Sewing tools, which could give them the primary identification of “textiles-driven” users. I wish I had demographic data - I would be very curious to see how gender shakes out in this cluster.
Cluster 3: With 92 observations, this cluster contains about 13% of this semester’s users. Again, this cluster seems to have below-average values mostly across the board. However, this cluster is highly associated with makerspace courses, even moreso than Cluster 1, which may mean that these users are more likely associated with more than one course. Curiously, they do not have a higher-than-average association with the ‘coursework’ purpose, although they do have a higher-than-average association with using the Electronics tools. We could refer to this cluster as “electronics academics”.
Let’s try visualizing these clusters:
fviz_cluster(visits_clusters, #kmeans object
visits_z, #data
geom = "point")
Remember that we left out features like ‘VisitCounts’ and ‘TimespanLevel’ when we created these kmeans clusters. Let’s add the cluster assignment back into the main dataset to see if patterns emerge between cluster #, total visits, and/or timespan level.
I also want to confirm some of my interpretations from the clustering process - what’s the average number of courses that users in Cluster 3 are taking, compared to their number of Course Check-ins? What’s the average number of visits for Cluster 2 compared to their visits for Sewing/Embroidery?
MegaPIDData$Cluster <- visits_clusters$cluster
aggregate(data = MegaPIDData, VisitsCount ~ Cluster, mean)
## Cluster VisitsCount
## 1 1 13.036842
## 2 2 2.970516
## 3 3 2.336957
aggregate(data = MegaPIDData, Timespan ~ Cluster, mean)
## Cluster Timespan
## 1 1 91.94737
## 2 2 30.25553
## 3 3 15.96739
aggregate(data = MegaPIDData, CoursesTaken ~ Cluster, mean)
## Cluster CoursesTaken
## 1 1 1.0052632
## 2 2 0.2751843
## 3 3 1.2934783
aggregate(data = MegaPIDData, P_Coursework ~ Cluster, mean) #let's compare CoursesTaken and P_Coursework averages for Cluster 3
## Cluster P_Coursework
## 1 1 5.8473684
## 2 2 0.4422604
## 3 3 0.3478261
aggregate(data = MegaPIDData, T_Embroidery + T_Sewing ~ Cluster, mean) #let's compare the combined average of Textiles purposes of Cluster 2 to their overall visits average
## Cluster T_Embroidery + T_Sewing
## 1 1 0.6894737
## 2 2 0.6805897
## 3 3 0.2500000
aggregate(data = MegaPIDData, T_Electronics ~ Cluster, mean) #let's compare the combined average of Electronics purposes of Cluster 3 to their overall visits average
## Cluster T_Electronics
## 1 1 0.005263158
## 2 2 0.000000000
## 3 3 1.043478261
I’m starting to get a sense for each of these clusters, but it would help to have some basic counts for TimespanLevel and VisitsCountLevel in each cluster for comparison. Remember that my goal (and research question) is to figure out how these clusters relate to makerspace persistence (i.e. more visits over a longer period of time.)
ClusterVisits <- MegaPIDData %>% group_by(Cluster,VisitsLevel) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame()
ClusterTimespan <- MegaPIDData %>% group_by(Cluster,TimespanLevel) %>%
summarise(total_count=n(),.groups = 'drop') %>%
as.data.frame()
ClusterVisits
## Cluster VisitsLevel total_count
## 1 1 Few 4
## 2 1 Many 186
## 3 2 Once 132
## 4 2 Few 221
## 5 2 Many 54
## 6 3 Once 66
## 7 3 Few 16
## 8 3 Many 10
ClusterTimespan
## Cluster TimespanLevel total_count
## 1 1 Brief 1
## 2 1 Longterm 189
## 3 2 Once 145
## 4 2 Brief 100
## 5 2 Longterm 162
## 6 3 Once 66
## 7 3 Brief 4
## 8 3 Longterm 22
ggplot(ClusterVisits, aes(x=VisitsLevel, y=total_count, fill=VisitsLevel)) +
geom_col() +
facet_wrap(~Cluster)
ggplot(ClusterTimespan, aes(x=TimespanLevel, y=total_count, fill=TimespanLevel)) +
geom_col() +
facet_wrap(~Cluster)
This concludes the coding portion of the final assignment. Further interpretation and conclusions will be discussed in the accompanying LAK poster paper. For ease of use, I will summarize observations of each cluster below:
Cluster 1: 190 observations, 27% of Spring 2023 users. Higher-than-average association with: # of trainings, longer timespan between visits, multiple purposes (courses, personal), and multiple tools (BeAM 101, Laser Cutter). Average of 14 visits over 92 days, and typically associated with one academic course. General classification is “longterm users who visit many times.”
Cluster 2: 407 observations, 60% of Spring 2023 users. Higher-than-average association with: Embroidery and Sewing tools. Average of 3 visits over 30 days, only associated with a course about 25% of the time. On average, 23% of their visits are to use Textiles tools (avgT_Textiles / avg#Visits) compared to Cluster 1’s average of 5% Textiles visits and Cluster 3’s average of 11% Textiles visits. General classification is “textiles-driven users who visit one-to-few times.”
Cluster 3: 92 observations, 13% of Spring 2023 users. Higher-than-average association with: courses taken, Electronics tools. Average of 2 visits over 16 days, and typically associated with more than one academic course. However, only 15% of their visits are related to coursework (avgP_Coursework / avg#Visits) while 45% of their visits are to use Electronics equipment. General classification is “electronics-driven users who visit once.”