Lab Usage Data Cleaning and Visualization

The college’s IT service team is interested in finding out what is the usage situation in each computer lab across Carleton campus. The IT service team is hoping by gaining a grasp of past student usage patterns, they will be able to make computer labs managment decisions with increased confidence.

The IT service team provided us the lab usage data of all 77 computer labs every day of the past 2018-2019 academic year. The dataset is a wide-format data containing the number of peak usage of Carleton computer labs.

Data Preparation

Import Dataset.

K2_peaks <- read.table("/Users/mac/Desktop/DataSquad/DataUsage/Login Peaks (ACAD x 2018.09.01-2019.07.01).txt", sep = "\t")

Variables of the raw dataset includes lab names, week names, days of the week. The raw data is in the following format:

Raw Data
V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
Name	Name2	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	NA
AAHI Misc									NA
	Week of 8/26/18							1	NA
	Week of 9/2/18	1	1	2	1	1	1	1	NA
	Week of 9/9/18	2	2	1	2	2	3	2	NA
	Week of 9/16/18	2	3	2	1	1	1	2	NA
	Week of 9/23/18	2	3	3	2	2	1	2	NA

Since we are only interested in data from Monday through Friday, we exclude the Saturday and Sunday data.

labdata <- K2_peaks[-1,-c(3,9,10)]

Since room name only exist in the first row of each group, we use the following function to add room name to each data row.

#Adding room to each data row instead of just first row in the group
for (i in 1: 3450){
  if(labdata[i,1] != ""){
    holder <- labdata[i,1]
  }
  else{
    labdata[i,1] <- holder
  }
}

Data is now in the following wide format, each row contains data of an entire week.

##         Name             Week Monday Tuesday Wednesday Thursday Friday
## 2  AAHI Misc                                                          
## 3  AAHI Misc  Week of 8/26/18                                         
## 4  AAHI Misc   Week of 9/2/18      1       2         1        1      1
## 5  AAHI Misc   Week of 9/9/18      2       1         2        2      3
## 6  AAHI Misc  Week of 9/16/18      3       2         1        1      1
## 7  AAHI Misc  Week of 9/23/18      3       3         2        2      1
## 8  AAHI Misc  Week of 9/30/18      3       1         3        2      2
## 9  AAHI Misc  Week of 10/7/18      2       3         3        3      2
## 10 AAHI Misc Week of 10/14/18      3       4         5        6      4
## 11 AAHI Misc Week of 10/21/18      3       4         2        2      3

We transformed the data from the wide format to the long format, so it becomes an appropriate dataframe for R to interpret. Monday through Friday becomes a category under the Day variable and correspond to a Peak value. 

Empty fields represent days where there was no usage, thus are substituted with 0. 

Non data-containing rows, such as row ##2 in the above data are dropped.

#Substitute empty cells with 0
labdata[is.na(labdata$Peak), 'Peak'] <- 0 

#Change wide-format to long-format
labdata <- labdata %>% gather(key = "Day", value = "Peak", 3:7) %>% filter(Week != "")

The cleaned up data is in the following format.

##         Name             Week    Day Peak
## 1  AAHI Misc  Week of 8/26/18 Monday     
## 2  AAHI Misc   Week of 9/2/18 Monday    1
## 3  AAHI Misc   Week of 9/9/18 Monday    2
## 4  AAHI Misc  Week of 9/16/18 Monday    3
## 5  AAHI Misc  Week of 9/23/18 Monday    3
## 6  AAHI Misc  Week of 9/30/18 Monday    3
## 7  AAHI Misc  Week of 10/7/18 Monday    2
## 8  AAHI Misc Week of 10/14/18 Monday    3
## 9  AAHI Misc Week of 10/21/18 Monday    3
## 10 AAHI Misc Week of 10/28/18 Monday    2

Change the variable types of Peak and Day.

labdata$Peak <- as.numeric(labdata$Peak)
labdata$Day <- as.factor(labdata$Day)

Group the data by the computer labs.

labdata <- labdata[order(labdata$Name),]

A Carleton acdemic year consists of three terms: the fall, the winter, and the spring terms. Each term spans across 11 weeks, including test days. 

Since the data includes values of every week of last year, we matched each week with the corresponding term names. We can now create a categoricla variable 'Term' to label the data. 

We filtered out the weeks that are not in school terms and left only the data we are interested in in a new dataframe labdata_term. Then we created the new 'Term' using the following condition.

labdata_term <- filter(labdata, Week %in% c(fallterm, winterterm, springterm)) 

labdata_term$Term <- ifelse(labdata_term$Week %in% fallterm, labdata_term$Term <- 'fall', ifelse(labdata_term$Week %in% winterterm, labdata_term$Term <- 'winter', labdata_term$Term <- 'spring'))

We would also like to create a 'Building' variable that indicates the campus building each computer lab is located. The following is an example of how Building variable is created for computer labs in CMC, Center for Mathematics and Computing.

labdata_term$Building[labdata_term$Name %in% c('CMC102', 'CMC104', 'CMC109', 'CMC110', 'CMC201', 'CMC304', 'CMC306', 'CMC307','CMC318')] <- 'CMC' #9 CMC labs

We also created a 'WeekId' variable to notate how many weeks it is into the term.

labdata_term$WeekId <- factor(1:11)

Data Visualization

Let’s now dive into the data of individual computer labs, using the example of CMC201 to explore options for visualizing data.

We derived data of just CMC201 through the full dataframe. 

We created a separate dataframe to calculate mean usage of each week so later on we can utilize this statistics in our visuals.

mean.term <- CMC201_terms %>% group_by(Term, Day) %>% dplyr::summarize(mean.term.val = mean(Peak))

In this graph we can see the usage situation in CMC201 by school terms and by days of the week. The red line, as indicated in the legend, is the average peak usage of Mondays through Fridays in the whole term. 
The blue line represents the number of total computers available in the CMC201.

We can also visualize the data by term through slicing the full CMC201 dataset, for example the following is the visualization of CMC201 usage in spring term.

mean.spring <- CMC201_spring %>% group_by(Day) %>% dplyr::summarize(mean.spring.val = mean(Peak))