0.1 Introductory

This file aims to illustrate the process of converting raw seminar attendance data into accumulated attendance counts of every student. By doing this, we can publish the effective attendance counts every week. Therefore, we can remind the students whose attendance times are comparatively lower than others.

The original data is the fingerprint records and this file processes a sample form in year 2011, which you can download at here with password “mzjd”. The outcome is attendance_count.csv, which contains information about the seminar attendance counts of every student.

0.2 Data cleaning and tidying

Here are some problems in the raw data. It contains empty columns, and the class of recorded time is neither “POSIXct” nor “POSIXlt”.

0.2.1 Clean data to remove empty columns.

library(pipeR) #High-performance packages
library(dplyr)
library(plyr)
library(lubridate)
setwd("~/Documents/for life/seminar")
dat0<-read.csv('raw2011.2-2011.6.csv',na.strings = "", header = TRUE, stringsAsFactors = FALSE)
names(dat0)[1:5]<-c('date','id','number','name','class')# Clean data to remove columns of all "NA"
logi.col<-as.data.frame(sapply(dat0,is.na))%>>%sapply(all)
dat.not.na<-dat0[,!logi.col] # remove columns of all "NA"
dat0$id<-as.character(dat0$id)
print(names(dat.not.na))
##  [1] "date"   "id"     "number" "name"   "class"  "X01"    "X02"   
##  [8] "X03"    "X04"    "X05"    "X06"    "X07"

0.2.2 Tidy the time records

time.col.NO<-grep("^X",names(dat.not.na))
time.cols<-lapply(dat.not.na[,time.col.NO],function(x) {
        paste(dat0[,"date"],x)%>>%
                parse_date_time("%Y%m%d %H%M%S",truncated = 2)        
})%>>%as.data.frame#Convert 'time' of character class into 'time' of time class

0.3 Show some special cases and fix them

Since not all record are valid in the raw fingerprint records, we should pay attention to the special cases and guarantee that the valid records are counted and repeated records are neglected.

0.3.1 Special cases

Case1: Information repeated in one seminar: X05 and X06 contain repeated information.

##                id   X01   X02   X03   X04   X05   X06  X07
## 1182 2.772008e+13 14:24 15:41 16:21 18:01 20:00 20:01 <NA>

Case2: Although the difference between X02 and X03 is less than 5mins, they are valid information.

##               id   X01   X02   X03   X04  X05  X06  X07
## 971 2.302008e+13 16:24 18:03 18:07 19:29 <NA> <NA> <NA>

Case3: This case indicate that we can’t separate the data into pairs in order.

##                id   X01   X02   X03   X04   X05   X06  X07
## 1603 2.772009e+13 16:22 16:28 17:54 17:55 19:19 22:23 <NA>

0.3.2 Count valid attendance times

times.oneday<-function(record){
        diff.time<-lapply(record[2:length(time.col.NO)]-record[1:(length(time.col.NO)-1)],as.period,units="minutes")%>>%
                as.data.frame
        logi.diff.time<-diff.time>period(30,units = "minute")
        a<-which(logi.diff.time==TRUE)# index of intervals which longer than 30mins
        times=ifelse(all(diff(a)>1)|length(a)==1,length(a),length(a)-ceiling(length(which(diff(a)==1))/2))# Function `times.oneday` countes the effective attendance of every student in one day
        return(times)
}

In the ifelse command, which is the key command judge whether the records are valid no not, three special cases are taken into consideration. The conditions for judgement is as following:

  • Condition1: all periods which are longer than 30mins are valid attendance
  • Condition2: students attend single seminar in this day
  • Condition3: students attend more than one seminars, and the interval between two seminars is also larger than 30mins, which means we should avoid counting interval as one attendance.

0.3.3 Test function times.oneday

test.outcome<-c(times.oneday(time.cols[1182,]),times.oneday(time.cols[971,]),times.oneday(time.cols[1603,]))
test.table<-data.frame(dat0[c(1182,971,1603),c(2,6:12)],test.outcome)
print(test.table)#Test function `times.oneday` on the special cases listed in the above section.
##                id   X01   X02   X03   X04   X05   X06  X07 test.outcome
## 1182 2.772008e+13 14:24 15:41 16:21 18:01 20:00 20:01 <NA>            2
## 971  2.302008e+13 16:24 18:03 18:07 19:29  <NA>  <NA> <NA>            2
## 1603 2.772009e+13 16:22 16:28 17:54 17:55 19:19 22:23 <NA>            2

0.4 Process all records

0.4.1 Final outcome

Final outcome is the data.frame accumulated.times which contains information about the seminar attendance counts of every student. In application, we can use function write.csv to output accumulated.times into a file.csv.

attend.times<-numeric(nrow(time.cols))
attend.times<-sapply(1:nrow(time.cols),function(i) attend.times[i]<-times.oneday(time.cols[i,]))
#calculate the times of effictive attendance in one day
attend.times.df<-data.frame(dat0[,c("date","id")],attend.times)
accumulated.times<-ddply(attend.times.df,.(id),summarise,counts=sum(attend.times))
head(accumulated.times)
##                id counts
## 1 103840100190232     13
## 2 103840101830233     14
## 3 103840102170234     13
## 4 103840102690235     12
## 5 103840102840237     14
## 6 103840102860236     16

0.4.2 Students with lowest attendance times

# Arrange the counts to remind students whose attendance are in the lowest 10%.
tail(arrange(accumulated.times,desc(counts)), n = nrow(accumulated.times)*0.1)
##                 id counts
## 239 27720081152872      0
## 240 27720081152873      0
## 241 27720081152876      0
## 242 27720081152879      0
## 243 27720081152882      0
## 244 27720081152884      0
## 245 27720081152886      0
## 246 27720081152887      0
## 247 27720081152889      0
## 248 27720081152891      0
## 249 27720081152893      0
## 250 27720081152894      0
## 251 27720081152896      0
## 252 27720081152897      0
## 253 27720081152898      0
## 254 27720081152899      0
## 255 27720081152901      0
## 256 27720081152904      0
## 257 27720081152910      0
## 258 27720081153766      0
## 259 27720081153767      0
## 260 27720081153768      0
## 261 27720091154030      0
## 262 27720101154455      0
## 263 27720101154458      0
## 264 30920078102595      0
## 265           <NA>      0

0.5 Summary

This file contains mainly methods of processing the seminar records data, but in reality, we may separate the final output table into several forms in terms of the class variable. Also, we can merge the name information with the outcome, but due to the comparatively low performance of Rstudio in processing Chinese, this example file omit the name information.

I suggest that WISE gives this work, publishing the accumulated attendance counts every week, to R club. So, on one hand, this helps students to check the seminar counts conveniently. On the other hand, it can convert the technology into productivity which reduce boring and tedious work of Mrs.Zhang.

Thanks for reading.