This file aims to illustrate the process of converting raw seminar attendance data into accumulated attendance counts of every student. By doing this, we can publish the effective attendance counts every week. Therefore, we can remind the students whose attendance times are comparatively lower than others.
The original data is the fingerprint records and this file processes a sample form in year 2011, which you can download at here with password “mzjd”. The outcome is attendance_count.csv, which contains information about the seminar attendance counts of every student.
Here are some problems in the raw data. It contains empty columns, and the class of recorded time is neither “POSIXct” nor “POSIXlt”.
library(pipeR) #High-performance packages
library(dplyr)
library(plyr)
library(lubridate)
setwd("~/Documents/for life/seminar")
dat0<-read.csv('raw2011.2-2011.6.csv',na.strings = "", header = TRUE, stringsAsFactors = FALSE)
names(dat0)[1:5]<-c('date','id','number','name','class')# Clean data to remove columns of all "NA"
logi.col<-as.data.frame(sapply(dat0,is.na))%>>%sapply(all)
dat.not.na<-dat0[,!logi.col] # remove columns of all "NA"
dat0$id<-as.character(dat0$id)
print(names(dat.not.na))
## [1] "date" "id" "number" "name" "class" "X01" "X02"
## [8] "X03" "X04" "X05" "X06" "X07"
time.col.NO<-grep("^X",names(dat.not.na))
time.cols<-lapply(dat.not.na[,time.col.NO],function(x) {
paste(dat0[,"date"],x)%>>%
parse_date_time("%Y%m%d %H%M%S",truncated = 2)
})%>>%as.data.frame#Convert 'time' of character class into 'time' of time class
Since not all record are valid in the raw fingerprint records, we should pay attention to the special cases and guarantee that the valid records are counted and repeated records are neglected.
Case1: Information repeated in one seminar: X05 and X06 contain repeated information.
## id X01 X02 X03 X04 X05 X06 X07
## 1182 2.772008e+13 14:24 15:41 16:21 18:01 20:00 20:01 <NA>
Case2: Although the difference between X02 and X03 is less than 5mins, they are valid information.
## id X01 X02 X03 X04 X05 X06 X07
## 971 2.302008e+13 16:24 18:03 18:07 19:29 <NA> <NA> <NA>
Case3: This case indicate that we can’t separate the data into pairs in order.
## id X01 X02 X03 X04 X05 X06 X07
## 1603 2.772009e+13 16:22 16:28 17:54 17:55 19:19 22:23 <NA>
times.oneday<-function(record){
diff.time<-lapply(record[2:length(time.col.NO)]-record[1:(length(time.col.NO)-1)],as.period,units="minutes")%>>%
as.data.frame
logi.diff.time<-diff.time>period(30,units = "minute")
a<-which(logi.diff.time==TRUE)# index of intervals which longer than 30mins
times=ifelse(all(diff(a)>1)|length(a)==1,length(a),length(a)-ceiling(length(which(diff(a)==1))/2))# Function `times.oneday` countes the effective attendance of every student in one day
return(times)
}
In the ifelse command, which is the key command judge whether the records are valid no not, three special cases are taken into consideration. The conditions for judgement is as following:
times.onedaytest.outcome<-c(times.oneday(time.cols[1182,]),times.oneday(time.cols[971,]),times.oneday(time.cols[1603,]))
test.table<-data.frame(dat0[c(1182,971,1603),c(2,6:12)],test.outcome)
print(test.table)#Test function `times.oneday` on the special cases listed in the above section.
## id X01 X02 X03 X04 X05 X06 X07 test.outcome
## 1182 2.772008e+13 14:24 15:41 16:21 18:01 20:00 20:01 <NA> 2
## 971 2.302008e+13 16:24 18:03 18:07 19:29 <NA> <NA> <NA> 2
## 1603 2.772009e+13 16:22 16:28 17:54 17:55 19:19 22:23 <NA> 2
Final outcome is the data.frame accumulated.times which contains information about the seminar attendance counts of every student. In application, we can use function write.csv to output accumulated.times into a file.csv.
attend.times<-numeric(nrow(time.cols))
attend.times<-sapply(1:nrow(time.cols),function(i) attend.times[i]<-times.oneday(time.cols[i,]))
#calculate the times of effictive attendance in one day
attend.times.df<-data.frame(dat0[,c("date","id")],attend.times)
accumulated.times<-ddply(attend.times.df,.(id),summarise,counts=sum(attend.times))
head(accumulated.times)
## id counts
## 1 103840100190232 13
## 2 103840101830233 14
## 3 103840102170234 13
## 4 103840102690235 12
## 5 103840102840237 14
## 6 103840102860236 16
# Arrange the counts to remind students whose attendance are in the lowest 10%.
tail(arrange(accumulated.times,desc(counts)), n = nrow(accumulated.times)*0.1)
## id counts
## 239 27720081152872 0
## 240 27720081152873 0
## 241 27720081152876 0
## 242 27720081152879 0
## 243 27720081152882 0
## 244 27720081152884 0
## 245 27720081152886 0
## 246 27720081152887 0
## 247 27720081152889 0
## 248 27720081152891 0
## 249 27720081152893 0
## 250 27720081152894 0
## 251 27720081152896 0
## 252 27720081152897 0
## 253 27720081152898 0
## 254 27720081152899 0
## 255 27720081152901 0
## 256 27720081152904 0
## 257 27720081152910 0
## 258 27720081153766 0
## 259 27720081153767 0
## 260 27720081153768 0
## 261 27720091154030 0
## 262 27720101154455 0
## 263 27720101154458 0
## 264 30920078102595 0
## 265 <NA> 0
This file contains mainly methods of processing the seminar records data, but in reality, we may separate the final output table into several forms in terms of the class variable. Also, we can merge the name information with the outcome, but due to the comparatively low performance of Rstudio in processing Chinese, this example file omit the name information.
I suggest that WISE gives this work, publishing the accumulated attendance counts every week, to R club. So, on one hand, this helps students to check the seminar counts conveniently. On the other hand, it can convert the technology into productivity which reduce boring and tedious work of Mrs.Zhang.
Thanks for reading.