数据来自杭州市垃圾分类网站,部分数据如下:

head(read.csv("111.csv"))
##         年份 城区 投放正确率. 分类正确率. 扣分
## 1 2014/01/01 上城          70          70   -3
## 2 2014/01/01 上城          50          60   -4
## 3 2014/01/01 上城          70          70    0
## 4 2014/01/01 下城          50          60  -11
## 5 2014/01/01 下城          70          65   -2
## 6 2014/01/01 下城          75          70   -2

读入数据,并对变量重命名。

#options(warn = -1)
data <- read.csv("111.csv")
names(data) <- c("year", "region", "right_ratio", "wrong_ratio", 
                 "deduct_scores")
dat_clean <- list()

对当日的得分取平均,进行初步数据整理

#dat_clean$time <- seq(as.POSIXct(data$year[1]), 
#as.POSIXct(data$year[length(data$year)]), by = "day")

dat_clean$right_ratio <- tapply(as.numeric(data$right_ratio),
                                list(data$region, data$year),
                                mean, na.rm = T)
dat_clean$wrong_ratio <- tapply(as.numeric(data$wrong_ratio),
                                list(data$region, data$year), 
                                mean, na.rm =  T)
dat_clean$deduct_scores <- tapply(as.numeric(data$deduct_scores),
                                list(data$region, data$year), 
                                mean, na.rm = T)
# dat_clean$names(data)[3] <- tapply(as.numeric(data[3]), 
#                                 list(data[1], data[2]), 
#                                 mean, na.rm = T)

重构数据,將数据转化符合时间序列分析的格式。

library(reshape)
dat_reshape <- list()
for(i in 1:3){ 
  dat_reshape[[i]] <-  as.data.frame(t(dat_clean[[i]]))
  dat_reshape[[i]]$time <- as.Date(rownames(dat_reshape[[i]])) 
  dat_reshape[[i]] <- melt(dat_reshape[[i]], id = (c("time")))
  names(dat_reshape[[i]]) <- c("time", "city", 
                               names(dat_clean)[i])
  
}
dat_work <- merge(dat_reshape[[1]], merge(dat_reshape[[2]], 
                                          dat_reshape[[3]],
                                          by = c("city","time")), 
                  by = c("time", "city"))

填补数据,运用knn近邻的方法对未知 数据进行填补。这是基于如下原理:如果两个样本十分相似,并且其中一个样本对于某一变量值是未知的。那么这个样本的变量值时与另一个样本的相应变量值的相似的概率时很高的。对于相似概念,一般以变量在多元空间中的量度定义的,其中广泛采用的是欧式距离(Euclidean Distance),即定义为两个体相应变量之差的平方和,然后取其平方根。 \[ d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{i=1}^p(\mathbf{x}_i-\mathbf{y}_i)^2} \] 找到近邻后,有两种方式来估计变量值。一种是取频数最大的那个值或中位数。另一种方法是求10个近邻相应变量值的加权和,权重随着待估样本与近邻的欧式距离的增大而减小。我们可以用一个高斯核作为其权重。如果某一近邻与待估样本的欧式距离为\(d\),那么该紧邻对应的权重即为 \[ u(d)=e^{-d} \] 在这里,我们采用的是后一种方法。并且,为了将此方法扩展到离散变量,应用的距离函数为 \[ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=m}^p\delta_i(\mathbf{x}_i, \mathbf{y}_i)} \] 其中 \[ \begin{equation*} |x| = \left\{ \begin{array}{ll} 0 & \text{如果 $i$ 是名义变量且 } v_1\neq v_2\\ 1 & \text{如果 $i$ 是名义变量且 } v_1 = v_2\\ (v_1-v_2)^2 & \text{如果 $i$ 是数值型变量} \end{array} \right. \end{equation*} \] 这些距离是在数据经过标准化之后计算的。如下 \[ \mathbf{y}_i = \frac{\mathbf{x}_i-\overline{\mathbf{x}}}{\sigma_{\mathbf{x}}} \]

library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
dat_work[3:5] <- knnImputation(dat_work[3:5], k = 10, 
                               meth = "media")

绘制各个市区的正确投放率(right_ ratio),错误投放率(wrong_ ratio)和扣分(deduct_ scores)。

library(ggplot2)
ylabel <- c("right_ratio", "wrong_ratio", "deduct_scores")
qplot(time, dat_work[,3], data = dat_work, colour = city, 
      geom = c("smooth"), size = 1, ylab = ylabel[1])
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-6

qplot(time, dat_work[,4], data = dat_work, colour = city, 
      geom = c("smooth"), size = 1, ylab = ylabel[2])
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-6

qplot(time, dat_work[,5], data = dat_work, colour = city, 
      geom = c("smooth"), size = 1, ylab = ylabel[3])
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-6