慕课《数据挖掘》课程——练习一
感兴趣的同学可根据提供的学生体侧数据,利用所学数据分析技术发掘数据中隐含的规律,撰写研究报告。可自行编程,也可利用数据挖掘工具和算法包。
## 倒入数据####
TCdata <- readxl::read_excel("数据.xlsx")
## 剔除读取到的多余的空列
TCdata <- TCdata[,1:38]
head(TCdata)## # A tibble: 6 × 38
## 编号 性别 年龄 身高 体重 细胞内液 细胞外液 肌肉重 蛋白质 瘦体重
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 23 155.2 54.0 18.3 8.9 34.9 7.7 37.4
## 2 0 0 25 178.0 60.3 27.3 13.3 52.0 11.5 55.6
## 3 0 1 23 155.0 54.0 18.5 8.9 35.1 7.7 37.7
## 4 27 0 22 168.5 54.5 23.7 11.2 44.7 9.8 47.9
## 5 28 0 23 175.1 66.9 27.6 13.3 52.4 11.5 55.9
## 6 29 0 23 157.7 56.3 22.4 10.1 41.6 9.2 44.6
## # ... with 28 more variables: 无机盐 <dbl>, 脂肪重 <dbl>,
## # 脂肪百分比 <dbl>, 腰臀比 <dbl>, 肥胖度 <dbl>, 体质指数 <dbl>,
## # 肌肉控制 <dbl>, 脂肪控制 <dbl>, 体重控制 <dbl>, 标准体重 <dbl>,
## # 目标体重 <dbl>, 基础代谢率 <dbl>, 躯干肌肉 <dbl>, 躯干骨质 <dbl>,
## # 躯干脂肪 <dbl>, 躯干脂肪百分比 <dbl>, 左上肢肌肉 <dbl>,
## # 左上肢骨质 <dbl>, 左上肢脂肪 <dbl>, 右上肢肌肉 <dbl>,
## # 右上肢骨质 <dbl>, 右上肢脂肪 <dbl>, 左下肢肌肉 <dbl>,
## # 左下肢骨质 <dbl>, 左下肢脂肪 <dbl>, 右下肢肌肉 <dbl>,
## # 右下肢骨质 <dbl>, 右下肢脂肪 <dbl>
summary(TCdata)## 编号 性别 年龄 身高
## Min. : 0.0 Min. :0.0000 Min. :18.00 Min. :148.7
## 1st Qu.: 228.0 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:162.0
## Median : 395.5 Median :0.0000 Median :23.00 Median :169.0
## Mean : 413.6 Mean :0.3525 Mean :23.45 Mean :168.1
## 3rd Qu.: 554.2 3rd Qu.:1.0000 3rd Qu.:24.00 3rd Qu.:173.9
## Max. :8888.0 Max. :1.0000 Max. :35.00 Max. :188.0
## 体重 细胞内液 细胞外液 肌肉重
## Min. : 35.00 Min. :15.40 Min. : 7.600 Min. :29.50
## 1st Qu.: 53.00 1st Qu.:20.07 1st Qu.: 9.875 1st Qu.:38.45
## Median : 60.00 Median :24.45 Median :11.600 Median :46.35
## Mean : 60.57 Mean :24.23 Mean :11.630 Mean :45.97
## 3rd Qu.: 67.00 3rd Qu.:27.52 3rd Qu.:13.200 3rd Qu.:52.20
## Max. :110.00 Max. :40.00 Max. :19.000 Max. :75.60
## 蛋白质 瘦体重 无机盐 脂肪重
## Min. : 6.500 Min. :31.70 Min. :2.300 Min. :-2.600
## 1st Qu.: 8.475 1st Qu.:41.25 1st Qu.:2.800 1st Qu.: 8.275
## Median :10.200 Median :49.60 Median :3.200 Median :11.000
## Mean :10.115 Mean :49.19 Mean :3.211 Mean :11.383
## 3rd Qu.:11.500 3rd Qu.:55.70 3rd Qu.:3.600 3rd Qu.:13.600
## Max. :16.600 Max. :80.50 Max. :4.900 Max. :42.700
## 脂肪百分比 腰臀比 肥胖度 体质指数
## Min. :-6.50 Min. :0.680 Min. : 68.80 Min. :14.10
## 1st Qu.:14.20 1st Qu.:0.800 1st Qu.: 90.88 1st Qu.:19.40
## Median :18.60 Median :0.820 Median : 97.70 Median :21.00
## Mean :18.71 Mean :0.828 Mean : 99.58 Mean :21.34
## 3rd Qu.:22.70 3rd Qu.:0.860 3rd Qu.:105.95 3rd Qu.:22.80
## Max. :39.90 Max. :1.090 Max. :160.90 Max. :35.90
## 肌肉控制 脂肪控制 体重控制 标准体重
## Min. : 0.000 Min. :-17.4000 Min. :-42.6000 Min. :43.60
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:55.00
## Median : 1.600 Median : 0.0000 Median : 0.0000 Median :61.60
## Mean : 2.152 Mean : -0.9148 Mean : -0.1855 Mean :60.88
## 3rd Qu.: 3.600 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:66.42
## Max. :13.000 Max. : 9.8000 Max. : 22.7000 Max. :81.70
## 目标体重 基础代谢率 躯干肌肉 躯干骨质
## Min. :42.00 Min. :1221 Min. :14.80 Min. :1.100
## 1st Qu.:54.17 1st Qu.:1401 1st Qu.:19.27 1st Qu.:1.400
## Median :60.15 Median :1624 Median :23.20 Median :1.600
## Mean :60.38 Mean :1616 Mean :23.02 Mean :1.607
## 3rd Qu.:65.72 3rd Qu.:1792 3rd Qu.:26.10 3rd Qu.:1.800
## Max. :84.50 Max. :2378 Max. :37.90 Max. :2.400
## 躯干脂肪 躯干脂肪百分比 左上肢肌肉 左上肢骨质
## Min. : 0.100 Min. :-6.50 Min. :-3.400 Min. :-0.2000
## 1st Qu.: 4.100 1st Qu.:14.20 1st Qu.: 2.800 1st Qu.: 0.2000
## Median : 5.500 Median :18.60 Median : 3.300 Median : 0.2000
## Mean : 5.702 Mean :18.71 Mean : 3.291 Mean : 0.2309
## 3rd Qu.: 6.800 3rd Qu.:22.70 3rd Qu.: 3.800 3rd Qu.: 0.3000
## Max. :21.400 Max. :39.90 Max. : 5.400 Max. : 0.3000
## 左上肢脂肪 右上肢肌肉 右上肢骨质 右上肢脂肪
## Min. :0.1000 Min. :-28.700 Min. :-2.1000 Min. :0.1000
## 1st Qu.:0.6000 1st Qu.: 2.700 1st Qu.: 0.2000 1st Qu.:0.6000
## Median :0.8000 Median : 3.300 Median : 0.2000 Median :0.8000
## Mean :0.8217 Mean : 3.232 Mean : 0.2281 Mean :0.8213
## 3rd Qu.:1.0000 3rd Qu.: 3.800 3rd Qu.: 0.3000 3rd Qu.:1.0000
## Max. :3.2000 Max. : 5.500 Max. : 0.4000 Max. :3.0000
## 左下肢肌肉 左下肢骨质 左下肢脂肪 右下肢肌肉
## Min. : 5.20 Min. :0.4000 Min. : 0.100 Min. : 5.300
## 1st Qu.: 6.80 1st Qu.:0.5000 1st Qu.: 1.500 1st Qu.: 6.800
## Median : 8.20 Median :0.6000 Median : 1.900 Median : 8.300
## Mean : 25.54 Mean :0.5672 Mean : 2.056 Mean : 8.558
## 3rd Qu.: 9.20 3rd Qu.:0.6000 3rd Qu.: 2.400 3rd Qu.: 9.400
## Max. :8513.00 Max. :1.0000 Max. :23.000 Max. :168.000
## 右下肢骨质 右下肢脂肪
## Min. : 0.4000 Min. : 0.100
## 1st Qu.: 0.5000 1st Qu.: 1.500
## Median : 0.6000 Median : 2.000
## Mean : 0.6791 Mean : 2.076
## 3rd Qu.: 0.6000 3rd Qu.: 2.400
## Max. :51.0000 Max. :20.000
## 该数据集没有缺失值
# sum(complete.cases(TCdata)) #查看完整的数据行数
# sum(is.na(TCdata)) ## 检查是否有缺失值
# md.pattern(TCdata)
## 可视化缺失值的情况
par(family = "STKaiti",cex = 0.7)
aggr(TCdata,numbers=T)## 没有缺失值
par(family = "STKaiti",cex = 0.7)
matrixplot(TCdata)##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
## 该矩阵图均为单一灰度图,每有缺失值上面的图像显示并没有缺失值存在
## 绘制数据的箱线图查看是否有异常值####
TCdatal <- reshape2::melt(TCdata,id.vars = c("编号","性别"))
TCdatal$性别 <- factor(TCdatal$性别,labels = c("男","女"))
# ## 一张图
# ggplot(TCdatal,aes(x = variable,y = value)) +
# theme_bw(base_family = "STKaiti",base_size = 10)+
# geom_boxplot(aes(colour = 性别))
## 每隔变量单独做图
ggplot(TCdatal,aes(x = variable,y = value)) +
theme_bw(base_family = "STKaiti",base_size = 10)+
geom_boxplot(aes(colour = 性别)) +
facet_wrap(~variable,scales = "free") + ggtitle("箱线图")+
labs(x = "体检项目",y = "体检数据") 从上面的图像中可以发现有很多的异常值
## 使用聚类查找异常样本并进行处理
## 获得异常值监测数据,不分析编号、性别、年龄、身高、体重
Lof_data <- TCdata[,c(-1,-2,-3,-4,-5)]
## 数据标准化
Lof_data_scale <- sapply(Lof_data, scale)
## 获得异常值的饭
outlier.scores <- lofactor(Lof_data_scale, k=8)
## 异常值得分可视化
ggplot(data = data.frame(outlier.scores = outlier.scores)) +
theme_bw(base_family = "STKaiti")+
geom_density(aes(outlier.scores),alpha = 0.1)# par(family = "STKaiti",cex = 1)
# plot(density(outlier.scores))
# which(outlier.scores >= 2)## pick top 10 as outliers
outliers <- order(outlier.scores, decreasing=T)[1:10]
## who are outliers
# print(outliers)
# TCdata[outliers,]
##剔除最大10个异常值得分后的密度图
par(family = "STKaiti",cex = 1)
plot(density(outlier.scores[-outliers])) 从上面的新的密度图可以看出异常值排除后效果明显
## 通过主成分分析显示数据点的二维分布情况
par(family = "STKaiti",cex = 1)
n <- nrow(Lof_data_scale)
labels <- 1:n
labels[-outliers] <- "。"
labels[outliers] <- "+"
biplot(prcomp(Lof_data_scale),var.axes = FALSE,cex = 0.8, xlabs = labels) 从主成分结果图上可以发现大部分异常样本提取了出来
加号为剔除的离群异常样本
## 矩阵散点图查看异常数据的提取情况
par(family = "STKaiti",cex = 1)
pch <- rep(".", dim(Lof_data)[1])
pch[outliers] <- "+"
col <- rep("black", dim(Lof_data)[1])
col[outliers] <- "red"
pairs(Lof_data[,seq(1,dim(Lof_data)[2],by = 4)], pch=pch, col=col) 该图也可以显示出异常值的提取情况
## 获取剔除异常值后新的数据集
TCdatadel_olt <- TCdata[-outliers,]
# summary(TCdatadel_olt)
## 绘制数据的箱线图查看异常值情况####
TCdatal <- reshape2::melt(TCdatadel_olt,id.vars = c("编号","性别"))
TCdatal$性别 <- factor(TCdatal$性别,labels = c("男","女"))
## 每隔变量单独做图
ggplot(TCdatal,aes(x = variable,y = value)) +
theme_bw(base_family = "STKaiti",base_size = 10)+
geom_boxplot(aes(colour = 性别),outlier.size = 1) +
facet_wrap(~variable,scales = "free") + ggtitle("剔除异常值后箱线图")+
labs(x = "体检项目",y = "体检数据") 从新的箱线图可以发现异常值的情况的到了改善
## 数据的主成分分析####
fa.parallel(TCdatadel_olt[,c(-1,-2)])## Parallel analysis suggests that the number of factors = 3 and the number of components = 2
上图推荐因子分析可提取前3个主因子
推荐主成分分析提取2个主成分
## 提取前三个主成分
pc <- princomp(TCdatadel_olt[,c(-1,-2)],cor = TRUE,scale = TRUE)
screeplot(pc,main = " 主成分分析方差图")## 主成分分析方差图和累积贡献率
summary(pc)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 4.5578983 3.3162519 1.11103586 0.95900155
## Proportion of Variance 0.5770677 0.3054869 0.03428891 0.02554678
## Cumulative Proportion 0.5770677 0.8825545 0.91684345 0.94239023
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.80964819 0.69408373 0.45595735 0.408061678
## Proportion of Variance 0.01820917 0.01338201 0.00577492 0.004625398
## Cumulative Proportion 0.96059940 0.97398141 0.97975633 0.984381728
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.384185150 0.363111526 0.33422424 0.233161704
## Proportion of Variance 0.004099951 0.003662499 0.00310294 0.001510122
## Cumulative Proportion 0.988481678 0.992144178 0.99524712 0.996757240
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.1718829248 0.1402065603 0.1222886877 0.1190852242
## Proportion of Variance 0.0008206594 0.0005460522 0.0004154034 0.0003939247
## Cumulative Proportion 0.9975778990 0.9981239512 0.9985393546 0.9989332793
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.0884287058 0.0871591363 0.0790331158 0.0714611656
## Proportion of Variance 0.0002172121 0.0002110199 0.0001735065 0.0001418527
## Cumulative Proportion 0.9991504915 0.9993615113 0.9995350178 0.9996768705
## Comp.21 Comp.22 Comp.23
## Standard deviation 0.05749144628 0.04968790949 0.0449502539
## Proportion of Variance 0.00009181296 0.00006858023 0.0000561257
## Cumulative Proportion 0.99976868348 0.99983726372 0.9998933894
## Comp.24 Comp.25 Comp.26
## Standard deviation 0.03889189165 0.03030543354 0.02213353849
## Proportion of Variance 0.00004201609 0.00002551165 0.00001360815
## Cumulative Proportion 0.99993540551 0.99996091716 0.99997452531
## Comp.27 Comp.28 Comp.29
## Standard deviation 0.01920916838 0.014230282288 0.012066499538
## Proportion of Variance 0.00001024978 0.000005625026 0.000004044456
## Cumulative Proportion 0.99998477509 0.999990400119 0.999994444575
## Comp.30 Comp.31 Comp.32
## Standard deviation 0.009987908692 0.007472693513 0.0057094604580
## Proportion of Variance 0.000002771064 0.000001551143 0.0000009054983
## Cumulative Proportion 0.999997215639 0.999998766782 0.9999996722804
## Comp.33 Comp.34
## Standard deviation 0.0033973376206 0.000505967065206
## Proportion of Variance 0.0000003206084 0.000000007111185
## Cumulative Proportion 0.9999999928888 1.000000000000000
## Comp.35 Comp.36
## Standard deviation 0.00000003304656565479524 0
## Proportion of Variance 0.00000000000000003033543 0
## Cumulative Proportion 0.99999999999999988897770 1
从累积贡献率结果分析,我们提取前三个主成分
## 提取前三个主成分
TCdel_olt.pc <- data.frame(predict(pc,TCdatadel_olt[,c(-1,-2)])[,1:3])
## 三维空间散点图
pch <- rep("。", nrow(TCdatadel_olt))
pch[which(TCdatadel_olt$性别 == 1)] <- "+"
col <- rep("blue", nrow(TCdatadel_olt))
col[which(TCdatadel_olt$性别 == 1)] <- "red"
par(family = "STKaiti")
scatterplot3d(x = TCdel_olt.pc$Comp.1, y=TCdel_olt.pc$Comp.2,
z=TCdel_olt.pc$Comp.3, color = col,pch = pch,
xlab = "主成分1",ylab = "主成分2",zlab = "主成分3",
main = "主成分散点图")
legend("topleft", inset=.05, # location and inset
bty="n", cex=.8, # suppress legend box, shrink text 50%
title="性别",
c("男","女"), fill=c("blue", "red"))## 从主成分图中,我们可以看出使用3个主成分已经很好的将男女分开了从主成分图中,我们可以看出使用3个主成分已经很好的将男女分开了