This data is the data that is used to create college rankings every year. With this data, we will try to create clusters in the data surrounding each school’s In-State Tuition.
Educ<-read.csv("Educ4HW.csv")
summary(Educ)
## College State Public
## Concordia College : 2 NY : 34 Min. :0.0000
## University of St. Thomas : 2 TX : 34 1st Qu.:0.0000
## Westminster College : 2 OH : 29 Median :0.0000
## Alaska Pacific University: 1 IN : 21 Mean :0.2442
## Albertson College : 1 CA : 19 3rd Qu.:0.0000
## Albertus Magnus College : 1 NC : 17 Max. :1.0000
## (Other) :376 (Other):231
## MathSAT VerbSAT ACT Received
## Min. :330.0 Min. :290.0 Min. :11.00 Min. : 77
## 1st Qu.:474.0 1st Qu.:435.0 1st Qu.:21.00 1st Qu.: 682
## Median :516.0 Median :470.0 Median :23.00 Median : 1272
## Mean :515.6 Mean :469.9 Mean :22.82 Mean : 2691
## 3rd Qu.:554.0 3rd Qu.:500.0 3rd Qu.:24.00 3rd Qu.: 3270
## Max. :742.0 Max. :639.0 Max. :31.00 Max. :20192
##
## Accepted Enrolled Pct10 Pct25
## Min. : 61 Min. : 27.0 Min. : 1.00 Min. : 13.00
## 1st Qu.: 535 1st Qu.: 220.0 1st Qu.:16.00 1st Qu.: 42.00
## Median : 969 Median : 383.0 Median :24.00 Median : 54.00
## Mean : 1859 Mean : 778.3 Mean :27.11 Mean : 55.26
## 3rd Qu.: 2140 3rd Qu.: 824.0 3rd Qu.:34.00 3rd Qu.: 68.00
## Max. :15096 Max. :7425.0 Max. :96.00 Max. :100.00
##
## FTUG PTUG ISTuit OSTuit
## Min. : 139 Min. : 1.0 Min. : 608 Min. : 2883
## 1st Qu.: 909 1st Qu.: 81.0 1st Qu.: 4103 1st Qu.: 7248
## Median : 1601 Median : 302.0 Median : 9600 Median : 9700
## Mean : 3611 Mean : 975.2 Mean : 9187 Mean :10285
## 3rd Qu.: 3876 3rd Qu.: 993.0 3rd Qu.:12825 3rd Qu.:12900
## Max. :31643 Max. :21836.0 Max. :24940 Max. :24940
##
## Books PhDs SFRatio
## Min. : 96.0 Min. : 8.00 Min. : 2.50
## 1st Qu.: 475.0 1st Qu.: 60.00 1st Qu.:11.60
## Median : 504.0 Median : 74.00 Median :13.60
## Mean : 552.9 Mean : 71.29 Mean :14.45
## 3rd Qu.: 600.0 3rd Qu.: 84.00 3rd Qu.:16.50
## Max. :2340.0 Max. :100.00 Max. :91.80
##
library(fastcluster)
##
## Attaching package: 'fastcluster'
## The following object is masked from 'package:stats':
##
## hclust
library(stats)
First, we will do a cluster with 6 groups just to see what the data looks like and to see where the clusters are. Then, we will look at which schools are in which groups. When we do that, we will look at the state name because that helps identify each school without using the full school name.
set.seed(1)
grpEdu <- kmeans(Educ[,c(-1,-2)], centers=6, nstart=10)
clusEdu<-cbind(Educ,cluster=grpEdu$cluster)##cluster is a variable within that model
head(clusEdu)##says what cluster it is in
## College State Public MathSAT VerbSAT ACT Received
## 1 Christendom College VA 0 568 568 23 81
## 2 Mount Vernon College DC 0 440 402 20 149
## 3 King's College NY 0 442 465 25 356
## 4 Alaska Pacific University AK 0 490 482 20 193
## 5 Sierra Nevada College NV 0 400 400 21 200
## 6 Trinity College DC 0 470 480 22 247
## Accepted Enrolled Pct10 Pct25 FTUG PTUG ISTuit OSTuit Books PhDs SFRatio
## 1 72 51 33 71 139 3 8730 8730 400 92 9.3
## 2 70 61 15 35 203 138 13780 13780 500 84 5.7
## 3 233 53 16 24 246 18 8190 8190 500 100 7.0
## 4 146 55 16 44 249 869 7560 7560 800 76 11.9
## 5 160 100 5 25 300 50 7500 7500 500 45 7.5
## 6 189 100 19 49 309 639 11412 11412 500 89 8.3
## cluster
## 1 6
## 2 2
## 3 6
## 4 6
## 5 6
## 6 6
o=order(grpEdu$cluster)
data.frame(Educ$State[o],grpEdu$cluster[o])##name of utility company and cluster identity
## Educ.State.o. grpEdu.cluster.o.
## 1 FL 1
## 2 IL 1
## 3 OK 1
## 4 NM 1
## 5 TN 1
## 6 CA 1
## 7 TX 1
## 8 TX 1
## 9 TX 1
## 10 TX 1
## 11 NC 1
## 12 VA 1
## 13 SC 1
## 14 TX 1
## 15 TX 1
## 16 TX 1
## 17 AR 1
## 18 NC 1
## 19 TN 1
## 20 TX 1
## 21 TX 1
## 22 KY 1
## 23 FL 1
## 24 TX 1
## 25 OK 1
## 26 WV 1
## 27 MT 1
## 28 VT 1
## 29 MN 1
## 30 SC 1
## 31 GA 1
## 32 NC 1
## 33 SC 1
## 34 TX 1
## 35 FL 1
## 36 OH 1
## 37 SC 1
## 38 PA 1
## 39 CO 1
## 40 WV 1
## 41 LA 1
## 42 GA 1
## 43 VA 1
## 44 AR 1
## 45 SC 1
## 46 CO 1
## 47 NY 1
## 48 CO 1
## 49 FL 1
## 50 CO 1
## 51 FL 1
## 52 MO 1
## 53 TX 1
## 54 NY 1
## 55 TX 1
## 56 IN 1
## 57 MD 1
## 58 AL 1
## 59 SC 1
## 60 TX 1
## 61 NY 1
## 62 TX 1
## 63 MI 1
## 64 MN 1
## 65 DC 2
## 66 CT 2
## 67 NY 2
## 68 PA 2
## 69 VA 2
## 70 CA 2
## 71 MO 2
## 72 VA 2
## 73 IL 2
## 74 ID 2
## 75 MD 2
## 76 WV 2
## 77 OH 2
## 78 WI 2
## 79 CA 2
## 80 GA 2
## 81 PA 2
## 82 VA 2
## 83 CA 2
## 84 OH 2
## 85 IL 2
## 86 PA 2
## 87 OH 2
## 88 IN 2
## 89 OR 2
## 90 WI 2
## 91 NY 2
## 92 OH 2
## 93 IA 2
## 94 TN 2
## 95 IA 2
## 96 MA 2
## 97 OR 2
## 98 WA 2
## 99 CA 2
## 100 IA 2
## 101 MA 2
## 102 MN 2
## 103 PA 2
## 104 FL 2
## 105 WI 2
## 106 NC 2
## 107 CA 2
## 108 CA 2
## 109 OH 2
## 110 NY 2
## 111 WV 2
## 112 OR 2
## 113 MA 2
## 114 OR 2
## 115 NC 2
## 116 PA 2
## 117 CA 2
## 118 FL 2
## 119 OH 2
## 120 ME 2
## 121 PA 2
## 122 OH 2
## 123 IL 2
## 124 MA 2
## 125 IL 2
## 126 IA 2
## 127 OH 2
## 128 CA 2
## 129 NY 2
## 130 IA 2
## 131 MN 2
## 132 NY 2
## 133 CA 2
## 134 OH 2
## 135 WA 2
## 136 WA 2
## 137 OH 2
## 138 IN 2
## 139 NY 2
## 140 CO 2
## 141 CT 2
## 142 WA 2
## 143 MA 2
## 144 MN 2
## 145 CT 2
## 146 WA 2
## 147 IA 2
## 148 RI 2
## 149 CA 2
## 150 NY 3
## 151 NY 3
## 152 NY 3
## 153 TX 3
## 154 NY 3
## 155 MD 3
## 156 NC 3
## 157 TN 3
## 158 SC 3
## 159 NY 3
## 160 ID 3
## 161 WV 3
## 162 OH 3
## 163 VA 3
## 164 ND 3
## 165 CO 3
## 166 WI 3
## 167 NY 3
## 168 IN 3
## 169 MT 3
## 170 WA 3
## 171 IN 3
## 172 MS 3
## 173 TX 3
## 174 CO 3
## 175 TX 3
## 176 NC 3
## 177 NC 3
## 178 FL 3
## 179 UT 3
## 180 TX 3
## 181 OH 3
## 182 AZ 3
## 183 FL 3
## 184 SC 3
## 185 OK 3
## 186 CA 3
## 187 NC 3
## 188 OH 3
## 189 TX 3
## 190 UT 3
## 191 TX 3
## 192 FL 3
## 193 MO 3
## 194 OH 3
## 195 OH 3
## 196 CO 3
## 197 TN 3
## 198 IN 3
## 199 AL 3
## 200 LA 3
## 201 CA 3
## 202 NE 3
## 203 MN 3
## 204 MD 4
## 205 PA 4
## 206 MA 4
## 207 TN 4
## 208 GA 4
## 209 RI 4
## 210 NC 4
## 211 NY 4
## 212 IL 4
## 213 PA 4
## 214 MA 4
## 215 CA 4
## 216 MA 4
## 217 AZ 5
## 218 MI 5
## 219 AZ 5
## 220 IN 5
## 221 IL 5
## 222 MI 5
## 223 TX 5
## 224 TX 5
## 225 VA 6
## 226 NY 6
## 227 AK 6
## 228 NV 6
## 229 DC 6
## 230 NC 6
## 231 IN 6
## 232 OH 6
## 233 NJ 6
## 234 GA 6
## 235 IN 6
## 236 WA 6
## 237 VA 6
## 238 NY 6
## 239 IN 6
## 240 TX 6
## 241 MI 6
## 242 NC 6
## 243 AL 6
## 244 MI 6
## 245 OK 6
## 246 SC 6
## 247 TX 6
## 248 TN 6
## 249 NC 6
## 250 KS 6
## 251 KY 6
## 252 CA 6
## 253 OH 6
## 254 NC 6
## 255 MO 6
## 256 OH 6
## 257 IA 6
## 258 KY 6
## 259 MO 6
## 260 OH 6
## 261 CA 6
## 262 NY 6
## 263 WI 6
## 264 IN 6
## 265 MS 6
## 266 NY 6
## 267 IA 6
## 268 MT 6
## 269 WV 6
## 270 OH 6
## 271 WV 6
## 272 LA 6
## 273 CA 6
## 274 PA 6
## 275 WV 6
## 276 VA 6
## 277 OR 6
## 278 IN 6
## 279 IN 6
## 280 MO 6
## 281 KY 6
## 282 NM 6
## 283 IL 6
## 284 NY 6
## 285 TX 6
## 286 IN 6
## 287 IL 6
## 288 WI 6
## 289 KY 6
## 290 AR 6
## 291 WV 6
## 292 IN 6
## 293 TN 6
## 294 NY 6
## 295 TX 6
## 296 IA 6
## 297 NY 6
## 298 IN 6
## 299 CA 6
## 300 SC 6
## 301 MO 6
## 302 PA 6
## 303 TX 6
## 304 NY 6
## 305 SC 6
## 306 TN 6
## 307 IL 6
## 308 MS 6
## 309 NY 6
## 310 MT 6
## 311 KY 6
## 312 TX 6
## 313 SC 6
## 314 MI 6
## 315 OH 6
## 316 NY 6
## 317 PA 6
## 318 TX 6
## 319 NC 6
## 320 KY 6
## 321 NC 6
## 322 MN 6
## 323 NY 6
## 324 WA 6
## 325 NY 6
## 326 MO 6
## 327 IA 6
## 328 OH 6
## 329 PA 6
## 330 IL 6
## 331 MO 6
## 332 MN 6
## 333 AL 6
## 334 PA 6
## 335 IN 6
## 336 IN 6
## 337 FL 6
## 338 TX 6
## 339 NC 6
## 340 OH 6
## 341 MI 6
## 342 NY 6
## 343 TX 6
## 344 NY 6
## 345 NY 6
## 346 MN 6
## 347 MN 6
## 348 MN 6
## 349 TX 6
## 350 MO 6
## 351 IN 6
## 352 IL 6
## 353 FL 6
## 354 IN 6
## 355 OH 6
## 356 CA 6
## 357 GA 6
## 358 WI 6
## 359 TN 6
## 360 MN 6
## 361 MI 6
## 362 TX 6
## 363 MO 6
## 364 PA 6
## 365 PA 6
## 366 IN 6
## 367 OK 6
## 368 NY 6
## 369 PA 6
## 370 OH 6
## 371 LA 6
## 372 OH 6
## 373 OK 6
## 374 GA 6
## 375 OH 6
## 376 MI 6
## 377 NY 6
## 378 NY 6
## 379 MN 6
## 380 IL 6
## 381 FL 6
## 382 IL 6
## 383 OH 6
## 384 IL 6
## 385 WI 6
Next, we will summarize and visualize the In-State Tuition data. On the graph, we will plot it against ACT data to see if there is visually a relationship between the two variables. We will also see how the groups are split up on based on these two variables.
summary(Educ$ISTuit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 608 4103 9600 9187 12825 24940
plot(Educ$ISTuit,Educ$ACT,type="n",xlim=c(600,25000),xlab="In-State Tuition", ylab="ACT")
text(x=Educ$ISTuit, y=Educ$ACT, labels=Educ$State, col=rainbow(6)[grpEdu$cluster])
Based on this graph, there is some clear seperation between the red, purple, and yellow groups. However, the green, dark blue, and light blue groups greatly overlap. Therefore, we could possibly lessen the number of groups to three.
Next, we will conduct an F-test to see if In-State Tuition distinguishes the groups.
m1<-aov(ISTuit ~ as.factor(cluster),data=clusEdu)##analysis of variance model
summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 5 9.141e+09 1.828e+09 590.3 <2e-16 ***
## Residuals 379 1.174e+09 3.097e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F statistic is the ratio of what we know to what we do not know. Because this F-Statistic is so high, we know a large amount based on the In-State Tuition data. Additionally, the p-value is very small which means that there is a significant difference between these six groups based on in-state tuition costs.
Now we will get the group means for each cluster to compare the between group variation to the in group variation.
grpEdu$centers
## Public MathSAT VerbSAT ACT Received Accepted Enrolled
## 1 0.50000000 479.6719 436.9531 20.93750 1548.281 1067.7031 503.2812
## 2 0.01176471 553.3765 507.3647 24.61176 1810.671 1247.8706 404.0353
## 3 0.98148148 511.9074 456.6481 22.27778 6370.667 4363.1667 2128.8519
## 4 0.00000000 644.3846 564.8462 27.61538 10978.308 5620.4615 1806.0769
## 5 1.00000000 566.6250 490.1250 24.62500 15613.250 11946.0000 5332.7500
## 6 0.00000000 498.2360 458.9441 22.32919 1063.410 850.6584 322.8820
## Pct10 Pct25 FTUG PTUG ISTuit OSTuit Books
## 1 20.56250 46.82812 2297.000 680.0625 3911.781 5922.531 535.2188
## 2 36.47059 66.09412 1580.976 227.1294 15471.082 15471.082 531.2588
## 3 22.24074 53.79630 10753.741 3422.4815 2127.685 6677.074 569.8704
## 4 65.84615 86.00000 7520.462 1481.6154 17264.385 17264.385 629.5385
## 5 39.50000 73.12500 25598.375 3694.1250 2527.875 8605.500 581.2500
## 6 22.66460 50.02484 1400.565 490.5590 10011.745 10011.758 558.0745
## PhDs SFRatio
## 1 62.53125 16.29687
## 2 81.31765 12.10118
## 3 80.48148 17.90370
## 4 90.76923 8.40000
## 5 88.25000 18.18750
## 6 63.96894 14.09689
In some of these variables, there seems to be a larger amount of in-group variation compared to between group variation. For example, the ACT means are very close together as are the Books means. This means that there is not a big enough difference between the groups and fewer groups may solve this.
educClus<-Educ[,c(-1,-2)]
head(educClus)
## Public MathSAT VerbSAT ACT Received Accepted Enrolled Pct10 Pct25 FTUG
## 1 0 568 568 23 81 72 51 33 71 139
## 2 0 440 402 20 149 70 61 15 35 203
## 3 0 442 465 25 356 233 53 16 24 246
## 4 0 490 482 20 193 146 55 16 44 249
## 5 0 400 400 21 200 160 100 5 25 300
## 6 0 470 480 22 247 189 100 19 49 309
## PTUG ISTuit OSTuit Books PhDs SFRatio
## 1 3 8730 8730 400 92 9.3
## 2 138 13780 13780 500 84 5.7
## 3 18 8190 8190 500 100 7.0
## 4 869 7560 7560 800 76 11.9
## 5 50 7500 7500 500 45 7.5
## 6 639 11412 11412 500 89 8.3
eduDis<-dist(educClus,method="euclidean",diag=FALSE,upper=FALSE)##don't want just half of it
uc1<-hclust(eduDis,method="complete")
plot(uc1,hang=0.1,main="Cluster Dendogram")
This is a cluster dendogram of the data. As you look towards the bottom, it becomes difficult to read. However, if you draw a line near the top of the graph, you intersect three lines. This means that three groups may be better than six.
Now we will rerun the cluster code for three groups and plot the clusters on a graph. We will still use In-State Tuition as the x value and ACT score as the y value.
set.seed(1)
grpEdu <- kmeans(Educ[,c(-1,-2)], centers=3, nstart=10)
clusEdu<-cbind(Educ,cluster=grpEdu$cluster)
plot(Educ$ISTuit, Educ$ACT, type="n", xlim=c(600,25000),xlab="In-State Tuition", ylab="ACT")
text(x=Educ$ISTuit, y=Educ$ACT, labels=Educ$State, col=rainbow(6)[grpEdu$cluster])
This graph shows that the individual clusters are somewhat more well-defined. However, the green and yellow clusters are mixed together. Overall, these clusters seem to be better because there is less overlap than the previous plot.
Now we will see how the p-values and the F-Statistics compare to those of the six group cluster.
m1 <- aov(ISTuit ~ as.factor(cluster), data=clusEdu)
summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 7.165e+09 3.582e+09 434.4 <2e-16 ***
## Residuals 382 3.150e+09 8.246e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compared to the six-cluster model, the p-values are exactly the same. This means that there is still a significant difference between these groups based on in-state tuition costs. However, the F-Statistic is smaller. We might want to try another number of clusters to see if the F-Statistic increases and keeps the p-value small. We would try either four or five clusters next time.