Cluster Analysis of College Data

This data is the data that is used to create college rankings every year. With this data, we will try to create clusters in the data surrounding each school’s In-State Tuition.

Educ<-read.csv("Educ4HW.csv")
summary(Educ)
##                       College        State         Public      
##  Concordia College        :  2   NY     : 34   Min.   :0.0000  
##  University of St. Thomas :  2   TX     : 34   1st Qu.:0.0000  
##  Westminster College      :  2   OH     : 29   Median :0.0000  
##  Alaska Pacific University:  1   IN     : 21   Mean   :0.2442  
##  Albertson College        :  1   CA     : 19   3rd Qu.:0.0000  
##  Albertus Magnus College  :  1   NC     : 17   Max.   :1.0000  
##  (Other)                  :376   (Other):231                   
##     MathSAT         VerbSAT           ACT           Received    
##  Min.   :330.0   Min.   :290.0   Min.   :11.00   Min.   :   77  
##  1st Qu.:474.0   1st Qu.:435.0   1st Qu.:21.00   1st Qu.:  682  
##  Median :516.0   Median :470.0   Median :23.00   Median : 1272  
##  Mean   :515.6   Mean   :469.9   Mean   :22.82   Mean   : 2691  
##  3rd Qu.:554.0   3rd Qu.:500.0   3rd Qu.:24.00   3rd Qu.: 3270  
##  Max.   :742.0   Max.   :639.0   Max.   :31.00   Max.   :20192  
##                                                                 
##     Accepted        Enrolled          Pct10           Pct25       
##  Min.   :   61   Min.   :  27.0   Min.   : 1.00   Min.   : 13.00  
##  1st Qu.:  535   1st Qu.: 220.0   1st Qu.:16.00   1st Qu.: 42.00  
##  Median :  969   Median : 383.0   Median :24.00   Median : 54.00  
##  Mean   : 1859   Mean   : 778.3   Mean   :27.11   Mean   : 55.26  
##  3rd Qu.: 2140   3rd Qu.: 824.0   3rd Qu.:34.00   3rd Qu.: 68.00  
##  Max.   :15096   Max.   :7425.0   Max.   :96.00   Max.   :100.00  
##                                                                   
##       FTUG            PTUG             ISTuit          OSTuit     
##  Min.   :  139   Min.   :    1.0   Min.   :  608   Min.   : 2883  
##  1st Qu.:  909   1st Qu.:   81.0   1st Qu.: 4103   1st Qu.: 7248  
##  Median : 1601   Median :  302.0   Median : 9600   Median : 9700  
##  Mean   : 3611   Mean   :  975.2   Mean   : 9187   Mean   :10285  
##  3rd Qu.: 3876   3rd Qu.:  993.0   3rd Qu.:12825   3rd Qu.:12900  
##  Max.   :31643   Max.   :21836.0   Max.   :24940   Max.   :24940  
##                                                                   
##      Books             PhDs           SFRatio     
##  Min.   :  96.0   Min.   :  8.00   Min.   : 2.50  
##  1st Qu.: 475.0   1st Qu.: 60.00   1st Qu.:11.60  
##  Median : 504.0   Median : 74.00   Median :13.60  
##  Mean   : 552.9   Mean   : 71.29   Mean   :14.45  
##  3rd Qu.: 600.0   3rd Qu.: 84.00   3rd Qu.:16.50  
##  Max.   :2340.0   Max.   :100.00   Max.   :91.80  
## 
library(fastcluster)
## 
## Attaching package: 'fastcluster'
## The following object is masked from 'package:stats':
## 
##     hclust
library(stats)

Initial Cluster

First, we will do a cluster with 6 groups just to see what the data looks like and to see where the clusters are. Then, we will look at which schools are in which groups. When we do that, we will look at the state name because that helps identify each school without using the full school name.

set.seed(1)
grpEdu <- kmeans(Educ[,c(-1,-2)], centers=6, nstart=10)
clusEdu<-cbind(Educ,cluster=grpEdu$cluster)##cluster is a variable within that model
head(clusEdu)##says what cluster it is in
##                     College State Public MathSAT VerbSAT ACT Received
## 1       Christendom College    VA      0     568     568  23       81
## 2      Mount Vernon College    DC      0     440     402  20      149
## 3            King's College    NY      0     442     465  25      356
## 4 Alaska Pacific University    AK      0     490     482  20      193
## 5     Sierra Nevada College    NV      0     400     400  21      200
## 6           Trinity College    DC      0     470     480  22      247
##   Accepted Enrolled Pct10 Pct25 FTUG PTUG ISTuit OSTuit Books PhDs SFRatio
## 1       72       51    33    71  139    3   8730   8730   400   92     9.3
## 2       70       61    15    35  203  138  13780  13780   500   84     5.7
## 3      233       53    16    24  246   18   8190   8190   500  100     7.0
## 4      146       55    16    44  249  869   7560   7560   800   76    11.9
## 5      160      100     5    25  300   50   7500   7500   500   45     7.5
## 6      189      100    19    49  309  639  11412  11412   500   89     8.3
##   cluster
## 1       6
## 2       2
## 3       6
## 4       6
## 5       6
## 6       6
o=order(grpEdu$cluster)
data.frame(Educ$State[o],grpEdu$cluster[o])##name of utility company and cluster identity
##     Educ.State.o. grpEdu.cluster.o.
## 1              FL                 1
## 2              IL                 1
## 3              OK                 1
## 4              NM                 1
## 5              TN                 1
## 6              CA                 1
## 7              TX                 1
## 8              TX                 1
## 9              TX                 1
## 10             TX                 1
## 11             NC                 1
## 12             VA                 1
## 13             SC                 1
## 14             TX                 1
## 15             TX                 1
## 16             TX                 1
## 17             AR                 1
## 18             NC                 1
## 19             TN                 1
## 20             TX                 1
## 21             TX                 1
## 22             KY                 1
## 23             FL                 1
## 24             TX                 1
## 25             OK                 1
## 26             WV                 1
## 27             MT                 1
## 28             VT                 1
## 29             MN                 1
## 30             SC                 1
## 31             GA                 1
## 32             NC                 1
## 33             SC                 1
## 34             TX                 1
## 35             FL                 1
## 36             OH                 1
## 37             SC                 1
## 38             PA                 1
## 39             CO                 1
## 40             WV                 1
## 41             LA                 1
## 42             GA                 1
## 43             VA                 1
## 44             AR                 1
## 45             SC                 1
## 46             CO                 1
## 47             NY                 1
## 48             CO                 1
## 49             FL                 1
## 50             CO                 1
## 51             FL                 1
## 52             MO                 1
## 53             TX                 1
## 54             NY                 1
## 55             TX                 1
## 56             IN                 1
## 57             MD                 1
## 58             AL                 1
## 59             SC                 1
## 60             TX                 1
## 61             NY                 1
## 62             TX                 1
## 63             MI                 1
## 64             MN                 1
## 65             DC                 2
## 66             CT                 2
## 67             NY                 2
## 68             PA                 2
## 69             VA                 2
## 70             CA                 2
## 71             MO                 2
## 72             VA                 2
## 73             IL                 2
## 74             ID                 2
## 75             MD                 2
## 76             WV                 2
## 77             OH                 2
## 78             WI                 2
## 79             CA                 2
## 80             GA                 2
## 81             PA                 2
## 82             VA                 2
## 83             CA                 2
## 84             OH                 2
## 85             IL                 2
## 86             PA                 2
## 87             OH                 2
## 88             IN                 2
## 89             OR                 2
## 90             WI                 2
## 91             NY                 2
## 92             OH                 2
## 93             IA                 2
## 94             TN                 2
## 95             IA                 2
## 96             MA                 2
## 97             OR                 2
## 98             WA                 2
## 99             CA                 2
## 100            IA                 2
## 101            MA                 2
## 102            MN                 2
## 103            PA                 2
## 104            FL                 2
## 105            WI                 2
## 106            NC                 2
## 107            CA                 2
## 108            CA                 2
## 109            OH                 2
## 110            NY                 2
## 111            WV                 2
## 112            OR                 2
## 113            MA                 2
## 114            OR                 2
## 115            NC                 2
## 116            PA                 2
## 117            CA                 2
## 118            FL                 2
## 119            OH                 2
## 120            ME                 2
## 121            PA                 2
## 122            OH                 2
## 123            IL                 2
## 124            MA                 2
## 125            IL                 2
## 126            IA                 2
## 127            OH                 2
## 128            CA                 2
## 129            NY                 2
## 130            IA                 2
## 131            MN                 2
## 132            NY                 2
## 133            CA                 2
## 134            OH                 2
## 135            WA                 2
## 136            WA                 2
## 137            OH                 2
## 138            IN                 2
## 139            NY                 2
## 140            CO                 2
## 141            CT                 2
## 142            WA                 2
## 143            MA                 2
## 144            MN                 2
## 145            CT                 2
## 146            WA                 2
## 147            IA                 2
## 148            RI                 2
## 149            CA                 2
## 150            NY                 3
## 151            NY                 3
## 152            NY                 3
## 153            TX                 3
## 154            NY                 3
## 155            MD                 3
## 156            NC                 3
## 157            TN                 3
## 158            SC                 3
## 159            NY                 3
## 160            ID                 3
## 161            WV                 3
## 162            OH                 3
## 163            VA                 3
## 164            ND                 3
## 165            CO                 3
## 166            WI                 3
## 167            NY                 3
## 168            IN                 3
## 169            MT                 3
## 170            WA                 3
## 171            IN                 3
## 172            MS                 3
## 173            TX                 3
## 174            CO                 3
## 175            TX                 3
## 176            NC                 3
## 177            NC                 3
## 178            FL                 3
## 179            UT                 3
## 180            TX                 3
## 181            OH                 3
## 182            AZ                 3
## 183            FL                 3
## 184            SC                 3
## 185            OK                 3
## 186            CA                 3
## 187            NC                 3
## 188            OH                 3
## 189            TX                 3
## 190            UT                 3
## 191            TX                 3
## 192            FL                 3
## 193            MO                 3
## 194            OH                 3
## 195            OH                 3
## 196            CO                 3
## 197            TN                 3
## 198            IN                 3
## 199            AL                 3
## 200            LA                 3
## 201            CA                 3
## 202            NE                 3
## 203            MN                 3
## 204            MD                 4
## 205            PA                 4
## 206            MA                 4
## 207            TN                 4
## 208            GA                 4
## 209            RI                 4
## 210            NC                 4
## 211            NY                 4
## 212            IL                 4
## 213            PA                 4
## 214            MA                 4
## 215            CA                 4
## 216            MA                 4
## 217            AZ                 5
## 218            MI                 5
## 219            AZ                 5
## 220            IN                 5
## 221            IL                 5
## 222            MI                 5
## 223            TX                 5
## 224            TX                 5
## 225            VA                 6
## 226            NY                 6
## 227            AK                 6
## 228            NV                 6
## 229            DC                 6
## 230            NC                 6
## 231            IN                 6
## 232            OH                 6
## 233            NJ                 6
## 234            GA                 6
## 235            IN                 6
## 236            WA                 6
## 237            VA                 6
## 238            NY                 6
## 239            IN                 6
## 240            TX                 6
## 241            MI                 6
## 242            NC                 6
## 243            AL                 6
## 244            MI                 6
## 245            OK                 6
## 246            SC                 6
## 247            TX                 6
## 248            TN                 6
## 249            NC                 6
## 250            KS                 6
## 251            KY                 6
## 252            CA                 6
## 253            OH                 6
## 254            NC                 6
## 255            MO                 6
## 256            OH                 6
## 257            IA                 6
## 258            KY                 6
## 259            MO                 6
## 260            OH                 6
## 261            CA                 6
## 262            NY                 6
## 263            WI                 6
## 264            IN                 6
## 265            MS                 6
## 266            NY                 6
## 267            IA                 6
## 268            MT                 6
## 269            WV                 6
## 270            OH                 6
## 271            WV                 6
## 272            LA                 6
## 273            CA                 6
## 274            PA                 6
## 275            WV                 6
## 276            VA                 6
## 277            OR                 6
## 278            IN                 6
## 279            IN                 6
## 280            MO                 6
## 281            KY                 6
## 282            NM                 6
## 283            IL                 6
## 284            NY                 6
## 285            TX                 6
## 286            IN                 6
## 287            IL                 6
## 288            WI                 6
## 289            KY                 6
## 290            AR                 6
## 291            WV                 6
## 292            IN                 6
## 293            TN                 6
## 294            NY                 6
## 295            TX                 6
## 296            IA                 6
## 297            NY                 6
## 298            IN                 6
## 299            CA                 6
## 300            SC                 6
## 301            MO                 6
## 302            PA                 6
## 303            TX                 6
## 304            NY                 6
## 305            SC                 6
## 306            TN                 6
## 307            IL                 6
## 308            MS                 6
## 309            NY                 6
## 310            MT                 6
## 311            KY                 6
## 312            TX                 6
## 313            SC                 6
## 314            MI                 6
## 315            OH                 6
## 316            NY                 6
## 317            PA                 6
## 318            TX                 6
## 319            NC                 6
## 320            KY                 6
## 321            NC                 6
## 322            MN                 6
## 323            NY                 6
## 324            WA                 6
## 325            NY                 6
## 326            MO                 6
## 327            IA                 6
## 328            OH                 6
## 329            PA                 6
## 330            IL                 6
## 331            MO                 6
## 332            MN                 6
## 333            AL                 6
## 334            PA                 6
## 335            IN                 6
## 336            IN                 6
## 337            FL                 6
## 338            TX                 6
## 339            NC                 6
## 340            OH                 6
## 341            MI                 6
## 342            NY                 6
## 343            TX                 6
## 344            NY                 6
## 345            NY                 6
## 346            MN                 6
## 347            MN                 6
## 348            MN                 6
## 349            TX                 6
## 350            MO                 6
## 351            IN                 6
## 352            IL                 6
## 353            FL                 6
## 354            IN                 6
## 355            OH                 6
## 356            CA                 6
## 357            GA                 6
## 358            WI                 6
## 359            TN                 6
## 360            MN                 6
## 361            MI                 6
## 362            TX                 6
## 363            MO                 6
## 364            PA                 6
## 365            PA                 6
## 366            IN                 6
## 367            OK                 6
## 368            NY                 6
## 369            PA                 6
## 370            OH                 6
## 371            LA                 6
## 372            OH                 6
## 373            OK                 6
## 374            GA                 6
## 375            OH                 6
## 376            MI                 6
## 377            NY                 6
## 378            NY                 6
## 379            MN                 6
## 380            IL                 6
## 381            FL                 6
## 382            IL                 6
## 383            OH                 6
## 384            IL                 6
## 385            WI                 6

Summarize In-State Tuition to figure out range of x variable

Next, we will summarize and visualize the In-State Tuition data. On the graph, we will plot it against ACT data to see if there is visually a relationship between the two variables. We will also see how the groups are split up on based on these two variables.

summary(Educ$ISTuit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     608    4103    9600    9187   12825   24940
plot(Educ$ISTuit,Educ$ACT,type="n",xlim=c(600,25000),xlab="In-State Tuition", ylab="ACT")
text(x=Educ$ISTuit, y=Educ$ACT, labels=Educ$State, col=rainbow(6)[grpEdu$cluster])

Based on this graph, there is some clear seperation between the red, purple, and yellow groups. However, the green, dark blue, and light blue groups greatly overlap. Therefore, we could possibly lessen the number of groups to three.

Does In-State Tuition seem to distinguish these groups?

Next, we will conduct an F-test to see if In-State Tuition distinguishes the groups.

m1<-aov(ISTuit ~ as.factor(cluster),data=clusEdu)##analysis of variance model
summary(m1)
##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(cluster)   5 9.141e+09 1.828e+09   590.3 <2e-16 ***
## Residuals          379 1.174e+09 3.097e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F statistic is the ratio of what we know to what we do not know. Because this F-Statistic is so high, we know a large amount based on the In-State Tuition data. Additionally, the p-value is very small which means that there is a significant difference between these six groups based on in-state tuition costs.

Get group means for each cluster

Now we will get the group means for each cluster to compare the between group variation to the in group variation.

grpEdu$centers
##       Public  MathSAT  VerbSAT      ACT  Received   Accepted  Enrolled
## 1 0.50000000 479.6719 436.9531 20.93750  1548.281  1067.7031  503.2812
## 2 0.01176471 553.3765 507.3647 24.61176  1810.671  1247.8706  404.0353
## 3 0.98148148 511.9074 456.6481 22.27778  6370.667  4363.1667 2128.8519
## 4 0.00000000 644.3846 564.8462 27.61538 10978.308  5620.4615 1806.0769
## 5 1.00000000 566.6250 490.1250 24.62500 15613.250 11946.0000 5332.7500
## 6 0.00000000 498.2360 458.9441 22.32919  1063.410   850.6584  322.8820
##      Pct10    Pct25      FTUG      PTUG    ISTuit    OSTuit    Books
## 1 20.56250 46.82812  2297.000  680.0625  3911.781  5922.531 535.2188
## 2 36.47059 66.09412  1580.976  227.1294 15471.082 15471.082 531.2588
## 3 22.24074 53.79630 10753.741 3422.4815  2127.685  6677.074 569.8704
## 4 65.84615 86.00000  7520.462 1481.6154 17264.385 17264.385 629.5385
## 5 39.50000 73.12500 25598.375 3694.1250  2527.875  8605.500 581.2500
## 6 22.66460 50.02484  1400.565  490.5590 10011.745 10011.758 558.0745
##       PhDs  SFRatio
## 1 62.53125 16.29687
## 2 81.31765 12.10118
## 3 80.48148 17.90370
## 4 90.76923  8.40000
## 5 88.25000 18.18750
## 6 63.96894 14.09689

In some of these variables, there seems to be a larger amount of in-group variation compared to between group variation. For example, the ACT means are very close together as are the Books means. This means that there is not a big enough difference between the groups and fewer groups may solve this.

Heirarchical Clustering for Education Data

educClus<-Educ[,c(-1,-2)]
head(educClus)
##   Public MathSAT VerbSAT ACT Received Accepted Enrolled Pct10 Pct25 FTUG
## 1      0     568     568  23       81       72       51    33    71  139
## 2      0     440     402  20      149       70       61    15    35  203
## 3      0     442     465  25      356      233       53    16    24  246
## 4      0     490     482  20      193      146       55    16    44  249
## 5      0     400     400  21      200      160      100     5    25  300
## 6      0     470     480  22      247      189      100    19    49  309
##   PTUG ISTuit OSTuit Books PhDs SFRatio
## 1    3   8730   8730   400   92     9.3
## 2  138  13780  13780   500   84     5.7
## 3   18   8190   8190   500  100     7.0
## 4  869   7560   7560   800   76    11.9
## 5   50   7500   7500   500   45     7.5
## 6  639  11412  11412   500   89     8.3

Compute a distance matrix with Euchlidean distance

eduDis<-dist(educClus,method="euclidean",diag=FALSE,upper=FALSE)##don't want just half of it
uc1<-hclust(eduDis,method="complete")
plot(uc1,hang=0.1,main="Cluster Dendogram")

This is a cluster dendogram of the data. As you look towards the bottom, it becomes difficult to read. However, if you draw a line near the top of the graph, you intersect three lines. This means that three groups may be better than six.

Three Clusters

Now we will rerun the cluster code for three groups and plot the clusters on a graph. We will still use In-State Tuition as the x value and ACT score as the y value.

set.seed(1)
grpEdu <- kmeans(Educ[,c(-1,-2)], centers=3, nstart=10)
clusEdu<-cbind(Educ,cluster=grpEdu$cluster)
plot(Educ$ISTuit, Educ$ACT, type="n", xlim=c(600,25000),xlab="In-State Tuition", ylab="ACT")
text(x=Educ$ISTuit, y=Educ$ACT, labels=Educ$State, col=rainbow(6)[grpEdu$cluster])

This graph shows that the individual clusters are somewhat more well-defined. However, the green and yellow clusters are mixed together. Overall, these clusters seem to be better because there is less overlap than the previous plot.

In-State Tuition Cluster Model with 3 Clusters

Now we will see how the p-values and the F-Statistics compare to those of the six group cluster.

m1 <- aov(ISTuit ~ as.factor(cluster), data=clusEdu) 
summary(m1)
##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(cluster)   2 7.165e+09 3.582e+09   434.4 <2e-16 ***
## Residuals          382 3.150e+09 8.246e+06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compared to the six-cluster model, the p-values are exactly the same. This means that there is still a significant difference between these groups based on in-state tuition costs. However, the F-Statistic is smaller. We might want to try another number of clusters to see if the F-Statistic increases and keeps the p-value small. We would try either four or five clusters next time.