Assignment #8

Part 1: Using the mtcars data set Create a kmeans object from the first, second, and third columns

require(datasets)  ## Require datasets library to access mtcars dataset

## Standardize input variables
attach(mtcars)
Z_mgp <- (mpg - mean(mpg)) / sd(mpg)
Z_cyl <- (cyl - mean(cyl)) / sd(cyl)
Z_disp <- (disp - mean(disp)) / sd(disp)
mtcars_stdzd <- as.data.frame(cbind(Z_mgp, Z_cyl, Z_disp))
detach(mtcars)

set.seed(7)        ## Set randomization seed
mtcars_k3 <- kmeans(mtcars_stdzd,
                    centers = 3)
                   ## Perform a 3-cluster K-means cluster analysis
mtcars_k3          ## Display clustering results

## K-means clustering with 3 clusters of sizes 11, 14, 7
## 
## Cluster means:
##         Z_mgp      Z_cyl     Z_disp
## 1  1.09060362 -1.2248578 -1.0132874
## 2 -0.82805177  1.0148821  0.9874085
## 3 -0.05770215 -0.1049878 -0.3825084
## 
## Clustering vector:
##  [1] 3 3 1 3 2 3 2 1 1 3 3 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 2 3 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 6.069269 6.232612 1.023746
##  (between_SS / total_SS =  85.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

What is the size of each cluster?

mtcars_k3$size

## [1] 11 14  7

What are the centers of each cluster?

## Convert cluster centers' standardized scores back to original values
attach(mtcars)
mpg_cntr <- mtcars_k3$centers[, 1] * sd(mpg) + mean(mpg)
cyl_cntr <- mtcars_k3$centers[, 2] * sd(cyl) + mean(cyl)
disp_cntr <- mtcars_k3$centers[, 3] * sd(disp) + mean(disp)
detach(mtcars)
mtcars_k3_cntr <- as.data.frame(cbind(mpg_cntr, cyl_cntr, disp_cntr))

## Display cluster centers
mtcars_k3_cntr

##   mpg_cntr cyl_cntr disp_cntr
## 1 26.66364        4  105.1364
## 2 15.10000        8  353.1000
## 3 19.74286        6  183.3143

What is the average disp, wt, and qsec of each cluster?

# Merge cluster assignment to original data fields
mtcars_results <- cbind(mtcars[, c(1:3, 6:7)], mtcars_stdzd, mtcars_k3$cluster)
colnames(mtcars_results)[9] <- "cluster"

## Aggregate to get the average and standard deviation of five fields by cluster
mtcars_aggr <- do.call(data.frame,
                       aggregate(cbind(mpg, cyl, disp, wt, qsec) ~ cluster,
                                 data = mtcars_results,
                                 FUN = function(x) c(avg = mean(x), sd = sd(x))))
mtcars_aggr

##   cluster  mpg.avg   mpg.sd cyl.avg cyl.sd disp.avg  disp.sd   wt.avg
## 1       1 26.66364 4.509828       4      0 105.1364 26.87159 2.285727
## 2       2 15.10000 2.560048       8      0 353.1000 67.77132 3.999214
## 3       3 19.74286 1.453567       6      0 183.3143 41.56246 3.117143
##       wt.sd qsec.avg  qsec.sd
## 1 0.5695637 19.13727 1.682445
## 2 0.7594047 16.77214 1.196014
## 3 0.3563455 17.97714 1.706866

Describe each cluster in English

ANSWER: From the 0 standard deviation of cyl (# of cylinderes) for all three clusters, it can be shown that the result of this clustering exercise soly depends on variable “cyl”. Description of clusters is as below:

Cluster 1 is 4-cylinder automobiles. This cluster is the most energy-efficient, given the highest average mpg value (miles-per-gallon) among the three clusters. In the mean time, it weights the least, has the lowest displacement, and has relatively slower acceleraion (higher qsec). Cluster 2 consists of 8-cylinder automobiles. These vehicles are the champion cars - an average 1/4 mile time as low as 16.77 seconds. This superior performance comes with a price: they weight almost twice as much as cluster 1 cars, and have an average displacement more than three times as much as cluster 1. In addition, they are the most energy-costing, with an average mile-per-gallon metric of 15.10, significantly lower than cluster 1’s 26.66. Cluster 3 is the middle ground - the 6-cylinder vehicles. For each metric among mile-per-gallon, displacement, weight, and 1/4 mile time, it’s sitting in between cluster 1 and 3, representing a fine trade-off between performance and economy.

Part 2: Find a data set with at least 4 columns of numeric data and a categorical column

U <- read.csv("timesData.csv")  ## Obtained from Kaggle.com/mylesoneill/world-university-rankings

## Data transformation and manipulation
topU.2016 <- U[which(U$year == 2016), ][1:200, ]
                    ## Get 2016's top 200 academic institutes
topU.2016$international <- as.numeric(as.character(topU.2016$international))
topU.2016$income <- as.numeric(as.character(topU.2016$income))

## Warning: NAs introduced by coercion

topU.2016$total_score <- as.numeric(as.character(topU.2016$total_score))
require(stringr)

## Loading required package: stringr

topU.2016$num_students <- as.numeric(gsub(",", "", as.character(topU.2016$num_students)))
topU.2016$international_students <-
  as.numeric(gsub("%", "", as.character(topU.2016$international_students))) / 100
topU.2016$female_students <-
  (as.numeric(substr(as.character(topU.2016$female_male_ratio),
                     1,
                     regexpr(" ", as.character(topU.2016$female_male_ratio))[1] - 1))
  / 100)
topU.2016$world_rank <- as.numeric(gsub("=", "", as.character(topU.2016$world_rank)))
row.names(topU.2016) <- 1:200
topU.2016$seq <- 1:200

topU.2016$continent <-
  ifelse(topU.2016$country %in% c("South Africa"),
         "Africa",
  ifelse(topU.2016$country %in% c("China",
                                  "Hong Kong",
                                  "Japan",
                                  "South Korea",
                                  "Taiwan",
                                  "Singapore",
                                  "Israel"),
         "Asia",
  ifelse(topU.2016$country %in% c("Canada",
                                  "United States of America"),
         "North America",
  ifelse(topU.2016$country %in% c("Australia",
                                  "New Zealand"),
         "Oceania",
         "Europe"))))

Run several scatter plots of the data

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

ggplot(topU.2016) +
  geom_point(aes(teaching, citations)) +
  labs(title = "Teaching vs. Citations")

ggplot(topU.2016) +
  geom_point(aes(teaching, research)) +
  labs(title = "Teaching vs. Research")

ggplot(topU.2016) +
  geom_point(aes(teaching, international)) +
  labs(title = "Teaching vs. international")

Create a kmeans object from the numeric data, you can pick K to be whatever you want

# Standardize input variables
attach(topU.2016)
teaching.Z      <- (teaching - mean(teaching)) / sd(teaching)
international.Z <- (international - mean(international)) / sd(international)
research.Z      <- (research - mean(research)) / sd(research)
citations.Z     <- (citations - mean(citations)) / sd(citations)
detach(topU.2016)

topU.2016.stdz <- as.data.frame(cbind(teaching.Z,
                                      international.Z,
                                      research.Z,
                                      citations.Z))

## Performa 3-cluster K-means cluster analysis
topU.2016.K3 <- kmeans(topU.2016.stdz,
                       centers = 3)
topU.2016.K3

## K-means clustering with 3 clusters of sizes 48, 68, 84
## 
## Cluster means:
##    teaching.Z international.Z research.Z citations.Z
## 1  1.35344836       0.3698007  1.3569082   0.7269492
## 2 -0.05037795      -1.0402719 -0.1295164  -0.3762204
## 3 -0.73261692       0.6308102 -0.6705295  -0.1108402
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 1 3 1 3 2 3 3 2 2 2 2 3 2 2 2 3 3
##  [71] 3 2 3 2 2 3 2 2 2 3 2 3 3 2 2 3 2 2 3 2 3 2 3 3 2 3 3 3 2 2 3 3 3 2 3
## [106] 3 3 3 2 3 2 3 2 3 3 2 2 2 2 3 3 3 2 2 2 2 3 2 2 3 3 3 3 2 3 3 2 3 3 2
## [141] 3 2 3 3 2 3 2 2 2 3 3 3 2 3 3 2 3 2 3 3 2 3 2 3 3 2 3 3 2 3 3 3 3 2 2
## [176] 3 3 2 3 3 2 3 3 2 3 3 2 3 2 3 3 3 3 2 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  87.53971 183.68037 114.70979
##  (between_SS / total_SS =  51.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Determine the size of each cluster

topU.2016.K3$size

## [1] 48 68 84

Determine the centers of each cluster

K3.centers <- as.data.frame(topU.2016.K3$centers)

## Convert standardized scores back to original values
attach(topU.2016)
cntr.teaching      <- K3.centers$teaching.Z * sd(teaching) + mean(teaching)
cntr.international <- K3.centers$international.Z * sd(international) + mean(international)
cntr.research      <- K3.centers$research.Z * sd(research) + mean(research)
cntr.citations    <- K3.centers$citations.Z * sd(citations) + mean(citations)
detach(topU.2016)
K3.centers <- as.data.frame(cbind(cntr.teaching,
                                  cntr.international,
                                  cntr.research,
                                  cntr.citations))
row.names(K3.centers) <- 1:3

## Display cluster centers
K3.centers

##   cntr.teaching cntr.international cntr.research cntr.citations
## 1      72.29583           74.03750      80.72500       92.21875
## 2      49.42941           46.39853      51.34559       78.10000
## 3      38.31667           79.15357      40.65238       81.49643

Compare the clusters to the categorical data column as we did with the iris$Species column

topU.2016$K3.clus <- topU.2016.K3$cluster
table(topU.2016$K3.clus, topU.2016$continent)

##    
##     Africa Asia Europe North America Oceania
##   1      0    2     17            27       2
##   2      0   10     26            32       0
##   3      1    3     62            11       7

Part 3: For your chosen data set, <1> Describe what each row of data represents

ANSWER: The data set is a list of world’s top 200 universities in 2016. Each row of data represents one university. Ranking is based on Times Higher Education’s World University Ranking. Data set is downloaded from kaggle.com.

<2> Describe each of your columns used - give a one sentence description of the column

ANSWER:

Column 1 “teaching”: A score ranging from 0 to 100 that indicates a university’s performance on teaching, with key factors including reputation on teaching, staff-to-student ratio, doctorate-to-bachelor’s ratio, doctorates awarded-to-academic staff ratio, and institutional income. Column 2 “international”: A score ranging from 0 to 100 that indicates a university’s performance on international outlook, with key factors including international-to-domestic-student ratio, international-to-domestic-staff ratio, and international collaboration. Column 3 “research”: A score ranging from 0 to 100 that indicates a university’s performance on research, with key factors including reputation on research, research income, and research productivity. Column 4 “citation”: A score ranging from 0 to 100 that indicates a university’s performance on citation, i.e., research influence. More information about ranking methodology can be found on TimesHigherEducation.com.

<3> If you know it, describe how the data was generated

ANSWER: Underlying information used in Times’ World University Ranking mainly comes from two sources: 1) institutional data provided and signed off by institutions for use in the rankings, and 2) the Academic Reputation Survey, which was conducted by Times and targeting experienced, published scholars. There are 13 performance indicators captured, such as doctorate-to-bachelor’s ratio and scholarly papers per academic staff. They are then assigned different weights, and grouped into five areas, including teaching, research, citations, international outlook, and industry income. Performance on each area is scaled to a 0-100 range. A total score is then calculated based on the weighted average of the scores of these five areas.

For the clusters <1> Describe the size and means of clusters

topU.2016.K3$size

## [1] 48 68 84

aggregate(cbind(total_score,
                teaching, international, research, citations, income,
                num_students, student_staff_ratio, international_students, female_students)
            ~ K3.clus,
          topU.2016,
          FUN = mean)

##   K3.clus total_score teaching international research citations income
## 1       1    80.61905 72.10714      74.07857 80.73810  92.22381  61.30
## 2       2    58.17500 47.73000      47.42500 50.04167  79.35500  59.22
## 3       3    55.32800 38.21467      79.74667 40.04933  81.85733  52.22
##   num_students student_staff_ratio international_students female_students
## 1     23617.79            13.50238              0.2597619       0.4885714
## 2     29028.30            19.08833              0.1123333       0.4890000
## 3     18970.87            17.63733              0.2433333       0.5197333

ANSWER:

Cluster 1 consists of 48 institutions. This cluster scores the highest on most of the five performance indicators, and hence the highest on total score, with an average of 80.62. On average, there are 13.5 students for every academic staffer, indicating bettter staffing condition. This cluster has a relatively higher percentage of international students 26% on average. Cluster 2 consists of 68 institutions. This cluster scores the lowest on International Outlook, with an average of 47.43. This can also be reflected on its low average international student rate of 11.2%. Among cluster 2 institutions, there are 19.1 students for every academic staffer, which students have relatively less mentoring resources compared to cluster 1’s students. Cluster 3 represents the rest, 84 institutions. It has an average total score of 55.33, close to cluster 2’s 58.18. Its main difference from cluster 2 is that it even outperforms cluster 1 on International Outlook, with an average score of 79.75. This cluster has the smallest number of students among the three, about 19 thousand students per institution. <2> Give a one- or two-word description to each cluster - in other words, give each cluster a label or name. This is an exercise in turning your numeric data into something descriptive for non-statisticians

ANSWER:

Cluster 1: Top tier Cluster 2: Traditional Cluster 3: Diversified