Part 1

Using the mtcars data set Create a kmeans object from the first, second, and third columns What is the size of each cluster? What are the centers of each cluster? What is the average disp, wt, and qsec of each cluster? Describe each cluster in English

data("mtcars")

First we want to decide how many clusters to create

set.seed(8)
wi = 0

for (i in 2:10) {
  z = kmeans(mtcars[,1:3], i)
  wi[i-1] = z$betweenss / z$totss
}
# report the list of within cluster sum of squares
plot(2:10, wi, type='l', xlab="k=2:10", ylab="Within Sum of Squares")

What is the size of each cluster?

r = kmeans(mtcars[,1:3], centers =4, nstart=20)
r$size
## [1]  4  8  9 11

The size of those 4 clusters are 4,8,9 and 11.

What are the centers of each cluster?

r$centers
##        mpg      cyl      disp
## 1 13.67500 8.000000 443.00000
## 2 20.50000 5.500000 164.08750
## 3 27.34444 4.000000  96.55556
## 4 16.19091 7.818182 311.76364

What is the average disp, wt, and qsec of each cluster?

mtcars %>% 
    mutate(clusters = r$cluster) %>% 
    group_by(clusters) %>% 
    summarise_each(funs(mean), c(disp, wt, qsec)) %>% kable()
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
clusters disp wt qsec
1 443.00000 4.966000 17.56750
2 164.08750 3.118125 18.66250
3 96.55556 2.089222 18.62333
4 311.76364 3.576364 16.72545

Describe each cluster in English. Cluster 1: cars that have medium disp, medium mpg and cyl. Cluster 2: cars that have highest disp, lowest mpg. Cluster 3: car that have lowest disp, highest mpg. Cluster 4: cars that have medium to high disp, low mpg and highest cyl.

Part 2:

Find a data set with at least 4 columns of numeric data and a categorical column Run several scatter plots of the data Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster Determine the centers of each cluster Compare the clusters to the categorical data column as we did with the iris$Species column

Find a data set with at least 4 columns of numeric data and a categorical column

library(readr)
U <- read_csv("/Users/yousiyan/Downloads/timesData.csv") ## Obtained from Kaggle.com/mylesoneill/world-university-rankings
## Parsed with column specification:
## cols(
##   world_rank = col_character(),
##   university_name = col_character(),
##   country = col_character(),
##   teaching = col_double(),
##   international = col_character(),
##   research = col_double(),
##   citations = col_double(),
##   income = col_character(),
##   total_score = col_character(),
##   num_students = col_number(),
##   student_staff_ratio = col_double(),
##   international_students = col_character(),
##   female_male_ratio = col_character(),
##   year = col_integer()
## )
topU.2016 <- U[which(U$year == 2016), ][1:200, ]
                    ## Get 2016's top 200 academic institutes
topU.2016$international <- as.numeric(as.character(topU.2016$international))
topU.2016$income <- as.numeric(as.character(topU.2016$income))
## Warning: NAs introduced by coercion
topU.2016$total_score <- as.numeric(as.character(topU.2016$total_score))
require(stringr)
topU.2016$num_students <- as.numeric(gsub(",", "", as.character(topU.2016$num_students)))
topU.2016$international_students <-
  as.numeric(gsub("%", "", as.character(topU.2016$international_students))) / 100
topU.2016$female_students <-
  (as.numeric(substr(as.character(topU.2016$female_male_ratio),
                     1,
                     regexpr(" ", as.character(topU.2016$female_male_ratio))[1] - 1))
  / 100)
topU.2016$world_rank <- as.numeric(gsub("=", "", as.character(topU.2016$world_rank)))
row.names(topU.2016) <- 1:200
## Warning: Setting row names on a tibble is deprecated.
topU.2016$seq <- 1:200

topU.2016$continent <-
  ifelse(topU.2016$country %in% c("South Africa"),
         "Africa",
  ifelse(topU.2016$country %in% c("China",
                                  "Hong Kong",
                                  "Japan",
                                  "South Korea",
                                  "Taiwan",
                                  "Singapore",
                                  "Israel"),
         "Asia",
  ifelse(topU.2016$country %in% c("Canada",
                                  "United States of America"),
         "North America",
  ifelse(topU.2016$country %in% c("Australia",
                                  "New Zealand"),
         "Oceania",
         "Europe"))))

Run several scatter plots of the data

ggplot(topU.2016) +
  geom_point(aes(teaching, citations)) +
  labs(title = "Teaching vs. Citations")

ggplot(topU.2016) +
  geom_point(aes(teaching, research)) +
  labs(title = "Teaching vs. Research")

ggplot(topU.2016) +
  geom_point(aes(teaching, international)) +
  labs(title = "Teaching vs. international")

Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster

attach(topU.2016)
teaching.Z      <- (teaching - mean(teaching)) / sd(teaching)
international.Z <- (international - mean(international)) / sd(international)
research.Z      <- (research - mean(research)) / sd(research)
citations.Z     <- (citations - mean(citations)) / sd(citations)
detach(topU.2016)

topU.2016.stdz <- as.data.frame(cbind(teaching.Z,
                                      international.Z,
                                      research.Z,
                                      citations.Z))
topU.2016.K3 <- kmeans(topU.2016.stdz,
                       centers = 3)
topU.2016.K3
## K-means clustering with 3 clusters of sizes 68, 84, 48
## 
## Cluster means:
##    teaching.Z international.Z research.Z citations.Z
## 1 -0.05037795      -1.0402719 -0.1295164  -0.3762204
## 2 -0.73261692       0.6308102 -0.6705295  -0.1108402
## 3  1.35344836       0.3698007  1.3569082   0.7269492
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 1 3 3 1 1 3 3 1 1 3 3 1 3 3 1 3 2 3 2 1 2 2 1 1 1 1 2 1 1 1 2 2
##  [71] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 1 2 2 2 1 1 2 2 2 1 2
## [106] 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 2 1 1 2 2 2 2 1 2 2 1 2 2 1
## [141] 2 1 2 2 1 2 1 1 1 2 2 2 1 2 2 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1
## [176] 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 183.68037 114.70979  87.53971
##  (between_SS / total_SS =  51.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Determine the size of each cluster

topU.2016.K3$size
## [1] 68 84 48

Determine the centers of each cluster

K3.centers <- as.data.frame(topU.2016.K3$centers)
attach(topU.2016)
cntr.teaching      <- K3.centers$teaching.Z * sd(teaching) + mean(teaching)
cntr.international <- K3.centers$international.Z * sd(international) + mean(international)
cntr.research      <- K3.centers$research.Z * sd(research) + mean(research)
cntr.citations    <- K3.centers$citations.Z * sd(citations) + mean(citations)
detach(topU.2016)
K3.centers <- as.data.frame(cbind(cntr.teaching,
                                  cntr.international,
                                  cntr.research,
                                  cntr.citations))
row.names(K3.centers) <- 1:3
K3.centers
##   cntr.teaching cntr.international cntr.research cntr.citations
## 1      49.42941           46.39853      51.34559       78.10000
## 2      38.31667           79.15357      40.65238       81.49643
## 3      72.29583           74.03750      80.72500       92.21875

Compare the clusters to the categorical data column as we did with the iris$Species column

topU.2016$K3.clus <- topU.2016.K3$cluster
table(topU.2016$K3.clus, topU.2016$continent)
##    
##     Africa Asia Europe North America Oceania
##   1      0   10     26            32       0
##   2      1    3     62            11       7
##   3      0    2     17            27       2

Part 3:

For your chosen data set, Describe what each row of data represents

ANSWER: The data set is a list of world’s top 200 universities in 2016. Each row of data represents one university. Ranking is based on Times Higher Education’s World University Ranking. Data set is obtained from Kaggle.com/mylesoneill/world-university-rankings Describe each of your columns used - give a one sentence

ANSWER: Column 1 “teaching”: A score ranging from 0 to 100 that indicates a university’s performance on teaching, with key factors including reputation on teaching, staff-to-student ratio, doctorate-to-bachelor’s ratio, doctorates awarded-to-academic staff ratio, and institutional income. Column 2 “international”: A score ranging from 0 to 100 that indicates a university’s performance on international outlook, with key factors including international-to-domestic-student ratio, international-to-domestic-staff ratio, and international collaboration. Column 3 “research”: A score ranging from 0 to 100 that indicates a university’s performance on research, with key factors including reputation on research, research income, and research productivity. Column 4 “citation”: A score ranging from 0 to 100 that indicates a university’s performance on citation, i.e., research influence.

If you know it, describe how the data was generated

ANSWER: Underlying information used in Times’ World University Ranking mainly comes from two sources: 1) institutional data provided and signed off by institutions for use in the rankings, and 2) the Academic Reputation Survey, which was conducted by Times and targeting experienced, published scholars. There are 13 performance indicators captured, such as doctorate-to-bachelor’s ratio and scholarly papers per academic staff. They are then assigned different weights, and grouped into five areas, including teaching, research, citations, international outlook, and industry income. Performance on each area is scaled to a 0-100 range. A total score is then calculated based on the weighted average of the scores of these five areas.

For the clusters Describe the size and means of clusters

topU.2016.K3$size
## [1] 68 84 48
aggregate(cbind(total_score,
                teaching, international, research, citations, income,
                num_students, student_staff_ratio, international_students, female_students)
            ~ K3.clus,
          topU.2016,
          FUN = mean)
##   K3.clus total_score teaching international research citations income
## 1       1    58.17500 47.73000      47.42500 50.04167  79.35500  59.22
## 2       2    55.32800 38.21467      79.74667 40.04933  81.85733  52.22
## 3       3    80.61905 72.10714      74.07857 80.73810  92.22381  61.30
##   num_students student_staff_ratio international_students female_students
## 1     29028.30            19.08833              0.1123333       0.4890000
## 2     18970.87            17.63733              0.2433333       0.5197333
## 3     23617.79            13.50238              0.2597619       0.4885714

ANSWER: Cluster 1 consists of 48 institutions. This cluster scores the highest on most of the five performance indicators, and hence the highest on total score, with an average of 80.62. On average, there are 13.5 students for every academic staffer, indicating bettter staffing condition. This cluster has a relatively higher percentage of international students 26% on average. Cluster 2 consists of 68 institutions. This cluster scores the lowest on International Outlook, with an average of 47.43. This can also be reflected on its low average international student rate of 11.2%. Among cluster 2 institutions, there are 19.1 students for every academic staffer, which students have relatively less mentoring resources compared to cluster 1’s students. Cluster 3 represents the rest, 84 institutions. It has an average total score of 55.33, close to cluster 2’s 58.18. Its main difference from cluster 2 is that it even outperforms cluster 1 on International Outlook, with an average score of 79.75. This cluster has the smallest number of students among the three, about 19 thousand students per institution.

Give a one- or two-word description to each cluster - in other words, give each cluster a label or name This is an exercise in turning your numeric data into something descriptive for non-statisticians

ANSWER: Cluster 1: Top tier Cluster 2: Traditional Cluster 3: Diversified