Using the mtcars data set Create a kmeans object from the first, second, and third columns What is the size of each cluster? What are the centers of each cluster? What is the average disp, wt, and qsec of each cluster? Describe each cluster in English
data("mtcars")
First we want to decide how many clusters to create
set.seed(8)
wi = 0
for (i in 2:10) {
z = kmeans(mtcars[,1:3], i)
wi[i-1] = z$betweenss / z$totss
}
# report the list of within cluster sum of squares
plot(2:10, wi, type='l', xlab="k=2:10", ylab="Within Sum of Squares")
What is the size of each cluster?
r = kmeans(mtcars[,1:3], centers =4, nstart=20)
r$size
## [1] 4 8 9 11
The size of those 4 clusters are 4,8,9 and 11.
What are the centers of each cluster?
r$centers
## mpg cyl disp
## 1 13.67500 8.000000 443.00000
## 2 20.50000 5.500000 164.08750
## 3 27.34444 4.000000 96.55556
## 4 16.19091 7.818182 311.76364
What is the average disp, wt, and qsec of each cluster?
mtcars %>%
mutate(clusters = r$cluster) %>%
group_by(clusters) %>%
summarise_each(funs(mean), c(disp, wt, qsec)) %>% kable()
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
| clusters | disp | wt | qsec |
|---|---|---|---|
| 1 | 443.00000 | 4.966000 | 17.56750 |
| 2 | 164.08750 | 3.118125 | 18.66250 |
| 3 | 96.55556 | 2.089222 | 18.62333 |
| 4 | 311.76364 | 3.576364 | 16.72545 |
Describe each cluster in English. Cluster 1: cars that have medium disp, medium mpg and cyl. Cluster 2: cars that have highest disp, lowest mpg. Cluster 3: car that have lowest disp, highest mpg. Cluster 4: cars that have medium to high disp, low mpg and highest cyl.
Find a data set with at least 4 columns of numeric data and a categorical column Run several scatter plots of the data Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster Determine the centers of each cluster Compare the clusters to the categorical data column as we did with the iris$Species column
Find a data set with at least 4 columns of numeric data and a categorical column
library(readr)
U <- read_csv("/Users/yousiyan/Downloads/timesData.csv") ## Obtained from Kaggle.com/mylesoneill/world-university-rankings
## Parsed with column specification:
## cols(
## world_rank = col_character(),
## university_name = col_character(),
## country = col_character(),
## teaching = col_double(),
## international = col_character(),
## research = col_double(),
## citations = col_double(),
## income = col_character(),
## total_score = col_character(),
## num_students = col_number(),
## student_staff_ratio = col_double(),
## international_students = col_character(),
## female_male_ratio = col_character(),
## year = col_integer()
## )
topU.2016 <- U[which(U$year == 2016), ][1:200, ]
## Get 2016's top 200 academic institutes
topU.2016$international <- as.numeric(as.character(topU.2016$international))
topU.2016$income <- as.numeric(as.character(topU.2016$income))
## Warning: NAs introduced by coercion
topU.2016$total_score <- as.numeric(as.character(topU.2016$total_score))
require(stringr)
topU.2016$num_students <- as.numeric(gsub(",", "", as.character(topU.2016$num_students)))
topU.2016$international_students <-
as.numeric(gsub("%", "", as.character(topU.2016$international_students))) / 100
topU.2016$female_students <-
(as.numeric(substr(as.character(topU.2016$female_male_ratio),
1,
regexpr(" ", as.character(topU.2016$female_male_ratio))[1] - 1))
/ 100)
topU.2016$world_rank <- as.numeric(gsub("=", "", as.character(topU.2016$world_rank)))
row.names(topU.2016) <- 1:200
## Warning: Setting row names on a tibble is deprecated.
topU.2016$seq <- 1:200
topU.2016$continent <-
ifelse(topU.2016$country %in% c("South Africa"),
"Africa",
ifelse(topU.2016$country %in% c("China",
"Hong Kong",
"Japan",
"South Korea",
"Taiwan",
"Singapore",
"Israel"),
"Asia",
ifelse(topU.2016$country %in% c("Canada",
"United States of America"),
"North America",
ifelse(topU.2016$country %in% c("Australia",
"New Zealand"),
"Oceania",
"Europe"))))
Run several scatter plots of the data
ggplot(topU.2016) +
geom_point(aes(teaching, citations)) +
labs(title = "Teaching vs. Citations")
ggplot(topU.2016) +
geom_point(aes(teaching, research)) +
labs(title = "Teaching vs. Research")
ggplot(topU.2016) +
geom_point(aes(teaching, international)) +
labs(title = "Teaching vs. international")
Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster
attach(topU.2016)
teaching.Z <- (teaching - mean(teaching)) / sd(teaching)
international.Z <- (international - mean(international)) / sd(international)
research.Z <- (research - mean(research)) / sd(research)
citations.Z <- (citations - mean(citations)) / sd(citations)
detach(topU.2016)
topU.2016.stdz <- as.data.frame(cbind(teaching.Z,
international.Z,
research.Z,
citations.Z))
topU.2016.K3 <- kmeans(topU.2016.stdz,
centers = 3)
topU.2016.K3
## K-means clustering with 3 clusters of sizes 68, 84, 48
##
## Cluster means:
## teaching.Z international.Z research.Z citations.Z
## 1 -0.05037795 -1.0402719 -0.1295164 -0.3762204
## 2 -0.73261692 0.6308102 -0.6705295 -0.1108402
## 3 1.35344836 0.3698007 1.3569082 0.7269492
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 1 3 3 1 1 3 3 1 1 3 3 1 3 3 1 3 2 3 2 1 2 2 1 1 1 1 2 1 1 1 2 2
## [71] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 1 2 2 2 1 1 2 2 2 1 2
## [106] 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 2 1 1 2 2 2 2 1 2 2 1 2 2 1
## [141] 2 1 2 2 1 2 1 1 1 2 2 2 1 2 2 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1
## [176] 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 183.68037 114.70979 87.53971
## (between_SS / total_SS = 51.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Determine the size of each cluster
topU.2016.K3$size
## [1] 68 84 48
Determine the centers of each cluster
K3.centers <- as.data.frame(topU.2016.K3$centers)
attach(topU.2016)
cntr.teaching <- K3.centers$teaching.Z * sd(teaching) + mean(teaching)
cntr.international <- K3.centers$international.Z * sd(international) + mean(international)
cntr.research <- K3.centers$research.Z * sd(research) + mean(research)
cntr.citations <- K3.centers$citations.Z * sd(citations) + mean(citations)
detach(topU.2016)
K3.centers <- as.data.frame(cbind(cntr.teaching,
cntr.international,
cntr.research,
cntr.citations))
row.names(K3.centers) <- 1:3
K3.centers
## cntr.teaching cntr.international cntr.research cntr.citations
## 1 49.42941 46.39853 51.34559 78.10000
## 2 38.31667 79.15357 40.65238 81.49643
## 3 72.29583 74.03750 80.72500 92.21875
Compare the clusters to the categorical data column as we did with the iris$Species column
topU.2016$K3.clus <- topU.2016.K3$cluster
table(topU.2016$K3.clus, topU.2016$continent)
##
## Africa Asia Europe North America Oceania
## 1 0 10 26 32 0
## 2 1 3 62 11 7
## 3 0 2 17 27 2
For your chosen data set, Describe what each row of data represents
ANSWER: The data set is a list of world’s top 200 universities in 2016. Each row of data represents one university. Ranking is based on Times Higher Education’s World University Ranking. Data set is obtained from Kaggle.com/mylesoneill/world-university-rankings Describe each of your columns used - give a one sentence
ANSWER: Column 1 “teaching”: A score ranging from 0 to 100 that indicates a university’s performance on teaching, with key factors including reputation on teaching, staff-to-student ratio, doctorate-to-bachelor’s ratio, doctorates awarded-to-academic staff ratio, and institutional income. Column 2 “international”: A score ranging from 0 to 100 that indicates a university’s performance on international outlook, with key factors including international-to-domestic-student ratio, international-to-domestic-staff ratio, and international collaboration. Column 3 “research”: A score ranging from 0 to 100 that indicates a university’s performance on research, with key factors including reputation on research, research income, and research productivity. Column 4 “citation”: A score ranging from 0 to 100 that indicates a university’s performance on citation, i.e., research influence.
If you know it, describe how the data was generated
ANSWER: Underlying information used in Times’ World University Ranking mainly comes from two sources: 1) institutional data provided and signed off by institutions for use in the rankings, and 2) the Academic Reputation Survey, which was conducted by Times and targeting experienced, published scholars. There are 13 performance indicators captured, such as doctorate-to-bachelor’s ratio and scholarly papers per academic staff. They are then assigned different weights, and grouped into five areas, including teaching, research, citations, international outlook, and industry income. Performance on each area is scaled to a 0-100 range. A total score is then calculated based on the weighted average of the scores of these five areas.
For the clusters Describe the size and means of clusters
topU.2016.K3$size
## [1] 68 84 48
aggregate(cbind(total_score,
teaching, international, research, citations, income,
num_students, student_staff_ratio, international_students, female_students)
~ K3.clus,
topU.2016,
FUN = mean)
## K3.clus total_score teaching international research citations income
## 1 1 58.17500 47.73000 47.42500 50.04167 79.35500 59.22
## 2 2 55.32800 38.21467 79.74667 40.04933 81.85733 52.22
## 3 3 80.61905 72.10714 74.07857 80.73810 92.22381 61.30
## num_students student_staff_ratio international_students female_students
## 1 29028.30 19.08833 0.1123333 0.4890000
## 2 18970.87 17.63733 0.2433333 0.5197333
## 3 23617.79 13.50238 0.2597619 0.4885714
ANSWER: Cluster 1 consists of 48 institutions. This cluster scores the highest on most of the five performance indicators, and hence the highest on total score, with an average of 80.62. On average, there are 13.5 students for every academic staffer, indicating bettter staffing condition. This cluster has a relatively higher percentage of international students 26% on average. Cluster 2 consists of 68 institutions. This cluster scores the lowest on International Outlook, with an average of 47.43. This can also be reflected on its low average international student rate of 11.2%. Among cluster 2 institutions, there are 19.1 students for every academic staffer, which students have relatively less mentoring resources compared to cluster 1’s students. Cluster 3 represents the rest, 84 institutions. It has an average total score of 55.33, close to cluster 2’s 58.18. Its main difference from cluster 2 is that it even outperforms cluster 1 on International Outlook, with an average score of 79.75. This cluster has the smallest number of students among the three, about 19 thousand students per institution.
Give a one- or two-word description to each cluster - in other words, give each cluster a label or name This is an exercise in turning your numeric data into something descriptive for non-statisticians
ANSWER: Cluster 1: Top tier Cluster 2: Traditional Cluster 3: Diversified