For this problem, you will use the College dataset from the ISLR2 library. This dataset contains 18 features describing 777 colleges based on the 1995 issue of the US News and World Report Rankings. The goal of this analysis will be to cluster the colleges using the available information, and to identify similarities between public and private schools, as well as schools in certain categories.
Separate the variable “Private” from the rest of the data, leaving you with 17 features. Scale the remaining features and save this dataframe as “college.scaled”. Extract the names of the colleges by creating a new variable that is equal to the row names of the College dataframe. Use the rownames() function in R to do this.
# Load the College dataset
data(College)
# Extract college names
college_names = rownames(College)
# Separate the "Private" variable
college_private = College$Private
# Scale the remaining features
college.scaled = as.data.frame(scale(College[, -which(names(College) == "Private")]))
Set the random seed to 1 for this analysis. Make sure you run this command immediately before you conduct your analysis. To ensure consistency, only run each step one time so that the random seed is not changed.
Conduct K-Means clustering on the scaled data using 1 to 20 clusters. Create a plot of the within sum of squares relative to the number of clusters. Include the plot here. Comment on what you see.
set.seed(1)
wss = sapply(1:20, function(k) {
kmeans(college.scaled, centers = k, nstart = 20, iter.max =100)$tot.withinss
})
# Create a WSS plot
wss_df = data.frame(clusters = 1:20, WSS = wss)
wss_plot = ggplot(wss_df, aes(x = clusters, y = WSS)) +
geom_line() +
geom_point() +
labs(title = "Elbow Plot: Within Sum of Squares",
x = "Number of Clusters",
y = "Total Within Sum of Squares") +
theme_minimal()
# Print the plot
print(wss_plot)
Answer: The plot shows that the K=3 poises as the elbow point for this data split.
Run K-means with 5 clusters. What are the sizes of the clusters? Report the centers of the clusters. Do you observe any trends?
set.seed(1)
kmeans_result <- kmeans(college.scaled, centers = 5, nstart = 20)
# Cluster sizes
cluster_sizes <- table(kmeans_result$cluster)
print("Cluster Sizes:")
## [1] "Cluster Sizes:"
print(cluster_sizes)
##
## 1 2 3 4 5
## 115 263 62 76 261
# Cluster centers
cluster_centers <- kmeans_result$centers
print("Cluster Centers:")
## [1] "Cluster Centers:"
print(cluster_centers)
## Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 0.2721386 0.3137047 0.5386923124 -0.5315055 -0.3660141 0.6450210
## 2 -0.3033590 -0.2667640 -0.3481553781 0.1838807 0.2786669 -0.3805719
## 3 2.4003436 2.6005038 2.6592525186 0.4556717 0.6064091 2.6728336
## 4 0.5032428 0.1287592 -0.0007639187 1.9397421 1.5722245 -0.1330025
## 5 -0.5309589 -0.5246515 -0.5180090031 -0.6241745 -0.7213954 -0.4969138
## P.Undergrad Outstate Room.Board Books Personal PhD
## 1 0.7729491 -0.9925279 -0.672522003 -0.004466688 0.73071917 0.2519379
## 2 -0.2930735 0.5000230 0.432729121 -0.079594076 -0.32701168 0.3461233
## 3 1.4738038 -0.4152657 -0.008509953 0.348983507 0.70634330 0.7991991
## 4 -0.4158054 1.8070466 1.225978520 0.279490402 -0.47979093 1.1530152
## 5 -0.2742737 -0.4940782 -0.494691490 -0.082112401 -0.02052799 -0.9853747
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 0.2848015 0.95205267 -0.7982466 -0.55654090 -0.74731926
## 2 0.3755673 -0.25781012 0.4315247 0.06203482 0.50293905
## 3 0.7454456 0.49319336 -0.4713327 0.09393798 -0.09175983
## 4 1.1105509 -1.22531645 1.3447607 2.12257943 1.15417907
## 5 -1.0043903 0.07993895 -0.3627273 -0.45767489 -0.49179983
Answer: For Cluster 1, F.Undergrad and P.Undergrad is higher, showing that the dataset has higher undergraduates. Outstate (-.9925) is really low compared to the other clusters, which could indicate institutions with more affordable out-of-state tuition fees. For Cluster 2, Enroll (-0.348) is low, suggesting that schools in this particular cluster have smaller students.Grad.Rate(0.503) and PhD(0.346) suggests that schools in this cluster might have a balanced undergraduate and graduation education being offered. For Cluster 3, Top10perc(0.456) and Top25perc(0.606) are higher, indicating that these colleges have many students coming from the top ranks of high schools. Outstate (-0.415) is negatively high, highlighting that these schools have higher tuition fees for out-of-state students. For Cluster 4, Top10perc (1.9) and Top25perc (1.572) are very high, suggesting that this cluster would be including the most prestigious and highly selective schools. This cluster has high Expend(2.123) and Grad.Rate(1.154) which suggests that these schools spend more on students and have higher graduation rates. Cluster 5, seems like a moderate enrollment, with moderate out-of state tuition fees, lower expenditures on its students related to lower graduation rates , contrary to the cluster 4.
Identify the proportion of schools in each cluster that are private. Report that here. Compare these results to the centers of the clusters. What might you infer?
# Create a dataframe with cluster assignments and private status
cluster_private_analysis = data.frame(
cluster = kmeans_result$cluster,
private = college_private
)
# Calculate proportion of private schools in each cluster
private_proportions = cluster_private_analysis %>%
group_by(cluster) %>%
summarize(
total_schools = n(),
private_schools = sum(private == "Yes"),
private_proportion = private_schools / total_schools
)
print("Private School Proportions by Cluster:")
## [1] "Private School Proportions by Cluster:"
print(private_proportions)
## # A tibble: 5 × 4
## cluster total_schools private_schools private_proportion
## <int> <int> <int> <dbl>
## 1 1 115 11 0.0957
## 2 2 263 251 0.954
## 3 3 62 6 0.0968
## 4 4 76 75 0.987
## 5 5 261 222 0.851
Answer: Clusters 2, 4 and 5 have a higher proportion of private schools with respect to the total schools.
Look at the school names in each cluster. Come up with a classification scheme for the groups of schools. Again, does this track with the centers from step 2? Identify three features that seem to be most associated with “elite” schools.
#I took help of AI to find this step
cluster_school_analysis <- data.frame(
school = college_names,
cluster = kmeans_result$cluster
)
# Function to find top features for each cluster
find_top_features = function(cluster_centers, top_n = 3) {
top_features_by_cluster <- apply(cluster_centers, 1, function(cluster) {
# Sort features by their absolute values in descending order
sorted_indices = order(abs(cluster), decreasing = TRUE)
top_indices = sorted_indices[1:top_n]
top_features = names(cluster)[top_indices]
top_values = cluster[top_indices]
data.frame(
feature = top_features,
value = top_values
)
})
return(top_features_by_cluster)
}
# Find top features for each cluster
top_features = find_top_features(cluster_centers)
# Print results
for (i in 1:length(top_features)) {
cat("Cluster", i, "Top Features:\n")
print(top_features[[i]])
cat("\n")
}
## Cluster 1 Top Features:
## feature value
## Outstate Outstate -0.9925279
## S.F.Ratio S.F.Ratio 0.9520527
## perc.alumni perc.alumni -0.7982466
##
## Cluster 2 Top Features:
## feature value
## Grad.Rate Grad.Rate 0.5029391
## Outstate Outstate 0.5000230
## Room.Board Room.Board 0.4327291
##
## Cluster 3 Top Features:
## feature value
## F.Undergrad F.Undergrad 2.672834
## Enroll Enroll 2.659253
## Accept Accept 2.600504
##
## Cluster 4 Top Features:
## feature value
## Expend Expend 2.122579
## Top10perc Top10perc 1.939742
## Outstate Outstate 1.807047
##
## Cluster 5 Top Features:
## feature value
## Terminal Terminal -1.0043903
## PhD PhD -0.9853747
## Top25perc Top25perc -0.7213954
Answer: In comparison with Step 2, Cluster 1 & 5 seem different while the other clusters are showing the same top features. From my understanding, cluster 4 indicates a elite group and the key indicators are Expend (2.123), Top10perc (1.94), Outstate(1.807), Top25perc(1.572), Grad.Rate(1.154) - which aligns with the features of top universities.