Problem Set 4

Applied Problem 1

For this problem, you will use the College dataset from the ISLR2 library. This dataset contains 18 features describing 777 colleges based on the 1995 issue of the US News and World Report Rankings. The goal of this analysis will be to cluster the colleges using the available information, and to identify similarities between public and private schools, as well as schools in certain categories.

Step 0

Separate the variable “Private” from the rest of the data, leaving you with 17 features. Scale the remaining features and save this dataframe as “college.scaled”. Extract the names of the colleges by creating a new variable that is equal to the row names of the College dataframe. Use the rownames() function in R to do this.

# Load the College dataset
data(College)

# Extract college names
college_names = rownames(College)

# Separate the "Private" variable
college_private = College$Private

# Scale the remaining features
college.scaled = as.data.frame(scale(College[, -which(names(College) == "Private")]))

Step 1

Set the random seed to 1 for this analysis. Make sure you run this command immediately before you conduct your analysis. To ensure consistency, only run each step one time so that the random seed is not changed.

Conduct K-Means clustering on the scaled data using 1 to 20 clusters. Create a plot of the within sum of squares relative to the number of clusters. Include the plot here. Comment on what you see.

set.seed(1)
wss = sapply(1:20, function(k) {
  kmeans(college.scaled, centers = k, nstart = 20, iter.max =100)$tot.withinss
})

# Create a WSS plot
wss_df = data.frame(clusters = 1:20, WSS = wss)
wss_plot = ggplot(wss_df, aes(x = clusters, y = WSS)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot: Within Sum of Squares",
       x = "Number of Clusters",
       y = "Total Within Sum of Squares") +
  theme_minimal()

# Print the plot
print(wss_plot)

Answer: The plot shows that the K=3 poises as the elbow point for this data split.

Step 2

Run K-means with 5 clusters. What are the sizes of the clusters? Report the centers of the clusters. Do you observe any trends?

set.seed(1)
kmeans_result <- kmeans(college.scaled, centers = 5, nstart = 20)

# Cluster sizes
cluster_sizes <- table(kmeans_result$cluster)
print("Cluster Sizes:")

## [1] "Cluster Sizes:"

print(cluster_sizes)

## 
##   1   2   3   4   5 
## 115 263  62  76 261

# Cluster centers
cluster_centers <- kmeans_result$centers
print("Cluster Centers:")

## [1] "Cluster Centers:"

print(cluster_centers)

##         Apps     Accept        Enroll  Top10perc  Top25perc F.Undergrad
## 1  0.2721386  0.3137047  0.5386923124 -0.5315055 -0.3660141   0.6450210
## 2 -0.3033590 -0.2667640 -0.3481553781  0.1838807  0.2786669  -0.3805719
## 3  2.4003436  2.6005038  2.6592525186  0.4556717  0.6064091   2.6728336
## 4  0.5032428  0.1287592 -0.0007639187  1.9397421  1.5722245  -0.1330025
## 5 -0.5309589 -0.5246515 -0.5180090031 -0.6241745 -0.7213954  -0.4969138
##   P.Undergrad   Outstate   Room.Board        Books    Personal        PhD
## 1   0.7729491 -0.9925279 -0.672522003 -0.004466688  0.73071917  0.2519379
## 2  -0.2930735  0.5000230  0.432729121 -0.079594076 -0.32701168  0.3461233
## 3   1.4738038 -0.4152657 -0.008509953  0.348983507  0.70634330  0.7991991
## 4  -0.4158054  1.8070466  1.225978520  0.279490402 -0.47979093  1.1530152
## 5  -0.2742737 -0.4940782 -0.494691490 -0.082112401 -0.02052799 -0.9853747
##     Terminal   S.F.Ratio perc.alumni      Expend   Grad.Rate
## 1  0.2848015  0.95205267  -0.7982466 -0.55654090 -0.74731926
## 2  0.3755673 -0.25781012   0.4315247  0.06203482  0.50293905
## 3  0.7454456  0.49319336  -0.4713327  0.09393798 -0.09175983
## 4  1.1105509 -1.22531645   1.3447607  2.12257943  1.15417907
## 5 -1.0043903  0.07993895  -0.3627273 -0.45767489 -0.49179983

Answer: For Cluster 1, F.Undergrad and P.Undergrad is higher, showing that the dataset has higher undergraduates. Outstate (-.9925) is really low compared to the other clusters, which could indicate institutions with more affordable out-of-state tuition fees. For Cluster 2, Enroll (-0.348) is low, suggesting that schools in this particular cluster have smaller students.Grad.Rate(0.503) and PhD(0.346) suggests that schools in this cluster might have a balanced undergraduate and graduation education being offered. For Cluster 3, Top10perc(0.456) and Top25perc(0.606) are higher, indicating that these colleges have many students coming from the top ranks of high schools. Outstate (-0.415) is negatively high, highlighting that these schools have higher tuition fees for out-of-state students. For Cluster 4, Top10perc (1.9) and Top25perc (1.572) are very high, suggesting that this cluster would be including the most prestigious and highly selective schools. This cluster has high Expend(2.123) and Grad.Rate(1.154) which suggests that these schools spend more on students and have higher graduation rates. Cluster 5, seems like a moderate enrollment, with moderate out-of state tuition fees, lower expenditures on its students related to lower graduation rates , contrary to the cluster 4.

Step 3

Identify the proportion of schools in each cluster that are private. Report that here. Compare these results to the centers of the clusters. What might you infer?

# Create a dataframe with cluster assignments and private status
cluster_private_analysis = data.frame(
  cluster = kmeans_result$cluster,
  private = college_private
)

# Calculate proportion of private schools in each cluster
private_proportions = cluster_private_analysis %>%
  group_by(cluster) %>%
  summarize(
    total_schools = n(),
    private_schools = sum(private == "Yes"),
    private_proportion = private_schools / total_schools
  )

print("Private School Proportions by Cluster:")

## [1] "Private School Proportions by Cluster:"

print(private_proportions)

## # A tibble: 5 × 4
##   cluster total_schools private_schools private_proportion
##     <int>         <int>           <int>              <dbl>
## 1       1           115              11             0.0957
## 2       2           263             251             0.954 
## 3       3            62               6             0.0968
## 4       4            76              75             0.987 
## 5       5           261             222             0.851

Answer: Clusters 2, 4 and 5 have a higher proportion of private schools with respect to the total schools.

Step 4

Look at the school names in each cluster. Come up with a classification scheme for the groups of schools. Again, does this track with the centers from step 2? Identify three features that seem to be most associated with “elite” schools.

#I took help of AI to find this step

cluster_school_analysis <- data.frame(
  school = college_names,
  cluster = kmeans_result$cluster
)

# Function to find top features for each cluster
find_top_features = function(cluster_centers, top_n = 3) {
  top_features_by_cluster <- apply(cluster_centers, 1, function(cluster) {
    # Sort features by their absolute values in descending order
    sorted_indices = order(abs(cluster), decreasing = TRUE)
    top_indices = sorted_indices[1:top_n]
    top_features = names(cluster)[top_indices]
    top_values = cluster[top_indices]
    
    data.frame(
      feature = top_features,
      value = top_values
    )
  })
  
  return(top_features_by_cluster)
}

# Find top features for each cluster
top_features = find_top_features(cluster_centers)

# Print results
for (i in 1:length(top_features)) {
  cat("Cluster", i, "Top Features:\n")
  print(top_features[[i]])
  cat("\n")
}

## Cluster 1 Top Features:
##                 feature      value
## Outstate       Outstate -0.9925279
## S.F.Ratio     S.F.Ratio  0.9520527
## perc.alumni perc.alumni -0.7982466
## 
## Cluster 2 Top Features:
##               feature     value
## Grad.Rate   Grad.Rate 0.5029391
## Outstate     Outstate 0.5000230
## Room.Board Room.Board 0.4327291
## 
## Cluster 3 Top Features:
##                 feature    value
## F.Undergrad F.Undergrad 2.672834
## Enroll           Enroll 2.659253
## Accept           Accept 2.600504
## 
## Cluster 4 Top Features:
##             feature    value
## Expend       Expend 2.122579
## Top10perc Top10perc 1.939742
## Outstate   Outstate 1.807047
## 
## Cluster 5 Top Features:
##             feature      value
## Terminal   Terminal -1.0043903
## PhD             PhD -0.9853747
## Top25perc Top25perc -0.7213954

Answer: In comparison with Step 2, Cluster 1 & 5 seem different while the other clusters are showing the same top features. From my understanding, cluster 4 indicates a elite group and the key indicators are Expend (2.123), Top10perc (1.94), Outstate(1.807), Top25perc(1.572), Grad.Rate(1.154) - which aligns with the features of top universities.