This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.


8. Proportion of Variance Explained (PVE) in two ways for the USArrests dataset

(a) Using the sdev Output of prcomp()

To calculate the Proportion of Variance Explained (PVE) using the built-in PCA function, I applied prcomp() to the USArrests dataset with both centering and scaling enabled. I then used the squared standard deviations (sdev^2) to compute the PVE for each principal component.

# Load data
data("USArrests")

# Perform PCA with centering and scaling
pca_out <- prcomp(USArrests, scale. = TRUE)

# Calculate the proportion of variance explained
eigenvalues <- pca_out$sdev^2
pve_a <- eigenvalues / sum(eigenvalues)

# Display the PVE
pve_a
[1] 0.62006039 0.24744129 0.08914080 0.04335752

Interpretation:

The first principal component explains 62.0% of the total variance. The second explains 24.7%, The third explains 8.9%, and The fourth explains 4.3%.

(b) Using Equation 12.10 Directly

To calculate the Proportion of Variance Explained (PVE) manually using Equation 12.10, I first scaled the USArrests dataset so that all variables have mean zero and standard deviation one, just like I did in part (a). Then I used the principal component loadings obtained from the prcomp() function and followed these steps:

I calculated the principal component scores manually by multiplying the scaled data matrix with the loading matrix. I computed the numerator of Equation 12.10 as the sum of squared scores for each principal component. I computed the denominator as the total sum of squared values from the scaled dataset. I divided the numerator by the denominator to get the PVE for each principal component.

# Scale the data
scaled_data <- scale(USArrests)

# Perform PCA
pca_out <- prcomp(scaled_data)

# Extract loadings
loadings <- pca_out$rotation

# Calculate scores manually
scores <- scaled_data %*% loadings

# Compute numerator and denominator
numerator <- colSums(scores^2)
denominator <- sum(scaled_data^2)

# Calculate PVE
pve_b <- numerator / denominator
pve_b
       PC1        PC2        PC3        PC4 
0.62006039 0.24744129 0.08914080 0.04335752 

This result matches exactly with the output from part (a), confirming that both methods give the same PVE values when the data is properly centered and scaled.

9(a) Hierarchical Clustering Using Complete Linkage and Euclidean Distance

I performed hierarchical clustering on the USArrests dataset using complete linkage and Euclidean distance. I did not scale the variables in this part.

# Load data
data("USArrests")

# Compute the distance matrix
dist_matrix <- dist(USArrests)

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage (Unscaled)")


# Dendrogram with clear labels
plot(hc_complete,
     main = "Dendrogram - Complete Linkage (Unscaled)",
     xlab = "", 
     sub = "",
     cex = 0.6,        # Shrink label size
     las = 2)          # Rotate labels for better spacing

(b) Cut the Dendrogram to Get 3 Clusters

I cut the dendrogram at a height that gives three clusters using cutree(), and then listed which states belong to each cluster.

# Cut tree into 3 clusters
clusters_unscaled <- cutree(hc_complete, k = 3)

# Group states by clusters
split(names(clusters_unscaled), clusters_unscaled)
$`1`
 [1] "Alabama"        "Alaska"         "Arizona"        "California"     "Delaware"       "Florida"        "Illinois"       "Louisiana"     
 [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"         "New Mexico"     "New York"       "North Carolina" "South Carolina"

$`2`
 [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts" "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"        "Rhode Island" 
[10] "Tennessee"     "Texas"         "Virginia"      "Washington"    "Wyoming"      

$`3`
 [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"       "Iowa"          "Kansas"        "Kentucky"      "Maine"         "Minnesota"    
[10] "Montana"       "Nebraska"      "New Hampshire" "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota"  "Utah"          "Vermont"      
[19] "West Virginia" "Wisconsin"    

(c) Hierarchical Clustering After Scaling the Variables

I repeated the clustering after scaling all variables to have standard deviation one.

# Scale the variables
scaled_data <- scale(USArrests)

# Compute new distance matrix
dist_scaled <- dist(scaled_data)

# Perform clustering
hc_scaled <- hclust(dist_scaled, method = "complete")

# Plot the scaled dendrogram
plot(hc_scaled, main = "Dendrogram - Complete Linkage (Scaled)")


# Dendrogram with clear labels
plot(hc_complete,
     main = "Dendrogram - Complete Linkage (Unscaled)",
     xlab = "", 
     sub = "",
     cex = 0.6,        # Shrink label size
     las = 2)          # Rotate labels for better spacing

Then I again cut the dendrogram into 3 clusters:

clusters_scaled <- cutree(hc_scaled, k = 3)
split(names(clusters_scaled), clusters_scaled)
$`1`
[1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"      "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     

$`2`
 [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"   "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"   "Texas"     

$`3`
 [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"        "Idaho"         "Indiana"       "Iowa"          "Kansas"        "Kentucky"     
[10] "Maine"         "Massachusetts" "Minnesota"     "Missouri"      "Montana"       "Nebraska"      "New Hampshire" "New Jersey"    "North Dakota" 
[19] "Ohio"          "Oklahoma"      "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota"  "Utah"          "Vermont"       "Virginia"     
[28] "Washington"    "West Virginia" "Wisconsin"     "Wyoming"      

(d) Effect of Scaling and Justification

Scaling had a significant effect on the clustering results.

Before scaling, variables like ‘Assault’ (which has a larger variance) dominated the distance calculations. As a result, the clustering was heavily influenced by the raw magnitude of certain crime types.

After scaling, all variables contributed equally to the distance calculation. This gave a more balanced view, and the cluster assignments changed accordingly.

In my opinion, the variables should be scaled before computing inter-observation dissimilarities, especially when the variables are measured on different scales (e.g., ‘Murder’ and ‘UrbanPop’). Without scaling, clustering reflects the influence of high-variance variables rather than all features equally.

10 (a) Generate Simulated Data for 3 Classes

A simulated dataset was generated with 60 observations (20 in each of 3 distinct classes) and 50 variables. A mean shift was added to each class to ensure clear separation between them.

set.seed(1)

# Generate 20 observations for each class
x1 <- matrix(rnorm(20 * 50, mean = 0), nrow = 20)
x2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20)
x3 <- matrix(rnorm(20 * 50, mean = 6), nrow = 20)

# Combine all observations
x <- rbind(x1, x2, x3)

# Create true class labels
labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))

(b) Perform PCA and Plot First Two Principal Components

Principal component analysis (PCA) was applied to the dataset, and the first two principal components were visualized. Each class was represented using a different color.

# Perform PCA
pca_out <- prcomp(x)

# Plot the first two principal component score vectors
plot(pca_out$x[, 1:2], col = labels, pch = 19,
     xlab = "PC1", ylab = "PC2", main = "PCA - First Two PCs")

Interpretation:

The three classes appeared well-separated in the PCA plot, confirming that the mean shifts were sufficient for visual class distinction.

(c) Perform K-means Clustering with K = 3

K-means clustering was performed with 3 clusters, and the resulting cluster assignments were compared to the true labels using a contingency table.

set.seed(1)
km_out <- kmeans(x, centers = 3, nstart = 20)

# Compare clustering with true labels
table(labels, km_out$cluster)
      
labels  1  2  3
     1  0  0 20
     2  0 20  0
     3 20  0  0

Interpretation:

The clustering closely aligned with the actual class labels, with only minor discrepancies due to arbitrary cluster numbering by K-means.

(d) Perform K-means Clustering with K = 2

K-means clustering was repeated with 2 clusters instead of 3.

set.seed(1)
km2 <- kmeans(x, centers = 2, nstart = 20)
table(labels, km2$cluster)
      
labels  1  2
     1  0 20
     2  0 20
     3 20  0

Interpretation:

Two clusters resulted in the merging of two actual classes, reducing clustering accuracy and underrepresenting the data’s true structure.

(e) Perform K-means Clustering with K = 4

K-means clustering was also performed with 4 clusters.

set.seed(1)
km4 <- kmeans(x, centers = 4, nstart = 20)
table(labels, km4$cluster)
      
labels  1  2  3  4
     1  0 11  0  9
     2  0  0 20  0
     3 20  0  0  0

Interpretation:

Using four clusters led to overfitting, as one of the original classes was split into two clusters unnecessarily.

(f) Perform K-means Clustering on First Two Principal Components

Clustering was applied to the 2D PCA-reduced dataset instead of the full 50-dimensional dataset.

pc_data <- pca_out$x[, 1:2]

set.seed(1)
km_pca <- kmeans(pc_data, centers = 3, nstart = 20)
table(labels, km_pca$cluster)
      
labels  1  2  3
     1  0 20  0
     2  0  0 20
     3 20  0  0

Interpretation:

Clustering on the first two principal components preserved much of the structure and yielded results similar to those obtained on the original dataset.

  1. Perform K-means Clustering on Scaled Data The dataset was scaled using scale(), and K-means clustering was performed again with 3 clusters.
x_scaled <- scale(x)

set.seed(1)
km_scaled <- kmeans(x_scaled, centers = 3, nstart = 20)
table(labels, km_scaled$cluster)
      
labels  1  2  3
     1  0  0 20
     2  0 20  0
     3 20  0  0

Interpretation:

Scaling the data before clustering helped standardize variable influence, although in this simulated case the improvement was minimal. In real-world datasets where variables differ in scale, scaling is essential to prevent dominance by high-variance features.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

------

# 8. Proportion of Variance Explained (PVE) in two ways for the USArrests dataset

# (a) Using the sdev Output of prcomp()

To calculate the Proportion of Variance Explained (PVE) using the built-in PCA function, I applied prcomp() to the USArrests dataset with both centering and scaling enabled. I then used the squared standard deviations (sdev^2) to compute the PVE for each principal component.

```{r}
# Load data
data("USArrests")

# Perform PCA with centering and scaling
pca_out <- prcomp(USArrests, scale. = TRUE)

# Calculate the proportion of variance explained
eigenvalues <- pca_out$sdev^2
pve_a <- eigenvalues / sum(eigenvalues)

# Display the PVE
pve_a
```

#  Interpretation:

The first principal component explains 62.0% of the total variance.
The second explains 24.7%,
The third explains 8.9%, and
The fourth explains 4.3%.

# (b) Using Equation 12.10 Directly

To calculate the Proportion of Variance Explained (PVE) manually using Equation 12.10, I first scaled the USArrests dataset so that all variables have mean zero and standard deviation one, just like I did in part (a). Then I used the principal component loadings obtained from the prcomp() function and followed these steps:

I calculated the principal component scores manually by multiplying the scaled data matrix with the loading matrix.
I computed the numerator of Equation 12.10 as the sum of squared scores for each principal component.
I computed the denominator as the total sum of squared values from the scaled dataset.
I divided the numerator by the denominator to get the PVE for each principal component.

```{r}
# Scale the data
scaled_data <- scale(USArrests)

# Perform PCA
pca_out <- prcomp(scaled_data)

# Extract loadings
loadings <- pca_out$rotation

# Calculate scores manually
scores <- scaled_data %*% loadings

# Compute numerator and denominator
numerator <- colSums(scores^2)
denominator <- sum(scaled_data^2)

# Calculate PVE
pve_b <- numerator / denominator
pve_b
```


This result matches exactly with the output from part (a), confirming that both methods give the same PVE values when the data is properly centered and scaled.

# 9(a) Hierarchical Clustering Using Complete Linkage and Euclidean Distance
I performed hierarchical clustering on the USArrests dataset using complete linkage and Euclidean distance. I did not scale the variables in this part.

```{r}
# Load data
data("USArrests")

# Compute the distance matrix
dist_matrix <- dist(USArrests)

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage (Unscaled)")

# Dendrogram with clear labels
plot(hc_complete,
     main = "Dendrogram - Complete Linkage (Unscaled)",
     xlab = "", 
     sub = "",
     cex = 0.6,        # Shrink label size
     las = 2)          # Rotate labels for better spacing
```

# (b) Cut the Dendrogram to Get 3 Clusters
I cut the dendrogram at a height that gives three clusters using cutree(), and then listed which states belong to each cluster.

```{r}
# Cut tree into 3 clusters
clusters_unscaled <- cutree(hc_complete, k = 3)

# Group states by clusters
split(names(clusters_unscaled), clusters_unscaled)
```

# (c) Hierarchical Clustering After Scaling the Variables
I repeated the clustering after scaling all variables to have standard deviation one.

```{r}
# Scale the variables
scaled_data <- scale(USArrests)

# Compute new distance matrix
dist_scaled <- dist(scaled_data)

# Perform clustering
hc_scaled <- hclust(dist_scaled, method = "complete")

# Plot the scaled dendrogram
plot(hc_scaled, main = "Dendrogram - Complete Linkage (Scaled)")

# Dendrogram with clear labels
plot(hc_complete,
     main = "Dendrogram - Complete Linkage (Unscaled)",
     xlab = "", 
     sub = "",
     cex = 0.6,        # Shrink label size
     las = 2)          # Rotate labels for better spacing
```

Then I again cut the dendrogram into 3 clusters:

```{r}
clusters_scaled <- cutree(hc_scaled, k = 3)
split(names(clusters_scaled), clusters_scaled)
```

# (d) Effect of Scaling and Justification
Scaling had a significant effect on the clustering results.

Before scaling, variables like 'Assault' (which has a larger variance) dominated the distance calculations. As a result, the clustering was heavily influenced by the raw magnitude of certain crime types.

After scaling, all variables contributed equally to the distance calculation. This gave a more balanced view, and the cluster assignments changed accordingly.

In my opinion, the variables should be scaled before computing inter-observation dissimilarities, especially when the variables are measured on different scales (e.g., 'Murder' and 'UrbanPop'). Without scaling, clustering reflects the influence of high-variance variables rather than all features equally.


# 10 (a) Generate Simulated Data for 3 Classes
A simulated dataset was generated with 60 observations (20 in each of 3 distinct classes) and 50 variables. A mean shift was added to each class to ensure clear separation between them.

```{r}
set.seed(1)

# Generate 20 observations for each class
x1 <- matrix(rnorm(20 * 50, mean = 0), nrow = 20)
x2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20)
x3 <- matrix(rnorm(20 * 50, mean = 6), nrow = 20)

# Combine all observations
x <- rbind(x1, x2, x3)

# Create true class labels
labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
```

# (b) Perform PCA and Plot First Two Principal Components
Principal component analysis (PCA) was applied to the dataset, and the first two principal components were visualized. Each class was represented using a different color.

```{r}
# Perform PCA
pca_out <- prcomp(x)

# Plot the first two principal component score vectors
plot(pca_out$x[, 1:2], col = labels, pch = 19,
     xlab = "PC1", ylab = "PC2", main = "PCA - First Two PCs")
```

# Interpretation:
The three classes appeared well-separated in the PCA plot, confirming that the mean shifts were sufficient for visual class distinction.

# (c) Perform K-means Clustering with K = 3
K-means clustering was performed with 3 clusters, and the resulting cluster assignments were compared to the true labels using a contingency table.

```{r}
set.seed(1)
km_out <- kmeans(x, centers = 3, nstart = 20)

# Compare clustering with true labels
table(labels, km_out$cluster)
```

# Interpretation:
The clustering closely aligned with the actual class labels, with only minor discrepancies due to arbitrary cluster numbering by K-means.

# (d) Perform K-means Clustering with K = 2
K-means clustering was repeated with 2 clusters instead of 3.

```{r}
set.seed(1)
km2 <- kmeans(x, centers = 2, nstart = 20)
table(labels, km2$cluster)
```

# Interpretation:
Two clusters resulted in the merging of two actual classes, reducing clustering accuracy and underrepresenting the data's true structure.

# (e) Perform K-means Clustering with K = 4
K-means clustering was also performed with 4 clusters.

```{r}
set.seed(1)
km4 <- kmeans(x, centers = 4, nstart = 20)
table(labels, km4$cluster)
```

# Interpretation:
Using four clusters led to overfitting, as one of the original classes was split into two clusters unnecessarily.

# (f) Perform K-means Clustering on First Two Principal Components
Clustering was applied to the 2D PCA-reduced dataset instead of the full 50-dimensional dataset.


```{r}
pc_data <- pca_out$x[, 1:2]

set.seed(1)
km_pca <- kmeans(pc_data, centers = 3, nstart = 20)
table(labels, km_pca$cluster)
```

# Interpretation:
Clustering on the first two principal components preserved much of the structure and yielded results similar to those obtained on the original dataset.

(g) Perform K-means Clustering on Scaled Data
The dataset was scaled using scale(), and K-means clustering was performed again with 3 clusters.

```{r}
x_scaled <- scale(x)

set.seed(1)
km_scaled <- kmeans(x_scaled, centers = 3, nstart = 20)
table(labels, km_scaled$cluster)
```

# Interpretation:
Scaling the data before clustering helped standardize variable influence, although in this simulated case the improvement was minimal. In real-world datasets where variables differ in scale, scaling is essential to prevent dominance by high-variance features.
