This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# (c) Obtain scree plot and suggest the number of principal components for efficient data
# dimensional reduction.
# (e) Find the sample correlation coefficient between the first PC and the variable “assault”.
# (f) Obtain complete linkage hierarchical cluster dendogram. Partition the states into 5
# clusters using the constructed dendogram and list the states in each of the clusters.
# Necessary Libraries
library(ggplot2)
library(dplyr)
library(plotly)
library(factoextra)
# Load the data
data <- read.csv("us_crime_data_02101dc2-349e-48cf-8789-793b608d4210.csv")
#
country <- data$State
#
data <- data[,-1]
#
nd <- scale(data)
# (a) Obtain a PCA based 3-dimensional projection of the data. PCA to be done using
# standardized data.
pca_data <- prcomp(nd, scale = TRUE)
#summary(pca_data)
#dim(pca_data$x)
# Extract first 3 PC
pca_projection <- as.data.frame((pca_data)$x[, 1:3])
# Plot the 3D projection
plot_ly(pca_projection, x = ~PC1, y = ~PC2, z = ~PC3, type = "scatter3d", mode = "markers" )
# (b) Detect outliers, if any, from the PCA projection plot obtained in (a).
# Using mahalanobis distance for detecting outliers
maha_dist <- mahalanobis(pca_projection, colMeans(pca_projection), cov(pca_projection))
outliers <- which(maha_dist > qchisq(0.975, df = 3))
print("Outliers detected in PCA projection :")
print(data[outliers, 1]) #prints outliers corresponding to states
# (c) Obtain scree plot and suggest the number of principal components for efficient data
# dimensional reduction.
fviz_eig(pca_data, addlabels = TRUE)
# (d) Find the proportion of total sample variation captured by the first 3 principal components.
prop_var <- summary(pca_data)$importance[,1:3]
print("Proportion of variance explained by the first 3 PCs : ")
print(prop_var)
# Create a DataFrame with case identifiers and PC1 scores
ranking1 <- data.frame(CaseTag = country, PC1_Score = pca_data$x[, 1])
ranking1 <- arrange(ranking1,desc(PC1_Score))
# Rank cases based on PC1 (Higher PC1 score = Higher Rank)
ranking1 <- ranking1 %>% arrange(desc(PC1_Score)) %>% mutate(Rank = row_number())
# (e) Find the sample correlation coefficient between the first PC and the variable “assault”.
assault_index <- which(colnames(data) == "assault"
correlation <- cor(pca_data$x[, 1], data[, assault_index])
print("Correlation between first PC and 'assault':")
print(correlation)
# (f) Obtain complete linkage hierarchical cluster dendogram. Partition the states into 5
# clusters using the constructed dendogram and list the states in each of the clusters.
dist_matrix <- dist(scale(data[, -1])) # Standardize data and compute distance
hc <- hclust(dist_matrix, method = "complete")
plot(hc, hang = -1, main = "Dendrogram of US Crime Data") # Plot dendrogram
# Cut dendrogram into 5 clusters
clusters <- cutree(hc, k = 5)
data$cluster <- clusters
# List states in each cluster
cluster_list <- split(country, data$cluster)
print("States in each cluster:")
print(cluster_list)
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.