Dataset Description

This project uses the MovieGalaxies dataset for both clustering and dimension reduction. MovieGalaxies is a large-scale collection of movie character interaction networks derived from scene-level co-appearance data. Each movie is represented as a graph, where nodes correspond to characters and edges represent co-occurrences within the same scene, capturing the underlying collaboration structure of the cast.

The dataset consists of two main components:

Movie Interaction Networks
Individual network files are available for each movie, representing the full character interaction structure. These networks are constructed automatically from script and subtitle data using scene-based co-occurrence extraction.
Network Metadata Table (Used in This Project)
A tabular file is provided with one row per movie, containing precomputed graph-theoretic features that summarize the structural properties of each network. These attributes include:
- Weighted Degree – overall interaction strength within the network
- Modularity – strength of community structure
- Path Length and Diameter – global distance characteristics
- Clustering Coefficient – local connectivity and triadic closure
- Connected Components – number of disconnected sub-networks
- Density – proportion of realized connections
- Edges – total number of interactions
- Characters – number of nodes (cast size)

Together, these components provide both fine-grained network representations and high-level structural summaries, making the dataset well suited for clustering and dimension reduction analysis of cinematic collaboration patterns.

Data Loading and Pre-processing

Data Loading

In this step, we load the dataset, select the mathematical network metrics, and determine if the data is suitable for clustering.

# Install packages
#install.packages("tidyverse")
#install.packages("cluster")
#install.packages("factoextra")
#install.packages("flexclust")
#install.packages("fpc")
#install.packages("hopkins")
#install.packages("ClusterR")
#install.packages("dendextend")
#install.packages("dbscan")
#install.packages("kernlab")
#install.packages("NbClust")
#install.packages("reshape2")
#install.packages("corrplot")

# Load libraries
library(tidyverse)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(hopkins)
library(ClusterR)
library(dendextend)
library(dbscan)
library(kernlab)
library(NbClust)
library(reshape2)
library(corrplot)

We first standardize column names, remove duplicates, and inspect variable types and missing values to understand what preprocessing is needed.

file_name <- "C:\\Users\\mevin\\Downloads\\network_metadata.tab" 
# Load the dataset
movies <- read.csv(file_name, sep="\t", header=TRUE,
                   dec=",", stringsAsFactors=FALSE)

# Explore the data
str(movies)

## 'data.frame':    773 obs. of  14 variables:
##  $ GexfID               : int  1 2 3 5 6 7 8 9 10 11 ...
##  $ Title                : chr  "10 Things I Hate About You" "12" "Twelve and Holding" "127 Hours" ...
##  $ IMDB_id              : chr  "tt0147800" "tt0488478" "tt0417385" "tt1542344" ...
##  $ ReleaseDate          : int  1999 2007 2005 2010 1992 2001 2009 1997 1968 2009 ...
##  $ Slug                 : chr  "10-Things-I-Hate-About-You" "12" "Twelve-and-Holding" "127-Hours" ...
##  $ WeightedDegree       : num  13.04 4.23 7.33 3.83 7.94 ...
##  $ Modularity           : num  0.194 0.517 0.473 0.16 0.35 0.344 0.252 0.423 0.45 0.438 ...
##  $ PathLength           : num  1.78 2.17 2.45 1.72 2.04 ...
##  $ Diameter             : num  3 3 4 2 4 4 3 4 3 5 ...
##  $ ClusteringCoefficient: num  0.567 0.61 0.593 0.339 0.705 0.713 0.794 0.713 0.879 0.647 ...
##  $ ConnectedComponents  : num  3 6 2 4 1 4 2 2 1 3 ...
##  $ Density              : num  0.233 0.077 0.14 0.152 0.153 0.12 0.192 0.121 0.291 0.089 ...
##  $ Edges                : int  225 102 139 33 164 265 307 116 122 470 ...
##  $ Characters           : int  25 42 26 11 34 47 37 33 27 69 ...

head(movies)

dim(movies)

## [1] 773  14

summary(movies)

##      GexfID         Title             IMDB_id           ReleaseDate  
##  Min.   :  1.0   Length:773         Length:773         Min.   :1915  
##  1st Qu.:231.0   Class :character   Class :character   1st Qu.:1991  
##  Median :452.0   Mode  :character   Mode  :character   Median :1998  
##  Mean   :455.7                                         Mean   :1995  
##  3rd Qu.:687.0                                         3rd Qu.:2003  
##  Max.   :914.0                                         Max.   :2012  
##      Slug           WeightedDegree     Modularity       PathLength   
##  Length:773         Min.   : 0.389   Min.   :0.0000   Min.   :1.000  
##  Class :character   1st Qu.: 7.133   1st Qu.:0.2640   1st Qu.:1.857  
##  Mode  :character   Median : 8.870   Median :0.3340   Median :2.000  
##                     Mean   :10.247   Mean   :0.3316   Mean   :2.039  
##                     3rd Qu.:12.080   3rd Qu.:0.4040   3rd Qu.:2.182  
##                     Max.   :45.273   Max.   :0.8570   Max.   :3.941  
##     Diameter     ClusteringCoefficient ConnectedComponents    Density      
##  Min.   :1.000   Min.   :0.0000        Min.   : 1.00       Min.   :0.0110  
##  1st Qu.:3.000   1st Qu.:0.6010        1st Qu.: 2.00       1st Qu.:0.1070  
##  Median :4.000   Median :0.6720        Median : 3.00       Median :0.1430  
##  Mean   :3.691   Mean   :0.6661        Mean   : 3.55       Mean   :0.1692  
##  3rd Qu.:4.000   3rd Qu.:0.7390        3rd Qu.: 5.00       3rd Qu.:0.2060  
##  Max.   :9.000   Max.   :1.0000        Max.   :29.00       Max.   :1.0000  
##      Edges         Characters    
##  Min.   :  7.0   Min.   :  5.00  
##  1st Qu.:150.0   1st Qu.: 26.00  
##  Median :204.0   Median : 35.00  
##  Mean   :230.3   Mean   : 36.69  
##  3rd Qu.:291.0   3rd Qu.: 45.00  
##  Max.   :906.0   Max.   :117.00

sum(is.na(movies))

## [1] 0

Preprocessing Strategy

Prior to clustering, the dataset undergoes standard preprocessing steps to ensure analytical reliability:

Verification of variable types and missing values
Selection of relevant numerical network features
Standardization using z-score normalization

Standardization is essential because the features are measured on different scales. Without scaling, variables with larger numeric ranges would disproportionately influence distance calculations and clustering outcomes.

clustering_features <- movies[, c("WeightedDegree", "Modularity", "PathLength", 
                                  "Diameter", "ClusteringCoefficient", 
                                  "ConnectedComponents", "Density", "Edges",
                                  "Characters")]

# Standardize the data (z-score)
movies_scaled <- as.data.frame(lapply(clustering_features, scale))

Clustering

Clustering Tendency Assessment

Before applying clustering algorithms, it is necessary to determine whether meaningful cluster structure exists in the dataset. This is evaluated using:

Hopkins statistic to measure spatial randomness
Ordered dissimilarity matrices (ODM) to visually assess grouping tendencies

# Hopkins statistic
set.seed(123)
hopkins_stat <- hopkins(movies_scaled, m=nrow(movies_scaled)-1)
cat("Hopkins Statistic:", hopkins_stat, "\n")

## Hopkins Statistic: 0.9984177

# Ordered Dissimilarity Matrix (ODM)
dist_matrix <- get_dist(movies_scaled, method="euclidean")
fviz_dist(dist_matrix, show_labels=FALSE) + 
  labs(title="Ordered Dissimilarity Matrix - MovieGalaxies")

Together, the Hopkins statistic and ordered dissimilarity matrix confirm that the dataset exhibits strong non-random structure and clear grouping tendencies, making it suitable for clustering analysis.

Determining the Optimal Number of Clusters

Multiple complementary methods were applied to determine the optimal number of clusters:

Elbow method (Within-Cluster Sum of Squares)
Silhouette method (Cluster cohesion and separation)
Gap statistic (Comparison with reference null distributions)
NbClust ensemble index (Multiple statistical criteria)

# Elbow method
fviz_nbclust(movies_scaled, FUN=kmeans, method="wss") + 
  labs(title="Elbow Method")

# Silhouette method
fviz_nbclust(movies_scaled, FUN=kmeans, method="silhouette") + 
  labs(title="Silhouette Method")

# Gap statistic
gap_stat <- clusGap(
  movies_scaled,
  FUN = function(x, k) kmeans(x, centers = k, nstart = 50, iter.max = 100),
  K.max = 10,
  B = 100
)

fviz_gap_stat(gap_stat)

# NbClust
opt_clusters <- NbClust(movies_scaled, distance="euclidean", min.nc=2, max.nc=10, 
                        method="kmeans", index="ch")
opt_clusters$Best.nc

## Number_clusters     Value_Index 
##           2.000         304.801

All four evaluation methods consistently identify k = 2 as the optimal clustering solution. This convergence across independent criteria provides strong evidence for adopting a two-cluster structure in subsequent analysis.

Clustering Methodology

Multiple clustering families are explored to capture different structural assumptions:

Partition-Based Methods
- K-Means
- PAM (Partitioning Around Medoids)
- CLARA (Clustering Large Applications)
These methods divide the dataset into non-overlapping clusters by minimizing within-cluster dissimilarity.
Hierarchical Methods
- Agglomerative clustering (AGNES)
- Divisive clustering (DIANA)
Hierarchical approaches provide dendrogram-based visualizations that reveal nested grouping relationships.
Density-Based Methods
- DBSCAN
This method identifies dense regions in feature space and isolates noise points, making it suitable for irregular cluster shapes.
Spectral Clustering

Spectral clustering leverages graph-based similarity structures and eigenvalue decomposition to detect non-linear cluster boundaries.

Cluster quality is evaluated using internal validation metrics:

Silhouette width to assess cohesion and separation
Calinski–Harabasz index to evaluate cluster compactness
Agglomerative and divisive coefficients for hierarchical models

Cluster agreement across algorithms is quantified using the Rand Index, allowing comparison of consistency between different clustering approaches.

Partition Based Methods

# K-Means
set.seed(123)
km_result <- kmeans(movies_scaled, centers = 2, nstart = 50)


fviz_cluster(km_result, data = movies_scaled,
             geom = "point") +
  labs(title = "K-Means Clustering (k = 2)")

d <- dist(movies_scaled)

sil_km <- silhouette(km_result$cluster, d)
fviz_silhouette(sil_km)

##   cluster size ave.sil.width
## 1       1  447          0.25
## 2       2  326          0.25

mean(sil_km[,3])

## [1] 0.2532182

ch_km <- calinhara(movies_scaled, km_result$cluster)
ch_km

## [1] 304.801

# PAM
pam_result <- eclust(movies_scaled, "pam", k = 2)

fviz_cluster(pam_result, geom = "point") +
  labs(title = "PAM Clustering (k = 2)")

sil_pam <- silhouette(pam_result$cluster, d)
fviz_silhouette(sil_pam)

##   cluster size ave.sil.width
## 1       1  299          0.26
## 2       2  474          0.25

mean(sil_pam[,3])

## [1] 0.2545918

ch_pam <- calinhara(movies_scaled, pam_result$cluster)
ch_pam

## [1] 299.9225

# CLARA
set.seed(123)  
clara_result <- eclust(movies_scaled, "clara", k = 2, samples = 10)

fviz_cluster(clara_result, geom = "point") +
  labs(title = "CLARA Clustering (k = 2)")

sil_clara <- silhouette(clara_result$cluster, d)
fviz_silhouette(sil_clara) + labs(title = "Silhouette - CLARA")

##   cluster size ave.sil.width
## 1       1  398          0.22
## 2       2  375          0.25

mean_sil_clara <- mean(sil_clara[, 3])
mean_sil_clara

## [1] 0.2363018

ch_clara <- calinhara(movies_scaled, clara_result$cluster)
ch_clara

## [1] 283.1059

# Distance matrix
d <- dist(movies_scaled)

# Silhouette scores
sil_km    <- silhouette(km_result$cluster, d)
sil_pam   <- silhouette(pam_result$cluster, d)
sil_clara <- silhouette(clara_result$cluster, d)

# Mean silhouette width
mean(sil_km[,3])

## [1] 0.2532182

mean(sil_pam[,3])

## [1] 0.2545918

mean(sil_clara[,3])

## [1] 0.2363018

Three partitioning algorithms were applied using the optimal cluster number (k = 2): K-Means, PAM, and CLARA. All methods produced a similar two-group separation pattern in the projected feature space, indicating stable underlying structure.

For K-Means, the average silhouette width was 0.253, indicating moderate cluster separation with reasonable internal cohesion. The Calinski–Harabasz index value of 304.80 further confirms strong between-cluster separation relative to within-cluster variance.

The PAM (Partitioning Around Medoids) algorithm achieved a comparable average silhouette width of 0.255 and a Calinski–Harabasz score of 299.92, demonstrating similar clustering quality while offering greater robustness to outliers due to its medoid-based formulation.

The CLARA algorithm produced a slightly lower average silhouette width of 0.236 and a Calinski–Harabasz index of 283.11, reflecting reduced clustering precision compared to PAM and K-Means, which is expected given CLARA’s subsampling-based approximation approach.

Overall, PAM achieved the highest silhouette score, indicating marginally better cluster cohesion and separation. Therefore, PAM clustering is selected as the primary clustering solution for subsequent analysis and interpretation.

Hierarchical Methods

# Agglomerative (AGNES)
hc_agnes <- agnes(movies_scaled, method="ward")
hc_agnes$ac

## [1] 0.9834808

pltree(hc_agnes, cex=0.6, hang=-1, main="AGNES Dendrogram")

# Divisive (DIANA)
hc_diana <- diana(movies_scaled)
hc_diana$dc

## [1] 0.943126

pltree(hc_diana, cex=0.6, hang=-1, main="DIANA Dendrogram")

# Ward.D2
dist_euc <- dist(movies_scaled, method="euclidean")
hc_complete <- hclust(dist_euc, method="complete")
hc_ward <- hclust(dist_euc, method="ward.D2")

# Plot dendrogram FIRST
plot(hc_ward, labels = FALSE, main = "Ward.D2 Dendrogram")


# Cut dendrogram
clusters_hc <- cutree(hc_ward, k=2)
rect.hclust(hc_ward, k=2, border=2:5)

sil_hc <- silhouette(clusters_hc, dist(movies_scaled))
fviz_silhouette(sil_hc)

##   cluster size ave.sil.width
## 1       1  303          0.24
## 2       2  470          0.25

mean(sil_hc[,3])

## [1] 0.2429154

# Dendrogram comparison
dend1 <- as.dendrogram(hc_complete)
dend2 <- as.dendrogram(hc_ward)
tanglegram(dend1, dend2)

Hierarchical clustering was performed using both agglomerative (AGNES) and divisive (DIANA) approaches to further validate the cluster structure.

The AGNES algorithm (Ward linkage) achieved an agglomerative coefficient of 0.983, indicating a very strong hierarchical structure with well-defined nested clusters. Similarly, the DIANA algorithm produced a high divisive coefficient of 0.943, confirming consistent separation patterns from a top-down perspective.

Standard hierarchical clustering using Ward.D2 linkage produced a clear two-branch dendrogram structure. Cutting the dendrogram at k = 2 resulted in two well-separated clusters, consistent with the optimal cluster number obtained earlier.

The silhouette analysis of the hierarchical solution yielded an average silhouette width of 0.243, which is comparable to the values obtained from partition-based methods. This indicates moderate but stable cluster separation.

Finally, dendrogram comparison between complete linkage and Ward linkage shows broadly similar grouping structures, further supporting the robustness of the two-cluster solution across hierarchical methods.

Hierarchical clustering confirms the presence of two dominant structural groups in the dataset and reinforces the stability of the selected clustering configuration.

Density Based Method (DBSCAN)

# DBSCAN
minPts <- ncol(movies_scaled) + 1

kNNdistplot(movies_scaled, k = minPts)
abline(h = quantile(kNNdist(movies_scaled, k = minPts), 0.95),
       lty = 2)

# Choose eps from elbow visually
eps_val <- 2.2

db_result <- dbscan(movies_scaled,
                    eps = eps_val,
                    MinPts = minPts)

fviz_cluster(db_result, movies_scaled,
             geom = "point",
             show.clust.cent = FALSE) +
  labs(title = "DBSCAN Clustering")

table(db_result$cluster)

## 
##   0   1 
##  21 752

The DBSCAN algorithm was applied to identify dense regions in the feature space and detect potential noise points. The k-nearest neighbor distance plot was used to guide the selection of the neighborhood radius parameter, resulting in ε = 2.2 and MinPts = 10.

The DBSCAN output produced one dominant dense cluster containing the majority of observations, while 21 points were classified as noise or outliers. This indicates that the dataset does not exhibit multiple well-separated density-based clusters, but instead forms a single dense core structure with a small number of isolated observations.

This behavior contrasts with the partition-based and hierarchical methods, which identified two global structural groups. The DBSCAN result suggests that the dataset is characterized by continuous structural variation rather than sharply separated density regions.

Therefore, DBSCAN is not selected as the primary clustering method for this dataset, but it provides useful insight into the presence of outliers and the overall density distribution of movie network structures.

Spectral Clustering

# Spectral clustering
set.seed(123)
spec_result <- specc(as.matrix(movies_scaled),
                     centers = 2,
                     kernel = "rbfdot")

# PCA visualization
pca <- prcomp(movies_scaled)

plot(pca$x[,1], pca$x[,2],
     col = spec_result,
     pch = 19,
     xlab = "PC1",
     ylab = "PC2",
     main = "Spectral Clustering (PCA Projection)")

sil_spec <- silhouette(as.integer(spec_result), dist(movies_scaled))
fviz_silhouette(sil_spec)

##   cluster size ave.sil.width
## 1       1    3          0.59
## 2       2  770          0.63

mean(sil_spec[,3])

## [1] 0.633699

table(spec_result)

## spec_result
##   1   2 
##   3 770

Spectral clustering was applied using a radial basis function (RBF) kernel in order to capture potential non-linear separation patterns in the data. The PCA projection of the resulting assignments shows that the algorithm isolates a very small number of extreme observations from the main data cloud.

The silhouette analysis produced a high average silhouette width (0.634). However, this value is driven by the fact that one cluster contains only three observations, while the remaining majority of movies are grouped into a single large cluster. This extreme imbalance indicates that spectral clustering is effectively behaving as an outlier separation mechanism rather than discovering meaningful global structure.

Although high silhouette values often suggest strong separation, in this case the result reflects trivial partitioning between a few isolated points and the main data mass. Such a solution lacks practical interpretability for structural segmentation of movie networks.

Despite its strong numerical silhouette score, spectral clustering is not selected as the primary clustering solution for this dataset. Instead, the partition-based and hierarchical approaches, which produce balanced and interpretable two-cluster solutions, provide more meaningful representations of the underlying collaboration network structure.

Cluster Agreement Analysis (Rand Index)

# PAM vs KMeans
rand_pam_km <- randIndex(pam_result$cluster, km_result$cluster)

# PAM vs Hierarchical
rand_pam_hc <- randIndex(pam_result$cluster, clusters_hc)

# PAM vs Spectral
rand_pam_spec <- randIndex(pam_result$cluster, as.integer(spec_result))

cat("PAM vs KMeans  -> Rand:", rand_pam_km, "\n")

## PAM vs KMeans  -> Rand: 0.8456403

cat("PAM vs Ward HC -> Rand:", rand_pam_hc, "\n")

## PAM vs Ward HC -> Rand: 0.7753307

cat("PAM vs Spectral-> Rand:", rand_pam_spec, "\n")

## PAM vs Spectral-> Rand: 0.004579441

The Rand Index was used to quantify the similarity between cluster assignments obtained from different algorithms.

The agreement between PAM and K-Means is high (Rand = 0.846), indicating that both partition-based methods identify very similar group structures. The agreement between PAM and hierarchical clustering (Ward) is also substantial (Rand = 0.775), confirming consistency between partitioning and hierarchical approaches.

In contrast, the agreement between PAM and spectral clustering is extremely low (Rand = 0.005), showing that spectral clustering produces a fundamentally different partitioning of the dataset. This divergence reflects the ability of spectral methods to capture non-linear similarity structures that are not detected by distance-based clustering algorithms.

These results show strong agreement among distance-based methods and clear divergence from spectral clustering, reinforcing PAM as the most consistent clustering solution.

Cluster Profiling

Once final cluster assignments are obtained, cluster profiling is performed by computing mean values of network features within each group. This allows interpretability by highlighting how clusters differ in:

Network density
Structural complexity
Actor participation scale
Connectivity strength

movies$cluster <- pam_result$cluster

table(movies$cluster)

## 
##   1   2 
## 299 474

cluster_profile <- aggregate(
  clustering_features,
  by = list(Cluster = movies$cluster),
  mean
)

cluster_profile

profile_long <- melt(cluster_profile, id.vars="Cluster")

ggplot(profile_long,
       aes(x=variable, y=value, fill=factor(Cluster))) +
  geom_bar(stat="identity", position="dodge") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(title="Network Feature Comparison Between Movie Clusters")

The final PAM-based clustering produced 299 movies in Cluster 1 and 474 movies in Cluster 2, indicating moderate imbalance but sufficient representation in both groups.

Cluster 1 exhibits higher weighted degree, higher edge count, and substantially higher density (≈ 0.252), alongside a higher clustering coefficient (≈ 0.73). Together, these features describe tightly connected collaboration networks with strong local cohesion and frequent co-appearances. This cluster also has fewer connected components and a smaller average cast size, consistent with compact ensemble-style productions.

In contrast, Cluster 2 shows lower density (≈ 0.118) and lower clustering (≈ 0.63), combined with more connected components and a larger number of characters. This pattern suggests broader casts organized into multiple loosely connected sub-networks, producing more fragmented and modular collaboration structures. Higher average path length and diameter reinforce the lower overall compactness of these networks.

Interpretation of the Two Movie Network Clusters

Based on the combined evidence from partition-based clustering, hierarchical methods, spectral clustering, and detailed feature profiling, the two identified clusters represent distinct structural archetypes of cinematic collaboration networks.

Cluster 1: Dense Ensemble Collaboration Networks

Cluster 1 corresponds to highly interconnected ensemble-style productions. These movies exhibit:

High network density and edge counts
High clustering coefficient
Strong weighted degree (intensive actor interactions)
Fewer connected components
Smaller cast size
Shorter path lengths and smaller network diameters

This profile reflects films where collaboration is concentrated and highly interlinked, consistent with ensemble casts and tightly integrated character interactions.

From a production perspective, these films tend to have:

Centralized interaction structures
High overlap among cast members
Strong narrative integration between characters

Examples of this archetype often include ensemble dramas, tightly scripted thrillers, and cast-driven productions where character interactions are dense and continuous.

Cluster 2: Large-Scale Modular Collaboration Networks

Cluster 2 represents broad, modular, and structurally fragmented movie networks. These movies are characterized by:

Lower network density
Lower clustering coefficient
Higher modularity
More connected components
Larger cast sizes
Longer path lengths and larger network diameters

This profile reflects films with larger casts where interactions are distributed across subgroups, consistent with segmented narratives and modular collaboration patterns.

These networks typically arise in:

Multi-plot or episodic narratives
Large franchise films
War, fantasy, or historical epics
Movies with parallel storylines

Such productions emphasize scale and narrative breadth rather than dense interpersonal interaction.

Structural Contrast Between the Two Clusters

The two clusters therefore capture a fundamental distinction in cinematic collaboration structure:

Cluster 1 emphasizes interaction intensity and cohesion
Cluster 2 emphasizes scale, modularity, and narrative segmentation

This contrast explains why distance-based methods converged on a stable two-group structure, while spectral clustering produced an outlier-driven partition that was not suitable for interpretable segmentation.

Practical Interpretation

From a network science perspective, the clustering results demonstrate that movies naturally organize into:

Interaction-driven ensembles with dense collaboration patterns
Scale-driven modular productions with distributed interaction structures

This insight shows that network topology provides meaningful structural classification beyond traditional genre labels and highlights how collaboration architecture differs across film production styles.

Dimension Reduction

The dataset contains multiple correlated network metrics, which complicates visualization and interpretation in high-dimensional space. Dimension reduction techniques are therefore applied to:

Reduce redundancy caused by correlated features
Improve visualization of cluster separation
Preserve key structural information
Validate clustering stability under reduced representations

Correlation Analysis

Correlation analysis is first performed to justify dimensional compression.

# distance matrix
d <- dist(movies_scaled, method = "euclidean")

# Correlation
cor_mat <- cor(movies_scaled)

corrplot(cor_mat,
         method = "circle",
         tl.cex = 0.8,
         title = "Correlation Matrix of Network Features")

The correlation matrix reveals strong interdependencies among several network features. In particular, weighted degree, edge count, network density, and clustering coefficient exhibit high positive correlations, indicating overlapping information content related to network connectivity and interaction intensity. Similarly, path length and diameter show strong positive association, reflecting shared structural distance properties.

These correlation patterns confirm substantial feature redundancy and provide empirical justification for applying PCA to compress the feature space while preserving dominant structural variation.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is used as the primary linear dimension reduction technique in this study. PCA transforms the original correlated network features into a new set of orthogonal components that maximize explained variance.

The use of PCA serves three main purposes:

To reduce dimensionality while preserving dominant structural information
To visualize movie networks in low-dimensional space
To interpret latent structural dimensions underlying collaboration patterns

# PCA 
pca <- prcomp(movies_scaled, center = FALSE, scale. = FALSE)

# Variance 
fviz_eig(pca, addlabels = TRUE) +
  labs(title = "PCA Scree Plot: Variance Explained")

The scree plot shows that the first principal component (PC1) explains approximately 48.6% of the total variance, while the second principal component (PC2) explains about 21.0%. Together, the first two components account for nearly 70% of the overall variance in the network feature space.

This indicates that a large proportion of the structural information contained in the original nine-dimensional dataset can be effectively represented using only two latent dimensions. Beyond the second component, each additional principal component contributes progressively smaller amounts of explained variance, suggesting diminishing returns in terms of information gain.

# PCA individuals colored by PAM cluster
fviz_pca_ind(pca,
             geom = "point",
             col.ind = factor(movies$cluster),
             palette = "jco") +
  labs(title = "PCA Projection (PC1 vs PC2) colored by PAM clusters")

The PCA projection shows the distribution of movie networks in the reduced two-dimensional space defined by the first two principal components. Coloring the observations by PAM cluster assignments reveals a clear separation between the two structural groups, primarily along the first principal component (PC1).

This indicates that the dominant variance directions identified by PCA align with the previously detected cluster structure, further supporting the consistency and interpretability of the two-group segmentation.

Although some overlap is present near the boundary region, the overall separation pattern confirms that the identified clusters align with the principal variance directions in the data. This visual consistency further supports the robustness and interpretability of the two-cluster solution.

# PCA variable plot
fviz_pca_var(pca) +
  labs(title = "PCA Variables (Loadings)")

# Contribution plots
fviz_contrib(pca, choice = "var", axes = 1) + 
  labs(title = "Variable Contribution to PC1")

fviz_contrib(pca, choice = "var", axes = 2) + 
  labs(title = "Variable Contribution to PC2")

The PCA loading plot illustrates how the original network features contribute to the first two principal components and how they relate to each other in the reduced feature space.

Variables such as modularity, path length, diameter, connected components, and number of characters load strongly in the positive direction of PC1, indicating that this axis represents increasing network scale and structural fragmentation. In contrast, density and weighted degree load in the negative direction of PC1, reflecting compact, highly connected collaboration networks.

Along PC2, strong positive contributions from edges, weighted degree, clustering coefficient, and number of characters indicate that this axis captures interaction intensity and collaboration activity. Networks with higher PC2 values therefore exhibit stronger actor connectivity and denser local interaction patterns.

The relative angles between vectors further highlight feature relationships. Variables pointing in similar directions (for example, path length and diameter) are positively correlated, while variables pointing in opposite directions (such as density and modularity) exhibit negative relationships.

The loading structure confirms that PCA successfully separates the dataset along two meaningful structural dimensions: one related to global network organization and fragmentation (PC1) and the other related to interaction strength and collaboration intensity (PC2).

round(pca$rotation[,1:2], 3)

##                          PC1    PC2
## WeightedDegree        -0.349  0.403
## Modularity             0.412  0.014
## PathLength             0.412  0.125
## Diameter               0.326  0.226
## ClusteringCoefficient -0.263  0.326
## ConnectedComponents    0.284  0.016
## Density               -0.421 -0.001
## Edges                 -0.074  0.681
## Characters             0.315  0.449

# Variance Table
var_exp <- (pca$sdev^2) / sum(pca$sdev^2)
pca_table <- data.frame(
  PC = paste0("PC", seq_along(var_exp)),
  Var_Explained = round(var_exp, 4),
  Cum_Var = round(cumsum(var_exp), 4)
)
pca_table

Similar to the results of the scree plot, The variance table shows that the first principal component (PC1) explains 48.62% of the total variance, while the second principal component (PC2) explains 21.04%. Together, the first two components account for approximately 69.65% of the overall variance in the dataset.

This confirms that nearly 70% of the structural information contained in the original nine-dimensional network feature space can be preserved using only two principal components. Including the third component increases the cumulative variance to over 80%, but with diminishing marginal gains in interpretability and visualization clarity.

Therefore, retaining PC1 and PC2 provides an effective balance between dimensional reduction and information retention. This supports their use for visualization, cluster interpretation, and structural analysis of movie collaboration networks.

Multidimensional Scaling (MDS)

MDS is applied as a complementary distance-preserving technique. Unlike PCA, which focuses on variance, MDS attempts to preserve pairwise distances between observations.

# mds
mds <- cmdscale(d, k = 2, eig = TRUE)

mds_xy <- as.data.frame(mds$points)
colnames(mds_xy) <- c("MDS1", "MDS2")

plot(mds_xy$MDS1, mds_xy$MDS2,
     col = movies$cluster,
     pch = 19,
     xlab = "MDS Dimension 1",
     ylab = "MDS Dimension 2",
     main = "MDS Projection colored by PAM clusters")

legend("topright",
       legend = paste("Cluster", sort(unique(movies$cluster))),
       col = sort(unique(movies$cluster)),
       pch = 19)

# goodness-of-fit 
mds_gof <- sum(mds$eig[1:2]) / sum(pmax(mds$eig, 0))
mds_gof

## [1] 0.6965266

Multidimensional Scaling (MDS) was applied to obtain a two-dimensional representation of the movie networks while preserving pairwise distance relationships from the original feature space. Unlike PCA, which maximizes variance, MDS focuses on maintaining the original similarity structure of the data.

The MDS projection shows a clear separation between the two PAM-derived clusters, primarily along the first MDS dimension. This indicates that the distance relationships between movies in the original high-dimensional space are well preserved in the reduced representation.

The computed goodness-of-fit value shows that a substantial proportion of the original distance variance is retained by the two-dimensional embedding, confirming that the MDS projection provides a reliable low-dimensional approximation of the network structure.

The consistency between MDS separation patterns and the PCA-based visualization further validates the robustness of the two-cluster solution.This consistency further supports that the observed cluster separation reflects genuine structural differences rather than projection artifacts.

Robustness Check Using Reduced Feature Space

To test clustering stability, PAM clustering is repeated using only principal component scores that explain the majority of variance. The resulting clusters are compared with full-space clustering using agreement indices and silhouette analysis.

var_exp <- (pca$sdev^2) / sum(pca$sdev^2)
r <- which(cumsum(var_exp) >= 0.70)[1]  

pca_scores <- pca$x[, 1:r]

set.seed(123)
pam_pca <- cluster::pam(pca_scores, k = 2)

fviz_cluster(pam_pca, geom = "point") +
  labs(title = "PAM Clustering (k = 2)")

randIndex(pam_pca$clustering, movies$cluster)

##       ARI 
## 0.7219673

sil_pam_pca <- silhouette(pam_pca$clustering, dist(pca_scores))
fviz_silhouette(sil_pam_pca)

##   cluster size ave.sil.width
## 1       1  357          0.28
## 2       2  416          0.34

mean(sil_pam_pca[,3])

## [1] 0.3121871

To evaluate the stability of the clustering solution under dimension reduction, PAM clustering was repeated using principal component scores instead of the original feature space. The number of retained components was chosen to preserve approximately 70% of the total variance, ensuring that most structural information was maintained while reducing dimensional complexity.

The resulting PCA-based PAM clustering shows a visually consistent separation pattern compared to the original feature-space clustering. Quantitative comparison using the Adjusted Rand Index (ARI) yields a value of 0.722, indicating strong agreement between the two clustering solutions.

Silhouette analysis in the reduced PCA space produces an average silhouette width of 0.312, which is higher than the silhouette scores obtained using the original feature space (~0.25). This indicates that dimensional compression preserves the essential cluster structure while improving compactness.

Conclusion

The analysis demonstrates that movie collaboration networks in the MovieGalaxies dataset naturally separate into two meaningful structural types rather than forming a continuous and uniform spectrum. One group is characterized by dense and cohesive collaboration patterns, with stronger connectivity, higher local clustering, higher network density, and fewer disconnected components. These properties reflect productions in which actors frequently co-appear and the cast operates as a tightly integrated ensemble. The second group exhibits larger cast sizes, lower overall density, more disconnected components, and stronger modular organization, corresponding to productions where collaboration is distributed across multiple subgroups and narrative segments.

Dimension reduction further supports this structural distinction by showing that the dominant variation in the dataset can be captured using a small number of latent dimensions. Specifically, the network structure is largely organized along two principal directions: a cohesion versus fragmentation (modularity) contrast and a collaboration intensity versus connectivity strength contrast. In low-dimensional projections, the two clusters remain clearly separated, indicating that the grouping reflects a consistent structural signal across multiple network properties rather than noise or dependence on a single metric. The stability of this pattern is reinforced by the strong agreement between clustering results obtained in the original and reduced feature spaces.

These findings suggest that the “Social DNA” of cinema, as captured through graph-based network features, provides a structurally grounded framework for distinguishing between ensemble-driven productions and large-scale modular collaboration architectures. This network-based perspective offers insight into film organization that extends beyond traditional descriptive categories such as genre or popularity.

References

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/T4HBA3

https://moviegalaxies.com/

The Social DNA of Cinema (Clustering & Dimension Reduction)

Mevin Biju (486097)