problem set 34

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Load required packages
packages <- c("igraph", "dplyr", "ggraph", "ggplot2", "RColorBrewer")

for (p in packages) {
  if (!require(p, character.only = TRUE)) {
    install.packages(p)
    library(p, character.only = TRUE)
  }
}

# Load Enron email data
enron <- read.table("Email-Enron.txt.txt", header = FALSE, sep = "\t")
colnames(enron) <- c("FromNodeId", "ToNodeId")

dim(enron)

## [1] 564   2

# Create undirected weighted edge list
edges_undirected <- enron %>%
  filter(FromNodeId != ToNodeId) %>%
  transmute(
    from = pmin(FromNodeId, ToNodeId),
    to   = pmax(FromNodeId, ToNodeId)
  ) %>%
  count(from, to, name = "weight")

# Build graph
g_undirected <- graph_from_data_frame(edges_undirected, directed = FALSE)

# Assign weights
E(g_undirected)$weight <- edges_undirected$weight

g_undirected

## IGRAPH c8f0432 UNW- 298 499 -- 
## + attr: name (v/c), weight (e/n)
## + edges from c8f0432 (vertex names):
##  [1] 0 --1  1 --10 1 --11 1 --12 1 --13 1 --14 1 --15 1 --16 1 --17 1 --18
## [11] 1 --19 1 --2  1 --20 1 --21 1 --22 1 --23 1 --24 1 --25 1 --26 1 --27
## [21] 1 --28 1 --29 1 --3  1 --30 1 --31 1 --32 1 --33 1 --34 1 --35 1 --36
## [31] 1 --37 1 --38 1 --39 1 --4  1 --40 1 --41 1 --42 1 --43 1 --44 1 --45
## [41] 1 --46 1 --47 1 --48 1 --49 1 --5  1 --50 1 --51 1 --52 1 --53 1 --54
## [51] 1 --55 1 --56 1 --57 1 --58 1 --59 1 --6  1 --60 1 --61 1 --62 1 --63
## [61] 1 --64 1 --65 1 --66 1 --67 1 --68 1 --69 1 --7  1 --70 1 --8  1 --9 
## [71] 10--11 10--12 10--13
## + ... omitted several edges

# Detect communities using Louvain method
com <- cluster_louvain(g_undirected, weights = E(g_undirected)$weight)

# Assign cluster membership
V(g_undirected)$cluster <- membership(com)

# Create color palette dynamically
n_clust <- length(unique(V(g_undirected)$cluster))
pal <- colorRampPalette(brewer.pal(8, "Set2"))(n_clust)

# Network visualization using ggraph
set.seed(42)

p <- ggraph(g_undirected, layout = "fr") +
  geom_edge_link(
    aes(width = log1p(weight)),
    alpha = 0.15,
    colour = "grey40"
  ) +
  geom_node_point(
    aes(color = factor(cluster)),
    size = 3
  ) +
  scale_color_manual(values = pal, name = "Cluster") +
  scale_edge_width(range = c(0.2, 2), guide = "none") +
  theme_void()

print(p)

# Calculate centrality measures
V(g_undirected)$degree <- degree(g_undirected)

V(g_undirected)$betweenness <- betweenness(
  g_undirected,
  weights = E(g_undirected)$weight
)

V(g_undirected)$closeness <- closeness(
  g_undirected,
  weights = E(g_undirected)$weight
)

# Create centrality table
top_nodes <- tibble(
  node = V(g_undirected)$name,
  degree = V(g_undirected)$degree,
  betweenness = V(g_undirected)$betweenness,
  closeness = V(g_undirected)$closeness
) %>%
  arrange(desc(degree))

head(top_nodes, 10)

## # A tibble: 10 × 4
##    node  degree betweenness closeness
##    <chr>  <dbl>       <dbl>     <dbl>
##  1 46       112      19572.  0.00144 
##  2 27        78      12591.  0.00133 
##  3 1         70      12051.  0.00126 
##  4 5         62       8120.  0.00126 
##  5 9         39       6139.  0.00118 
##  6 45        25       2771.  0.00116 
##  7 39        18       1540.  0.00114 
##  8 13        15       2608.  0.00114 
##  9 7         14       1490.  0.00108 
## 10 50        11        233.  0.000778

# Summarize community sizes
cluster_summary <- tibble(cluster = V(g_undirected)$cluster) %>%
  count(cluster, name = "n_nodes") %>%
  arrange(desc(n_nodes))

cluster_summary

## # A tibble: 9 × 2
##   cluster    n_nodes
##   <membrshp>   <int>
## 1 3               56
## 2 1               53
## 3 4               49
## 4 6               43
## 5 5               31
## 6 8               26
## 7 2               20
## 8 7               18
## 9 9                2

# Identify smallest community
cluster_sizes <- table(V(g_undirected)$cluster)
smallest_cluster <- as.numeric(
  names(cluster_sizes)[which.min(cluster_sizes)]
)

# Extract subgraph
sub_nodes <- V(g_undirected)[cluster == smallest_cluster]
sub_g <- induced_subgraph(g_undirected, sub_nodes)

# Match color
cluster_levels <- sort(unique(V(g_undirected)$cluster))
cluster_index <- which(cluster_levels == smallest_cluster)
sub_color <- pal[cluster_index]

# Visualize subnetwork
set.seed(42)

p_sub <- ggraph(sub_g, layout = "fr") +
  geom_edge_link(alpha = 0.25, colour = "grey60") +
  geom_node_point(color = sub_color, size = 3, alpha = 0.9) +
  geom_node_text(aes(label = name), size = 3, vjust = 1.8) +
  theme_void()

print(p_sub)

Social Network Analysis (SNA) is a method used to understand how people, organizations, or entities are connected and how those connections influence behavior, information flow, and power. Instead of focusing only on individual characteristics (like job title or rank), SNA looks at relationships — who communicates with whom, who acts as a bridge, and which groups form tightly connected clusters. This allows analysts to see patterns that are often invisible in spreadsheets or organizational charts.

SNA can be applied in many real-world settings. In businesses, it can be used to analyze communication networks within a company to identify informal leaders, collaboration bottlenecks, or isolated departments. In schools, administrators could examine student friendship networks to understand bullying patterns or identify socially isolated students. In public health, researchers use SNA to study how diseases spread through contact networks. Law enforcement and intelligence agencies also use SNA to examine criminal or extremist networks to determine key coordinators or brokers.

Subgroup or community-level analysis is especially important in these settings because networks are rarely uniform. Most real-world networks contain clusters — smaller groups of tightly connected individuals. These communities often represent teams, friend groups, departments, or coordinated cells. Understanding these subgroups helps reveal how influence is concentrated and how disruptions might affect the overall system. For example, removing one highly connected individual in a central cluster may significantly disrupt communication, while removing someone on the periphery may have little impact.

One original example could be a large warehouse operation within a third-party logistics (3PL) company. While the official structure may show supervisors, leads, and associates, the real communication network may look very different. By mapping daily problem-solving interactions — such as who employees go to when a picking error occurs or when inventory discrepancies arise — SNA could identify informal leaders who serve as knowledge hubs. Degree centrality could show which employees are most frequently consulted, while betweenness centrality could identify individuals who act as bridges between shifts (for example, between day and night crews). Community detection could reveal natural working clusters, such as teams that collaborate heavily across receiving, picking, and kitting functions.

Subgroup analysis in this warehouse example would be critical. If one small cluster handles a specialized SKU that slows picking time, isolating and analyzing that subgroup could reveal workflow inefficiencies or communication gaps. It could also show whether overtime or labor utilization issues stem from structural bottlenecks rather than staffing shortages.

Overall, Social Network Analysis provides a powerful way to move beyond formal titles and surface-level metrics to understand real influence, collaboration patterns, and structural vulnerabilities. By examining both the full network and its internal communities, organizations can make more informed decisions about leadership, resource allocation, risk management, and operational efficiency.

problem set 34

Terrylynn Diaz

2026-03-03

R Markdown

Including Plots