This dataset is the Amazon product co-purchasing network, built from the “Customers Who Bought This Item Also Bought” feature on Amazon. It contains 334,863 nodes (products) and 925,872 edges (co-purchase links), with all nodes belonging to one large connected component. The graph has an average clustering coefficient of 0.3967, around 667,000 triangles, and a diameter of 44, meaning the longest shortest path between two products is 44 steps. Overall, the network is large, highly connected, and shows significant local clustering, reflecting how products are grouped through frequent co-purchasing.

import gzip
import networkx as nx
import statistics
import matplotlib.pyplot as plt

This code loads a compressed Amazon co-purchase network file and extracts up to 15,000 edges for analysis. It builds a graph using NetworkX and ensures that only valid node pairs are added as edges. Since large networks can have many disconnected pieces, the code isolates the largest connected component to focus analysis on the biggest cluster of related nodes. This cleaned graph is then ready for computing metrics like diameter, clustering, and degree statistics.

# location where file is saved 
path = "C:/Users/Warner_Beast/OneDrive/Documents/CUNY/DATA 620 Web Analytics/Homeworks/Week 3/com-amazon.all.dedup.cmty.txt.gz"
EDGE_LIMIT = 15000 # cap edges for speed 

edges = []
# open the gz file  and handle the compression of the fly using the righ encoding
with gzip.open(path, mode='rt', encoding="utf-8", errors="ignore") as f:
    for line in f:
        # Skip empty lines or lines starting with '#' (comments/metadata).
        if not line or line.startswith("#"):
            continue
        # Split the line into parts (node identifiers).
        # Each valid line is expected to contain at least two numbers (u v).
        parts = line.strip().split()
        if len(parts) < 2 :
            continue
        u, v = parts[0], parts[1]
        edges.append((u, v))
        if len(edges) >= EDGE_LIMIT:
            break

G_full = nx.Graph() ## Create a new empty undirected graph object in NetworkX.
G_full.add_edges_from(edges) ## Add all the edges we collected into the graph.


if G_full.number_of_nodes() > 0:
    #Find the largest connected component (LCC) of the graph.
    # nx.connected_components returns a list of sets of nodes; we pick the biggest one.
    lcc_nodes = max(nx.connected_components(G_full), key=len)
    G = G_full.subgraph(lcc_nodes).copy()
else:
    G = G_full.copy()
# Basic metrics 
n = G.number_of_nodes() # count the number of nodes
m = G.number_of_edges() # count the number of edges


# Degree statistics 
degrees = [d for _, d in G.degree()]
avg_deg = sum(degrees)/len(degrees) if degrees else 0
med_deg = statistics.median(degrees) if degrees else 0
min_deg = min(degrees) if degrees else 0 
max_deg = max(degrees) if degrees else 0
from networkx.algorithms import approximation as approx
avg_clustering = nx.average_clustering(G)
if n <= 10000:
    diameter = nx.diameter(G)
    diam_method = 'exact'
else:
    diameter = approx.diameter(G)
    diam_method = "approximation"

This basic graph analysis shows that the largest connected component (LCC) of the Amazon co-purchase network sample contains 56 nodes connected by 65 edges. The degree statistics indicate that most nodes are sparsely connected, with a minimum degree of 1, a median of 1, and an average of about 2.32 connections per node, while the most connected node has only 12 links. The average clustering coefficient of 0.11177 suggests limited local grouping, meaning relatively few triangles form among connected nodes. Finally, the graph’s diameter is 10, which means the longest shortest path between any two nodes in this component spans ten steps, reflecting a modest level of connectivity within the network.

# ---- 3. Display results ----
print("=== Basic Graph Analysis ===")
## === Basic Graph Analysis ===
print(f"Nodes (LCC): {n}")
## Nodes (LCC): 56
print(f"Edges (LCC): {m}")
## Edges (LCC): 65
print(f"Degree min/median/avg/max: {min_deg} / {med_deg} / {avg_deg:.3f} / {max_deg}")
## Degree min/median/avg/max: 1 / 1.0 / 2.321 / 12
print(f"Average clustering coefficient: {avg_clustering:.5f}")
## Average clustering coefficient: 0.11177
print(f"Diameter: {diameter}  (method: {diam_method})")
## Diameter: 10  (method: exact)
top_nodes = sorted(G.degree, key=lambda x: x[1], reverse=True)[:50]
H = G.subgraph([n for n, _ in top_nodes])
plt.figure(figsize=(8, 8))
pos = nx.spring_layout(H, seed=42)
nx.draw_networkx(H, pos, node_size=50, with_labels=False)
plt.title("Subgraph of top-50 high-degree nodes")
plt.show()

plt.figure(figsize=(10, 10))
pos = nx.spring_layout(H, seed=42)
nx.draw_networkx_nodes(H, pos, node_size=300, node_color="lightblue")
nx.draw_networkx_edges(H, pos, alpha=0.4)
nx.draw_networkx_labels(H, pos, font_size=8, font_color="black")
## {'120886': Text(-0.5494046701554871, 0.17233914090419628, '120886'), '71385': Text(0.48286171131041256, -0.6226322087578476, '71385'), '71383': Text(0.3969149562025478, -0.8678176996840804, '71383'), '33032': Text(0.329075295902821, -0.8159740939342826, '33032'), '1076': Text(-0.13702950931514765, 0.8275903147153971, '1076'), '235177': Text(-0.5994767919611553, 0.34035179192707987, '235177'), '81238': Text(0.5186917761862337, -0.7124367848342393, '81238'), '120896': Text(-0.8108146934392386, -0.0359064894258624, '120896'), '187692': Text(0.6580173230624611, 0.06937373119042807, '187692'), '253133': Text(-0.001137058350939112, -0.4124172164247028, '253133'), '1912': Text(-0.5636728891216274, 0.878288723772452, '1912'), '50847': Text(0.2703095446290763, -0.876600969474406, '50847'), '2707': Text(0.05967650320790468, 0.8375484835815867, '2707'), '120894': Text(-0.7199555462781186, 0.2185502895303004, '120894'), '518176': Text(-0.6058568661474865, -0.05160797182323345, '518176'), '120882': Text(-0.49688860129529944, 0.042967852470006326, '120882'), '14732': Text(0.7709913597319364, 0.008918827832026558, '14732'), '2367': Text(0.22534226076853112, 0.8492071219568655, '2367'), '92329': Text(-0.09273353083880302, -0.42961360947996774, '92329'), '13209': Text(0.30355022810711463, -0.6353459121218313, '13209'), '62492': Text(0.5784500711100375, 0.08246821769358892, '62492'), '136481': Text(0.2053425019292742, -0.750426864328704, '136481'), '29957': Text(0.4359999453701112, -0.351538001208243, '29957'), '96085': Text(0.2753603948790426, 0.7834831853225919, '96085'), '26242': Text(0.48021735357024714, -0.8094043907467978, '26242'), '120887': Text(-0.7486502378205648, 0.09614328423481627, '120887'), '172155': Text(0.06542833508965122, 0.21964584654819178, '172155'), '1066': Text(0.1102276839057868, 0.6610386680186561, '1066'), '10191': Text(-0.004699669996322721, -0.2378812872529588, '10191'), '39221': Text(-0.1315273425717431, -0.28755054789383505, '39221'), '46739': Text(0.4164689445476073, -0.7428298979776659, '46739'), '14753': Text(0.49517452565652087, -0.05365581473963002, '14753'), '11362': Text(-0.32900065387713673, 0.026246978008229917, '11362'), '120891': Text(-0.4751940696805354, 0.24200123784764913, '120891'), '18617': Text(-0.438755159196145, -0.12895685711691007, '18617'), '35557': Text(0.9699695673659392, 0.06507853919787217, '35557'), '1239': Text(-0.38530242423410976, 0.9168488970703176, '1239'), '71381': Text(0.09590171696818793, 0.0012269225502003185, '71381'), '22301': Text(0.2799426242640131, 0.6951819819291345, '22301'), '1904': Text(-0.5310186971619618, 1.0, '1904'), '120881': Text(-0.44689552609465744, 0.11904606338847605, '120881'), '11911': Text(-0.08896021290481526, 0.05577605107374304, '11911'), '297421': Text(-0.1666578561838598, -0.3752098158867818, '297421'), '116873': Text(0.14460848532947998, 0.8588265583619016, '116873'), '11910': Text(-0.2039070057296876, 0.07061045112706511, '11910'), '57535': Text(0.1779347204949097, -0.8280193899982465, '57535'), '85597': Text(0.6952508108196527, -0.06275540278182666, '85597'), '14032': Text(0.0914033442223645, -0.3387880584529391, '14032'), '291500': Text(-0.39882396702794165, 0.18854887127340209, '291500'), '120892': Text(-0.6067490052490816, 0.10006125281882161, '120892')}
plt.title("Labeled Subgraph of Top-30 High-Degree Nodes")
plt.axis("off")
## (np.float64(-0.9977970408237823), np.float64(1.1569519147504828), np.float64(-1.0736440712692188), np.float64(1.1970431017948127))
plt.show()

# Save to GEXF for Gephi
nx.write_gexf(G, "amazon_subgraph.gexf")

The Youtube Video for this assignment can be found Here and GitHub