Table of contents

  1. Download data
  2. Data Preparation
  3. Data analysis 3.1. Out-degree distribution 3.2. In-degree distribution 3.3. In-degree distribution - log scale 3.4. Average number of inbound co-purchase links, the standard deviation, and the maximum 3.5. 10 products with the most inbound co-purchase links

The purpose is the report is to provide the answer and solutions to the Amazon pre-class assignment.

1. Download data

Initially, the required packages are downloaded including the igraph package.


packages <- c("dplyr", "igraph", "readr", "ggplot2", 'readxl', 'readr', 'tidyverse', 'ggthemes', 'knitr', 'extrafont', 'scales', 'lubridate') 
# Checking for package installations on the system and installing if not found.
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
# Packages to use
for(package in packages){
  library(package, character.only = TRUE)
}

Secondly, the required data is downloaded using read.table and read_csv functions.


data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
data2 <- as.data.frame(read.table(file = "graph_complete.txt"))
all_books <- read_csv("id_to_titles.csv")

2. Network structure visualization

The graph_subset_rank1000.txt is downloaded in order to generate a visualization of the network structure.


data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout_in_circle(routes_igraph),  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="#808080",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout_in_circle"
     )

The initial plot is generated using the layout “layout_in_circle”, however, it is very hard to gather some insights using this layout. For example, it is not possible to see if all the graph is or not connected.

Using other types of layouts such as layout.kamada.kawai provides better visualization, and it is possible to see a disconnected graph, which means it breaks apart naturally into a set of connected groups of nodes named components.


data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout.kamada.kawai,  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="black",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout.kamada.kawai"
     )

The graph below uses the layout named “layout_nicely” which allows us to better visualize the components within the network. There is a larger connected component in the center of the circle, this component has a higher number of nodes than the rest of the components in the graph, which could indicate a cluster of amazon products.


data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout_nicely(routes_igraph ),  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="black",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout_nicely "
     )

Dividing a graph into its components is of course only a first, global way of describing its structure. Within a given component, there may be a richer internal structure that is important to one’s interpretation of the network. As a result, the data from the largest component in the network of products with salesrank under 1,000 is extracted through the “graph_subset_rank1000_cc.txt” dataset in order to generate a visual to have a better understanding of the network.

The graph below displays the largest component in the network. In this graph every node can reach every other node by a path, which indicates that the graph is connected.


data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
routes_igraph1 <- graph_from_data_frame(data, directed = FALSE)
plot(routes_igraph1, 
     edge.arrow.size = 20, 
     layout= layout.kamada.kawai,
     vertex.label=NA, 
     edge.color="black",
     vertex.color = "#F0E130",
     vertex.size=4,
     main = "The largest component in the network of products with salesrank under 1,000")

Additionally, using the degree function in the graph below allow us to observe the nodes in the component with a larger count of outgoing links to another product page represented by a bigger node size.


data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
routes_igraph1 <- graph_from_data_frame(data, directed = FALSE)
plot.igraph(routes_igraph1, 
            edge.arrow.size = 20, 
            layout= layout.kamada.kawai, 
            edge.color="black",
            vertex.color = "#F0E130",
            vertex.size=degree(routes_igraph, mode="out"),
            main = "The largest component in the network of products with salesrank under 1,000",
            vertex.label=NA, vertex.color = "yellow")

NA
NA

3. Data analysis

3.1 Out-degree distribution

In order to count the number of outgoing links to another product page, the table function was used to summarise the number of outbound links for each product a -> b . However, this summarization does not contain the products with zero outgoing links, so a unique list of products is generated to be merged with the summarization table in order to account for the zero values. The code below describes the process.


colnames(data2) <- c("a", "b")
data3 <- data.frame(table(data2$a))
colnames(data3) <- c("a", "Freq")

a <- data2%>%select(a)
a<- unique(a)
b <- data2%>%select(b)
colnames(b) <- c("a")
b <- unique(b)
c <- rbind(a,b)
c <- unique(c)


merge <- merge(x= c, y= data3, by=c("a"), all.x = T)
merge <- merge%>%mutate(Freq = ifelse(is.na(Freq), 0, Freq))
in_data <- data.frame(table(merge$Freq))
in_data
NA

The graph below displays the out-degree distribution, in which the maximum number of outbound links is 5, and the highest frequency is 4.


ggplot(data = merge, aes(Freq)) +
  geom_histogram(bins = 6, fill="#00BED8", color = "#00BED8") + theme_classic() +
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "Out-degree distribution Plot", x = "Out-Degree", y = "Frequency",
                                               subtitle = "The maximum number of outbound links is 5 and the highest frequency is 4.", caption = "Jose Vilardy(2019) | Network Analytics" )  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

3.2 In-degree distribution

The graph below displays the in-degree distribution. In this case, the distribution is different as some few products have more than 200 incoming links with a maximum of 549, and also there is a larger number of products with 0 and 1 incoming links, which means that there are only few popular books that are driving co-purchases of several other books.


ggplot(data = merge2, aes(x=Freq)) +
  geom_histogram(bins = 100, fill="#00BED8", color = "#00BED8") +  theme_bw() + 
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "In-degree distribution Plot", x = "In-Degree", y = "Frequency",
                                               subtitle = "There is a larger number of products with 0 and 1 incoming links.The maximum frequency is 549 in-links", caption = "Jose Vilardy(2019)| Network Analytics" )  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

The in-degree distribution graph showcases the long tail distribution or power law where the distribution of the event is skewed so that a small number of outcomes have dramatically higher values than the remaining population, drawing as a result, a curve with a long tail lowering as the value increases.


merge.l <- merge2%>%mutate(Freq = ifelse(Freq == 0, log(0.1), log(Freq)))
out_data2 <- data.frame(table(merge.l$Freq))
out_data2
NA

3.3 In-degree distribution - log scale

Initially, the 0 inbound links were replaced with 0.1 to be able to apply the log function.


ggplot(data = merge.l, aes(x=Freq)) +
  geom_histogram(bins = 10, fill="#00BED8", color = "#00BED8") +  theme_bw() + 
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "In-degree distribution Plot - log scale", x = "In-Degree", y = "Frequency",
                                               subtitle = "The plot indicates that the most frequent number of incoming links is 0 (log (0.1) = -2.3)", caption = "Jose Vilardy(2019)| Network Analytics")  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

After applying the log function to the in-degree axis, the distribution is still skewed. The plot indicates that the most frequent number of incoming links is 0 (log (0.1) = -2.3). Also, it is possible to observe a high number of books that have between 1 and 7 in-links.

---
title: "Pre-Module Network Data Analysis Individual Assignment"
author: "Jose Beltran Vilardy"
output:
  html_notebook: default
  word_document: default
---

## Table of contents

1. [Download data](#1. Download data)
2. [Data Preparation](#2. Data Preparation)
3. [Data analysis](#3. Data analysis)
3.1. [Out-degree distribution](#3.1 Out-degree distribution)
3.2. [In-degree distribution](#3.2 In-degree distribution)
3.3. [In-degree distribution - log scale](#3.3 In-degree distribution - log scale)
3.4. [Average number of inbound co-purchase links, the standard deviation, and the maximum](#3.4 Average number of inbound co-purchase links, the standard deviation, and the maximum)
3.5. [10 products with the most inbound co-purchase links](#10 products with the most inbound co-purchase links)




The purpose is the report is to provide the answer and solutions to the Amazon pre-class assignment.


# 1. Download data <a name="1. Download data"></a>


Initially, the required packages are downloaded including the igraph package.


```{r, warning=FALSE, message = FALSE}

packages <- c("dplyr", "igraph", "readr", "ggplot2", 'readxl', 'readr', 'tidyverse', 'ggthemes', 'knitr', 'extrafont', 'scales', 'lubridate') 
# Checking for package installations on the system and installing if not found.
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
# Packages to use
for(package in packages){
  library(package, character.only = TRUE)
}


```


Secondly, the required data is downloaded using read.table and read_csv functions.

```{r, warning=FALSE, message = FALSE, fig.width = 12, fig.height = 10}

data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
data2 <- as.data.frame(read.table(file = "graph_complete.txt"))
all_books <- read_csv("id_to_titles.csv")


```



# 2. Network structure visualization <a name="2. Network structure visualization"></a>

The graph_subset_rank1000.txt is downloaded in order to generate a visualization of the network structure.  

```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}

data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout_in_circle(routes_igraph),  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="#808080",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout_in_circle"
     )

```

The initial plot is generated using the layout “layout_in_circle”, however, it is very hard to gather some insights using this layout. For example, it is not possible to see if all the graph is or not connected. 

Using other types of layouts such as layout.kamada.kawai provides better visualization, and it is possible to see a disconnected graph, which means it breaks apart naturally into a set of connected groups of nodes named components. 


```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}

data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout.kamada.kawai,  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="black",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout.kamada.kawai"
     )

```

The graph below uses the layout named “layout_nicely” which allows us to better visualize the components within the network. There is a larger connected component in the center of the circle, this component has a higher number of nodes than the rest of the components in the graph, which could indicate a cluster of amazon products. 

```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}

data1 <- as.data.frame(read.table(file = "graph_subset_rank1000.txt"))
routes_igraph <- graph_from_data_frame(data1, directed = FALSE)
plot(routes_igraph, 
     edge.arrow.size = 10, 
     layout= layout_nicely(routes_igraph ),  
     edge.width = 1, 
     vertex.label=NA, 
     vertex.size=3.5,
     edge.color="black",
     vertex.color = "#F0E130",
     main = "Network  Salesrank under 1,000 - layout_nicely "
     )

```
Dividing a graph into its components is of course only a ﬁrst, global way of describing its structure. Within a given component, there may be a richer internal structure that is important to one’s interpretation of the network. As a result, the data from the largest component in the network of products with salesrank under 1,000 is extracted through the “graph_subset_rank1000_cc.txt” dataset in order to generate a visual to have a better understanding of the network. 


The graph below displays the largest component in the network. In this graph every node can reach every other node by a path, which indicates that the graph is connected.  

```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}

data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
routes_igraph1 <- graph_from_data_frame(data, directed = FALSE)
plot(routes_igraph1, 
     edge.arrow.size = 20, 
     layout= layout.kamada.kawai,
     vertex.label=NA, 
     edge.color="black",
     vertex.color = "#F0E130",
     vertex.size=4,
     main = "The largest component in the network of products with salesrank under 1,000")

```

Additionally, using the degree function in the graph below allow us to observe the nodes in the component with a larger count of outgoing links to another product page represented by a bigger node size. 
 


```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}

data <- as.data.frame(read.table(file = "graph_subset_rank1000_cc.txt"))
routes_igraph1 <- graph_from_data_frame(data, directed = FALSE)
plot.igraph(routes_igraph1, 
            edge.arrow.size = 20, 
            layout= layout.kamada.kawai, 
            edge.color="black",
            vertex.color = "#F0E130",
            vertex.size=degree(routes_igraph, mode="out"),
            main = "The largest component in the network of products with salesrank under 1,000",
            vertex.label=NA, vertex.color = "yellow")


```

# 3. Data analysis <a name="3. Data analysis"></a>

## 3.1 Out-degree distribution <a name="3.1 Out-degree distribution"></a>

In order to count the number of outgoing links to another product page, the table function was used to summarise the number of outbound links for each product a -> b . However, this summarization does not contain the products with zero outgoing links, so a unique list of products is generated to be merged with the summarization table in order to account for the zero values. The code below describes the process.  


```{r, warning=FALSE, message = FALSE, fig.width = 12, fig.height = 20}
data2 <- as.data.frame(read.table(file = "graph_complete.txt"))
colnames(data2) <- c("a", "b")

colnames(data2) <- c("a", "b")
data3 <- data.frame(table(data2$a))
colnames(data3) <- c("a", "Freq")

# Getting unique Id Values
a <- data2%>%select(a)
a<- unique(a)
b <- data2%>%select(b)
colnames(b) <- c("a")
b <- unique(b)
c <- rbind(a,b)
c <- unique(c)

# Merging uniques values to calculate 0 outbound links 
merge <- merge(x= c, y= data3, by=c("a"), all.x = T)
merge <- merge%>%mutate(Freq = ifelse(is.na(Freq), 0, Freq))
in_data <- data.frame(table(merge$Freq))
in_data

```


The graph below displays the out-degree distribution, in which the maximum number of outbound links is 5, and the highest frequency is 4. 

```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 5}

ggplot(data = merge, aes(Freq)) +
  geom_histogram(bins = 6, fill="#00BED8", color = "#00BED8") + theme_classic() +
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "Out-degree distribution Plot", x = "Out-Degree", y = "Frequency",
                                               subtitle = "The maximum number of outbound links is 5 and the highest frequency is 4.", caption = "Jose Vilardy(2019) | Network Analytics" )  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

```


## 3.2 In-degree distribution <a name="3.2 In-degree distribution"></a>

The graph below displays the in-degree distribution. In this case, the distribution is different as some few products have more than 200 incoming links with a maximum of 549, and also there is a larger number of products with 0 and 1 incoming links, which means that there are only few popular books that are driving co-purchases of several other books. 


```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 5}
data4 <- data.frame(table(data2$b))
colnames(data4) <- c("b", "Freq")

colnames(c) <- c("b")
merge2 <- merge(x= c, y= data4, by=c("b"), all.x = T)
merge2 <- merge2%>%mutate(Freq = ifelse(is.na(Freq), 0, Freq))
out_data <- data.frame(table(merge2$Freq))

ggplot(data = merge2, aes(x=Freq)) +
  geom_histogram(bins = 100, fill="#00BED8", color = "#00BED8") +  theme_bw() + 
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "In-degree distribution Plot", x = "In-Degree", y = "Frequency",
                                               subtitle = "There is a larger number of products with 0 and 1 incoming links.The maximum frequency is 549 in-links", caption = "Jose Vilardy(2019) | Network Analytics" )  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

```

The in-degree distribution graph showcases the long tail distribution or power law where the distribution of the event is skewed so that a small number of outcomes have dramatically higher values than the remaining population, drawing as a result, a curve with a long tail lowering as the value increases. 

```{r, warning=FALSE, message = FALSE, fig.width = 12, fig.height = 20}

merge.l <- merge2%>%mutate(Freq = ifelse(Freq == 0, log(0.1), log(Freq)))
out_data2 <- data.frame(table(merge.l$Freq))
out_data2

```

## 3.3 In-degree distribution - log scale <a name="3.3 In-degree distribution - log scale"></a>

Initially, the 0 inbound links were replaced with 0.1 to be able to apply the log function. 

```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 5}

ggplot(data = merge.l, aes(x=Freq)) +
  geom_histogram(bins = 10, fill="#00BED8", color = "#00BED8") +  theme_bw() + 
  theme(
        text=element_text(size=10,  family="Comic Sans MS"),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor =  element_blank(),
        axis.line = element_line(color = "gray"),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank()) + labs(title = "In-degree distribution Plot - log scale", x = "In-Degree", y = "Frequency",
                                               subtitle = "The plot indicates that the most frequent number of incoming links is 0 (log (0.1) = -2.3)", caption = "Jose Vilardy(2019) | Network Analytics")  + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma)

```

After applying the log function to the in-degree axis, the distribution is still skewed. The plot indicates that the most frequent number of incoming links is 0 (log (0.1) = -2.3). Also, it is possible to observe a high number of books that have between 1 and 7 in-links. 


## 3.4 Average number of inbound co-purchase links, the standard deviation, and the maximum <a name="3.4 Average number of inbound co-purchase links, the standard deviation, and the maximum"></a>

```{r, warning=FALSE, message = FALSE, fig.width = 6, fig.height = 5}

Measures <- c("Mean", "Median", "standard deviation", "Maximum")
Mean <- mean(merge2$Freq)
Median <- median(merge2$Freq)
sd <- sd(merge2$Freq)
max <- max(merge2$Freq)

statistics <- c(Mean, Median, sd, max)
Table <- data.frame(Measures, statistics)
Table <- Table%>%mutate(statistics = round(statistics,2))
Table

```
The difference between the median and the mean confirms the positive skewness of the data. The maximum value 549 is highly distant from the mean which evidences the existence of the extreme values in the distribution. 


## 3.5 10 products with the most inbound co-purchase links <a name="3.5 10 products with the most inbound co-purchase links"></a>

The table below displays the top 10 products with the most inbound co-purchase links


```{r, warning=FALSE, message = FALSE, fig.width = 10, fig.height = 10}
all_books <- read_csv("id_to_titles.csv")
colnames(all_books) <- c("b", "title")

merge3 <- merge(x= merge2, y= all_books, by=c("b"), all.x = T)
merge3 <- merge3%>%arrange(desc(Freq))%>%select(title, Freq)
result <- head(merge3, 10)
result

```