1. Delete products that are not books from “products” and “copurchase” files. Note: In social network analysis, it important to define the boundary of your work; in other words, the boundary of the network.
setwd("/Users/brichew/Desktop/UCI MSBA/Fall/BANA 277/Social Networks HW/")
copurchase <- read.csv(file.choose(), header = TRUE)
products <- read.csv(file.choose(), header = TRUE)

products <- filter(products, group == "Book" &
                     products$salesrank <= 150000 &
                     products$salesrank != -1)

copurchase <- filter(copurchase, copurchase$Source %in% products$id & 
                        copurchase$Target %in% products$id)
  1. Create a variable named in-degree, to show how many “Source” products people who buy “Target” products buy; i.e. how many edges are to the focal product in “co-purchase” network.
library(igraph)

colnames(copurchase)[1] <- "id"
combined <- merge(x= products, y= copurchase, by="id", all.x = TRUE) 
head(combined)

#clean data
combined$Target <- as.numeric(combined$Target)
colnames(combined)[3] <- "group1"
combined1<- sqldf("SELECT id, Target, title, group1, salesrank, review_cnt, downloads, rating FROM combined Where Target !=0")
head(combined1)
g <- graph.data.frame(combined1, directed = T)

in_degree <- degree(g, mode = 'in')
class(in_degree)

The number of adjacent vertices, or sources, are going into the node or in our cause the Target.

  1. Create a variable named out-degree, to show how many “Target” products people who buy “Source” product also buy; i.e., how many edges are from the focal product in “co-purchase” network.
out_degree <- degree(g, mode = 'out')

The number of vertices that are from the Nodes, or Target products, that are leaving the Node.

  1. Pick up one of the products (in case there are multiple) with highest degree (in-degree + out-degree), and find its subcomponent, i.e., all the products that are connected to this focal product. From this point on, you will work only on this subcomponent.
#show how many degrees are connected to nodes
all_degree <- degree(g, mode = 'total')
#what is the max numb of degrees for all the nodes
max(all_degree)
all_degree[all_degree==53]
sub <- subcomponent(g, "33",'all') 
sub
sub

We ran the degree function to determine how many degrees are connected to the nodes. Then, we asked R to spit out the maximum number of in and out degrees for all the nodes, to which we got 53. From there, we needed to figure out which of the nodes had 53 degrees or in other words, are connected to this particular focal product. We found out that node 4429 and node 33 both have 53 degrees. As for our subcomponent, we decided to use the node 33.

  1. Visualize the subcomponent using iGraph, trying out different colors, node and edge sizes and layouts, so that the result is most appealing. Find the diameter, and color the nodes along the diameter. Provide your insights from the visualizations.
graph <- induced_subgraph(g, sub)

V(graph)
E(graph)

V(graph)$label <- V(graph)$name
V(graph)$degree <- degree(graph)

plot(graph,
     vertex.color='pink',
     vertex.size= V(graph)$degree*0.2,
     edge.arrow.size=0.01,
     vertex.label.cex=0.01,
     layout=layout.kamada.kawai)
diameter(graph, directed = T, weights = NA)
d <- get_diameter(graph, weights = NULL)

Diameter is the longest distance between two vertices, and we found the diameter to be 9. In the graph, the 10 red nodes are the vertices that on the longest path, and they are 37895, 27936, 21584, 10889, 11080, 14111, 4429, 2501, 3588, 6676.

V(graph)$color<-"yellow"
V(graph)$color[d]<-"red"

#testing different colors
plot(graph,
     vertex.color=V(graph)$color,
     vertex.size= V(graph)$degree*0.2,
     edge.arrow.size=0.01,
     vertex.label.cex=0.01,
     layout=layout.kamada.kawai)

The graph demonstrates 904 vertices. These 904 vertices are the book ids that connected to the book whose id = 33, directly and indirectly. Size of the vertices represents the number of vertices that connected to a vertice; the bigger of the vertice, the more vertices link to it. The distance between each vertice represents how strong the vertices connect to each other; the longer the ties, the weaker the relationship. Therefore, some vertices look like clusters in the middle with short edges, which means these books have strong connections. Some vertices are nodes on the edges, which means weaker connections.

  1. Compute various statistics about this network (i.e., subcomponent), including degree distribution, density, and centrality (degree centrality, closeness centrality and between centrality), hub/authority scores, etc. Interpret your results.
deg_dist<- degree.distribution(graph,cumulative = T, mode="all")
plot(x=0:max(all_degree), y=1-deg_dist, pch=19, cex=1.2, col="orange", xlab="Degree", ylab="Cumulative Frequency")


dd_all<-degree_distribution(graph,cumulative=T)
plot(dd_all, xlab="Degree")

dd_in<-degree_distribution(graph,cumulative=T,mode="in")
plot(dd_in, xlab="Degree")

dd_out<-degree_distribution(graph,cumulative=T,mode="out")
plot(dd_out, xlab="Degree")

Degree means the number of ties.

#density
edge_density(graph, loops=F)

Density is the proportion of present edges from all possible edges in the network. The density of our graph is 5.250029e-05, which is small; therefore, the networking is pretty dense.

#centrality
centr_degree(graph)

Centrality calculates the centrality of all the 904 nodes, our results vary from 0 to 53, and 53 is the highest centrality.

closeness <- closeness(graph, mode='all', weights=NA)

Closeness calculates the centrality based on distance to other nodes.

betweeness <- betweenness(graph, directed='T', weights=NA)

Betweenness (centrality based on a broker position connecting others)

#hub/authority scores
#hub centraility eigenvector
hub_score1 <- hub.score(graph)$vector

#authority centraility eiganvector
authority_score1 <- authority.score(graph)$vector
  1. Create a group of variables containing the information of neighbors that “point to” focal products. The variables include:
  1. Neighbors’ mean rating (nghb_mn_rating),
  2. Neighbors’ mean salesrank (nghb_mn_salesrank),
  3. Neighbors’ mean number of reviews (nghb_mn_review_cnt), Note: you may recall the functions in “dplyr” such as group_by, inner_join, summarize, mean, etc.
rating<-copurchase %>%
  group_by(Target) %>%
  inner_join(products, by=c('id'='id'))%>%
  transmute(nghb_mn_rating=mean(rating))

rank<-copurchase %>%
  group_by(Target) %>%
  inner_join(products,by=c('id'='id'))%>%
  transmute(nghb_mn_salesrank=mean(salesrank))


reviews<-copurchase %>%
  group_by(Target) %>%
  inner_join(products,by=c('id'='id'))%>%
  transmute(nghb_mn_review_cnt=mean(review_cnt))

products$id <- as.vector(products$id)
sub_id <- as_ids(sub)
products_sub <- products[products$id %in% sub_id,]

mean <- copurchase %>% 
  group_by(Target) %>% 
  inner_join(products_sub, by = c('id' = 'id')) %>%
  summarise(nghb_mn_rating=mean(rating),
            nghb_mn_salesrank=mean(salesrank),
            nghb_mn_review_cnt=mean(review_cnt))
  1. Include the variables (taking logs where necessary) created in Parts 2-6 above into the “products” information and fit a Poisson regression to predict salesrank of all the books in this subcomponent using products’ own information and their neighbor’s information. Provide an interpretation of your results. Note: Lower salesrank means higher sales. Data points in the network are related. The performance of one node is influenced by the performance of its neighbors. Also, it’s not necessary that all variables matter.
#conver to data frames
in_degree1 <- as.data.frame(in_degree)
in_degree1 <- cbind(newColName = rownames(in_degree1), in_degree1)
rownames(in_degree1) <- 1:nrow(in_degree1)
colnames(in_degree1) <- c("Nodes", "in_degree")

out_degree1 <- as.data.frame(out_degree)
out_degree1 <- cbind(newColName = rownames(out_degree1), out_degree1)
rownames(out_degree1) <- 1:nrow(out_degree1)
colnames(out_degree1) <- c("Nodes", "out_degree")

closeness1 <- as.data.frame(closeness)
closeness1 <- cbind(newColName = rownames(closeness1), closeness1)
rownames(closeness1) <- 1:nrow(closeness1)
colnames(closeness1) <- c("Nodes", "closeness")

betweeness1 <- as.data.frame(betweeness)
betweeness1 <- cbind(newColName = rownames(betweeness1), betweeness1)
rownames(betweeness1) <- 1:nrow(betweeness1)
colnames(betweeness1) <- c("Nodes", "betweeness")

hub_score2 <- as.data.frame(hub_score1)
hub_score2 <- cbind(newColName = rownames(hub_score2), hub_score2)
rownames(hub_score2) <- 1:nrow(hub_score2)
colnames(hub_score2) <- c("Nodes", "hub_score")

authority_score2 <- as.data.frame(authority_score1)
authority_score2 <- cbind(newColName = rownames(authority_score2), authority_score2)
rownames(authority_score2) <- 1:nrow(authority_score2)
colnames(authority_score2) <- c("Nodes", "authority_score")

poisson_data <- sqldf("SELECT hub_score2.Nodes, hub_score, betweeness, authority_score, closeness, in_degree, out_degree, nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt, products.id, products.review_cnt, products.rating, products.salesrank
                      FROM hub_score2, betweeness1, authority_score2, closeness1, in_degree1,            out_degree1, mean, products
                      WHERE hub_score2.Nodes = betweeness1.Nodes 
                      and hub_score2.Nodes = authority_score2.Nodes
                      and hub_score2.Nodes = closeness1.Nodes
                      and hub_score2.Nodes = in_degree1.Nodes
                      and hub_score2.Nodes = out_degree1.Nodes
                      and hub_score2.Nodes = mean.Target
                      and hub_score2.Nodes = products.id")

#run poisson regression
summary(salesrating_prediction<- glm(salesrank ~ review_cnt + rating + hub_score + betweeness + 
authority_score + closeness + in_degree + out_degree + nghb_mn_rating+ nghb_mn_salesrank + nghb_mn_review_cnt, family="poisson", data=poisson_data))

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

---
title: "Social Network Team Assignment"
author: "Team 14"
date: "11-09-2018"
output: html_notebook
---
1. 	Delete products that are not books from “products” and “copurchase” files.
Note: In social network analysis, it important to define the boundary of your work; in other words, the boundary of the network.

```{r}
setwd("/Users/brichew/Desktop/UCI MSBA/Fall/BANA 277/Social Networks HW/")
copurchase <- read.csv(file.choose(), header = TRUE)
products <- read.csv(file.choose(), header = TRUE)

products <- filter(products, group == "Book" &
                     products$salesrank <= 150000 &
                     products$salesrank != -1)

copurchase <- filter(copurchase, copurchase$Source %in% products$id & 
                        copurchase$Target %in% products$id)

```

2. 	Create a variable named in-degree, to show how many “Source” products people who buy “Target” products buy; i.e. how many edges are to the focal product in “co-purchase” network.

```{r}
library(igraph)

colnames(copurchase)[1] <- "id"
combined <- merge(x= products, y= copurchase, by="id", all.x = TRUE) 
head(combined)

#clean data
combined$Target <- as.numeric(combined$Target)
colnames(combined)[3] <- "group1"
combined1<- sqldf("SELECT id, Target, title, group1, salesrank, review_cnt, downloads, rating FROM combined Where Target !=0")
head(combined1)
g <- graph.data.frame(combined1, directed = T)

in_degree <- degree(g, mode = 'in')
class(in_degree)
```

The number of adjacent vertices, or sources, are going into the node or in our cause the Target.


3. 	Create a variable named out-degree, to show how many “Target” products people who buy “Source” product also buy; i.e., how many edges are from the focal product in “co-purchase” network.

```{r}
out_degree <- degree(g, mode = 'out')
```
The number of vertices that are from the Nodes, or Target products, that are leaving the Node. 


4. 	Pick up one of the products (in case there are multiple) with highest degree (in-degree + out-degree), and find its subcomponent, i.e., all the products that are connected to this focal product. From this point on, you will work only on this subcomponent.
```{r}
#show how many degrees are connected to nodes
all_degree <- degree(g, mode = 'total')
#what is the max numb of degrees for all the nodes
max(all_degree)
all_degree[all_degree==53]
sub <- subcomponent(g, "33",'all') 
sub
sub
```

We ran the degree function to determine how many degrees are connected to the nodes. Then, we asked R to spit out the maximum number of in and out degrees for all the nodes, to which we got 53. From there, we needed to figure out which of the nodes had 53 degrees or in other words, are connected to this particular focal product. We found out that node 4429 and node 33 both have 53 degrees. As for our subcomponent, we decided to use the node 33. 

5. 	Visualize the subcomponent using iGraph, trying out different colors, node and edge sizes and layouts, so that the result is most appealing. Find the diameter, and color the nodes along the diameter. Provide your insights from the visualizations.
```{r}
graph <- induced_subgraph(g, sub)

V(graph)
E(graph)

V(graph)$label <- V(graph)$name
V(graph)$degree <- degree(graph)

plot(graph,
     vertex.color='pink',
     vertex.size= V(graph)$degree*0.2,
     edge.arrow.size=0.01,
     vertex.label.cex=0.01,
     layout=layout.kamada.kawai)

```

```{r}
diameter(graph, directed = T, weights = NA)
d <- get_diameter(graph, weights = NULL)

```
Diameter is the longest distance between two vertices, and we found the diameter to be 9. In the graph, the 10 red nodes are the vertices that on the longest path, and they are 37895, 27936, 21584, 10889, 11080, 14111, 4429, 2501, 3588, 6676.
```{r}
V(graph)$color<-"yellow"
V(graph)$color[d]<-"red"

#testing different colors
plot(graph,
     vertex.color=V(graph)$color,
     vertex.size= V(graph)$degree*0.2,
     edge.arrow.size=0.01,
     vertex.label.cex=0.01,
     layout=layout.kamada.kawai)
```
The graph demonstrates 904 vertices. These 904 vertices are the book ids that connected to the book whose id = 33, directly and indirectly. Size of the vertices represents the number of vertices that connected to a vertice; the bigger of the vertice, the more vertices link to it. The distance between each vertice represents how strong the vertices connect to each other; the longer the ties, the weaker the relationship. Therefore, some vertices look like clusters in the middle with short edges, which means these books have strong connections. Some vertices are nodes on the edges, which means weaker connections.



6. 	Compute various statistics about this network (i.e., subcomponent), including degree distribution, density, and centrality (degree centrality, closeness centrality and between centrality), hub/authority scores, etc. Interpret your results.
```{r}
deg_dist<- degree.distribution(graph,cumulative = T, mode="all")
plot(x=0:max(all_degree), y=1-deg_dist, pch=19, cex=1.2, col="orange", xlab="Degree", ylab="Cumulative Frequency")


dd_all<-degree_distribution(graph,cumulative=T)
plot(dd_all, xlab="Degree")

dd_in<-degree_distribution(graph,cumulative=T,mode="in")
plot(dd_in, xlab="Degree")

dd_out<-degree_distribution(graph,cumulative=T,mode="out")
plot(dd_out, xlab="Degree")

```
Degree means the number of ties. 

```{r}
#density
edge_density(graph, loops=F)

```
Density is the proportion of present edges from all possible edges in the network. The density of our graph is 5.250029e-05, which is small; therefore, the networking is pretty dense. 
```{r}
#centrality
centr_degree(graph)
```

Centrality calculates the centrality of all the 904 nodes, our results vary from 0 to 53, and 53 is the highest centrality. 

```{r}
closeness <- closeness(graph, mode='all', weights=NA)
```
Closeness calculates the centrality based on distance to other nodes. 

```{r}
betweeness <- betweenness(graph, directed='T', weights=NA)
```
Betweenness (centrality based on a broker position connecting others)
```{r}
#hub/authority scores
#hub centraility eigenvector
hub_score1 <- hub.score(graph)$vector

#authority centraility eiganvector
authority_score1 <- authority.score(graph)$vector
```
7. 	Create a group of variables containing the information of neighbors that “point to” focal products. The variables include:
a.     Neighbors’ mean rating (nghb_mn_rating),
b.     Neighbors’ mean salesrank (nghb_mn_salesrank),
c.     Neighbors’ mean number of reviews (nghb_mn_review_cnt),
Note: you may recall the functions in “dplyr” such as group_by, inner_join, summarize, mean, etc.
```{r}
rating<-copurchase %>%
  group_by(Target) %>%
  inner_join(products, by=c('id'='id'))%>%
  transmute(nghb_mn_rating=mean(rating))

rank<-copurchase %>%
  group_by(Target) %>%
  inner_join(products,by=c('id'='id'))%>%
  transmute(nghb_mn_salesrank=mean(salesrank))


reviews<-copurchase %>%
  group_by(Target) %>%
  inner_join(products,by=c('id'='id'))%>%
  transmute(nghb_mn_review_cnt=mean(review_cnt))

products$id <- as.vector(products$id)
sub_id <- as_ids(sub)
products_sub <- products[products$id %in% sub_id,]

mean <- copurchase %>% 
  group_by(Target) %>% 
  inner_join(products_sub, by = c('id' = 'id')) %>%
  summarise(nghb_mn_rating=mean(rating),
            nghb_mn_salesrank=mean(salesrank),
            nghb_mn_review_cnt=mean(review_cnt))
```
8. 	Include the variables (taking logs where necessary) created in Parts 2-6 above into the “products” information and fit a Poisson regression to predict salesrank of all the books in this subcomponent using products’ own information and their neighbor’s information. Provide an interpretation of your results.
Note: Lower salesrank means higher sales. Data points in the network are related. The performance of one node is influenced by the performance of its neighbors. Also, it’s not necessary that all variables matter.
```{r}
#conver to data frames
in_degree1 <- as.data.frame(in_degree)
in_degree1 <- cbind(newColName = rownames(in_degree1), in_degree1)
rownames(in_degree1) <- 1:nrow(in_degree1)
colnames(in_degree1) <- c("Nodes", "in_degree")

out_degree1 <- as.data.frame(out_degree)
out_degree1 <- cbind(newColName = rownames(out_degree1), out_degree1)
rownames(out_degree1) <- 1:nrow(out_degree1)
colnames(out_degree1) <- c("Nodes", "out_degree")

closeness1 <- as.data.frame(closeness)
closeness1 <- cbind(newColName = rownames(closeness1), closeness1)
rownames(closeness1) <- 1:nrow(closeness1)
colnames(closeness1) <- c("Nodes", "closeness")

betweeness1 <- as.data.frame(betweeness)
betweeness1 <- cbind(newColName = rownames(betweeness1), betweeness1)
rownames(betweeness1) <- 1:nrow(betweeness1)
colnames(betweeness1) <- c("Nodes", "betweeness")

hub_score2 <- as.data.frame(hub_score1)
hub_score2 <- cbind(newColName = rownames(hub_score2), hub_score2)
rownames(hub_score2) <- 1:nrow(hub_score2)
colnames(hub_score2) <- c("Nodes", "hub_score")

authority_score2 <- as.data.frame(authority_score1)
authority_score2 <- cbind(newColName = rownames(authority_score2), authority_score2)
rownames(authority_score2) <- 1:nrow(authority_score2)
colnames(authority_score2) <- c("Nodes", "authority_score")

poisson_data <- sqldf("SELECT hub_score2.Nodes, hub_score, betweeness, authority_score, closeness, in_degree, out_degree, nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt, products.id, products.review_cnt, products.rating, products.salesrank
                      FROM hub_score2, betweeness1, authority_score2, closeness1, in_degree1,            out_degree1, mean, products
                      WHERE hub_score2.Nodes = betweeness1.Nodes 
                      and hub_score2.Nodes = authority_score2.Nodes
                      and hub_score2.Nodes = closeness1.Nodes
                      and hub_score2.Nodes = in_degree1.Nodes
                      and hub_score2.Nodes = out_degree1.Nodes
                      and hub_score2.Nodes = mean.Target
                      and hub_score2.Nodes = products.id")

#run poisson regression
summary(salesrating_prediction<- glm(salesrank ~ review_cnt + rating + hub_score + betweeness + 
authority_score + closeness + in_degree + out_degree + nghb_mn_rating+ nghb_mn_salesrank + nghb_mn_review_cnt, family="poisson", data=poisson_data))


```
 







Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file). 

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

