Data607FinalProjectSadiaBanuPart2

PROJECT TEAM: SADIA AND BANU DATA 607 FINAL PROJECT (PART2 DOCUMENT)

ANALYSIS:

Our second part of the analyis was to understand how to scrape Wikipedia for metoo webpage and then store that in Neo4j and retrieve data. The goal was to see if we can explain graph database and understand and expand on modeling topics through neo4j in the future. For this we will need to install neo4j and use RNeo4j library from github https://github.com/nicolewhite/RNeo4j#nodes to query the data model in Cypher.

url <- "https://en.wikipedia.org/wiki/Me_Too_movement"

  tag_headline <- url %>%
    read_html() %>%
    html_nodes(xpath = "//h3") %>%
    html_text()
  
  tag_text <- url %>%
    read_html() %>%
    html_nodes(xpath = "//p | //h3") %>%
    html_text()

  tag_text[1:1]

## [1] "\n"

str(tag_headline)

##  chr [1:71] "Awareness and empathy" "Policies and laws" ...

str(tag_text)

##  chr [1:285] "\n" "\n" ...

df1 <- as.data.frame(tag_headline)

df2 <- as.data.frame(tag_text)
str(df2)

## 'data.frame':    285 obs. of  1 variable:
##  $ tag_text: Factor w/ 284 levels " Japan","\n",..: 2 2 242 210 80 279 25 247 26 53 ...

colnames(df1)[1] <- "matchcol"
colnames(df2)[1] <- "matchcol"

str(df1)

## 'data.frame':    71 obs. of  1 variable:
##  $ matchcol: Factor w/ 71 levels " Japan","\n\t\t\t\tVariants\n\t\t\t",..: 11 53 36 30 58 59 4 5 6 7 ...

str(df2)

## 'data.frame':    285 obs. of  1 variable:
##  $ matchcol: Factor w/ 284 levels " Japan","\n",..: 2 2 242 210 80 279 25 247 26 53 ...

Cleanup the scrapped dataframe and attach flag headline and paragraph based on the //p and //h XPATH html extraction. Plot bigrams and igraph.

df1 %>% 
  mutate_if(is.factor, as.character) %>% 
  glimpse()

## Observations: 71
## Variables: 1
## $ matchcol <chr> "Awareness and empathy", "Policies and laws", "Media ...

df2 %>% 
  mutate_if(is.factor, as.character) %>% 
  glimpse()

## Observations: 285
## Variables: 1
## $ matchcol <chr> "\n", "\n", "The Me Too movement (or #MeToo movement)...

df3 <- merge(df1, df2, by = "matchcol", all.y = TRUE)

df4 <- cbind(df2,length(df2$matchcol))

list <- nchar(as.character(df4$matchcol))

final <- cbind(df4,list)

 tagheadline <- cbind(df1,nchar(as.character(df1$matchcol)))

final1 <- cbind(final, ifelse(nchar(as.character(df4$matchcol)) <=48, "headline", "Paragraph"))

str(final1)

## 'data.frame':    285 obs. of  4 variables:
##  $ matchcol                                                                : Factor w/ 284 levels " Japan","\n",..: 2 2 242 210 80 279 25 247 26 53 ...
##  $ length(df2$matchcol)                                                    : int  285 285 285 285 285 285 285 285 285 285 ...
##  $ list                                                                    : int  1 1 325 410 666 165 326 361 669 21 ...
##  $ ifelse(nchar(as.character(df4$matchcol)) <= 48, "headline", "Paragraph"): Factor w/ 2 levels "headline","Paragraph": 1 1 2 2 2 2 2 2 2 1 ...

#final data
newdata <- final1[which(final1$list > 1),]

bigrams <- newdata[110:231,] %>%
  unnest_tokens(token1, matchcol, token = "ngrams", n = 2) %>%
  separate(token1, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  count(word1, word2, sort = TRUE)

#count words co-occuring within sections

colnames(newdata)[4] <- "flag"

tokennewdata <- newdata[110:231,] %>%
  unnest_tokens(token1, matchcol, token = "ngrams", n = 1) %>% filter(!token1 %in% stop_words$word) %>% group_by(token1) %>% filter(n()>=20)%>% pairwise_cor(token1,flag,sort=TRUE)
       
tokennewdata[100,]

## # A tibble: 1 x 3
##   item1    item2      correlation
##   <chr>    <chr>            <dbl>
## 1 movement harassment       1.000

Plot bigrams of scrapped webpage on metoo main wiki page

# filter for only relatively common combinations
bigram_graph <- bigrams%>%
  filter(n > 1) %>%
  graph_from_data_frame()

bigram_graph

## IGRAPH b133acc DN-- 137 93 -- 
## + attr: name (v/c), n (e/n)
## + edges from b133acc (vertex names):
##  [1] sexual     ->harassment  sexual     ->assault    
##  [3] metoo      ->movement    social     ->media      
##  [5] sexual     ->misconduct  sexual     ->abuse      
##  [7] sexual     ->violence    january    ->2018       
##  [9] sexually   ->assaulted   law        ->enforcement
## [11] prime      ->minister    sexual     ->advances   
## [13] women      ->including   experienced->sexual     
## [15] harvey     ->weinstein   human      ->rights     
## + ... omitted several edges

set.seed(2017)

ggraph(bigram_graph, layout = "kk") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

#Neo4j section

Loop through to figure out how to assign headline/paragraph for the sections that are together.

For instance, a headline can have multiple paragraphs so loop and assign the flag correctly as these are needed to build a Neo4j datamodel

as.numeric(rownames(newdata[2,]))

## [1] 4

str(newdata)

## 'data.frame':    283 obs. of  4 variables:
##  $ matchcol            : Factor w/ 284 levels " Japan","\n",..: 242 210 80 279 25 247 26 53 39 58 ...
##  $ length(df2$matchcol): int  285 285 285 285 285 285 285 285 285 285 ...
##  $ list                : int  325 410 666 165 326 361 669 21 501 553 ...
##  $ flag                : Factor w/ 2 levels "headline","Paragraph": 2 2 2 2 2 2 2 1 2 2 ...

newdata1 <- newdata %>% mutate_if(is.factor, as.character)
str(newdata1)

## 'data.frame':    283 obs. of  4 variables:
##  $ matchcol            : chr  "The Me Too movement (or #MeToo movement), with a large variety of local and international related names, is a m"| __truncated__ "Similar to other social justice and empowerment movements based upon breaking silence, the purpose of \"Me Too\"| __truncated__ "Following the exposure of the widespread sexual-abuse allegations against Harvey Weinstein in early October 201"| __truncated__ "Widespread media coverage and discussion of sexual harassment, particularly in Hollywood, led to high-profile f"| __truncated__ ...
##  $ length(df2$matchcol): int  285 285 285 285 285 285 285 285 285 285 ...
##  $ list                : int  325 410 666 165 326 361 669 21 501 553 ...
##  $ flag                : chr  "Paragraph" "Paragraph" "Paragraph" "Paragraph" ...

isprevparagraph <- "No"
isprevheadline  <- "No"

nrow(newdata1)

## [1] 283

newdata1$flag[8]

## [1] "headline"

df <- NULL
ptype_id <- 88
isprevheadline <- "No"

for (i in 1:nrow(newdata1))
 
    {
      if (newdata1$flag[i] == "headline")   
        { 
          isprevheadline <- "Yes"
         ptype_id <- as.numeric(rownames(newdata1[i,]))
         type_id <- as.numeric(rownames(newdata1[i,]))
          type_name <- newdata1$flag[i]
          text_id <- i
          text_val <- newdata1$matchcol[i]
               }
  else if (newdata$flag[i] == "Paragraph") 
      { 
          if (isprevheadline == "Yes" )
          {
            type_id <- ptype_id
            type_name <- newdata1$flag[i]
            text_id <- i
            text_val <- newdata1$matchcol[i]
            #isprevheadline == "No"
            
          } 
      else 
          {
          type_id <- as.numeric(rownames(newdata1[i,]))
          type_name <- newdata1$flag[i]
          text_id <- i
          text_val <- newdata1$matchcol[i]                 
          }
  }
  
 df = rbind(df, data.frame(type_id,type_name,text_id,text_val,isprevheadline, ptype_id))
}

df[1:13,] %>%kable() %>% kable_styling(c("striped", "bordered"))

type_id	type_name	text_id	text_val	isprevheadline	ptype_id
1	Paragraph	1	The Me Too movement (or #MeToo movement), with a large variety of local and international related names, is a movement against sexual harassment and sexual assault.[1][2][3] The phrase “Me Too” was initially used in this context on social media in 2006, on Myspace, by sexual harassment survivor and activist Tarana Burke.[4]	No	88
2	Paragraph	2	Similar to other social justice and empowerment movements based upon breaking silence, the purpose of “Me Too”, as initially voiced by Burke as well as those who later adopted the tactic, is to empower women through empathy and strength in numbers, especially young and vulnerable women, by visibly demonstrating how many women have survived sexual assault and harassment, especially in the workplace.[4][5][6]	No	88
3	Paragraph	3	Following the exposure of the widespread sexual-abuse allegations against Harvey Weinstein in early October 2017,[7][8] the movement began to spread virally as a hashtag on social media.[6][9][10] On October 15, 2017, American actress Alyssa Milano posted on Twitter, “If all the women who have ever been sexually harassed or assaulted wrote ‘Me too.’ as a status, then we give people a sense of the magnitude of the problem,” saying that she got the idea from a friend.[11][12][13][14] A number of high-profile posts and responses from American celebrities Gwyneth Paltrow,[15]Ashley Judd,[16]Jennifer Lawrence,[17] and Uma Thurman, among others, soon followed.[18]	No	88
4	Paragraph	4	Widespread media coverage and discussion of sexual harassment, particularly in Hollywood, led to high-profile firings, as well as criticism and backlash.[19][20][21]	No	88
5	Paragraph	5	After millions of people started using the phrase and hashtag in this manner, it spread to dozens of other languages. The scope has become somewhat broader with this expansion, however, and Burke has more recently referred to it as an international movement for justice for marginalized people in marginalized communities.[22]	No	88
6	Paragraph	6	The original purpose of “Me Too” as used by Tarana Burke in 2006 was to empower women through empathy, especially young and vulnerable women. In October 2017, Alyssa Milano encouraged using the phrase as a hashtag to help reveal the extent of problems with sexual harassment and assault by showing how many people have experienced these events themselves.[4][5]	No	88
7	Paragraph	7	After millions of people started using the phrase, and it spread to dozens of other languages, the purpose changed and expanded, and as a result, it has come to mean different things to different people. Tarana Burke accepts the title of “leader” of the movement, but has stated that she considers herself more of a “worker.” Burke has stated that this movement has grown to include both men and women of all colors and ages, as it continues to support marginalized people in marginalized communities.[19][22] There have also been movements by men aimed at changing the culture through personal reflection and future action, including #IDidThat, #IHave, and #IWill.[23]	No	88
8	headline	8	Awareness and empathy	Yes	8
8	Paragraph	9	Analyses of the movement often point to the prevalence of sexual violence, which has been estimated by the World Health Organization to affect one-third of all women worldwide. A 2017 poll by ABC News and The Washington Post also found that 54% of American women report receiving “unwanted and inappropriate” sexual advances with 95% saying that such behavior usually goes unpunished. Others state that #MeToo underscores the need for men to intervene when they witness demeaning behavior.[24][25][26]	Yes	8
8	Paragraph	10	Burke said that #MeToo declares sexual violence sufferers are not alone and should not be ashamed.[27] Burke says sexual violence is usually caused by someone the woman knows, so people should be educated from a young age that they have the right to say no to sexual contact from any person, even after repeat solicitations from an authority or spouse, and to report predatory behavior.[28] Burke advises men to talk to each other about consent, call out demeaning behavior when they see it and try to listen to victims when they tell their stories.[28]	Yes	8
8	Paragraph	11	Alyssa Milano described the reach of #MeToo as helping society understand the “magnitude of the problem” and said, “it’s a standing in solidarity to all those who have been hurt.”[29][30] She stated that the success of #MeToo will require men to take a stand against behavior that objectifies women.[31]	Yes	8
12	headline	12	Policies and laws	Yes	12
12	Paragraph	13	Burke has stated the current purpose of the movement is to give people the resources to have access to healing, and to advocate for changes to laws and policies. Burke has highlighted goals such as processing all untested rape kits, re-examining local school policies, improving the vetting of teachers, and updating sexual harassment policies.[32] She has called for all professionals who work with children to be fingerprinted and subjected to a background check before being cleared to start work. She advocates for sex education that teaches kids to report predatory behavior immediately.[28] Burke supports the #MeToo bill in the US Congress, which would remove the requirement that staffers of the federal government go through months of “cooling off” before being allowed to file a complaint against a Congressperson.[32]	Yes	12

Connect to Neo4j start graph and build database

BBGraph <- startGraph("http://localhost:7474/db/data/",username="neo4j",password="cunyuser")


clear(BBGraph, input = F)

Row_nodes = list()
Type_nodes = list()


#create row nodes
for (i in 1:nrow(df)){
 
  Row_nodes[[df$text_id[i]]] = createNode(BBGraph, "Text",
                                          Text_ID=df$text_id[i], 
                                          Text_Value = df$text_val[i])  




  Type_nodes[[df$text_id[i]]] = createNode(BBGraph, "TextType",
                                           Text_ID =  df$text_id[i],
                                           Type_ID = df$type_id[i], 
                                           Type_Name = as.character(df$type_name[i]) ) 
}

#node relationship

Build Relationship

#node relationship

# Loop through  
for (i in 1:nrow(df)){
 
  myrow = Row_nodes[[df$text_id[i]]]
  myid1 = Type_nodes[[df$text_id[i]]]
  myid = Type_nodes[[df$type_id[i]]]
  mytype = Type_nodes[[df$type_id[i]]]
  mytypename = Type_nodes[[df$type_name[i]]]
  myvalue = Row_nodes[[df$text_val[i]]]
    
  createRel(myrow, paste("matches"), myid1)
  createRel(myrow, paste("is of section"), mytype)
  createRel(myrow, paste("is of type"), mytypename)
  createRel(myrow, paste("contains"), myvalue)
  }

#Retrieving Data from Neo4j

Query using cypher show results and pull igraph. Return all rows of headline type 17 and the corresponding paragraphs for that headline. The goal here was to figure out how to pull all paragraphs linked to a headline and perform modeling on that data. The neo4j model needs to be finetuned as I have shown below.

query = "
MATCH (f:Text)-[r]->(m:TextType) WHERE f.Text_ID = m.Text_ID AND m.Type_ID = 17 RETURN f.Text_ID,f.Text_Value,m.Type_ID,m.Type_Name
"
# execute cypher
edges = cypher(BBGraph, query)

# display the results to confirm that the data is present
edges %>% kable() %>% kable_styling(c("striped", "bordered"))

f.Text_ID	f.Text_Value	m.Type_ID	m.Type_Name
17	Media coverage	17	headline
17	Media coverage	17	headline
19	False reports of sexual assault are very rare, but when they happen, they are put in the spotlight for the public to see. This can give the false impression that the majority of the reported sexual assaults are false. However, false reports of sexual assault only make up 2%–10% of the total number that are reported.[36][37] These figures do not take into account the factor that the majority of victims do not report when they are assaulted or harassed. Misconceptions about false reports are one of the reasons why women are scared to report their experiences with sexual assault - because they are afraid that no one will believe them, that in the process they will have embarrassed and humiliated themselves, in addition to opening themselves up to retribution from the assailants.[38][39]	17	Paragraph
21	There is a discussion on the best ways to handle whisper networks, or private lists of “people to avoid” that are shared unofficially in nearly every major institution or industry where sexual harassment is common due to power imbalances, including government, media, news, and academia. These lists have the stated purpose of warning other workers in the industry and are shared from person-to-person, on forums, in private social media groups, and via spreadsheets. However, it has been argued that these lists can become “weaponized” and be used to spread unsubstantiated gossip - an opinion which has been discussed widely in the media.[43]	17	Paragraph
23	In India, a student gave her friends a list containing names of professors and academics in the Indian university system to be avoided. The list went viral after it was posted on social media.[48] In response to criticism in the media, the authors defended themselves by saying they were only trying to warn their friends, had confirmed every case, and several victims from the list were poor students who had already been punished or ignored when trying to come forward.[49][50]Moira Donegan, a New York City-based journalist, privately shared a crowd-sourced list of “Shitty Media Men” to avoid in publishing and journalism. When it was shared outside her private network, Donegan lost her job. Donegan stated it was unfair so few people had access to the list before it went public; for example, very few women of color received access (and therefore protection) from it. She pointed to her “whiteness, health, education, and class” that allowed her to take the risk of sharing the list and getting fired.[44]	17	Paragraph
18	In the coverage of #MeToo, there has been widespread discussion about the best ways to stop sexual harassment and abuse - for those who are currently being victimized at work, as well as those who are seeking justice for past abuse and trying to find ways to end what they see as a culture of abuse. There is general agreement that a lack of effective reporting options is a major factor that drives unchecked sexual misconduct in the workplace.[35]	17	Paragraph
20	In France, a person who makes a sexual harassment complaint at work is reprimanded or fired 40% of the time, while the accused person is typically not investigated or punished.[40] In the United States, a 2016 report from the Equal Employment Opportunity Commission states that although 25–85% of women say they experience sexual harassment at work, few ever report the incidents, most commonly due to fear of reprisal.[35] There is evidence that in Japan, as few as 4% of rape victims report the crime, and the charges are dropped about half the time.[41][42]	17	Paragraph
22	Defenders say the lists provide a way to warn other vulnerable people in the industry if worried about serious retribution from the abusers, especially if complaints have already been ignored. They say the lists help victims identify each other so they can speak out together and find safety in numbers.[43][44] Sometimes these lists are kept for other reasons. For example, a spreadsheet from the United Kingdom called “High Libido MPs” and dubbed “the spreadsheet of shame” was created by a group of male and female Parliamentary researchers, and contained a list of allegations against nearly 40 Conservative MPs in the British Parliament. It is also rumored that party whips (who are in charge of getting members of Parliament to commit to votes) maintain a “black book” that contains allegations against several lawmakers that can be used for blackmail.[45][46][47] When it is claimed a well-known person’s sexual misconduct was an “open secret”, these lists are often the source.[43] In the wake of #MeToo, several private whisper network lists have been leaked to the public.[43][44]	17	Paragraph
24	The main problem with trying to protect more potential victims by publishing whisper networks is determining the best mechanism to verify allegations in a way that is fair to all parties.[51][52] Some suggestions have included strengthening labor unions in vulnerable industries so workers can report harassment directly to the union instead of to an employer. Another suggestion is to maintain industry hotlines which have the power to trigger third-party investigations.[51] Several apps have been developed which offer various ways to report sexual misconduct, and some of these apps also have the ability to connect victims who have reported the same person.[53]	17	Paragraph

igh = graph.data.frame(edges)
ggraph(igh, layout = "kk") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Neo4j screenshots of windows service and graph database and queries

Neo4j browser port Mygraph and cypher query Neo4j MyGraph model neo4j Neo4j my Windows localhostbrowsescreen

Works cited

These sites were very helpful in understanding how to do this process by connecting R to Neo4j. Not many sites were out there that helped really understand basic understanding and steps except these.

http://rpubs.com/myampol/MY-DATA607-Week12-MySQL-to-Neo4j https://rdrr.io/github/nicolewhite/RNeo4j/ https://neo4j.com/docs/operations-manual/current/