Introduction

The Why:

I recently watched the documentary “Motherland” on PBS, described as a “vérité look at the busiest maternity hospital on the planet, in one of the world’s most populous countries: the Philippines”. As a first-generation American-Born-Filipina, the hyper-realistic film left me in awe thinking about a life I could have led. It showcased the lives of girls younger than I was, having their first child and caught in what seems to be a never-ending cycle of adolescent fertility. One 26 year old in particular, already had six children. I’ve thought about this documentary a lot since I’ve seen it and how we can help empower women or give them access to the right tools so they won’t be caught in a cycle of continuous pregnancy. I asked myself, why is the Philippines having this crisis? So I decided I to take a look at the countries in which adolescent fertility rates are increasing/decreasing, and then determine potential reasons.

The How:

The data obtained in this project is gathered from The World Bank, Wikipedia, and Twitter. The World Bank data has proven to be incredibly through and informative. In another project, I used multiple linear regression analysis to determine important quantitative time-series variables while understanding adolescent pregnancy rates. Values such as women’s health and youth education education were incredibly important values that determined adolescent pregancy. I also noticed that many of the variables that ultimately determined if a country had high amounts of adolescent pregnancy appeared to be incredbily community focused. For example, every percent increase of elder females in a population decreases births per 1000 by 4. So, perhaps an increase in women’s care leads to less teenage births. I believe the relationship to elder populations and teenage pregnancy is this: when there are matronly figures such as mothers, aunts, and grandmothers that can nurture young women in the family, there will be less teenage births. In the Philippines, family is incredibly important. If women can be role models or nurture their young daughters/sons to be careful – despite any governmental believes – then we can positively impact birth rates in the Philippines.

This project however, attemps to use more qualitative datasets. Initially, I wanted to scrape government websites to gain information about country’s responses to womens health/rights, unfortunately, a project of this magnitude would require restructuring many government pages, translating them into one common langauge (English), and through research to determine what webpages to scrape. Instead, I decided to focus on one metric – reactions towards women’s rights using Wikipedia websites titled “Womens Rights in (Country Name)”. These websites had similar strucutres, and often focused on the change in women’s rights in a given country over time. I used sentiment analysis to understand how positive, negative, or trusting (etc.,) these webpages were. Some very slighly correlations could be made using multivariable analysis and backwards elimination.

However, I thought this method might not incorporate the metrics of community I found to be so important before. The book Sapiens by historian Yuval Noah Harari elucidated how the Industrial Revolution spurred a massive change in power. No longer were families as important to our development – they were, of course important, but through the Industrial Revolution, the power given to families shifted to governments. Education or employment was no longer determined by what family you were born into. Of course, it determines part of it, but in many ways an efficent government is the biggest determinant that can ensure education for its constituents. This balance of government and community shape who we are are people, and perhaps, in some places people are shaped more by governments than by their families, or vice-versa. I hope to use wikipedia websites and twitter data to better understand this balance.

Loading Libraries and Obtaining Data

We are using quite a few libraries that help us better understand qualitative datasets, such as tm, RNewsflow, along with libraries to extract twitter data such as twitteR and ROAuth. I uploaded the dataset in question to my github for an easy upload. The original data can be found here. They are World Development Indicators and are very comprehensive.

#library(dplyr)
#library(tidyr)
#library(rvest)
#library(stringr)
#library(SnowballC)
#library(tm)
#library(syuzhet)
#library(ggplot2)
#library(twitteR)
#library(ROAuth)
#library(RNewsflow)

gendered_world_indicators <- read.csv("https://raw.githubusercontent.com/Michelebradley/DATA-606/master/Gender_World%20_Indicators.csv", header=TRUE, check.names = FALSE)

Tidying the Data

tidy_gendered_world_indicators <- gather(gendered_world_indicators, "year", "n", 5:60)
colnames(tidy_gendered_world_indicators)[colnames(tidy_gendered_world_indicators) == "Country Name"] <- "Country"
colnames(tidy_gendered_world_indicators)[colnames(tidy_gendered_world_indicators) == "Indicator Name"] <- "Indicator"
tidy_gendered_world_indicators <- select(tidy_gendered_world_indicators, one_of("Country", "Indicator","year", "n"))
tidy_gendered_world_indicators <- na.omit(tidy_gendered_world_indicators)
head(tidy_gendered_world_indicators)

Finding Biggest Change in Adolescent Pregnancy Rates

Since we are using qualitative data (Wikipedia websites and Twitter data from the last seven days), we can only compare a static numerical value. Therefore, I aimed to find the difference in adolescent pregnancy rates over a recent 15 year change, and the turn of the 21st century.

adolescent <- filter(tidy_gendered_world_indicators, Indicator==unique(tidy_gendered_world_indicators$Indicator)[8]&(year==2000|year==2015))
Difference <- NULL

for (country in unique(adolescent$Country)) {
  temp_00 <- filter(adolescent, Country==country & year==2000)
  temp_15 <- filter(adolescent, Country==country & year==2015)
  n_00 <- temp_00$n
  n_15 <- temp_15$n
  change <- n_15-n_00
  change_temp <- data.frame(country, change)
  Difference <- rbind(change_temp, Difference)
}
head(arrange(Difference, desc(change)))
tail(arrange(Difference, desc(change)))

Visualize the Change – Philippines

Philippines <- tidy_gendered_world_indicators %>%
  filter(Country == "Philippines")
fertility_phi <- Philippines %>%
  filter(Indicator == indicators[8])

PHI_fertility <- ggplot(fertility_phi, aes(year, n))
PHI_fertility + geom_jitter() + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Wikipedia articles

I started my analysis by obtaining every “Womens Rights in (Country)” articles available on Wikipedia. There were around 25, so in one dataframe, we now have the country, URL, and the percent change. We did a left join on the URLs so we would only use those countries for analysis.

data <- merge(countries_df, Difference, by="country")
head(arrange(data, desc(change)))

Sentiment Analysis

Next, we scraped all of these webistes using the CSS Selector Gadget tool and cleaned up the words using tm_map. The get_nrc_sentiment function from the syuznet package was then used to obtain setniment from all of these websites. The values were converted to a percentage and written to a dataframe.

obtain_text <- function(url) {
  bullets <- url %>%
    read_html() %>%
    html_nodes("#mw-content-text p") %>%
    html_text()

  text <- url %>%
    read_html() %>%
    html_nodes("#mw-content-text ul li") %>%
    html_text()

bullet <- unlist(str_extract_all(bullets, "\\w+"))
text <- unlist(str_extract_all(text, "\\w+"))
words <- c(bullet, text)

words <- Corpus(VectorSource(words))
# Convert the text to lower case
words <- tm_map(words, content_transformer(tolower))
# Remove numbers
words <- tm_map(words, removeNumbers)
# Remove english common stopwords
words <- tm_map(words, removeWords, stopwords("english"))
# Remove punctuations
words <- tm_map(words, removePunctuation)
# Eliminate extra white spaces
words <- tm_map(words, stripWhitespace)
 
return(words)
}
percent <- NULL

obtain_sentiment <- function(data){
  for (country in unique(data$country)){
    URL <- data[data$country == country, "URLs"]
    Sys.sleep(5)
    wiki_data <- obtain_text(as.character(URL))
    wiki_data <- as.character(wiki_data)
    p <- get_nrc_sentiment(wiki_data)
    total <- sum(p[1,])
    percent_anger <- (as.numeric(p$anger[1])/total)*100
    percent_anticipation <- (as.numeric(p$anticipation[1])/total)*100
    percent_disgust <- (as.numeric(p$disgust[1])/total)*100
    percent_fear <- (as.numeric(p$fear[1])/total)*100
    percent_joy <- (as.numeric(p$joy[1])/total)*100
    percent_sadness <- (as.numeric(p$sadness[1])/total)*100
    percent_surprise <- (as.numeric(p$surprise[1])/total)*100
    percent_trust <- (as.numeric(p$trust[1])/total)*100
    percent_negative <- (as.numeric(p$negative[1])/total)*100
    percent_positive <- (as.numeric(p$positive [1])/total)*100
    values <- data.frame(country, percent_anger, percent_anticipation, percent_disgust, percent_fear, percent_joy, percent_sadness, percent_surprise, percent_trust, percent_negative, percent_positive)
    percent <- rbind(values, percent)
  }
  sentiment <- merge(percent, Difference, by="country")
  return (sentiment)
}

sentiment <- obtain_sentiment(data)
sentiment

Once this data was obtained, we could easily run regression analysis on it and find correlations between the variables and the percent change in adolescent pregnancy over the years.

percent_anger <- sentiment$percent_anger
percent_anticipation <- sentiment$percent_anticipation
percent_disgust <- sentiment$percent_disgust
percent_fear <- sentiment$percent_fear
percent_joy <- sentiment$percent_joy
percent_sadness <- sentiment$percent_sadness
percent_surprise <- sentiment$percent_surprise
percent_trust <- sentiment$percent_trust
percent_negative <- sentiment$percent_negative
percent_positive <- sentiment$percent_positive
change <- sentiment$change
correlation <- lm(change ~ percent_anger + percent_anticipation + percent_disgust + percent_fear + percent_joy + percent_sadness + percent_surprise + percent_trust + percent_negative + percent_positive)
summary(correlation)
## 
## Call:
## lm(formula = change ~ percent_anger + percent_anticipation + 
##     percent_disgust + percent_fear + percent_joy + percent_sadness + 
##     percent_surprise + percent_trust + percent_negative + percent_positive)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1328  -4.7650  -0.1506   4.1663  24.1644 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          403.4656   177.3695   2.275  0.04394 * 
## percent_anger         -2.4565     5.2479  -0.468  0.64886   
## percent_anticipation   0.6362     5.0669   0.126  0.90235   
## percent_disgust        3.6027     4.0024   0.900  0.38734   
## percent_fear          -3.6538     3.6231  -1.008  0.33490   
## percent_joy          -10.0109     5.2822  -1.895  0.08464 . 
## percent_sadness       -0.1374     3.2623  -0.042  0.96715   
## percent_surprise       3.5875     5.7913   0.619  0.54823   
## percent_trust         -9.4312     3.4725  -2.716  0.02008 * 
## percent_negative     -12.6432     3.5825  -3.529  0.00472 **
## percent_positive           NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.46 on 11 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.5072 
## F-statistic: 3.287 on 9 and 11 DF,  p-value: 0.03349

Using backwards elimination, the most important variables were interestingly, percent_fear, percent_joy, percent_trust, and percent_negative.

correlation <- lm(change ~ percent_fear + percent_joy + percent_trust + percent_negative)
summary(correlation)
## 
## Call:
## lm(formula = change ~ percent_fear + percent_joy + percent_trust + 
##     percent_negative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.754  -2.557   2.311   3.773  22.246 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       451.436     86.014   5.248 7.96e-05 ***
## percent_fear       -4.265      1.929  -2.210 0.041990 *  
## percent_joy       -10.813      3.302  -3.275 0.004768 ** 
## percent_trust     -10.356      2.257  -4.588 0.000303 ***
## percent_negative  -13.362      2.352  -5.682 3.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.25 on 16 degrees of freedom
## Multiple R-squared:  0.6849, Adjusted R-squared:  0.6061 
## F-statistic: 8.693 on 4 and 16 DF,  p-value: 0.0006301

\[ \widehat{\mathrm{births \; per \; 1,000 \; women \; ages \; 15-19}} = \beta_0 + \beta_1(Percent \; Fear \;) + \beta_2(Percent \; Joy \;) + \beta_3(Percent \; trust) + \beta_3(Percent \; negative)\]

\[ \widehat{\mathrm{births \; per \; 1,000 \; women \; ages \; 15-19}} = 451.436 + \beta_1(Percent \; Fear \;) - 4.265 (Percent \; Joy \;) - 10.813(Percent \; trust) -13.362 (Percent \; negative)\]

In essence, as the percent fear and negative sentiment in the wikipedia article increases, the less births per 1000 women ages 15-19 there will be in that country. Perhaps because fear and thinking negative ultimately allows us to make change. The ability to see what the problems are can help us to fix them. Joy and trust also decreased birth rates, perhaps because both allow change to be made.

twitteR Analysis

Now, I am attempting to see if we can use tweets to determine if families or governments play a bigger role in each country by taking twitter data that contain the word of the country (and are in english – a potentially biased result) and compare it to wikipedia sites “family” and “government” using documents.compare. However, it doesn’t seem like this is the best method to obtain qualitative data about an entire country – because so many tweets aren’t about either governmental issues or family.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
get_tweets <- function(search){
  tweets <- searchTwitter(search,
                          n=1000,
                          lang="en",
                          retryOnRateLimit = 100,
                          resultType = 'recent')
  return (tweets)
}
clean_text <- function(words) {
words <- Corpus(VectorSource(words))
# Convert the text to lower case
words <- tm_map(words, content_transformer(tolower))
# Remove numbers
words <- tm_map(words, removeNumbers)
# Remove english common stopwords
words <- tm_map(words, removeWords, stopwords("english"))
# Remove punctuations
words <- tm_map(words, removePunctuation)
# Eliminate extra white spaces
words <- tm_map(words, stripWhitespace)
 
return(words)
}
dtm_names <- NULL

for (country in unique(countries_df$country)){
  country_search <- paste(country, "-filter:retweets")
  country_name <- gsub("[[:space:]]", "", country)
  title <- paste(country_name, "_tweets")
  temp <- get_tweets(country_search)
  temp <- clean_text(temp)
  words <- TermDocumentMatrix(temp)
  assign(title, words)
  dtm_names <- c(dtm_names, title)
}
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 834
## [1] "Rate limited .... blocking for a minute and retrying up to 99 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 98 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 97 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 96 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 95 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 94 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 93 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 92 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 91 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 90 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 89 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 88 times ..."
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 7
family <- obtain_text("https://en.wikipedia.org/wiki/Family")
government <- obtain_text("https://en.wikipedia.org/wiki/Government")

familydtm <- TermDocumentMatrix(family)
governmentdtm <- TermDocumentMatrix(government)
compare_documents <- function(dtm){
  similarity_fam <- documents.compare(dtm=dtm, dtm.y=familydtm, measure="cosine", min.similarity = .05)
  similarity_gov <- documents.compare(dtm=dtm, dtm.y=governmentdtm, measure="cosine", min.similarity = .05)
  
  cos_fam <- mean(similarity_fam$similarity)
  cos_gov <- mean(similarity_gov$similarity)
  
  country <- c(cos_fam, cos_gov)
  
  return (country)
}

Iran <- compare_documents(`Iran _tweets`)
## Warning in colnames(dtm) == colnames(dtm.y): longer object length is not a
## multiple of shorter object length

## Warning in colnames(dtm) == colnames(dtm.y): longer object length is not a
## multiple of shorter object length
Iran
## [1] 0.06590974 0.05959314
df <- data.frame(rbind(Iran, Afganistan, SaudiArabia, Cuba, Pakistan, Philippines, India, Brazil, Myanmar, Ukraine, Nepal, Tonga, Haiti, Bahrain, Colombia, Egypt, NorthKorea, PapuNewGuinea, Yemen, Turkey, Iraq, Qatar, Bangladesh, Japan,Lebanon, Seychelles , UnitedArabEmirates))
colnames(df) <- c("family", "government")
df

Womens Rights Wikipedia Pages to Wikipedia

Wikipedia articles came to the rescuse again, and were an improvement over twitter. Using the same articles as before, we compare each document with family and government using document.compare.

dtm_wiki_names <- NULL

for (country in unique(countries_df$country)){
  URL <- countries_df[countries_df$country == country, "URLs"]
  Sys.sleep(5)
  title <- paste(country, "_wiki")
  wiki_data <- obtain_text(as.character(URL))
  temp <- TermDocumentMatrix(wiki_data)
  assign(title, temp)
  dtm_wiki_names <- c(dtm_wiki_names, title)
}
df_wiki

Kmeans

Each country is assigned a similarity score to the documents family and government, and the results are clustered using Kmeans. Our goal is to find countries that act in similar ways when it comes to womens’ rights. However, it seems like there is a linear relationship between the two, something I did not expect. I expected these results to be spread out so there would be a clear destinction between countries that were more effected by government and others more effected by family. The goal was to group these countries together to run analysis, but because the groups were so small, we couldn’t do it. In the future, I’d like to use a bigger dataset or play around with other variables.

kmeansAIC <- function(fit){
  m <- ncol(fit$centers)
  n <- length(fit$cluster)
  k <- nrow(fit$centers)
  D <- fit$tot.withinss
  return(D + 2*m*k)
}
cos_fam <- df_wiki$family
cos_gov <- df_wiki$government
country <- row.names(df_wiki)

cluster_model <- function (fam, gov, country)
{
  kmax <- 10 # the maximum number of clusters we will examine
  kmfit <- list() # create and empty list
  totwss <- rep(0,kmax) # will be filled with total sum of within group sum squares
 
  for (i in 1:kmax){
    data <- cbind(fam,gov)
    kclus <- kmeans(data, centers=i)
    totwss[i] <- kclus$tot.withinss
    kmfit[[i]] <- kclus
  }
  aic=sapply(kmfit,kmeansAIC)
  v = -diff(aic)
  nv = length(v)
  fom = v[1:(nv-1)]/v[2:nv]
  nclus = which.max(fom)+1
  data <- cbind(fam, gov)
  grputil <- kmeans(data, centers=nclus)
  o <- grputil$cluster+0
  clustered <- cbind(fam, gov, o, country)
  colnames(clustered) <- c("Family", "Government", "o", "country")
    clustered <- data.frame(clustered)
  return(list(nclus=nclus, clustered=clustered))
}

kmeans <- cluster_model(cos_fam, cos_gov, country)
kmeans
## $nclus
## [1] 7
## 
## $clustered
##               Family        Government o            country
## 1  0.340227757141049 0.405790545757401 7               Iran
## 2  0.359220469155121 0.416947053739215 4         Afganistan
## 3  0.302735644306434 0.336732397087864 2        SaudiArabia
## 4   0.38985690156711 0.488376488253684 1               Cuba
## 5  0.325167812852401 0.363493426292242 6           Pakistan
## 6  0.379516753402282 0.452411401576391 3        Philippines
## 7  0.334730458340252 0.374413527645256 6              India
## 8  0.332405154970432 0.384968702034286 6             Brazil
## 9  0.349524098124857 0.418291357817552 4            Myanmar
## 10 0.329908090708617 0.409951393408504 7            Ukraine
## 11  0.38995225470309 0.482043108932032 1              Nepal
## 12 0.313264343147211 0.383677378881834 6              Tonga
## 13 0.369611220700527 0.434691420702436 3              Haiti
## 14 0.360634458146549 0.420941094112518 4            Bahrain
## 15 0.378942681591485 0.465146639399321 3           Colombia
## 16 0.354427494208382 0.415473297597646 4              Egypt
## 17 0.380096548499642 0.453287872602909 3         NorthKorea
## 18 0.352704461003168  0.40775172430126 5      PapuNewGuinea
## 19 0.352430068690643 0.396180107820941 5              Yemen
## 20 0.357061559045655 0.392342981278981 5             Turkey
## 21 0.339698740373538 0.389235679479467 7               Iraq
## 22 0.358998709572701 0.451387909661243 3              Qatar
## 23 0.345742683537146 0.439513074012152 4         Bangladesh
## 24  0.35826236459886 0.408742745088318 5              Japan
## 25  0.33966276603252 0.396736339866034 7            Lebanon
## 26 0.434716977496715 0.520940822860509 1         Seychelles
## 27 0.342458477901478 0.430338210782759 4 UnitedArabEmirates
clustered_data <- kmeans$clustered

ggplot(clustered_data, aes(Family, Government)) + geom_text(aes(label=country, color=o)) + theme(axis.ticks.y=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.x=element_blank())

Conclusions

Batch understanding of qualitative data is a difficult process, but I think wikipedia websites are a great tool for understanding concepts and relationships because they are all written in very similar formats. In the future, I’d like to work more with these webpages, and perhaps include some form of link analysis as well.