FCC’s Net Neutrality Repeal Rulemaking:: Comments Scoring Project

1. Context:

1. a) Net Neutrality:

Net Neutrality is the principle that Governments should mandate/ regulate ISPs to treat all data on the internet to be the same and not discriminate against data, websites, web based products and online content by any means possible like charging extra money for a website or slowing down certain content.


The 2015 regulation classified broadband as a highly regulated entity which just like how cellular networks shouldn’t decide on who one can or cannot call, ISPs shouldn’t interfere with what one can or cannot do and see on the internet.

Sample Corporate Players For Net Neutrality: Wesbites and Product Companies like Netflix, Amazon, Google, Facebook, Sample Corporate Players Against Net Neutrality: ISPs like AT&T, Comcast, Verizon,

1. b) FCC Rule Making:

“Gathering and analyzing comments from the public is an important part of the FCC’s rulemaking process, and it allows the public to participate in developing rules and polices that affect telecommunications and broadcast issues.”

“Whenever Congress enacts a law affecting telecommunications, the FCC starts a proceeding to create the rules and policies required by the new law. The commission also may start a proceeding when an outside party files a petition seeking a new law or change in existing rules.”

So, when a proposal to repeal the 2015 Net Neutrality regulation was made, the Federal Communications Commission ran a public docket where Americans can enter their thoughts and comments on this proposal to affect FCC’s rulemaking. More than 22 million comments were entered to the FCC.

2. Opportunity for Data Science:

2. a) Problem Statement:

Can Data Science and Machine Learning identify if the following situations could have happened:

  1. IDs of Americans were stolen by spam bots to input false/ fake comments
  2. Bots compromised the democratic process of FCC policy making through manufactured comments

Surveys are the only sure shot of knowing for sure if the above two scenarios occured, but surveys are expensive and are not practical to study the entire population as emails might not guarantee a response, might not be valid or might bounce. The solution is to employ Machine Learning and NLP techniques to score each comment basis its probability of being a fake. These scores can be employed to answer the above questions probabilistically. Though independent studies have analyzed the datasets, none have employed Machine Learning techniques to score the comments.

2. b) Objective:

Empower through imformation: 1. Build a probabilistic repository of IDs harvested without permission for fake opinions. 2. Build the same repository of comments with a probability of being fake score. 3. Publish an independent data centric research without taking a position on Net Neutrality

2. c) Data Set:

The dataset is available from the FCC website Docket 17-108. The entire dataset is available as 3 zip files. The entire dataset is > 3 GB in size. A sample of 500,000 comments in 298 MB is used for the exploratory analyses.

3. Exploratory Analyses:

Result 1

The sample dataset contains 500k comments and out of which these were the truly unique comments:

knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
options(warn = -1)
setwd("E:/Dhivya/FCC")
fcc_500k <- read.csv("fcc_500ksample.csv")
library(stringr)
fcc_500k$Message <- tolower(str_trim(fcc_500k$comment))
length(unique(fcc_500k$Message))
## [1] 126401

Result 2

So the unique comment rate is:

(length(unique(fcc_500k$Message))/ nrow(fcc_500k))*100
## [1] 25.2802

Now this data is just text with no target variable. So, Start Up Policy Lab conducted an independed survey for 450,000 people and received responses from 14,000 people, using the survey results as the target variable: Fake (0/1) binary. https://static1.squarespace.com/static/554441dae4b07d3f990170ea/t/5a32d72d9140b78109fd4dd4/1513281327086/SPL+TiPC+Preliminary+Report+12-14-17.pdf

The base fake rate is: 38.967% (194837 comments out of 500000)

Using this target variable, here is a sample bi-variate study for Fake rates under two buckets of comment repeated: If it was repeated <= 1000 times and > 1000 times.

Result 3

library(plyr)
fcc_500k$Row <- 1
collapse <- ddply(fcc_500k[,c("Message","Row")],.(Message),summarize,Count=sum(Row))
fcc_500k <- merge(fcc_500k, collapse, by="Message", all.x=T)
fcc_500k$Fake <- ifelse(fcc_500k$campaign %in% c(34,16,33,3,37,4,6,7,31,53,11,32,9,5,2,8,55),1,0)
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
fcc_500k$RepeatedTimesBucket <- cut(fcc_500k$Count, c(1,1000,30000), include.lowest=TRUE)
crosstab(fcc_500k, row.vars = "RepeatedTimesBucket", col.vars = 'Fake', type="r")
##                     Fake      0      1    Sum
## RepeatedTimesBucket                          
## [1,1e+03]                 85.75  14.25 100.00
## (1e+03,3e+04]             30.49  69.51 100.00

By creating a rough dictionary for pro repeal and against repeal using n-grams, as in comments containing phrases like “unprecedented increase in government control”, “overturn President Obama’s order” were marked under the Net Neutrality Repeal dictionary and comments containing phrases like “do not repeal”, “support existing net neutrality” were marked under the Net Neutrality Do Not Repeal dictionary, we are able to capture the sentiment of each comment.

Using these dictionaries, we are able to classify 74.83% of comments into Repeal and NoRepeal. 25.17% of comments are uncategorized as the dictionaries are incomplete at this stage and needs to be more exhaustive.

31.07% of comments were pro repeal and 43.76% of comments were against repeal.

Result 4

repeal_dictionary <- c("i strongly oppose", "strongly oppose", "oppose", "power grab", "fcc should repeal", "net neutrality order was the corrupt", "unprecedented increase in government control", "obama administration rammed through a massive scheme", "fcc to reverse obama's scheme", "overturn president Obama's order", "over-regulation", "before leaving office", "the current fcc regulation", "to the federal communications", "as a concerned taxpayer", "in 2015, chairman tom")
norepeal_dictionary <- c("do not repeal", "don't repeal", "need net neutrality", "support existing net neutrality", "support net neutrality", "protect net neutrality", "in favor of strong net neutrality", "please do not reverse", "do not support repeal", "the fcc's open internet")
fcc_500k$Repeal <-ifelse((grepl(paste(repeal_dictionary, collapse = "|"), fcc_500k$Message)),1,0)
fcc_500k$NoRepeal <-ifelse((grepl(paste(norepeal_dictionary, collapse = "|"), fcc_500k$Message)),1,0)
fcc_500k$Add <- fcc_500k$Repeal+ fcc_500k$NoRepeal
fcc_500k$Sentiment <- ifelse(fcc_500k$Repeal==1,"Repeal",ifelse(fcc_500k$NoRepeal==1,"NoRepeal","Unknown"))

The Fake Rates of each sentiment are 0.28% fake for against repeal sentiment and 92.11% fake for the repeal sentiment.

Result 5

crosstab(fcc_500k, row.vars = "Sentiment", col.vars = 'Fake', type="r")
##           Fake      0      1    Sum
## Sentiment                          
## NoRepeal        99.72   0.28 100.00
## Repeal           7.89  92.11 100.00
## Unknown         59.37  40.63 100.00

The Uniqueness Rates for each sentiment also had red flags: While only 8.12% of against repeal sentiment was repeated more than 1000 times, 85.68% of pro repeal sentiment was repeated more than 1000 times.

Result 6

crosstab(fcc_500k, row.vars = "Sentiment", col.vars = 'RepeatedTimesBucket', type="r")
##           RepeatedTimesBucket [1,1e+03] (1e+03,3e+04]    Sum
## Sentiment                                                   
## NoRepeal                          91.88          8.12 100.00
## Repeal                            14.32         85.68 100.00
## Unknown                           42.14         57.86 100.00

Here’s a sample tri-variate showing high Fake rates of 78.55% for comments repeated more than 1000 times and with pro repeal sentiment

Result 7

collapse <- ddply(fcc_500k[,c("RepeatedTimesBucket","Sentiment","Fake")],.(RepeatedTimesBucket,Sentiment),summarize,FakeCount=sum(Fake))
collapse$FakeRate <- round((ifelse(collapse$RepeatedTimesBucket=='[1,1e+03]',collapse$FakeCount/sum(collapse$FakeCount[collapse$RepeatedTimesBucket=='[1,1e+03]']),collapse$FakeCount/sum(collapse$FakeCount[collapse$RepeatedTimesBucket=='(1e+03,3e+04]'])))*100,3)
collapse
##   RepeatedTimesBucket Sentiment FakeCount FakeRate
## 1           [1,1e+03]  NoRepeal       620    1.575
## 2           [1,1e+03]    Repeal     20947   53.215
## 3           [1,1e+03]   Unknown     17796   45.210
## 4       (1e+03,3e+04]  NoRepeal         0    0.000
## 5       (1e+03,3e+04]    Repeal    122125   78.550
## 6       (1e+03,3e+04]   Unknown     33349   21.450

These are the 20 most common email domains which were used in the comments

Result 8

fcc_500k$EmailDomain <- tolower(str_trim(fcc_500k$email_domain))
head(sort(table(fcc_500k$EmailDomain),decreasing = TRUE),20)
## 
##      gmail.com      yahoo.com                   pornhub.com     einrot.com 
##         112871          57625          52830          23015          17877 
##    armyspy.com jourrapide.com      rhyta.com       cuvox.de    fleckens.hu 
##          17790          17756          17480          17466          17397 
##      gustr.com    teleworm.us     dayrep.com  superrito.com    hotmail.com 
##          17369          17312          17260          17164          15045 
##        aol.com       hurra.de    comcast.net        msn.com     icloud.com 
##          14352           8237           4634           2489           2372

These are the Fake rates for these 20 most common domains. Interestingly fake mails from fake mail generators were almost always genuine; possibly meaning people who entered genuine comments wanted to protect their privacy and so entered fake emails.

Result 9

fcc_500k$Domain <- ifelse(fcc_500k$EmailDomain %in% c("gmail.com","yahoo.com","pornhub.com","einrot.com","armyspy.com","jourrapide.com","rhyta.com","cuvox.de","fleckens.hu","gustr.com",    "teleworm.us","dayrep.com","superrito.com","hotmail.com","aol.com","hurra.de","comcast.net","msn.com","icloud.com"),fcc_500k$EmailDomain,"Others")
crosstab(fcc_500k, row.vars = "Domain", col.vars = 'Fake', type="r")
##                Fake      0      1    Sum
## Domain                                  
## aol.com              13.68  86.32 100.00
## armyspy.com          99.99   0.01 100.00
## comcast.net          27.28  72.72 100.00
## cuvox.de             99.99   0.01 100.00
## dayrep.com          100.00   0.00 100.00
## einrot.com          100.00   0.00 100.00
## fleckens.hu         100.00   0.00 100.00
## gmail.com            18.87  81.13 100.00
## gustr.com           100.00   0.00 100.00
## hotmail.com          17.40  82.60 100.00
## hurra.de            100.00   0.00 100.00
## icloud.com           10.33  89.67 100.00
## jourrapide.com      100.00   0.00 100.00
## msn.com              19.53  80.47 100.00
## Others               77.38  22.62 100.00
## pornhub.com         100.00   0.00 100.00
## rhyta.com           100.00   0.00 100.00
## superrito.com       100.00   0.00 100.00
## teleworm.us          99.99   0.01 100.00
## yahoo.com            10.06  89.94 100.00

Fake Rates for emails specifically from Fake Mail Generator domains.

Result 10

fcc_500k$FromFakeMailGenerator <- ifelse(fcc_500k$Domain %in% c("einrot.com","armyspy.com","jourrapide.com","rhyta.com","cuvox.de","fleckens.hu","gustr.com",    "teleworm.us","dayrep.com","superrito.com"),1,0)
crosstab(fcc_500k, row.vars = "FromFakeMailGenerator", col.vars = 'Fake', type="r")
##                       Fake      0      1    Sum
## FromFakeMailGenerator                          
## 0                           40.07  59.93 100.00
## 1                          100.00   0.00 100.00

This sample plot shows four comment campaigns where the exact comments with the exact wordings were repeated 170697, 37393, 2544 and 31254 times resp.

Result 11

## Taking only comments that are classified now
take <- fcc_500k[fcc_500k$Campaign!=0,]
take$Group <- ifelse(take$Campaign==1,"AgainstRepeal","ProRepeal")
library(ggplot2)
ggplot(take, aes(Campaign, fill=Group)) + geom_histogram() + ggtitle("Repeat Comment Occurances") + labs(x="CommentCampaign", y="Repeats")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Approach:

  • Next step would be to engineer many such variables to feed into the model. Variables that are possible include: Campaign, Sentiment, Email domain, Type of domain (Business/ Personal), IP addess, Count of words in comment, Comment Uniqueness Index, International address, US State, US City, Name Uniqueness Index and so on and perform similar bi-variate studies.

  • The next step would be to fit a model which will predict the probability of a comment being fake. The first cut model will be iterated to remove less predictive variables while validating the model at each step for performance using Rank ordering algorithms and lift charts.

5. Use Cases:

  1. Build a probabilistic repository of IDs harvested without permission for fake opinions. (Using a threshold, as in any ID with probaility of being a fake higher than the threshold will be classified as highly likely to be a stolen ID)
  2. Build the same repository of comments with a probability of being fake score.
  3. Publish an independent data centric research without taking a position on Net Neutrality

6. Evaluation Metrics:

  • Compare net sentiment with and without the model by filtering out fake comments using the model. If the net sentiment is “Against Repeal” with the model, then conclude that FCC’s “Pro Repeal” decision was inaccurate as they didn’t factor in the influence of fake comments.