Net Neutrality is the principle that Governments should mandate/ regulate ISPs to treat all data on the internet to be the same and not discriminate against data, websites, web based products and online content by any means possible like charging extra money for a website or slowing down certain content.
The 2015 regulation classified broadband as a highly regulated entity which just like how cellular networks shouldn’t decide on who one can or cannot call, ISPs shouldn’t interfere with what one can or cannot do and see on the internet.
Sample Corporate Players For Net Neutrality: Wesbites and Product Companies like Netflix, Amazon, Google, Facebook, Sample Corporate Players Against Net Neutrality: ISPs like AT&T, Comcast, Verizon,
“Gathering and analyzing comments from the public is an important part of the FCC’s rulemaking process, and it allows the public to participate in developing rules and polices that affect telecommunications and broadcast issues.”
“Whenever Congress enacts a law affecting telecommunications, the FCC starts a proceeding to create the rules and policies required by the new law. The commission also may start a proceeding when an outside party files a petition seeking a new law or change in existing rules.”
So, when a proposal to repeal the 2015 Net Neutrality regulation was made, the Federal Communications Commission ran a public docket where Americans can enter their thoughts and comments on this proposal to affect FCC’s rulemaking. More than 22 million comments were entered to the FCC.
The Wall Street Journal ran an article in December 2017 where it stated that many comments submitted in the platform were fake as PII information from many Americans were stolen and used to enter fake comments by spam bots and programs. https://www.wsj.com/articles/millions-of-people-post-comments-on-federal-regulations-many-are-fake-1513099188
The WSJ and few other investigative journalists and attorneys conducted independent surveys for a random population of the dataset to check if the people had actually submitted the comments. And more than 88% of people responded in the negative. https://www.techdirt.com/articles/20171214/03220038805/ny-attorney-general-finds-2-million-fake-fcc-net-neutrality-comments.shtml
Can Data Science and Machine Learning identify if the following situations could have happened:
Surveys are the only sure shot of knowing for sure if the above two scenarios occured, but surveys are expensive and are not practical to study the entire population as emails might not guarantee a response, might not be valid or might bounce. The solution is to employ Machine Learning and NLP techniques to score each comment basis its probability of being a fake. These scores can be employed to answer the above questions probabilistically. Though independent studies have analyzed the datasets, none have employed Machine Learning techniques to score the comments.
Empower through imformation: 1. Build a probabilistic repository of IDs harvested without permission for fake opinions. 2. Build the same repository of comments with a probability of being fake score. 3. Publish an independent data centric research without taking a position on Net Neutrality
The dataset is available from the FCC website Docket 17-108. The entire dataset is available as 3 zip files. The entire dataset is > 3 GB in size. A sample of 500,000 comments in 298 MB is used for the exploratory analyses.
The sample dataset contains 500k comments and out of which these were the truly unique comments:
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
options(warn = -1)
setwd("E:/Dhivya/FCC")
fcc_500k <- read.csv("fcc_500ksample.csv")
library(stringr)
fcc_500k$Message <- tolower(str_trim(fcc_500k$comment))
length(unique(fcc_500k$Message))
## [1] 126401
So the unique comment rate is:
(length(unique(fcc_500k$Message))/ nrow(fcc_500k))*100
## [1] 25.2802
Now this data is just text with no target variable. So, Start Up Policy Lab conducted an independed survey for 450,000 people and received responses from 14,000 people, using the survey results as the target variable: Fake (0/1) binary. https://static1.squarespace.com/static/554441dae4b07d3f990170ea/t/5a32d72d9140b78109fd4dd4/1513281327086/SPL+TiPC+Preliminary+Report+12-14-17.pdf
The base fake rate is: 38.967% (194837 comments out of 500000)
Using this target variable, here is a sample bi-variate study for Fake rates under two buckets of comment repeated: If it was repeated <= 1000 times and > 1000 times.
library(plyr)
fcc_500k$Row <- 1
collapse <- ddply(fcc_500k[,c("Message","Row")],.(Message),summarize,Count=sum(Row))
fcc_500k <- merge(fcc_500k, collapse, by="Message", all.x=T)
fcc_500k$Fake <- ifelse(fcc_500k$campaign %in% c(34,16,33,3,37,4,6,7,31,53,11,32,9,5,2,8,55),1,0)
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
fcc_500k$RepeatedTimesBucket <- cut(fcc_500k$Count, c(1,1000,30000), include.lowest=TRUE)
crosstab(fcc_500k, row.vars = "RepeatedTimesBucket", col.vars = 'Fake', type="r")
## Fake 0 1 Sum
## RepeatedTimesBucket
## [1,1e+03] 85.75 14.25 100.00
## (1e+03,3e+04] 30.49 69.51 100.00
By creating a rough dictionary for pro repeal and against repeal using n-grams, as in comments containing phrases like “unprecedented increase in government control”, “overturn President Obama’s order” were marked under the Net Neutrality Repeal dictionary and comments containing phrases like “do not repeal”, “support existing net neutrality” were marked under the Net Neutrality Do Not Repeal dictionary, we are able to capture the sentiment of each comment.
Using these dictionaries, we are able to classify 74.83% of comments into Repeal and NoRepeal. 25.17% of comments are uncategorized as the dictionaries are incomplete at this stage and needs to be more exhaustive.
31.07% of comments were pro repeal and 43.76% of comments were against repeal.
repeal_dictionary <- c("i strongly oppose", "strongly oppose", "oppose", "power grab", "fcc should repeal", "net neutrality order was the corrupt", "unprecedented increase in government control", "obama administration rammed through a massive scheme", "fcc to reverse obama's scheme", "overturn president Obama's order", "over-regulation", "before leaving office", "the current fcc regulation", "to the federal communications", "as a concerned taxpayer", "in 2015, chairman tom")
norepeal_dictionary <- c("do not repeal", "don't repeal", "need net neutrality", "support existing net neutrality", "support net neutrality", "protect net neutrality", "in favor of strong net neutrality", "please do not reverse", "do not support repeal", "the fcc's open internet")
fcc_500k$Repeal <-ifelse((grepl(paste(repeal_dictionary, collapse = "|"), fcc_500k$Message)),1,0)
fcc_500k$NoRepeal <-ifelse((grepl(paste(norepeal_dictionary, collapse = "|"), fcc_500k$Message)),1,0)
fcc_500k$Add <- fcc_500k$Repeal+ fcc_500k$NoRepeal
fcc_500k$Sentiment <- ifelse(fcc_500k$Repeal==1,"Repeal",ifelse(fcc_500k$NoRepeal==1,"NoRepeal","Unknown"))
The Fake Rates of each sentiment are 0.28% fake for against repeal sentiment and 92.11% fake for the repeal sentiment.
crosstab(fcc_500k, row.vars = "Sentiment", col.vars = 'Fake', type="r")
## Fake 0 1 Sum
## Sentiment
## NoRepeal 99.72 0.28 100.00
## Repeal 7.89 92.11 100.00
## Unknown 59.37 40.63 100.00
The Uniqueness Rates for each sentiment also had red flags: While only 8.12% of against repeal sentiment was repeated more than 1000 times, 85.68% of pro repeal sentiment was repeated more than 1000 times.
crosstab(fcc_500k, row.vars = "Sentiment", col.vars = 'RepeatedTimesBucket', type="r")
## RepeatedTimesBucket [1,1e+03] (1e+03,3e+04] Sum
## Sentiment
## NoRepeal 91.88 8.12 100.00
## Repeal 14.32 85.68 100.00
## Unknown 42.14 57.86 100.00
Here’s a sample tri-variate showing high Fake rates of 78.55% for comments repeated more than 1000 times and with pro repeal sentiment
collapse <- ddply(fcc_500k[,c("RepeatedTimesBucket","Sentiment","Fake")],.(RepeatedTimesBucket,Sentiment),summarize,FakeCount=sum(Fake))
collapse$FakeRate <- round((ifelse(collapse$RepeatedTimesBucket=='[1,1e+03]',collapse$FakeCount/sum(collapse$FakeCount[collapse$RepeatedTimesBucket=='[1,1e+03]']),collapse$FakeCount/sum(collapse$FakeCount[collapse$RepeatedTimesBucket=='(1e+03,3e+04]'])))*100,3)
collapse
## RepeatedTimesBucket Sentiment FakeCount FakeRate
## 1 [1,1e+03] NoRepeal 620 1.575
## 2 [1,1e+03] Repeal 20947 53.215
## 3 [1,1e+03] Unknown 17796 45.210
## 4 (1e+03,3e+04] NoRepeal 0 0.000
## 5 (1e+03,3e+04] Repeal 122125 78.550
## 6 (1e+03,3e+04] Unknown 33349 21.450
These are the 20 most common email domains which were used in the comments
fcc_500k$EmailDomain <- tolower(str_trim(fcc_500k$email_domain))
head(sort(table(fcc_500k$EmailDomain),decreasing = TRUE),20)
##
## gmail.com yahoo.com pornhub.com einrot.com
## 112871 57625 52830 23015 17877
## armyspy.com jourrapide.com rhyta.com cuvox.de fleckens.hu
## 17790 17756 17480 17466 17397
## gustr.com teleworm.us dayrep.com superrito.com hotmail.com
## 17369 17312 17260 17164 15045
## aol.com hurra.de comcast.net msn.com icloud.com
## 14352 8237 4634 2489 2372
These are the Fake rates for these 20 most common domains. Interestingly fake mails from fake mail generators were almost always genuine; possibly meaning people who entered genuine comments wanted to protect their privacy and so entered fake emails.
fcc_500k$Domain <- ifelse(fcc_500k$EmailDomain %in% c("gmail.com","yahoo.com","pornhub.com","einrot.com","armyspy.com","jourrapide.com","rhyta.com","cuvox.de","fleckens.hu","gustr.com", "teleworm.us","dayrep.com","superrito.com","hotmail.com","aol.com","hurra.de","comcast.net","msn.com","icloud.com"),fcc_500k$EmailDomain,"Others")
crosstab(fcc_500k, row.vars = "Domain", col.vars = 'Fake', type="r")
## Fake 0 1 Sum
## Domain
## aol.com 13.68 86.32 100.00
## armyspy.com 99.99 0.01 100.00
## comcast.net 27.28 72.72 100.00
## cuvox.de 99.99 0.01 100.00
## dayrep.com 100.00 0.00 100.00
## einrot.com 100.00 0.00 100.00
## fleckens.hu 100.00 0.00 100.00
## gmail.com 18.87 81.13 100.00
## gustr.com 100.00 0.00 100.00
## hotmail.com 17.40 82.60 100.00
## hurra.de 100.00 0.00 100.00
## icloud.com 10.33 89.67 100.00
## jourrapide.com 100.00 0.00 100.00
## msn.com 19.53 80.47 100.00
## Others 77.38 22.62 100.00
## pornhub.com 100.00 0.00 100.00
## rhyta.com 100.00 0.00 100.00
## superrito.com 100.00 0.00 100.00
## teleworm.us 99.99 0.01 100.00
## yahoo.com 10.06 89.94 100.00
Fake Rates for emails specifically from Fake Mail Generator domains.
fcc_500k$FromFakeMailGenerator <- ifelse(fcc_500k$Domain %in% c("einrot.com","armyspy.com","jourrapide.com","rhyta.com","cuvox.de","fleckens.hu","gustr.com", "teleworm.us","dayrep.com","superrito.com"),1,0)
crosstab(fcc_500k, row.vars = "FromFakeMailGenerator", col.vars = 'Fake', type="r")
## Fake 0 1 Sum
## FromFakeMailGenerator
## 0 40.07 59.93 100.00
## 1 100.00 0.00 100.00
This sample plot shows four comment campaigns where the exact comments with the exact wordings were repeated 170697, 37393, 2544 and 31254 times resp.
## Taking only comments that are classified now
take <- fcc_500k[fcc_500k$Campaign!=0,]
take$Group <- ifelse(take$Campaign==1,"AgainstRepeal","ProRepeal")
library(ggplot2)
ggplot(take, aes(Campaign, fill=Group)) + geom_histogram() + ggtitle("Repeat Comment Occurances") + labs(x="CommentCampaign", y="Repeats")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next step would be to engineer many such variables to feed into the model. Variables that are possible include: Campaign, Sentiment, Email domain, Type of domain (Business/ Personal), IP addess, Count of words in comment, Comment Uniqueness Index, International address, US State, US City, Name Uniqueness Index and so on and perform similar bi-variate studies.
The next step would be to fit a model which will predict the probability of a comment being fake. The first cut model will be iterated to remove less predictive variables while validating the model at each step for performance using Rank ordering algorithms and lift charts.