Abstract

In this final project for DATA 607 we are going to use the Federal Election Commission data set to research the campaign contributions and expenditures for election 2016.

We are going to analyze the data of 2 major party candidates

Further would like to do some analysis on how the Political Action Committes have spend their amounts in 2016.

Would like to have a sentiment analysis of Hillary Clintons tweets especially during the Comey Letter episode.

Environment Preparation

Data Acquisition :

We got the data for our first analysis from FEC website. These files are pretty huge files containing lots of individual contributor information. The two separate files alond with PAC expenditure files are loaded.

# The HRC DataSet to data.frame

hrccontributions=fread('HRC_Cont.csv')
## 
Read 3.7% of 3506081 rows
Read 15.1% of 3506081 rows
Read 23.1% of 3506081 rows
Read 31.7% of 3506081 rows
Read 41.6% of 3506081 rows
Read 53.6% of 3506081 rows
Read 67.6% of 3506081 rows
Read 85.6% of 3506081 rows
Read 85.9% of 3506081 rows
Read 3506081 rows and 19 (of 19) columns from 0.621 GB file in 00:00:14
hrccontributions=data.frame(hrccontributions)

# Calculating the Mean Contributions for HRC.
mean(hrccontributions$V10)
## [1] 147.0197
summary(hrccontributions$V10)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -20000       15       25      147      100 12780000
# The DJT DataSet to data.frame

djtcontributions=fread('DJT_Cont.csv')
djtcontributions=data.frame(djtcontributions)

# Calculating the Mean Contributions for DJT
mean(djtcontributions$V10)
## [1] 158.8418
summary(djtcontributions$V10)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -84240.0     28.0     64.0    158.8    160.0  86940.0
# Loading Expenditures of PAC.
pacexpn=fread('FEC_independent-expenditure.csv')
pacexpn=data.frame(pacexpn)

Data Cleansing :

In this data munging activity, we are using regex and dplyr functions, to select only the fields which are of interest for us and removing trivial fields.

# Selecting only important fields. Excluding non trivial data.
hrccontributions=subset(hrccontributions,select = c(V3:V13))

# Regex to change the date formats
hrccontributions$V11=str_replace_all(hrccontributions$V11, "[:digit:]","")
hrccontributions$V11=str_replace_all(hrccontributions$V11, "-","")


# Selecting only important fields. Excluding non trivial data.
djtcontributions=subset(djtcontributions,select = c(V3:V13))

# Regex to change the date formats
djtcontributions$V11=str_replace_all(djtcontributions$V11, "[:digit:]","")
djtcontributions$V11=str_replace_all(djtcontributions$V11, "-","")


# Selecting only important fields. Excluding non trivial data.
pacexpn=subset(pacexpn,select = c(2,10,11,12,13,14))

Data Management :

We are using Mongo DB to load Data into NoSQL database. Some of challenges of NoSQL DB. After trying on to many packages in R, finally settled on mongolite Though this package seems to be very good. There is not much documentation for this package. Hence it requires lots of research to figure out the correct command for small search operation.

# Loading HRC contributions into a collection.
con=mongo(collection = "test", db = "test", url = "mongodb://localhost",
verbose = FALSE, options = ssl_options())

con$insert(hrccontributions)
## List of 5
##  $ nInserted  : num 3506081
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
alldata=con$find(query = '{"V10" : { "$gt" : 2000 } }')

# Sorting of Max Contribution. With Out Indexing
top10 = con$find('{"V10" : { "$gt" : 2000 }, "V4" : { "$ne" : "HILLARY VICTORY FUND - UNITEMIZED" }}', sort = '{"V10": -1}', limit = 10)
knitr::kable(top10)
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
Clinton, Hillary Rodham GOCKE, THOMAS JUPITER FL 334691584 SELF-EMPLOYED PHYSICIAN 20000 JUN
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 SELF-EMPLOYED REAL ESTATE INVESTOR 10000 JUN X
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 INFORMATION REQUESTED INFORMATION REQUESTED 7300 JUN
Clinton, Hillary Rodham HILLARY ACTION FUND - UNITEMIZED NEW YORK NY 101855256 5620 SEP X
Clinton, Hillary Rodham PROPPER, GREG LOS ANGELES CA 900691429 PROPPER DALEY PHILANTHROPIC CONSULTING 5400 JUN
Clinton, Hillary Rodham PICKER, MICHAEL SACRAMENTO CA 958413111 STATE OF CALIFORNIA COMMISSIONER 5400 NOV
Clinton, Hillary Rodham AUSTIN, ALAN ATHERTON CA 940275458 N/A RETIRED 5400 AUG
Clinton, Hillary Rodham FOX, ALAN STUDIO CITY CA 916042407 ACF PROPERTY MANAGEMENT PRESIDENT 5400 MAR
Clinton, Hillary Rodham BEAUBIEN, JAMES SANTA MONICA CA 904023024 LATHAM & WATKINS LLP ATTORNEY 5400 APR
Clinton, Hillary Rodham CARROLL, DANIEL ASHTON SAN FRANCISCO CA 941151125 TPG CAPITAL INVESTMENT MANAGER 5400 MAR
# One of the interesting things figured during usage of MongoDB is how Indexing helps in reducing the time of search.
system.time(con$find('{"V10" : { "$gt" : 2000 }, "V4" : { "$ne" : "HILLARY VICTORY FUND - UNITEMIZED" }}', sort = '{"V10": -1}', limit = 10))
##    user  system elapsed 
##   0.001   0.000   1.587
#Trying to add an index of the field contribution, To measure how quick the sort operations in happening.
con$index(add = '{"V10" : 1}')
##   v key._id key.V10  name        ns
## 1 2       1      NA  _id_ test.test
## 2 2      NA       1 V10_1 test.test
system.time(con$find('{"V10" : { "$gt" : 2000 }, "V4" : { "$ne" : "HILLARY VICTORY FUND - UNITEMIZED" }}', sort = '{"V10": -1}', limit = 10))
##    user  system elapsed 
##   0.001   0.000   0.002
#Loading DJT Contribution into a new collection.
con_2=mongo(collection = "test_djt", db = "test", url = "mongodb://localhost",
verbose = FALSE, options = ssl_options())

#Top 10 Contributors.
con_2$insert(djtcontributions)
## List of 5
##  $ nInserted  : num 782711
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
alldata_trmp = con_2$find(query = '{"V10" : { "$gt" : 2000 } } ', sort = '{"V10": -1}', limit = 10)

#Remove all data from MongoDB
con$remove('{}')
con_2$remove('{}')

#Drop the collection
con$drop()
con_2$drop()

Data Analysis :

The analysis that we did using MongoDB functions, we are trying to achieve the same using dplyr functions in R.

# Performing the operations that was done using MongoDB in R.
#HRC Campaign.
top_fifty = hrccontributions %>%
         filter(rank(desc(hrccontributions$V10))<=100) 

# Removing HRC Victory Fund to find Individual contributors.
top_fifty =top_fifty %>%
      filter(top_fifty$V4 != "HILLARY VICTORY FUND - UNITEMIZED")
    
# Top Fifty Individual Contributor's to HRC Campaign.
knitr::kable(head(plyr::arrange(top_fifty,desc(top_fifty$V10)), n = 10))   
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
Clinton, Hillary Rodham GOCKE, THOMAS JUPITER FL 334691584 SELF-EMPLOYED PHYSICIAN 20000 JUN
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 SELF-EMPLOYED REAL ESTATE INVESTOR 10000 JUN X
Clinton, Hillary Rodham YOUNG, SAMUEL J. TEHACHAPI CA 935618652 INFORMATION REQUESTED INFORMATION REQUESTED 7300 JUN
Clinton, Hillary Rodham HILLARY ACTION FUND - UNITEMIZED NEW YORK NY 101855256 5620 SEP X
Clinton, Hillary Rodham PROPPER, GREG LOS ANGELES CA 900691429 PROPPER DALEY PHILANTHROPIC CONSULTING 5400 JUN
Clinton, Hillary Rodham PICKER, MICHAEL SACRAMENTO CA 958413111 STATE OF CALIFORNIA COMMISSIONER 5400 NOV
Clinton, Hillary Rodham AUSTIN, ALAN ATHERTON CA 940275458 N/A RETIRED 5400 AUG
Clinton, Hillary Rodham FOX, ALAN STUDIO CITY CA 916042407 ACF PROPERTY MANAGEMENT PRESIDENT 5400 MAR
Clinton, Hillary Rodham BEAUBIEN, JAMES SANTA MONICA CA 904023024 LATHAM & WATKINS LLP ATTORNEY 5400 APR
Clinton, Hillary Rodham CARROLL, DANIEL ASHTON SAN FRANCISCO CA 941151125 TPG CAPITAL INVESTMENT MANAGER 5400 MAR
# DJT Campaign.
top_fifty_djt = djtcontributions %>%
         filter(rank(desc(djtcontributions$V10))<=100) 

    
# Top Fifty Individual Contributor's to DJT Campaign.
knitr::kable(head(plyr::arrange(top_fifty_djt,desc(top_fifty_djt$V10)), n = 10))   
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
Trump, Donald J. BOCH, ERNIE NORWOOD MA 02062 BOCH AUTOMOTIVE GROUP EXECUTIVE 86936.80 AUG
Trump, Donald J. FERRERO, LOUIS P MR. CANTON GA 30115 INFORMATION REQUESTED INFORMATION REQUESTED 12500.00 JUL
Trump, Donald J. CONSERVATIVE ACTION FUND ALEXANDRIA VA 22314 10030.24 DEC X
Trump, Donald J. COBB, ROBERT BIRMINGHAM AL 35209 COBB THEATERS OWNER 10000.00 NOV
Trump, Donald J. DOBSKI, ROBERT BLOOMINGTON IL 61704 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 NOV
Trump, Donald J. GORMAN, L.D. MR. HAZARD KY 41702 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 JUL
Trump, Donald J. ROVT, ALEXANDER MR. BROOKLYN NY 11234 INFORMATION REQUESTED INFORMATION REQUESTED 10000.00 NOV
Trump, Donald J. GIGANTE, PETER BELLINGHAM WA 98225 SELF-EMPLOYED INTERNATIONAL TRADE 10000.00 AUG
Trump, Donald J. TANZER, LEONARD J MR. SCARSDALE NY 10583 PATIENT CARE ASSOCIATES PRESIDENT 7300.00 NOV
Trump, Donald J. NORTHCUTT, JOHN D MR. III FAIRHOPE AL 36532 INFORMATION REQUESTED INFORMATION REQUESTED 5400.00 JUL
# PAC Expenditure Analysis
hrcpac = pacexpn[grepl("clin", pacexpn$can_nam, ignore.case = TRUE) , ]
hrcunexp  = hrcpac %>%
         filter(rank(desc(hrcpac$exp_amo))<=100) 

djtpac = pacexpn[grepl("donald", pacexpn$can_nam, ignore.case = TRUE) , ]
djtunexp  = djtpac %>%
         filter(rank(desc(djtpac$exp_amo))<=100) 

On Comparing the top expenditure of both the PACs, DJT PAC has spent lots of amount on Campaign OutReach. So they were smart enough to invest heavily on On-line campaigning.

Plots:

Plots of Month wise contributions for both the candidates. From the plots it is clear for DJT more the contributions, refunds were higher.

ggplot(hrccontributions, aes(x=hrccontributions$V11, y=hrccontributions$V10 ,group=hrccontributions$V11, colour=hrccontributions$V10)) + 
    geom_line() + ylab("Total Contributions") + 
    xlab("Month")

ggplot(djtcontributions, aes(x=djtcontributions$V11, y=djtcontributions$V10 ,group=djtcontributions$V11, colour=djtcontributions$V10)) + 
    geom_line() + ylab("Total Contributions") + 
    xlab("Month")

Sentiment Analysis:

Sentiment Analysis: The aim of my twitter sentiment analysis was to see how the tweets were exchanged during Comey Letter reveal during 2016 Election. But i figured out that twitter will not be able to get the tweets that past. So to proceed with sentiment analysis. I looked for the current hashtag. Having from Commonwealth country, cricket is the only sport that we grew up with. Two of the most famous cricketers from our arch rivals team retired and twitter was abuzz about their achievements and criticsm. So i searched for tweets with tag #MisYou to get the tweets and do sentiment analysis.

Twitter API:

To access the twitter API I learnt that we need to create a application. Once application is created then it needs following information to access.

api_key

api_secret

access_token

access_token_secret

source("twitterapi.R")

options(httr_oauth_cache=TRUE)
setup_twitter_oauth(key, secret, token, tokensecret)
## [1] "Using direct authentication"
#version to just look for date range
#data2 <- searchTwitter("#MisYou", n=10000, lang="en")

#df2 <- twListToDF(data2)
# For our intial analysis downloaded and stored the files.
#write.csv(df2,file = "MisYou_2.csv")
#write.csv(df1, file="MisYou.csv")

rm(key, secret, token, tokensecret)

Opinion Lexicon

Next for sentiment analysis, we are loading the lexicon of both positive and negative words. Reference for this lexicon is listed below.

positive = scan('opinion-lexicon-English/positive-words.txt',
                  what='character', comment.char=';')
negative = scan('opinion-lexicon-English/negative-words.txt',
                  what='character', comment.char=';')


texttweet=fread('MisYou.csv')
texttweet=data.frame(texttweet)

texttweet_2=fread('MisYou_2.csv')
texttweet_2=data.frame(texttweet_2)

fulltweettext = rbind(texttweet,texttweet_2)

# Reference for this function given below.
score.sentiment <- function(sentences, good_text, bad_text, .progress='none'){
    scores = plyr::laply(sentences, function(sentence, good_text, bad_text) {
        
        # clean text
        sentence <- gsub('[[:punct:]]', '', sentence)
        sentence <- gsub('[[:cntrl:]]', '', sentence)
        sentence <- gsub('\\d+', '', sentence)
        sentence <- iconv(sentence, 'latin1', 'ASCII', 'byte')
        sentence <- tolower(sentence)        
        words <- unlist(str_split(sentence, '\\s+'))
        
        # compare to lexicon
        pos.matches <- !is.na(match(words, good_text))
        neg.matches <- !is.na(match(words, bad_text))
        
        # calc score as difference between two
        score <- sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, good_text, bad_text, .progress=.progress )
    return(scores)
}

fulltweettext$score=score.sentiment(fulltweettext$text,positive,negative)

table(fulltweettext$score)
## 
##    -5    -4    -3    -2    -1     0     1     2     3     4     5     6 
##     1     1    24   139  1132  6943 10146  1212   335    52    12     3

Score Analysis

neutral = length(which(fulltweettext$score == 0))
positive = length(which(fulltweettext$score > 0))
negative = length(which(fulltweettext$score < 0))
Sentiment = c("Negative","Neutral","Positive")
C = c(1289,6943,11760)

output <- data.frame(Sentiment,C)

knitr::kable(output)
Sentiment C
Negative 1289
Neutral 6943
Positive 11760
qplot(Sentiment,C,data=output,colour = C)

## Word Cloud.

# Ignore graphical Parameters to avoid input errors
tweets.txt <- str_replace_all(fulltweettext$text,"[^[:graph:]]", " ") 

# Reference for this function provided below.
clean.text = function(x)
{
  
   # tolower
   x = tolower(x)
   # remove rt
   x = gsub("rt", "", x)
   # remove at
   x = gsub("@\\w+", "", x)
   # remove punctuation
   x = gsub("[[:punct:]]", "", x)
   # remove numbers
   x = gsub("[[:digit:]]", "", x)
   # remove links http
   x = gsub("http\\w+", "", x)
   # remove tabs
   x = gsub("[ |\t]{2,}", "", x)
   # remove blank spaces at the beginning
   x = gsub("^ ", "", x)
   # remove blank spaces at the end
   x = gsub(" $", "", x)
   return(x)
}

cleanText <- clean.text(tweets.txt)
vector <- paste(cleanText,collapse=" ")
remwords <- c("pakistan","cricket","Misbah", "Younis", "Khan", "misbah", "younis","misbahulhaqmisyou","retirementmisyou","newwallpaperalecreditsht","misbahmisyou","togethermisyou")
vector <- removeWords(vector,c(stopwords("english"),remwords))

wordcloud(vector, scale=c(6,0.7), max.words=100, 
           random.order=FALSE, rot.per=0.35,colors=brewer.pal(8,"Dark2"))

Conclusion:

This project was a great learning experience for me in all the technologies and languages I have used to complete it. Whether it is R or MongoDB or Twitter feed analysis, it was full of challenges and lots of research.

Reading of twitter feed is not as easy as it sounds, twitter puts heavy restrictions if we try to read the tweets from single ip. It would be better if we have a cluster server and a load balancer to retreive the tweets so the ips keep switching. I was not able to better search tweets which are like 8 months old, there should be some better way to get that tweets.

I had my own doubts on MongoDB, but it is a breeze to work on, though as far as R-Mongo integration better documentations are neccessary but working on Mongo as a whole was relatively easy.

All in all it was wonderful learning and it was a great experience.

Reference

Opinion Lexicon

Sentiment Analyzer Function

The below function is from rbloggers, which will give us the sentiment scores when we provided the text and positive and negative lexicon.

score.sentiment <- function(sentences, good_text, bad_text, .progress='none'){
    scores = plyr::laply(sentences, function(sentence, good_text, bad_text) {
        
        # clean text
        sentence <- gsub('[[:punct:]]', '', sentence)
        sentence <- gsub('[[:cntrl:]]', '', sentence)
        sentence <- gsub('\\d+', '', sentence)
        sentence <- iconv(sentence, 'latin1', 'ASCII', 'byte')
        sentence <- tolower(sentence)        
        words <- unlist(str_split(sentence, '\\s+'))
        
        # compare to lexicon
        pos.matches <- !is.na(match(words, good_text))
        neg.matches <- !is.na(match(words, bad_text))
        
        # calc score as difference between two
        score <- sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, good_text, bad_text, .progress=.progress )
    return(scores)
}

Tweets Cleaner Function

clean.text = function(x)
{
  
   # tolower
   x = tolower(x)
   # remove rt
   x = gsub("rt", "", x)
   # remove at
   x = gsub("@\\w+", "", x)
   # remove punctuation
   x = gsub("[[:punct:]]", "", x)
   # remove numbers
   x = gsub("[[:digit:]]", "", x)
   # remove links http
   x = gsub("http\\w+", "", x)
   # remove tabs
   x = gsub("[ |\t]{2,}", "", x)
   # remove blank spaces at the beginning
   x = gsub("^ ", "", x)
   # remove blank spaces at the end
   x = gsub(" $", "", x)
   return(x)
}