4/28/2021
We are finding ways to use the standard twitter API to predict shifts in approval ratings for US Senators over time.
The standard twitter api allows users to collect tweets from individual timelines and provides data on each tweet such as whether or not it was a retweet, how many people liked or “favorited” it, how many times it was retweeted, etc.
Our goal is to use this data to predict shifts in approval ratings over 30 day periods by pairing twitter data with polling data.
Initially, we hoped to build sentiment profiles for each tweet based on the sentiment of their response. Twitter does not allow access to retweet without a premium account, so we shifted focus to sentiment and metadata of tweets by senators.
We found a data set that included approval ratings broken down into 30 day periods, stretching back from September 2020 to almost a year prior, that we could use to gauge shifts in approval.
We then selected a set of US senators and pulled a few thousand tweets from each of their timelines.
Another problem we faced is that there is no up to date data set that includes Twitter handles for US senators. There is a Harvard data set that is commonly cited for this purpose, but upon inspection it contains numerous errors (wrong handles, press handles, etc.). In response, we built a new CSV file that contained names and handles for the 116th US congress (minus a couple members, such as Kamala Harris, who no longer have senate twitter accounts).
We selected a subset of senators (all the senators holding office during the time we had approval raitings for) and then modified this data frame to include Twitter handles for the various senators using the csv we made so we could pull tweets. We then used this to generate a larger dataframe that included a few thousand tweets for each senator.
We tokenized the tweets, removed stopwords etc., and used the bing lexicon to detect positive or negative words.
Then we used the tweet metadata to break the tweets down into chunks that correlate with the 30 day windows in the approval data so we could relate the two sets of data.
We decided to use a linear regression model to see the relationship between the tweets’ metadata and sentiment score and the approval ratings of the various senators. We ran our predictions, trying to predict shifts in approval in 30 day periods that correlated to our data set.
For predicting shifts in approval ratings between 1 and 100, our model had a Mean Absolute Error of 3.88.
While the MAE is good, we worry that our model may be biased and could be improved by increasing the amount of data.
Our model can use Twitter metadata paired with sentiment analysis to reasonably predict shifts in approval rating.
Our new (hand made) data set of Twitter handles for the 116th congress solves problems faced by previous data sets (like the Harvard data set).
The data we generated by combining approval data with Twitter data could be expanded for future use.
We learned a lot about the Twitter API and think our project could be expanded to analyze replies in an enterprise context.