The Lobbyists4America institution expresses interest using data-driven approaches, which can help make their lobbying strategy more effective, by aligning their interests with and avoiding certain politicians. As a team, I would like for you to evaluate my text analytic approach. I use two Twitter and User datasets chosen by this institution in order to ensure that my answer to data questions focuses on climate change policies.
Prior to further analysis, the data’s challenges need to be addressed first. What these datasets lack are date datatype and classification. The column called created_at does not behave as a such, perhaps, due to decryption. Secondly, the classification is a binary datatype that can often be in use to describe qualitative observation. It does not exist. Nor do both datasets have like any click attributions.
Both datasets do not have all politicians tweeting about climate change-related topics. There are 324 out of 545 politicians who tweet about this topic. The number of their responses to certain individual or organization is small.Based on what both data dimensions tell, the most appropriate way to ask data questions is text analytics-oriented since the information in text is rich.
There are three steps for sentiment analysis:
There are three steps for climate denial detection system:
This calculation in Spark shows that t-score is 1.63 with degree of freedom equal to 203, which is only 0.02 apart from the critical value. Intuitively, the politicians using terms like “climate crisis”, “combat climate change”, or “climate action” have the less amount of negative sentiment than thoses using terms like “climate change”, “global warming”, or “changing climate.”
There are four different classification algorithms verified by the validation set. One from the following algorithms should be suitable for out-of-sample data, that is, without being overfitted. Because my interest is to ensure that a number of false positives is low, the Support Vector Machine is more suitable.
The Support Vector Machine algorithm shows the result verified by testing set that is also known as out-of-sample data. This cross-validation only occurs one time. Even though the accuracy is better than in-sample data, the specificity worsens by few percentage differences. Yet, since the sensitivity is essential, this result is satisfied.