SQL Capstone Project

Jonah Winninghoff

11/23/2020

Business Request

Lobbyists4America is a company that seeks to provide insights to their customers (who aim to affect legislation within the US). They want you to analyze the 2008-2017 congressional tweets in order to understand key topics, members, and relationships within Congress. These insights will help them focus and strengthen their lobbying efforts.

Translating into Data Questions

Every key topic is enormous and complex so that this data assessment focuses on climate crisis.

  • What is every legislator’s stance on climate crisis?
  • Who does each legislator talk to and how often, related to this topic?
  • What is each legislator’s sentiment towards climate crisis? Is their tone positive or negative?

Entity Relationship Diagram

Image

What are the problems with data?

Lack of classification

Since there are several problems that have been identified, this data does not have negative and positive comment classification. This could become a problem for two reasons. Firstly, the sophisticated statistical approach cannot be in use to make prediction without this classification, which should have it for determining confidence interval. Secondly, as held in high regard of data integrity, one should not rely on different data to train for statisitcal purpose. That is, using the product-related data to train machine learning for interpreting the specific parts of text in political-related data could result in measurement error.

Created_at Variable

The created_at variable that should be date datatype does not behave as a such. Since the lobbyists are more likely to be interested in particular legislators who change their poltiical stance over time, this datatype is, thus, useful. The problem is that this variable does not behave like this datatype. Perhaps, this variable is encrypted, which allows authorized users only to access this kind of information. Before the hypotheses explain, the next slide will explain the data approach.

Data Approach

Which programming languages I will work with?

There are two different programming languages for sentimental analysis and data pipelining. The R is powerful and it is perfectly capable of processing natural language in complex sense. Not only that, I have experience working with this language for more than one year. The SQL integrates in Apache Spark, which is a data frame that can speed up the data retrival and reading.

How can I address the absent classification problem?

As mentioned earlier, the R excels at natural language processing so that the appropriate lexicon can be applied for identifying the representatives’ and senators’ political stance for climate change. The objective for this assessment is to identify certain congressional members who might exhibit their political ambivalence, which is effective for lobbyists.

How to identify who legislator talks to

The R will develop the lexicon with full of names that this text mining aims to extract only from names of congressional members. Then, the new dataset is in use to count the highest frequency of people they talk to. The goal is to develop the interactive networking of congressional members who often talk to.

Hypothesis

First Hypothesis

The assumption is that legislators who use “climate action” term without contradiction transition attempt to show their stance in clear way. That is, if the congressional members retweet or repeatedly tweet the comments with term “climate action” and without contradiction trasnition at five times, then their stance is firm.

Second Hypothesis

If the congressional members talk to one member, then other members in same party might talk to same one. In other words, the bike wheel and networking may be resemblant. That is, all communications link to each party’s leader.

Third Hypothesis

If the sentimental analysis applies, then Republican party is more likely to have same tone but Democrat party might have mixed up tone.