R Markdown

This is an R Markdown document containing Sean Amato’s work for the week 3 bridge project.

Background: This data set contains information on politician stances regarding PIPA/SOPA. These were copywrite laws proposed back in 2010 that caused protests due to concerns around limitations placed on free speech. Generally speaking media companies were in support of the law and internet companies were against it.

Problem Statement: I would like to know which attribute or combination of attributes (political party, pro PIPA/SOPA donations, anti PIPA/SOPA donations, years in congress) is the best predictor of a politician’s stance toward Anti Piracy Laws, back in 2010.

library(ggplot2)
library(dplyr)

Questions 1 & 5: Generate summary statistics and read data from a Github link.
First, let’s load the data and then create a summary.

doubloons <- 
  read.csv('https://raw.githubusercontent.com/samato0624/R/main/piracy.csv')
str(doubloons)
## 'data.frame':    534 obs. of  9 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name     : chr  "Ackerman, Gary" "Adams, Sandra" "Aderholt, Robert" "Akin, Todd" ...
##  $ party    : chr  " D" " R" " R" " R" ...
##  $ state    : chr  "NY" "FL" "AL" "MO" ...
##  $ money_pro: int  13350 3500 4779 2500 3500 24250 6750 NA 9000 4500 ...
##  $ money_con: int  14800 5650 23944 8200 2700 10650 1700 NA 6400 36474 ...
##  $ years    : int  30 2 16 12 10 6 2 2 24 4 ...
##  $ stance   : chr  "unknown" "unknown" "unknown" "no" ...
##  $ chamber  : chr  "house" "house" "house" "house" ...
summary(doubloons)
##        X             name              party              state          
##  Min.   :  1.0   Length:534         Length:534         Length:534        
##  1st Qu.:134.2   Class :character   Class :character   Class :character  
##  Median :267.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :267.5                                                           
##  3rd Qu.:400.8                                                           
##  Max.   :534.0                                                           
##                                                                          
##    money_pro        money_con          years          stance         
##  Min.   : -5000   Min.   : -1000   Min.   : 1.00   Length:534        
##  1st Qu.:  4500   1st Qu.:  4500   1st Qu.: 4.00   Class :character  
##  Median : 11700   Median : 10200   Median :10.00   Mode  :character  
##  Mean   : 26326   Mean   : 23193   Mean   :11.76                     
##  3rd Qu.: 27463   3rd Qu.: 23947   3rd Qu.:18.00                     
##  Max.   :571600   Max.   :550000   Max.   :58.00                     
##  NA's   :14       NA's   :35                                         
##    chamber         
##  Length:534        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
median(doubloons$years)
## [1] 10
mean(doubloons$years)
## [1] 11.7603

Let’s make a couple tables to get some counts.

table(doubloons$party)
## 
##   D   I   R 
## 243   2 289
table(doubloons$stance)
## 
## leaning no         no  undecided    unknown        yes 
##         44        122         11        294         63
table(doubloons$chamber)
## 
##  house senate 
##    434    100

First Pass observations: The only note worthy things are the differences between the money_pro and money_con columns. We can see companies that were pro PIPA/SOPA gave on average $26,326 to 520 politicians and anti PIPA/SOPA companies gave on average $23,193 to 499 politicians, at least that’s what was reported. I can only conclude from this summary that companies that supported PIPA/SOPA had a larger reported sum of donations. Another smaller detail is the median for number of years in congress is 10 years meaning less than half the members became members during or after the dotcom crash, I’m thinking that observing data around that threshold might be a good start considering internet laws that passed in the 1990s fueled internet innovation. However the data needs to be cleaned and transformed to obtain better summary statistics.

Question 2: Perform some basic transformations.
See comments in my code for what I did to clean and transform the data.

#Adding the line below so I can easily rename my columns when I make a new data frame.
column_names <- c("Party", "State", "Money_Pro", "Money_Con", "Years", "Stance", "Chamber")

#Removing the index and name column as they are only unique identifiers and won't 
#help generate patterns.
filtered_out_columns <- 
  data.frame(doubloons$party, doubloons$state, 
             doubloons$money_pro, doubloons$money_con, 
             doubloons$years, doubloons$stance, doubloons$chamber)

#Changing the column names.
colnames(filtered_out_columns) <- column_names

#Removing rows where "Money_Pro" or "Money_Con" have NA or negative values.
#Removing rows where "Stance" is "unknown" or "undecided" because we don't 
#like people who are on the fence.
filtered_out_rows <- filtered_out_columns %>%  
  filter(filtered_out_columns$Money_Pro != "NA" & 
           filtered_out_columns$Money_Con != "NA" & 
           filtered_out_columns$Money_Pro >= 0 & 
           filtered_out_columns$Money_Con >= 0 & 
           filtered_out_columns$Stance != "unknown" & 
           filtered_out_columns$Stance != "undecided")

#Replacing "leaning no" with "no" in the "Stance" column because it's practically a "no".
mutated_table <- filtered_out_rows %>%
  mutate(Stance = ifelse(Stance == 'leaning no', 'no', Stance))

#Removing the Independent
mutated_table <- mutated_table %>% filter(mutated_table$Party != " I")

#Creating two new columns.
#The column calculated below is the difference between Money_Pro & Money_Con. 
#This will help me make charts to easily see who paid the politician more.
mutated_table$Money_DifProCon <- mutated_table$Money_Pro - mutated_table$Money_Con

#The column calculated below is the sum of Money_Pro & Money_Con. 
#This is just the total donations the politician received.
mutated_table$Money_SumProCon <- mutated_table$Money_Pro + mutated_table$Money_Con

#Now we can better summarize our data based on "yes" and "no" stances only.
summary(mutated_table)
##     Party              State             Money_Pro        Money_Con     
##  Length:215         Length:215         Min.   :   250   Min.   :   500  
##  Class :character   Class :character   1st Qu.:  5225   1st Qu.:  5250  
##  Mode  :character   Mode  :character   Median : 13900   Median : 11732  
##                                        Mean   : 38036   Mean   : 27955  
##                                        3rd Qu.: 36278   3rd Qu.: 26675  
##                                        Max.   :571600   Max.   :348691  
##      Years         Stance            Chamber          Money_DifProCon  
##  Min.   : 1.0   Length:215         Length:215         Min.   :-141185  
##  1st Qu.: 3.0   Class :character   Class :character   1st Qu.:  -4275  
##  Median : 8.0   Mode  :character   Mode  :character   Median :   1400  
##  Mean   :10.4                                         Mean   :  10082  
##  3rd Qu.:15.5                                         3rd Qu.:  15750  
##  Max.   :48.0                                         Max.   : 241000  
##  Money_SumProCon 
##  Min.   :  1250  
##  1st Qu.: 12775  
##  Median : 31800  
##  Mean   : 65991  
##  3rd Qu.: 67998  
##  Max.   :920291
table(mutated_table$Stance)
## 
##  no yes 
## 159  56
sum(mutated_table$Money_Pro)
## [1] 8177828
sum(mutated_table$Money_Con)
## [1] 6010229

Some note worthy things here include that politicians regardless of stance were still on average paid more by Pro PIPA/SOPA companies, but 159 politicians were not supportive of the proposed law compared to 56 who were supportive. This makes me suspect that maybe this was just a bad law if ~75% of the people who took a stance opposed it while the “Media Company” Donors gave ~33% more money compared to the “Internet Company” Donors. Also note that the median years in office only decreased from 10 to 8.

Question 3: Display at least one scatter plot, box plot, and histogram.

#First let's make a couple box plots of Money_Con and Money_Pro to see what we get.
ggplot(mutated_table, aes(y = Money_SumProCon, x=1)) + geom_boxplot()

The boxplot above has a number of outliers. I really only want to try and predict the stance of an average politician. So I will be filtering out outliers by total donations. Then I will aggregate my next box plots by stance.

removed_outliers <- mutated_table %>% arrange(desc(Money_SumProCon))
#Filtered out 41 politicians to remove outliers, the plot represents 175 politicians.
removed_outliers <- removed_outliers %>% slice(41:n())
ggplot(removed_outliers, aes(y = Money_Con, x=1)) + geom_boxplot() + facet_wrap(~Stance)

First we removed 41 data points leaving us with 174 data points to work with. The box plot above shows that on average “Internet Companies” donated, on average, equal amounts to politicians of either stance.

ggplot(removed_outliers, aes(y = Money_Pro, x=1)) + geom_boxplot() + facet_wrap(~Stance)

The box plot above shows that politicians whose stance was “Yes” received on average more money from Media Companies than politicians whose stance was “No”.

Next we will look at the distribution of the difference between donations. So if a bar is left of the zero, internet companies donated more and right of the zero indicates media companies donated more.

ggplot(removed_outliers, aes(x = Money_DifProCon)) + geom_histogram() + facet_wrap(~Stance)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We produced something of a normal distribution for either stance, with negligent skews favoring the donor they support. What this says to me is that the source of the money didn’t strongly affect the decision of the average politician who was willing to take a stance on this issue.

Okay now let’s check if there is any correlation between number of years in congress and total donations received.

ggplot(removed_outliers, aes(x = Years, y = Money_SumProCon)) + geom_point(aes(color = Stance))

Okay so this is sort of interesting, people who were paid some of the highest donations were not 30+ year members of congress/senate, but were actually people with 1 to 10 years. I interpret this to mean that companies thought they could buy out politicians with less experience.

Let’s explore a few other charts. First a bar charts with the counts of Republicans and Democrats aggregated by Stance. Second, the sum of donations given to either party. Third a scatter plot with Pro vs Con donations aggregated colored by chamber and aggregated by stance.

ggplot(removed_outliers, 
       aes(x = Party)) + geom_bar(
         stat = "count", fill = "steelblue") + facet_wrap(~Stance)

ggplot(removed_outliers, 
       aes(x = Party, y = Money_SumProCon)) + geom_bar(
         stat = "identity", fill = "steelblue")

ggplot(removed_outliers, 
       aes(x = Money_Pro, y = Money_Con)) + geom_point(
         aes(color = Party)) + facet_wrap(~Stance)

summary(removed_outliers)
##     Party              State             Money_Pro       Money_Con    
##  Length:175         Length:175         Min.   :  250   Min.   :  500  
##  Class :character   Class :character   1st Qu.: 4062   1st Qu.: 4500  
##  Mode  :character   Mode  :character   Median :10250   Median : 8500  
##                                        Mean   :14972   Mean   :13089  
##                                        3rd Qu.:21722   3rd Qu.:20400  
##                                        Max.   :62700   Max.   :63800  
##      Years           Stance            Chamber          Money_DifProCon 
##  Min.   : 1.000   Length:175         Length:175         Min.   :-47700  
##  1st Qu.: 3.500   Class :character   Class :character   1st Qu.: -4125  
##  Median : 8.000   Mode  :character   Mode  :character   Median :   750  
##  Mean   : 9.817                                         Mean   :  1883  
##  3rd Qu.:14.000                                         3rd Qu.:  9450  
##  Max.   :38.000                                         Max.   : 56450  
##  Money_SumProCon
##  Min.   : 1250  
##  1st Qu.:11562  
##  Median :22232  
##  Mean   :28062  
##  3rd Qu.:40372  
##  Max.   :81750

One last thing is to check our outliers.

outliers <- mutated_table %>% 
  arrange(desc(Money_SumProCon)) %>%  # Sort in descending order based on 'donations'
  top_n(41)
## Selecting by Money_SumProCon
ggplot(outliers,
       aes(x = Money_Pro,
           y = Money_Con)) + geom_point(aes(color = Chamber)) + facet_wrap(~Stance)

In the outliers chart we can see some politicians received a disproportionate amount of money from Media companies.

Question 4:
Conclusions: The data is not clear cut, but here are my thoughts on the best predictor of a random politicians stance on PIPA/SOPA. The one liberty I will take is assuming that the politician has a stance. 1. My default guess is that a politician is against PIPA/SOPA because ~75% of the time that’s true based on summary statistics alone. 2. If I know a politician is a Replublican there is ~88% chance they are against PIPA/SOPA, for Democrat it would be ~66%. 3. If I know the politician made less than 80k in donations my odds stay the same because it is a weak correlation. However, if you were paid 40K+ by the Pro side you are more likely to support the pro side. 4. There is a weak negative correlation between the number of years in congress and how much you get from donations, so I would say your % chance of supporting the law goes down the longer someone has been in congress.

All and all it’s much easier to predict when someone is going to be against PIPA/SOPA, but to predict someones stance I would use the following logic. 1. If you are in the Republican party then you are opposed to PIPA/SOPA 2. If you are a Democrat and received less than 40k from Media companies then you are opposed to PIPA/SOPA 3. Else you support PIPA/SOPA

The results of this function changes every time the code is run so I won’t know the results till after I knit, but I seem to be right about 70 - 80% of the time.

random_filter <- data.frame(mutated_table$Party, mutated_table$Money_Pro, mutated_table$Stance)
random_rows <- sample_n(random_filter, 20)
colnames(random_rows) <- c("Party", "Money_Pro", "Stance")
random_rows$Prediction <- ifelse (random_rows$Party == "R", "No",
                                  ifelse(random_rows$Money_Pro >= 40000, "Yes", "No"))
print(random_rows)
##    Party Money_Pro Stance Prediction
## 1      R     35050     no         No
## 2      D      4650     no         No
## 3      R      6500     no         No
## 4      D     19550     no         No
## 5      R     24670     no         No
## 6      R     12000     no         No
## 7      R     74850     no        Yes
## 8      D    367733    yes        Yes
## 9      D     17850     no         No
## 10     D     11500    yes         No
## 11     R      1500     no         No
## 12     R      5500     no         No
## 13     R      1000     no         No
## 14     R      3380     no         No
## 15     R     36255     no         No
## 16     D      7000    yes         No
## 17     R     56500    yes        Yes
## 18     R      6750     no         No
## 19     R      3500     no         No
## 20     R     35690     no         No