This is an R Markdown document containing Sean Amato’s work for the week 3 bridge project.
Background: This data set contains information on politician stances regarding PIPA/SOPA. These were copywrite laws proposed back in 2010 that caused protests due to concerns around limitations placed on free speech. Generally speaking media companies were in support of the law and internet companies were against it.
Problem Statement: I would like to know which attribute or combination of attributes (political party, pro PIPA/SOPA donations, anti PIPA/SOPA donations, years in congress) is the best predictor of a politician’s stance toward Anti Piracy Laws, back in 2010.
library(ggplot2)
library(dplyr)
Questions 1 & 5: Generate summary statistics and
read data from a Github link.
First, let’s load the data and then create a summary.
doubloons <-
read.csv('https://raw.githubusercontent.com/samato0624/R/main/piracy.csv')
str(doubloons)
## 'data.frame': 534 obs. of 9 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Ackerman, Gary" "Adams, Sandra" "Aderholt, Robert" "Akin, Todd" ...
## $ party : chr " D" " R" " R" " R" ...
## $ state : chr "NY" "FL" "AL" "MO" ...
## $ money_pro: int 13350 3500 4779 2500 3500 24250 6750 NA 9000 4500 ...
## $ money_con: int 14800 5650 23944 8200 2700 10650 1700 NA 6400 36474 ...
## $ years : int 30 2 16 12 10 6 2 2 24 4 ...
## $ stance : chr "unknown" "unknown" "unknown" "no" ...
## $ chamber : chr "house" "house" "house" "house" ...
summary(doubloons)
## X name party state
## Min. : 1.0 Length:534 Length:534 Length:534
## 1st Qu.:134.2 Class :character Class :character Class :character
## Median :267.5 Mode :character Mode :character Mode :character
## Mean :267.5
## 3rd Qu.:400.8
## Max. :534.0
##
## money_pro money_con years stance
## Min. : -5000 Min. : -1000 Min. : 1.00 Length:534
## 1st Qu.: 4500 1st Qu.: 4500 1st Qu.: 4.00 Class :character
## Median : 11700 Median : 10200 Median :10.00 Mode :character
## Mean : 26326 Mean : 23193 Mean :11.76
## 3rd Qu.: 27463 3rd Qu.: 23947 3rd Qu.:18.00
## Max. :571600 Max. :550000 Max. :58.00
## NA's :14 NA's :35
## chamber
## Length:534
## Class :character
## Mode :character
##
##
##
##
median(doubloons$years)
## [1] 10
mean(doubloons$years)
## [1] 11.7603
Let’s make a couple tables to get some counts.
table(doubloons$party)
##
## D I R
## 243 2 289
table(doubloons$stance)
##
## leaning no no undecided unknown yes
## 44 122 11 294 63
table(doubloons$chamber)
##
## house senate
## 434 100
First Pass observations: The only note worthy things are the differences between the money_pro and money_con columns. We can see companies that were pro PIPA/SOPA gave on average $26,326 to 520 politicians and anti PIPA/SOPA companies gave on average $23,193 to 499 politicians, at least that’s what was reported. I can only conclude from this summary that companies that supported PIPA/SOPA had a larger reported sum of donations. Another smaller detail is the median for number of years in congress is 10 years meaning less than half the members became members during or after the dotcom crash, I’m thinking that observing data around that threshold might be a good start considering internet laws that passed in the 1990s fueled internet innovation. However the data needs to be cleaned and transformed to obtain better summary statistics.
Question 2: Perform some basic
transformations.
See comments in my code for what I did to clean and transform the
data.
#Adding the line below so I can easily rename my columns when I make a new data frame.
column_names <- c("Party", "State", "Money_Pro", "Money_Con", "Years", "Stance", "Chamber")
#Removing the index and name column as they are only unique identifiers and won't
#help generate patterns.
filtered_out_columns <-
data.frame(doubloons$party, doubloons$state,
doubloons$money_pro, doubloons$money_con,
doubloons$years, doubloons$stance, doubloons$chamber)
#Changing the column names.
colnames(filtered_out_columns) <- column_names
#Removing rows where "Money_Pro" or "Money_Con" have NA or negative values.
#Removing rows where "Stance" is "unknown" or "undecided" because we don't
#like people who are on the fence.
filtered_out_rows <- filtered_out_columns %>%
filter(filtered_out_columns$Money_Pro != "NA" &
filtered_out_columns$Money_Con != "NA" &
filtered_out_columns$Money_Pro >= 0 &
filtered_out_columns$Money_Con >= 0 &
filtered_out_columns$Stance != "unknown" &
filtered_out_columns$Stance != "undecided")
#Replacing "leaning no" with "no" in the "Stance" column because it's practically a "no".
mutated_table <- filtered_out_rows %>%
mutate(Stance = ifelse(Stance == 'leaning no', 'no', Stance))
#Removing the Independent
mutated_table <- mutated_table %>% filter(mutated_table$Party != " I")
#Creating two new columns.
#The column calculated below is the difference between Money_Pro & Money_Con.
#This will help me make charts to easily see who paid the politician more.
mutated_table$Money_DifProCon <- mutated_table$Money_Pro - mutated_table$Money_Con
#The column calculated below is the sum of Money_Pro & Money_Con.
#This is just the total donations the politician received.
mutated_table$Money_SumProCon <- mutated_table$Money_Pro + mutated_table$Money_Con
#Now we can better summarize our data based on "yes" and "no" stances only.
summary(mutated_table)
## Party State Money_Pro Money_Con
## Length:215 Length:215 Min. : 250 Min. : 500
## Class :character Class :character 1st Qu.: 5225 1st Qu.: 5250
## Mode :character Mode :character Median : 13900 Median : 11732
## Mean : 38036 Mean : 27955
## 3rd Qu.: 36278 3rd Qu.: 26675
## Max. :571600 Max. :348691
## Years Stance Chamber Money_DifProCon
## Min. : 1.0 Length:215 Length:215 Min. :-141185
## 1st Qu.: 3.0 Class :character Class :character 1st Qu.: -4275
## Median : 8.0 Mode :character Mode :character Median : 1400
## Mean :10.4 Mean : 10082
## 3rd Qu.:15.5 3rd Qu.: 15750
## Max. :48.0 Max. : 241000
## Money_SumProCon
## Min. : 1250
## 1st Qu.: 12775
## Median : 31800
## Mean : 65991
## 3rd Qu.: 67998
## Max. :920291
table(mutated_table$Stance)
##
## no yes
## 159 56
sum(mutated_table$Money_Pro)
## [1] 8177828
sum(mutated_table$Money_Con)
## [1] 6010229
Some note worthy things here include that politicians regardless of stance were still on average paid more by Pro PIPA/SOPA companies, but 159 politicians were not supportive of the proposed law compared to 56 who were supportive. This makes me suspect that maybe this was just a bad law if ~75% of the people who took a stance opposed it while the “Media Company” Donors gave ~33% more money compared to the “Internet Company” Donors. Also note that the median years in office only decreased from 10 to 8.
Question 3: Display at least one scatter plot, box plot, and histogram.
#First let's make a couple box plots of Money_Con and Money_Pro to see what we get.
ggplot(mutated_table, aes(y = Money_SumProCon, x=1)) + geom_boxplot()
The boxplot above has a number of outliers. I really only want to try
and predict the stance of an average politician. So I will be filtering
out outliers by total donations. Then I will aggregate my next box plots
by stance.
removed_outliers <- mutated_table %>% arrange(desc(Money_SumProCon))
#Filtered out 41 politicians to remove outliers, the plot represents 175 politicians.
removed_outliers <- removed_outliers %>% slice(41:n())
ggplot(removed_outliers, aes(y = Money_Con, x=1)) + geom_boxplot() + facet_wrap(~Stance)
First we removed 41 data points leaving us with 174 data points to work
with. The box plot above shows that on average “Internet Companies”
donated, on average, equal amounts to politicians of either stance.
ggplot(removed_outliers, aes(y = Money_Pro, x=1)) + geom_boxplot() + facet_wrap(~Stance)
The box plot above shows that politicians whose stance was “Yes”
received on average more money from Media Companies than politicians
whose stance was “No”.
Next we will look at the distribution of the difference between donations. So if a bar is left of the zero, internet companies donated more and right of the zero indicates media companies donated more.
ggplot(removed_outliers, aes(x = Money_DifProCon)) + geom_histogram() + facet_wrap(~Stance)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We produced something of a normal distribution for either stance, with negligent skews favoring the donor they support. What this says to me is that the source of the money didn’t strongly affect the decision of the average politician who was willing to take a stance on this issue.
Okay now let’s check if there is any correlation between number of years in congress and total donations received.
ggplot(removed_outliers, aes(x = Years, y = Money_SumProCon)) + geom_point(aes(color = Stance))
Okay so this is sort of interesting, people who were paid some of the
highest donations were not 30+ year members of congress/senate, but were
actually people with 1 to 10 years. I interpret this to mean that
companies thought they could buy out politicians with less
experience.
Let’s explore a few other charts. First a bar charts with the counts of Republicans and Democrats aggregated by Stance. Second, the sum of donations given to either party. Third a scatter plot with Pro vs Con donations aggregated colored by chamber and aggregated by stance.
ggplot(removed_outliers,
aes(x = Party)) + geom_bar(
stat = "count", fill = "steelblue") + facet_wrap(~Stance)
ggplot(removed_outliers,
aes(x = Party, y = Money_SumProCon)) + geom_bar(
stat = "identity", fill = "steelblue")
ggplot(removed_outliers,
aes(x = Money_Pro, y = Money_Con)) + geom_point(
aes(color = Party)) + facet_wrap(~Stance)
summary(removed_outliers)
## Party State Money_Pro Money_Con
## Length:175 Length:175 Min. : 250 Min. : 500
## Class :character Class :character 1st Qu.: 4062 1st Qu.: 4500
## Mode :character Mode :character Median :10250 Median : 8500
## Mean :14972 Mean :13089
## 3rd Qu.:21722 3rd Qu.:20400
## Max. :62700 Max. :63800
## Years Stance Chamber Money_DifProCon
## Min. : 1.000 Length:175 Length:175 Min. :-47700
## 1st Qu.: 3.500 Class :character Class :character 1st Qu.: -4125
## Median : 8.000 Mode :character Mode :character Median : 750
## Mean : 9.817 Mean : 1883
## 3rd Qu.:14.000 3rd Qu.: 9450
## Max. :38.000 Max. : 56450
## Money_SumProCon
## Min. : 1250
## 1st Qu.:11562
## Median :22232
## Mean :28062
## 3rd Qu.:40372
## Max. :81750
One last thing is to check our outliers.
outliers <- mutated_table %>%
arrange(desc(Money_SumProCon)) %>% # Sort in descending order based on 'donations'
top_n(41)
## Selecting by Money_SumProCon
ggplot(outliers,
aes(x = Money_Pro,
y = Money_Con)) + geom_point(aes(color = Chamber)) + facet_wrap(~Stance)
In the outliers chart we can see some politicians received a
disproportionate amount of money from Media companies.
Question 4:
Conclusions: The data is not clear cut, but here are
my thoughts on the best predictor of a random politicians stance on
PIPA/SOPA. The one liberty I will take is assuming that the politician
has a stance. 1. My default guess is that a politician is against
PIPA/SOPA because ~75% of the time that’s true based on summary
statistics alone. 2. If I know a politician is a Replublican there is
~88% chance they are against PIPA/SOPA, for Democrat it would be ~66%.
3. If I know the politician made less than 80k in donations my odds stay
the same because it is a weak correlation. However, if you were paid
40K+ by the Pro side you are more likely to support the pro side. 4.
There is a weak negative correlation between the number of years in
congress and how much you get from donations, so I would say your %
chance of supporting the law goes down the longer someone has been in
congress.
All and all it’s much easier to predict when someone is going to be against PIPA/SOPA, but to predict someones stance I would use the following logic. 1. If you are in the Republican party then you are opposed to PIPA/SOPA 2. If you are a Democrat and received less than 40k from Media companies then you are opposed to PIPA/SOPA 3. Else you support PIPA/SOPA
The results of this function changes every time the code is run so I won’t know the results till after I knit, but I seem to be right about 70 - 80% of the time.
random_filter <- data.frame(mutated_table$Party, mutated_table$Money_Pro, mutated_table$Stance)
random_rows <- sample_n(random_filter, 20)
colnames(random_rows) <- c("Party", "Money_Pro", "Stance")
random_rows$Prediction <- ifelse (random_rows$Party == "R", "No",
ifelse(random_rows$Money_Pro >= 40000, "Yes", "No"))
print(random_rows)
## Party Money_Pro Stance Prediction
## 1 R 35050 no No
## 2 D 4650 no No
## 3 R 6500 no No
## 4 D 19550 no No
## 5 R 24670 no No
## 6 R 12000 no No
## 7 R 74850 no Yes
## 8 D 367733 yes Yes
## 9 D 17850 no No
## 10 D 11500 yes No
## 11 R 1500 no No
## 12 R 5500 no No
## 13 R 1000 no No
## 14 R 3380 no No
## 15 R 36255 no No
## 16 D 7000 yes No
## 17 R 56500 yes Yes
## 18 R 6750 no No
## 19 R 3500 no No
## 20 R 35690 no No