Project Assignment - Money in politics

In this assignment you will mainly work on data processing and data cleaning.Besides, you need to benefit from what we have covered so far regarding the missing data or potential outlier detection.

Packages

In this assignment we will use the following packages:

tidyverse: a collection of packages for doing data analysis in a “tidy” way
naniar: about detecting missing values in different ways
outlier: a collection of functions for testing potential outliers

In addition to these, feel free to benefit other useful packages that we discussed lastly (including dataExplorer etc.)

You should also load these packages in your Console, which you can do by sending the code to your Console by clicking on the Run Current Chunk icon (green arrow pointing right icon).

Note that these packages are also get loaded in your R Markdown environment when you Knit your R Markdown document.

About Data set

The data come from OpenSecrets.org, a “website tracking the influence of money on U.S. politics, and how that money affects policy and citizens’ lives”.

This website is hosted by The Center for Responsive Politics, which is a nonpartisan, independent non-profit that “tracks money in U.S. politics and its effect on elections and public policy.”[^1]

Our first goal is to merge the different data sets provided, to create a single dataset. Since that means repeating a task many times, you should write a function to create the data.

Exercise 1

Write a function using a for loop in R to create a single data set and call it pac_all containing the contributions in all election years given. In your R Markdown file, load pac_all.csv and report its number of observations and variables as a full sentence. Our focus is the data belonging to the years 2020, 2018 and 2016 only

pac_all <- data.frame()


election_years <- c(2020, 2018, 2016)


for (year in election_years) {
  data <- read.csv(paste0("pac_", year, ".csv"))
  
  pac_all <- rbind(pac_all, data)
}

print(paste("The number of observations in pac_all is", nrow(pac_all),
      "and the number of variables is", ncol(pac_all), "."))

## [1] "The number of observations in pac_all is 721 and the number of variables is 7 ."

Data cleaning

In this section we clean the pac_all data frame to prepare it for analysis and visualization. We have two goals in data cleaning:

Separate the country_parent into two such that country and parent company appear in different columns for country-level analysis.
Convert contribution amounts in total, dems, and repubs from character strings to numeric values.
Check each column whether you have any missing values or not (You can benefit from different packages to visualize missingness if you like)

The following exercises walk you through how to make these fixes to the data.

Looking at the functions separete() str_remove() and str_remove_all() are strongly recommended !!

library(tidyr)

#pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = ",")
install.packages("naniar")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(naniar)

gg_miss_var(pac_all)

Exercise 2

Use the separate() function to separate country_parent into country and parent columns.
Remove the character strings including $ and , signs in the total, dems, and repubs columns and convert these columns to numeric.
End your code chunk by printing out the top 10 rows of your data frame (if you just type the data frame name it should automatically do this for you).

library(tidyr)
library(dplyr)
library(stringr)
pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = "/")

## Warning: Expected 2 pieces. Additional pieces discarded in 6 rows [129, 152, 371, 400,
## 612, 640].

pac_all <- pac_all %>%
  mutate(
    total = as.numeric(str_remove_all(total,"[$,]" )),
    dems = as.numeric(str_remove_all(dems,"[$,]" )),
    repubs = as.numeric(str_remove_all(repubs,"[$,]" ))
  )


head(pac_all,n = 10)

##     X                                  name     country
## 1   1                              7-Eleven       Japan
## 2   2                 ABB Group (ABB Group) Switzerland
## 3   3                 Accenture (Accenture)     Ireland
## 4   4                   Air Liquide America      France
## 5   5                          Airbus Group Netherlands
## 6   6                          Alkermes Inc     Ireland
## 7   7          Allianz of America (Allianz)     Germany
## 8   8                          AMG Vanadium Netherlands
## 9   9 Anheuser-Busch (Anheuser-Busch InBev)     Belgium
## 10 10                    AON Corp (AON plc)          UK
##                      parent_company  total   dems repubs year
## 1                Seven & I Holdings  20000   1000  19000 2020
## 2                 Asea Brown Boveri  16900   6800  10100 2020
## 3                     Accenture plc  83500  50500  33000 2020
## 4                  L'Air Liquide SA  37800  15800  22000 2020
## 5                      Airbus Group 182000  79000 103000 2020
## 6                      Alkermes Plc  94750  30750  64000 2020
## 7                Allianz AG Holding  71300  36150  35150 2020
## 8  AMG Advanced Metallurgical Group   2000      0   2000 2020
## 9              Anheuser-Busch InBev 336500 174000 162500 2020
## 10                          AON PLC  80500  44000  36500 2020

Exercise 3

Consider one of the variables in your data set (say total column),

Create a boxplot for this numerical variable separately for each year.
Comment on the distributional behavior of the variable, are you able to spot any potential outliers (value far away from the others in general)
Apply related tests from outlier package if you have any potential outlier observation in your data. Confirm or not the considered point is a real outlier or not by interpreting your statistical test results (ie. Dixon’s test etc.)

library(ggplot2)
library(outliers)

ggplot(pac_all, aes(x = factor(year), y = total)) +
  geom_boxplot() +
  labs(title = "Boxplot of Total Contributions by Year",
       x = "Year",
       y = "Total Contributions") +
  facet_wrap(~year, scales = "free")

#boxplot.stats(pac_all$year)$out
boxplot.stats(pac_all$total)$out

##  [1]  336500  321500  733500  447000  347400  397500  566500  727500  496500
## [10]  458000  276400  290000  414582  386460  297000  298500  988500 1161642
## [19]  504000 1615000  495000  358500  611500  638875  278000  348500  279000
## [28]  364000  468750  552000  348700  323000  573000  320250  293000  349500
## [37]  293000  405938  380000  303500  442200  694500  874999  458500 1389750
## [46]  326000  575574  527500  615500  678236  480000  414500  326500  382950
## [55]  573500  402722  383500  431000  343500  366500  458351  352230  566475
## [64]  287200  432250  598500  360500 1445490

#boxplot.stats(pac_all$dems)$out
#boxplot.stats(pac_all$repubs)$out


grubbs.test(pac_all$total, two.sided = TRUE)

## 
##  Grubbs test for one outlier
## 
## data:  pac_all$total
## G = 9.11112, U = 0.88454, p-value < 2.2e-16
## alternative hypothesis: highest value 1615000 is an outlier

# Comment on distributional behavior and identify potential outliers
# You can visually inspect the boxplots for any values that seem far away from the others, indicating potential outliers.

# Apply Dixon's test for potential outliers
#outliers_2020 <- dixon.test(pac_all$total, opposite = TRUE)
#outliers_2018 <- dixon.test(pac_all$total[pac_all$year == 2018], opposite = TRUE)
#outliers_2016 <- dixon.test(pac_all$total[pac_all$year == 2016], opposite = TRUE)

# Print the results of Dixon's test
#print(outliers_2020)
#print(outliers_2018)
#print(outliers_2016)

# Interpret the results of Dixon's test to confirm or reject potential outliers
# If the p-value of the test is less than a chosen significance level (e.g., 0.05), the observation is considered an outlier.

The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.

The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:

H0 : The highest value is not an outlier H1 : The highest value is an outlier

Tests whether the highest value is 1615000 and whether it is an outlier and looking at the p-value, the H0 hypothesis is rejected because the p-value is small and the highest value is an outlier

Exercise 4

Create a line plot of total contributions from all foreign-connected PACs in the Canada and Mexico over the years.
Once you have made the plot, write a brief interpretation of what the graph reveals.

library(ggplot2)
foreign_pacs <- pac_all %>%
  filter(country %in% c("Mexico", "Canada")) %>% 
  group_by(country,year) %>%
  summarise(total_contributions = sum(total), .groups = "drop")



ggplot(foreign_pacs, aes(x = year, y = total_contributions,color  = country,group = country )) +
  geom_line(linewidth = 1.5) +
  labs(title = "Total Contributions from Foreign-Connected pacs in Mexico and Canada",
       x = "Year",
       y = "Total Contributions",
       color = "Country")+
  scale_color_manual(values = c("Canada" = "yellow", "Mexico" = "purple"))