Project Assignment - Money in politics

In this assignment you will mainly work on data processing and data cleaning.Besides, you need to benefit from what we have covered so far regarding the missing data or potential outlier detection.

Packages

In this assignment we will use the following packages:

In addition to these, feel free to benefit other useful packages that we discussed lastly (including dataExplorer etc.)

Note that these packages are also get loaded in your R Markdown environment when you Knit your R Markdown document.

About Data set

The data come from OpenSecrets.org, a “website tracking the influence of money on U.S. politics, and how that money affects policy and citizens’ lives”.

This website is hosted by The Center for Responsive Politics, which is a nonpartisan, independent non-profit that “tracks money in U.S. politics and its effect on elections and public policy.”[^1]

Our first goal is to merge the different data sets provided, to create a single dataset. Since that means repeating a task many times, you should write a function to create the data.

Exercise 1

Write a function using a for loop in R to create a single data set and call it pac_all containing the contributions in all election years given. In your R Markdown file, load pac_all.csv and report its number of observations and variables as a full sentence. Our focus is the data belonging to the years 2020, 2018 and 2016 only

pac_all <- data.frame()


election_years <- c(2020, 2018, 2016)


for (year in election_years) {
  data <- read.csv(paste0("pac_", year, ".csv"))
  
  pac_all <- rbind(pac_all, data)
}

print(paste("The number of observations in pac_all is", nrow(pac_all),
      "and the number of variables is", ncol(pac_all), "."))
## [1] "The number of observations in pac_all is 721 and the number of variables is 7 ."

Data cleaning

In this section we clean the pac_all data frame to prepare it for analysis and visualization. We have two goals in data cleaning:

The following exercises walk you through how to make these fixes to the data.

Looking at the functions separete() str_remove() and str_remove_all() are strongly recommended !!

library(tidyr)

#pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = ",")
install.packages("naniar")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(naniar)

gg_miss_var(pac_all)

Exercise 2

library(tidyr)
library(dplyr)
library(stringr)
pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = "/")
## Warning: Expected 2 pieces. Additional pieces discarded in 6 rows [129, 152, 371, 400,
## 612, 640].
pac_all <- pac_all %>%
  mutate(
    total = as.numeric(str_remove_all(total,"[$,]" )),
    dems = as.numeric(str_remove_all(dems,"[$,]" )),
    repubs = as.numeric(str_remove_all(repubs,"[$,]" ))
  )


head(pac_all,n = 10)
##     X                                  name     country
## 1   1                              7-Eleven       Japan
## 2   2                 ABB Group (ABB Group) Switzerland
## 3   3                 Accenture (Accenture)     Ireland
## 4   4                   Air Liquide America      France
## 5   5                          Airbus Group Netherlands
## 6   6                          Alkermes Inc     Ireland
## 7   7          Allianz of America (Allianz)     Germany
## 8   8                          AMG Vanadium Netherlands
## 9   9 Anheuser-Busch (Anheuser-Busch InBev)     Belgium
## 10 10                    AON Corp (AON plc)          UK
##                      parent_company  total   dems repubs year
## 1                Seven & I Holdings  20000   1000  19000 2020
## 2                 Asea Brown Boveri  16900   6800  10100 2020
## 3                     Accenture plc  83500  50500  33000 2020
## 4                  L'Air Liquide SA  37800  15800  22000 2020
## 5                      Airbus Group 182000  79000 103000 2020
## 6                      Alkermes Plc  94750  30750  64000 2020
## 7                Allianz AG Holding  71300  36150  35150 2020
## 8  AMG Advanced Metallurgical Group   2000      0   2000 2020
## 9              Anheuser-Busch InBev 336500 174000 162500 2020
## 10                          AON PLC  80500  44000  36500 2020

Exercise 3

Consider one of the variables in your data set (say total column),

library(ggplot2)
library(outliers)

ggplot(pac_all, aes(x = factor(year), y = total)) +
  geom_boxplot() +
  labs(title = "Boxplot of Total Contributions by Year",
       x = "Year",
       y = "Total Contributions") +
  facet_wrap(~year, scales = "free")

#boxplot.stats(pac_all$year)$out
boxplot.stats(pac_all$total)$out
##  [1]  336500  321500  733500  447000  347400  397500  566500  727500  496500
## [10]  458000  276400  290000  414582  386460  297000  298500  988500 1161642
## [19]  504000 1615000  495000  358500  611500  638875  278000  348500  279000
## [28]  364000  468750  552000  348700  323000  573000  320250  293000  349500
## [37]  293000  405938  380000  303500  442200  694500  874999  458500 1389750
## [46]  326000  575574  527500  615500  678236  480000  414500  326500  382950
## [55]  573500  402722  383500  431000  343500  366500  458351  352230  566475
## [64]  287200  432250  598500  360500 1445490
#boxplot.stats(pac_all$dems)$out
#boxplot.stats(pac_all$repubs)$out


grubbs.test(pac_all$total, two.sided = TRUE)
## 
##  Grubbs test for one outlier
## 
## data:  pac_all$total
## G = 9.11112, U = 0.88454, p-value < 2.2e-16
## alternative hypothesis: highest value 1615000 is an outlier
# Comment on distributional behavior and identify potential outliers
# You can visually inspect the boxplots for any values that seem far away from the others, indicating potential outliers.

# Apply Dixon's test for potential outliers
#outliers_2020 <- dixon.test(pac_all$total, opposite = TRUE)
#outliers_2018 <- dixon.test(pac_all$total[pac_all$year == 2018], opposite = TRUE)
#outliers_2016 <- dixon.test(pac_all$total[pac_all$year == 2016], opposite = TRUE)

# Print the results of Dixon's test
#print(outliers_2020)
#print(outliers_2018)
#print(outliers_2016)

# Interpret the results of Dixon's test to confirm or reject potential outliers
# If the p-value of the test is less than a chosen significance level (e.g., 0.05), the observation is considered an outlier.

The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.

The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:

H0 : The highest value is not an outlier H1 : The highest value is an outlier

Tests whether the highest value is 1615000 and whether it is an outlier and looking at the p-value, the H0 hypothesis is rejected because the p-value is small and the highest value is an outlier

Exercise 4

library(ggplot2)
foreign_pacs <- pac_all %>%
  filter(country %in% c("Mexico", "Canada")) %>% 
  group_by(country,year) %>%
  summarise(total_contributions = sum(total), .groups = "drop")



ggplot(foreign_pacs, aes(x = year, y = total_contributions,color  = country,group = country )) +
  geom_line(linewidth = 1.5) +
  labs(title = "Total Contributions from Foreign-Connected pacs in Mexico and Canada",
       x = "Year",
       y = "Total Contributions",
       color = "Country")+
  scale_color_manual(values = c("Canada" = "yellow", "Mexico" = "purple"))