In this assignment you will mainly work on data processing and data cleaning.Besides, you need to benefit from what we have covered so far regarding the missing data or potential outlier detection.
In this assignment we will use the following packages:
tidyverse: a collection of packages for doing data analysis in a “tidy” way
naniar: about detecting missing values in different ways
outlier: a collection of functions for testing potential outliers
In addition to these, feel free to benefit other useful packages that we discussed lastly (including dataExplorer etc.)
Note that these packages are also get loaded in your R Markdown environment when you Knit your R Markdown document.
The data come from OpenSecrets.org, a “website tracking the influence of money on U.S. politics, and how that money affects policy and citizens’ lives”.
This website is hosted by The Center for Responsive Politics, which is a nonpartisan, independent non-profit that “tracks money in U.S. politics and its effect on elections and public policy.”[^1]
Our first goal is to merge the different data sets provided, to create a single dataset. Since that means repeating a task many times, you should write a function to create the data.
Write a function using a for loop in R to create a single data set and call it pac_all containing the contributions in all election years given. In your R Markdown file, load pac_all.csv and report its number of observations and variables as a full sentence. Our focus is the data belonging to the years 2020, 2018 and 2016 only
pac_all <- data.frame()
election_years <- c(2020, 2018, 2016)
for (year in election_years) {
data <- read.csv(paste0("pac_", year, ".csv"))
pac_all <- rbind(pac_all, data)
}
print(paste("The number of observations in pac_all is", nrow(pac_all),
"and the number of variables is", ncol(pac_all), "."))## [1] "The number of observations in pac_all is 721 and the number of variables is 7 ."
In this section we clean the pac_all data frame to prepare it for analysis and visualization. We have two goals in data cleaning:
Separate the country_parent into two such that country and parent company appear in different columns for country-level analysis.
Convert contribution amounts in total, dems, and repubs from character strings to numeric values.
Check each column whether you have any missing values or not (You can benefit from different packages to visualize missingness if you like)
The following exercises walk you through how to make these fixes to the data.
Looking at the functions separete() str_remove() and str_remove_all() are strongly recommended !!
library(tidyr)
#pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = ",")
install.packages("naniar")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
Use the separate() function to separate country_parent into country and parent columns.
Remove the character strings including $ and , signs in the total, dems, and repubs columns and convert these columns to numeric.
End your code chunk by printing out the top 10 rows of your data frame (if you just type the data frame name it should automatically do this for you).
library(tidyr)
library(dplyr)
library(stringr)
pac_all <- separate(pac_all, country_parent, into = c("country", "parent_company"), sep = "/")## Warning: Expected 2 pieces. Additional pieces discarded in 6 rows [129, 152, 371, 400,
## 612, 640].
pac_all <- pac_all %>%
mutate(
total = as.numeric(str_remove_all(total,"[$,]" )),
dems = as.numeric(str_remove_all(dems,"[$,]" )),
repubs = as.numeric(str_remove_all(repubs,"[$,]" ))
)
head(pac_all,n = 10)## X name country
## 1 1 7-Eleven Japan
## 2 2 ABB Group (ABB Group) Switzerland
## 3 3 Accenture (Accenture) Ireland
## 4 4 Air Liquide America France
## 5 5 Airbus Group Netherlands
## 6 6 Alkermes Inc Ireland
## 7 7 Allianz of America (Allianz) Germany
## 8 8 AMG Vanadium Netherlands
## 9 9 Anheuser-Busch (Anheuser-Busch InBev) Belgium
## 10 10 AON Corp (AON plc) UK
## parent_company total dems repubs year
## 1 Seven & I Holdings 20000 1000 19000 2020
## 2 Asea Brown Boveri 16900 6800 10100 2020
## 3 Accenture plc 83500 50500 33000 2020
## 4 L'Air Liquide SA 37800 15800 22000 2020
## 5 Airbus Group 182000 79000 103000 2020
## 6 Alkermes Plc 94750 30750 64000 2020
## 7 Allianz AG Holding 71300 36150 35150 2020
## 8 AMG Advanced Metallurgical Group 2000 0 2000 2020
## 9 Anheuser-Busch InBev 336500 174000 162500 2020
## 10 AON PLC 80500 44000 36500 2020
Consider one of the variables in your data set (say total column),
Create a boxplot for this numerical variable separately for each year.
Comment on the distributional behavior of the variable, are you able to spot any potential outliers (value far away from the others in general)
Apply related tests from outlier package if you have any potential outlier observation in your data. Confirm or not the considered point is a real outlier or not by interpreting your statistical test results (ie. Dixon’s test etc.)
library(ggplot2)
library(outliers)
ggplot(pac_all, aes(x = factor(year), y = total)) +
geom_boxplot() +
labs(title = "Boxplot of Total Contributions by Year",
x = "Year",
y = "Total Contributions") +
facet_wrap(~year, scales = "free")## [1] 336500 321500 733500 447000 347400 397500 566500 727500 496500
## [10] 458000 276400 290000 414582 386460 297000 298500 988500 1161642
## [19] 504000 1615000 495000 358500 611500 638875 278000 348500 279000
## [28] 364000 468750 552000 348700 323000 573000 320250 293000 349500
## [37] 293000 405938 380000 303500 442200 694500 874999 458500 1389750
## [46] 326000 575574 527500 615500 678236 480000 414500 326500 382950
## [55] 573500 402722 383500 431000 343500 366500 458351 352230 566475
## [64] 287200 432250 598500 360500 1445490
#boxplot.stats(pac_all$dems)$out
#boxplot.stats(pac_all$repubs)$out
grubbs.test(pac_all$total, two.sided = TRUE)##
## Grubbs test for one outlier
##
## data: pac_all$total
## G = 9.11112, U = 0.88454, p-value < 2.2e-16
## alternative hypothesis: highest value 1615000 is an outlier
# Comment on distributional behavior and identify potential outliers
# You can visually inspect the boxplots for any values that seem far away from the others, indicating potential outliers.
# Apply Dixon's test for potential outliers
#outliers_2020 <- dixon.test(pac_all$total, opposite = TRUE)
#outliers_2018 <- dixon.test(pac_all$total[pac_all$year == 2018], opposite = TRUE)
#outliers_2016 <- dixon.test(pac_all$total[pac_all$year == 2016], opposite = TRUE)
# Print the results of Dixon's test
#print(outliers_2020)
#print(outliers_2018)
#print(outliers_2016)
# Interpret the results of Dixon's test to confirm or reject potential outliers
# If the p-value of the test is less than a chosen significance level (e.g., 0.05), the observation is considered an outlier.The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.
The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:
H0 : The highest value is not an outlier H1 : The highest value is an outlier
Tests whether the highest value is 1615000 and whether it is an outlier and looking at the p-value, the H0 hypothesis is rejected because the p-value is small and the highest value is an outlier
Create a line plot of total contributions from all foreign-connected PACs in the Canada and Mexico over the years.
Once you have made the plot, write a brief interpretation of what the graph reveals.
library(ggplot2)
foreign_pacs <- pac_all %>%
filter(country %in% c("Mexico", "Canada")) %>%
group_by(country,year) %>%
summarise(total_contributions = sum(total), .groups = "drop")
ggplot(foreign_pacs, aes(x = year, y = total_contributions,color = country,group = country )) +
geom_line(linewidth = 1.5) +
labs(title = "Total Contributions from Foreign-Connected pacs in Mexico and Canada",
x = "Year",
y = "Total Contributions",
color = "Country")+
scale_color_manual(values = c("Canada" = "yellow", "Mexico" = "purple"))