Mini Project 2: Data Wrangling

Learning Goal

Further sharpen data wrangling and data visualization skills

Due date

Description

You will analyze Federal Election Commission data and report your findings in a blog post. The topic is up to you. The best projects will:

You may work with a partner, but are not required to.

GitHub

As with the first mini project, you will use GitHub to collaborate with your, as well as to submit the final product. To create a repository for your mini project, complete the following steps:

  1. Nominate one person to host your repository. Suppose this person's GitHub username is github_id.
  2. Have github_id fork the mini-project-2 repository from cmsc205
  3. Have github_id add other teammates as "Collaborators" (under "Settings")
  4. All teammates:
    1. Open new project in RStudio...
    2. From Version Control...
    3. From GitHub...
    4. Paste the "Clone or download" URL from github_id/mini-project-2’s repo

To submit your final blog post, please follow the instructions from HW-0

The link to blog post is here

Data

The data for this project come from the Federal Election Commission, and are based on the 2011-2012 federal election cycle. These data were collected using the fec R package.

Be sure to read the supporting documentation for these data!! It is your responsibility to know what data you are working with!

There are four tables to consider: house_elections, candidates, committees, and contributions. You can use the following code to load in the data sets.

library(tidyverse)
house_elections <- read_csv("house_elections.csv")
committees <- read_csv("committees.csv")
candidates <- read_csv("candidates.csv")

# The contributions file was too big for GitHub, so it's 
# stored slightly differently.
load("contributions.rda") 

After loading the data sets, verify that your data sets look like those displayed below (be sure that the row and column dimensions match):

glimpse(house_elections)
## Observations: 2,178
## Variables: 10
## $ fec_id         <chr> "B2CA08156", "H0AK00097", "H0AL01030", "H0AL020...
## $ state          <chr> "CA", "AK", "AL", "AL", "AL", "AL", "AL", "AR",...
## $ district       <chr> "08", "00", "01", "02", "05", "07", "07", "01",...
## $ incumbent      <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, T...
## $ candidate_name <chr> "Mitzelfelt, Brad", "Cox, John R.", "Gounares, ...
## $ party          <chr> "R", "R", "R", "R", "R", "D", "R", "R", "R", "R...
## $ primary_votes  <int> 8801, 11179, 3854, 0, 65163, 0, 11537, 0, 0, 0,...
## $ runoff_votes   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ general_votes  <int> 0, 0, 0, 180591, 189185, 232520, 73835, 138800,...
## $ ge_winner      <chr> NA, NA, NA, "W", "W", "W", "N", "W", "W", "W", ...
glimpse(committees)
## Observations: 14,454
## Variables: 15
## $ cmte_id                <chr> "C00000042", "C00000059", "C00000422", ...
## $ cmte_name              <chr> "ILLINOIS TOOL WORKS INC. FOR BETTER GO...
## $ tres_name              <chr> "LYNCH, MICHAEL J. MR.", "GREG SWARENS"...
## $ cmte_st1               <chr> "3600 WEST LAKE AVENUE", "2501 MCGEE", ...
## $ cmte_st2               <chr> NA, "MD#288", "SUITE 600", NA, NA, NA, ...
## $ cmte_city              <chr> "GLENVIEW", "KANSAS CITY", "WASHINGTON"...
## $ cmte_state             <chr> "IL", "MO", "DC", "OK", "KS", "IN", "DC...
## $ cmte_zip               <int> 60026, 64108, 20001, 73107, 66612, 4620...
## $ cmte_dsgn              <chr> "B", "U", "B", "U", "U", "U", "B", "B",...
## $ cmte_type              <chr> "Q", "Q", "Q", "N", "Q", "Q", "Q", "Q",...
## $ cmte_party_affiliation <chr> NA, "UNK", NA, NA, "UNK", NA, "UNK", "U...
## $ cmte_filing_freq       <chr> "Q", "M", "M", "Q", "Q", "Q", "M", "M",...
## $ org_type               <chr> "C", "C", "M", "L", "T", "M", "M", "L",...
## $ connected_org_name     <chr> "ILLINOIS TOOL WORKS INC.", NA, "AMERIC...
## $ cand_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
glimpse(contributions)
## Observations: 396,369
## Variables: 22
## $ cmte_id          <chr> "C00478404", "C00140855", "C00140855", "C0014...
## $ amndt_ind        <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ rpt_type         <chr> "M3", "M3", "M3", "M3", "M3", "M3", "M3", "M3...
## $ transaction_pgi  <chr> "P", "P", "P", "P", "P", "P", "G", "P", "P", ...
## $ image_num        <chr> "11930476751.0", "11930476826.0", "1193047682...
## $ transaction_type <chr> "24K", "24K", "24K", "24K", "24K", "24K", "24...
## $ entity_type      <chr> "COM", "CCM", "CCM", "CCM", "CCM", "CCM", "CC...
## $ name             <chr> "KLINE FOR CONGRESS", "TIM RYAN FOR U.S. CONG...
## $ city             <chr> "BURNSVILLE", "WASHINGTON", "WASHINGTON", "BO...
## $ state            <chr> "MN", "DC", "DC", "MD", "ND", "MI", "MN", "IA...
## $ zip_code         <chr> "55337", "20013", "20005", "20716", "58106", ...
## $ employer         <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ occupation       <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ transaction_dt   <chr> "02252011", "02012011", "02012011", "02222011...
## $ transaction_amt  <dbl> 2400, 1000, 1000, 2500, 1000, 5000, 1000, 100...
## $ other_id         <chr> "C00326629", "C00373464", "C00289983", "C0014...
## $ cand_id          <chr> "H8MN06047", "H2OH17109", "H4KY01040", "H2MD0...
## $ tran_id          <chr> "B37FBC79414E54DD7A1C", "38595006", "38595007...
## $ file_num         <int> 717033, 717042, 717042, 717042, 717043, 71704...
## $ memo_cd          <chr> "", "", "", "", "", "", "", "", "", "X", "", ...
## $ memo_text        <chr> "", "", "", "", "", "", "", "", "", "CHECK 23...
## $ sub_id           <dbl> 4.03182e+18, 4.03172e+18, 4.03172e+18, 4.0317...
glimpse(candidates)
## Observations: 5,628
## Variables: 15
## $ cand_id                <chr> "H0AK00089", "H0AK00097", "H0AL00016", ...
## $ cand_name              <chr> "CRAWFORD, HARRY T JR", "COX, JOHN ROBE...
## $ cand_party_affiliation <chr> "DEM", "REP", "UNK", "REP", "REP", "DEM...
## $ cand_election_yr       <int> 2010, 2012, 2010, 2012, 2012, 2008, 201...
## $ cand_office_state      <chr> "AK", "AK", "AL", "AL", "AL", "AL", "AL...
## $ cand_office            <chr> "H", "H", "H", "H", "H", "H", "H", "H",...
## $ cand_office_district   <int> 0, 0, 7, 1, 2, 5, 5, 5, 5, 5, 6, 7, 7, ...
## $ cand_ici               <chr> "C", "C", "O", "C", "I", "C", "C", "I",...
## $ cand_status            <chr> "P", "N", "C", "C", "C", "C", "C", "C",...
## $ cand_pcc               <chr> "C00466698", "C00525261", "C00464040", ...
## $ cand_st1               <chr> "4350 BUTTE CIR", "PO BOX 1092", "PO BO...
## $ cand_st2               <chr> NA, NA, NA, NA, NA, NA, NA, NA, "SUITE ...
## $ cand_city              <chr> "ANCHORAGE", "ANCHOR POINT", "BIRMINGHA...
## $ cand_state             <chr> "AK", "AK", "AL", "AL", "AL", "AL", "AL...
## $ cand_zip               <int> 99504, 8388607, 35201, 36561, 36106, 35...

Assignment

Our first data represents the 15 candidates with highest transactions amounts. Candidates transaction amounts are different in different time of the year. We took sum of the transaction amounts for each candidate, to compare who had the most contribution.

Our second data is about relationship between primary votes and general votes for Republicans and Democrats. Some candidate has zero primary votes or some candidate have zero general vote for some reasons (for example, for zero primary votes, the candidate might be the only candidate in the party.) Our data only includes votes larger than 0.


library(tidyverse)
committees_transaction_amt <- full_join(candidates, contributions, by = "cand_id") %>%
  select(cmte_id,  transaction_dt, transaction_amt, cand_name, cand_election_yr, cand_party_affiliation)%>%
  filter(transaction_amt > 0) %>%
  group_by(cand_name, cand_election_yr, cand_party_affiliation) %>%
  summarize(total_trans_amt = sum(transaction_amt)) %>%
  arrange(desc(total_trans_amt))
head(committees_transaction_amt, 15)
## Source: local data frame [15 x 4]
## Groups: cand_name, cand_election_yr [15]
## 
##                       cand_name cand_election_yr cand_party_affiliation
##                           <chr>            <int>                  <chr>
## 1                 OBAMA, BARACK             2012                    DEM
## 2  ROMNEY, MITT / RYAN, PAUL D.             2012                    REP
## 3        KAINE, TIMOTHY MICHAEL             2012                    DEM
## 4                GINGRICH, NEWT             2012                    REP
## 5          SANTORUM, RICHARD J.             2012                    REP
## 6                BROWN, SHERROD             2012                    DEM
## 7             THOMPSON, TOMMY G             2012                    REP
## 8                BALDWIN, TAMMY             2012                    DEM
## 9                 ALLEN, GEORGE             2012                    REP
## 10                 MANDEL, JOSH             2012                    REP
## 11          MOURDOCK, RICHARD E             2012                    REP
## 12             BERKLEY, SHELLEY             2012                    DEM
## 13                 HELLER, DEAN             2012                    REP
## 14                 NELSON, BILL             2012                    DEM
## 15                  TESTER, JON             2012                    DEM
## # ... with 1 more variables: total_trans_amt <dbl>
transaction_15 <- head(committees_transaction_amt, 15)

We plot a bar chart to represent this data.

The reason of why we chose a bar chart is to see the difference between the transaction amounts.

transaction_15$cand_party_affiliation <- as.factor(transaction_15$cand_party_affiliation)
graph_1 <- 
  ggplot(transaction_15,
         aes(x = reorder(cand_name, total_trans_amt),
             y = total_trans_amt)) +
  geom_bar(stat = "identity", aes(fill = factor(cand_party_affiliation))) +
  geom_text(aes(label = total_trans_amt, hjust = -.1)) +
  coord_flip() +
  expand_limits(y = 500000000) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual("Party Affiliation", values = c("red", "blue")) +
  labs(x = NULL, y = NULL, 
       title = "Top 15 Transaction Amount in 2012",
       subtitle = "Highest Transction Amount of Candidates in 2012 ")
graph_1

First graph shows that in 2012, the highest transaction amount belongs to Barack Obama, and the second highest transaction amount belongs to Mitt Romney. In the bar chart, we can see 8 Republican transactions and 7 Democrats transactions. Furthermore, we represented the Democrats in red, and Republicans in blue to make the party affiliations easier to recognize visually.


Second graph

Second graph shows that relationship between general votes and primary votes are directly proportional.

The reason of why we chose a scatter plot is to see the primary and general votes and their relations.

house_elections %>% 
  filter(general_votes > 0, primary_votes >0 , party=="R"| party=="D" ) %>% 
  select(party, primary_votes, general_votes, ge_winner) %>%
  filter(!is.na("ge_winner")) %>%
  ggplot(aes(x= primary_votes, y= general_votes)) +
  facet_wrap(~party) + geom_point(stat = "identity", aes(color = factor(ge_winner))) + geom_smooth() +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Primary Votes", y = "General Votes",
       title = "Relationship between Primary Votes and General Votes",
       subtitle = "House Elections in the US") +
  scale_color_manual("Win?", values = c("goldenrod1", "dodgerblue1"))
## `geom_smooth()` using method = 'loess'

Here, we can see the relationship between Primary votes and General votes. The graph suggests a strong correlation between the two, and as the number of votes get bigger, they are become more disperse. The election winners are represented as blue, and losers as yellow. There is clear correlation between the number of both votes and whether they are elected, however, the correlation seems weaker in the Democratic party, which we may investigate further in the future.

Ackowledgements

This prompt was adapted from a prompt written by Ben Baumer and R. Jordan Crouser.