Project 2: Data Analysis

Setup

library(RCurl)
library(tidyr)
library(dplyr)
library(knitr)
library(ggplot2)
library(stringr)

DATA SET 1: Congressional Voting Records

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes in 1984 identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

Data was identified by Nkasi Nedd and is available at UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).

Requested Analysis: Which issue attracted the most difference/common ground between members of congress according to party affiliation?

Data Import

votes <- read.csv(text=getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"), 
                  header = FALSE, sep = ",")

Sample of raw data:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
republican n y n y y y n n n y ? y y y n y
republican n y n y y y n n n n n y y y n ?
democrat ? y y ? y y n n n n y n y y n n
democrat n y y n ? y n n n n y n y n n y
democrat y y y n y y n n n n y ? y y y y
democrat n y y n y y n n n n n n y y y y

Data Tidying

# Set column names
names(votes) <- c("party",
                  "handicapped-infants",
                  "water-project-cost-sharing",
                  "adoption-of-the-budget-resolution",
                  "physician-fee-freeze",
                  "el-salvador-aid",
                  "religious-groups-in-schools",
                  "anti-satellite-test-ban",
                  "aid-to-nicaraguan-contras",
                  "mx-missile",
                  "immigration",
                  "synfuels-corporation-cutback",
                  "education-spending",
                  "superfund-right-to-sue",
                  "crime",
                  "duty-free-exports",
                  "export-administration-act-south-africa")

# Gather all data into a tidy format with 3 variables - party, issue and vote
votes <- votes %>% gather(issue, vote, 2:17)

# Update vote variable with more descriptive values
votes$vote[votes$vote == 'y'] <- "Yes"
votes$vote[votes$vote == 'n'] <- "No"
votes$vote[votes$vote == '?'] <- "Unknown"

Sample of tidy data:

party issue vote
republican handicapped-infants No
republican handicapped-infants No
democrat handicapped-infants Unknown
democrat handicapped-infants No
democrat handicapped-infants Yes
democrat handicapped-infants No

Data Analysis

# Get counts of each vote type by issue
votesum <- votes %>% 
  group_by(issue, vote) %>% 
  summarise(count = n()) %>% 
  arrange(desc(vote), count)

# Convert issues to factor to preserve sort order
votesum$issue <- factor(votesum$issue, levels = unique(votesum$issue))

# Plot vote counts by issue
ggplot(votesum, aes(issue, count, fill = vote)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "", y = "", caption="Plot 1. Split of votes per issue.") +
  scale_fill_manual(values=c("#FF6666", "#C3C3C3", "#49AE49"), name = "Vote")

The plot above shows all 16 issues and illustrates Yes, No, Unknown votes per issue. Party affiliation is not included in this plot. It appears that the 98th Congress was not close to being unanimous on any of the issues. As a side quest, I became curious why so many votes show up as Unknown for export-administration-act-south-africa issue. It appears that the 1984 amendement to the Export Administration Act was a particularly contentious issue during the Reagan Administration.

In order to evaluate which issue had or lacked bipartisan support, I concentrated analysis only on Yea votes. By evaluating where the party split is among the Yea votes, it can be assumed that a similar split exists in the Nay votes. Possible drawback of this approach is that Unknown votes are not considered; however, it should be enough for a quick analysis (especially considering that this data set already includes simplification of congressional votes).

# Ananlyze Yea votes
yesvotes <- votes %>% 
  filter(vote == "Yes") %>%               # Get only Yea votes
  group_by(party, issue) %>%              # Get counts of votes per party and issue
  summarise(count = n()) %>% 
  spread(party, count) %>%                # Spread out results for additional calculations 
  mutate(sum = democrat+republican,       # Number of votes per issue
         demprop = democrat / sum,        # Proportion of democratic votes per issue
         repprop = republican / sum,      # Proportion of repubican votes per issue
         dist05 = abs(0.5-demprop)) %>%   # Distance of split from 0.5 (closer = bipartisan support)
  arrange(dist05)

Analysis of Yea votes per issue and per party affiliation:

issue democrat republican sum demprop repprop dist05
religious-groups-in-schools 123 149 272 0.4522059 0.5477941 0.0477941
immigration 124 92 216 0.5740741 0.4259259 0.0740741
water-project-cost-sharing 120 75 195 0.6153846 0.3846154 0.1153846
crime 90 158 248 0.3629032 0.6370968 0.1370968
export-administration-act-south-africa 173 96 269 0.6431227 0.3568773 0.1431227
superfund-right-to-sue 73 136 209 0.3492823 0.6507177 0.1507177
el-salvador-aid 55 157 212 0.2594340 0.7405660 0.2405660
education-spending 36 135 171 0.2105263 0.7894737 0.2894737
handicapped-infants 156 31 187 0.8342246 0.1657754 0.3342246
anti-satellite-test-ban 200 39 239 0.8368201 0.1631799 0.3368201
synfuels-corporation-cutback 129 21 150 0.8600000 0.1400000 0.3600000
aid-to-nicaraguan-contras 218 24 242 0.9008264 0.0991736 0.4008264
mx-missile 188 19 207 0.9082126 0.0917874 0.4082126
adoption-of-the-budget-resolution 231 22 253 0.9130435 0.0869565 0.4130435
duty-free-exports 160 14 174 0.9195402 0.0804598 0.4195402
physician-fee-freeze 14 163 177 0.0790960 0.9209040 0.4209040
# Sort and gather data for plotting
yesplot <- yesvotes %>% 
  arrange(demprop) %>% 
  gather(party, yescount, 5:6)

# Convert issue to factor to preserve order
yesplot$issue <- factor(yesplot$issue, levels = unique(yesplot$issue))

# Plot split in Yea votes per issue
ggplot(yesplot, aes(issue, yescount, fill = party)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "", y = "", caption="Plot 2. Split of affirmative votes per party.") +
  scale_fill_manual(values=c("#6969FF", "#FF6E6E"), name = "Party", labels = c("Democratic", "Republican"))

The plot above that demonstrates the split between party support of issues is somewhat confusing. It may appear that the duty-free-exports issue was the most contentious one; however, we do not care if an issue was supported more by republican or democractic party only which issue had the most support from one party. For this let us find the issue closest to the 50-50 split between parties and the issue furthest from the split.

yesvotes %>% 
  filter(dist05 == max(dist05)) %>% 
  select(issue)
## # A tibble: 1 × 1
##                  issue
##                  <chr>
## 1 physician-fee-freeze
yesvotes %>% 
  filter(dist05 == min(dist05)) %>% 
  select(issue)
## # A tibble: 1 × 1
##                         issue
##                         <chr>
## 1 religious-groups-in-schools

Conclusion

It appears that physician-fee-freeze was the issue that was split the most along party lines in 1984. The religious-groups-in-school issue enjoyed the most bipartisan support. I became interested in the education-spending issue which seemed to enjoy a lot of republican support in 1984. I consider education to be a topic primarily supported by democrats. Researching the issue further I discovered that the act lowered education spending. That demonstrated that issue names are limited in conveying their importance or meaning. It is critical to know the background and details of data for any real-world analysis.

DATA SET 2: World Bank Data

The data includes world development indicators from 1960 to 2016 covering 217 economies. It is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

Data was identified by Kyle Gilde and is available at the World Bank website (http://data.worldbank.org/data-catalog/world-development-indicators).

Requested Analysis: Specific analysis was not requested. I picked this data set due to its large size to see how data size influences the analysis in R.

Data Import

world <- read.csv("C:\\Temp\\GitHub\\WDI_Data.csv", 
                  header = TRUE, sep = ",")

Data Tidying

# Get data dimensions
dim(world)
## [1] 383328     61
# Get available columns
colnames(world)
##  [1] "Country.Name"   "Country.Code"   "Indicator.Name" "Indicator.Code"
##  [5] "X1960"          "X1961"          "X1962"          "X1963"         
##  [9] "X1964"          "X1965"          "X1966"          "X1967"         
## [13] "X1968"          "X1969"          "X1970"          "X1971"         
## [17] "X1972"          "X1973"          "X1974"          "X1975"         
## [21] "X1976"          "X1977"          "X1978"          "X1979"         
## [25] "X1980"          "X1981"          "X1982"          "X1983"         
## [29] "X1984"          "X1985"          "X1986"          "X1987"         
## [33] "X1988"          "X1989"          "X1990"          "X1991"         
## [37] "X1992"          "X1993"          "X1994"          "X1995"         
## [41] "X1996"          "X1997"          "X1998"          "X1999"         
## [45] "X2000"          "X2001"          "X2002"          "X2003"         
## [49] "X2004"          "X2005"          "X2006"          "X2007"         
## [53] "X2008"          "X2009"          "X2010"          "X2011"         
## [57] "X2012"          "X2013"          "X2014"          "X2015"         
## [61] "X2016"
# Adjust column names
colnames(world) <- str_replace_all(colnames(world), "X", "")

A portion of adjusted data:

Country.Name Country.Code Indicator.Code 1960 1961 1962 1963
Arab World ARB PA.NUS.PPP.05 NA NA NA NA
Arab World ARB PA.NUS.PRVT.PP.05 NA NA NA NA
Arab World ARB EG.ELC.ACCS.ZS NA NA NA NA
Arab World ARB EG.ELC.ACCS.RU.ZS NA NA NA NA
Arab World ARB EG.ELC.ACCS.UR.ZS NA NA NA NA
Arab World ARB EG.NSF.ACCS.ZS NA NA NA NA
Arab World ARB EG.NSF.ACCS.RU.ZS NA NA NA NA
Arab World ARB EG.NSF.ACCS.UR.ZS NA NA NA NA
Arab World ARB WP_time_01.1 NA NA NA NA
Arab World ARB WP_time_01.3 NA NA NA NA
# Convert table to the long form
world <- world %>% 
  gather(year, value, 5:61)

The data includes 1,452 indicators. I have noticed that there are several indicators tracking number of female and male over-age students. Being an over-age student myself, I have decided to do a quick analysis of this category. I did not go into detail what is implied by an over-age student in this data. Relevant indicators are as follows.

world %>% 
  filter(grepl('over-age',Indicator.Name, ignore.case = TRUE)) %>% 
  distinct(Indicator.Name)
##                                                Indicator.Name
## 1                Over-age students, primary (% of enrollment)
## 2 Over-age students, primary, female (% of female enrollment)
## 3     Over-age students, primary, male (% of male enrollment)
# Select data relevant to the indicators to be analyzed
# Remove NAs and zeros
# Put female, male and total numbers into separate columns
students <- world %>% 
  filter(grepl('over-age',Indicator.Name, ignore.case = TRUE), !is.na(value), value != 0)  %>% 
  select(Country.Name, Indicator.Code, year, value) %>% 
  spread(Indicator.Code, value)

# Adjust column names
colnames(students) <- c("country", "year", "students_female", "students_male", "students_all")

A sample of tidy data to be used for analysis:

country year students_female students_male students_all
Afghanistan 1974 11.76900 19.55751 18.43234
Afghanistan 1993 7.13094 7.13159 7.13143
Afghanistan 2014 5.65843 6.27605 6.02927
Albania 2013 7.16190 8.50432 7.87141
Albania 2014 5.97890 7.08751 6.56295
Algeria 1973 15.35597 19.18937 17.69170

Data Analysis

Create boxplots per year for all countries combines.

# Boxplots per year for all countries
ggplot(filter(students, !is.na(students_all)), aes(x = year, y = students_all)) + 
  geom_boxplot() + 
  labs(x = "Year", y = "% of Over-Age Students", caption = "Plot 3. Boxplots for all students by year.") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Create point plot per country per year.

ggplot(filter(students, !is.na(students_all)), aes(x = year, y = country)) + 
  geom_point(aes(color = students_all)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        axis.text.y = element_blank()) +
  scale_color_continuous(limits = c(0,30), name = "") + 
  labs(x = "", y = "Country", caption = "Plot 4. Availability of data by year and country.")

Conclusion

An interesting discovery looking at the boxplots is that the mean of number of over-age students dropped from 1960 to the present. Either the category implies adults going back for high-school level education (then the drop is a positive change) or adults continuing their higher level education (then I would consider the drop to be a negative change). Looking at the point graph, even with limited country detail, it is clear that there are a lot of gaps in the data especially between 1998 and 2012. Additionally, I have discovered that when in small data sets it really doesn’t matter what method you use to analyze it since results come up quickly, with large data sets, efficiency in code may have significant impact on efficiency of analysis.

DATA SET 3: Shipping Analysis

The data includes shipping costs and collections per country. This sample analysis can be used for operational improvements. I have picked this data set because it had a number of issues that required tidying up.

Data was provided by Cesar Espitia.

Requested Analysis: Do collected fees cover company’s cost? What is the shipping cost per month? Per country?

Data Import

shipping <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Discussion5.csv", 
                     header = TRUE, sep = ",")

Raw data:

Date Values China Canada Taiwan Singapore US England Korea Hong.Kong
12/29/16 Price of Carrier 518 NA NA NA NA NA NA NA
Shipping Fees Collected 260 NA NA NA NA NA NA NA
12/30/16 Price of Carrier 232 NA NA NA NA NA NA NA
Shipping Fees Collected 132 NA NA NA NA NA NA NA
1/3/17 Price of Carrier 1143.5 NA NA NA NA NA NA NA
Shipping Fees Collected 752 NA NA NA NA NA NA NA
1/4/17 Price of Carrier 732.5 NA NA NA NA NA NA NA
Shipping Fees Collected 448 NA NA NA NA NA NA NA
1/5/17 Price of Carrier 443 NA NA NA NA NA NA NA
Shipping Fees Collected 336 NA NA NA NA NA NA NA
1/6/17 Price of Carrier 480.5 NA NA NA NA NA NA NA
Shipping Fees Collected 236 NA NA NA NA NA NA NA
1/9/17 Price of Carrier 439.5 193.50000 NA NA NA NA 31.5 13
Shipping Fees Collected 316 228.00000 NA NA NA NA 36.0 20
1/10/17 Price of Carrier 318 279.55994 NA NA NA NA NA 13
Shipping Fees Collected 108 336.00000 NA NA NA NA NA 20
1/11/17 Price of Carrier 325 NA NA 130.50000 NA NA NA NA
Shipping Fees Collected 168 NA NA 30.00000 NA NA NA NA
1/12/17 Price of Carrier 230 116.99997 NA NA NA NA NA NA
Shipping Fees Collected 132 150.00000 NA NA NA NA NA NA
1/13/17 Price of Carrier 229 NA NA NA NA NA NA NA
Shipping Fees Collected #N/A NA NA NA NA NA NA NA
1/16/17 Price of Carrier 189.5 NA 34 53.99999 NA NA NA NA
Shipping Fees Collected 144 NA 45 72.00000 NA NA NA NA
1/17/17 Price of Carrier 470.5 NA NA NA NA NA NA 33
Shipping Fees Collected 196 NA NA NA NA NA NA 45
1/18/17 Price of Carrier 311.5 NA 13 58.79999 NA NA NA NA
Shipping Fees Collected 212 NA 0 66.00000 NA NA NA NA
1/19/17 Price of Carrier 266.5 NA 13 NA NA NA NA NA
Shipping Fees Collected 92 NA 20 NA NA NA NA NA
1/20/17 Price of Carrier 182 NA NA NA 20.92996 NA NA NA
Shipping Fees Collected 148 NA NA NA 85.87000 NA NA NA
1/23/17 Price of Carrier 117 52.20000 111 NA NA NA NA NA
Shipping Fees Collected 64 66.00000 50 NA NA NA NA NA
1/24/17 Price of Carrier 144.5 76.49999 NA NA NA NA NA NA
Shipping Fees Collected 64 90.00000 NA NA NA NA NA NA
1/25/17 Price of Carrier 95.5 NA NA 27.00000 NA NA NA NA
Shipping Fees Collected 76 NA NA 36.00000 NA NA NA NA
1/26/17 Price of Carrier 84 53.99999 NA NA 37.21999 NA NA NA
Shipping Fees Collected 40 66.00000 NA NA 29.82000 NA NA NA
1/27/17 Price of Carrier 121 NA NA 31.50000 NA NA NA NA
Shipping Fees Collected 48 NA NA 36.00000 NA NA NA NA
1/30/17 Price of Carrier 135.5 NA NA NA NA NA NA NA
Shipping Fees Collected 92 NA NA NA NA NA NA NA
1/31/17 Price of Carrier 33.5 31.50000 NA NA 38.51000 NA NA NA
Shipping Fees Collected 16 42.00000 NA NA 33.23000 NA NA NA
2/1/17 Price of Carrier 172.5 31.50000 NA NA 146.80994 NA NA NA
Shipping Fees Collected 76 42.00000 NA NA 121.25000 NA NA NA
2/2/17 Price of Carrier 165 NA NA NA NA NA NA NA
Shipping Fees Collected 40 NA NA NA NA NA NA NA
2/3/17 Price of Carrier 166 125.99998 13 NA 10.33000 31.50000 NA NA
Shipping Fees Collected 100 156.00000 20 NA 0.00000 42.00000 NA NA
2/6/17 Price of Carrier 343 58.50000 68 NA 94.10999 89.99997 NA 135
Shipping Fees Collected 236 78.00000 90 NA 97.39000 120.00000 NA 115
2/7/17 Price of Carrier NA 13 NA 87.83999 NA NA NA
Shipping Fees Collected NA 20 NA 70.28000 NA NA NA

Data Tidying

# Populate blank values in the Date column with a date from previous row
shipping[c(FALSE, TRUE),1] <- shipping[c(TRUE, FALSE), 1]

# Rearrange data
shipping <- shipping %>% 
  gather(Country, Amount, 3:10) %>%     # Collapse data into a narrow format
  filter(!is.na(Amount)) %>%            # Remove NAs
  spread(Values, Amount)                # Expand Cost and Collected values into two columns

# Adjust column names
colnames(shipping) <- c("Date", "Country", "Cost", "Collected")

# Convert cost and collected amounts to number
shipping$Cost <- as.numeric(shipping$Cost)
shipping$Collected <- as.numeric(shipping$Collected)

# Remove records that did not have numeric Cost or Collected values
shipping <- shipping %>% filter(!is.na(Cost), !is.na(Collected))

Tidy data:

Date Country Cost Collected
1/10/17 Canada 279.55994 336.00
1/10/17 China 318.00000 108.00
1/10/17 Hong.Kong 13.00000 20.00
1/11/17 China 325.00000 168.00
1/11/17 Singapore 130.50000 30.00
1/12/17 Canada 116.99997 150.00
1/12/17 China 230.00000 132.00
1/16/17 China 189.50000 144.00
1/16/17 Singapore 53.99999 72.00
1/16/17 Taiwan 34.00000 45.00
1/17/17 China 470.50000 196.00
1/17/17 Hong.Kong 33.00000 45.00
1/18/17 China 311.50000 212.00
1/18/17 Singapore 58.79999 66.00
1/18/17 Taiwan 13.00000 0.00
1/19/17 China 266.50000 92.00
1/19/17 Taiwan 13.00000 20.00
1/20/17 China 182.00000 148.00
1/20/17 US 20.92996 85.87
1/23/17 Canada 52.20000 66.00
1/23/17 China 117.00000 64.00
1/23/17 Taiwan 111.00000 50.00
1/24/17 Canada 76.49999 90.00
1/24/17 China 144.50000 64.00
1/25/17 China 95.50000 76.00
1/25/17 Singapore 27.00000 36.00
1/26/17 Canada 53.99999 66.00
1/26/17 China 84.00000 40.00
1/26/17 US 37.21999 29.82
1/27/17 China 121.00000 48.00
1/27/17 Singapore 31.50000 36.00
1/3/17 China 1143.50000 752.00
1/30/17 China 135.50000 92.00
1/31/17 Canada 31.50000 42.00
1/31/17 China 33.50000 16.00
1/31/17 US 38.51000 33.23
1/4/17 China 732.50000 448.00
1/5/17 China 443.00000 336.00
1/6/17 China 480.50000 236.00
1/9/17 Canada 193.50000 228.00
1/9/17 China 439.50000 316.00
1/9/17 Hong.Kong 13.00000 20.00
1/9/17 Korea 31.50000 36.00
12/29/16 China 518.00000 260.00
12/30/16 China 232.00000 132.00
2/1/17 Canada 31.50000 42.00
2/1/17 China 172.50000 76.00
2/1/17 US 146.80994 121.25
2/2/17 China 165.00000 40.00
2/3/17 Canada 125.99998 156.00
2/3/17 China 166.00000 100.00
2/3/17 England 31.50000 42.00
2/3/17 Taiwan 13.00000 20.00
2/3/17 US 10.33000 0.00
2/6/17 Canada 58.50000 78.00
2/6/17 China 343.00000 236.00
2/6/17 England 89.99997 120.00
2/6/17 Hong.Kong 134.99998 115.00
2/6/17 Taiwan 68.00000 90.00
2/6/17 US 94.10999 97.39
2/7/17 Taiwan 13.00000 20.00
2/7/17 US 87.83999 70.28

Data Analysis

# Get country specific data
countrydata <- shipping %>% 
  group_by(Country) %>% 
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%  # Total amounts per country 
  mutate(Total = SumCollected - SumCost) %>%                         # Get the difference between amounts
  arrange(Total)
Country SumCost SumCollected Total
China 7859.5000 4532.00 -3327.500000
Singapore 301.8000 240.00 -61.799971
Taiwan 265.0000 245.00 -20.000000
US 435.7499 437.84 2.090127
Korea 31.5000 36.00 4.500002
Hong.Kong 194.0000 200.00 6.000025
England 121.5000 162.00 40.500031
Canada 1020.2599 1254.00 233.740126
# Get monthly data
monthdata <- shipping %>% 
  separate(Date, c("Month", "Day", "Year"), sep = "/", remove=FALSE) %>%   # Split date into components
  group_by(Month, Year) %>%                                                # Group by month/year
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
  mutate(Total = SumCollected - SumCost, 
         MonthNo = as.numeric(Month),                      # Convert month to number for sorting
         YearNo = as.numeric(Year),                        # Convert year to number for sorting
         Period = paste(Month, Year, sep="/")) %>%         # Combine month and year for output
  arrange(YearNo, MonthNo) %>%                             # Sort
  group_by() %>%                                           # Remove grouping
  select(Period, SumCost, SumCollected, Total)             # Select field for output
Period SumCost SumCollected Total
12/16 750.00 392.00 -358.0000
1/17 7727.22 5290.92 -2436.2998
2/17 1752.09 1423.92 -328.1698

Conclusion

Based on the sample data it does not appear that the company recoups shipping charges. Shipping to China generates particularly big losses, but shipping to Canada is slightly profitable. Monthly data can be easily generated using dplyr and tidyr with provided format.