DATA 607 Project 2

Project 2: Data Analysis

Setup

library(RCurl)
library(tidyr)
library(dplyr)
library(knitr)
library(ggplot2)
library(stringr)

DATA SET 1: Congressional Voting Records

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes in 1984 identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

Data was identified by Nkasi Nedd and is available at UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).

Requested Analysis: Which issue attracted the most difference/common ground between members of congress according to party affiliation?

Data Import

votes <- read.csv(text=getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"), 
                  header = FALSE, sep = ",")

Sample of raw data:

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17
republican	n	y	n	y	y	y	n	n	n	y	?	y	y	y	n	y
republican	n	y	n	y	y	y	n	n	n	n	n	y	y	y	n	?
democrat	?	y	y	?	y	y	n	n	n	n	y	n	y	y	n	n
democrat	n	y	y	n	?	y	n	n	n	n	y	n	y	n	n	y
democrat	y	y	y	n	y	y	n	n	n	n	y	?	y	y	y	y
democrat	n	y	y	n	y	y	n	n	n	n	n	n	y	y	y	y

Data Tidying

# Set column names
names(votes) <- c("party",
                  "handicapped-infants",
                  "water-project-cost-sharing",
                  "adoption-of-the-budget-resolution",
                  "physician-fee-freeze",
                  "el-salvador-aid",
                  "religious-groups-in-schools",
                  "anti-satellite-test-ban",
                  "aid-to-nicaraguan-contras",
                  "mx-missile",
                  "immigration",
                  "synfuels-corporation-cutback",
                  "education-spending",
                  "superfund-right-to-sue",
                  "crime",
                  "duty-free-exports",
                  "export-administration-act-south-africa")

# Gather all data into a tidy format with 3 variables - party, issue and vote
votes <- votes %>% gather(issue, vote, 2:17)

# Update vote variable with more descriptive values
votes$vote[votes$vote == 'y'] <- "Yes"
votes$vote[votes$vote == 'n'] <- "No"
votes$vote[votes$vote == '?'] <- "Unknown"

Sample of tidy data:

party	issue	vote
republican	handicapped-infants	No
republican	handicapped-infants	No
democrat	handicapped-infants	Unknown
democrat	handicapped-infants	No
democrat	handicapped-infants	Yes
democrat	handicapped-infants	No

Data Analysis

# Get counts of each vote type by issue
votesum <- votes %>% 
  group_by(issue, vote) %>% 
  summarise(count = n()) %>% 
  arrange(desc(vote), count)

# Convert issues to factor to preserve sort order
votesum$issue <- factor(votesum$issue, levels = unique(votesum$issue))

# Plot vote counts by issue
ggplot(votesum, aes(issue, count, fill = vote)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "", y = "", caption="Plot 1. Split of votes per issue.") +
  scale_fill_manual(values=c("#FF6666", "#C3C3C3", "#49AE49"), name = "Vote")

The plot above shows all 16 issues and illustrates Yes, No, Unknown votes per issue. Party affiliation is not included in this plot. It appears that the 98th Congress was not close to being unanimous on any of the issues. As a side quest, I became curious why so many votes show up as Unknown for export-administration-act-south-africa issue. It appears that the 1984 amendement to the Export Administration Act was a particularly contentious issue during the Reagan Administration.

In order to evaluate which issue had or lacked bipartisan support, I concentrated analysis only on Yea votes. By evaluating where the party split is among the Yea votes, it can be assumed that a similar split exists in the Nay votes. Possible drawback of this approach is that Unknown votes are not considered; however, it should be enough for a quick analysis (especially considering that this data set already includes simplification of congressional votes).

# Ananlyze Yea votes
yesvotes <- votes %>% 
  filter(vote == "Yes") %>%               # Get only Yea votes
  group_by(party, issue) %>%              # Get counts of votes per party and issue
  summarise(count = n()) %>% 
  spread(party, count) %>%                # Spread out results for additional calculations 
  mutate(sum = democrat+republican,       # Number of votes per issue
         demprop = democrat / sum,        # Proportion of democratic votes per issue
         repprop = republican / sum,      # Proportion of repubican votes per issue
         dist05 = abs(0.5-demprop)) %>%   # Distance of split from 0.5 (closer = bipartisan support)
  arrange(dist05)

Analysis of Yea votes per issue and per party affiliation:

issue	democrat	republican	sum	demprop	repprop	dist05
religious-groups-in-schools	123	149	272	0.4522059	0.5477941	0.0477941
immigration	124	92	216	0.5740741	0.4259259	0.0740741
water-project-cost-sharing	120	75	195	0.6153846	0.3846154	0.1153846
crime	90	158	248	0.3629032	0.6370968	0.1370968
export-administration-act-south-africa	173	96	269	0.6431227	0.3568773	0.1431227
superfund-right-to-sue	73	136	209	0.3492823	0.6507177	0.1507177
el-salvador-aid	55	157	212	0.2594340	0.7405660	0.2405660
education-spending	36	135	171	0.2105263	0.7894737	0.2894737
handicapped-infants	156	31	187	0.8342246	0.1657754	0.3342246
anti-satellite-test-ban	200	39	239	0.8368201	0.1631799	0.3368201
synfuels-corporation-cutback	129	21	150	0.8600000	0.1400000	0.3600000
aid-to-nicaraguan-contras	218	24	242	0.9008264	0.0991736	0.4008264
mx-missile	188	19	207	0.9082126	0.0917874	0.4082126
adoption-of-the-budget-resolution	231	22	253	0.9130435	0.0869565	0.4130435
duty-free-exports	160	14	174	0.9195402	0.0804598	0.4195402
physician-fee-freeze	14	163	177	0.0790960	0.9209040	0.4209040

# Sort and gather data for plotting
yesplot <- yesvotes %>% 
  arrange(demprop) %>% 
  gather(party, yescount, 5:6)

# Convert issue to factor to preserve order
yesplot$issue <- factor(yesplot$issue, levels = unique(yesplot$issue))

# Plot split in Yea votes per issue
ggplot(yesplot, aes(issue, yescount, fill = party)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "", y = "", caption="Plot 2. Split of affirmative votes per party.") +
  scale_fill_manual(values=c("#6969FF", "#FF6E6E"), name = "Party", labels = c("Democratic", "Republican"))

The plot above that demonstrates the split between party support of issues is somewhat confusing. It may appear that the duty-free-exports issue was the most contentious one; however, we do not care if an issue was supported more by republican or democractic party only which issue had the most support from one party. For this let us find the issue closest to the 50-50 split between parties and the issue furthest from the split.

yesvotes %>% 
  filter(dist05 == max(dist05)) %>% 
  select(issue)

## # A tibble: 1 × 1
##                  issue
##                  <chr>
## 1 physician-fee-freeze

yesvotes %>% 
  filter(dist05 == min(dist05)) %>% 
  select(issue)

## # A tibble: 1 × 1
##                         issue
##                         <chr>
## 1 religious-groups-in-schools

Conclusion

It appears that physician-fee-freeze was the issue that was split the most along party lines in 1984. The religious-groups-in-school issue enjoyed the most bipartisan support. I became interested in the education-spending issue which seemed to enjoy a lot of republican support in 1984. I consider education to be a topic primarily supported by democrats. Researching the issue further I discovered that the act lowered education spending. That demonstrated that issue names are limited in conveying their importance or meaning. It is critical to know the background and details of data for any real-world analysis.

DATA SET 2: World Bank Data

The data includes world development indicators from 1960 to 2016 covering 217 economies. It is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

Data was identified by Kyle Gilde and is available at the World Bank website (http://data.worldbank.org/data-catalog/world-development-indicators).

Requested Analysis: Specific analysis was not requested. I picked this data set due to its large size to see how data size influences the analysis in R.

Data Import

world <- read.csv("C:\\Temp\\GitHub\\WDI_Data.csv", 
                  header = TRUE, sep = ",")

Data Tidying

# Get data dimensions
dim(world)

## [1] 383328     61

# Get available columns
colnames(world)

##  [1] "Country.Name"   "Country.Code"   "Indicator.Name" "Indicator.Code"
##  [5] "X1960"          "X1961"          "X1962"          "X1963"         
##  [9] "X1964"          "X1965"          "X1966"          "X1967"         
## [13] "X1968"          "X1969"          "X1970"          "X1971"         
## [17] "X1972"          "X1973"          "X1974"          "X1975"         
## [21] "X1976"          "X1977"          "X1978"          "X1979"         
## [25] "X1980"          "X1981"          "X1982"          "X1983"         
## [29] "X1984"          "X1985"          "X1986"          "X1987"         
## [33] "X1988"          "X1989"          "X1990"          "X1991"         
## [37] "X1992"          "X1993"          "X1994"          "X1995"         
## [41] "X1996"          "X1997"          "X1998"          "X1999"         
## [45] "X2000"          "X2001"          "X2002"          "X2003"         
## [49] "X2004"          "X2005"          "X2006"          "X2007"         
## [53] "X2008"          "X2009"          "X2010"          "X2011"         
## [57] "X2012"          "X2013"          "X2014"          "X2015"         
## [61] "X2016"

# Adjust column names
colnames(world) <- str_replace_all(colnames(world), "X", "")

A portion of adjusted data:

Country.Name	Country.Code	Indicator.Code	1960	1961	1962	1963
Arab World	ARB	PA.NUS.PPP.05	NA	NA	NA	NA
Arab World	ARB	PA.NUS.PRVT.PP.05	NA	NA	NA	NA
Arab World	ARB	EG.ELC.ACCS.ZS	NA	NA	NA	NA
Arab World	ARB	EG.ELC.ACCS.RU.ZS	NA	NA	NA	NA
Arab World	ARB	EG.ELC.ACCS.UR.ZS	NA	NA	NA	NA
Arab World	ARB	EG.NSF.ACCS.ZS	NA	NA	NA	NA
Arab World	ARB	EG.NSF.ACCS.RU.ZS	NA	NA	NA	NA
Arab World	ARB	EG.NSF.ACCS.UR.ZS	NA	NA	NA	NA
Arab World	ARB	WP_time_01.1	NA	NA	NA	NA
Arab World	ARB	WP_time_01.3	NA	NA	NA	NA

# Convert table to the long form
world <- world %>% 
  gather(year, value, 5:61)

The data includes 1,452 indicators. I have noticed that there are several indicators tracking number of female and male over-age students. Being an over-age student myself, I have decided to do a quick analysis of this category. I did not go into detail what is implied by an over-age student in this data. Relevant indicators are as follows.

world %>% 
  filter(grepl('over-age',Indicator.Name, ignore.case = TRUE)) %>% 
  distinct(Indicator.Name)

##                                                Indicator.Name
## 1                Over-age students, primary (% of enrollment)
## 2 Over-age students, primary, female (% of female enrollment)
## 3     Over-age students, primary, male (% of male enrollment)

# Select data relevant to the indicators to be analyzed
# Remove NAs and zeros
# Put female, male and total numbers into separate columns
students <- world %>% 
  filter(grepl('over-age',Indicator.Name, ignore.case = TRUE), !is.na(value), value != 0)  %>% 
  select(Country.Name, Indicator.Code, year, value) %>% 
  spread(Indicator.Code, value)

# Adjust column names
colnames(students) <- c("country", "year", "students_female", "students_male", "students_all")

A sample of tidy data to be used for analysis:

country	year	students_female	students_male	students_all
Afghanistan	1974	11.76900	19.55751	18.43234
Afghanistan	1993	7.13094	7.13159	7.13143
Afghanistan	2014	5.65843	6.27605	6.02927
Albania	2013	7.16190	8.50432	7.87141
Albania	2014	5.97890	7.08751	6.56295
Algeria	1973	15.35597	19.18937	17.69170

Data Analysis

Create boxplots per year for all countries combines.

# Boxplots per year for all countries
ggplot(filter(students, !is.na(students_all)), aes(x = year, y = students_all)) + 
  geom_boxplot() + 
  labs(x = "Year", y = "% of Over-Age Students", caption = "Plot 3. Boxplots for all students by year.") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Create point plot per country per year.

ggplot(filter(students, !is.na(students_all)), aes(x = year, y = country)) + 
  geom_point(aes(color = students_all)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        axis.text.y = element_blank()) +
  scale_color_continuous(limits = c(0,30), name = "") + 
  labs(x = "", y = "Country", caption = "Plot 4. Availability of data by year and country.")

Conclusion

An interesting discovery looking at the boxplots is that the mean of number of over-age students dropped from 1960 to the present. Either the category implies adults going back for high-school level education (then the drop is a positive change) or adults continuing their higher level education (then I would consider the drop to be a negative change). Looking at the point graph, even with limited country detail, it is clear that there are a lot of gaps in the data especially between 1998 and 2012. Additionally, I have discovered that when in small data sets it really doesn’t matter what method you use to analyze it since results come up quickly, with large data sets, efficiency in code may have significant impact on efficiency of analysis.

DATA SET 3: Shipping Analysis

The data includes shipping costs and collections per country. This sample analysis can be used for operational improvements. I have picked this data set because it had a number of issues that required tidying up.

Data was provided by Cesar Espitia.

Requested Analysis: Do collected fees cover company’s cost? What is the shipping cost per month? Per country?

Data Import

shipping <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Discussion5.csv", 
                     header = TRUE, sep = ",")

Raw data:

Date	Values	China	Canada	Taiwan	Singapore	US	England	Korea	Hong.Kong
12/29/16	Price of Carrier	518	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	260	NA	NA	NA	NA	NA	NA	NA
12/30/16	Price of Carrier	232	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	132	NA	NA	NA	NA	NA	NA	NA
1/3/17	Price of Carrier	1143.5	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	752	NA	NA	NA	NA	NA	NA	NA
1/4/17	Price of Carrier	732.5	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	448	NA	NA	NA	NA	NA	NA	NA
1/5/17	Price of Carrier	443	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	336	NA	NA	NA	NA	NA	NA	NA
1/6/17	Price of Carrier	480.5	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	236	NA	NA	NA	NA	NA	NA	NA
1/9/17	Price of Carrier	439.5	193.50000	NA	NA	NA	NA	31.5	13
	Shipping Fees Collected	316	228.00000	NA	NA	NA	NA	36.0	20
1/10/17	Price of Carrier	318	279.55994	NA	NA	NA	NA	NA	13
	Shipping Fees Collected	108	336.00000	NA	NA	NA	NA	NA	20
1/11/17	Price of Carrier	325	NA	NA	130.50000	NA	NA	NA	NA
	Shipping Fees Collected	168	NA	NA	30.00000	NA	NA	NA	NA
1/12/17	Price of Carrier	230	116.99997	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	132	150.00000	NA	NA	NA	NA	NA	NA
1/13/17	Price of Carrier	229	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	#N/A	NA	NA	NA	NA	NA	NA	NA
1/16/17	Price of Carrier	189.5	NA	34	53.99999	NA	NA	NA	NA
	Shipping Fees Collected	144	NA	45	72.00000	NA	NA	NA	NA
1/17/17	Price of Carrier	470.5	NA	NA	NA	NA	NA	NA	33
	Shipping Fees Collected	196	NA	NA	NA	NA	NA	NA	45
1/18/17	Price of Carrier	311.5	NA	13	58.79999	NA	NA	NA	NA
	Shipping Fees Collected	212	NA	0	66.00000	NA	NA	NA	NA
1/19/17	Price of Carrier	266.5	NA	13	NA	NA	NA	NA	NA
	Shipping Fees Collected	92	NA	20	NA	NA	NA	NA	NA
1/20/17	Price of Carrier	182	NA	NA	NA	20.92996	NA	NA	NA
	Shipping Fees Collected	148	NA	NA	NA	85.87000	NA	NA	NA
1/23/17	Price of Carrier	117	52.20000	111	NA	NA	NA	NA	NA
	Shipping Fees Collected	64	66.00000	50	NA	NA	NA	NA	NA
1/24/17	Price of Carrier	144.5	76.49999	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	64	90.00000	NA	NA	NA	NA	NA	NA
1/25/17	Price of Carrier	95.5	NA	NA	27.00000	NA	NA	NA	NA
	Shipping Fees Collected	76	NA	NA	36.00000	NA	NA	NA	NA
1/26/17	Price of Carrier	84	53.99999	NA	NA	37.21999	NA	NA	NA
	Shipping Fees Collected	40	66.00000	NA	NA	29.82000	NA	NA	NA
1/27/17	Price of Carrier	121	NA	NA	31.50000	NA	NA	NA	NA
	Shipping Fees Collected	48	NA	NA	36.00000	NA	NA	NA	NA
1/30/17	Price of Carrier	135.5	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	92	NA	NA	NA	NA	NA	NA	NA
1/31/17	Price of Carrier	33.5	31.50000	NA	NA	38.51000	NA	NA	NA
	Shipping Fees Collected	16	42.00000	NA	NA	33.23000	NA	NA	NA
2/1/17	Price of Carrier	172.5	31.50000	NA	NA	146.80994	NA	NA	NA
	Shipping Fees Collected	76	42.00000	NA	NA	121.25000	NA	NA	NA
2/2/17	Price of Carrier	165	NA	NA	NA	NA	NA	NA	NA
	Shipping Fees Collected	40	NA	NA	NA	NA	NA	NA	NA
2/3/17	Price of Carrier	166	125.99998	13	NA	10.33000	31.50000	NA	NA
	Shipping Fees Collected	100	156.00000	20	NA	0.00000	42.00000	NA	NA
2/6/17	Price of Carrier	343	58.50000	68	NA	94.10999	89.99997	NA	135
	Shipping Fees Collected	236	78.00000	90	NA	97.39000	120.00000	NA	115
2/7/17	Price of Carrier		NA	13	NA	87.83999	NA	NA	NA
	Shipping Fees Collected		NA	20	NA	70.28000	NA	NA	NA

Data Tidying

# Populate blank values in the Date column with a date from previous row
shipping[c(FALSE, TRUE),1] <- shipping[c(TRUE, FALSE), 1]

# Rearrange data
shipping <- shipping %>% 
  gather(Country, Amount, 3:10) %>%     # Collapse data into a narrow format
  filter(!is.na(Amount)) %>%            # Remove NAs
  spread(Values, Amount)                # Expand Cost and Collected values into two columns

# Adjust column names
colnames(shipping) <- c("Date", "Country", "Cost", "Collected")

# Convert cost and collected amounts to number
shipping$Cost <- as.numeric(shipping$Cost)
shipping$Collected <- as.numeric(shipping$Collected)

# Remove records that did not have numeric Cost or Collected values
shipping <- shipping %>% filter(!is.na(Cost), !is.na(Collected))

Tidy data:

Date	Country	Cost	Collected
1/10/17	Canada	279.55994	336.00
1/10/17	China	318.00000	108.00
1/10/17	Hong.Kong	13.00000	20.00
1/11/17	China	325.00000	168.00
1/11/17	Singapore	130.50000	30.00
1/12/17	Canada	116.99997	150.00
1/12/17	China	230.00000	132.00
1/16/17	China	189.50000	144.00
1/16/17	Singapore	53.99999	72.00
1/16/17	Taiwan	34.00000	45.00
1/17/17	China	470.50000	196.00
1/17/17	Hong.Kong	33.00000	45.00
1/18/17	China	311.50000	212.00
1/18/17	Singapore	58.79999	66.00
1/18/17	Taiwan	13.00000	0.00
1/19/17	China	266.50000	92.00
1/19/17	Taiwan	13.00000	20.00
1/20/17	China	182.00000	148.00
1/20/17	US	20.92996	85.87
1/23/17	Canada	52.20000	66.00
1/23/17	China	117.00000	64.00
1/23/17	Taiwan	111.00000	50.00
1/24/17	Canada	76.49999	90.00
1/24/17	China	144.50000	64.00
1/25/17	China	95.50000	76.00
1/25/17	Singapore	27.00000	36.00
1/26/17	Canada	53.99999	66.00
1/26/17	China	84.00000	40.00
1/26/17	US	37.21999	29.82
1/27/17	China	121.00000	48.00
1/27/17	Singapore	31.50000	36.00
1/3/17	China	1143.50000	752.00
1/30/17	China	135.50000	92.00
1/31/17	Canada	31.50000	42.00
1/31/17	China	33.50000	16.00
1/31/17	US	38.51000	33.23
1/4/17	China	732.50000	448.00
1/5/17	China	443.00000	336.00
1/6/17	China	480.50000	236.00
1/9/17	Canada	193.50000	228.00
1/9/17	China	439.50000	316.00
1/9/17	Hong.Kong	13.00000	20.00
1/9/17	Korea	31.50000	36.00
12/29/16	China	518.00000	260.00
12/30/16	China	232.00000	132.00
2/1/17	Canada	31.50000	42.00
2/1/17	China	172.50000	76.00
2/1/17	US	146.80994	121.25
2/2/17	China	165.00000	40.00
2/3/17	Canada	125.99998	156.00
2/3/17	China	166.00000	100.00
2/3/17	England	31.50000	42.00
2/3/17	Taiwan	13.00000	20.00
2/3/17	US	10.33000	0.00
2/6/17	Canada	58.50000	78.00
2/6/17	China	343.00000	236.00
2/6/17	England	89.99997	120.00
2/6/17	Hong.Kong	134.99998	115.00
2/6/17	Taiwan	68.00000	90.00
2/6/17	US	94.10999	97.39
2/7/17	Taiwan	13.00000	20.00
2/7/17	US	87.83999	70.28

Data Analysis

# Get country specific data
countrydata <- shipping %>% 
  group_by(Country) %>% 
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%  # Total amounts per country 
  mutate(Total = SumCollected - SumCost) %>%                         # Get the difference between amounts
  arrange(Total)

Country	SumCost	SumCollected	Total
China	7859.5000	4532.00	-3327.500000
Singapore	301.8000	240.00	-61.799971
Taiwan	265.0000	245.00	-20.000000
US	435.7499	437.84	2.090127
Korea	31.5000	36.00	4.500002
Hong.Kong	194.0000	200.00	6.000025
England	121.5000	162.00	40.500031
Canada	1020.2599	1254.00	233.740126

# Get monthly data
monthdata <- shipping %>% 
  separate(Date, c("Month", "Day", "Year"), sep = "/", remove=FALSE) %>%   # Split date into components
  group_by(Month, Year) %>%                                                # Group by month/year
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
  mutate(Total = SumCollected - SumCost, 
         MonthNo = as.numeric(Month),                      # Convert month to number for sorting
         YearNo = as.numeric(Year),                        # Convert year to number for sorting
         Period = paste(Month, Year, sep="/")) %>%         # Combine month and year for output
  arrange(YearNo, MonthNo) %>%                             # Sort
  group_by() %>%                                           # Remove grouping
  select(Period, SumCost, SumCollected, Total)             # Select field for output

Period	SumCost	SumCollected	Total
12/16	750.00	392.00	-358.0000
1/17	7727.22	5290.92	-2436.2998
2/17	1752.09	1423.92	-328.1698

Conclusion

Based on the sample data it does not appear that the company recoups shipping charges. Shipping to China generates particularly big losses, but shipping to Canada is slightly profitable. Monthly data can be easily generated using dplyr and tidyr with provided format.

DATA 607 Project 2

Ilya Kats

March 12, 2017

Project 2: Data Analysis

Setup

DATA SET 1: Congressional Voting Records

Data Import

Data Tidying

Data Analysis

Conclusion

DATA SET 2: World Bank Data

Data Import

Data Tidying

Data Analysis

Conclusion

DATA SET 3: Shipping Analysis

Data Import

Data Tidying

Data Analysis

Conclusion