library(RCurl)
library(tidyr)
library(dplyr)
library(knitr)
library(ggplot2)
library(stringr)
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes in 1984 identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).
Data was identified by Nkasi Nedd and is available at UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).
Requested Analysis: Which issue attracted the most difference/common ground between members of congress according to party affiliation?
votes <- read.csv(text=getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"),
header = FALSE, sep = ",")
Sample of raw data:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| republican | n | y | n | y | y | y | n | n | n | y | ? | y | y | y | n | y |
| republican | n | y | n | y | y | y | n | n | n | n | n | y | y | y | n | ? |
| democrat | ? | y | y | ? | y | y | n | n | n | n | y | n | y | y | n | n |
| democrat | n | y | y | n | ? | y | n | n | n | n | y | n | y | n | n | y |
| democrat | y | y | y | n | y | y | n | n | n | n | y | ? | y | y | y | y |
| democrat | n | y | y | n | y | y | n | n | n | n | n | n | y | y | y | y |
# Set column names
names(votes) <- c("party",
"handicapped-infants",
"water-project-cost-sharing",
"adoption-of-the-budget-resolution",
"physician-fee-freeze",
"el-salvador-aid",
"religious-groups-in-schools",
"anti-satellite-test-ban",
"aid-to-nicaraguan-contras",
"mx-missile",
"immigration",
"synfuels-corporation-cutback",
"education-spending",
"superfund-right-to-sue",
"crime",
"duty-free-exports",
"export-administration-act-south-africa")
# Gather all data into a tidy format with 3 variables - party, issue and vote
votes <- votes %>% gather(issue, vote, 2:17)
# Update vote variable with more descriptive values
votes$vote[votes$vote == 'y'] <- "Yes"
votes$vote[votes$vote == 'n'] <- "No"
votes$vote[votes$vote == '?'] <- "Unknown"
Sample of tidy data:
| party | issue | vote |
|---|---|---|
| republican | handicapped-infants | No |
| republican | handicapped-infants | No |
| democrat | handicapped-infants | Unknown |
| democrat | handicapped-infants | No |
| democrat | handicapped-infants | Yes |
| democrat | handicapped-infants | No |
# Get counts of each vote type by issue
votesum <- votes %>%
group_by(issue, vote) %>%
summarise(count = n()) %>%
arrange(desc(vote), count)
# Convert issues to factor to preserve sort order
votesum$issue <- factor(votesum$issue, levels = unique(votesum$issue))
# Plot vote counts by issue
ggplot(votesum, aes(issue, count, fill = vote)) +
geom_col() +
coord_flip() +
labs(x = "", y = "", caption="Plot 1. Split of votes per issue.") +
scale_fill_manual(values=c("#FF6666", "#C3C3C3", "#49AE49"), name = "Vote")
The plot above shows all 16 issues and illustrates Yes, No, Unknown votes per issue. Party affiliation is not included in this plot. It appears that the 98th Congress was not close to being unanimous on any of the issues. As a side quest, I became curious why so many votes show up as Unknown for export-administration-act-south-africa issue. It appears that the 1984 amendement to the Export Administration Act was a particularly contentious issue during the Reagan Administration.
In order to evaluate which issue had or lacked bipartisan support, I concentrated analysis only on Yea votes. By evaluating where the party split is among the Yea votes, it can be assumed that a similar split exists in the Nay votes. Possible drawback of this approach is that Unknown votes are not considered; however, it should be enough for a quick analysis (especially considering that this data set already includes simplification of congressional votes).
# Ananlyze Yea votes
yesvotes <- votes %>%
filter(vote == "Yes") %>% # Get only Yea votes
group_by(party, issue) %>% # Get counts of votes per party and issue
summarise(count = n()) %>%
spread(party, count) %>% # Spread out results for additional calculations
mutate(sum = democrat+republican, # Number of votes per issue
demprop = democrat / sum, # Proportion of democratic votes per issue
repprop = republican / sum, # Proportion of repubican votes per issue
dist05 = abs(0.5-demprop)) %>% # Distance of split from 0.5 (closer = bipartisan support)
arrange(dist05)
Analysis of Yea votes per issue and per party affiliation:
| issue | democrat | republican | sum | demprop | repprop | dist05 |
|---|---|---|---|---|---|---|
| religious-groups-in-schools | 123 | 149 | 272 | 0.4522059 | 0.5477941 | 0.0477941 |
| immigration | 124 | 92 | 216 | 0.5740741 | 0.4259259 | 0.0740741 |
| water-project-cost-sharing | 120 | 75 | 195 | 0.6153846 | 0.3846154 | 0.1153846 |
| crime | 90 | 158 | 248 | 0.3629032 | 0.6370968 | 0.1370968 |
| export-administration-act-south-africa | 173 | 96 | 269 | 0.6431227 | 0.3568773 | 0.1431227 |
| superfund-right-to-sue | 73 | 136 | 209 | 0.3492823 | 0.6507177 | 0.1507177 |
| el-salvador-aid | 55 | 157 | 212 | 0.2594340 | 0.7405660 | 0.2405660 |
| education-spending | 36 | 135 | 171 | 0.2105263 | 0.7894737 | 0.2894737 |
| handicapped-infants | 156 | 31 | 187 | 0.8342246 | 0.1657754 | 0.3342246 |
| anti-satellite-test-ban | 200 | 39 | 239 | 0.8368201 | 0.1631799 | 0.3368201 |
| synfuels-corporation-cutback | 129 | 21 | 150 | 0.8600000 | 0.1400000 | 0.3600000 |
| aid-to-nicaraguan-contras | 218 | 24 | 242 | 0.9008264 | 0.0991736 | 0.4008264 |
| mx-missile | 188 | 19 | 207 | 0.9082126 | 0.0917874 | 0.4082126 |
| adoption-of-the-budget-resolution | 231 | 22 | 253 | 0.9130435 | 0.0869565 | 0.4130435 |
| duty-free-exports | 160 | 14 | 174 | 0.9195402 | 0.0804598 | 0.4195402 |
| physician-fee-freeze | 14 | 163 | 177 | 0.0790960 | 0.9209040 | 0.4209040 |
# Sort and gather data for plotting
yesplot <- yesvotes %>%
arrange(demprop) %>%
gather(party, yescount, 5:6)
# Convert issue to factor to preserve order
yesplot$issue <- factor(yesplot$issue, levels = unique(yesplot$issue))
# Plot split in Yea votes per issue
ggplot(yesplot, aes(issue, yescount, fill = party)) +
geom_col() +
coord_flip() +
labs(x = "", y = "", caption="Plot 2. Split of affirmative votes per party.") +
scale_fill_manual(values=c("#6969FF", "#FF6E6E"), name = "Party", labels = c("Democratic", "Republican"))
The plot above that demonstrates the split between party support of issues is somewhat confusing. It may appear that the duty-free-exports issue was the most contentious one; however, we do not care if an issue was supported more by republican or democractic party only which issue had the most support from one party. For this let us find the issue closest to the 50-50 split between parties and the issue furthest from the split.
yesvotes %>%
filter(dist05 == max(dist05)) %>%
select(issue)
## # A tibble: 1 × 1
## issue
## <chr>
## 1 physician-fee-freeze
yesvotes %>%
filter(dist05 == min(dist05)) %>%
select(issue)
## # A tibble: 1 × 1
## issue
## <chr>
## 1 religious-groups-in-schools
It appears that physician-fee-freeze was the issue that was split the most along party lines in 1984. The religious-groups-in-school issue enjoyed the most bipartisan support. I became interested in the education-spending issue which seemed to enjoy a lot of republican support in 1984. I consider education to be a topic primarily supported by democrats. Researching the issue further I discovered that the act lowered education spending. That demonstrated that issue names are limited in conveying their importance or meaning. It is critical to know the background and details of data for any real-world analysis.
The data includes world development indicators from 1960 to 2016 covering 217 economies. It is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
Data was identified by Kyle Gilde and is available at the World Bank website (http://data.worldbank.org/data-catalog/world-development-indicators).
Requested Analysis: Specific analysis was not requested. I picked this data set due to its large size to see how data size influences the analysis in R.
world <- read.csv("C:\\Temp\\GitHub\\WDI_Data.csv",
header = TRUE, sep = ",")
# Get data dimensions
dim(world)
## [1] 383328 61
# Get available columns
colnames(world)
## [1] "Country.Name" "Country.Code" "Indicator.Name" "Indicator.Code"
## [5] "X1960" "X1961" "X1962" "X1963"
## [9] "X1964" "X1965" "X1966" "X1967"
## [13] "X1968" "X1969" "X1970" "X1971"
## [17] "X1972" "X1973" "X1974" "X1975"
## [21] "X1976" "X1977" "X1978" "X1979"
## [25] "X1980" "X1981" "X1982" "X1983"
## [29] "X1984" "X1985" "X1986" "X1987"
## [33] "X1988" "X1989" "X1990" "X1991"
## [37] "X1992" "X1993" "X1994" "X1995"
## [41] "X1996" "X1997" "X1998" "X1999"
## [45] "X2000" "X2001" "X2002" "X2003"
## [49] "X2004" "X2005" "X2006" "X2007"
## [53] "X2008" "X2009" "X2010" "X2011"
## [57] "X2012" "X2013" "X2014" "X2015"
## [61] "X2016"
# Adjust column names
colnames(world) <- str_replace_all(colnames(world), "X", "")
A portion of adjusted data:
| Country.Name | Country.Code | Indicator.Code | 1960 | 1961 | 1962 | 1963 |
|---|---|---|---|---|---|---|
| Arab World | ARB | PA.NUS.PPP.05 | NA | NA | NA | NA |
| Arab World | ARB | PA.NUS.PRVT.PP.05 | NA | NA | NA | NA |
| Arab World | ARB | EG.ELC.ACCS.ZS | NA | NA | NA | NA |
| Arab World | ARB | EG.ELC.ACCS.RU.ZS | NA | NA | NA | NA |
| Arab World | ARB | EG.ELC.ACCS.UR.ZS | NA | NA | NA | NA |
| Arab World | ARB | EG.NSF.ACCS.ZS | NA | NA | NA | NA |
| Arab World | ARB | EG.NSF.ACCS.RU.ZS | NA | NA | NA | NA |
| Arab World | ARB | EG.NSF.ACCS.UR.ZS | NA | NA | NA | NA |
| Arab World | ARB | WP_time_01.1 | NA | NA | NA | NA |
| Arab World | ARB | WP_time_01.3 | NA | NA | NA | NA |
# Convert table to the long form
world <- world %>%
gather(year, value, 5:61)
The data includes 1,452 indicators. I have noticed that there are several indicators tracking number of female and male over-age students. Being an over-age student myself, I have decided to do a quick analysis of this category. I did not go into detail what is implied by an over-age student in this data. Relevant indicators are as follows.
world %>%
filter(grepl('over-age',Indicator.Name, ignore.case = TRUE)) %>%
distinct(Indicator.Name)
## Indicator.Name
## 1 Over-age students, primary (% of enrollment)
## 2 Over-age students, primary, female (% of female enrollment)
## 3 Over-age students, primary, male (% of male enrollment)
# Select data relevant to the indicators to be analyzed
# Remove NAs and zeros
# Put female, male and total numbers into separate columns
students <- world %>%
filter(grepl('over-age',Indicator.Name, ignore.case = TRUE), !is.na(value), value != 0) %>%
select(Country.Name, Indicator.Code, year, value) %>%
spread(Indicator.Code, value)
# Adjust column names
colnames(students) <- c("country", "year", "students_female", "students_male", "students_all")
A sample of tidy data to be used for analysis:
| country | year | students_female | students_male | students_all |
|---|---|---|---|---|
| Afghanistan | 1974 | 11.76900 | 19.55751 | 18.43234 |
| Afghanistan | 1993 | 7.13094 | 7.13159 | 7.13143 |
| Afghanistan | 2014 | 5.65843 | 6.27605 | 6.02927 |
| Albania | 2013 | 7.16190 | 8.50432 | 7.87141 |
| Albania | 2014 | 5.97890 | 7.08751 | 6.56295 |
| Algeria | 1973 | 15.35597 | 19.18937 | 17.69170 |
Create boxplots per year for all countries combines.
# Boxplots per year for all countries
ggplot(filter(students, !is.na(students_all)), aes(x = year, y = students_all)) +
geom_boxplot() +
labs(x = "Year", y = "% of Over-Age Students", caption = "Plot 3. Boxplots for all students by year.") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Create point plot per country per year.
ggplot(filter(students, !is.na(students_all)), aes(x = year, y = country)) +
geom_point(aes(color = students_all)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.text.y = element_blank()) +
scale_color_continuous(limits = c(0,30), name = "") +
labs(x = "", y = "Country", caption = "Plot 4. Availability of data by year and country.")
An interesting discovery looking at the boxplots is that the mean of number of over-age students dropped from 1960 to the present. Either the category implies adults going back for high-school level education (then the drop is a positive change) or adults continuing their higher level education (then I would consider the drop to be a negative change). Looking at the point graph, even with limited country detail, it is clear that there are a lot of gaps in the data especially between 1998 and 2012. Additionally, I have discovered that when in small data sets it really doesn’t matter what method you use to analyze it since results come up quickly, with large data sets, efficiency in code may have significant impact on efficiency of analysis.
The data includes shipping costs and collections per country. This sample analysis can be used for operational improvements. I have picked this data set because it had a number of issues that required tidying up.
Data was provided by Cesar Espitia.
Requested Analysis: Do collected fees cover company’s cost? What is the shipping cost per month? Per country?
shipping <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Discussion5.csv",
header = TRUE, sep = ",")
Raw data:
| Date | Values | China | Canada | Taiwan | Singapore | US | England | Korea | Hong.Kong |
|---|---|---|---|---|---|---|---|---|---|
| 12/29/16 | Price of Carrier | 518 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 260 | NA | NA | NA | NA | NA | NA | NA | |
| 12/30/16 | Price of Carrier | 232 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 132 | NA | NA | NA | NA | NA | NA | NA | |
| 1/3/17 | Price of Carrier | 1143.5 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 752 | NA | NA | NA | NA | NA | NA | NA | |
| 1/4/17 | Price of Carrier | 732.5 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 448 | NA | NA | NA | NA | NA | NA | NA | |
| 1/5/17 | Price of Carrier | 443 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 336 | NA | NA | NA | NA | NA | NA | NA | |
| 1/6/17 | Price of Carrier | 480.5 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 236 | NA | NA | NA | NA | NA | NA | NA | |
| 1/9/17 | Price of Carrier | 439.5 | 193.50000 | NA | NA | NA | NA | 31.5 | 13 |
| Shipping Fees Collected | 316 | 228.00000 | NA | NA | NA | NA | 36.0 | 20 | |
| 1/10/17 | Price of Carrier | 318 | 279.55994 | NA | NA | NA | NA | NA | 13 |
| Shipping Fees Collected | 108 | 336.00000 | NA | NA | NA | NA | NA | 20 | |
| 1/11/17 | Price of Carrier | 325 | NA | NA | 130.50000 | NA | NA | NA | NA |
| Shipping Fees Collected | 168 | NA | NA | 30.00000 | NA | NA | NA | NA | |
| 1/12/17 | Price of Carrier | 230 | 116.99997 | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 132 | 150.00000 | NA | NA | NA | NA | NA | NA | |
| 1/13/17 | Price of Carrier | 229 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | #N/A | NA | NA | NA | NA | NA | NA | NA | |
| 1/16/17 | Price of Carrier | 189.5 | NA | 34 | 53.99999 | NA | NA | NA | NA |
| Shipping Fees Collected | 144 | NA | 45 | 72.00000 | NA | NA | NA | NA | |
| 1/17/17 | Price of Carrier | 470.5 | NA | NA | NA | NA | NA | NA | 33 |
| Shipping Fees Collected | 196 | NA | NA | NA | NA | NA | NA | 45 | |
| 1/18/17 | Price of Carrier | 311.5 | NA | 13 | 58.79999 | NA | NA | NA | NA |
| Shipping Fees Collected | 212 | NA | 0 | 66.00000 | NA | NA | NA | NA | |
| 1/19/17 | Price of Carrier | 266.5 | NA | 13 | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 92 | NA | 20 | NA | NA | NA | NA | NA | |
| 1/20/17 | Price of Carrier | 182 | NA | NA | NA | 20.92996 | NA | NA | NA |
| Shipping Fees Collected | 148 | NA | NA | NA | 85.87000 | NA | NA | NA | |
| 1/23/17 | Price of Carrier | 117 | 52.20000 | 111 | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 64 | 66.00000 | 50 | NA | NA | NA | NA | NA | |
| 1/24/17 | Price of Carrier | 144.5 | 76.49999 | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 64 | 90.00000 | NA | NA | NA | NA | NA | NA | |
| 1/25/17 | Price of Carrier | 95.5 | NA | NA | 27.00000 | NA | NA | NA | NA |
| Shipping Fees Collected | 76 | NA | NA | 36.00000 | NA | NA | NA | NA | |
| 1/26/17 | Price of Carrier | 84 | 53.99999 | NA | NA | 37.21999 | NA | NA | NA |
| Shipping Fees Collected | 40 | 66.00000 | NA | NA | 29.82000 | NA | NA | NA | |
| 1/27/17 | Price of Carrier | 121 | NA | NA | 31.50000 | NA | NA | NA | NA |
| Shipping Fees Collected | 48 | NA | NA | 36.00000 | NA | NA | NA | NA | |
| 1/30/17 | Price of Carrier | 135.5 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 92 | NA | NA | NA | NA | NA | NA | NA | |
| 1/31/17 | Price of Carrier | 33.5 | 31.50000 | NA | NA | 38.51000 | NA | NA | NA |
| Shipping Fees Collected | 16 | 42.00000 | NA | NA | 33.23000 | NA | NA | NA | |
| 2/1/17 | Price of Carrier | 172.5 | 31.50000 | NA | NA | 146.80994 | NA | NA | NA |
| Shipping Fees Collected | 76 | 42.00000 | NA | NA | 121.25000 | NA | NA | NA | |
| 2/2/17 | Price of Carrier | 165 | NA | NA | NA | NA | NA | NA | NA |
| Shipping Fees Collected | 40 | NA | NA | NA | NA | NA | NA | NA | |
| 2/3/17 | Price of Carrier | 166 | 125.99998 | 13 | NA | 10.33000 | 31.50000 | NA | NA |
| Shipping Fees Collected | 100 | 156.00000 | 20 | NA | 0.00000 | 42.00000 | NA | NA | |
| 2/6/17 | Price of Carrier | 343 | 58.50000 | 68 | NA | 94.10999 | 89.99997 | NA | 135 |
| Shipping Fees Collected | 236 | 78.00000 | 90 | NA | 97.39000 | 120.00000 | NA | 115 | |
| 2/7/17 | Price of Carrier | NA | 13 | NA | 87.83999 | NA | NA | NA | |
| Shipping Fees Collected | NA | 20 | NA | 70.28000 | NA | NA | NA |
# Populate blank values in the Date column with a date from previous row
shipping[c(FALSE, TRUE),1] <- shipping[c(TRUE, FALSE), 1]
# Rearrange data
shipping <- shipping %>%
gather(Country, Amount, 3:10) %>% # Collapse data into a narrow format
filter(!is.na(Amount)) %>% # Remove NAs
spread(Values, Amount) # Expand Cost and Collected values into two columns
# Adjust column names
colnames(shipping) <- c("Date", "Country", "Cost", "Collected")
# Convert cost and collected amounts to number
shipping$Cost <- as.numeric(shipping$Cost)
shipping$Collected <- as.numeric(shipping$Collected)
# Remove records that did not have numeric Cost or Collected values
shipping <- shipping %>% filter(!is.na(Cost), !is.na(Collected))
Tidy data:
| Date | Country | Cost | Collected |
|---|---|---|---|
| 1/10/17 | Canada | 279.55994 | 336.00 |
| 1/10/17 | China | 318.00000 | 108.00 |
| 1/10/17 | Hong.Kong | 13.00000 | 20.00 |
| 1/11/17 | China | 325.00000 | 168.00 |
| 1/11/17 | Singapore | 130.50000 | 30.00 |
| 1/12/17 | Canada | 116.99997 | 150.00 |
| 1/12/17 | China | 230.00000 | 132.00 |
| 1/16/17 | China | 189.50000 | 144.00 |
| 1/16/17 | Singapore | 53.99999 | 72.00 |
| 1/16/17 | Taiwan | 34.00000 | 45.00 |
| 1/17/17 | China | 470.50000 | 196.00 |
| 1/17/17 | Hong.Kong | 33.00000 | 45.00 |
| 1/18/17 | China | 311.50000 | 212.00 |
| 1/18/17 | Singapore | 58.79999 | 66.00 |
| 1/18/17 | Taiwan | 13.00000 | 0.00 |
| 1/19/17 | China | 266.50000 | 92.00 |
| 1/19/17 | Taiwan | 13.00000 | 20.00 |
| 1/20/17 | China | 182.00000 | 148.00 |
| 1/20/17 | US | 20.92996 | 85.87 |
| 1/23/17 | Canada | 52.20000 | 66.00 |
| 1/23/17 | China | 117.00000 | 64.00 |
| 1/23/17 | Taiwan | 111.00000 | 50.00 |
| 1/24/17 | Canada | 76.49999 | 90.00 |
| 1/24/17 | China | 144.50000 | 64.00 |
| 1/25/17 | China | 95.50000 | 76.00 |
| 1/25/17 | Singapore | 27.00000 | 36.00 |
| 1/26/17 | Canada | 53.99999 | 66.00 |
| 1/26/17 | China | 84.00000 | 40.00 |
| 1/26/17 | US | 37.21999 | 29.82 |
| 1/27/17 | China | 121.00000 | 48.00 |
| 1/27/17 | Singapore | 31.50000 | 36.00 |
| 1/3/17 | China | 1143.50000 | 752.00 |
| 1/30/17 | China | 135.50000 | 92.00 |
| 1/31/17 | Canada | 31.50000 | 42.00 |
| 1/31/17 | China | 33.50000 | 16.00 |
| 1/31/17 | US | 38.51000 | 33.23 |
| 1/4/17 | China | 732.50000 | 448.00 |
| 1/5/17 | China | 443.00000 | 336.00 |
| 1/6/17 | China | 480.50000 | 236.00 |
| 1/9/17 | Canada | 193.50000 | 228.00 |
| 1/9/17 | China | 439.50000 | 316.00 |
| 1/9/17 | Hong.Kong | 13.00000 | 20.00 |
| 1/9/17 | Korea | 31.50000 | 36.00 |
| 12/29/16 | China | 518.00000 | 260.00 |
| 12/30/16 | China | 232.00000 | 132.00 |
| 2/1/17 | Canada | 31.50000 | 42.00 |
| 2/1/17 | China | 172.50000 | 76.00 |
| 2/1/17 | US | 146.80994 | 121.25 |
| 2/2/17 | China | 165.00000 | 40.00 |
| 2/3/17 | Canada | 125.99998 | 156.00 |
| 2/3/17 | China | 166.00000 | 100.00 |
| 2/3/17 | England | 31.50000 | 42.00 |
| 2/3/17 | Taiwan | 13.00000 | 20.00 |
| 2/3/17 | US | 10.33000 | 0.00 |
| 2/6/17 | Canada | 58.50000 | 78.00 |
| 2/6/17 | China | 343.00000 | 236.00 |
| 2/6/17 | England | 89.99997 | 120.00 |
| 2/6/17 | Hong.Kong | 134.99998 | 115.00 |
| 2/6/17 | Taiwan | 68.00000 | 90.00 |
| 2/6/17 | US | 94.10999 | 97.39 |
| 2/7/17 | Taiwan | 13.00000 | 20.00 |
| 2/7/17 | US | 87.83999 | 70.28 |
# Get country specific data
countrydata <- shipping %>%
group_by(Country) %>%
summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>% # Total amounts per country
mutate(Total = SumCollected - SumCost) %>% # Get the difference between amounts
arrange(Total)
| Country | SumCost | SumCollected | Total |
|---|---|---|---|
| China | 7859.5000 | 4532.00 | -3327.500000 |
| Singapore | 301.8000 | 240.00 | -61.799971 |
| Taiwan | 265.0000 | 245.00 | -20.000000 |
| US | 435.7499 | 437.84 | 2.090127 |
| Korea | 31.5000 | 36.00 | 4.500002 |
| Hong.Kong | 194.0000 | 200.00 | 6.000025 |
| England | 121.5000 | 162.00 | 40.500031 |
| Canada | 1020.2599 | 1254.00 | 233.740126 |
# Get monthly data
monthdata <- shipping %>%
separate(Date, c("Month", "Day", "Year"), sep = "/", remove=FALSE) %>% # Split date into components
group_by(Month, Year) %>% # Group by month/year
summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
mutate(Total = SumCollected - SumCost,
MonthNo = as.numeric(Month), # Convert month to number for sorting
YearNo = as.numeric(Year), # Convert year to number for sorting
Period = paste(Month, Year, sep="/")) %>% # Combine month and year for output
arrange(YearNo, MonthNo) %>% # Sort
group_by() %>% # Remove grouping
select(Period, SumCost, SumCollected, Total) # Select field for output
| Period | SumCost | SumCollected | Total |
|---|---|---|---|
| 12/16 | 750.00 | 392.00 | -358.0000 |
| 1/17 | 7727.22 | 5290.92 | -2436.2998 |
| 2/17 | 1752.09 | 1423.92 | -328.1698 |
Based on the sample data it does not appear that the company recoups shipping charges. Shipping to China generates particularly big losses, but shipping to Canada is slightly profitable. Monthly data can be easily generated using dplyr and tidyr with provided format.