IS 607 - Project 2:
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.
I have taken the data set about shipping and prepared my .csv file for my Part II Project.
library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
shipping <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Week6/master/shipping_data.csv", header=TRUE, sep=",")
shipping
## Date Values China Canada Taiwan Singapore
## 1 12/29/2016 Price of Carrier 518 NA NA NA
## 2 Shipping Fees Collected 260 NA NA NA
## 3 12/30/2016 Price of Carrier 232 NA NA NA
## 4 Shipping Fees Collected 132 NA NA NA
## 5 1/3/2017 Price of Carrier 1143.5 NA NA NA
## 6 Shipping Fees Collected 752 NA NA NA
## 7 1/4/2017 Price of Carrier 732.5 NA NA NA
## 8 Shipping Fees Collected 448 NA NA NA
## 9 1/5/2017 Price of Carrier 443 NA NA NA
## 10 Shipping Fees Collected 336 NA NA NA
## 11 1/6/2017 Price of Carrier 480.5 NA NA NA
## 12 Shipping Fees Collected 236 NA NA NA
## 13 1/9/2017 Price of Carrier 439.5 193.50000 NA NA
## 14 Shipping Fees Collected 316 228.00000 NA NA
## 15 1/10/2017 Price of Carrier 318 279.55994 NA NA
## 16 Shipping Fees Collected 108 336.00000 NA NA
## 17 1/11/2017 Price of Carrier 325 NA NA 130.50000
## 18 Shipping Fees Collected 168 NA NA 30.00000
## 19 1/12/2017 Price of Carrier 230 116.99997 NA NA
## 20 Shipping Fees Collected 132 150.00000 NA NA
## 21 1/13/2017 Price of Carrier 229 NA NA NA
## 22 Shipping Fees Collected #N/A NA NA NA
## 23 1/16/2017 Price of Carrier 189.5 NA 34 53.99999
## 24 Shipping Fees Collected 144 NA 45 72.00000
## 25 1/17/2017 Price of Carrier 470.5 NA NA NA
## 26 Shipping Fees Collected 196 NA NA NA
## 27 1/18/2017 Price of Carrier 311.5 NA 13 58.79999
## 28 Shipping Fees Collected 212 NA 0 66.00000
## 29 1/19/2017 Price of Carrier 266.5 NA 13 NA
## 30 Shipping Fees Collected 92 NA 20 NA
## 31 1/20/2017 Price of Carrier 182 NA NA NA
## 32 Shipping Fees Collected 148 NA NA NA
## 33 1/23/2017 Price of Carrier 117 52.20000 111 NA
## 34 Shipping Fees Collected 64 66.00000 50 NA
## 35 1/24/2017 Price of Carrier 144.5 76.49999 NA NA
## 36 Shipping Fees Collected 64 90.00000 NA NA
## 37 1/25/2017 Price of Carrier 95.5 NA NA 27.00000
## 38 Shipping Fees Collected 76 NA NA 36.00000
## 39 1/26/2017 Price of Carrier 84 53.99999 NA NA
## 40 Shipping Fees Collected 40 66.00000 NA NA
## 41 1/27/2017 Price of Carrier 121 NA NA 31.50000
## 42 Shipping Fees Collected 48 NA NA 36.00000
## 43 1/30/2017 Price of Carrier 135.5 NA NA NA
## 44 Shipping Fees Collected 92 NA NA NA
## 45 1/31/2017 Price of Carrier 33.5 31.50000 NA NA
## 46 Shipping Fees Collected 16 42.00000 NA NA
## 47 2/1/2017 Price of Carrier 172.5 31.50000 NA NA
## 48 Shipping Fees Collected 76 42.00000 NA NA
## 49 2/2/2017 Price of Carrier 165 NA NA NA
## 50 Shipping Fees Collected 40 NA NA NA
## 51 2/3/2017 Price of Carrier 166 125.99998 13 NA
## 52 Shipping Fees Collected 100 156.00000 20 NA
## 53 2/6/2017 Price of Carrier 343 58.50000 68 NA
## 54 Shipping Fees Collected 236 78.00000 90 NA
## 55 2/7/2017 Price of Carrier NA 13 NA
## 56 Shipping Fees Collected NA 20 NA
## US England Korea Hong.Kong
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 NA NA NA NA
## 11 NA NA NA NA
## 12 NA NA NA NA
## 13 NA NA 31.5 13
## 14 NA NA 36.0 20
## 15 NA NA NA 13
## 16 NA NA NA 20
## 17 NA NA NA NA
## 18 NA NA NA NA
## 19 NA NA NA NA
## 20 NA NA NA NA
## 21 NA NA NA NA
## 22 NA NA NA NA
## 23 NA NA NA NA
## 24 NA NA NA NA
## 25 NA NA NA 33
## 26 NA NA NA 45
## 27 NA NA NA NA
## 28 NA NA NA NA
## 29 NA NA NA NA
## 30 NA NA NA NA
## 31 20.92996 NA NA NA
## 32 85.87000 NA NA NA
## 33 NA NA NA NA
## 34 NA NA NA NA
## 35 NA NA NA NA
## 36 NA NA NA NA
## 37 NA NA NA NA
## 38 NA NA NA NA
## 39 37.21999 NA NA NA
## 40 29.82000 NA NA NA
## 41 NA NA NA NA
## 42 NA NA NA NA
## 43 NA NA NA NA
## 44 NA NA NA NA
## 45 38.51000 NA NA NA
## 46 33.23000 NA NA NA
## 47 146.80994 NA NA NA
## 48 121.25000 NA NA NA
## 49 NA NA NA NA
## 50 NA NA NA NA
## 51 10.33000 31.50000 NA NA
## 52 0.00000 42.00000 NA NA
## 53 94.10999 89.99997 NA 135
## 54 97.39000 120.00000 NA 115
## 55 87.83999 NA NA NA
## 56 70.28000 NA NA NA
Adding the date to all the rows:
shipping[c(FALSE, TRUE),1] <- shipping[c(TRUE, FALSE), 1]
head(shipping)
## Date Values China Canada Taiwan Singapore US
## 1 12/29/2016 Price of Carrier 518 NA NA NA NA
## 2 12/29/2016 Shipping Fees Collected 260 NA NA NA NA
## 3 12/30/2016 Price of Carrier 232 NA NA NA NA
## 4 12/30/2016 Shipping Fees Collected 132 NA NA NA NA
## 5 1/3/2017 Price of Carrier 1143.5 NA NA NA NA
## 6 1/3/2017 Shipping Fees Collected 752 NA NA NA NA
## England Korea Hong.Kong
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
We can see that the data is in wide format. Need to bring the data to long format using gather():
shipping1 <- gather(shipping, "Country", "Amount", 3:10)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(shipping1)
## Date Values Country Amount
## 1 12/29/2016 Price of Carrier China 518
## 2 12/29/2016 Shipping Fees Collected China 260
## 3 12/30/2016 Price of Carrier China 232
## 4 12/30/2016 Shipping Fees Collected China 132
## 5 1/3/2017 Price of Carrier China 1143.5
## 6 1/3/2017 Shipping Fees Collected China 752
There are N/A in the data, we can remove thoes:
shipping2 <- filter(shipping1, !is.na(Amount))
head(shipping2)
## Date Values Country Amount
## 1 12/29/2016 Price of Carrier China 518
## 2 12/29/2016 Shipping Fees Collected China 260
## 3 12/30/2016 Price of Carrier China 232
## 4 12/30/2016 Shipping Fees Collected China 132
## 5 1/3/2017 Price of Carrier China 1143.5
## 6 1/3/2017 Shipping Fees Collected China 752
Speard the data using spread():
shipping3 <- shipping2 %>%
spread(Values, Amount)
head(shipping3)
## Date Country Price of Carrier Shipping Fees Collected
## 1 1/10/2017 Canada 279.5599426 336
## 2 1/10/2017 China 318 108
## 3 1/10/2017 Hong.Kong 13 20
## 4 1/11/2017 China 325 168
## 5 1/11/2017 Singapore 130.4999996 30
## 6 1/12/2017 Canada 116.9999661 150
Changing the column names:
colnames(shipping3) <- c("Date", "Country", "Price_of_Carrier", "Shipping_Fees_Collected")
head(shipping3)
## Date Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017 Canada 279.5599426 336
## 2 1/10/2017 China 318 108
## 3 1/10/2017 Hong.Kong 13 20
## 4 1/11/2017 China 325 168
## 5 1/11/2017 Singapore 130.4999996 30
## 6 1/12/2017 Canada 116.9999661 150
Converting the Price_of_Carrier and Shipping_Fees_Collected fields to numeric:
shipping3$Price_of_Carrier <- as.numeric(shipping3$Price_of_Carrier)
shipping3$Shipping_Fees_Collected <- as.numeric(shipping3$Shipping_Fees_Collected)
## Warning: NAs introduced by coercion
head(shipping3)
## Date Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017 Canada 279.5599 336
## 2 1/10/2017 China 318.0000 108
## 3 1/10/2017 Hong.Kong 13.0000 20
## 4 1/11/2017 China 325.0000 168
## 5 1/11/2017 Singapore 130.5000 30
## 6 1/12/2017 Canada 117.0000 150
Removing NA:
shipping4 <- shipping3 %>% filter(!is.na(Price_of_Carrier), !is.na(Shipping_Fees_Collected))
head(shipping4)
## Date Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017 Canada 279.5599 336
## 2 1/10/2017 China 318.0000 108
## 3 1/10/2017 Hong.Kong 13.0000 20
## 4 1/11/2017 China 325.0000 168
## 5 1/11/2017 Singapore 130.5000 30
## 6 1/12/2017 Canada 117.0000 150
Now we can aggregate the data for each country:
shipping5 <- shipping4 %>%
group_by(Country) %>%
summarise(Price_of_Carrier = sum(Price_of_Carrier), Shipping_Fees_Collected = sum(Shipping_Fees_Collected))
shipping5
## # A tibble: 8 x 3
## Country Price_of_Carrier Shipping_Fees_Collected
## <chr> <dbl> <dbl>
## 1 Canada 1020 1254
## 2 China 7860 4532
## 3 England 121 162
## 4 Hong.Kong 194 200
## 5 Korea 31.5 36.0
## 6 Singapore 302 240
## 7 Taiwan 265 245
## 8 US 436 438
Using ggplot to understand more:
ggplot(shipping5, aes(x = Country, y=Price_of_Carrier, fill = Country)) +
geom_bar(stat="identity",position="dodge") +
xlab("Country") + ylab("Price_of_Carrier") +
ggtitle("Country v/s Price_of_Carrier")
ggplot(shipping5, aes(x = Country, y=Shipping_Fees_Collected, fill = Country)) +
geom_bar(stat="identity",position="dodge") +
xlab("Country") + ylab("Shipping_Fees_Collected") +
ggtitle("Country v/s Shipping_Fees_Collected")
We can see that China has most Price_of_Carrier and Shipping_Fees_Collected amounts