IS 607 - Project 2:

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.

I have taken the data set about shipping and prepared my .csv file for my Part II Project.

library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
shipping <- read.csv("https://raw.githubusercontent.com/Riteshlohiya/Data607-Week6/master/shipping_data.csv", header=TRUE, sep=",")
shipping
##          Date                  Values  China    Canada Taiwan Singapore
## 1  12/29/2016        Price of Carrier    518        NA     NA        NA
## 2             Shipping Fees Collected    260        NA     NA        NA
## 3  12/30/2016        Price of Carrier    232        NA     NA        NA
## 4             Shipping Fees Collected    132        NA     NA        NA
## 5    1/3/2017        Price of Carrier 1143.5        NA     NA        NA
## 6             Shipping Fees Collected    752        NA     NA        NA
## 7    1/4/2017        Price of Carrier  732.5        NA     NA        NA
## 8             Shipping Fees Collected    448        NA     NA        NA
## 9    1/5/2017        Price of Carrier    443        NA     NA        NA
## 10            Shipping Fees Collected    336        NA     NA        NA
## 11   1/6/2017        Price of Carrier  480.5        NA     NA        NA
## 12            Shipping Fees Collected    236        NA     NA        NA
## 13   1/9/2017        Price of Carrier  439.5 193.50000     NA        NA
## 14            Shipping Fees Collected    316 228.00000     NA        NA
## 15  1/10/2017        Price of Carrier    318 279.55994     NA        NA
## 16            Shipping Fees Collected    108 336.00000     NA        NA
## 17  1/11/2017        Price of Carrier    325        NA     NA 130.50000
## 18            Shipping Fees Collected    168        NA     NA  30.00000
## 19  1/12/2017        Price of Carrier    230 116.99997     NA        NA
## 20            Shipping Fees Collected    132 150.00000     NA        NA
## 21  1/13/2017        Price of Carrier    229        NA     NA        NA
## 22            Shipping Fees Collected   #N/A        NA     NA        NA
## 23  1/16/2017        Price of Carrier  189.5        NA     34  53.99999
## 24            Shipping Fees Collected    144        NA     45  72.00000
## 25  1/17/2017        Price of Carrier  470.5        NA     NA        NA
## 26            Shipping Fees Collected    196        NA     NA        NA
## 27  1/18/2017        Price of Carrier  311.5        NA     13  58.79999
## 28            Shipping Fees Collected    212        NA      0  66.00000
## 29  1/19/2017        Price of Carrier  266.5        NA     13        NA
## 30            Shipping Fees Collected     92        NA     20        NA
## 31  1/20/2017        Price of Carrier    182        NA     NA        NA
## 32            Shipping Fees Collected    148        NA     NA        NA
## 33  1/23/2017        Price of Carrier    117  52.20000    111        NA
## 34            Shipping Fees Collected     64  66.00000     50        NA
## 35  1/24/2017        Price of Carrier  144.5  76.49999     NA        NA
## 36            Shipping Fees Collected     64  90.00000     NA        NA
## 37  1/25/2017        Price of Carrier   95.5        NA     NA  27.00000
## 38            Shipping Fees Collected     76        NA     NA  36.00000
## 39  1/26/2017        Price of Carrier     84  53.99999     NA        NA
## 40            Shipping Fees Collected     40  66.00000     NA        NA
## 41  1/27/2017        Price of Carrier    121        NA     NA  31.50000
## 42            Shipping Fees Collected     48        NA     NA  36.00000
## 43  1/30/2017        Price of Carrier  135.5        NA     NA        NA
## 44            Shipping Fees Collected     92        NA     NA        NA
## 45  1/31/2017        Price of Carrier   33.5  31.50000     NA        NA
## 46            Shipping Fees Collected     16  42.00000     NA        NA
## 47   2/1/2017        Price of Carrier  172.5  31.50000     NA        NA
## 48            Shipping Fees Collected     76  42.00000     NA        NA
## 49   2/2/2017        Price of Carrier    165        NA     NA        NA
## 50            Shipping Fees Collected     40        NA     NA        NA
## 51   2/3/2017        Price of Carrier    166 125.99998     13        NA
## 52            Shipping Fees Collected    100 156.00000     20        NA
## 53   2/6/2017        Price of Carrier    343  58.50000     68        NA
## 54            Shipping Fees Collected    236  78.00000     90        NA
## 55   2/7/2017        Price of Carrier               NA     13        NA
## 56            Shipping Fees Collected               NA     20        NA
##           US   England Korea Hong.Kong
## 1         NA        NA    NA        NA
## 2         NA        NA    NA        NA
## 3         NA        NA    NA        NA
## 4         NA        NA    NA        NA
## 5         NA        NA    NA        NA
## 6         NA        NA    NA        NA
## 7         NA        NA    NA        NA
## 8         NA        NA    NA        NA
## 9         NA        NA    NA        NA
## 10        NA        NA    NA        NA
## 11        NA        NA    NA        NA
## 12        NA        NA    NA        NA
## 13        NA        NA  31.5        13
## 14        NA        NA  36.0        20
## 15        NA        NA    NA        13
## 16        NA        NA    NA        20
## 17        NA        NA    NA        NA
## 18        NA        NA    NA        NA
## 19        NA        NA    NA        NA
## 20        NA        NA    NA        NA
## 21        NA        NA    NA        NA
## 22        NA        NA    NA        NA
## 23        NA        NA    NA        NA
## 24        NA        NA    NA        NA
## 25        NA        NA    NA        33
## 26        NA        NA    NA        45
## 27        NA        NA    NA        NA
## 28        NA        NA    NA        NA
## 29        NA        NA    NA        NA
## 30        NA        NA    NA        NA
## 31  20.92996        NA    NA        NA
## 32  85.87000        NA    NA        NA
## 33        NA        NA    NA        NA
## 34        NA        NA    NA        NA
## 35        NA        NA    NA        NA
## 36        NA        NA    NA        NA
## 37        NA        NA    NA        NA
## 38        NA        NA    NA        NA
## 39  37.21999        NA    NA        NA
## 40  29.82000        NA    NA        NA
## 41        NA        NA    NA        NA
## 42        NA        NA    NA        NA
## 43        NA        NA    NA        NA
## 44        NA        NA    NA        NA
## 45  38.51000        NA    NA        NA
## 46  33.23000        NA    NA        NA
## 47 146.80994        NA    NA        NA
## 48 121.25000        NA    NA        NA
## 49        NA        NA    NA        NA
## 50        NA        NA    NA        NA
## 51  10.33000  31.50000    NA        NA
## 52   0.00000  42.00000    NA        NA
## 53  94.10999  89.99997    NA       135
## 54  97.39000 120.00000    NA       115
## 55  87.83999        NA    NA        NA
## 56  70.28000        NA    NA        NA

Adding the date to all the rows:

shipping[c(FALSE, TRUE),1] <- shipping[c(TRUE, FALSE), 1]
head(shipping)
##         Date                  Values  China Canada Taiwan Singapore US
## 1 12/29/2016        Price of Carrier    518     NA     NA        NA NA
## 2 12/29/2016 Shipping Fees Collected    260     NA     NA        NA NA
## 3 12/30/2016        Price of Carrier    232     NA     NA        NA NA
## 4 12/30/2016 Shipping Fees Collected    132     NA     NA        NA NA
## 5   1/3/2017        Price of Carrier 1143.5     NA     NA        NA NA
## 6   1/3/2017 Shipping Fees Collected    752     NA     NA        NA NA
##   England Korea Hong.Kong
## 1      NA    NA        NA
## 2      NA    NA        NA
## 3      NA    NA        NA
## 4      NA    NA        NA
## 5      NA    NA        NA
## 6      NA    NA        NA

We can see that the data is in wide format. Need to bring the data to long format using gather():

shipping1 <- gather(shipping, "Country", "Amount", 3:10)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(shipping1)
##         Date                  Values Country Amount
## 1 12/29/2016        Price of Carrier   China    518
## 2 12/29/2016 Shipping Fees Collected   China    260
## 3 12/30/2016        Price of Carrier   China    232
## 4 12/30/2016 Shipping Fees Collected   China    132
## 5   1/3/2017        Price of Carrier   China 1143.5
## 6   1/3/2017 Shipping Fees Collected   China    752

There are N/A in the data, we can remove thoes:

shipping2 <- filter(shipping1, !is.na(Amount))
head(shipping2)
##         Date                  Values Country Amount
## 1 12/29/2016        Price of Carrier   China    518
## 2 12/29/2016 Shipping Fees Collected   China    260
## 3 12/30/2016        Price of Carrier   China    232
## 4 12/30/2016 Shipping Fees Collected   China    132
## 5   1/3/2017        Price of Carrier   China 1143.5
## 6   1/3/2017 Shipping Fees Collected   China    752

Speard the data using spread():

shipping3 <- shipping2 %>% 
  spread(Values, Amount)
head(shipping3)
##        Date   Country Price of Carrier Shipping Fees Collected
## 1 1/10/2017    Canada      279.5599426                     336
## 2 1/10/2017     China              318                     108
## 3 1/10/2017 Hong.Kong               13                      20
## 4 1/11/2017     China              325                     168
## 5 1/11/2017 Singapore      130.4999996                      30
## 6 1/12/2017    Canada      116.9999661                     150

Changing the column names:

colnames(shipping3) <- c("Date", "Country", "Price_of_Carrier", "Shipping_Fees_Collected")
head(shipping3)
##        Date   Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017    Canada      279.5599426                     336
## 2 1/10/2017     China              318                     108
## 3 1/10/2017 Hong.Kong               13                      20
## 4 1/11/2017     China              325                     168
## 5 1/11/2017 Singapore      130.4999996                      30
## 6 1/12/2017    Canada      116.9999661                     150

Converting the Price_of_Carrier and Shipping_Fees_Collected fields to numeric:

shipping3$Price_of_Carrier <- as.numeric(shipping3$Price_of_Carrier)
shipping3$Shipping_Fees_Collected <- as.numeric(shipping3$Shipping_Fees_Collected)
## Warning: NAs introduced by coercion
head(shipping3)
##        Date   Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017    Canada         279.5599                     336
## 2 1/10/2017     China         318.0000                     108
## 3 1/10/2017 Hong.Kong          13.0000                      20
## 4 1/11/2017     China         325.0000                     168
## 5 1/11/2017 Singapore         130.5000                      30
## 6 1/12/2017    Canada         117.0000                     150

Removing NA:

shipping4 <- shipping3 %>% filter(!is.na(Price_of_Carrier), !is.na(Shipping_Fees_Collected))
head(shipping4)
##        Date   Country Price_of_Carrier Shipping_Fees_Collected
## 1 1/10/2017    Canada         279.5599                     336
## 2 1/10/2017     China         318.0000                     108
## 3 1/10/2017 Hong.Kong          13.0000                      20
## 4 1/11/2017     China         325.0000                     168
## 5 1/11/2017 Singapore         130.5000                      30
## 6 1/12/2017    Canada         117.0000                     150

Now we can aggregate the data for each country:

shipping5 <- shipping4 %>% 
  group_by(Country) %>% 
  summarise(Price_of_Carrier = sum(Price_of_Carrier), Shipping_Fees_Collected = sum(Shipping_Fees_Collected))  
shipping5  
## # A tibble: 8 x 3
##   Country   Price_of_Carrier Shipping_Fees_Collected
##   <chr>                <dbl>                   <dbl>
## 1 Canada              1020                    1254  
## 2 China               7860                    4532  
## 3 England              121                     162  
## 4 Hong.Kong            194                     200  
## 5 Korea                 31.5                    36.0
## 6 Singapore            302                     240  
## 7 Taiwan               265                     245  
## 8 US                   436                     438

Using ggplot to understand more:

ggplot(shipping5, aes(x = Country, y=Price_of_Carrier, fill = Country)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Country") + ylab("Price_of_Carrier") +
          ggtitle("Country v/s Price_of_Carrier")

ggplot(shipping5, aes(x = Country, y=Shipping_Fees_Collected, fill = Country)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Country") + ylab("Shipping_Fees_Collected") +
          ggtitle("Country v/s Shipping_Fees_Collected")

We can see that China has most Price_of_Carrier and Shipping_Fees_Collected amounts