Data607 Project 2

Goal of this Project:

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: 1. Choose any three of the “wide” datasets. For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis,and conclusions.

FALSE Loading required package: DBI

FALSE Loading required package: bitops

FALSE 
FALSE Attaching package: 'tidyr'

FALSE The following object is masked from 'package:RCurl':
FALSE 
FALSE     complete

FALSE 
FALSE Attaching package: 'dplyr'

FALSE The following objects are masked from 'package:stats':
FALSE 
FALSE     filter, lag

FALSE The following objects are masked from 'package:base':
FALSE 
FALSE     intersect, setdiff, setequal, union

FALSE 
FALSE Attaching package: 'psych'

FALSE The following objects are masked from 'package:ggplot2':
FALSE 
FALSE     %+%, alpha

FALSE 
FALSE Attaching package: 'plotly'

FALSE The following object is masked from 'package:ggplot2':
FALSE 
FALSE     last_plot

FALSE The following object is masked from 'package:stats':
FALSE 
FALSE     filter

FALSE The following object is masked from 'package:graphics':
FALSE 
FALSE     layout

First Dataset

NYC LOTTO- Let’s see how we can increase the odds and win the lottery ! Let’s try to find out which numbers have a high frequency, and which numbers a player should potentially avoid that carry a low frequency.

Dataset original source: https://data.ny.gov/Government-Finance/Lottery-Cash-4-Life-Winning-Numbers-Beginning-2014/kwxv-fwze

# Let's begin by reading in the LotteryCash4Life csv file from github
lotto <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/LotteryCash4Life2014.csv", header = TRUE)

# Let's visualize this
head(lotto)

##   Draw.Date Winning.Numbers Cash.Ball
## 1   6/16/14  09 36 44 53 59         3
## 2   6/19/14  08 13 43 56 60         2
## 3   6/23/14  05 16 21 33 47         4
## 4   6/26/14  15 22 51 52 58         3
## 5   6/30/14  01 04 10 28 33         2
## 6    7/3/14  08 10 25 28 31         2

Next, let’s do some tidying in order to perform our analysis

# Next, let's create separate columns for each ball number
lotto.separated <- unlist(str_extract_all(lotto$Winning.Numbers, "(\\d)."))
lotto.separated <- as.numeric(lotto.separated)

lotto2 <- matrix(lotto.separated, ncol = 5, byrow = TRUE)
lotto2 <- as.data.frame(lotto2)

lotto.untidy <- data.frame(lotto$Draw.Date, lotto2, lotto$Cash.Ball)
colnames(lotto.untidy) <- c("Draw.Date", 1, 2, 3, 4, 5, "Cash.Ball")

# Let's visualize the new columns created
head(lotto.untidy)

##   Draw.Date  1  2  3  4  5 Cash.Ball
## 1   6/16/14  9 36 44 53 59         3
## 2   6/19/14  8 13 43 56 60         2
## 3   6/23/14  5 16 21 33 47         4
## 4   6/26/14 15 22 51 52 58         3
## 5   6/30/14  1  4 10 28 33         2
## 6    7/3/14  8 10 25 28 31         2

Let’s continue to clean up the data

lotto.tidy <- lotto.untidy %>% gather(BallOrder, Number, 2:6) %>% select(-Cash.Ball)
head(lotto.tidy)

##   Draw.Date BallOrder Number
## 1   6/16/14         1      9
## 2   6/19/14         1      8
## 3   6/23/14         1      5
## 4   6/26/14         1     15
## 5   6/30/14         1      1
## 6    7/3/14         1      8

Using a historgram, let’s visualize the data so we can continue analyzing and make some conclusions:

hist(lotto.tidy$Number, breaks = 60, main = "Winning Lotto Numbers", xlab = "Cash Ball Number", ylim = c(0,40), col = 'blue')

Ball.Number <- as.numeric(lotto.tidy$Number)
Ball.Number <- as.data.frame(table(Ball.Number))
head(Ball.Number)

##   Ball.Number Freq
## 1           1   22
## 2           2   16
## 3           3   28
## 4           4   27
## 5           5   30
## 6           6   26

Let’s see the frequency of the ball numbers drawn to see the highest and lowest frequencies.

Freq.Drawn <- Ball.Number %>% arrange(desc(Freq))

head(Freq.Drawn)

##   Ball.Number Freq
## 1          43   37
## 2          37   35
## 3           8   34
## 4          11   31
## 5          56   31
## 6           5   30

tail(Freq.Drawn)

##    Ball.Number Freq
## 55          33   17
## 56          39   17
## 57           2   16
## 58          50   16
## 59          15   15
## 60          42    9

Let’s see the average ball number to be drawn:

Ball.stat <- Freq.Drawn %>% summarise(Mean= mean(Freq))
Ball.stat <- as.numeric(Ball.stat)
Ball.stat

## [1] 23.75

This was a fun excercise and using visualizations and analysis we are able to make some conclusions while answering our initial questions:

The number ball with the lowest frequency = 42 While the highest frequency ball number = 43 The average ball number to be drawn = 23

Second Dataset: Drug use by Age

Question: In this excercise, I would like to find out if those within the 21 years of age continue or decrease their drug abuse till later in life (ie. 65+ years of age)?

#Let's begin by reading in the csv file from Github
drugs <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/drug-use-by-age.csv")
# Let's visualize it
head(drugs[,1:8])

##   age    n alcohol.use alcohol.frequency marijuana.use marijuana.frequency
## 1  12 2798         3.9                 3           1.1                   4
## 2  13 2757         8.5                 6           3.4                  15
## 3  14 2792        18.1                 5           8.7                  24
## 4  15 2956        29.2                 6          14.5                  25
## 5  16 3058        40.1                10          22.5                  30
## 6  17 3038        49.3                13          28.0                  36
##   cocaine.use cocaine.frequency
## 1         0.1               5.0
## 2         0.1               1.0
## 3         0.1               5.5
## 4         0.5               4.0
## 5         1.0               7.0
## 6         2.0               5.0

# We want to see which users used the various drugs, so we'll use .use
#Let's also move the data from wide format to long format
drugs <- drugs %>%
  select(age,ends_with(".use")) %>%
  gather(key = "Class",value = "Use",-age)

#Let's visualize it
head(drugs)

##   age       Class  Use
## 1  12 alcohol.use  3.9
## 2  13 alcohol.use  8.5
## 3  14 alcohol.use 18.1
## 4  15 alcohol.use 29.2
## 5  16 alcohol.use 40.1
## 6  17 alcohol.use 49.3

# Next, for the analysis part we want to chose the following age groups  21/30-34/65+
# Spread drug use data over age columns
drugs_sub <- drugs %>%
  filter(age %in% 21 | age %in% '30-34' | age %in% '65+') %>%
  spread(age,Use)
drugs_sub

##                Class   21 30-34  65+
## 1        alcohol.use 83.2  77.5 49.3
## 2        cocaine.use  4.8   2.1  0.0
## 3          crack.use  0.5   0.5  0.0
## 4   hallucinogen.use  6.3   1.8  0.1
## 5         heroin.use  0.6   0.4  0.0
## 6       inhalant.use  1.4   0.4  0.0
## 7      marijuana.use 33.0  16.4  1.2
## 8           meth.use  0.6   0.4  0.0
## 9      oxycontin.use  1.3   0.9  0.0
## 10 pain.releiver.use  9.0   5.9  0.6
## 11      sedative.use  0.3   0.4  0.0
## 12     stimulant.use  4.1   1.4  0.0
## 13  tranquilizer.use  3.9   3.6  0.2

In conclusion:

Yes, we can say that by comparing drug use starting at 21 years of age to 65+ years of age, there is a significant drop. Marijuana is down at 1.2% compared to its use at age 21 at 33%. Similarly, alcohol abuse is down starting at 83.2% to less than 50% by age 65+.

Third Dataset: Shipping Fees per country

When looking at this messy dataset, there was a strong requirement for tidying hence my choice of using this dataset.

Question we want to answer: What are the shipping costs for each country and at a monthly rate? Also, are these shipping fees feasible for companies in regards to their own costs?

#Let's begin by reading in the csv file from Github
ShippingFees <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/ShippingAnalysis.csv", header = TRUE, sep = ",")
ShippingFees

##        Date                  Values  China    Canada Taiwan Singapore
## 1  12/29/16        Price of Carrier    518        NA     NA        NA
## 2           Shipping Fees Collected    260        NA     NA        NA
## 3  12/30/16        Price of Carrier    232        NA     NA        NA
## 4           Shipping Fees Collected    132        NA     NA        NA
## 5    1/3/17        Price of Carrier 1143.5        NA     NA        NA
## 6           Shipping Fees Collected    752        NA     NA        NA
## 7    1/4/17        Price of Carrier  732.5        NA     NA        NA
## 8           Shipping Fees Collected    448        NA     NA        NA
## 9    1/5/17        Price of Carrier    443        NA     NA        NA
## 10          Shipping Fees Collected    336        NA     NA        NA
## 11   1/6/17        Price of Carrier  480.5        NA     NA        NA
## 12          Shipping Fees Collected    236        NA     NA        NA
## 13   1/9/17        Price of Carrier  439.5 193.50000     NA        NA
## 14          Shipping Fees Collected    316 228.00000     NA        NA
## 15  1/10/17        Price of Carrier    318 279.55994     NA        NA
## 16          Shipping Fees Collected    108 336.00000     NA        NA
## 17  1/11/17        Price of Carrier    325        NA     NA 130.50000
## 18          Shipping Fees Collected    168        NA     NA  30.00000
## 19  1/12/17        Price of Carrier    230 116.99997     NA        NA
## 20          Shipping Fees Collected    132 150.00000     NA        NA
## 21  1/13/17        Price of Carrier    229        NA     NA        NA
## 22          Shipping Fees Collected   #N/A        NA     NA        NA
## 23  1/16/17        Price of Carrier  189.5        NA     34  53.99999
## 24          Shipping Fees Collected    144        NA     45  72.00000
## 25  1/17/17        Price of Carrier  470.5        NA     NA        NA
## 26          Shipping Fees Collected    196        NA     NA        NA
## 27  1/18/17        Price of Carrier  311.5        NA     13  58.79999
## 28          Shipping Fees Collected    212        NA      0  66.00000
## 29  1/19/17        Price of Carrier  266.5        NA     13        NA
## 30          Shipping Fees Collected     92        NA     20        NA
## 31  1/20/17        Price of Carrier    182        NA     NA        NA
## 32          Shipping Fees Collected    148        NA     NA        NA
## 33  1/23/17        Price of Carrier    117  52.20000    111        NA
## 34          Shipping Fees Collected     64  66.00000     50        NA
## 35  1/24/17        Price of Carrier  144.5  76.49999     NA        NA
## 36          Shipping Fees Collected     64  90.00000     NA        NA
## 37  1/25/17        Price of Carrier   95.5        NA     NA  27.00000
## 38          Shipping Fees Collected     76        NA     NA  36.00000
## 39  1/26/17        Price of Carrier     84  53.99999     NA        NA
## 40          Shipping Fees Collected     40  66.00000     NA        NA
## 41  1/27/17        Price of Carrier    121        NA     NA  31.50000
## 42          Shipping Fees Collected     48        NA     NA  36.00000
## 43  1/30/17        Price of Carrier  135.5        NA     NA        NA
## 44          Shipping Fees Collected     92        NA     NA        NA
## 45  1/31/17        Price of Carrier   33.5  31.50000     NA        NA
## 46          Shipping Fees Collected     16  42.00000     NA        NA
## 47   2/1/17        Price of Carrier  172.5  31.50000     NA        NA
## 48          Shipping Fees Collected     76  42.00000     NA        NA
## 49   2/2/17        Price of Carrier    165        NA     NA        NA
## 50          Shipping Fees Collected     40        NA     NA        NA
## 51   2/3/17        Price of Carrier    166 125.99998     13        NA
## 52          Shipping Fees Collected    100 156.00000     20        NA
## 53   2/6/17        Price of Carrier    343  58.50000     68        NA
## 54          Shipping Fees Collected    236  78.00000     90        NA
## 55   2/7/17        Price of Carrier               NA     13        NA
## 56          Shipping Fees Collected               NA     20        NA
##           US   England Korea Hong.Kong
## 1         NA        NA    NA        NA
## 2         NA        NA    NA        NA
## 3         NA        NA    NA        NA
## 4         NA        NA    NA        NA
## 5         NA        NA    NA        NA
## 6         NA        NA    NA        NA
## 7         NA        NA    NA        NA
## 8         NA        NA    NA        NA
## 9         NA        NA    NA        NA
## 10        NA        NA    NA        NA
## 11        NA        NA    NA        NA
## 12        NA        NA    NA        NA
## 13        NA        NA  31.5        13
## 14        NA        NA  36.0        20
## 15        NA        NA    NA        13
## 16        NA        NA    NA        20
## 17        NA        NA    NA        NA
## 18        NA        NA    NA        NA
## 19        NA        NA    NA        NA
## 20        NA        NA    NA        NA
## 21        NA        NA    NA        NA
## 22        NA        NA    NA        NA
## 23        NA        NA    NA        NA
## 24        NA        NA    NA        NA
## 25        NA        NA    NA        33
## 26        NA        NA    NA        45
## 27        NA        NA    NA        NA
## 28        NA        NA    NA        NA
## 29        NA        NA    NA        NA
## 30        NA        NA    NA        NA
## 31  20.92996        NA    NA        NA
## 32  85.87000        NA    NA        NA
## 33        NA        NA    NA        NA
## 34        NA        NA    NA        NA
## 35        NA        NA    NA        NA
## 36        NA        NA    NA        NA
## 37        NA        NA    NA        NA
## 38        NA        NA    NA        NA
## 39  37.21999        NA    NA        NA
## 40  29.82000        NA    NA        NA
## 41        NA        NA    NA        NA
## 42        NA        NA    NA        NA
## 43        NA        NA    NA        NA
## 44        NA        NA    NA        NA
## 45  38.51000        NA    NA        NA
## 46  33.23000        NA    NA        NA
## 47 146.80994        NA    NA        NA
## 48 121.25000        NA    NA        NA
## 49        NA        NA    NA        NA
## 50        NA        NA    NA        NA
## 51  10.33000  31.50000    NA        NA
## 52   0.00000  42.00000    NA        NA
## 53  94.10999  89.99997    NA       135
## 54  97.39000 120.00000    NA       115
## 55  87.83999        NA    NA        NA
## 56  70.28000        NA    NA        NA

#Next, let's start cleaning up the data
ShippingFees[c(FALSE, TRUE),1] <- ShippingFees[c(TRUE, FALSE), 1]
# We can change the position of the data by collapsing it into a narrow format, getting rid of NAs, and show the Cost and Collected values in two columns
ShippingFees <- ShippingFees %>% 
  gather(Country, Amount, 3:10) %>% 
  filter(!is.na(Amount)) %>%            
  spread(Values, Amount)

## Warning: attributes are not identical across measure variables;
## they will be dropped

colnames(ShippingFees) <- c("Date", "Country", "Cost", "Collected")

# We want to change the cost and collected amounts to numbers
ShippingFees$Cost <- as.numeric(ShippingFees$Cost)
ShippingFees$Collected <- as.numeric(ShippingFees$Collected)

## Warning: NAs introduced by coercion

# Let's take out any records which didn't have a numeric Cost or Collected value
ShippingFees <- ShippingFees %>% filter(!is.na(Cost), !is.na(Collected))

head(ShippingFees)

##      Date   Country     Cost Collected
## 1 1/10/17    Canada 279.5599       336
## 2 1/10/17     China 318.0000       108
## 3 1/10/17 Hong.Kong  13.0000        20
## 4 1/11/17     China 325.0000       168
## 5 1/11/17 Singapore 130.5000        30
## 6 1/12/17    Canada 117.0000       150

tail(ShippingFees)

##      Date   Country      Cost Collected
## 57 2/6/17   England  89.99997    120.00
## 58 2/6/17 Hong.Kong 134.99998    115.00
## 59 2/6/17    Taiwan  68.00000     90.00
## 60 2/6/17        US  94.10999     97.39
## 61 2/7/17    Taiwan  13.00000     20.00
## 62 2/7/17        US  87.83999     70.28

# Let's begin the data analysis by looking at data for individual countries. We can look at the totol amounts for each country and also see the amount differences.
countrydata <- ShippingFees %>% 
  group_by(Country) %>% 
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%   
  mutate(Total = SumCollected - SumCost) %>%                         
  arrange(Total)

#Let's visualize this specific country data
head(countrydata)

## # A tibble: 6 x 4
##   Country   SumCost SumCollected    Total
##   <chr>       <dbl>        <dbl>    <dbl>
## 1 China      7860.         4532  -3328.  
## 2 Singapore   302.          240    -61.8 
## 3 Taiwan      265           245    -20   
## 4 US          436.          438.     2.09
## 5 Korea        31.5          36      4.50
## 6 Hong.Kong   194.          200      6.00

tail(countrydata)

## # A tibble: 6 x 4
##   Country   SumCost SumCollected   Total
##   <chr>       <dbl>        <dbl>   <dbl>
## 1 Taiwan      265           245   -20   
## 2 US          436.          438.    2.09
## 3 Korea        31.5          36     4.50
## 4 Hong.Kong   194.          200     6.00
## 5 England     121.          162    40.5 
## 6 Canada     1020.         1254   234.

# Next for analysis, let's retrieve the monthly data by splitting date into components; grouping by the month, year; converting month to number for sorting, combining month and year for output, sorting, removing grouping and selecting the field for output
monthdata <- ShippingFees %>% 
  separate(Date, c("Month", "Day", "Year"), sep = "/", remove=FALSE) %>%  
  group_by(Month, Year) %>%                                                
  summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
  mutate(Total = SumCollected - SumCost, 
         MonthNo = as.numeric(Month),                      
         YearNo = as.numeric(Year),                        
         Period = paste(Month, Year, sep="/")) %>%         
  arrange(YearNo, MonthNo) %>%                             
  group_by() %>%                                           
  select(Period, SumCost, SumCollected, Total)     

#Lets visualize this 
monthdata

## # A tibble: 3 x 4
##   Period SumCost SumCollected  Total
##   <chr>    <dbl>        <dbl>  <dbl>
## 1 12/16     750          392   -358 
## 2 1/17     7727.        5291. -2436.
## 3 2/17     1752.        1424.  -328.

We can say that of all the countries, China has the largest cost. If one was to ship to Canada, it can be worth it making it profitable. Also, overall it doesn’t seem feasible for companies with these shipping costs in regards to profit.

Data607 Project 2

RSingh

3/16/2018

Goal of this Project:

First Dataset

Second Dataset: Drug use by Age

Third Dataset: Shipping Fees per country