The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: 1. Choose any three of the “wide” datasets. For each of the three chosen datasets:
FALSE Loading required package: DBI
FALSE Loading required package: bitops
FALSE
FALSE Attaching package: 'tidyr'
FALSE The following object is masked from 'package:RCurl':
FALSE
FALSE complete
FALSE
FALSE Attaching package: 'dplyr'
FALSE The following objects are masked from 'package:stats':
FALSE
FALSE filter, lag
FALSE The following objects are masked from 'package:base':
FALSE
FALSE intersect, setdiff, setequal, union
FALSE
FALSE Attaching package: 'psych'
FALSE The following objects are masked from 'package:ggplot2':
FALSE
FALSE %+%, alpha
FALSE
FALSE Attaching package: 'plotly'
FALSE The following object is masked from 'package:ggplot2':
FALSE
FALSE last_plot
FALSE The following object is masked from 'package:stats':
FALSE
FALSE filter
FALSE The following object is masked from 'package:graphics':
FALSE
FALSE layout
NYC LOTTO- Let’s see how we can increase the odds and win the lottery ! Let’s try to find out which numbers have a high frequency, and which numbers a player should potentially avoid that carry a low frequency.
Dataset original source: https://data.ny.gov/Government-Finance/Lottery-Cash-4-Life-Winning-Numbers-Beginning-2014/kwxv-fwze
# Let's begin by reading in the LotteryCash4Life csv file from github
lotto <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/LotteryCash4Life2014.csv", header = TRUE)
# Let's visualize this
head(lotto)
## Draw.Date Winning.Numbers Cash.Ball
## 1 6/16/14 09 36 44 53 59 3
## 2 6/19/14 08 13 43 56 60 2
## 3 6/23/14 05 16 21 33 47 4
## 4 6/26/14 15 22 51 52 58 3
## 5 6/30/14 01 04 10 28 33 2
## 6 7/3/14 08 10 25 28 31 2
Next, let’s do some tidying in order to perform our analysis
# Next, let's create separate columns for each ball number
lotto.separated <- unlist(str_extract_all(lotto$Winning.Numbers, "(\\d)."))
lotto.separated <- as.numeric(lotto.separated)
lotto2 <- matrix(lotto.separated, ncol = 5, byrow = TRUE)
lotto2 <- as.data.frame(lotto2)
lotto.untidy <- data.frame(lotto$Draw.Date, lotto2, lotto$Cash.Ball)
colnames(lotto.untidy) <- c("Draw.Date", 1, 2, 3, 4, 5, "Cash.Ball")
# Let's visualize the new columns created
head(lotto.untidy)
## Draw.Date 1 2 3 4 5 Cash.Ball
## 1 6/16/14 9 36 44 53 59 3
## 2 6/19/14 8 13 43 56 60 2
## 3 6/23/14 5 16 21 33 47 4
## 4 6/26/14 15 22 51 52 58 3
## 5 6/30/14 1 4 10 28 33 2
## 6 7/3/14 8 10 25 28 31 2
Let’s continue to clean up the data
lotto.tidy <- lotto.untidy %>% gather(BallOrder, Number, 2:6) %>% select(-Cash.Ball)
head(lotto.tidy)
## Draw.Date BallOrder Number
## 1 6/16/14 1 9
## 2 6/19/14 1 8
## 3 6/23/14 1 5
## 4 6/26/14 1 15
## 5 6/30/14 1 1
## 6 7/3/14 1 8
Using a historgram, let’s visualize the data so we can continue analyzing and make some conclusions:
hist(lotto.tidy$Number, breaks = 60, main = "Winning Lotto Numbers", xlab = "Cash Ball Number", ylim = c(0,40), col = 'blue')
Ball.Number <- as.numeric(lotto.tidy$Number)
Ball.Number <- as.data.frame(table(Ball.Number))
head(Ball.Number)
## Ball.Number Freq
## 1 1 22
## 2 2 16
## 3 3 28
## 4 4 27
## 5 5 30
## 6 6 26
Let’s see the frequency of the ball numbers drawn to see the highest and lowest frequencies.
Freq.Drawn <- Ball.Number %>% arrange(desc(Freq))
head(Freq.Drawn)
## Ball.Number Freq
## 1 43 37
## 2 37 35
## 3 8 34
## 4 11 31
## 5 56 31
## 6 5 30
tail(Freq.Drawn)
## Ball.Number Freq
## 55 33 17
## 56 39 17
## 57 2 16
## 58 50 16
## 59 15 15
## 60 42 9
Let’s see the average ball number to be drawn:
Ball.stat <- Freq.Drawn %>% summarise(Mean= mean(Freq))
Ball.stat <- as.numeric(Ball.stat)
Ball.stat
## [1] 23.75
This was a fun excercise and using visualizations and analysis we are able to make some conclusions while answering our initial questions:
The number ball with the lowest frequency = 42 While the highest frequency ball number = 43 The average ball number to be drawn = 23
Question: In this excercise, I would like to find out if those within the 21 years of age continue or decrease their drug abuse till later in life (ie. 65+ years of age)?
#Let's begin by reading in the csv file from Github
drugs <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/drug-use-by-age.csv")
# Let's visualize it
head(drugs[,1:8])
## age n alcohol.use alcohol.frequency marijuana.use marijuana.frequency
## 1 12 2798 3.9 3 1.1 4
## 2 13 2757 8.5 6 3.4 15
## 3 14 2792 18.1 5 8.7 24
## 4 15 2956 29.2 6 14.5 25
## 5 16 3058 40.1 10 22.5 30
## 6 17 3038 49.3 13 28.0 36
## cocaine.use cocaine.frequency
## 1 0.1 5.0
## 2 0.1 1.0
## 3 0.1 5.5
## 4 0.5 4.0
## 5 1.0 7.0
## 6 2.0 5.0
# We want to see which users used the various drugs, so we'll use .use
#Let's also move the data from wide format to long format
drugs <- drugs %>%
select(age,ends_with(".use")) %>%
gather(key = "Class",value = "Use",-age)
#Let's visualize it
head(drugs)
## age Class Use
## 1 12 alcohol.use 3.9
## 2 13 alcohol.use 8.5
## 3 14 alcohol.use 18.1
## 4 15 alcohol.use 29.2
## 5 16 alcohol.use 40.1
## 6 17 alcohol.use 49.3
# Next, for the analysis part we want to chose the following age groups 21/30-34/65+
# Spread drug use data over age columns
drugs_sub <- drugs %>%
filter(age %in% 21 | age %in% '30-34' | age %in% '65+') %>%
spread(age,Use)
drugs_sub
## Class 21 30-34 65+
## 1 alcohol.use 83.2 77.5 49.3
## 2 cocaine.use 4.8 2.1 0.0
## 3 crack.use 0.5 0.5 0.0
## 4 hallucinogen.use 6.3 1.8 0.1
## 5 heroin.use 0.6 0.4 0.0
## 6 inhalant.use 1.4 0.4 0.0
## 7 marijuana.use 33.0 16.4 1.2
## 8 meth.use 0.6 0.4 0.0
## 9 oxycontin.use 1.3 0.9 0.0
## 10 pain.releiver.use 9.0 5.9 0.6
## 11 sedative.use 0.3 0.4 0.0
## 12 stimulant.use 4.1 1.4 0.0
## 13 tranquilizer.use 3.9 3.6 0.2
In conclusion:
Yes, we can say that by comparing drug use starting at 21 years of age to 65+ years of age, there is a significant drop. Marijuana is down at 1.2% compared to its use at age 21 at 33%. Similarly, alcohol abuse is down starting at 83.2% to less than 50% by age 65+.
When looking at this messy dataset, there was a strong requirement for tidying hence my choice of using this dataset.
Question we want to answer: What are the shipping costs for each country and at a monthly rate? Also, are these shipping fees feasible for companies in regards to their own costs?
#Let's begin by reading in the csv file from Github
ShippingFees <- read.csv("https://github.com/rickidonsingh/Data607/raw/master/ShippingAnalysis.csv", header = TRUE, sep = ",")
ShippingFees
## Date Values China Canada Taiwan Singapore
## 1 12/29/16 Price of Carrier 518 NA NA NA
## 2 Shipping Fees Collected 260 NA NA NA
## 3 12/30/16 Price of Carrier 232 NA NA NA
## 4 Shipping Fees Collected 132 NA NA NA
## 5 1/3/17 Price of Carrier 1143.5 NA NA NA
## 6 Shipping Fees Collected 752 NA NA NA
## 7 1/4/17 Price of Carrier 732.5 NA NA NA
## 8 Shipping Fees Collected 448 NA NA NA
## 9 1/5/17 Price of Carrier 443 NA NA NA
## 10 Shipping Fees Collected 336 NA NA NA
## 11 1/6/17 Price of Carrier 480.5 NA NA NA
## 12 Shipping Fees Collected 236 NA NA NA
## 13 1/9/17 Price of Carrier 439.5 193.50000 NA NA
## 14 Shipping Fees Collected 316 228.00000 NA NA
## 15 1/10/17 Price of Carrier 318 279.55994 NA NA
## 16 Shipping Fees Collected 108 336.00000 NA NA
## 17 1/11/17 Price of Carrier 325 NA NA 130.50000
## 18 Shipping Fees Collected 168 NA NA 30.00000
## 19 1/12/17 Price of Carrier 230 116.99997 NA NA
## 20 Shipping Fees Collected 132 150.00000 NA NA
## 21 1/13/17 Price of Carrier 229 NA NA NA
## 22 Shipping Fees Collected #N/A NA NA NA
## 23 1/16/17 Price of Carrier 189.5 NA 34 53.99999
## 24 Shipping Fees Collected 144 NA 45 72.00000
## 25 1/17/17 Price of Carrier 470.5 NA NA NA
## 26 Shipping Fees Collected 196 NA NA NA
## 27 1/18/17 Price of Carrier 311.5 NA 13 58.79999
## 28 Shipping Fees Collected 212 NA 0 66.00000
## 29 1/19/17 Price of Carrier 266.5 NA 13 NA
## 30 Shipping Fees Collected 92 NA 20 NA
## 31 1/20/17 Price of Carrier 182 NA NA NA
## 32 Shipping Fees Collected 148 NA NA NA
## 33 1/23/17 Price of Carrier 117 52.20000 111 NA
## 34 Shipping Fees Collected 64 66.00000 50 NA
## 35 1/24/17 Price of Carrier 144.5 76.49999 NA NA
## 36 Shipping Fees Collected 64 90.00000 NA NA
## 37 1/25/17 Price of Carrier 95.5 NA NA 27.00000
## 38 Shipping Fees Collected 76 NA NA 36.00000
## 39 1/26/17 Price of Carrier 84 53.99999 NA NA
## 40 Shipping Fees Collected 40 66.00000 NA NA
## 41 1/27/17 Price of Carrier 121 NA NA 31.50000
## 42 Shipping Fees Collected 48 NA NA 36.00000
## 43 1/30/17 Price of Carrier 135.5 NA NA NA
## 44 Shipping Fees Collected 92 NA NA NA
## 45 1/31/17 Price of Carrier 33.5 31.50000 NA NA
## 46 Shipping Fees Collected 16 42.00000 NA NA
## 47 2/1/17 Price of Carrier 172.5 31.50000 NA NA
## 48 Shipping Fees Collected 76 42.00000 NA NA
## 49 2/2/17 Price of Carrier 165 NA NA NA
## 50 Shipping Fees Collected 40 NA NA NA
## 51 2/3/17 Price of Carrier 166 125.99998 13 NA
## 52 Shipping Fees Collected 100 156.00000 20 NA
## 53 2/6/17 Price of Carrier 343 58.50000 68 NA
## 54 Shipping Fees Collected 236 78.00000 90 NA
## 55 2/7/17 Price of Carrier NA 13 NA
## 56 Shipping Fees Collected NA 20 NA
## US England Korea Hong.Kong
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 NA NA NA NA
## 11 NA NA NA NA
## 12 NA NA NA NA
## 13 NA NA 31.5 13
## 14 NA NA 36.0 20
## 15 NA NA NA 13
## 16 NA NA NA 20
## 17 NA NA NA NA
## 18 NA NA NA NA
## 19 NA NA NA NA
## 20 NA NA NA NA
## 21 NA NA NA NA
## 22 NA NA NA NA
## 23 NA NA NA NA
## 24 NA NA NA NA
## 25 NA NA NA 33
## 26 NA NA NA 45
## 27 NA NA NA NA
## 28 NA NA NA NA
## 29 NA NA NA NA
## 30 NA NA NA NA
## 31 20.92996 NA NA NA
## 32 85.87000 NA NA NA
## 33 NA NA NA NA
## 34 NA NA NA NA
## 35 NA NA NA NA
## 36 NA NA NA NA
## 37 NA NA NA NA
## 38 NA NA NA NA
## 39 37.21999 NA NA NA
## 40 29.82000 NA NA NA
## 41 NA NA NA NA
## 42 NA NA NA NA
## 43 NA NA NA NA
## 44 NA NA NA NA
## 45 38.51000 NA NA NA
## 46 33.23000 NA NA NA
## 47 146.80994 NA NA NA
## 48 121.25000 NA NA NA
## 49 NA NA NA NA
## 50 NA NA NA NA
## 51 10.33000 31.50000 NA NA
## 52 0.00000 42.00000 NA NA
## 53 94.10999 89.99997 NA 135
## 54 97.39000 120.00000 NA 115
## 55 87.83999 NA NA NA
## 56 70.28000 NA NA NA
#Next, let's start cleaning up the data
ShippingFees[c(FALSE, TRUE),1] <- ShippingFees[c(TRUE, FALSE), 1]
# We can change the position of the data by collapsing it into a narrow format, getting rid of NAs, and show the Cost and Collected values in two columns
ShippingFees <- ShippingFees %>%
gather(Country, Amount, 3:10) %>%
filter(!is.na(Amount)) %>%
spread(Values, Amount)
## Warning: attributes are not identical across measure variables;
## they will be dropped
colnames(ShippingFees) <- c("Date", "Country", "Cost", "Collected")
# We want to change the cost and collected amounts to numbers
ShippingFees$Cost <- as.numeric(ShippingFees$Cost)
ShippingFees$Collected <- as.numeric(ShippingFees$Collected)
## Warning: NAs introduced by coercion
# Let's take out any records which didn't have a numeric Cost or Collected value
ShippingFees <- ShippingFees %>% filter(!is.na(Cost), !is.na(Collected))
head(ShippingFees)
## Date Country Cost Collected
## 1 1/10/17 Canada 279.5599 336
## 2 1/10/17 China 318.0000 108
## 3 1/10/17 Hong.Kong 13.0000 20
## 4 1/11/17 China 325.0000 168
## 5 1/11/17 Singapore 130.5000 30
## 6 1/12/17 Canada 117.0000 150
tail(ShippingFees)
## Date Country Cost Collected
## 57 2/6/17 England 89.99997 120.00
## 58 2/6/17 Hong.Kong 134.99998 115.00
## 59 2/6/17 Taiwan 68.00000 90.00
## 60 2/6/17 US 94.10999 97.39
## 61 2/7/17 Taiwan 13.00000 20.00
## 62 2/7/17 US 87.83999 70.28
# Let's begin the data analysis by looking at data for individual countries. We can look at the totol amounts for each country and also see the amount differences.
countrydata <- ShippingFees %>%
group_by(Country) %>%
summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
mutate(Total = SumCollected - SumCost) %>%
arrange(Total)
#Let's visualize this specific country data
head(countrydata)
## # A tibble: 6 x 4
## Country SumCost SumCollected Total
## <chr> <dbl> <dbl> <dbl>
## 1 China 7860. 4532 -3328.
## 2 Singapore 302. 240 -61.8
## 3 Taiwan 265 245 -20
## 4 US 436. 438. 2.09
## 5 Korea 31.5 36 4.50
## 6 Hong.Kong 194. 200 6.00
tail(countrydata)
## # A tibble: 6 x 4
## Country SumCost SumCollected Total
## <chr> <dbl> <dbl> <dbl>
## 1 Taiwan 265 245 -20
## 2 US 436. 438. 2.09
## 3 Korea 31.5 36 4.50
## 4 Hong.Kong 194. 200 6.00
## 5 England 121. 162 40.5
## 6 Canada 1020. 1254 234.
# Next for analysis, let's retrieve the monthly data by splitting date into components; grouping by the month, year; converting month to number for sorting, combining month and year for output, sorting, removing grouping and selecting the field for output
monthdata <- ShippingFees %>%
separate(Date, c("Month", "Day", "Year"), sep = "/", remove=FALSE) %>%
group_by(Month, Year) %>%
summarise(SumCost = sum(Cost), SumCollected = sum(Collected)) %>%
mutate(Total = SumCollected - SumCost,
MonthNo = as.numeric(Month),
YearNo = as.numeric(Year),
Period = paste(Month, Year, sep="/")) %>%
arrange(YearNo, MonthNo) %>%
group_by() %>%
select(Period, SumCost, SumCollected, Total)
#Lets visualize this
monthdata
## # A tibble: 3 x 4
## Period SumCost SumCollected Total
## <chr> <dbl> <dbl> <dbl>
## 1 12/16 750 392 -358
## 2 1/17 7727. 5291. -2436.
## 3 2/17 1752. 1424. -328.
We can say that of all the countries, China has the largest cost. If one was to ship to Canada, it can be worth it making it profitable. Also, overall it doesn’t seem feasible for companies with these shipping costs in regards to profit.