The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
The second dataset I chose because I wanted a challenge and this dataset is particularly ‘untidy’ data! The original file is located on googlesheets, so I just downloaded the file as a csv for importing to R.
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project2/DS2-Raffle%20Prizes.csv")
raffle <-data.frame(read.csv(text=x, header=FALSE, stringsAsFactors = FALSE))
Look and the data structure as well as the first and last few rows of data to get a sense of what’s there.
library(tidyverse, quietly=TRUE)
dim(raffle)
## [1] 125 8
glimpse(raffle)
## Observations: 125
## Variables: 8
## $ V1 <chr> "association Golf Outing", "Raffle Prizes", "", "", "quant...
## $ V2 <chr> "", "", "", "", "Item", "", "2016", " Golf Shirts ", " Golf...
## $ V3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V4 <chr> "", "", "as of ", "", "Cost", "", "", "20 each", "40", "200...
## $ V5 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V6 <chr> "", "", "3/15/2018", "", "Status", "", "", "From Tukuru", "...
## $ V7 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V8 <chr> "", "", "", "", "Year", "", "", "", "", "", "", "", "", "",...
head(raffle)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 association Golf Outing NA NA NA
## 2 Raffle Prizes NA NA NA
## 3 NA as of NA 3/15/2018 NA
## 4 NA NA NA
## 5 quantity Item NA Cost NA Status NA Year
## 6 NA NA NA
tail(raffle)
## V1 V2 V3 V4 V5
## 120 4 AMC movie passes NA 50.00 NA
## 121 1 Portable Grill NA 240.00 NA
## 122 5 5 dozen Golf Balls NA 60.00 NA
## 123 5 5 dozen Golf Balls NA 60.00 NA
## 124 1 Gift Basket NA 40.00 NA
## 125 1 ASCE 7 on CD NA 20.00 NA
## V6 V7 V8
## 120 From Matt B NA
## 121 From JFT NA
## 122 From DM NA
## 123 From DM NA
## 124 From DM NA
## 125 From Whatever the hell that is from Joe NA
This data is so messy it’s hard to see what’s going on, but it looks like columns V3, V5, V7, and V8 do not contain any data at all. Columns V3, V5, and V7 are all NA’s and column V8 only has the word ‘Year’ in one observation and the rest look like empty strings. But just to be sure I will use dplyr to look for unique values in each of those columns.
columns <- paste('V',c(3,5,7,8), sep="")
columns
## [1] "V3" "V5" "V7" "V8"
func <- function(column, df){
df %>%
group_by_(column) %>%
summarise(val = n())
}
unlist(lapply(columns, func, raffle))
## V3 val V5 val V7 val V81 V82 val1 val2
## NA "125" NA "125" NA "125" "" "Year" "124" "1"
The results above confirm that those four columns do not contain any data. The word year is the only value entered in the V8 column but that really belongs in the header and there’s nothing else in that column so I can remove it along with the other 3 columns…
I’ll use a select function to remove the empty columns identified above and while I am at it remove the empty/useless rows at the top of the dataframe as well.
tidyraffle <- tail(raffle, 121) %>%
select(-c(3,5,7,8))
head(tidyraffle)
## V1 V2 V4 V6
## 5 quantity Item Cost Status
## 6
## 7 2016
## 8 4 Golf Shirts 20 each From Tukuru
## 9 1 Golf Windvest 40 From AE
## 10 4 Individual Golf Clubs sets 2000 From Matt B
tail(tidyraffle)
## V1 V2 V4 V6
## 120 4 AMC movie passes 50.00 From Matt B
## 121 1 Portable Grill 240.00 From JFT
## 122 5 5 dozen Golf Balls 60.00 From DM
## 123 5 5 dozen Golf Balls 60.00 From DM
## 124 1 Gift Basket 40.00 From DM
## 125 1 ASCE 7 on CD 20.00 From Whatever the hell that is from Joe
colnames(tidyraffle) <- tidyraffle[1, ] # the first row will be the header
tidyraffle <- tidyraffle[-c(1,2), ] # removing the first 2 (header and blank) rows
rownames(tidyraffle) <- NULL
head(tidyraffle)
## quantity Item Cost
## 1 2016
## 2 4 Golf Shirts 20 each
## 3 1 Golf Windvest 40
## 4 4 Individual Golf Clubs sets 2000
## 5 1 Surface Tablet 350.00
## 6 1 Whitney 300
## Status
## 1
## 2 From Tukuru
## 3 From AE
## 4 From Matt B
## 5 From Tukuru to be delivered 6/9/16
## 6 From Tukuru
First print the Item column to see what to filter out based on this column. This is the only column that consistently has data in it so if it is blank we can remove that row.
tidyraffle$Item
## [1] "2016"
## [2] " Golf Shirts "
## [3] " Golf Windvest "
## [4] " Individual Golf Clubs sets "
## [5] " Surface Tablet "
## [6] " Whitney "
## [7] " Drone "
## [8] " Ipad "
## [9] " laptop "
## [10] " 35\" flat screen "
## [11] " Noise cancelling headphones "
## [12] " Blue Tooth speakers "
## [13] " Gopro camera "
## [14] ""
## [15] ""
## [16] ""
## [17] ""
## [18] "2015"
## [19] " Golf Shirts "
## [20] " Golf Windvest "
## [21] " Individual Golf Clubs "
## [22] " Think Pad "
## [23] " tickets and a catalgue for the Whitney "
## [24] " go pro camera "
## [25] " Ipad "
## [26] " laptop "
## [27] " 35\" flat screen "
## [28] " Noise cancelling headphones "
## [29] ""
## [30] " All inclusive American Museum of Natural History "
## [31] ""
## [32] "2014"
## [33] " Golf Shirts "
## [34] " Golf Windvest "
## [35] " Individual Golf Clubs "
## [36] " City Sightseeing Tour of Manhattan "
## [37] " Wine "
## [38] " Lap top (extra per JFT for HS kid) "
## [39] " Movie passes "
## [40] " 39 inch insignia flat screen LED "
## [41] " Noise canceling head phones "
## [42] " Ipads "
## [43] " Pro Shop gft certificate "
## [44] " tickets and a catalgue for the Whitney "
## [45] " All inclusive American Museum of Natural History "
## [46] " Sony Handycam Camera Recorded "
## [47] ""
## [48] "2013"
## [49] " Golf Shirts "
## [50] " Golf Windvest "
## [51] " Individual Golf Clubs "
## [52] " Case of Walden Woods Wine "
## [53] " Flat Screen TV "
## [54] " Ipad mini's (16 GB) "
## [55] " Ipad Mini (32 GB) "
## [56] " Noise Canceling Head Phones "
## [57] " Yankee Tickets "
## [58] " Hyland "
## [59] " gift certificates for $97 each (prescription "
## [60] ""
## [61] " tickets and a catalgue for the Whitney "
## [62] " tickets to Yankees Rays game 6/22/13 "
## [63] ""
## [64] "2012"
## [65] " Golf Shirts "
## [66] " Golf Windvest "
## [67] " Individual Golf Clubs "
## [68] " All-inclusive Museum of Natural history passes "
## [69] ""
## [70] " Flat Screen TV "
## [71] " Kindle Fire "
## [72] " Ipad "
## [73] " Appleton Rum (and game on 8th hole) "
## [74] " Wine "
## [75] " Wine related platters and cheese "
## [76] " 40 Below Vodka "
## [77] " Sony Handman DCR-SX44 mini "
## [78] " camcorder "
## [79] " Dozen Golf Balls "
## [80] ""
## [81] "2011"
## [82] " Golf Shirts"
## [83] " Golf Windvest"
## [84] " Individual Golf Clubs"
## [85] " Dozen golf balls "
## [86] "Remaining Golf Balls"
## [87] "Chances"
## [88] "Dinner for 2 ($150) for Tribeca Grill"
## [89] "$100 Gift Certificate to \"The Stand\""
## [90] "The Kindle"
## [91] "Flat Screen TV"
## [92] "NBA Package (Satchel. T shirt, "
## [93] " water bottle, post its and holder)"
## [94] "HP Mini Lap Top"
## [95] "iPad (32 G)"
## [96] "Case of 40below Vodka"
## [97] ""
## [98] "2010"
## [99] "Gotham Restaurant certificate"
## [100] "iPod Touch"
## [101] "Digital Camera"
## [102] "Flat Screen TV"
## [103] "Ipad"
## [104] "Net Flix 6 month membership"
## [105] "10 Downing Restaurant GC"
## [106] "5 Ninth Restaurant GC"
## [107] "$25 Gift Certificate to Garden of Eden"
## [108] "$150 Gift Certificate from Artisanal"
## [109] ""
## [110] "$100 Gift Certificate to \"The Stand\""
## [111] ""
## [112] "2009"
## [113] "GPS"
## [114] "AMC movie passes"
## [115] "Portable Grill"
## [116] "5 dozen Golf Balls"
## [117] "5 dozen Golf Balls"
## [118] "Gift Basket"
## [119] "ASCE 7 on CD"
# Remove all rows with an empty string in the Item column
tidyraffle <- tidyraffle %>% filter(Item != "")
# Confirm that blank rows are gone...
head(tidyraffle, 20)
## quantity Item Cost
## 1 2016
## 2 4 Golf Shirts 20 each
## 3 1 Golf Windvest 40
## 4 4 Individual Golf Clubs sets 2000
## 5 1 Surface Tablet 350.00
## 6 1 Whitney 300
## 7 1 Drone 300
## 8 1 Ipad 400
## 9 1 laptop 700
## 10 1 35" flat screen 500
## 11 1 Noise cancelling headphones 400
## 12 1 Blue Tooth speakers 373
## 13 1 Gopro camera 250
## 14 2015
## 15 4 Golf Shirts 200.00
## 16 1 Golf Windvest 50.00
## 17 4 Individual Golf Clubs 1,925.00
## 18 1 Think Pad 300.00
## 19 2 tickets and a catalgue for the Whitney 150.00
## 20 1 go pro camera 200.00
## Status
## 1
## 2 From Tukuru
## 3 From AE
## 4 From Matt B
## 5 From Tukuru to be delivered 6/9/16
## 6 From Tukuru
## 7 From Tukuru
## 8 From Gretchen
## 9 From Dewault
## 10 From AE
## 11 From AE
## 12 From Dewault
## 13 From Dewault
## 14
## 15 From Matt B
## 16 From Matt B
## 17
## 18 From Tukuru
## 19 From AE
## 20 From Matt B
OK, so I am thinking that I probably should have just added the year in a new column in the csv file before importing the data to R. It would have been much easier and faster, but I was trying to do everything in R just to get practice… So I think I am going to have to hardcode this section to get the year in it’s own column. I tried to figure out a way to automate it, but each section that is separated by a row with just the year in it is a different size and I couldn’t figure out how to make a conditional function to look for the year and then add that year in a new column (mutate) only until you get to the next row with a year and then change the year to the next value and add that one to the new column, etc… I wanted to automate it because it’s always better to find a solution that would also work on much larger datasets, but I think the probability of getting a really large dataset that is structured this way is somewhat unlikely, so I am just going to hardcode it.
# Add a new Blank Column
tidyraffle$Year <- NA
# Copy year into all corresponding rows. Really wish I could figure out how to automate this!
tidyraffle$Year[2:13] <- tidyraffle[1,2]
tidyraffle$Year[15:25] <- tidyraffle[14,2]
tidyraffle$Year[27:40] <- tidyraffle[26,2]
tidyraffle$Year[42:54] <- tidyraffle[41,2]
tidyraffle$Year[56:69] <- tidyraffle[55,2]
tidyraffle$Year[71:85] <- tidyraffle[70,2]
tidyraffle$Year[87:97] <- tidyraffle[86,2]
tidyraffle$Year[99:105] <- tidyraffle[98,2]
# Remove original year separator rows
tidyraffle <- tidyraffle %>% filter(Year != "")
# Take a peek at the new data structure...
glimpse(tidyraffle)
## Observations: 97
## Variables: 5
## $ quantity <chr> "4", "1", "4", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ Item <chr> " Golf Shirts ", " Golf Windvest ", " Individual Gol...
## $ Cost <chr> "20 each", "40", "2000", " 350.00 ", "300", "300", "4...
## $ Status <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...
The Quantity and Cost columns should be numbers so that we can use them in calculations so let’s convert those… BUT before we do we need to fix any values that have numbers and text in them or R will convert them to NA! So let’s check what values are in those columns and fix them.
tidyraffle %>%
group_by(quantity) %>%
summarise(val = n())
## # A tibble: 4 x 2
## quantity val
## <chr> <int>
## 1 1 68
## 2 2 10
## 3 4 17
## 4 5 2
tidyraffle %>%
group_by(Cost) %>%
summarise(val = n()) %>%
arrange(Cost)
## # A tibble: 70 x 2
## Cost val
## <chr> <int>
## 1 "" 1
## 2 " 1,085.00 " 1
## 3 " 1,281.00 " 1
## 4 " 120.00 " 3
## 5 " 135.00 " 1
## 6 " 150.00 " 1
## 7 " 20.00 " 1
## 8 " 200.00 " 6
## 9 " 204.00 " 1
## 10 " 269.99 " 1
## # ... with 60 more rows
Looks like the quantity is going to be easy, but the cost is another story. First of all there is one value that says “each” in it. From my understanding of the dataset, all the other values are the total cost of the number of items in the quantity column. So I’ll just manually fix that one. Then we need to remove the commas from the rest of the values if they have one.
# fix the Cost in the first row to remove the text
tidyraffle[1,3] <- "80"
tidyraffle[1,]
## quantity Item Cost Status Year
## 1 4 Golf Shirts 80 From Tukuru 2016
# Remove all commas from Cost and convert to numeric
tidyraffle$Cost <- as.numeric(gsub(",","",tidyraffle$Cost))
# Convert quantity to numeric
tidyraffle$quantity <- as.integer(tidyraffle$quantity)
# Check results
glimpse(tidyraffle)
## Observations: 97
## Variables: 5
## $ quantity <int> 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 2, 1,...
## $ Item <chr> " Golf Shirts ", " Golf Windvest ", " Individual Gol...
## $ Cost <dbl> 80, 40, 2000, 350, 300, 300, 400, 700, 500, 400, 373,...
## $ Status <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...
tidyraffle %>%
group_by(quantity) %>%
summarise(val = n())
## # A tibble: 4 x 2
## quantity val
## <int> <int>
## 1 1 68
## 2 2 10
## 3 4 17
## 4 5 2
tidyraffle %>%
group_by(Cost) %>%
summarise(val = n()) %>%
arrange(Cost)
## # A tibble: 48 x 2
## Cost val
## <dbl> <int>
## 1 0. 2
## 2 10.0 1
## 3 19.5 1
## 4 20.0 5
## 5 25.0 1
## 6 30.0 2
## 7 40.0 3
## 8 50.0 4
## 9 60.0 2
## 10 74.0 2
## # ... with 38 more rows
Calculate the Unit Cost of Items
tidyraffle <- tidyraffle %>%
mutate(UnitCost = Cost / quantity)
glimpse(tidyraffle)
## Observations: 97
## Variables: 6
## $ quantity <int> 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 2, 1,...
## $ Item <chr> " Golf Shirts ", " Golf Windvest ", " Individual Gol...
## $ Cost <dbl> 80, 40, 2000, 350, 300, 300, 400, 700, 500, 400, 373,...
## $ Status <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...
## $ UnitCost <dbl> 20.00, 40.00, 500.00, 350.00, 300.00, 300.00, 400.00,...
Let’s see what the average total cost of each raffle prices was in each year
yearly_avg_cost <- tidyraffle %>%
group_by(Year) %>%
summarise(avg_cost = mean(Cost, na.rm = TRUE))
yearly_avg_cost
## # A tibble: 8 x 2
## Year avg_cost
## <chr> <dbl>
## 1 2009 88.6
## 2 2010 191.
## 3 2011 210.
## 4 2012 288.
## 5 2013 295.
## 6 2014 313.
## 7 2015 520.
## 8 2016 474.
ggplot(tidyraffle) +
geom_point(mapping = aes(x=Cost, y=Year))
## Warning: Removed 1 rows containing missing values (geom_point).
Let’s see if we had donors who donated over multiple years?
library(DT)
donors <- tidyraffle %>%
group_by(Status, Year) %>%
summarise(avg_cost = mean(Cost, na.rm = TRUE))
datatable(donors)
Obviously there is still more that could be done to clean up this data. The Status column includes a lot of extraneous data. Additionally, we could remove the word “From” and try to make sure Status entries for the same person are entered consistently so that they could be used for grouping more effectively.