Assignment Instructions

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the ‘wide’ datasets identified in the Week 5 Discussion items. For each of the three chosen datasets:
    • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a ‘wide’ structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
    • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
    • Perform the analysis requested in the discussion item.
    • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset Two - Raffle Prizes

The second dataset I chose because I wanted a challenge and this dataset is particularly ‘untidy’ data! The original file is located on googlesheets, so I just downloaded the file as a csv for importing to R.

Load the Raw Data into R from CSV File

library(RCurl) 
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project2/DS2-Raffle%20Prizes.csv") 
raffle <-data.frame(read.csv(text=x, header=FALSE, stringsAsFactors = FALSE))

Use tidyr to examine the raw data

Look and the data structure as well as the first and last few rows of data to get a sense of what’s there.

library(tidyverse, quietly=TRUE)
dim(raffle)
## [1] 125   8
glimpse(raffle)
## Observations: 125
## Variables: 8
## $ V1 <chr> "association  Golf Outing", "Raffle Prizes", "", "", "quant...
## $ V2 <chr> "", "", "", "", "Item", "", "2016", " Golf Shirts ", " Golf...
## $ V3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V4 <chr> "", "", "as of ", "", "Cost", "", "", "20 each", "40", "200...
## $ V5 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V6 <chr> "", "", "3/15/2018", "", "Status", "", "", "From Tukuru", "...
## $ V7 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ V8 <chr> "", "", "", "", "Year", "", "", "", "", "", "", "", "", "",...
head(raffle)
##                         V1   V2 V3     V4 V5        V6 V7   V8
## 1 association  Golf Outing      NA        NA           NA     
## 2            Raffle Prizes      NA        NA           NA     
## 3                               NA as of  NA 3/15/2018 NA     
## 4                               NA        NA           NA     
## 5                 quantity Item NA   Cost NA    Status NA Year
## 6                               NA        NA           NA
tail(raffle)
##     V1                 V2 V3       V4 V5
## 120  4   AMC movie passes NA   50.00  NA
## 121  1     Portable Grill NA  240.00  NA
## 122  5 5 dozen Golf Balls NA   60.00  NA
## 123  5 5 dozen Golf Balls NA   60.00  NA
## 124  1        Gift Basket NA   40.00  NA
## 125  1       ASCE 7 on CD NA   20.00  NA
##                                          V6 V7 V8
## 120                           From  Matt B  NA   
## 121                              From  JFT  NA   
## 122                                 From DM NA   
## 123                                From DM  NA   
## 124                                 From DM NA   
## 125 From Whatever the hell that is from Joe NA

This data is so messy it’s hard to see what’s going on, but it looks like columns V3, V5, V7, and V8 do not contain any data at all. Columns V3, V5, and V7 are all NA’s and column V8 only has the word ‘Year’ in one observation and the rest look like empty strings. But just to be sure I will use dplyr to look for unique values in each of those columns.

columns <- paste('V',c(3,5,7,8), sep="")
columns
## [1] "V3" "V5" "V7" "V8"
func <- function(column, df){
   df %>%
   group_by_(column) %>%
   summarise(val = n())
}

unlist(lapply(columns, func, raffle))
##     V3    val     V5    val     V7    val    V81    V82   val1   val2 
##     NA  "125"     NA  "125"     NA  "125"     "" "Year"  "124"    "1"

The results above confirm that those four columns do not contain any data. The word year is the only value entered in the V8 column but that really belongs in the header and there’s nothing else in that column so I can remove it along with the other 3 columns…

Remove empty/useless columns/rows

I’ll use a select function to remove the empty columns identified above and while I am at it remove the empty/useless rows at the top of the dataframe as well.

tidyraffle <- tail(raffle, 121) %>%
  select(-c(3,5,7,8))

head(tidyraffle)
##          V1                            V2      V4          V6
## 5  quantity                          Item    Cost      Status
## 6                                                            
## 7                                    2016                    
## 8         4                  Golf Shirts  20 each From Tukuru
## 9         1                Golf Windvest       40     From AE
## 10        4   Individual Golf Clubs sets     2000 From Matt B
tail(tidyraffle)
##     V1                 V2       V4                                      V6
## 120  4   AMC movie passes   50.00                            From  Matt B 
## 121  1     Portable Grill  240.00                               From  JFT 
## 122  5 5 dozen Golf Balls   60.00                                  From DM
## 123  5 5 dozen Golf Balls   60.00                                 From DM 
## 124  1        Gift Basket   40.00                                  From DM
## 125  1       ASCE 7 on CD   20.00  From Whatever the hell that is from Joe

Make first row the variable names

colnames(tidyraffle) <- tidyraffle[1, ] # the first row will be the header
tidyraffle <- tidyraffle[-c(1,2), ] # removing the first 2 (header and blank) rows
rownames(tidyraffle) <- NULL
head(tidyraffle)
##   quantity                          Item     Cost
## 1                                   2016         
## 2        4                  Golf Shirts   20 each
## 3        1                Golf Windvest        40
## 4        4   Individual Golf Clubs sets      2000
## 5        1               Surface Tablet   350.00 
## 6        1                      Whitney       300
##                                 Status
## 1                                     
## 2                          From Tukuru
## 3                              From AE
## 4                          From Matt B
## 5  From Tukuru to be delivered 6/9/16 
## 6                          From Tukuru

Remove Empty Rows

First print the Item column to see what to filter out based on this column. This is the only column that consistently has data in it so if it is blank we can remove that row.

tidyraffle$Item
##   [1] "2016"                                                   
##   [2] " Golf Shirts "                                          
##   [3] " Golf Windvest "                                        
##   [4] "  Individual Golf Clubs sets "                          
##   [5] " Surface Tablet "                                       
##   [6] " Whitney "                                              
##   [7] "  Drone "                                               
##   [8] " Ipad "                                                 
##   [9] " laptop "                                               
##  [10] "  35\" flat screen "                                    
##  [11] " Noise cancelling headphones "                          
##  [12] " Blue Tooth speakers "                                  
##  [13] " Gopro camera "                                         
##  [14] ""                                                       
##  [15] ""                                                       
##  [16] ""                                                       
##  [17] ""                                                       
##  [18] "2015"                                                   
##  [19] "  Golf Shirts "                                         
##  [20] "  Golf Windvest "                                       
##  [21] "  Individual Golf Clubs "                               
##  [22] " Think Pad "                                            
##  [23] " tickets and a catalgue for the Whitney "               
##  [24] " go pro camera "                                        
##  [25] "  Ipad "                                                
##  [26] "  laptop "                                              
##  [27] "  35\" flat screen "                                    
##  [28] " Noise cancelling headphones "                          
##  [29] ""                                                       
##  [30] "  All inclusive American Museum of    Natural History  "
##  [31] ""                                                       
##  [32] "2014"                                                   
##  [33] "  Golf Shirts "                                         
##  [34] "  Golf Windvest "                                       
##  [35] "  Individual Golf Clubs "                               
##  [36] " City Sightseeing Tour of Manhattan "                   
##  [37] " Wine "                                                 
##  [38] " Lap top (extra per JFT for HS kid) "                   
##  [39] " Movie passes "                                         
##  [40] " 39 inch  insignia flat screen LED "                    
##  [41] " Noise canceling head phones "                          
##  [42] " Ipads "                                                
##  [43] " Pro Shop gft certificate "                             
##  [44] "  tickets and a catalgue for the Whitney "              
##  [45] "  All inclusive American Museum of    Natural History  "
##  [46] " Sony Handycam Camera Recorded "                        
##  [47] ""                                                       
##  [48] "2013"                                                   
##  [49] "  Golf Shirts "                                         
##  [50] "  Golf Windvest "                                       
##  [51] "  Individual Golf Clubs "                               
##  [52] " Case of Walden Woods Wine "                            
##  [53] " Flat Screen TV "                                       
##  [54] "  Ipad mini's (16 GB) "                                 
##  [55] "  Ipad Mini (32 GB) "                                   
##  [56] " Noise Canceling Head Phones "                          
##  [57] "  Yankee Tickets  "                                     
##  [58] "   Hyland "                                             
##  [59] "  gift certificates for $97 each (prescription "        
##  [60] ""                                                       
##  [61] "  tickets and a catalgue for the Whitney "              
##  [62] "  tickets to Yankees Rays game 6/22/13 "                
##  [63] ""                                                       
##  [64] "2012"                                                   
##  [65] "  Golf Shirts "                                         
##  [66] "  Golf Windvest "                                       
##  [67] "  Individual Golf Clubs "                               
##  [68] "  All-inclusive Museum of Natural   history passes  "   
##  [69] ""                                                       
##  [70] " Flat Screen TV "                                       
##  [71] " Kindle Fire "                                          
##  [72] " Ipad "                                                 
##  [73] " Appleton Rum (and game on 8th hole) "                  
##  [74] " Wine "                                                 
##  [75] " Wine related platters and cheese "                     
##  [76] " 40 Below Vodka "                                       
##  [77] " Sony Handman DCR-SX44 mini "                           
##  [78] "   camcorder "                                          
##  [79] "  Dozen Golf Balls  "                                   
##  [80] ""                                                       
##  [81] "2011"                                                   
##  [82] " Golf Shirts"                                           
##  [83] " Golf Windvest"                                         
##  [84] " Individual Golf Clubs"                                 
##  [85] " Dozen golf balls "                                     
##  [86] "Remaining Golf Balls"                                   
##  [87] "Chances"                                                
##  [88] "Dinner for 2 ($150) for Tribeca Grill"                  
##  [89] "$100 Gift Certificate to \"The Stand\""                 
##  [90] "The Kindle"                                             
##  [91] "Flat Screen TV"                                         
##  [92] "NBA Package (Satchel. T shirt, "                        
##  [93] "  water bottle, post its and holder)"                   
##  [94] "HP Mini Lap Top"                                        
##  [95] "iPad  (32 G)"                                           
##  [96] "Case of 40below Vodka"                                  
##  [97] ""                                                       
##  [98] "2010"                                                   
##  [99] "Gotham Restaurant certificate"                          
## [100] "iPod Touch"                                             
## [101] "Digital Camera"                                         
## [102] "Flat Screen TV"                                         
## [103] "Ipad"                                                   
## [104] "Net Flix 6 month membership"                            
## [105] "10 Downing Restaurant GC"                               
## [106] "5 Ninth Restaurant GC"                                  
## [107] "$25 Gift Certificate to Garden of Eden"                 
## [108] "$150 Gift Certificate from Artisanal"                   
## [109] ""                                                       
## [110] "$100 Gift Certificate to \"The Stand\""                 
## [111] ""                                                       
## [112] "2009"                                                   
## [113] "GPS"                                                    
## [114] "AMC movie passes"                                       
## [115] "Portable Grill"                                         
## [116] "5 dozen Golf Balls"                                     
## [117] "5 dozen Golf Balls"                                     
## [118] "Gift Basket"                                            
## [119] "ASCE 7 on CD"
# Remove all rows with an empty string in the Item column
tidyraffle <- tidyraffle %>% filter(Item != "")

# Confirm that blank rows are gone...
head(tidyraffle, 20)
##    quantity                                     Item       Cost
## 1                                               2016           
## 2         4                             Golf Shirts     20 each
## 3         1                           Golf Windvest          40
## 4         4              Individual Golf Clubs sets        2000
## 5         1                          Surface Tablet     350.00 
## 6         1                                 Whitney         300
## 7         1                                   Drone         300
## 8         1                                    Ipad         400
## 9         1                                  laptop         700
## 10        1                         35" flat screen         500
## 11        1             Noise cancelling headphones         400
## 12        1                     Blue Tooth speakers         373
## 13        1                            Gopro camera         250
## 14                                              2015           
## 15        4                             Golf Shirts     200.00 
## 16        1                           Golf Windvest      50.00 
## 17        4                   Individual Golf Clubs   1,925.00 
## 18        1                               Think Pad     300.00 
## 19        2  tickets and a catalgue for the Whitney     150.00 
## 20        1                           go pro camera     200.00 
##                                  Status
## 1                                      
## 2                           From Tukuru
## 3                               From AE
## 4                           From Matt B
## 5   From Tukuru to be delivered 6/9/16 
## 6                           From Tukuru
## 7                           From Tukuru
## 8                         From Gretchen
## 9                          From Dewault
## 10                              From AE
## 11                              From AE
## 12                         From Dewault
## 13                         From Dewault
## 14                                     
## 15                          From Matt B
## 16                          From Matt B
## 17                                     
## 18                        From Tukuru  
## 19                            From AE  
## 20                          From Matt B

What to do about the year?

OK, so I am thinking that I probably should have just added the year in a new column in the csv file before importing the data to R. It would have been much easier and faster, but I was trying to do everything in R just to get practice… So I think I am going to have to hardcode this section to get the year in it’s own column. I tried to figure out a way to automate it, but each section that is separated by a row with just the year in it is a different size and I couldn’t figure out how to make a conditional function to look for the year and then add that year in a new column (mutate) only until you get to the next row with a year and then change the year to the next value and add that one to the new column, etc… I wanted to automate it because it’s always better to find a solution that would also work on much larger datasets, but I think the probability of getting a really large dataset that is structured this way is somewhat unlikely, so I am just going to hardcode it.

# Add a new Blank Column
tidyraffle$Year <- NA

# Copy year into all corresponding rows.  Really wish I could figure out how to automate this!
tidyraffle$Year[2:13] <- tidyraffle[1,2] 
tidyraffle$Year[15:25] <- tidyraffle[14,2] 
tidyraffle$Year[27:40] <- tidyraffle[26,2] 
tidyraffle$Year[42:54] <- tidyraffle[41,2] 
tidyraffle$Year[56:69] <- tidyraffle[55,2] 
tidyraffle$Year[71:85] <- tidyraffle[70,2] 
tidyraffle$Year[87:97] <- tidyraffle[86,2] 
tidyraffle$Year[99:105] <- tidyraffle[98,2] 

# Remove original year separator rows
tidyraffle <- tidyraffle %>% filter(Year != "")

# Take a peek at the new data structure...
glimpse(tidyraffle)
## Observations: 97
## Variables: 5
## $ quantity <chr> "4", "1", "4", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ Item     <chr> " Golf Shirts ", " Golf Windvest ", "  Individual Gol...
## $ Cost     <chr> "20 each", "40", "2000", " 350.00 ", "300", "300", "4...
## $ Status   <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year     <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...

Change data types so we can do calculations

The Quantity and Cost columns should be numbers so that we can use them in calculations so let’s convert those… BUT before we do we need to fix any values that have numbers and text in them or R will convert them to NA! So let’s check what values are in those columns and fix them.

tidyraffle %>% 
  group_by(quantity) %>%
  summarise(val = n())
## # A tibble: 4 x 2
##   quantity   val
##   <chr>    <int>
## 1 1           68
## 2 2           10
## 3 4           17
## 4 5            2
tidyraffle %>% 
  group_by(Cost) %>%
  summarise(val = n()) %>%
  arrange(Cost)
## # A tibble: 70 x 2
##    Cost            val
##    <chr>         <int>
##  1 ""                1
##  2 "  1,085.00 "     1
##  3 "  1,281.00 "     1
##  4 "  120.00 "       3
##  5 "  135.00 "       1
##  6 "  150.00 "       1
##  7 "  20.00 "        1
##  8 "  200.00 "       6
##  9 "  204.00 "       1
## 10 "  269.99 "       1
## # ... with 60 more rows

Looks like the quantity is going to be easy, but the cost is another story. First of all there is one value that says “each” in it. From my understanding of the dataset, all the other values are the total cost of the number of items in the quantity column. So I’ll just manually fix that one. Then we need to remove the commas from the rest of the values if they have one.

# fix the Cost in the first row to remove the text
tidyraffle[1,3] <- "80"
tidyraffle[1,]
##   quantity          Item Cost      Status Year
## 1        4  Golf Shirts    80 From Tukuru 2016
# Remove all commas from Cost and convert to numeric
tidyraffle$Cost <- as.numeric(gsub(",","",tidyraffle$Cost))

# Convert quantity to numeric
tidyraffle$quantity <- as.integer(tidyraffle$quantity)

# Check results
glimpse(tidyraffle)
## Observations: 97
## Variables: 5
## $ quantity <int> 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 2, 1,...
## $ Item     <chr> " Golf Shirts ", " Golf Windvest ", "  Individual Gol...
## $ Cost     <dbl> 80, 40, 2000, 350, 300, 300, 400, 700, 500, 400, 373,...
## $ Status   <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year     <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...
tidyraffle %>% 
  group_by(quantity) %>%
  summarise(val = n())
## # A tibble: 4 x 2
##   quantity   val
##      <int> <int>
## 1        1    68
## 2        2    10
## 3        4    17
## 4        5     2
tidyraffle %>% 
  group_by(Cost) %>%
  summarise(val = n()) %>%
  arrange(Cost)
## # A tibble: 48 x 2
##     Cost   val
##    <dbl> <int>
##  1   0.      2
##  2  10.0     1
##  3  19.5     1
##  4  20.0     5
##  5  25.0     1
##  6  30.0     2
##  7  40.0     3
##  8  50.0     4
##  9  60.0     2
## 10  74.0     2
## # ... with 38 more rows

Calculate the Unit Cost of Items

tidyraffle <- tidyraffle %>%
  mutate(UnitCost = Cost / quantity)
glimpse(tidyraffle)
## Observations: 97
## Variables: 6
## $ quantity <int> 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 2, 1,...
## $ Item     <chr> " Golf Shirts ", " Golf Windvest ", "  Individual Gol...
## $ Cost     <dbl> 80, 40, 2000, 350, 300, 300, 400, 700, 500, 400, 373,...
## $ Status   <chr> "From Tukuru", "From AE", "From Matt B", " From Tukur...
## $ Year     <chr> "2016", "2016", "2016", "2016", "2016", "2016", "2016...
## $ UnitCost <dbl> 20.00, 40.00, 500.00, 350.00, 300.00, 300.00, 400.00,...

Let’s finally do a tiny bit of analysis!

Let’s see what the average total cost of each raffle prices was in each year

yearly_avg_cost <- tidyraffle %>%
  group_by(Year) %>%
  summarise(avg_cost = mean(Cost, na.rm = TRUE))
yearly_avg_cost
## # A tibble: 8 x 2
##   Year  avg_cost
##   <chr>    <dbl>
## 1 2009      88.6
## 2 2010     191. 
## 3 2011     210. 
## 4 2012     288. 
## 5 2013     295. 
## 6 2014     313. 
## 7 2015     520. 
## 8 2016     474.
ggplot(tidyraffle) +
  geom_point(mapping = aes(x=Cost, y=Year))
## Warning: Removed 1 rows containing missing values (geom_point).

Let’s see if we had donors who donated over multiple years?

library(DT)
donors <- tidyraffle %>%
  group_by(Status, Year) %>%
  summarise(avg_cost = mean(Cost, na.rm = TRUE))
datatable(donors)

Conclusion

Obviously there is still more that could be done to clean up this data. The Status column includes a lot of extraneous data. Additionally, we could remove the word “From” and try to make sure Status entries for the same person are entered consistently so that they could be used for grouping more effectively.