Installing the Necessary Packages

In this project, we were tasked with tidying and analyzing the data of three different data sets. To kick it off, we’ll install the needed packages:

library("knitr")
library("stringr")
library("tidyr")
library("dplyr")

Untidy Auction Data

The first dataset I’ll tidy and analyse is my own for untidy Auction Data. What makes this data untidy is that the descriptions include a heavy amount of text that could be useful in analysis and the estimate prices are combined with a high and low estimate. Now we can import the data and see what it is we want to pull out. While there is certainly a lot more we can do with the data, the analysis we’ll perform for this dataset is:

auctiondata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/utidy_auction_data.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(auctiondata)
Lot Description Range Price
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 $283
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 $234
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 $357

To tidy this, we’ll seperate the range from two figures to a high/low estimate and also pull out the heigth and width from the description. For the estimates we could try extracting them individually, but it requires a bit more headache since it’s not as easy to skip the first result (as seen below):

#Estimate: High
str_extract(auctiondata$Range, "\\$[[:digit:]]+")
## [1] "$300" "$80"  "$200"
#Estimate: Low
str_extract(auctiondata$Range, "\\$[[:digit:]]+")
## [1] "$300" "$80"  "$200"

Instead I’ll simply extract all using str_extract_all and use seq create the columns from there to add to the data.

e1 <- unlist(str_extract_all(auctiondata$Range, "\\$[[:digit:]]+"))
e1
## [1] "$300" "$400" "$80"  "$120" "$200" "$300"
auctiondata$est_low <- e1[seq(1,6,2)]
auctiondata$est_high <- e1[seq(2,6,2)]
kable(auctiondata)
Lot Description Range Price est_low est_high
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 $283 $300 $400
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 $234 $80 $120
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 $357 $200 $300

This looks great, however the dollar signs will throw off any analysis so we’ll remove them all.

auctiondata$Price <- as.integer(str_replace_all(auctiondata$Price, "\\$", ""))
auctiondata$est_low <- as.integer(str_replace_all(auctiondata$est_low, "\\$", ""))
auctiondata$est_high <- as.integer(str_replace_all(auctiondata$est_high, "\\$", ""))
kable(auctiondata)
Lot Description Range Price est_low est_high
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 283 300 400
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 234 80 120
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 357 200 300

For the analysis, we want to find the total percentage difference between the estimates and actual sales price.

#High Estimate:
str_c(round(((sum(auctiondata$est_high) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")
## [1] "-6.18%"
#Low Estimate:
str_c(round(((sum(auctiondata$est_low) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")
## [1] "-33.64%"

Perhaps this analysis was too easy with the small table. The next two examples should be more complicated:

Untidy Oil Consumption Data

This fictional dataset comes from Kishore Prasad and displays oil consumption details of various brands in the first half of 2015, along with opening balances. The numbers are in the format of Purchased:COnsumed. The Brands in this case are Caltex, Gulf, and Mobil, while the Category includes Engine Oil and GearBox Oil. I’ve started by uploading the dataset into my personal GitHub account so that I can easily import the data for analysis. The analysis request is to:

oildata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_oil_consumption.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(oildata)
Month Category Caltex Gulf Mobil
Open Engine Oil 140 : 000 199 : 000 141 : 000
GearBox Oil 198 : 000 132 : 000 121 : 000
Jan Engine Oil 170 : 103 194 : 132 109 : 127
GearBox Oil 132 : 106 125 : 105 191 : 100
Feb Engine Oil 112 : 133 138 : 113 171 : 101
GearBox Oil 193 : 148 199 : 119 134 : 127
Mar Engine Oil 184 : 100 141 : 141 114 : 108
GearBox Oil 138 : 121 172 : 133 193 : 115
Apr Engine Oil 149 : 150 117 : 118 117 : 118
GearBox Oil 185 : 125 191 : 133 119 : 121
May Engine Oil 170 : 139 104 : 119 200 : 117
GearBox Oil 168 : 117 138 : 102 121 : 146
Jun Engine Oil 159 : 129 170 : 138 169 : 105
GearBox Oil 107 : 129 195 : 141 141 : 112

As we can see from the Brands, the quantity purchased and quantity consumed are seperated by a semi-colon :, so we’ll start by separating the two. Since we’re only looking for the total quantity purchased and consumed, the analysis is (thankfully) made slightly easier since the Month really doesn’t matter. We can ignore that column. We’re going to create three lists, one for each brand containing the purchased and consumed quantities, then add them back into the table.

#Caltex
brand1 <- unlist(str_split(oildata$Caltex, " :  "))
oildata$brand1 <- c("Caltex")
oildata$CalPurchased <- brand1[seq(1,28,2)]
oildata$CalConsumed <- brand1[seq(2,28,2)]

#Gulf
brand2 <- unlist(str_split(oildata$Gulf, " :  "))
oildata$brand2 <- c("Gulf")
oildata$GulfPurchased <- brand2[seq(1,28,2)]
oildata$GulfConsumed <- brand2[seq(2,28,2)]

#Mobil
brand3 <- unlist(str_split(oildata$Mobil, " :  "))
oildata$brand3 <- c("Mobil")
oildata$MobPurchased <- brand3[seq(1,28,2)]
oildata$MobConsumed <- brand3[seq(2,28,2)]

#Remove redunant lines
oildata <- oildata[,c(1:2,6:14)]
kable(head(oildata))
Month Category brand1 CalPurchased CalConsumed brand2 GulfPurchased GulfConsumed brand3 MobPurchased MobConsumed
Open Engine Oil Caltex 140 000 Gulf 199 000 Mobil 141 000
GearBox Oil Caltex 198 000 Gulf 132 000 Mobil 121 000
Jan Engine Oil Caltex 170 103 Gulf 194 132 Mobil 109 127
GearBox Oil Caltex 132 106 Gulf 125 105 Mobil 191 100
Feb Engine Oil Caltex 112 133 Gulf 138 113 Mobil 171 101
GearBox Oil Caltex 193 148 Gulf 199 119 Mobil 134 127

Looking great, but not looking tidy. Let’s make it tidy. To do this, we’ll re-separate all the columns and give them the same column names, then merge them using bind_rows. Yes, this could have been done in the previous step, but what’s the fun in simplicity?

brand1 <- oildata[,c(1:5)]
colnames(brand1) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand2 <- oildata[,c(1:2,6:8)]
colnames(brand2) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand3 <- oildata[,c(1:2,9:11)]
colnames(brand3) <- c("Month", "Category", "Brand", "Purchased", "Consumed")

#Combining the tables
oildata2 <- bind_rows(brand1, brand2, brand3)

#Making sure the right columns are integers
oildata2$Purchased <- as.integer(oildata2$Purchased)
oildata2$Consumed <- as.integer(oildata2$Consumed)
kable(head(oildata2))
Month Category Brand Purchased Consumed
Open Engine Oil Caltex 140 0
GearBox Oil Caltex 198 0
Jan Engine Oil Caltex 170 103
GearBox Oil Caltex 132 106
Feb Engine Oil Caltex 112 133
GearBox Oil Caltex 193 148
kable(tail(oildata2))
Month Category Brand Purchased Consumed
Apr Engine Oil Mobil 117 118
GearBox Oil Mobil 119 121
May Engine Oil Mobil 200 117
GearBox Oil Mobil 121 146
Jun Engine Oil Mobil 169 105
GearBox Oil Mobil 141 112

It’s beautiful! Now let’s analyse:

#Closing balances of each Brand
oildata2 %>% 
  group_by(Brand) %>% 
      summarise(sum(Consumed))
## Source: local data frame [3 x 2]
## 
##    Brand sum(Consumed)
##    (chr)         (int)
## 1 Caltex          1500
## 2   Gulf          1494
## 3  Mobil          1397
#Closing balances of each Category
oildata2 %>% 
  group_by(Category) %>% 
      summarise(sum(Consumed))
## Source: local data frame [2 x 2]
## 
##      Category sum(Consumed)
##         (chr)         (int)
## 1  Engine Oil          2191
## 2 GearBox Oil          2200
#Closing balances of each Brand and Category
oildata2 %>% 
  group_by(Brand, Category) %>% 
      summarise(sum(Consumed))
## Source: local data frame [6 x 3]
## Groups: Brand [?]
## 
##    Brand    Category sum(Consumed)
##    (chr)       (chr)         (int)
## 1 Caltex  Engine Oil           754
## 2 Caltex GearBox Oil           746
## 3   Gulf  Engine Oil           761
## 4   Gulf GearBox Oil           733
## 5  Mobil  Engine Oil           676
## 6  Mobil GearBox Oil           721
#Identifying the most Consumed Brands across the two Categories
oildata2 %>% 
  group_by(Category, Brand) %>% 
    summarise(sum1 = sum(Consumed)) %>% 
      arrange(desc(sum1)) %>% 
        top_n(.,1,sum1)
## Source: local data frame [2 x 3]
## Groups: Category [2]
## 
##      Category  Brand  sum1
##         (chr)  (chr) (int)
## 1  Engine Oil   Gulf   761
## 2 GearBox Oil Caltex   746

That’s awesome.

Untidy Generator Capacity Prices

The third and final dataset in this article comes from Daniel Smilowitz and displays auction pricing across four locations by month for Monthly auctions, Spot auctions, and Strip auctions. Monthly auctions take place monthly before final values are known, spot auctions take place monthly once values are finalized, and strip auctions take place twice a year (May and November) for six months following. The locations of the auctions are: New York City (NYC), Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS). Like the two previous sets, I’ve uploaded this to my personal GitHub for easy importing. The analysis we’ll perform on this data includes:

genprices <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_Generator_Capacity_Prices.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(head(genprices))
Auction Monthly Monthly.1 Monthly.2 Monthly.3 Spot Spot.1 Spot.2 Spot.3 Strip Strip.1 Strip.2 Strip.3
Location NYC LHV LI ROS NYC LHV LI ROS NYC LHV LI ROS
Nov-03 $6.67 $1.15 $0.50 $1.15 $6.98 $1.94 $8.14 $1.94 $6.55 $1.17 $4.00 $1.17
Dec-03 $6.67 $1.58 $5.00 $1.58 $6.98 $1.78 $8.22 $1.78 $6.55 $1.17 $4.00 $1.17
Jan-04 $6.67 $1.65 $8.10 $1.65 $6.98 $1.75 $7.99 $1.75 $6.55 $1.17 $4.00 $1.17
Feb-04 $6.95 $1.67 $7.50 $1.67 $6.98 $1.73 $7.08 $1.73 $6.55 $1.17 $4.00 $1.17
Mar-04 $6.25 $1.65 $7.00 $1.65 $6.98 $1.00 $7.72 $1.00 $6.55 $1.17 $4.00 $1.17

Let’s tidy this information. We know that there are three categories: Monthly, Spot, and Strip so we’ll make a table for each one, and then merge them together. After that we’ll split the Date into Month and Year, followed by changing N/A values to 0, which is a critical assumption and will skew the analysis. We’ll over-write the headers though so it’s a bit more clear which locations we’re looking at:

#Fixing the column names:
colnames(genprices) <- genprices[1,c(1:13)]
genprices <- genprices[2:151,]

#Seperating the three Categories and adding the category names to the three tables:
Monthly1 <- genprices[,1:5]
Monthly1$Category <- c("Monthly")
Spot1 <- genprices[,c(1,6:9)]
Spot1$Category <- c("Spot")
Strip1 <- genprices[,c(1,10:13)]
Strip1$Category <- c("Strip")

#Merging the tables:
genprices2 <- bind_rows(Monthly1, Spot1, Strip1)

#Making sure the price columns are integers and removing the dollar signs
genprices2$NYC <- as.numeric(unlist(str_replace_all(genprices2$NYC, "\\$", "")))
genprices2$LHV <- as.numeric(unlist(str_replace_all(genprices2$LHV, "\\$", "")))
genprices2$LI <- as.numeric(unlist(str_replace_all(genprices2$LI, "\\$", "")))
## Warning: NAs introduced by coercion
genprices2$ROS <- as.numeric(unlist(str_replace_all(genprices2$ROS, "\\$", "")))

#Seperating the month and year:
genprices2[,7:8] <- t(as.data.frame(strsplit(genprices2$Location,"-")))
colnames(genprices2)[c(7:8)] <- c("Month","Year")

#Removing redundant columns and change NA's to 0:
genprices2[is.na(genprices2)] <- 0
genprices <- genprices2[,c(2:8)]
kable(head(genprices))
NYC LHV LI ROS Category Month Year
6.67 1.15 0.50 1.15 Monthly Nov 03
6.67 1.58 5.00 1.58 Monthly Dec 03
6.67 1.65 8.10 1.65 Monthly Jan 04
6.95 1.67 7.50 1.67 Monthly Feb 04
6.25 1.65 7.00 1.65 Monthly Mar 04
6.25 0.99 6.85 0.99 Monthly Apr 04
kable(tail(genprices))
NYC LHV LI ROS Category Month Year
6.67 3.73 1.64 1.25 Strip Nov 15
6.67 3.73 1.64 1.25 Strip Dec 15
6.67 3.73 1.64 1.25 Strip Jan 16
6.67 3.73 1.64 1.25 Strip Feb 16
6.67 3.73 1.64 1.25 Strip Mar 16
6.67 3.73 1.64 1.25 Strip Apr 16

Now it’s time to analyse:

#Highest Monthly prices in each location, example: NYC
genprices %>% 
  group_by(Month, Category) %>% 
    summarise(max1 = max(NYC)) %>% 
      group_by(Month) %>% 
        summarise(max2 = max(max1)) %>% 
          arrange(desc(max2))
## Source: local data frame [12 x 2]
## 
##    Month  max2
##    (chr) (dbl)
## 1    Jun 18.84
## 2    May 18.83
## 3    Jul 18.71
## 4    Aug 18.56
## 5    Sep 18.37
## 6    Oct 18.19
## 7    Nov 10.01
## 8    Dec  9.90
## 9    Apr  9.75
## 10   Mar  9.72
## 11   Jan  9.63
## 12   Feb  9.56
#Average difference between NYC and ROS prices
round(mean(genprices$NYC - genprices$ROS), 2)
## [1] 6.7
#Calendar Year with highest average price across regions, example: LVH
genprices %>% 
  group_by(Year, Category) %>% 
    summarise(max1 = max(LHV)) %>% 
      group_by(Category, Year) %>% 
        summarise(max2 = max(max1)) %>% 
          top_n(.,1,max2)
## Source: local data frame [3 x 3]
## Groups: Category [3]
## 
##   Category  Year  max2
##      (chr) (chr) (dbl)
## 1  Monthly    14 12.25
## 2     Spot    14 12.38
## 3    Strip    14  9.96
#Month or Spot auction more volitile, example: ROS
ROS1 <- filter(genprices, genprices$Category == c("Monthly"))
ROS2 <- filter(genprices, genprices$Category == c("Spot"))
c(sd(ROS1$ROS),sd(ROS2$ROS))
## [1] 1.541426 1.623642
#The Spot auction market is more volitile

#Auction market with the highest price each year, example: LVH
genprices %>% 
  group_by(Year, Category) %>% 
    summarise(max1 = max(LHV)) %>%
      group_by(Year) %>% 
        top_n(.,1,max1)
## Source: local data frame [15 x 3]
## Groups: Year [14]
## 
##     Year Category  max1
##    (chr)    (chr) (dbl)
## 1     03     Spot  1.94
## 2     04     Spot  1.75
## 3     05     Spot  2.00
## 4     06     Spot  3.33
## 5     07     Spot  3.52
## 6     08     Spot  3.40
## 7     09     Spot  4.42
## 8     10     Spot  3.52
## 9     11  Monthly  0.65
## 10    11     Spot  0.65
## 11    12     Spot  2.91
## 12    13     Spot  6.07
## 13    14     Spot 12.38
## 14    15     Spot 10.93
## 15    16    Strip  3.73

Fantastic