DATA607 - Project 2 - Untidy Data

Installing the Necessary Packages
Untidy Auction Data
Untidy Oil Consumption Data
Untidy Generator Capacity Prices

Installing the Necessary Packages

In this project, we were tasked with tidying and analyzing the data of three different data sets. To kick it off, we’ll install the needed packages:

library("knitr")
library("stringr")
library("tidyr")
library("dplyr")

Untidy Auction Data

The first dataset I’ll tidy and analyse is my own for untidy Auction Data. What makes this data untidy is that the descriptions include a heavy amount of text that could be useful in analysis and the estimate prices are combined with a high and low estimate. Now we can import the data and see what it is we want to pull out. While there is certainly a lot more we can do with the data, the analysis we’ll perform for this dataset is:

Find the total percent difference from the high estimate and the sold price
Find the total percent difference from the low estimate and the sold price

auctiondata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/utidy_auction_data.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(auctiondata)

Lot	Description	Range	Price
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	$283
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	$234
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	$357

To tidy this, we’ll seperate the range from two figures to a high/low estimate and also pull out the heigth and width from the description. For the estimates we could try extracting them individually, but it requires a bit more headache since it’s not as easy to skip the first result (as seen below):

#Estimate: High
str_extract(auctiondata$Range, "\\$[[:digit:]]+")

## [1] "$300" "$80"  "$200"

#Estimate: Low
str_extract(auctiondata$Range, "\\$[[:digit:]]+")

## [1] "$300" "$80"  "$200"

Instead I’ll simply extract all using str_extract_all and use seq create the columns from there to add to the data.

e1 <- unlist(str_extract_all(auctiondata$Range, "\\$[[:digit:]]+"))
e1

## [1] "$300" "$400" "$80"  "$120" "$200" "$300"

auctiondata$est_low <- e1[seq(1,6,2)]
auctiondata$est_high <- e1[seq(2,6,2)]
kable(auctiondata)

Lot	Description	Range	Price	est_low	est_high
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	$283	$300	$400
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	$234	$80	$120
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	$357	$200	$300

This looks great, however the dollar signs will throw off any analysis so we’ll remove them all.

auctiondata$Price <- as.integer(str_replace_all(auctiondata$Price, "\\$", ""))
auctiondata$est_low <- as.integer(str_replace_all(auctiondata$est_low, "\\$", ""))
auctiondata$est_high <- as.integer(str_replace_all(auctiondata$est_high, "\\$", ""))
kable(auctiondata)

Lot	Description	Range	Price	est_low	est_high
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	283	300	400
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	234	80	120
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	357	200	300

For the analysis, we want to find the total percentage difference between the estimates and actual sales price.

#High Estimate:
str_c(round(((sum(auctiondata$est_high) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")

## [1] "-6.18%"

#Low Estimate:
str_c(round(((sum(auctiondata$est_low) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")

## [1] "-33.64%"

Perhaps this analysis was too easy with the small table. The next two examples should be more complicated:

Untidy Oil Consumption Data

This fictional dataset comes from Kishore Prasad and displays oil consumption details of various brands in the first half of 2015, along with opening balances. The numbers are in the format of Purchased:COnsumed. The Brands in this case are Caltex, Gulf, and Mobil, while the Category includes Engine Oil and GearBox Oil. I’ve started by uploading the dataset into my personal GitHub account so that I can easily import the data for analysis. The analysis request is to:

Give the closing balance of Category + Brand
Identify the most Consumed Brands across the two Categories

oildata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_oil_consumption.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(oildata)

Month	Category	Caltex	Gulf	Mobil
Open	Engine Oil	140 : 000	199 : 000	141 : 000
	GearBox Oil	198 : 000	132 : 000	121 : 000
Jan	Engine Oil	170 : 103	194 : 132	109 : 127
	GearBox Oil	132 : 106	125 : 105	191 : 100
Feb	Engine Oil	112 : 133	138 : 113	171 : 101
	GearBox Oil	193 : 148	199 : 119	134 : 127
Mar	Engine Oil	184 : 100	141 : 141	114 : 108
	GearBox Oil	138 : 121	172 : 133	193 : 115
Apr	Engine Oil	149 : 150	117 : 118	117 : 118
	GearBox Oil	185 : 125	191 : 133	119 : 121
May	Engine Oil	170 : 139	104 : 119	200 : 117
	GearBox Oil	168 : 117	138 : 102	121 : 146
Jun	Engine Oil	159 : 129	170 : 138	169 : 105
	GearBox Oil	107 : 129	195 : 141	141 : 112

As we can see from the Brands, the quantity purchased and quantity consumed are seperated by a semi-colon :, so we’ll start by separating the two. Since we’re only looking for the total quantity purchased and consumed, the analysis is (thankfully) made slightly easier since the Month really doesn’t matter. We can ignore that column. We’re going to create three lists, one for each brand containing the purchased and consumed quantities, then add them back into the table.

#Caltex
brand1 <- unlist(str_split(oildata$Caltex, " :  "))
oildata$brand1 <- c("Caltex")
oildata$CalPurchased <- brand1[seq(1,28,2)]
oildata$CalConsumed <- brand1[seq(2,28,2)]

#Gulf
brand2 <- unlist(str_split(oildata$Gulf, " :  "))
oildata$brand2 <- c("Gulf")
oildata$GulfPurchased <- brand2[seq(1,28,2)]
oildata$GulfConsumed <- brand2[seq(2,28,2)]

#Mobil
brand3 <- unlist(str_split(oildata$Mobil, " :  "))
oildata$brand3 <- c("Mobil")
oildata$MobPurchased <- brand3[seq(1,28,2)]
oildata$MobConsumed <- brand3[seq(2,28,2)]

#Remove redunant lines
oildata <- oildata[,c(1:2,6:14)]
kable(head(oildata))

Month	Category	brand1	CalPurchased	CalConsumed	brand2	GulfPurchased	GulfConsumed	brand3	MobPurchased	MobConsumed
Open	Engine Oil	Caltex	140	000	Gulf	199	000	Mobil	141	000
	GearBox Oil	Caltex	198	000	Gulf	132	000	Mobil	121	000
Jan	Engine Oil	Caltex	170	103	Gulf	194	132	Mobil	109	127
	GearBox Oil	Caltex	132	106	Gulf	125	105	Mobil	191	100
Feb	Engine Oil	Caltex	112	133	Gulf	138	113	Mobil	171	101
	GearBox Oil	Caltex	193	148	Gulf	199	119	Mobil	134	127

Looking great, but not looking tidy. Let’s make it tidy. To do this, we’ll re-separate all the columns and give them the same column names, then merge them using bind_rows. Yes, this could have been done in the previous step, but what’s the fun in simplicity?

brand1 <- oildata[,c(1:5)]
colnames(brand1) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand2 <- oildata[,c(1:2,6:8)]
colnames(brand2) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand3 <- oildata[,c(1:2,9:11)]
colnames(brand3) <- c("Month", "Category", "Brand", "Purchased", "Consumed")

#Combining the tables
oildata2 <- bind_rows(brand1, brand2, brand3)

#Making sure the right columns are integers
oildata2$Purchased <- as.integer(oildata2$Purchased)
oildata2$Consumed <- as.integer(oildata2$Consumed)
kable(head(oildata2))

Month	Category	Brand	Purchased	Consumed
Open	Engine Oil	Caltex	140	0
	GearBox Oil	Caltex	198	0
Jan	Engine Oil	Caltex	170	103
	GearBox Oil	Caltex	132	106
Feb	Engine Oil	Caltex	112	133
	GearBox Oil	Caltex	193	148

kable(tail(oildata2))

Month	Category	Brand	Purchased	Consumed
Apr	Engine Oil	Mobil	117	118
	GearBox Oil	Mobil	119	121
May	Engine Oil	Mobil	200	117
	GearBox Oil	Mobil	121	146
Jun	Engine Oil	Mobil	169	105
	GearBox Oil	Mobil	141	112

It’s beautiful! Now let’s analyse:

Give the closing balance of Category + Brand
Identify the most Consumed Brands across the two Categories

#Closing balances of each Brand
oildata2 %>% 
  group_by(Brand) %>% 
      summarise(sum(Consumed))

## Source: local data frame [3 x 2]
## 
##    Brand sum(Consumed)
##    (chr)         (int)
## 1 Caltex          1500
## 2   Gulf          1494
## 3  Mobil          1397

#Closing balances of each Category
oildata2 %>% 
  group_by(Category) %>% 
      summarise(sum(Consumed))

## Source: local data frame [2 x 2]
## 
##      Category sum(Consumed)
##         (chr)         (int)
## 1  Engine Oil          2191
## 2 GearBox Oil          2200

#Closing balances of each Brand and Category
oildata2 %>% 
  group_by(Brand, Category) %>% 
      summarise(sum(Consumed))

## Source: local data frame [6 x 3]
## Groups: Brand [?]
## 
##    Brand    Category sum(Consumed)
##    (chr)       (chr)         (int)
## 1 Caltex  Engine Oil           754
## 2 Caltex GearBox Oil           746
## 3   Gulf  Engine Oil           761
## 4   Gulf GearBox Oil           733
## 5  Mobil  Engine Oil           676
## 6  Mobil GearBox Oil           721

#Identifying the most Consumed Brands across the two Categories
oildata2 %>% 
  group_by(Category, Brand) %>% 
    summarise(sum1 = sum(Consumed)) %>% 
      arrange(desc(sum1)) %>% 
        top_n(.,1,sum1)

## Source: local data frame [2 x 3]
## Groups: Category [2]
## 
##      Category  Brand  sum1
##         (chr)  (chr) (int)
## 1  Engine Oil   Gulf   761
## 2 GearBox Oil Caltex   746

That’s awesome.

Untidy Generator Capacity Prices

The third and final dataset in this article comes from Daniel Smilowitz and displays auction pricing across four locations by month for Monthly auctions, Spot auctions, and Strip auctions. Monthly auctions take place monthly before final values are known, spot auctions take place monthly once values are finalized, and strip auctions take place twice a year (May and November) for six months following. The locations of the auctions are: New York City (NYC), Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS). Like the two previous sets, I’ve uploaded this to my personal GitHub for easy importing. The analysis we’ll perform on this data includes:

Which month of the year sees the highest prices in each location?
What is the average difference between NYC and ROS prices?
Which calendar year saw the highest average price across regions (ignoring weighting)?
Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?
What category had the highest price in each year? added myself

genprices <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_Generator_Capacity_Prices.csv"),
                           sep = ",", stringsAsFactors=FALSE)
kable(head(genprices))

Auction	Monthly	Monthly.1	Monthly.2	Monthly.3	Spot	Spot.1	Spot.2	Spot.3	Strip	Strip.1	Strip.2	Strip.3
Location	NYC	LHV	LI	ROS	NYC	LHV	LI	ROS	NYC	LHV	LI	ROS
Nov-03	$6.67	$1.15	$0.50	$1.15	$6.98	$1.94	$8.14	$1.94	$6.55	$1.17	$4.00	$1.17
Dec-03	$6.67	$1.58	$5.00	$1.58	$6.98	$1.78	$8.22	$1.78	$6.55	$1.17	$4.00	$1.17
Jan-04	$6.67	$1.65	$8.10	$1.65	$6.98	$1.75	$7.99	$1.75	$6.55	$1.17	$4.00	$1.17
Feb-04	$6.95	$1.67	$7.50	$1.67	$6.98	$1.73	$7.08	$1.73	$6.55	$1.17	$4.00	$1.17
Mar-04	$6.25	$1.65	$7.00	$1.65	$6.98	$1.00	$7.72	$1.00	$6.55	$1.17	$4.00	$1.17

Let’s tidy this information. We know that there are three categories: Monthly, Spot, and Strip so we’ll make a table for each one, and then merge them together. After that we’ll split the Date into Month and Year, followed by changing N/A values to 0, which is a critical assumption and will skew the analysis. We’ll over-write the headers though so it’s a bit more clear which locations we’re looking at:

#Fixing the column names:
colnames(genprices) <- genprices[1,c(1:13)]
genprices <- genprices[2:151,]

#Seperating the three Categories and adding the category names to the three tables:
Monthly1 <- genprices[,1:5]
Monthly1$Category <- c("Monthly")
Spot1 <- genprices[,c(1,6:9)]
Spot1$Category <- c("Spot")
Strip1 <- genprices[,c(1,10:13)]
Strip1$Category <- c("Strip")

#Merging the tables:
genprices2 <- bind_rows(Monthly1, Spot1, Strip1)

#Making sure the price columns are integers and removing the dollar signs
genprices2$NYC <- as.numeric(unlist(str_replace_all(genprices2$NYC, "\\$", "")))
genprices2$LHV <- as.numeric(unlist(str_replace_all(genprices2$LHV, "\\$", "")))
genprices2$LI <- as.numeric(unlist(str_replace_all(genprices2$LI, "\\$", "")))

## Warning: NAs introduced by coercion

genprices2$ROS <- as.numeric(unlist(str_replace_all(genprices2$ROS, "\\$", "")))

#Seperating the month and year:
genprices2[,7:8] <- t(as.data.frame(strsplit(genprices2$Location,"-")))
colnames(genprices2)[c(7:8)] <- c("Month","Year")

#Removing redundant columns and change NA's to 0:
genprices2[is.na(genprices2)] <- 0
genprices <- genprices2[,c(2:8)]
kable(head(genprices))

NYC	LHV	LI	ROS	Category	Month	Year
6.67	1.15	0.50	1.15	Monthly	Nov	03
6.67	1.58	5.00	1.58	Monthly	Dec	03
6.67	1.65	8.10	1.65	Monthly	Jan	04
6.95	1.67	7.50	1.67	Monthly	Feb	04
6.25	1.65	7.00	1.65	Monthly	Mar	04
6.25	0.99	6.85	0.99	Monthly	Apr	04

kable(tail(genprices))

NYC	LHV	LI	ROS	Category	Month	Year
6.67	3.73	1.64	1.25	Strip	Nov	15
6.67	3.73	1.64	1.25	Strip	Dec	15
6.67	3.73	1.64	1.25	Strip	Jan	16
6.67	3.73	1.64	1.25	Strip	Feb	16
6.67	3.73	1.64	1.25	Strip	Mar	16
6.67	3.73	1.64	1.25	Strip	Apr	16

Now it’s time to analyse:

Which month of the year sees the highest prices in each location?
What is the average difference between NYC and ROS prices?
Which calendar year saw the highest average price across regions (ignoring weighting)?
Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?
What category had the highest price in each year? added myself

#Highest Monthly prices in each location, example: NYC
genprices %>% 
  group_by(Month, Category) %>% 
    summarise(max1 = max(NYC)) %>% 
      group_by(Month) %>% 
        summarise(max2 = max(max1)) %>% 
          arrange(desc(max2))

## Source: local data frame [12 x 2]
## 
##    Month  max2
##    (chr) (dbl)
## 1    Jun 18.84
## 2    May 18.83
## 3    Jul 18.71
## 4    Aug 18.56
## 5    Sep 18.37
## 6    Oct 18.19
## 7    Nov 10.01
## 8    Dec  9.90
## 9    Apr  9.75
## 10   Mar  9.72
## 11   Jan  9.63
## 12   Feb  9.56

#Average difference between NYC and ROS prices
round(mean(genprices$NYC - genprices$ROS), 2)

## [1] 6.7

#Calendar Year with highest average price across regions, example: LVH
genprices %>% 
  group_by(Year, Category) %>% 
    summarise(max1 = max(LHV)) %>% 
      group_by(Category, Year) %>% 
        summarise(max2 = max(max1)) %>% 
          top_n(.,1,max2)

## Source: local data frame [3 x 3]
## Groups: Category [3]
## 
##   Category  Year  max2
##      (chr) (chr) (dbl)
## 1  Monthly    14 12.25
## 2     Spot    14 12.38
## 3    Strip    14  9.96

#Month or Spot auction more volitile, example: ROS
ROS1 <- filter(genprices, genprices$Category == c("Monthly"))
ROS2 <- filter(genprices, genprices$Category == c("Spot"))
c(sd(ROS1$ROS),sd(ROS2$ROS))

## [1] 1.541426 1.623642

#The Spot auction market is more volitile

#Auction market with the highest price each year, example: LVH
genprices %>% 
  group_by(Year, Category) %>% 
    summarise(max1 = max(LHV)) %>%
      group_by(Year) %>% 
        top_n(.,1,max1)

## Source: local data frame [15 x 3]
## Groups: Year [14]
## 
##     Year Category  max1
##    (chr)    (chr) (dbl)
## 1     03     Spot  1.94
## 2     04     Spot  1.75
## 3     05     Spot  2.00
## 4     06     Spot  3.33
## 5     07     Spot  3.52
## 6     08     Spot  3.40
## 7     09     Spot  4.42
## 8     10     Spot  3.52
## 9     11  Monthly  0.65
## 10    11     Spot  0.65
## 11    12     Spot  2.91
## 12    13     Spot  6.07
## 13    14     Spot 12.38
## 14    15     Spot 10.93
## 15    16    Strip  3.73

Fantastic

DATA607 - Project 2 - Untidy Data

Chris Martin

March 6, 2016

Installing the Necessary Packages

Untidy Auction Data

Untidy Oil Consumption Data

Untidy Generator Capacity Prices