In this project, we were tasked with tidying and analyzing the data of three different data sets. To kick it off, we’ll install the needed packages:
library("knitr")
library("stringr")
library("tidyr")
library("dplyr")The first dataset I’ll tidy and analyse is my own for untidy Auction Data. What makes this data untidy is that the descriptions include a heavy amount of text that could be useful in analysis and the estimate prices are combined with a high and low estimate. Now we can import the data and see what it is we want to pull out. While there is certainly a lot more we can do with the data, the analysis we’ll perform for this dataset is:
auctiondata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/utidy_auction_data.csv"),
sep = ",", stringsAsFactors=FALSE)
kable(auctiondata)| Lot | Description | Range | Price |
|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | $283 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | $234 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | $357 |
To tidy this, we’ll seperate the range from two figures to a high/low estimate and also pull out the heigth and width from the description. For the estimates we could try extracting them individually, but it requires a bit more headache since it’s not as easy to skip the first result (as seen below):
#Estimate: High
str_extract(auctiondata$Range, "\\$[[:digit:]]+")## [1] "$300" "$80" "$200"
#Estimate: Low
str_extract(auctiondata$Range, "\\$[[:digit:]]+")## [1] "$300" "$80" "$200"
Instead I’ll simply extract all using str_extract_all and use seq create the columns from there to add to the data.
e1 <- unlist(str_extract_all(auctiondata$Range, "\\$[[:digit:]]+"))
e1## [1] "$300" "$400" "$80" "$120" "$200" "$300"
auctiondata$est_low <- e1[seq(1,6,2)]
auctiondata$est_high <- e1[seq(2,6,2)]
kable(auctiondata)| Lot | Description | Range | Price | est_low | est_high |
|---|---|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | $283 | $300 | $400 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | $234 | $80 | $120 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | $357 | $200 | $300 |
This looks great, however the dollar signs will throw off any analysis so we’ll remove them all.
auctiondata$Price <- as.integer(str_replace_all(auctiondata$Price, "\\$", ""))
auctiondata$est_low <- as.integer(str_replace_all(auctiondata$est_low, "\\$", ""))
auctiondata$est_high <- as.integer(str_replace_all(auctiondata$est_high, "\\$", ""))
kable(auctiondata)| Lot | Description | Range | Price | est_low | est_high |
|---|---|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | 283 | 300 | 400 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | 234 | 80 | 120 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | 357 | 200 | 300 |
For the analysis, we want to find the total percentage difference between the estimates and actual sales price.
#High Estimate:
str_c(round(((sum(auctiondata$est_high) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")## [1] "-6.18%"
#Low Estimate:
str_c(round(((sum(auctiondata$est_low) - sum(auctiondata$Price)) / sum(auctiondata$Price)) * 100, 2),"%")## [1] "-33.64%"
Perhaps this analysis was too easy with the small table. The next two examples should be more complicated:
This fictional dataset comes from Kishore Prasad and displays oil consumption details of various brands in the first half of 2015, along with opening balances. The numbers are in the format of Purchased:COnsumed. The Brands in this case are Caltex, Gulf, and Mobil, while the Category includes Engine Oil and GearBox Oil. I’ve started by uploading the dataset into my personal GitHub account so that I can easily import the data for analysis. The analysis request is to:
oildata <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_oil_consumption.csv"),
sep = ",", stringsAsFactors=FALSE)
kable(oildata)| Month | Category | Caltex | Gulf | Mobil |
|---|---|---|---|---|
| Open | Engine Oil | 140 : 000 | 199 : 000 | 141 : 000 |
| GearBox Oil | 198 : 000 | 132 : 000 | 121 : 000 | |
| Jan | Engine Oil | 170 : 103 | 194 : 132 | 109 : 127 |
| GearBox Oil | 132 : 106 | 125 : 105 | 191 : 100 | |
| Feb | Engine Oil | 112 : 133 | 138 : 113 | 171 : 101 |
| GearBox Oil | 193 : 148 | 199 : 119 | 134 : 127 | |
| Mar | Engine Oil | 184 : 100 | 141 : 141 | 114 : 108 |
| GearBox Oil | 138 : 121 | 172 : 133 | 193 : 115 | |
| Apr | Engine Oil | 149 : 150 | 117 : 118 | 117 : 118 |
| GearBox Oil | 185 : 125 | 191 : 133 | 119 : 121 | |
| May | Engine Oil | 170 : 139 | 104 : 119 | 200 : 117 |
| GearBox Oil | 168 : 117 | 138 : 102 | 121 : 146 | |
| Jun | Engine Oil | 159 : 129 | 170 : 138 | 169 : 105 |
| GearBox Oil | 107 : 129 | 195 : 141 | 141 : 112 |
As we can see from the Brands, the quantity purchased and quantity consumed are seperated by a semi-colon :, so we’ll start by separating the two. Since we’re only looking for the total quantity purchased and consumed, the analysis is (thankfully) made slightly easier since the Month really doesn’t matter. We can ignore that column. We’re going to create three lists, one for each brand containing the purchased and consumed quantities, then add them back into the table.
#Caltex
brand1 <- unlist(str_split(oildata$Caltex, " : "))
oildata$brand1 <- c("Caltex")
oildata$CalPurchased <- brand1[seq(1,28,2)]
oildata$CalConsumed <- brand1[seq(2,28,2)]
#Gulf
brand2 <- unlist(str_split(oildata$Gulf, " : "))
oildata$brand2 <- c("Gulf")
oildata$GulfPurchased <- brand2[seq(1,28,2)]
oildata$GulfConsumed <- brand2[seq(2,28,2)]
#Mobil
brand3 <- unlist(str_split(oildata$Mobil, " : "))
oildata$brand3 <- c("Mobil")
oildata$MobPurchased <- brand3[seq(1,28,2)]
oildata$MobConsumed <- brand3[seq(2,28,2)]
#Remove redunant lines
oildata <- oildata[,c(1:2,6:14)]
kable(head(oildata))| Month | Category | brand1 | CalPurchased | CalConsumed | brand2 | GulfPurchased | GulfConsumed | brand3 | MobPurchased | MobConsumed |
|---|---|---|---|---|---|---|---|---|---|---|
| Open | Engine Oil | Caltex | 140 | 000 | Gulf | 199 | 000 | Mobil | 141 | 000 |
| GearBox Oil | Caltex | 198 | 000 | Gulf | 132 | 000 | Mobil | 121 | 000 | |
| Jan | Engine Oil | Caltex | 170 | 103 | Gulf | 194 | 132 | Mobil | 109 | 127 |
| GearBox Oil | Caltex | 132 | 106 | Gulf | 125 | 105 | Mobil | 191 | 100 | |
| Feb | Engine Oil | Caltex | 112 | 133 | Gulf | 138 | 113 | Mobil | 171 | 101 |
| GearBox Oil | Caltex | 193 | 148 | Gulf | 199 | 119 | Mobil | 134 | 127 |
Looking great, but not looking tidy. Let’s make it tidy. To do this, we’ll re-separate all the columns and give them the same column names, then merge them using bind_rows. Yes, this could have been done in the previous step, but what’s the fun in simplicity?
brand1 <- oildata[,c(1:5)]
colnames(brand1) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand2 <- oildata[,c(1:2,6:8)]
colnames(brand2) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
brand3 <- oildata[,c(1:2,9:11)]
colnames(brand3) <- c("Month", "Category", "Brand", "Purchased", "Consumed")
#Combining the tables
oildata2 <- bind_rows(brand1, brand2, brand3)
#Making sure the right columns are integers
oildata2$Purchased <- as.integer(oildata2$Purchased)
oildata2$Consumed <- as.integer(oildata2$Consumed)
kable(head(oildata2))| Month | Category | Brand | Purchased | Consumed |
|---|---|---|---|---|
| Open | Engine Oil | Caltex | 140 | 0 |
| GearBox Oil | Caltex | 198 | 0 | |
| Jan | Engine Oil | Caltex | 170 | 103 |
| GearBox Oil | Caltex | 132 | 106 | |
| Feb | Engine Oil | Caltex | 112 | 133 |
| GearBox Oil | Caltex | 193 | 148 |
kable(tail(oildata2))| Month | Category | Brand | Purchased | Consumed |
|---|---|---|---|---|
| Apr | Engine Oil | Mobil | 117 | 118 |
| GearBox Oil | Mobil | 119 | 121 | |
| May | Engine Oil | Mobil | 200 | 117 |
| GearBox Oil | Mobil | 121 | 146 | |
| Jun | Engine Oil | Mobil | 169 | 105 |
| GearBox Oil | Mobil | 141 | 112 |
It’s beautiful! Now let’s analyse:
#Closing balances of each Brand
oildata2 %>%
group_by(Brand) %>%
summarise(sum(Consumed))## Source: local data frame [3 x 2]
##
## Brand sum(Consumed)
## (chr) (int)
## 1 Caltex 1500
## 2 Gulf 1494
## 3 Mobil 1397
#Closing balances of each Category
oildata2 %>%
group_by(Category) %>%
summarise(sum(Consumed))## Source: local data frame [2 x 2]
##
## Category sum(Consumed)
## (chr) (int)
## 1 Engine Oil 2191
## 2 GearBox Oil 2200
#Closing balances of each Brand and Category
oildata2 %>%
group_by(Brand, Category) %>%
summarise(sum(Consumed))## Source: local data frame [6 x 3]
## Groups: Brand [?]
##
## Brand Category sum(Consumed)
## (chr) (chr) (int)
## 1 Caltex Engine Oil 754
## 2 Caltex GearBox Oil 746
## 3 Gulf Engine Oil 761
## 4 Gulf GearBox Oil 733
## 5 Mobil Engine Oil 676
## 6 Mobil GearBox Oil 721
#Identifying the most Consumed Brands across the two Categories
oildata2 %>%
group_by(Category, Brand) %>%
summarise(sum1 = sum(Consumed)) %>%
arrange(desc(sum1)) %>%
top_n(.,1,sum1)## Source: local data frame [2 x 3]
## Groups: Category [2]
##
## Category Brand sum1
## (chr) (chr) (int)
## 1 Engine Oil Gulf 761
## 2 GearBox Oil Caltex 746
That’s awesome.
The third and final dataset in this article comes from Daniel Smilowitz and displays auction pricing across four locations by month for Monthly auctions, Spot auctions, and Strip auctions. Monthly auctions take place monthly before final values are known, spot auctions take place monthly once values are finalized, and strip auctions take place twice a year (May and November) for six months following. The locations of the auctions are: New York City (NYC), Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS). Like the two previous sets, I’ve uploaded this to my personal GitHub for easy importing. The analysis we’ll perform on this data includes:
genprices <- read.csv(url("https://raw.githubusercontent.com/chrisgmartin/DATA607/master/untidy_Generator_Capacity_Prices.csv"),
sep = ",", stringsAsFactors=FALSE)
kable(head(genprices))| Auction | Monthly | Monthly.1 | Monthly.2 | Monthly.3 | Spot | Spot.1 | Spot.2 | Spot.3 | Strip | Strip.1 | Strip.2 | Strip.3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Location | NYC | LHV | LI | ROS | NYC | LHV | LI | ROS | NYC | LHV | LI | ROS |
| Nov-03 | $6.67 | $1.15 | $0.50 | $1.15 | $6.98 | $1.94 | $8.14 | $1.94 | $6.55 | $1.17 | $4.00 | $1.17 |
| Dec-03 | $6.67 | $1.58 | $5.00 | $1.58 | $6.98 | $1.78 | $8.22 | $1.78 | $6.55 | $1.17 | $4.00 | $1.17 |
| Jan-04 | $6.67 | $1.65 | $8.10 | $1.65 | $6.98 | $1.75 | $7.99 | $1.75 | $6.55 | $1.17 | $4.00 | $1.17 |
| Feb-04 | $6.95 | $1.67 | $7.50 | $1.67 | $6.98 | $1.73 | $7.08 | $1.73 | $6.55 | $1.17 | $4.00 | $1.17 |
| Mar-04 | $6.25 | $1.65 | $7.00 | $1.65 | $6.98 | $1.00 | $7.72 | $1.00 | $6.55 | $1.17 | $4.00 | $1.17 |
Let’s tidy this information. We know that there are three categories: Monthly, Spot, and Strip so we’ll make a table for each one, and then merge them together. After that we’ll split the Date into Month and Year, followed by changing N/A values to 0, which is a critical assumption and will skew the analysis. We’ll over-write the headers though so it’s a bit more clear which locations we’re looking at:
#Fixing the column names:
colnames(genprices) <- genprices[1,c(1:13)]
genprices <- genprices[2:151,]
#Seperating the three Categories and adding the category names to the three tables:
Monthly1 <- genprices[,1:5]
Monthly1$Category <- c("Monthly")
Spot1 <- genprices[,c(1,6:9)]
Spot1$Category <- c("Spot")
Strip1 <- genprices[,c(1,10:13)]
Strip1$Category <- c("Strip")
#Merging the tables:
genprices2 <- bind_rows(Monthly1, Spot1, Strip1)
#Making sure the price columns are integers and removing the dollar signs
genprices2$NYC <- as.numeric(unlist(str_replace_all(genprices2$NYC, "\\$", "")))
genprices2$LHV <- as.numeric(unlist(str_replace_all(genprices2$LHV, "\\$", "")))
genprices2$LI <- as.numeric(unlist(str_replace_all(genprices2$LI, "\\$", "")))## Warning: NAs introduced by coercion
genprices2$ROS <- as.numeric(unlist(str_replace_all(genprices2$ROS, "\\$", "")))
#Seperating the month and year:
genprices2[,7:8] <- t(as.data.frame(strsplit(genprices2$Location,"-")))
colnames(genprices2)[c(7:8)] <- c("Month","Year")
#Removing redundant columns and change NA's to 0:
genprices2[is.na(genprices2)] <- 0
genprices <- genprices2[,c(2:8)]
kable(head(genprices))| NYC | LHV | LI | ROS | Category | Month | Year |
|---|---|---|---|---|---|---|
| 6.67 | 1.15 | 0.50 | 1.15 | Monthly | Nov | 03 |
| 6.67 | 1.58 | 5.00 | 1.58 | Monthly | Dec | 03 |
| 6.67 | 1.65 | 8.10 | 1.65 | Monthly | Jan | 04 |
| 6.95 | 1.67 | 7.50 | 1.67 | Monthly | Feb | 04 |
| 6.25 | 1.65 | 7.00 | 1.65 | Monthly | Mar | 04 |
| 6.25 | 0.99 | 6.85 | 0.99 | Monthly | Apr | 04 |
kable(tail(genprices))| NYC | LHV | LI | ROS | Category | Month | Year |
|---|---|---|---|---|---|---|
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Nov | 15 |
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Dec | 15 |
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Jan | 16 |
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Feb | 16 |
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Mar | 16 |
| 6.67 | 3.73 | 1.64 | 1.25 | Strip | Apr | 16 |
Now it’s time to analyse:
#Highest Monthly prices in each location, example: NYC
genprices %>%
group_by(Month, Category) %>%
summarise(max1 = max(NYC)) %>%
group_by(Month) %>%
summarise(max2 = max(max1)) %>%
arrange(desc(max2))## Source: local data frame [12 x 2]
##
## Month max2
## (chr) (dbl)
## 1 Jun 18.84
## 2 May 18.83
## 3 Jul 18.71
## 4 Aug 18.56
## 5 Sep 18.37
## 6 Oct 18.19
## 7 Nov 10.01
## 8 Dec 9.90
## 9 Apr 9.75
## 10 Mar 9.72
## 11 Jan 9.63
## 12 Feb 9.56
#Average difference between NYC and ROS prices
round(mean(genprices$NYC - genprices$ROS), 2)## [1] 6.7
#Calendar Year with highest average price across regions, example: LVH
genprices %>%
group_by(Year, Category) %>%
summarise(max1 = max(LHV)) %>%
group_by(Category, Year) %>%
summarise(max2 = max(max1)) %>%
top_n(.,1,max2)## Source: local data frame [3 x 3]
## Groups: Category [3]
##
## Category Year max2
## (chr) (chr) (dbl)
## 1 Monthly 14 12.25
## 2 Spot 14 12.38
## 3 Strip 14 9.96
#Month or Spot auction more volitile, example: ROS
ROS1 <- filter(genprices, genprices$Category == c("Monthly"))
ROS2 <- filter(genprices, genprices$Category == c("Spot"))
c(sd(ROS1$ROS),sd(ROS2$ROS))## [1] 1.541426 1.623642
#The Spot auction market is more volitile
#Auction market with the highest price each year, example: LVH
genprices %>%
group_by(Year, Category) %>%
summarise(max1 = max(LHV)) %>%
group_by(Year) %>%
top_n(.,1,max1)## Source: local data frame [15 x 3]
## Groups: Year [14]
##
## Year Category max1
## (chr) (chr) (dbl)
## 1 03 Spot 1.94
## 2 04 Spot 1.75
## 3 05 Spot 2.00
## 4 06 Spot 3.33
## 5 07 Spot 3.52
## 6 08 Spot 3.40
## 7 09 Spot 4.42
## 8 10 Spot 3.52
## 9 11 Monthly 0.65
## 10 11 Spot 0.65
## 11 12 Spot 2.91
## 12 13 Spot 6.07
## 13 14 Spot 12.38
## 14 15 Spot 10.93
## 15 16 Strip 3.73
Fantastic