This project was done in consultation with Chris Martin, who partnered with me on the first project. I will be using the same library and cases as him, but I will do my coding independently.
First, we’ll import the libraries required.
library("plyr")
library("knitr")
library("stringr")
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For our first example, we will use Chris Martin’s auction data. He describes it here: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&forum_id=_1321839_1&nav=discussion_board&conf_id=_1299369_1&course_id=_1245245_1&message_id=_18955095_1#msg__18955095_1Id
We are told we must find the % difference of an estimate and the sale price; there are two estimates, a high and a low estimate.
The two % differences we must find are:
(Price - High Estimate) / High Estimate
(Price - Low Estimate) / Low Estimate
Looking at the data here - https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv - we see that columns are separated by commas, so we’ll use that to splice up our data table. I’ve reproduced the data on my own Github.
auction <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv"), sep = ",", stringsAsFactors = FALSE)
We can use a regular expression to extract the low range values. The regular expression is simply - “take the first numeric string in the range field.” For the first row, that means taking “300” from" “$300 - $400”.
auction$LowEst <- as.numeric(str_extract(auction[,3],"[:digit:]+"))
Now let’s do the same for the high end of the range. The extract function we used for the low end of the range won’t work, because it starts from the left; using it again will give us the same result.
Instead, we use the str_sub function to draw in a sub string; by entering a negative value, we extract the n rightmost values. Here, we use 5.
Then we use the ‘extract the first set of digits’ command as we had with the low estimate.
auction$HighEst <- as.numeric(str_extract(str_sub(auction[,3], -5), "[:digit:]+"))
Now let’s take a look at our auction data
kable(auction)
| Lot | Description | Range | Price | LowEst | HighEst |
|---|---|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | $283 | 300 | 400 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | $234 | 80 | 120 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | $357 | 200 | 300 |
kable(auction)
| Lot | Description | Range | Price | LowEst | HighEst |
|---|---|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | $283 | 300 | 400 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | $234 | 80 | 120 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | $357 | 200 | 300 |
The $ signs in our price column may complicate computations later, so let’s remove them.
auction$Price <- as.numeric(str_replace_all(auction[,4], "\\$",""))
Now, let’s do our computations for the low and high estimates
auction$LowEstDiff <- str_c(round(100*(auction$LowEst - auction$Price)/auction$Price,0),"%")
auction$HighEstDiff <- str_c(round(100*(auction$HighEst - auction$Price)/auction$Price,0),"%")
kable(auction)
| Lot | Description | Range | Price | LowEst | HighEst | LowEstDiff | HighEstDiff |
|---|---|---|---|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | 283 | 300 | 400 | 6% | 41% |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | 234 | 80 | 120 | -66% | -49% |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | 357 | 200 | 300 | -44% | -16% |
In sum, we’ve identified the percent difference between the estimates and the actual selling price.
Case 2: Oil
Kishore Prasad uploaded a fictional dataset
Oil <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Oil.csv"), sep = ",", stringsAsFactors = FALSE)
kable(Oil)
| Month | Category | Caltex | Gulf | Mobil |
|---|---|---|---|---|
| Open | Engine Oil | 140 : 000 | 199 : 000 | 141 : 000 |
| GearBox Oil | 198 : 000 | 132 : 000 | 121 : 000 | |
| Jan | Engine Oil | 170 : 103 | 194 : 132 | 109 : 127 |
| GearBox Oil | 132 : 106 | 125 : 105 | 191 : 100 | |
| Feb | Engine Oil | 112 : 133 | 138 : 113 | 171 : 101 |
| GearBox Oil | 193 : 148 | 199 : 119 | 134 : 127 | |
| Mar | Engine Oil | 184 : 100 | 141 : 141 | 114 : 108 |
| GearBox Oil | 138 : 121 | 172 : 133 | 193 : 115 | |
| Apr | Engine Oil | 149 : 150 | 117 : 118 | 117 : 118 |
| GearBox Oil | 185 : 125 | 191 : 133 | 119 : 121 | |
| May | Engine Oil | 170 : 139 | 104 : 119 | 200 : 117 |
| GearBox Oil | 168 : 117 | 138 : 102 | 121 : 146 | |
| Jun | Engine Oil | 159 : 129 | 170 : 138 | 169 : 105 |
| GearBox Oil | 107 : 129 | 195 : 141 | 141 : 112 |
First, let’s separate the table into two tables, one for Engine Oil and one for Gearbox Oil. Engine Oil is on odd numbered rows, while Gearbox is on even numbered rows. The table for Gearbox Oil is missing its months, so we’ll add them in
EngRows <- seq(1, length(Oil[,1]),2)
GearRows <- seq(2, length(Oil[,1]),2)
EngineOil <- Oil[EngRows,1:2]
GearboxOil <- Oil[GearRows,1:2]
GearboxOil$Month <- EngineOil$Month
kable(EngineOil)
| Month | Category | |
|---|---|---|
| 1 | Open | Engine Oil |
| 3 | Jan | Engine Oil |
| 5 | Feb | Engine Oil |
| 7 | Mar | Engine Oil |
| 9 | Apr | Engine Oil |
| 11 | May | Engine Oil |
| 13 | Jun | Engine Oil |
kable(GearboxOil)
| Month | Category | |
|---|---|---|
| 2 | Open | GearBox Oil |
| 4 | Jan | GearBox Oil |
| 6 | Feb | GearBox Oil |
| 8 | Mar | GearBox Oil |
| 10 | Apr | GearBox Oil |
| 12 | May | GearBox Oil |
| 14 | Jun | GearBox Oil |
Now, we want to split up each column for each brand, into bought and consumed quantities. We’ll use the convention of the brand name followed by a P for purchased and C for consumed. So we’ll have CaltexP, CaltexC, GulfP, GulfC, MobilP, MobilC.
We’ll extract the purchased portion by simply using a regular expression to extract the first string of digits. To get the consumed portion, we’ll read in the last five characters of that string, and then extract the digits from that.
Since these are strings by default, we’ll have to cast them all to numerics.
#For Engine Oil
EngineOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[EngRows],"[:digit:]+"))
EngineOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[EngRows], -5),"[:digit:]+"))
EngineOil$MobilP <- as.numeric(str_extract(Oil$Mobil[EngRows],"[:digit:]+"))
EngineOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[EngRows], -5),"[:digit:]+"))
EngineOil$GulfP <- as.numeric(str_extract(Oil$Gulf[EngRows],"[:digit:]+"))
EngineOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[EngRows], -5),"[:digit:]+"))
kable(EngineOil)
| Month | Category | CaltexP | CaltexC | MobilP | MobilC | GulfP | GulfC | |
|---|---|---|---|---|---|---|---|---|
| 1 | Open | Engine Oil | 140 | 0 | 141 | 0 | 199 | 0 |
| 3 | Jan | Engine Oil | 170 | 103 | 109 | 127 | 194 | 132 |
| 5 | Feb | Engine Oil | 112 | 133 | 171 | 101 | 138 | 113 |
| 7 | Mar | Engine Oil | 184 | 100 | 114 | 108 | 141 | 141 |
| 9 | Apr | Engine Oil | 149 | 150 | 117 | 118 | 117 | 118 |
| 11 | May | Engine Oil | 170 | 139 | 200 | 117 | 104 | 119 |
| 13 | Jun | Engine Oil | 159 | 129 | 169 | 105 | 170 | 138 |
#For Gearbox Oil
GearboxOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[GearRows],"[:digit:]+"))
GearboxOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[GearRows], -5),"[:digit:]+"))
GearboxOil$MobilP <- as.numeric(str_extract(Oil$Mobil[GearRows],"[:digit:]+"))
GearboxOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[GearRows], -5),"[:digit:]+"))
GearboxOil$GulfP <- as.numeric(str_extract(Oil$Gulf[GearRows],"[:digit:]+"))
GearboxOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[GearRows], -5),"[:digit:]+"))
kable(GearboxOil)
| Month | Category | CaltexP | CaltexC | MobilP | MobilC | GulfP | GulfC | |
|---|---|---|---|---|---|---|---|---|
| 2 | Open | GearBox Oil | 198 | 0 | 121 | 0 | 132 | 0 |
| 4 | Jan | GearBox Oil | 132 | 106 | 191 | 100 | 125 | 105 |
| 6 | Feb | GearBox Oil | 193 | 148 | 134 | 127 | 199 | 119 |
| 8 | Mar | GearBox Oil | 138 | 121 | 193 | 115 | 172 | 133 |
| 10 | Apr | GearBox Oil | 185 | 125 | 119 | 121 | 191 | 133 |
| 12 | May | GearBox Oil | 168 | 117 | 121 | 146 | 138 | 102 |
| 14 | Jun | GearBox Oil | 107 | 129 | 141 | 112 | 195 | 141 |
Now, for our results.
First, how much of each kind of oil is left, for each brand?
That’s answered by the total purchased minus total consumed, for each brand and type of oil. We will have six answers.
CaltexB <- c(EngineOil = sum(EngineOil$CaltexP) - sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexP) - sum(GearboxOil$CaltexC))
MobilB <- c(EngineOil = sum(EngineOil$MobilP) - sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilP) - sum(GearboxOil$MobilC))
GulfB <- c(EngineOil = sum(EngineOil$GulfP) - sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfP) - sum(GearboxOil$GulfC))
OilBalance <- data.frame(CaltexB, MobilB, GulfB)
kable(OilBalance)
| CaltexB | MobilB | GulfB | |
|---|---|---|---|
| EngineOil | 330 | 345 | 302 |
| GearboxOil | 375 | 299 | 419 |
Now let’s identify the most consumed brands, for each engine type. First, we’ll create a table of the consumption figures by brand and oil type. We’ll refer to this as TC, as in Total Consumed.
CaltexTC <- c(EngineOil = sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexC), Total = sum(EngineOil$CaltexC) + sum(GearboxOil$CaltexC))
MobilTC <- c(EngineOil = sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilC), Total = sum(EngineOil$MobilC) + sum(GearboxOil$MobilC))
GulfTC <- c(EngineOil = sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfC), Total = sum(EngineOil$GulfC) + sum(GearboxOil$GulfC))
OilTC <- data.frame(CaltexTC, MobilTC, GulfTC)
kable(OilTC)
| CaltexTC | MobilTC | GulfTC | |
|---|---|---|---|
| EngineOil | 754 | 676 | 761 |
| GearboxOil | 746 | 721 | 733 |
| Total | 1500 | 1397 | 1494 |
We see that Caltex sold the most overall, as well as the most Gearbox Oil, while Gulf sold the most Engine Oil.
Now for the third case: Dan Smilowitz’s.
Generators in the NYS electricity market are paid based on the outcome of competitive Unforced Capacity (UCAP) auctions.
There are 3 auctions: .Monthly - takes place monthly before final values are known .Spot - takes place monthly once values are finalized .Strip - takes place twice a year (May and November) for the following six months
The prices for each auction are set in four different locations - NYC, Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS).
The past ~12.5 years worth of auction results are contained in a table, as depicted below:
All values are the price paid to generators in $/kW-month. I’ve attached the data in a messy .csv here.
Data source: New York Independent System Operator
The Analysis
Which month of the year sees the highest prices in each location?
What is the average difference between NYC and ROS prices?
Which calendar year saw the highest average price across regions (ignoring weighting)?
Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?
Let’s read in the data, and then consolidate the two header rows into one.
generator <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Generator.csv"), sep = ",", stringsAsFactors = FALSE, header = TRUE)
colnames(generator) <- str_c(colnames(generator), generator[1,], sep = "")
kable(head(generator))
| AuctionLocation | MonthlyNYC | Monthly.1LHV | Monthly.2LI | Monthly.3ROS | SpotNYC | Spot.1LHV | Spot.2LI | Spot.3ROS | StripNYC | Strip.1LHV | Strip.2LI | Strip.3ROS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Location | NYC | LHV | LI | ROS | NYC | LHV | LI | ROS | NYC | LHV | LI | ROS |
| Nov-03 | $6.67 | $1.15 | $0.50 | $1.15 | $6.98 | $1.94 | $8.14 | $1.94 | $6.55 | $1.17 | $4.00 | $1.17 |
| Dec-03 | $6.67 | $1.58 | $5.00 | $1.58 | $6.98 | $1.78 | $8.22 | $1.78 | $6.55 | $1.17 | $4.00 | $1.17 |
| Jan-04 | $6.67 | $1.65 | $8.10 | $1.65 | $6.98 | $1.75 | $7.99 | $1.75 | $6.55 | $1.17 | $4.00 | $1.17 |
| Feb-04 | $6.95 | $1.67 | $7.50 | $1.67 | $6.98 | $1.73 | $7.08 | $1.73 | $6.55 | $1.17 | $4.00 | $1.17 |
| Mar-04 | $6.25 | $1.65 | $7.00 | $1.65 | $6.98 | $1.00 | $7.72 | $1.00 | $6.55 | $1.17 | $4.00 | $1.17 |
Some of those header names are ugly, with a redundant period and number in them. Let’s remove them.
colnames(generator) <- str_replace_all(colnames(generator), "\\.[:digit:]","")
colnames(generator) <- str_replace(colnames(generator), "AuctionLocation", "Month")
generator <- generator[-1,]
kable(head(generator))
| Month | MonthlyNYC | MonthlyLHV | MonthlyLI | MonthlyROS | SpotNYC | SpotLHV | SpotLI | SpotROS | StripNYC | StripLHV | StripLI | StripROS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Nov-03 | $6.67 | $1.15 | $0.50 | $1.15 | $6.98 | $1.94 | $8.14 | $1.94 | $6.55 | $1.17 | $4.00 | $1.17 |
| 3 | Dec-03 | $6.67 | $1.58 | $5.00 | $1.58 | $6.98 | $1.78 | $8.22 | $1.78 | $6.55 | $1.17 | $4.00 | $1.17 |
| 4 | Jan-04 | $6.67 | $1.65 | $8.10 | $1.65 | $6.98 | $1.75 | $7.99 | $1.75 | $6.55 | $1.17 | $4.00 | $1.17 |
| 5 | Feb-04 | $6.95 | $1.67 | $7.50 | $1.67 | $6.98 | $1.73 | $7.08 | $1.73 | $6.55 | $1.17 | $4.00 | $1.17 |
| 6 | Mar-04 | $6.25 | $1.65 | $7.00 | $1.65 | $6.98 | $1.00 | $7.72 | $1.00 | $6.55 | $1.17 | $4.00 | $1.17 |
| 7 | Apr-04 | $6.25 | $0.99 | $6.85 | $0.99 | $6.98 | $0.80 | $7.04 | $0.80 | $6.55 | $1.17 | $4.00 | $1.17 |
Next, let’s separate the date column into two columns, months and years; we’ll also scrub the $ signs and cast the non-date contents to numerics.
Year <- as.numeric(unlist(str_extract(generator[,1],"[:digit:]+")))
generator$Month <- str_replace_all(generator$Month,"\\-[:digit:]+","")
generator$Year <- Year
generator <- generator[,c("Year", "Month",colnames(generator)[2:13])]
pricecols <- length(colnames(generator))
for (i in 3:pricecols) {
generator[,i] <- as.numeric(str_replace_all(generator[,i],"\\$",""))
}
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
kable(head(generator))
| Year | Month | MonthlyNYC | MonthlyLHV | MonthlyLI | MonthlyROS | SpotNYC | SpotLHV | SpotLI | SpotROS | StripNYC | StripLHV | StripLI | StripROS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | Nov | 6.67 | 1.15 | 0.50 | 1.15 | 6.98 | 1.94 | 8.14 | 1.94 | 6.55 | 1.17 | 4 | 1.17 |
| 3 | 3 | Dec | 6.67 | 1.58 | 5.00 | 1.58 | 6.98 | 1.78 | 8.22 | 1.78 | 6.55 | 1.17 | 4 | 1.17 |
| 4 | 4 | Jan | 6.67 | 1.65 | 8.10 | 1.65 | 6.98 | 1.75 | 7.99 | 1.75 | 6.55 | 1.17 | 4 | 1.17 |
| 5 | 4 | Feb | 6.95 | 1.67 | 7.50 | 1.67 | 6.98 | 1.73 | 7.08 | 1.73 | 6.55 | 1.17 | 4 | 1.17 |
| 6 | 4 | Mar | 6.25 | 1.65 | 7.00 | 1.65 | 6.98 | 1.00 | 7.72 | 1.00 | 6.55 | 1.17 | 4 | 1.17 |
| 7 | 4 | Apr | 6.25 | 0.99 | 6.85 | 0.99 | 6.98 | 0.80 | 7.04 | 0.80 | 6.55 | 1.17 | 4 | 1.17 |
Now our dataset is clean and ready for analysis.
For the first question:
“Which month of the year sees the highest prices in each location?”
We’ll use the ddply function to compute averages for each monthly cost and location. Then we find the highest average for each location, which happens to be September for all locations.
AvgByMonth <- ddply(generator,~Month,summarise,meanNYC=mean(MonthlyNYC, na.rm = TRUE), meanLHV = mean(MonthlyLHV, na.rm = TRUE), meanLI = mean(MonthlyLI, na.rm = TRUE), meanROS = mean(MonthlyROS, na.rm = TRUE))
MaxMonth <- apply(AvgByMonth, 2, function(x) max(x, na.rm = TRUE))
Question 2: “What is the average difference between NYC and ROS prices?”
For variety, we will now use the spot prices.
NYC_ROS <- generator[ , c(1:2)]
NYC_ROS$Diff <- generator$SpotNYC - generator$SpotROS
kable(head(NYC_ROS))
| Year | Month | Diff | |
|---|---|---|---|
| 2 | 3 | Nov | 5.04 |
| 3 | 3 | Dec | 5.20 |
| 4 | 4 | Jan | 5.23 |
| 5 | 4 | Feb | 5.25 |
| 6 | 4 | Mar | 5.98 |
| 7 | 4 | Apr | 6.18 |
mean(NYC_ROS$Diff, na.rm = TRUE)
## [1] 6.66255
We get a value of a $6.67 difference in the spot price, between NYC and the rest of the state.
Question 3: Which calendar year saw the highest average price across regions (ignoring weighting)?
First, let’s compute an average across the regions. We’ll use the monthly price now. And since our information for the first and last years, 03 and 16, are incomplete, we will exclude those years from our analysis.
MaxYearNY <- generator[ , c(1:2)]
MaxYearNY$Mean <- rowMeans(generator[ , 3:6])
MaxYearNY <- subset(MaxYearNY, (Year > 3) & (Year < 16))
MaxYearNY$Year <- MaxYearNY$Year + 2000
kable(head(MaxYearNY))
| Year | Month | Mean | |
|---|---|---|---|
| 4 | 2004 | Jan | 4.5175 |
| 5 | 2004 | Feb | 4.4475 |
| 6 | 2004 | Mar | 4.1375 |
| 7 | 2004 | Apr | 3.7700 |
| 8 | 2004 | May | 5.6150 |
| 9 | 2004 | Jun | 5.9250 |
Now, we’ll use the ddply function to group our monthly averages by year, and then the apply function to find the max within that.
YearMeansNY <- ddply(MaxYearNY,~Year,summarise, meanYear = mean(Mean))
MaxYearNY <- which.max(YearMeansNY$meanYear)
YearMeansNY$Year[MaxYearNY]
## [1] 2014
2014 had the highest average across all of NY state (using an unweighted average).
Question 4: Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?
Let’s compute the standard deviation for each column. First, we’ll remove the last row because it has some null values
generator1 <- generator[c(1:49, 51:149),1:10]
apply(generator1, 2, function(x) sd(x))
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm
## = na.rm): NAs introduced by coercion
## Year Month MonthlyNYC MonthlyLHV MonthlyLI MonthlyROS
## 3.614458 NA 4.351853 2.771195 2.455219 1.542416
## SpotNYC SpotLHV SpotLI SpotROS
## 4.486423 2.793636 3.050375 1.625278
We see that for each location, the spot has a higher SD than the monthly price. Therefore, the spot is more volatile - although the difference is generally small.