This project was done in consultation with Chris Martin, who partnered with me on the first project. I will be using the same library and cases as him, but I will do my coding independently.

First, we’ll import the libraries required.

library("plyr")
library("knitr")
library("stringr")
library("tidyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For our first example, we will use Chris Martin’s auction data. He describes it here: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&forum_id=_1321839_1&nav=discussion_board&conf_id=_1299369_1&course_id=_1245245_1&message_id=_18955095_1#msg__18955095_1Id

We are told we must find the % difference of an estimate and the sale price; there are two estimates, a high and a low estimate.

The two % differences we must find are:

  1. (Price - High Estimate) / High Estimate

  2. (Price - Low Estimate) / Low Estimate

Looking at the data here - https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv - we see that columns are separated by commas, so we’ll use that to splice up our data table. I’ve reproduced the data on my own Github.

auction <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv"), sep = ",", stringsAsFactors = FALSE)

We can use a regular expression to extract the low range values. The regular expression is simply - “take the first numeric string in the range field.” For the first row, that means taking “300” from" “$300 - $400”.

auction$LowEst <- as.numeric(str_extract(auction[,3],"[:digit:]+"))

Now let’s do the same for the high end of the range. The extract function we used for the low end of the range won’t work, because it starts from the left; using it again will give us the same result.

Instead, we use the str_sub function to draw in a sub string; by entering a negative value, we extract the n rightmost values. Here, we use 5.

Then we use the ‘extract the first set of digits’ command as we had with the low estimate.

auction$HighEst <- as.numeric(str_extract(str_sub(auction[,3], -5), "[:digit:]+"))

Now let’s take a look at our auction data

kable(auction)
Lot Description Range Price LowEst HighEst
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 $283 300 400
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 $234 80 120
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 $357 200 300
kable(auction)
Lot Description Range Price LowEst HighEst
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 $283 300 400
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 $234 80 120
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 $357 200 300

The $ signs in our price column may complicate computations later, so let’s remove them.

auction$Price <- as.numeric(str_replace_all(auction[,4], "\\$",""))

Now, let’s do our computations for the low and high estimates

auction$LowEstDiff <- str_c(round(100*(auction$LowEst - auction$Price)/auction$Price,0),"%")
auction$HighEstDiff <- str_c(round(100*(auction$HighEst - auction$Price)/auction$Price,0),"%")
kable(auction)
Lot Description Range Price LowEst HighEst LowEstDiff HighEstDiff
2001 New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. $300 - $400 283 300 400 6% 41%
2002 Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. $80 - $120 234 80 120 -66% -49%
2003 Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. $200 - $300 357 200 300 -44% -16%

In sum, we’ve identified the percent difference between the estimates and the actual selling price.


Case 2: Oil

Kishore Prasad uploaded a fictional dataset

Oil <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Oil.csv"), sep = ",", stringsAsFactors = FALSE)

kable(Oil)
Month Category Caltex Gulf Mobil
Open Engine Oil 140 : 000 199 : 000 141 : 000
GearBox Oil 198 : 000 132 : 000 121 : 000
Jan Engine Oil 170 : 103 194 : 132 109 : 127
GearBox Oil 132 : 106 125 : 105 191 : 100
Feb Engine Oil 112 : 133 138 : 113 171 : 101
GearBox Oil 193 : 148 199 : 119 134 : 127
Mar Engine Oil 184 : 100 141 : 141 114 : 108
GearBox Oil 138 : 121 172 : 133 193 : 115
Apr Engine Oil 149 : 150 117 : 118 117 : 118
GearBox Oil 185 : 125 191 : 133 119 : 121
May Engine Oil 170 : 139 104 : 119 200 : 117
GearBox Oil 168 : 117 138 : 102 121 : 146
Jun Engine Oil 159 : 129 170 : 138 169 : 105
GearBox Oil 107 : 129 195 : 141 141 : 112

First, let’s separate the table into two tables, one for Engine Oil and one for Gearbox Oil. Engine Oil is on odd numbered rows, while Gearbox is on even numbered rows. The table for Gearbox Oil is missing its months, so we’ll add them in

EngRows <- seq(1, length(Oil[,1]),2)
GearRows <- seq(2, length(Oil[,1]),2)

EngineOil <- Oil[EngRows,1:2]
GearboxOil <- Oil[GearRows,1:2]
GearboxOil$Month <- EngineOil$Month
kable(EngineOil)
Month Category
1 Open Engine Oil
3 Jan Engine Oil
5 Feb Engine Oil
7 Mar Engine Oil
9 Apr Engine Oil
11 May Engine Oil
13 Jun Engine Oil
kable(GearboxOil)
Month Category
2 Open GearBox Oil
4 Jan GearBox Oil
6 Feb GearBox Oil
8 Mar GearBox Oil
10 Apr GearBox Oil
12 May GearBox Oil
14 Jun GearBox Oil

Now, we want to split up each column for each brand, into bought and consumed quantities. We’ll use the convention of the brand name followed by a P for purchased and C for consumed. So we’ll have CaltexP, CaltexC, GulfP, GulfC, MobilP, MobilC.

We’ll extract the purchased portion by simply using a regular expression to extract the first string of digits. To get the consumed portion, we’ll read in the last five characters of that string, and then extract the digits from that.

Since these are strings by default, we’ll have to cast them all to numerics.

#For Engine Oil

EngineOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[EngRows],"[:digit:]+"))
EngineOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[EngRows], -5),"[:digit:]+"))

EngineOil$MobilP <- as.numeric(str_extract(Oil$Mobil[EngRows],"[:digit:]+"))
EngineOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[EngRows], -5),"[:digit:]+"))

EngineOil$GulfP <- as.numeric(str_extract(Oil$Gulf[EngRows],"[:digit:]+"))
EngineOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[EngRows], -5),"[:digit:]+"))

kable(EngineOil)
Month Category CaltexP CaltexC MobilP MobilC GulfP GulfC
1 Open Engine Oil 140 0 141 0 199 0
3 Jan Engine Oil 170 103 109 127 194 132
5 Feb Engine Oil 112 133 171 101 138 113
7 Mar Engine Oil 184 100 114 108 141 141
9 Apr Engine Oil 149 150 117 118 117 118
11 May Engine Oil 170 139 200 117 104 119
13 Jun Engine Oil 159 129 169 105 170 138
#For Gearbox Oil

GearboxOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[GearRows],"[:digit:]+"))
GearboxOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[GearRows], -5),"[:digit:]+"))

GearboxOil$MobilP <- as.numeric(str_extract(Oil$Mobil[GearRows],"[:digit:]+"))
GearboxOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[GearRows], -5),"[:digit:]+"))

GearboxOil$GulfP <- as.numeric(str_extract(Oil$Gulf[GearRows],"[:digit:]+"))
GearboxOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[GearRows], -5),"[:digit:]+"))

kable(GearboxOil)
Month Category CaltexP CaltexC MobilP MobilC GulfP GulfC
2 Open GearBox Oil 198 0 121 0 132 0
4 Jan GearBox Oil 132 106 191 100 125 105
6 Feb GearBox Oil 193 148 134 127 199 119
8 Mar GearBox Oil 138 121 193 115 172 133
10 Apr GearBox Oil 185 125 119 121 191 133
12 May GearBox Oil 168 117 121 146 138 102
14 Jun GearBox Oil 107 129 141 112 195 141

Now, for our results.

First, how much of each kind of oil is left, for each brand?

That’s answered by the total purchased minus total consumed, for each brand and type of oil. We will have six answers.

CaltexB <- c(EngineOil = sum(EngineOil$CaltexP) - sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexP) - sum(GearboxOil$CaltexC))

MobilB <- c(EngineOil = sum(EngineOil$MobilP) - sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilP) - sum(GearboxOil$MobilC))
  
GulfB <- c(EngineOil = sum(EngineOil$GulfP) - sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfP) - sum(GearboxOil$GulfC))

OilBalance <- data.frame(CaltexB, MobilB, GulfB)

kable(OilBalance)
CaltexB MobilB GulfB
EngineOil 330 345 302
GearboxOil 375 299 419

Now let’s identify the most consumed brands, for each engine type. First, we’ll create a table of the consumption figures by brand and oil type. We’ll refer to this as TC, as in Total Consumed.

CaltexTC <- c(EngineOil = sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexC), Total = sum(EngineOil$CaltexC) + sum(GearboxOil$CaltexC))

MobilTC <- c(EngineOil = sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilC), Total = sum(EngineOil$MobilC) + sum(GearboxOil$MobilC))


GulfTC <- c(EngineOil = sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfC), Total = sum(EngineOil$GulfC) + sum(GearboxOil$GulfC))

OilTC <- data.frame(CaltexTC, MobilTC, GulfTC)

kable(OilTC)
CaltexTC MobilTC GulfTC
EngineOil 754 676 761
GearboxOil 746 721 733
Total 1500 1397 1494

We see that Caltex sold the most overall, as well as the most Gearbox Oil, while Gulf sold the most Engine Oil.

Now for the third case: Dan Smilowitz’s.

Generators in the NYS electricity market are paid based on the outcome of competitive Unforced Capacity (UCAP) auctions.

There are 3 auctions: .Monthly - takes place monthly before final values are known .Spot - takes place monthly once values are finalized .Strip - takes place twice a year (May and November) for the following six months

The prices for each auction are set in four different locations - NYC, Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS).

The past ~12.5 years worth of auction results are contained in a table, as depicted below:

All values are the price paid to generators in $/kW-month. I’ve attached the data in a messy .csv here.

Data source: New York Independent System Operator

The Analysis

Which month of the year sees the highest prices in each location?

What is the average difference between NYC and ROS prices?

Which calendar year saw the highest average price across regions (ignoring weighting)?

Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?

https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&forum_id=_1321839_1&nav=discussion_board&conf_id=_1299369_1&course_id=_1245245_1&message_id=_19008069_1#msg__19008069_1Id

Let’s read in the data, and then consolidate the two header rows into one.

generator <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Generator.csv"), sep = ",", stringsAsFactors =  FALSE, header = TRUE)

colnames(generator) <- str_c(colnames(generator), generator[1,], sep = "")
kable(head(generator))
AuctionLocation MonthlyNYC Monthly.1LHV Monthly.2LI Monthly.3ROS SpotNYC Spot.1LHV Spot.2LI Spot.3ROS StripNYC Strip.1LHV Strip.2LI Strip.3ROS
Location NYC LHV LI ROS NYC LHV LI ROS NYC LHV LI ROS
Nov-03 $6.67 $1.15 $0.50 $1.15 $6.98 $1.94 $8.14 $1.94 $6.55 $1.17 $4.00 $1.17
Dec-03 $6.67 $1.58 $5.00 $1.58 $6.98 $1.78 $8.22 $1.78 $6.55 $1.17 $4.00 $1.17
Jan-04 $6.67 $1.65 $8.10 $1.65 $6.98 $1.75 $7.99 $1.75 $6.55 $1.17 $4.00 $1.17
Feb-04 $6.95 $1.67 $7.50 $1.67 $6.98 $1.73 $7.08 $1.73 $6.55 $1.17 $4.00 $1.17
Mar-04 $6.25 $1.65 $7.00 $1.65 $6.98 $1.00 $7.72 $1.00 $6.55 $1.17 $4.00 $1.17

Some of those header names are ugly, with a redundant period and number in them. Let’s remove them.

colnames(generator) <- str_replace_all(colnames(generator), "\\.[:digit:]","")
colnames(generator) <- str_replace(colnames(generator), "AuctionLocation", "Month")
generator <- generator[-1,]

kable(head(generator))
Month MonthlyNYC MonthlyLHV MonthlyLI MonthlyROS SpotNYC SpotLHV SpotLI SpotROS StripNYC StripLHV StripLI StripROS
2 Nov-03 $6.67 $1.15 $0.50 $1.15 $6.98 $1.94 $8.14 $1.94 $6.55 $1.17 $4.00 $1.17
3 Dec-03 $6.67 $1.58 $5.00 $1.58 $6.98 $1.78 $8.22 $1.78 $6.55 $1.17 $4.00 $1.17
4 Jan-04 $6.67 $1.65 $8.10 $1.65 $6.98 $1.75 $7.99 $1.75 $6.55 $1.17 $4.00 $1.17
5 Feb-04 $6.95 $1.67 $7.50 $1.67 $6.98 $1.73 $7.08 $1.73 $6.55 $1.17 $4.00 $1.17
6 Mar-04 $6.25 $1.65 $7.00 $1.65 $6.98 $1.00 $7.72 $1.00 $6.55 $1.17 $4.00 $1.17
7 Apr-04 $6.25 $0.99 $6.85 $0.99 $6.98 $0.80 $7.04 $0.80 $6.55 $1.17 $4.00 $1.17

Next, let’s separate the date column into two columns, months and years; we’ll also scrub the $ signs and cast the non-date contents to numerics.

Year <- as.numeric(unlist(str_extract(generator[,1],"[:digit:]+")))

generator$Month <- str_replace_all(generator$Month,"\\-[:digit:]+","")
generator$Year <- Year
generator <- generator[,c("Year", "Month",colnames(generator)[2:13])]
pricecols <- length(colnames(generator))
for (i in 3:pricecols) {
  generator[,i] <- as.numeric(str_replace_all(generator[,i],"\\$",""))
}
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
kable(head(generator))
Year Month MonthlyNYC MonthlyLHV MonthlyLI MonthlyROS SpotNYC SpotLHV SpotLI SpotROS StripNYC StripLHV StripLI StripROS
2 3 Nov 6.67 1.15 0.50 1.15 6.98 1.94 8.14 1.94 6.55 1.17 4 1.17
3 3 Dec 6.67 1.58 5.00 1.58 6.98 1.78 8.22 1.78 6.55 1.17 4 1.17
4 4 Jan 6.67 1.65 8.10 1.65 6.98 1.75 7.99 1.75 6.55 1.17 4 1.17
5 4 Feb 6.95 1.67 7.50 1.67 6.98 1.73 7.08 1.73 6.55 1.17 4 1.17
6 4 Mar 6.25 1.65 7.00 1.65 6.98 1.00 7.72 1.00 6.55 1.17 4 1.17
7 4 Apr 6.25 0.99 6.85 0.99 6.98 0.80 7.04 0.80 6.55 1.17 4 1.17

Now our dataset is clean and ready for analysis.

For the first question:

“Which month of the year sees the highest prices in each location?”

We’ll use the ddply function to compute averages for each monthly cost and location. Then we find the highest average for each location, which happens to be September for all locations.

AvgByMonth <- ddply(generator,~Month,summarise,meanNYC=mean(MonthlyNYC, na.rm = TRUE), meanLHV = mean(MonthlyLHV, na.rm = TRUE), meanLI = mean(MonthlyLI, na.rm = TRUE), meanROS = mean(MonthlyROS, na.rm = TRUE))

MaxMonth <- apply(AvgByMonth, 2, function(x) max(x, na.rm = TRUE))

Question 2: “What is the average difference between NYC and ROS prices?”

For variety, we will now use the spot prices.

NYC_ROS <- generator[ , c(1:2)]
NYC_ROS$Diff <- generator$SpotNYC - generator$SpotROS
kable(head(NYC_ROS))
Year Month Diff
2 3 Nov 5.04
3 3 Dec 5.20
4 4 Jan 5.23
5 4 Feb 5.25
6 4 Mar 5.98
7 4 Apr 6.18
mean(NYC_ROS$Diff, na.rm = TRUE)
## [1] 6.66255

We get a value of a $6.67 difference in the spot price, between NYC and the rest of the state.


Question 3: Which calendar year saw the highest average price across regions (ignoring weighting)?

First, let’s compute an average across the regions. We’ll use the monthly price now. And since our information for the first and last years, 03 and 16, are incomplete, we will exclude those years from our analysis.

MaxYearNY <- generator[ , c(1:2)] 
MaxYearNY$Mean <- rowMeans(generator[ , 3:6])
MaxYearNY <- subset(MaxYearNY, (Year > 3) & (Year < 16))
MaxYearNY$Year <- MaxYearNY$Year + 2000
kable(head(MaxYearNY))
Year Month Mean
4 2004 Jan 4.5175
5 2004 Feb 4.4475
6 2004 Mar 4.1375
7 2004 Apr 3.7700
8 2004 May 5.6150
9 2004 Jun 5.9250

Now, we’ll use the ddply function to group our monthly averages by year, and then the apply function to find the max within that.

YearMeansNY <- ddply(MaxYearNY,~Year,summarise, meanYear = mean(Mean))

MaxYearNY <- which.max(YearMeansNY$meanYear)
YearMeansNY$Year[MaxYearNY]
## [1] 2014

2014 had the highest average across all of NY state (using an unweighted average).

Question 4: Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?

Let’s compute the standard deviation for each column. First, we’ll remove the last row because it has some null values

generator1 <- generator[c(1:49, 51:149),1:10]
apply(generator1, 2, function(x) sd(x))
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm
## = na.rm): NAs introduced by coercion
##       Year      Month MonthlyNYC MonthlyLHV  MonthlyLI MonthlyROS 
##   3.614458         NA   4.351853   2.771195   2.455219   1.542416 
##    SpotNYC    SpotLHV     SpotLI    SpotROS 
##   4.486423   2.793636   3.050375   1.625278

We see that for each location, the spot has a higher SD than the monthly price. Therefore, the spot is more volatile - although the difference is generally small.