Project 2

This project was done in consultation with Chris Martin, who partnered with me on the first project. I will be using the same library and cases as him, but I will do my coding independently.

First, we’ll import the libraries required.

library("plyr")
library("knitr")
library("stringr")
library("tidyr")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For our first example, we will use Chris Martin’s auction data. He describes it here: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&forum_id=_1321839_1&nav=discussion_board&conf_id=_1299369_1&course_id=_1245245_1&message_id=_18955095_1#msg__18955095_1Id

We are told we must find the % difference of an estimate and the sale price; there are two estimates, a high and a low estimate.

The two % differences we must find are:

(Price - High Estimate) / High Estimate
(Price - Low Estimate) / Low Estimate

Looking at the data here - https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv - we see that columns are separated by commas, so we’ll use that to splice up our data table. I’ve reproduced the data on my own Github.

auction <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Auction.csv"), sep = ",", stringsAsFactors = FALSE)

We can use a regular expression to extract the low range values. The regular expression is simply - “take the first numeric string in the range field.” For the first row, that means taking “300” from" “$300 - $400”.

auction$LowEst <- as.numeric(str_extract(auction[,3],"[:digit:]+"))

Now let’s do the same for the high end of the range. The extract function we used for the low end of the range won’t work, because it starts from the left; using it again will give us the same result.

Instead, we use the str_sub function to draw in a sub string; by entering a negative value, we extract the n rightmost values. Here, we use 5.

Then we use the ‘extract the first set of digits’ command as we had with the low estimate.

auction$HighEst <- as.numeric(str_extract(str_sub(auction[,3], -5), "[:digit:]+"))

Now let’s take a look at our auction data

kable(auction)

Lot	Description	Range	Price	LowEst	HighEst
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	$283	300	400
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	$234	80	120
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	$357	200	300

kable(auction)

Lot	Description	Range	Price	LowEst	HighEst
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	$283	300	400
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	$234	80	120
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	$357	200	300

The $ signs in our price column may complicate computations later, so let’s remove them.

auction$Price <- as.numeric(str_replace_all(auction[,4], "\\$",""))

Now, let’s do our computations for the low and high estimates

auction$LowEstDiff <- str_c(round(100*(auction$LowEst - auction$Price)/auction$Price,0),"%")
auction$HighEstDiff <- str_c(round(100*(auction$HighEst - auction$Price)/auction$Price,0),"%")
kable(auction)

Lot	Description	Range	Price	LowEst	HighEst	LowEstDiff	HighEstDiff
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	283	300	400	6%	41%
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	234	80	120	-66%	-49%
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	357	200	300	-44%	-16%

In sum, we’ve identified the percent difference between the estimates and the actual selling price.

Case 2: Oil

Kishore Prasad uploaded a fictional dataset

Oil <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Oil.csv"), sep = ",", stringsAsFactors = FALSE)

kable(Oil)

Month	Category	Caltex	Gulf	Mobil
Open	Engine Oil	140 : 000	199 : 000	141 : 000
	GearBox Oil	198 : 000	132 : 000	121 : 000
Jan	Engine Oil	170 : 103	194 : 132	109 : 127
	GearBox Oil	132 : 106	125 : 105	191 : 100
Feb	Engine Oil	112 : 133	138 : 113	171 : 101
	GearBox Oil	193 : 148	199 : 119	134 : 127
Mar	Engine Oil	184 : 100	141 : 141	114 : 108
	GearBox Oil	138 : 121	172 : 133	193 : 115
Apr	Engine Oil	149 : 150	117 : 118	117 : 118
	GearBox Oil	185 : 125	191 : 133	119 : 121
May	Engine Oil	170 : 139	104 : 119	200 : 117
	GearBox Oil	168 : 117	138 : 102	121 : 146
Jun	Engine Oil	159 : 129	170 : 138	169 : 105
	GearBox Oil	107 : 129	195 : 141	141 : 112

First, let’s separate the table into two tables, one for Engine Oil and one for Gearbox Oil. Engine Oil is on odd numbered rows, while Gearbox is on even numbered rows. The table for Gearbox Oil is missing its months, so we’ll add them in

EngRows <- seq(1, length(Oil[,1]),2)
GearRows <- seq(2, length(Oil[,1]),2)

EngineOil <- Oil[EngRows,1:2]
GearboxOil <- Oil[GearRows,1:2]
GearboxOil$Month <- EngineOil$Month
kable(EngineOil)

	Month	Category
1	Open	Engine Oil
3	Jan	Engine Oil
5	Feb	Engine Oil
7	Mar	Engine Oil
9	Apr	Engine Oil
11	May	Engine Oil
13	Jun	Engine Oil

kable(GearboxOil)

	Month	Category
2	Open	GearBox Oil
4	Jan	GearBox Oil
6	Feb	GearBox Oil
8	Mar	GearBox Oil
10	Apr	GearBox Oil
12	May	GearBox Oil
14	Jun	GearBox Oil

Now, we want to split up each column for each brand, into bought and consumed quantities. We’ll use the convention of the brand name followed by a P for purchased and C for consumed. So we’ll have CaltexP, CaltexC, GulfP, GulfC, MobilP, MobilC.

We’ll extract the purchased portion by simply using a regular expression to extract the first string of digits. To get the consumed portion, we’ll read in the last five characters of that string, and then extract the digits from that.

Since these are strings by default, we’ll have to cast them all to numerics.

#For Engine Oil

EngineOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[EngRows],"[:digit:]+"))
EngineOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[EngRows], -5),"[:digit:]+"))

EngineOil$MobilP <- as.numeric(str_extract(Oil$Mobil[EngRows],"[:digit:]+"))
EngineOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[EngRows], -5),"[:digit:]+"))

EngineOil$GulfP <- as.numeric(str_extract(Oil$Gulf[EngRows],"[:digit:]+"))
EngineOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[EngRows], -5),"[:digit:]+"))

kable(EngineOil)

	Month	Category	CaltexP	CaltexC	MobilP	MobilC	GulfP	GulfC
1	Open	Engine Oil	140	0	141	0	199	0
3	Jan	Engine Oil	170	103	109	127	194	132
5	Feb	Engine Oil	112	133	171	101	138	113
7	Mar	Engine Oil	184	100	114	108	141	141
9	Apr	Engine Oil	149	150	117	118	117	118
11	May	Engine Oil	170	139	200	117	104	119
13	Jun	Engine Oil	159	129	169	105	170	138

#For Gearbox Oil

GearboxOil$CaltexP <- as.numeric(str_extract(Oil$Caltex[GearRows],"[:digit:]+"))
GearboxOil$CaltexC <- as.numeric(str_extract(str_sub(Oil$Caltex[GearRows], -5),"[:digit:]+"))

GearboxOil$MobilP <- as.numeric(str_extract(Oil$Mobil[GearRows],"[:digit:]+"))
GearboxOil$MobilC <- as.numeric(str_extract(str_sub(Oil$Mobil[GearRows], -5),"[:digit:]+"))

GearboxOil$GulfP <- as.numeric(str_extract(Oil$Gulf[GearRows],"[:digit:]+"))
GearboxOil$GulfC <- as.numeric(str_extract(str_sub(Oil$Gulf[GearRows], -5),"[:digit:]+"))

kable(GearboxOil)

	Month	Category	CaltexP	CaltexC	MobilP	MobilC	GulfP	GulfC
2	Open	GearBox Oil	198	0	121	0	132	0
4	Jan	GearBox Oil	132	106	191	100	125	105
6	Feb	GearBox Oil	193	148	134	127	199	119
8	Mar	GearBox Oil	138	121	193	115	172	133
10	Apr	GearBox Oil	185	125	119	121	191	133
12	May	GearBox Oil	168	117	121	146	138	102
14	Jun	GearBox Oil	107	129	141	112	195	141

Now, for our results.

First, how much of each kind of oil is left, for each brand?

That’s answered by the total purchased minus total consumed, for each brand and type of oil. We will have six answers.

CaltexB <- c(EngineOil = sum(EngineOil$CaltexP) - sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexP) - sum(GearboxOil$CaltexC))

MobilB <- c(EngineOil = sum(EngineOil$MobilP) - sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilP) - sum(GearboxOil$MobilC))
  
GulfB <- c(EngineOil = sum(EngineOil$GulfP) - sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfP) - sum(GearboxOil$GulfC))

OilBalance <- data.frame(CaltexB, MobilB, GulfB)

kable(OilBalance)

	CaltexB	MobilB	GulfB
EngineOil	330	345	302
GearboxOil	375	299	419

Now let’s identify the most consumed brands, for each engine type. First, we’ll create a table of the consumption figures by brand and oil type. We’ll refer to this as TC, as in Total Consumed.

CaltexTC <- c(EngineOil = sum(EngineOil$CaltexC), GearboxOil = sum(GearboxOil$CaltexC), Total = sum(EngineOil$CaltexC) + sum(GearboxOil$CaltexC))

MobilTC <- c(EngineOil = sum(EngineOil$MobilC), GearboxOil = sum(GearboxOil$MobilC), Total = sum(EngineOil$MobilC) + sum(GearboxOil$MobilC))


GulfTC <- c(EngineOil = sum(EngineOil$GulfC), GearboxOil = sum(GearboxOil$GulfC), Total = sum(EngineOil$GulfC) + sum(GearboxOil$GulfC))

OilTC <- data.frame(CaltexTC, MobilTC, GulfTC)

kable(OilTC)

	CaltexTC	MobilTC	GulfTC
EngineOil	754	676	761
GearboxOil	746	721	733
Total	1500	1397	1494

We see that Caltex sold the most overall, as well as the most Gearbox Oil, while Gulf sold the most Engine Oil.

Now for the third case: Dan Smilowitz’s.

Generators in the NYS electricity market are paid based on the outcome of competitive Unforced Capacity (UCAP) auctions.

There are 3 auctions: .Monthly - takes place monthly before final values are known .Spot - takes place monthly once values are finalized .Strip - takes place twice a year (May and November) for the following six months

The prices for each auction are set in four different locations - NYC, Long Island (LI), Lower Hudson Valley (LHV), and Rest of State (ROS).

The past ~12.5 years worth of auction results are contained in a table, as depicted below:

All values are the price paid to generators in $/kW-month. I’ve attached the data in a messy .csv here.

Data source: New York Independent System Operator

The Analysis

Which month of the year sees the highest prices in each location?

What is the average difference between NYC and ROS prices?

Which calendar year saw the highest average price across regions (ignoring weighting)?

Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?

https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&forum_id=_1321839_1&nav=discussion_board&conf_id=_1299369_1&course_id=_1245245_1&message_id=_19008069_1#msg__19008069_1Id

Let’s read in the data, and then consolidate the two header rows into one.

generator <- read.csv(url("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week6/Generator.csv"), sep = ",", stringsAsFactors =  FALSE, header = TRUE)

colnames(generator) <- str_c(colnames(generator), generator[1,], sep = "")
kable(head(generator))

AuctionLocation	MonthlyNYC	Monthly.1LHV	Monthly.2LI	Monthly.3ROS	SpotNYC	Spot.1LHV	Spot.2LI	Spot.3ROS	StripNYC	Strip.1LHV	Strip.2LI	Strip.3ROS
Location	NYC	LHV	LI	ROS	NYC	LHV	LI	ROS	NYC	LHV	LI	ROS
Nov-03	$6.67	$1.15	$0.50	$1.15	$6.98	$1.94	$8.14	$1.94	$6.55	$1.17	$4.00	$1.17
Dec-03	$6.67	$1.58	$5.00	$1.58	$6.98	$1.78	$8.22	$1.78	$6.55	$1.17	$4.00	$1.17
Jan-04	$6.67	$1.65	$8.10	$1.65	$6.98	$1.75	$7.99	$1.75	$6.55	$1.17	$4.00	$1.17
Feb-04	$6.95	$1.67	$7.50	$1.67	$6.98	$1.73	$7.08	$1.73	$6.55	$1.17	$4.00	$1.17
Mar-04	$6.25	$1.65	$7.00	$1.65	$6.98	$1.00	$7.72	$1.00	$6.55	$1.17	$4.00	$1.17

Some of those header names are ugly, with a redundant period and number in them. Let’s remove them.

colnames(generator) <- str_replace_all(colnames(generator), "\\.[:digit:]","")
colnames(generator) <- str_replace(colnames(generator), "AuctionLocation", "Month")
generator <- generator[-1,]

kable(head(generator))

	Month	MonthlyNYC	MonthlyLHV	MonthlyLI	MonthlyROS	SpotNYC	SpotLHV	SpotLI	SpotROS	StripNYC	StripLHV	StripLI	StripROS
2	Nov-03	$6.67	$1.15	$0.50	$1.15	$6.98	$1.94	$8.14	$1.94	$6.55	$1.17	$4.00	$1.17
3	Dec-03	$6.67	$1.58	$5.00	$1.58	$6.98	$1.78	$8.22	$1.78	$6.55	$1.17	$4.00	$1.17
4	Jan-04	$6.67	$1.65	$8.10	$1.65	$6.98	$1.75	$7.99	$1.75	$6.55	$1.17	$4.00	$1.17
5	Feb-04	$6.95	$1.67	$7.50	$1.67	$6.98	$1.73	$7.08	$1.73	$6.55	$1.17	$4.00	$1.17
6	Mar-04	$6.25	$1.65	$7.00	$1.65	$6.98	$1.00	$7.72	$1.00	$6.55	$1.17	$4.00	$1.17
7	Apr-04	$6.25	$0.99	$6.85	$0.99	$6.98	$0.80	$7.04	$0.80	$6.55	$1.17	$4.00	$1.17

Next, let’s separate the date column into two columns, months and years; we’ll also scrub the $ signs and cast the non-date contents to numerics.

Year <- as.numeric(unlist(str_extract(generator[,1],"[:digit:]+")))

generator$Month <- str_replace_all(generator$Month,"\\-[:digit:]+","")
generator$Year <- Year
generator <- generator[,c("Year", "Month",colnames(generator)[2:13])]
pricecols <- length(colnames(generator))
for (i in 3:pricecols) {
  generator[,i] <- as.numeric(str_replace_all(generator[,i],"\\$",""))
}

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

kable(head(generator))

	Year	Month	MonthlyNYC	MonthlyLHV	MonthlyLI	MonthlyROS	SpotNYC	SpotLHV	SpotLI	SpotROS	StripNYC	StripLHV	StripLI	StripROS
2	3	Nov	6.67	1.15	0.50	1.15	6.98	1.94	8.14	1.94	6.55	1.17	4	1.17
3	3	Dec	6.67	1.58	5.00	1.58	6.98	1.78	8.22	1.78	6.55	1.17	4	1.17
4	4	Jan	6.67	1.65	8.10	1.65	6.98	1.75	7.99	1.75	6.55	1.17	4	1.17
5	4	Feb	6.95	1.67	7.50	1.67	6.98	1.73	7.08	1.73	6.55	1.17	4	1.17
6	4	Mar	6.25	1.65	7.00	1.65	6.98	1.00	7.72	1.00	6.55	1.17	4	1.17
7	4	Apr	6.25	0.99	6.85	0.99	6.98	0.80	7.04	0.80	6.55	1.17	4	1.17

Now our dataset is clean and ready for analysis.

For the first question:

“Which month of the year sees the highest prices in each location?”

We’ll use the ddply function to compute averages for each monthly cost and location. Then we find the highest average for each location, which happens to be September for all locations.

AvgByMonth <- ddply(generator,~Month,summarise,meanNYC=mean(MonthlyNYC, na.rm = TRUE), meanLHV = mean(MonthlyLHV, na.rm = TRUE), meanLI = mean(MonthlyLI, na.rm = TRUE), meanROS = mean(MonthlyROS, na.rm = TRUE))

MaxMonth <- apply(AvgByMonth, 2, function(x) max(x, na.rm = TRUE))

Question 2: “What is the average difference between NYC and ROS prices?”

For variety, we will now use the spot prices.

NYC_ROS <- generator[ , c(1:2)]
NYC_ROS$Diff <- generator$SpotNYC - generator$SpotROS
kable(head(NYC_ROS))

	Year	Month	Diff
2	3	Nov	5.04
3	3	Dec	5.20
4	4	Jan	5.23
5	4	Feb	5.25
6	4	Mar	5.98
7	4	Apr	6.18

mean(NYC_ROS$Diff, na.rm = TRUE)

## [1] 6.66255

We get a value of a $6.67 difference in the spot price, between NYC and the rest of the state.

Question 3: Which calendar year saw the highest average price across regions (ignoring weighting)?

First, let’s compute an average across the regions. We’ll use the monthly price now. And since our information for the first and last years, 03 and 16, are incomplete, we will exclude those years from our analysis.

MaxYearNY <- generator[ , c(1:2)] 
MaxYearNY$Mean <- rowMeans(generator[ , 3:6])
MaxYearNY <- subset(MaxYearNY, (Year > 3) & (Year < 16))
MaxYearNY$Year <- MaxYearNY$Year + 2000
kable(head(MaxYearNY))

	Year	Month	Mean
4	2004	Jan	4.5175
5	2004	Feb	4.4475
6	2004	Mar	4.1375
7	2004	Apr	3.7700
8	2004	May	5.6150
9	2004	Jun	5.9250

Now, we’ll use the ddply function to group our monthly averages by year, and then the apply function to find the max within that.

YearMeansNY <- ddply(MaxYearNY,~Year,summarise, meanYear = mean(Mean))

MaxYearNY <- which.max(YearMeansNY$meanYear)
YearMeansNY$Year[MaxYearNY]

## [1] 2014

2014 had the highest average across all of NY state (using an unweighted average).

Question 4: Is the monthly auction or the spot auction more volatile (i.e. which has the most variability in pricing)?

Let’s compute the standard deviation for each column. First, we’ll remove the last row because it has some null values

generator1 <- generator[c(1:49, 51:149),1:10]
apply(generator1, 2, function(x) sd(x))

## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm
## = na.rm): NAs introduced by coercion

##       Year      Month MonthlyNYC MonthlyLHV  MonthlyLI MonthlyROS 
##   3.614458         NA   4.351853   2.771195   2.455219   1.542416 
##    SpotNYC    SpotLHV     SpotLI    SpotROS 
##   4.486423   2.793636   3.050375   1.625278

We see that for each location, the spot has a higher SD than the monthly price. Therefore, the spot is more volatile - although the difference is generally small.

Project 2

Asher Meyers

March 13, 2016