DATA 607 Project 2: Preparing Datasets

Generator Capacity Prices (Dan Smilowitz)

Reading the Data

The data is loaded from a csv. The first two rows of the file have the headers – they are read in, combined, and set as the names:

UCAP         <- read.csv('./UCAP.csv', stringsAsFactors = FALSE, header = FALSE, skip = 2)
UCAP_headers <- read.csv('./UCAP.csv', stringsAsFactors = FALSE, header = FALSE, nrows = 2)
UCAP_names   <- c("Month_Year", rep("", length(UCAP_headers)-1))

for (i in 2:length(UCAP_headers)) {
  UCAP_names[i] = paste(UCAP_headers[1, i], UCAP_headers[2, i], sep = "-")
}

names(UCAP) <- UCAP_names

Month_Year	Monthly-NYC	Monthly-LHV	Monthly-LI	Monthly-ROS	Spot-NYC	Spot-LHV	Spot-LI	Spot-ROS	Strip-NYC	Strip-LHV	Strip-LI	Strip-ROS
November 2003	$6.67	$1.15	$0.50	$1.15	$6.98	$1.94	$8.14	$1.94	$6.55	$1.17	$4.00	$1.17
December 2003	$6.67	$1.58	$5.00	$1.58	$6.98	$1.78	$8.22	$1.78	$6.55	$1.17	$4.00	$1.17
January 2004	$6.67	$1.65	$8.10	$1.65	$6.98	$1.75	$7.99	$1.75	$6.55	$1.17	$4.00	$1.17
February 2004	$6.95	$1.67	$7.50	$1.67	$6.98	$1.73	$7.08	$1.73	$6.55	$1.17	$4.00	$1.17
March 2004	$6.25	$1.65	$7.00	$1.65	$6.98	$1.00	$7.72	$1.00	$6.55	$1.17	$4.00	$1.17
April 2004	$6.25	$0.99	$6.85	$0.99	$6.98	$0.80	$7.04	$0.80	$6.55	$1.17	$4.00	$1.17

Tidying the Data

The dataset is converted to a tiny format with each row representing an observation:

UCAP <- UCAP %>% gather(Auction_Location, Price, 2:13)

Month_Year	Auction_Location	Price
November 2003	Monthly-NYC	$6.67
December 2003	Monthly-NYC	$6.67
January 2004	Monthly-NYC	$6.67
February 2004	Monthly-NYC	$6.95
March 2004	Monthly-NYC	$6.25
April 2004	Monthly-NYC	$6.25

The first two columns contain two variables each, and the third column is stored as a string. The variables are separated using regular expressions and the stringr package, and the prices are converted to values.

UCAP$Year     <- str_extract(UCAP$Month_Year, "[[:digit:]]{4}")
UCAP$Month    <- str_extract(UCAP$Month_Year, "[[:alpha:]]+")
UCAP$Auction  <- str_sub(UCAP$Auction_Location, 1, str_locate(UCAP$Auction_Location, "-")[, 1] - 1)
UCAP$Location <- str_sub(UCAP$Auction_Location, str_locate(UCAP$Auction_Location, "-")[, 1] + 1)
UCAP$Price    <- extract_numeric(UCAP$Price)

UCAP          <- UCAP %>% select(Year, Month, Auction, Location, Price)

Year	Month	Auction	Location	Price
2003	November	Monthly	NYC	6.67
2003	December	Monthly	NYC	6.67
2004	January	Monthly	NYC	6.67
2004	February	Monthly	NYC	6.95
2004	March	Monthly	NYC	6.25
2004	April	Monthly	NYC	6.25

Analyzing the Data

Which month of the year sees the highest prices in each location?

The average price is calculated by location and month. With the monthly means for each location sorted in descending order, returning every 12th row will return the highest value for each location:

UCAP_month <- UCAP %>% 
  group_by(Location, Month) %>%
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  arrange(desc(Mean))

UCAP_month[seq(1, nrow(UCAP_month), 12), 1:2]

## Source: local data frame [4 x 2]
## Groups: Location [4]
## 
##   Location Month
##      (chr) (chr)
## 1      LHV  June
## 2       LI   May
## 3      NYC   May
## 4      ROS  June

What is the average difference between NYC and ROS prices?

UCAP_diff <- UCAP %>%
  group_by(Location) %>%
  filter(Location == "NYC" | Location == "ROS") %>%
  summarize(Mean=mean(Price, na.rm = TRUE))

UCAP_diff[1, 2] - UCAP_diff[2,2]

##      Mean
## 1 6.72846

Which calendar year saw the highest average price across regions?

UCAP %>%
  group_by(Year) %>%
  summarize(Mean = mean(Price, na.rm = TRUE)) %>%
  arrange(desc(Mean)) %>%
  top_n(3)

## Source: local data frame [3 x 2]
## 
##    Year     Mean
##   (chr)    (dbl)
## 1  2014 7.532708
## 2  2015 6.254306
## 3  2013 5.695833

Is the monthly auction or the spot auction more volatile?

UCAP %>%
  group_by(Auction) %>%
  filter(Auction == "Monthly" | Auction == "Spot") %>%
  summarize(Variance = var(Price, na.rm = TRUE)) %>%
  arrange(desc(Variance))

## Source: local data frame [2 x 2]
## 
##   Auction Variance
##     (chr)    (dbl)
## 1    Spot 16.66132
## 2 Monthly 15.79204

The spot auction is more volatile than the monthly auction, as it has a higher variance.

Untidy Auction Data (Christopher Martin)

Reading the Data

The data is loaded from a csv.

auction <- read.csv('./untidy_auction_data.csv', stringsAsFactors = FALSE, header = TRUE)

Lot	Description	Range	Price
2001	New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w.	$300 - $400	$283
2002	Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w.	$80 - $120	$234
2003	Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w.	$200 - $300	$357

Tidying the Data

The item descriptions are not needed for this analysis – they are stored in a separate table with the lot number to allow for join functions in dplyr if needed for further analysis.

lot_info <- auction %>% select(Lot, Description)
auction  <- auction %>% select(Lot, Range, Price)

Lot	Range	Price
2001	$300 - $400	$283
2002	$80 - $120	$234
2003	$200 - $300	$357

The auction dataset is tidied, including the creation of two separate values for estimates (Low and High):

auction$Low  <- str_trim(str_sub(auction$Range, 1, str_locate(auction$Range, "-")[, 1] - 1))
auction$High <- str_trim(str_sub(auction$Range, str_locate(auction$Range, "-")[, 1] + 1))
auction <- auction %>% select(Lot, Low, High, Price)

auction <- auction %>% gather(Value, Price, -Lot) %>% arrange(Lot, Value)
auction$Price <- extract_numeric(auction$Price)

Lot	Value	Price
2001	High	400
2001	Low	300
2001	Price	283
2002	High	120
2002	Low	80
2002	Price	234
2003	High	300
2003	Low	200
2003	Price	357

Analyzing the Data

What is the percent different between estimates and sale prices?

The requested analysis asked for the total percent difference between estimates and sale prices. Thus, the data are summarized across lot numbers before calculation.

auction <- auction %>% group_by(Value) %>% summarize(Sum = sum(Price))

HighDiff <- round((auction[auction$Value == "High", 2] - auction[auction$Value == "Price", 2]) / 
                    auction[auction$Value == "Price", 2], 4) * 100
LowDiff <- round((auction[auction$Value == "Low", 2] - auction[auction$Value == "Price", 2]) / 
                   auction[auction$Value == "Price", 2], 4) * 100

The stated problem suggested calculating the percent difference with the estimate as the denominator; however, using the sale price as the denominator is likely more accurate, as it compares to the actual value rather than the predicted value.

The difference between the low estimates and sale prices is -33.64%, and the difference between high estimates and sale prices is -6.18%. In both cases, the negative numbers indicate that the estimates underpredict the actual sale prices.

Distribution of Wealth (Ken Markus)

Reading the Data

The data is loaded from a csv. The first row of the file contains the chart title, and the second row has column names with a number of spaces — as a result, the first two rows will be skipped and the column names entered manually.

wealth <- read.csv('./wealth.csv', stringsAsFactors = FALSE, header = FALSE, skip = 2)
names(wealth) <- c('Year', 'Top1', 'Next19', 'Bottom80')

Year	Top1	Next19	Bottom80
1983	42.90%	48.40%	8.70%
1989	46.90%	46.50%	6.60%
1992	45.60%	46.70%	7.70%
1995	47.20%	45.90%	7.00%
1998	47.30%	43.60%	9.10%
2001	39.70%	51.50%	8.70%

Tidying the Data

Because of the percentage signs in the data, the values are read in as strings. This is corrected, and the numbers are converted to decimal form.

for (i in 2:4) {
  wealth[, i] <- extract_numeric(wealth[, i]) / 100
}

The data is then converted to a “long” data set to allow for easier analysis.

wealth <- wealth %>% gather(Tier, Share, -Year)

Year	Tier	Share
1983	Top1	0.429
1989	Top1	0.469
1992	Top1	0.456
1995	Top1	0.472
1998	Top1	0.473
2001	Top1	0.397

Analyzing the Data

How has the distribution of wealth changed over time?

To analyze the change in distribution of wealth, the values for each group are plotted over time:

From this chart, it is clear that the share of total wealth of the “Next 19%” has increased, surpassing taht of the “Top 1%”" some time between 1998 and 2001. The overall share of these two groups has increased, as the share of the “Bottom 80%” has decreased from almost 10% in 1983 to under 5% in 2010.

DATA 607 Project 2: Preparing Datasets

Dan Smilowitz

March 13, 2016

Generator Capacity Prices (Dan Smilowitz)

Reading the Data

Tidying the Data

Analyzing the Data

Which month of the year sees the highest prices in each location?

What is the average difference between NYC and ROS prices?

Which calendar year saw the highest average price across regions?

Is the monthly auction or the spot auction more volatile?

Untidy Auction Data (Christopher Martin)

Reading the Data

Tidying the Data

Analyzing the Data

What is the percent different between estimates and sale prices?

Distribution of Wealth (Ken Markus)

Reading the Data

Tidying the Data

Analyzing the Data

How has the distribution of wealth changed over time?