The data is loaded from a csv. The first two rows of the file have the headers – they are read in, combined, and set as the names:
UCAP <- read.csv('./UCAP.csv', stringsAsFactors = FALSE, header = FALSE, skip = 2)
UCAP_headers <- read.csv('./UCAP.csv', stringsAsFactors = FALSE, header = FALSE, nrows = 2)
UCAP_names <- c("Month_Year", rep("", length(UCAP_headers)-1))
for (i in 2:length(UCAP_headers)) {
UCAP_names[i] = paste(UCAP_headers[1, i], UCAP_headers[2, i], sep = "-")
}
names(UCAP) <- UCAP_names| Month_Year | Monthly-NYC | Monthly-LHV | Monthly-LI | Monthly-ROS | Spot-NYC | Spot-LHV | Spot-LI | Spot-ROS | Strip-NYC | Strip-LHV | Strip-LI | Strip-ROS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| November 2003 | $6.67 | $1.15 | $0.50 | $1.15 | $6.98 | $1.94 | $8.14 | $1.94 | $6.55 | $1.17 | $4.00 | $1.17 |
| December 2003 | $6.67 | $1.58 | $5.00 | $1.58 | $6.98 | $1.78 | $8.22 | $1.78 | $6.55 | $1.17 | $4.00 | $1.17 |
| January 2004 | $6.67 | $1.65 | $8.10 | $1.65 | $6.98 | $1.75 | $7.99 | $1.75 | $6.55 | $1.17 | $4.00 | $1.17 |
| February 2004 | $6.95 | $1.67 | $7.50 | $1.67 | $6.98 | $1.73 | $7.08 | $1.73 | $6.55 | $1.17 | $4.00 | $1.17 |
| March 2004 | $6.25 | $1.65 | $7.00 | $1.65 | $6.98 | $1.00 | $7.72 | $1.00 | $6.55 | $1.17 | $4.00 | $1.17 |
| April 2004 | $6.25 | $0.99 | $6.85 | $0.99 | $6.98 | $0.80 | $7.04 | $0.80 | $6.55 | $1.17 | $4.00 | $1.17 |
The dataset is converted to a tiny format with each row representing an observation:
UCAP <- UCAP %>% gather(Auction_Location, Price, 2:13)| Month_Year | Auction_Location | Price |
|---|---|---|
| November 2003 | Monthly-NYC | $6.67 |
| December 2003 | Monthly-NYC | $6.67 |
| January 2004 | Monthly-NYC | $6.67 |
| February 2004 | Monthly-NYC | $6.95 |
| March 2004 | Monthly-NYC | $6.25 |
| April 2004 | Monthly-NYC | $6.25 |
The first two columns contain two variables each, and the third column is stored as a string. The variables are separated using regular expressions and the stringr package, and the prices are converted to values.
UCAP$Year <- str_extract(UCAP$Month_Year, "[[:digit:]]{4}")
UCAP$Month <- str_extract(UCAP$Month_Year, "[[:alpha:]]+")
UCAP$Auction <- str_sub(UCAP$Auction_Location, 1, str_locate(UCAP$Auction_Location, "-")[, 1] - 1)
UCAP$Location <- str_sub(UCAP$Auction_Location, str_locate(UCAP$Auction_Location, "-")[, 1] + 1)
UCAP$Price <- extract_numeric(UCAP$Price)
UCAP <- UCAP %>% select(Year, Month, Auction, Location, Price)| Year | Month | Auction | Location | Price |
|---|---|---|---|---|
| 2003 | November | Monthly | NYC | 6.67 |
| 2003 | December | Monthly | NYC | 6.67 |
| 2004 | January | Monthly | NYC | 6.67 |
| 2004 | February | Monthly | NYC | 6.95 |
| 2004 | March | Monthly | NYC | 6.25 |
| 2004 | April | Monthly | NYC | 6.25 |
The average price is calculated by location and month. With the monthly means for each location sorted in descending order, returning every 12th row will return the highest value for each location:
UCAP_month <- UCAP %>%
group_by(Location, Month) %>%
summarize(Mean = mean(Price, na.rm = TRUE)) %>%
arrange(desc(Mean))
UCAP_month[seq(1, nrow(UCAP_month), 12), 1:2]## Source: local data frame [4 x 2]
## Groups: Location [4]
##
## Location Month
## (chr) (chr)
## 1 LHV June
## 2 LI May
## 3 NYC May
## 4 ROS June
UCAP_diff <- UCAP %>%
group_by(Location) %>%
filter(Location == "NYC" | Location == "ROS") %>%
summarize(Mean=mean(Price, na.rm = TRUE))
UCAP_diff[1, 2] - UCAP_diff[2,2]## Mean
## 1 6.72846
UCAP %>%
group_by(Year) %>%
summarize(Mean = mean(Price, na.rm = TRUE)) %>%
arrange(desc(Mean)) %>%
top_n(3)## Source: local data frame [3 x 2]
##
## Year Mean
## (chr) (dbl)
## 1 2014 7.532708
## 2 2015 6.254306
## 3 2013 5.695833
UCAP %>%
group_by(Auction) %>%
filter(Auction == "Monthly" | Auction == "Spot") %>%
summarize(Variance = var(Price, na.rm = TRUE)) %>%
arrange(desc(Variance))## Source: local data frame [2 x 2]
##
## Auction Variance
## (chr) (dbl)
## 1 Spot 16.66132
## 2 Monthly 15.79204
The spot auction is more volatile than the monthly auction, as it has a higher variance.
The data is loaded from a csv.
auction <- read.csv('./untidy_auction_data.csv', stringsAsFactors = FALSE, header = TRUE)| Lot | Description | Range | Price |
|---|---|---|---|
| 2001 | New England Federal mahogany slant front desk, ca. 1810, with allover line inlay, 45" h., 40" w. | $300 - $400 | $283 |
| 2002 | Federal brass and wire fire fender, early 19thc., 13 3/4" h., 30" w. | $80 - $120 | $234 |
| 2003 | Pennsylvania or New Jersey late Chippendale applewood card table, ca. 1790, 29 1/2" h., 36" w. | $200 - $300 | $357 |
The item descriptions are not needed for this analysis – they are stored in a separate table with the lot number to allow for join functions in dplyr if needed for further analysis.
lot_info <- auction %>% select(Lot, Description)
auction <- auction %>% select(Lot, Range, Price)| Lot | Range | Price |
|---|---|---|
| 2001 | $300 - $400 | $283 |
| 2002 | $80 - $120 | $234 |
| 2003 | $200 - $300 | $357 |
The auction dataset is tidied, including the creation of two separate values for estimates (Low and High):
auction$Low <- str_trim(str_sub(auction$Range, 1, str_locate(auction$Range, "-")[, 1] - 1))
auction$High <- str_trim(str_sub(auction$Range, str_locate(auction$Range, "-")[, 1] + 1))
auction <- auction %>% select(Lot, Low, High, Price)
auction <- auction %>% gather(Value, Price, -Lot) %>% arrange(Lot, Value)
auction$Price <- extract_numeric(auction$Price)| Lot | Value | Price |
|---|---|---|
| 2001 | High | 400 |
| 2001 | Low | 300 |
| 2001 | Price | 283 |
| 2002 | High | 120 |
| 2002 | Low | 80 |
| 2002 | Price | 234 |
| 2003 | High | 300 |
| 2003 | Low | 200 |
| 2003 | Price | 357 |
The requested analysis asked for the total percent difference between estimates and sale prices. Thus, the data are summarized across lot numbers before calculation.
auction <- auction %>% group_by(Value) %>% summarize(Sum = sum(Price))
HighDiff <- round((auction[auction$Value == "High", 2] - auction[auction$Value == "Price", 2]) /
auction[auction$Value == "Price", 2], 4) * 100
LowDiff <- round((auction[auction$Value == "Low", 2] - auction[auction$Value == "Price", 2]) /
auction[auction$Value == "Price", 2], 4) * 100The stated problem suggested calculating the percent difference with the estimate as the denominator; however, using the sale price as the denominator is likely more accurate, as it compares to the actual value rather than the predicted value.
The difference between the low estimates and sale prices is -33.64%, and the difference between high estimates and sale prices is -6.18%. In both cases, the negative numbers indicate that the estimates underpredict the actual sale prices.
The data is loaded from a csv. The first row of the file contains the chart title, and the second row has column names with a number of spaces — as a result, the first two rows will be skipped and the column names entered manually.
wealth <- read.csv('./wealth.csv', stringsAsFactors = FALSE, header = FALSE, skip = 2)
names(wealth) <- c('Year', 'Top1', 'Next19', 'Bottom80')| Year | Top1 | Next19 | Bottom80 |
|---|---|---|---|
| 1983 | 42.90% | 48.40% | 8.70% |
| 1989 | 46.90% | 46.50% | 6.60% |
| 1992 | 45.60% | 46.70% | 7.70% |
| 1995 | 47.20% | 45.90% | 7.00% |
| 1998 | 47.30% | 43.60% | 9.10% |
| 2001 | 39.70% | 51.50% | 8.70% |
Because of the percentage signs in the data, the values are read in as strings. This is corrected, and the numbers are converted to decimal form.
for (i in 2:4) {
wealth[, i] <- extract_numeric(wealth[, i]) / 100
}The data is then converted to a “long” data set to allow for easier analysis.
wealth <- wealth %>% gather(Tier, Share, -Year)| Year | Tier | Share |
|---|---|---|
| 1983 | Top1 | 0.429 |
| 1989 | Top1 | 0.469 |
| 1992 | Top1 | 0.456 |
| 1995 | Top1 | 0.472 |
| 1998 | Top1 | 0.473 |
| 2001 | Top1 | 0.397 |
To analyze the change in distribution of wealth, the values for each group are plotted over time:
From this chart, it is clear that the share of total wealth of the “Next 19%” has increased, surpassing taht of the “Top 1%”" some time between 1998 and 2001. The overall share of these two groups has increased, as the share of the “Bottom 80%” has decreased from almost 10% in 1983 to under 5% in 2010.