Fiona’s Data 110 Final Project

Resale Value by Shoe Size - Does Size Matter?

Before I loaded in the dataset, while in Excel I cleaned up the column headers so they wouldn’t contain any capital letters or spaces. I then reformatted the sale_price and retail_price columns from currency to numbers, so they wouldn’t contain any dollar signs and would be recognized as numbers in R. I then saved it as a .csv in my working directory.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)

Load in the dataset “stockx_data.csv”, name it “stockx”

stockx <- read_csv("stockx_data.csv")
## Parsed with column specification:
## cols(
##   order_date = col_character(),
##   brand = col_character(),
##   sneaker_name = col_character(),
##   sale_price = col_double(),
##   retail_price = col_double(),
##   release_date = col_character(),
##   shoe_size = col_double(),
##   buyer_region = col_character()
## )

First, use ggplot to make a histogram to compare retail price by brand. This showed that all Yeezy shoes retail for $220, while Off-White shoes range from $130-250. At first, I was thinking of trying to shrink my dataset (since it’s almost 100,000 data points) by only looking at sneaker’s over/under/between a certain price. This graph showed me, however, that doing so would likely result in one of the brands not being represented in the dataset.

plot1 <- stockx %>% 
 ggplot(aes(x=retail_price, fill=brand))+
  geom_histogram(position="identity", alpha = 0.4, binwidth = 5, color = "black")+
scale_fill_discrete()
plot1

Next, I used ggplot to make a histogram to compare sale price and brand. The scale makes it hard to see the data past the $1,500 range, but this shows that most shoes re-sold for under $1,000. While it is hard to see, most of the shoes sold in the $2,000-4,000 range are Off-White brand.

plot3 <- stockx %>%
  ggplot(aes(x=sale_price, fill=brand))+
  geom_histogram(position="dodge", alpha = 0.4, binwidth = 500, color = "black")+
scale_fill_discrete()
plot3

I then used the mutate function to create a new variable, “profit,” by subtracting retail price from sale price. I made a histogram to look at profit and brand, but this plot didn’t really make much sense in my opinion. The values of profits vary so much, so having a histogram with profit on the x-axis and count on the y-axis isn’t very useful.

plot4 <- stockx %>%
  mutate(profit = sale_price - retail_price) %>%
  ggplot(aes(x=profit, fill=brand))+
  geom_histogram(position="dodge", alpha = 0.4, binwidth = 500, color = "black")+
scale_fill_discrete()
plot4

Just in case….I tried a barplot with the same variables and decided against it once I saw the results.

plot6 <- stockx %>% 
  mutate(profit = sale_price - retail_price) %>%
ggplot(aes(x=shoe_size, y=profit, fill=brand)) +
  geom_col(position = "dodge")
plot6

Going back to the scatterplot, I made the points smaller (geom_point(size=.1)) to see if it was more legible that way. It definitely was.

plot7 <- stockx %>%
  mutate(profit = sale_price - retail_price) %>%
ggplot(aes(x=shoe_size, y=profit, color=brand)) +
  geom_point(size= .1)
plot7

Preparing to use plotly to make my visualization interactive, I called up plotly.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

For my statistcal analysis, I thought a linear regression wouldn’t be best to use here, since shoe size isn’t a continuous variable. Instead, I decided to use the summary function to get a 5-number summary. It was intersting to see that every pair of shoes in this dataset was bought for more than the retail price was.

summary(stockx)
##   order_date           brand           sneaker_name         sale_price    
##  Length:99956       Length:99956       Length:99956       Min.   : 186.0  
##  Class :character   Class :character   Class :character   1st Qu.: 275.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 370.0  
##                                                           Mean   : 446.6  
##                                                           3rd Qu.: 540.0  
##                                                           Max.   :4050.0  
##   retail_price   release_date         shoe_size      buyer_region      
##  Min.   :130.0   Length:99956       Min.   : 3.500   Length:99956      
##  1st Qu.:220.0   Class :character   1st Qu.: 8.000   Class :character  
##  Median :220.0   Mode  :character   Median : 9.500   Mode  :character  
##  Mean   :208.6                      Mean   : 9.344                     
##  3rd Qu.:220.0                      3rd Qu.:11.000                     
##  Max.   :250.0                      Max.   :17.000

Preparing for my final plot, I loaded in ggthemes.

library(ggthemes)

I wanted to play around with themes, colors, text, etc. and decide what I wanted before I created my final plot. So again I used mutate to create the variable of “profit,” then I put shoe size on the x-axis, profit on the y-axis, and brand on the legend. I made my points smaller using size=.1. I changed the background to grey, made Off-White black and Yeezy green. I added axis labels, a legend label and a title. I changed the size and font over the legend, the axis titles and axis labels, and made the legend labels and axis tick marks italic. I made the x-axis tick marks go from 3.5 - 17, the range of the shoe sizes, and to be labeled every 1.5 sizes. Lastly, I made the legend background grey.

plot8 <- stockx %>%
  mutate(profit = sale_price - retail_price) %>%
ggplot(aes(x=shoe_size, y=profit, color=brand)) +
  geom_point(size= .1) +
  theme(panel.background = element_rect(fill= "grey", colour= "white"))+
  scale_color_manual(values = c("black", "green3"))+
  xlab("Shoe Size") +
  ylab("Profit ($)") +
  labs(color= "Brand") +
  ggtitle("Resale Value by Shoe Size - Does Size Matter?")+
  theme(legend.text = element_text(size = 10, face = "italic", family = "sans"))+
  theme(axis.text=element_text(size=8, family = "sans", face = "italic"),
        axis.title=element_text(size=12, family = "sans"))+
  scale_x_continuous(breaks=seq(3.5,17,1.5))+
  theme(legend.background = element_rect(fill="grey"))
plot8

After playing around with customizations, I was happy with what I had and used plotly to make it interactive. Here is my final plot!

plot9 <- stockx %>%
  mutate(profit = sale_price - retail_price) %>%
ggplot(aes(x=shoe_size, y=profit, color=brand)) +
  geom_point(size= .1) +
  theme(panel.background = element_rect(fill= "grey", colour= "white"))+
  scale_color_manual(values = c("black", "green3"))+
  xlab("Shoe Size") +
  ylab("Profit ($)") +
  labs(color= "Brand") +
  ggtitle("Resale Value by Shoe Size - Does Size Matter?")+
  theme(legend.text = element_text(size = 10, face = "italic", family = "sans"))+
  theme(axis.text=element_text(size=8, family = "sans", face = "italic"),
        axis.title=element_text(size=12, family = "sans"))+
  scale_x_continuous(breaks=seq(3.5,17,1.5))+
  theme(legend.background = element_rect(fill="grey"))
plot9 <- ggplotly(plot9)
plot9