Final Project - Data 110

Introduction

knitr::include_graphics("~/Data 110 101/awesome_lego.gif")

Image Credit (https://tenor.com/4I00.gif)

Topic: How does the relationship between price and pieces vary across themes?

In this project I will be using a Lego data set with 75 observations and 14 variables. I found this data set on OpenIntro, however it’s contents were scraped from scraped from Brickset.com and BrickInstructions.com. These are websites that have a extensive catalogs of Lego sets and information on such sets.

To answer my question, I will being using three variables:

amazon_price: A numerical variable listing a Lego set’s price on amazon rather then an in-person price.
pieces: A numerical variable telling the number of pieces in a given Lego set.
theme: A categorical variable telling which theme, or product line, a Lego set belongs to. Although there are more, this data set only contains 3 themes– City, Friends, and DUPLO.

I became interested in this data set because I personally like building Legos. I also find that they make good gifts. Considering they’re often gifted, knowing the price range of different types of Lego sets can allow people to more efficiently look for sets that are within their budget when gifting.

Data Preparation

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggplot2) # Regression model
library(ggfortify) # Diagnostic plots
library(ggridges) # Density ridges plot
library(plotly) # Interactive plot

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

setwd("~/Data 110 101/csv") # Set working directory
lego <- read.csv("lego_sample.csv") # Call in data set

lego_clean <- lego |>
  # select relevant variables
  select(amazon_price, pieces, theme) |>
  group_by(theme) |>
  # Find mean amazon price for each theme for reference
  mutate(avg_price = mean(amazon_price))

lego_clean

## # A tibble: 75 × 4
## # Groups:   theme [3]
##    amazon_price pieces theme  avg_price
##           <dbl>  <int> <chr>      <dbl>
##  1        16         6 DUPLO®      34.3
##  2         9.45      6 DUPLO®      34.3
##  3        39.9      41 DUPLO®      34.3
##  4        56.7      71 DUPLO®      34.3
##  5        37.0      26 DUPLO®      34.3
##  6         9.99     16 DUPLO®      34.3
##  7        22.0      26 DUPLO®      34.3
##  8       129.      105 DUPLO®      34.3
##  9        74.5      38 DUPLO®      34.3
## 10        99.0      37 DUPLO®      34.3
## # ℹ 65 more rows

# Check for NAs
any(is.na(lego_clean))

## [1] FALSE

Multiple Linear Regression

# Create multiple regression model for amazon price using pieces and theme as predictors
lm <- lm(amazon_price ~ pieces + theme, data = lego_clean)
summary(lm)

## 
## Call:
## lm(formula = amazon_price ~ pieces + theme, data = lego_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.068 -12.736  -5.623   6.056  87.219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.52364    6.14050   1.388  0.16945    
## pieces        0.13380    0.01482   9.025 2.11e-13 ***
## themeDUPLO®  21.10939    7.41084   2.848  0.00574 ** 
## themeFriends -7.35361    6.50107  -1.131  0.26180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.98 on 71 degrees of freedom
## Multiple R-squared:  0.543,  Adjusted R-squared:  0.5237 
## F-statistic: 28.12 on 3 and 71 DF,  p-value: 4.281e-12

Equation: \(\widehat{amazonprice}\) = 8.52364 + 0.13380 * \(pieces\) + 21.10939 * \(themeDUPLO\) - 7.35361 * \(themeFriends\)

Since theme is a categorical variable, only one of the theme slopes would be used to predict a given set’s price; DUPLO would use 21.10939, friends would use -7.35361, and city would use the default line.
The p-values of pieces and themeDUPLO are less than 0.5. This means that they are significant predictors of amazon price.
The Adjusted R-squared value is about 0.52, meaning that pieces and theme account for 52% of the variation in amazon price. Seeing as they only account for about half the variation, theme and pieces alone don’t create a well fitted model for predicting amazon price.

Diagnostic Plots

# Create diagnostic plots in a 2x2 grid
autoplot(lm, 1:4, nrow=2, ncol=2)

Residuals vs Fitted

The residuals are a bit skewed to the right around the line. The line is also not entirely horizontal. Both these aspects of the plot suggest non-linearity, meaning that the regression model is not a very good fit.

Normal Q-Q

The majority of the points lie along the diagonal reference line, but there is some deviation around the top right of the plot. This could suggest non-normality in the residuals, which would reinforce what the Residuals vs Fitted plot shows.

Scale-Location

There is a slight funnel shape to the line indicating non-constant variance among the residuals. This again suggests a poor fitting plot.

Cook’s Distance

Observation 29 is highlighted in this graph identifying it as an influential point. As this observation is highlighted in all of the diagnostic plots, I might want to remove this observation from my data set in the future.

Regression Plot

# Create ggplot
ggplot(lego_clean, aes(pieces, amazon_price, color = theme)) +
  geom_point() +  # Scatter plot of Pieces vs. Amazon Price
  geom_smooth(method = "lm", se = F) +  # Add linear regression line
  labs(x = "Pieces",
       y = "Amazon Price",
       color = "Theme",
       caption = "lego_sample.csv",
       title = "Amazon Price Given Pieces and Theme") +  # Axis labels, caption and title
  theme_bw(base_size = 14, base_family = "serif")  + # edit theme, font, and font size
  scale_color_brewer(palette="Set2") # edit color

## `geom_smooth()` using formula = 'y ~ x'

Overall Analysis

Despite having significant p-values indicating that pieces and theme are good predictors of amazon price, the low Adjusted R-squared value and diagnostic plots suggest that overall the plot is not a great fit. More predictors would need to be included to more accurately predict amazon price.

Additional Visualizations

Interactive Plot

# Boxplot of amazon price given theme
p1 <- ggplot(lego_clean, aes(theme, amazon_price, color = theme)) +
  geom_boxplot() +
  theme_bw(base_size = 14, base_family = "serif") +
  
  # Add point that shows mean amazon price for each theme
  stat_summary(fun=mean, geom="point", shape=20, size=4, color="red", fill="red")  +
  scale_color_brewer(palette="Set1") +
  labs(
    title = "Amazon Price Distribution by Lego Set Theme",
    x = "Theme",
    y = "Amazon Price",
    fill = "Theme",
    caption = "lego_sample.csv"
  )

# Make plot interactive with plotly
ggplotly(p1)

This boxplot graph shows the amazon price distribution of each theme. Hovering over each boxplot provides a 5 statistic summary of the amazon price in a given theme. The “amazon_price” seen when hovering over the red points on each boxplot shows the mean price of that theme. Duplo has the lowest mean and City has the highest. Every theme has at least one outlier however, the most significant one is in Friends.

Density Plot

# Geom density plot of pieces by theme
ggplot(lego_clean, aes(pieces, theme, fill = theme)) +
  geom_density_ridges(alpha = 0.7) +
  theme_bw(base_size = 14, base_family = "serif") +
  scale_fill_brewer(palette="Set1") +
  labs(
    title = "Piece Distribution by Lego Set Theme",
    x = "Pieces",
    y = "Theme",
    fill = "Theme",
    caption = "lego_sample.csv"
  )

## Picking joint bandwidth of 67.3

This density plot shows the distribution of pieces within each theme. All themes have piece counts that are skewed to the right. Comparatively, Duplo’s distribution is centered around the lowest number of pieces. While the City and Friends distributions have irregular shapes, Duplo has a bell shaped distribution, indicating that there are many extreme outliers in the theme.

Conclusion (Essay part b)

To answer my question of how the relationship between price and pieces vary across themes, the regression model showed that Duplo has the largest y-intercept and the steepest slope, meaning that for the same amount of pieces it’s more expensive than friends and city themes. This means that in order from most to least expensive Lego sets, there’s Duplo, City and then Friends.

Reflecting on my findings, the multiple regression showing that with the same amount of pieces, DUPLO sets tend to be the most expensive was a bit surprising. This is because Dulpo’s mean amazon price, as seen in the boxplot, is the lowest out of the three themes. However, I think what was found in the regression model is due to Duplo having the smallest range in prices and therefore little outliers to skew the mean higher. In terms of the limitations of this exploration, I wish figured out how to clean up the tooltip in the interactive boxplot so that the mean highlighted by the red points was more clear. In the future, instead of focusing on a specific question like I did in this project, I would focus on trying to create a line of best fit with more variables as that would probably be the most helpful in budgeting around Lego sets.