Project 2: DATA-110

Research Questions & Introduction

Multiple Linear Regression: What is the relationship between LEGO set price, the quantitative variables pieces, year, pages, the categorical variable size, and how significant are these predictors in a multiple linear regression model?
Visualization: How does the number of pieces affect the pricing of the LEGO Speed Champions sets?

My dataset was compiled from JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com and it focuses on various different LEGO sets including the data of each set between Jan 1st, 2018 and Sep 11th, 2020. There are many variables in this data set with it having 1304 observations and 14 variables, but the main ones I used are set_name, theme, year, pieces, size, pages, and price. Year is the year of when the specific LEGO set was released, theme is the category the LEGO set fits under, set_name is the name of the specific LEGO set, pieces is the number of LEGO bricks that were included in the set, size is the category the set fits under (small or large), pages is the number of pages that were included in the manual for the set, and price is how much the LEGO set was released for sale. I plan to explore the correlation between price and various variables in this dataset, as well as filter the dataset to only the speed champions theme, to figure out the correlation between price and pieces in this LEGO set theme. I chose this topic because I personally love building the Speed Champions car sets LEGO offers and have a collection of them at home. I always wondered if the more pieces a set had led to the price increasing for the set, as I seemed to have observed that while being at the store, shopping for sets.

Loading Libraries & Data Set & Observing The Structure

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)
 
setwd("~/Documents/EC/Spring 2026/DATA 110/Project 2")
 
lego <- read_csv("lego_population.csv")

## Rows: 1304 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): set_name, theme, ages, packaging, weight, size
## dbl (8): item_number, pieces, price, amazon_price, year, pages, minifigures,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(lego)

## spc_tbl_ [1,304 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ item_number  : num [1:1304] 41916 41908 11006 11007 41901 ...
##  $ set_name     : chr [1:1304] "Extra Dots - Series 2" "Extra Dots - Series 1" "Creative Blue Bricks" "Creative Green Bricks" ...
##  $ theme        : chr [1:1304] "DOTS" "DOTS" "Classic" "Classic" ...
##  $ pieces       : num [1:1304] 109 109 52 60 33 33 33 33 33 33 ...
##  $ price        : num [1:1304] 3.99 3.99 4.99 4.99 4.99 4.99 4.99 4.99 4.99 4.99 ...
##  $ amazon_price : num [1:1304] 3.44 3.99 4.93 4.93 4.99 4.99 4.99 4.99 4.99 4.99 ...
##  $ year         : num [1:1304] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ ages         : chr [1:1304] "Ages_6+" "Ages_6+" "Ages_4+" "Ages_4+" ...
##  $ pages        : num [1:1304] NA NA 37 37 NA NA NA NA NA NA ...
##  $ minifigures  : num [1:1304] NA NA NA NA NA NA NA NA NA NA ...
##  $ packaging    : chr [1:1304] "Foil pack" "Foil pack" "Box" "Box" ...
##  $ weight       : chr [1:1304] NA NA NA NA ...
##  $ unique_pieces: num [1:1304] 6 6 28 36 10 9 9 12 10 9 ...
##  $ size         : chr [1:1304] "Small" "Small" "Small" "Small" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   item_number = col_double(),
##   ..   set_name = col_character(),
##   ..   theme = col_character(),
##   ..   pieces = col_double(),
##   ..   price = col_double(),
##   ..   amazon_price = col_double(),
##   ..   year = col_double(),
##   ..   ages = col_character(),
##   ..   pages = col_double(),
##   ..   minifigures = col_double(),
##   ..   packaging = col_character(),
##   ..   weight = col_character(),
##   ..   unique_pieces = col_double(),
##   ..   size = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(lego)

## # A tibble: 6 × 14
##   item_number set_name         theme pieces price amazon_price  year ages  pages
##         <dbl> <chr>            <chr>  <dbl> <dbl>        <dbl> <dbl> <chr> <dbl>
## 1       41916 Extra Dots - Se… DOTS     109  3.99         3.44  2020 Ages…    NA
## 2       41908 Extra Dots - Se… DOTS     109  3.99         3.99  2020 Ages…    NA
## 3       11006 Creative Blue B… Clas…     52  4.99         4.93  2020 Ages…    37
## 4       11007 Creative Green … Clas…     60  4.99         4.93  2020 Ages…    37
## 5       41901 Funky Animals B… DOTS      33  4.99         4.99  2020 Ages…    NA
## 6       41902 Sparkly Unicorn… DOTS      33  4.99         4.99  2020 Ages…    NA
## # ℹ 5 more variables: minifigures <dbl>, packaging <chr>, weight <chr>,
## #   unique_pieces <dbl>, size <chr>

Cleaning The Data Set Variables

names(lego) <- gsub("[(). \\-]", "_", names(lego))
names(lego) <- gsub("_$", "", names(lego))
names(lego) <- tolower(names(lego))

head(lego)

## # A tibble: 6 × 14
##   item_number set_name         theme pieces price amazon_price  year ages  pages
##         <dbl> <chr>            <chr>  <dbl> <dbl>        <dbl> <dbl> <chr> <dbl>
## 1       41916 Extra Dots - Se… DOTS     109  3.99         3.44  2020 Ages…    NA
## 2       41908 Extra Dots - Se… DOTS     109  3.99         3.99  2020 Ages…    NA
## 3       11006 Creative Blue B… Clas…     52  4.99         4.93  2020 Ages…    37
## 4       11007 Creative Green … Clas…     60  4.99         4.93  2020 Ages…    37
## 5       41901 Funky Animals B… DOTS      33  4.99         4.99  2020 Ages…    NA
## 6       41902 Sparkly Unicorn… DOTS      33  4.99         4.99  2020 Ages…    NA
## # ℹ 5 more variables: minifigures <dbl>, packaging <chr>, weight <chr>,
## #   unique_pieces <dbl>, size <chr>

Filtering for LEGO Speed Champions Theme

lego_cars <- lego |>
  filter(theme == "Speed Champions")
head(lego_cars)

## # A tibble: 6 × 14
##   item_number set_name         theme pieces price amazon_price  year ages  pages
##         <dbl> <chr>            <chr>  <dbl> <dbl>        <dbl> <dbl> <chr> <dbl>
## 1       75891 Chevrolet Camar… Spee…    198  15.0         12.9  2019 Ages…    60
## 2       75892 McLaren Senna    Spee…    219  15.0         12.9  2019 Ages…    64
## 3       75895 1974 Porsche 91… Spee…    180  15.0         13.3  2019 Ages…    80
## 4       75890 Ferrari F40 Com… Spee…    198  15.0         15.0  2019 Ages…    56
## 5       76896 Nissan GT-R NIS… Spee…    298  20.0         19.0  2020 Ages…    80
## 6       76895 Ferrari F8 Trib… Spee…    275  20.0         20.0  2020 Ages…    72
## # ℹ 5 more variables: minifigures <dbl>, packaging <chr>, weight <chr>,
## #   unique_pieces <dbl>, size <chr>

Multiple Linear Regression: Price ~ Pieces + Year + Size + Pages

multiple_model <- lm(price ~ pieces + year + size + pages, data = lego)

summary(multiple_model)

## 
## Call:
## lm(formula = price ~ pieces + year + size + pages, data = lego)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.762   -7.764   -2.330    4.640  282.437 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.998e+02  1.764e+03  -0.510 0.610130    
## pieces       8.232e-02  1.936e-03  42.529  < 2e-16 ***
## year         4.587e-01  8.737e-01   0.525 0.599745    
## sizeSmall   -2.084e+01  3.121e+00  -6.676 4.25e-11 ***
## pages        4.119e-02  1.058e-02   3.893 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.47 on 921 degrees of freedom
##   (378 observations deleted due to missingness)
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.8626 
## F-statistic:  1453 on 4 and 921 DF,  p-value: < 2.2e-16

Interpretation:

price = -899.8 + 0.08232 (pieces) + 0.4587 (year) − 20.84 (sizeSmall) + 0.04119 (pages)

Pieces (p<2e-16): Highly significant p-value. Each piece increases the price by about $0.082, making it the strongest predictor in the model.
Year (p = 0.573): Not significant. The year does not affect the price drastically.
SizeSmall (4.25e-11): Significant p-value. Small sets cost about $20.84 less than larger sets.
Pages (0.000106): Significant p-value. Longer instructions are associated with higher prices.

Adjusted R²: about 0.8626 This means about 86% of the variation in price is explained by pieces, year, size, and pages. This indicates that the model is strong.

Backwards Elimination on the Multiple Linear Regression Model - Removing Year (Largest P-value)

multiple_model <- lm(price ~ pieces + size + pages, data = lego)

summary(multiple_model)

## 
## Call:
## lm(formula = price ~ pieces + size + pages, data = lego)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.207   -7.950   -2.379    4.695  282.447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.257897   2.978902   8.815  < 2e-16 ***
## pieces        0.082323   0.001935  42.546  < 2e-16 ***
## sizeSmall   -20.823658   3.119741  -6.675 4.27e-11 ***
## pages         0.041160   0.010577   3.891 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 922 degrees of freedom
##   (378 observations deleted due to missingness)
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.8628 
## F-statistic:  1939 on 3 and 922 DF,  p-value: < 2.2e-16

price = 26.2579 + 0.08232 (pieces) − 20.8237 (sizeSmall) + 0.04116 (pages)

Adjusted R²: about 0.8628 This means about 86% of the variation in price is explained by pieces, size, and pages. This indicates that the model is slightly stronger than the previous model.

A reduced model for price using only pieces, size, and pages may be preferable since year is not statistically significant and does not improve adjusted R² meaningfully by decreasing the value by 0.0002. We can also see that all of these variables are significant due to the number of stars that are shown next to each variable with 3 stars.

High Charter Visualization of Speed Champions Sets Based on # of Pieces

legohb <- brewer.pal(4, "Set1") 
highchart() |>
  hc_add_series(data = lego_cars, type = "scatter", hcaes(x = pieces, y = price, group = set_name)) |> 
  hc_xAxis(title = list(text="# of Pieces")) |>
  hc_yAxis(title = list(text="Price in USD")) |>
  hc_title(text = "Speed Champions Sets Price Based on # of Pieces") |> # Found on Google
  hc_caption(text = "Source: OpenIntro (brickset.com)") |> # Found on Google
  hc_colors(legohb)

Interpretation

In this highcarter scatterplot visualization, we can clearly see the various LEGO Speed Champions sets that were around between Jan 1st, 2018, and Sep 11th, 2020. I created a scatter plot visualization using high charter. I observed that as the number of pieces increased for each LEGO set, the price also increased as well. I wish the legend of the graph could be cleaner, as it seems very cluttered with the long names of each LEGO set.

References

Dataset: https://www.openintro.org/data/index.php?data=lego_population Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com