lego <- read.csv("lego_population.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Research Question
Do the number of pieces in the LEGO set affect its price of sale?
Introduction
A very useful way to use a Linear Regression model is through analyzing a company’s sale. For this project, I’ve decided to use a data set names lego_population in order to understand factors that affect the sales of LEGO sets. LEGO is a toy company that is known to create plastic building toys for people of all ages and in this project, I will be analyzing their sales from the year 2018 to near the end of 2020.
The data set has 1304 cases with tons of variables but the one’s that I’ll be using mostly for this project would be pieces, which is the number of pieces in the set, and prices, which is the recommended retail price from LEGO.
Source for Data Set
“Population of Lego Sets for Sale between Jan. 1, 2018 and Sept. 11, 2020.” Data Sets, www.openintro.org/data/index.php?data=lego_population. Accessed 1 Dec. 2025.
Data Analysis
The Data Analysis stage for this project will be simple as I’m only trying to identify plainly through observation if there is a high correlation between LEGO pieces and its price. The first step is to make sure that the two variables I’ll be using is clean and that all the N/As are cleared out. The next step was I wanted to check my data by getting the head and the max amount of LEGO pieces and the max amount of price. The last step is to see if the set with the most amount of pieces is actually the most expensive set.
lego_clean <- lego |>
filter(!is.na(pieces)) %>%
filter(!is.na(price)) %>%
filter(!is.na(unique_pieces)) %>%
filter(!is.na(year))
head(lego_clean)
## item_number set_name theme pieces price amazon_price year
## 1 41916 Extra Dots - Series 2 DOTS 109 3.99 3.44 2020
## 2 41908 Extra Dots - Series 1 DOTS 109 3.99 3.99 2020
## 3 11006 Creative Blue Bricks Classic 52 4.99 4.93 2020
## 4 11007 Creative Green Bricks Classic 60 4.99 4.93 2020
## 5 41901 Funky Animals Bracelet DOTS 33 4.99 4.99 2020
## 6 41902 Sparkly Unicorn Bracelet DOTS 33 4.99 4.99 2020
## ages pages minifigures packaging weight unique_pieces size
## 1 Ages_6+ NA NA Foil pack <NA> 6 Small
## 2 Ages_6+ NA NA Foil pack <NA> 6 Small
## 3 Ages_4+ 37 NA Box <NA> 28 Small
## 4 Ages_4+ 37 NA Box <NA> 36 Small
## 5 Ages_6+ NA NA Foil pack <NA> 10 Small
## 6 Ages_6+ NA NA Foil pack <NA> 9 Small
max(lego_clean$pieces)
## [1] 6020
max(lego_clean$price)
## [1] 699.99
summary(lego_clean)
## item_number set_name theme pieces
## Min. :10260 Length:1056 Length:1056 Min. : 1.0
## 1st Qu.:40336 Class :character Class :character 1st Qu.: 101.8
## Median :42110 Mode :character Mode :character Median : 227.5
## Mean :50665 Mean : 432.1
## 3rd Qu.:72001 3rd Qu.: 496.2
## Max. :88014 Max. :6020.0
##
## price amazon_price year ages
## Min. : 1.99 Min. : 3.44 Min. :2018 Length:1056
## 1st Qu.: 14.99 1st Qu.: 19.95 1st Qu.:2018 Class :character
## Median : 29.99 Median : 37.33 Median :2019 Mode :character
## Mean : 46.33 Mean : 57.88 Mean :2019
## 3rd Qu.: 49.99 3rd Qu.: 69.96 3rd Qu.:2020
## Max. :699.99 Max. :699.95 Max. :2020
## NA's :232
## pages minifigures packaging weight
## Min. : 1.0 Min. : 1.000 Length:1056 Length:1056
## 1st Qu.: 40.0 1st Qu.: 2.000 Class :character Class :character
## Median : 80.0 Median : 3.000 Mode :character Mode :character
## Mean : 104.6 Mean : 3.206
## 3rd Qu.: 132.0 3rd Qu.: 4.000
## Max. :1527.0 Max. :28.000
## NA's :78 NA's :289
## unique_pieces size
## Min. : 1.0 Length:1056
## 1st Qu.: 50.0 Class :character
## Median : 100.0 Mode :character
## Mean : 130.4
## 3rd Qu.: 175.2
## Max. :1067.0
##
lego_byPrice <- lego_clean |>
arrange(desc(price))
head(lego_byPrice)
## item_number set_name theme pieces price amazon_price
## 1 75252 Imperial Star Destroyer Star Wars™ 4784 699.99 699.95
## 2 42100 Liebherr R 9800 Powered UP 4108 449.99 443.22
## 3 71043 Hogwarts Castle Harry Potter™ 6020 399.99 399.99
## 4 75978 Diagon Alley Harry Potter™ 5544 399.99 NA
## 5 10261 Roller Coaster Creator Expert 4124 379.99 379.95
## 6 42115 Lamborghini Sián FKP 37 Technic™ 3696 379.99 NA
## year ages pages minifigures packaging weight unique_pieces
## 1 2019 Ages_16+ 444 2 Box <NA> 445
## 2 2019 Ages_12+ 740 NA <NA> <NA> 221
## 3 2018 Ages_16+ 636 28 Box <NA> 624
## 4 2020 Ages_16+ NA 14 Box <NA> 1067
## 5 2018 Ages_16+ 440 11 Box 5.8Kg (12.78 lb) 556
## 6 2020 Ages_18+ 657 NA Box <NA> 293
## size
## 1 Small
## 2 Small
## 3 Small
## 4 Small
## 5 Small
## 6 Small
lego_byPieces <- lego_clean |>
arrange(desc(pieces))
head(lego_byPieces)
## item_number set_name theme pieces price amazon_price
## 1 71043 Hogwarts Castle Harry Potter™ 6020 399.99 399.99
## 2 75978 Diagon Alley Harry Potter™ 5544 399.99 NA
## 3 75252 Imperial Star Destroyer Star Wars™ 4784 699.99 699.95
## 4 10261 Roller Coaster Creator Expert 4124 379.99 379.95
## 5 42100 Liebherr R 9800 Powered UP 4108 449.99 443.22
## 6 42082 Rough Terrain Crane Technic™ 4057 299.99 284.93
## year ages pages minifigures packaging weight unique_pieces
## 1 2018 Ages_16+ 636 28 Box <NA> 624
## 2 2020 Ages_16+ NA 14 Box <NA> 1067
## 3 2019 Ages_16+ 444 2 Box <NA> 445
## 4 2018 Ages_16+ 440 11 Box 5.8Kg (12.78 lb) 556
## 5 2019 Ages_12+ 740 NA <NA> <NA> 221
## 6 2018 Ages_11+ 1527 NA Box 6.1Kg (13.44 lb) 262
## size
## 1 Small
## 2 Small
## 3 Small
## 4 Small
## 5 Small
## 6 Small
lego_byUnique <- lego_clean |>
arrange(desc(unique_pieces))
head(lego_byUnique)
## item_number set_name theme pieces price
## 1 75978 Diagon Alley Harry Potter™ 5544 399.99
## 2 70840 Welcome to Apocalypseburg! THE LEGO® MOVIE 2™ 3178 299.99
## 3 70657 NINJAGO City Docks NINJAGO® 3553 229.99
## 4 75222 Betrayal at Cloud City Star Wars™ 2812 349.99
## 5 71043 Hogwarts Castle Harry Potter™ 6020 399.99
## 6 80013 Monkie Kid's Team Secret HQ Monkie Kid 1105 169.99
## amazon_price year ages pages minifigures packaging weight
## 1 NA 2020 Ages_16+ NA 14 Box <NA>
## 2 NA 2019 Ages_16+ 452 13 Box <NA>
## 3 440.00 2018 Ages_12+ 380 14 Box 4.61Kg (10.15 lb)
## 4 668.18 2018 Ages_14+ 388 19 Box <NA>
## 5 399.99 2018 Ages_16+ 636 28 Box <NA>
## 6 NA 2020 Ages_10+ 556 7 <NA> <NA>
## unique_pieces size
## 1 1067 Small
## 2 692 Small
## 3 690 Small
## 4 676 Small
## 5 624 Small
## 6 622 Small
Regression Analysis
multiple_model <- lm(price ~ pieces + unique_pieces + year, data = lego_clean)
summary(multiple_model)
##
## Call:
## lm(formula = price ~ pieces + unique_pieces + year, data = lego_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.14 -8.77 -4.12 3.06 320.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.276e+03 1.974e+03 -0.647 0.518
## pieces 7.116e-02 1.990e-03 35.762 < 2e-16 ***
## unique_pieces 7.587e-02 1.114e-02 6.813 1.61e-11 ***
## year 6.348e-01 9.775e-01 0.649 0.516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.1 on 1052 degrees of freedom
## Multiple R-squared: 0.8003, Adjusted R-squared: 0.7997
## F-statistic: 1405 on 3 and 1052 DF, p-value: < 2.2e-16
Intepretation
Intercept: The predicted price when all the predictors are 0 is -1276. This number is not practical and is just calculated for the sake of the model.
Coefficients (slope):
Pieces: For the pieces, we got 0.07 which means more pieces, higher price. Unique Pieces: For the unique pieces, we got 0.76 which more unique pieces, it increases the price. Year: For the year, we got 0.63 but because the p-value is high, this number is insignificant in comparison to the two other variables.
P-values: Only pieces and unique_pieces are significant.
Adjusted R²: is about 0.799. This would mean that the 79.9% of variations in LEGO prices is explained by the pieces, unique pieces and the year, making this model a strong one.
Model Assumptions and Diagnostics
plot(lego_clean$pieces, lego_clean$price,
xlab="Pieces", ylab="Price", main="Price VS Pieces")
abline(multiple_model, col=1, lwd=2)
## Warning in abline(multiple_model, col = 1, lwd = 2): only using the first two
## of 4 regression coefficients
Interpretation There is an obvious positive trend,
meaning that the more pieces and unique pieces, the higher the price is.
Although we see linearity in this model, it is not perfect as it does
not accurately predict the prices of LEGO sets with very high amount of
pieces.
plot(resid(multiple_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)
Interpretation
Residuals vs Order: For the first ~700 observations, we can see that the residuals bounce randomly around 0 but on the last ~200 observations, there are larger spikes both positive and negative. This indicates that there could be changes in variability that might influence the model’s overall performance.
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
Residuals VS Fitted:
There is more of a flat cloud from the lower values, and the linearity is mostly okay but the plot does suggest that for larger sets, there might be a more non-linear pattern. As we can see there’s more of a funnel shape that appears as the value increases. This means that LEGO sets with more pieces might have higher prediction errors than smaller ones.
Scale-Location:
Based on what was said on the linearity, there is no homoscedasticity, rather the model has heteroscedasticity, which means there is an increase in variance for more expensive sets.
Q–Q plot:
Tails deviate, but stronger deviation from the upper/right tail. This means that the normality of residuals are not perfect due to the high variation on the right tail, however, it is only a mild normality issue.
Residuals vs Leverage:
Most sets have low leverage which is good since it wont affect the model as much. A few sets however has a high leverage, we even see some points fall near the Cook’s distance curves, meaning that these values might strongly affect the regression result. In that case, expensive LEGO sets might behave differently than the rest of the data set.
predictors <- lego_clean[, c("pieces", "unique_pieces", "year")]
cor(predictors)
## pieces unique_pieces year
## pieces 1.00000000 0.771875514 0.026725822
## unique_pieces 0.77187551 1.000000000 0.007649649
## year 0.02672582 0.007649649 1.000000000
Interpretation
From this, we see how there’s a pretty high correlation between pieces and unique pieces which is 0.77. This means that the more pieces a set have, the more unique pieces it will have too. For the pieces and year, there’s not really much of a correlation, including year and unique pieces as well.
Overall the only variables with a strong and significant correlation is pieces and unique_pieces. Therefore, multicollinearity is not a major concern for this model.
Conclusion and Future Directions
To summarize, our model is pretty good in predicting the prices of LEGO sets but only ones that are from the lower end of the data set, in terms of price and pieces. Towards more expensive sets and sets with higher number of pieces, our model is not that accurate because of the higher amounts of variance in those areas. The model’s fit, however, based on its adjusted R² which is a pretty high number of 0.799, we can say that our model is pretty strong as 79.9% of variations in LEGO prices is explained by the pieces, unique pieces and the year.
This relates back to our research question in which we can answer from our findings that the amount of LEGO pieces in a set do impact its price, however, it is important to note that LEGO sets with very high numbers of pieces are harder to predict in price. This could be due to other factors that are not taken into account in this model like the theme, maybe LEGO sets from the Star Wars franchise are more expensive due to its popular demand. With that said, looking into those different factors might give a much clearer understanding and accuracy for the model, especially when it’s dealing sets with very large amounts of pieces. Even having more observations could also help this model too.
References
“Population of Lego Sets for Sale between Jan. 1, 2018 and Sept. 11, 2020.” Data Sets, www.openintro.org/data/index.php?data=lego_population. Accessed 1 Dec. 2025.