I used the “lego_population” data from openintro.org/data, originally from Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3), 1-7.
The dataset included data on all lego sets for sale between January 1, 2018 and September 11, 2020. There were 1,304 observations and 14 variables: the item number, the name of the lego set, the theme, the number of pieces in the set, the recommended retail price of the set, the price of the set on Amazon, the year it was made, the recommended age range, the number of pages in the instruction manual, the number of minifigures, the type of packaging, the weight of the set, the number of unique pieces, and the size of the pieces.
legos <- read_csv("lego_population.csv", show_col_types = FALSE)
glimpse(legos)
## Rows: 1,304
## Columns: 14
## $ item_number <dbl> 41916, 41908, 11006, 11007, 41901, 41902, 41903, 41911, …
## $ set_name <chr> "Extra Dots - Series 2", "Extra Dots - Series 1", "Creat…
## $ theme <chr> "DOTS", "DOTS", "Classic", "Classic", "DOTS", "DOTS", "D…
## $ pieces <dbl> 109, 109, 52, 60, 33, 33, 33, 33, 33, 33, 33, 6, 7, 95, …
## $ price <dbl> 3.99, 3.99, 4.99, 4.99, 4.99, 4.99, 4.99, 4.99, 4.99, 4.…
## $ amazon_price <dbl> 3.44, 3.99, 4.93, 4.93, 4.99, 4.99, 4.99, 4.99, 4.99, 4.…
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ ages <chr> "Ages_6+", "Ages_6+", "Ages_4+", "Ages_4+", "Ages_6+", "…
## $ pages <dbl> NA, NA, 37, 37, NA, NA, NA, NA, NA, NA, NA, 3, 3, 40, 1,…
## $ minifigures <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N…
## $ packaging <chr> "Foil pack", "Foil pack", "Box", "Box", "Foil pack", "Fo…
## $ weight <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "0.15Kg (0.3…
## $ unique_pieces <dbl> 6, 6, 28, 36, 10, 9, 9, 12, 10, 9, 12, 6, 7, 52, 1, 10, …
## $ size <chr> "Small", "Small", "Small", "Small", "Small", "Small", "S…
Below is the distribution of the number of pieces.
ggplot(legos, aes(x = pieces)) +
geom_density(size = 2, na.rm = TRUE) +
theme_classic() +
labs(x = "Pieces",
y = "Density")
This distribution is very skewed right, and we need a roughly normal distribution to use linear regression to predict the number of pieces. Therefore, to get a more normal distribution of pieces, I took the log of those values. I also removed all the rows with NAs, which took the sample down to 272 observations, and I turned weight into a numeric value so it could be used for analysis. Below is a glimpse at the updated data.
legos <- drop_na(legos)
legos <- legos %>%
mutate(weight = str_remove(weight, "K.*")) %>%
mutate(weight = as.numeric(weight)) %>%
mutate(log_pieces = log(pieces))
glimpse(legos)
## Rows: 272
## Columns: 15
## $ item_number <dbl> 60239, 41452, 10900, 60212, 60251, 60206, 60219, 60255, …
## $ set_name <chr> "Police Patrol Car", "Prince Puppycorn Trike", "Police B…
## $ theme <chr> "City", "Unikitty!™", "DUPLO®", "City", "City", "City", …
## $ pieces <dbl> 92, 101, 8, 64, 55, 54, 88, 62, 67, 89, 74, 84, 84, 126,…
## $ price <dbl> 9.99, 9.99, 9.99, 9.99, 9.99, 9.99, 9.99, 9.99, 9.99, 9.…
## $ amazon_price <dbl> 8.35, 8.95, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.99, 8.…
## $ year <dbl> 2019, 2018, 2019, 2019, 2020, 2019, 2019, 2020, 2020, 20…
## $ ages <chr> "Ages_5+", "Ages_5-12", "Ages_2+", "Ages_4+", "Ages_5+",…
## $ pages <dbl> 36, 56, 10, 28, 32, 36, 36, 36, 40, 48, 52, 56, 56, 60, …
## $ minifigures <dbl> 1, 3, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1, 3, 3, 1, 1,…
## $ packaging <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box", …
## $ weight <dbl> 0.15, 0.15, 0.18, 0.13, 0.14, 0.14, 0.16, 0.10, 0.12, 0.…
## $ unique_pieces <dbl> 52, 65, 7, 45, 34, 45, 50, 34, 48, 55, 55, 52, 49, 83, 3…
## $ size <chr> "Small", "Small", "Large", "Small", "Small", "Small", "S…
## $ log_pieces <dbl> 4.521789, 4.615121, 2.079442, 4.158883, 4.007333, 3.9889…
The distribution of the log of the number of pieces is shown below. As you can see, the new distribution looks much closer to normal, so we’ll be making predictions on log_pieces.
ggplot(legos, aes(x = log_pieces)) +
geom_density(size = 2) +
theme_classic() +
labs(x = 'Log(Pieces)',
y = 'Density'
)
I will be using linear regression to make predictions on the response (Y) variable log_pieces. I started by splitting the data into 80% training (217 observations) and 20% testing (55 observations).
set.seed(1309413)
legos_split <- initial_split(legos, prop = 0.8)
legos_train <- training(legos_split)
dim(legos_train)
## [1] 217 15
legos_test <- testing(legos_split)
dim(legos_test)
## [1] 55 15
For my first model, I used seven explanatory (X) variables: price, amazon price, year, pages, minifigures, weight, and unique_pieces. Size was a near-zero variance variable, which I excluded in my recipe. I also decided not to include ages, theme, or packaging, since they were categorical variables with too many levels to be helpful predictors.
legos_mod1 <- linear_reg() %>%
set_engine("lm")
legos_mod1
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
legos_rec1 <- recipe(log_pieces ~ ., data = legos_train) %>%
update_role(item_number, new_role = "ID") %>%
update_role(set_name, new_role = "ID") %>%
step_rm(ages, theme, packaging, pieces) %>%
step_dummy(all_nominal(), -item_number, -set_name) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
legos_rec1
## Recipe
##
## Inputs:
##
## role #variables
## ID 2
## outcome 1
## predictor 12
##
## Operations:
##
## Variables removed ages, theme, packaging, pieces
## Dummy variables from all_nominal(), -item_number, -set_name
## Zero variance filter on all_predictors()
## Sparse, unbalanced variable filter on all_predictors()
legos_wflow1 <- workflow() %>%
add_model(legos_mod1) %>%
add_recipe(legos_rec1)
legos_wflow1
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_rm()
## • step_dummy()
## • step_zv()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
legos_fit1 <- legos_wflow1 %>%
fit(data = legos_train)
tidy(legos_fit1)
Below is the model equation. Based on the signs of the coefficients, we see that the log number of pieces increases with increases in price, amazon_price, pages, and unique_pieces, while the log number of pieces decreases with increases in year, minifigures, and weight.
\[\widehat{log \ pieces} = 161.216 + 0.010 \times price + 0.002 \times amazon\ price - 0.078 \times year\] \[ + 0.003 \times pages - 0.037 \times minifigures - 0.987 \times weight + 0.008 \times unique\ pieces\]
I performed 5-fold cross validation to evaluate Model #1 on the training data. I got an average RMSE of 0.461 and an average R-squared of 0.81, which means that 81% of the variation in the response variable was explained by the model.
set.seed(567381)
folds <- vfold_cv(legos_train, v = 5)
legos_fit_rs1 <- legos_wflow1 %>%
fit_resamples(folds)
collect_metrics(legos_fit_rs1)
collect_metrics(legos_fit_rs1, summarize = FALSE)
My second model used all of the same variables except for unique_pieces, since that was by far the strongest predictor and I wanted to see how well a model without that variable could make predictions. Therefore, this model used six explanatory (X) variables: price, amazon price, year, pages, minifigures, and weight.
legos_mod2 <- linear_reg() %>%
set_engine("lm")
legos_mod2
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
legos_rec2 <- recipe(log_pieces ~ ., data = legos_train) %>%
update_role(item_number, new_role = "ID") %>%
update_role(set_name, new_role = "ID") %>%
step_rm(ages, theme, packaging, pieces, unique_pieces) %>%
step_dummy(all_nominal(), -item_number, -set_name) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
legos_rec2
## Recipe
##
## Inputs:
##
## role #variables
## ID 2
## outcome 1
## predictor 12
##
## Operations:
##
## Variables removed ages, theme, packaging, pieces, unique_pieces
## Dummy variables from all_nominal(), -item_number, -set_name
## Zero variance filter on all_predictors()
## Sparse, unbalanced variable filter on all_predictors()
legos_wflow2 <- workflow() %>%
add_model(legos_mod2) %>%
add_recipe(legos_rec2)
legos_wflow2
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_rm()
## • step_dummy()
## • step_zv()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
legos_fit2 <- legos_wflow2 %>%
fit(data = legos_train)
tidy(legos_fit2)
Below is the model equation. Based on the signs of the coefficients, we see that the log number of pieces increases with increases in price, amazon_price, pages, and minifigures, while the log number of pieces decreases with increases in year and weight.
\[\widehat{log \ pieces} = 129.510 + 0.009 \times price + 0.005 \times amazon\ price - 0.062 \times year\] \[ + 0.008 \times pages - 0.017 \times minifigures - 0.669 \times weight\]
To evaluate Model #2, I used the same folds to perform cross validation again, and got an RMSE of 0.588 (slightly higher than the value for Model #1, which was 0.461) and an R-squared of 0.714 (slightly lower the the R-squared for model #1, which was 0.810), which suggests that this model was slightly weaker.
set.seed(567381)
folds <- vfold_cv(legos_train, v = 5)
legos_fit_rs2 <- legos_wflow2 %>%
fit_resamples(folds)
collect_metrics(legos_fit_rs2)
collect_metrics(legos_fit_rs2, summarize = FALSE)
The winning model was Model #1, so I used it to make predictions on the testing data. This led to an RMSE of 0.472 and an R-squared of 0.763, which means that on the testing data, about 76.3% of the variation in the response variable was explained by the model.
legos_test_pred1 <- predict(legos_fit1, legos_test) %>%
bind_cols(legos_test %>% select(log_pieces, set_name))
legos_test_pred1
yardstick::rmse(legos_test_pred1, truth = log_pieces, estimate = .pred)
yardstick::rsq(legos_test_pred1, truth = log_pieces, estimate = .pred)
Below, I visualized Model #1’s predictions on all of the data in orange, with the actual data points in black. I chose to plot these against the number of pages since it was a good predictor and it allowed us to see the data somewhat nicely spread out.
legos_pred <- legos %>%
select(log_pieces, pieces, weight, pages) %>%
bind_cols(predict(legos_fit1, legos)) %>%
bind_cols(predict(legos_fit1, legos, type = "pred_int"))
legos_pred
ggplot(legos_pred) +
geom_point(aes(x=pages, y=log_pieces)) +
geom_line(aes(x=pages, y=.pred), color = 'darkorange', alpha = 0.7, size = 1) +
theme_minimal() +
labs(title = "Actual (Black) and Predicted (Orange) Number of Pieces in Lego Sets",
x = "Pages in Instruction Manual",
y = "Log(Pieces)")
One limitation of this research is that the dataset only included three years of lego sets, so we cannot generalize to sets from all years. Additionally, many rows of the dataset included NA values and were removed for analysis, bringing the sample size from 1304 down to 272.
Future research could utilize datasets with a larger amount of variation in size (by using more years of data) in order to include that variable as a potentially valuable predictor. Research could also make predictions about other variables, such as price, weight, or pages. And with more years of data, it would also be interesting to investigate how trends in certain variables may change over time.