LEGO®

Analyzing the price, brick by brick

Author

Joel Sandbäck, Alva Rolandson, Klara Persson, Porsche Thichan

Introduction

We chose to study the “lego_sample” dataset from OpenIntro because we found it interesting and it was a subject that most people are familiar with.

In this study we will be using variables from the table below that we found suitable for our subject such as “pieces”, “price”, “unique_pieces”, “theme”, “ages” and lastly “set_name”.

Rows: 75
Columns: 14
$ item_number   <dbl> 10859, 10860, 10862, 10864, 10867, 10870, 10872, 10875, …
$ set_name      <chr> "My First Ladybird", "My First Race Car", "My First Cele…
$ theme         <chr> "DUPLO®", "DUPLO®", "DUPLO®", "DUPLO®", "DUPLO®", "DUPLO…
$ pieces        <dbl> 6, 6, 41, 71, 26, 16, 26, 105, 38, 37, 23, 15, 34, 21, 9…
$ price         <dbl> 4.99, 4.99, 14.99, 49.99, 19.99, 9.99, 24.99, 119.99, 29…
$ amazon_price  <dbl> 16.00, 9.45, 39.89, 56.69, 36.99, 9.99, 21.99, 128.95, 7…
$ year          <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 20…
$ ages          <chr> "Ages_1½-3", "Ages_1½-3", "Ages_1½-3", "Ages_2-5", "Ages…
$ pages         <dbl> 9, 9, 9, 32, 9, 8, 16, 64, 20, 24, 16, 9, 10, 10, 8, 40,…
$ minifigures   <dbl> NA, NA, NA, 2, 3, NA, 1, 3, 3, 1, NA, NA, NA, 2, NA, 3, …
$ packaging     <chr> "Box", "Box", "Box", "Plastic box", "Box", "Box", "Box",…
$ weight        <chr> NA, "0.13Kg (0.29 lb)", NA, "1.41Kg (3.11 lb)", NA, NA, …
$ unique_pieces <dbl> 5, 6, 18, 49, 18, 13, 10, 68, 29, 26, 9, 15, 31, 16, 9, …
$ size          <chr> "Large", "Large", "Large", "Large", "Large", "Large", "L…

Hypothesis

“The price of LEGO sets is determined by the number of pieces and the theme of the set, with more complex sets tending to be more expensive.”

We chose this hypothesis because it’s intuitive that LEGO set prices are influenced by complexity, both in terms of piece count and theme. Larger sets with more pieces typically require more time and resources to design and produce, leading to higher costs.

Method

We will utilize data visualization to examine the dataset “lego_sample” from openintro and with the help of tidyverse, ggplot2, dplyr, create plots and charts and then lastly determine if our hypothesis is correct.

First we need to install and load the necessary packages

install.packages("openintro")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("GGally")
install.packages("plotly")
install.packages("dplyr")

library(openintro)
library(tidyverse)
library(ggplot2)
library(GGally)
library(plotly)
library(dplyr)

After we have loaded these we can examine our dataset by writing this in the console.

view(lego_sample) # open the dataset "lego_sample" as table

After viewing the dataset we can begin to form the visualization.

Visualization

We began by exploring the correlation between all numeric variables in the LEGO dataset. This can help us identify relationships between price, number of pieces, weight, and other variables.

lego_sample |> # Specify the data set we want to use (in our case "lego_sample)
  select_if(is.numeric) |> # Selects all variables with numeric data
  ggcorr() # Create a correlogram with the selected variables from above

After this we made wanted to see the differences between each theme so we made three different correlograms

Correlation for “Friends”

lego_sample |>
  filter(theme %in% c("Friends")) |> # Filter to only include sets with theme "Friends"
  select_if(is.numeric) |>
  ggcorr()

Correlation for “City”

lego_sample |>
  filter(theme %in% c("City")) |> # Same as code above but with theme "City"
  select_if(is.numeric) |>
  ggcorr()

Correlation for “DUPLO®”

lego_sample |>
  filter(theme %in% c("DUPLO®")) |> # Same as code above but with theme "DUPLO®"
  select_if(is.numeric) |>
  ggcorr()

We found these three to be hard to follow and compare to each other so we decided to move on to a different type of plot.

Pieces and price

The decision was to try and use a scatter plot instead. And we can directly see the relationship between the amount of pieces and price. We also used “plotly” so that we can examine each individual data point.

MyPlot <- lego_sample |> # Store our plot in as variable MyPlot
  ggplot(aes(x = pieces, y = price, text = set_name, color = theme)) + # plotting with pieces as x-axis and price as y-axis, color the points as their theme.  
  geom_point() # create a scatterplot using ggplot and aes from above.

ggplotly(MyPlot)

This was much more pleasing to the eye and helpful for the analysis. But we felt like we needed to see the trend of each theme more clearly so we used “geom_smooth” with the method “lm” to make a straight line that represents the trend.

MyPlot <- lego_sample |>
  ggplot(aes(x = pieces, y = price, color = theme)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + # Create a straight line based on the points P.S this will create for each theme
  labs(
    title = "Relationship between LEGO set price and number of pieces",
    x = "Number of pieces",
    y = "Price ($)",
    color = "Theme"
  )

ggplotly(MyPlot)

`geom_smooth()` using formula = 'y ~ x'

With the help of the line we can see that DUPLO®’s trend line is much more steep than Friends and City. We can also see that the price of DUPLO® is quite high even though the amount of pieces is considerably low compared to Friends and City. This means that there are more factors than pieces that affects the price of each set. To confirm this suspicions lets see the average trend of all three themes combined.

MyPlot <- lego_sample |>
  ggplot(aes(x = pieces, y = price, text = set_name, color = theme, group = 1)) + # "group = 1" combines all of the themes together as a group
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.5) + # Create a black straight line P.S since we grouped them we will only get one line for the combined average
  labs(
    title = "Relationship between LEGO set price and number of pieces",
    x = "Number of pieces",
    y = "Price ($)",
    color = "Theme"
  )

ggplotly(MyPlot)

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: text.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

We already guessed this but now we can fully confirm that DUPLO® is above the average line by a large amount. And the Friends and City’s trend line is close in steepness of the combined line even if they’re not exactly as steep.

Complexity

We have defined complexity as the number of unique pieces in each set as it correlates to build difficulty, we have also based the complexity rating on the suggested age rating for the sets as the sets aimed at older people are more complex

We decided to use a violin plot to visualize the density of unique pieces in each theme

lego_sample |>
  ggplot(aes(x = theme, y = unique_pieces, fill = theme)) +
  geom_violin(alpha = 0.6) + # Create a violin chart with plot created above
  labs(
    title = "Density distribution of unique pieces by LEGO theme",
    x = "Theme",
    y = "Number of unique pieces",
    fill = "Theme"
  )

We then created a boxplot that would show the age rating of each theme’s sets. This allowed us to see that DUPLO® indeed was aimed at a younger audience and therefore less complex

lego_sample |>
  ggplot(aes(x = ages, y = price, color = theme)) +
  geom_boxplot() + # Create a boxplot with plot created above
  labs(
    title = "Price distribution of LEGO sets by age group", 
    x = "Age group", 
    y = "Price ($)",
    color = "Theme"
  ) +
   # Replace underscores with spaces in the x-axis labels
  scale_x_discrete(labels = function(x) gsub("_", " ", x)) +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1)
  )

Average price per piece

Lastly we wanted to further confirm that DUPLO®’s price per piece is higher than the other themes.

This column chart shows the average price per piece for each LEGO theme, highlighting the themes with the highest cost per piece. The data reveals how pricing varies across different themes.

lego_sample |>
  mutate(price_per_piece = price / pieces) |> # create a new variable with value of price divided by pieces
  group_by(theme) |>
  summarise(
    avg_price_per_piece = mean(price_per_piece, na.rm = TRUE), 
    count = n()
  ) |> # Calculate the mean of price_per_piece so that we don't get misleading numbers due to different number of set in each theme.
  arrange(desc(avg_price_per_piece)) |>
  ggplot(aes(x = reorder(theme, avg_price_per_piece), y = avg_price_per_piece, fill = theme)) + # "reorder" orders the column by their size. From the least to most expensive
  geom_col() + # Create a column chart with the plot above
  labs(
    title = "Most expensive LEGO themes ($ per piece)",
    x = "Theme",
    y = "Average price per piece"
  )

After we’ve examined this chart we can move on to the conclusion

Conclusion

Lastly we need to conclude our hypothesis which is: “The price of LEGO sets is determined by the number of pieces and the theme of the set, with more complex sets tending to be more expensive”.

After we examined our visualizations we can conclude that the hypothesis is wrong as the price of DUPLO® sets averages to around the same price as other themes despite the lower complexity and lower piece count. This means that there are other factor that affects the price of each set such as size or weight. A further study on the subject is required to understand which factors that really affect the price.

Reflection

Overall, the project went well. One of the main challenges was at the beginning, figuring out which visualizations best supported our hypothesis. It took some time to determine which aspects of the data were most relevant, such as number of pieces, theme, price, and age groups. Once that was clear, the visualizations fell into place. They complemented each other and helped build a clear story around our hypothesis: that more complex LEGO sets tend to be more expensive.

Another challenge we faced was when we were writing the scatterplot that had three lines that represent the price increase for each theme. Since we wanted the plot to be interactive we decided to use plotly on the scatterplot. We wanted each point to show the exact price, pieces, theme and name of the set but we found that we could not get the lines to appear if we wrote “text = set_name” in ggplot’s aes. We tested and found that it works if we didn’t have a geom_smooth in the scatterplot at all or when we wanted to simply have one line that combined all the themes together and represent the overall price increase. We did not find a solution so we chose to move forward and skipped having a set_name for each point for that scatterplot.