We will be using these packages :
# Importing and wrangling data
library(readr)
library(tidyr)
library(dplyr)
# EDA
library(skimr)
# For plotting
library(ggplot2)
# Modelling
library(ggfortify)
library(rstatix)
library(ggpubr)
library(broom)And we will be using my usual custom ggplot theme :
source("https://raw.githubusercontent.com/davidcarayon/TidyTuesdaySubmissions/master/R/themes.R")
# And adding some IKEA specificities
ikea_yellow <- "#F7E700"
ikea_blue <- "#273FAE"
theme_update(panel.background = element_rect(fill = ikea_blue))So this week’s data is about IKEA furniture. First, we load the data. As there is only one dataframe this week, we’ll read the data manually :
Let’s have a look at the data structure :
## Rows: 3,694
## Columns: 14
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ item_id <chr> "90420332", "368814", "9333523", "80155205", "30180…
## $ name <chr> "FREKVENS", "NORDVIKEN", "NORDVIKEN / NORDVIKEN", "…
## $ category <chr> "Bar furniture", "Bar furniture", "Bar furniture", …
## $ price <dbl> 265, 995, 2095, 69, 225, 345, 129, 195, 129, 2176, …
## $ old_price <chr> "No old price", "No old price", "No old price", "No…
## $ sellable_online <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ link <chr> "https://www.ikea.com/sa/en/p/frekvens-bar-table-in…
## $ other_colors <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "N…
## $ short_description <chr> "Bar table, in/outdoor, 51x51 cm", "Bar ta…
## $ designer <chr> "Nicholai Wiig Hansen", "Francis Cayouette", "Franc…
## $ depth <dbl> NA, NA, NA, 50, 60, 45, 44, 50, 44, NA, 44, 45, 47,…
## $ height <dbl> 99, 105, NA, 100, 43, 91, 95, NA, 95, NA, 103, 102,…
## $ width <dbl> 51, 80, NA, 60, 74, 40, 50, 50, 50, NA, 52, 40, 46,…
We can now have a deeper look into data structure. I especially like the skimr::skim() function for this :
| Name | ikea |
| Number of rows | 3694 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| logical | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| item_id | 0 | 1 | 5 | 8 | 0 | 2962 | 0 |
| name | 0 | 1 | 3 | 27 | 0 | 607 | 0 |
| category | 0 | 1 | 4 | 36 | 0 | 17 | 0 |
| old_price | 0 | 1 | 4 | 13 | 0 | 365 | 0 |
| link | 0 | 1 | 52 | 163 | 0 | 2962 | 0 |
| other_colors | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| short_description | 0 | 1 | 3 | 63 | 0 | 1706 | 0 |
| designer | 0 | 1 | 3 | 1261 | 0 | 381 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| sellable_online | 0 | 1 | 0.99 | TRU: 3666, FAL: 28 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 0 | 1.00 | 1846.50 | 1066.51 | 0 | 923.25 | 1846.5 | 2769.75 | 3693 | ▇▇▇▇▇ |
| price | 0 | 1.00 | 1078.21 | 1374.65 | 3 | 180.90 | 544.7 | 1429.50 | 9585 | ▇▁▁▁▁ |
| depth | 1463 | 0.60 | 54.38 | 29.96 | 1 | 38.00 | 47.0 | 60.00 | 257 | ▇▃▁▁▁ |
| height | 988 | 0.73 | 101.68 | 61.10 | 1 | 67.00 | 83.0 | 124.00 | 700 | ▇▂▁▁▁ |
| width | 589 | 0.84 | 104.47 | 71.13 | 1 | 60.00 | 80.0 | 140.00 | 420 | ▇▅▂▁▁ |
It seems that each piece of furniture has a single ID and a name. These furniture are described by a qualitative variable, category, and a few quantitative variables such as price, depth, height or width. We already can see some NA’s that will have to be taken into account for data analysis.
We can also note that the total number of rows (3694) is larger than the number of unique furniture id’s (2962). We can try to explore why :
# Let's find the duplicates
duplicates <- ikea %>%
group_by(item_id) %>%
count() %>%
filter(n>1) %>%
pull(item_id)
# Filter by duplicates
ikea %>% filter(item_id %in% duplicates) %>%
arrange(item_id) %>%
select(X1:price) %>%
head()| X1 | item_id | name | category | price |
|---|---|---|---|---|
| 1769 | 10091453 | TROFAST | Children’s furniture | 275 |
| 1897 | 10091453 | TROFAST | Nursery furniture | 275 |
| 5 | 10122647 | INGOLF | Bar furniture | 345 |
| 1214 | 10122647 | INGOLF | Chairs | 345 |
| 1757 | 10201673 | SUNDVIK | Children’s furniture | 175 |
| 2801 | 10201673 | SUNDVIK | Tables & desks | 175 |
The mismatch between item ID and the number of rows is due to multiple categories being assigned to the same item, creating multiple lines for the same object. This could eventually be fixed by aggregating these lines with a paste(collapse = ", ") but we will keep the original structure for now.
Speaking about categories, let’s see which categories are the most sold by IKEA :
ikea %>%
group_by(category) %>%
count() %>%
ggplot(aes(x = reorder(category,n), y = n)) +
geom_segment(aes(xend = reorder(category,n), y = 0, yend = n), color = ikea_yellow, size = 2) +
geom_point(shape = 21, size = 4, fill = ikea_yellow, color = "black") +
scale_y_continuous(breaks = seq(0,600,50)) +
coord_flip() +
labs(y = "# of items", x = "Categories")Then we can ask ourselves : Which categories are, on average, the most expensive ones ?
# We convert the prices in euros € using an approximate rate of 1SAR ~ 0.22EUR
ikea <- ikea %>%
mutate(euro_price = price * 0.22)
mean_prices <- ikea %>%
group_by(category) %>%
mutate(mean_price = mean(euro_price, na.rm=TRUE),
median_price = median(euro_price,na.rm=TRUE)) %>%
ungroup()
ggplot(mean_prices,aes(x = reorder(category,median_price), y = euro_price)) +
stat_boxplot(geom = "errorbar", width = 0.3, color = ikea_yellow) +
geom_boxplot(fill = ikea_yellow, color = "black", outlier.color = ikea_yellow) +
coord_flip() +
scale_y_continuous(labels = scales::dollar_format(suffix = "€", prefix = "")) +
labs(x = "Categories", y = "Price in euros €")So, wardrobes, sofas and beds are usually the most expensive pieces of furniture sold by IKEA. But we can clearly see here that there are a lot of outliers !
To practice our linear model skills, we’ll ask ourselves a very simple question : Do Sofas price depends on their width ?
Let’s have a first look to this relation with a simple scatterplot :
ikea_sofas <- ikea %>% filter(category == "Sofas & armchairs")
ggplot(ikea_sofas,aes(x = width, y = euro_price)) +
geom_point(color = ikea_yellow) +
scale_y_continuous(labels = scales::dollar_format(suffix = "€", prefix = "")) +
labs(x = "Sofas width (cm)", y ="Price (€)", title = paste0("IKEA Sofas and armchairs, N = ",nrow(ikea_sofas)," items"))Our linear model proposition seems appropriate. But first, we need to check the LINE conditions :
The linear model we are using is actually an Ordinary Least Squares (OLS) regression, which means that the algorithm used to find the “best fitting line” is the one that minimizes the square of the distance between each point and the line (i.e. the residual values) :
# First run the model
model <- lm(euro_price ~ width, data = ikea_sofas)
augment(model) %>%
ggplot(aes(x = width, y = euro_price)) +
geom_segment(aes(x = width, y = euro_price, yend = .fitted, xend = width), color = "red") +
geom_point(fill = ikea_yellow, color = "black",shape = 21) +
geom_line(aes(y = .fitted), color = ikea_yellow) +
scale_y_continuous(labels = scales::dollar_format(suffix = "€", prefix = "")) +
labs(x = "Sofas width (cm)", y = "Price (€)")We can now check our conditions using the autoplot.lm method from the {ggfortify} package :
Having a look at these graphs, we can see that the normality condition seems to be respected (the residuals are globally fitting the QQ-line). The equality of variances also seems respected when considering the residuals vs fitted graph (except a minor “banana” form). We can also see that the 365th data point seems to be an extreme/outlier value. For future analysis, we decide to remove this particular point :
We can consider that our model doesn’t violate the LINE conditions.
We can fit the model using the usual lm function :
##
## Call:
## lm(formula = euro_price ~ width, data = ikea_sofas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -555.76 -180.02 -6.16 140.54 764.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -208.2941 34.0872 -6.111 3.45e-09 ***
## width 4.4375 0.1642 27.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 247.6 on 270 degrees of freedom
## (155 observations deleted due to missingness)
## Multiple R-squared: 0.7301, Adjusted R-squared: 0.7291
## F-statistic: 730.4 on 1 and 270 DF, p-value: < 2.2e-16
What we can say about the model output :
width , the price increases on average by 4.44€With this model, we could try to predict the price of a new IKEA sofa, based on its width.
For example, we can try this with a new sofa with a width of 123 cm :
# We extract a single line from the dataset
new_data <- data.frame(width = 123)
new_data$predicted_price <- predict(model,newdata = new_data)
new_data| width | predicted_price |
|---|---|
| 123 | 337.5206 |
ggplot(ikea_sofas,aes(x = width, y = euro_price)) +
geom_point(color = ikea_yellow) +
geom_point(data = new_data, aes(x = width, y = predicted_price), size = 4, color = "red") +
scale_y_continuous(labels = scales::dollar_format(suffix = "€", prefix = "")) +
labs(x = "Sofas width (cm)", y ="Price (€)") +
theme(panel.background = element_rect(fill = ikea_blue))But maybe we could build a more complex model, also including height and depth.
The LINE conditions are fulfilled. We can now explore the model summary.
##
## Call:
## lm(formula = euro_price ~ width + height + depth, data = ikea_sofas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -666.97 -185.30 3.42 128.86 796.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -352.98551 72.37197 -4.877 2.07e-06 ***
## width 3.83633 0.24153 15.883 < 2e-16 ***
## height -0.07756 0.99764 -0.078 0.938
## depth 2.53334 0.50990 4.968 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 242.3 on 218 degrees of freedom
## (205 observations deleted due to missingness)
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.7243
## F-statistic: 194.6 on 3 and 218 DF, p-value: < 2.2e-16
What we can say :
depth will raise the price by 2.5€ after controlling for width and height.This time, we will be using a train/set datasets cross-validation method to evaluate the predictive power of our model. We will be using 80% of our dataset as a “train” set and the last 20% as the “test” set.
set.seed(123)
train <- ikea_sofas %>% sample_frac(0.8)
test <- anti_join(ikea_sofas,train)
model <- lm(euro_price ~ width + height + depth, data = train)We can now try to predict the price of buffets in the “test” dataset :
prediction <- augment(model, newdata = test)
ggplot(prediction,aes(x = euro_price, y = .fitted)) +
geom_point(color = ikea_yellow) +
geom_abline(intercept = 0, slope = 1) +
labs(x = "Actual price", y = "Predicted price")We can now evaluate the accuracy of our model by computing the Root Mean Square Error (RMSE) :
RMSE <- prediction %>%
summarise(square_error = (euro_price - .fitted)^2) %>%
summarise(mean_square_error = mean(square_error,na.rm=TRUE)) %>%
summarise(RMSE = sqrt(mean_square_error))
RMSE| RMSE |
|---|
| 239.0272 |
The Root Mean Square Error (RMSE) of the prediction is of 239.03€.
This prediction could be enhanced by the usage of other, more complex regression methods. But we will stop here for now as it is already quite a long analysis.