Click here to read the data manually.
ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')
skimr::skim(ikea)
| Name | ikea |
| Number of rows | 3694 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| logical | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 3 | 27 | 0 | 607 | 0 |
| category | 0 | 1 | 4 | 36 | 0 | 17 | 0 |
| old_price | 0 | 1 | 4 | 13 | 0 | 365 | 0 |
| link | 0 | 1 | 52 | 163 | 0 | 2962 | 0 |
| other_colors | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| short_description | 0 | 1 | 3 | 63 | 0 | 1706 | 0 |
| designer | 0 | 1 | 3 | 1261 | 0 | 381 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| sellable_online | 0 | 1 | 0.99 | TRU: 3666, FAL: 28 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| …1 | 0 | 1.00 | 1846.50 | 1066.51 | 0 | 923.25 | 1846.5 | 2769.75 | 3693 | ▇▇▇▇▇ |
| item_id | 0 | 1.00 | 48632396.79 | 28887094.10 | 58487 | 20390574.00 | 49288078.0 | 70403572.75 | 99932615 | ▇▇▇▇▇ |
| price | 0 | 1.00 | 1078.21 | 1374.65 | 3 | 180.90 | 544.7 | 1429.50 | 9585 | ▇▁▁▁▁ |
| depth | 1463 | 0.60 | 54.38 | 29.96 | 1 | 38.00 | 47.0 | 60.00 | 257 | ▇▃▁▁▁ |
| height | 988 | 0.73 | 101.68 | 61.10 | 1 | 67.00 | 83.0 | 124.00 | 700 | ▇▂▁▁▁ |
| width | 589 | 0.84 | 104.47 | 71.13 | 1 | 60.00 | 80.0 | 140.00 | 420 | ▇▅▂▁▁ |
data <- ikea %>%
# Treat missing values
filter(!is.na(height), !is.na(width), !is.na(depth)) %>%
# Remove unwanted columns
select(-...1, -link, -item_id, -old_price) %>%
# log transform variables with positively skewed distributions
mutate(price = log(price))
The correlation table indicates varying relationships between size
dimensions and price. Positive correlations are observed for larger
widths (e.g., 161.5_Inf with 0.579) and
heights (e.g., 171_Inf with 0.232), suggesting
that larger dimensions generally associate with higher prices.
Conversely, negative correlations are noted for smaller widths
(-Inf_60 with -0.374) and heights
(-Inf_71 with -0.277), indicating that within
these ranges, increases in size tend to correlate with lower prices.
These findings highlight that size impacts pricing, with larger sizes
typically leading to higher prices, though some ranges show the opposite
effect.
# Step 1: Prepare data by binarising data
size_binarized_table <- data %>%
select(price, height, width, depth) %>%
binarize()
size_binarized_table %>% glimpse()
## Rows: 1,899
## Columns: 16
## $ `price__-Inf_5.68697535633982` <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ price__5.68697535633982_6.52209279817015 <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,…
## $ price__6.52209279817015_7.37085996851068 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ price__7.37085996851068_Inf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `height__-Inf_71` <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ height__71_92 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ height__92_171 <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ height__171_Inf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `width__-Inf_60` <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ width__60_93 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__93_161.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__161.5_Inf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `depth__-Inf_40` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ depth__40_47 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,…
## $ depth__47_60 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,…
## $ depth__60_Inf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Step 2: Correlate the data
size_corr_tbl <- size_binarized_table %>%
correlate(price__7.37085996851068_Inf)
size_corr_tbl
## # A tibble: 16 × 3
## feature bin correlation
## <fct> <chr> <dbl>
## 1 price 7.37085996851068_Inf 1
## 2 width 161.5_Inf 0.579
## 3 depth 60_Inf 0.447
## 4 width -Inf_60 -0.374
## 5 price -Inf_5.68697535633982 -0.336
## 6 price 6.52209279817015_7.37085996851068 -0.333
## 7 price 5.68697535633982_6.52209279817015 -0.331
## 8 height -Inf_71 -0.277
## 9 width 60_93 -0.242
## 10 height 171_Inf 0.232
## 11 depth -Inf_40 -0.232
## 12 depth 47_60 -0.146
## 13 height 71_92 0.0687
## 14 depth 40_47 -0.0588
## 15 width 93_161.5 0.0413
## 16 height 92_171 -0.0219
# Step 3: Plot
size_corr_tbl %>%
plot_correlation_funnel()
Based on the boxplot and histogram below, we can see that the category ‘wardrobes’ is on average the most expensive, while Children’s furniture is on average the least expensive. When saying this, we must state that prices greatly fluctuate in some categories, as we can see in the TV & media furnitures category.
data %>%
ggplot(aes(price, as.factor(category))) +
geom_boxplot()
category <- ikea %>%
select(price, category) %>%
mutate(price = log(price))
# Calculate average price per category
category_avg_price <- category %>%
group_by(category) %>%
summarise(avg_price = mean(price, na.rm = TRUE)) %>%
ungroup()
# Plot bar plot of average prices per category
ggplot(category_avg_price, aes(x = reorder(category, avg_price), y = avg_price)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Average Price per Product Category",
x = "Product Category",
y = "Average Price (log-transformed)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The next boxplots tells us how the price of a product is correlated to the fact if that product is also sold in different colors. We can see that when a product is also available in different colors, the price on average is slightly more than if not.
data %>%
ggplot(aes(price, as.factor(other_colors))) +
geom_boxplot()
Similar to previous predictor, we can see that when a product is also sold online, and thus more available to consumers, the average price of that product is more expensive. We can also see, in this case, products that are also sold online vary less in price than the products that are not sold online.
data %>%
ggplot(aes(price, sellable_online)) +
geom_boxplot()
Finally, we look at the short description of these products. The diagram shows us that when the word ‘sliding’ is included in the short description of a product, it is likely to be more expensive than when the word ‘conference’ is included.
data %>%
# tokenize title
unnest_tokens(output = word, input = short_description) %>%
# calculate average price per word
group_by(word) %>%
summarise(price = mean(price),
n = n()) %>%
ungroup() %>%
filter(n > 10, !str_detect(word, "\\d")) %>%
slice_max(order_by = price, n = 20) %>%
# Plot
ggplot(aes(price, fct_reorder(word, price))) +
geom_point() +
labs(y = "Words in Short Description",
x = "Average Price when Word is Included")