Click here to read the data manually.

Import and Clean Data

ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')

skimr::skim(ikea)
Data summary
Name ikea
Number of rows 3694
Number of columns 14
_______________________
Column type frequency:
character 7
logical 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 3 27 0 607 0
category 0 1 4 36 0 17 0
old_price 0 1 4 13 0 365 0
link 0 1 52 163 0 2962 0
other_colors 0 1 2 3 0 2 0
short_description 0 1 3 63 0 1706 0
designer 0 1 3 1261 0 381 0

Variable type: logical

skim_variable n_missing complete_rate mean count
sellable_online 0 1 0.99 TRU: 3666, FAL: 28

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1.00 1846.50 1066.51 0 923.25 1846.5 2769.75 3693 ▇▇▇▇▇
item_id 0 1.00 48632396.79 28887094.10 58487 20390574.00 49288078.0 70403572.75 99932615 ▇▇▇▇▇
price 0 1.00 1078.21 1374.65 3 180.90 544.7 1429.50 9585 ▇▁▁▁▁
depth 1463 0.60 54.38 29.96 1 38.00 47.0 60.00 257 ▇▃▁▁▁
height 988 0.73 101.68 61.10 1 67.00 83.0 124.00 700 ▇▂▁▁▁
width 589 0.84 104.47 71.13 1 60.00 80.0 140.00 420 ▇▅▂▁▁
data <- ikea %>%
  
  # Treat missing values
  filter(!is.na(height), !is.na(width), !is.na(depth)) %>%
    
  # Remove unwanted columns
  select(-...1, -link, -item_id, -old_price) %>%
  
  # log transform variables with positively skewed distributions
  mutate(price = log(price))

Explore Data - Identify good predictors.

Size

The correlation table indicates varying relationships between size dimensions and price. Positive correlations are observed for larger widths (e.g., 161.5_Inf with 0.579) and heights (e.g., 171_Inf with 0.232), suggesting that larger dimensions generally associate with higher prices. Conversely, negative correlations are noted for smaller widths (-Inf_60 with -0.374) and heights (-Inf_71 with -0.277), indicating that within these ranges, increases in size tend to correlate with lower prices. These findings highlight that size impacts pricing, with larger sizes typically leading to higher prices, though some ranges show the opposite effect.

# Step 1: Prepare data by binarising data
size_binarized_table <- data %>%
    select(price, height, width, depth) %>%
    binarize()

size_binarized_table %>% glimpse()
## Rows: 1,899
## Columns: 16
## $ `price__-Inf_5.68697535633982`           <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ price__5.68697535633982_6.52209279817015 <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,…
## $ price__6.52209279817015_7.37085996851068 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ price__7.37085996851068_Inf              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `height__-Inf_71`                        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ height__71_92                            <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ height__92_171                           <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ height__171_Inf                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `width__-Inf_60`                         <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ width__60_93                             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__93_161.5                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__161.5_Inf                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `depth__-Inf_40`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ depth__40_47                             <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,…
## $ depth__47_60                             <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,…
## $ depth__60_Inf                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Step 2: Correlate the data
size_corr_tbl <- size_binarized_table %>%
    correlate(price__7.37085996851068_Inf)

size_corr_tbl
## # A tibble: 16 × 3
##    feature bin                               correlation
##    <fct>   <chr>                                   <dbl>
##  1 price   7.37085996851068_Inf                   1     
##  2 width   161.5_Inf                              0.579 
##  3 depth   60_Inf                                 0.447 
##  4 width   -Inf_60                               -0.374 
##  5 price   -Inf_5.68697535633982                 -0.336 
##  6 price   6.52209279817015_7.37085996851068     -0.333 
##  7 price   5.68697535633982_6.52209279817015     -0.331 
##  8 height  -Inf_71                               -0.277 
##  9 width   60_93                                 -0.242 
## 10 height  171_Inf                                0.232 
## 11 depth   -Inf_40                               -0.232 
## 12 depth   47_60                                 -0.146 
## 13 height  71_92                                  0.0687
## 14 depth   40_47                                 -0.0588
## 15 width   93_161.5                               0.0413
## 16 height  92_171                                -0.0219
# Step 3: Plot
size_corr_tbl %>%
    plot_correlation_funnel()

Product Category

Based on the boxplot and histogram below, we can see that the category ‘wardrobes’ is on average the most expensive, while Children’s furniture is on average the least expensive. When saying this, we must state that prices greatly fluctuate in some categories, as we can see in the TV & media furnitures category.

data %>%
  ggplot(aes(price, as.factor(category))) +
  geom_boxplot()

category <- ikea %>%
    select(price, category) %>%
    mutate(price = log(price))

# Calculate average price per category
category_avg_price <- category %>%
  group_by(category) %>%
  summarise(avg_price = mean(price, na.rm = TRUE)) %>%
  ungroup()

# Plot bar plot of average prices per category
ggplot(category_avg_price, aes(x = reorder(category, avg_price), y = avg_price)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Average Price per Product Category",
       x = "Product Category",
       y = "Average Price (log-transformed)") +
  theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Are there Other Colors Available

The next boxplots tells us how the price of a product is correlated to the fact if that product is also sold in different colors. We can see that when a product is also available in different colors, the price on average is slightly more than if not.

data %>%
  ggplot(aes(price, as.factor(other_colors))) +
  geom_boxplot()

Is the Product also Sold Online?

Similar to previous predictor, we can see that when a product is also sold online, and thus more available to consumers, the average price of that product is more expensive. We can also see, in this case, products that are also sold online vary less in price than the products that are not sold online.

data %>%
  ggplot(aes(price, sellable_online)) +
  geom_boxplot()

Short Description Correlation to the Price

Finally, we look at the short description of these products. The diagram shows us that when the word ‘sliding’ is included in the short description of a product, it is likely to be more expensive than when the word ‘conference’ is included.

data %>%
  
  # tokenize title
  unnest_tokens(output = word, input = short_description) %>%
  
  # calculate average price per word
  group_by(word) %>%
  summarise(price = mean(price), 
            n     = n()) %>%
  ungroup() %>%
  
  filter(n > 10, !str_detect(word, "\\d")) %>%
  slice_max(order_by = price, n = 20) %>%
  
  # Plot
  ggplot(aes(price, fct_reorder(word, price))) +
 geom_point() + 
  
  labs(y = "Words in Short Description", 
       x = "Average Price when Word is Included")