Import and Clean Data

ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')

skimr::skim(ikea)

Data summary
Name	ikea
Number of rows	3694
Number of columns	14
_______________________
Column type frequency:
character	7
logical	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	3	27	607
category	1	4	36	17
old_price	1	4	13	365
link	1	52	163	2962
other_colors	1	2	3	2
short_description	1	3	63	1706
designer	1	3	1261	381

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
sellable_online	0	1	0.99	TRU: 3666, FAL: 28

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
…1	0	1.00	1846.50	1066.51	0	923.25	1846.5	2769.75	3693	▇▇▇▇▇
item_id	0	1.00	48632396.79	28887094.10	58487	20390574.00	49288078.0	70403572.75	99932615	▇▇▇▇▇
price	0	1.00	1078.21	1374.65	3	180.90	544.7	1429.50	9585	▇▁▁▁▁
depth	1463	0.60	54.38	29.96	1	38.00	47.0	60.00	257	▇▃▁▁▁
height	988	0.73	101.68	61.10	1	67.00	83.0	124.00	700	▇▂▁▁▁
width	589	0.84	104.47	71.13	1	60.00	80.0	140.00	420	▇▅▂▁▁

data <- ikea %>%
  
  # Treat missing values
  filter(!is.na(height), !is.na(width), !is.na(depth)) %>%
    
  # Remove unwanted columns
  select(-...1, -link, -item_id, -old_price) %>%
  
  # log transform variables with positively skewed distributions
  mutate(price = log(price))

Explore Data - Identify good predictors.

Size

The correlation table indicates varying relationships between size dimensions and price. Positive correlations are observed for larger widths (e.g., 161.5_Inf with 0.579) and heights (e.g., 171_Inf with 0.232), suggesting that larger dimensions generally associate with higher prices. Conversely, negative correlations are noted for smaller widths (-Inf_60 with -0.374) and heights (-Inf_71 with -0.277), indicating that within these ranges, increases in size tend to correlate with lower prices. These findings highlight that size impacts pricing, with larger sizes typically leading to higher prices, though some ranges show the opposite effect.

# Step 1: Prepare data by binarising data
size_binarized_table <- data %>%
    select(price, height, width, depth) %>%
    binarize()

size_binarized_table %>% glimpse()

## Rows: 1,899
## Columns: 16
## $ `price__-Inf_5.68697535633982`           <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ price__5.68697535633982_6.52209279817015 <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,…
## $ price__6.52209279817015_7.37085996851068 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ price__7.37085996851068_Inf              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `height__-Inf_71`                        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ height__71_92                            <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ height__92_171                           <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ height__171_Inf                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `width__-Inf_60`                         <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ width__60_93                             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__93_161.5                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ width__161.5_Inf                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `depth__-Inf_40`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ depth__40_47                             <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,…
## $ depth__47_60                             <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,…
## $ depth__60_Inf                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

# Step 2: Correlate the data
size_corr_tbl <- size_binarized_table %>%
    correlate(price__7.37085996851068_Inf)

size_corr_tbl

## # A tibble: 16 × 3
##    feature bin                               correlation
##    <fct>   <chr>                                   <dbl>
##  1 price   7.37085996851068_Inf                   1     
##  2 width   161.5_Inf                              0.579 
##  3 depth   60_Inf                                 0.447 
##  4 width   -Inf_60                               -0.374 
##  5 price   -Inf_5.68697535633982                 -0.336 
##  6 price   6.52209279817015_7.37085996851068     -0.333 
##  7 price   5.68697535633982_6.52209279817015     -0.331 
##  8 height  -Inf_71                               -0.277 
##  9 width   60_93                                 -0.242 
## 10 height  171_Inf                                0.232 
## 11 depth   -Inf_40                               -0.232 
## 12 depth   47_60                                 -0.146 
## 13 height  71_92                                  0.0687
## 14 depth   40_47                                 -0.0588
## 15 width   93_161.5                               0.0413
## 16 height  92_171                                -0.0219

# Step 3: Plot
size_corr_tbl %>%
    plot_correlation_funnel()

Product Category

Based on the boxplot and histogram below, we can see that the category ‘wardrobes’ is on average the most expensive, while Children’s furniture is on average the least expensive. When saying this, we must state that prices greatly fluctuate in some categories, as we can see in the TV & media furnitures category.

data %>%
  ggplot(aes(price, as.factor(category))) +
  geom_boxplot()

category <- ikea %>%
    select(price, category) %>%
    mutate(price = log(price))

# Calculate average price per category
category_avg_price <- category %>%
  group_by(category) %>%
  summarise(avg_price = mean(price, na.rm = TRUE)) %>%
  ungroup()

# Plot bar plot of average prices per category
ggplot(category_avg_price, aes(x = reorder(category, avg_price), y = avg_price)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Average Price per Product Category",
       x = "Product Category",
       y = "Average Price (log-transformed)") +
  theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Are there Other Colors Available

The next boxplots tells us how the price of a product is correlated to the fact if that product is also sold in different colors. We can see that when a product is also available in different colors, the price on average is slightly more than if not.

data %>%
  ggplot(aes(price, as.factor(other_colors))) +
  geom_boxplot()

Is the Product also Sold Online?

Similar to previous predictor, we can see that when a product is also sold online, and thus more available to consumers, the average price of that product is more expensive. We can also see, in this case, products that are also sold online vary less in price than the products that are not sold online.

data %>%
  ggplot(aes(price, sellable_online)) +
  geom_boxplot()

Short Description Correlation to the Price

Finally, we look at the short description of these products. The diagram shows us that when the word ‘sliding’ is included in the short description of a product, it is likely to be more expensive than when the word ‘conference’ is included.

data %>%
  
  # tokenize title
  unnest_tokens(output = word, input = short_description) %>%
  
  # calculate average price per word
  group_by(word) %>%
  summarise(price = mean(price), 
            n     = n()) %>%
  ungroup() %>%
  
  filter(n > 10, !str_detect(word, "\\d")) %>%
  slice_max(order_by = price, n = 20) %>%
  
  # Plot
  ggplot(aes(price, fct_reorder(word, price))) +
 geom_point() + 
  
  labs(y = "Words in Short Description", 
       x = "Average Price when Word is Included")

Apply it to Your Data 1: Ikea Product Prices

Job Boonstoppel

2024-09-08