Explore data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## New names:
## Rows: 3694 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, category, old_price, link, other_colors, short_description, d...
## dbl (6): ...1, item_id, price, depth, height, width
## lgl (1): sellable_online
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Removed 3040 rows containing missing values (`geom_point()`).

Build a Model

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
## # Bootstrap sampling using stratification 
## # A tibble: 25 × 2
##    splits              id         
##    <list>              <chr>      
##  1 <split [2770/994]>  Bootstrap01
##  2 <split [2770/1003]> Bootstrap02
##  3 <split [2770/1037]> Bootstrap03
##  4 <split [2770/1010]> Bootstrap04
##  5 <split [2770/1014]> Bootstrap05
##  6 <split [2770/1007]> Bootstrap06
##  7 <split [2770/1036]> Bootstrap07
##  8 <split [2770/1016]> Bootstrap08
##  9 <split [2770/1021]> Bootstrap09
## 10 <split [2770/1043]> Bootstrap10
## # ℹ 15 more rows
## ranger_recipe <- 
##   recipe(formula = price ~ ., data = ikea_train) 
## 
## ranger_spec <- 
##   rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
##   set_mode("classification") %>% 
##   set_engine("ranger") 
## 
## ranger_workflow <- 
##   workflow() %>% 
##   add_recipe(ranger_recipe) %>% 
##   add_model(ranger_spec) 
## 
## set.seed(67013)
## ranger_tune <-
##   tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
## i Creating pre-processing data to finalize unknown parameter: mtry

Explore results

## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2     4 rmse    standard   0.340    25 0.00202 Preprocessor1_Model10
## 2     4    10 rmse    standard   0.348    25 0.00229 Preprocessor1_Model05
## 3     5     6 rmse    standard   0.349    25 0.00233 Preprocessor1_Model06
## 4     3    18 rmse    standard   0.350    25 0.00219 Preprocessor1_Model01
## 5     2    21 rmse    standard   0.352    25 0.00198 Preprocessor1_Model08
## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2     4 rsq     standard   0.726    25 0.00333 Preprocessor1_Model10
## 2     4    10 rsq     standard   0.713    25 0.00379 Preprocessor1_Model05
## 3     5     6 rsq     standard   0.711    25 0.00385 Preprocessor1_Model06
## 4     3    18 rsq     standard   0.709    25 0.00369 Preprocessor1_Model01
## 5     2    21 rsq     standard   0.707    25 0.00349 Preprocessor1_Model08

## Warning: No value of `metric` was given; metric 'rmse' will be used.
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_other()
## • step_clean_levels()
## • step_impute_knn()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 2
##   trees = 1000
##   min_n = 4
## 
## Computational engine: ranger
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [2770/924]> train/test split <tibble> <tibble> <tibble>     <workflow>
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       0.318 Preprocessor1_Model1
## 2 rsq     standard       0.752 Preprocessor1_Model1

## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  2.41
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
## Warning: No value of `metric` was given; metric 'rmse' will be used.

1

What is the research Question?

We aim to answer if we are able to predict the price of furniture from by ikea by using a variety of factors.

Describe the data briefly

The data we have here is a collection of furniture from ikea that has categories that include height, width, value and depth.

What are the characteristics of the key variables used in the analysis?

The primary variables mostly have to do with the size and shape of the furniture as that will likely be able to tell you how much material/work was used to build the item.

2

Describe the differences between the original data and the data transformed for modeling. Why?

A big difference between the original data dn the transformed data is the removal of all “NA” data so that we can use a more complete and accurate data set to answer our question. On top of that we also took the 5 most prominent categories that seemed to have an effect on price to further narrow the margin of error.

3

What are the names of data preparation steps mentioned in the video?

What is the name of the machine learning model(s) used in the analysis?

Initial Data Splitting: Split data into training set and test

Feature Selection: This allows us to pick relevant categories of data to make sure we only have the most relevant info

Data Transformation: Applied Logarithmic Transformation with log10

Converting Characters to Factors: Converted character columns to factors

Data Imputation: Imputed data for missing values

Level Aggregation: Applied step_other to aggregate rare levels in ‘name’ and ‘category’ into a common “other” level.

Level Cleaning: Allowed us to clean up factor levels

4

What metrics are used in the model evaluation?

The major metrics used in this model evaluation are RMSE(Root Mean Square Error) and R-Squared.

RMSE is the measure of difference between actual and predicted values. A lower RMSE can mean that the model is a great fit alongside the data.

R-Squared is a measure that determines the proportion of variance in a dependent variable that can be explained by the independent variable. A higher value in R-Squared can mean that the variance is mostly explained which can be helpful while looking at what category affects prcing or value the most.

5

What are the major findings?

The key findings and insights from the analysis of the new data set center around the dimensional categories having the biggest effect on the value of the piece of furniture. These categories all had high importance.