## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## New names:
## Rows: 3694 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, category, old_price, link, other_colors, short_description, d...
## dbl (6): ...1, item_id, price, depth, height, width
## lgl (1): sellable_online
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Removed 3040 rows containing missing values (`geom_point()`).
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
## # Bootstrap sampling using stratification
## # A tibble: 25 × 2
## splits id
## <list> <chr>
## 1 <split [2770/994]> Bootstrap01
## 2 <split [2770/1003]> Bootstrap02
## 3 <split [2770/1037]> Bootstrap03
## 4 <split [2770/1010]> Bootstrap04
## 5 <split [2770/1014]> Bootstrap05
## 6 <split [2770/1007]> Bootstrap06
## 7 <split [2770/1036]> Bootstrap07
## 8 <split [2770/1016]> Bootstrap08
## 9 <split [2770/1021]> Bootstrap09
## 10 <split [2770/1043]> Bootstrap10
## # ℹ 15 more rows
## ranger_recipe <-
## recipe(formula = price ~ ., data = ikea_train)
##
## ranger_spec <-
## rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
## set_mode("classification") %>%
## set_engine("ranger")
##
## ranger_workflow <-
## workflow() %>%
## add_recipe(ranger_recipe) %>%
## add_model(ranger_spec)
##
## set.seed(67013)
## ranger_tune <-
## tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
## i Creating pre-processing data to finalize unknown parameter: mtry
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 4 rmse standard 0.340 25 0.00202 Preprocessor1_Model10
## 2 4 10 rmse standard 0.348 25 0.00229 Preprocessor1_Model05
## 3 5 6 rmse standard 0.349 25 0.00233 Preprocessor1_Model06
## 4 3 18 rmse standard 0.350 25 0.00219 Preprocessor1_Model01
## 5 2 21 rmse standard 0.352 25 0.00198 Preprocessor1_Model08
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 4 rsq standard 0.726 25 0.00333 Preprocessor1_Model10
## 2 4 10 rsq standard 0.713 25 0.00379 Preprocessor1_Model05
## 3 5 6 rsq standard 0.711 25 0.00385 Preprocessor1_Model06
## 4 3 18 rsq standard 0.709 25 0.00369 Preprocessor1_Model01
## 5 2 21 rsq standard 0.707 25 0.00349 Preprocessor1_Model08
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_other()
## • step_clean_levels()
## • step_impute_knn()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 2
## trees = 1000
## min_n = 4
##
## Computational engine: ranger
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [2770/924]> train/test split <tibble> <tibble> <tibble> <workflow>
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.318 Preprocessor1_Model1
## 2 rsq standard 0.752 Preprocessor1_Model1
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 2.41
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
## Warning: No value of `metric` was given; metric 'rmse' will be used.
We aim to answer if we are able to predict the price of furniture from by ikea by using a variety of factors.
The data we have here is a collection of furniture from ikea that has categories that include height, width, value and depth.
The primary variables mostly have to do with the size and shape of the furniture as that will likely be able to tell you how much material/work was used to build the item.
A big difference between the original data dn the transformed data is the removal of all “NA” data so that we can use a more complete and accurate data set to answer our question. On top of that we also took the 5 most prominent categories that seemed to have an effect on price to further narrow the margin of error.
Initial Data Splitting: Split data into training set and test
Feature Selection: This allows us to pick relevant categories of data to make sure we only have the most relevant info
Data Transformation: Applied Logarithmic Transformation with log10
Converting Characters to Factors: Converted character columns to factors
Data Imputation: Imputed data for missing values
Level Aggregation: Applied step_other to aggregate rare levels in ‘name’ and ‘category’ into a common “other” level.
Level Cleaning: Allowed us to clean up factor levels
The major metrics used in this model evaluation are RMSE(Root Mean Square Error) and R-Squared.
RMSE is the measure of difference between actual and predicted values. A lower RMSE can mean that the model is a great fit alongside the data.
R-Squared is a measure that determines the proportion of variance in a dependent variable that can be explained by the independent variable. A higher value in R-Squared can mean that the variance is mostly explained which can be helpful while looking at what category affects prcing or value the most.
The key findings and insights from the analysis of the new data set center around the dimensional categories having the biggest effect on the value of the piece of furniture. These categories all had high importance.