March 13, 2016

Agenda

  • Overview
  • Data Story
  • Modeling
  • Output Analysis

Overview

  • Home Depot Product Search Relevance is Kaggle competition targets to improve Home Depot customers' shopping experience.
  • Target: developing a model that can accurately predict the relevance of search results.

Data Story

  • This data set contains a number of products and real customer search terms from Home Depot's website.
  • The challenge: to predict a relevance score for the provided combinations of search terms and products.
  • The relevance is a number between 1 (not relevant) to 3 (highly relevant).

For example, a search for "AA battery" would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1).

Data Story Cont..

Data includes the following files:

  • train.csv –> the training set, contains products, searches, and relevance scores.
## Observations: 74,067
## Variables: 5
## $ id            (int) 2, 3, 9, 16, 17, 18, 20, 21, 23, 27, 34, 35, 37,...
## $ product_uid   (int) 100001, 100001, 100002, 100005, 100005, 100006, ...
## $ product_title (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simpson St...
## $ search_term   (chr) "angle bracket", "l bracket", "deck over", "rain...
## $ relevance     (dbl) 3.00, 2.50, 3.00, 2.33, 2.67, 3.00, 2.67, 3.00, ...

Data Story Cont..

  • Relevance Distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Story Cont..

  • Relevance Desnisty Function

Data Story Cont..

  • test.csv –> similar to train.csv except the absence for relevance scores.

  • product_descriptions.csv contains a text description of each product.

## Observations: 124,428
## Variables: 2
## $ product_uid         (int) 100001, 100002, 100003, 100004, 100005, 10...
## $ product_description (chr) "Not only do angles make joints stronger, ...

Data Story Cont..

  • Combined train, test and product desription data
## Observations: 240,760
## Variables: 6
## $ id                  (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
## $ product_uid         (int) 100001, 100001, 100001, 100001, 100001, 10...
## $ product_title       (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp...
## $ search_term         (chr) "angle bracket", "l bracket", "90 degree b...
## $ relevance           (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ...
## $ product_description (chr) "Not only do angles make joints stronger, ...

Data Story Cont..

  • The most search terms used
## Source: local data frame [24,601 x 2]
## 
##                  search_term     n
##                        (chr) (int)
## 1       patio chair cushions    36
## 2                        1x4    23
## 3             24 inch vanity    23
## 4    40 gal gas water heater    23
## 5                        4x6    23
## 6                    acrylic    23
## 7   air conditioner portable    23
## 8  air conditioner with heat    23
## 9      allure plank flooring    23
## 10     allure vinyl flooring    23
## ..                       ...   ...

Data Story Cont..

  • The most searched products
## Source: local data frame [124,428 x 2]
## 
##    product_uid     n
##          (int) (int)
## 1       101892    70
## 2       101442    49
## 3       102456    48
## 4       101959    47
## 5       101280    45
## 6       102162    44
## 7       104691    44
## 8       101148    43
## 9       100898    42
## 10      109594    41
## ..         ...   ...

Data Story Cont..

  • attributes.csv provides extended information about a subset of the products (typically representing detailed technical specifications). Not every product is having attributes
## Observations: 2,044,803
## Variables: 3
## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 100001, 10...
## $ name        (chr) "Bullet01", "Bullet02", "Bullet03", "Bullet04", "B...
## $ value       (chr) "Versatile connector for various 90° connections ...

Modeling

  • Step 1: combine attributes keys and values
## Observations: 2,044,648
## Variables: 2
## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 100001, 10...
## $ property    (chr) "Bullet01;;Versatile connector for various 90° co...

Modeling Cont…

  • Step 2: group rows with the same id
## Observations: 86,263
## Variables: 2
## $ Group.1                      (int) 100001, 100002, 100003, 100004, 1...
## $ products_attributes$property (chr) "Bullet01;;Versatile connector fo...

Modeling Cont…

  • Step 3: restore original names
## Observations: 86,263
## Variables: 2
## $ product_uid (int) 100001, 100002, 100003, 100004, 100005, 100006, 10...
## $ property    (chr) "Bullet01;;Versatile connector for various 90° co...

Modeling Cont…

  • Step 4: Generate new attribute fields and combine with the main product set
## Joining by: "product_uid"
## Observations: 240,760
## Variables: 12
## $ id                  (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
## $ product_uid         (int) 100001, 100001, 100001, 100001, 100001, 10...
## $ product_title       (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp...
## $ search_term         (chr) "angle bracket", "l bracket", "90 degree b...
## $ relevance           (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ...
## $ product_description (chr) "Not only do angles make joints stronger, ...
## $ property            (chr) "Bullet01;;Versatile connector for various...
## $ bullets             (chr) "Versatile connector for various 90° conn...
## $ yeses               (chr) "", "", "", "", "", "", "", "", "Concrete ...
## $ nos                 (chr) "", "", "", "", "", "", "", "", "Sealer Ti...
## $ keys                (chr) "Gauge Material MFG Brand Name Number of P...
## $ values              (chr) "12 Galvanized Steel Simpson Strong-Tie 1 ...

Modeling Cont…

  • Step 5: Generate the features that is used in linear regression
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## Observations: 240,760
## Variables: 17
## $ id                  (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
## $ product_uid         (int) 100001, 100001, 100001, 100001, 100001, 10...
## $ product_title       (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp...
## $ search_term         (chr) "angle bracket", "l bracket", "90 degree b...
## $ relevance           (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ...
## $ product_description (chr) "Not only do angles make joints stronger, ...
## $ property            (chr) "Bullet01;;Versatile connector for various...
## $ bullets             (chr) "Versatile connector for various 90° conn...
## $ yeses               (chr) "", "", "", "", "", "", "", "", "Concrete ...
## $ nos                 (chr) "", "", "", "", "", "", "", "", "Sealer Ti...
## $ keys                (chr) "Gauge Material MFG Brand Name Number of P...
## $ values              (chr) "12 Galvanized Steel Simpson Strong-Tie 1 ...
## $ bulletsScore        (dbl) 0.8333333, 0.8333333, 0.7777778, 0.7555556...
## $ yesesScore          (dbl) 0.0000000, 0.0000000, 0.0000000, 0.0000000...
## $ nosScore            (dbl) 0.0000000, 0.0000000, 0.0000000, 0.0000000...
## $ keysScore           (dbl) 0.5500000, 0.7500000, 0.3000000, 0.7000000...
## $ valuesScore         (dbl) 0.4428571, 0.6428571, 0.3952381, 0.5500000...

Modeling Cont…

  • Step 6: Divide the data into training and test sets so can perform prediction and test our model

  • Step 7: Performing linear regression

  • Step 8: Test model investigation

## 
## Call:
## lm(formula = relevance ~ bulletsScore + yesesScore + nosScore + 
##     keysScore + valuesScore, data = product_all_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54892 -0.35275  0.01231  0.53109  0.77971 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.468907   0.003862 639.280  < 2e-16 ***
## bulletsScore -0.278528   0.012414 -22.436  < 2e-16 ***
## yesesScore    0.035908   0.011058   3.247 0.001166 ** 
## nosScore      0.040682   0.010579   3.846 0.000120 ***
## keysScore     0.056294   0.016768   3.357 0.000788 ***
## valuesScore   0.096021   0.012965   7.406 1.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5303 on 74061 degrees of freedom
## Multiple R-squared:  0.01389,    Adjusted R-squared:  0.01382 
## F-statistic: 208.6 on 5 and 74061 DF,  p-value: < 2.2e-16

Output Analysis

  • Relevance Distribution

Output Analysis Cont …

  • We could not help but notice that data is spread between 2.2 and 2.6 with max value is between 2.3 and 2.4.
  • The data is, nearly, normally distributed with an outlier in 2.5.
  • Let us see another view

It confirms our previous induction.

Output Analysis Cont …

  • Let us now investigate the features used in the model to see their affect on the result

It seems that the builets score is one affects the outlier.

Output Analysis Cont …

  • Now, let us investigate more by seeing what each feature behave against relevance

and it seems the value score the one drive the relevance