- Overview
- Data Story
- Modeling
- Output Analysis
March 13, 2016
For example, a search for "AA battery" would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1).
Data includes the following files:
## Observations: 74,067 ## Variables: 5 ## $ id (int) 2, 3, 9, 16, 17, 18, 20, 21, 23, 27, 34, 35, 37,... ## $ product_uid (int) 100001, 100001, 100002, 100005, 100005, 100006, ... ## $ product_title (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simpson St... ## $ search_term (chr) "angle bracket", "l bracket", "deck over", "rain... ## $ relevance (dbl) 3.00, 2.50, 3.00, 2.33, 2.67, 3.00, 2.67, 3.00, ...
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
test.csv –> similar to train.csv except the absence for relevance scores.
product_descriptions.csv contains a text description of each product.
## Observations: 124,428 ## Variables: 2 ## $ product_uid (int) 100001, 100002, 100003, 100004, 100005, 10... ## $ product_description (chr) "Not only do angles make joints stronger, ...
## Observations: 240,760 ## Variables: 6 ## $ id (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... ## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 10... ## $ product_title (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp... ## $ search_term (chr) "angle bracket", "l bracket", "90 degree b... ## $ relevance (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ... ## $ product_description (chr) "Not only do angles make joints stronger, ...
## Source: local data frame [24,601 x 2] ## ## search_term n ## (chr) (int) ## 1 patio chair cushions 36 ## 2 1x4 23 ## 3 24 inch vanity 23 ## 4 40 gal gas water heater 23 ## 5 4x6 23 ## 6 acrylic 23 ## 7 air conditioner portable 23 ## 8 air conditioner with heat 23 ## 9 allure plank flooring 23 ## 10 allure vinyl flooring 23 ## .. ... ...
## Source: local data frame [124,428 x 2] ## ## product_uid n ## (int) (int) ## 1 101892 70 ## 2 101442 49 ## 3 102456 48 ## 4 101959 47 ## 5 101280 45 ## 6 102162 44 ## 7 104691 44 ## 8 101148 43 ## 9 100898 42 ## 10 109594 41 ## .. ... ...
## Observations: 2,044,803 ## Variables: 3 ## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 100001, 10... ## $ name (chr) "Bullet01", "Bullet02", "Bullet03", "Bullet04", "B... ## $ value (chr) "Versatile connector for various 90° connections ...
## Observations: 2,044,648 ## Variables: 2 ## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 100001, 10... ## $ property (chr) "Bullet01;;Versatile connector for various 90° co...
## Observations: 86,263 ## Variables: 2 ## $ Group.1 (int) 100001, 100002, 100003, 100004, 1... ## $ products_attributes$property (chr) "Bullet01;;Versatile connector fo...
## Observations: 86,263 ## Variables: 2 ## $ product_uid (int) 100001, 100002, 100003, 100004, 100005, 100006, 10... ## $ property (chr) "Bullet01;;Versatile connector for various 90° co...
## Joining by: "product_uid"
## Observations: 240,760 ## Variables: 12 ## $ id (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... ## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 10... ## $ product_title (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp... ## $ search_term (chr) "angle bracket", "l bracket", "90 degree b... ## $ relevance (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ... ## $ product_description (chr) "Not only do angles make joints stronger, ... ## $ property (chr) "Bullet01;;Versatile connector for various... ## $ bullets (chr) "Versatile connector for various 90° conn... ## $ yeses (chr) "", "", "", "", "", "", "", "", "Concrete ... ## $ nos (chr) "", "", "", "", "", "", "", "", "Sealer Ti... ## $ keys (chr) "Gauge Material MFG Brand Name Number of P... ## $ values (chr) "12 Galvanized Steel Simpson Strong-Tie 1 ...
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded ## Copyright M. P. Wand 1997-2009
## Observations: 240,760 ## Variables: 17 ## $ id (int) 2, 3, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... ## $ product_uid (int) 100001, 100001, 100001, 100001, 100001, 10... ## $ product_title (chr) "Simpson Strong-Tie 12-Gauge Angle", "Simp... ## $ search_term (chr) "angle bracket", "l bracket", "90 degree b... ## $ relevance (dbl) 3.00, 2.50, NA, NA, NA, NA, NA, NA, 3.00, ... ## $ product_description (chr) "Not only do angles make joints stronger, ... ## $ property (chr) "Bullet01;;Versatile connector for various... ## $ bullets (chr) "Versatile connector for various 90° conn... ## $ yeses (chr) "", "", "", "", "", "", "", "", "Concrete ... ## $ nos (chr) "", "", "", "", "", "", "", "", "Sealer Ti... ## $ keys (chr) "Gauge Material MFG Brand Name Number of P... ## $ values (chr) "12 Galvanized Steel Simpson Strong-Tie 1 ... ## $ bulletsScore (dbl) 0.8333333, 0.8333333, 0.7777778, 0.7555556... ## $ yesesScore (dbl) 0.0000000, 0.0000000, 0.0000000, 0.0000000... ## $ nosScore (dbl) 0.0000000, 0.0000000, 0.0000000, 0.0000000... ## $ keysScore (dbl) 0.5500000, 0.7500000, 0.3000000, 0.7000000... ## $ valuesScore (dbl) 0.4428571, 0.6428571, 0.3952381, 0.5500000...
Step 6: Divide the data into training and test sets so can perform prediction and test our model
Step 7: Performing linear regression
Step 8: Test model investigation
## ## Call: ## lm(formula = relevance ~ bulletsScore + yesesScore + nosScore + ## keysScore + valuesScore, data = product_all_train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.54892 -0.35275 0.01231 0.53109 0.77971 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.468907 0.003862 639.280 < 2e-16 *** ## bulletsScore -0.278528 0.012414 -22.436 < 2e-16 *** ## yesesScore 0.035908 0.011058 3.247 0.001166 ** ## nosScore 0.040682 0.010579 3.846 0.000120 *** ## keysScore 0.056294 0.016768 3.357 0.000788 *** ## valuesScore 0.096021 0.012965 7.406 1.31e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5303 on 74061 degrees of freedom ## Multiple R-squared: 0.01389, Adjusted R-squared: 0.01382 ## F-statistic: 208.6 on 5 and 74061 DF, p-value: < 2.2e-16
It confirms our previous induction.
It seems that the builets score is one affects the outlier.
and it seems the value score the one drive the relevance