Recipe Site Traffic Recommendation system

Prediction Assignment

Background:

The marketing and product teams have observed that when a popular recipe is featured on the homepage, total website traffic can increase by up to 40%. However, there is currently no data-driven method to identify popular recipes in advance.

Goal:

  • To build a predictive model that can:

  • Identify which recipes are likely to be popular before being published.

  • Achieve at least 80% accuracy in predicting high-traffic recipes.

  • Provide actionable insights to guide homepage selection and maximize engagement

1. Data Validation

This data set has 947 rows, 8 columns. I have validated all variables and I have made several changes after validation: remove rows with null values in calories, carbohydrate, sugar, protein and replace null values in high_traffic with “Low”.

  • recipe: 947 unique identifiers without missing values (895 after dataset cleaning). No cleaning is needed.
  • calories: 895 non-null values. I fill 52 missed values with the mean value.
  • carbohydrate: 895 non-null values. I fill 52 missed values with the mean value.
  • sugar: 895 non-null values. I fill 52 missed values with the mean value.
  • protein: 895 non-null values. I fill 52 missed values with the mean value.
  • category: 11 unique values without missing values, whereas there were 10 values in the description. The extra valie is ‘Chicken Breast’. I united it with the ‘Chicken’ value.
  • servings: 6 unique values without missing values. By description, it should be numeric variable, but now it’s character. Has two extra values: ‘4 as a snack’ and ‘6 as a snack’. I united them with ‘4’ and ‘6’ and changed the column’s type to integer.
  • high_traffic: only 1 non-null value (“High”). Replaced null values with “Low”.

load in necessary packages

overview of the data set

recipe

calories

carbohydrate

sugar

protein

category

servings

high_traffic

001

Pork

6

High

002

35.48

38.56

0.66

0.92

Potato

4

High

003

914.28

42.68

3.09

2.88

Breakfast

1

004

97.03

30.56

38.63

0.02

Beverages

4

High

005

27.05

1.85

0.80

0.53

Beverages

4

006

691.15

3.46

1.65

53.93

One Dish Meal

2

High

look at the missing values

  • validating the dataset for missing values
#>       recipe     calories carbohydrate        sugar      protein     category 
#>            0           52           52           52           52            0 
#>     servings high_traffic 
#>            0          373

data wrangling and exploration

  • There are only 2 and 1 recipes of 4 as a snack and 6 as a snack servings, so I’ll rename them to “4” and “6” for simplicity and convert to numerical.
  • replace null values of high_traffic with Low
  • chicken breast turned to just chicken

inspect the data for the new changes

Servings

servings

n

percent

1

175

0.1847941

2

183

0.1932418

4

391

0.4128828

6

198

0.2090813

category

category

n

percent

Beverages

92

0.09714889

Breakfast

106

0.11193242

Chicken

172

0.18162619

Dessert

83

0.08764520

Lunch/Snacks

89

0.09398099

Meat

79

0.08342133

One Dish Meal

71

0.07497360

Pork

84

0.08870116

Potato

88

0.09292503

Vegetable

83

0.08764520

High Traffic

high_traffic

n

percent

High

574

0.6061246

low

373

0.3938754

  • replace missing values with mean

recipe

calories

carbohydrate

sugar

protein

category

servings

high_traffic

001

435.9392

35.06968

9.046547

24.1493

Pork

6

High

002

35.4800

38.56000

0.660000

0.9200

Potato

4

High

003

914.2800

42.68000

3.090000

2.8800

Breakfast

1

low

004

97.0300

30.56000

38.630000

0.0200

Beverages

4

High

005

27.0500

1.85000

0.800000

0.5300

Beverages

4

low

006

691.1500

3.46000

1.650000

53.9300

One Dish Meal

2

High

Data visualisation

  • this feature doesn’t have a big influence on target variable because recipes with high traffic are are many for each servings as compared to the those in with low traffic.

Conclusion:

  • Potato, Pork and Vegetable categories have a lot more recipes with high traffic than with low traffic.
  • One Dish Meal, Lunch/Snacks, Meat, Dessert categories have just more recipes with high traffic than with low traffic.

Correlations

  • the heatmap above suggests that there is little to no linear negative relationship in 5 variables
  • calories, carbohydrate, sugar, protein, servings. All values are close to 0, so we can say there is a weak relationship between the variables.

box plots

  • individual plots of both nutrients are shown in the facets below
  • looking if there outliers in the nutrients

Histogram

  • from the histograms above ,both nutrients are seen to be right skewed

let’s visually inspect single variables

  • look at calories

  • the data for calories is right skewed as seen from the histogram

  • the points in red indicate potential outliers in the data

Conclusion:

the density plots shows that there are no significant depedencies of the traffic and the following numerical features: calories, carbohydrate, protein, sugar, servings.

Modeling data

#> Training cases: 662
#> Test cases: 285
#> # A tibble: 5 × 7
#>   calories carbohydrate sugar protein category     servings high_traffic
#>      <dbl>        <dbl> <dbl>   <dbl> <chr>           <dbl> <fct>       
#> 1   960.           4.4  44.5    12.1  Dessert             1 0           
#> 2   189.           9.54  6.47    0.34 Beverages           6 0           
#> 3   248.          44.7   2.64   19.9  Chicken             1 1           
#> 4     6.23        56.4   5.6     2.12 Lunch/Snacks        6 1           
#> 5    81.0          0.35  1.27    1.19 Beverages           4 0

Train and Evaluate a Binary Classification Model

OK, now we’re ready to train our model by fitting the training features to the training labels (high_trafffic).

Preprocess the data for modelling

  • normalize all numerical features
  • turn categorical data to numerical data by creating dummy variables

fit the model

#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_normalize()
#> • step_dummy()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_normalize()
#> • step_dummy()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>            (Intercept)                calories            carbohydrate  
#>               -3.17385                 0.05534                 0.03697  
#>                  sugar                 protein                servings  
#>               -0.06993                 0.02482                -0.02061  
#>     category_Breakfast        category_Chicken        category_Dessert  
#>                2.42197                 2.95913                 3.81616  
#>  category_Lunch.Snacks           category_Meat  category_One.Dish.Meal  
#>                3.53970                 4.33127                 3.92688  
#>          category_Pork         category_Potato      category_Vegetable  
#>                6.07030                 6.10736                 7.28483  
#> 
#> Degrees of Freedom: 661 Total (i.e. Null);  647 Residual
#> Null Deviance:       880.6 
#> Residual Deviance: 627.8     AIC: 657.8

  • all variables whose p value lies below the black line are statistically significant

high_traffic

.pred_class

.pred_0

.pred_1

1

1

0.05392969

0.9460703

1

1

0.05187582

0.9481242

0

0

0.55707909

0.4429209

0

0

0.53024546

0.4697545

1

1

0.24135347

0.7586465

1

1

0.20510993

0.7948901

1

1

0.23067854

0.7693215

1

1

0.05567235

0.9443277

0

0

0.69466499

0.3053350

1

1

0.33510952

0.6648905

Let’s take a look at the confusion matrix:

#>           Truth
#> Prediction   0   1
#>          0  94  31
#>          1  26 134

What about our other metrics such as ppv, sensitivity etc?

.metric

.estimator

.estimate

ppv

binary

0.7520000

recall

binary

0.7833333

accuracy

binary

0.8000000

f_meas

binary

0.7673469

#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.844

Model 2 Random forest

Model 2 : evaluation

#>           Truth
#> Prediction   0   1
#>          0  80  22
#>          1  40 143

#> [[1]]
#> # A tibble: 4 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 ppv      binary         0.784
#> 2 recall   binary         0.667
#> 3 accuracy binary         0.782
#> 4 f_meas   binary         0.721
#> 
#> [[2]]
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.837
#> 
#> [[3]]

let’s make a Variable Importance Plot to see which predictor variables have the most impact in our model.

Conclusion:

Recall, Accuracy and F1 Score of High traffic by the Logistic Regression model are 0.78, 0.80, 0.76, and by Random Forest model are 0.67, 0.77, 0.71. That means the Logistic Regression model fits the features better and has less error in predicting values.

Recommendations for future actions

To help Product Manager predict the high traffic of the recipes, we can deploy this Logistic Regression Model into production. By implementing this model, about 80% of the prediction will make sure the traffic will be high. This will help Product Manager build their confidence in generating more traffic to the rest of the website.

To implement and improve the model, I will consider the following steps:

  • Looking for best ways to deploy this model in terms of performance and costs. The ideal way is to deploy this machine learning model on edge devices for its convenience and security and test the model in newly hired product analysts.
  • Collecting more data, e.g. time to make, cost per serving, ingredients, site duration time (how long users were at the recipe page), income links (from what sites users came to the recipe page), combinations of recipes (what recipes user visited at the same session with the current recipe).
  • Feature Engineering, e.g increase number of values in category, create more meaningful features from the variables.

KPI and the performance of 2 models using KPI

The company wants to increase an accuracy of prediction of high traffic. Therefore, we would consider using accuracy of predictions which predicted high traffic as a KPI to compare 2 models again. The higher the percentage, the better the model performs. The Logistic Regression model has 80% of the accuracy whereas the accuracy of the Random Forest is lower (77%).