Initial Simple Linear Regression Model

data <- data %>%
  mutate(final_amount_paid = product_amount - cashback + transaction_fee)
# Last week's linear model
linear_model <- lm(final_amount_paid ~ product_amount, data = data)
summary(linear_model)
## 
## Call:
## lm(formula = final_amount_paid ~ product_amount, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.829 -24.777  -0.391  24.482  74.146 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.524e+01  9.010e-01  -28.02   <2e-16 ***
## product_amount  1.000e+00  1.571e-04 6365.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.04 on 4998 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 4.052e+07 on 1 and 4998 DF,  p-value: < 2.2e-16

Expanded Model with Additional Terms

Adding Variables

Variables to Consider for the Model

product_amount: This is our primary independent variable.

transaction_fee: This also impact the final amount paid.

cashback: This is the significant factor influencing the final amount.

loyalty_points: This may also affect the final amount.

# Build the linear regression model with additional variables
expanded_model <- lm(final_amount_paid ~ product_amount + transaction_fee + cashback + 
                      product_amount:cashback, data = data)

# Summary of the model to evaluate its fit
summary(expanded_model)
## 
## Call:
## lm(formula = final_amount_paid ~ product_amount + transaction_fee + 
##     cashback + product_amount:cashback, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.646e-09 -4.100e-13  1.500e-13  7.700e-13  4.846e-10 
## 
## Coefficients:
##                           Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)             -1.235e-11  1.506e-12 -8.199e+00 3.05e-16 ***
## product_amount           1.000e+00  2.423e-16  4.127e+15  < 2e-16 ***
## transaction_fee          1.000e+00  2.365e-14  4.229e+13  < 2e-16 ***
## cashback                -1.000e+00  2.370e-14 -4.219e+13  < 2e-16 ***
## product_amount:cashback -6.861e-18  4.162e-18 -1.648e+00   0.0993 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43e-11 on 4995 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.762e+31 on 4 and 4995 DF,  p-value: < 2.2e-16

Explanation for Including Variables

transaction_fee: Fees may directly affect the amount paid; including it can reveal how fees impact the final amount.

cashback: Cashback offers can reduce the amount a user ultimately pays, making this variable crucial for understanding spending behavior.

Interaction Term (product_amount:cashback): This term helps to explore if the effect of the product amount on the final payment changes based on the cashback offered.

Multicollinearity Check

library(car)
vif(expanded_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##          product_amount         transaction_fee                cashback 
##                4.137910                1.000081                3.868634 
## product_amount:cashback 
##                7.039634

Interpretation of VIF Results

Product Amount: A VIF of approximately 4.14 suggests moderate multicollinearity but is below the threshold for concern.

Transaction Fee: A VIF of approximately 1.00 indicates no multicollinearity. Cashback: A VIF of approximately 3.87 also suggests moderate multicollinearity but is not a significant issue.

Interaction Term (product_amount:cashback): A VIF of approximately 7.04 indicates potential multicollinearity concerns due to its interaction nature.

Explanation for Each Variable

Product Amount

Reason for Inclusion: It directly influences the final amount paid by representing the product’s cost. Multicollinearity Concerns: Moderate VIF (around 4.14) suggests potential multicollinearity but not severe enough for exclusion.

Transaction Fee

Reason for Inclusion: It accounts for additional costs that impact the total final amount paid. Multicollinearity Concerns: Very low VIF (around 1.00) indicates no multicollinearity issues, allowing for confident inclusion.

Cashback

Reason for Inclusion: It reflects promotional offers that can reduce the final amount paid by consumers.

Multicollinearity Concerns: Moderate VIF (around 3.87) suggests some correlation with other predictors but is still valuable for analysis.

Interaction Term (Product Amount )

Reason for Inclusion: It captures the joint effect of product amount and cashback on the final amount paid.

Multicollinearity Concerns: Higher VIF (around 7.04) indicates possible multicollinearity; consider exclusion if it doesn’t improve model fit.

Evaluating the model

Residuals vs. Fitted Plot

plot(expanded_model, which = 1)

Insight and significance:the residuals appear to be clustered near zero, suggesting little variation. However, there are a few outliers (labeled “O1” and “O2”) that deviate from this pattern. This could indicate potential issues with specific observations but otherwise suggests minimal residual variance.

Residual Histogram:

hist(residuals(expanded_model), main = "Residual Histogram", xlab = "Residuals", breaks = 20)

Insight and significance: Here, the residuals are tightly centered around zero with minimal spread, suggesting that residual variance is low, though this may imply overfitting if the residuals are too small.

Q-Q Plot

plot(expanded_model, which = 2)

Insight and significance: In your plot, the residuals are mostly on or near the line, though there are some deviations at the extremes. This generally suggests normality of residuals, though the minor outliers might need further investigation.

Cook’s Distance Plot

plot(expanded_model, which = 4)

Insight and significance: Observations 1, 2, and 2784 flagged as potentially influential points. Observation 1 exhibits the highest Cook’s distance, close to 1.2, which suggests it has a substantial influence on the model. Typically, a Cook’s distance value exceeding 1 is a strong indication of a highly influential point..

Scale-Location Plot

plot(expanded_model, which = 3)

Insight and significance: In scale-location plot, the standardized residuals are plotted against the fitted values. The observations O1 and O2 are highlighted, indicating potential issues with these points. There are cluster of points at the lower end of the fitted values (close to zero), with a few residuals showing higher variability (e.g., the flagged points).The points are randomly scattered without a clear pattern. Here, the flagged observations might indicate heteroscedasticity or potential outliers.