data <- data %>%
mutate(final_amount_paid = product_amount - cashback + transaction_fee)
# Last week's linear model
linear_model <- lm(final_amount_paid ~ product_amount, data = data)
summary(linear_model)
##
## Call:
## lm(formula = final_amount_paid ~ product_amount, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.829 -24.777 -0.391 24.482 74.146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.524e+01 9.010e-01 -28.02 <2e-16 ***
## product_amount 1.000e+00 1.571e-04 6365.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.04 on 4998 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 4.052e+07 on 1 and 4998 DF, p-value: < 2.2e-16
Adding Variables
Variables to Consider for the Model
product_amount: This is our primary independent variable.
transaction_fee: This also impact the final amount paid.
cashback: This is the significant factor influencing the final amount.
loyalty_points: This may also affect the final amount.
# Build the linear regression model with additional variables
expanded_model <- lm(final_amount_paid ~ product_amount + transaction_fee + cashback +
product_amount:cashback, data = data)
# Summary of the model to evaluate its fit
summary(expanded_model)
##
## Call:
## lm(formula = final_amount_paid ~ product_amount + transaction_fee +
## cashback + product_amount:cashback, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.646e-09 -4.100e-13 1.500e-13 7.700e-13 4.846e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.235e-11 1.506e-12 -8.199e+00 3.05e-16 ***
## product_amount 1.000e+00 2.423e-16 4.127e+15 < 2e-16 ***
## transaction_fee 1.000e+00 2.365e-14 4.229e+13 < 2e-16 ***
## cashback -1.000e+00 2.370e-14 -4.219e+13 < 2e-16 ***
## product_amount:cashback -6.861e-18 4.162e-18 -1.648e+00 0.0993 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.43e-11 on 4995 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.762e+31 on 4 and 4995 DF, p-value: < 2.2e-16
Explanation for Including Variables
transaction_fee: Fees may directly affect the amount paid; including it can reveal how fees impact the final amount.
cashback: Cashback offers can reduce the amount a user ultimately pays, making this variable crucial for understanding spending behavior.
Interaction Term (product_amount:cashback): This term helps to explore if the effect of the product amount on the final payment changes based on the cashback offered.
library(car)
vif(expanded_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## product_amount transaction_fee cashback
## 4.137910 1.000081 3.868634
## product_amount:cashback
## 7.039634
Product Amount: A VIF of approximately 4.14 suggests moderate multicollinearity but is below the threshold for concern.
Transaction Fee: A VIF of approximately 1.00 indicates no multicollinearity. Cashback: A VIF of approximately 3.87 also suggests moderate multicollinearity but is not a significant issue.
Interaction Term (product_amount:cashback): A VIF of approximately 7.04 indicates potential multicollinearity concerns due to its interaction nature.
Product Amount
Reason for Inclusion: It directly influences the final amount paid by representing the product’s cost. Multicollinearity Concerns: Moderate VIF (around 4.14) suggests potential multicollinearity but not severe enough for exclusion.
Transaction Fee
Reason for Inclusion: It accounts for additional costs that impact the total final amount paid. Multicollinearity Concerns: Very low VIF (around 1.00) indicates no multicollinearity issues, allowing for confident inclusion.
Cashback
Reason for Inclusion: It reflects promotional offers that can reduce the final amount paid by consumers.
Multicollinearity Concerns: Moderate VIF (around 3.87) suggests some correlation with other predictors but is still valuable for analysis.
Interaction Term (Product Amount )
Reason for Inclusion: It captures the joint effect of product amount and cashback on the final amount paid.
Multicollinearity Concerns: Higher VIF (around 7.04) indicates
possible multicollinearity; consider exclusion if it doesn’t improve
model fit.
Residuals vs. Fitted Plot
plot(expanded_model, which = 1)
Insight and significance:the residuals appear to be clustered near zero, suggesting little variation. However, there are a few outliers (labeled “O1” and “O2”) that deviate from this pattern. This could indicate potential issues with specific observations but otherwise suggests minimal residual variance.
Residual Histogram:
hist(residuals(expanded_model), main = "Residual Histogram", xlab = "Residuals", breaks = 20)
Insight and significance: Here, the residuals are tightly centered around zero with minimal spread, suggesting that residual variance is low, though this may imply overfitting if the residuals are too small.
Q-Q Plot
plot(expanded_model, which = 2)
Insight and significance: In your plot, the residuals are mostly on or near the line, though there are some deviations at the extremes. This generally suggests normality of residuals, though the minor outliers might need further investigation.
Cook’s Distance Plot
plot(expanded_model, which = 4)
Insight and significance: Observations 1, 2, and 2784 flagged as potentially influential points. Observation 1 exhibits the highest Cook’s distance, close to 1.2, which suggests it has a substantial influence on the model. Typically, a Cook’s distance value exceeding 1 is a strong indication of a highly influential point..
Scale-Location Plot
plot(expanded_model, which = 3)
Insight and significance: In scale-location plot, the standardized residuals are plotted against the fitted values. The observations O1 and O2 are highlighted, indicating potential issues with these points. There are cluster of points at the lower end of the fitted values (close to zero), with a few residuals showing higher variability (e.g., the flagged points).The points are randomly scattered without a clear pattern. Here, the flagged observations might indicate heteroscedasticity or potential outliers.