Data Dive — Regression Diagnostics

Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(data)

##    Order ID         CustomerName         Category         SubCategory       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      City            OrderDate            Region              Sales     
##  Length:9994        Length:9994        Length:9994        Min.   : 500  
##  Class :character   Class :character   Class :character   1st Qu.:1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1498  
##                                                           Mean   :1497  
##                                                           3rd Qu.:1995  
##                                                           Max.   :2500  
##     Discount          Profit           State          
##  Min.   :0.1000   Min.   :  25.25   Length:9994       
##  1st Qu.:0.1600   1st Qu.: 180.02   Class :character  
##  Median :0.2300   Median : 320.78   Mode  :character  
##  Mean   :0.2268   Mean   : 374.94                     
##  3rd Qu.:0.2900   3rd Qu.: 525.63                     
##  Max.   :0.3500   Max.   :1120.95

Building a Multiple Linear Regression Model

Fitting a linear regression model using Profit, Discount, Region and an interaction term(combining ‘Profit’ and ‘Discount’) as the explanatory variables.

#Checking the linearity between the explanatory variables (Profit, Discount, and Region) and the response variable (Sales).
library(ggplot2)

#Profit vs. Sales
ggplot(data, aes(x = Profit, y = Sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Profit", y = "Sales")

## `geom_smooth()` using formula = 'y ~ x'

#Discount vs. Sales
ggplot(data, aes(x = Discount, y = Sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Discount", y = "Sales")

## `geom_smooth()` using formula = 'y ~ x'

#Region vs. Sales
ggplot(data, aes(x = Region, y = Sales)) +
  geom_boxplot() +
  labs(x = "Region", y = "Sales")

The terms in our model are:
Intercept: Represents the estimated sales when all the other variables are zero.
Continuous Variable: Shows the impact of a one-unit increase in profit on sales.
Discount Variable: Shows the effect of a one-unit increase in discount on sales.
Interaction Term: Shows the combined effect of “Profit” and “Discount” on sales.
Region Variables: Coefficients for different levels of the categorical variable “Region”. Each coefficient shows the average change in sales for the respective region compared to the reference region.

response_variable <- data$Sales
continuous_variable <- data$Profit
discount_variable <- data$Discount
region_variable <- as.factor(data$Region)
my_model <- lm(response_variable ~ continuous_variable + discount_variable + continuous_variable * discount_variable + region_variable, data = data)
summary(my_model)

## 
## Call:
## lm(formula = response_variable ~ continuous_variable + discount_variable + 
##     continuous_variable * discount_variable + region_variable, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -783.71 -367.01  -77.85  306.22 1359.23 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            976.51829   28.45583  34.317   <2e-16
## continuous_variable                      1.42722    0.06084  23.457   <2e-16
## discount_variable                      -93.02319  113.75589  -0.818    0.414
## region_variableEast                    -13.65877   12.85807  -1.062    0.288
## region_variableNorth                  -290.56674  459.95574  -0.632    0.528
## region_variableSouth                    -9.55883   14.89166  -0.642    0.521
## region_variableWest                      0.17259   12.53391   0.014    0.989
## continuous_variable:discount_variable    0.13493    0.25629   0.526    0.599
##                                          
## (Intercept)                           ***
## continuous_variable                   ***
## discount_variable                        
## region_variableEast                      
## region_variableNorth                     
## region_variableSouth                     
## region_variableWest                      
## continuous_variable:discount_variable    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459.8 on 9986 degrees of freedom
## Multiple R-squared:  0.3666, Adjusted R-squared:  0.3662 
## F-statistic: 825.8 on 7 and 9986 DF,  p-value: < 2.2e-16

Interpreting the coefficients:

Intercept: The estimated sales when all other variables are zero is around 976.52(doesn’t really have a very meaningful interpretation).
Profit: The coefficient for “Profit” is around 1.43 indicating that, on average, a one-unit increase in profit is associated with an increase of 1.43 units in sales.
Discount: The coefficient for “Discount” is around -93.02, but its p-value (0.414) is greater than the typical significance level of 0.05, suggesting that “Discount” may not be a statistically significant predictor of sales in this model, i.e., the observed relationship between “Discount” and “Sales” in our model could likely be due to random variation rather than a true relationship.
Region Variables: The regional variables do not seem to be statistically significant predictors of sales since all the region variables (East, North, South, West) have p-values greater than 0.05.
Interaction Term: The coefficient for the interaction term is around 0.135. This represents the additional change in sales for the combined effect of a one-unit increase in “Profit” and a one-unit increase in “Discount.” But, its p-value is higher than 0.50, which may indicate that it’s not statistically significant.

Evaluating the model:

Residuals: The residuals indicate the differences between observed and predicted values, with a standard error of 459.8
R-squared: The value of 0.3666 suggests that approximately 36.66% of the variability in sales is explained by the model.
F-statistic: The F-statistic is 825.8, with a p-value less than <2.2e-16, indicating that at least one of the predictors(like ‘Profit’) has a significant effect on sales.

Interaction Term(Continuous_variable:Discount_variable):
The interaction term is not statistically significant (p=0.599), suggesting that the combined effect of “Profit” and “Discount” may not significantly influence sales in this model.

Checking correlation between profit and discount

cor(continuous_variable, discount_variable, method = "pearson")

## [1] 1.747623e-05

A correlation of approximately 0.00001748 between profit and discount means that there is almost no linear relationship between these two variables. As a result, they are unlikely to contribute to multicollinearity issues in our regression model.

Evaluating the model using diagnostic plots

Residual vs. Fitted Value

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ stringr   1.5.1
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
gg_resfitted(my_model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Generally in a residual vs fitted value plot, the residuals should be randomly scattered around a horizontal line at zero. This would indicate that the model fits the data well, with no consistent over- or under-prediction at any given fitted value.

In our plot, the residuals do not appear to be randomly scattered. There seems to be a trend where the residuals are positive for lower fitted values and negative for higher fitted values. This pattern suggests that the model might not be capturing the relationship between the independent and dependent variables very well, which means it is violating the assumption of homoscedasticity in linear regression.

Residuals vs. X Values

residual_plots <- gg_resX(my_model)

print(residual_plots)

## TableGrob (3 x 1) "arrange": 3 grobs
##                     z     cells    name           grob
## continuous_variable 1 (1-1,1-1) arrange gtable[layout]
## discount_variable   2 (2-2,1-1) arrange gtable[layout]
## region_variable     3 (3-3,1-1) arrange gtable[layout]

Residual vs. Continuous_variable

There seems to be a trend where the residuals are positive for lower fitted values and negative for higher fitted values, which is a violation of the assumption of homoscedasticity in linear regression.

Discount_variable

There is a slight positive trend in the residuals, but it is not very strong. This suggests that there might be a weak violation of the homoscedasticity assumption. However, it is possible that this is not a significant issue.

Region_variable

Since the boxes for each region variable have similar interquartile ranges, it might suggest that the variance of the residuals is similar across regions.

Residual Histogram

gg_reshist(my_model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram of the residuals shows a somewhat symmetrical distribution around zero. This may suggest that the residuals don’t have a strong positive or negative skew.

QQ-Plot

gg_qqplot(my_model)

Generally in a QQ Plot, data points that follow a straight diagonal line indicate good agreement between the observed data and a normal distribution. In our plot, most of the data points fall relatively close to the diagonal line, suggesting that the quantiles of the data are similar to those of a normal distribution. Hence, we can validate the assumption of normality

gg_cooksd(my_model, threshold = 'matlab')

Several data points have Cook’s distance values much larger than the others. These points could be exerting an undue influence on the fitted regression line. Therefore, these influential data points might affect the model.