Data Dive — GLMs

Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(data)

##    Order ID         CustomerName         Category         SubCategory       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      City            OrderDate            Region              Sales     
##  Length:9994        Length:9994        Length:9994        Min.   : 500  
##  Class :character   Class :character   Class :character   1st Qu.:1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1498  
##                                                           Mean   :1497  
##                                                           3rd Qu.:1995  
##                                                           Max.   :2500  
##     Discount          Profit           State            ProfitRange    
##  Min.   :0.1000   Min.   :  25.25   Length:9994        Min.   :0.0000  
##  1st Qu.:0.1600   1st Qu.: 180.02   Class :character   1st Qu.:0.0000  
##  Median :0.2300   Median : 320.78   Mode  :character   Median :0.0000  
##  Mean   :0.2268   Mean   : 374.94                      Mean   :0.2764  
##  3rd Qu.:0.2900   3rd Qu.: 525.63                      3rd Qu.:1.0000  
##  Max.   :0.3500   Max.   :1120.95                      Max.   :1.0000

Loading the required packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
library(boot)
library(broom)
library(ggthemes)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

head(data)

## # A tibble: 6 × 12
##   `Order ID` CustomerName Category      SubCategory City  OrderDate Region Sales
##   <chr>      <chr>        <chr>         <chr>       <chr> <chr>     <chr>  <dbl>
## 1 OD1        Harish       Oil & Masala  Masalas     Vell… 11/8/17   North   1254
## 2 OD2        Sudha        Beverages     Health Dri… Kris… 11/8/17   South    749
## 3 OD3        Hussain      Food Grains   Atta & Flo… Pera… 6/12/17   West    2360
## 4 OD4        Jackson      Fruits & Veg… Fresh Vege… Dhar… 10/11/16  South    896
## 5 OD5        Ridhesh      Food Grains   Organic St… Ooty  10/11/16  South   2355
## 6 OD6        Adavan       Food Grains   Organic St… Dhar… 6/9/15    West    2305
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## #   ProfitRange <dbl>

Choosing ‘Sales’, ‘Discount’ and ‘Region’ as explanatory variables and ‘Profit’ as a response variable.

Building a linear regression model

model <- lm(Profit ~ Sales + Region + Discount, data = data)
coef(summary(model))

##               Estimate   Std. Error    t value   Pr(>|t|)
## (Intercept) -9.0622074 8.643228e+00 -1.0484749 0.29444524
## Sales        0.2514586 3.307932e-03 76.0168634 0.00000000
## RegionEast   8.7205046 5.339119e+00  1.6333228 0.10243265
## RegionNorth 93.7120824 1.910358e+02  0.4905473 0.62375743
## RegionSouth 12.7109614 6.183131e+00  2.0557483 0.03983272
## RegionWest   2.0423265 5.205469e+00  0.3923424 0.69481359
## Discount    10.8420699 2.560484e+01  0.4234383 0.67198463

The intercept value is -9.0622074, which is the estimated value of profit when the explanatory variables are zero. However, since profit can’t be negative, this interpretation might not make much sense.
Sales: For every unit increase in Sales, Profit is estimated to increase by approximately 0.2514586 units.
Region (East, North, South, West): These coefficients represent the difference in profit between each region and the reference regions.
- The coefficient for RegionEast (8.7205046) indicates that, compared to the reference region, profit is estimated to be higher by approximately 8.72 units in the East region.
- The coefficient for RegionNorth (93.7120824) indicates that there’s an increase of approximately 93.71 units in Profit for the North region compared to the reference region. However, the p-value for this coefficient is high (0.62375743), indicating that this difference may not be statistically significant.
- Similarly, coefficients for RegionSouth and RegionWest show the estimated difference in Profit for these regions compared to the reference regions.
Discount: For every unit increase in Discount, profit is estimated to increase by approximately 10.8420699 units.

Evaluating the model using variance inflation factor

vif(model)

##              GVIF Df GVIF^(1/(2*Df))
## Sales    1.000136  1        1.000068
## Region   1.000752  4        1.000094
## Discount 1.000676  1        1.000338

Sales: The GVIF value is super close to 1, indicating that there is no significant multicollinearity issue with the Sales variable.

Region: The GVIF value for region is also close to 1, suggesting that multicollinearity is not a significant concern among the different regions.

Discount: The GVIF value for discount is close to 1 as well, indicating no significant multicollinearity issue with the Discount variable.

In summary, based on the GVIF values, multicollinearity does not appear to be a major concern in our model.

Diagnostic Plots

Residual vs. Fitted Value

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There seems to be a subtle fanning effect in the residual plot, where the spread of the residuals increases as the fitted values on the x-axis increase.

This fanning effect indicates a potential violation of an assumption for linear regression, called homoscedasticity( the spread of the residuals should be similar regardless of the sales figures).

While a slight fanning effect might not be a major concern for a large dataset, it’s something to consider. Here’s what this pattern might mean for our model:

The model’s predictions might be less reliable for higher or lower fitted values. The model seems to struggle with accurately predicting residuals at the extremes of the fitted values.

Residuals vs. X Values

gg_resX(model, plot.all = FALSE)

## $Sales

## 
## $Region

## 
## $Discount

Residual vs. Sales

There seems to be a subtle fanning effect in the residual plot, where the spread of the residuals increases as the sales values (fitted values on the x-axis) increase.

This fanning effect indicates a potential violation of an assumption for linear regression, called homoscedasticity( the spread of the residuals should be similar regardless of the sales figures).

A fanning effect suggests heteroscedasticity, where the variance of the errors isn’t constant. This can happen when the error terms are larger for higher sales values or vice versa.

While a slight fanning effect might not be a major concern for a large dataset, it’s something to consider. Here’s what this pattern might mean for our model:

The model’s predictions might be less reliable for higher or lower sales figures. The model seems to struggle with accurately predicting residuals at the extremes of the sales data.
The model might not capture the full relationship between variables. There might be other factors influencing the residuals that the our model doesn’t account for.

Discount_variable

There is a slight positive trend in the residuals, but it is not very strong. This suggests that there might be a weak violation of the homoscedasticity assumption. However, it is possible that this is not a significant issue.

Region_variable

Since the boxes for each region variable have similar interquartile ranges, it might suggest that the variance of the residuals is similar across regions.

Residual Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is a curvature in the pattern of the residuals. The residuals tend to be negative for lower predicted profits and positive for higher predicted profits. This suggests a non-linear relationship between the explanatory variables (sales, discount, region) and profit. A linear regression model assumes a straight-line relationship between the variables. This curvature indicates that the linear model might not be the best fit for our data.

QQ-Plot

gg_qqplot(model)

Generally in a QQ Plot, data points that follow a straight diagonal line indicate good agreement between the observed data and a normal distribution. In our plot, most of the data points fall relatively close to the diagonal line, suggesting that the quantiles of the data are similar to those of a normal distribution. Hence, we can validate the assumption of normality

Cook’s Distance Plot

gg_cooksd(model)

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing missing values (`geom_segment()`).

## Warning: Removed 1 rows containing missing values (`geom_text()`).

Several data points have Cook’s distance values much larger than the others. These points could be exerting an undue influence on the fitted regression line. Therefore, these influential data points might affect the model.