Data Dive — GLMs

Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(data)

##    Order ID         CustomerName         Category         SubCategory       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      City            OrderDate            Region              Sales     
##  Length:9994        Length:9994        Length:9994        Min.   : 500  
##  Class :character   Class :character   Class :character   1st Qu.:1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1498  
##                                                           Mean   :1497  
##                                                           3rd Qu.:1995  
##                                                           Max.   :2500  
##     Discount          Profit           State            ProfitRange    
##  Min.   :0.1000   Min.   :  25.25   Length:9994        Min.   :0.0000  
##  1st Qu.:0.1600   1st Qu.: 180.02   Class :character   1st Qu.:0.0000  
##  Median :0.2300   Median : 320.78   Mode  :character   Median :0.0000  
##  Mean   :0.2268   Mean   : 374.94                      Mean   :0.2764  
##  3rd Qu.:0.2900   3rd Qu.: 525.63                      3rd Qu.:1.0000  
##  Max.   :0.3500   Max.   :1120.95                      Max.   :1.0000

Created a binary variable “ProfitRange” based on a specific range of profit values which can help in analyzing orders falling within certain profit brackets.

If Profit falls between $0 and $500, then ProfitRange = 0.
If Profit is outside the range of $0 and $500, then ProfitRange = 1.

Building a logistic regression model using this data to investigate which factors (such as sales, discounts, and region) influence whether an order falls within the high-profit bracket or not.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggplot2)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Checking the distribution of the binary variable, ‘ProfitRange’

table(data$ProfitRange)

## 
##    0    1 
## 7232 2762

There are 7232 values that fall within the profit range of $0 and $500 and 2762 values which fall oustide the range of $0 and $500.

Plotting the distribution of ‘ProfitRange’

ggplot(data, aes(x = factor(ProfitRange))) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(x = "ProfitRange", y = "Count") +
  ggtitle("Distribution of ProfitRange")

Building the logistic regression model

Using the binary column ProfitRange and explanatory variables Sales, Region and Discount.

model <- glm(ProfitRange ~ Sales + Region + Discount, data = data, family = "binomial")
coef(summary(model))

##                 Estimate   Std. Error      z value      Pr(>|z|)
## (Intercept) -5.427981130 1.454386e-01 -37.32147083 7.361650e-305
## Sales        0.002570929 5.835490e-05  44.05678613  0.000000e+00
## RegionEast   0.157009353 7.378261e-02   2.12799932  3.333714e-02
## RegionNorth -8.406727043 1.194681e+02  -0.07036798  9.439008e-01
## RegionSouth  0.163686368 8.496330e-02   1.92655389  5.403525e-02
## RegionWest   0.072646730 7.217321e-02   1.00656091  3.141458e-01
## Discount     0.372793862 3.536274e-01   1.05419962  2.917915e-01

Interpreting the Coeffecients of each variable:
Sales: For a one-unit increase in sales, the log odds of an order falling within the high-profit bracket increase by 0.00257. Hence, sales have a significant positive effect on the likelihood of an order falling within the high-profit bracket.

Region East: Sales from the East region are associated with a log odds increase of 0.157 in the likelihood of falling within the high-profit bracket compared to the reference region(North, South, or West) . The coefficient is statistically significant (p < 0.05), indicating that region has a significant effect on the likelihood of an order falling within the high-profit bracket.

Region North: Sales from the North region are associated with a log odds decrease of 8.407 in the likelihood of falling within the high-profit bracket compared to the reference region. But, the coefficient is not statistically significant (p > 0.05), indicating that the effect of the North region on profitability may not be significant.

Region South: Sales from the South region are associated with a log odds increase of 0.164 in the likelihood of falling within the high-profit bracket compared to the reference region. The coefficient is marginally significant (p = 0.054), indicating that region may have a modest effect on profitability.

Region West: Sales from the West region are associated with a log odds increase of 0.073 in the likelihood of falling within the high-profit bracket compared to the reference region. The coefficient is not statistically significant (p > 0.05), indicating that the effect of the West region on profitability may not be significant.

Discount: For a one-unit increase in discount, the log odds of a sale falling within the high-profit bracket increase by 0.373. The coefficient is not statistically significant (p > 0.05), indicating that the effect of discount on profitability may not be significant.

Plots to visualize the relationships between ‘ProfitRange’ and the other variables

#Bar plots to visualize the relationships:

ggplot(data, aes(x = factor(ProfitRange), y = Sales)) +
  geom_bar(stat = "summary", fun = "mean", fill = "purple", color = "lavender") +
  labs(x = "ProfitRange", y = "Mean Sales") +
  ggtitle("Mean Sales by ProfitRange")

ggplot(data, aes(x = factor(ProfitRange), y = Discount)) +
  geom_bar(stat = "summary", fun = "mean", fill = "purple", color = "lavender") +
  labs(x = "ProfitRange", y = "Mean Discount") +
  ggtitle("Mean Discount by ProfitRange")

ggplot(data, aes(x = Region, fill = factor(ProfitRange))) +
  geom_bar(position = "fill") +
  labs(x = "Region", y = "Proportion") +
  ggtitle("Proportion of ProfitRange by Region") +
  scale_fill_manual(values = c("1" = "purple", "0" = "lavender"))

Building confidence intervals for coefficients of sales

#Taking coefficient estimate and the standard error of sales
coefficient <- coef(model)["Sales"]
stderr <- coef(summary(model))["Sales", "Std. Error"]

#Computing the critical value for a 95% confidence interval for a two-tailed test
critical_value <- qnorm(0.975)  

#Computing confidenec intervals
ci_lower <- coefficient - critical_value * stderr
ci_upper <- coefficient + critical_value * stderr

cat("95% Confidence Interval for the Coefficient of 'Sales': [", ci_lower, ",", ci_upper, "]")

## 95% Confidence Interval for the Coefficient of 'Sales': [ 0.002456556 , 0.002685303 ]

The interval [0.002456556, 0.002685303] represents a range of possible values for the true coefficient of the “Sales” variable. This means that we are 95% confident that the true coefficient lies within this interval. The interval also indicates that, for each one-unit increase in “Sales,” the log odds of being in the profitable range are estimated to increase by an amount within this range. Therefore, we can conclude that there is a statistically significant relationship between “Sales” and the likelihood of being in the profitable range.

Plotting the confidence interval

ci_df <- data.frame(
  Estimate = coefficient,
  Lower = ci_lower,
  Upper = ci_upper
)

ggplot(ci_df, aes(x = "", y = Estimate, ymin = Lower, ymax = Upper)) +
  geom_pointrange(color = "blue", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "", y = "Coefficient Estimate", title = "95% Confidence Interval for Coefficient of 'Sales'") +
  theme_minimal() +
  coord_flip()

Considering transformation for the explanatory variable, ‘Discount’

set.seed(42)
#Generating random discount values between 0% and 100%
Discount <- runif(100, 0, 1)  
#Generating binary profit range data
ProfitRange <- rbinom(100, 1, 0.5)  
data <- data.frame(Discount = Discount, ProfitRange = ProfitRange)

#Creating a scatterplot of 'Discount' against 'ProfitRange' before transformation
p1 <- ggplot(data, aes(x = Discount, y = ProfitRange)) + 
  geom_point() +
  labs(title = "Before Transformation", x = "Discount", y = "ProfitRange")

# Creating a scatterplot of the square root of 'Discount' against 'ProfitRange' after transformation
p2 <- ggplot(data, aes(x = sqrt(Discount), y = ProfitRange)) + 
  geom_point() +
  labs(title = "After Transformation", x = "sqrt(Discount)", y = "ProfitRange")

# Combining the plots using patchwork
library(patchwork)
p1 + p2

Before the transformation, the discount increased linearly as the profit range increased. This means that the higher the profit range, the greater the discount. After the transformation, the relationship between the discount and the profit range is no longer linear. Instead, the discount increases at a slower rate as the profit range increases. This means that the transformation has made discounts less sensitive to changes in the profit range. In other words, the transformation has compressed the range of discounts that are applied.

The transformation might be needed if the supermarket wants to achieve consistent pricing or limit high-end discounts. However, if they value flexibility in promotions or need to incite sales on high-profit items, the current linear structure might be better.