library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data)
## Order ID CustomerName Category SubCategory
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## City OrderDate Region Sales
## Length:9994 Length:9994 Length:9994 Min. : 500
## Class :character Class :character Class :character 1st Qu.:1000
## Mode :character Mode :character Mode :character Median :1498
## Mean :1497
## 3rd Qu.:1995
## Max. :2500
## Discount Profit State ProfitRange
## Min. :0.1000 Min. : 25.25 Length:9994 Min. :0.0000
## 1st Qu.:0.1600 1st Qu.: 180.02 Class :character 1st Qu.:0.0000
## Median :0.2300 Median : 320.78 Mode :character Median :0.0000
## Mean :0.2268 Mean : 374.94 Mean :0.2764
## 3rd Qu.:0.2900 3rd Qu.: 525.63 3rd Qu.:1.0000
## Max. :0.3500 Max. :1120.95 Max. :1.0000
Created a binary variable “ProfitRange” based on a specific range of profit values which can help in analyzing orders falling within certain profit brackets.
If Profit falls between $0 and $500, then ProfitRange = 0.
If Profit is outside the range of $0 and $500, then ProfitRange = 1.
Building a logistic regression model using this data to investigate which factors (such as sales, discounts, and region) influence whether an order falls within the high-profit bracket or not.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggplot2)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
table(data$ProfitRange)
##
## 0 1
## 7232 2762
There are 7232 values that fall within the profit range of $0 and $500 and 2762 values which fall oustide the range of $0 and $500.
ggplot(data, aes(x = factor(ProfitRange))) +
geom_bar(fill = "skyblue", color = "black") +
labs(x = "ProfitRange", y = "Count") +
ggtitle("Distribution of ProfitRange")
Using the binary column ProfitRange and explanatory variables Sales, Region and Discount.
model <- glm(ProfitRange ~ Sales + Region + Discount, data = data, family = "binomial")
coef(summary(model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.427981130 1.454386e-01 -37.32147083 7.361650e-305
## Sales 0.002570929 5.835490e-05 44.05678613 0.000000e+00
## RegionEast 0.157009353 7.378261e-02 2.12799932 3.333714e-02
## RegionNorth -8.406727043 1.194681e+02 -0.07036798 9.439008e-01
## RegionSouth 0.163686368 8.496330e-02 1.92655389 5.403525e-02
## RegionWest 0.072646730 7.217321e-02 1.00656091 3.141458e-01
## Discount 0.372793862 3.536274e-01 1.05419962 2.917915e-01
Interpreting the Coeffecients of each variable:
Sales: For a one-unit increase in sales, the log odds of an
order falling within the high-profit bracket increase by 0.00257. Hence,
sales have a significant positive effect on the likelihood of an order
falling within the high-profit bracket.
Region East: Sales from the East region are associated with a log odds increase of 0.157 in the likelihood of falling within the high-profit bracket compared to the reference region(North, South, or West) . The coefficient is statistically significant (p < 0.05), indicating that region has a significant effect on the likelihood of an order falling within the high-profit bracket.
Region North: Sales from the North region are associated with a log odds decrease of 8.407 in the likelihood of falling within the high-profit bracket compared to the reference region. But, the coefficient is not statistically significant (p > 0.05), indicating that the effect of the North region on profitability may not be significant.
Region South: Sales from the South region are associated with a log odds increase of 0.164 in the likelihood of falling within the high-profit bracket compared to the reference region. The coefficient is marginally significant (p = 0.054), indicating that region may have a modest effect on profitability.
Region West: Sales from the West region are associated with a log odds increase of 0.073 in the likelihood of falling within the high-profit bracket compared to the reference region. The coefficient is not statistically significant (p > 0.05), indicating that the effect of the West region on profitability may not be significant.
Discount: For a one-unit increase in discount, the log odds of a sale falling within the high-profit bracket increase by 0.373. The coefficient is not statistically significant (p > 0.05), indicating that the effect of discount on profitability may not be significant.
#Bar plots to visualize the relationships:
ggplot(data, aes(x = factor(ProfitRange), y = Sales)) +
geom_bar(stat = "summary", fun = "mean", fill = "purple", color = "lavender") +
labs(x = "ProfitRange", y = "Mean Sales") +
ggtitle("Mean Sales by ProfitRange")
ggplot(data, aes(x = factor(ProfitRange), y = Discount)) +
geom_bar(stat = "summary", fun = "mean", fill = "purple", color = "lavender") +
labs(x = "ProfitRange", y = "Mean Discount") +
ggtitle("Mean Discount by ProfitRange")
ggplot(data, aes(x = Region, fill = factor(ProfitRange))) +
geom_bar(position = "fill") +
labs(x = "Region", y = "Proportion") +
ggtitle("Proportion of ProfitRange by Region") +
scale_fill_manual(values = c("1" = "purple", "0" = "lavender"))
#Taking coefficient estimate and the standard error of sales
coefficient <- coef(model)["Sales"]
stderr <- coef(summary(model))["Sales", "Std. Error"]
#Computing the critical value for a 95% confidence interval for a two-tailed test
critical_value <- qnorm(0.975)
#Computing confidenec intervals
ci_lower <- coefficient - critical_value * stderr
ci_upper <- coefficient + critical_value * stderr
cat("95% Confidence Interval for the Coefficient of 'Sales': [", ci_lower, ",", ci_upper, "]")
## 95% Confidence Interval for the Coefficient of 'Sales': [ 0.002456556 , 0.002685303 ]
The interval [0.002456556, 0.002685303] represents a range of possible values for the true coefficient of the “Sales” variable. This means that we are 95% confident that the true coefficient lies within this interval. The interval also indicates that, for each one-unit increase in “Sales,” the log odds of being in the profitable range are estimated to increase by an amount within this range. Therefore, we can conclude that there is a statistically significant relationship between “Sales” and the likelihood of being in the profitable range.
ci_df <- data.frame(
Estimate = coefficient,
Lower = ci_lower,
Upper = ci_upper
)
ggplot(ci_df, aes(x = "", y = Estimate, ymin = Lower, ymax = Upper)) +
geom_pointrange(color = "blue", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "", y = "Coefficient Estimate", title = "95% Confidence Interval for Coefficient of 'Sales'") +
theme_minimal() +
coord_flip()
set.seed(42)
#Generating random discount values between 0% and 100%
Discount <- runif(100, 0, 1)
#Generating binary profit range data
ProfitRange <- rbinom(100, 1, 0.5)
data <- data.frame(Discount = Discount, ProfitRange = ProfitRange)
#Creating a scatterplot of 'Discount' against 'ProfitRange' before transformation
p1 <- ggplot(data, aes(x = Discount, y = ProfitRange)) +
geom_point() +
labs(title = "Before Transformation", x = "Discount", y = "ProfitRange")
# Creating a scatterplot of the square root of 'Discount' against 'ProfitRange' after transformation
p2 <- ggplot(data, aes(x = sqrt(Discount), y = ProfitRange)) +
geom_point() +
labs(title = "After Transformation", x = "sqrt(Discount)", y = "ProfitRange")
# Combining the plots using patchwork
library(patchwork)
p1 + p2
Before the transformation, the discount increased linearly as the profit range increased. This means that the higher the profit range, the greater the discount. After the transformation, the relationship between the discount and the profit range is no longer linear. Instead, the discount increases at a slower rate as the profit range increases. This means that the transformation has made discounts less sensitive to changes in the profit range. In other words, the transformation has compressed the range of discounts that are applied.
The transformation might be needed if the supermarket wants to achieve consistent pricing or limit high-end discounts. However, if they value flexibility in promotions or need to incite sales on high-profit items, the current linear structure might be better.