library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data)
## Order ID CustomerName Category SubCategory
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## City OrderDate Region Sales
## Length:9994 Length:9994 Length:9994 Min. : 500
## Class :character Class :character Class :character 1st Qu.:1000
## Mode :character Mode :character Mode :character Median :1498
## Mean :1497
## 3rd Qu.:1995
## Max. :2500
## Discount Profit State ProfitRange
## Min. :0.1000 Min. : 25.25 Length:9994 Min. :0.0000
## 1st Qu.:0.1600 1st Qu.: 180.02 Class :character 1st Qu.:0.0000
## Median :0.2300 Median : 320.78 Mode :character Median :0.0000
## Mean :0.2268 Mean : 374.94 Mean :0.2764
## 3rd Qu.:0.2900 3rd Qu.: 525.63 3rd Qu.:1.0000
## Max. :0.3500 Max. :1120.95 Max. :1.0000
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(boot)
library(broom)
library(ggthemes)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
head(data)
## # A tibble: 6 × 12
## `Order ID` CustomerName Category SubCategory City OrderDate Region Sales
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 OD1 Harish Oil & Masala Masalas Vell… 11/8/17 North 1254
## 2 OD2 Sudha Beverages Health Dri… Kris… 11/8/17 South 749
## 3 OD3 Hussain Food Grains Atta & Flo… Pera… 6/12/17 West 2360
## 4 OD4 Jackson Fruits & Veg… Fresh Vege… Dhar… 10/11/16 South 896
## 5 OD5 Ridhesh Food Grains Organic St… Ooty 10/11/16 South 2355
## 6 OD6 Adavan Food Grains Organic St… Dhar… 6/9/15 West 2305
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## # ProfitRange <dbl>
Choosing ‘Sales’, ‘Discount’ and ‘Region’ as explanatory variables and ‘Profit’ as a response variable.
model <- lm(Profit ~ Sales + Region + Discount, data = data)
coef(summary(model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.0622074 8.643228e+00 -1.0484749 0.29444524
## Sales 0.2514586 3.307932e-03 76.0168634 0.00000000
## RegionEast 8.7205046 5.339119e+00 1.6333228 0.10243265
## RegionNorth 93.7120824 1.910358e+02 0.4905473 0.62375743
## RegionSouth 12.7109614 6.183131e+00 2.0557483 0.03983272
## RegionWest 2.0423265 5.205469e+00 0.3923424 0.69481359
## Discount 10.8420699 2.560484e+01 0.4234383 0.67198463
The intercept value is -9.0622074, which is the estimated value of profit when the explanatory variables are zero. However, since profit can’t be negative, this interpretation might not make much sense.
Sales: For every unit increase in Sales, Profit is estimated to increase by approximately 0.2514586 units.
Region (East, North, South, West): These coefficients represent the difference in profit between each region and the reference regions.
Discount: For every unit increase in Discount, profit is estimated to increase by approximately 10.8420699 units.
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Sales 1.000136 1 1.000068
## Region 1.000752 4 1.000094
## Discount 1.000676 1 1.000338
Sales: The GVIF value is super close to 1, indicating that there is no significant multicollinearity issue with the Sales variable.
Region: The GVIF value for region is also close to 1, suggesting that multicollinearity is not a significant concern among the different regions.
Discount: The GVIF value for discount is close to 1 as well, indicating no significant multicollinearity issue with the Discount variable.
In summary, based on the GVIF values, multicollinearity does not appear to be a major concern in our model.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
There seems to be a subtle fanning effect in the residual plot, where the spread of the residuals increases as the fitted values on the x-axis increase.
This fanning effect indicates a potential violation of an assumption for linear regression, called homoscedasticity( the spread of the residuals should be similar regardless of the sales figures).
While a slight fanning effect might not be a major concern for a large dataset, it’s something to consider. Here’s what this pattern might mean for our model:
gg_resX(model, plot.all = FALSE)
## $Sales
##
## $Region
##
## $Discount
Residual vs. Sales
There seems to be a subtle fanning effect in the residual plot, where the spread of the residuals increases as the sales values (fitted values on the x-axis) increase.
This fanning effect indicates a potential violation of an assumption for linear regression, called homoscedasticity( the spread of the residuals should be similar regardless of the sales figures).
A fanning effect suggests heteroscedasticity, where the variance of the errors isn’t constant. This can happen when the error terms are larger for higher sales values or vice versa.
While a slight fanning effect might not be a major concern for a large dataset, it’s something to consider. Here’s what this pattern might mean for our model:
The model’s predictions might be less reliable for higher or lower sales figures. The model seems to struggle with accurately predicting residuals at the extremes of the sales data.
The model might not capture the full relationship between variables. There might be other factors influencing the residuals that the our model doesn’t account for.
Discount_variable
There is a slight positive trend in the residuals, but it is not very strong. This suggests that there might be a weak violation of the homoscedasticity assumption. However, it is possible that this is not a significant issue.
Region_variable
Since the boxes for each region variable have similar interquartile ranges, it might suggest that the variance of the residuals is similar across regions.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is a curvature in the pattern of the residuals. The residuals tend to be negative for lower predicted profits and positive for higher predicted profits. This suggests a non-linear relationship between the explanatory variables (sales, discount, region) and profit. A linear regression model assumes a straight-line relationship between the variables. This curvature indicates that the linear model might not be the best fit for our data.
QQ-Plot
gg_qqplot(model)
Generally in a QQ Plot, data points that follow a straight diagonal line indicate good agreement between the observed data and a normal distribution. In our plot, most of the data points fall relatively close to the diagonal line, suggesting that the quantiles of the data are similar to those of a normal distribution. Hence, we can validate the assumption of normality
gg_cooksd(model)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_segment()`).
## Warning: Removed 1 rows containing missing values (`geom_text()`).
Several data points have Cook’s distance values much larger than the others. These points could be exerting an undue influence on the fitted regression line. Therefore, these influential data points might affect the model.