library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(data)
## Order ID CustomerName Category SubCategory
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## City OrderDate Region Sales
## Length:9994 Length:9994 Length:9994 Min. : 500
## Class :character Class :character Class :character 1st Qu.:1000
## Mode :character Mode :character Mode :character Median :1498
## Mean :1497
## 3rd Qu.:1995
## Max. :2500
## Discount Profit State
## Min. :0.1000 Min. : 25.25 Length:9994
## 1st Qu.:0.1600 1st Qu.: 180.02 Class :character
## Median :0.2300 Median : 320.78 Mode :character
## Mean :0.2268 Mean : 374.94
## 3rd Qu.:0.2900 3rd Qu.: 525.63
## Max. :0.3500 Max. :1120.95
Fitting a linear regression model using Profit, Discount, Region and an interaction term(combining ‘Profit’ and ‘Discount’) as the explanatory variables.
#Checking the linearity between the explanatory variables (Profit, Discount, and Region) and the response variable (Sales).
library(ggplot2)
#Profit vs. Sales
ggplot(data, aes(x = Profit, y = Sales)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Profit", y = "Sales")
## `geom_smooth()` using formula = 'y ~ x'
#Discount vs. Sales
ggplot(data, aes(x = Discount, y = Sales)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Discount", y = "Sales")
## `geom_smooth()` using formula = 'y ~ x'
#Region vs. Sales
ggplot(data, aes(x = Region, y = Sales)) +
geom_boxplot() +
labs(x = "Region", y = "Sales")
The terms in our model are:
Intercept: Represents the estimated sales when all the
other variables are zero.
Continuous Variable: Shows the impact of a one-unit
increase in profit on sales.
Discount Variable: Shows the effect of a one-unit
increase in discount on sales.
Interaction Term: Shows the combined effect of “Profit”
and “Discount” on sales.
Region Variables: Coefficients for different levels of
the categorical variable “Region”. Each coefficient shows the average
change in sales for the respective region compared to the reference
region.
response_variable <- data$Sales
continuous_variable <- data$Profit
discount_variable <- data$Discount
region_variable <- as.factor(data$Region)
my_model <- lm(response_variable ~ continuous_variable + discount_variable + continuous_variable * discount_variable + region_variable, data = data)
summary(my_model)
##
## Call:
## lm(formula = response_variable ~ continuous_variable + discount_variable +
## continuous_variable * discount_variable + region_variable,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -783.71 -367.01 -77.85 306.22 1359.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 976.51829 28.45583 34.317 <2e-16
## continuous_variable 1.42722 0.06084 23.457 <2e-16
## discount_variable -93.02319 113.75589 -0.818 0.414
## region_variableEast -13.65877 12.85807 -1.062 0.288
## region_variableNorth -290.56674 459.95574 -0.632 0.528
## region_variableSouth -9.55883 14.89166 -0.642 0.521
## region_variableWest 0.17259 12.53391 0.014 0.989
## continuous_variable:discount_variable 0.13493 0.25629 0.526 0.599
##
## (Intercept) ***
## continuous_variable ***
## discount_variable
## region_variableEast
## region_variableNorth
## region_variableSouth
## region_variableWest
## continuous_variable:discount_variable
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459.8 on 9986 degrees of freedom
## Multiple R-squared: 0.3666, Adjusted R-squared: 0.3662
## F-statistic: 825.8 on 7 and 9986 DF, p-value: < 2.2e-16
Intercept: The estimated sales when all other
variables are zero is around 976.52(doesn’t really have a very
meaningful interpretation).
Profit: The coefficient for “Profit” is around 1.43
indicating that, on average, a one-unit increase in profit is associated
with an increase of 1.43 units in sales.
Discount: The coefficient for “Discount” is around
-93.02, but its p-value (0.414) is greater than the typical significance
level of 0.05, suggesting that “Discount” may not be a statistically
significant predictor of sales in this model, i.e., the observed
relationship between “Discount” and “Sales” in our model could likely be
due to random variation rather than a true relationship.
Region Variables: The regional variables do not seem to
be statistically significant predictors of sales since all the region
variables (East, North, South, West) have p-values greater than
0.05.
Interaction Term: The coefficient for the interaction
term is around 0.135. This represents the additional change in sales for
the combined effect of a one-unit increase in “Profit” and a one-unit
increase in “Discount.” But, its p-value is higher than 0.50, which may
indicate that it’s not statistically significant.
Residuals: The residuals indicate the differences
between observed and predicted values, with a standard error of
459.8
R-squared: The value of 0.3666 suggests that
approximately 36.66% of the variability in sales is explained by the
model.
F-statistic: The F-statistic is 825.8, with a p-value
less than <2.2e-16, indicating that at least one of the
predictors(like ‘Profit’) has a significant effect on sales.
Interaction
Term(Continuous_variable:Discount_variable):
The interaction term is not statistically significant
(p=0.599), suggesting that the combined effect of “Profit” and
“Discount” may not significantly influence sales in this model.
cor(continuous_variable, discount_variable, method = "pearson")
## [1] 1.747623e-05
A correlation of approximately 0.00001748 between profit and discount means that there is almost no linear relationship between these two variables. As a result, they are unlikely to contribute to multicollinearity issues in our regression model.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ stringr 1.5.1
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
gg_resfitted(my_model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Generally in a residual vs fitted value plot, the residuals should be randomly scattered around a horizontal line at zero. This would indicate that the model fits the data well, with no consistent over- or under-prediction at any given fitted value.
In our plot, the residuals do not appear to be randomly scattered. There seems to be a trend where the residuals are positive for lower fitted values and negative for higher fitted values. This pattern suggests that the model might not be capturing the relationship between the independent and dependent variables very well, which means it is violating the assumption of homoscedasticity in linear regression.
residual_plots <- gg_resX(my_model)
print(residual_plots)
## TableGrob (3 x 1) "arrange": 3 grobs
## z cells name grob
## continuous_variable 1 (1-1,1-1) arrange gtable[layout]
## discount_variable 2 (2-2,1-1) arrange gtable[layout]
## region_variable 3 (3-3,1-1) arrange gtable[layout]
Residual vs. Continuous_variable
There seems to be a trend where the residuals are positive for lower fitted values and negative for higher fitted values, which is a violation of the assumption of homoscedasticity in linear regression.
Discount_variable
There is a slight positive trend in the residuals, but it is not very strong. This suggests that there might be a weak violation of the homoscedasticity assumption. However, it is possible that this is not a significant issue.
Region_variable
Since the boxes for each region variable have similar interquartile ranges, it might suggest that the variance of the residuals is similar across regions.
gg_reshist(my_model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram of the residuals shows a somewhat symmetrical distribution around zero. This may suggest that the residuals don’t have a strong positive or negative skew.
gg_qqplot(my_model)
Generally in a QQ Plot, data points that follow a straight diagonal line indicate good agreement between the observed data and a normal distribution. In our plot, most of the data points fall relatively close to the diagonal line, suggesting that the quantiles of the data are similar to those of a normal distribution. Hence, we can validate the assumption of normality
gg_cooksd(my_model, threshold = 'matlab')
Several data points have Cook’s distance values much larger than the others. These points could be exerting an undue influence on the fitted regression line. Therefore, these influential data points might affect the model.