Data Dive — Regression Modeling

Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 11
##   `Order ID` CustomerName Category      SubCategory City  OrderDate Region Sales
##   <chr>      <chr>        <chr>         <chr>       <chr> <chr>     <chr>  <dbl>
## 1 OD1        Harish       Oil & Masala  Masalas     Vell… 11/8/17   North   1254
## 2 OD2        Sudha        Beverages     Health Dri… Kris… 11/8/17   South    749
## 3 OD3        Hussain      Food Grains   Atta & Flo… Pera… 6/12/17   West    2360
## 4 OD4        Jackson      Fruits & Veg… Fresh Vege… Dhar… 10/11/16  South    896
## 5 OD5        Ridhesh      Food Grains   Organic St… Ooty  10/11/16  South   2355
## 6 OD6        Adavan       Food Grains   Organic St… Dhar… 6/9/15    West    2305
## # ℹ 3 more variables: Discount <dbl>, Profit <dbl>, State <chr>

Choosing ‘Sales’ as the response variable

“Sales” is like a scorecard for how well a business is doing, and we want to figure out what factors contribute to a high or low score in terms of sales. Hence, picked ‘Sales’ as the response variable.

response_variable <- data$Sales

Choosing a categorical variable(‘Category’) that might influence ‘Sales’

categorical_variable <- data$Category

Devising a null hypothesis for ANOVA

Null Hypothesis: The mean sales are the same across all categories.

alpha <- 0.05
anova_result <- aov(response_variable ~ categorical_variable, data = data)
summary(anova_result)

##                        Df    Sum Sq Mean Sq F value Pr(>F)
## categorical_variable    6 2.251e+06  375240   1.125  0.345
## Residuals            9987 3.331e+09  333549

The p-value for the categorical variable is 0.345. This p-value represents the probability of observing the given F-statistic (1.125) if there were no actual differences between the groups (categories).

Since 0.345 > 0.05(chosen significance level), we will not reject the null hypothesis.

Conclusion:

There is not enough evidence to conclude that there are significant differences in sales across the categories. The p-value is greater than the significance level, suggesting that any observed differences in mean sales could be due to random chance.
Based on this analysis, it would be safe to assume that, at the current level of evidence, the categories do not have a statistically significant impact on sales.

Choosing ‘Profit’ as a continuous variable influencing the response

continuous_variable <- data$Profit

Building a linear regression model

linear_model <- lm(response_variable ~ continuous_variable, data = data)
summary(linear_model)

## 
## Call:
## lm(formula = response_variable ~ continuous_variable, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -778.11 -366.57  -77.86  306.56 1360.19 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         950.24669    8.53214  111.37   <2e-16 ***
## continuous_variable   1.45718    0.01917   76.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459.7 on 9992 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.3664 
## F-statistic:  5779 on 1 and 9992 DF,  p-value: < 2.2e-16

Coefficients:

Intercept (950.25):

The intercept (950.25) represents the estimated value of the response variable(“Sales”) when the Profit is zero.
In the context of our data, having zero profit might not be meaningful because businesses usually aim to make a profit. Therefore, interpreting the specific value of the intercept (950.25) might not have a clear practical meaning.

Continuous Variable (1.45718):

The continuous variable coefficient (1.45718) is telling us how much the response variable (Sales) changes when Profit increases by one unit.
So, for every additional unit of profit, the Sales are estimated to increase by 1.45718 units.

The positive coefficient for the continuous variable suggests a positive linear relationship between the continuous variable and sales.

The high t-value (76.02) and low p-value (< 2.2e-16) indicate that the relationship is statistically significant.

The R-squared value of 0.3664 suggests that the model explains about 36.64% of the variance in the response variable.

Conclusion:

Based on the model, it appears that the continuous variable has a significant and positive impact on sales.
Businesses might consider focusing on increasing this continuous variable to optimize sales.
However, we should remember that correlation does not imply causation, and other factors not included in the model could also influence sales.