Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
##   `Order ID` CustomerName Category      SubCategory City  OrderDate Region Sales
##   <chr>      <chr>        <chr>         <chr>       <chr> <chr>     <chr>  <dbl>
## 1 OD1        Harish       Oil & Masala  Masalas     Vell… 11/8/17   North   1254
## 2 OD2        Sudha        Beverages     Health Dri… Kris… 11/8/17   South    749
## 3 OD3        Hussain      Food Grains   Atta & Flo… Pera… 6/12/17   West    2360
## 4 OD4        Jackson      Fruits & Veg… Fresh Vege… Dhar… 10/11/16  South    896
## 5 OD5        Ridhesh      Food Grains   Organic St… Ooty  10/11/16  South   2355
## 6 OD6        Adavan       Food Grains   Organic St… Dhar… 6/9/15    West    2305
## # ℹ 3 more variables: Discount <dbl>, Profit <dbl>, State <chr>

Choosing ‘Sales’ as the response variable

“Sales” is like a scorecard for how well a business is doing, and we want to figure out what factors contribute to a high or low score in terms of sales. Hence, picked ‘Sales’ as the response variable.

response_variable <- data$Sales

Choosing a categorical variable(‘Category’) that might influence ‘Sales’

categorical_variable <- data$Category

Devising a null hypothesis for ANOVA

Null Hypothesis: The mean sales are the same across all categories.

alpha <- 0.05
anova_result <- aov(response_variable ~ categorical_variable, data = data)
summary(anova_result)
##                        Df    Sum Sq Mean Sq F value Pr(>F)
## categorical_variable    6 2.251e+06  375240   1.125  0.345
## Residuals            9987 3.331e+09  333549

The p-value for the categorical variable is 0.345. This p-value represents the probability of observing the given F-statistic (1.125) if there were no actual differences between the groups (categories).

Since 0.345 > 0.05(chosen significance level), we will not reject the null hypothesis.

Conclusion:

Choosing ‘Profit’ as a continuous variable influencing the response

continuous_variable <- data$Profit

Building a linear regression model

linear_model <- lm(response_variable ~ continuous_variable, data = data)
summary(linear_model)
## 
## Call:
## lm(formula = response_variable ~ continuous_variable, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -778.11 -366.57  -77.86  306.56 1360.19 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         950.24669    8.53214  111.37   <2e-16 ***
## continuous_variable   1.45718    0.01917   76.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459.7 on 9992 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.3664 
## F-statistic:  5779 on 1 and 9992 DF,  p-value: < 2.2e-16

Coefficients:

Intercept (950.25):

Continuous Variable (1.45718):

The positive coefficient for the continuous variable suggests a positive linear relationship between the continuous variable and sales.

The high t-value (76.02) and low p-value (< 2.2e-16) indicate that the relationship is statistically significant.

The R-squared value of 0.3664 suggests that the model explains about 36.64% of the variance in the response variable.

Conclusion: