library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (3): Sales, Discount, Profit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
## `Order ID` CustomerName Category SubCategory City OrderDate Region Sales
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 OD1 Harish Oil & Masala Masalas Vell… 11/8/17 North 1254
## 2 OD2 Sudha Beverages Health Dri… Kris… 11/8/17 South 749
## 3 OD3 Hussain Food Grains Atta & Flo… Pera… 6/12/17 West 2360
## 4 OD4 Jackson Fruits & Veg… Fresh Vege… Dhar… 10/11/16 South 896
## 5 OD5 Ridhesh Food Grains Organic St… Ooty 10/11/16 South 2355
## 6 OD6 Adavan Food Grains Organic St… Dhar… 6/9/15 West 2305
## # ℹ 3 more variables: Discount <dbl>, Profit <dbl>, State <chr>
“Sales” is like a scorecard for how well a business is doing, and we want to figure out what factors contribute to a high or low score in terms of sales. Hence, picked ‘Sales’ as the response variable.
response_variable <- data$Sales
categorical_variable <- data$Category
Null Hypothesis: The mean sales are the same across all categories.
alpha <- 0.05
anova_result <- aov(response_variable ~ categorical_variable, data = data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## categorical_variable 6 2.251e+06 375240 1.125 0.345
## Residuals 9987 3.331e+09 333549
The p-value for the categorical variable is 0.345. This p-value represents the probability of observing the given F-statistic (1.125) if there were no actual differences between the groups (categories).
Since 0.345 > 0.05(chosen significance level), we will not reject the null hypothesis.
Conclusion:
There is not enough evidence to conclude that there are significant differences in sales across the categories. The p-value is greater than the significance level, suggesting that any observed differences in mean sales could be due to random chance.
Based on this analysis, it would be safe to assume that, at the current level of evidence, the categories do not have a statistically significant impact on sales.
continuous_variable <- data$Profit
linear_model <- lm(response_variable ~ continuous_variable, data = data)
summary(linear_model)
##
## Call:
## lm(formula = response_variable ~ continuous_variable, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -778.11 -366.57 -77.86 306.56 1360.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 950.24669 8.53214 111.37 <2e-16 ***
## continuous_variable 1.45718 0.01917 76.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 459.7 on 9992 degrees of freedom
## Multiple R-squared: 0.3664, Adjusted R-squared: 0.3664
## F-statistic: 5779 on 1 and 9992 DF, p-value: < 2.2e-16
Intercept (950.25):
The intercept (950.25) represents the estimated value of the response variable(“Sales”) when the Profit is zero.
In the context of our data, having zero profit might not be meaningful because businesses usually aim to make a profit. Therefore, interpreting the specific value of the intercept (950.25) might not have a clear practical meaning.
Continuous Variable (1.45718):
The continuous variable coefficient (1.45718) is telling us how much the response variable (Sales) changes when Profit increases by one unit.
So, for every additional unit of profit, the Sales are estimated to increase by 1.45718 units.
The positive coefficient for the continuous variable suggests a positive linear relationship between the continuous variable and sales.
The high t-value (76.02) and low p-value (< 2.2e-16) indicate that the relationship is statistically significant.
The R-squared value of 0.3664 suggests that the model explains about 36.64% of the variance in the response variable.
Based on the model, it appears that the continuous variable has a significant and positive impact on sales.
Businesses might consider focusing on increasing this continuous variable to optimize sales.
However, we should remember that correlation does not imply causation, and other factors not included in the model could also influence sales.