df <-read.csv('/Users/fahadmehfooz/Desktop/IUPUI/First Semester/Intro to Statistics/Intro to Stats Dataset/Dataset 1/Superstore.csv')
colnames(df)
##  [1] "Row.ID"        "Order.ID"      "Order.Date"    "Ship.Date"    
##  [5] "Ship.Mode"     "Customer.ID"   "Customer.Name" "Segment"      
##  [9] "Country"       "City"          "State"         "Postal.Code"  
## [13] "Region"        "Product.ID"    "Category"      "Sub.Category" 
## [17] "Product.Name"  "Sales"         "Quantity"      "Discount"     
## [21] "Profit"

Sales here is our response variable

Sub.Category is the categorical variable which I believe would influence sales of a superstore

H0: There’s no effect of subcategory on sales.

Ha: Subcategory does have an effect on sales.

# If there are more than 10 subcategories, consolidating them
if (length(unique(df$Sub.Category)) > 10) {
  # Here, we'll group the subcategories with the smallest counts into a "Other" category
  subcat_counts <- table(df$Sub.Category)
  small_subcats <- names(subcat_counts)[order(subcat_counts)][1:(length(subcat_counts)-10)]
  df$Sub.Category[df$Sub.Category %in% small_subcats] <- 'Other'
}

result <- aov(Sales ~ Sub.Category, data = df)
anova_summary <- summary(result)
print(anova_summary)
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## Sub.Category   10 3.111e+08 31105056   86.97 <2e-16 ***
## Residuals    9983 3.571e+09   357666                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
significance_level <- 0.05
p_value <- anova_summary[[1]]$'Pr(>F)'[1]

if (p_value < significance_level) {
  print("Reject the null hypothesis: Sales differ among the subcategories.")
} else {
  print("Do not reject the null hypothesis: There's no significant difference in sales among the subcategories.")
}
## [1] "Reject the null hypothesis: Sales differ among the subcategories."

Profit is another continuous column of data that might influence the response variable

# Create the scatter plot
plot(df$Profit, df$Sales, 
     main="Scatter Plot of Quantity vs Sales", 
     xlab="Quantity", 
     ylab="Sales", 
     col="blue", 
     pch=16)

It is almost linear. Although we might have to perform some sort of transformation here to bring it closer to a linear pattern

Building and checking the fit of the model

# Building the linear regression model
model <- lm(Sales ~ Quantity, data=df)

# evaluate the fit
model_summary <-summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Sales ~ Quantity, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -790.4  -181.9  -114.6     2.0 22284.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.725     12.063   1.386    0.166    
## Quantity      56.242      2.745  20.489   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 610.6 on 9992 degrees of freedom
## Multiple R-squared:  0.04032,    Adjusted R-squared:  0.04022 
## F-statistic: 419.8 on 1 and 9992 DF,  p-value: < 2.2e-16

Hypothesis Test1:

H0: Coefficient of intercept is 0. HA: Coefficient of intercept !=0

Result: The t-value is 1.386, and the p-value is 0.166. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that the intercept is not statistically different from zero in this context.

Hypothesis Test 2:

H0: Coefficient of Quantity is 0. HA: Coefficient of Quantity !=0

Result: The t-value is 20.489, and the p-value is <2e-16, which is very close to zero. Since the p-value is much less than 0.05, we reject the null hypothesis. This suggests that the coefficient for Quantity is statistically significant, meaning Quantity has a significant relationship with the response variable.

Overall model fit hypothesis test:

H0: All regression coefficients (except the intercept) are equal to zero. HA: At least one coefficient is not equal to zero.

Result: The F-statistic is 419.8 with a p-value of <2.2e-16. Given this very small p-value, we reject the null hypothesis. This implies that at least one predictor, in this case Quantity, is useful in predicting the response.

# Extract coefficients
coefficients <- model_summary$coefficients
print(coefficients)
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 16.72524  12.062925  1.38650 1.656254e-01
## Quantity    56.24188   2.745015 20.48873 2.019935e-91
# Extract R-squared
r_squared <- model_summary$r.squared

# Extract Adjusted R-squared
adj_r_squared <- model_summary$adj.r.squared

# Extract Residual Standard Error
residual_se <- model_summary$sigma

# Print extracted details
print(coefficients)
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 16.72524  12.062925  1.38650 1.656254e-01
## Quantity    56.24188   2.745015 20.48873 2.019935e-91
cat("R-squared:", r_squared, "\n")
## R-squared: 0.04031854
cat("Adjusted R-squared:", adj_r_squared, "\n")
## Adjusted R-squared: 0.0402225
cat("Residual Standard Error:", residual_se, "\n")
## Residual Standard Error: 610.5822

Interpretatation of Coefficient:

. Intercept (16.72524): The intercept suggests that when no items (Quantity = 0) are sold, the response variable (which Sales) would be approximately 16.72524. However, this doesn’t make practical sense in most contexts. If no items are sold, one would expect no revenue or profit. The intercept here might be capturing some base level or fixed costs, but since it’s not statistically significant, we don’t put much emphasis on it.

. Quantity (56.24188): The coefficient for Quantity means that for every additional item sold in the superstore, the response variable increases by 56.24188 units. Since the response is Sales, this means that for each additional item sold, there’s an associated increase in Sales of $56.24.

Recommendations & Interpretations:

Profitability: If the response variable is profit, selling an additional item yields an extra profit of $56.24. If it’s revenue, the revenue increases by that amount, but profit would be this value minus associated costs. Knowing costs would give a clearer picture of profitability.

Inventory Management: If the store sees a consistent increase in revenue or profit with increasing sales of this item, it might be worth ensuring that this item is always in stock and perhaps prominently displayed.

Diagnostic plots for the model

par(mfrow=c(2,2))
plot(model)

The plots are doing pretty okay here

Including one more variable- Discount to the model, lets see how the model changes

Disount directly would affect the net profit and even the sales. If good discount is offered a customer would be more likely to make the purchase

# Building the linear regression model
model2 <- lm(Sales ~ Quantity + Discount, data=df)

# evaluate the fit
model_summary <- summary(model2)
print(model_summary)
## 
## Call:
## lm(formula = Sales ~ Quantity + Discount, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -775.8  -188.3  -113.7     2.0 22315.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   30.562     12.881   2.373  0.01768 *  
## Quantity      56.314      2.744  20.523  < 2e-16 ***
## Discount     -90.335     29.574  -3.055  0.00226 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 610.3 on 9991 degrees of freedom
## Multiple R-squared:  0.04121,    Adjusted R-squared:  0.04102 
## F-statistic: 214.7 on 2 and 9991 DF,  p-value: < 2.2e-16

Changes after including new variable discount:

The intercept has almost doubled and is now statistically significant (p < 0.05) in the new model. The coefficient for Quantity remains almost the same, indicating that adding Discount as a predictor didn’t change the relationship between Quantity and Sales much.

For every unit increase in Discount, Sales decreases by 90.335 units. This indicates a negative relationship between Discount and Sales. Surprisingly, as the discount increases, sales seem to decrease, which is counterintuitive. This relationship might be due to some other confounding factors not considered in the model, or it could be that the type or application of discounts isn’t effectively driving sales.

The addition of the Discount variable has slightly improved the explanatory power of the model, but the change is minimal.

The F-statistic has decreased in the new model, but since the degrees of freedom for predictors have also increased, this isn’t a direct comparison. The important aspect is that the p-value remains extremely low (< 2.2e-16), indicating that the model is statistically significant.