df <-read.csv('/Users/fahadmehfooz/Desktop/IUPUI/First Semester/Intro to Statistics/Intro to Stats Dataset/Dataset 1/Superstore.csv')
colnames(df)
## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date"
## [5] "Ship.Mode" "Customer.ID" "Customer.Name" "Segment"
## [9] "Country" "City" "State" "Postal.Code"
## [13] "Region" "Product.ID" "Category" "Sub.Category"
## [17] "Product.Name" "Sales" "Quantity" "Discount"
## [21] "Profit"
# If there are more than 10 subcategories, consolidating them
if (length(unique(df$Sub.Category)) > 10) {
# Here, we'll group the subcategories with the smallest counts into a "Other" category
subcat_counts <- table(df$Sub.Category)
small_subcats <- names(subcat_counts)[order(subcat_counts)][1:(length(subcat_counts)-10)]
df$Sub.Category[df$Sub.Category %in% small_subcats] <- 'Other'
}
result <- aov(Sales ~ Sub.Category, data = df)
anova_summary <- summary(result)
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sub.Category 10 3.111e+08 31105056 86.97 <2e-16 ***
## Residuals 9983 3.571e+09 357666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
significance_level <- 0.05
p_value <- anova_summary[[1]]$'Pr(>F)'[1]
if (p_value < significance_level) {
print("Reject the null hypothesis: Sales differ among the subcategories.")
} else {
print("Do not reject the null hypothesis: There's no significant difference in sales among the subcategories.")
}
## [1] "Reject the null hypothesis: Sales differ among the subcategories."
# Create the scatter plot
plot(df$Profit, df$Sales,
main="Scatter Plot of Quantity vs Sales",
xlab="Quantity",
ylab="Sales",
col="blue",
pch=16)
It is almost linear. Although we might have to perform some sort of transformation here to bring it closer to a linear pattern
# Building the linear regression model
model <- lm(Sales ~ Quantity, data=df)
# evaluate the fit
model_summary <-summary(model)
print(model_summary)
##
## Call:
## lm(formula = Sales ~ Quantity, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.4 -181.9 -114.6 2.0 22284.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.725 12.063 1.386 0.166
## Quantity 56.242 2.745 20.489 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 610.6 on 9992 degrees of freedom
## Multiple R-squared: 0.04032, Adjusted R-squared: 0.04022
## F-statistic: 419.8 on 1 and 9992 DF, p-value: < 2.2e-16
H0: Coefficient of intercept is 0. HA: Coefficient of intercept !=0
Result: The t-value is 1.386, and the p-value is 0.166. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that the intercept is not statistically different from zero in this context.
H0: Coefficient of Quantity is 0. HA: Coefficient of Quantity !=0
Result: The t-value is 20.489, and the p-value is <2e-16, which is very close to zero. Since the p-value is much less than 0.05, we reject the null hypothesis. This suggests that the coefficient for Quantity is statistically significant, meaning Quantity has a significant relationship with the response variable.
H0: All regression coefficients (except the intercept) are equal to zero. HA: At least one coefficient is not equal to zero.
Result: The F-statistic is 419.8 with a p-value of <2.2e-16. Given this very small p-value, we reject the null hypothesis. This implies that at least one predictor, in this case Quantity, is useful in predicting the response.
# Extract coefficients
coefficients <- model_summary$coefficients
print(coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.72524 12.062925 1.38650 1.656254e-01
## Quantity 56.24188 2.745015 20.48873 2.019935e-91
# Extract R-squared
r_squared <- model_summary$r.squared
# Extract Adjusted R-squared
adj_r_squared <- model_summary$adj.r.squared
# Extract Residual Standard Error
residual_se <- model_summary$sigma
# Print extracted details
print(coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.72524 12.062925 1.38650 1.656254e-01
## Quantity 56.24188 2.745015 20.48873 2.019935e-91
cat("R-squared:", r_squared, "\n")
## R-squared: 0.04031854
cat("Adjusted R-squared:", adj_r_squared, "\n")
## Adjusted R-squared: 0.0402225
cat("Residual Standard Error:", residual_se, "\n")
## Residual Standard Error: 610.5822
. Intercept (16.72524): The intercept suggests that when no items (Quantity = 0) are sold, the response variable (which Sales) would be approximately 16.72524. However, this doesn’t make practical sense in most contexts. If no items are sold, one would expect no revenue or profit. The intercept here might be capturing some base level or fixed costs, but since it’s not statistically significant, we don’t put much emphasis on it.
. Quantity (56.24188): The coefficient for Quantity means that for every additional item sold in the superstore, the response variable increases by 56.24188 units. Since the response is Sales, this means that for each additional item sold, there’s an associated increase in Sales of $56.24.
Profitability: If the response variable is profit, selling an additional item yields an extra profit of $56.24. If it’s revenue, the revenue increases by that amount, but profit would be this value minus associated costs. Knowing costs would give a clearer picture of profitability.
Inventory Management: If the store sees a consistent increase in revenue or profit with increasing sales of this item, it might be worth ensuring that this item is always in stock and perhaps prominently displayed.
par(mfrow=c(2,2))
plot(model)
The plots are doing pretty okay here
# Building the linear regression model
model2 <- lm(Sales ~ Quantity + Discount, data=df)
# evaluate the fit
model_summary <- summary(model2)
print(model_summary)
##
## Call:
## lm(formula = Sales ~ Quantity + Discount, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -775.8 -188.3 -113.7 2.0 22315.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.562 12.881 2.373 0.01768 *
## Quantity 56.314 2.744 20.523 < 2e-16 ***
## Discount -90.335 29.574 -3.055 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 610.3 on 9991 degrees of freedom
## Multiple R-squared: 0.04121, Adjusted R-squared: 0.04102
## F-statistic: 214.7 on 2 and 9991 DF, p-value: < 2.2e-16
The intercept has almost doubled and is now statistically significant (p < 0.05) in the new model. The coefficient for Quantity remains almost the same, indicating that adding Discount as a predictor didn’t change the relationship between Quantity and Sales much.
For every unit increase in Discount, Sales decreases by 90.335 units. This indicates a negative relationship between Discount and Sales. Surprisingly, as the discount increases, sales seem to decrease, which is counterintuitive. This relationship might be due to some other confounding factors not considered in the model, or it could be that the type or application of discounts isn’t effectively driving sales.
The addition of the Discount variable has slightly improved the explanatory power of the model, but the change is minimal.
The F-statistic has decreased in the new model, but since the degrees of freedom for predictors have also increased, this isn’t a direct comparison. The important aspect is that the p-value remains extremely low (< 2.2e-16), indicating that the model is statistically significant.