##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
1. Part 1 - Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.
2. Part 2 - Select a categorical column of data (explanatory variable) that you expect might influence the response variable.
Let us consider categorical column to be Region. It could help figure out how much of sales of particular product would happen in certain regions based on circumstances/situations related to region. Therefore, Explanatory Variable = Region
Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions.
Null Hypothesis - There is no significant difference in the mean Sales across different Regions in the Global Superstore dataset. H0 suggests that the sales values in different regions are not significantly different from each other, and any observed differences are due to random variability or chance. \[ H_0: \text{AvgSales_south = AvgSales_central = AvgSales_west = AvgSales_east} \]
Alternative hypothesis (Ha) would then be the opposite of the null hypothesis, suggesting that there is a significant difference in the mean Sales across different Regions.
ANOVA TEST :-
m <- aov(Sales ~ Region, data = Superstore_data)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Region 3 9.330e+05 311007 0.801 0.493
## Residuals 9990 3.881e+09 388458
pairwise.t.test(Superstore_data$Sales, Superstore_data$Region, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Superstore_data$Sales and Superstore_data$Region
##
## Central East South
## East 1 - -
## South 1 1 -
## West 1 1 1
##
## P value adjustment method: bonferroni
From the above pair plot, It looks like Regions have somewhat same average sale with respect to other regions
This means that irrespective of the region the average sales come somewhat similar. There is not enough evidence to conclude the same but since the category of products are Office supplies, technology and furniture it doesn’t affect the regions as it is essential and blooming as products. So it would be safe to assume that sales is not affected by region based on the categories.
3. Part 3 - Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. - Consider the other continuous variable to be Profit which might influence how the sales happen. - So the 2 continuous variables are Sales and Profit. There is a possibility that each of them influence the other. Hoping it to be linear.
Superstore_data |>
ggplot(mapping = aes(x = Profit, y = Sales)) +
geom_point(size = 2, color = 'darkblue') +
theme_minimal()
Superstore_data |>
ggplot(mapping = aes(x = Profit, y = Sales)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
- Looking at the above plots can say there the dataset looks to showcase
positive relationship but in reality it is not good for finding
relationship between the Sales and Profit continuous variables. It
doesnt seem to appropriate plotting of line. - We can see that the
points are scattered all over and line doesnt even pass through it.
Hence can say that the relationship of getting linear regressive line
between Sales and Profit aint correct.
model <- lm(Sales ~ Profit, Superstore_data)
model$coefficients
## (Intercept) Profit
## 193.333563 1.274543
But if needed for the above model of linear regression line that can be drawn, it would have the equation as : Sales (y) = m * Profit( x= 1.27) + 193.333 With the above equation can determine certain relation between sales and profit. If \(x_1 (Profit)= 0\), then the expected value for \(y_1\) (Sales )is \(\hat{\beta}_0 = 193.33\).
But I dont think its an ideal way to say that there is positive relationship for all points. As certain instances are observed where the Sales are high but can see a Loss.
Hypothesis Test for linear regresiion: Consider that the coefficient for a model can be very close to zero, i.e., the relationship between \(X\) and \(Y\) would be essentially zero. This can be a hypothesis test, and we can determine if there is enough evidence against the claim that there is no relationship between a variable and the response.
model <- lm(Sales ~ Profit, Superstore_data)
tidy(model, conf.int = TRUE)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 193. 5.51 35.1 3.70e-254 183. 204.
## 2 Profit 1.27 0.0234 54.6 0 1.23 1.32
The summary(model) output includes information about hypothesis tests for each coefficient: - Estimate” column: Provides the estimated value of the coefficient. - “Std. Error” column: Provides the standard error of the estimate. - “t value” column: Gives the t-statistic for the hypothesis test. - “Pr(>|t|)” column: Provides the p-value for the hypothesis test.
From above can see p value = 3.7 > than 0.05, hence Null hypothesis is accpeted. But that aint right in reality as the relation between sales and profit aint entirely positive and straight.
Diagnostic Plot:
model <- lm(Sales ~ Profit, Superstore_data)
gg_resfitted(model) +
theme_minimal()
- For this particular plot, we can already see that one of our
assumptions is being violated, in that the variance in errors increases
for certain higher sales prices.
Error Metrics Not that important in inferential statistics ,
#sum of squared errors
sse <- sum(model$residuals ^ 2)
# mean squared error
mse <- mean(model$residuals ^ 2)
# root mean squared error
rmse <- sqrt(mse)
# mean absolute error
mae <- mean(abs(model$residuals))
# mean absolute percent error
mape <- mean((abs(model$residuals) + 1) /
(predict(model) + 1))
c('sse' = sse, 'mse' = mse, 'rmse' = rmse, 'mae' = mae, 'mape' = mape)
## sse mse rmse mae mape
## 2.990782e+09 2.992578e+05 5.470446e+02 2.342424e+02 4.067352e-01
4. Part 4: Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t).
Lets add in another variable as Profit to the initial ANOVA test Hypothesis Test. Hence the Response variable = Sales and Explanatory Variable = Region and profit
But before doing that lets narrow down the dataset so that we look at only Profits and not lossed (i.e. negative values within Profit which indicate loss)
we’ll create a subset of the Superstore dataset (with profit that is values > =0 ), and we’ll create a new column that “binarizes” the region column into “Central and West” or “East and North”.
sup_basic <- Superstore_data |>
filter( Profit >= 0) |>
mutate(Region_split = ifelse(Region %in%
c("Central"),
1, ifelse(Region %in%
c("West"),
2,ifelse(Region %in%
c("East"),
3, 4))))
sup_basic |>
group_by(Region_split) |>
summarize(num = n())
## # A tibble: 4 × 2
## Region_split num
## <dbl> <int>
## 1 1 1582
## 2 2 2885
## 3 3 2295
## 4 4 1361
sup_grouped <-
sup_basic |>
group_by(Region_split) |>
summarise(mean_price = mean(Sales))
sup_basic |>
ggplot() +
facet_wrap(vars(Region_split), labeller = label_both) +
geom_point(mapping = aes(x = Profit, y = Sales)) +
geom_hline(data = sup_grouped,
mapping = aes(yintercept = mean_price),
color = 'darkorange', linetype = 'dashed') +
scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
labs(title = "Sales Price by Profit",
subtitle = "Faceted by the Region where products are bought",
x = "Profit", y = "Sales") +
theme_minimal()
model <- lm(Sales ~ Profit + Region_split, data = sup_basic)
model$coefficients
## (Intercept) Profit Region_split
## 99.827220 2.475268 -3.951989
We can interpret our coefficients thus:
Maybe include an interaction term, but explain why you included it.
Interaction Terms
Sometimes, we have a relationship between two variables such that the line formed between one and the response changes based on the other.
sup_basic |>
ggplot(mapping = aes(x = Profit, y = Sales,
color = factor(Region_split))) +
geom_jitter(height = 0, width = 0.1, shape = 'o', size = 3) +
geom_smooth(method = 'lm', se = FALSE, linewidth = 0.5) +
scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
scale_color_brewer(palette = 'Dark2') +
labs(title = "Sales Price by Profit",
subtitle = "Colored by the Region of product bought (1= Central; 2= West; 3 = East; 4= North)",
x = "Profit (x-jittered)", y = "Sales",
color = 'Region split i.e. 1= Central; 2= West; 3 = East; 4= North?') +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Since points are all overlapping, becomes difficult to interpret the added interaction Terms. we cant really see/differentiate that when the region changes, the Profit does not really affect the Sales price much for North and Central Region. But, when it is West, an increase in Profit has a slightly positive effect on Sales Price. Same for East region.
model <- lm(Sales ~ Profit + Region_split
+ Profit:Region_split, sup_basic)
# to view more coefficients a bit easier
tidy(model) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 124.
## 2 Profit 2.1
## 3 Region_split -15.5
## 4 Profit:Region_split 0.2
I dont think it is a good idea to create an interaction term for the same.However to interpret the results, we can have say the following:
we only need to worry about the fact that a product having a Profit will tend to have a sales price of about $2.1 more.
On the other hand, if the region is other, then we have a Profit (in this case) will tend to be $(2.1 + $0.2 = $2.3) more.