Tidyverse is a collection of packages designed for data science and manipulation
Makes visualization and analysis easier and more efficient in R
library(tidyverse)
df_store <- (read.csv('C:/Users/Noki/Downloads/archive (8)/Sample - Superstore.csv'))
view(df_store)
df_store
Create a function called rename to replace column names from title.title to title_title.
Lower case column names with tolower()
Going through the columns and verifying the correct datatype for each column is correct.
rename <- function(x){
names(x) <- names(x) %>% str_replace_all('\\.', '_')
return(x)
}
df_store <- rename(df_store)
df_store
names(df_store) <- tolower(names(df_store))
df_store
colSums(is.na(store))
Row_ID Order_ID Order_Date Ship_Date Ship_Mode Customer_ID Customer_Name Segment Country City State Postal_Code Region
0 0 0 0 0 0 0 0 0 0 0 0 0
Product_ID Category Sub_Category Product_Name Sales Quantity Discount Profit
0 0 0 0 0 0 0 0
df_store %>% summarise(data_type = class(df_store))
summary(df_store)
row_id order_id order_date ship_date ship_mode customer_id customer_name segment country city
Min. : 1 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994
1st Qu.:2499 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character
Median :4998 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Mean :4998
3rd Qu.:7496
Max. :9994
state postal_code region product_id category sub_category product_name sales quantity discount
Length:9994 Min. : 1040 Length:9994 Length:9994 Length:9994 Length:9994 Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
Class :character 1st Qu.:23223 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
Mode :character Median :56431 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 54.490 Median : 3.00 Median :0.2000
Mean :55190 Mean : 229.858 Mean : 3.79 Mean :0.1562
3rd Qu.:90008 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
Max. :99301 Max. :22638.480 Max. :14.00 Max. :0.8000
profit
Min. :-6599.978
1st Qu.: 1.729
Median : 8.666
Mean : 28.657
3rd Qu.: 29.364
Max. : 8399.976
ggplot(df_store, aes(x = category, y = sales)) +
geom_boxplot()
ggplot(df_store, aes(x = sales, y = profit)) +
geom_point() +
labs(title = "Profit vs. Sales", x = "Sales", y = "Profit")
sales_summary <- df_store %>%
group_by(region, category) %>%
summarise(mean_sales = mean(sales))
`summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
sales_summary
NA
sales_summary_wide <- pivot_wider(sales_summary, names_from = region, values_from = mean_sales)
sales_summary_wide
# create groups based on a condition
r1 <- subset(df_store, region == "South")
r2 <- subset(df_store, region == "East")
r3 <- subset(df_store, region == 'Central')
r4 <- subset(df_store, region == 'West')
# Gather the mean sales of each region
mean_r1 <- mean(r1$sales)
mean_r2 <- mean(r2$sales)
mean_r3 <- mean(r3$sales)
mean_r4 <- mean(r4$sales)
# Standard deviation
sd_r1 <- sd(r1$sales)
sd_r2 <- sd(r2$sales)
sd_r3 <- sd(r3$sales)
sd_r4 <- sd(r4$sales)
# Print
mean_r1
[1] 241.8036
mean_r2
[1] 238.3361
mean_r3
[1] 215.7727
mean_r4
[1] 226.4932
sd_r1
[1] 774.7963
sd_r2
[1] 620.7127
sd_r3
[1] 632.779
sd_r4
[1] 524.8769
t.test(r1$sales, r2$profit, var.equal = TRUE)
Two Sample t-test
data: r1$sales and r2$profit
t = 13.265, df = 4466, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
178.6804 240.6553
sample estimates:
mean of x mean of y
241.80365 32.13581
t.test(r1$sales, r3$profit, var.equal = TRUE)
Two Sample t-test
data: r1$sales and r3$profit
t = 12.745, df = 3941, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
190.1447 259.2772
sample estimates:
mean of x mean of y
241.80365 17.09271
t.test(r1$sales, r4$profit, var.equal = TRUE)
Two Sample t-test
data: r1$sales and r4$profit
t = 14.485, df = 4821, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
179.8101 236.0991
sample estimates:
mean of x mean of y
241.80365 33.84903
t.test(r2$sales, r3$profit, var.equal = TRUE)
Two Sample t-test
data: r2$sales and r3$profit
t = 15.815, df = 5169, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
193.8189 248.6679
sample estimates:
mean of x mean of y
238.33611 17.09271
t.test(r2$sales, r4$profit, var.equal = TRUE)
Two Sample t-test
data: r2$sales and r4$profit
t = 17.871, df = 6049, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
182.0558 226.9184
sample estimates:
mean of x mean of y
238.33611 33.84903
t.test(r2$sales, r3$profit, var.equal = TRUE)
Two Sample t-test
data: r2$sales and r3$profit
t = 15.815, df = 5169, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
193.8189 248.6679
sample estimates:
mean of x mean of y
238.33611 17.09271
t.test(r3$sales, r4$profit, var.equal = TRUE)
Two Sample t-test
data: r3$sales and r4$profit
t = 15.483, df = 5524, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
158.8899 204.9574
sample estimates:
mean of x mean of y
215.77266 33.84903
ggplot(analysis)+
geom_point(aes(profit, region))
The point chart above is not the best visualization to use so I created a column chart below to better understand the data
Region as my x-axis
Profit as my y-axis
Stat = summary calculates summary statistic and fun = mean calculates the mean for each statistic
Calculates mean profit for each region
ggplot(df_store, aes(x = region, y = profit)) +
geom_col(stat = "summary", fun = "mean") +
labs(title = "Mean Profit by Region", x = "Region", y = "Mean Profit")
Warning: Ignoring unknown parameters: `stat` and `fun`
Using a bar chart to compare the sales in regions
position = dodge allows the bars to be side by side
Office supplies have the highest sales in all the regions
ggplot(df_store, aes(x = region, fill = category)) +
geom_bar(position = "dodge") +
labs(title = "Sales by Region and Category", x = "Region", y = "Sales", fill = "Category")
Q1:
df_store$category <- as.factor(df_store$category)
nova <- aov(profit ~ category, data=df_store)
summary(nova)
Df Sum Sq Mean Sq F value Pr(>F)
category 2 5898009 2949004 54.31 <2e-16 ***
Residuals 9991 542495827 54298
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-value tests whether there is a significant difference in means between the groups
F-value of 54.31, which is very large, indicates that there is a significant difference in means between the groups
p-value indicates a high significance against the null hypothesis. Therefore, we can reject the null hypothesis of no significant difference in profit between categories.
Based on the bar chart below, we can see that office supplies and technology show a higher average profit.
ggplot(df_store, aes(x = category, y = profit)) +
geom_col() +
labs(title = "Average Profit by Category", x = "Category", y = "Average Profit")
This faceted bar chart compares the profit for each category in each region
ggplot(df_store, aes(x = category, y = profit)) +
geom_col() +
facet_wrap(~ region) +
labs(title = "Average Profit by Category and Region", x = "Category", y = "Average Profit")
Question 1:
Are sales in certain regions higher compared to others?
Question 2:
Is there a significant difference in profit between different categories?
We can conclude that certain regions have higher sales compared to others. We reject the null hypothesis. We also have evidence that office supplies and technology have higher profits compared to furniture. Therefore, we can also reject the second questions’ null hypothesis.
I used t tests and ANOVA to test the means of each in my analysis. The results showed that there is a significant difference in profit between different categories, with office supplies having a lower mean profit than the other categories. I provided visualizations to better understand what the tests where showing. Overall, the analysis suggest that we may benefit from focusing our time and business decisions on higher profit categories and regions such as office supplies and technology in our east and west regions.
Further analysis on this data may be warranted to determine why and how these factors are contributing to these differences.