Import tidyverse library

Tidyverse is a collection of packages designed for data science and manipulation
Makes visualization and analysis easier and more efficient in R

library(tidyverse)

Use read.csv to import dataset

df_store <- (read.csv('C:/Users/Noki/Downloads/archive (8)/Sample - Superstore.csv'))

view(df_store)
df_store

Lower case column names and data type verification

Create a function called rename to replace column names from title.title to title_title.

Lower case column names with tolower()

Going through the columns and verifying the correct datatype for each column is correct.

rename <- function(x){
  names(x) <- names(x) %>% str_replace_all('\\.', '_')
  return(x)
}
df_store <- rename(df_store)
df_store


names(df_store) <- tolower(names(df_store))
df_store

Checking for NA(null) values in data

colSums(is.na(store))

       Row_ID      Order_ID    Order_Date     Ship_Date     Ship_Mode   Customer_ID Customer_Name       Segment       Country          City         State   Postal_Code        Region 
            0             0             0             0             0             0             0             0             0             0             0             0             0 
   Product_ID      Category  Sub_Category  Product_Name         Sales      Quantity      Discount        Profit 
            0             0             0             0             0             0             0             0

df_store %>% summarise(data_type = class(df_store))

Gather statistical information on dataset

summary(df_store)

     row_id       order_id          order_date         ship_date          ship_mode         customer_id        customer_name        segment            country              city          
 Min.   :   1   Length:9994        Length:9994        Length:9994        Length:9994        Length:9994        Length:9994        Length:9994        Length:9994        Length:9994       
 1st Qu.:2499   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Median :4998   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Mean   :4998                                                                                                                                                                             
 3rd Qu.:7496                                                                                                                                                                             
 Max.   :9994                                                                                                                                                                             
    state            postal_code       region           product_id          category         sub_category       product_name           sales              quantity        discount     
 Length:9994        Min.   : 1040   Length:9994        Length:9994        Length:9994        Length:9994        Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
 Class :character   1st Qu.:23223   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
 Mode  :character   Median :56431   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
                    Mean   :55190                                                                                                  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
                    3rd Qu.:90008                                                                                                  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
                    Max.   :99301                                                                                                  Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
     profit         
 Min.   :-6599.978  
 1st Qu.:    1.729  
 Median :    8.666  
 Mean   :   28.657  
 3rd Qu.:   29.364  
 Max.   : 8399.976

Question #1 - Are sales in certain regions higher than other regions?

Null Hypothesis - There is no difference between the regions in sales

Alt Hypothesis - Regions in the south are higher than other regions

Using t.test to test for differences in mean between two groups (region and sales)

Question #2 - Is there a significant difference in profit between different categories?

Null Hypothesis - There is no significant difference in profit between the categories

Alt Hypothesis - The mean profit for office supplies is significantly higher than the mean profit for other categories

Using ANOVA to compare the multiple different categories with profit

Create a boxplot to show distribution of sales by category.

Gather quick visuals to show which category has the highest sales

ggplot(df_store, aes(x = category, y = sales)) +
  geom_boxplot()

Scatter plot below shows that there is a positive correlation between sales and profit, with higher sales generally corresponding to higher profits

ggplot(df_store, aes(x = sales, y = profit)) +
  geom_point() +
  labs(title = "Profit vs. Sales", x = "Sales", y = "Profit")

Calculate the average sales by region and category

%>% operator passes df_store to the next line
group_by groups data by region and category
summarise calculates the mean sales and stores it to a variable called mean_sales
sales_summary will then be a new dataframe that consists the region, category, and mean_sales columns

sales_summary <- df_store %>% 
  group_by(region, category) %>%
  summarise(mean_sales = mean(sales))

`summarise()` has grouped output by 'region'. You can override using the `.groups` argument.

sales_summary

NA

Reshaping the data to a wide format to make it easier to visualize and present the data.

sales_summary_wide <- pivot_wider(sales_summary, names_from = region, values_from = mean_sales)

sales_summary_wide

Mean and standard deviation of the different regions

Comparing the means of each regions sales we can quickly see higher or lower average profits
This information can help guide marketing strategies and other business decisions based on region sales

# create groups based on a condition
r1 <- subset(df_store, region == "South")
r2 <- subset(df_store, region == "East")
r3 <- subset(df_store, region == 'Central')
r4 <- subset(df_store, region == 'West')
# Gather the mean sales of each region
mean_r1 <- mean(r1$sales)
mean_r2 <- mean(r2$sales)
mean_r3 <- mean(r3$sales)
mean_r4 <- mean(r4$sales)
# Standard deviation
sd_r1 <- sd(r1$sales)
sd_r2 <- sd(r2$sales)
sd_r3 <- sd(r3$sales)
sd_r4 <- sd(r4$sales)
# Print
mean_r1

[1] 241.8036

mean_r2

[1] 238.3361

mean_r3

[1] 215.7727

mean_r4

[1] 226.4932

sd_r1

[1] 774.7963

sd_r2

[1] 620.7127

sd_r3

[1] 632.779

sd_r4

[1] 524.8769

Statistical testing

T-Test Q2
Comparing each region against each other

t.test(r1$sales, r2$profit, var.equal = TRUE)


    Two Sample t-test

data:  r1$sales and r2$profit
t = 13.265, df = 4466, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 178.6804 240.6553
sample estimates:
mean of x mean of y 
241.80365  32.13581

t.test(r1$sales, r3$profit, var.equal = TRUE)


    Two Sample t-test

data:  r1$sales and r3$profit
t = 12.745, df = 3941, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 190.1447 259.2772
sample estimates:
mean of x mean of y 
241.80365  17.09271

t.test(r1$sales, r4$profit, var.equal = TRUE)


    Two Sample t-test

data:  r1$sales and r4$profit
t = 14.485, df = 4821, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 179.8101 236.0991
sample estimates:
mean of x mean of y 
241.80365  33.84903

t.test(r2$sales, r3$profit, var.equal = TRUE)


    Two Sample t-test

data:  r2$sales and r3$profit
t = 15.815, df = 5169, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 193.8189 248.6679
sample estimates:
mean of x mean of y 
238.33611  17.09271

t.test(r2$sales, r4$profit, var.equal = TRUE)


    Two Sample t-test

data:  r2$sales and r4$profit
t = 17.871, df = 6049, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 182.0558 226.9184
sample estimates:
mean of x mean of y 
238.33611  33.84903

t.test(r2$sales, r3$profit, var.equal = TRUE)


    Two Sample t-test

data:  r2$sales and r3$profit
t = 15.815, df = 5169, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 193.8189 248.6679
sample estimates:
mean of x mean of y 
238.33611  17.09271

t.test(r3$sales, r4$profit, var.equal = TRUE)


    Two Sample t-test

data:  r3$sales and r4$profit
t = 15.483, df = 5524, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 158.8899 204.9574
sample estimates:
mean of x mean of y 
215.77266  33.84903

Based on the results of the t-tests, we can reject the null hypothesis that there is no difference in sales or profit between the regions. The p-values for all the t-tests are less than the significance level of 0.05, suggesting strong evidence of a significant difference in means. Therefore, we can conclude that there is a statistically significant difference in sales and profit between the four regions
Mean sales and profits across all regions are not equal

ggplot(analysis)+
  geom_point(aes(profit, region))

The point chart above is not the best visualization to use so I created a column chart below to better understand the data

Region as my x-axis
Profit as my y-axis
Stat = summary calculates summary statistic and fun = mean calculates the mean for each statistic
Calculates mean profit for each region

ggplot(df_store, aes(x = region, y = profit)) +
  geom_col(stat = "summary", fun = "mean") +
  labs(title = "Mean Profit by Region", x = "Region", y = "Mean Profit")

Warning: Ignoring unknown parameters: `stat` and `fun`

Region sales

Using a bar chart to compare the sales in regions
position = dodge allows the bars to be side by side
Office supplies have the highest sales in all the regions

ggplot(df_store, aes(x = region, fill = category)) +
  geom_bar(position = "dodge") +
  labs(title = "Sales by Region and Category", x = "Region", y = "Sales", fill = "Category")

Statistical testing

Q1:

One way ANOVA test to see if there is a relationship between different categories and profit

df_store$category <- as.factor(df_store$category)
nova <- aov(profit ~ category, data=df_store)
summary(nova)

              Df    Sum Sq Mean Sq F value Pr(>F)    
category       2   5898009 2949004   54.31 <2e-16 ***
Residuals   9991 542495827   54298                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-value tests whether there is a significant difference in means between the groups
F-value of 54.31, which is very large, indicates that there is a significant difference in means between the groups
p-value indicates a high significance against the null hypothesis. Therefore, we can reject the null hypothesis of no significant difference in profit between categories.

Based on the bar chart below, we can see that office supplies and technology show a higher average profit.

ggplot(df_store, aes(x = category, y = profit)) +
  geom_col() +
  labs(title = "Average Profit by Category", x = "Category", y = "Average Profit")

This faceted bar chart compares the profit for each category in each region

Profits in the southern region are lower compared to other regions

ggplot(df_store, aes(x = category, y = profit)) +
  geom_col() +
  facet_wrap(~ region) +
  labs(title = "Average Profit by Category and Region", x = "Category", y = "Average Profit")

Conclusion

Question 1:

Are sales in certain regions higher compared to others?

Question 2:

Is there a significant difference in profit between different categories?

We can conclude that certain regions have higher sales compared to others. We reject the null hypothesis. We also have evidence that office supplies and technology have higher profits compared to furniture. Therefore, we can also reject the second questions’ null hypothesis.

I used t tests and ANOVA to test the means of each in my analysis. The results showed that there is a significant difference in profit between different categories, with office supplies having a lower mean profit than the other categories. I provided visualizations to better understand what the tests where showing. Overall, the analysis suggest that we may benefit from focusing our time and business decisions on higher profit categories and regions such as office supplies and technology in our east and west regions.

Further analysis on this data may be warranted to determine why and how these factors are contributing to these differences.

LS0tDQp0aXRsZTogIkNhcHN0b25lIFByb2plY3QgUiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMjIEltcG9ydCB0aWR5dmVyc2UgbGlicmFyeQ0KDQotICAgVGlkeXZlcnNlIGlzIGEgY29sbGVjdGlvbiBvZiBwYWNrYWdlcyBkZXNpZ25lZCBmb3IgZGF0YSBzY2llbmNlIGFuZCBtYW5pcHVsYXRpb24NCg0KLSAgIE1ha2VzIHZpc3VhbGl6YXRpb24gYW5kIGFuYWx5c2lzIGVhc2llciBhbmQgbW9yZSBlZmZpY2llbnQgaW4gUg0KDQpgYGB7cn0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KYGBgDQoNCiMjIFVzZSByZWFkLmNzdiB0byBpbXBvcnQgZGF0YXNldA0KDQpgYGB7cn0NCmRmX3N0b3JlIDwtIChyZWFkLmNzdignQzovVXNlcnMvTm9raS9Eb3dubG9hZHMvYXJjaGl2ZSAoOCkvU2FtcGxlIC0gU3VwZXJzdG9yZS5jc3YnKSkNCg0KdmlldyhkZl9zdG9yZSkNCmRmX3N0b3JlDQpgYGANCg0KIyMgTG93ZXIgY2FzZSBjb2x1bW4gbmFtZXMgYW5kIGRhdGEgdHlwZSB2ZXJpZmljYXRpb24NCg0KQ3JlYXRlIGEgZnVuY3Rpb24gY2FsbGVkIHJlbmFtZSB0byByZXBsYWNlIGNvbHVtbiBuYW1lcyBmcm9tIHRpdGxlLnRpdGxlIHRvIHRpdGxlX3RpdGxlLg0KDQpMb3dlciBjYXNlIGNvbHVtbiBuYW1lcyB3aXRoIHRvbG93ZXIoKQ0KDQpHb2luZyB0aHJvdWdoIHRoZSBjb2x1bW5zIGFuZCB2ZXJpZnlpbmcgdGhlIGNvcnJlY3QgZGF0YXR5cGUgZm9yIGVhY2ggY29sdW1uIGlzIGNvcnJlY3QuDQoNCmBgYHtyfQ0KcmVuYW1lIDwtIGZ1bmN0aW9uKHgpew0KICBuYW1lcyh4KSA8LSBuYW1lcyh4KSAlPiUgc3RyX3JlcGxhY2VfYWxsKCdcXC4nLCAnXycpDQogIHJldHVybih4KQ0KfQ0KZGZfc3RvcmUgPC0gcmVuYW1lKGRmX3N0b3JlKQ0KZGZfc3RvcmUNCg0KbmFtZXMoZGZfc3RvcmUpIDwtIHRvbG93ZXIobmFtZXMoZGZfc3RvcmUpKQ0KZGZfc3RvcmUNCmBgYA0KDQojIyBDaGVja2luZyBmb3IgTkEobnVsbCkgdmFsdWVzIGluIGRhdGENCg0KYGBge3J9DQpjb2xTdW1zKGlzLm5hKHN0b3JlKSkNCg0KZGZfc3RvcmUgJT4lIHN1bW1hcmlzZShkYXRhX3R5cGUgPSBjbGFzcyhkZl9zdG9yZSkpDQpgYGANCg0KIyMgR2F0aGVyIHN0YXRpc3RpY2FsIGluZm9ybWF0aW9uIG9uIGRhdGFzZXQNCg0KYGBge3J9DQpzdW1tYXJ5KGRmX3N0b3JlKQ0KYGBgDQoNCiMgKipRdWVzdGlvbiAjMSoqIC0gQXJlIHNhbGVzIGluIGNlcnRhaW4gcmVnaW9ucyBoaWdoZXIgdGhhbiBvdGhlciByZWdpb25zPw0KDQojIyBOdWxsIEh5cG90aGVzaXMgLSBUaGVyZSBpcyBubyBkaWZmZXJlbmNlIGJldHdlZW4gdGhlIHJlZ2lvbnMgaW4gc2FsZXMNCg0KIyMgQWx0IEh5cG90aGVzaXMgLSBSZWdpb25zIGluIHRoZSBzb3V0aCBhcmUgaGlnaGVyIHRoYW4gb3RoZXIgcmVnaW9ucw0KDQotICAgVXNpbmcgdC50ZXN0IHRvIHRlc3QgZm9yIGRpZmZlcmVuY2VzIGluIG1lYW4gYmV0d2VlbiB0d28gZ3JvdXBzIChyZWdpb24gYW5kIHNhbGVzKQ0KDQojICoqUXVlc3Rpb24gIzIqKiAtIElzIHRoZXJlIGEgc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBpbiBwcm9maXQgYmV0d2VlbiBkaWZmZXJlbnQgY2F0ZWdvcmllcz8NCg0KIyMgTnVsbCBIeXBvdGhlc2lzIC0gVGhlcmUgaXMgbm8gc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBpbiBwcm9maXQgYmV0d2VlbiB0aGUgY2F0ZWdvcmllcw0KDQojIyBBbHQgSHlwb3RoZXNpcyAtIFRoZSBtZWFuIHByb2ZpdCBmb3Igb2ZmaWNlIHN1cHBsaWVzIGlzIHNpZ25pZmljYW50bHkgaGlnaGVyIHRoYW4gdGhlIG1lYW4gcHJvZml0IGZvciBvdGhlciBjYXRlZ29yaWVzDQoNCi0gICBVc2luZyBBTk9WQSB0byBjb21wYXJlIHRoZSBtdWx0aXBsZSBkaWZmZXJlbnQgY2F0ZWdvcmllcyB3aXRoIHByb2ZpdA0KDQojIyMgQ3JlYXRlIGEgYm94cGxvdCB0byBzaG93IGRpc3RyaWJ1dGlvbiBvZiBzYWxlcyBieSBjYXRlZ29yeS4NCg0KLSAgIEdhdGhlciBxdWljayB2aXN1YWxzIHRvIHNob3cgd2hpY2ggY2F0ZWdvcnkgaGFzIHRoZSBoaWdoZXN0IHNhbGVzDQoNCmBgYHtyfQ0KZ2dwbG90KGRmX3N0b3JlLCBhZXMoeCA9IGNhdGVnb3J5LCB5ID0gc2FsZXMpKSArDQogIGdlb21fYm94cGxvdCgpDQpgYGANCg0KIyMjIFNjYXR0ZXIgcGxvdCBiZWxvdyBzaG93cyB0aGF0IHRoZXJlIGlzIGEgcG9zaXRpdmUgY29ycmVsYXRpb24gYmV0d2VlbiBzYWxlcyBhbmQgcHJvZml0LCB3aXRoIGhpZ2hlciBzYWxlcyBnZW5lcmFsbHkgY29ycmVzcG9uZGluZyB0byBoaWdoZXIgcHJvZml0cw0KDQpgYGB7cn0NCmdncGxvdChkZl9zdG9yZSwgYWVzKHggPSBzYWxlcywgeSA9IHByb2ZpdCkpICsNCiAgZ2VvbV9wb2ludCgpICsNCiAgbGFicyh0aXRsZSA9ICJQcm9maXQgdnMuIFNhbGVzIiwgeCA9ICJTYWxlcyIsIHkgPSAiUHJvZml0IikNCmBgYA0KDQojIyBDYWxjdWxhdGUgdGhlIGF2ZXJhZ2Ugc2FsZXMgYnkgcmVnaW9uIGFuZCBjYXRlZ29yeQ0KDQotICAgJVw+JSBvcGVyYXRvciBwYXNzZXMgZGZfc3RvcmUgdG8gdGhlIG5leHQgbGluZQ0KLSAgIGdyb3VwX2J5IGdyb3VwcyBkYXRhIGJ5IHJlZ2lvbiBhbmQgY2F0ZWdvcnkNCi0gICBzdW1tYXJpc2UgY2FsY3VsYXRlcyB0aGUgbWVhbiBzYWxlcyBhbmQgc3RvcmVzIGl0IHRvIGEgdmFyaWFibGUgY2FsbGVkIG1lYW5fc2FsZXMNCi0gICBzYWxlc19zdW1tYXJ5IHdpbGwgdGhlbiBiZSBhIG5ldyBkYXRhZnJhbWUgdGhhdCBjb25zaXN0cyB0aGUgcmVnaW9uLCBjYXRlZ29yeSwgYW5kIG1lYW5fc2FsZXMgY29sdW1ucw0KDQpgYGB7cn0NCnNhbGVzX3N1bW1hcnkgPC0gZGZfc3RvcmUgJT4lIA0KICBncm91cF9ieShyZWdpb24sIGNhdGVnb3J5KSAlPiUNCiAgc3VtbWFyaXNlKG1lYW5fc2FsZXMgPSBtZWFuKHNhbGVzKSkNCg0Kc2FsZXNfc3VtbWFyeQ0KDQpgYGANCg0KIyMgUmVzaGFwaW5nIHRoZSBkYXRhIHRvIGEgd2lkZSBmb3JtYXQgdG8gbWFrZSBpdCBlYXNpZXIgdG8gdmlzdWFsaXplIGFuZCBwcmVzZW50IHRoZSBkYXRhLg0KDQpgYGB7cn0NCnNhbGVzX3N1bW1hcnlfd2lkZSA8LSBwaXZvdF93aWRlcihzYWxlc19zdW1tYXJ5LCBuYW1lc19mcm9tID0gcmVnaW9uLCB2YWx1ZXNfZnJvbSA9IG1lYW5fc2FsZXMpDQoNCnNhbGVzX3N1bW1hcnlfd2lkZQ0KYGBgDQoNCiMjIE1lYW4gYW5kIHN0YW5kYXJkIGRldmlhdGlvbiBvZiB0aGUgZGlmZmVyZW50IHJlZ2lvbnMNCg0KLSAgIENvbXBhcmluZyB0aGUgbWVhbnMgb2YgZWFjaCByZWdpb25zIHNhbGVzIHdlIGNhbiBxdWlja2x5IHNlZSBoaWdoZXIgb3IgbG93ZXIgYXZlcmFnZSBwcm9maXRzDQotICAgVGhpcyBpbmZvcm1hdGlvbiBjYW4gaGVscCBndWlkZSBtYXJrZXRpbmcgc3RyYXRlZ2llcyBhbmQgb3RoZXIgYnVzaW5lc3MgZGVjaXNpb25zIGJhc2VkIG9uIHJlZ2lvbiBzYWxlcw0KDQpgYGB7cn0NCiMgY3JlYXRlIGdyb3VwcyBiYXNlZCBvbiBhIGNvbmRpdGlvbg0KcjEgPC0gc3Vic2V0KGRmX3N0b3JlLCByZWdpb24gPT0gIlNvdXRoIikNCnIyIDwtIHN1YnNldChkZl9zdG9yZSwgcmVnaW9uID09ICJFYXN0IikNCnIzIDwtIHN1YnNldChkZl9zdG9yZSwgcmVnaW9uID09ICdDZW50cmFsJykNCnI0IDwtIHN1YnNldChkZl9zdG9yZSwgcmVnaW9uID09ICdXZXN0JykNCiMgR2F0aGVyIHRoZSBtZWFuIHNhbGVzIG9mIGVhY2ggcmVnaW9uDQptZWFuX3IxIDwtIG1lYW4ocjEkc2FsZXMpDQptZWFuX3IyIDwtIG1lYW4ocjIkc2FsZXMpDQptZWFuX3IzIDwtIG1lYW4ocjMkc2FsZXMpDQptZWFuX3I0IDwtIG1lYW4ocjQkc2FsZXMpDQojIFN0YW5kYXJkIGRldmlhdGlvbg0Kc2RfcjEgPC0gc2QocjEkc2FsZXMpDQpzZF9yMiA8LSBzZChyMiRzYWxlcykNCnNkX3IzIDwtIHNkKHIzJHNhbGVzKQ0Kc2RfcjQgPC0gc2QocjQkc2FsZXMpDQojIFByaW50DQptZWFuX3IxDQptZWFuX3IyDQptZWFuX3IzDQptZWFuX3I0DQoNCnNkX3IxDQpzZF9yMg0Kc2RfcjMNCnNkX3I0DQpgYGANCg0KIyMgU3RhdGlzdGljYWwgdGVzdGluZw0KDQotICAgVC1UZXN0IFEyDQotICAgQ29tcGFyaW5nIGVhY2ggcmVnaW9uIGFnYWluc3QgZWFjaCBvdGhlcg0KDQpgYGB7cn0NCnQudGVzdChyMSRzYWxlcywgcjIkcHJvZml0LCB2YXIuZXF1YWwgPSBUUlVFKQ0KdC50ZXN0KHIxJHNhbGVzLCByMyRwcm9maXQsIHZhci5lcXVhbCA9IFRSVUUpDQp0LnRlc3QocjEkc2FsZXMsIHI0JHByb2ZpdCwgdmFyLmVxdWFsID0gVFJVRSkNCnQudGVzdChyMiRzYWxlcywgcjMkcHJvZml0LCB2YXIuZXF1YWwgPSBUUlVFKQ0KdC50ZXN0KHIyJHNhbGVzLCByNCRwcm9maXQsIHZhci5lcXVhbCA9IFRSVUUpDQp0LnRlc3QocjIkc2FsZXMsIHIzJHByb2ZpdCwgdmFyLmVxdWFsID0gVFJVRSkNCnQudGVzdChyMyRzYWxlcywgcjQkcHJvZml0LCB2YXIuZXF1YWwgPSBUUlVFKQ0KYGBgDQoNCi0gICBCYXNlZCBvbiB0aGUgcmVzdWx0cyBvZiB0aGUgdC10ZXN0cywgd2UgY2FuICoqcmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMqKiB0aGF0IHRoZXJlIGlzIG5vIGRpZmZlcmVuY2UgaW4gc2FsZXMgb3IgcHJvZml0IGJldHdlZW4gdGhlIHJlZ2lvbnMuIFRoZSBwLXZhbHVlcyBmb3IgYWxsIHRoZSB0LXRlc3RzIGFyZSBsZXNzIHRoYW4gdGhlIHNpZ25pZmljYW5jZSBsZXZlbCBvZiAwLjA1LCBzdWdnZXN0aW5nIHN0cm9uZyBldmlkZW5jZSBvZiBhIHNpZ25pZmljYW50IGRpZmZlcmVuY2UgaW4gbWVhbnMuIFRoZXJlZm9yZSwgd2UgY2FuIGNvbmNsdWRlIHRoYXQgdGhlcmUgaXMgYSBzdGF0aXN0aWNhbGx5IHNpZ25pZmljYW50IGRpZmZlcmVuY2UgaW4gc2FsZXMgYW5kIHByb2ZpdCBiZXR3ZWVuIHRoZSBmb3VyIHJlZ2lvbnMNCi0gICBNZWFuIHNhbGVzIGFuZCBwcm9maXRzIGFjcm9zcyBhbGwgcmVnaW9ucyBhcmUgbm90IGVxdWFsDQoNCmBgYHtyfQ0KZ2dwbG90KGFuYWx5c2lzKSsNCiAgZ2VvbV9wb2ludChhZXMocHJvZml0LCByZWdpb24pKQ0KYGBgDQoNClRoZSBwb2ludCBjaGFydCBhYm92ZSBpcyBub3QgdGhlIGJlc3QgdmlzdWFsaXphdGlvbiB0byB1c2Ugc28gSSBjcmVhdGVkIGEgY29sdW1uIGNoYXJ0IGJlbG93IHRvIGJldHRlciB1bmRlcnN0YW5kIHRoZSBkYXRhDQoNCi0gICBSZWdpb24gYXMgbXkgeC1heGlzDQoNCi0gICBQcm9maXQgYXMgbXkgeS1heGlzDQoNCi0gICBTdGF0ID0gc3VtbWFyeSBjYWxjdWxhdGVzIHN1bW1hcnkgc3RhdGlzdGljIGFuZCBmdW4gPSBtZWFuIGNhbGN1bGF0ZXMgdGhlIG1lYW4gZm9yIGVhY2ggc3RhdGlzdGljDQoNCi0gICBDYWxjdWxhdGVzIG1lYW4gcHJvZml0IGZvciBlYWNoIHJlZ2lvbg0KDQpgYGB7cn0NCmdncGxvdChkZl9zdG9yZSwgYWVzKHggPSByZWdpb24sIHkgPSBwcm9maXQpKSArDQogIGdlb21fY29sKHN0YXQgPSAic3VtbWFyeSIsIGZ1biA9ICJtZWFuIikgKw0KICBsYWJzKHRpdGxlID0gIk1lYW4gUHJvZml0IGJ5IFJlZ2lvbiIsIHggPSAiUmVnaW9uIiwgeSA9ICJNZWFuIFByb2ZpdCIpDQpgYGANCg0KIyMgUmVnaW9uIHNhbGVzDQoNCi0gICBVc2luZyBhIGJhciBjaGFydCB0byBjb21wYXJlIHRoZSBzYWxlcyBpbiByZWdpb25zDQoNCi0gICBwb3NpdGlvbiA9IGRvZGdlIGFsbG93cyB0aGUgYmFycyB0byBiZSBzaWRlIGJ5IHNpZGUNCg0KLSAgIE9mZmljZSBzdXBwbGllcyBoYXZlIHRoZSBoaWdoZXN0IHNhbGVzIGluIGFsbCB0aGUgcmVnaW9ucw0KDQpgYGB7cn0NCmdncGxvdChkZl9zdG9yZSwgYWVzKHggPSByZWdpb24sIGZpbGwgPSBjYXRlZ29yeSkpICsNCiAgZ2VvbV9iYXIocG9zaXRpb24gPSAiZG9kZ2UiKSArDQogIGxhYnModGl0bGUgPSAiU2FsZXMgYnkgUmVnaW9uIGFuZCBDYXRlZ29yeSIsIHggPSAiUmVnaW9uIiwgeSA9ICJTYWxlcyIsIGZpbGwgPSAiQ2F0ZWdvcnkiKQ0KDQpgYGANCg0KIyMgU3RhdGlzdGljYWwgdGVzdGluZw0KDQpRMToNCg0KLSAgIE9uZSB3YXkgQU5PVkEgdGVzdCB0byBzZWUgaWYgdGhlcmUgaXMgYSByZWxhdGlvbnNoaXAgYmV0d2VlbiBkaWZmZXJlbnQgY2F0ZWdvcmllcyBhbmQgcHJvZml0DQoNCmBgYHtyfQ0KZGZfc3RvcmUkY2F0ZWdvcnkgPC0gYXMuZmFjdG9yKGRmX3N0b3JlJGNhdGVnb3J5KQ0Kbm92YSA8LSBhb3YocHJvZml0IH4gY2F0ZWdvcnksIGRhdGE9ZGZfc3RvcmUpDQpzdW1tYXJ5KG5vdmEpDQpgYGANCg0KLSAgIEYtdmFsdWUgdGVzdHMgd2hldGhlciB0aGVyZSBpcyBhIHNpZ25pZmljYW50IGRpZmZlcmVuY2UgaW4gbWVhbnMgYmV0d2VlbiB0aGUgZ3JvdXBzDQoNCi0gICBGLXZhbHVlIG9mIDU0LjMxLCB3aGljaCBpcyB2ZXJ5IGxhcmdlLCBpbmRpY2F0ZXMgdGhhdCB0aGVyZSBpcyBhIHNpZ25pZmljYW50IGRpZmZlcmVuY2UgaW4gbWVhbnMgYmV0d2VlbiB0aGUgZ3JvdXBzDQoNCi0gICBwLXZhbHVlIGluZGljYXRlcyBhIGhpZ2ggc2lnbmlmaWNhbmNlIGFnYWluc3QgdGhlIG51bGwgaHlwb3RoZXNpcy4gVGhlcmVmb3JlLCB3ZSBjYW4gcmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMgb2Ygbm8gc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBpbiBwcm9maXQgYmV0d2VlbiBjYXRlZ29yaWVzLg0KDQogICAgQmFzZWQgb24gdGhlIGJhciBjaGFydCBiZWxvdywgd2UgY2FuIHNlZSB0aGF0IG9mZmljZSBzdXBwbGllcyBhbmQgdGVjaG5vbG9neSBzaG93IGEgaGlnaGVyIGF2ZXJhZ2UgcHJvZml0Lg0KDQpgYGB7cn0NCmdncGxvdChkZl9zdG9yZSwgYWVzKHggPSBjYXRlZ29yeSwgeSA9IHByb2ZpdCkpICsNCiAgZ2VvbV9jb2woKSArDQogIGxhYnModGl0bGUgPSAiQXZlcmFnZSBQcm9maXQgYnkgQ2F0ZWdvcnkiLCB4ID0gIkNhdGVnb3J5IiwgeSA9ICJBdmVyYWdlIFByb2ZpdCIpDQpgYGANCg0KVGhpcyBmYWNldGVkIGJhciBjaGFydCBjb21wYXJlcyB0aGUgcHJvZml0IGZvciBlYWNoIGNhdGVnb3J5IGluIGVhY2ggcmVnaW9uDQoNCi0gICBQcm9maXRzIGluIHRoZSBzb3V0aGVybiByZWdpb24gYXJlIGxvd2VyIGNvbXBhcmVkIHRvIG90aGVyIHJlZ2lvbnMNCg0KYGBge3J9DQpnZ3Bsb3QoZGZfc3RvcmUsIGFlcyh4ID0gY2F0ZWdvcnksIHkgPSBwcm9maXQpKSArDQogIGdlb21fY29sKCkgKw0KICBmYWNldF93cmFwKH4gcmVnaW9uKSArDQogIGxhYnModGl0bGUgPSAiQXZlcmFnZSBQcm9maXQgYnkgQ2F0ZWdvcnkgYW5kIFJlZ2lvbiIsIHggPSAiQ2F0ZWdvcnkiLCB5ID0gIkF2ZXJhZ2UgUHJvZml0IikNCmBgYA0KDQojIENvbmNsdXNpb24NCg0KKipRdWVzdGlvbiAxKio6DQoNCkFyZSBzYWxlcyBpbiBjZXJ0YWluIHJlZ2lvbnMgaGlnaGVyIGNvbXBhcmVkIHRvIG90aGVycz8NCg0KKipRdWVzdGlvbiAyKio6DQoNCklzIHRoZXJlIGEgc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBpbiBwcm9maXQgYmV0d2VlbiBkaWZmZXJlbnQgY2F0ZWdvcmllcz8NCg0KV2UgY2FuIGNvbmNsdWRlIHRoYXQgY2VydGFpbiByZWdpb25zIGhhdmUgaGlnaGVyIHNhbGVzIGNvbXBhcmVkIHRvIG90aGVycy4gV2UgcmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMuIFdlIGFsc28gaGF2ZSBldmlkZW5jZSB0aGF0IG9mZmljZSBzdXBwbGllcyBhbmQgdGVjaG5vbG9neSBoYXZlIGhpZ2hlciBwcm9maXRzIGNvbXBhcmVkIHRvIGZ1cm5pdHVyZS4gVGhlcmVmb3JlLCB3ZSBjYW4gYWxzbyByZWplY3QgdGhlIHNlY29uZCBxdWVzdGlvbnMnIG51bGwgaHlwb3RoZXNpcy4NCg0KSSB1c2VkIHQgdGVzdHMgYW5kIEFOT1ZBIHRvIHRlc3QgdGhlIG1lYW5zIG9mIGVhY2ggaW4gbXkgYW5hbHlzaXMuIFRoZSByZXN1bHRzIHNob3dlZCB0aGF0IHRoZXJlIGlzIGEgc2lnbmlmaWNhbnQgZGlmZmVyZW5jZSBpbiBwcm9maXQgYmV0d2VlbiBkaWZmZXJlbnQgY2F0ZWdvcmllcywgd2l0aCBvZmZpY2Ugc3VwcGxpZXMgaGF2aW5nIGEgbG93ZXIgbWVhbiBwcm9maXQgdGhhbiB0aGUgb3RoZXIgY2F0ZWdvcmllcy4gSSBwcm92aWRlZCB2aXN1YWxpemF0aW9ucyB0byBiZXR0ZXIgdW5kZXJzdGFuZCB3aGF0IHRoZSB0ZXN0cyB3aGVyZSBzaG93aW5nLiBPdmVyYWxsLCB0aGUgYW5hbHlzaXMgc3VnZ2VzdCB0aGF0IHdlIG1heSBiZW5lZml0IGZyb20gZm9jdXNpbmcgb3VyIHRpbWUgYW5kIGJ1c2luZXNzIGRlY2lzaW9ucyBvbiBoaWdoZXIgcHJvZml0IGNhdGVnb3JpZXMgYW5kIHJlZ2lvbnMgc3VjaCBhcyBvZmZpY2Ugc3VwcGxpZXMgYW5kIHRlY2hub2xvZ3kgaW4gb3VyIGVhc3QgYW5kIHdlc3QgcmVnaW9ucy4NCg0KRnVydGhlciBhbmFseXNpcyBvbiB0aGlzIGRhdGEgbWF5IGJlIHdhcnJhbnRlZCB0byBkZXRlcm1pbmUgd2h5IGFuZCBob3cgdGhlc2UgZmFjdG9ycyBhcmUgY29udHJpYnV0aW5nIHRvIHRoZXNlIGRpZmZlcmVuY2VzLg0K

Capstone Project R

Import tidyverse library

Use read.csv to import dataset

Lower case column names and data type verification

Checking for NA(null) values in data

Gather statistical information on dataset

Question #1 - Are sales in certain regions higher than other regions?

Null Hypothesis - There is no difference between the regions in sales

Alt Hypothesis - Regions in the south are higher than other regions

Question #2 - Is there a significant difference in profit between different categories?

Null Hypothesis - There is no significant difference in profit between the categories

Alt Hypothesis - The mean profit for office supplies is significantly higher than the mean profit for other categories

Create a boxplot to show distribution of sales by category.

Scatter plot below shows that there is a positive correlation between sales and profit, with higher sales generally corresponding to higher profits

Calculate the average sales by region and category

Reshaping the data to a wide format to make it easier to visualize and present the data.

Mean and standard deviation of the different regions

Statistical testing

Region sales

Statistical testing

Conclusion