STATISTICAL ANALYSIS OF SUPERSTORE DATA SET

INTRODUCTION

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.

Parametric Tests

Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following * One sample t-test * Paired and unpaired t-test * Analysis of Variance (ANOVA) * Pearson correlation One sample t-test The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.

Activate the required package

library(haven)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(agricolae)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggplot2)
library(plotrix)

Load the Excel data set

Superstore_data<-read.csv("C:\\Users\\LUMUMBA\\Downloads\\Superstore_data.csv")
attach(Superstore_data)

View the first few observations of the your data set

head(Superstore_data,10)
tail(Superstore_data,10)

The One-Sample t-Test

What is the one-sample t-test?

The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.

When can I use the test?

You can use the test for continuous data. Your data should be a random sample from a normal population.

What if my data isn’t nearly normally distributed?

If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.

The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.

t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Superstore_data$Sales
## t = 4.7893, df = 9993, p-value = 1.698e-06
## alternative hypothesis: true mean is not equal to 200
## 95 percent confidence interval:
##  217.6375 242.0785
## sample estimates:
## mean of x 
##   229.858

Interpretation

In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001

Independent T-Test

The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.

Sales per Region (West and East Regions)

data22<-Superstore_data %>%
  dplyr::select(Region, Sales)%>%
  filter(Region == "West"|
           Region == "East")

head(data22,5)
t.test(Sales ~ Region,alternative="two.sided",data = data22)
## 
##  Welch Two Sample t-test
## 
## data:  Sales by Region
## t = 0.79611, df = 5603.9, p-value = 0.426
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
##  -17.31977  41.00552
## sample estimates:
## mean in group East mean in group West 
##           238.3361           226.4932

Interpretation

In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.

Sales per Category (Technology and Furniture)

data23<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Furniture")

head(data23,5)
t.test(Sales ~ Category,alternative="two.sided",data = data23)
## 
##  Welch Two Sample t-test
## 
## data:  Sales by Category
## t = -3.6721, df = 2497.7, p-value = 0.0002456
## alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
## 95 percent confidence interval:
##  -157.8094  -47.9394
## sample estimates:
##  mean in group Furniture mean in group Technology 
##                 349.8349                 452.7093

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

Sales per Category (Technology and Office Supplies)

data24<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Office Supplies")

head(data24,5)
t.test(Sales ~ Category,alternative="two.sided",data = data24)
## 
##  Welch Two Sample t-test
## 
## data:  Sales by Category
## t = -12.694, df = 1982.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
## 95 percent confidence interval:
##  -384.8897 -281.8807
## sample estimates:
## mean in group Office Supplies      mean in group Technology 
##                      119.3241                      452.7093

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

ONE-WAY ANALYSIS OF VARIANCE

One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model<-aov(Sales~Category)
summary(anova_model)
##               Df    Sum Sq  Mean Sq F value Pr(>F)    
## Category       2 1.959e+08 97940872   265.5 <2e-16 ***
## Residuals   9991 3.686e+09   368906                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.

Post Hoc Test

Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.

tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sales ~ Category)
## 
## $Category
##                                 diff        lwr       upr p adj
## Office Supplies-Furniture  -230.5108 -266.45583 -194.5657 0e+00
## Technology-Furniture        102.8744   57.56303  148.1858 3e-07
## Technology-Office Supplies  333.3852  295.51937  371.2510 0e+00

From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.

plot(tukey_hsd, las=1)

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## Sub.Category   16 7.763e+08 48515873   155.9 <2e-16 ***
## Residuals    9977 3.105e+09   311253                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.

Post Hoc Test

tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sales ~ Sub.Category)
## 
## $Sub.Category
##                                  diff          lwr          upr     p adj
## Appliances-Accessories     14.7811064   -98.352371   127.914584 1.0000000
## Art-Accessories          -181.9057697  -279.299223   -84.512317 0.0000000
## Binders-Accessories       -82.4140438  -167.571615     2.743527 0.0707858
## Bookcases-Accessories     287.8850290   142.479391   433.290667 0.0000000
## Chairs-Accessories        316.3578159   212.227967   420.487665 0.0000000
## Copiers-Accessories      1982.9670138  1738.872771  2227.061256 0.0000000
## Envelopes-Accessories    -151.1068795  -290.643768   -11.569991 0.0187882
## Fasteners-Accessories    -202.0378297  -350.263756   -53.811903 0.0003125
## Furnishings-Accessories  -120.1489362  -213.413396   -26.884476 0.0010412
## Labels-Accessories       -181.6715489  -304.305120   -59.037978 0.0000400
## Machines-Accessories     1429.5787092  1236.717765  1622.439653 0.0000000
## Paper-Accessories        -158.6905119  -245.436934   -71.944090 0.0000000
## Phones-Accessories        155.2369304    60.389846   250.084015 0.0000021
## Storage-Accessories        48.6159493   -47.347208   144.579107 0.9455486
## Supplies-Accessories       29.6755961  -126.561825   185.913017 0.9999997
## Tables-Accessories        432.8201673   304.435984   561.204351 0.0000000
## Art-Appliances           -196.6868761  -309.258580   -84.115172 0.0000002
## Binders-Appliances        -97.1951502  -199.365184     4.974884 0.0839321
## Bookcases-Appliances      273.1039226   117.124105   429.083741 0.0000002
## Chairs-Appliances         301.5767095   183.128706   420.024713 0.0000000
## Copiers-Appliances       1968.1859073  1717.648678  2218.723136 0.0000000
## Envelopes-Appliances     -165.8879859  -316.411895   -15.364077 0.0147424
## Fasteners-Appliances     -216.8189361  -375.431134   -58.206739 0.0002927
## Furnishings-Appliances   -134.9300426  -243.949139   -25.910946 0.0022497
## Labels-Appliances        -196.4526554  -331.455976   -61.449334 0.0000641
## Machines-Appliances      1414.7976027  1213.844256  1615.750949 0.0000000
## Paper-Appliances         -173.4716183  -276.969665   -69.973572 0.0000009
## Phones-Appliances         140.4558240    30.079770   250.831878 0.0013283
## Storage-Appliances         33.8348429   -77.501726   145.171412 0.9998021
## Supplies-Appliances        14.8944897  -151.229065   181.018045 1.0000000
## Tables-Appliances         418.0390609   277.791414   558.286708 0.0000000
## Binders-Art                99.4917259    15.081912   183.901540 0.0052743
## Bookcases-Art             469.7907987   324.821821   614.759777 0.0000000
## Chairs-Art                498.2635856   394.744359   601.782813 0.0000000
## Copiers-Art              2164.8727835  1921.038405  2408.707162 0.0000000
## Envelopes-Art              30.7988902  -108.282913   169.880694 0.9999974
## Fasteners-Art             -20.1320600  -167.929659   127.665539 1.0000000
## Furnishings-Art            61.7568335   -30.825370   154.339037 0.6491218
## Labels-Art                  0.2342208  -121.881288   122.349730 1.0000000
## Machines-Art             1611.4844789  1418.952537  1804.016420 0.0000000
## Paper-Art                  23.2152578   -62.797221   109.227737 0.9999589
## Phones-Art                337.1427001   242.966407   431.318994 0.0000000
## Storage-Art               230.5217190   135.221496   325.821942 0.0000000
## Supplies-Art              211.5813658    55.750251   367.412481 0.0003411
## Tables-Art                614.7259370   486.836518   742.615356 0.0000000
## Bookcases-Binders         370.2990728   233.250425   507.347720 0.0000000
## Chairs-Binders            398.7718597   306.671058   490.872661 0.0000000
## Copiers-Binders          2065.3810576  1826.170834  2304.591281 0.0000000
## Envelopes-Binders         -68.6928357  -199.498322    62.112651 0.9265977
## Fasteners-Binders        -119.6237859  -259.661129    20.413557 0.2055723
## Furnishings-Binders       -37.7348924  -117.345140    41.875355 0.9698971
## Labels-Binders            -99.2575051  -211.856461    13.341450 0.1629283
## Machines-Binders         1511.9927530  1325.351719  1698.633787 0.0000000
## Paper-Binders             -76.2764681  -148.140741    -4.412195 0.0244962
## Phones-Binders            237.6509742   156.192387   319.109562 0.0000000
## Storage-Binders           131.0299931    48.274572   213.785414 0.0000059
## Supplies-Binders          112.0896399   -36.401652   260.580932 0.4181442
## Tables-Binders            515.2342111   396.398061   634.070362 0.0000000
## Chairs-Bookcases           28.4727869  -121.105104   178.050678 0.9999997
## Copiers-Bookcases        1695.0819848  1428.412678  1961.751291 0.0000000
## Envelopes-Bookcases      -438.9919085  -615.063096  -262.920721 0.0000000
## Fasteners-Bookcases      -489.9228587  -672.956859  -306.888859 0.0000000
## Furnishings-Bookcases    -408.0339652  -550.261879  -265.806052 0.0000000
## Labels-Bookcases         -469.5565779  -632.558150  -306.555006 0.0000000
## Machines-Bookcases       1141.6936801   920.954324  1362.433037 0.0000000
## Paper-Bookcases          -446.5755409  -584.617062  -308.534020 0.0000000
## Phones-Bookcases         -132.6480986  -275.918784    10.622587 0.1080946
## Storage-Bookcases        -239.2690797  -383.281050   -95.257110 0.0000013
## Supplies-Bookcases       -258.2094329  -447.789631   -68.629235 0.0003179
## Tables-Bookcases          144.9351383   -22.435763   312.306039 0.1863844
## Copiers-Chairs           1666.6091979  1420.006796  1913.211600 0.0000000
## Envelopes-Chairs         -467.4646954  -611.344120  -323.585271 0.0000000
## Fasteners-Chairs         -518.3956456  -670.716593  -366.074698 0.0000000
## Furnishings-Chairs       -436.5067521  -536.151146  -336.862358 0.0000000
## Labels-Chairs            -498.0293648  -625.582250  -370.476480 0.0000000
## Machines-Chairs          1113.2208933   917.195157  1309.246630 0.0000000
## Paper-Chairs             -475.0483278  -568.620158  -381.476498 0.0000000
## Phones-Chairs            -161.1208855  -262.248108   -59.993663 0.0000049
## Storage-Chairs           -267.7418666  -369.916586  -165.567147 0.0000000
## Supplies-Chairs          -286.6822198  -446.809910  -126.554529 0.0000001
## Tables-Chairs             116.4623514   -16.628761   249.553464 0.1723836
## Envelopes-Copiers       -2134.0738932 -2397.589097 -1870.558690 0.0000000
## Fasteners-Copiers       -2185.0048435 -2453.222376 -1916.787311 0.0000000
## Furnishings-Copiers     -2103.1159499 -2345.330687 -1860.901213 0.0000000
## Labels-Copiers          -2164.6385627 -2419.606624 -1909.670502 0.0000000
## Machines-Copiers         -553.3883046  -848.625604  -258.151005 0.0000000
## Paper-Copiers           -2141.6575257 -2381.437969 -1901.877082 0.0000000
## Phones-Copiers          -1827.7300833 -2070.558600 -1584.901566 0.0000000
## Storage-Copiers         -1934.3510645 -2177.617681 -1691.084447 0.0000000
## Supplies-Copiers        -1953.2914176 -2226.018114 -1680.564721 0.0000000
## Tables-Copiers          -1550.1468465 -1807.930124 -1292.363569 0.0000000
## Fasteners-Envelopes       -50.9309502  -229.338317   127.476417 0.9999128
## Furnishings-Envelopes      30.9579433  -105.264385   167.180272 0.9999962
## Labels-Envelopes          -30.5646695  -188.353313   127.223974 0.9999996
## Machines-Envelopes       1580.6855886  1363.767155  1797.604023 0.0000000
## Paper-Envelopes            -7.5836324  -139.429015   124.261751 1.0000000
## Phones-Envelopes          306.3438099   169.033094   443.654526 0.0000000
## Storage-Envelopes         199.7228288    61.638829   337.806829 0.0000749
## Supplies-Envelopes        180.7824756    -4.334771   365.899722 0.0644947
## Tables-Envelopes          583.9270468   421.628674   746.225420 0.0000000
## Furnishings-Fasteners      81.8888935   -63.221082   226.998869 0.8728069
## Labels-Fasteners           20.3662808  -145.156039   185.888600 1.0000000
## Machines-Fasteners       1631.6165388  1409.009286  1854.223792 0.0000000
## Paper-Fasteners            43.3473178   -97.661856   184.356492 0.9997704
## Phones-Fasteners          357.2747601   211.142577   503.406943 0.0000000
## Storage-Fasteners         250.6537790   103.794754   397.512804 0.0000005
## Supplies-Fasteners        231.7134258    39.961562   423.465289 0.0034553
## Tables-Fasteners          634.8579970   465.031191   804.684803 0.0000000
## Labels-Furnishings        -61.5226128  -180.371140    57.325914 0.9347411
## Machines-Furnishings     1549.7276453  1359.251076  1740.204214 0.0000000
## Paper-Furnishings         -38.5415757  -119.849148    42.765997 0.9698815
## Phones-Furnishings        275.3858666   185.486205   365.285528 0.0000000
## Storage-Furnishings       168.7648855    77.688504   259.841267 0.0000000
## Supplies-Furnishings      149.8245323    -3.459881   303.108945 0.0639163
## Tables-Furnishings        552.9691034   428.195395   677.742812 0.0000000
## Machines-Labels          1611.2502581  1404.799158  1817.701358 0.0000000
## Paper-Labels               22.9810370   -90.824299   136.786373 0.9999993
## Phones-Labels             336.9084794   216.814007   457.002952 0.0000000
## Storage-Labels            230.2874982   109.309647   351.265350 0.0000000
## Supplies-Labels           211.3471451    38.613696   384.080594 0.0027717
## Tables-Labels             614.4917162   466.474111   762.509322 0.0000000
## Paper-Machines          -1588.2692211 -1775.640525 -1400.897917 0.0000000
## Phones-Machines         -1274.3417787 -1465.598238 -1083.085319 0.0000000
## Storage-Machines        -1380.9627599 -1572.775146 -1189.150374 0.0000000
## Supplies-Machines       -1399.9031130 -1627.923278 -1171.882949 0.0000000
## Tables-Machines          -996.7585419 -1206.676532  -786.840552 0.0000000
## Phones-Paper              313.9274423   230.809266   397.045619 0.0000000
## Storage-Paper             207.3064612   122.916951   291.695972 0.0000000
## Supplies-Paper            188.3661080    38.957964   337.774252 0.0015901
## Tables-Paper              591.5106792   471.530846   711.490513 0.0000000
## Storage-Phones           -106.6209811  -199.317352   -13.924610 0.0078092
## Supplies-Phones          -125.5613343  -279.813794    28.691125 0.2813586
## Tables-Phones             277.5832369   151.622179   403.544295 0.0000000
## Supplies-Storage          -18.9403532  -173.881566   136.000860 1.0000000
## Tables-Storage            384.2042180   257.400644   511.007792 0.0000000
## Tables-Supplies           403.1445712   226.282053   580.007090 0.0000000

Make an interpretation of the results above. ## Plot

plot(tukey_hsd2, las=1)

Two-Way Analysis of Variance

What Is a 2-Way ANOVA?

ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.

Convert the string variables of interest into factors.

Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))

Run the multiple linear model.

linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)

Convert the linear model into its equivalent two-factor anova.

Two_factor_anova<-anova(linear_model)
Two_factor_anova

From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.

Visualize the results using stargazer

stargazer(Two_factor_anova, type="text")
## 
## =================================================================================
## Statistic N       Mean            St. Dev.            Min              Max       
## ---------------------------------------------------------------------------------
## Df        3     3,331.000         5,755.608            2              9,977      
## Sum Sq    3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
## Mean Sq   3  46,569,095.000    49,015,303.000     311,253.000    97,940,872.000  
## F value   2      223.927           128.325          133.188          314.666     
## Pr(> F)   2       0.000             0.000              0                0        
## ---------------------------------------------------------------------------------

Linear Regression

Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.

In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.

Scatter Plot

plot(Sales, Profit, main = "A scatter plot of Sales and Profit")

Transform one or both variable and create the scatter plot.

plot(log(Sales),Profit)

plot(Sales, log(Profit))
## Warning in log(Profit): NaNs produced

Log tranform both variables

plot(log(Sales),log(Profit))
## Warning in log(Profit): NaNs produced

## Run the correlation to determine the strength of association between the two variables of interest

cor(Sales,Profit)
## [1] 0.4790643

Run the Regression model

reg_model<-lm(Profit~Sales, data = Superstore_data)

Visualize the results

stargazer(reg_model, type="text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               Profit           
## -----------------------------------------------
## Sales                        0.180***          
##                               (0.003)          
##                                                
## Constant                    -12.733***         
##                               (2.192)          
##                                                
## -----------------------------------------------
## Observations                   9,994           
## R2                             0.230           
## Adjusted R2                    0.229           
## Residual Std. Error     205.639 (df = 9992)    
## F Statistic         2,976.247*** (df = 1; 9992)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Generate the resgression residuals

Resid<-residuals(reg_model)

Summarize the residuals

summary(Resid)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -7397.542     2.581    14.634     0.000    21.658  5261.551

From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.

cov(Resid,Sales)
## [1] -5.062127e-11

The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.

Data Visualization

Histogram

hist(log(Sales),breaks = 15)

summary(log(Sales))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.8119  2.8495  3.9980  4.1098  5.3468 10.0274
ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of Sales for Superstore Company",subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).

## Put title and Sub title at the centre of the graph

ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of Sales for Superstore Company",subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.subtitle = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).

Pie Chart (Average Sale per Region)

Summarize the average sales from various regions

Superstore_data %>% group_by(Region) %>% 
  drop_na() %>% 
  summarize(average_sales=mean(Sales))
# Create data for the graph.
average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.1, main = "Pie Chart of Sales from Various of the United States ")

Create data for the graph.

average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)

Plot the chart.

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
   fill = rainbow(length(average_sales)))

Pie Chart (Average Sales per category)

Summarize the average sales from various categories

Superstore_data %>% group_by(Category) %>% 
  drop_na() %>% 
  summarize(average_sales=mean(Sales))

Create the matrix

# Create data for the graph.
average_sales <-  c(347.7488, 119.0760,452.5782)
Sales_category <-  c("Furniture","Office Supplies","Technology")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of Sales from Various Categories")

Create sales Pie chart with percentages

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
   fill = rainbow(length(average_sales)))

Bar graphs

Create the data for the chart (average sales per category)

average_sales1 <-  c(347.7488, 119.0760,452.5782)
Sales_category1 <-  c("Furniture","Office Supplies","Technology")

Plot the bar chart

barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",col="blue",
main="Bar graph for the average sales for various category",border="red")