In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following * One sample t-test * Paired and unpaired t-test * Analysis of Variance (ANOVA) * Pearson correlation One sample t-test The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.
library(haven)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(agricolae)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(plotrix)
Superstore_data<-read.csv("C:\\Users\\LUMUMBA\\Downloads\\Superstore_data.csv")
attach(Superstore_data)
head(Superstore_data,10)
tail(Superstore_data,10)
The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.
You can use the test for continuous data. Your data should be a random sample from a normal population.
If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.
The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.
t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)
##
## One Sample t-test
##
## data: Superstore_data$Sales
## t = 4.7893, df = 9993, p-value = 1.698e-06
## alternative hypothesis: true mean is not equal to 200
## 95 percent confidence interval:
## 217.6375 242.0785
## sample estimates:
## mean of x
## 229.858
In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001
The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.
data22<-Superstore_data %>%
dplyr::select(Region, Sales)%>%
filter(Region == "West"|
Region == "East")
head(data22,5)
t.test(Sales ~ Region,alternative="two.sided",data = data22)
##
## Welch Two Sample t-test
##
## data: Sales by Region
## t = 0.79611, df = 5603.9, p-value = 0.426
## alternative hypothesis: true difference in means between group East and group West is not equal to 0
## 95 percent confidence interval:
## -17.31977 41.00552
## sample estimates:
## mean in group East mean in group West
## 238.3361 226.4932
In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.
data23<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Furniture")
head(data23,5)
t.test(Sales ~ Category,alternative="two.sided",data = data23)
##
## Welch Two Sample t-test
##
## data: Sales by Category
## t = -3.6721, df = 2497.7, p-value = 0.0002456
## alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
## 95 percent confidence interval:
## -157.8094 -47.9394
## sample estimates:
## mean in group Furniture mean in group Technology
## 349.8349 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
data24<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Office Supplies")
head(data24,5)
t.test(Sales ~ Category,alternative="two.sided",data = data24)
##
## Welch Two Sample t-test
##
## data: Sales by Category
## t = -12.694, df = 1982.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
## 95 percent confidence interval:
## -384.8897 -281.8807
## sample estimates:
## mean in group Office Supplies mean in group Technology
## 119.3241 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.
anova_model<-aov(Sales~Category)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Category 2 1.959e+08 97940872 265.5 <2e-16 ***
## Residuals 9991 3.686e+09 368906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.
Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.
tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sales ~ Category)
##
## $Category
## diff lwr upr p adj
## Office Supplies-Furniture -230.5108 -266.45583 -194.5657 0e+00
## Technology-Furniture 102.8744 57.56303 148.1858 3e-07
## Technology-Office Supplies 333.3852 295.51937 371.2510 0e+00
From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.
plot(tukey_hsd, las=1)
anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sub.Category 16 7.763e+08 48515873 155.9 <2e-16 ***
## Residuals 9977 3.105e+09 311253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.
tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sales ~ Sub.Category)
##
## $Sub.Category
## diff lwr upr p adj
## Appliances-Accessories 14.7811064 -98.352371 127.914584 1.0000000
## Art-Accessories -181.9057697 -279.299223 -84.512317 0.0000000
## Binders-Accessories -82.4140438 -167.571615 2.743527 0.0707858
## Bookcases-Accessories 287.8850290 142.479391 433.290667 0.0000000
## Chairs-Accessories 316.3578159 212.227967 420.487665 0.0000000
## Copiers-Accessories 1982.9670138 1738.872771 2227.061256 0.0000000
## Envelopes-Accessories -151.1068795 -290.643768 -11.569991 0.0187882
## Fasteners-Accessories -202.0378297 -350.263756 -53.811903 0.0003125
## Furnishings-Accessories -120.1489362 -213.413396 -26.884476 0.0010412
## Labels-Accessories -181.6715489 -304.305120 -59.037978 0.0000400
## Machines-Accessories 1429.5787092 1236.717765 1622.439653 0.0000000
## Paper-Accessories -158.6905119 -245.436934 -71.944090 0.0000000
## Phones-Accessories 155.2369304 60.389846 250.084015 0.0000021
## Storage-Accessories 48.6159493 -47.347208 144.579107 0.9455486
## Supplies-Accessories 29.6755961 -126.561825 185.913017 0.9999997
## Tables-Accessories 432.8201673 304.435984 561.204351 0.0000000
## Art-Appliances -196.6868761 -309.258580 -84.115172 0.0000002
## Binders-Appliances -97.1951502 -199.365184 4.974884 0.0839321
## Bookcases-Appliances 273.1039226 117.124105 429.083741 0.0000002
## Chairs-Appliances 301.5767095 183.128706 420.024713 0.0000000
## Copiers-Appliances 1968.1859073 1717.648678 2218.723136 0.0000000
## Envelopes-Appliances -165.8879859 -316.411895 -15.364077 0.0147424
## Fasteners-Appliances -216.8189361 -375.431134 -58.206739 0.0002927
## Furnishings-Appliances -134.9300426 -243.949139 -25.910946 0.0022497
## Labels-Appliances -196.4526554 -331.455976 -61.449334 0.0000641
## Machines-Appliances 1414.7976027 1213.844256 1615.750949 0.0000000
## Paper-Appliances -173.4716183 -276.969665 -69.973572 0.0000009
## Phones-Appliances 140.4558240 30.079770 250.831878 0.0013283
## Storage-Appliances 33.8348429 -77.501726 145.171412 0.9998021
## Supplies-Appliances 14.8944897 -151.229065 181.018045 1.0000000
## Tables-Appliances 418.0390609 277.791414 558.286708 0.0000000
## Binders-Art 99.4917259 15.081912 183.901540 0.0052743
## Bookcases-Art 469.7907987 324.821821 614.759777 0.0000000
## Chairs-Art 498.2635856 394.744359 601.782813 0.0000000
## Copiers-Art 2164.8727835 1921.038405 2408.707162 0.0000000
## Envelopes-Art 30.7988902 -108.282913 169.880694 0.9999974
## Fasteners-Art -20.1320600 -167.929659 127.665539 1.0000000
## Furnishings-Art 61.7568335 -30.825370 154.339037 0.6491218
## Labels-Art 0.2342208 -121.881288 122.349730 1.0000000
## Machines-Art 1611.4844789 1418.952537 1804.016420 0.0000000
## Paper-Art 23.2152578 -62.797221 109.227737 0.9999589
## Phones-Art 337.1427001 242.966407 431.318994 0.0000000
## Storage-Art 230.5217190 135.221496 325.821942 0.0000000
## Supplies-Art 211.5813658 55.750251 367.412481 0.0003411
## Tables-Art 614.7259370 486.836518 742.615356 0.0000000
## Bookcases-Binders 370.2990728 233.250425 507.347720 0.0000000
## Chairs-Binders 398.7718597 306.671058 490.872661 0.0000000
## Copiers-Binders 2065.3810576 1826.170834 2304.591281 0.0000000
## Envelopes-Binders -68.6928357 -199.498322 62.112651 0.9265977
## Fasteners-Binders -119.6237859 -259.661129 20.413557 0.2055723
## Furnishings-Binders -37.7348924 -117.345140 41.875355 0.9698971
## Labels-Binders -99.2575051 -211.856461 13.341450 0.1629283
## Machines-Binders 1511.9927530 1325.351719 1698.633787 0.0000000
## Paper-Binders -76.2764681 -148.140741 -4.412195 0.0244962
## Phones-Binders 237.6509742 156.192387 319.109562 0.0000000
## Storage-Binders 131.0299931 48.274572 213.785414 0.0000059
## Supplies-Binders 112.0896399 -36.401652 260.580932 0.4181442
## Tables-Binders 515.2342111 396.398061 634.070362 0.0000000
## Chairs-Bookcases 28.4727869 -121.105104 178.050678 0.9999997
## Copiers-Bookcases 1695.0819848 1428.412678 1961.751291 0.0000000
## Envelopes-Bookcases -438.9919085 -615.063096 -262.920721 0.0000000
## Fasteners-Bookcases -489.9228587 -672.956859 -306.888859 0.0000000
## Furnishings-Bookcases -408.0339652 -550.261879 -265.806052 0.0000000
## Labels-Bookcases -469.5565779 -632.558150 -306.555006 0.0000000
## Machines-Bookcases 1141.6936801 920.954324 1362.433037 0.0000000
## Paper-Bookcases -446.5755409 -584.617062 -308.534020 0.0000000
## Phones-Bookcases -132.6480986 -275.918784 10.622587 0.1080946
## Storage-Bookcases -239.2690797 -383.281050 -95.257110 0.0000013
## Supplies-Bookcases -258.2094329 -447.789631 -68.629235 0.0003179
## Tables-Bookcases 144.9351383 -22.435763 312.306039 0.1863844
## Copiers-Chairs 1666.6091979 1420.006796 1913.211600 0.0000000
## Envelopes-Chairs -467.4646954 -611.344120 -323.585271 0.0000000
## Fasteners-Chairs -518.3956456 -670.716593 -366.074698 0.0000000
## Furnishings-Chairs -436.5067521 -536.151146 -336.862358 0.0000000
## Labels-Chairs -498.0293648 -625.582250 -370.476480 0.0000000
## Machines-Chairs 1113.2208933 917.195157 1309.246630 0.0000000
## Paper-Chairs -475.0483278 -568.620158 -381.476498 0.0000000
## Phones-Chairs -161.1208855 -262.248108 -59.993663 0.0000049
## Storage-Chairs -267.7418666 -369.916586 -165.567147 0.0000000
## Supplies-Chairs -286.6822198 -446.809910 -126.554529 0.0000001
## Tables-Chairs 116.4623514 -16.628761 249.553464 0.1723836
## Envelopes-Copiers -2134.0738932 -2397.589097 -1870.558690 0.0000000
## Fasteners-Copiers -2185.0048435 -2453.222376 -1916.787311 0.0000000
## Furnishings-Copiers -2103.1159499 -2345.330687 -1860.901213 0.0000000
## Labels-Copiers -2164.6385627 -2419.606624 -1909.670502 0.0000000
## Machines-Copiers -553.3883046 -848.625604 -258.151005 0.0000000
## Paper-Copiers -2141.6575257 -2381.437969 -1901.877082 0.0000000
## Phones-Copiers -1827.7300833 -2070.558600 -1584.901566 0.0000000
## Storage-Copiers -1934.3510645 -2177.617681 -1691.084447 0.0000000
## Supplies-Copiers -1953.2914176 -2226.018114 -1680.564721 0.0000000
## Tables-Copiers -1550.1468465 -1807.930124 -1292.363569 0.0000000
## Fasteners-Envelopes -50.9309502 -229.338317 127.476417 0.9999128
## Furnishings-Envelopes 30.9579433 -105.264385 167.180272 0.9999962
## Labels-Envelopes -30.5646695 -188.353313 127.223974 0.9999996
## Machines-Envelopes 1580.6855886 1363.767155 1797.604023 0.0000000
## Paper-Envelopes -7.5836324 -139.429015 124.261751 1.0000000
## Phones-Envelopes 306.3438099 169.033094 443.654526 0.0000000
## Storage-Envelopes 199.7228288 61.638829 337.806829 0.0000749
## Supplies-Envelopes 180.7824756 -4.334771 365.899722 0.0644947
## Tables-Envelopes 583.9270468 421.628674 746.225420 0.0000000
## Furnishings-Fasteners 81.8888935 -63.221082 226.998869 0.8728069
## Labels-Fasteners 20.3662808 -145.156039 185.888600 1.0000000
## Machines-Fasteners 1631.6165388 1409.009286 1854.223792 0.0000000
## Paper-Fasteners 43.3473178 -97.661856 184.356492 0.9997704
## Phones-Fasteners 357.2747601 211.142577 503.406943 0.0000000
## Storage-Fasteners 250.6537790 103.794754 397.512804 0.0000005
## Supplies-Fasteners 231.7134258 39.961562 423.465289 0.0034553
## Tables-Fasteners 634.8579970 465.031191 804.684803 0.0000000
## Labels-Furnishings -61.5226128 -180.371140 57.325914 0.9347411
## Machines-Furnishings 1549.7276453 1359.251076 1740.204214 0.0000000
## Paper-Furnishings -38.5415757 -119.849148 42.765997 0.9698815
## Phones-Furnishings 275.3858666 185.486205 365.285528 0.0000000
## Storage-Furnishings 168.7648855 77.688504 259.841267 0.0000000
## Supplies-Furnishings 149.8245323 -3.459881 303.108945 0.0639163
## Tables-Furnishings 552.9691034 428.195395 677.742812 0.0000000
## Machines-Labels 1611.2502581 1404.799158 1817.701358 0.0000000
## Paper-Labels 22.9810370 -90.824299 136.786373 0.9999993
## Phones-Labels 336.9084794 216.814007 457.002952 0.0000000
## Storage-Labels 230.2874982 109.309647 351.265350 0.0000000
## Supplies-Labels 211.3471451 38.613696 384.080594 0.0027717
## Tables-Labels 614.4917162 466.474111 762.509322 0.0000000
## Paper-Machines -1588.2692211 -1775.640525 -1400.897917 0.0000000
## Phones-Machines -1274.3417787 -1465.598238 -1083.085319 0.0000000
## Storage-Machines -1380.9627599 -1572.775146 -1189.150374 0.0000000
## Supplies-Machines -1399.9031130 -1627.923278 -1171.882949 0.0000000
## Tables-Machines -996.7585419 -1206.676532 -786.840552 0.0000000
## Phones-Paper 313.9274423 230.809266 397.045619 0.0000000
## Storage-Paper 207.3064612 122.916951 291.695972 0.0000000
## Supplies-Paper 188.3661080 38.957964 337.774252 0.0015901
## Tables-Paper 591.5106792 471.530846 711.490513 0.0000000
## Storage-Phones -106.6209811 -199.317352 -13.924610 0.0078092
## Supplies-Phones -125.5613343 -279.813794 28.691125 0.2813586
## Tables-Phones 277.5832369 151.622179 403.544295 0.0000000
## Supplies-Storage -18.9403532 -173.881566 136.000860 1.0000000
## Tables-Storage 384.2042180 257.400644 511.007792 0.0000000
## Tables-Supplies 403.1445712 226.282053 580.007090 0.0000000
Make an interpretation of the results above. ## Plot
plot(tukey_hsd2, las=1)
ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.
Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))
linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)
Two_factor_anova<-anova(linear_model)
Two_factor_anova
From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.
stargazer(Two_factor_anova, type="text")
##
## =================================================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------------------------------------
## Df 3 3,331.000 5,755.608 2 9,977
## Sum Sq 3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
## Mean Sq 3 46,569,095.000 49,015,303.000 311,253.000 97,940,872.000
## F value 2 223.927 128.325 133.188 314.666
## Pr(> F) 2 0.000 0.000 0 0
## ---------------------------------------------------------------------------------
Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.
In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.
plot(Sales, Profit, main = "A scatter plot of Sales and Profit")
plot(log(Sales),Profit)
plot(Sales, log(Profit))
## Warning in log(Profit): NaNs produced
plot(log(Sales),log(Profit))
## Warning in log(Profit): NaNs produced
## Run the correlation to determine the strength of association between
the two variables of interest
cor(Sales,Profit)
## [1] 0.4790643
reg_model<-lm(Profit~Sales, data = Superstore_data)
stargazer(reg_model, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## Profit
## -----------------------------------------------
## Sales 0.180***
## (0.003)
##
## Constant -12.733***
## (2.192)
##
## -----------------------------------------------
## Observations 9,994
## R2 0.230
## Adjusted R2 0.229
## Residual Std. Error 205.639 (df = 9992)
## F Statistic 2,976.247*** (df = 1; 9992)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Resid<-residuals(reg_model)
summary(Resid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7397.542 2.581 14.634 0.000 21.658 5261.551
From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.
cov(Resid,Sales)
## [1] -5.062127e-11
The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.
hist(log(Sales),breaks = 15)
summary(log(Sales))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8119 2.8495 3.9980 4.1098 5.3468 10.0274
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of Sales for Superstore Company",subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
## Put title and Sub title at the centre of the graph
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of Sales for Superstore Company",subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")+
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.subtitle = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
Summarize the average sales from various regions
Superstore_data %>% group_by(Region) %>%
drop_na() %>%
summarize(average_sales=mean(Sales))
# Create data for the graph.
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.1, main = "Pie Chart of Sales from Various of the United States ")
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
fill = rainbow(length(average_sales)))
Summarize the average sales from various categories
Superstore_data %>% group_by(Category) %>%
drop_na() %>%
summarize(average_sales=mean(Sales))
# Create data for the graph.
average_sales <- c(347.7488, 119.0760,452.5782)
Sales_category <- c("Furniture","Office Supplies","Technology")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of Sales from Various Categories")
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
fill = rainbow(length(average_sales)))
average_sales1 <- c(347.7488, 119.0760,452.5782)
Sales_category1 <- c("Furniture","Office Supplies","Technology")
barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",col="blue",
main="Bar graph for the average sales for various category",border="red")