1.Desriptive Statistics and Inferential statistics are 2 main pillar in field of Statistics . While Descriptive statistics present summary information like central tendency , skweness and dispersion about a sample data , inferential statistics is used to make inferences or prediction about the larger dataset using a sample dataset.
Unlike Inferential statistics,descriptive statistics is always accuate
and precise.
For Example , to describe the SAT score of 20 students out of 100 in a
class we use descriptive statistics like mean , median ,mode ,varianve
,standard deviations but to make inferences about the general overall
score of 100 students we use hyposthesis test which is inferential
statistics .
2.Data visualisation is
important aspect of data analytics as it gives the pictorial
representation about the shape , distribution and outliers in the
dataset.
3.R is majorly used by statisticians and data scientists to make predictions and analyze trends in data. It has many application in several domain like Finanace ,healthcare . For example , Bank Of America uses R to report on finances and gain insights about monetory losses.
2.ANALYSIS SECTIONIn this section we are going to perform several data analysis tasks on our dataset.
#Libraries used
library(tidyverse)
library(readxl)
library(magrittr)
library(dplyr)
library(RColorBrewer)
library(kableExtra)
#Datasets used
M2Data2023 <- read_excel("DataSets/M2Project_Data-1.xlsx")
TASK 1 Presenting the First and last 5 records of our dataset.
M2head<- head(M2Data2023,5)
M2tail<-tail(M2Data2023,5)
M2head_tail<- rbind(M2head,M2tail)
knitr::kable(M2head_tail,
align = "c",
format = "markdown",
caption = "First and Last 5 rows",
table.attr="style='width:20%;'")
| Region | Market | Company_Segment | Product_Category | Product_SubCategory | Price | Quantity | Sales | Profits | ShippingCost |
|---|---|---|---|---|---|---|---|---|---|
| Central US | USCA | Consumer | Technology | Phones | 221.98 | 2 | 443.96 | 62.15 | 40.77 |
| Oceania | Asia Pacific | Corporate | Furniture | Chairs | 3709.40 | 9 | 33384.60 | -288.77 | 923.63 |
| Oceania | Asia Pacific | Consumer | Technology | Phones | 5175.17 | 9 | 46576.53 | 919.97 | 915.49 |
| Western Europe | Europe | Home Office | Technology | Phones | 2892.51 | 5 | 14462.55 | -96.54 | 910.16 |
| Western Africa | Africa | Consumer | Technology | Copiers | 2832.96 | 8 | 22663.68 | 311.52 | 903.04 |
| Eastern Asia | Asia Pacific | Consumer | Furniture | Tables | 2614.69 | 7 | 18302.83 | -821.96 | 203.26 |
| Western US | USCA | Corporate | Office Supplies | Appliances | 69.48 | 1 | 69.48 | 20.84 | 12.04 |
| Oceania | Asia Pacific | Consumer | Technology | Copiers | 636.78 | 2 | 1273.56 | 286.50 | 203.20 |
| South America | LATAM | Corporate | Furniture | Bookcases | 2751.20 | 10 | 27512.00 | 110.00 | 203.13 |
| Southeastern Asia | Asia Pacific | Corporate | Technology | Phones | 1587.00 | 3 | 4761.00 | -76.56 | 203.08 |
Calculating the number of records for each market type in dataset
#creating table
M2Market_table<-M2Data2023$Market %>%
table() %>%
sort(decreasing = TRUE) %>% as.data.frame()
##Assign column names
colnames(M2Market_table) <- c("Markets","Freq")
##display Table
knitr::kable(M2Market_table,align = "c",
format = "markdown",
caption = "Frequency ditribution of Markets",
table.attr="style='width:20%;'")
| Markets | Freq |
|---|---|
| Asia Pacific | 365 |
| Europe | 248 |
| USCA | 200 |
| LATAM | 133 |
| Africa | 54 |
Asia Pacific has highest no of records
365
1.It can be observed from the above table that Africa has
lowest no of records 54
TASK 3
Creating the BarPlot for above obtained Table
attach(M2Market_table)
barplot(Freq~Markets,data = M2Market_table,
col=brewer.pal(5,"Set3"),
main="Freq distribution of Markets",
ylab="Freq",ylim=c(0,400))
Asia Pacific has highest no of records.
1.It can be observed from the above table that Africa has
lowest no of records .
TASK 4
Creating a Piechart to present % distribution of product category in Africa Market
##filtering data
M2Data_Africa= filter(M2Data2023,M2Data2023$Market=="Africa")
#Creating Table
M2Data_Africatble<-M2Data_Africa$Product_Category %>% table() %>%
sort(decreasing = TRUE) %>%
as.data.frame()
#Assign columns
colnames(M2Data_Africatble)<- c("Category","Freq")
##Adding a new column for Percentages
M2Data_Africatble<-M2Data_Africatble %>%
mutate(perc =round((M2Data_Africatble$Freq/sum(M2Data_Africatble$Freq))*100,2))
#Display the table
attach(M2Data_Africatble)
knitr::kable(M2Data_Africatble,align = "c",
format = "markdown",
caption = "percentage ditribution of products in Africa",
table.attr="style='width:20%;'")
| Category | Freq | perc |
|---|---|---|
| Technology | 30 | 55.56 |
| Furniture | 13 | 24.07 |
| Office Supplies | 11 | 20.37 |
#Adding new column for lables:-
M2Data_Africatble$lables=paste(M2Data_Africatble$Category,M2Data_Africatble$perc,sep="\n")
##Piechart
par(mai=c(1,2,2,2))
pie(M2Data_Africatble$perc,labels=M2Data_Africatble$lables,
radius=1,
main="Pie Chart showing Percentage distribution of product category in Africa",
col=c("pink","light green","gray"))
Technology has highest no of records of
55.56 %
2.It can be observed from the following PieChart that
Office Supplies has lowest no of records of
20.37 %
TASK 5
Calculating frequency distribution of product Sub-category in Africa Market
##Creating Table
M2Data_Africatble2<-M2Data_Africa$Product_SubCategory %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame()
##assign column names
colnames(M2Data_Africatble2)<- c("SubCategory","Freq")
##creating barplot
attach(M2Data_Africatble2)
barplot(Freq~SubCategory,data=M2Data_Africatble2,
main="Distribution of Sub-Category in Africa")
Phones has highest no of records of
13
2.It can be observed from the following Barplot that
Storage has lowest no of records of 3
TASK 6
Creating a presentable Barplot to present distribution of product Sub-category in Africa Market
par(mai=c(1,1,0.2,0.1))
par(mfrow=c(1,1))
barplot(Freq~SubCategory,data=M2Data_Africatble2,
main="Distribution of Sub-Category in Africa",
col=brewer.pal(10,"Dark2"),width=2.5,
ylab="Frequency",ylim=c(0,15),las=1,cex.names=0.6,
cex.axes=0.5,space=1,xlab="Sub-Category",
horiz = FALSE)
Calculating mean Sales Per product Sub-category in Africa Market
#calculating mean sales
meansales=tapply(M2Data_Africa$Sales,M2Data_Africa$Product_SubCategory , mean)%>% sort(decreasing = TRUE)%>% round(digits=2)
meansales_df<-as.data.frame(meansales)
colnames(meansales_df)<-c("Mean_Sales")
#presenting table
knitr::kable(meansales_df,align = "c",
format = "markdown",
caption = "mean sales per sub category in Africa",
table.attr="style='width:20%;'")
| Mean_Sales | |
|---|---|
| Copiers | 26338.29 |
| Storage | 21289.20 |
| Chairs | 19306.76 |
| Phones | 15001.70 |
| Tables | 14738.97 |
| Bookcases | 10441.84 |
| Appliances | 8601.98 |
| Machines | 6991.88 |
| Accessories | 6478.98 |
#Using dot plot
dotchart(meansales,
main="mean sales per sub category in Africa",
xlab="Mean Sales",bg=brewer.pal(10,"Dark2"),
color = brewer.pal(10,"Dark2"),xlim=c(0,28000))
Copiers has the highest average sales
of 2.633829\times 10^{4}
2.From the above dot plot we can say that
Accessories has the highest average
sales of 6478.98
Calculating Total Sales Per Region in Africa Market
#Calculating total sales Per region
salesperregion=tapply(M2Data_Africa$Sales,M2Data_Africa$Region , sum)%>% sort(decreasing = TRUE)
knitr::kable(salesperregion,align = "c",
format = "markdown",
caption = "Total sales per Region in Africa",
table.attr="style='width:20%;'")
| x | |
|---|---|
| Central Africa | 205523.8 |
| North Africa | 178792.3 |
| Southern Africa | 161749.4 |
| Western Africa | 116827.0 |
| Eastern Africa | 96575.4 |
par(mai=c(1,1.5,0.2,0.2))
barplot(salesperregion,
main="Distribution of Sales per Region in Africa",
col=brewer.pal(10,"Set2"),width=2.5,
xlab="Frequency",xlim=c(0,300000),las=1,cex.names=0.8,
cex.axes=0.8,space=1,ylab="",
horiz = TRUE)
Central Africa has the highest total
sales of 2.0552376\times 10^{5}
2.From the above dot plot we can say that
Eastern Africa has the highest total
sales of 9.65754\times 10^{4}
Calculating Mean shipping Cost per Region in Africa Market
#Calculating mean
meanshpcost=tapply(M2Data_Africa$ShippingCost,M2Data_Africa$Product_SubCategory, mean)%>% sort(decreasing = TRUE)
#presenting table
knitr::kable(meanshpcost,align = "c",
format = "markdown",
caption = "mean shpping cost per sub category in Africa",
table.attr="style='width:20%;'")
| x | |
|---|---|
| Phones | 402.2385 |
| Copiers | 398.3986 |
| Appliances | 397.1350 |
| Chairs | 353.0800 |
| Storage | 346.5200 |
| Bookcases | 284.4883 |
| Tables | 278.2225 |
| Accessories | 268.2300 |
| Machines | 261.7283 |
#Using dot plot
par(mai=c(1,1.5,0.2,0.2))
par(mfrow=c(1,1))
barplot(meanshpcost,
main="Mean shipping cost per sub category in Africa",
col=brewer.pal(10,"Spectral"),width=2.5,
xlab="Frequency",xlim=c(0,500),las=1,cex.names=0.8,
cex.axes=0.8,space=1,ylab="",
horiz = TRUE)
Phones has the highest total sales of
402.24
From the above dot plot we can say that
Machines has the highest total sales of
261.73
Creating Boxplot and Histogram for profit
par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,1))
hist(M2Data2023$Profits,breaks=50,
col="pink",xlab="Profits",
main="Histogram for profit",las=1)
boxplot(M2Data2023$Profits,
horizontal=TRUE,col="light blue",xlab="Profits",
main="Boxplot for Profits")
Creating a histogram and Boxplot for Profits in Latin American market
##Filtering Data
M2_t13LATAM<-filter(M2Data2023,M2Data2023$Market=="LATAM")
par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,1))
hist(M2_t13LATAM$Profits,breaks=50,
col="red",xlab="Profits",
main="Histogram for profit in LATAM",las=1,xlim=c(-2000,2000),
ylim=c(0,20))
boxplot(M2_t13LATAM$Profits,
horizontal=TRUE,col="yellow",xlab="Profits",
main="Boxplot for Profits in LATAM")
Calculating Total sales per Region in Latin America
#calculating total sales per Region
totalsales_latam=tapply(M2_t13LATAM$Sales,M2_t13LATAM$Region, sum)%>% sort(decreasing = TRUE)
#presenting table
knitr::kable(totalsales_latam,align = "c",
format = "markdown",
caption = "Total Sales per Region in LATAM",
table.attr="style='width:20%;'")
| x | |
|---|---|
| Central America | 924226.2 |
| South America | 457623.3 |
| Caribbean | 196775.2 |
#Using dot plot
par(mai=c(1,1.5,0.2,0.2))
par(mfrow=c(1,1))
barplot(totalsales_latam,
main="Total Sales per Region in LATAM",
col=brewer.pal(10,"BrBG"),width=2.5,
ylab="Frequency",las=1,cex.names=0.8,
cex.axes=0.8,space=1,xlab="Region",ylim=c(0,1000000),
horiz = FALSE)
Central America has the highest total sales of
9.2422616\times 10^{5}
2.From the above dot plot we can say that Caribbean has the
highest total sales of 1.9677522\times 10^{5}
TASK 14
Presenting Boxplots for Profits Per Region in LATAM
boxplot(M2_t13LATAM$Profits~M2_t13LATAM$Region,
horizontal=FALSE,col=brewer.pal(3,"Set3"),
main="Boxplot for Profits in LATAM",
xlab="Region",ylab="Profits",las=1)
Presenting table with cumulative frequency and probabilities
#Creating dataframe with calculated columns
M2Data_t15<- M2Data2023$Product_SubCategory %>% table() %>%
sort(decreasing = TRUE) %>%
as.data.frame()%>% rename(Frequency=Freq)%>%
mutate(Cumulative_Frequency=cumsum(Frequency),
Probability=Frequency/nrow(M2Data2023),
Cumulative_Probability=cumsum(Probability))
#assign column names
colnames(M2Data_t15)<-c("SubCategory","Frequency","Cumulative Frequency","Probability","Cumulative Probability")
#display table
knitr::kable(M2Data_t15,align = "c",
format = "markdown",
caption = "Table 15: cumulative frequency and Probabilities for Product Sub-Categories",
table.attr="style='width:20%;'") %>%
kable_classic(bootstrap_options= "striped",
full_width=TRUE,
position="center",
font_size=12)
| SubCategory | Frequency | Cumulative Frequency | Probability | Cumulative Probability |
|---|---|---|---|---|
| Phones | 179 | 179 | 0.179 | 0.179 |
| Bookcases | 130 | 309 | 0.130 | 0.309 |
| Copiers | 126 | 435 | 0.126 | 0.435 |
| Appliances | 125 | 560 | 0.125 | 0.560 |
| Chairs | 95 | 655 | 0.095 | 0.655 |
| Tables | 87 | 742 | 0.087 | 0.742 |
| Machines | 52 | 794 | 0.052 | 0.794 |
| Storage | 45 | 839 | 0.045 | 0.839 |
| Accessories | 38 | 877 | 0.038 | 0.877 |
| Binders | 38 | 915 | 0.038 | 0.915 |
| Paper | 32 | 947 | 0.032 | 0.947 |
| Art | 18 | 965 | 0.018 | 0.965 |
| Furnishings | 14 | 979 | 0.014 | 0.979 |
| Supplies | 7 | 986 | 0.007 | 0.986 |
| Fasteners | 6 | 992 | 0.006 | 0.992 |
| Labels | 6 | 998 | 0.006 | 0.998 |
| Envelopes | 2 | 1000 | 0.002 | 1.000 |
Plotting Probability graphs
M2Data_t15$Newcolumn1=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Frequency),sep = "\n")
M2Data_t15$Newcolumn2=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Probability),sep = "\n")
par(mai=c(0.5,1.2,0.5,0.1))
par(mfrow=c(2,2))
##Pie chart
pie(M2Data_t15$Frequency,labels=,M2Data_t15$Newcolumn1,
radius=0.8,cex=0.8,
main="Frequency distribution ",
col=brewer.pal(10,"BrBG"))
barplot(M2Data_t15$`Cumulative Frequency`~M2Data_t15$SubCategory,
main="Cumulative Frequency",
col=brewer.pal(10,"PRGn"),width=1,
xlab="Frequency",las=1,cex.names=0.8,
cex.axes=0.5,space=1,ylab="",
horiz = TRUE)
pie(M2Data_t15$Probability,labels=,M2Data_t15$Newcolumn2,
radius=0.8,
main=" Probability distribution ",cex=0.8,
col=brewer.pal(10,"RdBu"))
barplot(M2Data_t15$`Cumulative Probability`~M2Data_t15$SubCategory,
main="Cumulative Probability",
col=brewer.pal(10,"PiYG"),width=1,
xlab="Frequency",las=1,cex.names=0.8,
cex.axes=0.5,space=1,ylab="",
horiz = TRUE)
Free graph on subset of Data
#Filtering Data on Market
M2_t13Europe<-filter(M2Data2023,M2Data2023$Market=="Europe")
#Creating data frame with sum of sales per Product category
totalsales_europe=tapply(M2_t13Europe$Sales,M2_t13Europe$Product_Category, sum)%>% sort(decreasing = TRUE)%>% as.data.frame()
#Assign column names
colnames(totalsales_europe)<-c("TotalSales")
#Adding Percentage column
totalsales_europe$Percentage=round((totalsales_europe$TotalSales/sum(totalsales_europe$TotalSales))*100,2)
#Adding new columns for labels
totalsales_europe$product=rownames(totalsales_europe)
totalsales_europe$labels=paste(totalsales_europe$product,totalsales_europe$Percentage,sep="\n")
#print(totalsales_europe)
#Creating Plots
par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,2))
hist(M2_t13Europe$Sales,breaks=50,
col="red",xlab="Sales",
main="Histogram for Sales in Europe",las=1,xlim=c(0,100000),ylim=c(0,50))
boxplot(M2_t13Europe$Sales~M2_t13Europe$Region,
horizontal=TRUE,col=brewer.pal(5,"Set3"),
main="Boxplot for Sales in Europe",ylab="",las=1,xlab="Sales")
pie(totalsales_europe$Percentage,labels = totalsales_europe$labels,
radius=0.8,
main=" % distribution of Sales by products ",cex=0.8,
col=brewer.pal(10,"RdBu"))
1.Through this report, we learnt many
ways of representing data visually like histogram , barplot, piechart
and Boxplot
2.We learnt the use of dplyr package in creating
summaries of data based on groups like mean sales , total sales.
3.
We learnt to create filtered dataset based on given condition.
4.From the above report it is observed that Asia Pacific
has highest no of records 365and Africa has
lowest no of records 54.
5.It is observed from the
above report that Technology has highest no of records and
Office Supplies has lowest no of records in Africa.
https://rpubs.com/Dee_Chiluiza/barplots https://rpubs.com/Dee_Chiluiza/995745 https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/
https://www.marsja.se/how-to-add-a-column-to-dataframe-in-r-with-tibble-dplyr/
https://www.r-bloggers.com/2022/08/r-program-to-concatenate-two-strings/
https://www.javaassignmenthelp.com/blog/applications-of-r/
https://digitaschools.com/data-types-in-r-tutorial/