YASIN_ALY6000Project2

ALY6000-20312 INTRODUCTION TO ANALYTICS
NORTHEASTERN UNIVERSITY
SANA YASIN
Date :18 February, 2023
Project 2 Report
Professor: Dr. Dee Chiluiza, PHD

1.INTRODUCTION

1.Desriptive Statistics and Inferential statistics are 2 main pillar in field of Statistics . While Descriptive statistics present summary information like central tendency , skweness and dispersion about a sample data , inferential statistics is used to make inferences or prediction about the larger dataset using a sample dataset.

Unlike Inferential statistics,descriptive statistics is always accuate and precise.

For Example , to describe the SAT score of 20 students out of 100 in a class we use descriptive statistics like mean , median ,mode ,varianve ,standard deviations but to make inferences about the general overall score of 100 students we use hyposthesis test which is inferential statistics .

2.Data visualisation is important aspect of data analytics as it gives the pictorial representation about the shape , distribution and outliers in the dataset.

3.R is majorly used by statisticians and data scientists to make predictions and analyze trends in data. It has many application in several domain like Finanace ,healthcare . For example , Bank Of America uses R to report on finances and gain insights about monetory losses.
2.ANALYSIS SECTION

In this section we are going to perform several data analysis tasks on our dataset.

#Libraries used library(tidyverse) library(readxl) library(magrittr) library(dplyr) library(RColorBrewer) library(kableExtra) #Datasets used M2Data2023 <- read_excel("DataSets/M2Project_Data-1.xlsx")
TASK 1

Presenting the First and last 5 records of our dataset.

M2head<- head(M2Data2023,5) M2tail<-tail(M2Data2023,5) M2head_tail<- rbind(M2head,M2tail) knitr::kable(M2head_tail, align = "c", format = "markdown", caption = "First and Last 5 rows", table.attr="style='width:20%;'")

First and Last 5 rows

Region Market Company_Segment Product_Category Product_SubCategory Price Quantity Sales Profits ShippingCost

Central US USCA Consumer Technology Phones 221.98 2 443.96 62.15 40.77

Oceania Asia Pacific Corporate Furniture Chairs 3709.40 9 33384.60 -288.77 923.63

Oceania Asia Pacific Consumer Technology Phones 5175.17 9 46576.53 919.97 915.49

Western Europe Europe Home Office Technology Phones 2892.51 5 14462.55 -96.54 910.16

Western Africa Africa Consumer Technology Copiers 2832.96 8 22663.68 311.52 903.04

Eastern Asia Asia Pacific Consumer Furniture Tables 2614.69 7 18302.83 -821.96 203.26

Western US USCA Corporate Office Supplies Appliances 69.48 1 69.48 20.84 12.04

Oceania Asia Pacific Consumer Technology Copiers 636.78 2 1273.56 286.50 203.20

South America LATAM Corporate Furniture Bookcases 2751.20 10 27512.00 110.00 203.13

Southeastern Asia Asia Pacific Corporate Technology Phones 1587.00 3 4761.00 -76.56 203.08

TASK 2

Calculating the number of records for each market type in dataset

#creating table M2Market_table<-M2Data2023$Market %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame() ##Assign column names colnames(M2Market_table) <- c("Markets","Freq") ##display Table knitr::kable(M2Market_table,align = "c", format = "markdown", caption = "Frequency ditribution of Markets", table.attr="style='width:20%;'")

Frequency ditribution of Markets

Markets Freq

Asia Pacific 365

Europe 248

USCA 200

LATAM 133

Africa 54

1.It can be observed from the above table that Asia Pacific has highest no of records 365

1.It can be observed from the above table that Africa has lowest no of records 54

TASK 3

Creating the BarPlot for above obtained Table

attach(M2Market_table) barplot(Freq~Markets,data = M2Market_table, col=brewer.pal(5,"Set3"), main="Freq distribution of Markets", ylab="Freq",ylim=c(0,400))

1.It can be observed from the above table that Asia Pacific has highest no of records.

1.It can be observed from the above table that Africa has lowest no of records .

TASK 4

Creating a Piechart to present % distribution of product category in Africa Market

##filtering data M2Data_Africa= filter(M2Data2023,M2Data2023$Market=="Africa") #Creating Table M2Data_Africatble<-M2Data_Africa$Product_Category %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame() #Assign columns colnames(M2Data_Africatble)<- c("Category","Freq") ##Adding a new column for Percentages M2Data_Africatble<-M2Data_Africatble %>% mutate(perc =round((M2Data_Africatble$Freq/sum(M2Data_Africatble$Freq))*100,2)) #Display the table attach(M2Data_Africatble) knitr::kable(M2Data_Africatble,align = "c", format = "markdown", caption = "percentage ditribution of products in Africa", table.attr="style='width:20%;'")

percentage ditribution of products in Africa

Category Freq perc

Technology 30 55.56

Furniture 13 24.07

Office Supplies 11 20.37

#Adding new column for lables:- M2Data_Africatble$lables=paste(M2Data_Africatble$Category,M2Data_Africatble$perc,sep="\n") ##Piechart par(mai=c(1,2,2,2)) pie(M2Data_Africatble$perc,labels=M2Data_Africatble$lables, radius=1, main="Pie Chart showing Percentage distribution of product category in Africa", col=c("pink","light green","gray"))

1.It can be observed from the following PieChart that Technology has highest no of records of 55.56 %

2.It can be observed from the following PieChart that Office Supplies has lowest no of records of 20.37 %

TASK 5

Calculating frequency distribution of product Sub-category in Africa Market

##Creating Table M2Data_Africatble2<-M2Data_Africa$Product_SubCategory %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame() ##assign column names colnames(M2Data_Africatble2)<- c("SubCategory","Freq") ##creating barplot attach(M2Data_Africatble2) barplot(Freq~SubCategory,data=M2Data_Africatble2, main="Distribution of Sub-Category in Africa")

1.It can be observed from the following Barplot that Phones has highest no of records of 13

2.It can be observed from the following Barplot that Storage has lowest no of records of 3

TASK 6

Creating a presentable Barplot to present distribution of product Sub-category in Africa Market

par(mai=c(1,1,0.2,0.1)) par(mfrow=c(1,1)) barplot(Freq~SubCategory,data=M2Data_Africatble2, main="Distribution of Sub-Category in Africa", col=brewer.pal(10,"Dark2"),width=2.5, ylab="Frequency",ylim=c(0,15),las=1,cex.names=0.6, cex.axes=0.5,space=1,xlab="Sub-Category", horiz = FALSE)

TASK 7

Calculating mean Sales Per product Sub-category in Africa Market

#calculating mean sales meansales=tapply(M2Data_Africa$Sales,M2Data_Africa$Product_SubCategory , mean)%>% sort(decreasing = TRUE)%>% round(digits=2) meansales_df<-as.data.frame(meansales) colnames(meansales_df)<-c("Mean_Sales") #presenting table knitr::kable(meansales_df,align = "c", format = "markdown", caption = "mean sales per sub category in Africa", table.attr="style='width:20%;'")

mean sales per sub category in Africa

Mean_Sales

Copiers 26338.29

Storage 21289.20

Chairs 19306.76

Phones 15001.70

Tables 14738.97

Bookcases 10441.84

Appliances 8601.98

Machines 6991.88

Accessories 6478.98

#Using dot plot dotchart(meansales, main="mean sales per sub category in Africa", xlab="Mean Sales",bg=brewer.pal(10,"Dark2"), color = brewer.pal(10,"Dark2"),xlim=c(0,28000))

1.From the above dot plot we can say that Copiers has the highest average sales of 2.633829\times 10^{4}

2.From the above dot plot we can say that Accessories has the highest average sales of 6478.98

TASK 8

Calculating Total Sales Per Region in Africa Market

#Calculating total sales Per region salesperregion=tapply(M2Data_Africa$Sales,M2Data_Africa$Region , sum)%>% sort(decreasing = TRUE) knitr::kable(salesperregion,align = "c", format = "markdown", caption = "Total sales per Region in Africa", table.attr="style='width:20%;'")

Total sales per Region in Africa

x

Central Africa 205523.8

North Africa 178792.3

Southern Africa 161749.4

Western Africa 116827.0

Eastern Africa 96575.4

par(mai=c(1,1.5,0.2,0.2)) barplot(salesperregion, main="Distribution of Sales per Region in Africa", col=brewer.pal(10,"Set2"),width=2.5, xlab="Frequency",xlim=c(0,300000),las=1,cex.names=0.8, cex.axes=0.8,space=1,ylab="", horiz = TRUE)

1.From the above dot plot we can say that Central Africa has the highest total sales of 2.0552376\times 10^{5}

2.From the above dot plot we can say that Eastern Africa has the highest total sales of 9.65754\times 10^{4}

TASK 9

Calculating Mean shipping Cost per Region in Africa Market

#Calculating mean meanshpcost=tapply(M2Data_Africa$ShippingCost,M2Data_Africa$Product_SubCategory, mean)%>% sort(decreasing = TRUE) #presenting table knitr::kable(meanshpcost,align = "c", format = "markdown", caption = "mean shpping cost per sub category in Africa", table.attr="style='width:20%;'")

mean shpping cost per sub category in Africa

x

Phones 402.2385

Copiers 398.3986

Appliances 397.1350

Chairs 353.0800

Storage 346.5200

Bookcases 284.4883

Tables 278.2225

Accessories 268.2300

Machines 261.7283

#Using dot plot par(mai=c(1,1.5,0.2,0.2)) par(mfrow=c(1,1)) barplot(meanshpcost, main="Mean shipping cost per sub category in Africa", col=brewer.pal(10,"Spectral"),width=2.5, xlab="Frequency",xlim=c(0,500),las=1,cex.names=0.8, cex.axes=0.8,space=1,ylab="", horiz = TRUE)

From the above dot plot we can say that Phones has the highest total sales of 402.24

From the above dot plot we can say that Machines has the highest total sales of 261.73

TASK 11

Creating Boxplot and Histogram for profit

par(mai=c(1,1,0.2,0.4)) par(mfrow=c(2,1)) hist(M2Data2023$Profits,breaks=50, col="pink",xlab="Profits", main="Histogram for profit",las=1) boxplot(M2Data2023$Profits, horizontal=TRUE,col="light blue",xlab="Profits", main="Boxplot for Profits")

TASK 12

Creating a histogram and Boxplot for Profits in Latin American market

##Filtering Data M2_t13LATAM<-filter(M2Data2023,M2Data2023$Market=="LATAM") par(mai=c(1,1,0.2,0.4)) par(mfrow=c(2,1)) hist(M2_t13LATAM$Profits,breaks=50, col="red",xlab="Profits", main="Histogram for profit in LATAM",las=1,xlim=c(-2000,2000), ylim=c(0,20)) boxplot(M2_t13LATAM$Profits, horizontal=TRUE,col="yellow",xlab="Profits", main="Boxplot for Profits in LATAM")

TASK 13

Calculating Total sales per Region in Latin America

#calculating total sales per Region totalsales_latam=tapply(M2_t13LATAM$Sales,M2_t13LATAM$Region, sum)%>% sort(decreasing = TRUE) #presenting table knitr::kable(totalsales_latam,align = "c", format = "markdown", caption = "Total Sales per Region in LATAM", table.attr="style='width:20%;'")

Total Sales per Region in LATAM

x

Central America 924226.2

South America 457623.3

Caribbean 196775.2

#Using dot plot par(mai=c(1,1.5,0.2,0.2)) par(mfrow=c(1,1)) barplot(totalsales_latam, main="Total Sales per Region in LATAM", col=brewer.pal(10,"BrBG"),width=2.5, ylab="Frequency",las=1,cex.names=0.8, cex.axes=0.8,space=1,xlab="Region",ylim=c(0,1000000), horiz = FALSE)

1.From the above dot plot we can say that Central America has the highest total sales of 9.2422616\times 10^{5}

2.From the above dot plot we can say that Caribbean has the highest total sales of 1.9677522\times 10^{5}

TASK 14

Presenting Boxplots for Profits Per Region in LATAM

boxplot(M2_t13LATAM$Profits~M2_t13LATAM$Region, horizontal=FALSE,col=brewer.pal(3,"Set3"), main="Boxplot for Profits in LATAM", xlab="Region",ylab="Profits",las=1)

TASK 15

Presenting table with cumulative frequency and probabilities

#Creating dataframe with calculated columns M2Data_t15<- M2Data2023$Product_SubCategory %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame()%>% rename(Frequency=Freq)%>% mutate(Cumulative_Frequency=cumsum(Frequency), Probability=Frequency/nrow(M2Data2023), Cumulative_Probability=cumsum(Probability)) #assign column names colnames(M2Data_t15)<-c("SubCategory","Frequency","Cumulative Frequency","Probability","Cumulative Probability") #display table knitr::kable(M2Data_t15,align = "c", format = "markdown", caption = "Table 15: cumulative frequency and Probabilities for Product Sub-Categories", table.attr="style='width:20%;'") %>% kable_classic(bootstrap_options= "striped", full_width=TRUE, position="center", font_size=12)

Table 15: cumulative frequency and Probabilities for Product Sub-Categories

SubCategory Frequency Cumulative Frequency Probability Cumulative Probability

Phones 179 179 0.179 0.179

Bookcases 130 309 0.130 0.309

Copiers 126 435 0.126 0.435

Appliances 125 560 0.125 0.560

Chairs 95 655 0.095 0.655

Tables 87 742 0.087 0.742

Machines 52 794 0.052 0.794

Storage 45 839 0.045 0.839

Accessories 38 877 0.038 0.877

Binders 38 915 0.038 0.915

Paper 32 947 0.032 0.947

Art 18 965 0.018 0.965

Furnishings 14 979 0.014 0.979

Supplies 7 986 0.007 0.986

Fasteners 6 992 0.006 0.992

Labels 6 998 0.006 0.998

Envelopes 2 1000 0.002 1.000

TASK 16

Plotting Probability graphs

M2Data_t15$Newcolumn1=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Frequency),sep = "\n") M2Data_t15$Newcolumn2=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Probability),sep = "\n") par(mai=c(0.5,1.2,0.5,0.1)) par(mfrow=c(2,2)) ##Pie chart pie(M2Data_t15$Frequency,labels=,M2Data_t15$Newcolumn1, radius=0.8,cex=0.8, main="Frequency distribution ", col=brewer.pal(10,"BrBG")) barplot(M2Data_t15$`Cumulative Frequency`~M2Data_t15$SubCategory, main="Cumulative Frequency", col=brewer.pal(10,"PRGn"),width=1, xlab="Frequency",las=1,cex.names=0.8, cex.axes=0.5,space=1,ylab="", horiz = TRUE) pie(M2Data_t15$Probability,labels=,M2Data_t15$Newcolumn2, radius=0.8, main=" Probability distribution ",cex=0.8, col=brewer.pal(10,"RdBu")) barplot(M2Data_t15$`Cumulative Probability`~M2Data_t15$SubCategory, main="Cumulative Probability", col=brewer.pal(10,"PiYG"),width=1, xlab="Frequency",las=1,cex.names=0.8, cex.axes=0.5,space=1,ylab="", horiz = TRUE)

TASK 17

Free graph on subset of Data

#Filtering Data on Market M2_t13Europe<-filter(M2Data2023,M2Data2023$Market=="Europe") #Creating data frame with sum of sales per Product category totalsales_europe=tapply(M2_t13Europe$Sales,M2_t13Europe$Product_Category, sum)%>% sort(decreasing = TRUE)%>% as.data.frame() #Assign column names colnames(totalsales_europe)<-c("TotalSales") #Adding Percentage column totalsales_europe$Percentage=round((totalsales_europe$TotalSales/sum(totalsales_europe$TotalSales))*100,2) #Adding new columns for labels totalsales_europe$product=rownames(totalsales_europe) totalsales_europe$labels=paste(totalsales_europe$product,totalsales_europe$Percentage,sep="\n") #print(totalsales_europe) #Creating Plots par(mai=c(1,1,0.2,0.4)) par(mfrow=c(2,2)) hist(M2_t13Europe$Sales,breaks=50, col="red",xlab="Sales", main="Histogram for Sales in Europe",las=1,xlim=c(0,100000),ylim=c(0,50)) boxplot(M2_t13Europe$Sales~M2_t13Europe$Region, horizontal=TRUE,col=brewer.pal(5,"Set3"), main="Boxplot for Sales in Europe",ylab="",las=1,xlab="Sales") pie(totalsales_europe$Percentage,labels = totalsales_europe$labels, radius=0.8, main=" % distribution of Sales by products ",cex=0.8, col=brewer.pal(10,"RdBu"))

Conclusions:-

1.Through this report, we learnt many ways of representing data visually like histogram , barplot, piechart and Boxplot
2.We learnt the use of dplyr package in creating summaries of data based on groups like mean sales , total sales.
3. We learnt to create filtered dataset based on given condition.
4.From the above report it is observed that Asia Pacific has highest no of records 365and Africa has lowest no of records 54.
5.It is observed from the above report that Technology has highest no of records and Office Supplies has lowest no of records in Africa.

Bibliography:-

https://rpubs.com/Dee_Chiluiza/barplots https://rpubs.com/Dee_Chiluiza/995745 https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/ https://www.marsja.se/how-to-add-a-column-to-dataframe-in-r-with-tibble-dplyr/ https://www.r-bloggers.com/2022/08/r-program-to-concatenate-two-strings/ https://www.javaassignmenthelp.com/blog/applications-of-r/ https://digitaschools.com/data-types-in-r-tutorial/

Region	Market	Company_Segment	Product_Category	Product_SubCategory	Price	Quantity	Sales	Profits	ShippingCost
Central US	USCA	Consumer	Technology	Phones	221.98	2	443.96	62.15	40.77
Oceania	Asia Pacific	Corporate	Furniture	Chairs	3709.40	9	33384.60	-288.77	923.63
Oceania	Asia Pacific	Consumer	Technology	Phones	5175.17	9	46576.53	919.97	915.49
Western Europe	Europe	Home Office	Technology	Phones	2892.51	5	14462.55	-96.54	910.16
Western Africa	Africa	Consumer	Technology	Copiers	2832.96	8	22663.68	311.52	903.04
Eastern Asia	Asia Pacific	Consumer	Furniture	Tables	2614.69	7	18302.83	-821.96	203.26
Western US	USCA	Corporate	Office Supplies	Appliances	69.48	1	69.48	20.84	12.04
Oceania	Asia Pacific	Consumer	Technology	Copiers	636.78	2	1273.56	286.50	203.20
South America	LATAM	Corporate	Furniture	Bookcases	2751.20	10	27512.00	110.00	203.13
Southeastern Asia	Asia Pacific	Corporate	Technology	Phones	1587.00	3	4761.00	-76.56	203.08

	Mean_Sales
Copiers	26338.29
Storage	21289.20
Chairs	19306.76
Phones	15001.70
Tables	14738.97
Bookcases	10441.84
Appliances	8601.98
Machines	6991.88
Accessories	6478.98

	x
Central Africa	205523.8
North Africa	178792.3
Southern Africa	161749.4
Western Africa	116827.0
Eastern Africa	96575.4

	x
Phones	402.2385
Copiers	398.3986
Appliances	397.1350
Chairs	353.0800
Storage	346.5200
Bookcases	284.4883
Tables	278.2225
Accessories	268.2300
Machines	261.7283

	x
Central America	924226.2
South America	457623.3
Caribbean	196775.2

SubCategory	Frequency	Cumulative Frequency	Probability	Cumulative Probability
Phones	179	179	0.179	0.179
Bookcases	130	309	0.130	0.309
Copiers	126	435	0.126	0.435
Appliances	125	560	0.125	0.560
Chairs	95	655	0.095	0.655
Tables	87	742	0.087	0.742
Machines	52	794	0.052	0.794
Storage	45	839	0.045	0.839
Accessories	38	877	0.038	0.877
Binders	38	915	0.038	0.915
Paper	32	947	0.032	0.947
Art	18	965	0.018	0.965
Furnishings	14	979	0.014	0.979
Supplies	7	986	0.007	0.986
Fasteners	6	992	0.006	0.992
Labels	6	998	0.006	0.998
Envelopes	2	1000	0.002	1.000