ALY6000-20312 INTRODUCTION TO ANALYTICS
NORTHEASTERN UNIVERSITY
SANA YASIN
Date :18 February, 2023
Project 2 Report
Professor: Dr. Dee Chiluiza, PHD

1.INTRODUCTION

1.Desriptive Statistics and Inferential statistics are 2 main pillar in field of Statistics . While Descriptive statistics present summary information like central tendency , skweness and dispersion about a sample data , inferential statistics is used to make inferences or prediction about the larger dataset using a sample dataset.

Unlike Inferential statistics,descriptive statistics is always accuate and precise.

For Example , to describe the SAT score of 20 students out of 100 in a class we use descriptive statistics like mean , median ,mode ,varianve ,standard deviations but to make inferences about the general overall score of 100 students we use hyposthesis test which is inferential statistics .

2.Data visualisation is important aspect of data analytics as it gives the pictorial representation about the shape , distribution and outliers in the dataset.

3.R is majorly used by statisticians and data scientists to make predictions and analyze trends in data. It has many application in several domain like Finanace ,healthcare . For example , Bank Of America uses R to report on finances and gain insights about monetory losses.

2.ANALYSIS SECTION

In this section we are going to perform several data analysis tasks on our dataset.

#Libraries used
library(tidyverse)
library(readxl)
library(magrittr)
library(dplyr)
library(RColorBrewer)
library(kableExtra)
#Datasets used 
M2Data2023 <- read_excel("DataSets/M2Project_Data-1.xlsx")
TASK 1

Presenting the First and last 5 records of our dataset.

M2head<- head(M2Data2023,5)
M2tail<-tail(M2Data2023,5)
M2head_tail<- rbind(M2head,M2tail)
knitr::kable(M2head_tail,
               align = "c",
               format = "markdown",
               caption = "First and Last 5 rows",
               table.attr="style='width:20%;'")
First and Last 5 rows
Region Market Company_Segment Product_Category Product_SubCategory Price Quantity Sales Profits ShippingCost
Central US USCA Consumer Technology Phones 221.98 2 443.96 62.15 40.77
Oceania Asia Pacific Corporate Furniture Chairs 3709.40 9 33384.60 -288.77 923.63
Oceania Asia Pacific Consumer Technology Phones 5175.17 9 46576.53 919.97 915.49
Western Europe Europe Home Office Technology Phones 2892.51 5 14462.55 -96.54 910.16
Western Africa Africa Consumer Technology Copiers 2832.96 8 22663.68 311.52 903.04
Eastern Asia Asia Pacific Consumer Furniture Tables 2614.69 7 18302.83 -821.96 203.26
Western US USCA Corporate Office Supplies Appliances 69.48 1 69.48 20.84 12.04
Oceania Asia Pacific Consumer Technology Copiers 636.78 2 1273.56 286.50 203.20
South America LATAM Corporate Furniture Bookcases 2751.20 10 27512.00 110.00 203.13
Southeastern Asia Asia Pacific Corporate Technology Phones 1587.00 3 4761.00 -76.56 203.08
TASK 2

Calculating the number of records for each market type in dataset

#creating table
M2Market_table<-M2Data2023$Market %>%
  table() %>% 
  sort(decreasing = TRUE) %>% as.data.frame()
##Assign column names
colnames(M2Market_table) <- c("Markets","Freq")
##display Table
knitr::kable(M2Market_table,align = "c",
               format = "markdown",
               caption = "Frequency ditribution of Markets",
               table.attr="style='width:20%;'")
Frequency ditribution of Markets
Markets Freq
Asia Pacific 365
Europe 248
USCA 200
LATAM 133
Africa 54
1.It can be observed from the above table that Asia Pacific has highest no of records 365

1.It can be observed from the above table that Africa has lowest no of records 54

TASK 3

Creating the BarPlot for above obtained Table


attach(M2Market_table)
barplot(Freq~Markets,data = M2Market_table,
        col=brewer.pal(5,"Set3"),
        main="Freq distribution of Markets",
        ylab="Freq",ylim=c(0,400))

1.It can be observed from the above table that Asia Pacific has highest no of records.

1.It can be observed from the above table that Africa has lowest no of records .

TASK 4

Creating a Piechart to present % distribution of product category in Africa Market


##filtering data
M2Data_Africa= filter(M2Data2023,M2Data2023$Market=="Africa")
#Creating Table
M2Data_Africatble<-M2Data_Africa$Product_Category %>% table() %>%
  sort(decreasing = TRUE) %>% 
  as.data.frame() 
#Assign columns
colnames(M2Data_Africatble)<- c("Category","Freq")
##Adding a new column for Percentages 
M2Data_Africatble<-M2Data_Africatble %>% 
  mutate(perc =round((M2Data_Africatble$Freq/sum(M2Data_Africatble$Freq))*100,2))

#Display the table 
attach(M2Data_Africatble)
knitr::kable(M2Data_Africatble,align = "c",
               format = "markdown",
               caption = "percentage ditribution of products in Africa",
               table.attr="style='width:20%;'")
percentage ditribution of products in Africa
Category Freq perc
Technology 30 55.56
Furniture 13 24.07
Office Supplies 11 20.37
#Adding new column for lables:-
M2Data_Africatble$lables=paste(M2Data_Africatble$Category,M2Data_Africatble$perc,sep="\n")
##Piechart
par(mai=c(1,2,2,2))
pie(M2Data_Africatble$perc,labels=M2Data_Africatble$lables,
    radius=1,
    main="Pie Chart showing Percentage distribution of product category in Africa",
    col=c("pink","light green","gray"))

1.It can be observed from the following PieChart that Technology has highest no of records of 55.56 %

2.It can be observed from the following PieChart that Office Supplies has lowest no of records of 20.37 %

TASK 5

Calculating frequency distribution of product Sub-category in Africa Market


##Creating Table
M2Data_Africatble2<-M2Data_Africa$Product_SubCategory %>% table() %>% sort(decreasing = TRUE) %>% as.data.frame()
##assign column names
colnames(M2Data_Africatble2)<- c("SubCategory","Freq")
##creating barplot
attach(M2Data_Africatble2)
barplot(Freq~SubCategory,data=M2Data_Africatble2,
          main="Distribution of Sub-Category in Africa")

1.It can be observed from the following Barplot that Phones has highest no of records of 13

2.It can be observed from the following Barplot that Storage has lowest no of records of 3

TASK 6

Creating a presentable Barplot to present distribution of product Sub-category in Africa Market


par(mai=c(1,1,0.2,0.1))
par(mfrow=c(1,1))
barplot(Freq~SubCategory,data=M2Data_Africatble2,
          main="Distribution of Sub-Category in Africa",
          col=brewer.pal(10,"Dark2"),width=2.5,
          ylab="Frequency",ylim=c(0,15),las=1,cex.names=0.6,
          cex.axes=0.5,space=1,xlab="Sub-Category",
        horiz = FALSE)

TASK 7

Calculating mean Sales Per product Sub-category in Africa Market

#calculating mean sales
meansales=tapply(M2Data_Africa$Sales,M2Data_Africa$Product_SubCategory , mean)%>% sort(decreasing = TRUE)%>% round(digits=2)

meansales_df<-as.data.frame(meansales)
colnames(meansales_df)<-c("Mean_Sales")
#presenting table
knitr::kable(meansales_df,align = "c",
               format = "markdown",
               caption = "mean sales per sub category in Africa",
               table.attr="style='width:20%;'")
mean sales per sub category in Africa
Mean_Sales
Copiers 26338.29
Storage 21289.20
Chairs 19306.76
Phones 15001.70
Tables 14738.97
Bookcases 10441.84
Appliances 8601.98
Machines 6991.88
Accessories 6478.98
#Using dot plot
dotchart(meansales,
         main="mean sales per sub category in Africa",
         xlab="Mean Sales",bg=brewer.pal(10,"Dark2"),
         color = brewer.pal(10,"Dark2"),xlim=c(0,28000))

1.From the above dot plot we can say that Copiers has the highest average sales of 2.633829\times 10^{4}

2.From the above dot plot we can say that Accessories has the highest average sales of 6478.98

TASK 8

Calculating Total Sales Per Region in Africa Market

#Calculating total sales Per region
salesperregion=tapply(M2Data_Africa$Sales,M2Data_Africa$Region , sum)%>% sort(decreasing = TRUE)

knitr::kable(salesperregion,align = "c",
               format = "markdown",
               caption = "Total sales per Region in Africa",
               table.attr="style='width:20%;'")
Total sales per Region in Africa
x
Central Africa 205523.8
North Africa 178792.3
Southern Africa 161749.4
Western Africa 116827.0
Eastern Africa 96575.4
par(mai=c(1,1.5,0.2,0.2))
barplot(salesperregion,
          main="Distribution of Sales per Region in Africa",
          col=brewer.pal(10,"Set2"),width=2.5,
          xlab="Frequency",xlim=c(0,300000),las=1,cex.names=0.8,
          cex.axes=0.8,space=1,ylab="",
        horiz = TRUE)

1.From the above dot plot we can say that Central Africa has the highest total sales of 2.0552376\times 10^{5}

2.From the above dot plot we can say that Eastern Africa has the highest total sales of 9.65754\times 10^{4}

TASK 9

Calculating Mean shipping Cost per Region in Africa Market

#Calculating mean 
meanshpcost=tapply(M2Data_Africa$ShippingCost,M2Data_Africa$Product_SubCategory, mean)%>% sort(decreasing = TRUE)

#presenting table
knitr::kable(meanshpcost,align = "c",
               format = "markdown",
               caption = "mean shpping cost per sub category in Africa",
               table.attr="style='width:20%;'")
mean shpping cost per sub category in Africa
x
Phones 402.2385
Copiers 398.3986
Appliances 397.1350
Chairs 353.0800
Storage 346.5200
Bookcases 284.4883
Tables 278.2225
Accessories 268.2300
Machines 261.7283
#Using dot plot
par(mai=c(1,1.5,0.2,0.2))
par(mfrow=c(1,1))
barplot(meanshpcost,
          main="Mean shipping cost per sub category in Africa",
          col=brewer.pal(10,"Spectral"),width=2.5,
          xlab="Frequency",xlim=c(0,500),las=1,cex.names=0.8,
          cex.axes=0.8,space=1,ylab="",
        horiz = TRUE)

From the above dot plot we can say that Phones has the highest total sales of 402.24

From the above dot plot we can say that Machines has the highest total sales of 261.73

TASK 11

Creating Boxplot and Histogram for profit

par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,1))
hist(M2Data2023$Profits,breaks=50,
     col="pink",xlab="Profits",
     main="Histogram for profit",las=1)
boxplot(M2Data2023$Profits,
        horizontal=TRUE,col="light blue",xlab="Profits",
        main="Boxplot for Profits")

TASK 12

Creating a histogram and Boxplot for Profits in Latin American market

##Filtering Data
M2_t13LATAM<-filter(M2Data2023,M2Data2023$Market=="LATAM")
par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,1))

hist(M2_t13LATAM$Profits,breaks=50,
     col="red",xlab="Profits",
     main="Histogram for profit in LATAM",las=1,xlim=c(-2000,2000),
     ylim=c(0,20))

boxplot(M2_t13LATAM$Profits,
        horizontal=TRUE,col="yellow",xlab="Profits",
        main="Boxplot for Profits in LATAM")

TASK 13

Calculating Total sales per Region in Latin America

#calculating total sales per Region
totalsales_latam=tapply(M2_t13LATAM$Sales,M2_t13LATAM$Region, sum)%>% sort(decreasing = TRUE)

#presenting table
knitr::kable(totalsales_latam,align = "c",
               format = "markdown",
               caption = "Total Sales per Region in LATAM",
               table.attr="style='width:20%;'")
Total Sales per Region in LATAM
x
Central America 924226.2
South America 457623.3
Caribbean 196775.2
#Using dot plot
par(mai=c(1,1.5,0.2,0.2))
par(mfrow=c(1,1))
barplot(totalsales_latam,
          main="Total Sales per Region in LATAM",
          col=brewer.pal(10,"BrBG"),width=2.5,
          ylab="Frequency",las=1,cex.names=0.8,
          cex.axes=0.8,space=1,xlab="Region",ylim=c(0,1000000),
        horiz = FALSE)

1.From the above dot plot we can say that Central America has the highest total sales of 9.2422616\times 10^{5}

2.From the above dot plot we can say that Caribbean has the highest total sales of 1.9677522\times 10^{5}

TASK 14

Presenting Boxplots for Profits Per Region in LATAM

boxplot(M2_t13LATAM$Profits~M2_t13LATAM$Region,
        horizontal=FALSE,col=brewer.pal(3,"Set3"),
        main="Boxplot for Profits in LATAM",
        xlab="Region",ylab="Profits",las=1)

TASK 15

Presenting table with cumulative frequency and probabilities

#Creating dataframe with calculated columns 
M2Data_t15<- M2Data2023$Product_SubCategory %>% table() %>%
  sort(decreasing = TRUE) %>%
  as.data.frame()%>% rename(Frequency=Freq)%>%
  mutate(Cumulative_Frequency=cumsum(Frequency),
         Probability=Frequency/nrow(M2Data2023),
         Cumulative_Probability=cumsum(Probability))
#assign column names
colnames(M2Data_t15)<-c("SubCategory","Frequency","Cumulative Frequency","Probability","Cumulative Probability")
#display table
knitr::kable(M2Data_t15,align = "c",
               format = "markdown",
               caption = "Table 15: cumulative frequency and Probabilities for Product Sub-Categories",
               table.attr="style='width:20%;'") %>%
  kable_classic(bootstrap_options= "striped",
                full_width=TRUE,
                position="center",
                font_size=12)
Table 15: cumulative frequency and Probabilities for Product Sub-Categories
SubCategory Frequency Cumulative Frequency Probability Cumulative Probability
Phones 179 179 0.179 0.179
Bookcases 130 309 0.130 0.309
Copiers 126 435 0.126 0.435
Appliances 125 560 0.125 0.560
Chairs 95 655 0.095 0.655
Tables 87 742 0.087 0.742
Machines 52 794 0.052 0.794
Storage 45 839 0.045 0.839
Accessories 38 877 0.038 0.877
Binders 38 915 0.038 0.915
Paper 32 947 0.032 0.947
Art 18 965 0.018 0.965
Furnishings 14 979 0.014 0.979
Supplies 7 986 0.007 0.986
Fasteners 6 992 0.006 0.992
Labels 6 998 0.006 0.998
Envelopes 2 1000 0.002 1.000
TASK 16

Plotting Probability graphs

M2Data_t15$Newcolumn1=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Frequency),sep = "\n")
M2Data_t15$Newcolumn2=paste(M2Data_t15$SubCategory,as.character(M2Data_t15$Probability),sep = "\n")

par(mai=c(0.5,1.2,0.5,0.1))
par(mfrow=c(2,2))
##Pie chart
pie(M2Data_t15$Frequency,labels=,M2Data_t15$Newcolumn1,
    radius=0.8,cex=0.8,
    main="Frequency distribution ",
    col=brewer.pal(10,"BrBG"))

barplot(M2Data_t15$`Cumulative Frequency`~M2Data_t15$SubCategory,
          main="Cumulative Frequency",
          col=brewer.pal(10,"PRGn"),width=1,
          xlab="Frequency",las=1,cex.names=0.8,
          cex.axes=0.5,space=1,ylab="",
        horiz = TRUE)

pie(M2Data_t15$Probability,labels=,M2Data_t15$Newcolumn2,
    radius=0.8,
    main=" Probability distribution ",cex=0.8,
    col=brewer.pal(10,"RdBu"))

barplot(M2Data_t15$`Cumulative Probability`~M2Data_t15$SubCategory,
          main="Cumulative Probability",
          col=brewer.pal(10,"PiYG"),width=1,
          xlab="Frequency",las=1,cex.names=0.8,
          cex.axes=0.5,space=1,ylab="",
        horiz = TRUE)

TASK 17

Free graph on subset of Data

#Filtering Data on Market
M2_t13Europe<-filter(M2Data2023,M2Data2023$Market=="Europe")
#Creating data frame with sum of sales per Product category
totalsales_europe=tapply(M2_t13Europe$Sales,M2_t13Europe$Product_Category, sum)%>% sort(decreasing = TRUE)%>% as.data.frame()
#Assign column names
colnames(totalsales_europe)<-c("TotalSales")
#Adding Percentage column
totalsales_europe$Percentage=round((totalsales_europe$TotalSales/sum(totalsales_europe$TotalSales))*100,2)
#Adding new columns for labels
totalsales_europe$product=rownames(totalsales_europe)
totalsales_europe$labels=paste(totalsales_europe$product,totalsales_europe$Percentage,sep="\n")
#print(totalsales_europe)
#Creating Plots 
par(mai=c(1,1,0.2,0.4))
par(mfrow=c(2,2))
hist(M2_t13Europe$Sales,breaks=50,
     col="red",xlab="Sales",
     main="Histogram for Sales in Europe",las=1,xlim=c(0,100000),ylim=c(0,50))

boxplot(M2_t13Europe$Sales~M2_t13Europe$Region,
        horizontal=TRUE,col=brewer.pal(5,"Set3"),
        main="Boxplot for Sales in Europe",ylab="",las=1,xlab="Sales")

pie(totalsales_europe$Percentage,labels = totalsales_europe$labels,
    radius=0.8,
    main=" % distribution of Sales by products ",cex=0.8,
    col=brewer.pal(10,"RdBu"))

Conclusions:-

1.Through this report, we learnt many ways of representing data visually like histogram , barplot, piechart and Boxplot
2.We learnt the use of dplyr package in creating summaries of data based on groups like mean sales , total sales.
3. We learnt to create filtered dataset based on given condition.
4.From the above report it is observed that Asia Pacific has highest no of records 365and Africa has lowest no of records 54.
5.It is observed from the above report that Technology has highest no of records and Office Supplies has lowest no of records in Africa.

Bibliography:-

https://rpubs.com/Dee_Chiluiza/barplots https://rpubs.com/Dee_Chiluiza/995745 https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/ https://www.marsja.se/how-to-add-a-column-to-dataframe-in-r-with-tibble-dplyr/ https://www.r-bloggers.com/2022/08/r-program-to-concatenate-two-strings/ https://www.javaassignmenthelp.com/blog/applications-of-r/ https://digitaschools.com/data-types-in-r-tutorial/