ALY6000 Introduction to Analytics
Northeastern University

Student Name : Omkar Nitin Sadekar

Date : 30 July 2021

R Project 2 Report


INTRODUCTION
1. Descriptive statistics refers to data analysis that serves to explain, show, or summarize data in a comprehensible way, allowing patterns to emerge from the data.Descriptive statistics are highly important since it would be difficult to visualize what the data was indicating if we simply presented it as raw data, especially if there was a lot of it. Descriptive statistics allow us to portray data in a more meaningful form, allowing for easier comprehension of the data.
Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. It is, therefore, important that the sample accurately represents the population. The process of achieving this is called sampling (sampling strategies are discussed in detail in the section, Sampling Strategy, on our sister site). Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population.

2. Data presentation is not only utilized to make your Independent Investigation look more visually appealing; effective data presentation will also make reading the results more fascinating to the reader. Instead, the main reason for extracting and presenting the relevant data from your results is to show the reader and marker of your study that you can select the data most appropriate for answering your research questions and graphically work with the data to allow it to highlight its own inherent correlations and relationships. While a lengthy data table may technically do the same purpose, forcing the reader to ‘discover’ the pertinent data among a mess of numbers is a symptom of poor research.

3.Banking firms, like financial institutions, use R programming for credit risk presentation and many sorts of risk analysis. Banks make extensive use of the Mortgage Haircut Model, which allows them to take control of the property in the event of a credit failure. Modeling for a home loan comprises the following: Deals worth circulation, The cost of doing business is volatile and The estimated shortage is calculated.
For these reasons, R programming is typically used in conjunction with property tools such as SAS. R is used by the bank for financial reporting. The information researchers can use R to break out money-related losses and employ R’s perceptive devices.


ANALYSIS


Task 1: Presenting a table with first and last 5 observations

• Here, a table consisting of first and last 5 observations is presented in a presentable format

head(M2Data, 5)
## # A tibble: 5 × 10
##   Region Market Company_Segment Product_Category Product_SubCate… Price Quantity
##   <chr>  <chr>  <chr>           <chr>            <chr>            <dbl>    <dbl>
## 1 Centr… USCA   Consumer        Technology       Phones            222.        2
## 2 Ocean… Asia … Corporate       Furniture        Chairs           3709.        9
## 3 Ocean… Asia … Consumer        Technology       Phones           5175.        9
## 4 Weste… Europe Home Office     Technology       Phones           2893.        5
## 5 Weste… Africa Consumer        Technology       Copiers          2833.        8
## # … with 3 more variables: Sales <dbl>, Profits <dbl>, ShippingCost <dbl>
tail(M2Data, 5)
## # A tibble: 5 × 10
##   Region         Market Company_Segment Product_Category Product_SubCate…  Price
##   <chr>          <chr>  <chr>           <chr>            <chr>             <dbl>
## 1 Eastern Asia   Asia … Consumer        Furniture        Tables           2615. 
## 2 Western US     USCA   Corporate       Office Supplies  Appliances         69.5
## 3 Oceania        Asia … Consumer        Technology       Copiers           637. 
## 4 South America  LATAM  Corporate       Furniture        Bookcases        2751. 
## 5 Southeastern … Asia … Corporate       Technology       Phones           1587  
## # … with 4 more variables: Quantity <dbl>, Sales <dbl>, Profits <dbl>,
## #   ShippingCost <dbl>
rbind(head(M2Data,5), tail(M2Data, 5))
## # A tibble: 10 × 10
##    Region        Market Company_Segment Product_Category Product_SubCate…  Price
##    <chr>         <chr>  <chr>           <chr>            <chr>             <dbl>
##  1 Central US    USCA   Consumer        Technology       Phones            222. 
##  2 Oceania       Asia … Corporate       Furniture        Chairs           3709. 
##  3 Oceania       Asia … Consumer        Technology       Phones           5175. 
##  4 Western Euro… Europe Home Office     Technology       Phones           2893. 
##  5 Western Afri… Africa Consumer        Technology       Copiers          2833. 
##  6 Eastern Asia  Asia … Consumer        Furniture        Tables           2615. 
##  7 Western US    USCA   Corporate       Office Supplies  Appliances         69.5
##  8 Oceania       Asia … Consumer        Technology       Copiers           637. 
##  9 South America LATAM  Corporate       Furniture        Bookcases        2751. 
## 10 Southeastern… Asia … Corporate       Technology       Phones           1587  
## # … with 4 more variables: Quantity <dbl>, Sales <dbl>, Profits <dbl>,
## #   ShippingCost <dbl>
knitr::kable(rbind(head(M2Data,5), tail(M2Data, 5)))
Region Market Company_Segment Product_Category Product_SubCategory Price Quantity Sales Profits ShippingCost
Central US USCA Consumer Technology Phones 221.98 2 443.96 62.15 40.77
Oceania Asia Pacific Corporate Furniture Chairs 3709.40 9 33384.60 -288.77 923.63
Oceania Asia Pacific Consumer Technology Phones 5175.17 9 46576.53 919.97 915.49
Western Europe Europe Home Office Technology Phones 2892.51 5 14462.55 -96.54 910.16
Western Africa Africa Consumer Technology Copiers 2832.96 8 22663.68 311.52 903.04
Eastern Asia Asia Pacific Consumer Furniture Tables 2614.69 7 18302.83 -821.96 203.26
Western US USCA Corporate Office Supplies Appliances 69.48 1 69.48 20.84 12.04
Oceania Asia Pacific Consumer Technology Copiers 636.78 2 1273.56 286.50 203.20
South America LATAM Corporate Furniture Bookcases 2751.20 10 27512.00 110.00 203.13
Southeastern Asia Asia Pacific Corporate Technology Phones 1587.00 3 4761.00 -76.56 203.08


• As a data analyst, we are expected to work with large datasets. So, it is really hard to see entire dataset at once. The head and tail funtion enables us to get a glance of a large dataset


Task 2:Finding Categories of Market and their frequencies

•We intend to find the categories of market and their frequency distribuiton

knitr::kable(rbind(table(M2Data$Market)))
Africa Asia Pacific Europe LATAM USCA
54 365 248 133 200


• Frequency indicates the number of occurences of a value in a data. From the above table we get an idea of how many times a sale has been occured in particular category of market


Task3: Plotting a Bar Graph for Market and their Frequencies

• A Bar graph of the market and the frequencies is plotted adding colors with the help of RcolorBrewer library. Using text() function, the values of frequencies are shown.

table1 = table(M2Data$Market)
plot2 =  barplot(table1, horiz = TRUE, xlab = 'frequency', ylab = 'category', xlim = c(0,400), col = brewer.pal(6, "Accent"))
text(x=table(M2Data$Market),plot2,table(M2Data$Market), cex = 0.8, pos = 3)      


• Bar graph is a great way of visualizing data. As we can see above, it is easy to understand the frequency of a particular category of the market using different colors which makes the observations more visually appealing


Task4: Analysing product category and frequency of African market using Pie Chart

• Here, we Analyse the frequency of product categories of African market and visualize it using a Pie Chart

t4Africa = dplyr::filter(M2Data, Market=="Africa")
tablet4 = table(t4Africa$Product_Category)
pie(tablet4)


• Using above Pie chart, we can say that the market share of technology products is highest in the African market followed by Furniture and then Office Supplies. Pie charts is one the simple ways to visualize the data and used to make sense to show parts-to-whole relationship for categorical data.


Task5: Analyzing product subcategory and their frequencies

•Using a Barplot to visualize subcategory of products and their frequencies in African market

task5_table = table(t4Africa$Product_SubCategory)
t5bar = barplot(task5_table)
text(y = table(t4Africa$Product_SubCategory), t5bar, table(t4Africa$Product_SubCategory), cex = 0.8, pos = 3)


• It is clear from the barplot that the sale of phones in African market is the largest compared to other products in the subcategory. Barplot provides a systematic understanding of data and the frequency values.


Task6: Improvization of BarPlot of subcategory and their frequencies

• Adding labels,colors and setting margins for eye catching visualization

barplot(task5_table, xlab="Subcategory",ylab="Frequency",  col = brewer.pal(6, "Accent"),
                main="Product Subcategory")

par(mar=c(1, 1.2, 1, 1))


• BarPlots have a good scope of adding attractive features for proper visualization.With the ColorBrewer library, we can add colors to each bar in the Barplot. Also, adding labels and setting margins is important to view the observations in a systematic way.


Task7: Finding Mean Sales per Subcategory

• In this task we find the Average Sales per Subcategory and use a dot plot for observations

Mean_Sales = tapply(t4Africa$Sales, t4Africa$Product_SubCategory, mean)
knitr::kable(Mean_Sales)
x
Accessories 6478.980
Appliances 8601.975
Bookcases 10441.840
Chairs 19306.760
Copiers 26338.286
Machines 6991.880
Phones 15001.698
Storage 21289.200
Tables 14738.970
dotchart(Mean_Sales)


• Dot plots are one of the simple statistical plots suitable for small sized datasets. It is convinent to use dot plot when we are supposed to analyse categorical data and get precise insights.


Task8: Finding Total sales per region in the African market

• We suppose to interpret the total sale in a particular region in the African market

Total_Sales = tapply(t4Africa$Sales, t4Africa$Region, sum)
knitr::kable(Total_Sales)
x
Central Africa 205523.8
Eastern Africa 96575.4
North Africa 178792.3
Southern Africa 161749.4
Western Africa 116827.0
barplot(Total_Sales, xlab="Region",ylab="Sale",  col = brewer.pal(6, "Accent"),
                main="Total Regional Sales")

par(mai=c(1, 0.6, 1, 1))


• According to the statistics, the total sale in particular regions is seen with reference to the barplot. Eastern Africa should be the most focused area when it comes to increasing the total sale.


Task9: Finding average shipping cost per region in the African market

• Average shipping cost per region is analyzed in the North African market

Mean_Shipping = tapply(t4Africa$ShippingCost, t4Africa$Region, mean)
knitr::kable(Mean_Shipping)
x
Central Africa 354.3857
Eastern Africa 386.9600
North Africa 326.8583
Southern Africa 325.5718
Western Africa 351.1562
barplot(Mean_Shipping, xlab="Region",ylab="Shipping Cost",  col = brewer.pal(6, "Accent"),main="Average Shipping Cost")


•As we can see from the barplot that average shipping cost of Eastern Africa is more but from previous plot, total sale is lowest. So, a proper strategy to optimize the sale and shipping is needed.


Task10: Differences on data type designations used in R:

•There are several classes classified as “numeric,” the two most common of which are double (for double precision floating point values) and integer.
•R will automatically convert between numeric classes when necessary, so it makes little difference to the average user whether the value 3 is now stored as an integer or as a double.
•Because most math is done with double precision, that is frequently the default storage.Because integers take up less storage capacity, we may choose to save a vector as integers if we know it will never be changed to doubles (for ID values or indexing).
•However, if they are going to be used in any math that would convert them to doubles, it is probably best to store them as doubles from the start.


Task11: Analyzing Profits

• Here, we visualize the profits through a boxplot and a Histogram

par(mfcol=c(2,1),
    mai = c(1,1,0.2,0.4),
    mar = c(4,4,0.5,2))
boxplot(M2Data$Profits,
        horizontal = T)
hist(M2Data$Profits,
     breaks = 50,
     main = "Histogram",
     xlab = "Profits",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,100))


• Using above plots we can say that the median profits is in the range of 0 to less that 1000. Histogram and boxplot provides great visualization when it comes to visualizing big data.


Task12: Finding profits in the Latin American Market

• We intend to find the profits in the Latin American Market using boxplot and a histogram

t13LATAM = dplyr::filter(M2Data, Market=="LATAM")
par(mfcol=c(2,1),
    mai = c(1,1,0.2,0.4),
    mar = c(4,4,0.5,2))
LATAM_profits = (t13LATAM$Profits)
boxplot(LATAM_profits,main = "Boxplot",
        horizontal = T)
hist(LATAM_profits,
     breaks = 50,
     main = "Histogram",
     xlab = "Profits ",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,20),
     xlim = c(-2000,1500))


• In the Latin American market, the maximum profits are in the range of 0 to 500, though there are some outliers as well. Outliers can be seen specifically with a boxplot and histogram shows the profits distribution.


Task13: Finding total sales in Latin American market

• We are supposed to find the total amount of sales in the Latin American market

Total_Sales1 = tapply(t13LATAM$Sales, t13LATAM$Region, sum)
knitr::kable(Total_Sales1)
x
Caribbean 196775.2
Central America 924226.2
South America 457623.3


• With the above table we can depict the total sales in the Latin American market with three sub regions that are Carribean, Central America and South America, with lowest sale in Caribbean region and highest in Central America


Task14: Find Regional Profits in the Latin American market

•We find Regional Profits of the Latin American market and visualizing it with the help of a boxplot

boxplot(Profits~Region,data=t13LATAM,
   xlab="Regions", ylab="Profits made",main = "Boxplot",
        horizontal = F)


• Using a boxplot, we find the Profits in a particular region in the Latin American market. The median of central america is lesser than other regions.


Task15:Probabilty distribution table for Subcategories of products

•A table containing frequency, cumulative frequency, probability and cumulative probability of product subcategories

t15 = M2Data$Product_SubCategory%>%
  table()%>%
  as.data.frame()%>%
  rename(Coloumn1 = Freq)%>%
  mutate(coloumn2 = cumsum(Coloumn1),
         coloumn3 = Coloumn1/nrow(M2Data),
         coloumn4 = cumsum(coloumn3))
  colnames(t15)<-c('Product Type','Frequency','Cum Frequency','Probability','Cum Probability')
knitr::kable(t15,
             digits = 2,
             caption = "Probability of product Subcategory",
             format = "html",
             table.attr = "style='width:40%;'",
             align = 'c')%>%
  kable_classic(bootstrap_options = "striped",
                full_width = TRUE,
                position = "center",
                font_size = 12)
Probability of product Subcategory
Product Type Frequency Cum Frequency Probability Cum Probability
Accessories 38 38 0.04 0.04
Appliances 125 163 0.12 0.16
Art 18 181 0.02 0.18
Binders 38 219 0.04 0.22
Bookcases 130 349 0.13 0.35
Chairs 95 444 0.10 0.44
Copiers 126 570 0.13 0.57
Envelopes 2 572 0.00 0.57
Fasteners 6 578 0.01 0.58
Furnishings 14 592 0.01 0.59
Labels 6 598 0.01 0.60
Machines 52 650 0.05 0.65
Paper 32 682 0.03 0.68
Phones 179 861 0.18 0.86
Storage 45 906 0.04 0.91
Supplies 7 913 0.01 0.91
Tables 87 1000 0.09 1.00


•A table containing frequency, cumulative frequency, probability and cumulative probability of product subcategories is shown above.


Task 16: Plotting Frequencies and Probability of product Subcategory

• We intend to plot pie charts for frequency and probability and Barplots for cumulative frequency and cumulative probability

par(mfrow=c(2,2))
pie(t15$Frequency, radius = 1,
    col = brewer.pal(ncol(t15), "Paired"),
    border = "white",
    lty = 1,
    cex=0.9,
    font = 3)
barplot(t15$`Cum Frequency`, xlab="Product Subcategory",ylab="Frequency",  col = brewer.pal(6, "Accent"),main="Cumulative Frequency")
pie(t15$Probability,  radius = 1,
    col = brewer.pal(ncol(t15), "Paired"),
    border = "white",
    lty = 1,
    cex=0.9,
    font = 3)
barplot(t15$`Cum Probability`, xlab="Product Subcategory",ylab="Probability",  col = brewer.pal(6, "Accent"),main="Cumulative Probability")


•We can visualize the observations of frequency and probability via Pie charts for frequency and probability and Barplots for cumulative frequency and cumulative probability.


Task17: Average sale of products in company segment

• We intend to find average sales of products for company segments and visualize it with help of a barplot

Segment_Sale = tapply(M2Data$Sales, M2Data$Company_Segment, mean)
barplot(Segment_Sale, xlab="Segment",ylab="Sale",  col = brewer.pal(6, "Accent"),
                main="Average Segment Sale")


• We infer that the average sale of corporate products is higher than consumer and Home office segment. Thus, to increase sale, Research and Development and a good marketing strategy in consumer and Home Office products is needed.


CONCLUSION
In this R Project, I learned how to analyze data, show data visually, and do basic calculations using explanatory analysis. I have a thorough understanding of R Markdown and its purpose. This assignment, I believe, helped me develop a solid understanding of analytics and R programming. Several factors influence market growth, sales, and profitability while reviewing market statistics. To reach a conclusion and move forward with strategy development, a thorough grasp of all elements is required. To gain a basic comprehension of the circumstance, visually attractive graphs are required. So, for each Analysis, we created a graphical representation using Bar plots, Box plots, Histograms, and Pie charts. With these data, we can conclude that a robust sales strategy is required in the African market.We can sell more technology products in African regions because the sale of technology products is the largest. Phones are also the most popular electronic gadgets. On the African continent. Eastern Africa has the lowest sales and the greatest shipping costs. As a result, a strategy to reduce shipping costs and improve sales in the Eastern African market is required. The products in the corporate sector have the highest average sale. As a result, products for corporate sale must be available at all times.


BIBLIOGRAPHY
1.Introduction to analytics using R, R Studio and R Markdown Short manual series by Dr. Dee Chiluiza, PhD. Retrieved from https://rpubs.com/Dee_Chiluiza/home
2.R Applications- 9 Real World use cases of R Programming. Retrieved from https://techvidvan.com/tutorials/r-applications/
3.Harvard Business Review-Present Your Data Like a Pro by Joel Schwartzberg. Retrieved from https://hbr.org/2020/02/present-your-data-like-a-pro
4.Harvard Business Review-A Refresher on Statistical Significance It’s too often misused and misunderstood. by Amy Gallo. Retrieved from https://hbr.org/2016/02/a-refresher-on-statistical-significance



APPENDIX
R for Data Science by Hadley Wickham published by O’Reilly. https://r4ds.had.co.nz/index.html
Discovering Statistics using R by Andy Field published by SAGE publications limited.