M6 Project Report on R
ALY 6000 - CRN 90281: Introduction to Analytics
College of Professional Studies (CPS), Analytics
Northeastern University
Professor: Prof. Diana Chiluiza, PhD

By: Zeeshan Ahmad Ansari

Date of Submission: 19 August, 2023


Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(tidyverse)
library(readxl)
library(readr)
library(RColorBrewer)
library(ggplot2)
library(magrittr)
library(kableExtra)
library(dplyr)
library(knitr)
library(FSA)
library(plotly)

library(graphics)
library(scales)  # For formatting numbers

#Dataset_Employed_in_this_M3_Report

M6Data = read_excel("D:/Quater_1/ALY 6000/M6 Project/DataSets/M6_project_dataset(2023).xlsx")

# Define custom colors for each species
custom_colors <- c("#43a2ca", "slategrey", "red", "green", "lightblue", "orange", "yellow")


Introduction

In today’s business world, understanding sales, markets, and how different parts of a company work together is crucial for success. This report dives into a dataset that holds information about sales transactions happening across different places and types of products. By analyzing this data, we aim to uncover useful insights that can help businesses make better decisions.

The data shows us how sales happen in different places like cities and countries. It also shows us what kinds of products people like to buy. In this we have different groups of customers, like people buying things for their homes, offices, or technology needs. Each group has its own preferences and the impact it will create on the company.Another important part is looking at different parts of a company, like the departments that handle furniture, technology, and office supplies.

This data contains 17 columns and 1000 rows. The columns are divided into categorical and numerical variables. While closly looking we got that we have overall 5 numerical variables namely Product Price, Quantity, Shipping Cost Each, Returns and Loss per return rest all are categorical varibales.

Descriptive statistics are a set of methods that are used to summarize and describe key features of a dataset. It aids in comprehending data’s central tendencies, variability, and distribution. This includes computing metrics such as mean (average), median (middle value), and mode (most frequent value), as well as investigating the data’s range, standard deviation, and percentiles (Bluman,2010).

It aids in the discovery of insights from massive amounts of data, assisting in decision-making and plan building. Businesses, for example, might use these statistics to better understand client purchase patterns, identify popular products etc. It’s also useful for determining revenue patterns, monitoring inventory levels, and assessing staff performance.

Inferential statistics involves drawing conclusions and making predictions about a larger population based on a sample of data. It uses probability and statistical techniques to make informed statements about a group beyond the observed data. This includes methods like hypothesis testing, confidence intervals, and regression analysis (Bluman,2010).

Inferential statistics is a strong tool in real-world business undertakings. It assists businesses in making decisions and forecasting by analyzing a smaller selection of data to discover wider trends. For example, a corporation can use inferential statistics to determine whether a new marketing plan is likely to enhance sales across the board. It can also be used to forecast future market patterns and assess the impact of policy changes.

An R Script is a plain text file containing a series of R commands and code that can be executed sequentially. It’s used to write, edit, and run R code without any additional formatting or narrative content. R Scripts are commonly used by programmers and analysts to automate tasks, perform data analysis, and develop functions or algorithms (Kabacoff, 2015).

R Markdown is a file format that combines R code, text, and formatting into a single document. It allows you to seamlessly blend narrative text, code, and visualizations. R Markdown files are processed by an R Markdown rendering engine to create dynamic reports, presentations, documents, and even web pages. When rendered, an R Markdown file produces output that can include code results, tables, plots, and text (Kabacoff, 2015).

Both R Scripts and R Markdown files allow us to run R code and perform data analysis. Both files are used within the R programming environment to utilize its functionalists. We can manipulate and process data using R commands in both R Scripts and R Markdown files (Kabacoff, 2015).

The main difference is that R Scripts are just for giving instructions to our computer. It’s like telling it what to do. R Markdown, on the other hand, is for telling a whole story. We can mix our words with R code and pictures to create something like this report or a presentation that people can read and see (Kabacoff, 2015).

R Scripts are great for making our computer do specific tasks and work with numbers, but they don’t make things look good. R Markdown, on the other hand, makes things look nice and combines words and numbers in a way that people can understand.

So, R Scripts are like computer instructions, and R Markdown is like a storytelling tool that combines words and code to create impressive reports and presentations.


Analysis Section


Task_1 Finding the Basic Descriptive Statistics of the Dataset

In this analysis, we are presenting an overview of the entire dataset through a combination of a summary table and two informative graphs. The summary table showcases essential descriptive statistics for numeric variables such as Quantity, Discount and Profit. The graphs, a box plot and a histogram, visualize the distribution of the Profit variable. This comprehensive approach offers insights into the dataset’s numerical trends, distribution, and outliers, enhancing our understanding of the data’s characteristics and potential patterns.

data_summary <- t(summary(M6Data))

kable(data_summary, format = "html",caption = "Transpose Summary of the Data Set", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0,  background = "slategrey" ,bold = TRUE, color = "white")
Transpose Summary of the Data Set
OrderDate Min. :2020-01-01 00:00:00.0 1st Qu.:2020-05-09 00:00:00.0 Median :2020-08-06 00:00:00.0 Mean :2020-07-24 05:12:28.7 3rd Qu.:2020-10-19 00:00:00.0 Max. :2020-12-25 00:00:00.0
ProductID Length:1000 Class :character Mode :character NA NA NA
City Length:1000 Class :character Mode :character NA NA NA
State Length:1000 Class :character Mode :character NA NA NA
Country Length:1000 Class :character Mode :character NA NA NA
Region Length:1000 Class :character Mode :character NA NA NA
Market Length:1000 Class :character Mode :character NA NA NA
Segment Length:1000 Class :character Mode :character NA NA NA
Department Length:1000 Class :character Mode :character NA NA NA
Division Length:1000 Class :character Mode :character NA NA NA
OrderPriority Length:1000 Class :character Mode :character NA NA NA
ShipMode Length:1000 Class :character Mode :character NA NA NA
Product_Price Min. : 151.9 1st Qu.: 361.1 Median : 501.4 Mean : 548.9 3rd Qu.: 684.3 Max. :3558.4
Quantity Min. : 20.0 1st Qu.: 45.0 Median :133.0 Mean :158.5 3rd Qu.:251.2 Max. :400.0
Shipping_Cost_Each Min. : 0.000 1st Qu.: 3.890 Median : 8.585 Mean :14.325 3rd Qu.:21.413 Max. :59.710
Returns Min. : 6.00 1st Qu.: 14.00 Median : 40.00 Mean : 47.59 3rd Qu.: 75.25 Max. :120.00
Loss_Per_Return Min. : 18.23 1st Qu.: 73.54 Median : 145.66 Mean : 168.92 3rd Qu.: 240.63 Max. :1352.20
# Select relevant numerical columns
numeric_cols <- c("Product_Price", 
                  "Quantity", 
                  "Shipping_Cost_Each", 
                  "Returns", 
                  "Loss_Per_Return")
numeric_data <- M6Data[, numeric_cols]

# Generate basic descriptive statistics
data_summary_M6 <- summary(numeric_data)

kable(data_summary_M6, caption = "Summary of Numeric Data",format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0,  background = "slategrey" ,bold = TRUE, color = "white")
Summary of Numeric Data
Product_Price Quantity Shipping_Cost_Each Returns Loss_Per_Return
Min. : 151.9 Min. : 20.0 Min. : 0.000 Min. : 6.00 Min. : 18.23
1st Qu.: 361.1 1st Qu.: 45.0 1st Qu.: 3.890 1st Qu.: 14.00 1st Qu.: 73.54
Median : 501.4 Median :133.0 Median : 8.585 Median : 40.00 Median : 145.66
Mean : 548.9 Mean :158.5 Mean :14.325 Mean : 47.59 Mean : 168.92
3rd Qu.: 684.3 3rd Qu.:251.2 3rd Qu.:21.413 3rd Qu.: 75.25 3rd Qu.: 240.63
Max. :3558.4 Max. :400.0 Max. :59.710 Max. :120.00 Max. :1352.20
# Create a histogram for Product_Price
# Extract Product_Price column
product_prices <- M6Data$Product_Price

# Create a histogram
hist(product_prices, breaks = 20, col = "lightblue", border = "black",
     xlab = "Product Price", ylab = "Frequency",
     ylim = c(0, 350),
     main = "Histogram of Product Price")

# Create a box plot for Quantity
boxplot(M6Data$Quantity, col = "lightgreen",
        xlab = "", ylab = "Quantity",
        main = "Box Plot of Quantity")

Observations to Task_1:

As we analysed our dataset using summary command (Kabacoff, 2015) we got the information about the numerical variables. By employing this function, we gained access to key statistical measures for each numeric column, facilitating a comprehensive understanding of the dataset’s characteristics. These measures encompass the minimum value, the value at the first quartile, the median (representing the second quartile), the arithmetic mean, the value at the third quartile, and the maximum value associated with each numeric variable. We tranposed the table so that we can have a better visualization. After getting information about the numerical values we again used summary command to get a better picture of the numerical variables.

A histogram is (Chiluiza, 2022) generated to visualize the distribution of “Product_Price” in the dataset. It clearly shows that the distribution is right-skewed, while many products have relatively lower prices, there are some products with higher prices that contribute to the spread in the distribution. The interval [250, 500] has the highest count with more than 300 observations. This suggests that a significant portion of products falls within the price range of $250 to $550.

The box plot (Chiluiza, 2022) for the quantity data shows that the median (Q2) value of is around 130 which suggests that the middle value of product quantities sold is around this number. This indicates that a significant portion of sales transactions involves quantities close to this value. The slight asymmetry of the box plot suggests a possible skewness in the distribution. The longer whisker on one side indicates that some transactions involve larger quantities, contributing to the skewness.


Task_2 Evaluation of the distribution of the shipping charges

We want to view and evaluate the distribution of shipping charges within the provided dataset. For this we can use a histogram and a horizontal box plot to properly understand the distribution, variability, and core tendency of shipping costs. The visual comparison will allow us to detect probable outliers, evaluate the general distribution shape, and gain insight into the dataset’s usual values and extremes.

# Calculate mean and median
mean_shipping_cost <- mean(M6Data$Shipping_Cost_Each)
median_shipping_cost <- median(M6Data$Shipping_Cost_Each)

par(fig=c(0, 1, 0, 0.8))

hist(M6Data$Shipping_Cost_Each,
     xlab = "Shipping Cost", ylab = "Frequency",
     ylim = c(0, 400),
     main = "",
     col = "lightgrey", border = "black")
abline(v = mean_shipping_cost, 
       col = "red", 
       lty = 1)
abline(v = median_shipping_cost, 
       col = "blue", 
       lty = 2)
text(mean_shipping_cost, 
     70, 
     round(mean_shipping_cost, 2), 
     col = "red", 
     pos = 4,
     srt=90,
     cex=0.8)
text(median_shipping_cost, 
     80, 
     round(median_shipping_cost, 2),
     col = "blue",
     pos = 4,
     srt=90,
     cex=0.8)

legend(30,300, legend = c("Mean Shipping Cost", "Median Shipping Cost"), 
       col = c("red", "blue"), lty = c(1, 2))


par(fig=c(0, 1, 0.45, 1), new=TRUE)

boxplot(M6Data$Shipping_Cost_Each, horizontal = TRUE,
        notch=TRUE,
        varwidth= TRUE,
        main = "Comparing Distribution of Shipping Cost with Box and Histogram",
        col = "lightgrey", border = "black")
abline(v = mean_shipping_cost, 
       col = "red", 
       lty = 1)
abline(v = median_shipping_cost, 
       col = "blue", 
       lty = 2)
text(mean_shipping_cost, 
     1, 
     round(mean_shipping_cost, 2),
     col = "red", 
     pos = 4,
     srt=90, 
     cex=0.8)
text(median_shipping_cost, 
     1, 
     round(median_shipping_cost, 2), 
     col = "blue", 
     pos = 4,
     srt=90, 
     cex=0.8)

Observations to Task_2:

The frequency distribution of shipping costs is depicted by the histogram. The data is skewed to the right, with the majority of shipping charges clustered near the bottom of the scale. This suggests that the majority of orders have lower shipping prices, but a few orders have higher shipping expenses.

The horizontal box plot confirms the observations of the histogram. The median (blue line) shipping cost is lower than the mean (red line), emphasizing the skewness. Outliers are visible, as certain transportation charges drastically surpass the average range.

We may compare the effectiveness of each strategy by using both visuals side by side. The histogram shows the frequency distribution and value spread more clearly. The box plot, on the other hand, provides information about quartiles, medians, and ranges.

The median, which is less impacted by outliers, represents the data’s central trend better. The mean is being pulled upwards by extreme values.

The combined study enables us to learn that, while the majority of orders have relatively lower delivery expenses, a small number of orders have significantly higher charges. This could be because to variables like quicker shipping or remote delivery areas.

In conclusion, the variable distribution study, which included both a histogram and a horizontal box plot, shed light on the skewed distribution of shipping costs. This method enabled us to observe central tendencies, identify outliers, and obtain insights into the dataset’s range of shipping charges.


Task_3 Use a Box Plot to Compare Shipping Costs Across Markets

By employing a box plot, we aim to visually compare the distribution of shipping costs across different markets. This allows us to identify variations in shipping charges, discern markets with higher and lower shipping costs, and gain insights into the range of cost discrepancies across various markets.

max_shipping_cost <- max(M6Data$Shipping_Cost_Each)
min_shipping_cost <- min(M6Data$Shipping_Cost_Each)


market_with_max <- M6Data$Market[which.max(M6Data$Shipping_Cost_Each)]
market_with_min <- M6Data$Market[which.min(M6Data$Shipping_Cost_Each)]

boxplot(Shipping_Cost_Each ~ Market, data = M6Data,
        notch = TRUE, 
        horizontal = FALSE,
        col = custom_colors, 
        border = "black",
        main = "Shipping Cost versus Markets",
        xlab = "Markets", 
        ylab = "Shipping Cost",
        ylim = c(0, 70))

Observations to Task_3:

As we look to the box plots above the horizontal lines inside the box represents the median, which is a good indicator of the central tendency of the distribution. If a box’s median is higher, it means that market has higher shipping cost, and if the median is lower, it means that market has lower shipping cost.

From the visual inspection, we can clearly see that the Africa has the higher shipping cost and USCA has the lowest shipping cost.

Also when we calculated it we get Africa is the highest shipping cost and USCA has the lowest shipping cost.


Task_4 Calculate Mean Shipping Costs per Market and Visualization

Using tapply() function, we can compute the mean shipping cost for each market. We will be generating a bar plot to visually compare the mean shipping costs across different markets. The comparison can offer insights distinct from the previous box plot, providing us with the understanding of market-specific shipping cost.

Initially, the tapply() function is utilized to calculate the mean shipping cost for each market segment. This process involves grouping the data by market and then applying the mean function to the shipping cost column within each market group. The resulting mean values are stored in an object.

Following this, a bar plot is created using the object to illustrate the mean shipping costs across different markets. The bar heights reflect the average shipping expenses in each market, facilitating a visual comparison.

Similarly, we used tapply() function with the median function and we can investigate how the median and total shipping costs vary across markets. This approach enables us to explore the central tendency for different market segments. The resulting bar plots for median can be compared to the initial mean bar plot, which will be insightful for the distribution of shipping.

mean_shipping_by_market <- tapply(M6Data$Shipping_Cost_Each, M6Data$Market, mean)

barplot(mean_shipping_by_market, 
        main = "Mean Shipping Cost per Market",
        xlab = "Market", ylab = "Mean Shipping Cost",
        ylim = c(0,40),
        col = custom_colors, border = "black")

median_shipping_per_market <- tapply(M6Data$Shipping_Cost_Each, M6Data$Market, median)

barplot(median_shipping_per_market, 
        main = "Median Shipping Costs per Market",
        xlab = "Market", ylab = "Median Shipping Cost",
        ylim = c(0,50),
        col = custom_colors, border = "black")

Observations to Task_4:

The mean shipping cost bar plot shows the average shipping expenses per market, aiding in identifying markets with higher or lower costs. As African market has the highest mean shipping cost than the USCA one. By applying different aggregation functions, such as the central tendency (median), we can uncover distinct aspects of the shipping cost distribution. As the from the graph we can see that Africa has the median value approximately around 41 which is still validating from the previous box plots.


Task_5 Create a Box Plot displaying Shipping Cost versus Shipping Mode

By utilizing a box plot, we are visually comparing the distribution of shipping costs based on different shipping modes. This graphical representation provides insight into the variability and core patterns of shipping costs connected with various shipment methods.

boxplot(Shipping_Cost_Each ~ ShipMode, data = M6Data,
        notch = TRUE, horizontal = FALSE,
        col = custom_colors, border = "black",
        main = "Shipping Cost versus Shipping Mode",
        xlab = "Shipping Mode", ylab = "Shipping Cost",
        ylim = c(0,65))

Observations to Task_5:

The box plot (Chiluiza, 2022) clearly depicts the distribution of shipping costs across different shipping modes. TheFirst Class mode shows a wider range of costs compared to other modes, with some potential outliers on the higher end.Second Class and First Class and Standard Class shipping modes exhibit relatively narrower interquartile ranges, indicating more consistent cost distribution. Same Day shipping mode demonstrates a relatively balanced distribution with moderate variability. The plot provides a visual understanding of the spread and central tendencies of shipping costs for each mode, aiding in identifying potential cost-saving opportunities and optimizing shipping strategies.


Task_6 Usage of Mutate function to create new Column

Using the pipes and the code mutate(), we enhance our dataset by calculating a new column called Total sales, obtained from the multiplication of the Price and Quantity variables. This allows us to better understand the overall sales generated from each transaction. By creating a new object to store this updated dataset, the original data is intact. The new “Total sales” column and its relationship with the existing variables will be usefull for our analysis.

NewDataSet_Name <- M6Data %>% 
  mutate(Total_Sales = Product_Price * Quantity)

glimpse(NewDataSet_Name[1, ])
## Rows: 1
## Columns: 18
## $ OrderDate          <dttm> 2020-05-08
## $ ProductID          <chr> "TEC75553"
## $ City               <chr> "Yaounde"
## $ State              <chr> "Centre"
## $ Country            <chr> "Cameroon"
## $ Region             <chr> "Central Africa"
## $ Market             <chr> "Africa"
## $ Segment            <chr> "Home Office"
## $ Department         <chr> "Technology"
## $ Division           <chr> "Accessories"
## $ OrderPriority      <chr> "High"
## $ ShipMode           <chr> "First Class"
## $ Product_Price      <dbl> 360.04
## $ Quantity           <dbl> 154
## $ Shipping_Cost_Each <dbl> 50.84
## $ Returns            <dbl> 46
## $ Loss_Per_Return    <dbl> 126.01
## $ Total_Sales        <dbl> 55446.16
Observations to Task_6:

The new “Total sales” column provides a comprehensive view of the revenue generated by each transaction, considering both the product price and quantity sold. By creating a new object to store the modified dataset, the original dataset remains unaltered, ensuring data integrity and allowing for comparisons between the two versions. By using the glimpse() function on the first row of the enriched dataset, we efficiently inspect the new column while maintaining focus on the context of other variables, preventing an overwhelming display of observations.


Task_7 Analysing the highest sale of the company according to Market, Department and Segments

In this analysis, we will be using recently created “Total sales” column to identify where the company achieved its highest sales figures across different dimensions: Market, Segment, and Department. Utilizing the power of the pipe operator and various dplyr functions, we group the data accordingly, aiming to uncover the most profitable sectors within the company’s operations.

sales_by_market <- NewDataSet_Name %>%
  group_by(Market) %>%
  summarise(Total_Sales = sum(Total_Sales)) %>%
  arrange(Total_Sales)

formatted_sales <- dollar(sales_by_market$Total_Sales, big.mark = ",", prefix = "$")

formatted_data <- data.frame(Market = sales_by_market$Market,
                              Formatted_Sales = formatted_sales)

Market_sales_table = kable(formatted_data, format = "html", align = "l", 
      caption = "Total Sales by Market") %>%
  kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")


ggplot(sales_by_market, aes(x = reorder(Market, -Total_Sales), y = Total_Sales, fill = Market)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) +  # Add value labels
  labs(title = "Total Sales by Market",
       x = "Market", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1)) +
  scale_fill_manual(values = custom_colors) +  # Using custom colors
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"))

Total Sales by Market
Market Formatted_Sales
Africa $5,127,296
USCA $9,533,931
LATAM $14,005,415
Europe $24,627,127
Asia Pacific $24,704,625
Observations to Task_7 for Total Sales in Market:

The Asia Pacific market exhibits the highest total sales, with a staggering value of approximately $24.7 million followed by a margin with the Europe market with $24.6 million. This indicates that the Asia Pacific market holds a significant portion of the company’s overall sales. The LATAM (Latin America) market showcases impressive total sales of approximately $14 million, positioning it as a key contributor to the company’s sales revenue. While the African market demonstrates lower total sales compared to other regions, it still contributes notably with around $5.1 million in sales.

The Asia Pacific and European market stands out as the leader in terms of total sales, indicating a strong business presence and successful operations in the region.

The “LATAM” market’s impressive sales suggest promising growth potential in Latin American regions.

In this we used inline r code to print the table.

sales_by_segment <- NewDataSet_Name %>%
  group_by(Segment) %>%
  summarise(Total_Sales = sum(Total_Sales)) %>%
  arrange(Total_Sales)

formatted_sales_segment <- dollar(sales_by_segment$Total_Sales, big.mark = ",", prefix = "$")

formatted_data_segment <- data.frame(Segment = sales_by_segment$Segment,
                                     Formatted_Sales = formatted_sales_segment)

Segment_sales_table = kable(formatted_data_segment, format = "html", align = "l", 
      caption = "Total Sales by Segment") %>%
  kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")

ggplot(sales_by_segment, aes(x = reorder(Segment, -Total_Sales), y = Total_Sales, fill = Segment)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) +  # Add value labels
  labs(title = "Total Sales by Segment",
       x = "Segment", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1)) +
  scale_fill_manual(values = custom_colors) +  # Using custom colors
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"))

Total Sales by Segment
Segment Formatted_Sales
Home Office $12,980,145
Corporate $23,377,596
Consumer $41,640,653
Observations to Task_7 for Total Sales in Segment:

The Consumer segment emerges as the leader in total sales, boasting a substantial figure of around $41.6 million. This indicates that the consumer market is a significant revenue driver for the company. The Corporate segment follows closely with total sales amounting to approximately $23.4 million. Corporate clients also play a substantial role in the company’s sales revenue. The Home Office segment demonstrates commendable performance, contributing significantly to the company’s revenue with total sales of roughly $13 million.

In this we used inline r code to print the table.

sales_by_department <- NewDataSet_Name %>%
  group_by(Department) %>%
  summarise(Total_Sales = sum(Total_Sales)) %>%
  arrange(Total_Sales)

formatted_sales_department <- dollar(sales_by_department$Total_Sales, big.mark = ",", prefix = "$")

formatted_data_department <- data.frame(Department = sales_by_department$Department,
                                     Formatted_Sales = formatted_sales_department)

Department_sales_table = kable(formatted_data_department, format = "html", align = "l", 
      caption = "Total Sales by Department") %>%
  kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")

ggplot(sales_by_department, aes(x = reorder(Department, -Total_Sales), y = Total_Sales, fill = Department)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) +  # Add value labels
  labs(title = "Total Sales by Department",
       x = "Department", y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1)) +
  scale_fill_manual(values = custom_colors) +  # Using custom colors
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"))

Total Sales by Segment
Segment Formatted_Sales
Home Office $12,980,145
Corporate $23,377,596
Consumer $41,640,653
Observations to Task_7 for Total Sales in Department:

The Office Supplies department emerges as the leader in terms of total sales, boasting an impressive figure of approximately $43.5 million. This department plays a crucial role in generating revenue for the company. The Technology department follows closely with total sales amounting to around $27.8 million. This indicates a strong market demand for technological products offered by the company. The Furniture department demonstrates commendable performance, contributing significantly to the company’s revenue with total sales of roughly $6.8 million.

In this we used inline r code to print the table.


Task_8 HOTS (High Order Thinking Skill) Question

In this task we will try to understand three commands which until now I was mostly confused these are mutate() , group_by() and filter() functions in R.

  1. mutate() : This function is used to create new variables (columns) in a data frame based on transformations or calculations involving existing columns. Basically it adds new columns in the data frame keeping all exixting rows.

Example: Lets take a data frame which contains height and weight and we need another column named BMI, so we can create it using mutate().

  1. group_by(): This function is used to categorize data by one or more categorical variables. It is frequently used in conjunction with aggregation methods to study data within each group. It doesn’t change the number of rows in the data frame but instead prepares the data for subsequent group-wise operations.

Example: Let say we have a data frame of student scores with columns for “StudentID,” “Subject,” and “Score.” We want to analyze the average score for each subject. This is where the group_by() function will be used.

  1. filter():This function is used to subset or filter rows in a data frame based on certain criteria. It enables us to keep only the rows that meet specified criteria. It reduces the number of rows in the data frame by removing rows that do not meet the specified conditions.

Example: Selecting rows with certain values, filtering data based on dates, retaining rows with specific characteristics.

In nutshell we can say mutate() adds new columns to the data frame, group_by() groups data by one or more categorical variables for further group-wise operations and filter() selects rows that meet specified conditions, reducing the number of rows in the data frame.

These three operations are frequently used in various ways to accomplish complicated data transformations and analysis. For example, we can filter() rows first, then group_by() specific columns, and then modify() to add computed columns to each group.

Now for analysis we can have a question like :

What is the average shipping cost for product category (Department) in different regions some (Market) for the company?

The question aims to analyze the average shipping cost for each product category (Department) across different regions (Market) let say Africa, Europe and LATAM for the company’s operations. We want to see if there are any variations in shipping costs based on the type of product and the region it is being shipped to.

So, here we have three data variables Shipping_Cost_Each as numerical variable , Department and Market as categorical variables.

Code Application:

We can use the mutate(), group_by(), and filter() functions to solve this question. We can use mutate() to calculate the average shipping cost for each product category (Department) in different regions (Market) and apply group_by() to group the data by Department and Market. After that we can use summarise() to calculate the average shipping cost within each group. After that we will filter the results to focus on the regions (Markets) of interest Africa, Europe and LATAM.

average_shipping_cost <- NewDataSet_Name %>%
  group_by(Department, Market) %>%
  summarise(Avg_Shipping_Cost = mean(Shipping_Cost_Each)) %>%
  filter(Market %in% c("Africa", "Europe", "LATAM"))

kable(average_shipping_cost, format = "html", align = "c", 
      caption = "Average Shipping Cost by Department and Market") %>%
  kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
Average Shipping Cost by Department and Market
Department Market Avg_Shipping_Cost
Furniture Africa 28.47111
Furniture Europe 10.50464
Furniture LATAM 15.57906
Office Supplies Africa 35.79333
Office Supplies Europe 16.71920
Office Supplies LATAM 21.57053
Technology Africa 33.76115
Technology Europe 13.15522
Technology LATAM 15.02463
average_shipping_cost <- NewDataSet_Name %>%
  group_by(Department, Market) %>%
  summarise(Avg_Shipping_Cost = mean(Shipping_Cost_Each)) %>%
  filter(Market %in% c("Africa", "Europe", "LATAM"))

ggplot(average_shipping_cost, aes(x = Department, y = Avg_Shipping_Cost, fill = Market)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.6) +
  labs(title = "Average Shipping Cost by Department and Market",
       x = "Department", y = "Average Shipping Cost") +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1)) +
  scale_fill_manual(values = c("Africa" = "skyblue", "Europe" = "lightgreen", "LATAM" = "lightpink")) +
  geom_text(aes(label = scales::dollar_format(prefix = "$")(Avg_Shipping_Cost)), vjust = -0.5, position = position_dodge(width = 0.8)) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"))

Observations to Task_8:

From the above bar graph shows the average shipping cost (Avg_Shipping_Cost) for different combinations of Department and Market. Seeing this we can make the following meaningful inferences and observations:

  1. The unique average shipping costs show that shipping costs vary among departments and markets. The average shipping costs differ greatly amongst the three departments (Furniture, Office Supplies, and Technology).

  2. Among all combinations, the Furniture department in the Europe market has the lowest average delivery cost of around $10.50. In the European market, the department of Technology has a comparatively cheap average delivery cost of roughly $13.16.

  3. With an average cost of around $35.79, the Office Supplies department in the Africa market has the highest average delivery cost. In the Africa market, the department of Technology also has a somewhat high average shipping cost of around $33.76.

  4. The shipping costs tend to be higher in the Africa market compared to the Europe and LATAM markets for most departments.

  5. There are significant variances in typical shipping prices among the three departments within each market. In the LATAM region, for example, the average shipping cost for Office Supplies is significantly greater than for Furniture and Technology.

The insights acquired from this table can help businesses make cost-cutting decisions. Identifying departments and markets with high shipping costs, for example, can motivate further investigation to optimize shipping techniques or investigate potential cost-cutting agreements.

The differing shipping rates across markets could be related to factors such as shipping distances, local legislation, demand variations, and accessible delivery choices.


CONCLUSION

In this report, we investigated the analysis of a dataset based on sales-related data, with the goal of uncovering insights and trends within the business context. This dataset contains a profusion of variables, each of which represents a different facet of sales transactions in various markets, segments, and departments. We attempted to obtain relevant observations and conclusions that may drive business decisions by combining data manipulation, visualization approaches, and statistical metrics.

We began our adventure by utilizing the summary() command, which allowed us to comprehend essential statistical properties of the numeric variables in the dataset. This stage laid the groundwork for understanding the numerical aspects of the dataset, such as spread and central tendencies. We next conducted additional analysis using visualization tools, beginning with histograms. These visualizations revealed the distribution patterns of variables such as Product_Price which indicated a right-skewed distribution and identified the most common price ranges.

Following that, we used box plots to better comprehend data like Quantity and Shipping_Cost_Each. The quantity box plot revealed central tendencies and outliers, while the shipping cost histogram suggested a right-skewed distribution. The horizontal box plot supported these findings by emphasizing the skewness and presence of outliers in the shipping cost data.

When histograms and box plots were compared, their relative strengths were shown. Histograms showed distribution patterns and value dispersion more clearly, whereas box plots showed quartiles, medians, and probable outliers.

Our investigation led to the addition of the Total sales column, which was computed as the product of price and quantity. This extra knowledge enabled us to conduct a comprehensive analysis of sales revenue. According to the data, the Asia Pacific and European markets contributed the most to the company’s revenues, with significant amounts of around $24.7 million and $24.6 million, respectively. The Latin American (LATAM) market also displayed remarkable sales figures, demonstrating the region’s growth potential.

In addition, we investigated the typical delivery prices across departments and markets. This analysis provided useful insights, revealing trends such as greater transportation costs in the Africa region and variable expenses between departments. Departments such as Office Supplies in Africa have higher average transportation costs, begging for additional inquiry into potential cost-cutting techniques.

Finally, this in-depth study gave critical insights into the dataset’s dynamics, providing significant information for business decisions. We discovered distribution patterns, central tendencies, and market-specific trends, which enabled us to identify possible areas for improvement and optimization. The project not only taught us data analysis skills, but it also supplied us with practical insights that businesses can use to improve operational efficiency and make educated strategic decisions. We navigated this investigation using a strategic strategy that included data exploration, visualization, and interpretation, discovering hidden patterns and trends inside the domain of sales data.

To create this report, I followed a systematic approach. Initially, I familiarized myself with the dataset and its context. Then, I predominantly worked with R scripts to code and analyze the data. I often wrote and tested code in R script, and once I was satisfied with the results, I transferred the code to the markdown format for better presentation. This way, I ensured that the technical aspects were accurately conveyed and by this I feel comfortable. This back-and-forth process helped create a report that tells a clear story about the data, making sure the technical details are easy to understand.


Acknowledgments

I would like to extend my heartfelt gratitude to you Professor for your exceptional dedication to teaching and guiding us. As an international student, I was initially scared about adapting to a new educational environment. However, your unwavering commitment to fostering a positive learning atmosphere, coupled with your comprehensive course content and project modules, made a profound impact on my academic journey.

The course material you developed, whether it was RPubs, assignment projects, Quizzes, or discussion forums, offered a great basis for understanding R programming and Analytics. The project modules and discussion forums were quite helpful in strengthening my learning and allowing me to put what I learnt in class into practice. These hands-on experiences have boosted my confidence in managing R.

In just six classes, I went from being a cautious student to someone who is comfortable with R. You have established an environment in which learning is both pleasurable and rewarding.

I’d want to thank you for your remarkable efforts in creating the course content and accompanying resources.I’m carrying the knowledge and abilities I’ve gained under your supervision with me as I continue my academic path. Thank you for your excellent instruction and constant support.

I am grateful to my classmates for their warm support, which eased my homesickness and made me feel at home. Their friendship and assistance have been invaluable, creating a welcoming environment that has made my learning journey even more enjoyable.


BIBLIOGRAPHY

  1. Bluman, A. G. (2010). In Elementary statistics. A step by Step Approach (9th ed., pp. 28–29). essay, McGraw-Hill.

  2. Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition., pp. 30,516-520). Manning Publications.

  3. Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/796492

  4. Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/barplots

  5. Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/boxplots


Appendix

This report contains an R Markdown file named as follows Ansari_ALY6000Project_M6.Rmd