M6 Project Report
on R
ALY 6000 - CRN 90281: Introduction to Analytics
College of Professional Studies (CPS), Analytics
Northeastern University
Professor: Prof. Diana
Chiluiza, PhD
By: Zeeshan Ahmad Ansari
Date of Submission: 19 August,
2023
Library
#The report utilizes a set of libraries for various data processing and visualization tasks.
library(tidyverse)
library(readxl)
library(readr)
library(RColorBrewer)
library(ggplot2)
library(magrittr)
library(kableExtra)
library(dplyr)
library(knitr)
library(FSA)
library(plotly)
library(graphics)
library(scales) # For formatting numbers
#Dataset_Employed_in_this_M3_Report
M6Data = read_excel("D:/Quater_1/ALY 6000/M6 Project/DataSets/M6_project_dataset(2023).xlsx")
# Define custom colors for each species
custom_colors <- c("#43a2ca", "slategrey", "red", "green", "lightblue", "orange", "yellow")
Introduction
In today’s business world, understanding sales, markets, and how different parts of a company work together is crucial for success. This report dives into a dataset that holds information about sales transactions happening across different places and types of products. By analyzing this data, we aim to uncover useful insights that can help businesses make better decisions.
The data shows us how sales happen in different places like cities and countries. It also shows us what kinds of products people like to buy. In this we have different groups of customers, like people buying things for their homes, offices, or technology needs. Each group has its own preferences and the impact it will create on the company.Another important part is looking at different parts of a company, like the departments that handle furniture, technology, and office supplies.
This data contains 17 columns and 1000 rows. The columns are divided
into categorical and numerical variables. While closly looking we got
that we have overall 5 numerical variables namely
Product Price, Quantity,
Shipping Cost Each, Returns and
Loss per return rest all are categorical varibales.
Descriptive statistics are a set of methods that are used to summarize and describe key features of a dataset. It aids in comprehending data’s central tendencies, variability, and distribution. This includes computing metrics such as mean (average), median (middle value), and mode (most frequent value), as well as investigating the data’s range, standard deviation, and percentiles (Bluman,2010).
It aids in the discovery of insights from massive amounts of data, assisting in decision-making and plan building. Businesses, for example, might use these statistics to better understand client purchase patterns, identify popular products etc. It’s also useful for determining revenue patterns, monitoring inventory levels, and assessing staff performance.
Inferential statistics involves drawing conclusions and making predictions about a larger population based on a sample of data. It uses probability and statistical techniques to make informed statements about a group beyond the observed data. This includes methods like hypothesis testing, confidence intervals, and regression analysis (Bluman,2010).
Inferential statistics is a strong tool in real-world business undertakings. It assists businesses in making decisions and forecasting by analyzing a smaller selection of data to discover wider trends. For example, a corporation can use inferential statistics to determine whether a new marketing plan is likely to enhance sales across the board. It can also be used to forecast future market patterns and assess the impact of policy changes.
An R Script is a plain text file containing a series of R commands and code that can be executed sequentially. It’s used to write, edit, and run R code without any additional formatting or narrative content. R Scripts are commonly used by programmers and analysts to automate tasks, perform data analysis, and develop functions or algorithms (Kabacoff, 2015).
R Markdown is a file format that combines R code, text, and formatting into a single document. It allows you to seamlessly blend narrative text, code, and visualizations. R Markdown files are processed by an R Markdown rendering engine to create dynamic reports, presentations, documents, and even web pages. When rendered, an R Markdown file produces output that can include code results, tables, plots, and text (Kabacoff, 2015).
Both R Scripts and R Markdown files allow us to run R code and perform data analysis. Both files are used within the R programming environment to utilize its functionalists. We can manipulate and process data using R commands in both R Scripts and R Markdown files (Kabacoff, 2015).
The main difference is that R Scripts are just for giving instructions to our computer. It’s like telling it what to do. R Markdown, on the other hand, is for telling a whole story. We can mix our words with R code and pictures to create something like this report or a presentation that people can read and see (Kabacoff, 2015).
R Scripts are great for making our computer do specific tasks and work with numbers, but they don’t make things look good. R Markdown, on the other hand, makes things look nice and combines words and numbers in a way that people can understand.
So, R Scripts are like computer instructions, and R Markdown is like a storytelling tool that combines words and code to create impressive reports and presentations.
Analysis Section
Task_1 Finding the Basic
Descriptive Statistics of the Dataset
In this analysis, we are presenting an overview of the entire dataset
through a combination of a summary table and two informative graphs. The
summary table showcases essential descriptive statistics for numeric
variables such as Quantity, Discount and
Profit. The graphs, a box plot and a histogram, visualize
the distribution of the Profit variable. This comprehensive
approach offers insights into the dataset’s numerical trends,
distribution, and outliers, enhancing our understanding of the data’s
characteristics and potential patterns.
data_summary <- t(summary(M6Data))
kable(data_summary, format = "html",caption = "Transpose Summary of the Data Set", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, background = "slategrey" ,bold = TRUE, color = "white")
| OrderDate | Min. :2020-01-01 00:00:00.0 | 1st Qu.:2020-05-09 00:00:00.0 | Median :2020-08-06 00:00:00.0 | Mean :2020-07-24 05:12:28.7 | 3rd Qu.:2020-10-19 00:00:00.0 | Max. :2020-12-25 00:00:00.0 |
| ProductID | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| City | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| State | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Country | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Region | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Market | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Segment | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Department | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Division | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| OrderPriority | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| ShipMode | Length:1000 | Class :character | Mode :character | NA | NA | NA |
| Product_Price | Min. : 151.9 | 1st Qu.: 361.1 | Median : 501.4 | Mean : 548.9 | 3rd Qu.: 684.3 | Max. :3558.4 |
| Quantity | Min. : 20.0 | 1st Qu.: 45.0 | Median :133.0 | Mean :158.5 | 3rd Qu.:251.2 | Max. :400.0 |
| Shipping_Cost_Each | Min. : 0.000 | 1st Qu.: 3.890 | Median : 8.585 | Mean :14.325 | 3rd Qu.:21.413 | Max. :59.710 |
| Returns | Min. : 6.00 | 1st Qu.: 14.00 | Median : 40.00 | Mean : 47.59 | 3rd Qu.: 75.25 | Max. :120.00 |
| Loss_Per_Return | Min. : 18.23 | 1st Qu.: 73.54 | Median : 145.66 | Mean : 168.92 | 3rd Qu.: 240.63 | Max. :1352.20 |
# Select relevant numerical columns
numeric_cols <- c("Product_Price",
"Quantity",
"Shipping_Cost_Each",
"Returns",
"Loss_Per_Return")
numeric_data <- M6Data[, numeric_cols]
# Generate basic descriptive statistics
data_summary_M6 <- summary(numeric_data)
kable(data_summary_M6, caption = "Summary of Numeric Data",format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, background = "slategrey" ,bold = TRUE, color = "white")
| Product_Price | Quantity | Shipping_Cost_Each | Returns | Loss_Per_Return | |
|---|---|---|---|---|---|
| Min. : 151.9 | Min. : 20.0 | Min. : 0.000 | Min. : 6.00 | Min. : 18.23 | |
| 1st Qu.: 361.1 | 1st Qu.: 45.0 | 1st Qu.: 3.890 | 1st Qu.: 14.00 | 1st Qu.: 73.54 | |
| Median : 501.4 | Median :133.0 | Median : 8.585 | Median : 40.00 | Median : 145.66 | |
| Mean : 548.9 | Mean :158.5 | Mean :14.325 | Mean : 47.59 | Mean : 168.92 | |
| 3rd Qu.: 684.3 | 3rd Qu.:251.2 | 3rd Qu.:21.413 | 3rd Qu.: 75.25 | 3rd Qu.: 240.63 | |
| Max. :3558.4 | Max. :400.0 | Max. :59.710 | Max. :120.00 | Max. :1352.20 |
# Create a histogram for Product_Price
# Extract Product_Price column
product_prices <- M6Data$Product_Price
# Create a histogram
hist(product_prices, breaks = 20, col = "lightblue", border = "black",
xlab = "Product Price", ylab = "Frequency",
ylim = c(0, 350),
main = "Histogram of Product Price")
# Create a box plot for Quantity
boxplot(M6Data$Quantity, col = "lightgreen",
xlab = "", ylab = "Quantity",
main = "Box Plot of Quantity")
As we analysed our dataset using summary command (Kabacoff,
2015) we got the information about the numerical variables. By employing
this function, we gained access to key statistical measures for each
numeric column, facilitating a comprehensive understanding of the
dataset’s characteristics. These measures encompass the minimum value,
the value at the first quartile, the median (representing the second
quartile), the arithmetic mean, the value at the third quartile, and the
maximum value associated with each numeric variable. We tranposed the
table so that we can have a better visualization. After getting
information about the numerical values we again used summary command to
get a better picture of the numerical variables.
A histogram is (Chiluiza, 2022) generated to visualize the distribution of “Product_Price” in the dataset. It clearly shows that the distribution is right-skewed, while many products have relatively lower prices, there are some products with higher prices that contribute to the spread in the distribution. The interval [250, 500] has the highest count with more than 300 observations. This suggests that a significant portion of products falls within the price range of $250 to $550.
The box plot (Chiluiza, 2022) for the quantity data shows that the median (Q2) value of is around 130 which suggests that the middle value of product quantities sold is around this number. This indicates that a significant portion of sales transactions involves quantities close to this value. The slight asymmetry of the box plot suggests a possible skewness in the distribution. The longer whisker on one side indicates that some transactions involve larger quantities, contributing to the skewness.
Task_2 Evaluation of the
distribution of the shipping charges
We want to view and evaluate the distribution of shipping charges within the provided dataset. For this we can use a histogram and a horizontal box plot to properly understand the distribution, variability, and core tendency of shipping costs. The visual comparison will allow us to detect probable outliers, evaluate the general distribution shape, and gain insight into the dataset’s usual values and extremes.
# Calculate mean and median
mean_shipping_cost <- mean(M6Data$Shipping_Cost_Each)
median_shipping_cost <- median(M6Data$Shipping_Cost_Each)
par(fig=c(0, 1, 0, 0.8))
hist(M6Data$Shipping_Cost_Each,
xlab = "Shipping Cost", ylab = "Frequency",
ylim = c(0, 400),
main = "",
col = "lightgrey", border = "black")
abline(v = mean_shipping_cost,
col = "red",
lty = 1)
abline(v = median_shipping_cost,
col = "blue",
lty = 2)
text(mean_shipping_cost,
70,
round(mean_shipping_cost, 2),
col = "red",
pos = 4,
srt=90,
cex=0.8)
text(median_shipping_cost,
80,
round(median_shipping_cost, 2),
col = "blue",
pos = 4,
srt=90,
cex=0.8)
legend(30,300, legend = c("Mean Shipping Cost", "Median Shipping Cost"),
col = c("red", "blue"), lty = c(1, 2))
par(fig=c(0, 1, 0.45, 1), new=TRUE)
boxplot(M6Data$Shipping_Cost_Each, horizontal = TRUE,
notch=TRUE,
varwidth= TRUE,
main = "Comparing Distribution of Shipping Cost with Box and Histogram",
col = "lightgrey", border = "black")
abline(v = mean_shipping_cost,
col = "red",
lty = 1)
abline(v = median_shipping_cost,
col = "blue",
lty = 2)
text(mean_shipping_cost,
1,
round(mean_shipping_cost, 2),
col = "red",
pos = 4,
srt=90,
cex=0.8)
text(median_shipping_cost,
1,
round(median_shipping_cost, 2),
col = "blue",
pos = 4,
srt=90,
cex=0.8)
Observations to Task_2:
The frequency distribution of shipping costs is depicted by the histogram. The data is skewed to the right, with the majority of shipping charges clustered near the bottom of the scale. This suggests that the majority of orders have lower shipping prices, but a few orders have higher shipping expenses.
The horizontal box plot confirms the observations of the histogram. The median (blue line) shipping cost is lower than the mean (red line), emphasizing the skewness. Outliers are visible, as certain transportation charges drastically surpass the average range.
We may compare the effectiveness of each strategy by using both visuals side by side. The histogram shows the frequency distribution and value spread more clearly. The box plot, on the other hand, provides information about quartiles, medians, and ranges.
The median, which is less impacted by outliers, represents the data’s central trend better. The mean is being pulled upwards by extreme values.
The combined study enables us to learn that, while the majority of orders have relatively lower delivery expenses, a small number of orders have significantly higher charges. This could be because to variables like quicker shipping or remote delivery areas.
In conclusion, the variable distribution study, which included both a histogram and a horizontal box plot, shed light on the skewed distribution of shipping costs. This method enabled us to observe central tendencies, identify outliers, and obtain insights into the dataset’s range of shipping charges.
Task_3 Use a Box Plot to
Compare Shipping Costs Across Markets
By employing a box plot, we aim to visually compare the distribution of shipping costs across different markets. This allows us to identify variations in shipping charges, discern markets with higher and lower shipping costs, and gain insights into the range of cost discrepancies across various markets.
max_shipping_cost <- max(M6Data$Shipping_Cost_Each)
min_shipping_cost <- min(M6Data$Shipping_Cost_Each)
market_with_max <- M6Data$Market[which.max(M6Data$Shipping_Cost_Each)]
market_with_min <- M6Data$Market[which.min(M6Data$Shipping_Cost_Each)]
boxplot(Shipping_Cost_Each ~ Market, data = M6Data,
notch = TRUE,
horizontal = FALSE,
col = custom_colors,
border = "black",
main = "Shipping Cost versus Markets",
xlab = "Markets",
ylab = "Shipping Cost",
ylim = c(0, 70))
As we look to the box plots above the horizontal lines inside the box represents the median, which is a good indicator of the central tendency of the distribution. If a box’s median is higher, it means that market has higher shipping cost, and if the median is lower, it means that market has lower shipping cost.
From the visual inspection, we can clearly see that the
Africa has the higher shipping cost and USCA
has the lowest shipping cost.
Also when we calculated it we get Africa is the highest
shipping cost and USCA has the lowest shipping cost.
Task_4 Calculate Mean
Shipping Costs per Market and Visualization
Using tapply() function, we can compute the mean shipping
cost for each market. We will be generating a bar plot to visually
compare the mean shipping costs across different markets. The comparison
can offer insights distinct from the previous box plot, providing us
with the understanding of market-specific shipping cost.
Initially, the tapply() function is utilized to calculate the mean shipping cost for each market segment. This process involves grouping the data by market and then applying the mean function to the shipping cost column within each market group. The resulting mean values are stored in an object.
Following this, a bar plot is created using the object to illustrate the mean shipping costs across different markets. The bar heights reflect the average shipping expenses in each market, facilitating a visual comparison.
Similarly, we used tapply() function with the median function and we can investigate how the median and total shipping costs vary across markets. This approach enables us to explore the central tendency for different market segments. The resulting bar plots for median can be compared to the initial mean bar plot, which will be insightful for the distribution of shipping.
mean_shipping_by_market <- tapply(M6Data$Shipping_Cost_Each, M6Data$Market, mean)
barplot(mean_shipping_by_market,
main = "Mean Shipping Cost per Market",
xlab = "Market", ylab = "Mean Shipping Cost",
ylim = c(0,40),
col = custom_colors, border = "black")
median_shipping_per_market <- tapply(M6Data$Shipping_Cost_Each, M6Data$Market, median)
barplot(median_shipping_per_market,
main = "Median Shipping Costs per Market",
xlab = "Market", ylab = "Median Shipping Cost",
ylim = c(0,50),
col = custom_colors, border = "black")
The mean shipping cost bar plot shows the average shipping expenses per market, aiding in identifying markets with higher or lower costs. As African market has the highest mean shipping cost than the USCA one. By applying different aggregation functions, such as the central tendency (median), we can uncover distinct aspects of the shipping cost distribution. As the from the graph we can see that Africa has the median value approximately around 41 which is still validating from the previous box plots.
Task_5 Create a Box Plot
displaying Shipping Cost versus Shipping Mode
By utilizing a box plot, we are visually comparing the distribution of shipping costs based on different shipping modes. This graphical representation provides insight into the variability and core patterns of shipping costs connected with various shipment methods.
boxplot(Shipping_Cost_Each ~ ShipMode, data = M6Data,
notch = TRUE, horizontal = FALSE,
col = custom_colors, border = "black",
main = "Shipping Cost versus Shipping Mode",
xlab = "Shipping Mode", ylab = "Shipping Cost",
ylim = c(0,65))
The box plot (Chiluiza, 2022) clearly depicts the distribution of
shipping costs across different shipping modes.
TheFirst Class mode shows a wider range of costs compared
to other modes, with some potential outliers on the higher
end.Second Class and First Class and
Standard Class shipping modes exhibit relatively narrower
interquartile ranges, indicating more consistent cost distribution.
Same Day shipping mode demonstrates a relatively balanced
distribution with moderate variability. The plot provides a visual
understanding of the spread and central tendencies of shipping costs for
each mode, aiding in identifying potential cost-saving opportunities and
optimizing shipping strategies.
Task_6 Usage of Mutate
function to create new Column
Using the pipes and the code mutate(), we
enhance our dataset by calculating a new column called
Total sales, obtained from the multiplication of the
Price and Quantity variables. This allows us
to better understand the overall sales generated from each transaction.
By creating a new object to store this updated dataset, the original
data is intact. The new “Total sales” column and its relationship with
the existing variables will be usefull for our analysis.
NewDataSet_Name <- M6Data %>%
mutate(Total_Sales = Product_Price * Quantity)
glimpse(NewDataSet_Name[1, ])
## Rows: 1
## Columns: 18
## $ OrderDate <dttm> 2020-05-08
## $ ProductID <chr> "TEC75553"
## $ City <chr> "Yaounde"
## $ State <chr> "Centre"
## $ Country <chr> "Cameroon"
## $ Region <chr> "Central Africa"
## $ Market <chr> "Africa"
## $ Segment <chr> "Home Office"
## $ Department <chr> "Technology"
## $ Division <chr> "Accessories"
## $ OrderPriority <chr> "High"
## $ ShipMode <chr> "First Class"
## $ Product_Price <dbl> 360.04
## $ Quantity <dbl> 154
## $ Shipping_Cost_Each <dbl> 50.84
## $ Returns <dbl> 46
## $ Loss_Per_Return <dbl> 126.01
## $ Total_Sales <dbl> 55446.16
Observations to Task_6:The new “Total sales” column provides a comprehensive view of the revenue generated by each transaction, considering both the product price and quantity sold. By creating a new object to store the modified dataset, the original dataset remains unaltered, ensuring data integrity and allowing for comparisons between the two versions. By using the glimpse() function on the first row of the enriched dataset, we efficiently inspect the new column while maintaining focus on the context of other variables, preventing an overwhelming display of observations.
Task_7 Analysing the
highest sale of the company according to Market, Department and Segments
In this analysis, we will be using recently created “Total sales” column
to identify where the company achieved its highest sales figures across
different dimensions: Market, Segment, and
Department. Utilizing the power of the pipe operator and
various dplyr functions, we group the data accordingly, aiming to
uncover the most profitable sectors within the company’s operations.
sales_by_market <- NewDataSet_Name %>%
group_by(Market) %>%
summarise(Total_Sales = sum(Total_Sales)) %>%
arrange(Total_Sales)
formatted_sales <- dollar(sales_by_market$Total_Sales, big.mark = ",", prefix = "$")
formatted_data <- data.frame(Market = sales_by_market$Market,
Formatted_Sales = formatted_sales)
Market_sales_table = kable(formatted_data, format = "html", align = "l",
caption = "Total Sales by Market") %>%
kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
ggplot(sales_by_market, aes(x = reorder(Market, -Total_Sales), y = Total_Sales, fill = Market)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) + # Add value labels
labs(title = "Total Sales by Market",
x = "Market", y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(hjust = 1)) +
scale_fill_manual(values = custom_colors) + # Using custom colors
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
| Market | Formatted_Sales |
|---|---|
| Africa | $5,127,296 |
| USCA | $9,533,931 |
| LATAM | $14,005,415 |
| Europe | $24,627,127 |
| Asia Pacific | $24,704,625 |
The Asia Pacific market exhibits the highest total sales,
with a staggering value of approximately $24.7 million
followed by a margin with the Europe market with
$24.6 million. This indicates that the Asia Pacific market
holds a significant portion of the company’s overall sales. The
LATAM (Latin America) market showcases impressive total
sales of approximately $14 million, positioning it as a key
contributor to the company’s sales revenue. While the
African market demonstrates lower total sales compared to
other regions, it still contributes notably with around
$5.1 million in sales.
The Asia Pacific and European market stands out as the leader in terms of total sales, indicating a strong business presence and successful operations in the region.
The “LATAM” market’s impressive sales suggest promising growth potential in Latin American regions.
In this we used inline r code to print the table.
sales_by_segment <- NewDataSet_Name %>%
group_by(Segment) %>%
summarise(Total_Sales = sum(Total_Sales)) %>%
arrange(Total_Sales)
formatted_sales_segment <- dollar(sales_by_segment$Total_Sales, big.mark = ",", prefix = "$")
formatted_data_segment <- data.frame(Segment = sales_by_segment$Segment,
Formatted_Sales = formatted_sales_segment)
Segment_sales_table = kable(formatted_data_segment, format = "html", align = "l",
caption = "Total Sales by Segment") %>%
kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
ggplot(sales_by_segment, aes(x = reorder(Segment, -Total_Sales), y = Total_Sales, fill = Segment)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) + # Add value labels
labs(title = "Total Sales by Segment",
x = "Segment", y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(hjust = 1)) +
scale_fill_manual(values = custom_colors) + # Using custom colors
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
| Segment | Formatted_Sales |
|---|---|
| Home Office | $12,980,145 |
| Corporate | $23,377,596 |
| Consumer | $41,640,653 |
The Consumer segment emerges as the leader in total sales,
boasting a substantial figure of around $41.6 million. This
indicates that the consumer market is a significant revenue driver for
the company. The Corporate segment follows closely with
total sales amounting to approximately $23.4 million.
Corporate clients also play a substantial role in the company’s sales
revenue. The Home Office segment demonstrates commendable
performance, contributing significantly to the company’s revenue with
total sales of roughly $13 million.
In this we used inline r code to print the table.
sales_by_department <- NewDataSet_Name %>%
group_by(Department) %>%
summarise(Total_Sales = sum(Total_Sales)) %>%
arrange(Total_Sales)
formatted_sales_department <- dollar(sales_by_department$Total_Sales, big.mark = ",", prefix = "$")
formatted_data_department <- data.frame(Department = sales_by_department$Department,
Formatted_Sales = formatted_sales_department)
Department_sales_table = kable(formatted_data_department, format = "html", align = "l",
caption = "Total Sales by Department") %>%
kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14)%>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
ggplot(sales_by_department, aes(x = reorder(Department, -Total_Sales), y = Total_Sales, fill = Department)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::dollar_format(prefix = "$")(Total_Sales)), vjust = -0.5) + # Add value labels
labs(title = "Total Sales by Department",
x = "Department", y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(hjust = 1)) +
scale_fill_manual(values = custom_colors) + # Using custom colors
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
| Segment | Formatted_Sales |
|---|---|
| Home Office | $12,980,145 |
| Corporate | $23,377,596 |
| Consumer | $41,640,653 |
The Office Supplies department emerges as the leader in
terms of total sales, boasting an impressive figure of approximately
$43.5 million. This department plays a crucial role in
generating revenue for the company. The Technology
department follows closely with total sales amounting to around
$27.8 million. This indicates a strong market demand for
technological products offered by the company. The
Furniture department demonstrates commendable performance,
contributing significantly to the company’s revenue with total sales of
roughly $6.8 million.
In this we used inline r code to print the table.
Task_8 HOTS (High Order
Thinking Skill) Question
In this task we will try to understand three commands which until now
I was mostly confused these are mutate() ,
group_by() and filter() functions in R.
mutate() : This function is used to create new
variables (columns) in a data frame based on transformations or
calculations involving existing columns. Basically it adds new columns
in the data frame keeping all exixting rows.Example: Lets take a data frame which contains height and weight and
we need another column named BMI, so we can create it using
mutate().
group_by(): This function is used to categorize data by
one or more categorical variables. It is frequently used in conjunction
with aggregation methods to study data within each group. It doesn’t
change the number of rows in the data frame but instead prepares the
data for subsequent group-wise operations.Example: Let say we have a data frame of student scores with columns
for “StudentID,” “Subject,” and “Score.” We want to analyze the average
score for each subject. This is where the group_by()
function will be used.
filter():This function is used to subset or filter rows
in a data frame based on certain criteria. It enables us to keep only
the rows that meet specified criteria. It reduces the number of rows in
the data frame by removing rows that do not meet the specified
conditions.Example: Selecting rows with certain values, filtering data based on dates, retaining rows with specific characteristics.
In nutshell we can say mutate() adds new columns to the
data frame, group_by() groups data by one or more
categorical variables for further group-wise operations and
filter() selects rows that meet specified conditions,
reducing the number of rows in the data frame.
These three operations are frequently used in various ways to accomplish complicated data transformations and analysis. For example, we can filter() rows first, then group_by() specific columns, and then modify() to add computed columns to each group.
Now for analysis we can have a question like :
What is the average shipping cost for product category (Department) in different regions some (Market) for the company?
The question aims to analyze the average shipping cost for each product category (Department) across different regions (Market) let say Africa, Europe and LATAM for the company’s operations. We want to see if there are any variations in shipping costs based on the type of product and the region it is being shipped to.
So, here we have three data variables Shipping_Cost_Each
as numerical variable , Department and Market
as categorical variables.
Code Application:
We can use the mutate(), group_by(), and filter() functions to solve this question. We can usemutate() to calculate the average
shipping cost for each product category (Department) in different
regions (Market) and apply group_by() to group the data by
Department and Market. After that we can use summarise() to
calculate the average shipping cost within each group. After that we
will filter the results to focus on the regions (Markets) of interest
Africa, Europe and LATAM.
average_shipping_cost <- NewDataSet_Name %>%
group_by(Department, Market) %>%
summarise(Avg_Shipping_Cost = mean(Shipping_Cost_Each)) %>%
filter(Market %in% c("Africa", "Europe", "LATAM"))
kable(average_shipping_cost, format = "html", align = "c",
caption = "Average Shipping Cost by Department and Market") %>%
kable_styling(full_width = TRUE, bootstrap_options = "striped", font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
| Department | Market | Avg_Shipping_Cost |
|---|---|---|
| Furniture | Africa | 28.47111 |
| Furniture | Europe | 10.50464 |
| Furniture | LATAM | 15.57906 |
| Office Supplies | Africa | 35.79333 |
| Office Supplies | Europe | 16.71920 |
| Office Supplies | LATAM | 21.57053 |
| Technology | Africa | 33.76115 |
| Technology | Europe | 13.15522 |
| Technology | LATAM | 15.02463 |
average_shipping_cost <- NewDataSet_Name %>%
group_by(Department, Market) %>%
summarise(Avg_Shipping_Cost = mean(Shipping_Cost_Each)) %>%
filter(Market %in% c("Africa", "Europe", "LATAM"))
ggplot(average_shipping_cost, aes(x = Department, y = Avg_Shipping_Cost, fill = Market)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.6) +
labs(title = "Average Shipping Cost by Department and Market",
x = "Department", y = "Average Shipping Cost") +
theme_minimal() +
theme(axis.text.x = element_text(hjust = 1)) +
scale_fill_manual(values = c("Africa" = "skyblue", "Europe" = "lightgreen", "LATAM" = "lightpink")) +
geom_text(aes(label = scales::dollar_format(prefix = "$")(Avg_Shipping_Cost)), vjust = -0.5, position = position_dodge(width = 0.8)) +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
Observations to Task_8:
From the above bar graph shows the average shipping cost (Avg_Shipping_Cost) for different combinations of Department and Market. Seeing this we can make the following meaningful inferences and observations:
The unique average shipping costs show that shipping costs vary among departments and markets. The average shipping costs differ greatly amongst the three departments (Furniture, Office Supplies, and Technology).
Among all combinations, the Furniture department in the Europe market
has the lowest average delivery cost of around $10.50. In
the European market, the department of Technology has a comparatively
cheap average delivery cost of roughly $13.16.
With an average cost of around $35.79, the Office Supplies
department in the Africa market has the highest average delivery cost.
In the Africa market, the department of Technology also has a somewhat
high average shipping cost of around $33.76.
The shipping costs tend to be higher in the Africa market compared to the Europe and LATAM markets for most departments.
There are significant variances in typical shipping prices among the three departments within each market. In the LATAM region, for example, the average shipping cost for Office Supplies is significantly greater than for Furniture and Technology.
The insights acquired from this table can help businesses make cost-cutting decisions. Identifying departments and markets with high shipping costs, for example, can motivate further investigation to optimize shipping techniques or investigate potential cost-cutting agreements.
The differing shipping rates across markets could be related to factors such as shipping distances, local legislation, demand variations, and accessible delivery choices.
CONCLUSION
In this report, we investigated the analysis of a dataset based on sales-related data, with the goal of uncovering insights and trends within the business context. This dataset contains a profusion of variables, each of which represents a different facet of sales transactions in various markets, segments, and departments. We attempted to obtain relevant observations and conclusions that may drive business decisions by combining data manipulation, visualization approaches, and statistical metrics.
We began our adventure by utilizing the summary() command,
which allowed us to comprehend essential statistical properties of the
numeric variables in the dataset. This stage laid the groundwork for
understanding the numerical aspects of the dataset, such as spread and
central tendencies. We next conducted additional analysis using
visualization tools, beginning with histograms. These visualizations
revealed the distribution patterns of variables such as
Product_Price which indicated a right-skewed distribution
and identified the most common price ranges.
Following that, we used box plots to better comprehend data like
Quantity and Shipping_Cost_Each. The quantity
box plot revealed central tendencies and outliers, while the shipping
cost histogram suggested a right-skewed distribution. The horizontal box
plot supported these findings by emphasizing the skewness and presence
of outliers in the shipping cost data.
When histograms and box plots were compared, their relative strengths were shown. Histograms showed distribution patterns and value dispersion more clearly, whereas box plots showed quartiles, medians, and probable outliers.
Our investigation led to the addition of the Total sales
column, which was computed as the product of price and
quantity. This extra knowledge enabled us to conduct a
comprehensive analysis of sales revenue. According to the data, the Asia
Pacific and European markets contributed the most to the company’s
revenues, with significant amounts of around $24.7 million
and $24.6 million, respectively. The Latin American (LATAM)
market also displayed remarkable sales figures, demonstrating the
region’s growth potential.
In addition, we investigated the typical delivery prices across departments and markets. This analysis provided useful insights, revealing trends such as greater transportation costs in the Africa region and variable expenses between departments. Departments such as Office Supplies in Africa have higher average transportation costs, begging for additional inquiry into potential cost-cutting techniques.
Finally, this in-depth study gave critical insights into the dataset’s dynamics, providing significant information for business decisions. We discovered distribution patterns, central tendencies, and market-specific trends, which enabled us to identify possible areas for improvement and optimization. The project not only taught us data analysis skills, but it also supplied us with practical insights that businesses can use to improve operational efficiency and make educated strategic decisions. We navigated this investigation using a strategic strategy that included data exploration, visualization, and interpretation, discovering hidden patterns and trends inside the domain of sales data.
To create this report, I followed a systematic approach. Initially, I familiarized myself with the dataset and its context. Then, I predominantly worked with R scripts to code and analyze the data. I often wrote and tested code in R script, and once I was satisfied with the results, I transferred the code to the markdown format for better presentation. This way, I ensured that the technical aspects were accurately conveyed and by this I feel comfortable. This back-and-forth process helped create a report that tells a clear story about the data, making sure the technical details are easy to understand.
Acknowledgments
I would like to extend my heartfelt gratitude to you Professor for your exceptional dedication to teaching and guiding us. As an international student, I was initially scared about adapting to a new educational environment. However, your unwavering commitment to fostering a positive learning atmosphere, coupled with your comprehensive course content and project modules, made a profound impact on my academic journey.
The course material you developed, whether it was RPubs, assignment projects, Quizzes, or discussion forums, offered a great basis for understanding R programming and Analytics. The project modules and discussion forums were quite helpful in strengthening my learning and allowing me to put what I learnt in class into practice. These hands-on experiences have boosted my confidence in managing R.
In just six classes, I went from being a cautious student to someone who is comfortable with R. You have established an environment in which learning is both pleasurable and rewarding.
I’d want to thank you for your remarkable efforts in creating the course content and accompanying resources.I’m carrying the knowledge and abilities I’ve gained under your supervision with me as I continue my academic path. Thank you for your excellent instruction and constant support.I am grateful to my classmates for their warm support, which eased my homesickness and made me feel at home. Their friendship and assistance have been invaluable, creating a welcoming environment that has made my learning journey even more enjoyable.
BIBLIOGRAPHY
Bluman, A. G. (2010). In Elementary statistics. A step by Step Approach (9th ed., pp. 28–29). essay, McGraw-Hill.
Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition., pp. 30,516-520). Manning Publications.
Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/796492
Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/barplots
Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/boxplots
Appendix
This report contains an R Markdown file named as follows
Ansari_ALY6000Project_M6.Rmd