(The purpose of this week’s data dive is for you to think critically about the importance of documenting your model, but also the importance of referencing the documentation for the data you’re using.)
Part 1: A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.
Part 2: At least one element or your data that is unclear even after reading the documentation
Part 3: Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.3 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
1. Part 1 - A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.
One of the columns that I feel is unclear without any
documentation is the column name - Ship Mode. It does
not justify the values appropriately.
Within a Superstore data set, I presume that the Shipping Mode should
define the mode via which the products were being sold and received by
the customers. So considering the same, within the Data set, we can see
the Ship.Mode Column has categorical value defining 4 different
types.
They are as follows -
As seen above there are 4 shipping modes. From the name of each, we
cannot directly understand their meaning. Only for “Same Day” we can
maybe interpret that the products will be received by the customer on
the same day. But am not sure on what the other 3 types within the
shipping mode signify.
First Class, Second Class and Standard Class inform that they are some
shipping mode designed to differentiate on, when they might be obtained
by the buyer. But exact details on the range of days, is unknown from
the words. Hence I would call SHIP MODE as an unclear column.
unique(Superstore_data$Ship.Mode)
## [1] "Second Class" "Standard Class" "First Class" "Same Day"
For certain records, can observe that the value in Profit column
is negative, and a lot more negative than how much ever they are sold.
Partly, the reason could be that a lot of discount was given compared to
the actual cost of the product. Not clear as to why the superstore would
do so.
For instance there is a Product which falls under Technology Category,
Machine Sub-category which sees the lowest profit. It is about 6599.978$
Loss. While the quantity for the product is 5 and sales is 4499.985$. So
the company incurred a major loss. This is somewhat a weird case.
m<-min(Superstore_data$Profit)
Superstore_data[Superstore_data$Profit == m,]
## Ship.Mode Segment Country City State Postal.Code Region
## 7773 Standard Class Consumer United States Lancaster Ohio 43130 East
## Category Sub.Category Sales Quantity Discount Profit
## 7773 Technology Machines 4499.985 5 0.7 -6599.978
When we look into State Column and count the
number of records, we cant really differentiate if actually there are
less customers in those states who purchased the products.
Also, another variation for the same would be, if the data collection
for the records are not done properly.
Maybe the data is for certain time span alone. But this is unknown to
come to a conclusion as to why certain states are having less
records
From the below count calculation for each state, can see that State -
Wyoming has only 1 record in the entire dataset. Followed by West
Virginia with 4 records. Hence, I believe without documentation of the
same, we wouldnt know the reason for them being the least in terms of
records.
## # A tibble: 10 Ă— 2
## State total_count_state
## <chr> <int>
## 1 Kansas 24
## 2 Idaho 21
## 3 Montana 15
## 4 South Dakota 12
## 5 Vermont 11
## 6 District of Columbia 10
## 7 Maine 8
## 8 North Dakota 7
## 9 West Virginia 4
## 10 Wyoming 1
2. Part 2 - At least one element or your data that is unclear even after reading the documentation
Even after reading the documentation, I feel Ship Mode is an unclear column. The documentation suggests the following. It has 4 values, they are First Class, Same Day, Second Class and Standard Class. Each value is defines the limit for being delayed as follows:
Same day: >=0 days
First class: >=2 days
Second class: >=3 days
Standard class: >= 5 days
But the detail of when the product was purchased or even when the
product reached the customer is unknown. Like the order_date or
delivered_date is unknown.
So maybe a bit more clarity on the Ship Mode column is possible, if
those two information is known.
Apart from that I believe rest all columns are self explanatory or even understood from the documentation. i.e. Segment, Country, City, State, Postal.Code, Region, Category, Sub.Category, Sales, Quantity, Discount, Profit
3. Part 3 - Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.
Superstore_data |>
ggplot(mapping = aes(x=Ship.Mode, color = Ship.Mode, fill=Ship.Mode)) +
geom_bar()
Not sure as to why people would choose a standard delivery over the same day or even first class or even Second class
There is an un-clarity of why the products are reaching the customers after 5 days in most cases. Is it the mode that customers are given an opportunity to select.? Or Are vendors the reason for the same?
It would have been better, if there was a column that indicated the distance between vendor city and customer city. This could have helped understand the reason of Ship Mode being divided into 4 different types.
Also, the count that we saw in the above plot, suggests the number of orders delivered with Standard Class Ship Mode were more.
Following is a graph between ship mode and quantity indicating
the quantity within each order.
ggplot(Superstore_data) +
aes(x = Ship.Mode, y = Quantity, color=Ship.Mode, fill=Ship.Mode) +
geom_bar(stat = "identity")
Overall can see the quantity of products in standard class is more than others i.e. more than 20000 products were purchased with Standard ship mode
Also, lets check if the products had an effect in their sales due
to the ship mode.
ggplot(Superstore_data) +
aes(x = Ship.Mode, y = Sales, color=Ship.Mode, fill=Ship.Mode) +
geom_bar(stat = "identity")
I believe since most products were bought from Standard class, the sales for same didn’t find much of a difference. Like the sales for it were high.
Also, lets check if the products had an effect in their Profit
due to the ship mode.
ggplot(Superstore_data) +
aes(x = Ship.Mode, y = Profit, color=Ship.Mode, fill=Ship.Mode) +
geom_bar(stat = "identity")
Since there were most products bought from Standard class, naturally they saw loss/profit more in number. But the loss were not as much as the profit, but when compared with other type of classes Standard class did see some loss