Week 5 Assignment

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)

Columns/values unclear until I read the documentation

Ownership

I didn’t quite understand what the column “Ownership” meant immediately after looking at the data set. The values in this columns are [1, 2, 3, 4]. To understand this data, I went and saw the documentation. And after reading the documentation I understood that this company buys used cars from people and sells to people who wants to buy used cars with some profit.

df <- read.delim('Cars24.csv', sep = ',')
df |>
  group_by(Ownership) |>
  summarize(count = n()) |>
  arrange(desc(count))
## # A tibble: 4 × 2
##   Ownership count
##       <int> <int>
## 1         1  4448
## 2         2  1264
## 3         3   191
## 4         4    15

And the description of the column “Ownership” as per the documentation is “number of previous owners the car has”. This pretty much explains what data this column holds.

Why would they encode data this way?

I think using whole numbers to mention number of previous owners is straight forward. But to effectively apply regression techniques it is better to have 1 and 0 (1 stating there are more than one previous owner and 0 stating only one previous owner) instead of a bunch of whole numbers.

What would have happened if I didn’t read the documentation?

If I hadn’t read the document, there is a chance that I might have thought this as some category like below:

  • 1 - Individual owned

  • 2 - Company owned

  • 3 - Government owned

  • 4 - Rentals

Price

Just by seeing the column heading “Price”, I can understand what data this column holds. But I couldn’t comprehend whether this column contains the price the used car is bought for or the price the used car is selling for.

After seeing the documentation, it is clear that the “Price” column contains price that each used car sells for.

Why would they encode data this way?

Encoding price data in numeric form is best because of its precision which ensures no financial details is lost. Performing aggregate functions like sum, average is easier. Also comparing two vehicles based on its price becomes much simpler.

What would have happened if I didn’t read the documentation?

There is a chance that I might have considered the Price column contains the price that the used car is bought for and went with my exploration. If I had gone with my assumption without reading the documentation, the analysis that I did would be out of incorrect data and leading to flawed conclusion.

Values in the column Fuel

Lets see the values in the column “Fuel”

df <- read.delim('Cars24.csv', sep = ',')
df |>
  group_by(Fuel) |>
  summarize(count = n()) |>
  arrange(desc(count))
## # A tibble: 5 × 2
##   Fuel         count
##   <chr>        <int>
## 1 Petrol        3787
## 2 Diesel        1964
## 3 Petrol + CNG   147
## 4 Petrol + LPG    18
## 5 Electric         2

From the values in this column, I could understand that this column contains the kind of fuel the car runs on. I had some questions on what does the values “Petrol + CNG” and “Petrol + LPG” meant.

Why would they encode data this way?

This is a categorical variable, and has 4 categories as stated above. It depicts that each car falls under either one of the category. I understand what does Petrol, Diesel and Electric means. But need to do some digging for the CNG and LPG.

What would have happened if I didn’t read the documentation?

I feel like this is pretty simple to understand this column, even without the documentation, which means the type of fuel each car runs on. The only part that I feel like some clarity required, is on the two categories Petrol + CNG and Petrol + LPG.

Columns Unclear even after reading the documentation

After googling about this I understood that some cars might use two kinds of fuel, in this case Petrol and Compressed Natural Gas (CNG) or Petrol and Liquefied Petroleum Gas (LPG).

These vehicles are called bi-fuel vehicles and they have some advantages comparing to a vehicle that has a single fuel source.

  1. Longer range: Having two fuel tanks, the owner has more fuel and thus more range.
  2. Cost Savings: Fuels like LPG and CNG might be cheaper compared to petrol.
  3. Reduced Emission: LPG and CNG are cleaner fuels, so the emissions will also be less.

Visualization

Misinterpreting “Ownership” Column

Without reading the documentation, one could assume that the “Ownership” values represent the type of ownership (e.g., individual, company, rental), which could be misleading when performing analysis. If you see the below plot we don’t get much info on what the x axis means and it poses ambiguity to the reader.

df |>
  ggplot(aes(x = factor(Ownership), y = Price)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  ggtitle("Price Distribution by Ownership") +
  xlab("Ownership") +
  ylab("Price (INR)") +
  theme_minimal() 

In the below plot, which contains proper data annotation and there is clear description of the x axis in the x label. This makes a reader to understand the plot clearly.

df |>
  ggplot(aes(x = factor(Ownership), y = Price)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  ggtitle("Price Distribution by Ownership") +
  xlab("Ownership (Number of Previous Owners)") +
  ylab("Price (INR)") +
  annotate("text", x = 2, y = 5000000, label = "Ownership could be misinterpreted as\n individual, company, etc.", color = "red") +
  theme_minimal() 

Significant Risk

A misinterpretation of the column can lead to faulty decision-making in the analysis. For example, if a user thought that cars with “Ownership = 1” were individually owned and “Ownership = 2” were company-owned, they might draw conclusions about price trends for those categories, which would be entirely wrong.

Reducing consequences

  • Documentation: Providing the user with the link to documentation helps the them to get some background info about the underlying data which helps in eliminating any assumptions that the user could make.

  • Clear Labeling: Using clearer labels for axes and adding explanations in visualizations (e.g., “Number of Previous Owners” for the x-axis) reduces ambiguity.