library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
#remove scientific notation
options(scipen = 6)
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
There were a few attributes in the dataset that weren’t clear to me until I read the documentation: Price, Year, and Kilometer.
In this dataset, the Price attribute is the vehicle’s selling price in Indian Rupees. I assumed it was in US Dollars until I read the dataset’s discussion page on Kaggle. The US Dollar’s status as a “global currency” has lead to it being used to denote currency even in non-US contexts, but situations like these are a reminder that the world doesn’t revolve around the USA.
The Year attribute represents the vehicle’s manufacturing year. The ambiguity of the attribute’s name could lead to confusion with the year the vehicle was sold.
What we would call mileage is represented by the Kilometer attribute, representing the total kilometers driven by the vehicle. The USA is still one of the few holdouts against metrication, but the name wasn’t too confusing, just not very good. I would have called it something like “Kilometers.Driven”.
unique(df$Owner)
## [1] "First" "Second" "Third" "Fourth"
## [5] "UnRegistered Car" "4 or More"
The Owner attribute represents the number of previous owners the car had, but the “UnRegistered Car” value confuses me because from what I looked up, it’s illegal to sell unregistered vehicles in India. The only value I can see in recording a dataset of unregistered vehicles is for police to track them down, but releasing information like this to the public feels irresponsible.
Additionally, the “Fourth” attribute is separate from “4 or More” for unknown reasons. I assume it’s just a mistake.
#creating an attribute for price converted to USD, then making a stacked dataframe
df$USD <- df$Price * 0.012
df <- df %>% rename_at('Price', ~'INR')
df_price <- stack(df[c("INR", "USD")])
ggplot(df_price, aes(ind, values)) + geom_boxplot()
This isn’t the prettiest way to illustrate it, but this illustrates just how much more Indian Rupees it takes to create an equivalent amount of US Dollars.