Importing Libraries and Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)

#remove scientific notation
options(scipen = 6)

df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")

Relationships Between Numeric Variables

#saving the raw dataset
df_raw <- df
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Mileage = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)

Using the mutate function, I created a column for how many years it’s been since the year of manufacture, and created columns for converting non-American things to their American counterparts for ease of understanding: Indian Rupees to US Dollars, kilometers to miles, millimeters to inches, and liters to gallons.

We will use price in US dollars as a response variable, and use the volume in cubic inches and herpaderp as explanatory variables.

options(scipen=10000)
ggplot(df, aes(x = Volume, y = PriceUSD)) + geom_point() + labs(x = "Volume (cubic inches)", y = "Price (USD)", title = "Comparison of Volume to Price")
## Warning: Removed 64 rows containing missing values or values outside the scale range
## (`geom_point()`).

The majority of vehicles are located underneath the $100,000 mark. The density of data points starts to decrease past 800,000 cubic inches, with about a dozen or so outliers outside of each of those ranges.

ggplot(df, aes(x = Mileage, y = PriceUSD)) + geom_point() + labs(x = "Miles Traveled", y = "Price (USD)", title = "Comparison of Miles Traveled to Price")

There’s a very dense cloud of data points beneath approximately 100,000 USD and 200,000 miles traveled. there’s only about 3 data points past 200,000 miles, and several dozen beyond 100,000 USD.

Calculating Correlation Coefficients and Confidence Intervals

df <- df |> drop_na(Volume)
round(cor(df$Volume, df$PriceUSD), 2)
## [1] 0.52
round(cor(df$Mileage, df$PriceUSD), 2)
## [1] -0.15

Above are the pearson correlation coefficient values for the relationships between volume and price, and miles driven and price. There’s a modest positive correlation between volume and price, while there’s a very slight negative correlation between miles driven and price. This is mostly in line with my expectations of those attributes on price. I expected the correlation between miles driven and price to be stronger, but this is probably due to the outliers.

# Calculate the mean of the sample data
meanValue <- mean(df$PriceUSD)
 
# Compute the size
n <- length(df$PriceUSD)
 
# Find the standard deviation
stdev <- sd(df$PriceUSD)
 
# Find the standard error
standardError <- stdev / sqrt(n)
alpha = 0.05
dof = n - 1
t = qt(p = alpha/2, df = dof, lower.tail = F)
marginError <- t * standardError
 
# Calculating lower bound and upper bound
lowerBound <- meanValue - marginError
upperBound <- meanValue + marginError
 
# Print the confidence interval
print(c(lowerBound,upperBound))
## [1] 19312.20 21883.88

Using a degree of certainty of 95%, the confidence interval for Price in USD for my dataset is 19,312.20 to 21,883.88 dollars. Seems like a very tight interval to me, so the vast majority of used cars in India fall within that narrow price range. Compared to Indianapolis (note: this is just a “gut sense” observation based on skimming web pages, not a formal statement), there’s less variance in pricing. As far as used vehicles go, the prices within the interval I calculated are on the cheaper end; this seems fairly intuitive in combination with the usual motivation for buying used vehicles: lower prices. I’d also have to consider how much cheaper vehicles are being produced in proportion to more expensive vehicles and how much that skews the population of used vehicles towards cheaper models.