Week 6 Assignment

Importing data and necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
df <- read.delim('Cars24.csv', sep = ',')

head(df, n = 5)
##   X Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1 0   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2 1    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3 2    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4 3    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5 4      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642

Calculating a Column

Creating a calculated column “Age” by subtracting current year by Model year (year of manufacturing).

# Adding a column named 'age' and populating the age of each car.
df$Age = year(now()) - df$Model.Year

head(df)
##   X Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1 0   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2 1    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3 2    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4 3    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5 4      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6 5    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly. Age
## 1    Manual         2          7350   8
## 2    Manual         1          7790  13
## 3    Manual         2          5098  13
## 4    Manual         1          6816  13
## 5 Automatic         1          4642   9
## 6    Manual         1          5554  12

Pair 1 - Price Vs Age

Visualization

df |>
  ggplot(aes(x = Age, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  ggtitle("Price vs Age") +
  xlab("Age of the Car (Years)") +
  ylab("Price (INR)") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Conclusions based on the plot

  • The plot shows a slight negative trend, indicating that as the age of the car increases, the price tends to decrease. This is expected since older cars typically depreciate in value over time.

  • After about 10 years, the price of most cars converges towards a lower range. This suggests that beyond a certain age, cars tend to be valued similarly, regardless of how old they are.

  • The downward slope suggests that on average, older cars have lower prices, but there are exceptions (outliers) that should be investigated further.

Calculating correlation coefficient

cor_price_age <- cor(df$Price, df$Age)

cor_price_age
## [1] -0.4744073
  • The correlation value of -0.474 suggests a moderate negative correlation between the Age of the car and its Price. Here’s why this value makes sense based on the visualization.

  • The plot clearly shows that as the age of the car increases, the price generally decreases. This is consistent with a negative correlation value.

  • The outliers seen for older cars (e.g., cars over 15 years old that have relatively high prices) also contribute to reducing the strength of the correlation. If those outliers were removed, the correlation might be stronger (closer to -1), but with them included, a value of -0.477 makes sense.

Building a confidence interval

# Calculate the mean and standard deviation of the Price
mean_price <- mean(df$Price, na.rm = TRUE)
sd_price <- sd(df$Price, na.rm = TRUE)

# Number of cars in the dataset (sample size)
n <- sum(!is.na(df$Price))

# Set confidence level (e.g., 95%)
confidence_level <- 0.95
alpha <- 1 - confidence_level

# Z-score for 95% confidence level
z_score <- qnorm(1 - alpha/2)

# Calculate the margin of error
margin_of_error <- z_score * (sd_price / sqrt(n))

# Calculate the confidence interval
ci_lower <- mean_price - margin_of_error
ci_upper <- mean_price + margin_of_error

# Output the confidence interval
ci_lower
## [1] 509837.2
ci_upper
## [1] 526268.8

Explanation

  • Interpretation of the Confidence Interval: Based on the sample data, we are 95% confident that the true mean price of cars in the population falls between 509837 and 526268. This range provides an estimate of the average price for used cars across the dataset.

  • Significance: This confidence interval is useful in providing insights into the general price range for cars in the market. It can guide consumers and dealerships in setting expectations for the average market value of used cars.

  • Further Investigation: While the confidence interval gives us a good estimate of the mean price, the wide range indicates that prices in the market can vary significantly. This variability could be due to factors such as the car’s brand, condition, age, and location. Investigating these factors might help in narrowing down this range further.

Pair 2 - Age Vs Kilometers driven

Visualization

df |>
  ggplot(aes(x = Age, y = Driven..Kms.)) +
  geom_point(color = 'grey') +
  geom_smooth(method = "lm", col = "red") +
  ggtitle("Age vs Kilometers Driven") +
  xlab("Age") +
  ylab("Kilometers Driven") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_x_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_fill_brewer(palette = 'Dark2') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Conclusion based on the plot

  • Positive Correlation: There is a slight positive correlation between the age of the car and the number of kilometers driven. As the car gets older, it tends to have more kilometers driven, though the correlation is not very strong.

  • The plot shows a high level of variability in kilometers driven, especially in the 5–10 years age range, where some cars have been driven close to 750,000 kilometers while others have been driven significantly less.

  • There are notable outliers in the dataset where certain cars have been driven an unusually high number of kilometers relative to their age, especially around the 10-year mark.

This suggests that while age plays a role in how much a car has been driven, other factors like usage intensity or type of ownership may also contribute significantly to the total kilometers.

Calculating correlation coefficient

cor_Age_kmDriven <- cor(df$Age, df$Driven..Kms.)

cor_Age_kmDriven
## [1] 0.4086148

The correlation coefficient is 0.41, indicating a moderate positive correlation. This makes sense given the plot, as older cars generally tend to have higher kilometers driven, though the scatterplot shows some variability. In summary, the correlation values align with what we observe visually and logically.

Building a confidence interval

mean_kms_driven <- mean(df$Driven..Kms., na.rm = TRUE)
sd_kms_driven <- sd(df$Driven..Kms., na.rm = TRUE)

# Calculating the sample size
n <- sum(!is.na(df$Driven..Kms.))

# Calculating the standard error
se_kms_driven <- sd_kms_driven / sqrt(n)

# Calculating the 95% confidence interval
alpha <- 0.05
z_value <- qnorm(1 - alpha / 2)

# Lower and upper bounds of the confidence interval
lower_bound <- mean_kms_driven - z_value * se_kms_driven
upper_bound <- mean_kms_driven + z_value * se_kms_driven

cat("95% Confidence Interval for Age: [", lower_bound, ", ", upper_bound, "]\n")
## 95% Confidence Interval for Age: [ 59763.47 ,  61922.09 ]

Explanation

The 95% confidence interval means that we are 95% confident that the true population mean for kilometers driven by the cars in this dataset lies between 59763 km and 61922 km. This suggests that the typical car in the population represented by this sample has likely been driven between these two values on average.

This interval does not mean that 95% of cars have been driven between these values but rather that we are 95% sure that the population average kilometers driven lies in this range.