Week 4 Assignment

Importing data set

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library(ggplot2)
library(patchwork)
df <- read_csv("Cars24.csv")
## New names:
## Rows: 5918 Columns: 11
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Car Brand, Model, Location, Fuel, Gear dbl (6): ...1, Price, Model Year,
## Driven (Kms), Ownership, EMI (monthly)
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(df, n = 5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1     0 Hyundai     EonERA PL… 330399         2016 Hyderab… Petr…          10674
## 2     1 Maruti      Wagon R 1… 350199         2011 Hyderab… Petr…          20979
## 3     2 Maruti      Alto K10L… 229199         2011 Hyderab… Petr…          47330
## 4     3 Maruti      RitzVXI B… 306399         2011 Hyderab… Petr…          19662
## 5     4 Tata        NanoTWIST… 208699         2015 Hyderab… Petr…          11256
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>

Collecting 5 random samples

Since I need to collect 5 random samples from the df (population), I am passing the output of sample function to the original df to get random samples to be indexed and storing it in 5 different variables.

I am using 3500 as the size of the sample since the question asked for more than 50% of the population.

sample_df1 <- df[sample(nrow(df), 3500),]
sample_df2 <- df[sample(nrow(df), 3500),]
sample_df3 <- df[sample(nrow(df), 3500),]
sample_df4 <- df[sample(nrow(df), 3500),]
sample_df5 <- df[sample(nrow(df), 3500),]

Let’s see a first 5 rows of each sample.

head(sample_df1,5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1  2644 Maruti      <NA>       572399         2016 Delhi    Dies…          55768
## 2  4282 Hyundai     i10MAGNA … 201399         2010 Mumbai   Petr…          39989
## 3  3127 Mercedes    Benz C Cl… 475000         2009 Mumbai   Petr…          65805
## 4  1735 Maruti      RitzVXI B… 295799         2011 Delhi    Petr…         115774
## 5  4980 Maruti      Alto K10L… 272099         2011 Bangalo… Petr…          82396
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df2,5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1  3349 Honda       CityS MT … 230899         2009 Mumbai   Petr…         110210
## 2  3894 Hyundai     Elite i20… 574899         2015 Mumbai   Dies…          37423
## 3  5540 Maruti      Zen Estil… 241299         2010 Chennai  Petr…          43110
## 4  4533 Renault     <NA>       414499         2018 Bangalo… Petr…          19093
## 5   120 Maruti      AltoLXI    242199         2012 Hyderab… Petr…          64304
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df3,5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1   954 Maruti      ErtigaVDI… 6.57e5         2017 Delhi    Dies…          49237
## 2  1690 Mercedes    Benz E Cl… 1.41e6         2013 Delhi    Dies…          52110
## 3  3260 Maruti      CelerioVX… 3.99e5         2014 Mumbai   Petr…          58624
## 4  4654 Toyota      Etios Liv… 3.57e5         2011 Bangalo… Petr…          54931
## 5  3313 Hyundai     Creta1.6 … 1.17e6         2017 Mumbai   Dies…          37664
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df4,5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1  4581 Maruti      Alto 800L… 3.03e5         2013 Bangalo… Petr…          16178
## 2  3376 Maruti      SwiftVDI   6.95e5         2019 Mumbai   Dies…          66257
## 3  1193 Mahindra    <NA>       1.20e6         2018 Delhi    Dies…          20741
## 4  3585 Hyundai     Elite i20… 5.45e5         2014 Mumbai   Dies…         912380
## 5  5412 Maruti      Alto 800L… 2.96e5         2013 Chennai  Petr…          43405
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df5,5)
## # A tibble: 5 × 11
##    ...1 `Car Brand` Model       Price `Model Year` Location Fuel  `Driven (Kms)`
##   <dbl> <chr>       <chr>       <dbl>        <dbl> <chr>    <chr>          <dbl>
## 1  1346 Renault     Duster85 … 456599         2014 Delhi    Dies…          67522
## 2  5680 Maruti      RitzVXI B… 267599         2010 Chennai  Petr…          83163
## 3  5206 Hyundai     i10SPORTZ… 343699         2011 Bangalo… Petr…          48576
## 4  3824 Maruti      Wagon R 1… 411599         2015 Mumbai   Petr…          32660
## 5   738 Maruti      A StarVXI  238999         2010 Delhi    Petr…          57489
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>

Lets see the dimension of each sample.

dim(sample_df1)
## [1] 3500   11
dim(sample_df2)
## [1] 3500   11
dim(sample_df3)
## [1] 3500   11
dim(sample_df4)
## [1] 3500   11
dim(sample_df5)
## [1] 3500   11

We can see that all the samples are of the same dimensions, but all are unique as shown in the below code. There might be some overlap between them since each sample is derived from the same population.

all(sample_df1 == sample_df2)
## [1] FALSE
all(sample_df2 == sample_df3)
## [1] FALSE
all(sample_df3 == sample_df4)
## [1] FALSE
all(sample_df4 == sample_df5)
## [1] FALSE
all(sample_df5 == sample_df1)
## [1] FALSE

Detecting Anomaly

Inspecting each sample

Here I am trying to find the top 5 brands in each sample and plotting it in bar graph to see it visually.

top5_sample1 <- sample_df1 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample2 <- sample_df2 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample3 <- sample_df3 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample4 <- sample_df4 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample5 <- sample_df5 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
p1 <- ggplot(sample_df1 |> filter(`Car Brand` %in% top5_sample1$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 1") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(sample_df2 |> filter(`Car Brand` %in% top5_sample2$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 2") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p3 <- ggplot(sample_df3 |> filter(`Car Brand` %in% top5_sample3$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 3") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p4 <- ggplot(sample_df4 |> filter(`Car Brand` %in% top5_sample4$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 4") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p5 <- ggplot(sample_df5 |> filter(`Car Brand` %in% top5_sample5$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 5") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

p1 + p2 + p3 + p4 + p5 + plot_annotation(title = "Top 5 brands in each sample")

Referring the population

From the samples, it is understood that Honda, Hyundai, Maruti, Toyota and Volkswagen are the top 5 brands for 4 out of 5 samples. If we see top 5 brands for sample 5 instead of Volkswagen, Renault is present. Looks like there is one sample where the car brand doesn’t comply with the population.

Lets check what is the top 5 brands in the entire population.

df |>
  group_by(`Car Brand`) |>
  summarise(count = n()) |>
  top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
##   `Car Brand` count
##   <chr>       <int>
## 1 Honda         465
## 2 Hyundai      1281
## 3 Maruti       2819
## 4 Toyota        301
## 5 Volkswagen    192

As the above data frame shows, the top 5 brands of the entire population is Honda, Hyundai, Maruti, Toyota and Volkswagen.

Anomaly found

From inspecting the 5 samples and the population, it is clear that the top 5 brands of sample 5 doesn’t match with the population. Hence there is an anomaly in sample 5.

Let’s check the top 5 brands for some bigger samples and see whether this anomaly present in

First big sample

df[sample(nrow(df), 4000),] |>
  group_by(`Car Brand`) |>
  summarise(count = n()) |>
  top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
##   `Car Brand` count
##   <chr>       <int>
## 1 Honda         295
## 2 Hyundai       854
## 3 Maruti       1923
## 4 Toyota        203
## 5 Volkswagen    138

Checking with second big sample

df[sample(nrow(df), 4200),] |>
  group_by(`Car Brand`) |>
  summarise(count = n()) |>
  top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
##   `Car Brand` count
##   <chr>       <int>
## 1 Honda         319
## 2 Hyundai       941
## 3 Maruti       1983
## 4 Toyota        206
## 5 Volkswagen    140

Checking with third big sample.

df[sample(nrow(df), 4500),] |>
  group_by(`Car Brand`) |>
  summarise(count = n()) |>
  top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
##   `Car Brand` count
##   <chr>       <int>
## 1 Honda         351
## 2 Hyundai       979
## 3 Maruti       2130
## 4 Toyota        222
## 5 Volkswagen    152

It is observed that with bigger samples we can actually reduce getting anomalies in the samples.

Finding Consistency

Creating a function named summarize_location, to group the data based on location and getting its count.

summarize_location <- function(data, location_column) {
  data |>
    count(!!sym(location_column)) |>
    mutate(Percentage = round(100 * n / sum(n), 1))
}
df_summary <- summarize_location(df, "Location")
sample1_summary <- summarize_location(sample_df1, "Location")
sample2_summary <- summarize_location(sample_df2, "Location")
sample3_summary <- summarize_location(sample_df3, "Location")
sample4_summary <- summarize_location(sample_df4, "Location")
sample5_summary <- summarize_location(sample_df5, "Location")

Inspecting each sample

pie1 <- ggplot(sample1_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Sample 1") +
  theme_void() + 
  theme(legend.position = "none") 

pie2 <- ggplot(sample2_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Sample 2") +
  theme_void() + 
  theme(legend.position = "none")

pie3 <- ggplot(sample3_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Sample 3") +
  theme_void() + 
  theme(legend.position = "none")

pie4 <- ggplot(sample4_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Sample 4") +
  theme_void() + 
  theme(legend.position = "none")

pie5 <- ggplot(sample5_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Sample 5") +
  theme_void()



pie1 + pie2 + pie3 + pie4 + pie5 + plot_annotation(title = "Location split up in each sample")

Referring the population

ggplot(df_summary, aes(x = "", y = Percentage, fill = Location)) +
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y") + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5)) +
  labs(title = "Location split of the entire population") +
  theme_void()

Consistency found

From seeing the location split ups for the samples and the population, I don’t see any major issues in it. But to be sure lets have a look at the actual numbers by finding the average of the samples.

With the below code we can see the average split of the locations and its percentage for the 5 samples.

location_avg <- bind_rows(sample1_summary, sample2_summary, sample3_summary, sample4_summary, sample5_summary)|>
  group_by(Location) |>
  summarise(avg = mean(n))

location_avg <- location_avg |>
  mutate(Percentage = round(avg/sum(avg)*100,1))

location_avg
## # A tibble: 5 × 3
##   Location    avg Percentage
##   <chr>     <dbl>      <dbl>
## 1 Bangalore  491.       14  
## 2 Chennai    371.       10.6
## 3 Delhi     1370.       39.1
## 4 Hyderabad  248         7.1
## 5 Mumbai    1020.       29.1

Below is the percentage breakup of location for the population.

df_summary
## # A tibble: 5 × 3
##   Location      n Percentage
##   <chr>     <int>      <dbl>
## 1 Bangalore   822       13.9
## 2 Chennai     614       10.4
## 3 Delhi      2312       39.1
## 4 Hyderabad   410        6.9
## 5 Mumbai     1760       29.7

Comparing both of these groups we can see that there is consistency among the breakup of locations in those 5 samples and the entire population.

Conclusion

With the above analysis, I can understand that anomalies observed in the samples are not a representative of the entire population. And we can reduce the anomalies by considering a bigger sample of the population.

Any pattern across the samples indicates consistency. Also they must be validated with bigger samples to be sure.

From this analysis I understood that in my future analysis using a bigger sample will result in more reliable and generalized insights. And with sampling multiple times and getting consistent result across the samples, I can trust the data.