library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library(ggplot2)
library(patchwork)
df <- read_csv("Cars24.csv")
## New names:
## Rows: 5918 Columns: 11
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Car Brand, Model, Location, Fuel, Gear dbl (6): ...1, Price, Model Year,
## Driven (Kms), Ownership, EMI (monthly)
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(df, n = 5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 0 Hyundai EonERA PL… 330399 2016 Hyderab… Petr… 10674
## 2 1 Maruti Wagon R 1… 350199 2011 Hyderab… Petr… 20979
## 3 2 Maruti Alto K10L… 229199 2011 Hyderab… Petr… 47330
## 4 3 Maruti RitzVXI B… 306399 2011 Hyderab… Petr… 19662
## 5 4 Tata NanoTWIST… 208699 2015 Hyderab… Petr… 11256
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
Since I need to collect 5 random samples from the df (population), I am passing the output of sample function to the original df to get random samples to be indexed and storing it in 5 different variables.
I am using 3500 as the size of the sample since the question asked for more than 50% of the population.
sample_df1 <- df[sample(nrow(df), 3500),]
sample_df2 <- df[sample(nrow(df), 3500),]
sample_df3 <- df[sample(nrow(df), 3500),]
sample_df4 <- df[sample(nrow(df), 3500),]
sample_df5 <- df[sample(nrow(df), 3500),]
Let’s see a first 5 rows of each sample.
head(sample_df1,5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 2644 Maruti <NA> 572399 2016 Delhi Dies… 55768
## 2 4282 Hyundai i10MAGNA … 201399 2010 Mumbai Petr… 39989
## 3 3127 Mercedes Benz C Cl… 475000 2009 Mumbai Petr… 65805
## 4 1735 Maruti RitzVXI B… 295799 2011 Delhi Petr… 115774
## 5 4980 Maruti Alto K10L… 272099 2011 Bangalo… Petr… 82396
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df2,5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 3349 Honda CityS MT … 230899 2009 Mumbai Petr… 110210
## 2 3894 Hyundai Elite i20… 574899 2015 Mumbai Dies… 37423
## 3 5540 Maruti Zen Estil… 241299 2010 Chennai Petr… 43110
## 4 4533 Renault <NA> 414499 2018 Bangalo… Petr… 19093
## 5 120 Maruti AltoLXI 242199 2012 Hyderab… Petr… 64304
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df3,5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 954 Maruti ErtigaVDI… 6.57e5 2017 Delhi Dies… 49237
## 2 1690 Mercedes Benz E Cl… 1.41e6 2013 Delhi Dies… 52110
## 3 3260 Maruti CelerioVX… 3.99e5 2014 Mumbai Petr… 58624
## 4 4654 Toyota Etios Liv… 3.57e5 2011 Bangalo… Petr… 54931
## 5 3313 Hyundai Creta1.6 … 1.17e6 2017 Mumbai Dies… 37664
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df4,5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 4581 Maruti Alto 800L… 3.03e5 2013 Bangalo… Petr… 16178
## 2 3376 Maruti SwiftVDI 6.95e5 2019 Mumbai Dies… 66257
## 3 1193 Mahindra <NA> 1.20e6 2018 Delhi Dies… 20741
## 4 3585 Hyundai Elite i20… 5.45e5 2014 Mumbai Dies… 912380
## 5 5412 Maruti Alto 800L… 2.96e5 2013 Chennai Petr… 43405
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
head(sample_df5,5)
## # A tibble: 5 × 11
## ...1 `Car Brand` Model Price `Model Year` Location Fuel `Driven (Kms)`
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1346 Renault Duster85 … 456599 2014 Delhi Dies… 67522
## 2 5680 Maruti RitzVXI B… 267599 2010 Chennai Petr… 83163
## 3 5206 Hyundai i10SPORTZ… 343699 2011 Bangalo… Petr… 48576
## 4 3824 Maruti Wagon R 1… 411599 2015 Mumbai Petr… 32660
## 5 738 Maruti A StarVXI 238999 2010 Delhi Petr… 57489
## # ℹ 3 more variables: Gear <chr>, Ownership <dbl>, `EMI (monthly)` <dbl>
Lets see the dimension of each sample.
dim(sample_df1)
## [1] 3500 11
dim(sample_df2)
## [1] 3500 11
dim(sample_df3)
## [1] 3500 11
dim(sample_df4)
## [1] 3500 11
dim(sample_df5)
## [1] 3500 11
We can see that all the samples are of the same dimensions, but all are unique as shown in the below code. There might be some overlap between them since each sample is derived from the same population.
all(sample_df1 == sample_df2)
## [1] FALSE
all(sample_df2 == sample_df3)
## [1] FALSE
all(sample_df3 == sample_df4)
## [1] FALSE
all(sample_df4 == sample_df5)
## [1] FALSE
all(sample_df5 == sample_df1)
## [1] FALSE
Here I am trying to find the top 5 brands in each sample and plotting it in bar graph to see it visually.
top5_sample1 <- sample_df1 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample2 <- sample_df2 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample3 <- sample_df3 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample4 <- sample_df4 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
top5_sample5 <- sample_df5 |> group_by(`Car Brand`) |> summarise(Count = n()) |> top_n(5)
## Selecting by Count
p1 <- ggplot(sample_df1 |> filter(`Car Brand` %in% top5_sample1$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 1") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(sample_df2 |> filter(`Car Brand` %in% top5_sample2$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 2") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p3 <- ggplot(sample_df3 |> filter(`Car Brand` %in% top5_sample3$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 3") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4 <- ggplot(sample_df4 |> filter(`Car Brand` %in% top5_sample4$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 4") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5 <- ggplot(sample_df5 |> filter(`Car Brand` %in% top5_sample5$`Car Brand`), aes(x = `Car Brand`)) + geom_bar() + labs(x = "Car Brands", y = "Count", title = "Sample 5") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p1 + p2 + p3 + p4 + p5 + plot_annotation(title = "Top 5 brands in each sample")
From the samples, it is understood that Honda, Hyundai, Maruti, Toyota and Volkswagen are the top 5 brands for 4 out of 5 samples. If we see top 5 brands for sample 5 instead of Volkswagen, Renault is present. Looks like there is one sample where the car brand doesn’t comply with the population.
Lets check what is the top 5 brands in the entire population.
df |>
group_by(`Car Brand`) |>
summarise(count = n()) |>
top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
## `Car Brand` count
## <chr> <int>
## 1 Honda 465
## 2 Hyundai 1281
## 3 Maruti 2819
## 4 Toyota 301
## 5 Volkswagen 192
As the above data frame shows, the top 5 brands of the entire population is Honda, Hyundai, Maruti, Toyota and Volkswagen.
From inspecting the 5 samples and the population, it is clear that
the top 5 brands of sample 5 doesn’t match with the population. Hence
there is an anomaly in sample 5.
Let’s check the top 5 brands for some bigger samples and see whether
this anomaly present in
First big sample
df[sample(nrow(df), 4000),] |>
group_by(`Car Brand`) |>
summarise(count = n()) |>
top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
## `Car Brand` count
## <chr> <int>
## 1 Honda 295
## 2 Hyundai 854
## 3 Maruti 1923
## 4 Toyota 203
## 5 Volkswagen 138
Checking with second big sample
df[sample(nrow(df), 4200),] |>
group_by(`Car Brand`) |>
summarise(count = n()) |>
top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
## `Car Brand` count
## <chr> <int>
## 1 Honda 319
## 2 Hyundai 941
## 3 Maruti 1983
## 4 Toyota 206
## 5 Volkswagen 140
Checking with third big sample.
df[sample(nrow(df), 4500),] |>
group_by(`Car Brand`) |>
summarise(count = n()) |>
top_n(n = 5)
## Selecting by count
## # A tibble: 5 × 2
## `Car Brand` count
## <chr> <int>
## 1 Honda 351
## 2 Hyundai 979
## 3 Maruti 2130
## 4 Toyota 222
## 5 Volkswagen 152
It is observed that with bigger samples we can actually reduce getting anomalies in the samples.
Creating a function named summarize_location, to group the data based on location and getting its count.
summarize_location <- function(data, location_column) {
data |>
count(!!sym(location_column)) |>
mutate(Percentage = round(100 * n / sum(n), 1))
}
df_summary <- summarize_location(df, "Location")
sample1_summary <- summarize_location(sample_df1, "Location")
sample2_summary <- summarize_location(sample_df2, "Location")
sample3_summary <- summarize_location(sample_df3, "Location")
sample4_summary <- summarize_location(sample_df4, "Location")
sample5_summary <- summarize_location(sample_df5, "Location")
pie1 <- ggplot(sample1_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Sample 1") +
theme_void() +
theme(legend.position = "none")
pie2 <- ggplot(sample2_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Sample 2") +
theme_void() +
theme(legend.position = "none")
pie3 <- ggplot(sample3_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Sample 3") +
theme_void() +
theme(legend.position = "none")
pie4 <- ggplot(sample4_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Sample 4") +
theme_void() +
theme(legend.position = "none")
pie5 <- ggplot(sample5_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Sample 5") +
theme_void()
pie1 + pie2 + pie3 + pie4 + pie5 + plot_annotation(title = "Location split up in each sample")
ggplot(df_summary, aes(x = "", y = Percentage, fill = Location)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Location split of the entire population") +
theme_void()
From seeing the location split ups for the samples and the population, I don’t see any major issues in it. But to be sure lets have a look at the actual numbers by finding the average of the samples.
With the below code we can see the average split of the locations and its percentage for the 5 samples.
location_avg <- bind_rows(sample1_summary, sample2_summary, sample3_summary, sample4_summary, sample5_summary)|>
group_by(Location) |>
summarise(avg = mean(n))
location_avg <- location_avg |>
mutate(Percentage = round(avg/sum(avg)*100,1))
location_avg
## # A tibble: 5 × 3
## Location avg Percentage
## <chr> <dbl> <dbl>
## 1 Bangalore 491. 14
## 2 Chennai 371. 10.6
## 3 Delhi 1370. 39.1
## 4 Hyderabad 248 7.1
## 5 Mumbai 1020. 29.1
Below is the percentage breakup of location for the population.
df_summary
## # A tibble: 5 × 3
## Location n Percentage
## <chr> <int> <dbl>
## 1 Bangalore 822 13.9
## 2 Chennai 614 10.4
## 3 Delhi 2312 39.1
## 4 Hyderabad 410 6.9
## 5 Mumbai 1760 29.7
Comparing both of these groups we can see that there is consistency among the breakup of locations in those 5 samples and the entire population.
With the above analysis, I can understand that anomalies observed in the samples are not a representative of the entire population. And we can reduce the anomalies by considering a bigger sample of the population.
Any pattern across the samples indicates consistency. Also they must be validated with bigger samples to be sure.
From this analysis I understood that in my future analysis using a bigger sample will result in more reliable and generalized insights. And with sampling multiple times and getting consistent result across the samples, I can trust the data.