library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(pwr)
library(dplyr)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
df <- read.delim('Cars24.csv', sep = ",")
df <- df |>
filter(Gear %in% "Automatic" | Gear %in% "Manual")
head(df)
## X Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 0 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 1 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 2 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 3 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 4 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 5 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly.
## 1 Manual 2 7350
## 2 Manual 1 7790
## 3 Manual 2 5098
## 4 Manual 1 6816
## 5 Automatic 1 4642
## 6 Manual 1 5554
Null Hypothesis (H0): There is not much difference in the mean kilometer driven between Cars of different Transmission.
Alternate Hypothesis (H1): There is significant difference in the mean kilometer driven between Cars of different Transmission.
stat <- df |>
group_by(Gear) |>
summarize(
mean_finishing = mean(Driven..Kms., na.rm = TRUE),
sd_finishing = sd(Driven..Kms., na.rm = TRUE),
n = n()
)
print(stat)
## # A tibble: 2 × 4
## Gear mean_finishing sd_finishing n
## <chr> <dbl> <dbl> <int>
## 1 Automatic 55534. 36739. 578
## 2 Manual 62188. 42705. 5075
The above statistics shows that the mean kilometers driven for cars having manual transmission is more than that of cars with automatic transmission.
I am choosing the significance level to be 5%, since it is a industry standard. Setting a significance level of 5% means that there is 5% probability of making mistake by rejecting a true null hypothesis. This is called Type I error.
Setting the power level to be 0.85. This means that there is 85% probability of correctly rejecting a null hypothesis if a significant effect exists, with a 15% chance of making a Type II error.
cohen.d(d = filter(df, Gear == "Manual") |> pluck("Driven..Kms."),
f = filter(df, Gear == "Automatic") |> pluck("Driven..Kms."))
##
## Cohen's d
##
## d estimate: 0.1579147 (negligible)
## 95 percent confidence interval:
## lower upper
## 0.07180602 0.24402343
Since the value is 0.157, it indicates that the difference in kilometers driven between the manual and automatic gear vehicles is small. While there is some difference, it’s not significant.
In conclusion, this means that the type of gear (Manual vs. Automatic) has a minimal impact on the distance driven, based on the data you’re analyzing.
sample_size <- pwr.t.test(d = 0.157, sig.level = 0.05, power = 0.85, type = "two.sample")
sample_size
##
## Two-sample t test power calculation
##
## n = 729.4608
## d = 0.157
## sig.level = 0.05
## power = 0.85
## alternative = two.sided
##
## NOTE: n is number in *each* group
From the above output, it is conclusive that, we need approximately 730 cars per group with 85% power and 5% significance level.
manual <- df$Driven..Kms.[df$Gear == "Manual"]
auto <- df$Driven..Kms.[df$Gear == "Automatic"]
mean_manual<- mean(manual)
mean_auto <- mean(auto)
sd_manual <- sd(manual)
sd_auto <- sd(auto)
n_manual <- length(manual)
n_auto <- length(auto)
test <- pwrss.t.2means(mu1 = mean_manual,
sd1 = sd(pluck(df, "Driven..Kms.")),
kappa = n_manual/n_auto,
power = .85, alpha = 0.05,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.85
## n1 = 43
## n2 = 5
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 46
## Non-centrality parameter = 3.12
## Type I error rate = 0.05
## Type II error rate = 0.15
plot(test)
The Cohen’s d value of 0.157. This indicates a small effect size, meaning there is a small difference in the kilometers driven between manual and automatic cars. While the difference exists, it’s not particularly strong.
Sample size n = 730, tells us that we need approximately 730 players per group with 85% power at a 0.05 significance level.
alpha = 0.05 and Beta = 0.14
Null Hypothesis: There is no relationship between the age of a car and its average price, meaning the age of the car does not affect its price.
Alternative Hypothesis: There is a relationship between the age of a car and its average price, such that the price decreases as the car’s age increases.
Calculating new column “age” and categorizing it as new or old based on certain condition. Similarly categorizing the price of the car as high or low as well.
df$age = year(now()) - df$Model.Year
df$age_category <- ifelse(df$age < 10, "New", "Old")
median_price <- median(df$Price, na.rm = TRUE)
df$price_category <- ifelse(df$Price < median_price, "Low", "High")
contingency_table <- table(df$age_category, df$price_category)
contingency_table
##
## High Low
## New 2141 741
## Old 686 2085
fisher_test_result <- fisher.test(contingency_table)
fisher_test_result
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 7.771885 9.920140
## sample estimates:
## odds ratio
## 8.776611
Since the p-value is significantly smaller the significance level (0.05), we can reject the null hypothesis, meaning there is a significant association between the two variables (age of the car and price).
There is a statistically significant association between the categorized age of the car and its price. The odds of one category (e.g., “Old” cars) falling into a particular price category (e.g., “Low price”) are approximately 8.77 times higher than the other. The test’s p-value shows this result is highly unlikely to occur by chance, and the confidence interval further strengthens the validity of the odds ratio.
I have plotted a scatter plot between, the Age of the car and it price. The black like in the graph is the regression line, which show that as the age of the car increases the price of the car reduces.
ggplot(df, aes(x = age, y = Price)) +
geom_point(color = "grey") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Scatter Plot of Car Age vs Price",
x = "Age of Car (Years)",
y = "Price (INR)") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'