Week 7 Assignment

Importing data and necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test
library(pwr)
library(dplyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
df <- read.delim('Cars24.csv', sep = ",")

df <- df |>
  filter(Gear %in% "Automatic" | Gear %in% "Manual")

head(df)
##   X Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1 0   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2 1    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3 2    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4 3    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5 4      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6 5    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642
## 6    Manual         1          5554

Hypothesis 1

Null Hypothesis (H0): There is not much difference in the mean kilometer driven between Cars of different Transmission.

Alternate Hypothesis (H1): There is significant difference in the mean kilometer driven between Cars of different Transmission.

stat <- df |>

  group_by(Gear) |>
  summarize(
    mean_finishing = mean(Driven..Kms., na.rm = TRUE),
    sd_finishing = sd(Driven..Kms., na.rm = TRUE),
    n = n()
  )

print(stat)
## # A tibble: 2 × 4
##   Gear      mean_finishing sd_finishing     n
##   <chr>              <dbl>        <dbl> <int>
## 1 Automatic         55534.       36739.   578
## 2 Manual            62188.       42705.  5075

The above statistics shows that the mean kilometers driven for cars having manual transmission is more than that of cars with automatic transmission.

Significance level

I am choosing the significance level to be 5%, since it is a industry standard. Setting a significance level of 5% means that there is 5% probability of making mistake by rejecting a true null hypothesis. This is called Type I error.

Power Level

Setting the power level to be 0.85. This means that there is 85% probability of correctly rejecting a null hypothesis if a significant effect exists, with a 15% chance of making a Type II error.

Effect Size

cohen.d(d = filter(df, Gear == "Manual") |> pluck("Driven..Kms."),
        f = filter(df, Gear == "Automatic") |> pluck("Driven..Kms."))
## 
## Cohen's d
## 
## d estimate: 0.1579147 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## 0.07180602 0.24402343

Since the value is 0.157, it indicates that the difference in kilometers driven between the manual and automatic gear vehicles is small. While there is some difference, it’s not significant.

In conclusion, this means that the type of gear (Manual vs. Automatic) has a minimal impact on the distance driven, based on the data you’re analyzing.

Sample Size Calculation

sample_size <- pwr.t.test(d = 0.157, sig.level = 0.05, power = 0.85, type = "two.sample")
sample_size
## 
##      Two-sample t test power calculation 
## 
##               n = 729.4608
##               d = 0.157
##       sig.level = 0.05
##           power = 0.85
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

From the above output, it is conclusive that, we need approximately 730 cars per group with 85% power and 5% significance level.

Perform Hypothesis Test ~ Two Sample t-test

manual <- df$Driven..Kms.[df$Gear == "Manual"]
auto <- df$Driven..Kms.[df$Gear == "Automatic"]


mean_manual<- mean(manual)
mean_auto <- mean(auto)


sd_manual <- sd(manual)
sd_auto <- sd(auto)


n_manual <- length(manual)
n_auto <- length(auto)
test <- pwrss.t.2means(mu1 = mean_manual, 
                       sd1 = sd(pluck(df, "Driven..Kms.")),
                       kappa = n_manual/n_auto,
                       power = .85, alpha = 0.05, 
                       alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 43 
##   n2 = 5 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 46 
##  Non-centrality parameter = 3.12 
##  Type I error rate = 0.05 
##  Type II error rate = 0.15
plot(test)

Conclusion

  • The Cohen’s d value of 0.157. This indicates a small effect size, meaning there is a small difference in the kilometers driven between manual and automatic cars. While the difference exists, it’s not particularly strong.

  • Sample size n = 730, tells us that we need approximately 730 players per group with 85% power at a 0.05 significance level.

  • alpha = 0.05 and Beta = 0.14

Hypothesis 2

Null Hypothesis: There is no relationship between the age of a car and its average price, meaning the age of the car does not affect its price.

Alternative Hypothesis: There is a relationship between the age of a car and its average price, such that the price decreases as the car’s age increases.

Calculating new column “age” and categorizing it as new or old based on certain condition. Similarly categorizing the price of the car as high or low as well.

df$age = year(now()) - df$Model.Year
df$age_category <- ifelse(df$age < 10, "New", "Old")

median_price <- median(df$Price, na.rm = TRUE)
df$price_category <- ifelse(df$Price < median_price, "Low", "High")

Fisher’s Significance test

Contingency table

contingency_table <- table(df$age_category, df$price_category)
contingency_table
##      
##       High  Low
##   New 2141  741
##   Old  686 2085

Performing Fisher’s significance test

fisher_test_result <- fisher.test(contingency_table)
fisher_test_result
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  7.771885 9.920140
## sample estimates:
## odds ratio 
##   8.776611

Conclusion

Since the p-value is significantly smaller the significance level (0.05), we can reject the null hypothesis, meaning there is a significant association between the two variables (age of the car and price).

There is a statistically significant association between the categorized age of the car and its price. The odds of one category (e.g., “Old” cars) falling into a particular price category (e.g., “Low price”) are approximately 8.77 times higher than the other. The test’s p-value shows this result is highly unlikely to occur by chance, and the confidence interval further strengthens the validity of the odds ratio.

Visualization

I have plotted a scatter plot between, the Age of the car and it price. The black like in the graph is the regression line, which show that as the age of the car increases the price of the car reduces.

ggplot(df, aes(x = age, y = Price)) +
  geom_point(color = "grey") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "Scatter Plot of Car Age vs Price",
       x = "Age of Car (Years)",
       y = "Price (INR)") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'