Write the first research question. Perform the statistical hypothesis tests to answer your research question. Perform both the parametric test and the corresponding non-parametric test and explain the results. For the parametric test, check all necessary assumptions. Finally, describe which test (parametric or non-parametric) is more suitable for your particular case and why. Also calculate the effect size and explain it. Finally, answer your research question clearly.
mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Seller_Type Transmission Owner
## 1 Dealer Manual 0
## 2 Dealer Manual 0
## 3 Dealer Manual 0
## 4 Dealer Manual 0
## 5 Dealer Manual 0
## 6 Dealer Manual 0
Explain your data (what is the unit of observation, sample size, definition of all variables, units of measurement, etc.).
Unit of observation: Single car
Sample size: 301
Definition of all variables:
In summary, this dataset contains information about cars being sold, including their specifications and details about the sellers.
# Load the dataset
mydata <- read.csv("car data.csv", header = TRUE, sep = ",")
str(mydata)
## 'data.frame': 301 obs. of 9 variables:
## $ Car_Name : chr "ritz" "sx4" "ciaz" "wagon r" ...
## $ Year : int 2014 2013 2017 2011 2014 2018 2015 2015 2016 2015 ...
## $ Selling_Price: num 3.35 4.75 7.25 2.85 4.6 9.25 6.75 6.5 8.75 7.45 ...
## $ Present_Price: num 5.59 9.54 9.85 4.15 6.87 9.83 8.12 8.61 8.89 8.92 ...
## $ Kms_Driven : int 27000 43000 6900 5200 42450 2071 18796 33429 20273 42367 ...
## $ Fuel_Type : chr "Petrol" "Diesel" "Petrol" "Petrol" ...
## $ Seller_Type : chr "Dealer" "Dealer" "Dealer" "Dealer" ...
## $ Transmission : chr "Manual" "Manual" "Manual" "Manual" ...
## $ Owner : int 0 0 0 0 0 0 0 0 0 0 ...
mydata$Owner <- factor(mydata$Owner,
levels = c(0, 1),
labels = c("No Prev Owner", "Prev Owner"))
head(mydata)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Seller_Type Transmission Owner
## 1 Dealer Manual No Prev Owner
## 2 Dealer Manual No Prev Owner
## 3 Dealer Manual No Prev Owner
## 4 Dealer Manual No Prev Owner
## 5 Dealer Manual No Prev Owner
## 6 Dealer Manual No Prev Owner
library(pastecs)
round(stat.desc(mydata),2)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## nbr.val NA 301.00 301.00 301.00 3.010000e+02 NA
## nbr.null NA 0.00 0.00 0.00 0.000000e+00 NA
## nbr.na NA 0.00 0.00 0.00 0.000000e+00 NA
## min NA 2003.00 0.10 0.32 5.000000e+02 NA
## max NA 2018.00 35.00 92.60 5.000000e+05 NA
## range NA 15.00 34.90 92.28 4.995000e+05 NA
## sum NA 606102.00 1403.05 2296.17 1.112111e+07 NA
## median NA 2014.00 3.60 6.40 3.200000e+04 NA
## mean NA 2013.63 4.66 7.63 3.694721e+04 NA
## SE.mean NA 0.17 0.29 0.50 2.241400e+03 NA
## CI.mean NA 0.33 0.58 0.98 4.410860e+03 NA
## var NA 8.36 25.83 74.72 1.512190e+09 NA
## std.dev NA 2.89 5.08 8.64 3.888688e+04 NA
## coef.var NA 0.00 1.09 1.13 1.050000e+00 NA
## Seller_Type Transmission Owner
## nbr.val NA NA NA
## nbr.null NA NA NA
## nbr.na NA NA NA
## min NA NA NA
## max NA NA NA
## range NA NA NA
## sum NA NA NA
## median NA NA NA
## mean NA NA NA
## SE.mean NA NA NA
## CI.mean NA NA NA
## var NA NA NA
## std.dev NA NA NA
## coef.var NA NA NA
Mean: 4.66
Median: 3.60
Interpretation:The mean of Selling_Price is greater than the median, indicating that the price distribution is skewed to the right. This suggests that there are some cars with very high selling prices (possibly luxury cars) that are driving up the average.
Range: 0 to 3 owners.
Mean: 0.04
Standard deviation: 0.25
Interpretation: Most cars have very few previous owners.
Minimum (min): 2003
Maximum (max): 2018
Interpretation: The oldest car in the dataset was purchased in 2003, while the most recent was purchased in 2018.
Minimum (min): 500 km
Maximum (max): 500,000 km
Interpretation: The car with the fewest kilometers driven has hardly been used (500 km), and is probably almost new or has been used very little. On the other hand, the vehicle with 500,000 km reflects very intensive use, possibly as a commercial or work car.
Is there any significant difference in “Selling_Price” between Manual and Automatic cars?
H0: μPetrol = μDiesel
H1: μPetrol ≠ μDiesel
#Descriptive statistics by group
library(psych)
describeBy(x = mydata$Selling_Price , group = mydata$Transmission)
##
## Descriptive statistics by group
## group: Automatic
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 40 9.42 8.76 5.8 8.54 7.97 0.17 33 32.83 0.77 -0.54 1.39
## ------------------------------------------------------------
## group: Manual
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 261 3.93 3.78 3.25 3.43 3.63 0.1 35 34.9 2.71 16.6 0.23
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
Manual <- ggplot(mydata[mydata$Transmission == "Manual", ], aes(x = Selling_Price )) +
theme_linedraw() +
geom_histogram(binwidth = 1, col = "black", fill = "lightgreen") +
ylab("Frequency") +
ggtitle("Manual")
Automatic <- ggplot(mydata[mydata$Transmission == "Automatic", ], aes(x = Selling_Price)) +
theme_linedraw() +
geom_histogram(binwidth = 1, col = "yellow", fill = "lightblue") +
ylab("Frequency") +
ggtitle("Automatic")
library(ggpubr)
ggarrange(Manual, Automatic,
ncol = 2, nrow = 1)
#Shapiro Test
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Transmission) %>%
shapiro_test(Selling_Price)
## # A tibble: 2 × 4
## Transmission variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Automatic Selling_Price 0.878 4.76e- 4
## 2 Manual Selling_Price 0.800 1.28e-17
Interpretation:
For both groups (Automatic and Manual), the p-values are much less than 0.05. This means there is strong evidence to reject the H0 hypothesis, indicating that the selling price data does not follow a normal distribution in either group.
Automatic: We reject the H₀ for the normal distribution of Selling_Price for cars with automatic transmission (p < 0,001).
Manual: We reject the H₀ for the normal distribution of Selling_Price for cars with manual transmission (p < 0.001).
Since, the data is not normally distributed, we should use a non-parametric test to compare the Selling_Price between the two groups. In this case we can use the Wilcoxon rank-sum test or also called Mann-Whitney U test.
Before doing the Wilcoxon rank-sum test we are going to check a little bit the knowledge.
Parametric test / Just to check knwoledge
Independent samples
t.test(mydata$Selling_Price ~ mydata$Transmission,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$Selling_Price by mydata$Transmission
## t = 3.9055, df = 41.248, p-value = 0.0003417
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
## 2.650698 8.325318
## sample estimates:
## mean in group Automatic mean in group Manual
## 9.420000 3.931992
Non-Parametric test / Wilcoxon rank-sum test
wilcox.test(mydata$Selling_Price ~ mydata$Transmission,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Selling_Price by mydata$Transmission
## W = 6902.5, p-value = 0.001028
## alternative hypothesis: true location shift is not equal to 0
Interpretation:
H0: There is no significant difference in the distribution location of Selling_Price between the two groups (Manual and Automatic transmission).
H1: There is a significant difference in the distribution location of Selling_Price between the two groups.
We reject the H0 hypothesis at p-value 0,002. This means there is a statistically significant difference in the Selling_Price between cars with Manual and Automatic transmission.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$Selling_Price ~ mydata$Transmission,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.32 | [0.14, 0.48]
interpret_rank_biserial(0.32)
## [1] "large"
## (Rules: funder2019)
Conclusion: Based on the sample data, we find that cars with automatic transmission have a significantly higher selling price compared to manual cars (p<0.001), the difference in distribution location is large (r=0.32).
Based on the mean of the Selling_Price, automatic cars are more expenseive than manual.
RQ2 (2 points): Write the second research question. Using two numerical variables from your dataset, calculate the appropriate correlation coefficient and explain it. Justify your decision. Perform the appropriate statistical test and interpret the result obtained. Answer your research question clearly.
Research question: Is there a significant correlation between the current price of the car (Present_Price) and its mileage (Kms_Driven)?
H0:r = 0 There is no correlation between Present_Price and Kms_Driven.
H1:r ≠ 0 There is a correlation between Present_Price and Kms_Driven.
mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Seller_Type Transmission Owner
## 1 Dealer Manual 0
## 2 Dealer Manual 0
## 3 Dealer Manual 0
## 4 Dealer Manual 0
## 5 Dealer Manual 0
## 6 Dealer Manual 0
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[, c("Present_Price", "Kms_Driven")], smooth = FALSE)
Scatterplot Matrix
This plot visualizes the distributions of Present_Price and Kms_Driven individually and their relationship.
Present_Price and Kms_Driven show right-skewed distributions (most cars are cheaper and have low mileage).
#GGpairs Correlation
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(mydata[, c("Present_Price", "Kms_Driven")])
In GGpairs, the correlation coefficient ( r = 0.204) suggests a weak relationship, and its statistical significance is visually indicated (***).
# Pearson
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("Present_Price", "Kms_Driven")]), type = "pearson")
## Present_Price Kms_Driven
## Present_Price 1.0 0.2
## Kms_Driven 0.2 1.0
##
## n= 301
##
##
## P
## Present_Price Kms_Driven
## Present_Price 4e-04
## Kms_Driven 4e-04
We reject H0 at p < 0.001.
There is a linear relationship between Present Price and Kms Driven which is positive and weak.(r = 0.2)
RQ3 (3 points): Write the third research question. Using two categorical variables, perform the Pearson Chi2 test. Make sure that the necessary assumptions are met. Write down the null hypothesis and the alternative hypothesis as well as your findings based on the p-value of the test. Show empirical and theoretical frequencies and explain them. Also calculate the standardized residuals and interpret them. Calculate the effect size. Answer your research question clearly.
Research question: Is there a significant association between the fuel type (Fuel_Type) and the transmission type (Transmission) of the cars?
H0:There is no association between Fuel_Type and Transmission
H1:There is an association between Fuel_Type and Transmission
mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Seller_Type Transmission Owner
## 1 Dealer Manual 0
## 2 Dealer Manual 0
## 3 Dealer Manual 0
## 4 Dealer Manual 0
## 5 Dealer Manual 0
## 6 Dealer Manual 0
# Transforming categorical variables
mydata$Fuel_Type_Factor <- factor(mydata$Fuel_Type)
mydata$Transmission_Factor <- factor(mydata$Transmission)
Categorical variables: Fuel_Type and Transmission are converted into factors for proper analysis.
We use the Chi-squared test just to practice, as we were doing in class. In our case the Fischer Test of Independency will be more accurate, as we will notice in the analysis.
results <- chisq.test(mydata$Fuel_Type, mydata$Transmission,
correct = FALSE)
## Warning in chisq.test(mydata$Fuel_Type, mydata$Transmission, correct = FALSE):
## Chi-squared approximation may be incorrect
results
##
## Pearson's Chi-squared test
##
## data: mydata$Fuel_Type and mydata$Transmission
## X-squared = 3.1651, df = 2, p-value = 0.2054
Interpretation:
First, we see the P-value = 0.2054 (greater than 0.05), which means that we cannot reject the H0 hypothesis. This indicates that there is insufficient statistical evidence to claim that there is a significant association between the two variables.
Secondly, we will see the observed frequencies (actual values of the cross between Fuel_Type and Transmission), the expected frequencies (values that would be expected if there were no association between the variables) and standardized residuals (these tell us how far the actual observations are from the expected ones).
round(results$observed)
## mydata$Transmission
## mydata$Fuel_Type Automatic Manual
## CNG 0 2
## Diesel 12 48
## Petrol 28 211
round(results$expected, 2)
## mydata$Transmission
## mydata$Fuel_Type Automatic Manual
## CNG 0.27 1.73
## Diesel 7.97 52.03
## Petrol 31.76 207.24
round(results$res, 2)
## mydata$Transmission
## mydata$Fuel_Type Automatic Manual
## CNG -0.52 0.20
## Diesel 1.43 -0.56
## Petrol -0.67 0.26
We can say that if there were no relationship between Fuel_Type and Transmission, the observed frequency of 211 for Petrol cars with Manual transmission is very close to the expected frequency of 207.24, supporting the lack of a strong association between the two variables in this case. The residual of 0.26 (no significant) further confirms this.
library(effectsize)
effectsize::cramers_v(mydata$Fuel_Type, mydata$Transmission)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.06 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.06)
## [1] "very small"
## (Rules: funder2019)
Interpretation:
The value of the effect size 0.06 is interpreted as a very small effect, which reinforces that the relationship between the variables is practically null.
### Fisher's probability test
fisher.test(mydata$Fuel_Type, mydata$Transmission)
##
## Fisher's Exact Test for Count Data
##
## data: mydata$Fuel_Type and mydata$Transmission
## p-value = 0.2092
## alternative hypothesis: two.sided
Interpretation:
In this case we use the Fisher’s Exact Test. This is done when the expected frequencies are smaller than 5 (as in the case of CNG). The p-value 0.2092 is greater than 0.05, so we fail to reject the null hypothesis. This means there is no statistical evidence to suggest a significant association between the variables Fuel_Type and Transmission.
Conclusion: We cannot say that there is a significant relationship between Fuel_Type and Transmission. The two variables appear to be not associated, and Fisher p-values (p=0.210) support this conclusion.