Write the first research question. Perform the statistical hypothesis tests to answer your research question. Perform both the parametric test and the corresponding non-parametric test and explain the results. For the parametric test, check all necessary assumptions. Finally, describe which test (parametric or non-parametric) is more suitable for your particular case and why. Also calculate the effect size and explain it. Finally, answer your research question clearly.

mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##        Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Seller_Type Transmission Owner
## 1      Dealer       Manual     0
## 2      Dealer       Manual     0
## 3      Dealer       Manual     0
## 4      Dealer       Manual     0
## 5      Dealer       Manual     0
## 6      Dealer       Manual     0

Explanation of data

Explain your data (what is the unit of observation, sample size, definition of all variables, units of measurement, etc.).

Unit of observation: Single car

Sample size: 301

Definition of all variables:

Car_Name:Variable representing the name and model of the car.
Year: Numeric Variable indicating the year the car was bought.
Selling_Price: Variable representing the price at which the owner wants to sell the car (measured in USD).
Present_Price: Variable representing the current ex-showroom price of the car (measured in USD).
Kms_Driven: Variable showing the total kilometers the car has been driven.
Fuel_Type: Variable representing the type of fuel the car uses (e.g., Petrol, Diesel, etc).
Seller_Type: Variable defining whether the seller is a dealer or an individual.
Transmission: Variable defining whether the car has a manual or automatic transmission.
Owner: Variable indicating how many owners the car has had previously (e.g., 0 for no previous owner, 1 for one previous owner.

In summary, this dataset contains information about cars being sold, including their specifications and details about the sellers.

Name of the source: https://www.kaggle.com/datasets/athirags/car-data

Manipulation of the data

# Load the dataset
mydata <- read.csv("car data.csv", header = TRUE, sep = ",")

str(mydata)

## 'data.frame':    301 obs. of  9 variables:
##  $ Car_Name     : chr  "ritz" "sx4" "ciaz" "wagon r" ...
##  $ Year         : int  2014 2013 2017 2011 2014 2018 2015 2015 2016 2015 ...
##  $ Selling_Price: num  3.35 4.75 7.25 2.85 4.6 9.25 6.75 6.5 8.75 7.45 ...
##  $ Present_Price: num  5.59 9.54 9.85 4.15 6.87 9.83 8.12 8.61 8.89 8.92 ...
##  $ Kms_Driven   : int  27000 43000 6900 5200 42450 2071 18796 33429 20273 42367 ...
##  $ Fuel_Type    : chr  "Petrol" "Diesel" "Petrol" "Petrol" ...
##  $ Seller_Type  : chr  "Dealer" "Dealer" "Dealer" "Dealer" ...
##  $ Transmission : chr  "Manual" "Manual" "Manual" "Manual" ...
##  $ Owner        : int  0 0 0 0 0 0 0 0 0 0 ...

mydata$Owner <- factor(mydata$Owner,
                       levels = c(0, 1),
                       labels = c("No Prev Owner", "Prev Owner"))

head(mydata)

##        Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Seller_Type Transmission         Owner
## 1      Dealer       Manual No Prev Owner
## 2      Dealer       Manual No Prev Owner
## 3      Dealer       Manual No Prev Owner
## 4      Dealer       Manual No Prev Owner
## 5      Dealer       Manual No Prev Owner
## 6      Dealer       Manual No Prev Owner

library(pastecs)
round(stat.desc(mydata),2)

##          Car_Name      Year Selling_Price Present_Price   Kms_Driven Fuel_Type
## nbr.val        NA    301.00        301.00        301.00 3.010000e+02        NA
## nbr.null       NA      0.00          0.00          0.00 0.000000e+00        NA
## nbr.na         NA      0.00          0.00          0.00 0.000000e+00        NA
## min            NA   2003.00          0.10          0.32 5.000000e+02        NA
## max            NA   2018.00         35.00         92.60 5.000000e+05        NA
## range          NA     15.00         34.90         92.28 4.995000e+05        NA
## sum            NA 606102.00       1403.05       2296.17 1.112111e+07        NA
## median         NA   2014.00          3.60          6.40 3.200000e+04        NA
## mean           NA   2013.63          4.66          7.63 3.694721e+04        NA
## SE.mean        NA      0.17          0.29          0.50 2.241400e+03        NA
## CI.mean        NA      0.33          0.58          0.98 4.410860e+03        NA
## var            NA      8.36         25.83         74.72 1.512190e+09        NA
## std.dev        NA      2.89          5.08          8.64 3.888688e+04        NA
## coef.var       NA      0.00          1.09          1.13 1.050000e+00        NA
##          Seller_Type Transmission Owner
## nbr.val           NA           NA    NA
## nbr.null          NA           NA    NA
## nbr.na            NA           NA    NA
## min               NA           NA    NA
## max               NA           NA    NA
## range             NA           NA    NA
## sum               NA           NA    NA
## median            NA           NA    NA
## mean              NA           NA    NA
## SE.mean           NA           NA    NA
## CI.mean           NA           NA    NA
## var               NA           NA    NA
## std.dev           NA           NA    NA
## coef.var          NA           NA    NA

Descriptive statistics

Selling Price

Mean: 4.66

Median: 3.60

Interpretation:The mean of Selling_Price is greater than the median, indicating that the price distribution is skewed to the right. This suggests that there are some cars with very high selling prices (possibly luxury cars) that are driving up the average.

Owner:

Range: 0 to 3 owners.

Mean: 0.04

Standard deviation: 0.25

Interpretation: Most cars have very few previous owners.

Year (Year of purchase):

Minimum (min): 2003

Maximum (max): 2018

Interpretation: The oldest car in the dataset was purchased in 2003, while the most recent was purchased in 2018.

Kms_Driven (Kilometers driven):

Minimum (min): 500 km

Maximum (max): 500,000 km

Interpretation: The car with the fewest kilometers driven has hardly been used (500 km), and is probably almost new or has been used very little. On the other hand, the vehicle with 500,000 km reflects very intensive use, possibly as a commercial or work car.

RSQ 1

Is there any significant difference in “Selling_Price” between Manual and Automatic cars?

H0: μPetrol = μDiesel

H1: μPetrol ≠ μDiesel

#Descriptive statistics by group
library(psych)
describeBy(x = mydata$Selling_Price , group = mydata$Transmission)

## 
##  Descriptive statistics by group 
## group: Automatic
##    vars  n mean   sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 40 9.42 8.76    5.8    8.54 7.97 0.17  33 32.83 0.77    -0.54 1.39
## ------------------------------------------------------------ 
## group: Manual
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 261 3.93 3.78   3.25    3.43 3.63 0.1  35  34.9 2.71     16.6 0.23

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Manual <- ggplot(mydata[mydata$Transmission == "Manual",  ], aes(x = Selling_Price )) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, col = "black", fill = "lightgreen") +
  ylab("Frequency") +
  ggtitle("Manual")

Automatic <- ggplot(mydata[mydata$Transmission == "Automatic",  ], aes(x = Selling_Price)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, col = "yellow", fill = "lightblue") +
  ylab("Frequency") +
  ggtitle("Automatic")

library(ggpubr)
ggarrange(Manual, Automatic,
          ncol = 2, nrow = 1)

#Shapiro Test
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(Transmission) %>%
  shapiro_test(Selling_Price)

## # A tibble: 2 × 4
##   Transmission variable      statistic        p
##   <chr>        <chr>             <dbl>    <dbl>
## 1 Automatic    Selling_Price     0.878 4.76e- 4
## 2 Manual       Selling_Price     0.800 1.28e-17

Interpretation:

For both groups (Automatic and Manual), the p-values are much less than 0.05. This means there is strong evidence to reject the H0 hypothesis, indicating that the selling price data does not follow a normal distribution in either group.

Automatic: We reject the H₀ for the normal distribution of Selling_Price for cars with automatic transmission (p < 0,001).

Manual: We reject the H₀ for the normal distribution of Selling_Price for cars with manual transmission (p < 0.001).

Since, the data is not normally distributed, we should use a non-parametric test to compare the Selling_Price between the two groups. In this case we can use the Wilcoxon rank-sum test or also called Mann-Whitney U test.

Before doing the Wilcoxon rank-sum test we are going to check a little bit the knowledge.

Parametric test / Just to check knwoledge

Independent samples

t.test(mydata$Selling_Price ~ mydata$Transmission, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$Selling_Price by mydata$Transmission
## t = 3.9055, df = 41.248, p-value = 0.0003417
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
##  2.650698 8.325318
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                9.420000                3.931992

Non-Parametric test / Wilcoxon rank-sum test

wilcox.test(mydata$Selling_Price ~ mydata$Transmission,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Selling_Price by mydata$Transmission
## W = 6902.5, p-value = 0.001028
## alternative hypothesis: true location shift is not equal to 0

Interpretation:

H0: There is no significant difference in the distribution location of Selling_Price between the two groups (Manual and Automatic transmission).

H1: There is a significant difference in the distribution location of Selling_Price between the two groups.

We reject the H0 hypothesis at p-value 0,002. This means there is a statistically significant difference in the Selling_Price between cars with Manual and Automatic transmission.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydata$Selling_Price ~ mydata$Transmission,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.32              | [0.14, 0.48]

interpret_rank_biserial(0.32)

## [1] "large"
## (Rules: funder2019)

Conclusion: Based on the sample data, we find that cars with automatic transmission have a significantly higher selling price compared to manual cars (p<0.001), the difference in distribution location is large (r=0.32).

Based on the mean of the Selling_Price, automatic cars are more expenseive than manual.

RSQ2

RQ2 (2 points): Write the second research question. Using two numerical variables from your dataset, calculate the appropriate correlation coefficient and explain it. Justify your decision. Perform the appropriate statistical test and interpret the result obtained. Answer your research question clearly.

Research question: Is there a significant correlation between the current price of the car (Present_Price) and its mileage (Kms_Driven)?

H0:r = 0 There is no correlation between Present_Price and Kms_Driven.

H1:r ≠ 0 There is a correlation between Present_Price and Kms_Driven.

mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##        Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Seller_Type Transmission Owner
## 1      Dealer       Manual     0
## 2      Dealer       Manual     0
## 3      Dealer       Manual     0
## 4      Dealer       Manual     0
## 5      Dealer       Manual     0
## 6      Dealer       Manual     0

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata[, c("Present_Price", "Kms_Driven")], smooth = FALSE)

Scatterplot Matrix

This plot visualizes the distributions of Present_Price and Kms_Driven individually and their relationship.

Present_Price and Kms_Driven show right-skewed distributions (most cars are cheaper and have low mileage).

#GGpairs Correlation

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(mydata[, c("Present_Price", "Kms_Driven")])

In GGpairs, the correlation coefficient ( r = 0.204) suggests a weak relationship, and its statistical significance is visually indicated (***).

# Pearson
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("Present_Price", "Kms_Driven")]), type = "pearson")

##               Present_Price Kms_Driven
## Present_Price           1.0        0.2
## Kms_Driven              0.2        1.0
## 
## n= 301 
## 
## 
## P
##               Present_Price Kms_Driven
## Present_Price               4e-04     
## Kms_Driven    4e-04

We reject H0 at p < 0.001.

There is a linear relationship between Present Price and Kms Driven which is positive and weak.(r = 0.2)

RSQ3

RQ3 (3 points): Write the third research question. Using two categorical variables, perform the Pearson Chi2 test. Make sure that the necessary assumptions are met. Write down the null hypothesis and the alternative hypothesis as well as your findings based on the p-value of the test. Show empirical and theoretical frequencies and explain them. Also calculate the standardized residuals and interpret them. Calculate the effect size. Answer your research question clearly.

Research question: Is there a significant association between the fuel type (Fuel_Type) and the transmission type (Transmission) of the cars?

H0:There is no association between Fuel_Type and Transmission

H1:There is an association between Fuel_Type and Transmission

mydata <- read.table("./car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##        Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Seller_Type Transmission Owner
## 1      Dealer       Manual     0
## 2      Dealer       Manual     0
## 3      Dealer       Manual     0
## 4      Dealer       Manual     0
## 5      Dealer       Manual     0
## 6      Dealer       Manual     0

# Transforming categorical variables
mydata$Fuel_Type_Factor <- factor(mydata$Fuel_Type)
mydata$Transmission_Factor <- factor(mydata$Transmission)

Categorical variables: Fuel_Type and Transmission are converted into factors for proper analysis.

We use the Chi-squared test just to practice, as we were doing in class. In our case the Fischer Test of Independency will be more accurate, as we will notice in the analysis.

results <- chisq.test(mydata$Fuel_Type, mydata$Transmission, 
                      correct = FALSE)

## Warning in chisq.test(mydata$Fuel_Type, mydata$Transmission, correct = FALSE):
## Chi-squared approximation may be incorrect

results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Fuel_Type and mydata$Transmission
## X-squared = 3.1651, df = 2, p-value = 0.2054

Interpretation:

First, we see the P-value = 0.2054 (greater than 0.05), which means that we cannot reject the H0 hypothesis. This indicates that there is insufficient statistical evidence to claim that there is a significant association between the two variables.

Secondly, we will see the observed frequencies (actual values of the cross between Fuel_Type and Transmission), the expected frequencies (values that would be expected if there were no association between the variables) and standardized residuals (these tell us how far the actual observations are from the expected ones).

round(results$observed)

##                 mydata$Transmission
## mydata$Fuel_Type Automatic Manual
##           CNG            0      2
##           Diesel        12     48
##           Petrol        28    211

round(results$expected, 2)

##                 mydata$Transmission
## mydata$Fuel_Type Automatic Manual
##           CNG         0.27   1.73
##           Diesel      7.97  52.03
##           Petrol     31.76 207.24

round(results$res, 2)

##                 mydata$Transmission
## mydata$Fuel_Type Automatic Manual
##           CNG        -0.52   0.20
##           Diesel      1.43  -0.56
##           Petrol     -0.67   0.26

We can say that if there were no relationship between Fuel_Type and Transmission, the observed frequency of 211 for Petrol cars with Manual transmission is very close to the expected frequency of 207.24, supporting the lack of a strong association between the two variables in this case. The residual of 0.26 (no significant) further confirms this.

library(effectsize)
effectsize::cramers_v(mydata$Fuel_Type, mydata$Transmission)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.06              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.06)

## [1] "very small"
## (Rules: funder2019)

Interpretation:

The value of the effect size 0.06 is interpreted as a very small effect, which reinforces that the relationship between the variables is practically null.

### Fisher's  probability test
fisher.test(mydata$Fuel_Type, mydata$Transmission)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$Fuel_Type and mydata$Transmission
## p-value = 0.2092
## alternative hypothesis: two.sided

Interpretation:

In this case we use the Fisher’s Exact Test. This is done when the expected frequencies are smaller than 5 (as in the case of CNG). The p-value 0.2092 is greater than 0.05, so we fail to reject the null hypothesis. This means there is no statistical evidence to suggest a significant association between the variables Fuel_Type and Transmission.

Conclusion: We cannot say that there is a significant relationship between Fuel_Type and Transmission. The two variables appear to be not associated, and Fisher p-values (p=0.210) support this conclusion.

Homework 1

Tomas Miklic

2025-01-12

Explanation of data

Manipulation of the data

Descriptive statistics

RSQ 1

RSQ2

RSQ3