MVA-HW-2 correlation

Dataset was used from: https://www.kaggle.com/datasets/pushpakhinglaspure/used-car-price-prediction

Research Questions

1. Is there a correlation between present price and kilometers driven?
1. Is there an association between transmission of the car and fuel type of the car?

#install.packages("readxl")

library(readxl)

## Warning: package 'readxl' was built under R version 4.3.2

mydata <- read_xlsx("~/IMB/Mutivariat analysis/car data.xlsx")
head(mydata)

## # A tibble: 6 × 9
##   Car_Name     Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014           335           559      27000 Petrol    Dealer     
## 2 sx4          2013           475           954      43000 Diesel    Dealer     
## 3 ciaz         2017           725           985       6900 Petrol    Dealer     
## 4 wagon r      2011           285           415       5200 Petrol    Dealer     
## 5 swift        2014            46           687      42450 Diesel    Dealer     
## 6 vitara bre…  2018           925           983       2071 Diesel    Dealer     
## # ℹ 2 more variables: Transmission <chr>, Owner <dbl>

Explanation of dataset

Unit of observation: cars
Sample size: n = 301
Definition of variables:
Year - the car’s manufacturing year
Selling_Price - The actual price at which the car was sold
Present_Price: The currrent price at which the car is listed
Kms_Driven: total distance driven in kilometers
Fuel_Type: Categorical variable, specifying the car’s fuel type (e.g., petrol, diesel)
Seller_Type: Categorical variable, indicting whether the seller is the individual or a dealer
Transmission: Categorical variable, specifying the car’s transmission type (e.g., manual, automatic)
Owner: Categorical variable, indicating whether the car has had previous owners or is being sold by the first owner

I have a sample on 301 units, therefore I will make random sample of 100 units, for the purpose of testing hypothesis and normality

set.seed(1) #Setting initial point of sampling
mydata <- mydata[sample(nrow(mydata),100),]

Data manipulations

#for the purpose of the easier analysis, I will exclude some of the variables I won't use in the analysis
mydata2 <- mydata[,c(-2,-7,-9)]
head(mydata2)

## # A tibble: 6 × 6
##   Car_Name         Selling_Price Present_Price Kms_Driven Fuel_Type Transmission
##   <chr>                    <dbl>         <dbl>      <dbl> <chr>     <chr>       
## 1 Hero Passion Pro            45            55       1000 Petrol    Manual      
## 2 Honda CB Hornet…             8            87       3000 Petrol    Manual      
## 3 city                       335            11      87934 Petrol    Manual      
## 4 city                        67            10      18828 Petrol    Manual      
## 5 TVS Wego                    25            52      22000 Petrol    Automatic   
## 6 innova                     349          1346     197176 Diesel    Manual

#creating new variable and informing R that we have non-numerical variable
mydata2$TransmissionF <- factor(mydata2$Transmission,
 levels = c("Manual", "Automatic"),
 labels =c("Manual", "Automatic"))

mydata2$Fuel_TypeF <- factor(mydata2$Fuel_Type,
 levels = c("Petrol", "Diesel"),
 labels =c("Petrol", "Diesel"))

head(mydata2)

## # A tibble: 6 × 8
##   Car_Name         Selling_Price Present_Price Kms_Driven Fuel_Type Transmission
##   <chr>                    <dbl>         <dbl>      <dbl> <chr>     <chr>       
## 1 Hero Passion Pro            45            55       1000 Petrol    Manual      
## 2 Honda CB Hornet…             8            87       3000 Petrol    Manual      
## 3 city                       335            11      87934 Petrol    Manual      
## 4 city                        67            10      18828 Petrol    Manual      
## 5 TVS Wego                    25            52      22000 Petrol    Automatic   
## 6 innova                     349          1346     197176 Diesel    Manual      
## # ℹ 2 more variables: TransmissionF <fct>, Fuel_TypeF <fct>

summary(mydata2[,c(-1,-5,-6)])

##  Selling_Price    Present_Price      Kms_Driven       TransmissionF  Fuel_TypeF
##  Min.   :   2.0   Min.   :  10.0   Min.   :  1000   Manual   :86    Petrol:82  
##  1st Qu.:  25.0   1st Qu.:  76.0   1st Qu.: 14875   Automatic:14    Diesel:18  
##  Median :  63.0   Median : 136.0   Median : 33494                              
##  Mean   : 216.7   Mean   : 478.5   Mean   : 35699                              
##  3rd Qu.: 272.5   3rd Qu.: 655.8   3rd Qu.: 48825                              
##  Max.   :1999.0   Max.   :3596.0   Max.   :197176

library(psych)
describe(mydata2[,c(-1,-5,-6,-7,-8)])

##               vars   n     mean       sd median  trimmed      mad  min    max
## Selling_Price    1 100   216.67   338.30     63   140.72    84.51    2   1999
## Present_Price    2 100   478.51   717.40    136   308.20   153.45   10   3596
## Kms_Driven       3 100 35698.61 29287.81  33494 32288.20 25788.34 1000 197176
##                range skew kurtosis      se
## Selling_Price   1997 2.61     8.15   33.83
## Present_Price   3586 2.65     7.41   71.74
## Kms_Driven    196176 2.15     8.47 2928.78

Explanations of parameters

Selling_Price - min: The lowest selling price (The actual price at which the car was sold) of the car in the sample was 2 units.
Present_price - mean: The present price of the car (The currrent price at which the car is listed) was on average 478.51 units.
Kms_Driven - median: Half of the cars in in the sample had less or equal to 33494 kilometers driven and half of the cars had more than 33494 kilometers driven.
Selling_Price - skewness: Skewness value of 2.61 for variable Selling_Price indicates that the distribution of selling price is right-skewed or positively skewed.

Research Question

Is there a correlation between present price and kilometers driven?

Assumptions:

Both variables are numeric (This assumtions is met, since variables present price and kilometers driven are both numeric)
Normality of variables (We assume this assumption is met, since we have big enough sample)
Linear relationship

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(mydata2$Present_Price, mydata2$Kms_Driven, 
            smooth = FALSE,
            ylim = c(1000, 197176),
            xlim = c(0, 3596),
            main = "Relationship between Present price and Kilometers driven",
            xlab = "Present price", 
            ylab = "Kilometers driven")

- Based on the scatter plot we can conclude that linearity assumption is violated, however for education purposes we will assume linearity and procede with Pearson correlation test.

cor(mydata2$Present_Price, mydata$Kms_Driven,
    method = "pearson",
    use = "complete.obs")

## [1] 0.3229115

The relationship between selling price of the cars and kilometers driven is positive and semi strong.

cor.test(mydata2$Present_Price, mydata2$Kms_Driven,
         method = "pearson",
         use ="complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata2$Present_Price and mydata2$Kms_Driven
## t = 3.3776, df = 98, p-value = 0.00105
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1350596 0.4883554
## sample estimates:
##       cor 
## 0.3229115

H0: The correlation between Present_Price and Kms_Driven is equal to 0.
H1: The correlation between Present_Price and Kms_Driven is not equal to 0.

BSD we can reject H0 (P<0.001) and conclude that there is linear relationship between present price of the car and kilometers driven by the car.

Categorical variables analysis

Pearson Chi2 test

Research question

Is there an association between transmission of the car and fuel type of the car?

Assumptions:

Observations must be independent.(This assumption is met)
All expected frequencies are greater than 5
In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.

results <- chisq.test(mydata2$TransmissionF, mydata2$Fuel_TypeF, 
                      correct = TRUE)

## Warning in chisq.test(mydata2$TransmissionF, mydata2$Fuel_TypeF, correct =
## TRUE): Chi-squared approximation may be incorrect

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata2$TransmissionF and mydata2$Fuel_TypeF
## X-squared = 0.54043, df = 1, p-value = 0.4623

Hypothesis

H0: There is no association between transmission type of the car and fuel type of the car.
H1: There is association between transmission type of the car and fuel type of the car.

BSD we can not reject null hypothesis, we can not say that there is association between transmission type of the car and fuel type of the car.

addmargins(results$observed)

##                      mydata2$Fuel_TypeF
## mydata2$TransmissionF Petrol Diesel Sum
##             Manual        72     14  86
##             Automatic     10      4  14
##             Sum           82     18 100

Expected/theoretical frequences

round(results$expected, 2)

##                      mydata2$Fuel_TypeF
## mydata2$TransmissionF Petrol Diesel
##             Manual     70.52  15.48
##             Automatic  11.48   2.52

Because not all expected frequencies are greater than 5, I have to do Fischer’s Exact Probability Test of Independence - nonparametric test.

However I will still show the code of proportion tables (structure) and interpret the frequency from each of the proportion tables, for the education purposes.

addmargins(round(prop.table(results$observed), 3))

##                      mydata2$Fuel_TypeF
## mydata2$TransmissionF Petrol Diesel  Sum
##             Manual      0.72   0.14 0.86
##             Automatic   0.10   0.04 0.14
##             Sum         0.82   0.18 1.00

0.72 - Out of all the cars in the sample, 72% of them were manual have and petrol fuel.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##                      mydata2$Fuel_TypeF
## mydata2$TransmissionF Petrol Diesel   Sum
##             Manual     0.837  0.163 1.000
##             Automatic  0.714  0.286 1.000

0.714 - out of all the cars that are automatic, 71.4% of them use Petrol as a fuel.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##                      mydata2$Fuel_TypeF
## mydata2$TransmissionF Petrol Diesel
##             Manual     0.878  0.778
##             Automatic  0.122  0.222
##             Sum        1.000  1.000

0.222 - Out of all the cars that use diesel as a fuel, 22.2% of them are automatic

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata2$TransmissionF, mydata2$Fuel_TypeF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.05              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.05)

## [1] "very small"
## (Rules: funder2019)

Fisher’s exact probability test

fisher.test(mydata2$TransmissionF, mydata2$Fuel_TypeF)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata2$TransmissionF and mydata2$Fuel_TypeF
## p-value = 0.2728
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4088362 8.4529474
## sample estimates:
## odds ratio 
##   2.039865

H0: Odds ratio is = 1
H1: Odds ratio is not equal to 1

BSD we can not reject the null hypothesis (P=0.28). We assume that the odds ratio is equal to one (not enough evidence to suggest that the odds of having a certain transmission type re different between the two fuel types). There is not enough evidence to conclude that there is a significant association between transmission and fuel type of the car.

interpret_oddsratio(2.04)

## [1] "small"
## (Rules: chen2010)

MVA-HW-2 correlation

Anej Levpuscek

2024-01-17

Research Question

Pearson Chi2 test

Fisher’s exact probability test