Auto Mobile Price Analysis Project

Date : 23/05/2024

Author : Omar Soub


Introduction


Data Set and Data types view


Data Set (1st five rows) :

num_of_doors body_style drive_wheels engine_location length width height curb_weight num_of_cylinders engine_size fuel_system peak_rpm city_mpg highway_mpg price
two convertible rwd front 168.8 64.1 48.8 2548 four 130 mpfi 5000 21 27 13495
two convertible rwd front 168.8 64.1 48.8 2548 four 130 mpfi 5000 21 27 16500
two hatchback rwd front 171.2 65.5 52.4 2823 six 152 mpfi 5000 19 26 16500
four sedan fwd front 176.6 66.2 54.3 2337 four 109 mpfi 5500 24 30 13950
four sedan 4wd front 176.6 66.4 54.3 2824 five 136 mpfi 5500 18 22 17450
two sedan fwd front 177.3 66.3 53.1 2507 five 136 mpfi 5500 19 25 15250

data set dimensions

## [1] 205  15

Data types and some details using glimps function

## Rows: 205
## Columns: 15
## $ num_of_doors     <chr> "two", "two", "two", "four", "four", "two", "four", "…
## $ body_style       <chr> "convertible", "convertible", "hatchback", "sedan", "…
## $ drive_wheels     <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fwd…
## $ engine_location  <chr> "front", "front", "front", "front", "front", "front",…
## $ length           <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192.…
## $ width            <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4,…
## $ height           <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9,…
## $ curb_weight      <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086,…
## $ num_of_cylinders <chr> "four", "four", "six", "four", "five", "five", "five"…
## $ engine_size      <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 108…
## $ fuel_system      <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi…
## $ peak_rpm         <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500,…
## $ city_mpg         <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, 2…
## $ highway_mpg      <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, 2…
## $ price            <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 1892…

# df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],as.factor)                                

Data types and some details using skim function

Data summary
Name Piped data
Number of rows 205
Number of columns 15
_______________________
Column type frequency:
factor 6
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
num_of_doors 0 1 FALSE 2 fou: 114, two: 91
body_style 0 1 FALSE 5 sed: 96, hat: 70, wag: 25, har: 8
drive_wheels 0 1 FALSE 3 fwd: 120, rwd: 76, 4wd: 9
engine_location 0 1 FALSE 2 fro: 202, rea: 3
num_of_cylinders 0 1 FALSE 7 fou: 159, six: 24, fiv: 11, eig: 5
fuel_system 0 1 FALSE 8 mpf: 94, 2bb: 66, idi: 20, 1bb: 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
length 0 1 174.05 12.34 141.1 166.3 173.2 183.1 208.1 ▁▅▇▃▁
width 0 1 65.91 2.15 60.3 64.1 65.5 66.9 72.3 ▁▇▇▂▁
height 0 1 53.72 2.44 47.8 52.0 54.1 55.5 59.8 ▁▆▇▆▂
curb_weight 0 1 2555.57 520.68 1488.0 2145.0 2414.0 2935.0 4066.0 ▃▇▅▃▁
engine_size 0 1 126.91 41.64 61.0 97.0 120.0 141.0 326.0 ▇▆▂▁▁
peak_rpm 0 1 5125.12 477.09 4150.0 4800.0 5200.0 5500.0 6600.0 ▂▇▇▂▁
city_mpg 0 1 25.22 6.54 13.0 19.0 24.0 30.0 49.0 ▆▇▅▂▁
highway_mpg 0 1 30.75 6.89 16.0 25.0 30.0 34.0 54.0 ▂▇▆▁▁
price 0 1 13121.15 7908.96 5118.0 7775.0 10245.0 16500.0 45400.0 ▇▃▁▁▁

Define categorical variables

## [1] "num_of_doors"     "body_style"       "drive_wheels"     "engine_location" 
## [5] "num_of_cylinders" "fuel_system"

Define numerical variables

## [1] "length"      "width"       "height"      "curb_weight" "engine_size"
## [6] "peak_rpm"    "city_mpg"    "highway_mpg" "price"

Categorical variables EDA

Categorical variables distribution

num_of_doors body_style drive_wheels engine_location num_of_cylinders fuel_system
four:114 convertible: 6 4wd: 9 front:202 eight : 5 mpfi :94
two : 91 hardtop : 8 fwd:120 rear : 3 five : 11 2bbl :66
NA hatchback :70 rwd: 76 NA four :159 idi :20
NA sedan :96 NA NA six : 24 1bbl :11
NA wagon :25 NA NA three : 1 spdi : 9
NA NA NA NA twelve: 1 4bbl : 3
NA NA NA NA two : 4 (Other): 2

Categorical variables Visualization


Numerical variables EDA

Numerical variables distribution

length width height curb_weight engine_size peak_rpm city_mpg highway_mpg price
Min. :141.1 Min. :60.30 Min. :47.80 Min. :1488 Min. : 61.0 Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00 1st Qu.:2145 1st Qu.: 97.0 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7775
Median :173.2 Median :65.50 Median :54.10 Median :2414 Median :120.0 Median :5200 Median :24.00 Median :30.00 Median :10245
Mean :174.0 Mean :65.91 Mean :53.72 Mean :2556 Mean :126.9 Mean :5125 Mean :25.22 Mean :30.75 Mean :13121
3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50 3rd Qu.:2935 3rd Qu.:141.0 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16500
Max. :208.1 Max. :72.30 Max. :59.80 Max. :4066 Max. :326.0 Max. :6600 Max. :49.00 Max. :54.00 Max. :45400

Numerical variables Visualization


Visualizing The Relation between The numerical variables :


Visualizing The Relation between The Price variable and other numerical variables :

We can see from the above charts that the relationship between the variables is monotonic (consistently increasing or decreasing but not necessarily linear)


The correlation between The numerical variables :

cor(numerical_data,method = "spearman") %>% kable()
length width height curb_weight engine_size peak_rpm city_mpg highway_mpg price
length 1.0000000 0.8882006 0.5251484 0.8904148 0.7826164 -0.2696342 -0.6700131 -0.6979487 0.7911858
width 0.8882006 1.0000000 0.3502788 0.8638148 0.7706150 -0.1988831 -0.6876902 -0.7009986 0.7792643
height 0.5251484 0.3502788 1.0000000 0.3458525 0.1998109 -0.2989955 -0.0686248 -0.1325124 0.2673450
curb_weight 0.8904148 0.8638148 0.3458525 1.0000000 0.8777393 -0.2361785 -0.8129473 -0.8343846 0.8793285
engine_size 0.7826164 0.7706150 0.1998109 0.8777393 1.0000000 -0.2730466 -0.7300556 -0.7213416 0.8019996
peak_rpm -0.2696342 -0.1988831 -0.2989955 -0.2361785 -0.2730466 1.0000000 -0.1316799 -0.0575427 -0.0926102
city_mpg -0.6700131 -0.6876902 -0.0686248 -0.8129473 -0.7300556 -0.1316799 1.0000000 0.9677382 -0.7947509
highway_mpg -0.6979487 -0.7009986 -0.1325124 -0.8343846 -0.7213416 -0.0575427 0.9677382 1.0000000 -0.7985104
price 0.7911858 0.7792643 0.2673450 0.8793285 0.8019996 -0.0926102 -0.7947509 -0.7985104 1.0000000


E. inferential analysis

E.1: Price by number of doors

num_of_doors count mean sd min Q1 median Q3 max
four 114 13493.95 7393.148 5389 7996 11071.5 16886.25 40960
two 91 12654.12 8529.957 5118 7114 9895.0 15145.00 45400

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  price by num_of_doors
## W = 5980.5, p-value = 0.06022
## alternative hypothesis: true location shift is not equal to 0

As the The p-value is more than 0.05 , we will accept the null hypothesis,indicating that there are no significant differences in Price Distribution among the Auto Mobiles based on number of doors groups.

E.2: Price by drive_wheels

drive_wheels count mean sd min Q1 median Q3 max
4wd 9 9847.000 3295.135 6695 7898.00 8778.0 11259.00 17450
fwd 120 9259.517 3376.005 5118 6933.00 8192.0 10407.50 23875
rwd 76 19606.184 9117.895 6785 13361.25 16872.5 22508.75 45400

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by drive_wheels
## Kruskal-Wallis chi-squared = 91.896, df = 2, p-value < 2.2e-16

As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on drive wheels groups.

## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df$price and df$drive_wheels 
## 
##     4wd     fwd    
## fwd 1.00000 -      
## rwd 0.00056 < 2e-16
## 
## P value adjustment method: bonferroni

As we can see there is significant difference in Price Distributions when we compare rwd group with the other two groups (4wd,fwd) ,as The p-value adjusted is less than 0.05, but the Price Distributions has no significant difference between groups (4wd,fwd) as The p-value adjusted is more than 0.05

E.3: Price by body_style

body_style count mean sd min Q1 median Q3 max
convertible 6 21890.500 11187.802 11595 14246.25 17084.5 30709.25 37028
hardtop 8 22208.500 14555.521 8249 9341.50 19687.5 32903.00 45400
hatchback 70 9886.429 4111.640 5118 6564.00 8428.5 11848.75 22018
sedan 96 14369.531 8483.830 5389 7990.00 11078.5 17770.00 41315
wagon 25 12371.960 5120.949 6918 8013.00 11694.0 15750.00 28248

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by body_style
## Kruskal-Wallis chi-squared = 25.764, df = 4, p-value = 3.531e-05

As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price among the Auto Mobiles based on body style groups.

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df$price and df$body_style 
## 
##           convertible hardtop hatchback sedan
## hardtop   1.000       -       -         -    
## hatchback 0.015       0.102   -         -    
## sedan     0.505       1.000   0.001     -    
## wagon     0.296       1.000   0.119     1.000
## 
## P value adjustment method: bonferroni

As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1.hatchback group with convertible group.
2.sedan group with hatchback group.

E.4: Price by body_style

fuel_system count mean sd min Q1 median Q3 max
1bbl 11 7555.545 1390.121 5399 6692.00 7295.0 8370.00 10295
2bbl 66 7514.894 1661.844 5118 6509.75 7324.0 8246.25 14869
4bbl 3 12145.000 1374.773 10945 11395.00 11845.0 12745.00 13645
idi 20 15838.150 7759.844 7099 9120.00 13852.5 19375.50 31600
mfi 1 12964.000 NA 12964 12964.00 12964.0 12964.00 12964
mpfi 94 17389.543 8694.507 6695 11331.50 15720.0 18845.00 45400
spdi 9 10990.444 2741.731 7689 9279.00 9959.0 12764.00 14869
spfi 1 11048.000 NA 11048 11048.00 11048.0 11048.00 11048

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by fuel_system
## Kruskal-Wallis chi-squared = 118.43, df = 7, p-value < 2.2e-16

As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on fuel_system groups.

## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df$price and df$fuel_system 
## 
##      1bbl    2bbl    4bbl   idi    mfi    mpfi   spdi  
## 2bbl 1.0000  -       -      -      -      -      -     
## 4bbl 0.3537  0.1518  -      -      -      -      -     
## idi  0.0085  3.4e-06 1.0000 -      -      -      -     
## mfi  1.0000  1.0000  1.0000 1.0000 -      -      -     
## mpfi 1.9e-05 < 2e-16 1.0000 1.0000 1.0000 -      -     
## spdi 0.0847  0.0056  1.0000 1.0000 1.0000 0.1319 -     
## spfi 1.0000  1.0000  1.0000 1.0000 1.0000 1.0000 1.0000
## 
## P value adjustment method: bonferroni

As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1. idi group with 1bbl and 2bbl groups.
2. mpfi group with 1bbl and 2bbl groups.
3. spdi group with 2bbl group.

E.5: Price by num_of_cylinders

num_of_cylinders count mean sd min Q1 median Q3 max
eight 5 32769.80 14449.028 8249 34184.0 35056.0 40960.0 45400
five 11 20615.55 7540.322 6695 16350.0 18920.0 26864.0 31600
four 159 10301.01 3954.715 5118 7324.0 8949.0 12534.5 22625
six 24 23671.83 8850.138 13499 16374.5 21037.5 32319.5 41315
three 1 5151.00 NA 5151 5151.0 5151.0 5151.0 5151
twelve 1 36000.00 NA 36000 36000.0 36000.0 36000.0 36000
two 4 13020.00 2079.062 10945 11620.0 12745.0 14145.0 15645

## 
##  Kruskal-Wallis rank sum test
## 
## data:  price by num_of_cylinders
## Kruskal-Wallis chi-squared = 72.995, df = 6, p-value = 9.92e-14

As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on num_of_cylinders groups.

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  df$price and df$num_of_cylinders 
## 
##        eight   five    four    six     three   twelve 
## five   1.00000 -       -       -       -       -      
## four   0.07531 0.00073 -       -       -       -      
## six    1.00000 1.00000 7.7e-11 -       -       -      
## three  1.00000 1.00000 1.00000 1.00000 -       -      
## twelve 1.00000 1.00000 1.00000 1.00000 1.00000 -      
## two    1.00000 1.00000 1.00000 0.11047 1.00000 1.00000
## 
## P value adjustment method: bonferroni

As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1. four group with five group.
2. six group with four group.

We will exclude cylinders Three and Twelve as there are no enough number of samples


Final Conclusions: