Auto Mobile Price Analysis Project
Date : 23/05/2024
- Introduction
- Data Set and Data types view
- Categorical variables EDA
- Numerical variables EDA
- Visualizing The Relation between The numerical variables :
- Visualizing The Relation between The Price variable and other numerical variables :
- The correlation between The numerical variables :
- E. inferential analysis
- Final Conclusions:
Introduction
- On this chapter ,we will use R-Programming Language in order to conduct inferential analysis to analyse the data as much as we can to help decision making
Data Set and Data types view
Data Set (1st five rows) :
| num_of_doors | body_style | drive_wheels | engine_location | length | width | height | curb_weight | num_of_cylinders | engine_size | fuel_system | peak_rpm | city_mpg | highway_mpg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| two | convertible | rwd | front | 168.8 | 64.1 | 48.8 | 2548 | four | 130 | mpfi | 5000 | 21 | 27 | 13495 |
| two | convertible | rwd | front | 168.8 | 64.1 | 48.8 | 2548 | four | 130 | mpfi | 5000 | 21 | 27 | 16500 |
| two | hatchback | rwd | front | 171.2 | 65.5 | 52.4 | 2823 | six | 152 | mpfi | 5000 | 19 | 26 | 16500 |
| four | sedan | fwd | front | 176.6 | 66.2 | 54.3 | 2337 | four | 109 | mpfi | 5500 | 24 | 30 | 13950 |
| four | sedan | 4wd | front | 176.6 | 66.4 | 54.3 | 2824 | five | 136 | mpfi | 5500 | 18 | 22 | 17450 |
| two | sedan | fwd | front | 177.3 | 66.3 | 53.1 | 2507 | five | 136 | mpfi | 5500 | 19 | 25 | 15250 |
data set dimensions
## [1] 205 15
Data types and some details using glimps function
## Rows: 205
## Columns: 15
## $ num_of_doors <chr> "two", "two", "two", "four", "four", "two", "four", "β¦
## $ body_style <chr> "convertible", "convertible", "hatchback", "sedan", "β¦
## $ drive_wheels <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fwdβ¦
## $ engine_location <chr> "front", "front", "front", "front", "front", "front",β¦
## $ length <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192.β¦
## $ width <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4,β¦
## $ height <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9,β¦
## $ curb_weight <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086,β¦
## $ num_of_cylinders <chr> "four", "four", "six", "four", "five", "five", "five"β¦
## $ engine_size <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 108β¦
## $ fuel_system <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfiβ¦
## $ peak_rpm <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500,β¦
## $ city_mpg <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, 2β¦
## $ highway_mpg <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, 2β¦
## $ price <int> 13495, 16500, 16500, 13950, 17450, 15250, 17710, 1892β¦
- convert variables from character data type into factor
# df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],as.factor) Data types and some details using skim function
| Name | Piped data |
| Number of rows | 205 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| num_of_doors | 0 | 1 | FALSE | 2 | fou: 114, two: 91 |
| body_style | 0 | 1 | FALSE | 5 | sed: 96, hat: 70, wag: 25, har: 8 |
| drive_wheels | 0 | 1 | FALSE | 3 | fwd: 120, rwd: 76, 4wd: 9 |
| engine_location | 0 | 1 | FALSE | 2 | fro: 202, rea: 3 |
| num_of_cylinders | 0 | 1 | FALSE | 7 | fou: 159, six: 24, fiv: 11, eig: 5 |
| fuel_system | 0 | 1 | FALSE | 8 | mpf: 94, 2bb: 66, idi: 20, 1bb: 11 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| length | 0 | 1 | 174.05 | 12.34 | 141.1 | 166.3 | 173.2 | 183.1 | 208.1 | ββ βββ |
| width | 0 | 1 | 65.91 | 2.15 | 60.3 | 64.1 | 65.5 | 66.9 | 72.3 | βββββ |
| height | 0 | 1 | 53.72 | 2.44 | 47.8 | 52.0 | 54.1 | 55.5 | 59.8 | βββββ |
| curb_weight | 0 | 1 | 2555.57 | 520.68 | 1488.0 | 2145.0 | 2414.0 | 2935.0 | 4066.0 | βββ ββ |
| engine_size | 0 | 1 | 126.91 | 41.64 | 61.0 | 97.0 | 120.0 | 141.0 | 326.0 | βββββ |
| peak_rpm | 0 | 1 | 5125.12 | 477.09 | 4150.0 | 4800.0 | 5200.0 | 5500.0 | 6600.0 | βββββ |
| city_mpg | 0 | 1 | 25.22 | 6.54 | 13.0 | 19.0 | 24.0 | 30.0 | 49.0 | βββ ββ |
| highway_mpg | 0 | 1 | 30.75 | 6.89 | 16.0 | 25.0 | 30.0 | 34.0 | 54.0 | βββββ |
| price | 0 | 1 | 13121.15 | 7908.96 | 5118.0 | 7775.0 | 10245.0 | 16500.0 | 45400.0 | βββββ |
Define categorical variables
## [1] "num_of_doors" "body_style" "drive_wheels" "engine_location"
## [5] "num_of_cylinders" "fuel_system"
Define numerical variables
## [1] "length" "width" "height" "curb_weight" "engine_size"
## [6] "peak_rpm" "city_mpg" "highway_mpg" "price"
Categorical variables EDA
Categorical variables distribution
| num_of_doors | body_style | drive_wheels | engine_location | num_of_cylinders | fuel_system | |
|---|---|---|---|---|---|---|
| four:114 | convertible: 6 | 4wd: 9 | front:202 | eight : 5 | mpfi :94 | |
| two : 91 | hardtop : 8 | fwd:120 | rear : 3 | five : 11 | 2bbl :66 | |
| NA | hatchback :70 | rwd: 76 | NA | four :159 | idi :20 | |
| NA | sedan :96 | NA | NA | six : 24 | 1bbl :11 | |
| NA | wagon :25 | NA | NA | three : 1 | spdi : 9 | |
| NA | NA | NA | NA | twelve: 1 | 4bbl : 3 | |
| NA | NA | NA | NA | two : 4 | (Other): 2 |
Categorical variables Visualization
Numerical variables EDA
Numerical variables distribution
| length | width | height | curb_weight | engine_size | peak_rpm | city_mpg | highway_mpg | price | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :141.1 | Min. :60.30 | Min. :47.80 | Min. :1488 | Min. : 61.0 | Min. :4150 | Min. :13.00 | Min. :16.00 | Min. : 5118 | |
| 1st Qu.:166.3 | 1st Qu.:64.10 | 1st Qu.:52.00 | 1st Qu.:2145 | 1st Qu.: 97.0 | 1st Qu.:4800 | 1st Qu.:19.00 | 1st Qu.:25.00 | 1st Qu.: 7775 | |
| Median :173.2 | Median :65.50 | Median :54.10 | Median :2414 | Median :120.0 | Median :5200 | Median :24.00 | Median :30.00 | Median :10245 | |
| Mean :174.0 | Mean :65.91 | Mean :53.72 | Mean :2556 | Mean :126.9 | Mean :5125 | Mean :25.22 | Mean :30.75 | Mean :13121 | |
| 3rd Qu.:183.1 | 3rd Qu.:66.90 | 3rd Qu.:55.50 | 3rd Qu.:2935 | 3rd Qu.:141.0 | 3rd Qu.:5500 | 3rd Qu.:30.00 | 3rd Qu.:34.00 | 3rd Qu.:16500 | |
| Max. :208.1 | Max. :72.30 | Max. :59.80 | Max. :4066 | Max. :326.0 | Max. :6600 | Max. :49.00 | Max. :54.00 | Max. :45400 |
Numerical variables Visualization
Visualizing The Relation between The numerical variables :
Visualizing The Relation between The Price variable and other numerical variables :
We can see from the above charts that the relationship between the variables is monotonic (consistently increasing or decreasing but not necessarily linear)
The correlation between The numerical variables :
- as the data is not normally distributed we will apply spearman methods
cor(numerical_data,method = "spearman") %>% kable()| length | width | height | curb_weight | engine_size | peak_rpm | city_mpg | highway_mpg | price | |
|---|---|---|---|---|---|---|---|---|---|
| length | 1.0000000 | 0.8882006 | 0.5251484 | 0.8904148 | 0.7826164 | -0.2696342 | -0.6700131 | -0.6979487 | 0.7911858 |
| width | 0.8882006 | 1.0000000 | 0.3502788 | 0.8638148 | 0.7706150 | -0.1988831 | -0.6876902 | -0.7009986 | 0.7792643 |
| height | 0.5251484 | 0.3502788 | 1.0000000 | 0.3458525 | 0.1998109 | -0.2989955 | -0.0686248 | -0.1325124 | 0.2673450 |
| curb_weight | 0.8904148 | 0.8638148 | 0.3458525 | 1.0000000 | 0.8777393 | -0.2361785 | -0.8129473 | -0.8343846 | 0.8793285 |
| engine_size | 0.7826164 | 0.7706150 | 0.1998109 | 0.8777393 | 1.0000000 | -0.2730466 | -0.7300556 | -0.7213416 | 0.8019996 |
| peak_rpm | -0.2696342 | -0.1988831 | -0.2989955 | -0.2361785 | -0.2730466 | 1.0000000 | -0.1316799 | -0.0575427 | -0.0926102 |
| city_mpg | -0.6700131 | -0.6876902 | -0.0686248 | -0.8129473 | -0.7300556 | -0.1316799 | 1.0000000 | 0.9677382 | -0.7947509 |
| highway_mpg | -0.6979487 | -0.7009986 | -0.1325124 | -0.8343846 | -0.7213416 | -0.0575427 | 0.9677382 | 1.0000000 | -0.7985104 |
| price | 0.7911858 | 0.7792643 | 0.2673450 | 0.8793285 | 0.8019996 | -0.0926102 | -0.7947509 | -0.7985104 | 1.0000000 |
- Visualizing The correlation
E. inferential analysis
E.1: Price by number of doors
- E.1.1 : Price by number of doors-Basic description
| num_of_doors | count | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|---|
| four | 114 | 13493.95 | 7393.148 | 5389 | 7996 | 11071.5 | 16886.25 | 40960 |
| two | 91 | 12654.12 | 8529.957 | 5118 | 7114 | 9895.0 | 15145.00 | 45400 |
- E.1.2 : Applying wilcox.test (for variables with two groups) to find out if there are significant differences in Price Distribution among the Auto Mobiles based on number of doors groups :
##
## Wilcoxon rank sum test with continuity correction
##
## data: price by num_of_doors
## W = 5980.5, p-value = 0.06022
## alternative hypothesis: true location shift is not equal to 0
As the The p-value is more than 0.05 , we will accept the null hypothesis,indicating that there are no significant differences in Price Distribution among the Auto Mobiles based on number of doors groups.
E.1.3 :Visualizing The Statistical test
we will use Mann-Whitney U Test -for features that have 2 categories : equivelent to wilcox.test() wilcox rank sum test with continuity correction,better than wilcox sighned rank exct test(that used when variables are paired)
E.2: Price by drive_wheels
- E.2.1 : Price by drive_wheels-Basic description
| drive_wheels | count | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|---|
| 4wd | 9 | 9847.000 | 3295.135 | 6695 | 7898.00 | 8778.0 | 11259.00 | 17450 |
| fwd | 120 | 9259.517 | 3376.005 | 5118 | 6933.00 | 8192.0 | 10407.50 | 23875 |
| rwd | 76 | 19606.184 | 9117.895 | 6785 | 13361.25 | 16872.5 | 22508.75 | 45400 |
- E.2.2 :Applying kruskal.test (for variables with more than two groups) to find out if there are significant differences in Price Distribution among the Auto Mobiles based on drive wheels groups :
##
## Kruskal-Wallis rank sum test
##
## data: price by drive_wheels
## Kruskal-Wallis chi-squared = 91.896, df = 2, p-value < 2.2e-16
As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on drive wheels groups.
- E.2.3 :Visualizing The Statistical test
- E.2.4: To find out this difference in details: we will apply pairwise.wilcox.test
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: df$price and df$drive_wheels
##
## 4wd fwd
## fwd 1.00000 -
## rwd 0.00056 < 2e-16
##
## P value adjustment method: bonferroni
As we can see there is significant difference in Price Distributions when we compare rwd group with the other two groups (4wd,fwd) ,as The p-value adjusted is less than 0.05, but the Price Distributions has no significant difference between groups (4wd,fwd) as The p-value adjusted is more than 0.05
E.3: Price by body_style
- E.3.1 : Price by body_style-Basic description
| body_style | count | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|---|
| convertible | 6 | 21890.500 | 11187.802 | 11595 | 14246.25 | 17084.5 | 30709.25 | 37028 |
| hardtop | 8 | 22208.500 | 14555.521 | 8249 | 9341.50 | 19687.5 | 32903.00 | 45400 |
| hatchback | 70 | 9886.429 | 4111.640 | 5118 | 6564.00 | 8428.5 | 11848.75 | 22018 |
| sedan | 96 | 14369.531 | 8483.830 | 5389 | 7990.00 | 11078.5 | 17770.00 | 41315 |
| wagon | 25 | 12371.960 | 5120.949 | 6918 | 8013.00 | 11694.0 | 15750.00 | 28248 |
- E.3.2 :Applying kruskal.test (for variables with more than two groups) to find out if there are significant differences in Price Distribution among the Auto Mobiles based on body_style groups :
##
## Kruskal-Wallis rank sum test
##
## data: price by body_style
## Kruskal-Wallis chi-squared = 25.764, df = 4, p-value = 3.531e-05
As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price among the Auto Mobiles based on body style groups.
- E.3.3 :Visualizing The Statistical test
- E.3.4: To find out this difference in details: we will apply pairwise.wilcox.test
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: df$price and df$body_style
##
## convertible hardtop hatchback sedan
## hardtop 1.000 - - -
## hatchback 0.015 0.102 - -
## sedan 0.505 1.000 0.001 -
## wagon 0.296 1.000 0.119 1.000
##
## P value adjustment method: bonferroni
As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1.hatchback group with convertible group.
2.sedan group with hatchback group.
E.4: Price by body_style
- E.4.1 : Price by fuel_system-Basic description
| fuel_system | count | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|---|
| 1bbl | 11 | 7555.545 | 1390.121 | 5399 | 6692.00 | 7295.0 | 8370.00 | 10295 |
| 2bbl | 66 | 7514.894 | 1661.844 | 5118 | 6509.75 | 7324.0 | 8246.25 | 14869 |
| 4bbl | 3 | 12145.000 | 1374.773 | 10945 | 11395.00 | 11845.0 | 12745.00 | 13645 |
| idi | 20 | 15838.150 | 7759.844 | 7099 | 9120.00 | 13852.5 | 19375.50 | 31600 |
| mfi | 1 | 12964.000 | NA | 12964 | 12964.00 | 12964.0 | 12964.00 | 12964 |
| mpfi | 94 | 17389.543 | 8694.507 | 6695 | 11331.50 | 15720.0 | 18845.00 | 45400 |
| spdi | 9 | 10990.444 | 2741.731 | 7689 | 9279.00 | 9959.0 | 12764.00 | 14869 |
| spfi | 1 | 11048.000 | NA | 11048 | 11048.00 | 11048.0 | 11048.00 | 11048 |
- E.4.2 :Applying kruskal.test (for variables with more than two groups) to find out if there are significant differences in Price Distribution among the Auto Mobiles based on fuel_system groups :
##
## Kruskal-Wallis rank sum test
##
## data: price by fuel_system
## Kruskal-Wallis chi-squared = 118.43, df = 7, p-value < 2.2e-16
As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on fuel_system groups.
- E.4.3 :Visualizing The Statistical test
- E.4.4: To find out this difference in details: we will apply pairwise.wilcox.test
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: df$price and df$fuel_system
##
## 1bbl 2bbl 4bbl idi mfi mpfi spdi
## 2bbl 1.0000 - - - - - -
## 4bbl 0.3537 0.1518 - - - - -
## idi 0.0085 3.4e-06 1.0000 - - - -
## mfi 1.0000 1.0000 1.0000 1.0000 - - -
## mpfi 1.9e-05 < 2e-16 1.0000 1.0000 1.0000 - -
## spdi 0.0847 0.0056 1.0000 1.0000 1.0000 0.1319 -
## spfi 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
##
## P value adjustment method: bonferroni
As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1. idi group with 1bbl and 2bbl groups.
2. mpfi group with 1bbl and 2bbl groups.
3. spdi group with 2bbl group.
E.5: Price by num_of_cylinders
- E.5.1 : Price by num_of_cylinders-Basic description
| num_of_cylinders | count | mean | sd | min | Q1 | median | Q3 | max |
|---|---|---|---|---|---|---|---|---|
| eight | 5 | 32769.80 | 14449.028 | 8249 | 34184.0 | 35056.0 | 40960.0 | 45400 |
| five | 11 | 20615.55 | 7540.322 | 6695 | 16350.0 | 18920.0 | 26864.0 | 31600 |
| four | 159 | 10301.01 | 3954.715 | 5118 | 7324.0 | 8949.0 | 12534.5 | 22625 |
| six | 24 | 23671.83 | 8850.138 | 13499 | 16374.5 | 21037.5 | 32319.5 | 41315 |
| three | 1 | 5151.00 | NA | 5151 | 5151.0 | 5151.0 | 5151.0 | 5151 |
| twelve | 1 | 36000.00 | NA | 36000 | 36000.0 | 36000.0 | 36000.0 | 36000 |
| two | 4 | 13020.00 | 2079.062 | 10945 | 11620.0 | 12745.0 | 14145.0 | 15645 |
- E.5.2 :Applying kruskal.test (for variables with more than two groups) to find out if there are significant differences in Price Distribution among the Auto Mobiles based on fuel_system groups :
##
## Kruskal-Wallis rank sum test
##
## data: price by num_of_cylinders
## Kruskal-Wallis chi-squared = 72.995, df = 6, p-value = 9.92e-14
As the The p-value is less than 0.05 , we will reject the null hypothesis,indicating that there are significant differences in Price Distribution among the Auto Mobiles based on num_of_cylinders groups.
- E.5.3 :Visualizing The Statistical test
- E.5.4: To find out this difference in details: we will apply pairwise.wilcox.test
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: df$price and df$num_of_cylinders
##
## eight five four six three twelve
## five 1.00000 - - - - -
## four 0.07531 0.00073 - - - -
## six 1.00000 1.00000 7.7e-11 - - -
## three 1.00000 1.00000 1.00000 1.00000 - -
## twelve 1.00000 1.00000 1.00000 1.00000 1.00000 -
## two 1.00000 1.00000 1.00000 0.11047 1.00000 1.00000
##
## P value adjustment method: bonferroni
As we can see there is significant difference in Price Distributions where the p-value adjusted is less than 0.05 when we compare :
1. four group with five group.
2. six group with four group.
We will exclude cylinders Three and Twelve as there are no enough number of samples
Final Conclusions:
1. The number of doors (two doors vs.Β four doors) does not significantly affect the price distribution of the automobiles in the data set.
2. The type of drive wheels (e.g., front-wheel drive, rear-wheel drive, or four-wheel drive) significantly affects the price distribution of the automobiles in the data set especially between groups (4wd,fwd) .
3. The body style of automobiles (e.g., sedan, hatchback, SUV) significantly affects the price distribution of the automobiles in the data set especially between :
A. hatchback group with convertible group.
B. sedan group with hatchback group.4. The type of fuel system in automobiles (e.g., fuel injection, carburetor) significantly affects the price distribution of the automobiles in the data set especially between :
A. idi group with 1bbl and 2bbl groups.
B. mpfi group with 1bbl and 2bbl groups.
C. spdi group with 2bbl group.5. The number of cylinders in an automobileβs engine significantly affects the price distribution of the automobiles in the data set especially between :
A. four group with five group.
B. six group with four group.