-Duong Nguyen Anh Tuan (leader) 10622062
-Nguyễn Thành Tài 10622036
-Huỳnh Quang Khải 10622017
-Đoàn Trần Bách Việt 10322031
-Lý Lê Phương Dung 10622054
-Trần Minh Quân 10622033
If you want to buy a car, you are wondering what car model to choose, where it is manufactured, and which brand. You don’t know what factors significantly affect the price of cars. In this report, our team will use analytical methods to provide a detailed view of the factors affecting price. Each car model has a different price, and within each car model, there are other factors such as seats, origin or year of manufacture that affect the price of that car. This report will help readers have an easy-to-understand view and understand each data of the dataset.
This dataset we get from Kaggel.com. There are 8 columns and 200 rows. In the original data, the author used factors including: origin, model, year of manufacture and specific price of each data. Therefore, we also use those factors to analyze car prices using our reduced dataset.
We will now apply One way Anova and Tukey method to our data set with the aim of finding out whether factors such as seat, place of assembly, year of manufacture or model affect price.
Hypothesis testing at a 5% significance level ( = 0.05): H0: µ1= µ2 = µ3 = µ4 =…= µ9 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of Coupe
µ2 : the mean for the price of Crossover
µ3: the mean for the price of Hatchback
µ4 : the mean for the price of Pickup Truck
µ5 : the mean for the price of Sedan
µ6 : the mean for the price of SUV
µ7: the mean for the price of Truck
µ8: the mean for the price of Van/ Minivan
µ9: the mean for the price of Wagon
We use R , a programming language, to compute calculations. The code is as follows:
library(readxl)
dataset <- read_excel("dataset.xlsx")
one.way <- aov(PRICE ~ factor(MODEL), data = dataset)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(MODEL) 8 6.301e+19 7.876e+18 3.976 0.000225 ***
## Residuals 188 3.724e+20 1.981e+18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
F-statistic:
F = 3.976
P-value:
p-value = 0.000225 < 0.05Since the p-value is smaller than 0.05, the null hypothesis is rejected at the 5% significance level. Thus, there is sufficient evidence that it is likely that the model of car does have a significant effect on price.
We further conduct the Tukey multiple comparison procedure to discover which μi are different and by how much. The R code is as follows:
TukeyHSD(one.way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PRICE ~ factor(MODEL), data = dataset)
##
## $`factor(MODEL)`
## diff lwr upr p adj
## Crossover-Coupe -6016666667 -10524636014 -1508697319 0.0014020
## Hatchback-Coupe -6192812500 -10745638369 -1639986631 0.0010228
## Pickuptruck-Coupe -5911600000 -10437565305 -1385634695 0.0019811
## Sedan-Coupe -5564514286 -10044058207 -1084970365 0.0041910
## SUV-Coupe -5163569444 -9611026392 -716112497 0.0102793
## Truck-Coupe -6025055556 -10562978518 -1487132594 0.0015264
## Van/Minivan-Coupe -5301300000 -9933773178 -668826822 0.0122574
## Wagon-Coupe -4029000000 -10275425559 2217425559 0.5285438
## Hatchback-Crossover -176145833 -1601690909 1249399243 0.9999852
## Pickuptruck-Crossover 105066667 -1232213152 1442346485 0.9999996
## Sedan-Crossover 452152381 -718432562 1622737324 0.9529946
## SUV-Crossover 853097222 -187973704 1894168149 0.2057938
## Truck-Crossover -8388889 -1385596273 1368818496 1.0000000
## Van/Minivan-Crossover 715366667 -947090286 2377823619 0.9144206
## Wagon-Crossover 1987666667 -2520302681 6495636014 0.9027255
## Pickuptruck-Hatchback 281212500 -1200257400 1762682400 0.9996136
## Sedan-Hatchback 628298214 -704632715 1961229144 0.8638559
## SUV-Hatchback 1029243056 -191520815 2250006926 0.1751000
## Truck-Hatchback 167756944 -1349851679 1685365567 0.9999938
## Van/Minivan-Hatchback 891512500 -888992729 2672017729 0.8190960
## Wagon-Hatchback 2163812500 -2389013369 6716638369 0.8583069
## Sedan-Pickuptruck 347085714 -890994820 1585166249 0.9937802
## SUV-Pickuptruck 748030556 -368393636 1864454747 0.4746368
## Truck-Pickuptruck -113455556 -1548472796 1321561685 0.9999996
## Van/Minivan-Pickuptruck 610300000 -1100354091 2320954091 0.9706383
## Wagon-Pickuptruck 1882600000 -2643365305 6408565305 0.9287187
## SUV-Sedan 400944841 -509195133 1311084815 0.9031742
## Truck-Sedan -460541270 -1741644809 820562269 0.9692546
## Van/Minivan-Sedan 263214286 -1320543656 1846972227 0.9998571
## Wagon-Sedan 1535514286 -2944029635 6015058207 0.9770252
## Truck-SUV -861486111 -2025438792 302466569 0.3339281
## Van/Minivan-SUV -137730556 -1628317280 1352856169 0.9999985
## Wagon-SUV 1134569444 -3312887503 5582026392 0.9967453
## Van/Minivan-Truck 723755556 -1018289303 2465800414 0.9291787
## Wagon-Truck 1996055556 -2541867406 6533978518 0.9039384
## Wagon-Van/Minivan 1272300000 -3360173178 5904773178 0.9945886
plot( TukeyHSD(one.way), las= 1, col="brown")
The 95% confidence interval for the difference between Crossover and Coupe:
µ2 - µ1 ∈ (-7776720690, -4256612643)
Since the confidence interval does not contain 0, we can conclude that Crossover and Coupe have differen price results with 95% confidence level. In addition, the price of Coupe is larger than that of Crossover.
The 95% confidence interval for the difference between Hatchback and Coupe:
µ3 - µ1 ∈ (-7970379930, -4415245070)
Since the confidence interval does not contain 0, it is plausible that Coupe and Hatchback have different price with 95% confidence level. Additionally, Coupe may have larger price than Hatchback.
The 95% confidence interval for the difference between Pickup Truck and Coupe:
µ4 - µ1 ∈ (-7678680215 -4144519785)
Since the confidence interval does not contain 0, we can conclude that Coupe and Pickup Truck yield different price results with 95% confidence level. In addition, the price of Coupe is larger than that of Pickup Truck model.
Because tukey table has a lot of data, We do the same method for the following figures. After analyzing the whole table, we realized that the Coupe model has the most expensive price among the car models.
Hypothesis testing at a 5% significance level ( = 0.05):
H0: µ1= µ2 = µ3 = µ4 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of cars which is manufactured in 2017
µ2 : the mean for the price of cars which is manufactured in 2021
µ3: the mean for the price of cars which is manufactured in 2022
µ4 : the mean for the price of cars which is manufactured in 2023
We use R , a programming language, to compute calculations. The code is as follows:
one.way <- aov(PRICE~factor(YEAR), data = dataset)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(YEAR) 3 1.626e+19 5.421e+18 2.496 0.0611 .
## Residuals 193 4.191e+20 2.172e+18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
–F-statistic:
F = 2.496
–P-value:
p-value = 0.0611> 0.05Since the p-value is larger than 0.05, the null hypothesis is accepted at the 5% significance level. Thus, there is sufficient evidence that it is likely that the year of manufacture of car does not have a significant effect on price.
Suppose a customer, after reviewing, has chosen a satisfactory car model and wants to learn specifically about the factors that affect the price of that car model. For example, we will choose a Crossover model to analyze whether the seating factor affects the price and from there we can apply it to other models and other factors.
Hypothesis testing at a 5% significance level ( = 0.05):
H0: µ1= µ2 = µ3 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of Crossover model cars with 5 seats
µ2 : the mean for the price of Crossover model cars with 7 seats
µ3: the mean for the price of Crossover model cars with 8 seats
We use R , a programming language, to compute calculations. The code is as follows:
library(readxl)
CROSSOVERMODEL <- read_excel("CROSSOVERMODEL.xlsx")
one.way <- aov(PRICE~factor(SEAT), data = CROSSOVERMODEL)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(SEAT) 2 3.017e+17 1.509e+17 4.442 0.0246 *
## Residuals 21 7.132e+17 3.396e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
F-statistic:
F = 0.0246
P-value:
p-value = 0.0246< 0.05Since the p-value is smaller than 0.05, the null hypothesis is rejected at the 5% significance level. Thus, there is sufficient evidence that it is likely that the seat of car does have a significant effect on price.
We further conduct the Tukey multiple comparison procedure to discover which μi are different and by how much. The R code is as follows:
TukeyHSD(one.way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PRICE ~ factor(SEAT), data = CROSSOVERMODEL)
##
## $`factor(SEAT)`
## diff lwr upr p adj
## 7-5 -56000000 -539463693 427463693 0.9542005
## 8-5 219636364 25744420 413528307 0.0246460
## 8-7 275636364 -209514817 760787544 0.3431588
plot(TukeyHSD(one.way, conf.level=.95), las = 2)
The 95% confidence interval for the difference between 7 seats and 5 seats:
µ2 - µ1 ∈ (-539463693 427463693)
Since the confidence interval contains 0, there is evidence that 7 seats car and 5 seats car have the same effect on price with 95% confidence interval.
The 95% confidence interval for the difference between 8 seats and 5 seats:
µ3 - µ1 ∈ (25744420 413528307)
Since the confidence interval does not contain 0, it is plausible that 8 seats and 5 seats have different price with 95% confidence level. Addi-tionally, 8 seats car may have larger price than the 5 seats one.
The 95% confidence interval for the difference between 4 seats and 2 seats:
µ4 – µ1 ∈ (-209514817 760787544)
Since the confidence interval contains 0, there is evidence that 8 seats car and 7 seats car have the same effect on price with 95% confidence interval.
It is a statistical hypothesis that investigates if there is a significant difference between the mean of two independent groups that may have unequal variance. The test is comparing the means of two groups while considering the variability within each group.
There are two hypotheses for the t-test: - H0: µ1 = µ2: the mean for the price of two types of origin are equal. - HA: µ1 ≠µ2: the mean for the price of two types of origin are not equal.
In our comparison of price of car (Crossover model) by origin, we decide to perform your t test using R. The code looks like this:
t.test(PRICE ~ ORIGIN, data= CROSSOVERMODEL)
##
## Welch Two Sample t-test
##
## data: PRICE by ORIGIN
## t = 2.2196, df = 19.309, p-value = 0.0386
## alternative hypothesis: true difference in means between group Domestic assembly and group Imported is not equal to 0
## 95 percent confidence interval:
## 7873426 263319851
## sample estimates:
## mean in group Domestic assembly mean in group Imported
## 741882353 606285714
• data: the data used in Two Sample t-test (Domestic assembly and Imported) t: t test-statistic. The positive t-value of 2.2196 indicates that the Domestic assembly sample mean is significantly larger than Imported.
• df: it is the degree of freedom associated with the t-test value.
• p-value: indicates the statistical significance of the result. The p-value is 0.0386 which is lower than alpha (0.005), indicating that the probability of obtaining such a large difference between the two groups by chance is very small.
• alternative hypothesis: we can set the alternative hypothesis. In our case, it was set to check if the true difference in means is not equal to zero. 95 percent confidence interval: 95% confident that the true population means the difference between the two groups lies within the range of (7873426 263319851)
• sample estimates: it tells us the sample means of each group where Domestic asembly and Imported are 741882553 and 606285714, respectively. It means that, on average, Domestic has a higher value than Imported.
In conclusion, because the p-value( 0.0386) is smaller than significant level (0.05), we rejected the null hypothesis( H0). And the results of the Welch Two Sample t-test suggest that there is strong evidence that there is a statistically significant difference between Domestic assembly and Imported.
There are two hypotheses for the t-test: - H0: µ1 = µ2: the mean for the price of two types of transmission are equal. - HA: µ1 ≠µ2: the mean for the price of two types of origin are not equal.
In our comparison of price of car (Crossover model) by transmission, we decide to perform your t test using R. The code looks like this:
t.test(PRICE~TRANSMISSION, data=CROSSOVERMODEL)
##
## Welch Two Sample t-test
##
## data: PRICE by TRANSMISSION
## t = -0.94959, df = 20.204, p-value = 0.3535
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
## -158805974 59405974
## sample estimates:
## mean in group Automatic mean in group Manual
## 694050000 743750000
• data: the data used in Two Sample t-test (Automatic and Manual) t: t test-statistic. The negative t-value of -0.94959 indicates that the Automatic sample mean is significantly smaller than Manual.
• df: it is the degree of freedom associated with the t-test value.
• p-value: indicates the statistical significance of the result. The p-value is 0.3535 which is larger than alpha (0.005)
• alternative hypothesis: we can set the alternative hypothesis. In our case, it was set to check if the true difference in means is not equal to zero. 95 percent confidence interval: 95% confident that the true population means the difference between the two groups lies within the range of (-158805974 59405974)
• sample estimates: it tells us the sample means of each group where Automatic and Manual are 694050000 and 743750000, respectively. In conclusion, because the p-value( 0.3535) is larger than significant level (0.05), we acepted the null hypothesis( H0). And the results of the Welch Two Sample t-test suggest that the evidence is not strong enough so that there is no statistically significant difference between Automatic and Manual.
Through methods such as Anova, Tukey and T test, we have helped us analyze the data of the data set, thereby knowing which factors directly affect the price of a car. For example, the car model is an important factor in deciding whether the car is expensive or not. If you want to buy a low-segment car, the Coupe model is definitely not a reasonable choice. Furthermore, if you have chosen a suitable car model, other factors such as the place of assembly as well as the number of seats also greatly affect the choice of a suitable car within the price range.