-Duong Nguyen Anh Tuan (leader) 10622062
-Nguyễn Thành Tài 10622036
-Huỳnh Quang Khải 10622017
-Đoàn Trần Bách Việt 10322031
-Lý Lê Phương Dung 10622054
-Trần Minh Quân 10622033
If you want to buy a car, you are wondering what car model to choose, where it is manufactured, and which brand. You don’t know what factors significantly affect the price of cars. In this report, our team will use analytical methods to provide a detailed view of the factors affecting price. Each car model has a different price, and within each car model, there are other factors such as seats, origin or year of manufacture that affect the price of that car. This report will help readers have an easy-to-understand view and understand each data of the dataset.
This dataset we get from Kaggel.com. There are 8 columns and 200 rows. In the original data, the author used factors including: origin, model, year of manufacture and specific price of each data. Therefore, we also use those factors to analyze car prices using our reduced dataset.
pacman::p_load(tidyverse, ggplot2)
library(readxl)
dataset <- read_excel("dataset.xlsx")
ggplot(data = dataset) +
geom_bar(mapping = aes(x = SEAT, fill=ORIGIN),width=0.8)
**** First, the number of seats in two different car types—cars that are built domestically and cars that are imported—is displayed in this bar chart. Overall, the distinctions between the various seat kinds are rather obvious. When examining the specifics, five-seat vehicles make up the majority and the disparity is greater for automobiles that are built domestically. On the other hand, no domestic nor foreign automakers make cars with nine to fifteen seats.
ggplot(data = dataset) +
geom_bar(mapping = aes(x = YEAR, fill = MODEL))
**** Next, the year that each of the following car types—Coupe, Hatchback, Pickup truck, SUV, Truck, Van/Minivan, and Wagon—was manufactured is displayed in the second column chart. Based on the details, the majority of car models are produced in 2022 and 2023. Furthermore, the earlier production years were 2017, 2021, 2022, and 2023 for the Truck and SUV lines, respectively.
ggplot(data = dataset) +
geom_bar(mapping = aes(x = SEAT, fill = TRANSMISSION))
Based on the number of seats, the number of Crossover cars produced with automatic and manual transmissions is displayed in the fourth column chart. Examining the specifics, data indicates that Crossover cars with five or seven seats are primarily built with automatic transmissions. On the other hand, manual transmission cars are made in small batches and range in number from 5 to 16 seats.
table(dataset$origin)
## Warning: Unknown or uninitialised column: `origin`.
## < table of extent 0 >
soluong <- c(114,83)
tinhtrang <- c("Domestic assembly","Imported" )
phantram <- round(soluong/ sum(soluong)*100,2)
tinhtrang <- paste(tinhtrang, phantram)
tinhtrang <- paste(tinhtrang, "%", sep="" )
pie(soluong, labels = tinhtrang, col=c("green","blue"), main= "Percentage distribution of vehicle origin")
****The percentage distribution of vehicle origin can be broken down into domestic assembly and imported vehicles. As of the provided figures, approximately 57.87% of vehicles are domestically assembled, while around 42.13% are imported. Domestic assembly refers to vehicles manufactured within the country, often by domestic automakers or international companies with manufacturing plants in that country. On the other hand, imported vehicles are those produced outside the country’s borders and then imported for sale. This distribution highlights a significant portion of vehicles being assembled domestically, indicating a substantial production presence within the country, while also acknowledging the popularity and availability of imported vehicles in the market.
table(dataset$model)
## Warning: Unknown or uninitialised column: `model`.
## < table of extent 0 >
soluong <- c(1,24,16,20,35,72,18,10,1)
tinhtrang <- c("Coupe","Crossover","Hatchback","Pickuptruck","Sedan","SUV","Truck","Van/Minivan","Wagon" )
phantram <- round(soluong/ sum(soluong)*100,2)
tinhtrang <- paste(tinhtrang, phantram)
tinhtrang <- paste(tinhtrang, "%", sep="" )
pie(soluong, labels = tinhtrang, col=c("red","blue","gray","pink","purple","yellow","green","brown"), main= "Percentage distribution of vehicle model")
The percentage of different car kinds in the dataset is displayed in the second pie chart. In general, SUVs make up the largest proportion (36.5%), while Coupe and Wagon make up the lowest percentages (0.51%) of all car models.
table(dataset$year)
## Warning: Unknown or uninitialised column: `year`.
## < table of extent 0 >
soluong <- c(2,3,87,105)
tinhtrang <- c("2017","2021","2022","2023" )
phantram <- round(soluong/ sum(soluong)*100,2)
tinhtrang <- paste(tinhtrang, phantram)
tinhtrang <- paste(tinhtrang, "%", sep="" )
pie(soluong, labels = tinhtrang, col=c("green","blue","gray","yellow"), main= "Percentage distribution of year of manufacture")
The production years as a percentage based on the data set, which includes 2017, 2021, 2022, and 2023, are then shown in the third pie chart. Examining the specifics, automobiles manufactured in 2023 comprise the proportion 53.3% is the greatest percentage, and 44.16% is a little lower in 2022. In contrast, 2017 had the lowest rate (1.02%).
boxplot(log(dataset$PRICE)~dataset$MODEL,col = "pink", xlab="model", ylab="price")
****According to the statistics, coupes at 20.5, hatchbacks and sedans at 20.1 each, SUVs at 20.8, trucks at 19.8, and wagons at 20.6. These figures outline an estimated price range for each type of vehicle, suggesting that trucks tend to be priced slightly lower compared to other categories, while SUVs have a slightly higher price point. SUVs generally tend to have higher price tags compared to other vehicle types.
boxplot(log(dataset$PRICE)~dataset$YEAR, col="red",xlab="year", ylab="price")
****The price of cars in Vietnam has changed markedly in recent years. However, in 2021, the price of cars will decrease. According to experts, the covid epidemic situation has caused Vietnam’s economy to decline rapidly. That causes businesses to be affected and forced to reduce prices to get customers to buy cars from brands. In 2022 and 2023, the economic situation will be more stable than in 2021, so car prices will increase again. Especially due to this crisis, car prices will be lower than in 2017. According to experts, after 2023, Vietnam’s economic situation will develop again. That could cause car prices to rise again or higher in the following years.
Figure 9: relationship between number of seat and price
plot(log(PRICE) ~ SEAT, data = dataset, col="purple", ylab="price")
The relationship between the number of seats and the price of a car can vary based on several factors. Generally, larger vehicles with more seating capacity tend to have higher prices due to increased manufacturing costs, larger size, and potentially more features. However, within a specific category, other factors like brand, model, trim level, technology, and additional amenities also influence the price. Additionally, certain high-end or luxury vehicles might have fewer seats but come with a higher price due to their advanced features and exclusivity. Ultimately, the correlation between the number of seats and the price can vary significantly depending on the specific car and its market segment.
boxplot(log(dataset$PRICE)~dataset$TRANSMISSION, col="pink",xlab="transmission", ylab="price")
****The average price of a manual transmission car is 19.8 and the average price of a car with an automatic transmission is only slightly more than a manual transmission car at 20.5. According to the data in the figure, automatic cars have a higher average price because the machines have a more modern design than manual cars. That’s why when automatic cars are sold on the market, they have a higher price. On the contrary, manual transmission cars can save a lot of fuel, so many people still choose them for transportation, especially in the service industry. In addition, for manual transmission vehicles, the driver is the one who decides when to change gears to suit the type of road, speed, and driving purpose, not dependent on pre-programmed algorithms like an automatic transmission car.
boxplot(log(dataset$PRICE)~dataset$ORIGIN, col="yellow",xlab="origin", ylab="price")
****The average price of domestically assembled cars is about 23, while the price of imported cars has an average price of 22. According to my analysis, currently, domestically assembled cars have the selling price is higher than imported cars partly because the cost of producing cars in Vietnam is about 20% higher than in foreign countries, according to data from industry experts. The current Vietnamese automobile market is still small, with up to a few dozen car brands and each unit has dozens of models. Therefore, each car line sold has a limited quantity, leading to increased production costs. However, imported cars need to have quite high taxes when returning to the country including transportation. That is a reason why the price of imported cars is slightly lower than the price of domestic cars.
We will now apply One way Anova and Tukey method to our data set with the aim of finding out whether factors such as seat, place of assembly, year of manufacture or model affect price.
Hypothesis testing at a 5% significance level ( = 0.05): H0: µ1= µ2 = µ3 = µ4 =…= µ9 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of Coupe
µ2 : the mean for the price of Crossover
µ3: the mean for the price of Hatchback
µ4 : the mean for the price of Pickup Truck
µ5 : the mean for the price of Sedan
µ6 : the mean for the price of SUV
µ7: the mean for the price of Truck
µ8: the mean for the price of Van/ Minivan
µ9: the mean for the price of Wagon
We use R , a programming language, to compute calculations. The code is as follows:
library(readxl)
dataset <- read_excel("dataset.xlsx")
one.way <- aov(PRICE ~ factor(MODEL), data = dataset)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(MODEL) 8 6.301e+19 7.876e+18 3.976 0.000225 ***
## Residuals 188 3.724e+20 1.981e+18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
F-statistic:
F = 3.976
P-value:
p-value = 0.000225 < 0.05Since the p-value is smaller than 0.05, the null hypothesis is rejected at the 5% significance level. Thus, there is sufficient evidence that it is likely that the model of car does have a significant effect on price.
We further conduct the Tukey multiple comparison procedure to discover which μi are different and by how much. The R code is as follows:
TukeyHSD(one.way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PRICE ~ factor(MODEL), data = dataset)
##
## $`factor(MODEL)`
## diff lwr upr p adj
## Crossover-Coupe -6016666667 -10524636014 -1508697319 0.0014020
## Hatchback-Coupe -6192812500 -10745638369 -1639986631 0.0010228
## Pickuptruck-Coupe -5911600000 -10437565305 -1385634695 0.0019811
## Sedan-Coupe -5564514286 -10044058207 -1084970365 0.0041910
## SUV-Coupe -5163569444 -9611026392 -716112497 0.0102793
## Truck-Coupe -6025055556 -10562978518 -1487132594 0.0015264
## Van/Minivan-Coupe -5301300000 -9933773178 -668826822 0.0122574
## Wagon-Coupe -4029000000 -10275425559 2217425559 0.5285438
## Hatchback-Crossover -176145833 -1601690909 1249399243 0.9999852
## Pickuptruck-Crossover 105066667 -1232213152 1442346485 0.9999996
## Sedan-Crossover 452152381 -718432562 1622737324 0.9529946
## SUV-Crossover 853097222 -187973704 1894168149 0.2057938
## Truck-Crossover -8388889 -1385596273 1368818496 1.0000000
## Van/Minivan-Crossover 715366667 -947090286 2377823619 0.9144206
## Wagon-Crossover 1987666667 -2520302681 6495636014 0.9027255
## Pickuptruck-Hatchback 281212500 -1200257400 1762682400 0.9996136
## Sedan-Hatchback 628298214 -704632715 1961229144 0.8638559
## SUV-Hatchback 1029243056 -191520815 2250006926 0.1751000
## Truck-Hatchback 167756944 -1349851679 1685365567 0.9999938
## Van/Minivan-Hatchback 891512500 -888992729 2672017729 0.8190960
## Wagon-Hatchback 2163812500 -2389013369 6716638369 0.8583069
## Sedan-Pickuptruck 347085714 -890994820 1585166249 0.9937802
## SUV-Pickuptruck 748030556 -368393636 1864454747 0.4746368
## Truck-Pickuptruck -113455556 -1548472796 1321561685 0.9999996
## Van/Minivan-Pickuptruck 610300000 -1100354091 2320954091 0.9706383
## Wagon-Pickuptruck 1882600000 -2643365305 6408565305 0.9287187
## SUV-Sedan 400944841 -509195133 1311084815 0.9031742
## Truck-Sedan -460541270 -1741644809 820562269 0.9692546
## Van/Minivan-Sedan 263214286 -1320543656 1846972227 0.9998571
## Wagon-Sedan 1535514286 -2944029635 6015058207 0.9770252
## Truck-SUV -861486111 -2025438792 302466569 0.3339281
## Van/Minivan-SUV -137730556 -1628317280 1352856169 0.9999985
## Wagon-SUV 1134569444 -3312887503 5582026392 0.9967453
## Van/Minivan-Truck 723755556 -1018289303 2465800414 0.9291787
## Wagon-Truck 1996055556 -2541867406 6533978518 0.9039384
## Wagon-Van/Minivan 1272300000 -3360173178 5904773178 0.9945886
plot( TukeyHSD(one.way), las= 1, col="brown")
The 95% confidence interval for the difference between Crossover and Coupe:
µ2 - µ1 ∈ (-7776720690, -4256612643)
Since the confidence interval does not contain 0, we can conclude that Crossover and Coupe have differen price results with 95% confidence level. In addition, the price of Coupe is larger than that of Crossover.
The 95% confidence interval for the difference between Hatchback and Coupe:
µ3 - µ1 ∈ (-7970379930, -4415245070)
Since the confidence interval does not contain 0, it is plausible that Coupe and Hatchback have different price with 95% confidence level. Additionally, Coupe may have larger price than Hatchback.
The 95% confidence interval for the difference between Pickup Truck and Coupe:
µ4 - µ1 ∈ (-7678680215 -4144519785)
Since the confidence interval does not contain 0, we can conclude that Coupe and Pickup Truck yield different price results with 95% confidence level. In addition, the price of Coupe is larger than that of Pickup Truck model.
Because tukey table has a lot of data, We do the same method for the following figures. After analyzing the whole table, we realized that the Coupe model has the most expensive price among the car models.
Hypothesis testing at a 5% significance level ( = 0.05):
H0: µ1= µ2 = µ3 = µ4 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of cars which is manufactured in 2017
µ2 : the mean for the price of cars which is manufactured in 2021
µ3: the mean for the price of cars which is manufactured in 2022
µ4 : the mean for the price of cars which is manufactured in 2023
We use R , a programming language, to compute calculations. The code is as follows:
one.way <- aov(PRICE~factor(YEAR), data = dataset)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(YEAR) 3 1.626e+19 5.421e+18 2.496 0.0611 .
## Residuals 193 4.191e+20 2.172e+18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
–F-statistic:
F = 2.496
–P-value:
p-value = 0.0611> 0.05Since the p-value is larger than 0.05, the null hypothesis is accepted at the 5% significance level. Thus, there is sufficient evidence that it is likely that the year of manufacture of car does not have a significant effect on price.
Suppose a customer, after reviewing, has chosen a satisfactory car model and wants to learn specifically about the factors that affect the price of that car model. For example, we will choose a Crossover model to analyze whether the seating factor affects the price and from there we can apply it to other models and other factors.
First let’s get to the CROSSOVER car model
Figure 12: Relationship between seat and origin of Crossover model
library(readxl)
CROSSOVERMODEL <- read_excel("CROSSOVERMODEL.xlsx")
ggplot(data = CROSSOVERMODEL) +
geom_bar(mapping = aes(x = SEAT, fill = ORIGIN))
****Statistics on the number of seats in Crossover automobiles that are imported and assembled domestically are shown in the third column chart. Crossover automobiles are typically produced with a 5-seat configuration in mind. Only crossover cars with seven seats are imported in limited numbers. The only Crossover car line that is constructed locally and produced in greater quantities than the others is the 8-seat line, as opposed to the 7-seat line.
-Figure 13: Relationship between seat and transmission of Crossover model
ggplot(data = CROSSOVERMODEL) +
geom_bar(mapping = aes(x = SEAT, fill = TRANSMISSION))
Based on the number of seats, the fifth column chart displays the quantity of Crossover cars made with automatic and manual transmissions. With the exception of a limited number of cars with seven seats, the number of automobiles with automatic transmissions is often far higher than that of cars with manual transmissions in models with five and eight seats. In contrast, only eight-seat models of manual transmission cars are manufactured.
ggplot(data = CROSSOVERMODEL) +
geom_bar(mapping = aes(x = SEAT, fill = BRAND))
Based on the number of seats, the final column table displays the number of crossover vehicle models from Hyundai and Toyota. Overall, there are significant differences in the production objectives of the two automakers: Hyundai prioritizes the mass production of crossover vehicles with a 5-seat configuration, whereas Toyota primarily builds vehicles with a 7-seat configuration.
boxplot(log(CROSSOVERMODEL$PRICE)~CROSSOVERMODEL$BRAND, col="green",xlab="brand", ylab="price")
boxplot(log(CROSSOVERMODEL$PRICE)~CROSSOVERMODEL$TRANSMISSION, col="black",xlab="brand", ylab="price")
boxplot(log(CROSSOVERMODEL$PRICE)~CROSSOVERMODEL$ORIGIN, col="gray",xlab="brand", ylab="price")
boxplot(log(CROSSOVERMODEL$PRICE)~CROSSOVERMODEL$SEAT, col="orange",xlab="brand", ylab="price")
Hypothesis testing at a 5% significance level ( = 0.05):
H0: µ1= µ2 = µ3 versus HA: at least one pair of mean are different from each other
µ1 : the mean for the price of Crossover model cars with 5 seats
µ2 : the mean for the price of Crossover model cars with 7 seats
µ3: the mean for the price of Crossover model cars with 8 seats
We use R , a programming language, to compute calculations. The code is as follows:
library(readxl)
CROSSOVERMODEL <- read_excel("CROSSOVERMODEL.xlsx")
one.way <- aov(PRICE~factor(SEAT), data = CROSSOVERMODEL)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(SEAT) 2 3.017e+17 1.509e+17 4.442 0.0246 *
## Residuals 21 7.132e+17 3.396e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the table, we have:
F-statistic:
F = 0.0246
P-value:
p-value = 0.0246< 0.05Since the p-value is smaller than 0.05, the null hypothesis is rejected at the 5% significance level. Thus, there is sufficient evidence that it is likely that the seat of car does have a significant effect on price.
We further conduct the Tukey multiple comparison procedure to discover which μi are different and by how much. The R code is as follows:
TukeyHSD(one.way)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = PRICE ~ factor(SEAT), data = CROSSOVERMODEL)
##
## $`factor(SEAT)`
## diff lwr upr p adj
## 7-5 -56000000 -539463693 427463693 0.9542005
## 8-5 219636364 25744420 413528307 0.0246460
## 8-7 275636364 -209514817 760787544 0.3431588
plot(TukeyHSD(one.way, conf.level=.95), las = 2)
The 95% confidence interval for the difference between 7 seats and 5 seats:
µ2 - µ1 ∈ (-539463693 427463693)
Since the confidence interval contains 0, there is evidence that 7 seats car and 5 seats car have the same effect on price with 95% confidence interval.
The 95% confidence interval for the difference between 8 seats and 5 seats:
µ3 - µ1 ∈ (25744420 413528307)
Since the confidence interval does not contain 0, it is plausible that 8 seats and 5 seats have different price with 95% confidence level. Addi-tionally, 8 seats car may have larger price than the 5 seats one.
The 95% confidence interval for the difference between 4 seats and 2 seats:
µ4 – µ1 ∈ (-209514817 760787544)
Since the confidence interval contains 0, there is evidence that 8 seats car and 7 seats car have the same effect on price with 95% confidence interval.
It is a statistical hypothesis that investigates if there is a significant difference between the mean of two independent groups that may have unequal variance. The test is comparing the means of two groups while considering the variability within each group.
There are two hypotheses for the t-test: - H0: µ1 = µ2: the mean for the price of two types of origin are equal. - HA: µ1 ≠µ2: the mean for the price of two types of origin are not equal.
In our comparison of price of car (Crossover model) by origin, we decide to perform your t test using R. The code looks like this:
t.test(PRICE ~ ORIGIN, data= CROSSOVERMODEL)
##
## Welch Two Sample t-test
##
## data: PRICE by ORIGIN
## t = 2.2196, df = 19.309, p-value = 0.0386
## alternative hypothesis: true difference in means between group Domestic assembly and group Imported is not equal to 0
## 95 percent confidence interval:
## 7873426 263319851
## sample estimates:
## mean in group Domestic assembly mean in group Imported
## 741882353 606285714
• data: the data used in Two Sample t-test (Domestic assembly and Imported) t: t test-statistic. The positive t-value of 2.2196 indicates that the Domestic assembly sample mean is significantly larger than Imported.
• df: it is the degree of freedom associated with the t-test value.
• p-value: indicates the statistical significance of the result. The p-value is 0.0386 which is lower than alpha (0.005), indicating that the probability of obtaining such a large difference between the two groups by chance is very small.
• alternative hypothesis: we can set the alternative hypothesis. In our case, it was set to check if the true difference in means is not equal to zero. 95 percent confidence interval: 95% confident that the true population means the difference between the two groups lies within the range of (7873426 263319851)
• sample estimates: it tells us the sample means of each group where Domestic asembly and Imported are 741882553 and 606285714, respectively. It means that, on average, Domestic has a higher value than Imported.
In conclusion, because the p-value( 0.0386) is smaller than significant level (0.05), we rejected the null hypothesis( H0). And the results of the Welch Two Sample t-test suggest that there is strong evidence that there is a statistically significant difference between Domestic assembly and Imported.
There are two hypotheses for the t-test: - H0: µ1 = µ2: the mean for the price of two types of transmission are equal. - HA: µ1 ≠µ2: the mean for the price of two types of origin are not equal.
In our comparison of price of car (Crossover model) by transmission, we decide to perform your t test using R. The code looks like this:
t.test(PRICE~TRANSMISSION, data=CROSSOVERMODEL)
##
## Welch Two Sample t-test
##
## data: PRICE by TRANSMISSION
## t = -0.94959, df = 20.204, p-value = 0.3535
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
## -158805974 59405974
## sample estimates:
## mean in group Automatic mean in group Manual
## 694050000 743750000
• data: the data used in Two Sample t-test (Automatic and Manual) t: t test-statistic. The negative t-value of -0.94959 indicates that the Automatic sample mean is significantly smaller than Manual.
• df: it is the degree of freedom associated with the t-test value.
• p-value: indicates the statistical significance of the result. The p-value is 0.3535 which is larger than alpha (0.005)
• alternative hypothesis: we can set the alternative hypothesis. In our case, it was set to check if the true difference in means is not equal to zero. 95 percent confidence interval: 95% confident that the true population means the difference between the two groups lies within the range of (-158805974 59405974)
• sample estimates: it tells us the sample means of each group where Automatic and Manual are 694050000 and 743750000, respectively. In conclusion, because the p-value( 0.3535) is larger than significant level (0.05), we acepted the null hypothesis( H0). And the results of the Welch Two Sample t-test suggest that the evidence is not strong enough so that there is no statistically significant difference between Automatic and Manual.
Through methods such as Anova, Tukey and T test, we have helped us analyze the data of the data set, thereby knowing which factors directly affect the price of a car. For example, the car model is an important factor in deciding whether the car is expensive or not. If you want to buy a low-segment car, the Coupe model is definitely not a reasonable choice. Furthermore, if you have chosen a suitable car model, other factors such as the place of assembly as well as the number of seats also greatly affect the choice of a suitable car within the price range.
View(dataset)
View(CROSSOVERMODEL)