This projects will learn and analyze the trending car market in Arghentina. We will use argentina_cars dataset from kaggle website to analyze
library(readxl)
df<-read.csv("argentina_cars.csv")
head(df)
## money brand model year color fuel_type door gear
## 1 10350000 Toyota Corolla Cross 2022 Plateado Nafta 5 Automática
## 2 10850000 Jeep Compass 2022 Blanco Nafta 5 Automática
## 3 35500 Jeep Compass 2022 Gris oscuro Nafta 5 Automática
## 4 19000 Citroën C4 Cactus 2022 Gris oscuro Nafta 5 Automática
## 5 5800000 Toyota Corolla 2019 Gris Nafta 4 Manual
## 6 34500 Jeep Compass 2022 Negro Nafta 5 Automática
## motor body_type kilometres currency
## 1 SUV 500 pesos
## 2 2.4 SUV 500 pesos
## 3 2.4 SUV 500 dólares
## 4 SUV 550 dólares
## 5 1.8 Sedán 9000 pesos
## 6 1.3 SUV 10500 dólares
We have to check if there are any N/A values or empty values in our dataset.
colSums(df=="")
## money brand model year color fuel_type door
## 0 0 0 0 11 0 0
## gear motor body_type kilometres currency
## 1 11 1 0 0
colSums(is.na(df))
## money brand model year color fuel_type door
## 0 0 0 0 0 0 0
## gear motor body_type kilometres currency
## 0 0 0 0 0
There are some empty values in color variable and the motor variable but it does not affect our analysis much. On the next step, we see in the currency variable, there are dollars and pesos. Therefore, we will create a new column that only contains price in $dollar. By google, 1 dollar can exchange 172.44 pesos. Furthermore, we will create a new column Age by using this year 2022 minus car years.
df$money_dollar<-ifelse(df[,"currency"]=='pesos',df[,"money"]/172.44,df[,"money"])
df$age<-2022-df$year
df<-df[,-1]
head(df)
## brand model year color fuel_type door gear motor
## 1 Toyota Corolla Cross 2022 Plateado Nafta 5 Automática
## 2 Jeep Compass 2022 Blanco Nafta 5 Automática 2.4
## 3 Jeep Compass 2022 Gris oscuro Nafta 5 Automática 2.4
## 4 Citroën C4 Cactus 2022 Gris oscuro Nafta 5 Automática
## 5 Toyota Corolla 2019 Gris Nafta 4 Manual 1.8
## 6 Jeep Compass 2022 Negro Nafta 5 Automática 1.3
## body_type kilometres currency money_dollar age
## 1 SUV 500 pesos 60020.88 0
## 2 SUV 500 pesos 62920.44 0
## 3 SUV 500 dólares 35500.00 0
## 4 SUV 550 dólares 19000.00 0
## 5 Sedán 9000 pesos 33634.89 3
## 6 SUV 10500 dólares 34500.00 0
Next, we will check of there are any outliers in the car price. Since dataset can be taken on some online websites, so there would be a possibility about fake price.
boxplot(df$money_dollar)
The outliers here are prices of luxury car. Such as the maximum price is $430000, which is the Audi R8 and just drive only 3000 kilometers. Therefore, the price makes sense to the variaty of car market.
We have different brands here, and want to know which is the leading car market in Argentina at this time.
prop_table<-prop.table(table(df$brand))
prop_table<-data.frame(prop_table)
colnames(prop_table)<-c("Brands","Freq")
print(prop_table)
## Brands Freq
## 1 Audi 0.015686275
## 2 Baic 0.001960784
## 3 BMW 0.025490196
## 4 Chery 0.001960784
## 5 Chevrolet 0.100000000
## 6 Citroën 0.062745098
## 7 Dodge 0.007843137
## 8 DS 0.003921569
## 9 Fiat 0.049019608
## 10 Ford 0.111764706
## 11 Honda 0.035294118
## 12 Hyundai 0.017647059
## 13 Jeep 0.039215686
## 14 Kia 0.005882353
## 15 Mercedes-Benz 0.027450980
## 16 Mini 0.003921569
## 17 Mitsubishi 0.003921569
## 18 Nissan 0.015686275
## 19 Peugeot 0.096078431
## 20 Porsche 0.001960784
## 21 RAM 0.011764706
## 22 Renault 0.088235294
## 23 Subaru 0.001960784
## 24 Suzuki 0.001960784
## 25 Toyota 0.092156863
## 26 Volkswagen 0.172549020
## 27 Volvo 0.003921569
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
ggplot(prop_table,aes(x="",y=Freq,fill=Brands))+geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) +theme_void()
The car market of Argentina is various, with many brands from different countries. The best seller can be listed as Ford, Peugeot, Toyota and the highest is Volkswagen with 17.25%. What is mentioning is that there is no brand car from Argentina.
We want to know the price distribution of each car to understand is this the reason why some brands are more popular than others.
#Boxplot
ggplot(data=df,aes(x =money_dollar, y = brand, fill = brand), alpha = .2) + geom_boxplot(alpha = .6, outlier.alpha = 0) +
geom_jitter(shape = 1, alpha = .2) +
theme_minimal()
Audi and BMW has extremely large distribution, and the majority of their prices are much higher compared to other brand cars.However, it shows us that Audi and BMW has a good strategy to have different models that can be affordable for individuals do not have so much money but still want to experience an Audi car or a BMW car. Moreover, we see that prices of Volkswagen are not expensive, where the most expensive model is $82922. Simlarly to Toyota brand, when their prices are slightly higher than Volkwagen. This can be the approriate reason to show why Argentines prefer Volkswagen, or Toyota than BMW or Audi cars.
Moreover, we know that the non-fuel car is now more popular now due to the improvement of technology, and also the higher fuel price recently. We will analyze if electrical car is now popular in car market in Argentina.
#Pie chart and frequency table
fuel_table<-prop.table(table(df$fuel_type))
fuel_table<-data.frame(fuel_table)
colnames(fuel_table)<-c("Type","Freq")
pie(fuel_table$Freq,labels= paste(fuel_table$Type,round(fuel_table$Freq,3)),col=c("royalblue2","lightseagreen","lightskyblue1","cornflowerblue"),main="Fuel type")
Surprisingly, there is no electric car within 500 cars in market. Nafta type, or petrol is still the most used fuel in Argentina, accounted for more than 80%. Similarly to previous analysis, we analyze the price distribution of cars with 4 fuel types and what type of fuel car costs more in recent car market.
#Boxplot
ggplot(data=df,aes(x =fuel_type, y = money_dollar, fill = fuel_type), alpha = .2) + geom_boxplot(alpha = .6, outlier.alpha = 0.5) +
geom_jitter(shape = 1, alpha = 0.5) +
theme_minimal()
It is possible that price affects on the demand of car of Argentines. Since the price of petrol cars are less expensive than other cars, therefore more than 80% petrol cars are on the market now. On the other hand, we have only o.6% HÃbrido/Nafta or Petrol/Hybrid cars are available on the market because their prices are much higher than others.
The price of an old car could be decided on ages, or kilometers of that cars. In this step, we will analyze if cars or kilometers are related to their selling prices.
library("ggplot2")
library("GGally")
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data<-df[,c("age","kilometres","money_dollar")]
#Correlation plot
ggpairs(data)+theme_bw()
Its is true that there is a strong positive correlation between kilometers and age, which means the older the car is, the higher kilometers drove. Moreover, we also confirm that the older cars have the cheaper price since there is a negative correlation. Similarly, the higher kilometers drove, the lower price for that car.
Since we see that Volkswagen is the leading brand now in Argentina, we want to estimate their average price of Volkswagen in the whole market in Argentina. We will construct a 95% level of confidence for the average price of Volkswagen cars. We only have the sample mean of car price, so we apply t-test for this problems.
volkswagen_car<-df[which(df$brand=="Volkswagen"),]
#Size of data
n_volk<-length(volkswagen_car$money_dollar)
#Sample mean of data
x_volk<-mean(volkswagen_car$money_dollar)
#Sample SD of data
sd_volk<-sd(volkswagen_car$money_dollar)
#Standard error
se<-qt(0.975,df=n_volk-1)
#Upper bound
x_volk+se*(sd_volk/sqrt(n_volk))
## [1] 35101.53
#Lower bound
x_volk-se*(sd_volk/sqrt(n_volk))
## [1] 28197.88
#t.test function
t.test(volkswagen_car$money_dollar,conf.level=0.95)
##
## One Sample t-test
##
## data: volkswagen_car$money_dollar
## t = 18.224, df = 87, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 28197.88 35101.53
## sample estimates:
## mean of x
## 31649.7
We are 95% confident that the average price of Volkswagen brand in Argentina will lie between $28197 and $35101.
Next, we also estimate the proportion of Argentines using Volkswagen. We have percentage of Volkswagen in Argentina is p= 0.1725. We will construct a 90% level of confidence for the proportion of Volswagen brand cars.
#Sample of proportion
p_bar<-0.1725
sd_p<-sqrt((p_bar*(1-p_bar))/n_volk)
se<-qnorm(0.95)
#Upper bound
p_bar + (se*sd_p)
## [1] 0.2387467
#Lower bound
p_bar - (se*sd_p)
## [1] 0.1062533
We have 90% level of confidence that the proportion of people in Argentina drive Volkswagen cars will be between 10% and 23.8%.
We want to answer if the average price of Audi cars is larger than the average price of BMW cars in Argentina. We set up the hypothesis:
Ho: population mean(Audi)=population mean(BMW);
Ha: Popluation mean(Audi)>Population mean(BMW).
Assume H0 is true, we apply unpooled t-test for this hypothesis testing.
#Data of Audi car
audi<-df[which(df$brand=="Audi"),]
#Data of BMW car
bmw<-df[which(df$brand=="BMW"),]
#Unpooled t-test
t.test(x=audi$money_dollar,y=bmw$money_dollar,
alternative='greater',
mu=0,
var.equal=FALSE,
conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: audi$money_dollar and bmw$money_dollar
## t = 1.0867, df = 7.5428, p-value = 0.1553
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -42060.86 Inf
## sample estimates:
## mean of x mean of y
## 109123.16 51082.04
At a signifance level of 5%, we have p-value = 0.1553 > 0.05, we fail to reject H0. It means that the data do not support enough evidence that the average price of Audi is larger than the average price of BMW in Argentina.