Is there a statistically significant difference in the average price of used cars in the US in 2010 and 2015?
This is a very broad research question and it could definitely be narrowed down. However, to make the analysis simple the average price of used cars in the US is compared for 2010 and 2015. Hence, the variables required to answer the question are the year in which the car was sold (“year”;2010 or 2015) and the price of the car (“sellingprice”).
Null hypothesis: The average price of used cars in the US is the same in 2010 and 2015.
Alternative hypothesis: The average price of used cars in the US is different in 2010 and 2015.
data <- read.csv("car_prices.csv")
#Remove unnecessary columns in the dataframe.
data_final <- data[, -c(2,3,4,5,6,7,8,9,10,11,12,13,14,16)]
#Filter for the years 2010 and 2015
data_final <- data_final[data_final$year == c(2010,2015), ]
## Warning in data_final$year == c(2010, 2015): longer object length is not a
## multiple of shorter object length
#Head
head(data_final)
## year sellingprice
## 2 2015 21500
## 4 2015 27750
## 6 2015 10900
## 12 2015 17700
## 14 2015 21500
## 16 2015 14100
Variables:
year = the year in which the car was sold.
sellingprice = the price at which the used car was sold (USD).
Unit of Observation:
The unit of observation is one car.
Sample Size:
The sample size for the used cars sold in 2010 is 13253.
The sample size for the used cars sold in 2015 is 4744.
The overall sample size is 17997 cars.
Website: Kaggle
Name of the dataset: Used Car Auction Prices
URL: https://www.kaggle.com/datasets/tunguz/used-car-auction-prices?resource=download
#Create Factor Variables for "year"
data_final$year_factor <- factor(data_final$year,
levels = c(2010, 2015),
labels = c("2010", "2015"))
#Summary Statistics by Year
library(psych)
describeBy(data_final$sellingprice, group = data_final$year_factor)
##
## Descriptive statistics by group
## group: 2010
## vars n mean sd median trimmed mad min max range skew
## X1 1 13253 12465 7787.57 10500 11568.39 5930.4 275 154000 153725 3.79
## kurtosis se
## X1 43.79 67.65
## ------------------------------------------------------------
## group: 2015
## vars n mean sd median trimmed mad min max range skew
## X1 1 4744 26075.44 13881.04 21600 24285.37 9785.16 1100 173000 171900 1.81
## kurtosis se
## X1 6.53 201.53
Explanation:
Comparing the mean and median selling price of used cars in 2010 and 2015 we can identify a significantly higher price for both parameters in 2015. However, it appears that with this increase in price the standard deviation increased as well. Hence, on average deviation of the selling price of used cars from the mean is higher in 2015 than in 2010. Similarly the minimum and maximum price of used cars was higher in 2015.
Since the data belongs to two different groups of units (one from 2010 and one from 2015) and each unit is measured once, an independent sample t-test or a Wilkox Rank Sum Test needs to be conducted.
The requirements for a parametric independent sample t-test are as followed:
The variable (= sellingprice) is numeric.
The variable (= sellingprice) is normally distributed across the populations.
Data must come from two different independent populations.
The variable (= sellingprice) has the same variation in each population.
If these requirements are violated, a Wilkox Rank Sum Test must be conducted.
Testing the Assumptions:
#shapiro.test(data_final$sellingprice)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
#Plot for each year
data_final_2010 <- data_final[data_final$year == 2010, ]
data_final_2015 <- data_final[data_final$year == 2015, ]
ggplot(data_final_2010, aes(x = sellingprice)) + geom_histogram(binwidth = 50, color = "black") + xlab("Price")
ggplot(data_final_2015, aes(x = sellingprice)) + geom_histogram(binwidth = 50, color = "black") + xlab("Price")
As the samples are clearly not normally distributed, the Wilcox Rank Sum Test must be conducted.
Research hypotheses:
Null hypothesis: Location of distribution of the selling price of used cars is the same in 2010 and 2015.
Alternative hypothesis: Location of distribution of the selling price of used cars is not the same in 2010 and 2015.
wilcox.test(data_final$sellingprice ~ data_final$year_factor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: data_final$sellingprice by data_final$year_factor
## W = 9531550, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
#install.packages("effectsize")
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(data_final$sellingprice ~ data_final$year_factor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.70 | [-0.71, -0.69]
interpret_rank_biserial(0.70)
## [1] "very large"
## (Rules: funder2019)
Based on the sample data, we find that the selling price for used
cars differ for 2010 and 2015 (p<0.001). Hence, we
reject the null hypothesis and argue that the location
distribution of the selling price of used cars is not the same in 2010
and 2015.
The prices in 2015 are larger than in 2010. The difference in
distribution is very large
(r=0.70).