Research Question

1.1 Clear research question that can be tested statistically

Is there a statistically significant difference in the average price of used cars in the US in 2010 and 2015?

1.2 Which variables need to be collected to answer your research question?

This is a very broad research question and it could definitely be narrowed down. However, to make the analysis simple the average price of used cars in the US is compared for 2010 and 2015. Hence, the variables required to answer the question are the year in which the car was sold (“year”;2010 or 2015) and the price of the car (“sellingprice”).

1.3 Write the research hypotheses.

Null hypothesis: The average price of used cars in the US is the same in 2010 and 2015.

Alternative hypothesis: The average price of used cars in the US is different in 2010 and 2015.

Data

2.1 Import of the data, presentation of the data using a function head, definition of the variables used

data <- read.csv("car_prices.csv")

#Remove unnecessary columns in the dataframe.
data_final <- data[, -c(2,3,4,5,6,7,8,9,10,11,12,13,14,16)]

#Filter for the years 2010 and 2015
data_final <- data_final[data_final$year == c(2010,2015), ]
## Warning in data_final$year == c(2010, 2015): longer object length is not a
## multiple of shorter object length
#Head
head(data_final)
##    year sellingprice
## 2  2015        21500
## 4  2015        27750
## 6  2015        10900
## 12 2015        17700
## 14 2015        21500
## 16 2015        14100

Variables:

  • year = the year in which the car was sold.

  • sellingprice = the price at which the used car was sold (USD).

2.2 Definition of the unit of observation and the sample size

Unit of Observation:

The unit of observation is one car.

Sample Size:

The sample size for the used cars sold in 2010 is 13253.

The sample size for the used cars sold in 2015 is 4744.

The overall sample size is 17997 cars.

2.3 Source of the data set

Website: Kaggle

Name of the dataset: Used Car Auction Prices

URL: https://www.kaggle.com/datasets/tunguz/used-car-auction-prices?resource=download

2.4 Basic descriptive statistics - estimate a few parameters (e.g., functions summary, describe, etc.) and explanation

#Create Factor Variables for "year"
data_final$year_factor <- factor(data_final$year, 
                         levels = c(2010, 2015),
                         labels = c("2010", "2015"))

#Summary Statistics by Year
library(psych)
describeBy(data_final$sellingprice, group = data_final$year_factor)
## 
##  Descriptive statistics by group 
## group: 2010
##    vars     n  mean      sd median  trimmed    mad min    max  range skew
## X1    1 13253 12465 7787.57  10500 11568.39 5930.4 275 154000 153725 3.79
##    kurtosis    se
## X1    43.79 67.65
## ------------------------------------------------------------ 
## group: 2015
##    vars    n     mean       sd median  trimmed     mad  min    max  range skew
## X1    1 4744 26075.44 13881.04  21600 24285.37 9785.16 1100 173000 171900 1.81
##    kurtosis     se
## X1     6.53 201.53

Explanation:

Comparing the mean and median selling price of used cars in 2010 and 2015 we can identify a significantly higher price for both parameters in 2015. However, it appears that with this increase in price the standard deviation increased as well. Hence, on average deviation of the selling price of used cars from the mean is higher in 2015 than in 2010. Similarly the minimum and maximum price of used cars was higher in 2015.

Analysis

3.1 Determine which statistical test to use and why

Since the data belongs to two different groups of units (one from 2010 and one from 2015) and each unit is measured once, an independent sample t-test or a Wilkox Rank Sum Test needs to be conducted.

3.2 Evaluate all assumptions

The requirements for a parametric independent sample t-test are as followed:

  • The variable (= sellingprice) is numeric.

  • The variable (= sellingprice) is normally distributed across the populations.

  • Data must come from two different independent populations.

  • The variable (= sellingprice) has the same variation in each population.

If these requirements are violated, a Wilkox Rank Sum Test must be conducted.

Testing the Assumptions:

  1. Firstly, the variable is obviously numeric. No test needs to be conducted here.
  2. To test whether the price is normally distributed, we usually use the Shapiro Test. Unfortunately, the test does not allow a sample larger than 5000. Therefore, a visual test is conducted.
#shapiro.test(data_final$sellingprice)

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
#Plot for each year
data_final_2010 <- data_final[data_final$year == 2010, ]
data_final_2015 <- data_final[data_final$year == 2015, ]

ggplot(data_final_2010, aes(x = sellingprice)) + geom_histogram(binwidth = 50, color = "black") + xlab("Price")

ggplot(data_final_2015, aes(x = sellingprice)) + geom_histogram(binwidth = 50, color = "black") + xlab("Price")

As the samples are clearly not normally distributed, the Wilcox Rank Sum Test must be conducted.

3.3 & 3.4 Perform the appropriate statistical test based on the results of the assumption evaluation and its interpretation. If non-parametric alternative is needed, write also the research hypotheses. Calculation of the effect size and its interpretation.

Research hypotheses:

Null hypothesis: Location of distribution of the selling price of used cars is the same in 2010 and 2015.

Alternative hypothesis: Location of distribution of the selling price of used cars is not the same in 2010 and 2015.

wilcox.test(data_final$sellingprice ~ data_final$year_factor,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE, 
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  data_final$sellingprice by data_final$year_factor
## W = 9531550, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
#install.packages("effectsize")
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(data_final$sellingprice ~ data_final$year_factor,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE, 
            alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.70             | [-0.71, -0.69]
interpret_rank_biserial(0.70)
## [1] "very large"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that the selling price for used cars differ for 2010 and 2015 (p<0.001). Hence, we reject the null hypothesis and argue that the location distribution of the selling price of used cars is not the same in 2010 and 2015.
The prices in 2015 are larger than in 2010. The difference in distribution is very large (r=0.70).