mydata <- read.table("./cars.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## 1 14.0 8 350 165 4209 12 1972 US.
## 2 31.9 4 89 71 1925 14 1980 Europe.
## 3 17.0 8 302 140 3449 11 1971 US.
## 4 15.0 8 400 150 3761 10 1971 US.
## 5 30.5 4 98 63 2051 17 1978 US.
## 6 23.0 8 350 125 3900 17 1980 US.
Unit of observation is a single car. Sample size is 261.
Explanation of variables:
Source: https://www.kaggle.com/datasets/abineshkumark/carsdata?resource=download
library(pastecs)
round(stat.desc(mydata), 2)
## mpg cylinders cubicinches hp weightlbs time.to.60
## nbr.val 261.00 261.00 259.00 261.00 258.00 261.00
## nbr.null 0.00 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 2.00 0.00 3.00 0.00
## min 10.00 3.00 68.00 46.00 1613.00 8.00
## max 46.60 8.00 455.00 230.00 4997.00 25.00
## range 36.60 5.00 387.00 184.00 3384.00 17.00
## sum 6040.80 1459.00 52038.00 27760.00 776537.00 4058.00
## median 22.00 6.00 156.00 95.00 2867.50 16.00
## mean 23.14 5.59 200.92 106.36 3009.83 15.55
## SE.mean 0.48 0.11 6.79 2.51 53.17 0.18
## CI.mean.0.95 0.95 0.21 13.37 4.94 104.70 0.35
## var 61.21 3.00 11937.38 1640.25 729382.65 8.47
## std.dev 7.82 1.73 109.26 40.50 854.04 2.91
## coef.var 0.34 0.31 0.54 0.38 0.28 0.19
## year brand
## nbr.val 261.00 NA
## nbr.null 0.00 NA
## nbr.na 0.00 NA
## min 1971.00 NA
## max 1983.00 NA
## range 12.00 NA
## sum 515950.00 NA
## median 1977.00 NA
## mean 1976.82 NA
## SE.mean 0.23 NA
## CI.mean.0.95 0.44 NA
## var 13.23 NA
## std.dev 3.64 NA
## coef.var 0.00 NA
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
mydata <- drop_na(mydata)
library(pastecs)
round(stat.desc(mydata), 2)
## mpg cylinders cubicinches hp weightlbs time.to.60
## nbr.val 256.00 256.00 256.00 256.00 256.00 256.00
## nbr.null 0.00 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00
## min 10.00 3.00 70.00 46.00 1613.00 8.00
## max 46.60 8.00 455.00 230.00 4997.00 25.00
## range 36.60 5.00 385.00 184.00 3384.00 17.00
## sum 5935.90 1431.00 51546.00 27341.00 769650.00 3967.00
## median 22.00 5.00 156.00 95.00 2832.50 16.00
## mean 23.19 5.59 201.35 106.80 3006.45 15.50
## SE.mean 0.49 0.11 6.85 2.54 53.47 0.18
## CI.mean.0.95 0.97 0.21 13.49 5.01 105.31 0.36
## var 61.95 3.05 12006.06 1655.42 732003.70 8.44
## std.dev 7.87 1.75 109.57 40.69 855.57 2.91
## coef.var 0.34 0.31 0.54 0.38 0.28 0.19
## year brand
## nbr.val 256.00 NA
## nbr.null 0.00 NA
## nbr.na 0.00 NA
## min 1971.00 NA
## max 1983.00 NA
## range 12.00 NA
## sum 506068.00 NA
## median 1977.00 NA
## mean 1976.83 NA
## SE.mean 0.23 NA
## CI.mean.0.95 0.45 NA
## var 13.22 NA
## std.dev 3.64 NA
## coef.var 0.00 NA
The average time to 60 for cars is 15.50 s.
The cars in the analysis are from years 1971 to 1983.
Their mpg ranges for 36.60, from 10 mpg to 46.60 mpg.
The median of cylinders is 5 meaning half cars have less cylinders and half more.
Research question: Is there a significant difference in the number of cylinders in vehicles based on the country of production?
\(H_0\): \(𝜇_{Eu}\) = \(𝜇_{USA}\) = \(𝜇_{Jpn}\) \(H_1\): \(𝜇_{Eu}\) ≠ \(𝜇_{USA}\) ≠ \(𝜇_{Jpn}\)
library(psych)
describeBy(x = mydata$cylinders, group = mydata$brand)
##
## Descriptive statistics by group
## group: Europe.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 47 4.15 0.47 4 4.03 0 4 6 2 3.04 8.39 0.07
## ------------------------------------------------------------
## group: Japan.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 51 4.16 0.64 4 4 0 3 6 3 2.06 3.98 0.09
## ------------------------------------------------------------
## group: US.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 158 6.48 1.63 6 6.59 2.97 4 8 4 -0.46 -1.36 0.13
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = cylinders)) +
geom_histogram(binwidth = 1, fill="darkgreen") +
facet_wrap(~brand, ncol = 1) +
ylab("Frequency")
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(brand) %>%
shapiro_test(cylinders)
## # A tibble: 3 × 4
## brand variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 " Europe." cylinders 0.359 4.85e-13
## 2 " Japan." cylinders 0.460 1.97e-12
## 3 " US." cylinders 0.759 7.78e-15
We reject the H0 for normal distribution of cars produced in Europe at p < 0.001.
We reject the H0 for normal distribution of cars produced in US at p < 0.001.
We reject the H0 for normal distribution of cars produced in Japan at p < 0.001.
The conclusion from the Shapiro-Wilk test and the graphs is that the distribution for all three groups is not normal. For that reason we need to perform the non-parametric test for testing the hypothesis.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(mydata$cylinders, group = mydata$brand)
## Warning in leveneTest.default(mydata$cylinders, group = mydata$brand):
## mydata$brand coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 76.621 < 2.2e-16 ***
## 253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject H0 at p < 0.001. Variances are not the same between groups.
ANOVA_results <- aov(cylinders ~ brand,
data = mydata)
summary(ANOVA_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## brand 2 327.8 163.89 92.11 <2e-16 ***
## Residuals 253 450.1 1.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kruskal.test(cylinders ~ brand,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: cylinders by brand
## Kruskal-Wallis chi-squared = 109.94, df = 2, p-value < 2.2e-16
Because normality was not met in our case is more suitable to perform the non-parametric test. In this case this is Kruskal-Wallis rank sum test.
We reject H0 at p < 0.001. At least 1 distribution location is different.
kruskal_effsize(cylinders ~ brand,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 cylinders 256 0.427 eta2[H] large
The effect size large.
library(rstatix)
groups_nonpar <- wilcox_test(cylinders ~ brand,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
groups_nonpar
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 cylinders " Europ… " Jap… 47 51 1243 5.82e- 1 1 e+ 0 ns
## 2 cylinders " Europ… " US." 47 158 1032 1.14e-15 3.42e-15 ****
## 3 cylinders " Japan… " US." 51 158 1136 2.08e-16 6.24e-16 ****
pwc <- mydata %>%
wilcox_test(cylinders ~ brand,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(cylinders ~ brand,
data = mydata)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggboxplot(mydata, x = "brand", y = "cylinders", add = "point", ylim=c(0, 12)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = brand), color = "orange",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "orange",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = brand)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
We found that the distribution of number of cylinders a car has differs for at least one of the countries of origin (x² = 109.94, p < 0.01). The effect size was large (η2 = 0.43). Post-Hoc test revealed differences for two pair of groups, between US and Europe cars and between US and Japan cars. There is no statistical difference between Europe and Japan cars.
Research question: Is there a correlation between the size of the engine and the weight of the car?
\(H_0\): \(ρ_{eng,weight}\) = 0 \(H_1\): \(ρ_{eng,weight}\) ≠ 0
library(car)
scatterplotMatrix(mydata[ , c(3,5)], smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(3,5)]),
type = "pearson")
## cubicinches weightlbs
## cubicinches 1.00 0.93
## weightlbs 0.93 1.00
##
## n= 256
##
##
## P
## cubicinches weightlbs
## cubicinches 0
## weightlbs 0
cor.test(mydata$cubicinches, mydata$weightlbs,
method = "pearson",
exact = FALSE)
##
## Pearson's product-moment correlation
##
## data: mydata$cubicinches and mydata$weightlbs
## t = 40.334, df = 254, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9113390 0.9448898
## sample estimates:
## cor
## 0.9300272
We reject H0 at p < 0.001. The correlation between engine size and weight is statistically significant. Linear relationship between engine size and weight is positive and very strong.
Research question: Does the country of production vary depending on the year of production?
\(H_0\): There is no association between year of production and country of production.
\(H_1\): There is association between year of production and country of production.
To perform the test needed in this task we first need to change both variables to ordinal variables.
mydata$yearF <- ifelse(test = mydata$year <= 1977,
yes = 0,
no = 1)
head(mydata)
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand yearF
## 1 14.0 8 350 165 4209 12 1972 US. 0
## 2 31.9 4 89 71 1925 14 1980 Europe. 1
## 3 17.0 8 302 140 3449 11 1971 US. 0
## 4 15.0 8 400 150 3761 10 1971 US. 0
## 5 30.5 4 98 63 2051 17 1978 US. 1
## 6 23.0 8 350 125 3900 17 1980 US. 1
results <- chisq.test(mydata$brand, mydata$yearF)
results
##
## Pearson's Chi-squared test
##
## data: mydata$brand and mydata$yearF
## X-squared = 10.564, df = 2, p-value = 0.005083
We reject H0 at p = 0.006. We found association between country of production and year of production.
addmargins(results$observed)
## mydata$yearF
## mydata$brand 0 1 Sum
## Europe. 31 16 47
## Japan. 19 32 51
## US. 96 62 158
## Sum 146 110 256
In the table above the empirical frequencies are shown.
Between the years 1971 and 1977 31 cars were build in Europe.
Between the years 1977 and 1983 16 cars were build in Europe.
Between the years 1971 and 1977 19 cars were build in Japan.
Between the years 1977 and 1983 32 cars were build in Japan.
Between the years 1971 and 197 96 cars were build in US.
Between the years 1977 and 1983 62 cars were build in US .
round(results$expected, 2)
## mydata$yearF
## mydata$brand 0 1
## Europe. 26.80 20.20
## Japan. 29.09 21.91
## US. 90.11 67.89
In the table above the theoretical frequencies are shown. All the numbers are bigger than 5, so the 2nd assumption is met.
If there is no association we would expect 26.80 cars to be build in Europe between the years 1971 and 1977.
If there is no association we would expect 20.20 cars to be build in Europe between the years 1977 and 1983.
If there is no association we would expect 29.09 cars to be build in Japan between the years 1971 and 1977.
If there is no association we would expect 21.91 cars to be build in Japan between the years 1977 and 1983.
If there is no association we would expect 90.11 cars to be build in US between the years 1971 and 1977.
If there is no association we would expect 67.89 cars to be build in US between the years 1977 and 1983.
round(results$residuals, 2)
## mydata$yearF
## mydata$brand 0 1
## Europe. 0.81 -0.93
## Japan. -1.87 2.15
## US. 0.62 -0.71
In the combination Japan and years 1977 - 1983 there is more than expected number of cars produced (α = 5%).
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$brand, mydata$yearF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.18 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.18)
## [1] "small"
## (Rules: funder2019)
The effect size is small.
In the years 1977 to 1983 there were more than expected cars produced in Japan. For other countries we cannot say there is a statistical difference between expected and real number of produced cars.