HW1

Importing and describing data

mydata <- read.table("./cars.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##    mpg cylinders cubicinches  hp weightlbs time.to.60 year    brand
## 1 14.0         8         350 165      4209         12 1972      US.
## 2 31.9         4          89  71      1925         14 1980  Europe.
## 3 17.0         8         302 140      3449         11 1971      US.
## 4 15.0         8         400 150      3761         10 1971      US.
## 5 30.5         4          98  63      2051         17 1978      US.
## 6 23.0         8         350 125      3900         17 1980      US.

Unit of observation is a single car. Sample size is 261.

Explanation of variables:

MPG: Fuel efficiency in miles per gallon,
Cylinders: Number of cylinders an engine has,
Cubicinches: Engine size in cubic inches,
hp: Measurement for power of a car,
Weightlbs: Weight of a car in lbs,
time.to.60: Time needed form 0 to 60 mph,
year: The year of production,
brand: The country of origin (US, Europe, Japan).

Source: https://www.kaggle.com/datasets/abineshkumark/carsdata?resource=download

library(pastecs)
round(stat.desc(mydata), 2)

##                  mpg cylinders cubicinches       hp weightlbs time.to.60
## nbr.val       261.00    261.00      259.00   261.00    258.00     261.00
## nbr.null        0.00      0.00        0.00     0.00      0.00       0.00
## nbr.na          0.00      0.00        2.00     0.00      3.00       0.00
## min            10.00      3.00       68.00    46.00   1613.00       8.00
## max            46.60      8.00      455.00   230.00   4997.00      25.00
## range          36.60      5.00      387.00   184.00   3384.00      17.00
## sum          6040.80   1459.00    52038.00 27760.00 776537.00    4058.00
## median         22.00      6.00      156.00    95.00   2867.50      16.00
## mean           23.14      5.59      200.92   106.36   3009.83      15.55
## SE.mean         0.48      0.11        6.79     2.51     53.17       0.18
## CI.mean.0.95    0.95      0.21       13.37     4.94    104.70       0.35
## var            61.21      3.00    11937.38  1640.25 729382.65       8.47
## std.dev         7.82      1.73      109.26    40.50    854.04       2.91
## coef.var        0.34      0.31        0.54     0.38      0.28       0.19
##                   year brand
## nbr.val         261.00    NA
## nbr.null          0.00    NA
## nbr.na            0.00    NA
## min            1971.00    NA
## max            1983.00    NA
## range            12.00    NA
## sum          515950.00    NA
## median         1977.00    NA
## mean           1976.82    NA
## SE.mean           0.23    NA
## CI.mean.0.95      0.44    NA
## var              13.23    NA
## std.dev           3.64    NA
## coef.var          0.00    NA

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:pastecs':
## 
##     extract

mydata <- drop_na(mydata)

library(pastecs)
round(stat.desc(mydata), 2)

##                  mpg cylinders cubicinches       hp weightlbs time.to.60
## nbr.val       256.00    256.00      256.00   256.00    256.00     256.00
## nbr.null        0.00      0.00        0.00     0.00      0.00       0.00
## nbr.na          0.00      0.00        0.00     0.00      0.00       0.00
## min            10.00      3.00       70.00    46.00   1613.00       8.00
## max            46.60      8.00      455.00   230.00   4997.00      25.00
## range          36.60      5.00      385.00   184.00   3384.00      17.00
## sum          5935.90   1431.00    51546.00 27341.00 769650.00    3967.00
## median         22.00      5.00      156.00    95.00   2832.50      16.00
## mean           23.19      5.59      201.35   106.80   3006.45      15.50
## SE.mean         0.49      0.11        6.85     2.54     53.47       0.18
## CI.mean.0.95    0.97      0.21       13.49     5.01    105.31       0.36
## var            61.95      3.05    12006.06  1655.42 732003.70       8.44
## std.dev         7.87      1.75      109.57    40.69    855.57       2.91
## coef.var        0.34      0.31        0.54     0.38      0.28       0.19
##                   year brand
## nbr.val         256.00    NA
## nbr.null          0.00    NA
## nbr.na            0.00    NA
## min            1971.00    NA
## max            1983.00    NA
## range            12.00    NA
## sum          506068.00    NA
## median         1977.00    NA
## mean           1976.83    NA
## SE.mean           0.23    NA
## CI.mean.0.95      0.45    NA
## var              13.22    NA
## std.dev           3.64    NA
## coef.var          0.00    NA

The average time to 60 for cars is 15.50 s.

The cars in the analysis are from years 1971 to 1983.

Their mpg ranges for 36.60, from 10 mpg to 46.60 mpg.

The median of cylinders is 5 meaning half cars have less cylinders and half more.

Task 1

Research question: Is there a significant difference in the number of cylinders in vehicles based on the country of production?

\(H_0\): \(𝜇_{Eu}\) = \(𝜇_{USA}\) = \(𝜇_{Jpn}\) \(H_1\): \(𝜇_{Eu}\) ≠ \(𝜇_{USA}\) ≠ \(𝜇_{Jpn}\)

library(psych)
describeBy(x = mydata$cylinders, group = mydata$brand)

## 
##  Descriptive statistics by group 
## group:  Europe.
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 47 4.15 0.47      4    4.03   0   4   6     2 3.04     8.39 0.07
## ------------------------------------------------------------ 
## group:  Japan.
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 51 4.16 0.64      4       4   0   3   6     3 2.06     3.98 0.09
## ------------------------------------------------------------ 
## group:  US.
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 158 6.48 1.63      6    6.59 2.97   4   8     4 -0.46    -1.36 0.13

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = cylinders)) +
  geom_histogram(binwidth = 1, fill="darkgreen") +
  facet_wrap(~brand, ncol = 1) + 
  ylab("Frequency")

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(brand) %>%
  shapiro_test(cylinders)

## # A tibble: 3 × 4
##   brand      variable  statistic        p
##   <chr>      <chr>         <dbl>    <dbl>
## 1 " Europe." cylinders     0.359 4.85e-13
## 2 " Japan."  cylinders     0.460 1.97e-12
## 3 " US."     cylinders     0.759 7.78e-15

We reject the H0 for normal distribution of cars produced in Europe at p < 0.001.

We reject the H0 for normal distribution of cars produced in US at p < 0.001.

We reject the H0 for normal distribution of cars produced in Japan at p < 0.001.

The conclusion from the Shapiro-Wilk test and the graphs is that the distribution for all three groups is not normal. For that reason we need to perform the non-parametric test for testing the hypothesis.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(mydata$cylinders, group = mydata$brand)

## Warning in leveneTest.default(mydata$cylinders, group = mydata$brand):
## mydata$brand coerced to factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2  76.621 < 2.2e-16 ***
##       253                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject H0 at p < 0.001. Variances are not the same between groups.

Paramtric test - One way anova

ANOVA_results <- aov(cylinders ~ brand, 
                     data = mydata)

summary(ANOVA_results)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## brand         2  327.8  163.89   92.11 <2e-16 ***
## Residuals   253  450.1    1.78                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non parametric test

kruskal.test(cylinders ~ brand, 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  cylinders by brand
## Kruskal-Wallis chi-squared = 109.94, df = 2, p-value < 2.2e-16

Because normality was not met in our case is more suitable to perform the non-parametric test. In this case this is Kruskal-Wallis rank sum test.

We reject H0 at p < 0.001. At least 1 distribution location is different.

Effect size

kruskal_effsize(cylinders ~ brand, 
             data = mydata)

## # A tibble: 1 × 5
##   .y.           n effsize method  magnitude
## * <chr>     <int>   <dbl> <chr>   <ord>    
## 1 cylinders   256   0.427 eta2[H] large

The effect size large.

library(rstatix)

groups_nonpar <- wilcox_test(cylinders ~ brand,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar

## # A tibble: 3 × 9
##   .y.       group1   group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>     <chr>    <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 cylinders " Europ… " Jap…    47    51      1243 5.82e- 1 1   e+ 0 ns          
## 2 cylinders " Europ… " US."    47   158      1032 1.14e-15 3.42e-15 ****        
## 3 cylinders " Japan… " US."    51   158      1136 2.08e-16 6.24e-16 ****

pwc <- mydata %>%
  wilcox_test(cylinders ~ brand, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(cylinders ~ brand, 
                                data = mydata)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggboxplot(mydata, x = "brand", y = "cylinders", add = "point", ylim=c(0, 12)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = brand), color = "orange",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "orange", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = brand)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

We found that the distribution of number of cylinders a car has differs for at least one of the countries of origin (x² = 109.94, p < 0.01). The effect size was large (η2 = 0.43). Post-Hoc test revealed differences for two pair of groups, between US and Europe cars and between US and Japan cars. There is no statistical difference between Europe and Japan cars.

Task 2

Research question: Is there a correlation between the size of the engine and the weight of the car?

\(H_0\): \(ρ_{eng,weight}\) = 0 \(H_1\): \(ρ_{eng,weight}\) ≠ 0

library(car)
scatterplotMatrix(mydata[ , c(3,5)], smooth = FALSE)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c(3,5)]),
      type = "pearson")

##             cubicinches weightlbs
## cubicinches        1.00      0.93
## weightlbs          0.93      1.00
## 
## n= 256 
## 
## 
## P
##             cubicinches weightlbs
## cubicinches              0       
## weightlbs    0

cor.test(mydata$cubicinches, mydata$weightlbs,
         method = "pearson",
         exact = FALSE)

## 
##  Pearson's product-moment correlation
## 
## data:  mydata$cubicinches and mydata$weightlbs
## t = 40.334, df = 254, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9113390 0.9448898
## sample estimates:
##       cor 
## 0.9300272

We reject H0 at p < 0.001. The correlation between engine size and weight is statistically significant. Linear relationship between engine size and weight is positive and very strong.

Task 3

Research question: Does the country of production vary depending on the year of production?

\(H_0\): There is no association between year of production and country of production.

\(H_1\): There is association between year of production and country of production.

To perform the test needed in this task we first need to change both variables to ordinal variables.

mydata$yearF <- ifelse(test = mydata$year <= 1977,
                      yes = 0,
                      no = 1)

head(mydata)

##    mpg cylinders cubicinches  hp weightlbs time.to.60 year    brand yearF
## 1 14.0         8         350 165      4209         12 1972      US.     0
## 2 31.9         4          89  71      1925         14 1980  Europe.     1
## 3 17.0         8         302 140      3449         11 1971      US.     0
## 4 15.0         8         400 150      3761         10 1971      US.     0
## 5 30.5         4          98  63      2051         17 1978      US.     1
## 6 23.0         8         350 125      3900         17 1980      US.     1

results <- chisq.test(mydata$brand, mydata$yearF)
results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$brand and mydata$yearF
## X-squared = 10.564, df = 2, p-value = 0.005083

We reject H0 at p = 0.006. We found association between country of production and year of production.

addmargins(results$observed)

##             mydata$yearF
## mydata$brand   0   1 Sum
##      Europe.  31  16  47
##      Japan.   19  32  51
##      US.      96  62 158
##     Sum      146 110 256

In the table above the empirical frequencies are shown.

Between the years 1971 and 1977 31 cars were build in Europe.

Between the years 1977 and 1983 16 cars were build in Europe.

Between the years 1971 and 1977 19 cars were build in Japan.

Between the years 1977 and 1983 32 cars were build in Japan.

Between the years 1971 and 197 96 cars were build in US.

Between the years 1977 and 1983 62 cars were build in US .

round(results$expected, 2)

##             mydata$yearF
## mydata$brand     0     1
##      Europe. 26.80 20.20
##      Japan.  29.09 21.91
##      US.     90.11 67.89

In the table above the theoretical frequencies are shown. All the numbers are bigger than 5, so the 2nd assumption is met.

If there is no association we would expect 26.80 cars to be build in Europe between the years 1971 and 1977.

If there is no association we would expect 20.20 cars to be build in Europe between the years 1977 and 1983.

If there is no association we would expect 29.09 cars to be build in Japan between the years 1971 and 1977.

If there is no association we would expect 21.91 cars to be build in Japan between the years 1977 and 1983.

If there is no association we would expect 90.11 cars to be build in US between the years 1971 and 1977.

If there is no association we would expect 67.89 cars to be build in US between the years 1977 and 1983.

round(results$residuals, 2)

##             mydata$yearF
## mydata$brand     0     1
##      Europe.  0.81 -0.93
##      Japan.  -1.87  2.15
##      US.      0.62 -0.71

In the combination Japan and years 1977 - 1983 there is more than expected number of cars produced (α = 5%).

Effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata$brand, mydata$yearF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.18              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.18)

## [1] "small"
## (Rules: funder2019)

The effect size is small.

In the years 1977 to 1983 there were more than expected cars produced in Japan. For other countries we cannot say there is a statistical difference between expected and real number of produced cars.

HW1

Matej Peroša

2025-01-12

Importing and describing data

Task 1

Paramtric test - One way anova

Non parametric test

Effect size

Task 2

Task 3

Effect size