options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("psych")
## 
## The downloaded binary packages are in
##  /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(psych)
data()
data(package = .packages(all.available=TRUE))

install.packages("MASS")
## 
## The downloaded binary packages are in
##  /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages

Task 1

Research Question for Task 1:

“Is there a significant difference in city fuel efficiency (MPG.city) between American and non-American cars in the Cars93 dataset?” ## Car Data 1993

library(MASS)        
data(Cars93)         
head(Cars93)        
##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1        Acura Integra   Small      12.9  15.9      18.8       25          31
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
## 3         Audi      90 Compact      25.9  29.1      32.3       20          26
## 4         Audi     100 Midsize      30.8  37.7      44.6       19          26
## 5          BMW    535i Midsize      23.7  30.0      36.2       22          30
## 6        Buick Century Midsize      14.2  15.7      17.3       22          31
##              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1               None      Front         4        1.8        140 6300
## 2 Driver & Passenger      Front         6        3.2        200 5500
## 3        Driver only      Front         6        2.8        172 5500
## 4 Driver & Passenger      Front         6        2.8        172 5500
## 5        Driver only       Rear         4        3.5        208 5700
## 6        Driver only      Front         4        2.2        110 5200
##   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1         2890             Yes               13.2          5    177       102
## 2         2335             Yes               18.0          5    195       115
## 3         2280             Yes               16.9          5    180       102
## 4         2535             Yes               21.1          6    193       106
## 5         2545             Yes               21.1          4    186       109
## 6         2565              No               16.4          6    189       105
##   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1    68          37           26.5           11   2705 non-USA Acura Integra
## 2    71          38           30.0           15   3560 non-USA  Acura Legend
## 3    67          37           28.0           14   3375 non-USA       Audi 90
## 4    70          37           31.0           17   3405 non-USA      Audi 100
## 5    69          39           27.0           13   3640 non-USA      BMW 535i
## 6    69          41           28.0           16   2880     USA Buick Century
?Cars93

Explanation of variables:

Description The Cars93 data frame has 93 rows and 27 columns.

This data frame contains the following columns:

Manufacturer: Manufacturer.

Model: Model.

Type: a factor with levels “Small”, “Sporty”, “Compact”, “Midsize”, “Large” and “Van”.

Min.Price: Minimum Price (in $1,000): price for a basic version.

Price: Midrange Price (in $1,000): average of Min.Price and Max.Price.

Max.Price: Maximum Price (in $1,000): price for “a premium version”.

MPG.city: City MPG (miles per US gallon by EPA rating).

MPG.highway: Highway MPG.

AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.

DriveTrain: Drive train type: rear wheel, front wheel or 4WD; (factor).

Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).

EngineSize: Engine size (litres).

Horsepower: Horsepower (maximum).

RPM: RPM (revs per minute at maximum horsepower).

Rev.per.mile: Engine revolutions per mile (in highest gear).

Man.trans.avail: Is a manual transmission version available? (yes or no, Factor).

Fuel.tank.capacity: Fuel tank capacity (US gallons).

Passengers: Passenger capacity (persons)

Length: Length (inches).

Wheelbase: Wheelbase (inches).

Width: Width (inches).

Turn.circle: U-turn space (feet).

Rear.seat.room: Rear seat room (inches) (missing for 2-seater vehicles).

Luggage.room: Luggage capacity (cubic feet) (missing for vans).

Weight: Weight (pounds).

Origin: Of non-USA or USA company origins? (factor).

Make: Combination of Manufacturer and Model (character).

Hypotheses

H0: American and non-american cars have the same innercity fuel consumption. H1: American and non-american cars have different innercity fuel consumption.

cars_subset <- subset(Cars93, select = c(Origin, MPG.city))

head(cars_subset)
##    Origin MPG.city
## 1 non-USA       25
## 2 non-USA       18
## 3 non-USA       20
## 4 non-USA       19
## 5 non-USA       22
## 6     USA       22
result <- describeBy(Cars93$MPG.city, group = Cars93$Origin, mat = TRUE)


result
##     item  group1 vars  n     mean       sd median  trimmed    mad min max range
## X11    1     USA    1 48 20.95833 3.994455     20 20.60000 4.4478  15  31    16
## X12    2 non-USA    1 45 23.86667 6.672876     22 22.86486 5.9304  17  46    29
##          skew  kurtosis        se
## X11 0.7968452 0.0939812 0.5765500
## X12 1.4299847 1.8877575 0.9947336

Parametric test

library(ggplot2) 
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(Cars93, aes(x = MPG.city)) + 
  geom_histogram(binwidth = 5, colour="salmon", fill = "salmon") +
  facet_wrap(~Origin, ncol = 1) +
  ylab("Frequency")

t.test(cars_subset$MPG.city ~ cars_subset$Origin,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  cars_subset$MPG.city by cars_subset$Origin
## t = -2.5296, df = 71.024, p-value = 0.01364
## alternative hypothesis: true difference in means between group USA and group non-USA is not equal to 0
## 95 percent confidence interval:
##  -5.2008385 -0.6158282
## sample estimates:
##     mean in group USA mean in group non-USA 
##              20.95833              23.86667

Non-parametric test

wilcox.test(cars_subset$MPG.city ~ cars_subset$Origin, 
            correct = FALSE, 
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  cars_subset$MPG.city by cars_subset$Origin
## W = 809.5, p-value = 0.03693
## alternative hypothesis: true location shift is not equal to 0

Testing for normality

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Cars93 %>%
  group_by(Origin) %>%
  shapiro_test(MPG.city)
## # A tibble: 2 × 4
##   Origin  variable statistic         p
##   <fct>   <chr>        <dbl>     <dbl>
## 1 USA     MPG.city     0.929 0.00625  
## 2 non-USA MPG.city     0.850 0.0000359

HO: Distribution of MPG in the city for US-cars is normal –> We reject the H0, normality is not given.

HO: Distribution of MPG in the city for non-US-cars is normal –> We reject the H0, normality is not given.

Normality is in both cases not given, so we can not perform a parametrical test. ## Effectsize

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(cars_subset$MPG.city ~ cars_subset$Origin,
               correct = FALSE,
               exact = FALSE,
               alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.25             | [-0.46, -0.02]

The negative rank biserial correlation (r = -0.25) suggests that non-US cars tend to have higher MPG.city values (better fuel efficiency) compared to US cars. This result is statistically significant, as the 95% confidence interval ([-0.46, -0.02]) does not include 0. The effect size indicates a medium difference in fuel efficiency between the two groups.

effectsize::cohens_d(cars_subset$MPG.city ~ cars_subset$Origin,
                     pooled_sd = FALSE)
## Cohen's d |         95% CI
## --------------------------
## -0.53     | [-0.95, -0.11]
## 
## - Estimated using un-pooled SD.

Cohen’s d = -0.53: This indicates a medium negative effect size. Non-US cars tend to have higher MPG.city values (better fuel efficiency) compared to US cars.

Interpretation and Conclusion

Since normal distribution is not given in both groups, it is inevitable to use the non-parametric Wilcoxon Rank Sum Test. Hypotheses: Null Hypothesis (H₀): The median MPG.city for American cars is equal to the median MPG.city for non-American cars. Alternative Hypothesis (H₁): The medians are not equal (true location shift is not equal to 0).

p-value: p = 0.03693: Since the p-value is less than the standard significance level of 0.05, we reject the null hypothesis at alpha = 0.05. This means there is a statistically significant difference in MPG.city between American and non-American cars.

Conclusion: Non-American cars and American cars differ significantly in their city fuel efficiency (MPG.city). While the test doesn’t indicate which group has higher or lower values, the previously computed effect sizes suggest that non-American cars tend to have better fuel efficiency.

Task 2

Research question: “Is there a significant correlation between city fuel efficiency (MPG.city) and engine power (Horsepower) in the Cars93 dataset?”

Numerical Variables: MPG.city (Miles per gallon in city driving): Measures fuel efficiency. Horsepower: Measures the engine’s power output.

Correlation Coefficient: Since both variables are continuous and likely to have a linear relationship, Pearson’s correlation coefficient (r) is appropriate to measure the strength and direction of their association.

Statistical Test: A hypothesis test for Pearson’s correlation will be conducted to determine if the observed correlation is statistically significant. Hypotheses:

Null Hypothesis (H₀): There is no correlation between MPG.city and Horsepower (ρ = 0). Alternative Hypothesis (H₁): There is a correlation between MPG.city and Horsepower (ρ ≠ 0).

cars_subset <- subset(Cars93, select = c(Horsepower, MPG.city))

head(cars_subset)
##   Horsepower MPG.city
## 1        140       25
## 2        200       18
## 3        172       20
## 4        172       19
## 5        208       22
## 6        110       22
describeBy(Cars93[, c("MPG.city", "Horsepower")], group = NULL)
## Warning in describeBy(Cars93[, c("MPG.city", "Horsepower")], group = NULL): no
## grouping variable requested
##            vars  n   mean    sd median trimmed   mad min max range skew
## MPG.city      1 93  22.37  5.62     21   21.61  4.45  15  46    31 1.65
## Horsepower    2 93 143.83 52.37    140  138.95 44.48  55 300   245 0.92
##            kurtosis   se
## MPG.city       3.58 0.58
## Horsepower     0.90 5.43
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(~ MPG.city + Horsepower, 
                  data = Cars93,
                  smooth = FALSE, 
                  col = c("blue")) 

install.packages("GGally")
## 
## The downloaded binary packages are in
##  /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(cars_subset)

The correlation coefficient (corr = -0.673), derived from the ggpairs plot, indicates a strong negative correlation between MPG.city and Horsepower. The three asterisks () signify that this correlation is highly statistically significant (p < 0.001). This result confirms that as horsepower increases, city fuel efficiency (MPG.city) decreases notably.

install.packages("ggpubr")
## 
## The downloaded binary packages are in
##  /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(ggpubr)

ggqqplot(Cars93$MPG.city, title = "Q-Q-Plot: MPG.city")

ggqqplot(Cars93$Horsepower, title = "Q-Q-Plot: Horsepower")

Q-Q Plot for Horsepower: The points deviate from the straight diagonal line, especially at the extremes (both low and high values). This indicates that the Horsepower variable is not normally distributed, with more extreme values than expected under a normal distribution.

Q-Q Plot for MPG.city: Similarly, the points show deviations from the straight diagonal line, particularly at the tails (low and high ends). This suggests that the MPG.city variable is also not normally distributed, with deviations from the expected normality at the extremes.

shapiro.test(Cars93$MPG.city)
## 
##  Shapiro-Wilk normality test
## 
## data:  Cars93$MPG.city
## W = 0.85831, p-value = 5.763e-08
shapiro.test(Cars93$Horsepower)
## 
##  Shapiro-Wilk normality test
## 
## data:  Cars93$Horsepower
## W = 0.93581, p-value = 0.0001916

MPG.city: The variable is not normally distributed, as the p-value (5.763e-08) is less than the significance level (α = 0.05). We reject the null hypothesis of normality.

Horsepower: The variable is not normally distributed, as the p-value (0.0001916) is less than the significance level (α = 0.05). We reject the null hypothesis of normality.

Spearman Rank Correlation

cor.test(Cars93$MPG.city, Cars93$Horsepower, method = "spearman")
## Warning in cor.test.default(Cars93$MPG.city, Cars93$Horsepower, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Cars93$MPG.city and Cars93$Horsepower
## S = 239846, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.7893071

Interpretation and Conclusion

Null Hypothesis (H₀): There is no correlation between MPG.city and Horsepower (ρ = 0).

Alternative Hypothesis (H₁): There is a correlation between MPG.city and Horsepower (ρ ≠ 0).

Correlation Coefficient (ρ): ρ = -0.789: This indicates a strong negative correlation between MPG.city and Horsepower. Higher horsepower values are associated with lower MPG.city values, meaning cars with more powerful engines are less fuel-efficient in city driving.

p-value: p < 2.2e-16: The p-value is extremely small, much smaller than the significance level (α = 0.05). This allows us to reject the null hypothesis (H₀) and conclude that the observed correlation is statistically significant.

Conclusion: There is a statistically significant strong negative correlation between city fuel efficiency (MPG.city) and engine power (Horsepower). As horsepower increases, fuel efficiency in city driving tends to decrease.

Task 3

Research Question: Is there an association between the origin of the car (Origin) and the type of airbag system (AirBags) included?

Origin: Categorical variable with two levels: “USA” and “non-USA”. This variable represents the geographic origin of the car, which is relevant for market segmentation.

AirBags: Categorical variable with three levels: “None”, “Driver only”, and “Driver & Passenger”. This variable relates to car safety features, which could differ based on the origin.

Null Hypothesis (H₀): There is no association between the origin of the car and the type of airbag system. (The two variables are independent.)

Alternative Hypothesis (H₁): There is an association between the origin of the car and the type of airbag system. (The two variables are not independent.)

contingency_table <- table(Cars93$Origin, Cars93$AirBags)
print(contingency_table)
##          
##           Driver & Passenger Driver only None
##   USA                      9          23   16
##   non-USA                  7          20   18
results <- chisq.test(contingency_table, correct = FALSE)

print(results)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 0.48068, df = 2, p-value = 0.7864

p-value = 0.7864: The p-value is much larger than the typical significance level (α = 0.05). Conclusion: Fail to reject the null hypothesis (H₀). This means there is no significant association between the origin of the car (Origin) and the type of airbag system (AirBags). The two variables appear to be independent.

#Empirical Frequencies
print(contingency_table)
##          
##           Driver & Passenger Driver only None
##   USA                      9          23   16
##   non-USA                  7          20   18
#Theoretical Frequencies
print(results$expected)
##          
##           Driver & Passenger Driver only     None
##   USA               8.258065    22.19355 17.54839
##   non-USA           7.741935    20.80645 16.45161

Assumptions:

Independence of Observations: Each car in the dataset is categorized uniquely by its origin (USA or non-USA) and type of airbag system, ensuring independent observations.

Categorical Variables: Both Origin and AirBags are categorical variables, meeting the requirement for the Chi-Square test.

Expected Frequencies: All expected cell frequencies in the contingency table are ≥ 5, satisfying the assumption of minimum expected counts.

Comparison of Observed vs. Expected Frequencies

The observed frequencies are very close to the expected frequencies in all categories. This suggests that there are no large deviations between the observed and theoretical frequencies, which aligns with the p-value indicating no significant association.

print(results$stdres)
##          
##           Driver & Passenger Driver only       None
##   USA              0.4079043   0.3356268 -0.6671318
##   non-USA         -0.4079043  -0.3356268  0.6671318

Summary of Residual Interpretation

The standardized residuals for all cells are small (within ±2), indicating that the observed frequencies do not significantly deviate from the expected frequencies. This aligns with the earlier Chi-Square test result (p = 0.7864), supporting the conclusion that Origin and AirBags are statistically independent.

library(effectsize)
effectsize::cramers_v(Cars93$Origin, Cars93$AirBags)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Interpretation: The effect size is negligible (0.00), confirming there is no meaningful relationship between Origin and AirBags. Combined with the Chi-Square test result (p = 0.7864), we can conclude that Origin and AirBags are independent, and their relationship is statistically and practically insignificant.

Conclusion

Research Question: Is there an association between the origin of the car (Origin) and the type of airbag system (AirBags) in the Cars93 dataset?

Answer: Based on the Pearson Chi-Square test (X² = 0.481, df = 2, p = 0.7864), there is no statistically significant association between the origin of the car and the type of airbag system. The p-value (0.7864) is much greater than the significance level (α = 0.05), indicating that we fail to reject the null hypothesis of independence.

Additionally, the effect size, measured by Cramér’s V (0.00), confirms that there is no meaningful relationship between these variables. The observed and expected frequencies align closely, with no significant deviations in any specific category (as indicated by the standardized residuals).

Conclusion: The type of airbag system (AirBags) in a car is independent of its origin (Origin).