options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("psych")
##
## The downloaded binary packages are in
## /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(psych)
data()
data(package = .packages(all.available=TRUE))
install.packages("MASS")
##
## The downloaded binary packages are in
## /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
Research Question for Task 1:
“Is there a significant difference in city fuel efficiency (MPG.city) between American and non-American cars in the Cars93 dataset?” ## Car Data 1993
library(MASS)
data(Cars93)
head(Cars93)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1 Acura Integra Small 12.9 15.9 18.8 25 31
## 2 Acura Legend Midsize 29.2 33.9 38.7 18 25
## 3 Audi 90 Compact 25.9 29.1 32.3 20 26
## 4 Audi 100 Midsize 30.8 37.7 44.6 19 26
## 5 BMW 535i Midsize 23.7 30.0 36.2 22 30
## 6 Buick Century Midsize 14.2 15.7 17.3 22 31
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1 None Front 4 1.8 140 6300
## 2 Driver & Passenger Front 6 3.2 200 5500
## 3 Driver only Front 6 2.8 172 5500
## 4 Driver & Passenger Front 6 2.8 172 5500
## 5 Driver only Rear 4 3.5 208 5700
## 6 Driver only Front 4 2.2 110 5200
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1 2890 Yes 13.2 5 177 102
## 2 2335 Yes 18.0 5 195 115
## 3 2280 Yes 16.9 5 180 102
## 4 2535 Yes 21.1 6 193 106
## 5 2545 Yes 21.1 4 186 109
## 6 2565 No 16.4 6 189 105
## Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
## 1 68 37 26.5 11 2705 non-USA Acura Integra
## 2 71 38 30.0 15 3560 non-USA Acura Legend
## 3 67 37 28.0 14 3375 non-USA Audi 90
## 4 70 37 31.0 17 3405 non-USA Audi 100
## 5 69 39 27.0 13 3640 non-USA BMW 535i
## 6 69 41 28.0 16 2880 USA Buick Century
?Cars93
Description The Cars93 data frame has 93 rows and 27 columns.
This data frame contains the following columns:
Manufacturer: Manufacturer.
Model: Model.
Type: a factor with levels “Small”, “Sporty”, “Compact”, “Midsize”, “Large” and “Van”.
Min.Price: Minimum Price (in $1,000): price for a basic version.
Price: Midrange Price (in $1,000): average of Min.Price and Max.Price.
Max.Price: Maximum Price (in $1,000): price for “a premium version”.
MPG.city: City MPG (miles per US gallon by EPA rating).
MPG.highway: Highway MPG.
AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.
DriveTrain: Drive train type: rear wheel, front wheel or 4WD; (factor).
Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).
EngineSize: Engine size (litres).
Horsepower: Horsepower (maximum).
RPM: RPM (revs per minute at maximum horsepower).
Rev.per.mile: Engine revolutions per mile (in highest gear).
Man.trans.avail: Is a manual transmission version available? (yes or no, Factor).
Fuel.tank.capacity: Fuel tank capacity (US gallons).
Passengers: Passenger capacity (persons)
Length: Length (inches).
Wheelbase: Wheelbase (inches).
Width: Width (inches).
Turn.circle: U-turn space (feet).
Rear.seat.room: Rear seat room (inches) (missing for 2-seater vehicles).
Luggage.room: Luggage capacity (cubic feet) (missing for vans).
Weight: Weight (pounds).
Origin: Of non-USA or USA company origins? (factor).
Make: Combination of Manufacturer and Model (character).
H0: American and non-american cars have the same innercity fuel consumption. H1: American and non-american cars have different innercity fuel consumption.
cars_subset <- subset(Cars93, select = c(Origin, MPG.city))
head(cars_subset)
## Origin MPG.city
## 1 non-USA 25
## 2 non-USA 18
## 3 non-USA 20
## 4 non-USA 19
## 5 non-USA 22
## 6 USA 22
result <- describeBy(Cars93$MPG.city, group = Cars93$Origin, mat = TRUE)
result
## item group1 vars n mean sd median trimmed mad min max range
## X11 1 USA 1 48 20.95833 3.994455 20 20.60000 4.4478 15 31 16
## X12 2 non-USA 1 45 23.86667 6.672876 22 22.86486 5.9304 17 46 29
## skew kurtosis se
## X11 0.7968452 0.0939812 0.5765500
## X12 1.4299847 1.8877575 0.9947336
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(Cars93, aes(x = MPG.city)) +
geom_histogram(binwidth = 5, colour="salmon", fill = "salmon") +
facet_wrap(~Origin, ncol = 1) +
ylab("Frequency")
t.test(cars_subset$MPG.city ~ cars_subset$Origin,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: cars_subset$MPG.city by cars_subset$Origin
## t = -2.5296, df = 71.024, p-value = 0.01364
## alternative hypothesis: true difference in means between group USA and group non-USA is not equal to 0
## 95 percent confidence interval:
## -5.2008385 -0.6158282
## sample estimates:
## mean in group USA mean in group non-USA
## 20.95833 23.86667
wilcox.test(cars_subset$MPG.city ~ cars_subset$Origin,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: cars_subset$MPG.city by cars_subset$Origin
## W = 809.5, p-value = 0.03693
## alternative hypothesis: true location shift is not equal to 0
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Cars93 %>%
group_by(Origin) %>%
shapiro_test(MPG.city)
## # A tibble: 2 × 4
## Origin variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 USA MPG.city 0.929 0.00625
## 2 non-USA MPG.city 0.850 0.0000359
HO: Distribution of MPG in the city for US-cars is normal –> We reject the H0, normality is not given.
HO: Distribution of MPG in the city for non-US-cars is normal –> We reject the H0, normality is not given.
Normality is in both cases not given, so we can not perform a parametrical test. ## Effectsize
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(cars_subset$MPG.city ~ cars_subset$Origin,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.25 | [-0.46, -0.02]
The negative rank biserial correlation (r = -0.25) suggests that non-US cars tend to have higher MPG.city values (better fuel efficiency) compared to US cars. This result is statistically significant, as the 95% confidence interval ([-0.46, -0.02]) does not include 0. The effect size indicates a medium difference in fuel efficiency between the two groups.
effectsize::cohens_d(cars_subset$MPG.city ~ cars_subset$Origin,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## --------------------------
## -0.53 | [-0.95, -0.11]
##
## - Estimated using un-pooled SD.
Cohen’s d = -0.53: This indicates a medium negative effect size. Non-US cars tend to have higher MPG.city values (better fuel efficiency) compared to US cars.
Since normal distribution is not given in both groups, it is inevitable to use the non-parametric Wilcoxon Rank Sum Test. Hypotheses: Null Hypothesis (H₀): The median MPG.city for American cars is equal to the median MPG.city for non-American cars. Alternative Hypothesis (H₁): The medians are not equal (true location shift is not equal to 0).
p-value: p = 0.03693: Since the p-value is less than the standard significance level of 0.05, we reject the null hypothesis at alpha = 0.05. This means there is a statistically significant difference in MPG.city between American and non-American cars.
Conclusion: Non-American cars and American cars differ significantly in their city fuel efficiency (MPG.city). While the test doesn’t indicate which group has higher or lower values, the previously computed effect sizes suggest that non-American cars tend to have better fuel efficiency.
Research question: “Is there a significant correlation between city fuel efficiency (MPG.city) and engine power (Horsepower) in the Cars93 dataset?”
Numerical Variables: MPG.city (Miles per gallon in city driving): Measures fuel efficiency. Horsepower: Measures the engine’s power output.
Correlation Coefficient: Since both variables are continuous and likely to have a linear relationship, Pearson’s correlation coefficient (r) is appropriate to measure the strength and direction of their association.
Statistical Test: A hypothesis test for Pearson’s correlation will be conducted to determine if the observed correlation is statistically significant. Hypotheses:
Null Hypothesis (H₀): There is no correlation between MPG.city and Horsepower (ρ = 0). Alternative Hypothesis (H₁): There is a correlation between MPG.city and Horsepower (ρ ≠ 0).
cars_subset <- subset(Cars93, select = c(Horsepower, MPG.city))
head(cars_subset)
## Horsepower MPG.city
## 1 140 25
## 2 200 18
## 3 172 20
## 4 172 19
## 5 208 22
## 6 110 22
describeBy(Cars93[, c("MPG.city", "Horsepower")], group = NULL)
## Warning in describeBy(Cars93[, c("MPG.city", "Horsepower")], group = NULL): no
## grouping variable requested
## vars n mean sd median trimmed mad min max range skew
## MPG.city 1 93 22.37 5.62 21 21.61 4.45 15 46 31 1.65
## Horsepower 2 93 143.83 52.37 140 138.95 44.48 55 300 245 0.92
## kurtosis se
## MPG.city 3.58 0.58
## Horsepower 0.90 5.43
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(~ MPG.city + Horsepower,
data = Cars93,
smooth = FALSE,
col = c("blue"))
install.packages("GGally")
##
## The downloaded binary packages are in
## /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(cars_subset)
The correlation coefficient (corr = -0.673), derived from the ggpairs plot, indicates a strong negative correlation between MPG.city and Horsepower. The three asterisks () signify that this correlation is highly statistically significant (p < 0.001). This result confirms that as horsepower increases, city fuel efficiency (MPG.city) decreases notably.
install.packages("ggpubr")
##
## The downloaded binary packages are in
## /var/folders/x6/v0zd1d6546jfd0_m0dvbb_kc0000gn/T//RtmpdBx9Fo/downloaded_packages
library(ggpubr)
ggqqplot(Cars93$MPG.city, title = "Q-Q-Plot: MPG.city")
ggqqplot(Cars93$Horsepower, title = "Q-Q-Plot: Horsepower")
Q-Q Plot for Horsepower: The points deviate from the straight diagonal line, especially at the extremes (both low and high values). This indicates that the Horsepower variable is not normally distributed, with more extreme values than expected under a normal distribution.
Q-Q Plot for MPG.city: Similarly, the points show deviations from the straight diagonal line, particularly at the tails (low and high ends). This suggests that the MPG.city variable is also not normally distributed, with deviations from the expected normality at the extremes.
shapiro.test(Cars93$MPG.city)
##
## Shapiro-Wilk normality test
##
## data: Cars93$MPG.city
## W = 0.85831, p-value = 5.763e-08
shapiro.test(Cars93$Horsepower)
##
## Shapiro-Wilk normality test
##
## data: Cars93$Horsepower
## W = 0.93581, p-value = 0.0001916
MPG.city: The variable is not normally distributed, as the p-value (5.763e-08) is less than the significance level (α = 0.05). We reject the null hypothesis of normality.
Horsepower: The variable is not normally distributed, as the p-value (0.0001916) is less than the significance level (α = 0.05). We reject the null hypothesis of normality.
cor.test(Cars93$MPG.city, Cars93$Horsepower, method = "spearman")
## Warning in cor.test.default(Cars93$MPG.city, Cars93$Horsepower, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Cars93$MPG.city and Cars93$Horsepower
## S = 239846, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.7893071
Null Hypothesis (H₀): There is no correlation between MPG.city and Horsepower (ρ = 0).
Alternative Hypothesis (H₁): There is a correlation between MPG.city and Horsepower (ρ ≠ 0).
Correlation Coefficient (ρ): ρ = -0.789: This indicates a strong negative correlation between MPG.city and Horsepower. Higher horsepower values are associated with lower MPG.city values, meaning cars with more powerful engines are less fuel-efficient in city driving.
p-value: p < 2.2e-16: The p-value is extremely small, much smaller than the significance level (α = 0.05). This allows us to reject the null hypothesis (H₀) and conclude that the observed correlation is statistically significant.
Conclusion: There is a statistically significant strong negative correlation between city fuel efficiency (MPG.city) and engine power (Horsepower). As horsepower increases, fuel efficiency in city driving tends to decrease.
Research Question: Is there an association between the origin of the car (Origin) and the type of airbag system (AirBags) included?
Origin: Categorical variable with two levels: “USA” and “non-USA”. This variable represents the geographic origin of the car, which is relevant for market segmentation.
AirBags: Categorical variable with three levels: “None”, “Driver only”, and “Driver & Passenger”. This variable relates to car safety features, which could differ based on the origin.
Null Hypothesis (H₀): There is no association between the origin of the car and the type of airbag system. (The two variables are independent.)
Alternative Hypothesis (H₁): There is an association between the origin of the car and the type of airbag system. (The two variables are not independent.)
contingency_table <- table(Cars93$Origin, Cars93$AirBags)
print(contingency_table)
##
## Driver & Passenger Driver only None
## USA 9 23 16
## non-USA 7 20 18
results <- chisq.test(contingency_table, correct = FALSE)
print(results)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 0.48068, df = 2, p-value = 0.7864
p-value = 0.7864: The p-value is much larger than the typical significance level (α = 0.05). Conclusion: Fail to reject the null hypothesis (H₀). This means there is no significant association between the origin of the car (Origin) and the type of airbag system (AirBags). The two variables appear to be independent.
#Empirical Frequencies
print(contingency_table)
##
## Driver & Passenger Driver only None
## USA 9 23 16
## non-USA 7 20 18
#Theoretical Frequencies
print(results$expected)
##
## Driver & Passenger Driver only None
## USA 8.258065 22.19355 17.54839
## non-USA 7.741935 20.80645 16.45161
Assumptions:
Independence of Observations: Each car in the dataset is categorized uniquely by its origin (USA or non-USA) and type of airbag system, ensuring independent observations.
Categorical Variables: Both Origin and AirBags are categorical variables, meeting the requirement for the Chi-Square test.
Expected Frequencies: All expected cell frequencies in the contingency table are ≥ 5, satisfying the assumption of minimum expected counts.
Comparison of Observed vs. Expected Frequencies
The observed frequencies are very close to the expected frequencies in all categories. This suggests that there are no large deviations between the observed and theoretical frequencies, which aligns with the p-value indicating no significant association.
print(results$stdres)
##
## Driver & Passenger Driver only None
## USA 0.4079043 0.3356268 -0.6671318
## non-USA -0.4079043 -0.3356268 0.6671318
Summary of Residual Interpretation
The standardized residuals for all cells are small (within ±2), indicating that the observed frequencies do not significantly deviate from the expected frequencies. This aligns with the earlier Chi-Square test result (p = 0.7864), supporting the conclusion that Origin and AirBags are statistically independent.
library(effectsize)
effectsize::cramers_v(Cars93$Origin, Cars93$AirBags)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
Interpretation: The effect size is negligible (0.00), confirming there is no meaningful relationship between Origin and AirBags. Combined with the Chi-Square test result (p = 0.7864), we can conclude that Origin and AirBags are independent, and their relationship is statistically and practically insignificant.
Research Question: Is there an association between the origin of the car (Origin) and the type of airbag system (AirBags) in the Cars93 dataset?
Answer: Based on the Pearson Chi-Square test (X² = 0.481, df = 2, p = 0.7864), there is no statistically significant association between the origin of the car and the type of airbag system. The p-value (0.7864) is much greater than the significance level (α = 0.05), indicating that we fail to reject the null hypothesis of independence.
Additionally, the effect size, measured by Cramér’s V (0.00), confirms that there is no meaningful relationship between these variables. The observed and expected frequencies align closely, with no significant deviations in any specific category (as indicated by the standardized residuals).
Conclusion: The type of airbag system (AirBags) in a car is independent of its origin (Origin).