library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
## here() starts at D:/MS Data Analytics/Classes Completed/STA 6443 Algorithms/Homeworks/Homework 1
library(ggplot2)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
here::here()
## [1] "D:/MS Data Analytics/Classes Completed/STA 6443 Algorithms/Homeworks/Homework 1"
cars=read.csv("Cars.csv")  # read dataset 

Exercise 1

.bordered{
  border-style: solid;
  border-color: teal;
  padding: 5px;
  background-color: #DCDCDC;
}

(A) Create a combined mpg variable called MPG_Combo which combines 60% of the MPG_City and 40% of the MPG_Highway. Obtain a box plot for MPG_Combo and comment on what the plot tells us about fuel inefficiencies.

  • The bloxplot of the combined MPG has several outliers above and below the quartiles. Ignoring those outliers, the mean and median are close together(roughly 22 GPM) and the spread of the distribution is not wide. It is slightly right skewed and most vehicles get between 20 and 25 MPG combined.


MPG_Combo <- 0.6*cars$MPG_City+0.4*cars$MPG_Highway  # combined mpg varialbe 
cars=data.frame(cars, MPG_Combo)   # data frame with MPG_Combo 

AV <- ggplot(data = cars)+
  geom_boxplot(mapping = aes("", MPG_Combo)) +
  xlab("MPG Combo") +
  ylab("MPG")
AV

AH <- ggplot(data = cars)+
  geom_boxplot(mapping = aes(MPG_Combo, "")) +
  xlab("MPG") +
  ylab("MPG Combo")
AH

(B) Obtain box plots for MPG_Combo by Type and comment on any differences you notice between the different vehicle types combined fuel efficiency.

  • When viewing the distribution of combined MPG separated by vehicle type, we see some interesting things. The combined MPG of SUVs and sports cars appear to be most like normal distributions - seemingly symmetric, with not wide spreads. Sedans’ combined MPG have lots of variability and the most outliers. Combined MPG for wagons also has a widespread but may not be symmetric. Trucks appear to be least fuel efficient among the vehicle types. The distribution of combined MPG of trucks is right skewed and has at least one outlier.


B <- ggplot(cars, aes(x = Type, y = MPG_Combo, fill = Type))
B <- B + geom_boxplot()
B <- B + labs(title = " MPG Combo by Vehicle Type")
B

# 
# 
# 
# boxplot(cars$MPG_Combo~cars$Type,
#         main = "MPG Combo by Type",
#         xlab = "Type",
#         ylab = "MPG Combo",)
# 
# boxplot(cars$MPG_Combo~cars$Type,
#         xlab = "Cars",ylab = "MPG",
#         main="MPG_Combo by Type",
#         border=(c("black","black","black","black","black")),
#         horizontal = FALSE,col=(c("red","blue","green","yellow","orange")))

(C) Obtain basic descriptive statistics for Horsepower for all vehicles. Comment on any general features and statistics of the data. Use visual and quantitative methods to comment on whether an assumption of Normality would be reasonable for Horsepower variable.

  • The distribution of Horsepower is not normal as visible from the qqplot and strongly skewed as visible from the histogram. The mean and median are quite different from each other and it clarifies the asymmetry of the Horsepower variable’s distribution. The qqplot does not follow the straight line, the skewness test is close to a +1 and the shapiro-Wilk test shows very small p-value, thus Horsepower does not follow the Normal distribution.


qqnorm(cars$Horsepower); qqline(cars$Horsepower, col=2)

CV <- ggplot(data = cars)+
  geom_boxplot(mapping = aes("", Horsepower)) +
  xlab("Combined Horsepower") +
  ylab("Horsepower")
CV

hist(cars$Horsepower,main="Horsepower", xlab="Horsepower");

skewness(cars$Horsepower, na.rm=TRUE) # skewness function in package "e1071"
## [1] 0.9528091
shapiro.test(cars$Horsepower)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Horsepower
## W = 0.94573, p-value = 2.32e-11

(D) Use visual and quantitative methods to comment on whether an assumption of normality would be reasonable for Horsepower variable by Type, especially for Sports, SUV, and Truck (i.e., check normality of Horsepower from Type of i) Sports, ii) SUV, and iii) Truck).

  • There are 49 Sports cars, 60 SUV cars, and 24 Trucks in the data set. None of the distributions of Horsepower variable by Type are normal according to the tests of normality (all p-values are small, less than 0.05), histograms, and qqplots. The distribution of Horsepower is skewed and asymmetric for each of the 3 types. The respective mean and median for each of the 3 distributions are quite different from each other.


S1 <- filter(cars, Type == "Sports")

ggplot(S1, aes(x = Type, y = Horsepower)) +
 geom_boxplot(fill = "green") +
 labs(title = " Horsepower by Sports")

shapiro.test(cars[cars$Type=="Sports", "Horsepower"])
## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Type == "Sports", "Horsepower"]
## W = 0.94276, p-value = 0.01898
S2 <- filter(cars, Type == "SUV")

ggplot(S2, aes(x = Type, y = Horsepower)) +
 geom_boxplot( fill = "violet") +
 labs(title = " Horsepower by SUV")

shapiro.test(cars[cars$Type=="SUV", "Horsepower"])
## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Type == "SUV", "Horsepower"]
## W = 0.95945, p-value = 0.04423
S3 <- filter(cars, Type == "Truck")

ggplot(S3, aes(x = Type, y = Horsepower)) +
 geom_boxplot(fill = "yellow") +
 labs(title = " Horsepower by Truck")

shapiro.test(cars[cars$Type=="Truck", "Horsepower"])
## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Type == "Truck", "Horsepower"]
## W = 0.8951, p-value = 0.01697

Exercise 2

Perform a hypothesis test of whether SUV has a different Horsepower than Truck, and state your conclusions.

(A) Which test should we perform, and why? Justify your answer based on findings on Exercise 1 (d).

  • The distributions of Horsepower for SUV and Truck are not Normal from the normality tests above. Thus we should perform nonparametric Wilcoxon rank-sum test.


(B) Specify null and alternative hypotheses.

  • H0: Distributions of Horsepower for SUV and Truck cars are from the same distribution
  • H1: One of the groups tends to be more efficient (either SUV or Truck).


(C) State the conclusion based on the test result.

  • We see the Wilcoxon rank-sum test do not reject the null hypothesis, with the larger p-value (greater than 0.05). Thus we conclude that the distributions of Horsepower for SUV and Truck cars are from the same distribution.


cars_filtered <-cars #made a copy of cars to filter out data.

library(dplyr)

cars_filtered <- cars %>% filter(Type %in% c("SUV","Truck"))#retaining only SUV and Truck variables in new dataframe


boxplot(Horsepower ~ Type, data = cars_filtered, main="Horsepower between SUV and Truck",
        xlab="SUV or Truck", ylab="Horsepower")

qqnorm(cars_filtered$Horsepower[cars_filtered$Type=="SUV"]);
qqline(cars_filtered$Horsepower[cars_filtered$Type=="Truck"], col = 2)

shapiro.test(cars_filtered$Horsepower[cars_filtered$Type=="SUV"])
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_filtered$Horsepower[cars_filtered$Type == "SUV"]
## W = 0.95945, p-value = 0.04423
shapiro.test(cars_filtered$Horsepower[cars_filtered$Type=="Truck"])
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_filtered$Horsepower[cars_filtered$Type == "Truck"]
## W = 0.8951, p-value = 0.01697
# non-parametric wilcox test
wilcox.test(Horsepower ~ Type, data=cars_filtered, exact=FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Horsepower by Type
## W = 806.5, p-value = 0.3942
## alternative hypothesis: true location shift is not equal to 0

Exercise 3

Perform a hypothesis test -whether Wind in July has a different speed (mph) than Wind in August.

(A) Which test should we perform, and why? See QQ-plot and perform the Shapiro-Wilk test for normality check.

  • The distributions of wind from July and August both follow Normal as we see the almost straight line in qqplots and large p-values (greater than significance level 0.05) on Shapiro-Wilk test. Thus we perform two-sample t-test. Then we check equal variance of two groups through equal variance test and find that two groups have the same variance with large p-values. So we can perform pooled two-sample t-test.


(B) Specify null and alternative hypotheses.

  • H0: mean(Wind of July) = mean(Wind of Aug).
  • H1: mean(Wind of July) != mean(Wind of Aug).


library(dplyr)
airquality_filtered <-airquality #made a copy of airquality to filter out data.

airquality <--airquality
airquality_filtered <--airquality
airquality_filtered <- airquality %>% filter(Month %in% c(-7,-8))#retaining only July and August variables in new dataframe


boxplot(Wind ~ Month, data = airquality_filtered, main="Wind Speed between August and July",
        xlab="August and July", ylab="Wind Speed")

qqnorm(airquality_filtered$Wind[airquality_filtered$Month==-7]);
qqline(airquality_filtered$Wind[airquality_filtered$Month==-7], col = 2)

qqnorm(airquality_filtered$Wind[airquality_filtered$Month==-8]);
qqline(airquality_filtered$Wind[airquality_filtered$Month==-8], col = 2)

(C) State the conclusion based on the test result.

  • We see large p-values on pooled two sample t-test and do not have enough evidence to reject the null hypothesis. The mean of Wind from July is equal to the mean of wind from August.


shapiro.test(airquality_filtered$Wind[airquality_filtered$Month==-7])
## 
##  Shapiro-Wilk normality test
## 
## data:  airquality_filtered$Wind[airquality_filtered$Month == -7]
## W = 0.95003, p-value = 0.1564
shapiro.test(airquality_filtered$Wind[airquality_filtered$Month==-8])
## 
##  Shapiro-Wilk normality test
## 
## data:  airquality_filtered$Wind[airquality_filtered$Month == -8]
## W = 0.98533, p-value = 0.937
# Equal variance test to decide - pooled t-test or satterthwaite t-test?
var.test(Wind ~ Month, airquality_filtered, 
         alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  Wind by Month
## F = 1.129, num df = 30, denom df = 30, p-value = 0.7418
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5443957 2.3415780
## sample estimates:
## ratio of variances 
##           1.129046
bartlett.test(Wind ~ Month, airquality_filtered)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Wind by Month
## Bartlett's K-squared = 0.10861, df = 1, p-value = 0.7417
# parametric t-test
t.test(Wind ~ Month, airquality_filtered, 
         alternative = "two.sided",var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Wind by Month
## t = 0.1865, df = 60, p-value = 0.8527
## alternative hypothesis: true difference in means between group -8 and group -7 is not equal to 0
## 95 percent confidence interval:
##  -1.443108  1.739883
## sample estimates:
## mean in group -8 mean in group -7 
##        -8.793548        -8.941935