HW_1_Lacy

EXERCISE 1

Read Dataset:

cars = read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/Cars.csv", header = TRUE)

Ex 1. Part (a)

Combined MPG Variable & New Data Frame

Boxplot:

boxplot(cars$MPG_Combo, ylab= "MPG")

This boxplot shows that the majority of the data is between 15 and 30 MPG with 50% of the data being between 20 and 25. This means that fuel efficiencies are low with most outliers being above the maximum at 30.

Ex 1. Part (b)

Boxplot:

boxplot(MPG_Combo ~ Type, data = cars, xlab = "Car Type", ylab = "MPG")

Based on the boxplot, Sedans have the largest range in fuel efficiency with a minimum of 15 and maximum of 35, with outliers reaching above 40 MPG. Wagons have the most datapoints in the 50% between 23 and 28 MPG. SUVs have the lowest minimum at 11 MPG, and the lowest maximum at 25 MPG.Trucks are similar to SUVs, but with a much higher maximum MPG (16).

Ex 1. Part (c)

Descriptive Statistics:

summary(cars$Horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   100.0   165.0   210.0   216.8   255.0   500.0

mean(cars$Horsepower)

## [1] 216.76

var(cars$Horsepower)

## [1] 5085.952

range(cars$Horsepower)

## [1] 100 500

Qualitative Normality Check of Horsepower:

par(mfrow = c(2, 2))
boxplot(cars$Horsepower, xlab = "Horsepower", main = "Normality of Horsepower")
hist(cars$Horsepower, xlab = "Horsepower", main = "Normality of Horsepower")
qqnorm(cars$Horsepower); qqline(cars$Horsepower, col = 2)
plot(density(cars$Horsepower), xlab = "Horsepower", main = "Normality of Horsepower")

According to the visual test, the variable seems to follow a fairly normal distribution. But the large number of outliers in the QQ-Plot could indicate that the variable does not follow a normal distribution.

Quantitative/Formal Normality Check of Horsepower:

shapiro.test(cars$Horsepower)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Horsepower
## W = 0.94573, p-value = 2.32e-11

The formal test shows there is a p-value that is less than our significant value of 0.05. This proves, like the evidence from the QQ-Plot, that the Horsepower variable does not follow a normal distribution.

Ex 1. Part (d)

Qualitative Normality Check of Horsepower by Type: Boxplot:

boxplot(Horsepower ~ Type, data = cars, main = "Horsepower by Type",
        xlab = "Car Type", ylab = "Horsepower")

Histograms:

par(mfrow = c(1, 3))
hist(cars$Horsepower[cars$Type=="Sports"], xlab = "Horsepower", main = "Sports Horsepower")
hist(cars$Horsepower[cars$Type=="SUV"], xlab = "Horsepower", main = "SUV Horsepower")
hist(cars$Horsepower[cars$Type=="Truck"], xlab = "Horsepower", main = "Truck Horsepower")

QQ-Plots:

par(mfrow = c(1, 3))
qqnorm(cars$Horsepower[cars$Type=="Sports"], main = "Sports Horsepower"); qqline(cars$Horsepower[cars$Type=="Sports"], col = 2)
qqnorm(cars$Horsepower[cars$Type=="SUV"], main = "SUV Horsepower"); qqline(cars$Horsepower[cars$Type=="SUV"], col = 2)
qqnorm(cars$Horsepower[cars$Type=="Truck"], main = "Truck Horsepower"); qqline(cars$Horsepower[cars$Type=="Truck"], col = 2)

Density:

par(mfrow = c(1, 3))
plot(density(cars$Horsepower[cars$Type=="Sports"]), main = "Sports Horsepower")
plot(density(cars$Horsepower[cars$Type=="SUV"]), main = "SUV Horsepower")
plot(density(cars$Horsepower[cars$Type=="Truck"]), main = "Truck Horsepower")

Based on Qualitative tests, none of the three types of cars follow a normal distribution. There are no solid bell-curves and there are several outliers in each Type.

Quantitative Normality Check of Horsepower by Type:

shapiro.test(cars$Horsepower[cars$Type=="Sports"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Horsepower[cars$Type == "Sports"]
## W = 0.94276, p-value = 0.01898

shapiro.test(cars$Horsepower[cars$Type=="SUV"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Horsepower[cars$Type == "SUV"]
## W = 0.95945, p-value = 0.04423

shapiro.test(cars$Horsepower[cars$Type=="Truck"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Horsepower[cars$Type == "Truck"]
## W = 0.8951, p-value = 0.01697

According to the formal test, all three Types of cars have a p-value that is less that our significant value of 0.05 and therefore none of the three follow a normal distribution.

EXERCISE 2

Ex 2. Part (a)

Based on the findings from Exercise 1 Part (d), we should perform the Non-Parametric Wilcoxon (Wilcox) test. Because at least one car Type (variable type) does not follow Normal Distribution, this is the necessary test. While SUV came close, neither SUV or Truck car Types followed Normal Distribution.

Ex 2. Part (b)

Null Hypothesis (Ho): Both SUVs and Trucks types of cars have the same distribution (same median) Alternative Hypothesis (Ha): At least one of the two types has a larger median

Ex 2. Part (c)

Creation of a Subset of Car Types for SUV and Truck:

Wilcox Test:

wilcox.test(Horsepower ~ Type, data = subCars, exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Horsepower by Type
## W = 806.5, p-value = 0.3942
## alternative hypothesis: true location shift is not equal to 0

Based on the results of the Wilcox test, the p-value is less than our significant value of 0.05. Therefore, we must reject the Null Hypothesis (Ho) that Both SUVs and Trucks have the same distribution/median.

EXERCISE 3

Ex 3. Part(a)

We are comparing two samples in the test, both samples follow a normal distribution, and they have equal variances. Therefore, we should perform the POOLED T-TEST. Descriptive Stats, Qualitative and Quantitative Normality Checks, and Variance Equality tests shown below.

Descriptive Statistics:

summary(airquality$Wind)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.700   7.400   9.700   9.958  11.500  20.700

mean(airquality$Wind)

## [1] 9.957516

var(airquality$Wind)

## [1] 12.41154

range(airquality$Wind)

## [1]  1.7 20.7

Qualitative Normality Check:

par(mfrow = c(1, 2))
plot(density(airquality$Wind[airquality$Month==7]), main = "July Windspeed")
plot(density(airquality$Wind[airquality$Month==8]), main = "August Windspeed")

par(mfrow = c(1, 2))
qqnorm(airquality[airquality$Month==7, "Wind"], main = "July Windspeed"); qqline(airquality[airquality$Month==7, "Wind"], col = 2)
qqnorm(airquality[airquality$Month==8, "Wind"], main = "August Windspeed"); qqline(airquality[airquality$Month==8, "Wind"], col = 2)

Quantitative Normality Check

shapiro.test(airquality[airquality$Month==7, "Wind"])

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 7, "Wind"]
## W = 0.95003, p-value = 0.1564

shapiro.test(airquality[airquality$Month==8, "Wind"])

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 8, "Wind"]
## W = 0.98533, p-value = 0.937

Variance Equality Test (have to subset data first)

subAir = subset(airquality, Month %in% c(7,8))

var.test(Wind ~ Month, subAir, alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  Wind by Month
## F = 0.8857, num df = 30, denom df = 30, p-value = 0.7418
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4270624 1.8368992
## sample estimates:
## ratio of variances 
##          0.8857035

Ex 3. Part(b)

Null Hypothesis (Ho): the means of the July Windspeeds and August Windspeeds are EQUAL Alternative Hypothesis (Ha): the means of the July and August Windspeeds are NOT EQUAL

Ex 3. Part(c)

Two-Sided Pooled T-Test:

t.test(Wind ~ Month, subAir, alternative = "two.sided", var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  Wind by Month
## t = 0.1865, df = 60, p-value = 0.8527
## alternative hypothesis: true difference in means between group 7 and group 8 is not equal to 0
## 95 percent confidence interval:
##  -1.443108  1.739883
## sample estimates:
## mean in group 7 mean in group 8 
##        8.941935        8.793548

According to the T-Test, the means for July and August Windspeeds are slightly different and we therefore have to REJECT the Null Hypothesis.

HW_1_Lacy_Burke

Lacy Burke

9/11/2021