Homework 1

Author: Abha Jha

abc123: wnn231

install.packages(“dplyr”) output: html_document: default

library(e1071) library(fBasics) library(tidyverse) library(devtools) library(dplyr)

Exercise 1

a)

MPGCombo = (c(CARS$MPG_City * 0.4))+(c(CARS$MPG_Highway * 0.6))

CARS = data.frame(CARS, MPGCombo)

boxplot(MPGCombo, main = “MPG Combined (Highway and City)”, ylab = “Miles per Gallon”, xlab = “40% City and 60% Highway Observations”)

Comments = The mean is around 22mpg (approx), and it is expected from combining

both the City and Highway MPG. There are multiple outliers, especially outside

the maximum value (this is because we used a 60% weight of the Highway data

which contains higher values for MPG)

b)

unique(CARS$Type) # [1] “SUV” “Sedan” “Sports” “Wagon” “Truck”

ggplot(data = CARS) + geom_boxplot(mapping = aes(y = MPGCombo)) + facet_wrap(~ Type, nrow = 2)

Comments = There is a noticeable difference from both the Wagon and Sedan

types, being the top performers, and the Truck type being the bottom one. The

one that is also interesting to analyze would be the outliers from the Sedan,

that might be either hybrid cars or with a small engine.

c)

Horsepower

summary(CARS$Horsepower)

Min. 1st Qu. Median Mean 3rd Qu. Max.

100.0 165.0 210.0 216.8 255.0 500.0

hist(CARS$Horsepower, main = "Histogram of Horsepower", xlab = "Horsepower") skewness(CARS$Horsepower)

Skewness: [1] 0.9528091 ### Calculated by “moment”

Comment = The Horsepower variable doesn’t follow a Normal distribution, it has

Right skewness (coefficient is almost 1), which indicates that our data

starts at 100 horsepower, and the majority of it is around 100-250, while some

few outliers, most likely Sport Cars, lie at the right tail with all the way

to 500 horsepower.

d)

ggplot(data = CARS) + geom_histogram(mapping = aes(x = Horsepower)) + facet_wrap(~ Type, nrow = 2)+ labs(main = “Horsepower by Type of car”)

shapiro.test(CARS$Horsepower)

W = 0.94573, p-value = 2.32e-11

unique(CARS$Type) # [1] “SUV” “Sedan” “Sports” “Wagon” “Truck”

Sedan = CARS[CARS$Type==“Sedan”,“Horsepower”] SUV = CARS[CARS$Type==“SUV”,“Horsepower”] Sports = CARS[CARS$Type==“Sports”,“Horsepower”] Wagon = CARS[CARS$Type==“Wagon”,“Horsepower”] Truck = CARS[CARS$Type==“Truck”,“Horsepower”]

shapiro.test(Sedan$Horsepower) # W = 0.95154, p-value = 1.205e-07 # Since p-value is < 0.05 (alpha), the data doesn’t follow a normality distribution.

shapiro.test(Sports$Horsepower) # W = 0.94276, p-value = 0.01898 # Since p-value is < 0.05 (alpha), the data doesn’t follow a normality distribution.

shapiro.test(SUV$Horsepower) # W = 0.95945, p-value = 0.04423 # Even though p-value is less than 0.05, we could be flexible and assume that the # data follows a distribution very close to a normal one (also visually we can # see the same.)

shapiro.test(Truck$Horsepower) # W = 0.8951, p-value = 0.01697 # # Since p-value is < 0.05 (alpha), the data doesn’t follow a normality distribution.

shapiro.test(Wagon$Horsepower) # W = 0.94074, p-value = 0.09525 # This data is the only one from the Type of cars, that could follow a normal # distribution (p-value > 0.05), even when visually it is not very clear the # normal distribution.

Exercise 2

a) I will perform the Wilcoxon rank-sum test, since at least 1 of my

data sets is not normally distributed (in this case SUV could fall into a

normally distributed data if we are flexible, but we will assume is not).

b)

Ho : Both data sets have a similar distribution (similar Median)

Ha : The distribution of one population is shifted to the left or right of the

other.

wilcox.test(SUV,Truck, alternative = “two.sided”)

Results: data: SUV and Truck

W = 806.5, p-value = 0.3942

alternative hypothesis: true location shift is not equal to 0

c)

Conclusion: The data does not give me any reason to conclude that the

population median differs from the hypothetical median, however, one of the

assumptions of the Wilcoxon test assumes that the data is

distributed symmetrically around the median, which might not be the case with

the Truck data set.

Exercise 3

a)

View(airquality)

data.frame(airquality) summary(airquality)

month_July = subset(airquality, Month==7) month_August = subset(airquality, Month==8)

par(mfrow=c(1,2)) hist(month_July$Wind, main = "Wind for July", xlab = "Wind") hist(month_August$Wind, main = “Wind for August”, xlab = “Wind”)

summary(month_July$Wind) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 4.100 6.900 8.600 8.942 10.900 14.900

summary(month_August$Wind) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 2.300 6.600 8.600 8.794 11.200 15.500

par(mfrow=c(1,2))

qqnorm(month_July$Wind, main = "Q-Q Plot July") qqnorm(month_August$Wind, main = “Q-Q Plot August”)

par(mfrow=c(1,1))

shapiro.test(month_July$Wind) # data: month_July$Wind # W = 0.95003, p-value = 0.1564

shapiro.test(month_August$Wind) # data: month_August$Wind # W = 0.98533, p-value = 0.937

Both data sets follow a normal distribution, therefore the t-test would be in

order. We will assume the variance is not equal (var.equal = FALSE)

b)

For the Null Hypothesis we will assume that both means are the same, while for

the Alternative Hypothesis we will test if they are different from each other.

Ho : mJuly = mAugust

Ha : mJuly <> mAugust

c)

t.test(month_July$Wind, month_August$Wind, alternative = “two.sided”)

data: month_July$Wind and month_August$Wind

t = 0.1865, df = 59.78, p-value = 0.8527

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.443229 1.740003

sample estimates:

mean of x mean of y

8.941935 8.793548

Since the P-value = 0.8527 is larger than the 0.05 significance level, we will

accept the null hypothesis (Ho), that means the two means are significantly