Homework 1

Exercise 1: Descriptive Statistics

(A)

Create a combined mpg variable called MPG_Combo which combines 55% of the MPG_City and 45% of the MPG_Highway. Obtain a box plot for MPG_Combo and comment on what the plot tells us about fuel efficiencies.

MPG_Combo <- 0.55*cars$MPG_City+0.45*cars$MPG_Highway 
cars=data.frame(cars, MPG_Combo)   

boxplot(MPG_Combo, 
        main = "Distro of Fuel Efficency",
        ylab = "MPG_Combo",
        col = "Red",
        border = "Blue",
        horizontal = FALSE
)  

points(mean(MPG_Combo, na.rm=TRUE), col = "White")

Boxplot shows its fairly symmetrical.
We can infer that with the mean point this data is more right skewed since the Mean is above the Median line.
Mean > Median
We can also see that there are a group of outliers above the maximum

(B)

Obtain box plots for MPG_Combo by Type and comment on any differences you notice between the different vehicle types combined fuel efficiency.

boxplot(MPG_Combo ~ Type, data=cars,
            main = "Distro of Fuel Efficency by Type of Vehicle",
            ylab = "MPG_Combo",
            xlab = "Type",
            col = "red",
            border = "blue",
            horizontal = FALSE
)

All of the Types of vehicles looks fairly symmetrical except Truck
Truck does not look to be symmetrical but rather right skewed since the range is very close to the minimum value
Sedan has the largest wide range of values, while Truck has the lowest range of values
Sedan has more groups of outlier data than the rest

(C)

Obtain basic descriptive statistics for Invoice for all vehicles. Comment on any general features and statistics of the data. Use visual and quantitative methods to comment on whether an assumption of Normality would be reasonable for Invoice variable.

summary(cars$Invoice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9875   18973   25672   30096   35777  173560

qqnorm(cars$Invoice)
qqline(cars$Invoice , col = "red")

shapiro.test(cars$Invoice)

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Invoice
## W = 0.77353, p-value < 2.2e-16

The summary shows that the mean (30096) is greater than the median (25672) which tells us the data is more right skewed.
The QQ plot shows the data does not follow normal distribution since it doesnt follow the qq line
The Shapiro-Wilk normality shows that the p value is very small and does not follow normal distribution.
In conclusion: Using the qq plot and the Shapiro test shows assumption of Normality would not be reasonable

(D)

Use visual and quantitative methods to comment on whether an assumption of normality would be reasonable for Invoice variable by Origin. (i.e., check normality of Invoice from i) Europe, ii) Asian, and iii) USA cars.

Qualitative

boxplot(Invoice ~ Origin, data=cars,
            main = "Invoice Vs. Origin",
            ylab = "Origin", 
            xlab = "Invoice",
            col = "red",
            border = "blue",
            horizontal = FALSE
)

histogram_plot <- ggplot(data=cars, mapping=aes(x=Invoice))+geom_histogram(aes(fill=Origin, color=Origin), alpha = 0.25, bins=40) + facet_wrap(Origin~.)
histogram_plot

The boxplot shows that each Origin are right skewed since they are all close to the minimum value range
The Invoice Histogram graph does not look bell shaped meaning that it is not normally distributed
Each of the histogram graphs have a long right tail

Quantitative

shapiro.test(cars[cars$Origin=='Europe', "Invoice"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "Europe", "Invoice"]
## W = 0.79809, p-value = 1.024e-11

shapiro.test(cars[cars$Origin=='Asia', "Invoice"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "Asia", "Invoice"]
## W = 0.84696, p-value = 2.012e-11

shapiro.test(cars[cars$Origin=='USA', "Invoice"])

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "USA", "Invoice"]
## W = 0.89222, p-value = 6.42e-09

Based on the Shapiro-Wilk Test we can infer that each of the Origins does not follow normal distribution because they each have a very small p-value compared to the .05 Significance level

Exercise 2: Hypothesis Testing

Perform a hypothesis test of whether cars originated in Europe have different invoice price than Asian cars, and state your conclusions.

(A)

Which test should we perform, and why? Justify your answer based on findings on Exercise 1 (d).

We should conduct a two sample test. Since both groups (Asia and Europe) do not follow normal distribution a Wilcoxen Rank Sum Test should be conducted

(B)

Specify null and alternative hypotheses.

#Ho: Asia and Europe are from the same distribution.
Ha: One group has a larger value than the other.

(C)

State the conclusion based on the test result.

asia_europe = filter(cars, Origin == 'Asia' | Origin == 'Europe')

wilcox.test(Invoice ~Origin, data=asia_europe, exact=FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Invoice by Origin
## W = 2344, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Based on the Shapiro Test from Exercise 1 D the results show that both Asia and Europe do not follow normal distribution
Since normal distribution is not followed we can conduct the Wilcoxen Rank Sum Test
Conclusion: Based on the results of the Wilcox test the p-value is very small compared to the .05 significant level. We can reject the null hypothesis and support the alternative hypothesis of one group having a greater value than the other

Exercise 3: Hypothesis Testing

(A)

Which test should we perform, and why? See QQ-plot and perform Shapiro-Wilk test for normality check.

We should preform a two sample test since we are comparing 2 variables to state if they do have the same output or if their outputs are different from each other.

*First check for Normality

View(airquality)
summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

july_august = filter(airquality, 
                      Month ==7 | Month==8)
qqnorm(airquality$Wind)
qqline(airquality$Wind , col = "red")

shapiro.test(airquality[airquality$Month==7, "Wind"])

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 7, "Wind"]
## W = 0.95003, p-value = 0.1564

shapiro.test(airquality[airquality$Month==8, "Wind"])

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 8, "Wind"]
## W = 0.98533, p-value = 0.937

Based on Shapiro test both data sets are greater than the .05 significance level and are normally distributed
Conduct var.test

var.test(Wind ~Month, july_august, alternative ="two.sided")

## 
##  F test to compare two variances
## 
## data:  Wind by Month
## F = 0.8857, num df = 30, denom df = 30, p-value = 0.7418
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4270624 1.8368992
## sample estimates:
## ratio of variances 
##          0.8857035

Since Variance test p-value is greater than the .05 significance level we can say Yes that both month 7 and 8 have equal variances.
Run Pooled t-test - keep var.equal=True since the variances are the same

t.test(Wind ~Month, july_august,alternative ="two.sided", var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  Wind by Month
## t = 0.1865, df = 60, p-value = 0.8527
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.443108  1.739883
## sample estimates:
## mean in group 7 mean in group 8 
##        8.941935        8.793548

(B)