Descriptive Statistics & One and two sample inferential tests

Exercise 1

(a)

When looking at the box-plot for MPG_Combo it looks relatively bell-shaped so it’s safe to assume that fuel efficiency is symmetric, so relatively equal. What we can also see is that their are a few smaller outliers, and even more larger outliers within the data.

Exercise 1

(b)

The most obviously difference when examining the different box-plots for each type is how right skewed the trucks MPG efficiency is, which tells us that trucks tended to have a lower combination of MPG, with the exception of the one outliar we see. - Sedans have the best overall MPG combination with several large outliar showing that some of their models have what looks like amazing gas mileage. -The data for sports cars looks almost perfectly symmetrical. Overall it’s gas mileage is very average. Not bad, but not all that great either. - Suv’s, like trucks feature the worst gas mileage, but unlike the trucks the SUV takes the cake for worst numbers for gas mileage overall posting the lowest of any of the other types. -Wagon’s post the second best MPG combination.

Exercise 1

(c)

We can use several visualization and descriptive techniques to help us understand if the data within the invoice variable is normal. The first tool we can use to help us visually is a box-plot. Observing this data we can see first that it has several outliers, the amount that we see is our first clue that this data does not follow a normal distribution. -We can also use a histogram to look at our data. A bell shape in a histrogram defines what normal distribution looks like, and it is clear in our example that this data is greatly right skewed. Our third option would be to to examine the qq-plot. In the graph you can see the data plots as the circles, where the line in the graph shows a more normal distribution. Now this will rarely be one-hundred percent accurate to the plots v. the line, but here we can see the plot points vary strongly. For descriptive tools the first option we can use is to compare the mean and median within the Invoice data. The closer the two values are to one another, the more symmetric the data is. In this example we see that the mean is much greater than the median which is another indication that the data is right skewed. Lastly, we can use the Shapiro-Wilk normality test. In this case we can assume that any p-value generated with a significance level >.05 we do not have enough evidence to reject the null, telling us that it follows a normal distribution. In this example our p-value is <.05 so we reject the null which indicates that it does NOT follow a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9875   18973   25672   30096   35777  173560

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Invoice
## W = 0.77353, p-value < 2.2e-16

Exercise 1

(d)

Based on the box-plot we can see that neither of them look to be normally distributed due to large outliers and a slight right skewness for all three. Next, using the Shapiro-Wilk normality test we can see each variable from the Origin data gives us a p-value <.05 so we have sufficient evidence to reject the null hypothesis and assume that they are not normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "Europe", "Invoice"]
## W = 0.79809, p-value = 1.024e-11

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "Asia", "Invoice"]
## W = 0.84696, p-value = 2.012e-11

## 
##  Shapiro-Wilk normality test
## 
## data:  cars[cars$Origin == "USA", "Invoice"]
## W = 0.89222, p-value = 6.42e-09

Exercise 2

(a)

We learn in exercise 1 that the Invoice data does not follow a normal distribution, but now we want to check the invoice prices between Europe and Asia to see if they’re different. First we’ll use the Shapiro-Wilk test to ensure whether the data is normal for that of each group. After running the test we can conclude that each variable is not normal due to their p-value being <.05. When the data is not normal between two groups we must run a Wilcoxon rank-sum test.

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Invoice[cars$Origin == "Europe"]
## W = 0.79809, p-value = 1.024e-11

## 
##  Shapiro-Wilk normality test
## 
## data:  cars$Invoice[cars$Origin == "Asia"]
## W = 0.84696, p-value = 2.012e-11

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Invoice by Origin
## W = 2344, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Exercise 2

(b)

H0: European and Asian cars are from the same distribution. H1: One car, either European or Asian have larger values within the distribution.

Exercise 2

(c)

The p-value <.05 after running the Wilcoxon rank sum test gives us enough evidence to reject the null hypothesis and assume that one variable is greater than the other.

Exercise 3

(a)

To determine what test we need to run we first must check the normality of wind speeds within the months of July and August. We can first do this visually using a qq-plot. Here we see that the distribution follows closing with the target. Next, we’ll use the Shapiro-Wilk test to determine normality. After running it we receive a p-value >.05 for each month so we do not have enough evidence to reject the null and can assume that the data is normal. Since the data is normal we must now check if the variance is also equal. After running the variance test we return a p-value >.05 we cannot reject the null hypothesis and assume that the two groups have an equal variance. Because the variances were equal we now conduct the two sample t-test, or pooled t-test. After running the pooled t-test we again get a p-value >.05 so we do not have enough evidence to reject the null and ultimately can assume that the wind speeds in August are equal to those in July.

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 7, "Wind"]
## W = 0.95003, p-value = 0.1564

## 
##  Shapiro-Wilk normality test
## 
## data:  airquality[airquality$Month == 8, "Wind"]
## W = 0.98533, p-value = 0.937

## 
##  F test to compare two variances
## 
## data:  Wind by Month
## F = 0.8857, num df = 30, denom df = 30, p-value = 0.7418
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4270624 1.8368992
## sample estimates:
## ratio of variances 
##          0.8857035

## 
##  Two Sample t-test
## 
## data:  Wind by Month
## t = 0.1865, df = 60, p-value = 0.8527
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.443108  1.739883
## sample estimates:
## mean in group 7 mean in group 8 
##        8.941935        8.793548

Exercise 3

H0: The wind speeds in July are equal to the winds speeds in August H1: The wind speeds in July and August are not equal.

Exercise 3

Based on the evidence provided by the pooled t-test we do not have sufficient evidence to reject the null hypothesis and conclude that the wind speed in July and August are equal to one another.

Descriptive Statistics & One and two sample inferential tests

Travis Compton

2020

Exercise 1

(a)

Exercise 1

(b)

Exercise 1

(c)

Exercise 1

(d)

Exercise 2

(a)

Exercise 2

(b)

Exercise 2

(c)

Exercise 3

(a)

Exercise 3

Exercise 3