Z— title: “Week 5 Lecture” output: html_document date: “2024-06-12” —

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Means, Modes and Medians

We’ve discussed means previously, and we all know how to do them:

example<-c(5,5,5,5,10)

sum(example)/length(example)
## [1] 6
# OR just...

mean(example)
## [1] 6

But as we discussed last week, mean is sometimes not the most appropriate model. Sometimes the mean hides the most important part of the data.

MODE

You may or may not remember from college algebra, but “MODE” is a number that represents the most common value of a particular variable.

There is no R command I’m aware of to display MODE automatically, but we don’t need one: just count the variables.

mode_example<-c(5,5,5,5,5,5,5,5,10)

mode_example %>% table(.)
## .
##  5 10 
##  8  1
# the 'period' in table(.) means "everything", so we are saying, "take mode_example, and create a table where EVERYTHING is counted"
hist(mode_example,breaks = 20)

Why is this important? Being able to look at a dataset and describe it’s mode is useful for deciding if your data is normal or not. Here is a histogram of various cuts from the diamond dataset.

Lets learn how to summarize and arrange numerical data:

diamond_mode_table<-diamonds %>% group_by(x) %>% summarize(n=n()) %>% arrange(-n)


head(diamond_mode_table)
## # A tibble: 6 × 2
##       x     n
##   <dbl> <int>
## 1  4.37   448
## 2  4.34   437
## 3  4.33   429
## 4  4.38   428
## 5  4.32   425
## 6  4.35   407
hist(diamonds$x,breaks=50)

hist(diamonds$x,breaks=50,probability = T)
lines(density(diamonds$x),col="red",lwd=3)

As we can see, this is more of a bimodal distribution. It’s not perfectly bimodal, but it almost is. By definition, frequently used models cannot be used on this type of data.

We will discuss assumptions of models soon, but for now remember this: data must be NORMALLY DISTRIBUTED for usual statistical tests.

It’s possible to do some fancy statistics on data that isn’t normally distributed, but that’s considered an advanced technique. We might save that for the end of class, or as needed.

Trimodal data is also possible:

trimodal_data<-c(1,2,2,2,2,2,3,4,4,4,4,4,5,6,6,6,6,6,7)

hist(trimodal_data,breaks=9)

MEDIAN

Median is an alternate measurement of central tendency that is typically used when the data is heavily skewed. The reason here is that median is “relatively unaffected by extreme scores at either end of the distribution” (Zoe et al, 2012).

For example, in normally distributed data, mean and median are basically the same:

normal1<-rnorm(1000,mean=50)



mean(normal1)
## [1] 49.96204
median(normal1)
## [1] 49.96788

Observe this graph of the data:

normal_mean<-mean(normal1)
normal_median<-median(normal1)

hist(normal1)
abline(v=normal_mean,col='blue',lwd=5)
abline(v=normal_median,col='red',lwd=2)

But much (perhaps most?) data in the real world is skewed. Take the prices of diamonds, for example:

hist(diamonds$price)

If you think about it, it would be really odd if diamond prices were distributed “normally”. That would mean almost all diamond prices cluster towards the middle-range – almost nobody buying very expensive or very cheap diamonds. In reality, most diamonds are “priced to sell”.

In skewed data like this, the mean and median are often very different:

mean(diamonds$price)
## [1] 3932.8
median(diamonds$price)
## [1] 2401
d_mean<-mean(diamonds$price)
d_median<-median(diamonds$price)


hist(diamonds$price)
abline(v=d_mean,col='blue',lwd=5)
abline(v=d_median,col='red',lwd=5)
legend("topright",c("Mean","Median"),col=c("blue","red"),lwd=5)

The mean and median US Salary is often skewed this way as well (lots of people make very little, some people make a lot), so the “average” US salary is higher than the median (middle) salary. The “average” is getting “pulled upward” by people who make a lot of money – just like the “average” price is being pulled upward by high diamond prices.

FORMING A TESTABLE HYPOTHESIS

Every testable hypothesis is a statement about a population that can be accepted or rejected. The population doesn’t have to be people, it can be a collection of anything (school districts, cars, diamonds, etc.)

Can we form a testable hypothesis using the CARS dataset?

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

am = transmission (1 = manual, 0 = automatic) vs = engine shape (1 = vshape, 0 = other shape) cyl = number of cylinders

What about transmission versus miles per gallon?

ggplot(mtcars,aes(x=am,y=mpg,group=am)) + geom_boxplot() 

#Could prevent ggplot from treating AM as continuous value by adding “as factor” to x=am

Seems like manual transmission gets better mileage. How do we phrase this as a research question?

Hypothesis 0 (the NULL hypothesis): there is NO DIFFERENCE in miles per gallon between transmission types

Hypothesis 1 (the ALTERNATIVE HYPOTHESIS): manual transmission will have a higher miles-per-gallon rating than automatic transmission

We can test this mathematically: the first variable is the dependent variable (what we are trying to explain). In the example below, we are trying to explain fuel efficiency (MPG)– you typically only have one (if you want many, you’ll need to run many different tests)

t.test(mpg~am,data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Alternative hypothesis explaination: is there a true difference in averages between these two groups (in this case, between automatic and manual)

You’ll notice that the t.test gives you the alternative hypothesis: the true difference in means between group 0 and group 1 is not equal to 0.

Technically, the p-value is saying “the null hypothesis can be ruled out with a high degree of precision (p=0.0013). What it’s really saying is”the alternative hypothesis is probably true”.

#What is the p-value? very simple, casual explanation: What is the chance that the null hypothesis can be ruled out? When the p-value is small, there is a good chance that the null hypothesis can be ruled out What is the chances that the alternative hypothesis is true? The lower the value, the higher the chance that it is true

Conclusion: manual transmissions tend to have higher mpg than automatic transmissions.

What about horsepower versus cylinders?

hist(mtcars$hp)

median(mtcars$hp)
## [1] 123
mean(mtcars$hp)
## [1] 146.6875

Median: 123 Mean: 146.6875 (the mean is being dragged up by the two cars with high horsepower relative to the other vehicles)

cars_data<-mtcars

ggplot(cars_data,aes(x=cyl,y=hp,group=cyl)) + geom_boxplot() 

[the black dot above six cylinders is an outlier– ggplot has some sort of fancy calculation that determines if a value is an outlier]

There are three types of cylinders now – 4, 6 and 8.

Hypothesis 0 (THE NULL HYPOTHESIS): There is NO DIFFERENCE in the horsepower means between cylinders

Hypothesis 1 (THE ALTERNATIVE HYPOTHESIS): There IS a DIFFERENCE in the horsepower means between cylinders.

We can’t use a regular t-test for this (I will explain why later– basically because there’s three type of cylinders. t-tests are for two values).

For comparing multiple means, we use what’s called ANOVA (Analysis Of Variance):

anova_model<-aov(hp~cyl,data=cars_data)

summary(anova_model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          1 100984  100984   67.71 3.48e-09 ***
## Residuals   30  44743    1491                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The stars represent how significant the pr>f (p-value equivalent) is- *** is very significant

We see here that the means of horsepower really ARE different depending on cylinders. The p-value is 3.48e-09, which is something like “0.00000000348”. This means that the null hypothesis (there is no difference) can be rejected with some certainty. There is probably a difference between cylinders! (.05 is the critical value– the cut off point, basically, that was decided in the 1920s by some guy). We can reject the null hypothesis– there is a true difference!

The problem with ANOVA is that it doesn’t tell you which means are different, or why. It’s just saying “the means are different across groups”. This is why t-tests and ANOVA are the beginning of research, rather than the end. It just tells us the null hypothesis can be rejected for the whole equation– nothing more specific.

Is there a way to do more advanced hypothesis testing? (Yes! You will do this!) ANOVA and t-tests are good for initial examination– it’s not the end of the research, it’s the beginning!

What if we decided that number of cylinders probably does affect horsepower, but this is because cars with more cylinders tend to have manual transmissions. How could we see the effect of different variables at the same time?

Null Hypothesis (Hypothesis 0): Number of cylinders has no effect upon horsepower, even when controlling for the effect of transmission, engine displacement and weight.

Alternative hypothesis (Hypothesis 1): Number of cylinders DOES have an effect on horsepower, even when controlling for the effect of transmission, engine displacement and weight.

lm stands for linear model

cars_lm<-lm(hp~cyl+am+disp+wt,data=mtcars)

summary(cars_lm)
## 
## Call:
## lm(formula = hp ~ cyl + am + disp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.742 -20.014  -3.174  13.217  98.321 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -96.4060    47.2054  -2.042  0.05100 . 
## cyl          24.2415     8.1027   2.992  0.00586 **
## am           51.0524    17.3211   2.947  0.00653 **
## disp          0.1736     0.1583   1.096  0.28266   
## wt           10.0430    15.5515   0.646  0.52387   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.63 on 27 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.745 
## F-statistic: 23.64 on 4 and 27 DF,  p-value: 1.738e-08

Pr(>|t|) is basically a p-value– it’s a p-value for the individual independent values. The p-value at the bottom (1.738e-08) means the null hypothesis can be rejected– there is a difference! Dsp and wt– big margin of error (28% and 52%)– these are not that important to horsepower. Cyl and am (transmission type) are important to horsepower. Cyl is the most important. Estimate– this column basically shows (in this model) how many extra horsepowers you would get for every 1 cyl added– it’s basically a slope! Thw wt and disp can be ignored.

Now we see that cylinders DOES have an effect on HP, even when variables like transmission type, weight and engine displacement are accounted for. Cylinders have a p value of “0.0058” which means the null hypothesis can be safely discarded.

We will discuss how to perform these analysis later on, for now you just need to formulate a null hypothesis and an alternative hypothesis for your data.

HOMEWORK: Complete the Testable Hypothesis Selection