Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(corrplot)
corrplot 0.92 loaded
Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results
faithful<-faithful#eruptions = length of eruption time in minutes#waiting = time between eruptions in minutescor.test(faithful$eruptions,faithful$waiting, method="spearman")
Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7779721
The Spearman’s rho is 0.7779721, which we can round to 0.78. As indicated by the plot, there appears to be a strong, positive correlation between waiting time and eruption time. According to the correlation test, p = 2.2 x 10^-16, which is statistically significant, meaning we could reject H0 if we were doing a hypothesis test.
DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)
#distribution of class of patients admitted to self poisoning (A) and gastro units (B)#socioclasssocioclass<-c("I","II","III","IV","V")selfpoison<-c(17,25,39,42,32)gastro<-c(22,46,73,91,57)patients<-data.frame(socioclass,selfpoison,gastro)head(patients)
socioclass selfpoison gastro
1 I 17 22
2 II 25 46
3 III 39 73
4 IV 42 91
5 V 32 57
#remove the categories so the data are clean for chi-squarepat2<-patients[,-1]
H0: there is no relationship between self-poisoning and gastro units, they are independent of each other HA: there is a relationship between self-poisoning and gastro units, they are not independent of each other.
#perform a chisq test -- write a hypothesis and then perform the test!pat2chi<-chisq.test(pat2)pat2chi
The p value is 0.7379, which is not less than 0.05, so we fail to reject H0.
Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!
H0: There is no linear relationship between waiting time and eruption time
HA: there is a linear relationship between waiting time and eruption time
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
The results of the linear regression indicate that there is a strong, positive linear relationship between waiting time and eruption time. The summary tells us that around 81% of the variation in eruption time can be explained by waiting time. We can conclude that this relationship is statistically significant with the p-value of 2.2x10^-16, which is less than 0.05. Here is a graph to visualize this relationship:
The following objects are masked from 'package:ggsci':
scale_color_material, scale_colour_material, scale_fill_material
check_model(lmfaithful)
I ran check_model, which checks 3/4 of the assumptions for linear regression. The first panel (top left) shows that the model and the data match pretty well, the blue and green lines mostly align, but not fully. This is not technically one of our assumptions but I’m adding it anyway. The top right panel checks for linearity, which looks pretty good here but is still a little wavy, so we are ok to assume that the relationship between waiting time and eruption is approximately linear. The next panel checks for our assumption of homogeneity of variance, which is also looking ok as the green reference line is horizontal and mostly flat, just a bit wavy. The check of our assumption of normality of residuals seems really good, the data all falls along the reference line. This assumptions check does not account for one of our assumptions, the assumption of independence. This is a limitation of this method of checking, and given the data that we have, we cannot know for sure if it is independent.
Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.
library(palmerpenguins)
The statistical test that I want to use here t-test.
H0: there is no difference between the average body mass (g) of Adelie and Gentoo penguins
HA: there is a difference in the average body mass (g) of Adelie and Gentoo penguins
Welch Two Sample t-test
data: penguins_G$body_mass_g and penguins_A$body_mass_g
t = 23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1259.525 1491.183
sample estimates:
mean of x mean of y
5076.016 3700.662
# A tibble: 2 × 5
species avg_bm sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 Adelie 3701. 459. 151 37.3
2 Gentoo 5076. 504. 123 45.5
ggplot(data=bodymass, aes(x=species, y=avg_bm, color=species))+geom_point()+geom_jitter(data=penguinsAG, aes(x=species,y=body_mass_g, color=species),alpha=0.3,size=0.3)+geom_errorbar(data=bodymass,aes(x=species,ymin=avg_bm-se,ymax=avg_bm+se, width=0.2))+theme_classic()+theme(axis.text=element_text(size=10))+labs(x='Species',y='Average Body Mass in Grams',title='Silly Penguins')+theme(plot.title=element_text(hjust=0.5))
From looking at the graph alone, we can already tell that there is a difference in The graph visually shows us that there is a difference in the average body mass between the two species of penguin, the means are very far away from each other, and there is no overlap of the error bars. This result is further supported by the t-test, which tells us that we have a p-value of 2.2x10^-16, which is less than 0.05. Based on these results, we can reject our null hypothesis that there was no difference between the two means.
Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R
Done (eye twitching)
Make your t-test graph from Essentials better – add the raw data in behind your means + error bars
I already did this in the Essentials :)
ggplot(data=bodymass, aes(x=species, y=avg_bm, color=species))+geom_point()+geom_jitter(data=penguinsAG, aes(x=species,y=body_mass_g, color=species),alpha=0.3,size=0.3)+geom_errorbar(data=bodymass,aes(x=species,ymin=avg_bm-se,ymax=avg_bm+se, width=0.2))+theme_classic()+theme(axis.text=element_text(size=10))+labs(x='Species',y='Average Body Mass in Grams',title='Silly Penguins')+theme(plot.title=element_text(hjust=0.5))
What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive of additive. Write a hypothesis, filter data as needed, run the test, make a graph, interpret the results.
H0: there is no effect of body mass on the bill length of Chinstrap penguins, no effect of sex on bill length, and no interactive effect of sex and body mass on bill length
HA: there is an interactive effect of body mass and sex on the bill length of Chinstrap penguins
Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = penguinsC)
Residuals:
Min 1Q Median 3Q Max
-4.6911 -1.1108 -0.2264 0.7923 10.9076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.982910 5.196611 6.924 2.53e-09 ***
body_mass_g 0.003003 0.001469 2.044 0.045 *
sexmale 11.056140 6.924654 1.597 0.115
body_mass_g:sexmale -0.001973 0.001870 -1.055 0.295
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared: 0.5036, Adjusted R-squared: 0.4803
F-statistic: 21.64 on 3 and 64 DF, p-value: 8.606e-10
anova(lmchin)
Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
body_mass_g 1 197.10 197.101 34.0126 1.959e-07 ***
sex 1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex 1 6.45 6.453 1.1136 0.2953
Residuals 64 370.88 5.795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the graph and the linear regression and anova, we fail to reject the null hypothesis. The p-value for the interaction between body mass and sex is 0.295, which is greater than 0.05, so we cannot determine that there is an interactive effect. This means there is no difference in the relationship between bill length and body mass based on sex. It is still interesting to note that sex and body mass have individual statistically significant effects on bill length. There is a positive effect of body mass on bill length, so bigger penguins have bigger bills. There is also an effect of sex, in this case male penguins have a longer bill length than females.
Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): observed, location_details, county, state, season, title, classif...
dbl (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Welch Two Sample t-test
data: bigfoot_OH$n and bigfoot_WA$n
t = -2.8504, df = 91.002, p-value = 0.0054
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.431579 -1.148963
sample estimates:
mean of x mean of y
5.315789 9.106061
According to the graph and t-test above, there is a statistically significant difference in the average bigfoot encounters in Washington and Ohio. The p value is 0.0054, which is less than 0.05. This result is corroborated by the graph, which shows a visible difference between the means and no overlap of the error bars.