Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(performance)library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
2. Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results
Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7779721
The (r=0.77797) positive and a (p value = 2.2e-16 < 0.05) suggest that there is a statistically significant relatively strong positive correlation between eruption time and waiting time.
3. DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)
#distribution of class of patients admitted to self poisoning (A) and gastro units (B)#socioclasssocioclass<-c("I","II","III","IV","V")selfpoison<-c(17,25,39,42,32)gastro<-c(22,46,73,91,57)patients<-data.frame(socioclass,selfpoison,gastro)head(patients)
socioclass selfpoison gastro
1 I 17 22
2 II 25 46
3 III 39 73
4 IV 42 91
5 V 32 57
\(H_0\):The two variables are independent. (There is no correlation between the number of patients from various socioclasses admitted to gastro and self poison units)
\(H_1\):The two variables relate to each other.(There is a correlation between the number of patients from various socioclasses admitted to gastro and self poison units)
X-squared = 1.9885, and p-value = 0.7379(statistically not significant). Therefore, we fail to reject the null hypothesis and conclude that the two variables are in fact not related to each other. That is; they are independent and that there’s is no difference in going to poison control or gastro units
4. Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!
p<-ggplot(data= faithful, mapping =aes(x = waiting, y= eruptions))+geom_point()+geom_smooth(method = lm, se =TRUE, fullrange=TRUE) +# Se - Don't add shaded confidence regionlabs(title ="Relationship between length of eruption and waiting time") +theme(plot.title =element_text(hjust =0.5))+theme_classic()p
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
check_model(faith_lm)
summary(faith_lm)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
linearity is met ( strong positive relationship between eruption time and waiting time - plot p)
Normality is met as the points fall flat and horizontal on the reference line
Equal variance is not quite met (homogeneity of variance plot)
Outliers - from the Leverage plot, there does not see to be any high leverage or very influential outliers and so this condition is also met.
Independence: we do not have much information of how the participants in the study were selected, however we can assume independence.
5 Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.
Welch Two Sample t-test
data: pengs_adelie$body_mass_g and pengs_gentoo$body_mass_g
t = -23.254, df = 242.14, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1503.702 -1268.843
sample estimates:
mean of x mean of y
3706.164 5092.437
pvalue = 2.2e-16 and thus there is a significant difference between the body mass of Adelie and Gentoo
# A tibble: 2 × 5
species mean_bm sd_bm n se_bm
<fct> <dbl> <dbl> <int> <dbl>
1 Adelie 3706. 459. 146 38.0
2 Gentoo 5092. 501. 119 46.0
p1<-ggplot(mean_pengs, aes(x=species, y=mean_bm, color=species))+geom_point()+geom_errorbar(aes(x=species, ymin=mean_bm-se_bm, ymax=mean_bm+se_bm), width=0.2)+scale_color_aaas()+theme_classic()+labs(title='Body mass of penguins')+theme(plot.title =element_text(hjust =0.5))p1
\(H_0\):There is no difference in body mass between Adelie and Gentoo penguins
\(H_1\):There is a difference in body mass between Adelie and Gentoo penguins
The p-value = 2.2e-16 and so there is a significant difference in the body mass between Adelie and Gentoo penguins. The scatter plot shows that the average body mass of the Gentoo penguins is higher than Adelie Penguins.
1. Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R
2. Make your t-test graph from Essentials better – add the raw data in behind your means + error bars
p1<-ggplot(mean_pengs, aes(y=mean_bm, x=species, color = species))+geom_point()+labs(title='Body mass of penguins')+theme_classic()+theme(plot.title =element_text(hjust =0.5))p1+geom_errorbar(aes(x=species, ymin=mean_bm-se_bm, ymax=mean_bm+se_bm), width=0.2)+scale_color_aaas()+geom_jitter(data=pengs_1,aes(y = body_mass_g, x = species))
3 What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive or additive. Write a hypothesis, filter data as needed, run the test, make a graph, interpret the results.
\(H_0\): The body mass and sex do not affect the bill length of the Chinstrap penguins.
\(H_1\): The body mass and sex does have an affect on the bill length of the Chinstrap penguins.
Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
body_mass_g 1 197.10 197.101 34.0126 1.959e-07 ***
sex 1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex 1 6.45 6.453 1.1136 0.2953
Residuals 64 370.88 5.795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data= pengs_2, mapping =aes(x = body_mass_g, y= bill_length_mm, color = sex))+geom_point()+geom_smooth(method = lm, se =TRUE, fullrange=TRUE) +# Se - Don't add shaded confidence regionlabs(title ="Effect of body mass and sex on the bill length of Chinstrap Penguins") +theme(plot.title =element_text(hjust =0.5))+theme_classic()
`geom_smooth()` using formula = 'y ~ x'
The p values =1.959e-07 and 8.337e-07 show that both the body mass and sex respectively, have a significant effect on the bill length of the Chinstrap penguins. interactive effect of body mass and sex on bill length. However, the interactive effect of body mass and sex does not have significant effect on bill length of chinstrap(p value = 0.2953). This can further be seen on the plot, where we see a relatively strong correlation between the body mass and sex but no significant correlation in bill length and body mass within the sex categories (slopes are pretty much flat)
4 Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): observed, location_details, county, state, season, title, classif...
dbl (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Welch Two Sample t-test
data: sighting by state
t = -3.1298, df = 98.619, p-value = 0.002301
alternative hypothesis: true difference in means between group Ohio and group Washington is not equal to 0
95 percent confidence interval:
-5.292127 -1.185345
sample estimates:
mean in group Ohio mean in group Washington
5.053571 8.292308
big3 <- big2 %>%group_by(state) %>% dplyr::summarize(mean_sightings =mean(sighting),sd_sightings =sd(sighting), n =n(), se = sd_sightings/sqrt(n))tibble(big3)
# A tibble: 2 × 5
state mean_sightings sd_sightings n se
<chr> <dbl> <dbl> <int> <dbl>
1 Ohio 5.05 3.76 56 0.502
2 Washington 8.29 7.29 65 0.905
p5<-ggplot() +geom_point(data = big3, aes(x = state, y = mean_sightings, color = state)) +labs(x="State", y="Mean Annual Sightings", title ="Comparison of Mean annual Bigfoot Sightings in Washington and Ohio", color ="State") +theme(plot.title =element_text(vjust =0.5)) +theme_classic()p5+geom_errorbar(data = big3, aes(x = state, ymin = mean_sightings - se, ymax = mean_sightings + se, color = state)) +geom_jitter(data = big2, aes(x = state, y = sighting))
\(H_0\) that there is no difference in annual bigfoot sightings between Ohio and Washington
\(H_A\) that there is a difference in annual bigfoot sightings between Ohio and Washington
The p-value = 0.002301 < 0.05 from the t.test and so we can reject the null hypothesis that there is no annual difference in bigfoot sightings between Ohio and Washington. ORThis shows that there is a statistically significant difference in the annual sightings of bigfoot in Washington and Ohio. The model output for mean bigfoot sightings in Ohio and Washington 5.053571 and 8.292308 respectively.
The plot also shows that there is definitely difference in mean annual sightings of bigfoot between Ohio and Washington. In particular, Washington has higher mean annual bigfoot sightings than Ohio.