── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(corrplot)
corrplot 0.92 loaded
library(lubridate)
2. Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results
faithful<-faithful#eruptions = length of eruption time in minutes#waiting = time between eruptions in minutesggplot(data = faithful, aes(x = eruptions, y = waiting)) +geom_point() +geom_smooth(method = lm) +theme_bw()
Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7779721
Spearman’s Rho is 0.778, which indicates that there is a strong correlation between length of eruption time and the waiting time between eruptions. The p-value is less than 0.05, which means that the correlation is significant.
3. DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)
#distribution of class of patients admitted to self poisoning (A) and gastro units (B)#socioclasssocioclass<-c("I","II","III","IV","V")selfpoison<-c(17,25,39,42,32)gastro<-c(22,46,73,91,57)patients<-data.frame(socioclass,selfpoison,gastro)head(patients)
socioclass selfpoison gastro
1 I 17 22
2 II 25 46
3 III 39 73
4 IV 42 91
5 V 32 57
#remove the categories so the data are clean for chi-squarepat2<-patients[,-1]#perform a chisq test -- write a hypothesis and then perform the test!pat2_chi <-chisq.test(pat2)pat2_chi
Null hypothesis: The is no significant correlation between social class and which unit patients were admitted to, aka they are independent of each other. Alternative hypothesis: There is a significant correlation between social class and which unit patients were admitted to, aka they are dependent on each other. Results: The p-value is 0.7379, which is greater than 0.05. Therefore, we fail to reject the null hypothesis that there is no significant correlation between social class and which unit patients were admitted to.
4. Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!
#Perform the linear regressionlm_faithful <-lm(eruptions ~ waiting, data = faithful)#Report the summary of your regressionsummary(lm_faithful)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
#Provide a graphggplot(data = faithful, aes(x = waiting, y = eruptions)) +geom_point() +geom_smooth(method = lm) +theme_bw()
`geom_smooth()` using formula = 'y ~ x'
#Test assumptionscheck_model(lm_faithful)
Null hypothesis: There is no relationship between the waiting time between eruptions and length of eruption time; Waiting time between eruptions has no effect on length of eruption time. Alternative hypothesis: There is a relationship between the waiting time between eruptions and length of eruption time; Waiting time has an effect on length of eruption time. Interpretation of results: Adjusted R-squared is 0.8108, meaning that there is a strong positive linear correlation between waiting time between eruptions and length of eruption time. The p-value is less than 0.05, meaning that the relationship between the variables is significant and we can reject our null hypothesis: Waiting time between eruptions has an effect on length of eruption time. Interpretation of assumptions test: I tested my assumptions using check model, and it shows that the data has linearity, homogeneity of variance, does not have influential observations, and has normal residuals. We do not know enough about the experimental design to know if there is independence of residual error terms, but will assume that there is independence in order to proceed with the analysis.
5 Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.
#Load datalibrary(palmerpenguins)penguins <- penguins# b) Filtering datapenguins_ade <- penguins %>%filter(species =="Adelie")penguins_gent <- penguins %>%filter(species =="Gentoo")penguins_ttest <- penguins %>%filter(species =="Adelie"| species =="Gentoo")# c) Run your testt.test(penguins_ade$body_mass_g, penguins_gent$body_mass_g)
Welch Two Sample t-test
data: penguins_ade$body_mass_g and penguins_gent$body_mass_g
t = -23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1491.183 -1259.525
sample estimates:
mean of x mean of y
3700.662 5076.016
# d) Graphpenguins_mean <- penguins %>%filter(species =="Adelie"| species =="Gentoo") %>%group_by(species) %>%drop_na() %>%summarise(mean =mean(body_mass_g), sd =sd(body_mass_g), n =n(), se = sd/sqrt(n))ggplot(data = penguins_mean, aes(x = species, y = mean)) +geom_point() +geom_errorbar(data = penguins_mean, aes(x = species, ymin = mean - se, max = mean +se)) +theme_bw() +labs(x ='Species', y ='Mean body mass (g)')
Null hypothesis: There is no relationship between body mass and whether a penguin is Adelie or Gentoo; penguin body mass does not vary with species. Alternative hypothesis: There is a relationship between body mass and whether a penguin is Adelie or Gentoo; penguin body mass varies significantly by species. a) Statistical test: I will be using a t-test because it is used to determine whether the means of two groups are different from each other. We are looking at whether the mean body mass is different between two groups (species) of penguins. Interpretation: The p-value is less than 0.05, therefore the difference in mean body mass between Adelie and Gentoo penguins is significant and we reject the null hypothesis. The graph shows that Gentoo penguins have a significantly higher body mass than Adelie penguins. ————————————————————————
1 Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R
2 Make your t-test graph from Essentials better – add the raw data in behind your means + error bars
ggplot(data = penguins_mean, aes(x = species, y = mean)) +geom_point(data = penguins_ttest, aes(x = species, y = body_mass_g, color = species)) +geom_point() +geom_errorbar(data = penguins_mean, aes(x = species, ymin = mean - se, max = mean +se)) +theme_bw() +labs(x ='Species', y ='Mean body mass (g)')
3 What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive or additive. Write a hypothesis, filter data as needed, run the test, make a graph, intepret the results.
# Filter datapenguins_chin <- penguins %>%filter(species =='Chinstrap')# Run linear regression testlm_chin <-lm(bill_length_mm ~ body_mass_g * sex, data = penguins_chin)summary(lm_chin)
Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = penguins_chin)
Residuals:
Min 1Q Median 3Q Max
-4.6911 -1.1108 -0.2264 0.7923 10.9076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.982910 5.196611 6.924 2.53e-09 ***
body_mass_g 0.003003 0.001469 2.044 0.045 *
sexmale 11.056140 6.924654 1.597 0.115
body_mass_g:sexmale -0.001973 0.001870 -1.055 0.295
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared: 0.5036, Adjusted R-squared: 0.4803
F-statistic: 21.64 on 3 and 64 DF, p-value: 8.606e-10
anova(lm_chin)
Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
body_mass_g 1 197.10 197.101 34.0126 1.959e-07 ***
sex 1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex 1 6.45 6.453 1.1136 0.2953
Residuals 64 370.88 5.795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Make a graphggplot(data = penguins_chin, aes(x = body_mass_g, y = bill_length_mm, color = sex)) +geom_point() +geom_smooth(method ='lm')
`geom_smooth()` using formula = 'y ~ x'
Null hypothesis: There is no relationship between body mass and bill length, and/or there is no effect of sex on bill length, and/or the relationship between body mass and bill length does not vary based on sex. Alternative hypothesis: The relationship between body mass and bill length in Chinstrap penguins varies based on sex. Interpretation: According to the ANOV table, the p-value for the relationship between bill length and body mass is less than 0.05. Therefore, body mass has a significant effect on bill length and we reject the null hypothesis. A higher body mass means a greater bill length. The p-value for the relationship between bill length and sex is also less than 0.05, so sex has a significant effect on bill length and we reject the null hypothesis. Bill length is greater in male penguins compared to female penguins. The p-value for the interactive effect of body mass and sex on bill length is greater than 0.05 (it is 0.2953). Therefore, there is not a significant effect and we fail to reject the null hypothesis.
4 Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): observed, location_details, county, state, season, title, classif...
dbl (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Welch Two Sample t-test
data: bigfoot_o$n and bigfoot_w$n
t = -3.1298, df = 98.619, p-value = 0.002301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.292127 -1.185345
sample estimates:
mean of x mean of y
5.053571 8.292308
# Calculate a mean across yearscolnames(bigfoot_f)[3] <-"count"bigfoot_mean <- bigfoot_f %>%group_by(state) %>%summarise(mean =mean(count), sd =sd(count), n =n(), se = sd/sqrt(n))ggplot(data = bigfoot_mean, aes(x = state, y = mean)) +geom_point() +geom_errorbar(data = bigfoot_mean, aes(x = state, ymin = mean - se, max = mean + se)) +theme_bw() +labs(x ="State", y ="Mean count")
Null hypothesis: Mean count of bigfoot sightings does not vary by state. Alternative hypothesis: Mean count of bigfoot sightings varies significantly by state. Interpretation: The p-value is 0.002, which is less than 0.05. The graph shows us that the mean sightings in Washington are higher than the mean sightings in Ohio. Therefore, mean bigfoot sightings by year are significantly higher in Washington compared to Ohio.