Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(car) #contains some statistical tests we need to assess assumptions
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Warning in cor.test.default(faithful1$eruptions, faithful1$waiting, method =
"spearman"): Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: faithful1$eruptions and faithful1$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7779721
#eruptions = length of eruption time in minutes#waiting = time between eruptions in minutes
H0:The correlation between these two variables is 0
Ha:The correlation between these two variables is not 0
Since the correlation coefficient is 0.778, there is a strong positive relationship between the two values Additionally, the small p-value(p<0.001) suggests that the correlation is statistically significant.
#distribution of class of patients admitted to self poisoning (A) and gastro units (B)#socioclasssocioclass<-c("I","II","III","IV","V")selfpoison<-c(17,25,39,42,32)gastro<-c(22,46,73,91,57)patients<-data.frame(socioclass,selfpoison,gastro)head(patients)
socioclass selfpoison gastro
1 I 17 22
2 II 25 46
3 III 39 73
4 IV 42 91
5 V 32 57
#perform a chisq test -- write a hypothesis and then perform the test!#H0: There is no association between socioclass, and the selfpoisoning and gastro#Ha: There is an association between socioclass, and the selfpoisoning and gastropatientschi<-chisq.test(pat2)patientschi
#Since p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis that there is no association between socioclass, and the selfpoisoning and gastro.
#Simple Linear Regressionggplot(faithful1, aes(x=eruptions, y=waiting))+geom_point()+geom_smooth(method='lm')+theme_classic()
`geom_smooth()` using formula = 'y ~ x'
#Summary of the regressionlm_faith<-lm(waiting~eruptions, data=faithful1)summary(lm_faith)
Call:
lm(formula = waiting ~ eruptions, data = faithful1)
Residuals:
Min 1Q Median 3Q Max
-12.0796 -4.4831 0.2122 3.9246 15.9719
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.4744 1.1549 28.98 <2e-16 ***
eruptions 10.7296 0.3148 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.914 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
#Check_modelcheck_model(lm_faith)
eruptions = length of eruption time in minutes waiting = time between eruptions in minutes
Hypothesis
H0: There is no correlation between eruptions and waiting Ha: There is a correlation between eruptions and waiting
From the summary, the intercept is 33.4744 and slope is 10.7296. This means that, on average, as the length of eruption time in minutes increases by 1, the time between eruptions in minutes increases by 10.7296. Both coefficients are statistically significant (p<0.001), indicating a strong linear relation ship between the waiting time and the eruption duration. The adusted R-squared value of 0.8108 suggests that approximately 81% of the data in the data can be explained by this model.Therefore, we have enough evidence to reject our null hypothesis that there is no correlation between eruptions and waiting.
*Check Model 1. Linearity of the data: Although the linearity graph shows that it is not completely flat and horizontal, the linearity is good enough.
Normality of residuals: From the graph we can see that the dots are following the line, our assumption is good.
Homogeneity of residual variance: Although the line is not completely flat and horizontal, the Homogeneity of residual variance is good enough for liner regressions.
Independence of residual error terms: There is no way to check this without understanding how the data collected.
library(palmerpenguins)head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
penguins1<-penguins#a) Tell me what statistical test you are using#T-test#b) filter the data appropriatelypen_body<-penguins1 %>%select(species,body_mass_g) %>%filter(species=="Adelie"|species=="Gentoo") %>%drop_na(body_mass_g)pen_body
#c) run your test#Calculate means and error and plot!qpen_body1<-pen_body %>%group_by(species) %>% dplyr::summarize(mean=mean(body_mass_g),sd=sd(body_mass_g),n=n(),se=sd/sqrt(n))pen_body1
# A tibble: 2 × 5
species mean sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 Adelie 3701. 459. 151 37.3
2 Gentoo 5076. 504. 123 45.5
#dplyr <-to make r to use tydyverse package when you get error "Error in summarize(., mean = mean(body_mass_g), sd = sd(body_mass_g), : argument "by" is missing, with no default"#Make a graphggplot(pen_body1,aes(x=species,y=mean, color=species))+geom_point()+geom_errorbar(aes(x=species,ymin=mean-se, ymax=mean+se),width=0.2)+scale_color_aaas()+theme_classic()+labs(title="Body Mass")
#test "Does the body mass differ between species?"#H0: The mean body mass is the same between species#Ha: The mean body mass differ between speciest.test(data=pen_body, body_mass_g~species, alternative='two.sided', var.equal=FALSE)
Welch Two Sample t-test
data: body_mass_g by species
t = -23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0
95 percent confidence interval:
-1491.183 -1259.525
sample estimates:
mean in group Adelie mean in group Gentoo
3700.662 5076.016
#Since the p-value is < 2.2e-16, which is very small and indicates strong evidence against the null hypothesis that the mean body mass is the same between the two groups.
ggplot(pen_body1, aes(x = species, y = mean, color = species)) +geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width =0.2) +geom_point() +geom_point(data = pen_body, aes(x = species, y = body_mass_g), position =position_jitterdodge(dodge.width =0.75), alpha =0.3) +scale_color_aaas() +theme_classic() +labs(title ="Body Mass", y ="Mean Body Mass (g)", x =NULL)
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
# A tibble: 68 × 4
species body_mass_g sex bill_length_mm
<fct> <int> <fct> <dbl>
1 Chinstrap 3500 female 46.5
2 Chinstrap 3900 male 50
3 Chinstrap 3650 male 51.3
4 Chinstrap 3525 female 45.4
5 Chinstrap 3725 male 52.7
6 Chinstrap 3950 female 45.2
7 Chinstrap 3250 female 46.1
8 Chinstrap 3750 male 51.3
9 Chinstrap 4150 female 46
10 Chinstrap 3700 male 51.3
# … with 58 more rows
ggplot(pen_chin,aes(x=body_mass_g,y=bill_length_mm, color=sex))+geom_point()+scale_color_aaas()+theme_classic()+labs(title="Body Mass vs Bill Length")+geom_smooth(method="lm")
Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = pen_chin)
Residuals:
Min 1Q Median 3Q Max
-4.6911 -1.1108 -0.2264 0.7923 10.9076
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.982910 5.196611 6.924 2.53e-09 ***
body_mass_g 0.003003 0.001469 2.044 0.045 *
sexmale 11.056140 6.924654 1.597 0.115
body_mass_g:sexmale -0.001973 0.001870 -1.055 0.295
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared: 0.5036, Adjusted R-squared: 0.4803
F-statistic: 21.64 on 3 and 64 DF, p-value: 8.606e-10
anova(lm1)
Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
body_mass_g 1 197.10 197.101 34.0126 1.959e-07 ***
sex 1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex 1 6.45 6.453 1.1136 0.2953
Residuals 64 370.88 5.795
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#There is a significant correlation between body mass and bill length (p<0.001), sex and bill length(p<0.001). From the graph we can see that the as body mass increases, the bill length also increases. For sex and bill length, males have larger bill length than females. However, the pvalue for bodymass:sex is 0.2953 (>0.05) which suggests that sex doesn't change the correlation between body mass and bill length.
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): observed, location_details, county, state, season, title, classif...
dbl (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigfoot
# A tibble: 5,021 × 28
observed locat…¹ county state season title latit…² longi…³ date number
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 "I was c… <NA> Winst… Alab… Summer <NA> NA NA NA 30680
2 "Ed L. w… "East … Valde… Alas… Fall <NA> NA NA NA 1261
3 "While a… "Great… Washi… Rhod… Fall Repo… 41.4 -71.5 1974-09-20 6496
4 "Hello, … "I wou… York … Penn… Summer <NA> NA NA NA 8000
5 "It was … "Loggi… Yamhi… Oreg… Spring <NA> NA NA NA 703
6 "My two … "The c… Washi… Okla… Fall Repo… 35.3 -99.2 1973-09-28 9765
7 "I was s… "Vince… Washi… Ohio Summer Repo… 39.4 -81.7 1971-08-01 4983
8 "Well la… "Both … Westc… New … Fall Repo… 41.3 -73.7 2010-09-01 31940
9 "I grew … "The W… Washo… Neva… Fall Repo… 39.6 -120. 1970-09-01 5692
10 "heh i k… "the r… Warre… New … Fall <NA> NA NA NA 438
# … with 5,011 more rows, 18 more variables: classification <chr>,
# geohash <chr>, temperature_high <dbl>, temperature_mid <dbl>,
# temperature_low <dbl>, dew_point <dbl>, humidity <dbl>, cloud_cover <dbl>,
# moon_phase <dbl>, precip_intensity <dbl>, precip_probability <dbl>,
# precip_type <chr>, pressure <dbl>, summary <chr>, uv_index <dbl>,
# visibility <dbl>, wind_bearing <dbl>, wind_speed <dbl>, and abbreviated
# variable names ¹location_details, ²latitude, ³longitude
#filter the data setbigfoot_ow<-bigfoot %>%filter(state=="Ohio"|state=="Washington") %>%mutate(year=year(date))%>%group_by(state,year)%>%count()%>%drop_na() bigfoot_ow
# A tibble: 121 × 3
# Groups: state, year [121]
state year n
<chr> <dbl> <int>
1 Ohio 1956 1
2 Ohio 1958 2
3 Ohio 1962 1
4 Ohio 1963 1
5 Ohio 1967 1
6 Ohio 1968 1
7 Ohio 1970 1
8 Ohio 1971 2
9 Ohio 1972 2
10 Ohio 1973 1
# … with 111 more rows
#Null Hypothesis: There is no difference in the number of annual bigfoot sightings between Ohio and Washington#Alternate Hypothesis: There is difference in the number of annual bigfoot sightings between Ohio and Washingtonbigfoot_ow_test<-bigfoot_ow%>%group_by(state) %>% dplyr::summarize(mean=mean(n), sd=sd(n),n=n(),se=sd/sqrt(n))bigfoot_ow_test
# A tibble: 2 × 5
state mean sd n se
<chr> <dbl> <dbl> <int> <dbl>
1 Ohio 5.05 3.76 56 0.502
2 Washington 8.29 7.29 65 0.905
#Graphggplot(data=bigfoot_ow_test,aes(x=state,y=mean))+geom_point()+geom_errorbar(data=bigfoot_ow_test,aes(x=state,ymin=mean-se,ymax=mean+se))#The graph suggests the significant difference in the number of bigfoot sightings between Ohio and Washington, Washington has higher number of bigfoot sightings than Ohio.
#T-testt.test(data=bigfoot_ow, n~state, alternative='two.sided', var.equal=FALSE) #p-value<0.01 shows that there is a significant difference in the number of bigfoot sightings between Ohio and Washington. Graph also supports this.
Welch Two Sample t-test
data: n by state
t = -3.1298, df = 98.619, p-value = 0.002301
alternative hypothesis: true difference in means between group Ohio and group Washington is not equal to 0
95 percent confidence interval:
-5.292127 -1.185345
sample estimates:
mean in group Ohio mean in group Washington
5.053571 8.292308