tashas workbook

Author

natasha

overview 1 lesson

Data Analysis: focusing on the basics.Covering aspects dealing with data and less is MORE in statistics

Research methods: covering the theoretical and philosophical aspects of doing science. Making sense of science and working on writing and reading skills.

test

Sample data of penguins

library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

histograms

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=bill_length_mm, color=species, fill=species))+
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

boxplots

library(tidyverse)
library(palmerpenguins)

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_length_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

speices of peguins

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x=species,
             color=species, 
             fill=species))+
  geom_bar(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

visualising correlations

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = bill_depth_mm))+
  geom_point()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

moments of centrality

mean. medium and mode

moments of dispersion variance

  1. standard deviation
  2. standard error
  3. range and quarantines

checking via histograms

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_histogram(color="#DD4A48", fill="#DD4A48")+
  geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)), 
             linetype="dashed")
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

checking via boxplot

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_boxplot(fill="#DD4A48",alpha=0.7)

types of variables

categorical:

  1. ordinal: categories that maintain an order
  2. Nominal: that has no ranking order
  3. binary; nominal variables with two categories.
  4. Numerical: Discrete; numbered values that can only take certain values
  5. continuous; numbered values that are measured can be any number within a particular range.

Inductive VS Deductivism?

They are opposite approaches to reasoning that differ in how they start and what they use to reach a conclusion.

Inductive:

Observation/ pattern/ hypothesis/ theory Deductivism : Theory/ hypothesis/ observation/ confirmation

#types of good and bad questions

Bad questions:

  1. is there any difference between a and b?
  2. is A bigger than B? 3.Can X influence Y?

Good questions:

  1. what explains the differences between A and B?
  2. What makes A bigger than B? 3. How X can influence Y?

diamonds

  1. diamonds%>% (i) #utilizes the diamonds dataset group_by(color,clarity)%>% #groups data by the color and clarity variables.

  2. mutate(price200=mean(price))%>% #creates new variables (average price by groups) ungroup()%>% #data no longer grouped by color and clarity mutate(random=10+price)%>%

  3. new variable,original price+$10 select(cut,color,clarity,price,price200,random10)%>% #retain only these listed columns.

  4. arrange(color)%>% #visualize data ordered by color. group_by(cut)%>% #group data by cut mutate(dis=n_distinct(price)

  5. counts the number of unique price values per cut. rowID=row_number())%>%

  6. numbers each row consecutively for each cut ungroup() #final ungrouping of data.

practice

library(tidyverse)
View(diamonds)
diamonds %>% arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
library(tidyverse)
View(diamonds)
diamonds %>% arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
library(tidyverse)
View(diamonds)
diamonds %>% arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>% arrange(desc(cut))
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows
library(tidyverse)
View(diamonds)
diamonds %>% 
  mutate(price200 = price - 250)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43       76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31       76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31       77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63       84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75       85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48       86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47       86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53       87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49       87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39       88
# ℹ 53,930 more rows
library(tidyverse)
View(diamonds)
diamonds %>% select(1:7)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows
library(tidyverse)
View(diamonds)
diamonds %>% group_by(cut, clarity) %>% summarize(n = n())
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 40 × 3
# Groups:   cut [5]
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows

week 4-andrews formative exercise- crickets

library(tidyverse)
library(modeldata)

Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':

    penguins
The following object is masked from 'package:palmerpenguins':

    penguins
?ggplot
starting httpd help server ...
 done
?crickets
View(crickets)
##the data of crickets
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

##modifying properties of the plot
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# Learn more about the options for the geom_abline()
# with ?geom_point
# Adding another layer


ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") 
`geom_smooth()` using formula = 'y ~ x'

# Other plots

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

?theme_minimal()
# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

?facet_wrap

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

moments of dispersion

  1. variance
  2. standard deviation
  3. standard error
  4. range
  5. quantiles

why we need a hypothesis?

  1. candidate explanation to a phenomenon
  2. contains previsions and expectations
  3. feedback theory
  4. advance science a good working hypothesis is everything as a wrong phyotheses will misguide you however a good one would keep you excited and up to date. it may need training.

A statement

  1. it is affirmative
  2. it is not a question
  3. must lead to expectations if confirmed
  4. self explanatory

types of hypotheses

scientific hypotheses

  1. candidate statements to explain an observed phenomenon
  2. meant to generate logical predictions
  3. working guidelines
  4. Null Hypothesis: This states that there is no effect or relationship. For instance, “There is no difference in plant growth between plants grown in sunlight and those grown in the dark.”
  5. Alternative Hypothesis: This proposes an effect or relationship. For example, “Plants grown in sunlight will grow taller than those grown in the dark.”

statistical hypotheses

Statistical hypotheses are specific statements about a population parameter or a process that can be tested using statistical methods.

  1. logical predictions
  2. confirmed by stats
  3. can be drawn in a graph

writng a hypotheses

you need to tell a story, by not using subheadings and never reefer to statistical hypotheses.you can use the if/then method.

  1. Identify the Variables: Determine your independent variable (the one you change) and dependent variable (the one you measure).

  2. Make a Prediction: State what you expect to happen based on your understanding of the topic.

Be Specific: Include details that clarify your prediction.

A hypothesis typically follows an “If… then…” format. For example, “If increasing temperature increases the rate of a chemical reaction, then higher temperatures will lead to faster reactions.”

week 5

Frequency tests

  1. chi-square
  2. G-Tests
  3. Contingency tables
  4. log-linear models

powerful for testing associations between categorical variables.

means tests

  1. T-Tests (two levels)
  2. Anovas (three plus levels)
  3. non-parametric equivalents
  4. nested and two way
  5. post-hoc tests

widely used for testing differences in means.

Correlations and models

  1. correlations
  2. many variations
  3. linear models
  4. many variations

highly predictive and powerful but depend on many conditions.

##Logistic models

  1. logistic models
  2. predictive of odds
  3. similar inlogic to frequency tests
  4. similar in calculations to linear models

highly predictive and powerful but can be complex to interpret

formative exercise

bloxplots

Boxplots are useful for visualizing the distribution of a dataset, highlighting the median, quartiles, and potential outliers. They can be associated with several statistical tests and analyses, including:

  1. Kruskal-Wallis Test: A non-parametric test used to compare three or more independent groups. Boxplots can visually represent the distributions of these groups.

  2. Mann-Whitney U Test: Another non-parametric test that compares two independent groups. Boxplots can show the median and spread of the data for both groups.

  3. ANOVA: While boxplots are not directly associated with ANOVA, they can be used to visualize the distribution of data across multiple groups, helping to interpret ANOVA results.

  4. T-tests: Similar to ANOVA, boxplots can display the distributions of two groups being compared with a t-test.

  5. Outlier detection: Boxplots inherently display potential outliers, making them useful for visualizing and identifying outliers in the context of any statistical analysis

data(iris)

# Create a boxplot of Sepal.Length by Species
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Boxplot of Sepal Length by Species",
        xlab = "Species",
        ylab = "Sepal Length",
        col = c("lightblue", "lightgreen", "lightpink"))

  1. Sepal.Length ~ Species: This formula indicates that you want to plot sepal lengths (the dependent variable) grouped by species (the independent variable).

  2. data = iris: Specifies that the data comes from the iris dataset.

  3. main, xlab, ylab, and col: Customize the title, axis labels, and colors of the boxes.

linegraphs

  1. T-test: To compare means between two groups over time or conditions.

  2. ANOVA (Analysis of Variance): To compare means across multiple groups or time points.

  3. Chi-Square Test: When categorical data is involved, to see if there’s a significant association between variables over time.

  4. Mann-Kendall Trend Test: A non-parametric test for identifying trends in time series data.

data(iris)
library(ggplot2)
summary_data <- aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)

# Create the line graph
ggplot(summary_data, aes(x = Species, y = Sepal.Length, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Sepal Length by Species",
       x = "Species",
       y = "Average Sepal Length") +
  theme_minimal()

  1. aggregate(): This function computes the mean sepal length for each species.

  2. ggplot2: A popular package for creating graphics in R.

  3. geom_line(): Adds the lines connecting the points.

  4. geom_point(): Adds points to represent the mean values.

  5. labs(): Adds labels for the title and axes.

  6. theme_minimal(): Applies a minimal theme to the plot.

scattergraph

  1. T-test: To compare means between two groups over time or conditions.

  2. ANOVA (Analysis of Variance): To compare means across multiple groups or time points.

  3. Regression Analysis: To assess relationships between variables, often fitting a line to the data to identify trends.

  4. Chi-Square Test: When categorical data is involved, to see if there’s a significant association between variables over time.

  5. Mann-Kendall Trend Test: A non-parametric test for identifying trends in time series data.

  6. These tests help interpret the data represented in line graphs, offering insights into trends.

data(iris)


library(ggplot2)

# Create a scatter graph
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3) +  # Adjust size of points
  labs(title = "Scatter Plot of Sepal Length vs. Sepal Width",
       x = "Sepal Length",
       y = "Sepal Width") +
  theme_minimal()

  1. aes(x = Sepal.Length, y = Sepal.Width, color = Species): This specifies that Sepal.Length will be on the x-axis, Sepal.Width on the y-axis, and points will be colored by species.

  2. geom_point(): This function creates the scatter plot. You can adjust the size of the points using the size argument.

  3. labs(): Adds titles and labels for the axes.

  4. theme_minimal(): Applies a clean, minimal theme for better aesthetics.

Barcharts

library(ggplot2)
iris_summary <- aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)
ggplot(iris_summary, aes(x = Species, y = Sepal.Length, fill = Species)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Average Sepal Length by Species",
         x = "Species",
         y = "Average Sepal Length") +
    theme_minimal()

  1. Proportion Tests: To compare the proportions of different categories within groups.

  2. Kruskal-Wallis Test: A non-parametric alternative to ANOVA when the assumptions of normality are not met. Chi-Square Test: Used to determine if there is a significant association between two categorical variables.

  3. ANOVA (Analysis of Variance): When comparing means across multiple groups, ANOVA can help assess whether there are any statistically significant differences.

  4. T-test: If comparing the means of two groups, a t-test can determine if the differences are significant.

  5. Proportion Tests: To compare the proportions of different categories within groups.

mosquitos

brainstorming

questions to possibly ask?

some cool questions

  1. What specific differences in wing span exist between male and female mosquitoes across various species?

  2. Does wing length affect the vulnerability of male and female mosquitoes to predators?

3.How does nutrition during the larval stage affect wing length differently in male and female mosquitoes?

4.How might variations in wing length between male and female mosquitoes affect their roles as disease vectors?

library(ggplot2)

data <- read.table("mosquitos.txt", header = TRUE, sep = "\t")

ggplot(data, aes(x = sex , y = wing)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Bar Chart", x = "sex", y = "wing")