overview 1 lesson

Data Analysis: focusing on the basics.Covering aspects dealing with data and less is MORE in statistics

Research methods: covering the theoretical and philosophical aspects of doing science. Making sense of science and working on writing and reading skills.

test

Sample data of penguins

library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

bloxplot

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(palmerpenguins)

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_length_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

speices of peguins

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x=species,
             color=species, 
             fill=species))+
  geom_bar(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Visualising correlations

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = bill_depth_mm))+
  geom_point()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

moments of centrality mean. medium and mode
moments of dispersion variance standard deviation standard error range and quarantines

checking via histograms

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_histogram(color="#DD4A48", fill="#DD4A48")+
  geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)), 
             linetype="dashed")

Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

checking via boxplot

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_boxplot(fill="#DD4A48",alpha=0.7)

- diamonds diamonds%>% (i) #utilizes the diamonds dataset group_by(color,clarity)%>% #groups data by the color and clarity variables.

mutate(price200=mean(price))%>% #creates new variables (average price by groups) ungroup()%>% #data no longer grouped by color and clarity mutate(random=10+price)%>%
nw variable,original price+$10 select(cut,color,clarity,price,price200,random10)%>% #retain only these listed columns.
arrange(color)%>% #visualize data ordered by color. group_by(cut)%>% #group data by cut mutate(dis=n_distinct(price)
counts the number of unique price values per cut. rowID=row_number())%>%
numbers each row consecutively for each cut ungroup() #final ungrouping of data.

library(tidyverse)
View(diamonds)
diamonds %>% arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

library(tidyverse)
View(diamonds)
diamonds %>% arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

library(tidyverse)
View(diamonds)
diamonds %>% arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% arrange(desc(cut))

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows

library(tidyverse)
View(diamonds)
diamonds %>% 
  mutate(price200 = price - 250)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43       76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31       76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31       77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63       84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75       85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48       86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47       86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53       87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49       87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39       88
# ℹ 53,930 more rows

library(tidyverse)
View(diamonds)
diamonds %>% select(1:7)

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

library(tidyverse)
View(diamonds)
diamonds %>% group_by(cut, clarity) %>% summarize(n = n())

`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

# A tibble: 40 × 3
# Groups:   cut [5]
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows

library(tidyverse)
library(modeldata)


Attaching package: 'modeldata'

The following object is masked _by_ '.GlobalEnv':

    penguins

The following object is masked from 'package:palmerpenguins':

    penguins

##the data of crickets
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

##modifying properties of the plot
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# Adding another layer


ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

`geom_smooth()` using formula = 'y ~ x'

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

`geom_smooth()` using formula = 'y ~ x'

# Other plots

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

?facet_wrap

starting httpd help server ... done

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

week 5 excirsie

data(iris)

# Create a boxplot of Sepal.Length by Species
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Boxplot of Sepal Length by Species",
        xlab = "Species",
        ylab = "Sepal Length",
        col = c("lightblue", "lightgreen", "lightpink"))

data(iris)
library(ggplot2)
summary_data <- aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)

# Create the line graph
ggplot(summary_data, aes(x = Species, y = Sepal.Length, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Sepal Length by Species",
       x = "Species",
       y = "Average Sepal Length") +
  theme_minimal()

data(iris)


library(ggplot2)

# Create a scatter graph
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3) +  # Adjust size of points
  labs(title = "Scatter Plot of Sepal Length vs. Sepal Width",
       x = "Sepal Length",
       y = "Sepal Width") +
  theme_minimal()

library(ggplot2)
iris_summary <- aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)
ggplot(iris_summary, aes(x = Species, y = Sepal.Length, fill = Species)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Average Sepal Length by Species",
         x = "Species",
         y = "Average Sepal Length") +
    theme_minimal()

# Types of variables

Categorical

Categorical data refers to data that can be divided into distinct categories or groups. These categories may represent qualitative characteristics or attributes rather than numerical values.

types of categorical

Nominal Data: Definition: Categories without a specific order. Examples: Gender (male, female), eye color (blue, green, brown), types of cuisine (Italian, Mexican, Chinese).
Ordinal Data: Definition: Categories with a meaningful order, but the intervals between categories are not uniform. Examples: Satisfaction ratings (very satisfied, satisfied, neutral, dissatisfied, very dissatisfied), education levels (high school, bachelor’s, master’s, doctorate).

Numerical

Numerical data, also known as quantitative data, consists of values that represent quantities and can be measured or counted. This type of data can be further divided into two main categories: discrete and continuous.

Types of Numerical Data

Discrete Data: Definition: Consists of distinct, separate values that can be counted. Discrete data typically represents counts of items or occurrences. Examples: Number of students in a classroom, number of cars in a parking lot, or the number of goals scored in a game.
Continuous Data: Definition: Can take any value within a given range and is typically measured rather than counted. Continuous data can be subdivided infinitely. Examples: Height, weight, temperature, or time.

The four families of statistical tests

Parametric Tests These tests assume that the data follows a specific distribution (typically a normal distribution). They are generally more powerful than non-parametric tests when their assumptions are met. Examples: t-tests (e.g., one-sample, independent, paired) ANOVA (Analysis of Variance)
Non-Parametric Tests These tests do not assume a specific distribution and are often used when the data do not meet the assumptions required for parametric tests. They are useful for ordinal data or when sample sizes are small. Examples: Mann-Whitney U Test (for comparing two independent groups) Wilcoxon Signed-Rank Test (for comparing two related groups) Kruskal-Wallis Test (for comparing more than two groups) Chi-Square Test (for categorical data)
Bayesian Tests These tests incorporate prior beliefs or information into the analysis, allowing for a more flexible approach to inference. Bayesian methods provide a framework for updating beliefs in light of new evidence. Examples: Bayesian t-test Bayesian ANOVA Bayesian Regression
Resampling Methods These methods involve repeatedly drawing samples from the data (with or without replacement) to estimate the sampling distribution of a statistic. They are particularly useful for hypothesis testing and confidence interval estimation. Examples: Bootstrap (to estimate the distribution of a statistic) Permutation Tests (to assess the significance of a test statistic)

Frequency tests

Chi-square G-tests
Contingecy tables
Log-linear models Powerful for testing associations between categorical variables

Mean tests

T-tests (two levels)
Anovas (3+ levels)
Non-parametric equivalents Nested and two-way…
Post-hoc tests (Tukey HSD, Student, etc.)

Correlations and models

Correlations – many variations
Linear models – many variations

Logistic models

Logistic models
Predictive of odds
Similar inlogic to frequency tests
Similar in calculation to linear models

Moments of dispersion

Variance Standard
deviation Standard
Error Range
Quantiles

# formative exercise 6

what is one-proportion Z-test?

The One proportion Z-test is used to compare an observed proportion to a theoretical one, when there are only two categories. This article describes the basics of one-proportion z-test and provides practical examples using R software.

For example, we have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n = 160) have developed a spontaneous cancer, including 95 male and 65 female.

Typical research questions are:

whether the observed proportion of male (po ) is equal to the expected proportion (pe )?
whether the observed proportion of male (po ) is less than the expected proportion (pe )?
whether the observed proportion of male (p ) is greater than the expected proportion (pe )?

Formula of the test statistic

The test statistic (also known as z-test) can be calculated as follow:

z=p^−popo(1−po)n√z=p^−popo(1−po)n

zz = Test statistics
nn = Sample size
popo = Null hypothesized value
p^p^ = Observed proportion

R functions: binom.test() & prop.test()

The R functions binom.test() and prop.test() can be used to perform one-proportion test:

binom.test(): compute exact binomial test. Recommended when sample size is small
prop.test(): can be used when sample size is large ( N > 30). It uses a normal approximation to binomial

The syntax of the two functions are exactly the same. The simplified format is as follow:

binom.test(x, n, p = 0.5, alternative = "two.sided") prop.test(x, n, p = NULL, alternative = "two.sided",           correct = TRUE)

res <- prop.test(x = 95, n = 160, p = 0.5, 
                 correct = FALSE)
# Printing the results
res


    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5163169 0.6667870
sample estimates:
      p 
0.59375

The function returns:

the value of Pearson’s chi-squared test statistic.
a p-value
a 95% confidence intervals
an estimated probability of success (the proportion of male with cancer)

Interpretation of the result

The p-value of the test is 0.01771, which is less than the significance level alpha = 0.05. We can conclude that the proportion of male with cancer is significantly different from 0.5 with a p-value = 0.01771.

# printing the p-value
res$p.value

[1] 0.01770607

# printing the mean
res$estimate

      p 
0.59375

# printing the confidence interval
res$conf.int

[1] 0.5163169 0.6667870
attr(,"conf.level")
[1] 0.95

What is two-proportions z-test?

The two-proportions z-test is used to compare two observed proportions. This article describes the basics of two-proportions *z-test and provides pratical examples using R sfoftware**.

We want to know, whether the proportions of smokers are the same in the two groups of individuals?

For example, we have two groups of individuals:

Group A with lung cancer: n = 500
Group B, healthy individuals: n = 500

The number of smokers in each group is as follow:

Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98pA=490/500=98
Group B, healthy individuals: n = 500, 400 smokers, pB=400/500=80pB=400/500=80

In this setting:

The overall proportion of smokers is p=frac(490+400)500+500=89p=frac(490+400)500+500=89
The overall proportion of non-smokers is q=1−p=11

Typical research questions are:

whether the observed proportion of smokers in group A (pApA) is equal to the observed proportion of smokers in group (pBpB)?
whether the observed proportion of smokers in group A (pApA) is less than the observed proportion of smokers in group (pBpB)?
whether the observed proportion of smokers in group A (pApA) is greater than the observed proportion of smokers in group (pBpB)?

he test statistic (also known as z-test) can be calculated as follow:

z=pA−pBpq/nA+pq/nB−√

pApA is the proportion observed in group A with size nAnA
pBpB is the proportion observed in group B with size nBnB
pp and qq are the overall proportions

if |z|<1.96|z|<1.96, then the difference is not significant at 5%
if |z|≥1.96|z|≥1.96, then the difference is significant at 5%
The significance level (p-value) corresponding to the z-statistic can be read in the z-table. We’ll see how to compute it in R.

The R functions prop.test() can be used as follow:

prop.test(x, n, p = NULL, alternative = "two.sided",           correct = TRUE)

x: a vector of counts of successes
n: a vector of count trials
alternative: a character string specifying the alternative hypothesis
correct: a logical indicating whether Yates’ continuity correction should be applied where possible

res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res


    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80

The function returns:

the value of Pearson’s chi-squared test statistic.
a p-value
a 95% confidence intervals

if you want to test whether the observed proportion of smokers in group A (pApA) is less than the observed proportion of smokers in group (pBpB), type this:

prop.test(x = c(490, 400), n = c(500, 500),            alternative = "less")

an estimated probability of success (the proportion of smokers in the two groups)

Or, if you want to test whether the observed proportion of smokers in group A (pApA) is greater than the observed proportion of smokers in group (pBpB), type this:

prop.test(x = c(490, 400), n = c(500, 500),               alternative = "greater")

Interpretation of the result

The p-value of the test is 2.36310^{-19}, which is less than the significance level alpha = 0.05. We can conclude that the proportion of smokers is significantly different in the two groups with a p-value = 2.36310^{-19}.

What is chi-square goodness of fit test?

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.

R function: chisq.test()

The R function chisq.test() can be used as follow:

chisq.test(x, p)

x: a numeric vector
p: a vector of probabilities of the same length of x.

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res


    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07

The function returns: the value of chi-square test statistic (“X-squared”) and a a p-value.

The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.

# Access to the expected values
res$expected

[1] 52.66667 52.66667 52.66667

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res


    Chi-squared test for given probabilities

data:  tulip
X-squared = 0.20253, df = 2, p-value = 0.9037

The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.

The result of chisq.test() function is a list containing the following components:

statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count

Chi-Square Test of Independence in R

The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables. This article describes the basics of chi-square test and provides practical examples using R software.

# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
# head(housetasks)

The data is a contingency table containing 13 housetasks and their distribution in the couple:

rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly

Graphical display of contengency tables

Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

library("gplots")


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package garphics:

library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

The argument shade is used to color the graph
The argument las = 2 produces vertical labels

Blue color indicates that the observed value is higher than the expected value if the data were random
Red color specifies that the observed value is lower than the expected value if the data were random

From this mosaic plot, it can be seen that the housetasks Laundry, Main_meal, Dinner and breakfeast (blue color) are mainly done by the wife in our example.

Chi-square test basics

Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent

For each cell of the table, we have to calculate the expected value under null hypothesis.

For a given cell, the expected value is calculated as follow:

e=row.sum∗col.sumgrand.totale=row.sum∗col.sumgrand.total

The Chi-square statistic is calculated as follow:

χ2=∑(o−e)2eχ2=∑(o−e)2e

o is the observed value
e is the expected value

This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1)df=(r−1)(c−1) degrees of freedom and p = 0.05.
- r is the number of rows in the contingency table
- c is the number of column in the contingency table
If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.

Compute chi-square test in R

Chi-square statistic can be easily computed using the function chisq.test() as follow:

chisq <- chisq.test(housetasks)
chisq


    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16

# Observed counts
chisq$observed

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

# Expected counts
round(chisq$expected,2)

            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:

r=o−ee√

The contribution (in %) of a given cell to the total Chi-square score is calculated as follow:

contrib=r2χ2

Access to the values returned by chisq.test() function

The result of chisq.test() function is a list containing the following components:

statistic: the value the chi-squared test statistic.
parameter: the degrees of freedom
p.value: the p-value of the test
observed: the observed count
expected: the expected count

The format of the R code to use for getting these values is as follow:
```
# printing the p-value chisq$p.value # printing the mean chisq$estimate
```

the examples

Data format: Contingency tables

# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
# head(housetasks)

Graphical display of contengency tables

Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

library("gplots")
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr) #for printing html-friendly tables.

mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	cyl	n
2seater	8	5
compact	4	32
compact	5	2
compact	6	13
midsize	4	16
midsize	6	23
midsize	8	2
minivan	4	1
minivan	6	10
pickup	4	3
pickup	6	10
pickup	8	20
subcompact	4	21
subcompact	5	2
subcompact	6	7
subcompact	8	5
suv	4	8
suv	6	16
suv	8	38

mpg%>%
  group_by(class, cyl)%>%
  summarise(n=n())%>%
  spread(cyl, n)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	5
compact	32	2	13	NA
midsize	16	NA	23	2
minivan	1	NA	10	NA
pickup	3	NA	10	20
subcompact	21	2	7	5
suv	8	NA	16	38

One advantage of dplyr is that we can determine what kind of summary statistic we want to see very easily by adjusting our summarize() input.

Here instead of displaying frequencies, we can get the average number of city miles by class & cyl

mpg%>%
  group_by(class, cyl)%>%
  summarise(mean_cty=mean(cty))%>%
  spread(cyl, mean_cty)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	15.40000
compact	21.37500	21	16.92308	NA
midsize	20.50000	NA	17.78261	16.00000
minivan	18.00000	NA	15.60000	NA
pickup	16.00000	NA	14.50000	11.80000
subcompact	22.85714	20	17.00000	14.80000
suv	18.00000	NA	14.50000	12.13158

mpg%>%
  group_by(class, cyl)%>%
  summarise(max_cty=max(cty))%>%
  spread(cyl, max_cty)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	16
compact	33	21	18	NA
midsize	23	NA	19	16
minivan	18	NA	17	NA
pickup	17	NA	16	14
subcompact	35	20	18	15
suv	20	NA	17	14

We can find proportions by creating a new, calculated variable dividing row frequency by table frequency.

mpg%>%
  group_by(class)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%   # our new proportion variable
  kable()

class	n	prop
2seater	5	0.0213675
compact	47	0.2008547
midsize	41	0.1752137
minivan	11	0.0470085
pickup	33	0.1410256
subcompact	35	0.1495726
suv	62	0.2649573

We can create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguents to produce proportions of different variables.

mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%
  subset(select=c("class","cyl","prop"))%>%   #drop the frequency value
  spread(class, prop)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

cyl	2seater	compact	midsize	minivan	pickup	subcompact	suv
4	NA	0.6808511	0.3902439	0.0909091	0.0909091	0.6000000	0.1290323
5	NA	0.0425532	NA	NA	NA	0.0571429	NA
6	NA	0.2765957	0.5609756	0.9090909	0.3030303	0.2000000	0.2580645
8	1	NA	0.0487805	NA	0.6060606	0.1428571	0.6129032

table() is a quick way to pull together row/column frequencies and proportions for categorical variables

Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.

table(mpg$class, mpg$cyl)

            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38

The table frequency can also be called by using the ftable() command.

mpg_table<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
ftable(mpg_table)

             4  5  6  8
                       
2seater      0  0  0  5
compact     32  2 13  0
midsize     16  0 23  2
minivan      1  0 10  0
pickup       3  0 10 20
subcompact  21  2  7  5
suv          8  0 16 38

For row frequencies, we use the margin.table() command, with the 1 argument.

margin.table(mpg_table, 1)


   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62

For column frequencies, we use the margin.table() command, with the 2 argument

margin.table(mpg_table, 2)


 4  5  6  8 
81  4 79 70

We can get the proportion values for our variable combinations as well.

For proportion of the entire table, we use the prop.table() command.

prop.table(mpg_table)     #proportion of entire table

            
                       4           5           6           8
  2seater    0.000000000 0.000000000 0.000000000 0.021367521
  compact    0.136752137 0.008547009 0.055555556 0.000000000
  midsize    0.068376068 0.000000000 0.098290598 0.008547009
  minivan    0.004273504 0.000000000 0.042735043 0.000000000
  pickup     0.012820513 0.000000000 0.042735043 0.085470085
  subcompact 0.089743590 0.008547009 0.029914530 0.021367521
  suv        0.034188034 0.000000000 0.068376068 0.162393162

Assumptions

Continuous
Normally Distributed
Random Sample
Enough Data

assessment title; title page- word count 300

Biodiversity and Conservation

Title
Affiliation
Abstract of 150 to 250 words which should be divided into the following sections: Purpose, Methods, Results, Conclusion.

4 to 6 keywords which can be used for indexing purposes.

Title: The link between our farm lands and giant tortoises; different characteristics of farmlands influencing tortoises behavior using visual observations.

Authors and Affiliation : Kyana N.Pike, Stephen Blake, Lain J.Gorden and Lin Schwardkopf

Abstract: Land sharing between nature and humans can be shown to be effective, however, this isn’t always going to work. The purpose of this study is to show that different characteristics of farmlands can influence tortoises. The methods were behavioral observations in two different seasons between the wet season march to May and the dry season November to December, a total of 242 behavior observations. This was carried out by locating the tortoise and then observing for 30 minutes from 5-15 m, using a pair of binoculars. The influences of time spent on eating, resting and walking were recorded. It was shown that tortoises within farmland as more likely to walk in areas with no vegetation and were more likely to rest at lower carapace temperatures. However, tortoises in abounded areas had different patterns when observed, this may be because of the lack of livestock and tourism. Tortoises need more time to digest their food within cooler temperatures compared to warmer temperaturess. Overall, this study shows the differences within giant tortoise using different farm types and how this indicates which actives the tortoises would choose to be able to reduce energy, however more investigation should be done on the different habitat qualities in farms and how this can affect tortoises.

keywords: Galapagos tortoises, farm, behavioral, characteristics, vegetation.

EXERCISE 5 – CORRELATION

Which of the following correlation coefficients expresses the strongest association?

a) 0.55 b) 0.09 c) -0.77 d) 0.1 e) 1.05

answer is C, expresses the strongest association, as it represents a strong negative correlation.

We have five representative samples of people aged 15, 20, 30, 45 and 60 years who completed a questionnaire of political conservatism. In these 5 samples in the given order were the average scores of political conservatism as follows: 60, 85, 80, 70, . Correlation between age and political conservatism is: b) 1.0 b) -1.0 c) linear d) nonlinear

we need to look at the relationship between the two variables based on the given data. so the answer is D , nonlinear.

How is Pearson’s coefficient influenced by:

a) Limited variability → can reduce the correlation, especially if the variability in one or both variables is low.
b) Differences in distribution → can distort the correlation, especially if distributions are non-normal.
c) Outliers → can drastically change the correlation, either inflating or deflating it.
d) Extreme groups → can lead to a biased correlation, as it may not be representative of the entire range of data.

What information can we gather based on scatterplot?

Direction (positive/negative/no correlation)
Strength (strong/weak/moderate correlation)
Linearity (linear vs. nonlinear)
Outliers
Clusters or subgroups within the data
Trends and patterns (if applicable, especially over time)
Homogeneity of variance (homoscedasticity vs. heteroscedasticity)

If you use Pearson’s correlation coefficient for computing correlation between variables where the relationship between X and Y is not linear, how it influence the coefficient?

If the relationship between XXX and YYY is not linear and you use Pearson’s correlation coefficient, it can lead to:

Weak or misleading correlation values (even if there is a strong nonlinear relationship),
Misrepresentation of the direction of the relationship (positive or negative),
Inaccurate conclusions about the strength and nature of the relationship between the variables.

For non-linear relationships, it’s better to use alternative methods like Spearman’s rank correlation or to apply nonlinear regression models for a more accurate analysis.

What are marginal frequencies?

Marginal frequencies are the sums of the rows and columns in a contingency table.
They represent the total count of observations for each category of one variable, ignoring the other variable.
Marginal frequencies are helpful for understanding the overall distribution of each variable and are often used in various statistical analyses like probability calculations and Chi-square tests.

If correlation between X and Y is 0.5, how does the correlation change if we transform X to T-scores?

A linear transformation of XXX (such as converting to T-scores) does not change the correlation between XXX and YYY. Thus, if the original correlation between XXX and YYY is 0.5, the correlation between T(X)T(X)T(X) and YYY will also be 0.5.