This is Renad’s online workbook

Author

Renad Galal (Ray)

Published

May 10, 2025

Week 2 post session exercise

Task 2 Data Wrangling 🧹🧹🧹

  • task date: 04/10/2024

Meet the penguins 🐧🐧🐧

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

Who Are the Palmer Penguins? 🐧🐧🐧

The Palmer Penguins are three types of penguins that live on some islands in Antarctica:

  • Adélie Penguins

  • Chinstrap Penguins

  • Gentoo Penguins

Scientists measured things about these penguins, like:

  • How long their flippers are (the part they use to swim).

  • How heavy they are (called body mass).

  • How long and deep their bills (beaks) are.

They collected this information to learn more about the penguins and how they’re doing in their environment.

Flipper and bill lengths

The plot below shows the relationship between flipper and bill lengths of these penguins.

So, even though these penguins live far away in Antarctica, we can use R and cool pictures like scatter plots and box plots to learn a lot about them. It’s a fun way to use data to understand the world!

Week 3 post session exercise

10/10/2024

A few important notes to remember from chapter 4,5,6 R for Graduates book:

  1. to load tidyverse use: (its installed already) library(tidyverse)
  2. Pipe operator %>% used to pass the result of one function to another in a readable way.
  3. to view diamonds dataset: view (diamonds) ## Exercises for data analysis
  4. to display the structure or internal details: str(diamonds)
  5. if u need help u can access the help page by: (?diamonds)
  6. We can take a quick view of the variable names using: names(diamonds)
  7. functions: mutate, summarize, group and ungroup, filter, select, arrange

Exercises for Data analysis 💎💎💎

  1. make sure to load tidy verse: library (tidyverse).

  2. now we type the codes given in the exercise manually and explain what they do:

    Problems:

    A

    library(tidyverse)
    midwest %>% 
      group_by(state) %>%  # Group the dataset by 'state'.
      summarize(poptotalmean = mean(poptotal),  # Calculate the mean of 'poptotal' for each state.
                poptotalmed = median(poptotal),  # Calculate the median of 'poptotal' for each state.
                popmax = max(poptotal),  # Find the maximum 'poptotal' value for each state.
                popmin = min(poptotal),  # Find the minimum 'poptotal' value for each state.
                popdistinct = n_distinct(poptotal),  # Count the number of unique 'poptotal' values for each state.
                popfirst = first(poptotal),  # Get the first value of 'poptotal' in each state.
                popany = any(poptotal < 5000),  # Check if any 'poptotal' value is less than 5000 in each state.
                popany2 = any(poptotal > 2000000)) %>%  # Check if any 'poptotal' value is greater than 2,000,000.
      ungroup()  # Remove the grouping, returning a non-grouped dataset.
    # A tibble: 5 × 9
      state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
      <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
    1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
    2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
    3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
    4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
    5 WI          67941.      33528   959275   3890          72    15682 TRUE  
    # ℹ 1 more variable: popany2 <lgl>

    B

    midwest %>% 
      group_by(state) %>%  # Group the data by 'state'.
      summarize(num5k = sum(poptotal < 5000),  # Count how many 'poptotal' values are less than 5000 for each state.
                num2mil = sum(poptotal > 2000000),  # Count how many 'poptotal' values are greater than 2,000,000 for each state.
                numrows = n()) %>%  # Count the total number of rows (observations) in each state.
      ungroup()  # Remove the grouping, returning the dataset without state groups.
    # A tibble: 5 × 4
      state num5k num2mil numrows
      <chr> <int>   <int>   <int>
    1 IL        1       1     102
    2 IN        0       0      92
    3 MI        1       1      83
    4 OH        0       0      88
    5 WI        2       0      72

    C

    # part I
    midwest %>%                               # Start with the 'midwest' dataset and pipe it to the next function
      group_by(county) %>%                    # Group the data by the 'county' column, so the operations apply to each county separately
      summarize(x = n_distinct(state)) %>%    # Create a new column 'x' that counts the distinct number of states within each county
      arrange(desc(x)) %>%                    # Sort the result in descending order of 'x' (the distinct state count)
      ungroup()                               # Remove the grouping after summarizing to return an ungrouped data frame
    # A tibble: 320 × 2
       county         x
       <chr>      <int>
     1 CRAWFORD       5
     2 JACKSON        5
     3 MONROE         5
     4 ADAMS          4
     5 BROWN          4
     6 CLARK          4
     7 CLINTON        4
     8 JEFFERSON      4
     9 LAKE           4
    10 WASHINGTON     4
    # ℹ 310 more rows
    # part II
    midwest %>%                               # Start with the 'midwest' dataset
      group_by(county) %>%                    # Group the data by 'county'
      summarize(x = n()) %>%                  # Create a new column 'x' that counts the total number of rows in each county
      ungroup()                               # Remove the grouping after summarizing
    # A tibble: 320 × 2
       county        x
       <chr>     <int>
     1 ADAMS         4
     2 ALCONA        1
     3 ALEXANDER     1
     4 ALGER         1
     5 ALLEGAN       1
     6 ALLEN         2
     7 ALPENA        1
     8 ANTRIM        1
     9 ARENAC        1
    10 ASHLAND       2
    # ℹ 310 more rows
    # part III
    midwest %>%                               # Start with the 'midwest' dataset
      group_by(county) %>%                    # Group the data by 'county'
      summarize(x = n_distinct(county)) %>%   # Create a new column 'x' that counts the distinct number of 'county' values within each county (which will always be 1)
      ungroup()                               # Remove the grouping to return an ungrouped data frame
    # A tibble: 320 × 2
       county        x
       <chr>     <int>
     1 ADAMS         1
     2 ALCONA        1
     3 ALEXANDER     1
     4 ALGER         1
     5 ALLEGAN       1
     6 ALLEN         1
     7 ALPENA        1
     8 ANTRIM        1
     9 ARENAC        1
    10 ASHLAND       1
    # ℹ 310 more rows

    D

    diamonds %>%                                # Start with the 'diamonds' dataset
      group_by(clarity) %>%                     # Group the data by the 'clarity' column (each unique value of clarity will form a group)
      summarize(a = n_distinct(color),          # Create a new column 'a' that counts the number of distinct colors in each clarity group
                b = n_distinct(price),          # Create a new column 'b' that counts the number of distinct prices in each clarity group
                c = n()) %>%                    # Create a new column 'c' that counts the total number of diamonds in each clarity group (the total number of rows)
      ungroup()                                 # Remove the grouping to return an ungrouped data frame
    # A tibble: 8 × 4
      clarity     a     b     c
      <ord>   <int> <int> <int>
    1 I1          7   632   741
    2 SI2         7  4904  9194
    3 SI1         7  5380 13065
    4 VS2         7  5051 12258
    5 VS1         7  3926  8171
    6 VVS2        7  2409  5066
    7 VVS1        7  1623  3655
    8 IF          7   902  1790

    E

    # part I
    diamonds %>%                               # Start with the 'diamonds' dataset
      group_by(color, cut) %>%                 # Group the data by both 'color' and 'cut' columns (each unique combination of color and cut forms a group)
      summarize(m = mean(price),               # Create a new column 'm' that calculates the mean (average) price for each color-cut combination
                s = sd(price)) %>%             # Create a new column 's' that calculates the standard deviation of the price for each color-cut combination
      ungroup()                                # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'color'. You can override using the
    `.groups` argument.
    # A tibble: 35 × 4
       color cut           m     s
       <ord> <ord>     <dbl> <dbl>
     1 D     Fair      4291. 3286.
     2 D     Good      3405. 3175.
     3 D     Very Good 3470. 3524.
     4 D     Premium   3631. 3712.
     5 D     Ideal     2629. 3001.
     6 E     Fair      3682. 2977.
     7 E     Good      3424. 3331.
     8 E     Very Good 3215. 3408.
     9 E     Premium   3539. 3795.
    10 E     Ideal     2598. 2956.
    # ℹ 25 more rows
    # part II
    diamonds %>%                                # Start with the 'diamonds' dataset
      group_by(cut, color) %>%                  # Group the data by 'cut' and 'color' (note the order is now 'cut' first, then 'color')
      summarize(m = mean(price),                # Create a new column 'm' that calculates the mean (average) price for each cut-color combination
                s = sd(price)) %>%              # Create a new column 's' that calculates the standard deviation of the price for each cut-color combination
      ungroup()                                 # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'cut'. You can override using the `.groups`
    argument.
    # A tibble: 35 × 4
       cut   color     m     s
       <ord> <ord> <dbl> <dbl>
     1 Fair  D     4291. 3286.
     2 Fair  E     3682. 2977.
     3 Fair  F     3827. 3223.
     4 Fair  G     4239. 3610.
     5 Fair  H     5136. 3886.
     6 Fair  I     4685. 3730.
     7 Fair  J     4976. 4050.
     8 Good  D     3405. 3175.
     9 Good  E     3424. 3331.
    10 Good  F     3496. 3202.
    # ℹ 25 more rows
    # part III
    diamonds %>%                                   # Start with the 'diamonds' dataset
      group_by(cut, color, clarity) %>%            # Group the data by 'cut', 'color', and 'clarity' (each unique combination of these forms a group)
      summarize(m = mean(price),                   # Create a new column 'm' that calculates the mean (average) price for each cut-color-clarity combination
                s = sd(price),                     # Create a new column 's' that calculates the standard deviation of the price for each group
                msale = m * 0.80) %>%              # Create a new column 'msale' where the sale price is 80% of the mean price (i.e., a 20% discount)
      ungroup()                                    # Remove the grouping to return an ungrouped data frame
    `summarise()` has grouped output by 'cut', 'color'. You can override using the
    `.groups` argument.
    # A tibble: 276 × 6
       cut   color clarity     m     s msale
       <ord> <ord> <ord>   <dbl> <dbl> <dbl>
     1 Fair  D     I1      7383  5899. 5906.
     2 Fair  D     SI2     4355. 3260. 3484.
     3 Fair  D     SI1     4273. 3019. 3419.
     4 Fair  D     VS2     4513. 3383. 3610.
     5 Fair  D     VS1     2921. 2550. 2337.
     6 Fair  D     VVS2    3607  3629. 2886.
     7 Fair  D     VVS1    4473  5457. 3578.
     8 Fair  D     IF      1620.  525. 1296.
     9 Fair  E     I1      2095.  824. 1676.
    10 Fair  E     SI2     4172. 3055. 3338.
    # ℹ 266 more rows

    F

    diamonds %>%                                  # Start with the 'diamonds' dataset
      group_by(cut) %>%                           # Group the data by the 'cut' column (each unique value of cut forms a group)
      summarize(potato = mean(depth),             # Create a new column 'potato' that calculates the mean of the 'depth' for each cut group
                pizza = mean(price),              # Create a new column 'pizza' that calculates the mean of the 'price' for each cut group
                popcorn = median(y),              # Create a new column 'popcorn' that calculates the median of the 'y' dimension (diamond height) for each cut group
                pineapple = potato - pizza,       # Create a new column 'pineapple' that calculates the difference between the mean depth and mean price
                papaya = pineapple ^ 2,           # Create a new column 'papaya' that calculates the square of the 'pineapple' value (the squared difference between depth and price)
                peach = n()) %>%                  # Create a new column 'peach' that counts the total number of diamonds (rows) for each cut group
      ungroup()                                   # Remove the grouping to return an ungrouped data frame
    # A tibble: 5 × 7
      cut       potato pizza popcorn pineapple    papaya peach
      <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
    1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
    2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
    3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
    4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
    5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

    G

    # part I
    diamonds %>%                                    # Start with the 'diamonds' dataset
      group_by(color) %>%                           # Group the data by 'color' (each unique diamond color forms a group)
      summarize(m = mean(price)) %>%                # Create a new column 'm' that calculates the mean price for each color
      mutate(x1 = str_c("Diamond color ", color),   # Create a new column 'x1' by concatenating the string "Diamond color " with the diamond's color value
             x2 = 5) %>%                            # Create a new column 'x2' that assigns the value 5 to each row (same for every color group)
      ungroup()                                     # Remove the grouping to return an ungrouped data frame
    # A tibble: 7 × 4
      color     m x1                 x2
      <ord> <dbl> <chr>           <dbl>
    1 D     3170. Diamond color D     5
    2 E     3077. Diamond color E     5
    3 F     3725. Diamond color F     5
    4 G     3999. Diamond color G     5
    5 H     4487. Diamond color H     5
    6 I     5092. Diamond color I     5
    7 J     5324. Diamond color J     5
    # part II
    diamonds %>%                                    # Start with the 'diamonds' dataset
      group_by(color) %>%                           # Group the data by 'color'
      summarize(m = mean(price)) %>%                # Create a new column 'm' that calculates the mean price for each color
      ungroup() %>%                                 # Remove the grouping after summarizing, so the next operations are not grouped
      mutate(x1 = str_c("Diamond color ", color),   # Create a new column 'x1' by concatenating "Diamond color " with the color value
             x2 = 5)                                 # Create a new column 'x2' with a constant value of 5 for all rows
    # A tibble: 7 × 4
      color     m x1                 x2
      <ord> <dbl> <chr>           <dbl>
    1 D     3170. Diamond color D     5
    2 E     3077. Diamond color E     5
    3 F     3725. Diamond color F     5
    4 G     3999. Diamond color G     5
    5 H     4487. Diamond color H     5
    6 I     5092. Diamond color I     5
    7 J     5324. Diamond color J     5

    H

    # part I
    diamonds %>%                                     # Start with the 'diamonds' dataset
      group_by(color) %>%                            # Group the data by 'color' (each unique diamond color forms a group)
      mutate(x1 = price * 0.5) %>%                   # Create a new column 'x1' that calculates half the price of each diamond in the group
      summarize(m = mean(x1)) %>%                    # Create a new column 'm' that calculates the mean of the 'x1' values (mean half price) for each color group
      ungroup()                                      # Remove the grouping to return an ungrouped data frame
    # A tibble: 7 × 2
      color     m
      <ord> <dbl>
    1 D     1585.
    2 E     1538.
    3 F     1862.
    4 G     2000.
    5 H     2243.
    6 I     2546.
    7 J     2662.
    # part II
    diamonds %>%                                     # Start with the 'diamonds' dataset
      group_by(color) %>%                            # Group the data by 'color'
      mutate(x1 = price * 0.5) %>%                   # Create a new column 'x1' that calculates half the price of each diamond in the group
      ungroup() %>%                                  # Remove the grouping to return an ungrouped data frame
      summarize(m = mean(x1))                        # Calculate the mean of the 'x1' values (mean half price) across the entire dataset, since it is now ungrouped
    # A tibble: 1 × 1
          m
      <dbl>
    1 1966.
  3. after that we answer these questions:

Q1: Why is grouping data necessary?

Grouping allows performing aggregations on subsets of the data rather than on the entire dataset, and It enables us to analyze different segments of the data independently. It is also essential to calculate statistics that are specific to a subset of the data.

Q2: Why is ungrouping data necessary?

To return to a standard data frame structure for subsequent operations. Once ungrouped, you can freely manipulate the data frame without being constrained by the previous grouping.

Q3: When should you ungroup data?

1.  {r}

2.  After Aggregation

3.  Before Operations that Shouldn't Be Grouped

4.  Before Merging or Joining

Q4: If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No, you do not need to use ungroup() if you have not previously used group_by().

  1. Extra exercise:

    view (diamonds)

    # Arrange diamonds by various criteria

    Lowest to highest price

    diamonds_low_high <- diamonds %>%
      arrange(price)

    Highest to lowest price

    diamonds_high_low <- diamonds %>%
      arrange(desc(price))

    Lowest price and cut

    diamonds_low_price_cut <- diamonds %>%
      arrange(price, cut)

    Highest price and cut

    diamonds_high_price_cut <- diamonds %>%
      arrange(desc(price), cut)

    Arrange by lowest to highest price and worst to best clarity

    # Arrange diamonds by lowest to highest price and worst to best clarity
    diamonds_sorted <- diamonds %>%
      arrange(price, clarity)
    
    # View the sorted dataset
    head(diamonds_sorted)
    # A tibble: 6 × 10
      carat cut       color clarity depth table price     x     y     z
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

    Create a new variable ‘salePrice’ reflecting a $250 discount

    # Create a new variable 'salePrice' reflecting a $250 discount
    diamonds_with_sale_price <- diamonds %>%
      mutate(salePrice = price - 250)
    
    # View the updated dataset
    head(diamonds_with_sale_price)
    # A tibble: 6 × 11
      carat cut       color clarity depth table price     x     y     z salePrice
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86

    Remove the x, y, and z variables from the diamonds dataset (hint: select()).

    # Load the necessary library
    library(tidyverse)
    
    # Remove the x, y, and z variables from the dataset
    diamonds_cleaned <- diamonds %>%
      select(-x, -y, -z)
    
    # View the updated dataset
    head(diamonds_cleaned)
    # A tibble: 6 × 7
      carat cut       color clarity depth table price
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
    1  0.23 Ideal     E     SI2      61.5    55   326
    2  0.21 Premium   E     SI1      59.8    61   326
    3  0.23 Good      E     VS1      56.9    65   327
    4  0.29 Premium   I     VS2      62.4    58   334
    5  0.31 Good      J     SI2      63.3    58   335
    6  0.24 Very Good J     VVS2     62.8    57   336

    Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).

    # Load the necessary library
    library(tidyverse)
    
    # Group by cut and count the number of diamonds for each cut
    diamonds_by_cut <- diamonds %>%
      group_by(cut) %>%
      summarize(number_of_diamonds = n())
    
    # View the result
    print(diamonds_by_cut)
    # A tibble: 5 × 2
      cut       number_of_diamonds
      <ord>                  <int>
    1 Fair                    1610
    2 Good                    4906
    3 Very Good              12082
    4 Premium                13791
    5 Ideal                  21551

    Create a new column named totalNum that calculates the total number of diamonds.

    # Load the necessary library
    library(tidyverse)
    
    # Calculate the total number of diamonds and add it as a new column
    diamonds_with_totalNum <- diamonds %>%
      mutate(totalNum = n())
    
    # View the first few rows of the dataset with the new column
    head(diamonds_with_totalNum)
    # A tibble: 6 × 11
      carat cut       color clarity depth table price     x     y     z totalNum
      <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
    1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
    2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
    3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
    4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
    5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
    6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940

Exercises for Research methods 💎💎💎

Good question:

What percentage of diamonds fall within a specific price range (e.g., $1,000 to $5,000)?

Bad question:

Are diamonds shiny?

Week 4 Post session formative exercise

There are 2 exercise: One for Data analysis and one for Rsrch methods

Data analysis exercise:

1. Follow the video tutorial and try to reproduce the same graphs. ✅

dowload Andrew’s script (to copy paste in Rscript and use the codes directly) here.

my answer:

The basics:

  • first load tidyverse: library (tidyverse)

  • then load library (modeldata) that includes crickets dataset that we’ll work on :)

Tip

make sure you installed modeldata packages!!

library (tidyverse)
library (modeldata)

Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins
  • now just view the dataset you’re working on:
view (crickets)

the data explains how fast the crickets chirped in different temperatures ?crickets 🦗🦗🦗

first thing to know about graphing in R with ggplot is that the most important thing about a plot is which variables are being used in the plot and how they’re being displayed (x axis or y axis, color or size, or anything else)

Tip

for plotting we only have one function ggplot in R (mostly)

  • ok so:
ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point ()# first argument is just the data set, then which variables are being displayed

  # then after "+" how those variables are going to be mapped for the actual visualization (display)
  • that’s all u need to do a scatter plot! :)

  • now if we wanna change the appearnce of this plot: (we always do it after we specify the data)

ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point () +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # to change the labels and add a title and a caption that says where we got the data from 

  • now if we wanna add more features like color by species. We’re gonna need the species variable from the crickets dataset to be diplayed as well.
ggplot (crickets, aes( x = temp, 
                       y = rate,
                       color = species)) + 
  geom_point () +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # we just add color = species to the ggplot

  • now if we wanna change “species” to “Species” we can’t just do that by changing it in the code, coz its written in small letters in the dataset.

  • so we have to give it a command: color = “Species”

ggplot (crickets, aes( x = temp, 
                       y = rate,
                       color = species)) + 
  geom_point () +
  labs (x = "Temperature", 
        y = "Chirp rate",
        color = "Species", 
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # so we have to give it a command: color = "Species"

there u go 😃 ❤️

  • now that we’re done with the basics we’ll move on to modifications

  • Modifiying basic properties of the plot

ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point () +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # for this example and upcoming ones we don't want the colored species

  • now what if u want my dots to be red
Tip

its not aes because that tells us variables and how they’re being displayed

ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point (color = "red") +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # so the basic property of the graph is added inside the `geom_point`

pretty cool huh! 😎👍

  • we can also change the size and the opacity of the points: (following the same steps)
ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point (color = "red",
              size = 2,
              alpha = .3,
              shape = "square") +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # we added the size and the opacity and the shape

Tip

This is particulary helpful if u have overplotting

  • Now to learn more about the options for the geom_

  • with ?geom_point

  • Adding another layer

 ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point () +
  geom_smooth () +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") # using the same one without colors, we add the layer which is geom_smooth
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

this is so cool 🫨

  • but it doesn’t look as smooth as we want it to be so we add an argument method = "lm"
 ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point () +
  geom_smooth (method = "lm") +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") 
`geom_smooth()` using formula = 'y ~ x'

awesome that’s better 👍

  • to make it even better: we add se = FALSE which means no standard erros
 ggplot (crickets, aes( x = temp, 
                       y = rate)) + 
  geom_point () +
  geom_smooth (method = "lm",
               se = FALSE) +
  labs (x = "Temperature", 
        y = "Chirp rate",
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") 
`geom_smooth()` using formula = 'y ~ x'

  • now we can do it with colored species:
ggplot (crickets, aes( x = temp, 
                       y = rate,
                       color = species)) + 
  geom_point () +
  geom_smooth(method = "lm",
              se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

  labs (x = "Temperature", 
        y = "Chirp rate",
        color = "Species", 
        title = "Cricket chirps",
        caption = "Source: McDonald (2009)") 
$x
[1] "Temperature"

$y
[1] "Chirp rate"

$colour
[1] "Species"

$title
[1] "Cricket chirps"

$caption
[1] "Source: McDonald (2009)"

attr(,"class")
[1] "labels"

FABULOUS!! 🤩

Tip

Learn more about the options for the geom_abline() with ?geom_point

  • now to learn other plots:

  • a histogram of the chirp rates:

ggplot(crickets, aes(x = rate)) + 
  geom_histogram() # in this case i only want chirp rates so i only add x, and the histogram represents a single quantitative variable so its only one column being considered here
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

there u go! 😀

  • adjusting bins:
ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # now we add/change bins, and this was for one quantitative variable

  • now we change the word “histogram” to “freqpoly” to represent it in a different way
ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15) 

Tip

the funamental feature of this plot is that chirp rate is on the x axis

  • ok now lets do a plot for the species variable

  • its a single categorical variable and i want a **bar chart*

ggplot(crickets, aes(x = species)) + 
  geom_bar ()

  • then to add colors
ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue") # we didn't specify a variable we just said make it all blue or black

  • lets try to specify a variable:
 ggplot(crickets, aes(x = species, 
                     fill = species)) +
 geom_bar()

Tip

to add the colorblind line: scale_fill_brewer(palette = “Dark2”)

  • the problem now is that we can see species (the label) twice, as well as the x and legend

  • we don’t want the legend so lets remove it (show.legend = FALSE)

 ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) 

Tip

histogram is your go-to default with displaying a single quantitative variable scatter plot is good for two quantitative variable

  • lets say we want to show for one quantitative variable and one categorical
 ggplot(crickets, aes(x = species, 
                     y = rate)) + 
   geom_boxplot ()

  • now lets add color and fill
 ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) 

Tip

you should always think ahead whether your data is quantitative or qualitative

  • use this image to help:

  • In this case im interested in a data question relating to chirp rates to the different species. For example: “Do these different species have different chirp rates?” or “which has a higher chirp rate on average?”

  • This visualization of data should help us answer those questions :)

Tip

in Rstudio go to help > cheat sheets

  • now remove the gray background by adding theme_minimal()
 ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  theme_minimal() # highly recommended!!!

Tip

to add more themes or know more go to: ?theme_minimal()

faceting (adding faces)

ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2") # for color blind

  • they’re stacked on top of each other so they don’t look that great
 ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

?facet_wrap # basically wrap it by species to make it look better 
starting httpd help server ... done

very cool so much better!!

  • finally we wanna arrange these plots vertically why? coz them being side by side doesn’t really show the difference
ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

2. Use your WorkBook and organize it in your own way ✅

3. Publish the WorkBook in Rpubs

Rsrch methods exercise: ✅

  • Read this Wikipedia page on Scientific Method and this brief tutorial on how to craft a strong research hypothesis.

  • Now, use your Work Book to write a short paragraph that summarizes your understanding on what is good research hypothesis

A good research hypothesis is a clear, concise, and testable statement that predicts the relationship between variables in a study. Here are the key characteristics that define a strong research hypothesis:

My answer:

Characteristics of a Good Research Hypothesis: 🔍🔍🔍

  • The hypothesis should be empirically testable through experimentation, observation, or data analysis. It should allow for the collection of data that can support or refute the hypothesis.

  • The wording should be clear, leaving no room for confusion. A well-defined hypothesis avoids vague terms and ensures that anyone reading it understands the variables involved and the expected relationship.

  • A good hypothesis is specific and straightforward. about what it predicts. It should Avoid unnecessary complexity or overly elaborate statements.

  • Address a significant question or gap in the existing literature. It should contribute to the field of study and advance understanding of the topic.

  • It should logically follow from previous research findings, reflecting an understanding of the current state of knowledge in the field.

  • Depending on the nature of the research, a hypothesis can be directional (predicting a specific outcome) or non-directional (indicating a relationship without specifying the direction).

Examples of Good Research Hypotheses: 🤓🤓🤓

  1. Health Sciences:
    • If individuals consume a high-fiber diet, then their cholesterol levels will decrease compared to those who do not consume a high-fiber diet.
  2. Education:
    • Students who receive regular feedback from their instructors will demonstrate higher academic performance than those who do not receive feedback.
  3. Environmental Science:
    • Increased urbanization negatively affects the biodiversity of local ecosystems compared to rural areas.
  4. Psychology:
    • Individuals who practice mindfulness meditation will report lower levels of anxiety than those who do not practice mindfulness.

End of task 👌👌👌

Week 5 formative exercise

Q: Look at the graphics below and identify which tests can be associated to them. Explain why!

A:

  1. boxplot: T-test or ANOVA, as it compares the distribution of data between groups.
  2. histogram*: Chi-square goodness-of-fit test, as it helps assess the distribution of a single variable.
  3. scatterplot: correlation or linear regression, as it is Used to assess relationships between two variables.
  4. Bar chart: T-test or ANOVA, as it often summarizes data for different categories, and you need a test like the t-test or ANOVA to determine whether the differences in means between groups are statistically significant.

Try to reproduce these graphs using the dataset “iris” (pre-built in R) that you can access just calling “data(iris)”

Answer:

  1. boxplot
  2. histogram
  3. scatterplot
  4. bar chart

we’ll apply what we learned in the last session (week 4), but with iris data

don’t forget to library (ggplot2)

To reproduce a boxplot

# Load ggplot2 package
library(ggplot2)

# Create boxplot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Length by Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal()

To reproduce a histogram

  # Load ggplot2 package
library(ggplot2)

# Create density plot by species for Petal Length
ggplot(iris, aes(x = Petal.Length, color = Species, fill = Species)) +
  geom_density(alpha = 0.5) +  # Use density curves
  labs(title = "Density Plot of Petal Length by Species",
       x = "Petal Length",
       y = "Density") +
  theme_minimal()

To reproduce a scatterplot

# Load ggplot2 package
library(ggplot2)

# Create scatter plot with one linear regression line for all species
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +  # Add points with size and transparency
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add a single linear regression line
  labs(title = "Scatter Plot of Petal Length vs. Petal Width with Linear Regression Line",
       x = "Petal Length",
       y = "Petal Width") +
  theme_minimal() +
  theme(legend.position = "top")  # Optional: move legend position
`geom_smooth()` using formula = 'y ~ x'

To reproduce a bar chart

Hint: for this graph you’ll need to create a new variable usign the following code

iris %\>% mutate(size=ifelse(Sepal.Length \< median(Sepal.Length), "small", "big")
# Create new variable 'size' and then create a bar chart
iris %>%
  mutate(size = ifelse(Sepal.Length < median(Sepal.Length), "small", "big")) %>%
  ggplot(aes(x = size, fill = Species)) +
  geom_bar(position = "dodge") +  # Create the bar chart with dodged bars
  labs(title = "Count of Iris Samples by Size and Species",
       x = "Size",
       y = "Count") +
  theme_minimal()

End of task :)

Week 6 post session

Comparing proportions in R: function: prop.test()

Tip

for z-test prop.test()

Tip

for Chi-square chisq.test()

One-proportion z-Test:

What is One-proportion z-Test?

  • is used to compare an observed proportion to a theoretical one, when there are only two categories.

Example:

  • we have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n = 160) have developed a spontaneous cancer, including 95 male and 65 female.
  • The R functions binom.test() and prop.test() can be used to perform one-proportion test:
  • We want to know, whether the cancer affects more male than female?
  • if |z|<1.96, then the difference is not significant at 5%
  • if |z|≥1.96, then the difference is significant at 5%
res <- prop.test(x = 95, n = 160, p = 0.5, 
                 correct = FALSE)
# Printing the results
res 

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5163169 0.6667870
sample estimates:
      p 
0.59375 

Result:

The p-value of the test is 0.01771, which is less than the significance level alpha = 0.05. We can conclude that the proportion of male with cancer is significantly different from 0.5 with a p-value = 0.01771.

im not sure i fully understand this bit

two-proportions z-test

What is two-proportions z-test?

  • The two-proportions z-test is used to compare two observed proportions.

example

  • For example, we have two groups of individuals:
  1. Group A with lung cancer: n = 500 2.Group B, healthy individuals: n = 500
  • The number of smokers in each group is as follow:
  1. Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98
  2. Group B, healthy individuals: n = 500, 400 smokers, pB=400/500=80
  • In this setting:
  1. The overall proportion of smokers is p=frac(490+400)500+500=89
  2. The overall proportion of non-smokers is q=1−p=11

We want to know, whether the proportions of smokers are the same in the two groups of individuals? (A is equal, less, greater than B)

we already know but we want the p value (i think)

Tip
  • if |z|<1.96, then the difference is not significant at 5%
  • if |z|≥1.96, then the difference is significant at 5%
res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80 

Result

  • The p-value of the test is 2.36310^{-19}, which is less than the significance level alpha = 0.05. We can conclude that the proportion of smokers is significantly different in the two groups with a p-value = 2.36310^{-19}.

what??

chi-square goodness of fit test

what is chi-square goodness of fit test

  • The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data.
  • In other words, it compares multiple observed proportions to expected probabilities.

example

  • we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white.

Q1

  • Are these colors equally common?
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07

result

  • The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.

Q2

  • We want to know, if there is any significant difference between the observed proportions and the expected proportions.
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 0.20253, df = 2, p-value = 0.9037

result

  • The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.

Chi-square test basics

Tip

Chi-square test: examines whether rows and columns of a contingency table are statistically significantly associated

Tip

Null hypothesis (H0): the row and the column variables of the contingency table are independent.

Tip

Alternative hypothesis (H1): row and column variables are dependent

Chi-Square Test of Independence in R

what is Chi-Square Test of Independence in R?

  • The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables.

  • The chi-square test evaluates whether there is a significant association between the categories of the two variables.

example

# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
# head(housetasks)
  • Contingency table can be visualized using the function balloonplot() [in gplots package].
  • install.packages(“gplots”) then run this code:
library("gplots")

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

  • It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package garphics
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

  • There is another package named vcd, which can be used to make a mosaic plot (function mosaic()) or an association plot (function assoc()).
  • install.packages(“vcd”) then run this code:
# install.packages("vcd")
library("vcd")
Loading required package: grid
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)

result

chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
  • the row and the column variables are statistically significantly associated (p-value = 0).

End of task :)

Week 7

Pre session

Mean tests

T- test

library (tidyverse)
library (patchwork)
library (gapminder)
  • in T-test we’re usually asking about the average of the data that we have
  • we’re asking “Is that average different?”
  • we’ll start with a **single sample T-test*

How does hypothesis test work?

  • Null Hypothesis (H₀): This is the hypothesis of “no effect” or “no difference.” It represents the status quo or baseline assumption (e.g., “The mean test score is 50”).
  • Alternative Hypothesis (H₁ or Hₐ): This hypothesis reflects the opposite of the null hypothesis(e.g., “The mean test score is not 50” or “The mean is greater than 50”).

Significance Level (α)

  • This is the probability threshold used to decide whether to reject the null hypothesis, often set at 0.05 (5%).
  • A lower significance level means stricter evidence is required to reject H₀.

p-value

  • The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one observed, given that the null hypothesis is true.
  • A low p-value (typically below α = 0.05) suggests that the observed data is unlikely under H₀, providing evidence in favor of the alternative hypothesis.

Make a Decision

  • Reject H₀: If the p-value is less than or equal to α, there is significant evidence to reject the null hypothesis in favor of the alternative hypothesis.
  • Fail to Reject H₀: If the p-value is greater than α, there is insufficient evidence to reject the null hypothesis.

In short, hypothesis testing is a way to use sample data to make informed decisions or draw conclusions about the population based on statistical evidence.

ANOVA test watch this video

  • What to focus on/what is important: is being able to understand the Q and interpret the results.
  • You don’t need to understand the mathematics underneath it, and writing the code is EZ
  • in the video he used the gapminder dataset
  • aov means analysis of variance
  • when we run an AVOVA test it tells us whether or not they’re all different from each but it does not specify if maybe two of them are in fact not different, so we can tell R to do that for us.
  • TukeyHSD which stands for honestly significant difference
  • in the video when we did that it gave us the differnce between each 2 continents separatly and the p-values for it
  • Asia and America did not have a significant difference. (it makes sense looking at the graph as well)
  • On the other hand, Europe and America + Europe and Asia had a significant difference between them. Their p-values were extremely smaller than 0.05 so we reject null hypothesis for both. (unlike Asia and America, the p-value was higher)

Difference between ANOVA and T-test is that with ANOVA we compare more than 2 groups (for example Europe and America only then t-test, but if we add Asia then ANOVA)

Tip

Null hypothesis (H0): no difference Alternative hypothesis (H1): difference

that’s it for the pre-session! pretty easy huh?

Week 7 post-session

##reproduce these examples in your workbook

2.Comparing one-sample mean to a standard known mean

2.1 One-sample T-test (parametric): one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ).

For example, compare whether the mean weight of mice differs from 200 mg, a value determined in a previous study.

  • Typical research questions are: whether the mean (m) of the sample is equal/less/greater to the theoretical mean (μ)?
  • If the p-value is inferior or equal to the significance level 0.05, we can reject the null hypothesis and accept the alternative hypothesis.
  • In other words, we conclude that the sample mean is significantly different from the theoretical mean.

Visualize your data and compute one-sample t-test in R

# Install ggpubr R package for data visualization:
if(!require(devtools)) install.packages("devtools")
Loading required package: devtools
Warning: package 'devtools' was built under R version 4.4.2
Loading required package: usethis
devtools::install_github("kassambara/ggpubr")
Skipping install of 'ggpubr' from a github remote, the SHA1 (6aeb4f70) has not changed since last install.
  Use `force = TRUE` to force installation

R function to compute one-sample t-test

  • t.test()

Import your data into R (ok so apparently i think this is the code for importing your data so maybe no need for excel stuff? but running this code takes forever…)

  • just do:library (ggplot2) then library ggpubr then `View (my_data) then it works :)

  • Here, we’ll use an example data set containing the weight of 10 mice.

  • We want to know, if the average weight of the mice differs from 25g?

set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

Check your data

# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 

Visualize your data using box plots

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Preleminary test to check one-sample t-test assumptions

  • Is this a large sample? - No, because n < 30.
  • Since the sample size is not large enough (less than 30, central limit theorem), we need to check whether the data follow a normal distribution.
# to check the normality
shapiro.test(my_data$weight) # => p-value = 0.6993

    Shapiro-Wilk normality test

data:  my_data$weight
W = 0.9526, p-value = 0.6993

From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribtion. In other words, we can assume the normality.

# Visual inspection of the data normality using Q-Q plots
library("ggpubr")
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())

Compute one-sample t-test

# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res 

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 

Interpretation of the result:

The p-value of the test is 7.95310^{-6}, which is less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g with a p-value = 7.95310^{-6}.

Access to the values returned by t.test() function

# printing the p-value
res$p.value
[1] 7.953383e-06
# printing the mean
res$estimate
mean of x 
    19.25 
# printing the confidence interval
res$conf.int
[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95

One-Sample Wilcoxon Signed Rank Test in R

What is it?

  • The one-sample Wilcoxon signed rank test is a non-parametric alternative to one-sample t-test when the data cannot be assumed to be normally distributed.
  • It’s used to determine whether the median of the sample is equal to a known standard value (i.e. theoretical value).

Research questions and statistical hypothese:

  • whether the median (m) of the sample is equal/less/greater to the theoretical value (m0)?

Visualize your data and compute one-sample Wilcoxon test in R

Install ggpubr R package for data visualization

  • already done that in the previours example so ill just:
library (ggplot2)
library (ggpubr)

R function to compute one-sample Wilcoxon test:

  • wilcox.test()

Import your data into R

  • done that as well

Check your data

# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 

Visualize your data using box plots

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Compute one-sample Wilcoxon test

We want to know, if the average weight of the mice differs from 25g (two-tailed test)?

# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)
Warning in wilcox.test.default(my_data$weight, mu = 25): cannot compute exact
p-value with ties
# Printing the results
res 

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
# print only the p-value
res$p.value
[1] 0.005793045

Interpretation of the result:

The p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. We can reject the null hypothesis and conclude that the average weight of the mice is significantly different from 25g with a p-value = 0.005793.

Tip
  • Note that:
#if you want to test whether the median weight of mice is less than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "less")
Warning in wilcox.test.default(my_data$weight, mu = 25, alternative = "less"):
cannot compute exact p-value with ties

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.002897
alternative hypothesis: true location is less than 25
# Or, if you want to test whether the median weight of mice is greater than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "greater")
Warning in wilcox.test.default(my_data$weight, mu = 25, alternative =
"greater"): cannot compute exact p-value with ties

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.9979
alternative hypothesis: true location is greater than 25

Unpaired Two-Samples T-test in R

What is it?

  • The unpaired two-samples t-test is used to compare the mean of two independent groups.

For example, suppose that we have measured the weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (mA) is significantly different from that of men (mB).

  • In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.

Research questions and statistical hypotheses

  • whether the mean of group A (mA) is equal/less/greater to the mean of group B (mB)?

Visualize your data and compute unpaired two-samples t-test in R and Install ggpubr R package for data visualization and Import your data into R

  • done before

R function to compute unpaired two-samples t-test:

  • t.test()

Check your data

# Print all data
print(my_data)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

Preleminary test to check independent t-test assumptions

  • Assumption 1: Are the two samples independents?
  • Assumption 2: Are the data from each of the 2 groups follow a normal distribution?
  • From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality.

Compute unpaired two-samples t-test

  • Question : Is there any significant difference between women and men weights?
  1. Compute independent t-test - Method 1: The data are saved in two different numeric vectors.

  2. Compute independent t-test - Method 2: The data are saved in a data frame.

  • As you can see, the two methods give the same results.

Interpretation of the result:

The p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. We can conclude that men’s average weight is significantly different from women’s average weight with a p-value = 0.01327.

Access to the values returned by t.test() function

# printing the p-value
res$p.value
[1] 0.005793045
# printing the mean
res$estimat
NULL
# printing the confidence interval
res$conf.in
NULL

i didnt put all the codes in this example so some of them won’t work and that’s ok i get the gerenal idea :)

Unpaired Two-Samples Wilcoxon Test in R

What is it?

  • The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.

Visualize your data and compute Wilcoxon test in R

R function to compute Wilcoxon test:

  • wilcox.test() ### Import your data into R ### Check your data
print(my_data)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

Visualize your data using box plots (check the website)

Compute unpaired two-samples Wilcoxon test (check the website)

Interpretation of the result:

  • The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.

Comparing the means of paired samples

Paired samples t-test (parametric)

What is paired samples t-test?

  • The paired samples t-test is used to compare the means between two related groups of samples. - In this case, you have two values (i.e., pair of values) for the same samples.

As an example of data:

  • 20 mice received a treatment X during 3 months.
  • We want to know whether the treatment X has an impact on the weight of the mice.
  • To answer to this question, the weight of the 20 mice has been measured before and after the treatment.
  • This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice.

In such situations, paired t-test can be used to compare the mean weights before and after treatment.

Research questions and statistical hypotheses

  • whether the mean difference (m) is equal/less/greater than 0?

Visualize your data and compute paired t-test in R

R function to compute paired t-test

  • t.test()

Import your data into R

  • done

Check your data (ofc here we are using different set of data so we would have to import but u get the idea)

# Print all data
print(my_data)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

Visualize your data using box plots

Preleminary test to check paired t-test assumptions

  • Assumption 1: Are the two samples paired?

Yes, since the data have been collected from measuring twice the weight of the same mice.

  • Assumption 2: Is this a large sample?

No, because n < 30.

Tip

Since the sample size is not large enough (less than 30), we need to check whether the differences of the pairs follow a normal distribution.

  • How to check the normality?

Use Shapiro-Wilk normality test as described at: Normality Test in R.

Compute paired samples t-test

Interpretation of the result

  • The p-value of the test is 6.210^{-9}, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment with a p-value = 6.210^{-9}.

Access to the values returned by t.test() function

# printing the p-value
res$p.value
[1] 0.005793045
# printing the mean
res$estimate
NULL
# printing the confidence interval
res$conf.int
NULL

Paired samples Wilcoxon test (non-parametric)

Visualize your data and compute Wilcoxon test in R

R function to compute Wilcoxon test

  • wilcox.test()

Import your data into R

Check your data

print(my_data)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

Visualize your data using box plots

Compute unpaired two-samples Wilcoxon test

  • Question : Is there any significant changes in the weights of mice before after treatment?

Interpretation of the result

  • The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.

Comparing the means of more than two groups

One-way ANOVA test

What is one-way ANOVA test?

  • The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups.
  • In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).

How one-way ANOVA test works?

Import your data

Check your data

Visualize your data

Compute one-way ANOVA test

  • The R function aov() can be used to answer to this question.
  • The function summary.aov() is used to summarize the analysis of variance model.

Interpretation of the result

  • As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.

Two-Way ANOVA Test in R

What is two-way ANOVA test?

  • Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable.

Two-way ANOVA test hypotheses

  • There is no difference in the means of factor A
  • There is no difference in means of factor B
  • There is no interaction between factors A and B
  • The alternative hypothesis for cases 1 and 2 is: the means are not equal. T- he alternative hypothesis for case 3 is: there is an interaction between A and B.

Assumptions of two-way ANOVA test

  • Two-way ANOVA, like all ANOVA tests, assumes that the observations within each cell are normally distributed and have equal variances. We’ll show you how to check these assumptions after fitting ANOVA.

Compute two-way ANOVA test in R: balanced designs

  • Balanced designs correspond to the situation where we have equal sample sizes within levels of our independent grouping levels.

Import your data into R

Check your data

Visualize your data

Compute two-way ANOVA test

  • The R function aov() can be used to answer this question.
  • The function summary.aov() is used to summarize the analysis of variance model.

Interpret the results

From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:

  • the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.
  • the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
  • the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.

MANOVA Test in R: Multivariate Analysis of Variance

What is MANOVA test?

  • In the situation where there multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA). This article describes how to compute manova in R.

Import your data into R

Compute MANOVA test

  • The function manova() can be used as follow:
sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look to see which differ
summary.aov(res.man)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpret the results

  • From the output above, it can be seen that the two variables are highly significantly different among Species.

Kruskal-Wallis test

What is Kruskal-Wallis test?

Import your data into R

Check your data

Visualize the data using box plots

Compute Kruskal-Wallis test

  • We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.

Interpret

  • As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

End of task 7 :)

Week 8

Correlations + intro

Pre session: Read and try to reproduce the code in this chapter

# Data for Height and Weight (left plot)
height <- c(60, 62, 65, 67, 70, 72, 75)  # Heights in inches
weight1 <- c(115, 120, 130, 140, 155, 165, 180)  # Weights in pounds

# Data for Exercise and Weight (right plot)
exercise <- c(1, 2, 3, 4, 5, 6, 7)  # Hours of exercise per week
weight2 <- c(180, 170, 160, 150, 140, 130, 120)  # Weights in pounds

# Set up the plotting area to display two plots side by side
par(mfrow = c(1, 2))  # 1 row, 2 columns

# Left plot: Height vs. Weight
plot(height, weight1,
     main = "Height vs. Weight", 
     xlab = "Height (inches)", 
     ylab = "Weight (pounds)", 
     col = "blue", pch = 19, cex = 1.5)

# Right plot: Exercise vs. Weight
plot(exercise, weight2,
     main = "Exercise vs. Weight", 
     xlab = "Exercise (hours/week)", 
     ylab = "Weight (pounds)", 
     col = "green", pch = 19, cex = 1.5)

# Reset plotting layout
par(mfrow = c(1, 1))  # Restore to single plot layout
# Simulating some example data
set.seed(123)
height <- rnorm(100, mean = 65, sd = 5)  # Height in inches
weight <- rnorm(100, mean = 150, sd = 30)  # Weight in pounds
exercise <- rnorm(100, mean = 5, sd = 2)  # Hours of exercise per week

# Plot 1: Scatter plot of height vs weight
plot(height, weight, main="Height vs Weight", xlab="Height (inches)", ylab="Weight (pounds)", pch=19, col="blue")

# Plot 2: Scatter plot of exercise vs weight
plot(exercise, weight, main="Exercise vs Weight", xlab="Exercise Hours/Week", ylab="Weight (pounds)", pch=19, col="red")

# Sample data
height_group <- factor(c(1, 1, 2, 2, 3, 3, 4, 4))  # Categorical variable (height group)
weight <- c(130, 150, 160, 170, 180, 190, 200, 210)  # Quantitative variable (weight)

# Create the boxplot
boxplot(weight ~ height_group, main="Boxplot of Weight by Height Group",
        xlab="Height Group", ylab="Weight", col="lightblue")

  • they’re not exactly the same but close enough?

Post-session: Practice correlations https://is.muni.cz/el/phil/jaro2017/HUMB002/um/68004555/Exercise5_-_Correlation.pdf then update your workbook and submit it

  • Q1: C, has strongest association/ the one closest to either +1 or -1.
  • Q2: A, represents the strongest positive relationship among them.
  • Q3: D, the trend does not follow a straight line (outlier exists.
  • Q4: a:d ,b:e , c:a ,d:b ,e:c.
  • Q5:
  1. limited variability > reduces the strength of r.

  2. differences in distribution of the correlated variables > may not accurately reflect their relationship.

  3. outliers > can strongly distort r, either inflating or deflating the value depending on their position.

  4. using extreme groups as samples > can artificially inflate r.

  • Q6:
  1. taller > more weight > positive

  2. older > faster (less time) > negative

  3. if u good at math u good at reading

  4. negative

  5. 0 > no meaningful or logical relationship

  6. 0 > no meaningful or logical relationship

  7. positive > older higher/ newer lower

  8. positive (hotter more water)

  • Q7:

the correlation coefficient is gonna change a bit in > strength > sightly less negative perhaps (the “age” variable > a direct measure of how long the car has been used, whereas “year of production” > indirect calculations.

End of task 8 :)

Week 9

Linear models

Post-session formative exercise: link

1. load the dataset

View (mtcars)

2. create the linear model + summarize it

# Create a linear model
model <- lm(mpg ~ wt, data = mtcars)

3. visualize the model

# Plot data points
plot(mtcars$wt, mtcars$mpg, main = "Linear Model: mpg vs wt",
     xlab = "Weight (wt)", ylab = "Miles per Gallon (mpg)",
     pch = 19, col = "blue")

# Add regression line
abline(model, col = "red", lwd = 2)

4. View the model’s coefficients (intercept and slope)

# View coefficients
coef(model)
(Intercept)          wt 
  37.285126   -5.344472 
  • The slope (-5.344472) indicates that for every 1,000 lb increase in vehicle weight, the fuel efficiency decreases by approximately 5.34 miles per gallon.
  • Thus, heavier cars are less fuel-efficient, with the model quantifying this trend.

**5.

# Summary of the model
summary(model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

coefficients:

  • intercept = 37.2851 > When the vehicle weight (wt) is 0 (hypothetical), the predicted MPG is 37.2851.
  • slope = -5.3445 > For every 1,000-lb increase in vehicle weight, the MPG decreases by 5.3445.
  • P-value = 1.294e-10
  • Both coefficients are statistically significant, as indicated by the low p-values

significance:

  • The p-value for wt (1.294e-10) is very small, indicating that the weight significantly affects MPG.

model fit:

  • Multiple R-squared = 0.7528
  • The model explains 75.28% of the variation in mpg.
  • Residual standard error = 3.046 on 30 degrees of freedom > The average deviation of the predicted MPG from the actual MPG.

for anova model

# ANOVA for the linear model
anova(model)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
wt         1 847.73  847.73  91.375 1.294e-10 ***
Residuals 30 278.32    9.28                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Degrees of Freedom (Df) = 1, Redisuals = 30
  • Sum of Squares (Sum Sq) = 847.73 > indicates the variation in mpg explained by the weight. Redisuals = 278.32 > represents unexplained variation.
  • Mean Squares (Mean Sq) = 847.73, Redisuals = 9.28 > Mean square for the model is higher (847.73) than for residuals (9.28), reflecting the predictor’s significant contribution.
  • F-value = 91.375 > The high F-statistic indicates the model is a good fit.
  • p-value = 1.294e-10 > The relationship between wt and mpg is statistically significant.

end of data analysis task

Research methods task:

end of week 9 task :)

Week 10 logistic regression

about logistic regression

  • s used to predict the class (or category) of individuals based on one or multiple predictor variables (x).

  • It is used to model a binary outcome, that is a variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-diseased.

  • Logistic regression belongs to a family, named Generalized Linear Model (GLM), developed for extending the linear regression model (Chapter @ref(linear-regression)) to other situations. Other synonyms are binary logistic regression, binomial logistic regression and logit model.

  • Logistic regression does not return directly the class of observations. It allows us to estimate the probability (p) of class membership.

  • The probability will range between 0 and 1. You need to decide the threshold probability at which the category flips from one to the other. By default, this is set to p = 0.5, but in reality it should be settled based on the analysis purpose.

logistic regression function in R

  • glm()

Example:

  1. We install and load the packages
install.packages("caret") # caret for easy machine learning workflow
Installing package into 'C:/Users/renad/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'caret' successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package 'caret'
Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
C:\Users\renad\AppData\Local\R\win-library\4.4\00LOCK\caret\libs\x64\caret.dll
to C:\Users\renad\AppData\Local\R\win-library\4.4\caret\libs\x64\caret.dll:
Permission denied
Warning: restored 'caret'

The downloaded binary packages are in
    C:\Users\renad\AppData\Local\Temp\Rtmpeyw2DR\downloaded_packages
library(tidyverse) 
library(caret) # tidyverse for easy data manipulation and visualization
Warning: package 'caret' was built under R version 4.4.2
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
  1. We prepare the data
Tip

Logistic regression works for a data that contain continuous and/or categorical predictor variables.

install.packages("mlbench")
Installing package into 'C:/Users/renad/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'mlbench' successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package 'mlbench'
Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
C:\Users\renad\AppData\Local\R\win-library\4.4\00LOCK\mlbench\libs\x64\mlbench.dll
to C:\Users\renad\AppData\Local\R\win-library\4.4\mlbench\libs\x64\mlbench.dll:
Permission denied
Warning: restored 'mlbench'

The downloaded binary packages are in
    C:\Users\renad\AppData\Local\Temp\Rtmpeyw2DR\downloaded_packages
library (mlbench)
Warning: package 'mlbench' was built under R version 4.4.2
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
    pregnant glucose pressure triceps insulin mass pedigree age diabetes
375        2     122       52      43     158 36.2    0.816  28      neg
290        5     108       72      43      75 36.1    0.263  33      neg
575        1     143       86      30     330 30.1    0.892  23      neg
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing logistic regression

Tip

The R function glm(), for generalized linear model, can be used to compute logistic regression. You need to specify the option family = binomial, which tells to R that we want to fit logistic regression.

Quick start R code

# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
pregnant     1.005e-01  6.127e-02   1.640  0.10092    
glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
triceps      1.418e-02  1.998e-02   0.710  0.47800    
insulin      5.940e-04  1.508e-03   0.394  0.69371    
mass         7.997e-02  3.180e-02   2.515  0.01190 *  
pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
age          2.718e-02  2.020e-02   1.346  0.17840    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 267.18  on 305  degrees of freedom
AIC: 285.18

Number of Fisher Scoring iterations: 5
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

Simple logistic regression

  • The simple logistic regression is used to predict the probability of class membership based on one single predictor variable.

  • The following R code builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration:

model <- glm( diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose      0.04327234 0.005341133  8.101716 5.418949e-16
  • The output above shows the estimate of the regression beta coefficients and their significance levels. The intercept (b0) is -6.32 and the coefficient of glucose variable is 0.043.

  • Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities

newdata <- data.frame(glucose = c(20,  180))
probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes
    1     2 
"neg" "pos" 
  • The logistic function gives an s-shaped probability curve illustrated as follow:
train.data %>%
  mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
  ggplot(aes(glucose, prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
    )
`geom_smooth()` using formula = 'y ~ x'

Multiple logistic regression

  • The multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables, as follow:
model <- glm( diabetes ~ glucose + mass + pregnant, 
                data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose      0.03886154 0.005404219  7.190962 6.433636e-13
mass         0.09458458 0.023529905  4.019760 5.825738e-05
pregnant     0.14466661 0.045125729  3.205857 1.346611e-03
  • Here, we want to include all the predictor variables available in the data set. This is done using ~.:
model <- glm( diabetes ~., data = train.data, family = binomial)
summary(model)$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01
  • From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:
  1. Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable
  2. Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.
  3. z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)
  4. Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is
  • Note that, the functions coef() and summary() can be used to extract only the coefficients, as follow:
coef(model)
  (Intercept)      pregnant       glucose      pressure       triceps 
-1.053400e+01  1.005031e-01  3.709621e-02 -3.875933e-04  1.417771e-02 
      insulin          mass      pedigree           age 
 5.939876e-04  7.997447e-02  1.329149e+00  2.718224e-02 
summary(model )$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01
  • It can be seen that only 5 out of the 8 predictors are significantly associated to the outcome. These include: pregnant, glucose, pressure, mass and pedigree.

  • The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.

  • Here, as we have a small number of predictors (n = 9), we can select manually the most significant:

model <- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree, 
                data = train.data, family = binomial)

Making predictions

Tip

We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.

steps: 1. Predict the class membership probabilities of observations based on predictor variables 2. Assign the observations to the class with highest probability score (i.e above 0.5)

  • The R function predict() can be used to predict the probability of being diabetes-positive, given the predictor values.

  • Predict the probabilities of being diabetes-positive:

probabilities <- model %>% predict(test.data, type = "response")
head(probabilities)
       19        21        32        55        64        71 
0.1352603 0.5127526 0.6795376 0.7517408 0.2734867 0.1648174 
  • Predict the class of individuals:
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
   19    21    32    55    64    71 
"neg" "pos" "pos" "pos" "neg" "neg" 

Assessing model accuracy

Tip

The model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.

  • Proportion of correctly classified observations:
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

The classification prediction accuracy is about 76%, which is good. The misclassification error rate is 24%.

End of task