Talane’s Workbook

Author

Talane Bowne

Why Am I Here?

I am here to take research methods and data analysis as part of the MRes program for Endangered Species Recovery and Conservation program at Nottingham Trent University. My hopes are to acquire skills during this course and year to do bat conservation. Data analysis is a necessary skill for research.

Picture of cute bat as motivation to remember why I am here. Source link through image.

Picture of cute bat as motivation to remember why I am here. Source link through image.

Week One

This week entailed an introduction to the course to include downloading R and Rstudio as well as beginning to learn the basics of R. Once upon a time I had a brief introduction to R but have forgotten most of everything beyond installing the program and changing the mode to dark mode.

Screenshot of my RStudio in dark mode.

After setting up my RStudio to my chosen aesthetic I navigated the task of going through the resources posted on the module page. Using the videos and links provided I began the painful process of remembering how to function in RStudio again by practicing with data. As I have no data yet for my research project I decided to make my own by surveying myself on my feelings of starting this course. So the figure below is a Bar chart about how I feel at the start of this course with each category being rated on a scale of 0-10. The higher the rating the stronger the feeling.

Code
# Data about my feelings
data <- data.frame(
  Feelings=c("Need to Pass Course","R Dread","R Confidence") ,  
  Rating=c(10,8,1)
  )

# Bar plot of my feelings 
ggplot(data, aes(x=Feelings, y=Rating)) + 
  geom_bar(stat = "identity",fill="purple")+ 
  theme_dark() 

Week Two

Week two we are exploring more of how to start working with R and playing more with data. This week I will use an actual data set, specifically the penguins data set, and not just make up data about my feelings. The tasks for week two are as follows:

  • Work more with the penguin data
  • Insert a picture
  • Insert a video

Playing with Penguins…data

Professor Felipe encouraged us to look at the data first before we got too deep into creating code for it. This way we get used to checking the data for errors first before messing around with it and potentially getting defeated if there is simply an error with our data.

Code
penguins %>% glimpse() # Getting a small look at the data and see what variables are in the file, see the columns and some rows
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

After getting a glimpse at the data I can play around with it further by making a figures. Code retrieved from the week 3 power point and then altered to look at different variables, personalize the colors, add titles as well as labels to the figures.

Code
#| label: Histogram of penguin species' body mass

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=body_mass_g, color=species, fill=species))+
  geom_histogram(bins = 33) + theme_dark() + scale_color_manual(values=c("orchid", "purple", "black")) + scale_fill_manual(values=c("orchid", "purple", "black")) + labs(x = "Body Mass (g)", 
       y = "Frequency", 
       title = "Penguin's Body Mass")

Code
#| label: Boxplot of penguin's bill length by sex and species

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=sex, 
             y = bill_length_mm,
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16)) + theme_dark() + 
  scale_color_manual(values=c("orchid", "purple", "black")) + scale_fill_manual(values=c("orchid", "purple", "black")) + labs(x = "Sex", 
       y = "Bill Length (mm)", 
       title = "Penguin's Bill Length by Sex")

Code
#| Label: Penguin species' bill length by body mass

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = body_mass_g,
             color=species, 
             fill=species))+
  geom_point() + theme_dark() + scale_color_manual(values=c("orchid", "purple", "black")) + scale_fill_manual(values=c("orchid", "purple", "black")) + theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16)) + labs(x = "Bill Length (mm)", 
       y = "Body Mass (g)", 
       title = "Penguin's Body Mass by Bill Length")

Add an Image

Another part of the workbook assignment this week was to add an image to the workbook. During week one I had placed an image in my workbook but I am not opposed to having an excuse to place more pictures of bats on my page.

Image from Bat Conservation International of the Congress Street Bridge bat colony in Austin, Texas, USA. Source link through image.

Image from Bat Conservation International of the Congress Street Bridge bat colony in Austin, Texas, USA. Source link through image.

Add a Video

The last part of the workbook exercise for this week was to add a video to our page. I would be remiss in providing pictures of the Congress Street Bridge colony in flight without attaching a video of the nightly flight. The video is also from Bat Conservation International and was posted to their Youtube page.

Week Three

This week is dedicated to getting used to wrangling/managing our data in R. For this exercise I will be following the prompts and code of R for Graduate Students by Y. Wendy Huynh, specifically the chapter 6.6.1 and 6.7 exercises.

Problem A

Code
midwest %>%  # indicates working with the midwest database and %>% is the pipe to simplify code
  group_by(state) %>% # group the data in this chunk by state to look at states separately
  summarize(poptotalmean = mean(poptotal), # the start of a nested function to create a summary, and new variables, this being the mean population, grouped by state. 
            poptotalmed = median(poptotal), # the population total median grouped by state. 
            popmax = max(poptotal), # creates the variable for maximum population entry grouped by state 
            popmin = min(poptotal), # creates the variable for minimum population entry grouped by state 
            popdistinct = n_distinct(poptotal), # number of unique totals of the all population totals grouped by state
            popfirst = first(poptotal), # the first population entry grouped by state
            popany = any(poptotal < 5000), # indicates true or false if the state has a population total entry less than 5000
            popany2 = any(poptotal > 2000000)) %>% # indicates true or false for it the state has a population total entry over 2000000.
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

Code
midwest %>% # indicates working with the midwest database and %>% is the pipe to simplify code
  group_by(state) %>% # group the data in this chunk by state to look at states separately
  summarize(num5k = sum(poptotal < 5000), # the start of nested function, shows the number of states that have counties with populations less than 5000
            num2mil = sum(poptotal > 2000000), # shows the number of states that have counties with populations more than 2000000
            numrows = n()) %>% # The total number of entries/counties grouped by state. 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

Part I

Code
midwest %>% # indicates working with the midwest database and %>% is the pipe to simplify code
  group_by(county) %>% # groups the data by county to look at county data separately 
  summarize(x = n_distinct(state)) %>% # shows how many states has the same county name
  arrange(desc(x)) %>% # arranges in descending order the column listing how many states share a county name 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Part II

Code
# How does n() differ from n_distinct()? n() counts rows, not unique entries, while n_distinct() counts unique entries for variable. 
# When would they be the same? different? In this case they are the same because it's looking at how many rows that county occurs in and each state would list the county. In other functions it could be different, the n_distinct() is going to ignore duplicates and only count unique entries. 
midwest %>% # indicates working with the midwest database and %>% is the pipe to simplify code
  group_by(county) %>% # groups the data by county to look at county data separately 
  summarize(x = n()) %>% # the number of rows that county appears in the data 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Part III

Code
# hint: 
# - How many distinctly different counties are there for each county? 320 
# - Can there be more than 1 (county) county in each county? No, all same named county will be counted as 1 since it's not unique. 
# - What if we replace 'county' with 'state'? Yes as multiple states have counties with the same name as we've seen in previous chunks. 
midwest %>%  # indicates working with the midwest database and %>% is the pipe to simplify code
  group_by(county) %>% # groups the data by county to look at county data separately
  summarize(x = n_distinct(county)) %>% # shows only the unique county entries specifically in the county variable 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

Problem D

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(clarity) %>% # groups the data by clarity to look at clarity data separately
  summarize(a = n_distinct(color), # number of unique entries for color in each clarity type 
            b = n_distinct(price), # number of unique entries for price for each clarity type 
            c = n()) %>% # lists the number of rows that exist for each clarity type
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

Part I

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(color, cut) %>% # groups the data by color and cut to look at data separated by both color and cut. 
  summarize(m = mean(price), # shows the mean price of the data when grouped by both color and cut. 
            s = sd(price)) %>% # shows the standard deviation of the price when grouped by both color and cut. 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Part II

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(cut, color) %>% # groups the data by cut and color to look at data separated by both cut and color. 
  summarize(m = mean(price), # shows the mean price of the data when grouped by both cut and color. Organization changes but not the mean from part I. 
            s = sd(price)) %>% # shows the standard deviation of the price when grouped by both cut and color. Organization changes but not the standard deviation from part I. 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

Part III

Code
# - How good is the sale if the price of diamonds equaled msale? 20% off...? Especially the premium and ideal cuts have a noticeable price drop with the sale price. 
# - e.x. The diamonds are x% off original price in msale.
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(cut, color, clarity) %>% # groups the data by cut, color and clarity to look at data separated by both cut, color and clarity. 
  summarize(m = mean(price), # mean of the data when grouped by cut, color and clarity. 
            s = sd(price), # standard deviation of data when grouped by cut, color and clarity. 
            msale = m * 0.80) %>% # The "sale" price of 80% of the mean of the price when grouped by cut, color and clarity, 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(cut) %>% # groups the data by cut to look at cut data separately.
  summarize(potato = mean(depth), # potato is the column name being set for the mean of the depth of the diamond data being grouped by cut, 
            pizza = mean(price),  # pizza is the column name being set for the mean of the price of the diamond data being grouped by cut. 
            popcorn = median(y), # popcorn is column name being set for the median of the width of the diamonds data being grouped by cut.
            pineapple = potato - pizza, # pineapple is the column name set for the mean diamond depth - mean diamond price for the diamond dataset as grouped by cut. 
            papaya = pineapple ^ 2, # papaya is the column name set for the square of mean depth - the mean price both being grouped by cut. 
            peach = n()) %>% # 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

Part I

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(color) %>% # groups the data by color to look at color data separately.
  summarize(m = mean(price)) %>% # provides the mean of the prices as grouped by color. 
  mutate(x1 = str_c("Diamond color ", color), # adds the label, "diamond color," for the color vector. 
         x2 = 5) %>% #Serves no purpose until I find a resource or am told differently. Just makes stuff equal 5. 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Part II

What does the first ungroup() do? Is it useful here? Why/why not? Eliminated the repetitive code to not disrupt my pretty workboook, The answer is no, the ungroup() is not useful as it’s not at the end of the grouping code. The first ungroup just stops the grouping before the code is done using said group. Why isn’t there a closing ungroup() after the mutate()? I don’t know, The text says there should be, And at this point in my life with R I go by resources provided by professors and web searches.

Problem H

Part I

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(color) %>% # groups the data by color to look at color data separately.
  mutate(x1 = price * 0.5) %>% # create a new variable which will be 0.5 of the price based on the group by color. 
  summarize(m = mean(x1)) %>% # This line asks for the mean of half of the price of diamond data set grouped by color. 
  ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Part II

What’s the difference between part I and II? I didn’t put the code in a chunk because I didn’t want to wreck future script. I remember from the chapter that any group requires an ungroup, so the ungroup being before another function just ruins the resulting data.

diamonds %>% group_by(color) %>% mutate(x1 = price * 0.5) %>% ungroup() %>%
summarize(m = mean(x1))

6.6.1 Questions

Why is grouping data necessary? To specify parameters you want to evaluate.

Why is ungrouping data necessary? To clear the parameters you set, leaving the data unaffected and no longer altering further functions. Ungrouping removes the prior set parameters for the dataset you want to analyze.

When should you ungroup data? Any time you group data.

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? Not that I know, but I will likely look more into now. From what I know from the text, ungroup only works for functions that start with group_by.

6.7 Extra Practice

Question 1

Code
View(diamonds) # Opens up the dataset in a new tab to allow you to view the data and see all the variables. 

Question 2

Part I
Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  arrange(price) # arranges the data by the price column in ascending order, lowest to highest prices.
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
Part II
Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  arrange(desc(price))  # arranges the data by the price column in descending order, highest to lowest prices.
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
Part III
Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  arrange(price) %>% # arranges the data by the price column in ascending order, lowest to highest prices. The pipe in this line connects the actions of the next line 
  arrange(cut) # Arranges the data by the cut column in ascending order from worst to best cut, due to previous line it will be arranged by cut and then price.  
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows
Part IV
Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  arrange(desc(price)) %>% # arranges the data by the price column in descending order, highest to lowest prices. The pipe in this line connects the actions of the next line 
  arrange(desc(cut)) # Arranges the data by the cut column in descending order from best to worst cut, due to previous line it will be arranged by cut and then price.  
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

Question 3

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  arrange(price) %>% # arranges the data by the price column in ascending order, lowest to highest prices. The pipe in this line connects the actions of the next line 
  arrange(clarity) # arranges the clarity in ascending order from worst to best. Previous line allows the data to be organized by clarity and then price. 
# A tibble: 53,940 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 2  0.32 Good    D     I1       64      54   361  4.33  4.36  2.78
 3  0.31 Premium F     I1       62.9    59   394  4.33  4.29  2.71
 4  0.34 Ideal   D     I1       62.5    57   413  4.47  4.49  2.8 
 5  0.34 Ideal   D     I1       61.4    55   413  4.5   4.52  2.77
 6  0.32 Premium E     I1       60.9    58   444  4.42  4.38  2.68
 7  0.43 Premium H     I1       62      59   452  4.78  4.83  2.98
 8  0.41 Good    G     I1       63.8    56   467  4.7   4.74  3.01
 9  0.32 Good    D     I1       64      54   468  4.36  4.33  2.78
10  0.32 Ideal   E     I1       60.7    57   490  4.45  4.41  2.69
# ℹ 53,930 more rows

Question 4

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  mutate(SalePrice = price - 250) # creates a new column showing the price - $250
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z SalePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

Question 5

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  select(-x, -y, -z) # removes the x, y, and z columns in the output. 
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Question 6

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  group_by(cut) %>% # groups the data by the cut variable to look at the cut data separately 
  summarize(total= n()) %>% # shows a summary of how many rows/diamonds there are for each cut category
  ungroup() # undoes the grouping to not alter the date for further functions
# A tibble: 5 × 2
  cut       total
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

Question 7

Code
diamonds %>% # indicates working with the diamonds database and %>% is the pipe to simplify code.
  mutate(totalNum = n()) # adds a column showing the total number of diamonds in the data 
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Research Methods

Come up with a good and bad research question based on the Diamonds dataset.

Bad Question

Do diamonds get bought and sold?

Good Question

What variables of diamonds, if any, impact the price of diamonds more or less than other variables?

Week Four

This week we are learning more about exploring out data. Particularly we are learning more about the different types of graphs we can produce and how utilizing the graphs can show us information about our data that we may not have known otherwise. Running our data through the different types of graphs can help us find the type of graph that answers our research questions the best or even helps you ask more questions.

Data Exploration

The following figures will be based off the video Data Visualization with R in 36 Minutes. The instructions were to reproduce the graphs and organize it in my own way. My way of making the graphs my own will mainly be making them dark themed with purple accents to stick with my aesthetic. I will also annotate the code to further ensure I know what the lines of code do.

Graph One, Original

Code
library(modeldata) # indicates the library we'll be using 

ggplot(crickets, aes(x = temp, y = rate)) + # creating a plot of the cricket data looking at rate by temperature 
  geom_point() + # indicates the figure is to be a scatter plot
  labs(x = "Temperature",  # allows you to make labels, x = is the x label
       y = "Chirp rate", # y = is the y axis label 
       title = "Cricket chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph One, My Version

Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) + # creating a plot of the cricket data looking at rate by temperature and making the species different colors 
  geom_point() + # indicates the figure is to be a scatter plot
  labs(x = "Temperature", # allows you to make labels, x = is the x label 
       y = "Chirp Rate", # y = is the y axis label 
       color = "Species", # this is the figure legend label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") + # adds a caption to the graph
      theme_dark() + # makes the background darker
  scale_color_manual(values=c("darkmagenta", "darkorchid4")) + # allows you to choose the colors for the outlines
      scale_fill_manual(values=c("darkmagenta", "darkorchid4")) # allows for choosing the fill colors 

Graph Two, Original

Code
ggplot(crickets, aes(x = temp, y = rate)) + # creating a plot of the cricket data looking at rate by temperature
  geom_point(color = "red", # indicates the figure is to be a scatter plot, makes the data points all red 
     size = 2, # Changes the size of the data points 
     alpha = .3, # changes the transparency of the data points 
     shape = "square") + # changes the shape of the data points 
  labs(x = "Temperature", # allows you to make labels, x = is the x label 
       y = "Chirp rate", # y = is the y axis label
       title = "Cricket chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Two, My Version

Code
ggplot(crickets, aes(x = temp, y = rate)) + # creating a plot of the cricket data looking at rate by temperature
  geom_point(aes(color = species),  # indicates the figure is to be a scatter plot, the aes allows me to separate the data points by species. 
     size = 3, # Changes the size of the data points 
     alpha = .4, # changes the transparency of the data points 
     shape = "diamond") + # changes the shape of the data points 
  theme_dark() + # makes the background darker
  labs(x = "Temperature", # allows you to make labels, x = is the x label 
       y = "Chirp Rate", # y = is the y axis label 
       title = "Cricket Chirps", # adds a title to the graph
       color = "Species",  # this is the figure legend label
       caption = "Source: McDonald (2009)") + # adds a caption to the graph
  scale_color_manual(values=c("darkmagenta", "darkorchid4")) + # allows you to choose the colors for the outlines
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) # allows for choosing the fill colors 

Graph Three, Original

Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) + # creating a plot of the cricket data looking at rate by temperature and making the species different colors 
  geom_point() +  # indicates the figure is to be a scatter plot
  geom_smooth(method = "lm", # adds a regression line
              se = FALSE) + # ignores the standard error
  labs(x = "Temperature", # allows you to make labels, x = is the x label 
       y = "Chirp rate", # y = is the y axis label 
       color = "Species", # this is the figure legend label
       title = "Cricket chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") + # adds a caption to the graph
  scale_color_brewer(palette = "Dark2") # makes the theme a high contrast color palette 

Graph Three, My Version

Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) + # creating a plot of the cricket data looking at rate by temperature and making the species different colors 
  geom_point(size = 3, # indicates the figure is to be a scatter plot, size changes the size of the data points
             shape = "diamond") + # changes the shape of the data points 
  geom_smooth(method = "lm", # adds a regression line
              se = FALSE) + # ignores the standard error
  theme_dark() + # makes the background darker
  scale_color_manual(values=c("darkmagenta", "darkorchid4")) + # allows you to choose the colors for the outlines
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) + # allows for choosing the fill colors
  labs(x = "Temperature", # allows you to make labels, x = is the x label 
       y = "Chirp Rate", # y = is the y axis label
       color = "Species", # this is the figure legend label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Four, Original

Code
ggplot(crickets, aes(x = rate)) + # Creating a graph of the cricket data with the x value being the rate of cricket chirps. 
  geom_histogram(bins = 15) # indicates the graph is to be a histogram and sets the bins to 15 

Graph Four, My Version

Code
ggplot(crickets, aes(x = rate)) + # Creating a graph of the cricket data with the x value being the rate of cricket chirps. 
  geom_histogram(color = "black", fill="purple", bins = 15) + # indicates the graph is to be a histogram, customizes the color, fill color,  and sets the bins to 15 
  theme_dark() + # makes the background darker
  labs(x = "Chirp Rate", # x = is the x axis label
       y = "Frequency",  # y = is the y axis label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Five, Original

Code
ggplot(crickets, aes(x = species, fill = species)) + # Creating a graph of the cricket data with the x value being the the species of cricket, and the color fill indicating to separate the date by species and use different colors for the species 
  geom_bar(show.legend = FALSE) + # indicates to make a bar chart and to not show the icon legend 
  scale_fill_brewer(palette = "Dark2") # makes the theme a high contrast color palette 

Graph Five, My Version

Code
ggplot(crickets, aes(x = species, fill = species)) + # Creating a graph of the cricket data with the x value being the the species of cricket, and the color fill indicating to separate the date by species and use different colors for the species 
  geom_bar(show.legend = FALSE) + # indicates to make a bar chart and to not show the icon legend 
  theme_dark() + # makes the background darker
scale_color_manual(values=c("darkmagenta", "darkorchid4")) + # allows you to choose the colors for the outlines
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) + # allows for choosing the fill colors
labs(x = "Species", # x = is the x axis label
       y = "Observed", # y = is the y axis label
       title = "Crickets", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Six, Original

Code
ggplot(crickets, aes(x = rate)) + # Creating a graph of the cricket data with the x value being the the rate of cricket chirps 
  geom_freqpoly(bins = 15) # shows the same data as the bar chart but as a connected line, makes the bins 15. 

Graph Six, My Version

Code
ggplot(crickets, aes(x = rate)) + # Creating a graph of the cricket data with the x value being the the rate of cricket chirps 
  geom_freqpoly(bins = 15, color = "darkorchid4") + # shows the same data as the chirp rate bar chart but as a connected line, makes the bins 15, color sets the color of the line 
  theme_dark() + # makes the background darker
  labs(x = "Chirp Rate", # x = is the x axis label
       y = "Observed", # y = is the y axis label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Seven, Original

Code
ggplot(crickets, aes(x = species, y = rate, color = species)) + # sets the data to be used, sets the x and y values, sets the date to be separated by species. 
  geom_boxplot(show.legend = FALSE) + # indicates to make a boxplot and to not show the legend
  scale_color_brewer(palette = "Dark2") + # makes the theme a high contrast color palette 
  theme_minimal() # sets the theme for the graph

Graph Seven, My Version

Code
ggplot(crickets, aes(x = species, y = rate, color = species, fill = species)) + # sets the data to be used, sets the x and y values, sets the date to be separated by species. 
  geom_boxplot(show.legend = FALSE) + # indicates to make a boxplot and to not show the legend
  theme_dark () + # makes the background darker
  scale_color_manual(values=c("black", "black")) + # allows you to choose the colors for the outlines
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) + # allows for choosing the fill colors
  labs(x = "Species", # x = is the x axis label
       y = "Chirp Rate", # y = is the y axis label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Graph Eight, Original

Code
ggplot(crickets, aes(x = rate, fill = species)) + # sets the data to be used, sets the x and y values, sets the fill values to separate the data by species. 
  geom_histogram(bins = 15, show.legend = FALSE) + # indicates to create a histogram graph and set the bins to 15. Also turns off the figure legend, 
  facet_wrap(~species, # creates separate figures per species in the data 
             ncol = 1) + # sets there to be one column
  scale_fill_brewer(palette = "Dark2") + # makes the theme a high contrast color palette
  theme_minimal() # sets the theme for the graph

Graph Eight, My Version

Code
ggplot(crickets, aes(x = rate, Color = species, fill = species)) +  # sets the data to be used, sets the x and y values, sets the fill values to separate the data by species. 
  geom_histogram(bins = 15, show.legend = FALSE) + # indicates to create a histogram graph and set the bins to 15. Also turns off the figure legend, 
  facet_wrap(~species, # creates separate figures per species in the data 
             ncol = 1) + # sets there to be one column
  theme_dark() + # makes the background darker
  scale_color_manual(values=c("black", "black")) + # allows you to choose the colors for the outlines
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) + # allows for choosing the fill colors
  labs(x = "Species", # x = is the x axis label
       y = "Chirp Rate", # y = is the y axis label
       title = "Cricket Chirps", # adds a title to the graph
       caption = "Source: McDonald (2009)") # adds a caption to the graph

Research Methods

What is a good hypothesis?

A good hypothesis is one that comes about first by making observations and having some sort of background knowledge of what you are looking to research. A good hypothesis is not one that you can just form without any experience or insight into what you would like to research. In order to gather evidence to support a good hypothesis, the hypothesis has to be one that can also be rejected. You cannot start a good research project with a good hypothesis and the only outcome is that the hypothesis will be supported. A good hypothesis will also provide insight from being tested. The insight gained from testing a hypothesis can range from having evidence to support your hypothesis to asking a knew question based on the findings. If you start your experiment with a good foundation, a good hypothesis, then the knowledge gained from testing will typically be useful.

Week Five

Oof okay, so this week has me having to really stop myself from hiding under the covers, terrified of the stats monsters in my wardrobe. It has been a hot minute since I took stats 101 and this reminded me of how much I forgot. Not even speaking about the first time I was supposed to learn about R, those are the under the bed monsters…we don’t talk about them. Luckily, I can acknowledge that while this is an initial knowledge dump, this is still an introduction to each of the different statistical tests and there will be more time to get re-familiarized with them. This week the quest for the workbook is to identify the statistical tests associated with the different graphs provided and attempt to recreated them. Both of the tasks will be combined into the one subsection for each graph as I will try to recreate it and then speak about the tests I think could work with the figure. All graphs will be recreated based upon the Iris data.

Graph One

Code
data("iris")
iris %>% 
  ggplot(aes(x=Species, 
             y = Sepal.Length,
             color=Species, fill=Species))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16)) + theme_dark() + 
  scale_color_manual(values=c("orchid", "purple", "black")) + scale_fill_manual(values=c("orchid", "purple", "black")) + labs(x = "Species", 
       y = "Sepal Length (mm)", 
       title = "Iris Sepal Length by Species")

In graph one the x variable is a category and the y value is numerical. The graph looks at how the species affect the sepal length. The graph illustrates the differences between the 3 iris species. The boxy plot also shows the distribution of the data for each species category. For each category the box and whiskers shows the minimum, maximum, median, 1st quartile, 3rd quartile and the dot is the outlier of the data. The predictor variable, the x variable, is a category and the outcome variable, y variable, is a numeric which means that a mean comparison test is indicated. Only one outcome variable is shown on this graph for this question so an ANOVA test should be suitable.

Graph Two

Code
data("iris")
iris %>% 
ggplot(aes(x = Petal.Length, color=Species, fill=Species)) +
  geom_density(alpha = 0.7) +
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16)) + theme_dark() + 
  scale_color_manual(values=c("orchid", "purple", "black")) + scale_fill_manual(values=c("orchid", "purple", "black")) +
  labs(title = "Density Plot of Iris Species Petal Lengths",
       x = "Petal Length",
       y = "Density")

Graph two is a density plot and is a way of visualizing the distribution of the data. The x value is petal length which is a numerical variable. The density is a frequency variable. The graph looks at the differences of petal length between the 3 species. The data does not appear to have a normal distribution. The data is not paired and as the outcome being ordinal a Mann–Whitney U test and Kruskal–Wallis test should be suitable teats for this question.

Graoh Three

Code
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color=Species)) + 
  geom_point(aes(shape=Species, color=Species), size = 3) +
  scale_shape_manual(values=c("diamond", "triangle", "circle")) +
  geom_smooth(method = "lm", color = "darkmagenta") + 
  theme_dark() + 
  scale_color_manual(values=c("orchid", "purple", "black")) + 
  scale_fill_manual(values=c("orchid", "purple", "black")) + 
  labs(x = "Petal Length", 
       y = "Petal Width", 
       color = "Species", 
       title = "Iris Species Petal Length by Width")

Both the x an y variables for graph three are numerical and the scatter plot looks at the relationship between petal length and petal width. While the data is separated by species the only predictor variable being looked at with this question is petal length with is continuous. Because of this a simple regression is a good test for this data. With this linear model we have the regression line and the shadow behind it is the standard error. Pearson correlation test could also be done to examine if the variables have a linear relationship.

Graph Four

Code
data(iris)
iris.new <-
mutate(iris, size=ifelse(Sepal.Length < median(Sepal.Length),
  "small", "big")) 
  ggplot(iris.new, aes(x = Species, color=size, fill=size)) +
  geom_bar(position="dodge") + 
  theme_dark() + 
scale_color_manual(values=c("darkmagenta", "darkorchid4")) + 
  scale_fill_manual(values=c("darkmagenta", "darkorchid4")) +
  labs(x = "Species", 
       y = "Count", 
       title = "Sepal Size of Iris Species")

Graphs four looks at the frequency small and large sepal size of the 3 iris species. The sepal size was a numerical variable that we made into two categories based on that numerical data. The graph looks at how frequently the 3 iris species were observed to have either big or small sepals. The graph shows the differences of sepal sizes between the 3 species. Given that we have created a binary for the data a Chi-Squared test would be an appropriate to assess if the differences are due to the relationship of the species variable.

Week Six

Week six we did a group project as a part of an exercise in picking the appropriate statistical analyses for the data sets given, creating questions for the data as well as annotating code for the data that we worked with for make figures and tables. My group was with Hephzibah, Tamsin, Alice, Marrisa and Ellelouise. The document was submitted via Tamsin.

We also learned about spotting the difference between good and bad titles for research papers. As well as going into more detailed about different frequency tests.

Week Seven

This week was focused around what makes a good and bad abstract with the formative assignment being to create our own good titles and abstracts for a paper provided. The other part of this week was around the different types of mean tests we can do. Admittedly I am behind on doing the post sessions for the statistical analyses in my workbook for weeks 6 and 7. I am American, and while I take responsibility for not doing the post sessions in a timely manner…I got stressed and depressed this week with the election then election results that I just fell off track. Now it’s just up to me to make up for it in the following weeks to catch up.

Title

How Santa Cruz Galapagos Tortoises Utilize Three Types of Farmlands During Migration

Authors

Kyana N. Pike, Stephen Blake, Iain J. Gordon & Lin Schwarzkopf

Affiliations

James Cook University; Galapagos National Park; Saint Louis University; Max Planck Institute for Animal Behaviour; Wildcare Institute; Central Queensland University; James Hutton Institute; The Australian National University; CSIRO; Protected Places Initiative

Abstract

Western Santa Cruz Galapagos tortoises (Chelonoidis porteri), residing on Santa Cruz Island, are critically endangered ecosystem engineers. The tortoises migrate during the dry season to the highlands which are around 88% agricultural land, be it abandoned, used for livestock or repurposed for tourism. This study evaluated behaviors the tortoises exhibit in the different agricultural habitats in terms of time eating, resting and walking. Studied in 2019 with 114 observations made in the wet season, March-May, and 128 observations in the dry season, November-December. Tortoise behaviors recorded from 0630-1230 by an observer 5-15m away. Once spotted all actions and length of actions were logged for 30 minutes. Habitat characteristics also noted in a 1m² area with vegetation coverage, density and mean height recorded. Thermal images utilized to obtain temperatures of the tortoises and ground. Sex of the tortoise and curved carapace length also obtained. Indications in the analyses showed that resting time had a strong relationship with land type and temperature. Walking time was most affected by vegetation height. The abandoned agricultural land was least suitable for migratory behaviors of the tortoise as it had the most shade, lower temperatures and tall vegetation. The touristic locations were the most ideal for eating and the level of disturbance was suitable for walking and eating. The livestock area was theorized as having potential for conservation efforts such as increasing areas of shade, cutting to the tortoises’ ideal height and rotating the stocking of livestock could create ideal habits for the tortoises’ migratory behaviors.

Keywords

Chelonoidis porteri; agriculture; land sharing; behavior; activity; habitat quality

Week Six Catch Up

After much delay, procrastination, internal protest and mental blockage I am finally catching up on the post sessions. The first part is putting into practice with R tests to compare proportions via t-tests and Chi-squared tests. Mosrt work recreated via STHDA.com chapters.

Part One

Running the examples from the website explaining the different types of t-tests and Chi-squared

One Proportion Z Test

Code
# per the text a z test looks at observed proportional data and compares it to theoretical proportions. The example of the chapter of one proportion test uses data about mize and if a higher proportion of one gender is experiencing cancer
res <- prop.test(x = 95, n = 160, p = 0.5, 
                 correct = FALSE)
# Printing the results
res 

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5163169 0.6667870
sample estimates:
      p 
0.59375 
Code
# printing the p-value
res$p.value
[1] 0.01770607
Code
# printing the mean
res$estimate
      p 
0.59375 
Code
# printing the confidence interval
res$conf.int
[1] 0.5163169 0.6667870
attr(,"conf.level")
[1] 0.95
Code
# tests is male mice with cancer is less than 0.5 
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE,
           alternative = "less")

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
 0.0000000 0.6555425
sample estimates:
      p 
0.59375 
Code
# tests is male mice with cancer is more than 0.5 
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE,
           alternative = "less")

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
 0.0000000 0.6555425
sample estimates:
      p 
0.59375 

Two Proportion Z Test

Code
# Compares two proportions. Both being observed data. This data set used is meant to represent two groups, those with cancer and those healthy, then compare those who smoke and those who do not within each category. 

res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res 

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80 
Code
# printing the p-value
res$p.value
[1] 2.363439e-19
Code
# printing the mean
res$estimate
prop 1 prop 2 
  0.98   0.80 
Code
# printing the confidence interval
res$conf.int
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95
Code
# tests to see if there are less smokers with cancer than there are without

prop.test(x = c(490, 400), n = c(500, 500),
           alternative = "less")

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value = 1
alternative hypothesis: less
95 percent confidence interval:
 -1.0000000  0.2131742
sample estimates:
prop 1 prop 2 
  0.98   0.80 
Code
# tests to see if there are more smokers with cancer than there are without

prop.test(x = c(490, 400), n = c(500, 500),
              alternative = "greater")

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: greater
95 percent confidence interval:
 0.1468258 1.0000000
sample estimates:
prop 1 prop 2 
  0.98   0.80 

Chi-square Goodness of Fit

Code
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
Code
# Access to the expected values
res$expected
[1] 52.66667 52.66667 52.66667
Code
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 0.20253, df = 2, p-value = 0.9037
Code
# printing the p-value
res$p.value
[1] 0.9036928
Code
# the observed is not significantly different meaning what was obseved is close to expected
Code
# printing the mean
res$estimate
NULL

Chi-Square Test of Independence

Code
# Import the data
file_path <- "https://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
# head(housetasks)
Code
#| Label: Chi-square; contingency table 
# looks for an association between to variable's categories 

library("gplots")
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Code
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

Code
library("vcd")
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)

Code
chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
Code
# Observed counts
chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153
Code
round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556
Code
# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445
Code
# Visualize the contribution

library("corrplot")

corrplot(contrib, is.cor = FALSE)

Code
# printing the p-value
chisq$p.value
[1] 0
Code
# printing the mean
chisq$estimate
NULL

Part Two: Practice

Code
library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr)

mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%
  subset(select=c("class","cyl","prop"))%>%   #drop the frequency value
  spread(class, prop)%>%
  kable()
cyl 2seater compact midsize minivan pickup subcompact suv
4 NA 0.6808511 0.3902439 0.0909091 0.0909091 0.6000000 0.1290323
5 NA 0.0425532 NA NA NA 0.0571429 NA
6 NA 0.2765957 0.5609756 0.9090909 0.3030303 0.2000000 0.2580645
8 1 NA 0.0487805 NA 0.6060606 0.1428571 0.6129032
Code
# Load required libraries
library("gplots")
library("ggplot2") # for the mpg dataset

# summary
mpg_summary <- table(mpg$class, mpg$manufacturer)

# format
dt <- as.table(as.matrix(mpg_summary))

# balloon plot
balloonplot(t(dt), 
            main = "Distribution of Car Classes by Manufacturer", 
            xlab = "Manufacturer", 
            ylab = "Class", 
            label = FALSE, 
            show.margins = FALSE)

Code
library("ggplot2")  
library("graphics") 

# contingency table 
dt <- table(mpg$class, mpg$manufacturer)

# mosaic plot
mosaicplot(dt, 
           shade = TRUE,    
           las = 2,         
           main = "Distribution of Car Classes by Manufacturer")

Code
library("vcd")
library("ggplot2") 

# contingency table
dt <- table(mpg$class, mpg$manufacturer)

# Table subswt, first 5 rows 
dt_subset <- head(dt, 5)

#  zero-only rows or columns
dt_subset <- dt_subset[apply(dt_subset, 1, sum) > 0, apply(dt_subset, 2, sum) > 0]

# association plot
assoc(dt_subset, 
      shade = TRUE,  
      las = 3,       
      main = "Subset of Car Classes by Manufacturer")

Code
dt <- table(mpg$class, mpg$manufacturer)

chisq <- chisq.test(dt)# chi-squared test
chisq #Display

    Pearson's Chi-squared test

data:  dt
X-squared = 464.37, df = 84, p-value < 2.2e-16
Code
# Observed counts
chisq$observed
            
             audi chevrolet dodge ford honda hyundai jeep land rover lincoln
  2seater       0         5     0    0     0       0    0          0       0
  compact      15         0     0    0     0       0    0          0       0
  midsize       3         5     0    0     0       7    0          0       0
  minivan       0         0    11    0     0       0    0          0       0
  pickup        0         0    19    7     0       0    0          0       0
  subcompact    0         0     0    9     9       7    0          0       0
  suv           0         9     7    9     0       0    8          4       3
            
             mercury nissan pontiac subaru toyota volkswagen
  2seater          0      0       0      0      0          0
  compact          0      2       0      4     12         14
  midsize          0      7       5      0      7          7
  minivan          0      0       0      0      0          0
  pickup           0      0       0      0      7          0
  subcompact       0      0       0      4      0          6
  suv              4      4       0      6      8          0
Code
# Expected counts
round(chisq$expected,2)
            
             audi chevrolet dodge ford honda hyundai jeep land rover lincoln
  2seater    0.38      0.41  0.79 0.53  0.19    0.30 0.17       0.09    0.06
  compact    3.62      3.82  7.43 5.02  1.81    2.81 1.61       0.80    0.60
  midsize    3.15      3.33  6.48 4.38  1.58    2.45 1.40       0.70    0.53
  minivan    0.85      0.89  1.74 1.18  0.42    0.66 0.38       0.19    0.14
  pickup     2.54      2.68  5.22 3.53  1.27    1.97 1.13       0.56    0.42
  subcompact 2.69      2.84  5.53 3.74  1.35    2.09 1.20       0.60    0.45
  suv        4.77      5.03  9.80 6.62  2.38    3.71 2.12       1.06    0.79
            
             mercury nissan pontiac subaru toyota volkswagen
  2seater       0.09   0.28    0.11   0.30   0.73       0.58
  compact       0.80   2.61    1.00   2.81   6.83       5.42
  midsize       0.70   2.28    0.88   2.45   5.96       4.73
  minivan       0.19   0.61    0.24   0.66   1.60       1.27
  pickup        0.56   1.83    0.71   1.97   4.79       3.81
  subcompact    0.60   1.94    0.75   2.09   5.09       4.04
  suv           1.06   3.44    1.32   3.71   9.01       7.15
Code
library(corrplot)
corrplot(chisq$residuals, is.cor = FALSE)

Code
# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
            
               audi chevrolet  dodge   ford  honda hyundai   jeep land rover
  2seater     0.083    11.195  0.170  0.115  0.041   0.064  0.037      0.018
  compact     7.720     0.822  1.600  1.081  0.389   0.606  0.346      0.173
  midsize     0.002     0.181  1.396  0.943  0.340   1.815  0.302      0.151
  minivan     0.182     0.192 10.618  0.253  0.091   0.142  0.081      0.040
  pickup      0.547     0.577  7.839  0.737  0.273   0.425  0.243      0.121
  subcompact  0.580     0.612  1.192  1.594  9.371   2.475  0.258      0.129
  suv         1.027     0.673  0.173  0.184  0.514   0.799  3.513      1.756
            
             lincoln mercury nissan pontiac subaru toyota volkswagen
  2seater      0.014   0.018  0.060   0.023  0.064  0.156      0.124
  compact      0.130   0.173  0.031   0.216  0.108  0.843      2.921
  midsize      0.113   0.151  2.108   4.180  0.528  0.039      0.234
  minivan      0.030   0.040  0.132   0.051  0.142  0.344      0.273
  pickup       0.091   0.121  0.395   0.152  0.425  0.218      0.820
  subcompact   0.097   0.129  0.419   0.161  0.374  1.095      0.205
  suv          1.317   1.756  0.019   0.285  0.305  0.024      1.541
Code
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

Code
# printing the p-value
chisq$p.value
[1] 5.267718e-54
Code
# printing the mean
chisq$estimate
NULL

Week Seven Catch Up

My aesthetic choices are thrown to the wayside as I look to work on practicing the post session content to know how to apply the content for future projects. this week is predominantly a comparison of means, various types of circumstances which require you to compare means. All work recreated via STHDA.com chapters.

One sample T-test

Compare an observed/known mean to a standard/hypothetical mean

Code
set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)
Code
# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
Code
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
Code
library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Code
shapiro.test(my_data$weight) # => p-value = 0.6993

    Shapiro-Wilk normality test

data:  my_data$weight
W = 0.9526, p-value = 0.6993
Code
library("ggpubr")
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())

Code
# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res 

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 
Code
# printing the p-value
res$p.value
[1] 7.953383e-06
Code
# printing the mean
res$estimate
mean of x 
    19.25 
Code
# printing the confidence interval
res$conf.int
[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95
Code
# check if mean weight is less than
t.test(my_data$weight, mu = 25,
              alternative = "less")

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
     -Inf 20.41105
sample estimates:
mean of x 
    19.25 
Code
# check if mean weight is more than 
t.test(my_data$weight, mu = 25,
              alternative = "greater")

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 1
alternative hypothesis: true mean is greater than 25
95 percent confidence interval:
 18.08895      Inf
sample estimates:
mean of x 
    19.25 

One-Sample Wilcoxon Signed Rank Test

One sampled t-test alternative, non-parametric, no normal distribution assumed.

Code
set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)
Code
# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
Code
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
Code
library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Code
# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)
# Printing the results
res 

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
Code
# print only the p-value
res$p.value
[1] 0.005793045
Code
wilcox.test(my_data$weight, mu = 25,
              alternative = "less")

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.002897
alternative hypothesis: true location is less than 25
Code
wilcox.test(my_data$weight, mu = 25,
              alternative = "greater")

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.9979
alternative hypothesis: true location is greater than 25

Unpaired Two-Sample T-test

Compares the means of two separate groups.

Code
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
  group = rep(c("Woman", "Man"), each = length(women_weight)),
  weight = c(women_weight, men_weight)
)

# View the data frame
print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4
Code
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group count  mean    sd
  <chr> <int> <dbl> <dbl>
1 Man       9  69.0  9.38
2 Woman     9  52.1 15.6 
Code
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("purple", "orchid"),
        ylab = "Weight", xlab = "Groups")

Code
# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))# p = 0.1

    Shapiro-Wilk normality test

data:  weight[group == "Man"]
W = 0.86425, p-value = 0.1066
Code
# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.6

    Shapiro-Wilk normality test

data:  weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
Code
# Compute t-test
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res

    Two Sample t-test

data:  women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.748019  -4.029759
sample estimates:
mean of x mean of y 
 52.10000  68.98889 
Code
# printing the p-value
res$p.value
[1] 0.0132656
Code
# printing the mean
res$estimate
mean of x mean of y 
 52.10000  68.98889 
Code
# printing the confidence interval
res$conf.int
[1] -29.748019  -4.029759
attr(,"conf.level")
[1] 0.95
Code
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "less")

    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.9934
alternative hypothesis: true difference in means between group Man and group Woman is less than 0
95 percent confidence interval:
     -Inf 27.47924
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 
Code
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "greater")

    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.006633
alternative hypothesis: true difference in means between group Man and group Woman is greater than 0
95 percent confidence interval:
 6.298536      Inf
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 

Unpaired Two-Samples Wilcoxon Test

Also known as the Mann Whitney, non parametric alternative for the two sample, unpaired t-test, does not assume normal distribution, independent samples.

Code
res <- wilcox.test(women_weight, men_weight)
res

    Wilcoxon rank sum test with continuity correction

data:  women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
Code
res <- wilcox.test(weight ~ group, data = my_data,
                   exact = FALSE)
res

    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
Code
# Print the p-value only
res$p.value
[1] 0.02711657
Code
wilcox.test(weight ~ group, data = my_data, 
        exact = FALSE, alternative = "less")

    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.9892
alternative hypothesis: true location shift is less than 0
Code
wilcox.test(weight ~ group, data = my_data,
        exact = FALSE, alternative = "greater")

    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.01356
alternative hypothesis: true location shift is greater than 0

Paired Samples T-test

Compares the means of 2 related groups.

Code
# Data in two numeric vectors
# ++++++++++++++++++++++++++
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )
Code
# Print all data
print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2
Code
library("dplyr")
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group  count  mean    sd
  <chr>  <int> <dbl> <dbl>
1 after     10  394.  29.4
2 before    10  199.  18.5
Code
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("purple", "orchid"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")

Code
# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()

Code
# compute the difference
d <- with(my_data, 
        weight[group == "before"] - weight[group == "after"])
# Shapiro-Wilk normality test for the differences
shapiro.test(d) # => p-value = 0.6141

    Shapiro-Wilk normality test

data:  d
W = 0.94536, p-value = 0.6141
Code
# Compute t-test
res <- t.test(before, after, paired = TRUE)
res

    Paired t-test

data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean difference 
        -194.49 

Paired Samples Wilcoxon Test in R

Non-parametric method for a paired t-test alternative.

Code
res <- wilcox.test(before, after, paired = TRUE)
res

    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

One-Way ANOVA Test

Like an unpaired t-test, but more than two groups, one way ANOVA is grouping the multiple groups by a single grouping variable.

Code
my_data <- PlantGrowth
Code
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
   weight group
1    6.15  trt2
2    3.83  trt1
3    5.29  trt2
4    5.12  trt2
5    4.50  ctrl
6    4.17  trt1
7    5.87  trt1
8    5.33  ctrl
9    5.26  trt2
10   4.61  ctrl
Code
# Show the levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"
Code
my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))
Code
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 3 × 4
  group count  mean    sd
  <ord> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443
Code
# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("purple", "orchid", "darkorchid4"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

Code
# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

Code
# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
TukeyHSD(res.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = my_data)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064
Code
library(multcomp)

summary(glht(res.aov, linfct = mcp(group = "Tukey")))

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = weight ~ group, data = my_data)

Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)  
trt1 - ctrl == 0  -0.3710     0.2788  -1.331   0.3908  
trt2 - ctrl == 0   0.4940     0.2788   1.772   0.1979  
trt2 - trt1 == 0   0.8650     0.2788   3.103   0.0121 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 

data:  my_data$weight and my_data$group 

     ctrl  trt1 
trt1 0.194 -    
trt2 0.132 0.013

P value adjustment method: BH 
Code
# 1. Homogeneity of variances
plot(res.aov, 1)

Code
library(car)
leveneTest(weight ~ group, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.1192 0.3412
      27               
Code
oneway.test(weight ~ group, data = my_data)

    One-way analysis of means (not assuming equal variances)

data:  weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739
Code
pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH", pool.sd = FALSE)

    Pairwise comparisons using t tests with non-pooled SD 

data:  my_data$weight and my_data$group 

     ctrl  trt1 
trt1 0.250 -    
trt2 0.072 0.028

P value adjustment method: BH 
Code
# 2. Normality
plot(res.aov, 2)

Code
# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.96607, p-value = 0.4379
Code
# Non-parametric alternative to one-way ANOVA test

kruskal.test(weight ~ group, data = my_data)

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Two Way ANOVA Test

Evaluates two grouped variable on a response variable.

Code
my_data <- ToothGrowth
Code
set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
1  21.5   VC  2.0
2  17.3   VC  1.0
3  27.3   OJ  2.0
4  18.5   VC  2.0
5   8.2   OJ  0.5
6  26.4   OJ  1.0
7  25.8   OJ  1.0
8   5.2   VC  0.5
9   6.4   VC  0.5
10  9.4   OJ  0.5
Code
# Check the structure
str(my_data)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Code
# Convert dose as a factor and recode the levels
# as "D0.5", "D1", "D2"
my_data$dose <- factor(my_data$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
head(my_data)
   len supp dose
1  4.2   VC D0.5
2 11.5   VC D0.5
3  7.3   VC D0.5
4  5.8   VC D0.5
5  6.4   VC D0.5
6 10.0   VC D0.5
Code
table(my_data$supp, my_data$dose)
    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10
Code
# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
library("ggpubr")
ggboxplot(my_data, x = "dose", y = "len", color = "supp",
          palette = c("purple", "orchid"))

Code
# Line plots with multiple groups
# +++++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("purple", "orchid"))

Code
res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
group_by(my_data, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
# A tibble: 6 × 5
# Groups:   supp [2]
  supp  dose  count  mean    sd
  <fct> <fct> <int> <dbl> <dbl>
1 OJ    D0.5     10 13.2   4.46
2 OJ    D1       10 22.7   3.91
3 OJ    D2       10 26.1   2.66
4 VC    D0.5     10  7.98  2.75
5 VC    D1       10 16.8   2.52
6 VC    D2       10 26.1   4.80
Code
TukeyHSD(res.aov3, which = "dose")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)

$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06
Code
pairwise.t.test(my_data$len, my_data$dose,
                p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 

data:  my_data$len and my_data$dose 

   D0.5    D1     
D1 1.0e-08 -      
D2 4.4e-16 1.4e-05

P value adjustment method: BH 
Code
# 1. Homogeneity of variances
plot(res.aov3, 1)

Code
library(car)
leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54               
Code
# 2. Normality
plot(res.aov3, 2)

Code
# Extract the residuals
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.98499, p-value = 0.6694
Code
library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")
Anova Table (Type III tests)

Response: len
             Sum Sq Df F value    Pr(>F)    
(Intercept) 1750.33  1 132.730 3.603e-16 ***
supp         137.81  1  10.450  0.002092 ** 
dose         885.26  2  33.565 3.363e-10 ***
supp:dose    108.32  2   4.107  0.021860 *  
Residuals    712.11 54                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MANOVA Test: Multivariate Analysis of Variance

The testing of multiple response variables

Code
my_data <- iris
Code
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.2         3.5          1.5         0.2     setosa
2           5.7         2.6          3.5         1.0 versicolor
3           6.3         3.3          6.0         2.5  virginica
4           6.5         3.2          5.1         2.0  virginica
5           6.3         3.4          5.6         2.4  virginica
6           6.4         2.8          5.6         2.2  virginica
7           6.8         3.2          5.9         2.3  virginica
8           7.9         3.8          6.4         2.0  virginica
9           6.2         2.9          4.3         1.3 versicolor
10          7.1         3.0          5.9         2.1  virginica
Code
sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Look to see which differ
summary.aov(res.man)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kruskal-Wallis Test

Non-parametric alternative for a one way ANOVA

Code
my_data <- PlantGrowth
Code
# print the head of the file
head(my_data)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl
Code
my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))
Code
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
# A tibble: 3 × 6
  group count  mean    sd median   IQR
  <ord> <int> <dbl> <dbl>  <dbl> <dbl>
1 ctrl     10  5.03 0.583   5.15 0.743
2 trt1     10  4.66 0.794   4.55 0.662
3 trt2     10  5.53 0.443   5.44 0.467
Code
# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("purple", "orchid", "darkorchid4"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

Code
# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

Code
kruskal.test(weight ~ group, data = my_data)

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Code
pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
                 p.adjust.method = "BH")

    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  PlantGrowth$weight and PlantGrowth$group 

     ctrl  trt1 
trt1 0.199 -    
trt2 0.095 0.027

P value adjustment method: BH 

Week Eight Catch Up

This week is looking at correlations, look at plotting the realtionships between variables.

A part of the post session is a quiz so the following are the answers to such.

  1. C

  2. A

  3. D

  4. A:C; B:D; C:E; D:A; E:B