I am here to take research methods and data analysis as part of the MRes program for Endangered Species Recovery and Conservation program at Nottingham Trent University. My hopes are to acquire skills during this course and year to do bat conservation. Data analysis is a necessary skill for research.
Picture of cute bat as motivation to remember why I am here. Source link through image.
Week One
This week entailed an introduction to the course to include downloading R and Rstudio as well as beginning to learn the basics of R. Once upon a time I had a brief introduction to R but have forgotten most of everything beyond installing the program and changing the mode to dark mode.
Screenshot of my RStudio in dark mode.
After setting up my RStudio to my chosen aesthetic I navigated the task of going through the resources posted on the module page. Using the videos and links provided I began the painful process of remembering how to function in RStudio again by practicing with data. As I have no data yet for my research project I decided to make my own by surveying myself on my feelings of starting this course. So the figure below is a Bar chart about how I feel at the start of this course with each category being rated on a scale of 0-10. The higher the rating the stronger the feeling.
Code
# Data about my feelingsdata <-data.frame(Feelings=c("Need to Pass Course","R Dread","R Confidence") , Rating=c(10,8,1) )# Bar plot of my feelings ggplot(data, aes(x=Feelings, y=Rating)) +geom_bar(stat ="identity",fill="purple")+theme_dark()
Week Two
Week two we are exploring more of how to start working with R and playing more with data. This week I will use an actual data set, specifically the penguins data set, and not just make up data about my feelings. The tasks for week two are as follows:
Work more with the penguin data
Insert a picture
Insert a video
Playing with Penguins…data
Professor Felipe encouraged us to look at the data first before we got too deep into creating code for it. This way we get used to checking the data for errors first before messing around with it and potentially getting defeated if there is simply an error with our data.
Code
penguins %>%glimpse() # Getting a small look at the data and see what variables are in the file, see the columns and some rows
After getting a glimpse at the data I can play around with it further by making a figures. Code retrieved from the week 3 power point and then altered to look at different variables, personalize the colors, add titles as well as labels to the figures.
Code
#| label: Histogram of penguin species' body massdata("penguins")penguins %>%group_by(species) %>%ggplot(aes(x=body_mass_g, color=species, fill=species))+geom_histogram(bins =33) +theme_dark() +scale_color_manual(values=c("orchid", "purple", "black")) +scale_fill_manual(values=c("orchid", "purple", "black")) +labs(x ="Body Mass (g)", y ="Frequency", title ="Penguin's Body Mass")
Code
#| label: Boxplot of penguin's bill length by sex and speciespenguins %>%na.omit() %>%ggplot(aes(x=sex, y = bill_length_mm,color=species, fill=species))+geom_boxplot(alpha=0.7)+theme(axis.text=element_text(size=16),axis.title=element_text(size=16)) +theme_dark() +scale_color_manual(values=c("orchid", "purple", "black")) +scale_fill_manual(values=c("orchid", "purple", "black")) +labs(x ="Sex", y ="Bill Length (mm)", title ="Penguin's Bill Length by Sex")
Code
#| Label: Penguin species' bill length by body masspenguins %>%ggplot(aes(x=bill_length_mm, y = body_mass_g,color=species, fill=species))+geom_point() +theme_dark() +scale_color_manual(values=c("orchid", "purple", "black")) +scale_fill_manual(values=c("orchid", "purple", "black")) +theme(axis.text=element_text(size=16),axis.title=element_text(size=16)) +labs(x ="Bill Length (mm)", y ="Body Mass (g)", title ="Penguin's Body Mass by Bill Length")
Add an Image
Another part of the workbook assignment this week was to add an image to the workbook. During week one I had placed an image in my workbook but I am not opposed to having an excuse to place more pictures of bats on my page.
Image from Bat Conservation International of the Congress Street Bridge bat colony in Austin, Texas, USA. Source link through image.
Add a Video
The last part of the workbook exercise for this week was to add a video to our page. I would be remiss in providing pictures of the Congress Street Bridge colony in flight without attaching a video of the nightly flight. The video is also from Bat Conservation International and was posted to their Youtube page.
Week Three
This week is dedicated to getting used to wrangling/managing our data in R. For this exercise I will be following the prompts and code of R for Graduate Students by Y. Wendy Huynh, specifically the chapter 6.6.1 and 6.7 exercises.
Problem A
Code
midwest %>%# indicates working with the midwest database and %>% is the pipe to simplify codegroup_by(state) %>%# group the data in this chunk by state to look at states separatelysummarize(poptotalmean =mean(poptotal), # the start of a nested function to create a summary, and new variables, this being the mean population, grouped by state. poptotalmed =median(poptotal), # the population total median grouped by state. popmax =max(poptotal), # creates the variable for maximum population entry grouped by state popmin =min(poptotal), # creates the variable for minimum population entry grouped by state popdistinct =n_distinct(poptotal), # number of unique totals of the all population totals grouped by statepopfirst =first(poptotal), # the first population entry grouped by statepopany =any(poptotal <5000), # indicates true or false if the state has a population total entry less than 5000popany2 =any(poptotal >2000000)) %>%# indicates true or false for it the state has a population total entry over 2000000.ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
midwest %>%# indicates working with the midwest database and %>% is the pipe to simplify codegroup_by(state) %>%# group the data in this chunk by state to look at states separatelysummarize(num5k =sum(poptotal <5000), # the start of nested function, shows the number of states that have counties with populations less than 5000num2mil =sum(poptotal >2000000), # shows the number of states that have counties with populations more than 2000000numrows =n()) %>%# The total number of entries/counties grouped by state. ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
Part I
Code
midwest %>%# indicates working with the midwest database and %>% is the pipe to simplify codegroup_by(county) %>%# groups the data by county to look at county data separately summarize(x =n_distinct(state)) %>%# shows how many states has the same county namearrange(desc(x)) %>%# arranges in descending order the column listing how many states share a county name ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Part II
Code
# How does n() differ from n_distinct()? n() counts rows, not unique entries, while n_distinct() counts unique entries for variable. # When would they be the same? different? In this case they are the same because it's looking at how many rows that county occurs in and each state would list the county. In other functions it could be different, the n_distinct() is going to ignore duplicates and only count unique entries. midwest %>%# indicates working with the midwest database and %>% is the pipe to simplify codegroup_by(county) %>%# groups the data by county to look at county data separately summarize(x =n()) %>%# the number of rows that county appears in the data ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Part III
Code
# hint: # - How many distinctly different counties are there for each county? 320 # - Can there be more than 1 (county) county in each county? No, all same named county will be counted as 1 since it's not unique. # - What if we replace 'county' with 'state'? Yes as multiple states have counties with the same name as we've seen in previous chunks. midwest %>%# indicates working with the midwest database and %>% is the pipe to simplify codegroup_by(county) %>%# groups the data by county to look at county data separatelysummarize(x =n_distinct(county)) %>%# shows only the unique county entries specifically in the county variable ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
Problem D
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(clarity) %>%# groups the data by clarity to look at clarity data separatelysummarize(a =n_distinct(color), # number of unique entries for color in each clarity type b =n_distinct(price), # number of unique entries for price for each clarity type c =n()) %>%# lists the number of rows that exist for each clarity typeungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(color, cut) %>%# groups the data by color and cut to look at data separated by both color and cut. summarize(m =mean(price), # shows the mean price of the data when grouped by both color and cut. s =sd(price)) %>%# shows the standard deviation of the price when grouped by both color and cut. ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Part II
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(cut, color) %>%# groups the data by cut and color to look at data separated by both cut and color. summarize(m =mean(price), # shows the mean price of the data when grouped by both cut and color. Organization changes but not the mean from part I. s =sd(price)) %>%# shows the standard deviation of the price when grouped by both cut and color. Organization changes but not the standard deviation from part I. ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Part III
Code
# - How good is the sale if the price of diamonds equaled msale? 20% off...? Especially the premium and ideal cuts have a noticeable price drop with the sale price. # - e.x. The diamonds are x% off original price in msale.diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(cut, color, clarity) %>%# groups the data by cut, color and clarity to look at data separated by both cut, color and clarity. summarize(m =mean(price), # mean of the data when grouped by cut, color and clarity. s =sd(price), # standard deviation of data when grouped by cut, color and clarity. msale = m *0.80) %>%# The "sale" price of 80% of the mean of the price when grouped by cut, color and clarity, ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem F
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(cut) %>%# groups the data by cut to look at cut data separately.summarize(potato =mean(depth), # potato is the column name being set for the mean of the depth of the diamond data being grouped by cut, pizza =mean(price), # pizza is the column name being set for the mean of the price of the diamond data being grouped by cut. popcorn =median(y), # popcorn is column name being set for the median of the width of the diamonds data being grouped by cut.pineapple = potato - pizza, # pineapple is the column name set for the mean diamond depth - mean diamond price for the diamond dataset as grouped by cut. papaya = pineapple ^2, # papaya is the column name set for the square of mean depth - the mean price both being grouped by cut. peach =n()) %>%# ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(color) %>%# groups the data by color to look at color data separately.summarize(m =mean(price)) %>%# provides the mean of the prices as grouped by color. mutate(x1 =str_c("Diamond color ", color), # adds the label, "diamond color," for the color vector. x2 =5) %>%#Serves no purpose until I find a resource or am told differently. Just makes stuff equal 5. ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Part II
What does the first ungroup() do? Is it useful here? Why/why not?Eliminated the repetitive code to not disrupt my pretty workboook, The answer is no, the ungroup() is not useful as it’s not at the end of the grouping code. The first ungroup just stops the grouping before the code is done using said group.Why isn’t there a closing ungroup() after the mutate()?I don’t know, The text says there should be, And at this point in my life with R I go by resources provided by professors and web searches.
Problem H
Part I
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(color) %>%# groups the data by color to look at color data separately.mutate(x1 = price *0.5) %>%# create a new variable which will be 0.5 of the price based on the group by color. summarize(m =mean(x1)) %>%# This line asks for the mean of half of the price of diamond data set grouped by color. ungroup() # Ends the grouping as to not potentially cause issue with the data for further functions
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Part II
What’s the difference between part I and II?I didn’t put the code in a chunk because I didn’t want to wreck future script. I remember from the chapter that any group requires an ungroup, so the ungroup being before another function just ruins the resulting data.
Why is grouping data necessary?To specify parameters you want to evaluate.
Why is ungrouping data necessary?To clear the parameters you set, leaving the data unaffected and no longer altering further functions. Ungrouping removes the prior set parameters for the dataset you want to analyze.
When should you ungroup data?Any time you group data.
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?Not that I know, but I will likely look more into now. From what I know from the text, ungroup only works for functions that start with group_by.
6.7 Extra Practice
Question 1
Code
View(diamonds) # Opens up the dataset in a new tab to allow you to view the data and see all the variables.
Question 2
Part I
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.arrange(price) # arranges the data by the price column in ascending order, lowest to highest prices.
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Part II
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.arrange(desc(price)) # arranges the data by the price column in descending order, highest to lowest prices.
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Part III
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.arrange(price) %>%# arranges the data by the price column in ascending order, lowest to highest prices. The pipe in this line connects the actions of the next line arrange(cut) # Arranges the data by the cut column in ascending order from worst to best cut, due to previous line it will be arranged by cut and then price.
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.arrange(desc(price)) %>%# arranges the data by the price column in descending order, highest to lowest prices. The pipe in this line connects the actions of the next line arrange(desc(cut)) # Arranges the data by the cut column in descending order from best to worst cut, due to previous line it will be arranged by cut and then price.
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
Question 3
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.arrange(price) %>%# arranges the data by the price column in ascending order, lowest to highest prices. The pipe in this line connects the actions of the next line arrange(clarity) # arranges the clarity in ascending order from worst to best. Previous line allows the data to be organized by clarity and then price.
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
2 0.32 Good D I1 64 54 361 4.33 4.36 2.78
3 0.31 Premium F I1 62.9 59 394 4.33 4.29 2.71
4 0.34 Ideal D I1 62.5 57 413 4.47 4.49 2.8
5 0.34 Ideal D I1 61.4 55 413 4.5 4.52 2.77
6 0.32 Premium E I1 60.9 58 444 4.42 4.38 2.68
7 0.43 Premium H I1 62 59 452 4.78 4.83 2.98
8 0.41 Good G I1 63.8 56 467 4.7 4.74 3.01
9 0.32 Good D I1 64 54 468 4.36 4.33 2.78
10 0.32 Ideal E I1 60.7 57 490 4.45 4.41 2.69
# ℹ 53,930 more rows
Question 4
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.mutate(SalePrice = price -250) # creates a new column showing the price - $250
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z SalePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
Question 5
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.select(-x, -y, -z) # removes the x, y, and z columns in the output.
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
Question 6
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.group_by(cut) %>%# groups the data by the cut variable to look at the cut data separately summarize(total=n()) %>%# shows a summary of how many rows/diamonds there are for each cut categoryungroup() # undoes the grouping to not alter the date for further functions
# A tibble: 5 × 2
cut total
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Question 7
Code
diamonds %>%# indicates working with the diamonds database and %>% is the pipe to simplify code.mutate(totalNum =n()) # adds a column showing the total number of diamonds in the data
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Research Methods
Come up with a good and bad research question based on the Diamonds dataset.
Bad Question
Do diamonds get bought and sold?
Good Question
What variables of diamonds, if any, impact the price of diamonds more or less than other variables?
Week Four
This week we are learning more about exploring out data. Particularly we are learning more about the different types of graphs we can produce and how utilizing the graphs can show us information about our data that we may not have known otherwise. Running our data through the different types of graphs can help us find the type of graph that answers our research questions the best or even helps you ask more questions.
Data Exploration
The following figures will be based off the video Data Visualization with R in 36 Minutes. The instructions were to reproduce the graphs and organize it in my own way. My way of making the graphs my own will mainly be making them dark themed with purple accents to stick with my aesthetic. I will also annotate the code to further ensure I know what the lines of code do.
Graph One, Original
Code
library(modeldata) # indicates the library we'll be using ggplot(crickets, aes(x = temp, y = rate)) +# creating a plot of the cricket data looking at rate by temperature geom_point() +# indicates the figure is to be a scatter plotlabs(x ="Temperature", # allows you to make labels, x = is the x labely ="Chirp rate", # y = is the y axis label title ="Cricket chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph One, My Version
Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) +# creating a plot of the cricket data looking at rate by temperature and making the species different colors geom_point() +# indicates the figure is to be a scatter plotlabs(x ="Temperature", # allows you to make labels, x = is the x label y ="Chirp Rate", # y = is the y axis label color ="Species", # this is the figure legend labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") +# adds a caption to the graphtheme_dark() +# makes the background darkerscale_color_manual(values=c("darkmagenta", "darkorchid4")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) # allows for choosing the fill colors
Graph Two, Original
Code
ggplot(crickets, aes(x = temp, y = rate)) +# creating a plot of the cricket data looking at rate by temperaturegeom_point(color ="red", # indicates the figure is to be a scatter plot, makes the data points all red size =2, # Changes the size of the data points alpha = .3, # changes the transparency of the data points shape ="square") +# changes the shape of the data points labs(x ="Temperature", # allows you to make labels, x = is the x label y ="Chirp rate", # y = is the y axis labeltitle ="Cricket chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Two, My Version
Code
ggplot(crickets, aes(x = temp, y = rate)) +# creating a plot of the cricket data looking at rate by temperaturegeom_point(aes(color = species), # indicates the figure is to be a scatter plot, the aes allows me to separate the data points by species. size =3, # Changes the size of the data points alpha = .4, # changes the transparency of the data points shape ="diamond") +# changes the shape of the data points theme_dark() +# makes the background darkerlabs(x ="Temperature", # allows you to make labels, x = is the x label y ="Chirp Rate", # y = is the y axis label title ="Cricket Chirps", # adds a title to the graphcolor ="Species", # this is the figure legend labelcaption ="Source: McDonald (2009)") +# adds a caption to the graphscale_color_manual(values=c("darkmagenta", "darkorchid4")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) # allows for choosing the fill colors
Graph Three, Original
Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) +# creating a plot of the cricket data looking at rate by temperature and making the species different colors geom_point() +# indicates the figure is to be a scatter plotgeom_smooth(method ="lm", # adds a regression linese =FALSE) +# ignores the standard errorlabs(x ="Temperature", # allows you to make labels, x = is the x label y ="Chirp rate", # y = is the y axis label color ="Species", # this is the figure legend labeltitle ="Cricket chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") +# adds a caption to the graphscale_color_brewer(palette ="Dark2") # makes the theme a high contrast color palette
Graph Three, My Version
Code
ggplot(crickets, aes(x = temp, y = rate, color = species)) +# creating a plot of the cricket data looking at rate by temperature and making the species different colors geom_point(size =3, # indicates the figure is to be a scatter plot, size changes the size of the data pointsshape ="diamond") +# changes the shape of the data points geom_smooth(method ="lm", # adds a regression linese =FALSE) +# ignores the standard errortheme_dark() +# makes the background darkerscale_color_manual(values=c("darkmagenta", "darkorchid4")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) +# allows for choosing the fill colorslabs(x ="Temperature", # allows you to make labels, x = is the x label y ="Chirp Rate", # y = is the y axis labelcolor ="Species", # this is the figure legend labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Four, Original
Code
ggplot(crickets, aes(x = rate)) +# Creating a graph of the cricket data with the x value being the rate of cricket chirps. geom_histogram(bins =15) # indicates the graph is to be a histogram and sets the bins to 15
Graph Four, My Version
Code
ggplot(crickets, aes(x = rate)) +# Creating a graph of the cricket data with the x value being the rate of cricket chirps. geom_histogram(color ="black", fill="purple", bins =15) +# indicates the graph is to be a histogram, customizes the color, fill color, and sets the bins to 15 theme_dark() +# makes the background darkerlabs(x ="Chirp Rate", # x = is the x axis labely ="Frequency", # y = is the y axis labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Five, Original
Code
ggplot(crickets, aes(x = species, fill = species)) +# Creating a graph of the cricket data with the x value being the the species of cricket, and the color fill indicating to separate the date by species and use different colors for the species geom_bar(show.legend =FALSE) +# indicates to make a bar chart and to not show the icon legend scale_fill_brewer(palette ="Dark2") # makes the theme a high contrast color palette
Graph Five, My Version
Code
ggplot(crickets, aes(x = species, fill = species)) +# Creating a graph of the cricket data with the x value being the the species of cricket, and the color fill indicating to separate the date by species and use different colors for the species geom_bar(show.legend =FALSE) +# indicates to make a bar chart and to not show the icon legend theme_dark() +# makes the background darkerscale_color_manual(values=c("darkmagenta", "darkorchid4")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) +# allows for choosing the fill colorslabs(x ="Species", # x = is the x axis labely ="Observed", # y = is the y axis labeltitle ="Crickets", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Six, Original
Code
ggplot(crickets, aes(x = rate)) +# Creating a graph of the cricket data with the x value being the the rate of cricket chirps geom_freqpoly(bins =15) # shows the same data as the bar chart but as a connected line, makes the bins 15.
Graph Six, My Version
Code
ggplot(crickets, aes(x = rate)) +# Creating a graph of the cricket data with the x value being the the rate of cricket chirps geom_freqpoly(bins =15, color ="darkorchid4") +# shows the same data as the chirp rate bar chart but as a connected line, makes the bins 15, color sets the color of the line theme_dark() +# makes the background darkerlabs(x ="Chirp Rate", # x = is the x axis labely ="Observed", # y = is the y axis labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Seven, Original
Code
ggplot(crickets, aes(x = species, y = rate, color = species)) +# sets the data to be used, sets the x and y values, sets the date to be separated by species. geom_boxplot(show.legend =FALSE) +# indicates to make a boxplot and to not show the legendscale_color_brewer(palette ="Dark2") +# makes the theme a high contrast color palette theme_minimal() # sets the theme for the graph
Graph Seven, My Version
Code
ggplot(crickets, aes(x = species, y = rate, color = species, fill = species)) +# sets the data to be used, sets the x and y values, sets the date to be separated by species. geom_boxplot(show.legend =FALSE) +# indicates to make a boxplot and to not show the legendtheme_dark () +# makes the background darkerscale_color_manual(values=c("black", "black")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) +# allows for choosing the fill colorslabs(x ="Species", # x = is the x axis labely ="Chirp Rate", # y = is the y axis labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Graph Eight, Original
Code
ggplot(crickets, aes(x = rate, fill = species)) +# sets the data to be used, sets the x and y values, sets the fill values to separate the data by species. geom_histogram(bins =15, show.legend =FALSE) +# indicates to create a histogram graph and set the bins to 15. Also turns off the figure legend, facet_wrap(~species, # creates separate figures per species in the data ncol =1) +# sets there to be one columnscale_fill_brewer(palette ="Dark2") +# makes the theme a high contrast color palettetheme_minimal() # sets the theme for the graph
Graph Eight, My Version
Code
ggplot(crickets, aes(x = rate, Color = species, fill = species)) +# sets the data to be used, sets the x and y values, sets the fill values to separate the data by species. geom_histogram(bins =15, show.legend =FALSE) +# indicates to create a histogram graph and set the bins to 15. Also turns off the figure legend, facet_wrap(~species, # creates separate figures per species in the data ncol =1) +# sets there to be one columntheme_dark() +# makes the background darkerscale_color_manual(values=c("black", "black")) +# allows you to choose the colors for the outlinesscale_fill_manual(values=c("darkmagenta", "darkorchid4")) +# allows for choosing the fill colorslabs(x ="Species", # x = is the x axis labely ="Chirp Rate", # y = is the y axis labeltitle ="Cricket Chirps", # adds a title to the graphcaption ="Source: McDonald (2009)") # adds a caption to the graph
Research Methods
What is a good hypothesis?
A good hypothesis is one that comes about first by making observations and having some sort of background knowledge of what you are looking to research. A good hypothesis is not one that you can just form without any experience or insight into what you would like to research. In order to gather evidence to support a good hypothesis, the hypothesis has to be one that can also be rejected. You cannot start a good research project with a good hypothesis and the only outcome is that the hypothesis will be supported. A good hypothesis will also provide insight from being tested. The insight gained from testing a hypothesis can range from having evidence to support your hypothesis to asking a knew question based on the findings. If you start your experiment with a good foundation, a good hypothesis, then the knowledge gained from testing will typically be useful.
Week Five
Oof okay, so this week has me having to really stop myself from hiding under the covers, terrified of the stats monsters in my wardrobe. It has been a hot minute since I took stats 101 and this reminded me of how much I forgot. Not even speaking about the first time I was supposed to learn about R, those are the under the bed monsters…we don’t talk about them. Luckily, I can acknowledge that while this is an initial knowledge dump, this is still an introduction to each of the different statistical tests and there will be more time to get re-familiarized with them. This week the quest for the workbook is to identify the statistical tests associated with the different graphs provided and attempt to recreated them. Both of the tasks will be combined into the one subsection for each graph as I will try to recreate it and then speak about the tests I think could work with the figure. All graphs will be recreated based upon the Iris data.
Graph One
Code
data("iris")iris %>%ggplot(aes(x=Species, y = Sepal.Length,color=Species, fill=Species))+geom_boxplot(alpha=0.7)+theme(axis.text=element_text(size=16),axis.title=element_text(size=16)) +theme_dark() +scale_color_manual(values=c("orchid", "purple", "black")) +scale_fill_manual(values=c("orchid", "purple", "black")) +labs(x ="Species", y ="Sepal Length (mm)", title ="Iris Sepal Length by Species")
In graph one the x variable is a category and the y value is numerical. The graph looks at how the species affect the sepal length. The graph illustrates the differences between the 3 iris species. The boxy plot also shows the distribution of the data for each species category. For each category the box and whiskers shows the minimum, maximum, median, 1st quartile, 3rd quartile and the dot is the outlier of the data. The predictor variable, the x variable, is a category and the outcome variable, y variable, is a numeric which means that a mean comparison test is indicated. Only one outcome variable is shown on this graph for this question so an ANOVA test should be suitable.
Graph two is a density plot and is a way of visualizing the distribution of the data. The x value is petal length which is a numerical variable. The density is a frequency variable. The graph looks at the differences of petal length between the 3 species. The data does not appear to have a normal distribution. The data is not paired and as the outcome being ordinal a Mann–Whitney U test and Kruskal–Wallis test should be suitable teats for this question.
Graoh Three
Code
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color=Species)) +geom_point(aes(shape=Species, color=Species), size =3) +scale_shape_manual(values=c("diamond", "triangle", "circle")) +geom_smooth(method ="lm", color ="darkmagenta") +theme_dark() +scale_color_manual(values=c("orchid", "purple", "black")) +scale_fill_manual(values=c("orchid", "purple", "black")) +labs(x ="Petal Length", y ="Petal Width", color ="Species", title ="Iris Species Petal Length by Width")
Both the x an y variables for graph three are numerical and the scatter plot looks at the relationship between petal length and petal width. While the data is separated by species the only predictor variable being looked at with this question is petal length with is continuous. Because of this a simple regression is a good test for this data. With this linear model we have the regression line and the shadow behind it is the standard error. Pearson correlation test could also be done to examine if the variables have a linear relationship.
Graph Four
Code
data(iris)iris.new <-mutate(iris, size=ifelse(Sepal.Length <median(Sepal.Length),"small", "big")) ggplot(iris.new, aes(x = Species, color=size, fill=size)) +geom_bar(position="dodge") +theme_dark() +scale_color_manual(values=c("darkmagenta", "darkorchid4")) +scale_fill_manual(values=c("darkmagenta", "darkorchid4")) +labs(x ="Species", y ="Count", title ="Sepal Size of Iris Species")
Graphs four looks at the frequency small and large sepal size of the 3 iris species. The sepal size was a numerical variable that we made into two categories based on that numerical data. The graph looks at how frequently the 3 iris species were observed to have either big or small sepals. The graph shows the differences of sepal sizes between the 3 species. Given that we have created a binary for the data a Chi-Squared test would be an appropriate to assess if the differences are due to the relationship of the species variable.
Week Six
Week six we did a group project as a part of an exercise in picking the appropriate statistical analyses for the data sets given, creating questions for the data as well as annotating code for the data that we worked with for make figures and tables. My group was with Hephzibah, Tamsin, Alice, Marrisa and Ellelouise. The document was submitted via Tamsin.
We also learned about spotting the difference between good and bad titles for research papers. As well as going into more detailed about different frequency tests.
Week Seven
This week was focused around what makes a good and bad abstract with the formative assignment being to create our own good titles and abstracts for a paper provided. The other part of this week was around the different types of mean tests we can do. Admittedly I am behind on doing the post sessions for the statistical analyses in my workbook for weeks 6 and 7. I am American, and while I take responsibility for not doing the post sessions in a timely manner…I got stressed and depressed this week with the election then election results that I just fell off track. Now it’s just up to me to make up for it in the following weeks to catch up.
Title
How Santa Cruz Galapagos Tortoises Utilize Three Types of Farmlands During Migration
Authors
Kyana N. Pike, Stephen Blake, Iain J. Gordon & Lin Schwarzkopf
Affiliations
James Cook University; Galapagos National Park; Saint Louis University; Max Planck Institute for Animal Behaviour; Wildcare Institute; Central Queensland University; James Hutton Institute; The Australian National University; CSIRO; Protected Places Initiative
Abstract
Western Santa Cruz Galapagos tortoises (Chelonoidis porteri), residing on Santa Cruz Island, are critically endangered ecosystem engineers. The tortoises migrate during the dry season to the highlands which are around 88% agricultural land, be it abandoned, used for livestock or repurposed for tourism. This study evaluated behaviors the tortoises exhibit in the different agricultural habitats in terms of time eating, resting and walking. Studied in 2019 with 114 observations made in the wet season, March-May, and 128 observations in the dry season, November-December. Tortoise behaviors recorded from 0630-1230 by an observer 5-15m away. Once spotted all actions and length of actions were logged for 30 minutes. Habitat characteristics also noted in a 1m² area with vegetation coverage, density and mean height recorded. Thermal images utilized to obtain temperatures of the tortoises and ground. Sex of the tortoise and curved carapace length also obtained. Indications in the analyses showed that resting time had a strong relationship with land type and temperature. Walking time was most affected by vegetation height. The abandoned agricultural land was least suitable for migratory behaviors of the tortoise as it had the most shade, lower temperatures and tall vegetation. The touristic locations were the most ideal for eating and the level of disturbance was suitable for walking and eating. The livestock area was theorized as having potential for conservation efforts such as increasing areas of shade, cutting to the tortoises’ ideal height and rotating the stocking of livestock could create ideal habits for the tortoises’ migratory behaviors.
Keywords
Chelonoidis porteri; agriculture; land sharing; behavior; activity; habitat quality
Week Six Catch Up
After much delay, procrastination, internal protest and mental blockage I am finally catching up on the post sessions. The first part is putting into practice with R tests to compare proportions via t-tests and Chi-squared tests. Mosrt work recreated via STHDA.com chapters.
Part One
Running the examples from the website explaining the different types of t-tests and Chi-squared
One Proportion Z Test
Code
# per the text a z test looks at observed proportional data and compares it to theoretical proportions. The example of the chapter of one proportion test uses data about mize and if a higher proportion of one gender is experiencing cancerres <-prop.test(x =95, n =160, p =0.5, correct =FALSE)# Printing the resultsres
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
# tests is male mice with cancer is less than 0.5 prop.test(x =95, n =160, p =0.5, correct =FALSE,alternative ="less")
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
0.0000000 0.6555425
sample estimates:
p
0.59375
Code
# tests is male mice with cancer is more than 0.5 prop.test(x =95, n =160, p =0.5, correct =FALSE,alternative ="less")
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
0.0000000 0.6555425
sample estimates:
p
0.59375
Two Proportion Z Test
Code
# Compares two proportions. Both being observed data. This data set used is meant to represent two groups, those with cancer and those healthy, then compare those who smoke and those who do not within each category. res <-prop.test(x =c(490, 400), n =c(500, 500))# Printing the resultsres
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
# tests to see if there are less smokers with cancer than there are withoutprop.test(x =c(490, 400), n =c(500, 500),alternative ="less")
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value = 1
alternative hypothesis: less
95 percent confidence interval:
-1.0000000 0.2131742
sample estimates:
prop 1 prop 2
0.98 0.80
Code
# tests to see if there are more smokers with cancer than there are withoutprop.test(x =c(490, 400), n =c(500, 500),alternative ="greater")
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: greater
95 percent confidence interval:
0.1468258 1.0000000
sample estimates:
prop 1 prop 2
0.98 0.80
Chi-square Goodness of Fit
Code
tulip <-c(81, 50, 27)res <-chisq.test(tulip, p =c(1/3, 1/3, 1/3))res
Chi-squared test for given probabilities
data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
Code
# Access to the expected valuesres$expected
[1] 52.66667 52.66667 52.66667
Code
tulip <-c(81, 50, 27)res <-chisq.test(tulip, p =c(1/2, 1/3, 1/6))res
Chi-squared test for given probabilities
data: tulip
X-squared = 0.20253, df = 2, p-value = 0.9037
Code
# printing the p-valueres$p.value
[1] 0.9036928
Code
# the observed is not significantly different meaning what was obseved is close to expected
Code
# printing the meanres$estimate
NULL
Chi-Square Test of Independence
Code
# Import the datafile_path <-"https://www.sthda.com/sthda/RDoc/data/housetasks.txt"housetasks <-read.delim(file_path, row.names =1)# head(housetasks)
Code
#| Label: Chi-square; contingency table # looks for an association between to variable's categories library("gplots")# 1. convert the data as a tabledt <-as.table(as.matrix(housetasks))# 2. Graphballoonplot(t(dt), main ="housetasks", xlab ="", ylab="",label =FALSE, show.margins =FALSE)
# Visualize the contributionlibrary("corrplot")corrplot(contrib, is.cor =FALSE)
Code
# printing the p-valuechisq$p.value
[1] 0
Code
# printing the meanchisq$estimate
NULL
Part Two: Practice
Code
library(ggplot2)library(dplyr)library(tidyr)library(knitr)mpg%>%group_by(class, cyl)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%subset(select=c("class","cyl","prop"))%>%#drop the frequency valuespread(class, prop)%>%kable()
cyl
2seater
compact
midsize
minivan
pickup
subcompact
suv
4
NA
0.6808511
0.3902439
0.0909091
0.0909091
0.6000000
0.1290323
5
NA
0.0425532
NA
NA
NA
0.0571429
NA
6
NA
0.2765957
0.5609756
0.9090909
0.3030303
0.2000000
0.2580645
8
1
NA
0.0487805
NA
0.6060606
0.1428571
0.6129032
Code
# Load required librarieslibrary("gplots")library("ggplot2") # for the mpg dataset# summarympg_summary <-table(mpg$class, mpg$manufacturer)# formatdt <-as.table(as.matrix(mpg_summary))# balloon plotballoonplot(t(dt), main ="Distribution of Car Classes by Manufacturer", xlab ="Manufacturer", ylab ="Class", label =FALSE, show.margins =FALSE)
Code
library("ggplot2") library("graphics") # contingency table dt <-table(mpg$class, mpg$manufacturer)# mosaic plotmosaicplot(dt, shade =TRUE, las =2, main ="Distribution of Car Classes by Manufacturer")
Code
library("vcd")library("ggplot2") # contingency tabledt <-table(mpg$class, mpg$manufacturer)# Table subswt, first 5 rows dt_subset <-head(dt, 5)# zero-only rows or columnsdt_subset <- dt_subset[apply(dt_subset, 1, sum) >0, apply(dt_subset, 2, sum) >0]# association plotassoc(dt_subset, shade =TRUE, las =3, main ="Subset of Car Classes by Manufacturer")
# Visualize the contributioncorrplot(contrib, is.cor =FALSE)
Code
# printing the p-valuechisq$p.value
[1] 5.267718e-54
Code
# printing the meanchisq$estimate
NULL
Week Seven Catch Up
My aesthetic choices are thrown to the wayside as I look to work on practicing the post session content to know how to apply the content for future projects. this week is predominantly a comparison of means, various types of circumstances which require you to compare means. All work recreated via STHDA.com chapters.
One sample T-test
Compare an observed/known mean to a standard/hypothetical mean
# One-sample t-testres <-t.test(my_data$weight, mu =25)# Printing the resultsres
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
17.8172 20.6828
sample estimates:
mean of x
19.25
Code
# printing the p-valueres$p.value
[1] 7.953383e-06
Code
# printing the meanres$estimate
mean of x
19.25
Code
# printing the confidence intervalres$conf.int
[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95
Code
# check if mean weight is less thant.test(my_data$weight, mu =25,alternative ="less")
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
-Inf 20.41105
sample estimates:
mean of x
19.25
Code
# check if mean weight is more than t.test(my_data$weight, mu =25,alternative ="greater")
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 1
alternative hypothesis: true mean is greater than 25
95 percent confidence interval:
18.08895 Inf
sample estimates:
mean of x
19.25
One-Sample Wilcoxon Signed Rank Test
One sampled t-test alternative, non-parametric, no normal distribution assumed.
# One-sample wilcoxon testres <-wilcox.test(my_data$weight, mu =25)# Printing the resultsres
Wilcoxon signed rank test with continuity correction
data: my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
Code
# print only the p-valueres$p.value
[1] 0.005793045
Code
wilcox.test(my_data$weight, mu =25,alternative ="less")
Wilcoxon signed rank test with continuity correction
data: my_data$weight
V = 0, p-value = 0.002897
alternative hypothesis: true location is less than 25
Code
wilcox.test(my_data$weight, mu =25,alternative ="greater")
Wilcoxon signed rank test with continuity correction
data: my_data$weight
V = 0, p-value = 0.9979
alternative hypothesis: true location is greater than 25
Unpaired Two-Sample T-test
Compares the means of two separate groups.
Code
# Data in two numeric vectorswomen_weight <-c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)men_weight <-c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) # Create a data framemy_data <-data.frame( group =rep(c("Woman", "Man"), each =length(women_weight)),weight =c(women_weight, men_weight))# View the data frameprint(my_data)
group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 Man 9 69.0 9.38
2 Woman 9 52.1 15.6
Code
# Plot weight by group and color by grouplibrary("ggpubr")ggboxplot(my_data, x ="group", y ="weight", color ="group", palette =c("purple", "orchid"),ylab ="Weight", xlab ="Groups")
Code
# Shapiro-Wilk normality test for Men's weightswith(my_data, shapiro.test(weight[group =="Man"]))# p = 0.1
Shapiro-Wilk normality test
data: weight[group == "Man"]
W = 0.86425, p-value = 0.1066
Code
# Shapiro-Wilk normality test for Women's weightswith(my_data, shapiro.test(weight[group =="Woman"])) # p = 0.6
Shapiro-Wilk normality test
data: weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
Two Sample t-test
data: women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-29.748019 -4.029759
sample estimates:
mean of x mean of y
52.10000 68.98889
t.test(weight ~ group, data = my_data,var.equal =TRUE, alternative ="less")
Two Sample t-test
data: weight by group
t = 2.7842, df = 16, p-value = 0.9934
alternative hypothesis: true difference in means between group Man and group Woman is less than 0
95 percent confidence interval:
-Inf 27.47924
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
Code
t.test(weight ~ group, data = my_data,var.equal =TRUE, alternative ="greater")
Two Sample t-test
data: weight by group
t = 2.7842, df = 16, p-value = 0.006633
alternative hypothesis: true difference in means between group Man and group Woman is greater than 0
95 percent confidence interval:
6.298536 Inf
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
Unpaired Two-Samples Wilcoxon Test
Also known as the Mann Whitney, non parametric alternative for the two sample, unpaired t-test, does not assume normal distribution, independent samples.
Code
res <-wilcox.test(women_weight, men_weight)res
Wilcoxon rank sum test with continuity correction
data: women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
Code
res <-wilcox.test(weight ~ group, data = my_data,exact =FALSE)res
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
Code
# Print the p-value onlyres$p.value
[1] 0.02711657
Code
wilcox.test(weight ~ group, data = my_data, exact =FALSE, alternative ="less")
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.9892
alternative hypothesis: true location shift is less than 0
Code
wilcox.test(weight ~ group, data = my_data,exact =FALSE, alternative ="greater")
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.01356
alternative hypothesis: true location shift is greater than 0
Paired Samples T-test
Compares the means of 2 related groups.
Code
# Data in two numeric vectors# ++++++++++++++++++++++++++# Weight of the mice before treatmentbefore <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)# Weight of the mice after treatmentafter <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)# Create a data framemy_data <-data.frame( group =rep(c("before", "after"), each =10),weight =c(before, after) )
Code
# Print all dataprint(my_data)
group weight
1 before 200.1
2 before 190.9
3 before 192.7
4 before 213.0
5 before 241.4
6 before 196.9
7 before 172.2
8 before 185.5
9 before 205.2
10 before 193.7
11 after 392.9
12 after 393.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 427.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 after 10 394. 29.4
2 before 10 199. 18.5
Code
# Plot weight by group and color by grouplibrary("ggpubr")ggboxplot(my_data, x ="group", y ="weight", color ="group", palette =c("purple", "orchid"),order =c("before", "after"),ylab ="Weight", xlab ="Groups")
Code
# Subset weight data before treatmentbefore <-subset(my_data, group =="before", weight,drop =TRUE)# subset weight data after treatmentafter <-subset(my_data, group =="after", weight,drop =TRUE)# Plot paired datalibrary(PairedData)pd <-paired(before, after)plot(pd, type ="profile") +theme_bw()
Code
# compute the differenced <-with(my_data, weight[group =="before"] - weight[group =="after"])# Shapiro-Wilk normality test for the differencesshapiro.test(d) # => p-value = 0.6141
Shapiro-Wilk normality test
data: d
W = 0.94536, p-value = 0.6141
Paired t-test
data: before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-215.5581 -173.4219
sample estimates:
mean difference
-194.49
Paired Samples Wilcoxon Test in R
Non-parametric method for a paired t-test alternative.
Code
res <-wilcox.test(before, after, paired =TRUE)res
Wilcoxon signed rank exact test
data: before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
One-Way ANOVA Test
Like an unpaired t-test, but more than two groups, one way ANOVA is grouping the multiple groups by a single grouping variable.
Code
my_data <- PlantGrowth
Code
# Show a random sampleset.seed(1234)dplyr::sample_n(my_data, 10)
# A tibble: 3 × 4
group count mean sd
<ord> <int> <dbl> <dbl>
1 ctrl 10 5.03 0.583
2 trt1 10 4.66 0.794
3 trt2 10 5.53 0.443
Code
# Box plots# ++++++++++++++++++++# Plot weight by group and color by grouplibrary("ggpubr")ggboxplot(my_data, x ="group", y ="weight", color ="group", palette =c("purple", "orchid", "darkorchid4"),order =c("ctrl", "trt1", "trt2"),ylab ="Weight", xlab ="Treatment")
Code
# Mean plots# ++++++++++++++++++++# Plot weight by group# Add error bars: mean_se# (other values include: mean_sd, mean_ci, median_iqr, ....)library("ggpubr")ggline(my_data, x ="group", y ="weight", add =c("mean_se", "jitter"), order =c("ctrl", "trt1", "trt2"),ylab ="Weight", xlab ="Treatment")
Code
# Compute the analysis of varianceres.aov <-aov(weight ~ group, data = my_data)# Summary of the analysissummary(res.aov)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
TukeyHSD(res.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = my_data)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
Pairwise comparisons using t tests with pooled SD
data: my_data$weight and my_data$group
ctrl trt1
trt1 0.194 -
trt2 0.132 0.013
P value adjustment method: BH
Code
# 1. Homogeneity of variancesplot(res.aov, 1)
Code
library(car)leveneTest(weight ~ group, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 1.1192 0.3412
27
Code
oneway.test(weight ~ group, data = my_data)
One-way analysis of means (not assuming equal variances)
data: weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739
Pairwise comparisons using t tests with non-pooled SD
data: my_data$weight and my_data$group
ctrl trt1
trt1 0.250 -
trt2 0.072 0.028
P value adjustment method: BH
Code
# 2. Normalityplot(res.aov, 2)
Code
# Extract the residualsaov_residuals <-residuals(object = res.aov )# Run Shapiro-Wilk testshapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.96607, p-value = 0.4379
Code
# Non-parametric alternative to one-way ANOVA testkruskal.test(weight ~ group, data = my_data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Two Way ANOVA Test
Evaluates two grouped variable on a response variable.
# Convert dose as a factor and recode the levels# as "D0.5", "D1", "D2"my_data$dose <-factor(my_data$dose, levels =c(0.5, 1, 2),labels =c("D0.5", "D1", "D2"))head(my_data)
# Box plot with multiple groups# +++++++++++++++++++++# Plot tooth length ("len") by groups ("dose")# Color box plot by a second group: "supp"library("ggpubr")ggboxplot(my_data, x ="dose", y ="len", color ="supp",palette =c("purple", "orchid"))
Code
# Line plots with multiple groups# +++++++++++++++++++++++# Plot tooth length ("len") by groups ("dose")# Color box plot by a second group: "supp"# Add error bars: mean_se# (other values include: mean_sd, mean_ci, median_iqr, ....)library("ggpubr")ggline(my_data, x ="dose", y ="len", color ="supp",add =c("mean_se", "dotplot"),palette =c("purple", "orchid"))
Code
res.aov2 <-aov(len ~ supp + dose, data = my_data)summary(res.aov2)
# Two-way ANOVA with interaction effect# These two calls are equivalentres.aov3 <-aov(len ~ supp * dose, data = my_data)res.aov3 <-aov(len ~ supp + dose + supp:dose, data = my_data)summary(res.aov3)
Pairwise comparisons using t tests with pooled SD
data: my_data$len and my_data$dose
D0.5 D1
D1 1.0e-08 -
D2 4.4e-16 1.4e-05
P value adjustment method: BH
Code
# 1. Homogeneity of variancesplot(res.aov3, 1)
Code
library(car)leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 1.7086 0.1484
54
Code
# 2. Normalityplot(res.aov3, 2)
Code
# Extract the residualsaov_residuals <-residuals(object = res.aov3)# Run Shapiro-Wilk testshapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.98499, p-value = 0.6694
Code
library(car)my_anova <-aov(len ~ supp * dose, data = my_data)Anova(my_anova, type ="III")
# A tibble: 3 × 6
group count mean sd median IQR
<ord> <int> <dbl> <dbl> <dbl> <dbl>
1 ctrl 10 5.03 0.583 5.15 0.743
2 trt1 10 4.66 0.794 4.55 0.662
3 trt2 10 5.53 0.443 5.44 0.467
Code
# Box plots# ++++++++++++++++++++# Plot weight by group and color by grouplibrary("ggpubr")ggboxplot(my_data, x ="group", y ="weight", color ="group", palette =c("purple", "orchid", "darkorchid4"),order =c("ctrl", "trt1", "trt2"),ylab ="Weight", xlab ="Treatment")
Code
# Mean plots# ++++++++++++++++++++# Plot weight by group# Add error bars: mean_se# (other values include: mean_sd, mean_ci, median_iqr, ....)library("ggpubr")ggline(my_data, x ="group", y ="weight", add =c("mean_se", "jitter"), order =c("ctrl", "trt1", "trt2"),ylab ="Weight", xlab ="Treatment")
Code
kruskal.test(weight ~ group, data = my_data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: PlantGrowth$weight and PlantGrowth$group
ctrl trt1
trt1 0.199 -
trt2 0.095 0.027
P value adjustment method: BH
Week Eight Catch Up
This week is looking at correlations, look at plotting the realtionships between variables.
A part of the post session is a quiz so the following are the answers to such.