Quarto workbook - Rosie Longmore

library(tidyverse)
library(palmerpenguins)

Week 1

This week I have learnt how to render a workbook in R studio.

I have played around with graph themes and colours.

Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Meet the penguins

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

Week 3

This week I have completed task 6.1 from the R studio workbook and looked at good and bad questions based on the data sets being used.

Post session tasks

Task 1

Data analysis

1a Follow the instructions, comment the purpose of each command.

Ensuring that tidyverse is loaded in your library:

# ensuring tidyverse is uploaded in library:


library(tidyverse)
#Example:

library(tidyverse)

view(diamonds)

diamonds %>%                         

# utilizes the diamonds dataset
  
  group_by(color, clarity) %>%      
  
# groups data by color and clarity variables
 
mutate(price200 = mean(price)) %>% 
  
# creates new variable (average price by groups)
  
ungroup() %>%                      

# data no longer grouped by color and clarity
  
mutate(random10 = 10 + price) %>%  

# new variable, original price + $10
  

select(cut, color,clarity, price, price200, random10) %>% 
  
# retain only these listed columns
  
arrange(color) %>%
  
# visualize data ordered by color
  
group_by(cut) %>%       

# groups data by cut
  
mutate(dis = n_distinct(price),
       
# counts the number of unique prices per cut

rowID = row_number()) %>%
  
# numbers each row consecutively for each cut
  
ungroup() 
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
# final ungrouping of data
Problem A - purpose of code is written above
library(tidyverse)

# Selecting the midwest data set

data("midwest")


# selecting the midwest data

midwest %>%   
  
# grouping by state
  
  group_by(state) %>% 
  
# summarizing data - calculating the mean of poptotal and renaming it poptotalmean
  
  summarize(poptotalmean = mean(poptotal),
            
# median of poptotal and renaming it poptotalmed

   poptotalmed = median(poptotal),

# finding the max poptotal and renaming popmax

popmax = max(poptotal),

# finding the min poptotal and renaming popmin

popmin = min(poptotal),

# finding any distinct variables in poptotal and renaming popdistinct

popdistinct = n_distinct(poptotal),

# find the first value in poptotal and renaming popfirst

popfirst = first(poptotal),

# finding any values in poptotal less than 500 and renaming popany

popany = any(poptotal < 5000),

# finding any values greater than 2 mill in poptotal and renaming popany2

popany2 = any(poptotal > 2000000)) %>% 
  
# final ungrouping of data
  
  ungroup()
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>
Problem B - purpose of code is written above
# selecting the midwest data set
midwest %>% 
  
# grouping by state
  
group_by(state) %>% 
  
# summarizing the data - calculating the sum of poptotal that is less than 5000 and renaming num5k
  
summarize(num5k = sum(poptotal < 5000),
          
# calculating the sum of poptotal that is greater than 2 mil and renaming num2mil

num2mil = sum(poptotal > 2000000),

# creating a new column called numrows to count number of rows in current group

numrows = n()) %>% 
  
# ungrouping all the data
  
ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72
Problem C - The purpose of the code is written above

#| label: Task C part 1

# selects midwest data

midwest %>% 
  
# groups data by county
  
 group_by(county) %>% 

# create a new column called x that shows the number of distinct values in states
  
summarize(x = n_distinct(state)) %>% 

# arranging the column in descending order
  
arrange(desc(x)) %>% 

# final ungrouping of data
  
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
# How does n() differ from n_distinct()? - n() counts all rows including duplicates, whereas n_distinct() only counts distinct values and not duplicates

# When would they be the same? different? - would be the same if all values were unique would be differnet if there was duplicates

# selects midwest data

midwest %>% 
  
# groups by county
  
  group_by(county) %>% 

# creates a new column named x which is a summmary of all the data in the rows
  
  summarize(x = n()) %>% 
  
# ungrouping
  
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows
Problem D
# selecting diamonds data set

diamonds %>% 

# grouping by clarity
  
group_by(clarity) %>%

# creating a new column a which is the distinct values in the catergory colour, new column called b ehich is distinct values in price catergory and a new column c which is all the number inclduing non distinct ones
  
  summarize(a = n_distinct(color),
            b = n_distinct(price),
            c = n()) %>% 
  
# final ungrouping
  
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790
Problem E
# selects diamonds data set

diamonds %>% 
  
# groups by colour and cut
  
  group_by(color, cut) %>% 

# creating a new column named m that shows the mean price and a new column called s which shows the standard deviation in the price
  
  summarize(m = mean(price),
            s = sd(price)) %>% 
 
  # ungrouping
  
 ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

2 Why is grouping data necessary?

Raw data is never ready to be analysed, grouping data can help:

  • summary statistics

  • identifying trends

  • comparison

  • data visualisation

  • Manipulating data

  • Managing large data set

3 Why is ungrouping data necessary?

  • return to original structure

  • create further analysis

  • clarity

4 When should you ungroup data

  • after summary calculations

  • before applying new functions

  • when you want to do an operation that includes the whole data set

5 If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No as the data remains ungrouped, if there is no grouping involved you don’t need to apply it.

Task 2

Go further and create your own commented code for the simple tasks listed in session 6.7 “Extra Practice” 

#| Label: Task 6.7
#| errors: false

# selecting diamonds data set

library(tidyverse)

view(diamonds)

diamonds %>%
  
# arrange diamonds in price from lowest to highest
  
group_by(price) %>% 

arrange(price)
# A tibble: 53,940 × 10
# Groups:   price [11,602]
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Exercise for Research Methods

A good question for the diamonds data set:

Does the cut, colour and clarity have an effect on diamond prices?

A bad question:

What is the highest price a diamond can ever be?

Week 4

Formative exercise data analysis - post session

This week I watched Andrew’s video and reproduced his R script.

It is important to first look at the type of data your are working with to distinguish which plot to use:

First of all it looked at plotting a basic scatter graph, then looked at changing the aesthetics such as the shape and colour of the points.

A scatter plot is good for two quantitative variables

# Ensuring correct packages are running and selecting the data

library(tidyverse)
library(modeldata)

Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins
?ggplot
starting httpd help server ...
 done
?crickets
View(crickets)

# The crickets data contains different species, heat in one column and the rate of chirping in another

# We specify the colour, x any y axis etc before specifying the actual data plot

# The first argument is the data set, were telling R that we went the x axis to be temp and the y to be rate, then plotting it as a scatter graph

ggplot(crickets, aes(x = temp, y = rate)) +
 
    geom_point()+
# to change the labels, give a title
  
    labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

ggplot(crickets, aes(x = temp, 
                     y = rate,
               
                        color = species)) + 

   # adding colour to points - each colour is a different species 
  
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  
  # more colour blind friendly 
 
  scale_color_brewer(palette = "Dark2")

Looking at modifying the basic features of the plot:

# Modifying the basic properties of the plot

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# This changes the colour, size and shape of the points, alpha changes opacity - helpful if you have over plotting

Then I added another layed, is this case a smooth line of best fit

# Adding another layer

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  
  # this adds a regression line, method lm means a linear model, se false takes away the error shading 
  
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

Then I looked at adding other layers to the modified plot:

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") 
`geom_smooth()` using formula = 'y ~ x'

Then I looked at other plots:

# A histogram - bins represents the thickness of the bars

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

A frequency polygon, this is similar to a histogram just presented in a different way:

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

A bar chart for the species variable:

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

# Adding some extra modifications

ggplot(crickets, aes(x = species, 
                     fill = species,
                         )) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

A boxplot is good for one categorical and quantitative variable:

In this case the quantitative variable is rate and the categorical is species.

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

# Theme minimal take out the default grey background from the plot

In R studio, the help button has a cheat sheet option which can help you decide which graph to use.

Faceting:

This is useful when looking at chirp rate per species in a histogram.

# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

# this is not very easy to interpret
# A better way of presenting it 

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

# This is a way of viewing the above differently, this shows the difference between the two species more clearly

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

?facet_wrap can bring up a help guide for faceting

Research Methods formative task

My understanding of a good research hypothesis:

When doing research, it is important to think of questions and hypotheses before you starts collecting your data.

The hypothesis should be clear and specific. It is important that the hypothesis is relevant, it should relate to existing studies and contribute to the field of studies. It should be easy to identify the different variables in the hypothesis to make it repeatable. The hypothesis should be predictive - predicting trends between the variables.

Week 5

Choosing the right statistical analysis   

Post session 5

Statistical test would be ANOVA test.

The predictor variable is species as it is on the x axis. Species is categorical. The outcome variable is sepal length and this is quantitative. This means we use a comparison of means test. 3 different species are being looked at and there is one outcome variable - sepal length.

Statistical test would be a simple regression.

The predictive variable is petal length as this is on the x axis. Petal length is quantitative. The outcome variable is density which is quantitative. There is one predictor variable - petal length

The statistical test would be multiple regression.

The predictor variable is petal length which is quantitative. The outcome variable petal width is quantitative. There is more than 1 predictive outcome as it is looking at species too.

The statistical test would be a chi-squared test.

The predictor variable is species which is categorical. The outcome variable is big or small which is categorical.

Plotting the graphs

# Boxplot
library(ggplot2)
library(tidyverse)

data("iris")
view(iris)

ggplot(iris, aes(x = Species, 
                     y = Sepal.Length,
                     color = Species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2")

# histogram