Quarto workbook - Rosie Longmore

library(tidyverse)
library(palmerpenguins)

Week 1

This week I have learnt how to render a workbook in R studio.

I have played around with graph themes and colours.

Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Meet the penguins

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

Week 3

This week I have completed task 6.1 from the R studio workbook and looked at good and bad questions based on the data sets being used.

Post session tasks

Task 1

Data analysis

1a Follow the instructions, comment the purpose of each command.

Ensuring that tidyverse is loaded in your library:

# ensuring tidyverse is uploaded in library:


library(tidyverse)
#Example:

library(tidyverse)

view(diamonds)

diamonds %>%                         

# utilizes the diamonds dataset
  
  group_by(color, clarity) %>%      
  
# groups data by color and clarity variables
 
mutate(price200 = mean(price)) %>% 
  
# creates new variable (average price by groups)
  
ungroup() %>%                      

# data no longer grouped by color and clarity
  
mutate(random10 = 10 + price) %>%  

# new variable, original price + $10
  

select(cut, color,clarity, price, price200, random10) %>% 
  
# retain only these listed columns
  
arrange(color) %>%
  
# visualize data ordered by color
  
group_by(cut) %>%       

# groups data by cut
  
mutate(dis = n_distinct(price),
       
# counts the number of unique prices per cut

rowID = row_number()) %>%
  
# numbers each row consecutively for each cut
  
ungroup() 
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
# final ungrouping of data
Problem A - purpose of code is written above
library(tidyverse)

# Selecting the midwest data set

data("midwest")


# selecting the midwest data

midwest %>%   
  
# grouping by state
  
  group_by(state) %>% 
  
# summarizing data - calculating the mean of poptotal and renaming it poptotalmean
  
  summarize(poptotalmean = mean(poptotal),
            
# median of poptotal and renaming it poptotalmed

   poptotalmed = median(poptotal),

# finding the max poptotal and renaming popmax

popmax = max(poptotal),

# finding the min poptotal and renaming popmin

popmin = min(poptotal),

# finding any distinct variables in poptotal and renaming popdistinct

popdistinct = n_distinct(poptotal),

# find the first value in poptotal and renaming popfirst

popfirst = first(poptotal),

# finding any values in poptotal less than 500 and renaming popany

popany = any(poptotal < 5000),

# finding any values greater than 2 mill in poptotal and renaming popany2

popany2 = any(poptotal > 2000000)) %>% 
  
# final ungrouping of data
  
  ungroup()
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>
Problem B - purpose of code is written above
# selecting the midwest data set
midwest %>% 
  
# grouping by state
  
group_by(state) %>% 
  
# summarizing the data - calculating the sum of poptotal that is less than 5000 and renaming num5k
  
summarize(num5k = sum(poptotal < 5000),
          
# calculating the sum of poptotal that is greater than 2 mil and renaming num2mil

num2mil = sum(poptotal > 2000000),

# creating a new column called numrows to count number of rows in current group

numrows = n()) %>% 
  
# ungrouping all the data
  
ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72
Problem C - The purpose of the code is written above

#| label: Task C part 1

# selects midwest data

midwest %>% 
  
# groups data by county
  
 group_by(county) %>% 

# create a new column called x that shows the number of distinct values in states
  
summarize(x = n_distinct(state)) %>% 

# arranging the column in descending order
  
arrange(desc(x)) %>% 

# final ungrouping of data
  
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
# How does n() differ from n_distinct()? - n() counts all rows including duplicates, whereas n_distinct() only counts distinct values and not duplicates

# When would they be the same? different? - would be the same if all values were unique would be differnet if there was duplicates

# selects midwest data

midwest %>% 
  
# groups by county
  
  group_by(county) %>% 

# creates a new column named x which is a summmary of all the data in the rows
  
  summarize(x = n()) %>% 
  
# ungrouping
  
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows
Problem D
# selecting diamonds data set

diamonds %>% 

# grouping by clarity
  
group_by(clarity) %>%

# creating a new column a which is the distinct values in the catergory colour, new column called b ehich is distinct values in price catergory and a new column c which is all the number inclduing non distinct ones
  
  summarize(a = n_distinct(color),
            b = n_distinct(price),
            c = n()) %>% 
  
# final ungrouping
  
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790
Problem E
# selects diamonds data set

diamonds %>% 
  
# groups by colour and cut
  
  group_by(color, cut) %>% 

# creating a new column named m that shows the mean price and a new column called s which shows the standard deviation in the price
  
  summarize(m = mean(price),
            s = sd(price)) %>% 
 
  # ungrouping
  
 ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

2 Why is grouping data necessary?

Raw data is never ready to be analysed, grouping data can help:

  • summary statistics

  • identifying trends

  • comparison

  • data visualisation

  • Manipulating data

  • Managing large data set

3 Why is ungrouping data necessary?

  • return to original structure

  • create further analysis

  • clarity

4 When should you ungroup data

  • after summary calculations

  • before applying new functions

  • when you want to do an operation that includes the whole data set

5 If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No as the data remains ungrouped, if there is no grouping involved you don’t need to apply it.

Task 2

Go further and create your own commented code for the simple tasks listed in session 6.7 “Extra Practice” 

#| Label: Task 6.7
#| errors: false

# selecting diamonds data set

library(tidyverse)

view(diamonds)

diamonds %>%
  
# arrange diamonds in price from lowest to highest
  
group_by(price) %>% 

arrange(price)
# A tibble: 53,940 × 10
# Groups:   price [11,602]
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Exercise for Research Methods

A good question for the diamonds data set:

Does the cut, colour and clarity have an effect on diamond prices?

A bad question:

What is the highest price a diamond can ever be?

Week 4

Formative exercise data analysis - post session

This week I watched Andrew’s video and reproduced his R script.

It is important to first look at the type of data your are working with to distinguish which plot to use:

First of all it looked at plotting a basic scatter graph, then looked at changing the aesthetics such as the shape and colour of the points.

A scatter plot is good for two quantitative variables

# Ensuring correct packages are running and selecting the data

library(tidyverse)
library(modeldata)

Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins
?ggplot
starting httpd help server ...
 done
?crickets
View(crickets)

# The crickets data contains different species, heat in one column and the rate of chirping in another

# We specify the colour, x any y axis etc before specifying the actual data plot

# The first argument is the data set, were telling R that we went the x axis to be temp and the y to be rate, then plotting it as a scatter graph

ggplot(crickets, aes(x = temp, y = rate)) +
 
    geom_point()+
# to change the labels, give a title
  
    labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

ggplot(crickets, aes(x = temp, 
                     y = rate,
               
                        color = species)) + 

   # adding colour to points - each colour is a different species 
  
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  
  # more colour blind friendly 
 
  scale_color_brewer(palette = "Dark2")

Looking at modifying the basic features of the plot:

# Modifying the basic properties of the plot

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# This changes the colour, size and shape of the points, alpha changes opacity - helpful if you have over plotting

Then I added another layed, is this case a smooth line of best fit

# Adding another layer

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  
  # this adds a regression line, method lm means a linear model, se false takes away the error shading 
  
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

Then I looked at adding other layers to the modified plot:

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") 
`geom_smooth()` using formula = 'y ~ x'

Then I looked at other plots:

# A histogram - bins represents the thickness of the bars

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

A frequency polygon, this is similar to a histogram just presented in a different way:

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

A bar chart for the species variable:

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

# Adding some extra modifications

ggplot(crickets, aes(x = species, 
                     fill = species,
                         )) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

A boxplot is good for one categorical and quantitative variable:

In this case the quantitative variable is rate and the categorical is species.

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

# Theme minimal take out the default grey background from the plot

In R studio, the help button has a cheat sheet option which can help you decide which graph to use.

Faceting:

This is useful when looking at chirp rate per species in a histogram.

# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

# this is not very easy to interpret
# A better way of presenting it 

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

# This is a way of viewing the above differently, this shows the difference between the two species more clearly

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

?facet_wrap can bring up a help guide for faceting

Research Methods formative task

My understanding of a good research hypothesis:

When doing research, it is important to think of questions and hypotheses before you starts collecting your data.

The hypothesis should be clear and specific. It is important that the hypothesis is relevant, it should relate to existing studies and contribute to the field of studies. It should be easy to identify the different variables in the hypothesis to make it repeatable. The hypothesis should be predictive - predicting trends between the variables.

Week 5

Choosing the right statistical analysis   

Post session 5

Statistical test would be ANOVA test.

The predictor variable is species as it is on the x axis. Species is categorical. The outcome variable is sepal length and this is quantitative. This means we use a comparison of means test. 3 different species are being looked at and there is one outcome variable - sepal length.

Statistical test would be a simple regression.

The predictive variable is petal length as this is on the x axis. Petal length is quantitative. The outcome variable is density which is quantitative. There is one predictor variable - petal length

The statistical test would be multiple regression.

The predictor variable is petal length which is quantitative. The outcome variable petal width is quantitative. There is more than 1 predictive outcome as it is looking at species too.

The statistical test would be a chi-squared test.

The predictor variable is species which is categorical. The outcome variable is big or small which is categorical.

Plotting the graphs

# Boxplot
library(ggplot2)
library(tidyverse)

data("iris")
view(iris)

ggplot(iris, aes(x = Species, 
                     y = Sepal.Length,
                     color = Species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2")

# histogram

Week 6

This week we took a look into frequency tests.

Comparing Proportions in R - Easy Guides - Wiki - STHDA

The webpage above gives a great insight into statistical testing.

I have reproduced some examples from the webpage:

# Import the data

file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"

housetasks <- read.delim(file_path, row.names = 1)

# head

(housetasks)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

The data is a contingency table containing 13 housetasks and their distribution in the couple:

  • rows are the different tasks

  • values are the frequencies of the tasks done :

  • by the wife only

  • alternatively

  • by the husband only

  • or jointly

Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package graphics:

pbh library(“graphics”) mosaicplot(dt, shade = TRUE, las=2, main = “housetasks”)


-   Blue color indicates that the observed value is higher than the expected value if the data were random

-   Red color specifies that the observed value is lower than the expected value if the data were random

From this mosaic plot, it can be seen that the housetasks *Laundry, Main_meal, Dinner and breakfeast* (blue color) are mainly done by the wife in our example.

There is another package named *vcd*, which can be used to make a mosaic plot (function *mosaic*()) or an association plot (function *assoc*())



Chi squared test:

::: {.cell}

```{.r .cell-code}
chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16

:::

In our example, the row and the column variables are statistically significantly associated (p-value = 0).

The observed and the expected counts can be extracted from the result of the test as follow:

# Observed counts
chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153
# Expected counts
round(chisq$expected,2)
            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:

r=o−ee√r=o−ee

The above formula returns the so-called Pearson residuals (r) for each cell (or standardized residuals)

Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.

Pearson residuals can be easily extracted from the output of the function chisq.test():

round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556

Week 7