library(tidyverse)
library(palmerpenguins)
Quarto workbook - Rosie Longmore
Week 1
This week I have learnt how to render a workbook in R studio.
I have played around with graph themes and colours.
Meet Quarto
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
Meet the penguins
The penguins
data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
The plot below shows the relationship between flipper and bill lengths of these penguins.
Week 3
This week I have completed task 6.1 from the R studio workbook and looked at good and bad questions based on the data sets being used.
Post session tasks
Task 1
Data analysis
1a Follow the instructions, comment the purpose of each command.
Ensuring that tidyverse is loaded in your library:
# ensuring tidyverse is uploaded in library:
library(tidyverse)
#Example:
library(tidyverse)
view(diamonds)
%>%
diamonds
# utilizes the diamonds dataset
group_by(color, clarity) %>%
# groups data by color and clarity variables
mutate(price200 = mean(price)) %>%
# creates new variable (average price by groups)
ungroup() %>%
# data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>%
# new variable, original price + $10
select(cut, color,clarity, price, price200, random10) %>%
# retain only these listed columns
arrange(color) %>%
# visualize data ordered by color
group_by(cut) %>%
# groups data by cut
mutate(dis = n_distinct(price),
# counts the number of unique prices per cut
rowID = row_number()) %>%
# numbers each row consecutively for each cut
ungroup()
# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
# final ungrouping of data
Problem A - purpose of code is written above
library(tidyverse)
# Selecting the midwest data set
data("midwest")
# selecting the midwest data
%>%
midwest
# grouping by state
group_by(state) %>%
# summarizing data - calculating the mean of poptotal and renaming it poptotalmean
summarize(poptotalmean = mean(poptotal),
# median of poptotal and renaming it poptotalmed
poptotalmed = median(poptotal),
# finding the max poptotal and renaming popmax
popmax = max(poptotal),
# finding the min poptotal and renaming popmin
popmin = min(poptotal),
# finding any distinct variables in poptotal and renaming popdistinct
popdistinct = n_distinct(poptotal),
# find the first value in poptotal and renaming popfirst
popfirst = first(poptotal),
# finding any values in poptotal less than 500 and renaming popany
popany = any(poptotal < 5000),
# finding any values greater than 2 mill in poptotal and renaming popany2
popany2 = any(poptotal > 2000000)) %>%
# final ungrouping of data
ungroup()
# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem B - purpose of code is written above
# selecting the midwest data set
%>%
midwest
# grouping by state
group_by(state) %>%
# summarizing the data - calculating the sum of poptotal that is less than 5000 and renaming num5k
summarize(num5k = sum(poptotal < 5000),
# calculating the sum of poptotal that is greater than 2 mil and renaming num2mil
num2mil = sum(poptotal > 2000000),
# creating a new column called numrows to count number of rows in current group
numrows = n()) %>%
# ungrouping all the data
ungroup()
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C - The purpose of the code is written above
#| label: Task C part 1
# selects midwest data
%>%
midwest
# groups data by county
group_by(county) %>%
# create a new column called x that shows the number of distinct values in states
summarize(x = n_distinct(state)) %>%
# arranging the column in descending order
arrange(desc(x)) %>%
# final ungrouping of data
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# How does n() differ from n_distinct()? - n() counts all rows including duplicates, whereas n_distinct() only counts distinct values and not duplicates
# When would they be the same? different? - would be the same if all values were unique would be differnet if there was duplicates
# selects midwest data
%>%
midwest
# groups by county
group_by(county) %>%
# creates a new column named x which is a summmary of all the data in the rows
summarize(x = n()) %>%
# ungrouping
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Problem D
# selecting diamonds data set
%>%
diamonds
# grouping by clarity
group_by(clarity) %>%
# creating a new column a which is the distinct values in the catergory colour, new column called b ehich is distinct values in price catergory and a new column c which is all the number inclduing non distinct ones
summarize(a = n_distinct(color),
b = n_distinct(price),
c = n()) %>%
# final ungrouping
ungroup()
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem E
# selects diamonds data set
%>%
diamonds
# groups by colour and cut
group_by(color, cut) %>%
# creating a new column named m that shows the mean price and a new column called s which shows the standard deviation in the price
summarize(m = mean(price),
s = sd(price)) %>%
# ungrouping
ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
2 Why is grouping data necessary?
Raw data is never ready to be analysed, grouping data can help:
summary statistics
identifying trends
comparison
data visualisation
Manipulating data
Managing large data set
3 Why is ungrouping data necessary?
return to original structure
create further analysis
clarity
4 When should you ungroup data
after summary calculations
before applying new functions
when you want to do an operation that includes the whole data set
5 If the code does not contain group_by()
, do you still need ungroup()
at the end? For example, does data() %>% mutate(newVar = 1 + 2)
require ungroup()
?
No as the data remains ungrouped, if there is no grouping involved you don’t need to apply it.
Task 2
Go further and create your own commented code for the simple tasks listed in session 6.7 “Extra Practice”
#| Label: Task 6.7
#| errors: false
# selecting diamonds data set
library(tidyverse)
view(diamonds)
%>%
diamonds
# arrange diamonds in price from lowest to highest
group_by(price) %>%
arrange(price)
# A tibble: 53,940 × 10
# Groups: price [11,602]
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Exercise for Research Methods
A good question for the diamonds data set:
Does the cut, colour and clarity have an effect on diamond prices?
A bad question:
What is the highest price a diamond can ever be?
Week 4
Formative exercise data analysis - post session
This week I watched Andrew’s video and reproduced his R script.
It is important to first look at the type of data your are working with to distinguish which plot to use:
First of all it looked at plotting a basic scatter graph, then looked at changing the aesthetics such as the shape and colour of the points.
A scatter plot is good for two quantitative variables
# Ensuring correct packages are running and selecting the data
library(tidyverse)
library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
?ggplot
starting httpd help server ...
done
?cricketsView(crickets)
# The crickets data contains different species, heat in one column and the rate of chirping in another
# We specify the colour, x any y axis etc before specifying the actual data plot
# The first argument is the data set, were telling R that we went the x axis to be temp and the y to be rate, then plotting it as a scatter graph
ggplot(crickets, aes(x = temp, y = rate)) +
geom_point()+
# to change the labels, give a title
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
# adding colour to points - each colour is a different species
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
# more colour blind friendly
scale_color_brewer(palette = "Dark2")
Looking at modifying the basic features of the plot:
# Modifying the basic properties of the plot
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 2,
alpha = .3,
shape = "square") +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
# This changes the colour, size and shape of the points, alpha changes opacity - helpful if you have over plotting
Then I added another layed, is this case a smooth line of best fit
# Adding another layer
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
# this adds a regression line, method lm means a linear model, se false takes away the error shading
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'
Then I looked at adding other layers to the modified plot:
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
Then I looked at other plots:
# A histogram - bins represents the thickness of the bars
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variable
A frequency polygon, this is similar to a histogram just presented in a different way:
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)
A bar chart for the species variable:
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black",
fill = "lightblue")
# Adding some extra modifications
ggplot(crickets, aes(x = species,
fill = species,
+
)) geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")
A boxplot is good for one categorical and quantitative variable:
In this case the quantitative variable is rate and the categorical is species.
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()
# Theme minimal take out the default grey background from the plot
In R studio, the help button has a cheat sheet option which can help you decide which graph to use.
Faceting:
This is useful when looking at chirp rate per species in a histogram.
# faceting
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
# this is not very easy to interpret
# A better way of presenting it
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) +
scale_fill_brewer(palette = "Dark2")
# This is a way of viewing the above differently, this shows the difference between the two species more clearly
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
?facet_wrap can bring up a help guide for faceting
Research Methods formative task
My understanding of a good research hypothesis:
When doing research, it is important to think of questions and hypotheses before you starts collecting your data.
The hypothesis should be clear and specific. It is important that the hypothesis is relevant, it should relate to existing studies and contribute to the field of studies. It should be easy to identify the different variables in the hypothesis to make it repeatable. The hypothesis should be predictive - predicting trends between the variables.
Week 5
Choosing the right statistical analysis
Post session 5
Statistical test would be ANOVA test.
The predictor variable is species as it is on the x axis. Species is categorical. The outcome variable is sepal length and this is quantitative. This means we use a comparison of means test. 3 different species are being looked at and there is one outcome variable - sepal length.
The predictive variable is petal length as this is on the x axis. Petal length is quantitative. The outcome variable is density which is quantitative. There is one predictor variable - petal length
The predictor variable is petal length which is quantitative. The outcome variable petal width is quantitative. There is more than 1 predictive outcome as it is looking at species too.
The statistical test would be a chi-squared test.
The predictor variable is species which is categorical. The outcome variable is big or small which is categorical.
Plotting the graphs
# Boxplot
library(ggplot2)
library(tidyverse)
data("iris")
view(iris)
ggplot(iris, aes(x = Species,
y = Sepal.Length,
color = Species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2")
# histogram