library(tidyverse)
library(palmerpenguins)
Quarto workbook - Rosie Longmore
Week 1
This week I have learnt how to render a workbook in R studio.
I have played around with graph themes and colours.
Meet Quarto
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
Meet the penguins
The penguins
data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
The plot below shows the relationship between flipper and bill lengths of these penguins.
Week 3
This week I have completed task 6.1 from the R studio workbook and looked at good and bad questions based on the data sets being used.
Post session tasks
Task 1
Data analysis
1a Follow the instructions, comment the purpose of each command.
Ensuring that tidyverse is loaded in your library:
# ensuring tidyverse is uploaded in library:
library(tidyverse)
#Example:
library(tidyverse)
view(diamonds)
%>%
diamonds
# utilizes the diamonds dataset
group_by(color, clarity) %>%
# groups data by color and clarity variables
mutate(price200 = mean(price)) %>%
# creates new variable (average price by groups)
ungroup() %>%
# data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>%
# new variable, original price + $10
select(cut, color,clarity, price, price200, random10) %>%
# retain only these listed columns
arrange(color) %>%
# visualize data ordered by color
group_by(cut) %>%
# groups data by cut
mutate(dis = n_distinct(price),
# counts the number of unique prices per cut
rowID = row_number()) %>%
# numbers each row consecutively for each cut
ungroup()
# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
# final ungrouping of data
Problem A - purpose of code is written above
library(tidyverse)
# Selecting the midwest data set
data("midwest")
# selecting the midwest data
%>%
midwest
# grouping by state
group_by(state) %>%
# summarizing data - calculating the mean of poptotal and renaming it poptotalmean
summarize(poptotalmean = mean(poptotal),
# median of poptotal and renaming it poptotalmed
poptotalmed = median(poptotal),
# finding the max poptotal and renaming popmax
popmax = max(poptotal),
# finding the min poptotal and renaming popmin
popmin = min(poptotal),
# finding any distinct variables in poptotal and renaming popdistinct
popdistinct = n_distinct(poptotal),
# find the first value in poptotal and renaming popfirst
popfirst = first(poptotal),
# finding any values in poptotal less than 500 and renaming popany
popany = any(poptotal < 5000),
# finding any values greater than 2 mill in poptotal and renaming popany2
popany2 = any(poptotal > 2000000)) %>%
# final ungrouping of data
ungroup()
# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem B - purpose of code is written above
# selecting the midwest data set
%>%
midwest
# grouping by state
group_by(state) %>%
# summarizing the data - calculating the sum of poptotal that is less than 5000 and renaming num5k
summarize(num5k = sum(poptotal < 5000),
# calculating the sum of poptotal that is greater than 2 mil and renaming num2mil
num2mil = sum(poptotal > 2000000),
# creating a new column called numrows to count number of rows in current group
numrows = n()) %>%
# ungrouping all the data
ungroup()
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C - The purpose of the code is written above
#| label: Task C part 1
# selects midwest data
%>%
midwest
# groups data by county
group_by(county) %>%
# create a new column called x that shows the number of distinct values in states
summarize(x = n_distinct(state)) %>%
# arranging the column in descending order
arrange(desc(x)) %>%
# final ungrouping of data
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# How does n() differ from n_distinct()? - n() counts all rows including duplicates, whereas n_distinct() only counts distinct values and not duplicates
# When would they be the same? different? - would be the same if all values were unique would be differnet if there was duplicates
# selects midwest data
%>%
midwest
# groups by county
group_by(county) %>%
# creates a new column named x which is a summmary of all the data in the rows
summarize(x = n()) %>%
# ungrouping
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Problem D
# selecting diamonds data set
%>%
diamonds
# grouping by clarity
group_by(clarity) %>%
# creating a new column a which is the distinct values in the catergory colour, new column called b ehich is distinct values in price catergory and a new column c which is all the number inclduing non distinct ones
summarize(a = n_distinct(color),
b = n_distinct(price),
c = n()) %>%
# final ungrouping
ungroup()
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem E
# selects diamonds data set
%>%
diamonds
# groups by colour and cut
group_by(color, cut) %>%
# creating a new column named m that shows the mean price and a new column called s which shows the standard deviation in the price
summarize(m = mean(price),
s = sd(price)) %>%
# ungrouping
ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
2 Why is grouping data necessary?
Raw data is never ready to be analysed, grouping data can help:
summary statistics
identifying trends
comparison
data visualisation
Manipulating data
Managing large data set
3 Why is ungrouping data necessary?
return to original structure
create further analysis
clarity
4 When should you ungroup data
after summary calculations
before applying new functions
when you want to do an operation that includes the whole data set
5 If the code does not contain group_by()
, do you still need ungroup()
at the end? For example, does data() %>% mutate(newVar = 1 + 2)
require ungroup()
?
No as the data remains ungrouped, if there is no grouping involved you don’t need to apply it.
Task 2
Go further and create your own commented code for the simple tasks listed in session 6.7 “Extra Practice”
#| Label: Task 6.7
#| errors: false
# selecting diamonds data set
library(tidyverse)
view(diamonds)
%>%
diamonds
# arrange diamonds in price from lowest to highest
group_by(price) %>%
arrange(price)
# A tibble: 53,940 × 10
# Groups: price [11,602]
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Exercise for Research Methods
A good question for the diamonds data set:
Does the cut, colour and clarity have an effect on diamond prices?
A bad question:
What is the highest price a diamond can ever be?
Week 4
Formative exercise data analysis - post session
This week I watched Andrew’s video and reproduced his R script.
It is important to first look at the type of data your are working with to distinguish which plot to use:
First of all it looked at plotting a basic scatter graph, then looked at changing the aesthetics such as the shape and colour of the points.
A scatter plot is good for two quantitative variables
# Ensuring correct packages are running and selecting the data
library(tidyverse)
library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
?ggplot
starting httpd help server ...
done
?cricketsView(crickets)
# The crickets data contains different species, heat in one column and the rate of chirping in another
# We specify the colour, x any y axis etc before specifying the actual data plot
# The first argument is the data set, were telling R that we went the x axis to be temp and the y to be rate, then plotting it as a scatter graph
ggplot(crickets, aes(x = temp, y = rate)) +
geom_point()+
# to change the labels, give a title
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
# adding colour to points - each colour is a different species
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
# more colour blind friendly
scale_color_brewer(palette = "Dark2")
Looking at modifying the basic features of the plot:
# Modifying the basic properties of the plot
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 2,
alpha = .3,
shape = "square") +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
# This changes the colour, size and shape of the points, alpha changes opacity - helpful if you have over plotting
Then I added another layed, is this case a smooth line of best fit
# Adding another layer
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
# this adds a regression line, method lm means a linear model, se false takes away the error shading
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'
Then I looked at adding other layers to the modified plot:
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
Then I looked at other plots:
# A histogram - bins represents the thickness of the bars
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variable
A frequency polygon, this is similar to a histogram just presented in a different way:
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)
A bar chart for the species variable:
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black",
fill = "lightblue")
# Adding some extra modifications
ggplot(crickets, aes(x = species,
fill = species,
+
)) geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")
A boxplot is good for one categorical and quantitative variable:
In this case the quantitative variable is rate and the categorical is species.
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()
# Theme minimal take out the default grey background from the plot
In R studio, the help button has a cheat sheet option which can help you decide which graph to use.
Faceting:
This is useful when looking at chirp rate per species in a histogram.
# faceting
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
# this is not very easy to interpret
# A better way of presenting it
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) +
scale_fill_brewer(palette = "Dark2")
# This is a way of viewing the above differently, this shows the difference between the two species more clearly
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
?facet_wrap can bring up a help guide for faceting
Research Methods formative task
My understanding of a good research hypothesis:
When doing research, it is important to think of questions and hypotheses before you starts collecting your data.
The hypothesis should be clear and specific. It is important that the hypothesis is relevant, it should relate to existing studies and contribute to the field of studies. It should be easy to identify the different variables in the hypothesis to make it repeatable. The hypothesis should be predictive - predicting trends between the variables.
Week 5
Choosing the right statistical analysis
Post session 5
Statistical test would be ANOVA test.
The predictor variable is species as it is on the x axis. Species is categorical. The outcome variable is sepal length and this is quantitative. This means we use a comparison of means test. 3 different species are being looked at and there is one outcome variable - sepal length.
The predictive variable is petal length as this is on the x axis. Petal length is quantitative. The outcome variable is density which is quantitative. There is one predictor variable - petal length
The predictor variable is petal length which is quantitative. The outcome variable petal width is quantitative. There is more than 1 predictive outcome as it is looking at species too.
The statistical test would be a chi-squared test.
The predictor variable is species which is categorical. The outcome variable is big or small which is categorical.
Plotting the graphs
# Boxplot
library(ggplot2)
library(tidyverse)
data("iris")
view(iris)
ggplot(iris, aes(x = Species,
y = Sepal.Length,
color = Species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2")
# histogram
Week 6
This week we took a look into frequency tests.
Comparing Proportions in R - Easy Guides - Wiki - STHDA
The webpage above gives a great insight into statistical testing.
I have reproduced some examples from the webpage:
# Import the data
<- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
file_path
<- read.delim(file_path, row.names = 1)
housetasks
# head
(housetasks)
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
The data is a contingency table containing 13 housetasks and their distribution in the couple:
rows are the different tasks
values are the frequencies of the tasks done :
by the wife only
alternatively
by the husband only
or jointly
Contingency table can be visualized using the function balloonplot() [in gplots package]. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
It’s also possible to visualize a contingency table as a mosaic plot. This is done using the function mosaicplot() from the built-in R package graphics:
pbh
library(“graphics”) mosaicplot(dt, shade = TRUE, las=2, main = “housetasks”)
- Blue color indicates that the observed value is higher than the expected value if the data were random
- Red color specifies that the observed value is lower than the expected value if the data were random
From this mosaic plot, it can be seen that the housetasks *Laundry, Main_meal, Dinner and breakfeast* (blue color) are mainly done by the wife in our example.
There is another package named *vcd*, which can be used to make a mosaic plot (function *mosaic*()) or an association plot (function *assoc*())
Chi squared test:
::: {.cell}
```{.r .cell-code}
chisq <- chisq.test(housetasks)
chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
:::
In our example, the row and the column variables are statistically significantly associated (p-value = 0).
The observed and the expected counts can be extracted from the result of the test as follow:
# Observed counts
$observed chisq
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
# Expected counts
round(chisq$expected,2)
Wife Alternating Husband Jointly
Laundry 60.55 25.63 38.45 51.37
Main_meal 52.64 22.28 33.42 44.65
Dinner 37.16 15.73 23.59 31.52
Breakfeast 48.17 20.39 30.58 40.86
Tidying 41.97 17.77 26.65 35.61
Dishes 38.88 16.46 24.69 32.98
Shopping 41.28 17.48 26.22 35.02
Official 33.03 13.98 20.97 28.02
Driving 47.82 20.24 30.37 40.57
Finances 38.88 16.46 24.69 32.98
Insurance 47.82 20.24 30.37 40.57
Repairs 56.77 24.03 36.05 48.16
Holidays 55.05 23.30 34.95 46.70
If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:
r=o−ee√r=o−ee
The above formula returns the so-called Pearson residuals (r) for each cell (or standardized residuals)
Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.
Pearson residuals can be easily extracted from the output of the function chisq.test():
round(chisq$residuals, 3)
Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556