1+1
[1] 2
<-2
a+1 a
[1] 3
In these first tutorials, we have learnt how to create a basic Quarto file and how to perform basic functions in RStudio.
I have learnt to complete basic functions and assign values:
1+1
[1] 2
<-2
a+1 a
[1] 3
I have learnt how to define vectors and combine them to make a data frame:
<- c(12, 17, 5, 11, 2, 7)
freq <- c("Buzzard", "Hobby", "Sparrow", "Pigeon", "Harrier", "Hawk")
species <- data.frame(species,freq) spec_freq
spec_freq
species freq
1 Buzzard 12
2 Hobby 17
3 Sparrow 5
4 Pigeon 11
5 Harrier 2
6 Hawk 7
However, often we want to work with large data sets which we need to import into R. This can be done by going to ‘File’ and ‘Import Dataset’.
I have imported a data set called ‘penguins’. I can view this by selecting it in the environment window:
You can see this has 344 entries so to get information from this we need to manipulate it in different ways. First we need to load packages that allow us to do this:
library(tidyverse)
library(psych)
I can perform all kinds of statistical tests on any column of my choosing: (For some reason it would not let me render when I was trying this with the penguins data set but it would when I use my birds data set- will have to look into this!)
describe(freq)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 6 9 5.4 9 9 5.19 2 17 15 0.14 -1.66 2.21
If I know what test I want to do I can select that one specifically:
mean (freq)
[1] 9
Finally I learnt how to create graphs. Again it would not let me render, however here is the graph that I produced and the code used:
I now feel comfortable doing basic functions in R. I began to look at statistical analyses and graph functions. One issue I came across was getting R to ignore NA values in the dataframe- I looked up the na.omit function however it still didn’t allow me to do numerical analyses on the data. This is something that I will have to look into further. I also had problems rendering certain chunks of code, even though they ran well in the blocks themselves.
This week we looked at the tidyverse package and how this can help us manipulate data sets.
First we looked at the mutate() function. It can be used to add columns to a database.
<- #this ensures my new columns are saved in a new dataset
midwest.new %>%
midwest mutate(child.to.adult = (percchildbelowpovert / percadultpoverty),
ratio.adult = (popadults / poptotal),
perc.adult = (ratio.adult * 100))
Here I have added 3 new columns. One to show the ratio of children to adults below the poverty line, one to give the ratio of adults in the population, and one to give the percentage of adults.
Recode can be used to alter a value in the dataset. It is most commonly used for correcting inconsistencies.
data %>% mutate(Variable = recode(Variable, “old value” = “new value”))
For example: The dataset below has different denotions for the same classification.
print(dataset)
# A tibble: 6 × 2
Sex TestScore
<fct> <dbl>
1 male 10
2 m 20
3 M 10
4 Female 25
5 Female 12
6 Female 5
This can be fixed using recode() as shown below:
<-
dataset.new %>%
dataset mutate(Sex.new = recode(Sex,
"m" = "Male",
"M" = "Male",
"male" ="Male"))
print(dataset.new %>% select(TestScore, Sex.new))
# A tibble: 6 × 2
TestScore Sex.new
<dbl> <fct>
1 10 Male
2 20 Male
3 10 Male
4 25 Female
5 12 Female
6 5 Female
Group by can be useful if we want to compare certain groups in our data for example males vs females, age groups, or something else!
For example I have a data set of test scores, to compare males and females I can group the data by sex then carry out some analyses:
%>%
data group_by(Sex) %>%
summarize(m = mean(Score), # calculates the mean
s = sd(Score), # calculates the standard deviation
n = n()) %>% # calculates the total number
ungroup() #It is important to remember to always ungroup afterwards!
# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
Using a comma we can group by more than one factor. We can also combine with mutate() to make a new column specific to that factor. For example:
<-
data.new %>%
data group_by(Sex) %>%
mutate(m = mean(Age)) %>% # calculates the average age of males and females
mutate(x = mean(Score)) %>% # counts number of participants
ungroup()
print(data.new)
# A tibble: 50 × 6
ID Sex Age Score m x
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 male 26 0.01 29.2 0.487
2 2 female 25 0.418 29.0 0.437
3 3 male 39 0.014 29.2 0.487
4 4 female 37 0.09 29.0 0.437
5 5 male 31 0.061 29.2 0.487
6 6 female 34 0.328 29.0 0.437
7 7 male 34 0.656 29.2 0.487
8 8 female 30 0.002 29.0 0.437
9 9 male 26 0.639 29.2 0.487
10 10 female 33 0.173 29.0 0.437
# ℹ 40 more rows
With Filter we can retain specific row of data:
%>%
diamonds filter(cut == "Fair" | cut == "Good", price <= 600)
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
This will return any data which has a cut of Fair or Good and a price of less than £600
We can choose which variables we want to see:
%>% select(cut, color) diamonds
# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
The arrange function will arrange the data by the variable stated (either alphabetical or from lowest to highest). If we add desc () it will order them in reverse.
View all of the variable names in diamonds:
View (diamonds) #This allows us to see the dataset
!. Arrange the diamonds by:
a. Lowest to highest price
%>% arrange(price) diamonds
b. Highest to lowest price
%>% arrange(desc(price)) diamonds
c. Lowest price and cut
%>% arrange(price, cut) #This will arrange first by lowest price and then lowest cut if there are two prices the same diamonds
d. Highest price and cut
%>% arrange(desc(price), desc(cut)) diamonds
2. Arrange the diamonds by lowest to highest price and worst to best clarity.
%>% arrange(desc(price), desc(clarity)) diamonds
3. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond
<-
diamonds.new %>% mutate(salePrice = (price-250))
diamonds%>% summarise(salePrice) diamonds.new
# A tibble: 53,940 × 1
salePrice
<dbl>
1 76
2 76
3 77
4 84
5 85
6 86
7 86
8 87
9 87
10 88
# ℹ 53,930 more rows
4. Remove the x, y, and z variables from the diamonds dataset
%>%
diamondsselect(-x,-y,-z)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
5. Determine the number of diamonds there are for each cut value
<- diamonds %>%
diamond_counts group_by(cut) %>%
summarize(count = n())
print(diamond_counts)
# A tibble: 5 × 2
cut count
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
6. Create a new column named totalNum that calculates the total number of diamonds.
<-
diamonds.new %>%
diamondsmutate(totalNum = n())
%>%
diamonds.newsummarise(totalNum)
# A tibble: 53,940 × 1
totalNum
<int>
1 53940
2 53940
3 53940
4 53940
5 53940
6 53940
7 53940
8 53940
9 53940
10 53940
# ℹ 53,930 more rows
Collapses the rows and counts the number of observations per group of values.
%>% count(cut) diamonds
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Does what it says on the tin:
#|output: false
%>% rename(PRICE = price) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table PRICE x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
So if I have a dataset for example test scores:
<- c("John", "James", "Amelia", "Jim", "Suzie")
Names <- c(2, 44, 76, 18, 55)
Scores <- data.frame(Names, Scores)
Test print(Test)
Names Scores
1 John 2
2 James 44
3 Amelia 76
4 Jim 18
5 Suzie 55
I can use ifelse to create another column based on the data we already have
%>%
Test mutate(Result = ifelse(Scores>40, "Pass", "Fail"))
Names Scores Result
1 John 2 Fail
2 James 44 Pass
3 Amelia 76 Pass
4 Jim 18 Fail
5 Suzie 55 Pass
First we give it a thing to check, the first one is what it will assign if the thing is true, and the second is what it will assign if it is false
This week we have been looking at gg plot and producing graphs. ##Scatter Graph
ggplot(crickets, aes(x=temp, y= rate)) +
geom_point()
This produces a graph using the data ‘crickets’. We tell it what data we want on the x and y and what type of graph that we want
Once we have the basics we can tidy it up
ggplot(crickets, aes(x=temp, y= rate)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp Rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
For example adding labels!
ggplot(crickets, aes(x=temp, y= rate, colour = species)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp Rate",
title = "Cricket chirps",
colour = "Species",
caption = "Source: McDonald (2009)")
Or use colour to provide extra information!
When using colour we can improve accessibility by proving the code
scale_colour_brewer(palette = "Dark2")
This can be added at the bottom
##Geom Properties This goes inside the geom ()s For example colour:
ggplot(crickets, aes(x=temp, y= rate)) +
geom_point(colour = "red",
size = 2, #point size
alpha = .8, #opacity
shape = "square")
ggplot(crickets, aes(x=temp, y= rate)) +
geom_point()+
geom_smooth(method = "lm",
se = FALSE) #removes error bars
`geom_smooth()` using formula = 'y ~ x'
If we add a regression line when we haev split the species by colour we get two lines
ggplot(crickets, aes(x=temp, y= rate, colour = species)) +
geom_point()+
geom_smooth(method = "lm",
se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Histograms
ggplot(crickets, aes(x = rate)) + geom_histogram(bins = 15)
Frequency Polygons
ggplot(crickets, aes(x = rate)) + geom_freqpoly(bins = 15)
Bar Graph
ggplot(crickets, aes(x= species))+
geom_bar(colour = "black",
fill = "lightblue")
To specify colour to species we can:
ggplot(crickets, aes(x = species, fill = species))+
geom_bar(show.legend = FALSE)+
scale_fill_brewer(palette = "Dark2")
Bar Graph
ggplot(crickets, aes(x=species, y = rate, fill = species))+
geom_boxplot(show.legend = FALSE) +
theme_minimal() #Removes background
#Faceting
ggplot(crickets, aes(x=rate, fill= species))+
geom_histogram(bins=15) +
scale_fill_brewer(palette = "Dark2")
This is not very clear. Separate plots would be a lot clearer:
ggplot(crickets, aes(x=rate, fill = species))+
geom_histogram(bins=15, show.legend = FALSE) +
facet_wrap(~species) #wrap by species
ggplot(crickets, aes(x=rate, fill = species))+
geom_histogram(bins=15, show.legend = FALSE) +
facet_wrap(~species, ncol = 1)
This helps us to compare the two species with increased ease.
This week we had an introduction to statistical tests:
Top Left: Frequency Test: e.g. Chi-Squared
Bottom Left: Mean Test: e.g. t-test or ANOVA
Bottom Right: Correlations/models e.g. Regression
Top Right: Logistic: e.g. prediction of odds
ggplot(iris, aes(x=Species, y=Sepal.Length, colour = Species))+
geom_boxplot()+
labs(x= "Species", y="Sepal.Length", colour = "Species")
Here we are looking at differences in Sepal Length between different iris species. The predictor variable is categorical The output is quantitative Because there are more than two groups an ANOVA test will allow us to see if there is a statistical difference between the means.
ggplot(iris, aes(x= Petal.Length, fill = Species))+
geom_density(alpha = 0.4)
ANOVA again?
ggplot(iris, aes(x= Petal.Length, y= Petal.Width))+
geom_point(mapping = aes(colour = Species, shape = Species))+
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Here we have two quantitative variables. A suitable test would be to test correlation. One way to do this would be a simple regression.
<-
iris.new %>%
iris mutate(size=ifelse(Sepal.Length < median(Sepal.Length),
"small", "big"))
ggplot(iris.new, aes(x = Species, fill = size)) +
geom_bar(position = "dodge")
Chi-squared?