Week One

Quarto

In these first tutorials, we have learnt how to create a basic Quarto file and how to perform basic functions in RStudio.

Using RStudio

I have learnt to complete basic functions and assign values:

1+1

[1] 2

a<-2
a+1

[1] 3

Creating a Data Frame

I have learnt how to define vectors and combine them to make a data frame:

freq <- c(12, 17, 5, 11, 2, 7)
species <- c("Buzzard", "Hobby", "Sparrow", "Pigeon", "Harrier", "Hawk")
spec_freq <- data.frame(species,freq)

spec_freq

  species freq
1 Buzzard   12
2   Hobby   17
3 Sparrow    5
4  Pigeon   11
5 Harrier    2
6    Hawk    7

Importing Data and Loading Packages

However, often we want to work with large data sets which we need to import into R. This can be done by going to ‘File’ and ‘Import Dataset’.

I have imported a data set called ‘penguins’. I can view this by selecting it in the environment window:

You can see this has 344 entries so to get information from this we need to manipulate it in different ways. First we need to load packages that allow us to do this:

library(tidyverse)
library(psych)

I can perform all kinds of statistical tests on any column of my choosing: (For some reason it would not let me render when I was trying this with the penguins data set but it would when I use my birds data set- will have to look into this!)

describe(freq)

   vars n mean  sd median trimmed  mad min max range skew kurtosis   se
X1    1 6    9 5.4      9       9 5.19   2  17    15 0.14    -1.66 2.21

If I know what test I want to do I can select that one specifically:

mean (freq)

[1] 9

Finally I learnt how to create graphs. Again it would not let me render, however here is the graph that I produced and the code used:

Summary and Problems faced:

I now feel comfortable doing basic functions in R. I began to look at statistical analyses and graph functions. One issue I came across was getting R to ignore NA values in the dataframe- I looked up the na.omit function however it still didn’t allow me to do numerical analyses on the data. This is something that I will have to look into further. I also had problems rendering certain chunks of code, even though they ran well in the blocks themselves.

Week 2: Data Wrangling

This week we looked at the tidyverse package and how this can help us manipulate data sets.

Mutate()

First we looked at the mutate() function. It can be used to add columns to a database.

midwest.new <- #this ensures my new columns are saved in a new dataset
  midwest %>% 
  mutate(child.to.adult = (percchildbelowpovert / percadultpoverty),
         ratio.adult = (popadults / poptotal),
         perc.adult = (ratio.adult * 100))

Here I have added 3 new columns. One to show the ratio of children to adults below the poverty line, one to give the ratio of adults in the population, and one to give the percentage of adults.

Recode()

Recode can be used to alter a value in the dataset. It is most commonly used for correcting inconsistencies.

data %>% mutate(Variable = recode(Variable, “old value” = “new value”))

For example: The dataset below has different denotions for the same classification.

print(dataset)

# A tibble: 6 × 2
  Sex    TestScore
  <fct>      <dbl>
1 male          10
2 m             20
3 M             10
4 Female        25
5 Female        12
6 Female         5

This can be fixed using recode() as shown below:

dataset.new <- 
  dataset %>% 
  mutate(Sex.new = recode(Sex, 
                          "m" = "Male",
                          "M" = "Male",
                          "male" ="Male"))

print(dataset.new %>% select(TestScore, Sex.new))

# A tibble: 6 × 2
  TestScore Sex.new
      <dbl> <fct>  
1        10 Male   
2        20 Male   
3        10 Male   
4        25 Female 
5        12 Female 
6         5 Female

Group_by()

Group by can be useful if we want to compare certain groups in our data for example males vs females, age groups, or something else!

For example I have a data set of test scores, to compare males and females I can group the data by sex then carry out some analyses:

data %>% 
  group_by(Sex) %>% 
  summarize(m = mean(Score), # calculates the mean
            s = sd(Score),   # calculates the standard deviation
            n = n()) %>%     # calculates the total number
  ungroup() #It is important to remember to always ungroup afterwards!

# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25

Using a comma we can group by more than one factor. We can also combine with mutate() to make a new column specific to that factor. For example:

data.new <-
data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>%   # calculates the average age of males and females
  mutate(x = mean(Score)) %>% # counts number of participants
  ungroup() 
print(data.new)

# A tibble: 50 × 6
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.487
 2     2 female    25 0.418  29.0 0.437
 3     3 male      39 0.014  29.2 0.487
 4     4 female    37 0.09   29.0 0.437
 5     5 male      31 0.061  29.2 0.487
 6     6 female    34 0.328  29.0 0.437
 7     7 male      34 0.656  29.2 0.487
 8     8 female    30 0.002  29.0 0.437
 9     9 male      26 0.639  29.2 0.487
10    10 female    33 0.173  29.0 0.437
# ℹ 40 more rows

Filter ()

With Filter we can retain specific row of data:

diamonds %>%
  filter(cut == "Fair" | cut == "Good", price <= 600)

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

This will return any data which has a cut of Fair or Good and a price of less than £600

Select ()

We can choose which variables we want to see:

diamonds %>% select(cut, color)

# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows

Arrange()

The arrange function will arrange the data by the variable stated (either alphabetical or from lowest to highest). If we add desc () it will order them in reverse.

6.7 Extra Practice:

View all of the variable names in diamonds:

View (diamonds) #This allows us to see the dataset

!. Arrange the diamonds by:

a. Lowest to highest price

diamonds %>% arrange(price)

b. Highest to lowest price

diamonds %>% arrange(desc(price))

c. Lowest price and cut

diamonds %>% arrange(price, cut) #This will arrange first by lowest price and then lowest cut if there are two prices the same

d. Highest price and cut

diamonds %>% arrange(desc(price), desc(cut))

2. Arrange the diamonds by lowest to highest price and worst to best clarity.

diamonds %>% arrange(desc(price), desc(clarity))

3. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond

diamonds.new <-
  diamonds%>% mutate(salePrice = (price-250))
diamonds.new %>% summarise(salePrice)

# A tibble: 53,940 × 1
   salePrice
       <dbl>
 1        76
 2        76
 3        77
 4        84
 5        85
 6        86
 7        86
 8        87
 9        87
10        88
# ℹ 53,930 more rows

4. Remove the x, y, and z variables from the diamonds dataset

diamonds%>%
   select(-x,-y,-z)

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

5. Determine the number of diamonds there are for each cut value

diamond_counts <- diamonds %>%
  group_by(cut) %>%
  summarize(count = n())
print(diamond_counts)

# A tibble: 5 × 2
  cut       count
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

6. Create a new column named totalNum that calculates the total number of diamonds.

diamonds.new <-
  diamonds%>%
    mutate(totalNum = n())
diamonds.new%>%
  summarise(totalNum)

# A tibble: 53,940 × 1
   totalNum
      <int>
 1    53940
 2    53940
 3    53940
 4    53940
 5    53940
 6    53940
 7    53940
 8    53940
 9    53940
10    53940
# ℹ 53,930 more rows

Count()

Collapses the rows and counts the number of observations per group of values.

diamonds %>% count(cut)

# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

Rename()

Does what it says on the tin:

#|output: false 
diamonds %>% rename(PRICE = price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table PRICE     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

ifelse()

So if I have a dataset for example test scores:

Names <- c("John", "James", "Amelia", "Jim", "Suzie")
Scores <- c(2, 44, 76, 18, 55)
Test <- data.frame(Names, Scores)
print(Test)

   Names Scores
1   John      2
2  James     44
3 Amelia     76
4    Jim     18
5  Suzie     55

I can use ifelse to create another column based on the data we already have

Test %>%
  mutate(Result = ifelse(Scores>40, "Pass", "Fail"))

   Names Scores Result
1   John      2   Fail
2  James     44   Pass
3 Amelia     76   Pass
4    Jim     18   Fail
5  Suzie     55   Pass

First we give it a thing to check, the first one is what it will assign if the thing is true, and the second is what it will assign if it is false

Week 3: Visualising Data

This week we have been looking at gg plot and producing graphs. ##Scatter Graph

ggplot(crickets, aes(x=temp, y= rate)) +
geom_point()

This produces a graph using the data ‘crickets’. We tell it what data we want on the x and y and what type of graph that we want

Once we have the basics we can tidy it up

ggplot(crickets, aes(x=temp, y= rate)) +
geom_point() +
  labs(x = "Temperature",
       y = "Chirp Rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

For example adding labels!

ggplot(crickets, aes(x=temp, y= rate, colour = species)) +
geom_point() +
  labs(x = "Temperature",
       y = "Chirp Rate",
       title = "Cricket chirps",
       colour = "Species",
       caption = "Source: McDonald (2009)")

Or use colour to provide extra information!

When using colour we can improve accessibility by proving the code

scale_colour_brewer(palette = "Dark2")

This can be added at the bottom

##Geom Properties This goes inside the geom ()s For example colour:

ggplot(crickets, aes(x=temp, y= rate)) +
geom_point(colour = "red", 
           size = 2, #point size
           alpha = .8, #opacity
           shape = "square")

Regression Lines

ggplot(crickets, aes(x=temp, y= rate)) +
  geom_point()+
  geom_smooth(method = "lm", 
              se = FALSE) #removes error bars

`geom_smooth()` using formula = 'y ~ x'

If we add a regression line when we haev split the species by colour we get two lines

ggplot(crickets, aes(x=temp, y= rate, colour = species)) +
  geom_point()+
  geom_smooth(method = "lm", 
              se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Other Plots:

One Quantitative Variable:

Histograms

ggplot(crickets, aes(x = rate)) + geom_histogram(bins = 15)

Frequency Polygons

ggplot(crickets, aes(x = rate)) + geom_freqpoly(bins = 15)

One Catagorical Variable:

Bar Graph

ggplot(crickets, aes(x= species))+
  geom_bar(colour = "black", 
           fill = "lightblue")

To specify colour to species we can:

ggplot(crickets, aes(x = species, fill = species))+
  geom_bar(show.legend = FALSE)+
  scale_fill_brewer(palette = "Dark2")

One Catagorical and One Quantitative Variable

Bar Graph

ggplot(crickets, aes(x=species, y = rate, fill = species))+
  geom_boxplot(show.legend = FALSE) +
  theme_minimal() #Removes background

#Faceting

ggplot(crickets, aes(x=rate, fill= species))+
  geom_histogram(bins=15) +
  scale_fill_brewer(palette = "Dark2")

This is not very clear. Separate plots would be a lot clearer:

ggplot(crickets, aes(x=rate, fill = species))+
  geom_histogram(bins=15, show.legend = FALSE) +
  facet_wrap(~species) #wrap by species

ggplot(crickets, aes(x=rate, fill = species))+
  geom_histogram(bins=15, show.legend = FALSE) +
  facet_wrap(~species, ncol = 1)

This helps us to compare the two species with increased ease.

Week 4: Statistical Analysis

This week we had an introduction to statistical tests:

Top Left: Frequency Test: e.g. Chi-Squared

Bottom Left: Mean Test: e.g. t-test or ANOVA

Bottom Right: Correlations/models e.g. Regression

Top Right: Logistic: e.g. prediction of odds

ggplot(iris, aes(x=Species, y=Sepal.Length, colour = Species))+
  geom_boxplot()+
  labs(x= "Species", y="Sepal.Length", colour = "Species")

Here we are looking at differences in Sepal Length between different iris species. The predictor variable is categorical The output is quantitative Because there are more than two groups an ANOVA test will allow us to see if there is a statistical difference between the means.

ggplot(iris, aes(x= Petal.Length, fill = Species))+
  geom_density(alpha = 0.4)

ANOVA again?

ggplot(iris, aes(x= Petal.Length, y= Petal.Width))+
  geom_point(mapping = aes(colour = Species, shape = Species))+
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Here we have two quantitative variables. A suitable test would be to test correlation. One way to do this would be a simple regression.

iris.new <-
  iris %>%
  mutate(size=ifelse(Sepal.Length < median(Sepal.Length),
                     "small", "big"))

ggplot(iris.new, aes(x = Species, fill = size)) +
  geom_bar(position = "dodge")

Chi-squared?