R basics and normal distributions

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. R code chunks look like this:

# This is a comment line, which we will often use as a question prompt.
# Do not change the name of the chunk because it is used when grading.
note <- "This is a line of R code that sets a variable to a value"
note # This line will print the value to the knitted document.

## [1] "This is a line of R code that sets a variable to a value"

We will also use multiple choice questions within this assignment. You will use an X to indicate you choice, as shown below.

Which of the following is a shade of red?

aquamarine
gold
eggshell
maroon

Gradescope Unit Tests

Gradescope is the website that we will be using to grade your RMarkdown exercises, and you will access it via Canvas. To submit, you will upload your RMarkdown file to Gradescope, and Gradescope will run your file and its output through many different unit tests to validate your code. Each unit test will check something specific about your code and if something fails, you will received a report about the failure. You may submit as many times as you want up until the deadline for the assignment.

There is a learning curve attached to Gradescope, but once you get the hang of using unit testing to fix your code, it will improve your productivity. Due to limitations of the unit tests, some unit tests will be difficult to pass on first submission, but in these cases the failure messages will help you fix your code to match what the unit test expects.

Unit testing is new to BIO 415/514 this semester. There may be errors with the assignments or the unit tests, please reach out to us with any issues or feedback. We have tried our best to make the unit test output understandable by novice R uses, if you have problems understanding your unit test reports, please reach out to us for help.

Gradescope Tips

Treat unit testing as a form of real-time feedback that allows you to improve your code.
Pay attention to the failure reports, as they often contain information about what you need to fix.
Do not upload data files, as they will be available on available in the working directory on Gradscope’s server.
Do not call setwd() in your RMarkdown, as your local directory structure will not match the server’s directory structure.
Do not try to install packages or other interactive operations as Gradescope’s interface does not support that.

Basic R exercises

Once you have completed all of the exercises below, use the Knit button to generate an html file with all of your code and output. Then submit both the html file and this R Markdown file on Canvas.

Question 1

What is the sex ratio at birth of red deer? You have data from six deer mothers on the sex of all their offspring, as follows:

Deer 1: 3 female, 2 male
Deer 2: 2 female, 5 male
Deer 3: 1 female, 3 male
Deer 4: 5 female, 4 male
Deer 5: 4 female, 1 male
Deer 6: 3 female, 2 male

Use the prompts in the code chunk below to write a script to analyze these data. For each comment, write the corresponding code immediately below, as in the first example.

# Store the data in two vectors, one for males and one for females.
male <- c(2, 5, 3, 4, 1, 2)
female <- c(3, 2, 1, 5, 4, 3)

# Calculate the sex ratio for Deer 1. That is, what proportion of her
# offspring are female?
female[1] / (male[1] + female[1])

## [1] 0.6

# Calculate the sex ratio for all of the deer mothers.
sex_ratio <- female / (female + male)
print (sex_ratio)

## [1] 0.6000000 0.2857143 0.2500000 0.5555556 0.8000000 0.6000000

# Do this in one single R command.
# Do not separately calculate the ratio for each mother.

# Calculate the average sex ratio across the six mothers.
average_sex_ratio <- mean(sex_ratio)
print(average_sex_ratio)

## [1] 0.5152116

Question 1d

Do red deer appear to have an equal sex ratio at birth?

No, the average for this sample is not 0.5.
Yes, the average for this sample is very close to 0.6.
No, the average for this sample is 0.52
Yes, the average for this sample is very close to 0.5.

Question 2

Body mass index (BMI) is a measure of body weight designed to account for differences in height. It is equal to weight divided by height squared, with weight measured in kilograms and height in meters. You have data on the height and weight of ten people, as follows:

167 cm, 64 kg
175 cm, 72 kg
180 cm, 73 kg
155 cm, 65 kg
173 cm, 75 kg
185 cm, 74 kg
181 cm, 82 kg
163 cm, 69 kg
179 cm, 79 kg
170 cm, 72 kg

Normally, you would input these two columns of data into a spreadsheet via Excel or Google Sheets and save it as a CSV file (comma separated values) with appropriate column labels, e.g. “height” and “weight.” However, Gradescope will not have access to your local CSV unless you upload it alongside your script. Because this data is small enough, we will input create a data frame to hold it directly in R.

Use the prompts in the code chunk below to write a script to analyze this data.

# We can construct a data frame to hold the first three values
data.frame(
    height = c(167, 175, 180),
    weight = c(64, 72, 73)
)

##   height weight
## 1    167     64
## 2    175     72
## 3    180     73

# Construct a data frame to hold all values of this data.
# Save it to a variable called `body_data`.

  height <- c(167, 175, 180, 155, 173, 185, 181, 163, 179, 170)
  weight <- c(64, 72, 73, 65, 75, 74, 82, 69, 79, 72)
  body_data <- data.frame(height, weight)

# Show the size of the data frame.
dim(body_data)

## [1] 10  2

# Show the names of the data frame's variables.
names(body_data)

## [1] "height" "weight"

# Calculate the average weight of the ten people.
average_weight <- mean(body_data$weight)
print(average_weight)

## [1] 72.5

# Calculate the BMI of each person and store it in the data frame.
# Do this with one single command; do not separately calculate the BMI for each
# person. Don't forget that BMI expects height in meters, not centimeters.
body_data$bmi <- body_data$weight / ((body_data$height / 100)^2)
  print(body_data)

##    height weight      bmi
## 1     167     64 22.94812
## 2     175     72 23.51020
## 3     180     73 22.53086
## 4     155     65 27.05515
## 5     173     75 25.05931
## 6     185     74 21.62162
## 7     181     82 25.02976
## 8     163     69 25.97012
## 9     179     79 24.65591
## 10    170     72 24.91349

# Make a scatterplot of weight vs BMI. Be sure to label the plot axes.
plot(body_data$weight, body_data$bmi,
     xlab = "Weight (kg)", ylab = "Body mass index",
     main = "Scatterplot of Weight vs. Body Mass Index")

Question 2g

Does BMI appear to depend on weight?

No, there is no obvious relationship between BMI and weight.
Yes, there is no obvious relationship between BMI and weight.
No, there is an obvious relationship between BMI and weight.
Yes, there is an obvious relationship between BMI and weight.

Question 3

How fast does the concentration of a toxin in the bloodstream decrease? A typical pattern is that the concentration decreases by a fixed proportion each unit time (e.g., it goes down by half every 2 hours). File toxin.csv contains data on the concentration (in parts per million) of a toxin in the bloodstream of a rat, measured every hour for eight hours. Use the prompts in the code chunk below to write a script to analyze these data.

# Read the data into a data frame called `toxin_data`.
toxin_data <- read.csv("toxin.csv")
print(toxin_data)

##   concentration time
## 1         112.0    1
## 2          53.0    2
## 3          22.0    3
## 4           9.0    4
## 5           3.0    5
## 6           1.0    6
## 7           0.6    7
## 8           0.2    8

str(toxin_data)

## 'data.frame':    8 obs. of  2 variables:
##  $ concentration: num  112 53 22 9 3 1 0.6 0.2
##  $ time         : int  1 2 3 4 5 6 7 8

# Plot toxin concentration over time with appropriate axis labels.
plot(toxin_data$time,
     log(toxin_data$concentration),
     xlab = "Time (hours)",
     ylab = "Toxin concentration (ppm)",
     main = "Toxin Concentration Over Time")

# Plot the logarithm of concentration over time with appropriate axis labels.
plot(toxin_data$time, log(toxin_data$concentration),
     xlab = "Time (hours)",
     ylab = "Log toxin concentration (ppm)",
     main = "Log of Toxin Concentration Over Time")

Question 3d

How do the two plots compare?

Both the toxin concentration and the log of toxin concentration decline at a steady rate.
Toxin concentration shrinks at a lower rate as time goes on. The log of toxin concentration declines at a steady rate.
Toxin concentration grows over time. The log of toxin concentration declines over time.
Both toxin concentration and the log of toxin concentration increase at a steady rate.

Question 4

How do the weights of male and female monitor lizards compare? File lizards.csv contains the weights of ten male and ten female lizards (in kilograms). Use the prompts in the code chunk below to write a script to analyze these data.

# Read the data into a data frame called `lizard_data`.
lizard_data <- read.csv("lizards.csv")
print(lizard_data)

##       sex weight
## 1    male    0.8
## 2    male    1.0
## 3    male    0.7
## 4    male    0.6
## 5    male    1.2
## 6    male    0.8
## 7    male    0.8
## 8    male    1.3
## 9    male    1.3
## 10   male    0.6
## 11 female    1.0
## 12 female    1.4
## 13 female    1.5
## 14 female    1.6
## 15 female    1.3
## 16 female    1.3
## 17 female    1.1
## 18 female    1.2
## 19 female    1.3
## 20 female    1.5

# Calculate the average weight of males and the average weight of females.
# Create a vector called `averages` that contains the male average and the
# female average in that order.
# Hint: You can manually create a vector using the `c()` function:
#       `c(some_thing, other_thing)`.
male_avg <- mean(lizard_data$weight[lizard_data$sex == "male"])
female_avg <- mean(lizard_data$weight[lizard_data$sex == "female"])
averages <- c(male_avg, female_avg)
print(averages)

## [1] 0.91 1.32

# Make a bar plot showing the average weight of each sex with appropriate axis
# labels, bar labels, and main title.
barplot(averages, names.arg = c("Male", "Female"),
        xlab = "Sex",
        ylab = "Weight (g)",
        main = "Average weights of male and female lizards")

Question 4d

Does one sex seem bigger?

No, both males and females had similar average weights.
Yes, females in the sample had a higher average weight.
Yes, males in the sample had a higher average weight.
Yes, males were placed on the left and females on the right.

Question 5

Can lion age be told by the amount of black pigmentation on the nose? File lions.csv contains the age (in years) and proportion of black pigmentation on the nose for 20 male lions. Use the prompts in the code chunk below to write a script to analyze these data.

# Read the data into a data frame called `lion_data`.
lion_data <- read.csv("lions.csv")
print(lion_data)

##     age black
## 1   2.3  0.03
## 2   3.5  0.13
## 3   2.0  0.37
## 4  11.0  0.85
## 5  10.6  1.00
## 6   6.5  0.63
## 7   5.7  0.55
## 8   1.1  0.26
## 9   0.6  0.04
## 10 12.7  0.96
## 11 11.3  0.91
## 12  8.8  0.47
## 13 10.9  0.65
## 14 11.1  0.70
## 15  7.6  0.51

# Make a scatter plot of the relation between age and proportion of black
# pigmentation on the nose. Use appropriate axis labels.
plot(lion_data$age, lion_data$black_proportion,
     xlab = "Age (years)", 
     ylab = "Proportion of black pigmentation on nose",
     main = "Age vs. Proportion of Black Pigmentation")

Based on this plot, what can you say about the usefulness of nose pigmentation for estimating the age of male lions?

Question 5c

It looks promising because it can be easily measured from photographs.
It doesn’t seem useful because only old lions have high levels of black pigmentation.
It looks promising because the black proportion seems to increase steadily with increasing age.
It doesn’t seem useful because there is no relationship between age and pigmentation.

Normal probability exercises

Please watch the video lectures about the normal distribution before doing these exercises! Also, please first watch the tutorial video about working with normal distributions in R.

Question 6

In many animals, island populations diverge in size from mainland populations of the same species. You are interested in finding out if this is the case for monitor lizards, which are widespread on the Australian continent and surrounding islands. Based on earlier studies, you believe that the mean and variance for the lengths of adult mainland lizards are as follows: Mean = 100 cm, Variance = 400. You also make the reasonable assumption that length is a normal random variable. Before collecting any data on the Kangaroo Island lizards, you will use this prior knowledge to say something about their expected probability distribution. Use the prompts in the code chunk below to write a script that plots the PDF and the CDF.

We will create a plots of the probability and cumulative density functions of lizard lengths. The plots will include informative axis titles and an overall titles.

# First enter parameter values using the variables `lizard_mu` and `lizard_var`.
lizard_mu <- 100
lizard_var <- 400
lizard_sd <- sqrt(lizard_var)

# Create a vector of X values centered on lizard_mu and covering 6 standard
# deviations. Save this vector into a variable called `lizard_x`.
lizard_x <- seq(lizard_mu - 3 * lizard_sd, 
                lizard_mu + 3 * lizard_sd, length = 100)

# Using `lizard_x`, plot a line graph of the PDF for lizard lengths.
# Include informative axis titles and an overall title.
plot(lizard_x, dnorm(lizard_x, mean = lizard_mu, sd = lizard_sd),
     type = "l",
     xlab = "Length (cm)",
     ylab = "Probability density",
     main = "Probability Density Function of Lizard Lengths")

# Using `lizard_x`, plot a line graph of the CDF for lizard lengths.
# Include informative axis titles and an overall title.
plot(lizard_x, pnorm(lizard_x, mean = lizard_mu, sd = lizard_sd),
     type = "l",
     xlab = "Length (cm)",
     ylab = "Cumulative probability",
     main = "Cumulative Probability Function of Lizard Lengths")

Question 7

Write another script to answer the following questions about what you expect to find when you measure Kangaroo island lizards, assuming that they follow the same probability distribution as mainland lizards. Inspect the plots you made in 1 to check if your answers make sense.

# What is the probability density for a length of 75 cm?
dnorm(75, mean = lizard_mu, sd = lizard_sd)

## [1] 0.009132454

# What is the probability that a lizard will be...
# Less than or equal to 75 cm?
pnorm(75, mean = lizard_mu, sd = lizard_sd)

## [1] 0.1056498

# What is the probability that a lizard will be...
# Greater than 120 cm?
1 - pnorm(120, mean = lizard_mu, sd = lizard_sd)

## [1] 0.1586553

# What is the probability that a lizard will be...
# Between 95 and 115cm?
pnorm(115, mean = lizard_mu, sd = lizard_sd) - pnorm(95, mean = lizard_mu, sd = lizard_sd)

## [1] 0.372079

# What is the probability that a lizard will be...
# At least 40 cm different from the mean, in either direction?
pnorm(60, mean = lizard_mu, sd = lizard_sd) + (1 - pnorm(140, mean = lizard_mu, sd = lizard_sd))

## [1] 0.04550026

# What is the probability that a lizard will be...
# Further than 0.7 standard deviations below the mean?
cutoff <- lizard_mu - 0.7 * lizard_sd
pnorm(cutoff, mean = lizard_mu, sd = lizard_sd)

## [1] 0.2419637

# What is the probability that a lizard will be...
# Closer than 1.3 standard deviations to the mean?
lower <- lizard_mu - 1.3 * lizard_sd
upper <- lizard_mu + 1.3 * lizard_sd
pnorm(upper, lizard_mu, lizard_sd) - pnorm(lower, lizard_mu, lizard_sd)

## [1] 0.806399

# What is the probability that a lizard will be...
# Further than 1.5 standard deviations from the mean?
lower <- lizard_mu - 1.5 * lizard_sd
upper <- lizard_mu + 1.5 * lizard_sd
pnorm(lower, lizard_mu, lizard_sd) + (1 - pnorm(upper, lizard_mu, lizard_sd))

## [1] 0.1336144

# What are the quartiles of the distribution? That is, what are the 0.25, 0.5,
# and 0.75 quantiles? Return the values as a vector of quartiles.
# Hint: You can manually create a vector using the `c()` function:
#       `c(some_thing, other_thing)`.
quartiles <- qnorm(c(0.25, 0.5, 0.75), mean = lizard_mu, sd = lizard_sd)
print(quartiles)

## [1]  86.5102 100.0000 113.4898

# 2/3 of observations are expected to lie below what value?
value_2thirds <- round(qnorm(2/3, mean = lizard_mu, sd = lizard_sd))
print(value_2thirds)

## [1] 109

# 80% of observations are expected to lie above what value?
qnorm(0.2, mean = lizard_mu, sd = lizard_sd)

## [1] 83.16758

Question 8

Britain’s domestic intelligence service MI5 places an upper limit on the height of its spies, on the assumption that people who are too tall do not blend in well with the crowd. To be a spy, men must be no taller than 180 cm (~5 feet 11 inches) and women no taller than 173 cm (~5 feet 8 inches). Write a script to answer the questions in the code chunk below.

# If the mean height of British men is 177 cm, with a standard deviation of
# 7.1 cm, what proportion of British men are excluded from being spies by this
# height restriction? Assume that height follows a normal distribution.
1 - pnorm(180, mean = 177, sd = 7.1)

## [1] 0.3363172

# The mean height of British women is 163.3 cm, with a standard deviation of
# 6.4 cm. Assuming a normal distribution of female height, what fraction of
# women meet MI5's height standard?
pnorm(173, mean = 163.3, sd = 6.4)

## [1] 0.9351929

# Imagine that MI5 wants to change its maximum height for female spies. Its
# goal is to exclude the same proportion of women as men. What should the new
# maximum height for women be?
male_cutoff <- 1 - pnorm(180, mean = 177, sd = 7.1)
qnorm(1 - male_cutoff, mean = 163.3, sd = 6.4)

## [1] 166.0042

# Sean Connery, the original James Bond, was 183 cm tall. By how many standard
# deviations did he exceed the height limit for spies?
(183 - 180) / 7.1

## [1] 0.4225352