1 Chick Weight

Here we will explore some data stored in R called ChickWeight. It contains the weight of chicks in grams as they grow from day 0 to day 21.

Don’t worry if you don’t understand all the code yet - just read it carefully and see what it does.

Notice, in the YAML above, that code_folding: hide. This means that when you press knit, all the code is run and the .html file contains the output, with the code blocks hidden at the side.

1.1 Background to Data

Read about the background to the data.

?ChickWeight
  • Putting eval=F in the code blocks means that the code is not run when knitted.

1.2 Structure of the Data

Have a look at the first 6 rows of the data.

head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Have a look at the last 6 rows of the data.

tail(ChickWeight)
##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

How many observations are in the data? How many variables are in the data?

dim(ChickWeight)
## [1] 578   4

What are the names of the variables?

names(ChickWeight)
## [1] "weight" "Time"   "Chick"  "Diet"

Note: The names of variables are case sensitive. Best practise is that all the variables use the same convention - ie here, they should all start with capitals. However, you have to use the data you are given, which may be messy!


1.3 Explore the data

First, isolate the diet variable by using ChickWeight$Diet, and store it in diet.

diet = ChickWeight$Diet
# diet <- ChickWeight$Diet  ## alternate option!

Note: RStudio has code completion, so will auto-predict your commands. When you type ChickWeight$, the names of the all the variables will come up. Easy!

What was the most common type of diet fed to the chicks?

table(diet)
## diet
##   1   2   3   4 
## 220 120 120 118

Second, isolate the weight variable, and store it in weight.

weight = ChickWeight$weight

What is the minimum and maximum weight of the chicks? Use min() and max().

min(weight)
## [1] 35
max(weight)
## [1] 373

In later weeks, we learn about the plot command. But for now, run this code and see if you can see any patterns.

plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)

2 Smoking Study (UK)

Here we have a go at analysing an external data set, the Smoking data from Week 1 lectures. For each part, check the code, run the code, and then write your answer.

2.1 Import the data

  • First make sure your Lab1Worksheet.Rmd file is in your MATH1005 folder. Then download the data from Canvas, store it in a folder called data, inside your MATH1005 folder, and then run the following code. You also need to remove the eval = F before you knit, otherwise the chunk won’t run!
smoking = read.csv("data/simpsons_smoking.csv", header=T)
  • Alternatively, you may store the Rmd file in the same folder as the csv file, and then remove the data/ part of the code.

  • Pro Tip: you can hover your cursor over the Lab1Worksheet.Rmd in the top left corner to see where the file is.

  • The type of data (here .csv) must match up with the command (read.csv). So if the data has the ending.xlsx, we need to load a package and use read_excel.

2.2 Examine the data

  • What is the size of the data file? What do the rows and columns represent? Is this the full data from the UK study, or a summary?
dim(smoking)
names(smoking)
  • Can you see any patterns?
smoking

2.3 Research Question: Is the mortality rate higher for smokers or non-smokers?

2.3.1 First, consider the overall mortality rates

  • Calculate the mortality rate for smokers.
sum(smoking$SmokersDied)/sum(smoking$SmokersDied+smoking$SmokersSurvived)
  • Calculate the mortality rate for non-smokers.
# Put your code here

2.3.2 Second, examine the mortality rate by age groups

  • Did more smokers or non-smokers die in the 18-24 age group?
# Consider Smokers 18-24
sum(smoking$SmokersDied[1])/sum(smoking$SmokersDied[1]+smoking$SmokersSurvived[1])
# Consider Non-Smokers 18-24
sum(smoking$NonSmokersDied[1])/sum(smoking$NonSmokersDied[1]+smoking$NonSmokersSurvived[1])

Note: smoking$SmokersDied[1] selects the 1st entry of smoking$SmokersDied.

  • Did more smokers or non-smokers die in the 65-74 age group?
## Put your code here

3 Simpson’s Paradox

To practice your understanding of Simpson’s Paradox, consider the following example.

Suppose we ask 1000 people to taste-test Pepsi, and say whether they like it. Similarly, we ask 1000 people to taste-test Coke, and say whether they like it.

The results are as follows (to 1 dp).

Drink Male Female Total
Pepsi 760 / 900 = 84.4% 40 / 100 = 40% 800/1000 = 80%
Coke 600 / 700 = 85.7% 150/300 = 50% 750/1000 = 75%

Which statement do you think is true?

  • The data provides evidence that more people like Pepsi than Coke.
  • The data provides evidence that more people like Coke than Pepsi.
  • The data does not provide enough information to determine overall preference for Coke and Pepsi.