Here we will explore some data stored in R called
ChickWeight. It contains the weight of chicks in grams as
they grow from day 0 to day 21.
Don’t worry if you don’t understand all the code yet - just read it carefully and see what it does.
Notice, in the YAML above, that code_folding: hide. This
means that when you press knit, all the code is run and the
.html file contains the output, with the code blocks hidden at the
side.
Read about the background to the data.
?ChickWeight
eval=F in the code blocks means that the code
is not run when knitted.Have a look at the first 6 rows of the data.
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Have a look at the last 6 rows of the data.
tail(ChickWeight)
## weight Time Chick Diet
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
How many observations are in the data? How many variables are in the data?
dim(ChickWeight)
## [1] 578 4
What are the names of the variables?
names(ChickWeight)
## [1] "weight" "Time" "Chick" "Diet"
Note: The names of variables are case sensitive. Best practise is that all the variables use the same convention - ie here, they should all start with capitals. However, you have to use the data you are given, which may be messy!
First, isolate the diet variable by using
ChickWeight$Diet, and store it in diet.
diet = ChickWeight$Diet
# diet <- ChickWeight$Diet ## alternate option!
Note: RStudio has code completion, so will auto-predict your
commands. When you type ChickWeight$, the names of the all
the variables will come up. Easy!
What was the most common type of diet fed to the chicks?
table(diet)
## diet
## 1 2 3 4
## 220 120 120 118
Second, isolate the weight variable, and store it in
weight.
weight = ChickWeight$weight
What is the minimum and maximum weight of the chicks? Use
min() and max().
min(weight)
## [1] 35
max(weight)
## [1] 373
In later weeks, we learn about the plot command. But for
now, run this code and see if you can see any patterns.
plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)
Here we have a go at analysing an external data set, the Smoking data from Week 1 lectures. For each part, check the code, run the code, and then write your answer.
Lab1Worksheet.Rmd file is in your
MATH1005 folder. Then download the data from Canvas, store
it in a folder called data, inside your
MATH1005 folder, and then run the following code. You also
need to remove the eval = F before you knit, otherwise the
chunk won’t run!smoking = read.csv("data/simpsons_smoking.csv", header=T)
Alternatively, you may store the Rmd file in the
same folder as the csv file, and then
remove the data/ part of the code.
Pro Tip: you can hover your cursor over the
Lab1Worksheet.Rmd in the top left corner to see where the
file is.
The type of data (here .csv) must match up with the
command (read.csv). So if the data has the
ending.xlsx, we need to load a package and use
read_excel.
dim(smoking)
names(smoking)
smoking
sum(smoking$SmokersDied)/sum(smoking$SmokersDied+smoking$SmokersSurvived)
# Put your code here
# Consider Smokers 18-24
sum(smoking$SmokersDied[1])/sum(smoking$SmokersDied[1]+smoking$SmokersSurvived[1])
# Consider Non-Smokers 18-24
sum(smoking$NonSmokersDied[1])/sum(smoking$NonSmokersDied[1]+smoking$NonSmokersSurvived[1])
Note: smoking$SmokersDied[1] selects the 1st entry of
smoking$SmokersDied.
## Put your code here
To practice your understanding of Simpson’s Paradox, consider the following example.
Suppose we ask 1000 people to taste-test Pepsi, and say whether they like it. Similarly, we ask 1000 people to taste-test Coke, and say whether they like it.
The results are as follows (to 1 dp).
| Drink | Male | Female | Total |
|---|---|---|---|
| Pepsi | 760 / 900 = 84.4% | 40 / 100 = 40% | 800/1000 = 80% |
| Coke | 600 / 700 = 85.7% | 150/300 = 50% | 750/1000 = 75% |
Which statement do you think is true?