Here we will explore some data stored in R called
ChickWeight. It contains the weight of chicks in grams as
they grow from day 0 to day 21.
Don’t worry if you don’t understand all the code yet - just read it carefully and see what it does.
Notice, in the YAML above, that code_folding: hide. This
means that when you press knit, all the code is run and the
.html file contains the output, with the code blocks hidden at the
side.
Read about the background to the data.
eval=F in the code blocks means that the code
is not run when knitted.?ChickWeight
Have a look at the first 6 rows of the data.
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Have a look at the last 6 rows of the data.
tail(ChickWeight)
## weight Time Chick Diet
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
To grasp the scale of the dataset,
dim(ChickWeight) comes in handy, returning
the number of observations and variables. This dual output informs us
not just about the dataset’s breadth but also about its depth, giving
clues on the data’s comprehensiveness.
How many observations are in the data? How many variables are in the data?
dim(ChickWeight)
## [1] 578 4
There are 578 observations and 4 variables.
What are the names of the variables?
names(ChickWeight)
## [1] "weight" "Time" "Chick" "Diet"
Note: The names of variables are case sensitive. Best practise is that all the variables use the same convention - ie here, they should all start with capitals. However, you have to use the data you are given, which may be messy!
First, isolate the diet variable by using
ChickWeight$Diet, and store it in
diet.
diet = ChickWeight$Diet
What was the most common type of diet fed to the chicks?
table(diet) command. This function counts
the frequency of each diet type, providing a clear view of which diet is
most prevalent among the chicks.table(diet)
## diet
## 1 2 3 4
## 220 120 120 118
Second, isolate the weight variable, and store it in
weight.
weight = ChickWeight$weight
What is the minimum and maximum weight of the chicks? Use
min() and max().
min(weight)
## [1] 35
max(weight)
## [1] 373
ChickWeight[which.min(weight), ]
## weight Time Chick Diet
## 196 35 2 18 1
ChickWeight[which.max(weight), ]
## weight Time Chick Diet
## 400 373 21 35 3
The Plot Command
In later weeks, we learn about the plot
command. But for now, run this code and see if you can see any
patterns.
plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)
The provided plot command
plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)
generates a scatterplot displaying how chick weights vary over time,
with data points colored by diet type.
ChickWeight$Time specifies the
x-axis, representing time in days.
ChickWeight$weight sets the y-axis,
showing the weights of the chicks.
col=ChickWeight$Diet colors the
data points according to the diet type, making it easier to observe if
different diets influence weight gain patterns over time.
Here we have a go at analysing an external data set, the Smoking data from Week 1 lectures. For each part, check the code, run the code, and then write your answer.
Lab1Worksheet.Rmd file is in your
MATH1062 folder. Then download the data from Canvas, store
it in a folder called data, inside your
MATH1062 folder, and then run the following code. You also
need to remove the eval = F before you knit, otherwise the
chunk won’t run!smoking = read.csv("data/simpsons_smoking.csv", header=T)
The R code to import a CSV file uses the
read.csv() function. Here’s a breakdown of
the provided code:
smoking is the variable where the
imported dataset will be stored in R.
read.csv() is the function used to
read a CSV file.
"data/simpsons_smoking.csv"
specifies the path to the CSV file. This path assumes the CSV file is
located in a subfolder named data within
your current working directory.
header=T indicates that the first
row of the CSV file contains the column names (headers).
Alternatively, you may store the Rmd file in the
same folder as the csv file, and then
remove the data/ part of the code.
Pro Tip: you can hover your cursor over the
Lab1Worksheet.Rmd in the top left corner to see where the
file is.
The type of data (here .csv) must match up with the
command (read.csv). So if the data has the
ending.xlsx, we need to load a package and use
read_excel.
What is the size of the data file? What do the rows and columns represent? Is this the full data from the UK study, or a summary?
dim() function will tell you the
dimensions of the dataset, indicating the number of rows and columns,
which helps in understanding the size of the data file. The
names() function lists all column names,
giving insight into what each column represents.dim(smoking)
## [1] 7 5
names(smoking)
## [1] "Age" "SmokersDied" "SmokersSurvived"
## [4] "NonSmokersDied" "NonSmokersSurvived"
The dataset has dimensions [1] 7 5,
indicating it consists of 7 rows and 5 columns.
The columns represent different variables associated with the study, specifically:
Age: The age group
category.
SmokersDied: The number of smokers
who died in each age group.
SmokersSurvived: The number of
smokers who survived in each age group.
NonSmokersDied: The number of
non-smokers who died in each age group.
NonSmokersSurvived: The number of
non-smokers who survived in each age group.
head(smoking)
## Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 1 18-24 2 53 1 61
## 2 25-34 3 121 5 152
## 3 35-44 14 95 7 114
## 4 45-54 27 103 12 66
## 5 55-64 51 64 40 81
## 6 65-74 29 7 101 28
tail(smoking)
## Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 2 25-34 3 121 5 152
## 3 35-44 14 95 7 114
## 4 45-54 27 103 12 66
## 5 55-64 51 64 40 81
## 6 65-74 29 7 101 28
## 7 75+ 13 0 64 0
Can you see any patterns?
smoking
Mortality Rate Comparison: By comparing
SmokersDied and
NonSmokersDied within the same age groups,
it’s possible to assess if smokers have a higher mortality rate than
non-smokers, which is a key research question. An expected pattern is
that smokers may exhibit higher mortality rates than non-smokers across
most, if not all, age groups.
Survival Rate Analysis: Similar to mortality
rates, survival rates (SmokersSurvived and
NonSmokersSurvived) can be analyzed. The
pattern here may show a higher survival rate among non-smokers compared
to smokers, reinforcing the potential health risks associated with
smoking.
Simpson’s Paradox: When data is aggregated, as it is here by age groups, Simpson’s Paradox might occur. This phenomenon can make it appear that one trend is present in aggregated data, but when the data is broken down into groups (such as age groups), the opposite trend might emerge.
smokers_died = sum(smoking$SmokersDied)
smokers_survived = sum(smoking$SmokersSurvived)
mortality_rate_smokers = smokers_died/(smokers_died+smokers_survived)
print(mortality_rate_smokers)
## [1] 0.2388316
mortality_rate_smokers = sum(smoking$SmokersDied) / sum(smoking$SmokersDied + smoking$SmokersSurvived)
print(mortality_rate_smokers)
## [1] 0.2388316
nonsmokers_died = sum(smoking$NonSmokersDied)
nonsmokers_survived = sum(smoking$NonSmokersSurvived)
mortality_rate_nonsmokers = nonsmokers_died/(nonsmokers_died + nonsmokers_survived)
print(mortality_rate_nonsmokers)
## [1] 0.3142077
mortality_rate_nonsmokers = sum(smoking$NonSmokersDied) / sum(smoking$NonSmokersDied + smoking$NonSmokersSurvived)
print(mortality_rate_nonsmokers)
## [1] 0.3142077
The calculated mortality rate for smokers is approximately 23.88%, while the rate for non-smokers is higher, at approximately 31.42%. This indicates that, within this dataset, non-smokers have a higher mortality rate than smokers, which is an unexpected outcome and contrary to common assumptions regarding the health impacts of smoking.
Examining the 18-24 Age Group
smoking$SmokersDied[1] selects the 1st entry of
smoking$SmokersDied.smoking[1, ]
## Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 1 18-24 2 53 1 61
smokers_mortality_rate_18_24 = smoking$SmokersDied[1]/(smoking$SmokersDied[1]+smoking$SmokersSurvived[1])
print(smokers_mortality_rate_18_24)
## [1] 0.03636364
nonsmokers_mortality_rate_18_24 = smoking$NonSmokersDied[1]/{smoking$NonSmokersDied[1]+smoking$NonSmokersSurvived[1]}
print(nonsmokers_mortality_rate_18_24)
## [1] 0.01612903
In the 18-24 age group, the mortality rate among smokers is higher than among non-smokers. Specifically, smokers in this age group have a mortality rate that is more than double that of their non-smoking counterparts. This observation indicates that, at least within this age group, smoking could be associated with a higher risk of mortality.
which(smoking$Age == "65-74")
## [1] 6
smoking[6, ]
## Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 6 65-74 29 7 101 28
# Assuming index 6 corresponds to the 65-74 age group
smoking_mortality_rate_65_74 = smoking$SmokersDied[6] / (smoking$SmokersDied[6] + smoking$SmokersSurvived[6])
print(smoking_mortality_rate_65_74)
## [1] 0.8055556
nonsmoking_mortality_rate_65_74 = smoking$NonSmokersDied[6] / (smoking$NonSmokersDied[6] + smoking$NonSmokersSurvived[6] )
print(nonsmoking_mortality_rate_65_74)
## [1] 0.7829457
Both smokers and non-smokers in the 65-74 age group show high mortality rates, with smokers having a marginally higher rate than non-smokers. This suggests that in this age group, smoking may be associated with a slight increase in mortality rate, although the difference is not very large.
To practice your understanding of Simpson’s Paradox, consider the following example.
Suppose we ask 1000 people to taste-test Pepsi, and say whether they like it. Similarly, we ask 1000 people to taste-test Coke, and say whether they like it.
The results are as follows (to 1 dp).
| Drink | Male | Female | Total |
|---|---|---|---|
| Pepsi | 760 / 900 = 84.4% | 40 / 100 = 40% | 800/1000 = 80% |
| Coke | 600 / 700 = 85.7% | 150/300 = 50% | 750/1000 = 75% |
Which statement do you think is true?
This example is a classic illustration of Simpson’s Paradox, where the aggregated data tells a different story from the disaggregated data. Here’s a breakdown of the situation:
Disaggregated (by gender):
Among males, the preference for Pepsi is 84.4%, while for Coke, it is 85.7%.
Among females, the preference for Pepsi is 40%, and for Coke, it is 50%.
In both gender groups, Coke is preferred over Pepsi.
Aggregated:
The overall preference for Pepsi is 80% (800 out of 1000).
The overall preference for Coke is 75% (750 out of 1000).
When the data is combined, it suggests that Pepsi is preferred over Coke.
Conclusion regarding the statements:
The aggregated data might lead you to believe that “The data provides evidence that more people like Pepsi than Coke.” However, when looking at the preferences broken down by gender, it is evident that, in each subgroup, Coke is preferred over Pepsi. This demonstrates Simpson’s Paradox, where the trend observed in aggregated data is the opposite of the trend observed in disaggregated data
Therefore, the statement that might seem true based on the aggregated data doesn’t hold when you consider the detailed breakdown by gender. This scenario emphasizes the importance of examining data at a granular level, especially when making decisions or drawing conclusions from the data.
Given this paradox, the most accurate reflection considering both disaggregated and aggregated data would lean towards “The data provides evidence that more people like Coke than Pepsi.” when considering the disaggregated data by gender, despite the aggregated data suggesting Pepsi is more popular overall. This underscores Simpson’s Paradox and the need for careful data analysis.
Comment Re Simpsons Paradox for Smoking
Overall Mortality Rates: The overall analysis suggested that non-smokers had a higher mortality rate than smokers, which is counterintuitive given the well-documented health risks associated with smoking. This could mislead one to conclude that smoking has a protective effect, which contradicts existing medical knowledge.
Mortality Rates by Age Groups:
In the 18-24 age group, the mortality rate for smokers was significantly higher than for non-smokers. This aligns more closely with the expected impact of smoking on health.
In the 65-74 age group, both groups showed high mortality rates, with smokers slightly higher, suggesting other factors may significantly influence mortality in older age, potentially diluting the observable effect of smoking alone.
Simpson’s Paradox Explanation:
These outcomes illustrate Simpson’s Paradox because when analyzing the data as a whole, one might conclude that smoking does not significantly impact mortality or may even misleadingly infer it reduces mortality risk. However, a more detailed analysis that considers age groups reveals that smoking does increase mortality risk, especially in younger populations.
The initial overall conclusion fails to account for how age (a confounding variable) influences mortality. Older individuals, regardless of smoking status, may have higher mortality rates due to age-related health issues. Thus, when not accounting for age, the detrimental effects of smoking on younger individuals are masked by the naturally higher mortality rates found in older age groups.
This example underscores the critical need for careful data analysis, emphasizing the importance of considering all relevant variables and stratifying data accordingly to avoid misleading conclusions that could result from phenomena like Simpson’s Paradox.