Chick Weight

Here we will explore some data stored in R called ChickWeight. It contains the weight of chicks in grams as they grow from day 0 to day 21.

Don’t worry if you don’t understand all the code yet - just read it carefully and see what it does.

Notice, in the YAML above, that code_folding: hide. This means that when you press knit, all the code is run and the .html file contains the output, with the code blocks hidden at the side.

Background to Data

Read about the background to the data.

Putting eval=F in the code blocks means that the code is not run when knitted.

?ChickWeight

Structure of the Data

Initial Exploration

Have a look at the first 6 rows of the data.

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Have a look at the last 6 rows of the data.

tail(ChickWeight)

##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

Dimensions

To grasp the scale of the dataset, dim(ChickWeight) comes in handy, returning the number of observations and variables. This dual output informs us not just about the dataset’s breadth but also about its depth, giving clues on the data’s comprehensiveness.

How many observations are in the data? How many variables are in the data?

dim(ChickWeight)

## [1] 578   4

There are 578 observations and 4 variables.

What are the names of the variables?

names(ChickWeight)

## [1] "weight" "Time"   "Chick"  "Diet"

Note: The names of variables are case sensitive. Best practise is that all the variables use the same convention - ie here, they should all start with capitals. However, you have to use the data you are given, which may be messy!

Explore the data

Isolating Variables and Finding the Most Common Diet

First, isolate the diet variable by using ChickWeight$Diet, and store it in diet.

diet = ChickWeight$Diet

What was the most common type of diet fed to the chicks?

To determine the most common type of diet, use the table(diet) command. This function counts the frequency of each diet type, providing a clear view of which diet is most prevalent among the chicks.

table(diet)

## diet
##   1   2   3   4 
## 220 120 120 118

Diet 1 was the most common type as it constitutes the highest number of recorded observations as indicated by the frequency distribution.

Second, isolate the weight variable, and store it in weight.

weight = ChickWeight$weight

What is the minimum and maximum weight of the chicks? Use min() and max().

min(weight)

## [1] 35

max(weight)

## [1] 373

The minimum weight was 35 grams while the maximum weight was 373

ChickWeight[which.min(weight), ]

##     weight Time Chick Diet
## 196     35    2    18    1

ChickWeight[which.max(weight), ]

##     weight Time Chick Diet
## 400    373   21    35    3

These commands will correctly return the full observations (rows) for both the minimum and maximum weights in the dataset.

The Plot Command

In later weeks, we learn about the plot command. But for now, run this code and see if you can see any patterns.

plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)

The provided plot command plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet) generates a scatterplot displaying how chick weights vary over time, with data points colored by diet type.
- ChickWeight$Time specifies the x-axis, representing time in days.
- ChickWeight$weight sets the y-axis, showing the weights of the chicks.
- col=ChickWeight$Diet colors the data points according to the diet type, making it easier to observe if different diets influence weight gain patterns over time.

Smoking Study (UK)

Here we have a go at analysing an external data set, the Smoking data from Week 1 lectures. For each part, check the code, run the code, and then write your answer.

Import the data

First make sure your Lab1Worksheet.Rmd file is in your MATH1062 folder. Then download the data from Canvas, store it in a folder called data, inside your MATH1062 folder, and then run the following code. You also need to remove the eval = F before you knit, otherwise the chunk won’t run!

smoking = read.csv("data/simpsons_smoking.csv", header=T)

The R code to import a CSV file uses the read.csv() function. Here’s a breakdown of the provided code:
- smoking is the variable where the imported dataset will be stored in R.
- read.csv() is the function used to read a CSV file.
- "data/simpsons_smoking.csv" specifies the path to the CSV file. This path assumes the CSV file is located in a subfolder named data within your current working directory.
- header=T indicates that the first row of the CSV file contains the column names (headers).
Alternatively, you may store the Rmd file in the same folder as the csv file, and then remove the data/ part of the code.
Pro Tip: you can hover your cursor over the Lab1Worksheet.Rmd in the top left corner to see where the file is.
The type of data (here .csv) must match up with the command (read.csv). So if the data has the ending.xlsx, we need to load a package and use read_excel.

Examine the data

What is the size of the data file? What do the rows and columns represent? Is this the full data from the UK study, or a summary?

The dim() function will tell you the dimensions of the dataset, indicating the number of rows and columns, which helps in understanding the size of the data file. The names() function lists all column names, giving insight into what each column represents.

dim(smoking)

## [1] 7 5

names(smoking)

## [1] "Age"                "SmokersDied"        "SmokersSurvived"   
## [4] "NonSmokersDied"     "NonSmokersSurvived"

The dataset has dimensions [1] 7 5, indicating it consists of 7 rows and 5 columns.
The columns represent different variables associated with the study, specifically:
- Age: The age group category.
- SmokersDied: The number of smokers who died in each age group.
- SmokersSurvived: The number of smokers who survived in each age group.
- NonSmokersDied: The number of non-smokers who died in each age group.
- NonSmokersSurvived: The number of non-smokers who survived in each age group.

head(smoking)

##     Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 1 18-24           2              53              1                 61
## 2 25-34           3             121              5                152
## 3 35-44          14              95              7                114
## 4 45-54          27             103             12                 66
## 5 55-64          51              64             40                 81
## 6 65-74          29               7            101                 28

tail(smoking)

##     Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 2 25-34           3             121              5                152
## 3 35-44          14              95              7                114
## 4 45-54          27             103             12                 66
## 5 55-64          51              64             40                 81
## 6 65-74          29               7            101                 28
## 7   75+          13               0             64                  0

Each row represents an aggregated summary of data for a specific age group. This means individual observations are not listed separately; instead, data is compiled into age categories.

Can you see any patterns?

smoking

Mortality Rate Comparison: By comparing SmokersDied and NonSmokersDied within the same age groups, it’s possible to assess if smokers have a higher mortality rate than non-smokers, which is a key research question. An expected pattern is that smokers may exhibit higher mortality rates than non-smokers across most, if not all, age groups.
Survival Rate Analysis: Similar to mortality rates, survival rates (SmokersSurvived and NonSmokersSurvived) can be analyzed. The pattern here may show a higher survival rate among non-smokers compared to smokers, reinforcing the potential health risks associated with smoking.
Simpson’s Paradox: When data is aggregated, as it is here by age groups, Simpson’s Paradox might occur. This phenomenon can make it appear that one trend is present in aggregated data, but when the data is broken down into groups (such as age groups), the opposite trend might emerge.

Research Question: Is the mortality rate higher for smokers or non-smokers?

First, consider the overall mortality rates

Calculate the mortality rate for smokers: Out of all those that smoked, how many died? Smokers Died/Lived+Died
1. Calculate the total number of deaths among smokers across all age groups.
2. Calculate the total number of smokers (both died and survived) across all age groups.

smokers_died = sum(smoking$SmokersDied)
smokers_survived = sum(smoking$SmokersSurvived)
mortality_rate_smokers = smokers_died/(smokers_died+smokers_survived)
print(mortality_rate_smokers)

## [1] 0.2388316

mortality_rate_smokers = sum(smoking$SmokersDied) / sum(smoking$SmokersDied + smoking$SmokersSurvived)
print(mortality_rate_smokers)

## [1] 0.2388316

Calculate the mortality rate for non-smokers: Out of all those that did not smoke, how many of them died? Non Smokers Died/Those who Survived + Those who died

nonsmokers_died = sum(smoking$NonSmokersDied)
nonsmokers_survived = sum(smoking$NonSmokersSurvived)
mortality_rate_nonsmokers = nonsmokers_died/(nonsmokers_died + nonsmokers_survived)
print(mortality_rate_nonsmokers)

## [1] 0.3142077

mortality_rate_nonsmokers = sum(smoking$NonSmokersDied) / sum(smoking$NonSmokersDied + smoking$NonSmokersSurvived)
print(mortality_rate_nonsmokers)

## [1] 0.3142077

The calculated mortality rate for smokers is approximately 23.88%, while the rate for non-smokers is higher, at approximately 31.42%. This indicates that, within this dataset, non-smokers have a higher mortality rate than smokers, which is an unexpected outcome and contrary to common assumptions regarding the health impacts of smoking.

Second, examine the mortality rate by age groups

Examining the 18-24 Age Group

Did more smokers or non-smokers die in the 18-24 age group?
- Note: smoking$SmokersDied[1] selects the 1st entry of smoking$SmokersDied.

smoking[1, ]

##     Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 1 18-24           2              53              1                 61

smokers_mortality_rate_18_24 = smoking$SmokersDied[1]/(smoking$SmokersDied[1]+smoking$SmokersSurvived[1])
print(smokers_mortality_rate_18_24)

## [1] 0.03636364

nonsmokers_mortality_rate_18_24 = smoking$NonSmokersDied[1]/{smoking$NonSmokersDied[1]+smoking$NonSmokersSurvived[1]}
print(nonsmokers_mortality_rate_18_24)

## [1] 0.01612903

In the 18-24 age group, the mortality rate among smokers is higher than among non-smokers. Specifically, smokers in this age group have a mortality rate that is more than double that of their non-smoking counterparts. This observation indicates that, at least within this age group, smoking could be associated with a higher risk of mortality.

Did more smokers or non-smokers die in the 65-74 age group?

which(smoking$Age == "65-74")

## [1] 6

smoking[6, ]

##     Age SmokersDied SmokersSurvived NonSmokersDied NonSmokersSurvived
## 6 65-74          29               7            101                 28

# Assuming index 6 corresponds to the 65-74 age group

smoking_mortality_rate_65_74 = smoking$SmokersDied[6] / (smoking$SmokersDied[6] + smoking$SmokersSurvived[6])
print(smoking_mortality_rate_65_74)

## [1] 0.8055556

nonsmoking_mortality_rate_65_74 = smoking$NonSmokersDied[6] / (smoking$NonSmokersDied[6] + smoking$NonSmokersSurvived[6] )
print(nonsmoking_mortality_rate_65_74)

## [1] 0.7829457

Both smokers and non-smokers in the 65-74 age group show high mortality rates, with smokers having a marginally higher rate than non-smokers. This suggests that in this age group, smoking may be associated with a slight increase in mortality rate, although the difference is not very large.

Comment Re Simpsons Paradox for Smoking

Overall Mortality Rates: The overall analysis suggested that non-smokers had a higher mortality rate than smokers, which is counterintuitive given the well-documented health risks associated with smoking. This could mislead one to conclude that smoking has a protective effect, which contradicts existing medical knowledge.

Mortality Rates by Age Groups:

In the 18-24 age group, the mortality rate for smokers was significantly higher than for non-smokers. This aligns more closely with the expected impact of smoking on health.
In the 65-74 age group, both groups showed high mortality rates, with smokers slightly higher, suggesting other factors may significantly influence mortality in older age, potentially diluting the observable effect of smoking alone.

Simpson’s Paradox Explanation:

These outcomes illustrate Simpson’s Paradox because when analyzing the data as a whole, one might conclude that smoking does not significantly impact mortality or may even misleadingly infer it reduces mortality risk. However, a more detailed analysis that considers age groups reveals that smoking does increase mortality risk, especially in younger populations.
The initial overall conclusion fails to account for how age (a confounding variable) influences mortality. Older individuals, regardless of smoking status, may have higher mortality rates due to age-related health issues. Thus, when not accounting for age, the detrimental effects of smoking on younger individuals are masked by the naturally higher mortality rates found in older age groups.
This example underscores the critical need for careful data analysis, emphasizing the importance of considering all relevant variables and stratifying data accordingly to avoid misleading conclusions that could result from phenomena like Simpson’s Paradox.

Simpson’s Paradox

To practice your understanding of Simpson’s Paradox, consider the following example.

Suppose we ask 1000 people to taste-test Pepsi, and say whether they like it. Similarly, we ask 1000 people to taste-test Coke, and say whether they like it.

The results are as follows (to 1 dp).

Drink	Male	Female	Total
Pepsi	760 / 900 = 84.4%	40 / 100 = 40%	800/1000 = 80%
Coke	600 / 700 = 85.7%	150/300 = 50%	750/1000 = 75%

Which statement do you think is true?

The data provides evidence that more people like Pepsi than Coke.
The data provides evidence that more people like Coke than Pepsi.
The data does not provide enough information to determine overall preference for Coke and Pepsi.

This example is a classic illustration of Simpson’s Paradox, where the aggregated data tells a different story from the disaggregated data. Here’s a breakdown of the situation:

Disaggregated (by gender):

Among males, the preference for Pepsi is 84.4%, while for Coke, it is 85.7%.
Among females, the preference for Pepsi is 40%, and for Coke, it is 50%.

In both gender groups, Coke is preferred over Pepsi.

Aggregated:

The overall preference for Pepsi is 80% (800 out of 1000).
The overall preference for Coke is 75% (750 out of 1000).

When the data is combined, it suggests that Pepsi is preferred over Coke.

Conclusion regarding the statements:

The aggregated data might lead you to believe that “The data provides evidence that more people like Pepsi than Coke.” However, when looking at the preferences broken down by gender, it is evident that, in each subgroup, Coke is preferred over Pepsi. This demonstrates Simpson’s Paradox, where the trend observed in aggregated data is the opposite of the trend observed in disaggregated data
Therefore, the statement that might seem true based on the aggregated data doesn’t hold when you consider the detailed breakdown by gender. This scenario emphasizes the importance of examining data at a granular level, especially when making decisions or drawing conclusions from the data.
Given this paradox, the most accurate reflection considering both disaggregated and aggregated data would lean towards “The data provides evidence that more people like Coke than Pepsi.” when considering the disaggregated data by gender, despite the aggregated data suggesting Pepsi is more popular overall. This underscores Simpson’s Paradox and the need for careful data analysis.

MATH1062 Lab1 Worksheet

Introduction to R

© University of Sydney MATH1062

24 March 2024