# a) Each row of the data matrix represent each UK residents.
head(smoking, 5)
## gender age maritalStatus highestQualification nationality ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## grossIncome region smoke amtWeekends amtWeekdays type
## 1 2,600 to 5,200 The North No NA NA
## 2 Under 2,600 The North Yes 12 12 Packets
## 3 28,600 to 36,400 The North No NA NA
## 4 10,400 to 15,600 The North No NA NA
## 5 2,600 to 5,200 The North No NA NA
# b) total 1691 participants
nrow(smoking)
## [1] 1691
# c)
# Categorical: gender, maritalStatus, nationality, ethnicity, region, smoke
# Categorical Ordinal: highestQualification, type, grossIncome
# Numerical discrete: age, amtWeekends, amtWeekdays
str(smoking)
## 'data.frame': 1691 obs. of 12 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ maritalStatus : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ highestQualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
## $ nationality : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
## $ ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ grossIncome : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
## $ region : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ amtWeekends : int NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amtWeekdays : int NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : Factor w/ 5 levels "","Both/Mainly Hand-Rolled",..: 1 5 1 1 1 1 5 1 4 5 ...
e: If the sample size can be justified to be sufficient, as this is a properly designed experiment, a casual relationship can be established between mental health and exercise, as both sampling and assignments were random.
f: One reservation I have is that there should be condition for patients normal exercise pattern. Also, morally speaking, one should not prevent people from exercising.
library(ggplot2)
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
scores <- as.data.frame(scores)
ggplot(scores, aes(x = "", y = scores)) + geom_boxplot() + coord_cartesian(ylim = c(50, 100))
a : The mosaic chart shows that patients who received transplant had greater odds of surviving. This suggests the two variables are not independent.
b : The transplant treatment incereases survival time for patients.
c : 88.24% of control group died followed by 65.22% of treatment group.
library(data.table)
heartTr <- as.data.table(heartTr)
ht <- heartTr[, .(count = .N), by = .(transplant,survived)]
control.dead.ratio <- ht[transplant== "control" & survived == "dead"]$count / sum(ht[transplant == "control"]$count)
treatment.dead.ratio <- ht[transplant== "treatment" & survived == "dead"]$count / sum(ht[transplant == "treatment"]$count)
control.dead.ratio
## [1] 0.8823529
treatment.dead.ratio
## [1] 0.6521739
#alive
sum(ht[survived == "alive"]$count)
## [1] 28
#dead
sum(ht[survived == "dead"]$count)
## [1] 75
#treatment
sum(ht[transplant == "treatment"]$count)
## [1] 69
#control
sum(ht[transplant == "control"]$count)
## [1] 34
#
treatment.dead.ratio - control.dead.ratio
## [1] -0.230179