Assignments

Chapter 1 - Introduction to Data

  • Practice: 1.7 (available in R using the data(iris) command), 1.9, 1.23, 1.33, 1.55, 1.69
  • Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70 (use the library(openintro); data(heartTr) to load the data)

# a) Each row of the data matrix represent each UK residents.
head(smoking, 5)
##   gender age maritalStatus highestQualification nationality ethnicity
## 1   Male  38      Divorced     No Qualification     British     White
## 2 Female  42        Single     No Qualification     British     White
## 3   Male  40       Married               Degree     English     White
## 4 Female  40       Married               Degree     English     White
## 5 Female  39       Married         GCSE/O Level     British     White
##        grossIncome    region smoke amtWeekends amtWeekdays    type
## 1   2,600 to 5,200 The North    No          NA          NA        
## 2      Under 2,600 The North   Yes          12          12 Packets
## 3 28,600 to 36,400 The North    No          NA          NA        
## 4 10,400 to 15,600 The North    No          NA          NA        
## 5   2,600 to 5,200 The North    No          NA          NA
# b) total 1691 participants
nrow(smoking)
## [1] 1691
# c)
# Categorical: gender, maritalStatus, nationality, ethnicity, region, smoke
# Categorical Ordinal: highestQualification, type, grossIncome
# Numerical discrete: age, amtWeekends, amtWeekdays 

str(smoking)
## 'data.frame':    1691 obs. of  12 variables:
##  $ gender              : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
##  $ age                 : int  38 42 40 40 39 37 53 44 40 41 ...
##  $ maritalStatus       : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
##  $ highestQualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
##  $ nationality         : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
##  $ ethnicity           : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ grossIncome         : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
##  $ region              : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ smoke               : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
##  $ amtWeekends         : int  NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amtWeekdays         : int  NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                : Factor w/ 5 levels "","Both/Mainly Hand-Rolled",..: 1 5 1 1 1 1 5 1 4 5 ...
  • a: Each row of the data matrix represent each UK residents.
  • b: total 1691 participants
  • c:
    • Categorical: gender, maritalStatus, nationality, ethnicity, region, smoke
    • Categorical Ordinal: highestQualification, type, grossIncome
    • Numerical discrete: age, amtWeekends, amtWeekdays
  • a: 160 children between the ages of 5 and 15.
  • b: To generalize the findings of the study to the population, it is required that the sampe is properly randomized thoroughly. It is not stated in this example, so the answer is no.
  • a:
    • Firstly, there is no indication whether this study was properly designed experiment with specified control/treatement group.
    • Seocndly, “the other factors”" mentioned above, is not clearly stated. This alerts for bias.
    • Thirdly, There is no mention of randomization on samples.
    • In conclusion, this is an observation study, thus, it will be improper to imply the causation.
  • b: It is a common mistake to imply that correlation leads to cusation. This is, as above, an observation study and thus we can’t officially draw the conclusion.
  • a: Prospetive experiment.
  • b:
    • treatment: exercise twice a week
    • control: not to exercise
  • c: Yes, by age.
    • 18-30
    • 31-40
    • 41-55
  • d: No. The patients know whether they are in the group to exercise or not.
  • e: If the sample size can be justified to be sufficient, as this is a properly designed experiment, a casual relationship can be established between mental health and exercise, as both sampling and assignments were random.

  • f: One reservation I have is that there should be condition for patients normal exercise pattern. Also, morally speaking, one should not prevent people from exercising.

library(ggplot2)
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(scores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00
scores <- as.data.frame(scores)
ggplot(scores, aes(x = "", y = scores)) + geom_boxplot() + coord_cartesian(ylim = c(50, 100))

  • a : 2 Symmetrical distribution
  • b : 3 Multimodal distribution
  • c : 1 Right skew distribution
  • a : Right Skewed with long tail in the upper range. Using median or IQR will control outliers.
  • b : Nomal distribution, Symmetric - Mean or Median or SD.
  • c : Right skewed, median and IQR should be used.
  • d : Right Skewed, median and IQR to omit the outlier effect.

  • a : The mosaic chart shows that patients who received transplant had greater odds of surviving. This suggests the two variables are not independent.

  • b : The transplant treatment incereases survival time for patients.

  • c : 88.24% of control group died followed by 65.22% of treatment group.

library(data.table)
heartTr <- as.data.table(heartTr)
ht <- heartTr[, .(count = .N), by = .(transplant,survived)]
control.dead.ratio <- ht[transplant== "control" & survived == "dead"]$count / sum(ht[transplant == "control"]$count) 
treatment.dead.ratio <- ht[transplant== "treatment" & survived == "dead"]$count / sum(ht[transplant == "treatment"]$count)  

control.dead.ratio
## [1] 0.8823529
treatment.dead.ratio
## [1] 0.6521739
  • d-i :
    • H0: Independence model - transplant has no effect on length of survival
    • H1: Alternative model - transplant increases the length of survival
  • d-ii: 28, 75, 69, 34, -23.02%
#alive
sum(ht[survived == "alive"]$count)
## [1] 28
#dead
sum(ht[survived == "dead"]$count)
## [1] 75
#treatment
sum(ht[transplant == "treatment"]$count)
## [1] 69
#control
sum(ht[transplant == "control"]$count)
## [1] 34
#
treatment.dead.ratio - control.dead.ratio
## [1] -0.230179
  • d-iii: The simulation results suggests that the transplant program is strongly effective since the difference among 100 simulation is centered near 0. We can conclude that the evidence is strong to reject H0 and state that here was a success in lengthening survival days thanks to heart transplant.