1.8 Smoking habits of UK residents.

A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.

  1. Each row represents data point of samples of UK residents (each row is unique). It gives you each resident’s information about sex, age, martial status, gross income, smoking habit (yes/no) and amounts of cigarettes consumptions per day on weekends and weekdays.

  2. Total 1691 participants were included.

  3. sex, martial and smoke are categorical and nominal (non-ordinal). age, amtWeekends and amtWeekdays are numerical and discrete. grosIncome seems to be numerical and discrete (given that it does not allow decimal places) but it can be interepreted as categorical and ordinal due to “Under”, “above” and “to” depening on how you look at it.

1.10 Cheaters, scope of inference.

Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across children’s characteristics within each group.

  1. Population of interest is all children between the ages of 5 and 15 and sample is 160 children between the ages of 5 and 15.

  2. Given that 160 samples are selected through randomization, blocking, controlling (instruction vs no-instruction) and replication, the study can be generalized to the population. Causal relationships can be established through experimental research. The research is experimental so it can be used to establish causal relationships.

1.28 Reading the paper

  1. Given that sample size is large enough and

  2. The statement is false as the conclusion states “the researchers found that children who had behavioral issues and those who were IDENTIFIED as bullies were twice as likely to have shown symptoms of sleep disorders”. In other words, being bullied leads to sleep disorders.

1.36

  1. It is an experiment since the researcher is interested in causal relationship and grouped people into treatment and control groups using stratified random sampling.

  2. Treatment group: Half of 18-30, 31-40 and 41-55 year olds that were instructed to excercise twice a week, Control Group: Rest of people that were instructed not to excercise

  3. It seems like the study did not make use of blocking since we only have two groups, people who excercised and not excercised. If there was other groups who excercise only once a week, three times a week and etc to figure out low, medium, high risk groups, we can say blocking is used.

  4. I think bliding is not used since half of people were INSTRUCTED not to excercise and probably people will be aware that they are in control group. But if they did not know the purpose of research it self, which is figuring out the effects of excercise on mental health, we can somewhat say the study made use of blinding.

  5. I think study needs to be improved to establish a causal relationship between excercise and mental health. That being said, blocking and blinding should be used more effectively to give us better results. I want to know how much excercise is optimal to maintain good mental health; would it be once a week? twice a week? Maybe too much excercising can rather harm mental health. Not only that, to make study more accurate, blinding should be used so that participants can CHEAT LESS during experiment. In conclusion, I do not think the conclusions of this study can be generalized to the population at large.

  6. As long as they do better on blocking and blinding as I mentioned above, I would give funding. Otherwise, no.

1.48

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

1.50

a - 2, b - 3, c - 1

1.56

  1. Positively skewed since median is closer to 1st quartile than 3rd quartile. Since it is positively skewed and outlier is very large, IQR is better representation of the variability in this case as it can tell you where the middle 50% of the data is located. Median is better in this case because it median is not affected by outlier.

  2. Symmetric, since the difference between median and 1st quartile is the same as the difference between median and 3rd quartile. Since it is symmetric and outlier is not very far from 3rd quartile, Standard Deviation is better representation of the variability in this case (normally distributed). Mean would be better choice because the left and right tails are equally balanced and in fact mean and median are the same in normal distribution.

  3. Positively skewed since median is closer to 1st quartile than 3rd quartile as few students drink excessively. Since it is positively skewed and outlier is very large, IQR is better representation of the variability in this case as it can tell you where the middle 50% of the data is located. Median is better in this case because it median is not affected by outlier.

  4. Positively skewed since median is closer to 1st quartile than 3rd quartile. Since it is positively skewed and outlier is very large, IQR is better representation of the variability in this case as it can tell you where the middle 50% of the data is located. Median is better in this case because it median is not affected by outlier.

1.70

## Warning: package 'openintro' was built under R version 3.4.1
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
  1. Survival is not independent of transplant because those who got transplant had higher chance of surviving than those who did not have transplant.

  2. It shows that there is great effectiveness of the heart transplant since 1st quartile, median and 3rd quartile in treatment are all higher than control group. Even the outliers in treatment had longer survivial time than the ones in control group.

###total number of people died in control
control_d <- nrow(subset(heartTr, survived == 'dead' & transplant == 'control'))

##total of people in control
control_total <- nrow(subset(heartTr, transplant == 'control'))

##proportion of dead people in control
control_d/control_total
## [1] 0.8823529
###88.24%
###total number of people died in treatment
treatment_d <- nrow(subset(heartTr, survived == 'dead' & transplant == 'treatment'))

##total of people in control
treatment_total <- nrow(subset(heartTr, transplant == 'treatment'))


##proportion of dead people in control
(treatment_d/treatment_total)
## [1] 0.6521739
####65.22%
  1. What are the claims being tested?

The claim being tested are based on hypotheses. Null Hypothesis is saying that having transplant has no effect on survival rate and Alternative Hypothesis is saying that having translplant lead to higher survival rate.

  1. We write alive on NUMBER OF ALIVE cards representing patients who were alive at the end of the study, and dead on NUMBER OF DEAD cards representing patients who were not. Then, we shuffle these cards and split them into two groups:one group of size representing NUMBER OF PATIENTS WHO RECEIVED EFFECTIVE TREATMENT treatment, and another group of size representing NUMBER OF PATIENTS WHO DID NOT RECEIVE EFFECTIVE TREATMENT control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment -control) and record this value. We repeat this 100 times to build a distribution centered at ZERO. Lastly, we calculate the fraction of simulations wherethe simulated differences in proportions are GREATER THAN OBSERVED DIFFERENCE IN PROPORTION. If this fraction is low,we conclude that it is unlikely to have observed such an outcome by chance andthat the null hypothesis should be rejected in favor of the alternative.

  2. What do the simulation results shown below suggest about the e???ectiveness of the transplant program?

It seems that TREATMENT group have lower proportion of dead than CONTROL GROUP in most of cases so we can reject Null Hypothesis and conclude heart transplant leads to greater surviver rate.