Homework-1

data(iris)

1.8 Smoke habits of UK residents.

theURL <- "https://raw.githubusercontent.com/jbryer/DATA606Spring2017/master/Data/
Data%20from%20openintro.org/Ch%201%20Exercise%20Data/smoking.csv"
smoking <- read.table(file = theURL, header = TRUE, sep = ",")
dim(smoking)

## [1] 1691   12

a) What does each row of the data matrix represent?

Each row represents a respondent to the survey.

b) How many participants were included in the survey?

1691 participants were included.

c) Indicate whether each variable in the study is numerical or categorical.

variables <- names(smoking)
dataType <- c("Categorical", "Numerical", "Categorical", "Categorical", "Categorical", 
              "Categorical", "Categorical", "Categorical", "Categorical", "Numerical", 
              "Numerical", "Categorical")
dataSubType <- c("Nominal", "Discrete", "Nominal", "Ordinal", "Nominal", "Nominal", 
                 "Ordinal", "Nominal", "Nominal", "Discrete", "Discrete", "Nominal")  
data.frame(variables, dataType, dataSubType)

##               variables    dataType dataSubType
## 1                gender Categorical     Nominal
## 2                   age   Numerical    Discrete
## 3         maritalStatus Categorical     Nominal
## 4  highestQualification Categorical     Ordinal
## 5           nationality Categorical     Nominal
## 6             ethnicity Categorical     Nominal
## 7           grossIncome Categorical     Ordinal
## 8                region Categorical     Nominal
## 9                 smoke Categorical     Nominal
## 10          amtWeekends   Numerical    Discrete
## 11          amtWeekdays   Numerical    Discrete
## 12                 type Categorical     Nominal

1.10 Cheaters, scope of inference.

a) Identify the population of interest and the sample in this study.

The population of interest is all children between age of 5 and 15. The sample is the 160 children between the age of 5 and 15.

b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

Since the study is experimental and not observational, the findings can establish causal relationships. If the selection of the children are truly randomized, then yes it can be generalized to the population.

1.28 Reading the paper.

a) Base on this study, can we conclude that smoking causes dementia later in life? Explain.

This is an observational study, not an experimental study. Therefore, no causal relationship can be confirmed.

b) A friend of yours who read the article says, “The study shows that sleep disorder lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conlusion that can be drawn from this study?

This is an observational study. It cannot by itself show a causal connection. We can only say that the study provide evident of a possible association between sleep disorder and bullying. To show a causal connection, you will need to conduct a randomized experiment.

1.36 Exercise and mental health

a) What type of study is this?

This is stratified sampling.

b) What are the treatment and control groups in this study?

The treatment group is the half of the subjects that were instructed to exercise twice a week. The control groups is the other half that were instructed to not exercise.

c) Does this study make use of blocking? If so, what is the blocking variable?

Yes, the blocking variable is the age groups.

d) Does this study make use of blinding

No.

e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.

The result of this study can be used to establish a causal relationship because it’s a experimental study. The conclusion cannot be generalized to the population at large because it just covers three age groups.

f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

The separation of age groups is not well supported. The age range is 12 years for the first group, 9 years for the second group, and 14 years in the third group. This unevent range of ages in each group may affect the result.

1.48 States scores

finalScore <- c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
boxplot(finalScore)

1.50 Mix-and-match

(a) matches (2)

The distribution is symmetric, unimodal, and has a belt shape.

(b) matches (3)

The distribution is even and close to uniform.

(c) matches (1)

The distribution is skewed to the right.

1.56 Distributions and appropriate statistics, Part II

(a)

The distribution is right skewed. Median would best represent a typical observation. Variability would be best represent by IQR.

(b)

The distribution is symmetric. Mean would best represent a typical observation. Variability would be best represent by standard deviaion.

(c)

The distribution is right skewed. Median would best represent a typical observation. Variability would be best represent by IQR.

(d)

The distribution is right skewed. Median would best represent a typical observation. Variability would be best represent by IQR.

1.70 Heart transplants

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked _by_ '.GlobalEnv':
## 
##     smoking

## The following object is masked from 'package:datasets':
## 
##     cars

data(heartTr)
names(heartTr)

## [1] "id"         "acceptyear" "age"        "survived"   "survtime"  
## [6] "prior"      "transplant" "wait"

levels(heartTr$survived)

## [1] "alive" "dead"

levels(heartTr$transplant)

## [1] "control"   "treatment"

ctrlDead <- dim(subset(heartTr, heartTr$survived == 'dead' & heartTr$transplant == 'control'))[1]
trtDead <- dim(subset(heartTr, heartTr$survived == 'dead' & heartTr$transplant == 'treatment'))[1]
ctrlAlive <- dim(subset(heartTr, heartTr$survived == 'alive' & heartTr$transplant == 'control'))[1]
trtAlive <-dim(subset(heartTr, heartTr$survived == 'alive' & heartTr$transplant == 'treatment'))[1]
ctrlDead

## [1] 30

trtDead

## [1] 45

ctrlAlive

## [1] 4

trtAlive

## [1] 24

a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant?

No, survival is not indepenent of transplant. The mosaic plot clearly indicates a relationship between survival and transplant. The proportion of patients alived is much higher in the treatment group than in the control group.

b) What do the box plots below suggest about the efficacy of the heart transplant treatment?

The box plots show that the heart transplant treatment is highly effective. In the control group, almost all patients died within 500 days, and 75% died within 100 days. In the treatment group, 75% of the patients survived almost 1500 days.

c) What proportion of patients in the treatment group and what proportion of patients in the control group died?

In the treatment group, 0.6521739 died. In the contrl group, 0.8823529 died. The difference is about -0.230179.

d) One approach for investigating whether or not the treatment is effective is to use a randomization technique.

The claims being tested are:

\({ H }_{ 0 }\) Independent Model: The transplant treatment has no effect on survival. The observed higher survival rate was due to chance.

\({ H }_{ A }\) Alternative model: The transplant treatment has effect on patient survival. The observed higher survival rate was not due to chance.

We write “alive” on 28 cards representing patients who were alive at the end of study, and “dead” on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of “dead” cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are less than -0.230179. If this fraction is low, we conclude that it is unlikely to have observed an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
The simulation shown that the different in proportion being less than -0.230179 due to chance is highly unlikely (about 3% chance). Therefore, the proportion different is not due to chance. The null hypothesis is reject and the alternative is accepted - that the transplant is effective.

Homework-1

Jun Yan

1.8 Smoke habits of UK residents.

1.10 Cheaters, scope of inference.

1.28 Reading the paper.

1.36 Exercise and mental health

1.48 States scores

1.50 Mix-and-match

1.56 Distributions and appropriate statistics, Part II

1.70 Heart transplants