Chapter 1 - Introduction to Data

1.8 Smoking habits of UK residents (p57)

What does each row of the data matrix represent?

ANS: Each row of a data matrix represents the case or object of various variables. It represents a sample data of a group.

How many participants were included in the survey?

ANS: There are a total of 1691 participants included in the survery.

Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

ANS: sex: Categorical age: Numerical, Discrete marital: Categorical grossIncome: Categorical, Ordinal smoke: Categorical amtWeekends: Numerical, Discrete amtWeekdays: Numerical, Discrete

1.10 Cheaters, scope of inference (p58)

Identify the population of interest and the sample in this study.

ANS: sample –160 children. population – 5 to 15 years old

Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

ANS: Making causal conclusion based on observational data is not recommended. The more samples the researchers study the more accruately they can estimate the effect of the explanatory variable.

1.28 Reading the paper (p62)

An article titled Risks: Smokers Found More Prone to Dementia states the following:

ANS: Based on this study, smokers are prone to Dementia to some degree but not entirely. This study does not provide what other reasons cause dimentia or the correlation between percentage of non smokers and dimentia. The research also does not provide whether the smokers had other habits like alcohol or drugs and the influence of it for Dementia. Based on this article, causal conclusion of smoking causes Dementia is not completely acceptable.

Another article titled The School Bully Is Sleepy

ANS: The research says kids who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders. whereas the friend says “sleep disorders lead to bullying in school children.” The interpretation of a leads to b may not be same as b leads to a.

1.36 Exercise and mental health.

What type of study is this?

ANS: This study is randomization. (b) What are the treatment and control groups in this study?

ANS: The treatment group is the the group instructed to exercise twice a week. The control group is the group instructed not to exercise.

Does this study make use of blocking? If so, what is the blocking variable?

ANS: Yes, this study makes use of blocking by choosing participants in different age group.

Does this study make use of blinding?

ANS: No, participants know whether they are exercising or not.

Comment on whether or not the results of the study can be used to establish a causal rela- tionship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.

ANS: Stratifed sampling can be used to generalize to the population at large. However the population of the strata is not mentioned here.

Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

ANS: Additional details are needed for this proposal are following. based on this the decision would be taken. 1. sample size 2. duration of the study and also is exercising twice sufficient for this study 3. impact of blinding the control group for this study period

library(ggplot2)
data <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
scores <- data.frame(scores=data,
                     type=rep("score", length(data)))
# Define the boxplot
bp <- ggplot(data=scores, aes(x=type, y=scores)) + 
  geom_boxplot(colour="#6666FF") +
  labs(title="Boxplot of Statistics Scores", x="Stats Final Exam Scores", y="Score")

bp

1.50 Mix-and-match.

ANS: (a) The distribution is mostly symmetric and matches the (2) boxplot. (b) The distribution is fairly evenly distributed, and multimodal. It matches the (3) boxplot. (c) The distribution is unimodel and matches the (1) boxplot.

1.56 Distributions and appropriate statistics (p69)

ANS: (a) Q1 - 25% - Below $350,000; Q2 - 50% - Below $450,000 (Median); Q3 - 75% - Below $1,000,000; Meaningful # of houses that cost more than $6,000,000. Distribution: Right skewed due to significant number of houses that are more than $1M. Mean could be in the 3rd Quartile as meaningful # of houses cost more than 6M. Mean could be more than median. IQR would better represent the observation.

Q1 - 25% - Below $300,000; Q2 - 50% - Below $600,000 (Median); Q3 - 75% - Below $900,000; Very few houses that cost more than $1,200,000. Distribution: symmetric due to the proportion of houses in each of the quartiles, though slightly right skewed due to the few houses costing more than $1.2M. IQR seems the better represenation due to the distribution of the data.
Number of alcoholic drinks

Distribution: Left skewed due to the few drinks consumed under 21. IQR seems the better represenation of variability because it wouldn’t be influenced by the few excessive drinkers.

Annual salaries

Distribution: Left skewed due to the more number of lower salaries and a few very high salaries on the right.

Mean would be influenced by the very high salaries and would show a higher average salary.

IQR is best for variability. Standard deviation would be influenced by the few high salaries.

1.70 Heart transplants (p74)

Based on mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.

The mosaic plot shows more survival rate in the treatment group rather than the control group. The treatment group size is wide. looks like there were more subjects in this group. The mosaic alone is not enough to claim survival is independent or dependent, but it appears that survival rate improves with, and is not entirely independent of, the transplant treatment.

What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment?

The boxplots show the control group have a short survival time (much less than 250 days) with some outliers with extended survival times, while the treatment group has a much larger quartile range with a median close to 250 days and up above 500 days for the third quartile.

Considering this interpretation of the boxplots, the transplant appear to be effective in extending the survival time of the subjects significantly.

What proportion of patients in the treatment group and what proportion of patients in the control group died?

The following R code shows loading the Heart Transplant data and computing the proportions for each group:

library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
# Load the data from our copy downloaded from course site.
heart <- read.table("https://raw.githubusercontent.com/jbryer/DATA606Fall2016/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/heartTr.csv", sep="," , stringsAsFactors=FALSE, header=TRUE)
# Using dplyr to group and count.
agg <- tally(group_by(heart, transplant, survived))
agg

## Source: local data frame [4 x 3]
## Groups: transplant [?]
## 
##   transplant survived     n
##        <chr>    <chr> <int>
## 1    control    alive     4
## 2    control     dead    30
## 3  treatment    alive    24
## 4  treatment     dead    45

# Compute totals and died for each group
totalControl <- sum(agg[agg$transplant == "control",]$n)
totalControl

## [1] 34

diedControl <- sum(agg[agg$transplant == "control" & 
                              agg$survived == "dead",]$n)
diedControl

## [1] 30

totalTreatment <- sum(agg[agg$transplant == "treatment",]$n)
totalTreatment

## [1] 69

diedTreatment <- sum(agg[agg$transplant == "treatment" & 
                                agg$survived == "dead",]$n)
diedTreatment

## [1] 45

# Control porportion died
ratioControlDied <- diedControl / totalControl
ratioControlDied

## [1] 0.8823529

# Treatment porportion died
ratioTreatmentDied <- diedTreatment / totalTreatment
ratioTreatmentDied

## [1] 0.6521739

One approach for investigating whether or not the treatment is effective is to use a randomization technique.

What are the claims being tested? The claim being tested is that a heart transplant increased lifespan.
The paragraph below describes the set up for such approach, if we were to do it with- out using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.

We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shu✏e these cards and split them into two groups: one group of size representing treatment rtotalTreatment, and another group of size representing control rtotalControl. We calculate the di↵erence between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated di↵erences in proportions are r ratioTreatmentDied - ratioControlDied or less. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. iii. What do the simulation results shown below suggest about the e↵ectiveness of the trans- plant program? the transplant program is effective

Raghu_606_HW1

Raghu Ramnath

2/4/2017