Exercise 1.8 Smoking habits of UK residents

(a) What does each row of the data matrix represent?
      Each row represent a Case of Smoking habits with collection of data attributes

((b) How many participants were included in the survey?
      The last case observation is 1691

(c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

Exercise 1.10 Cheaters, scope of inference

(a) Identify the population of interest and the sample in this study
      160 children between the ages of 5 and 15.

(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships
      Cannot be generalized becaused half of them guided by instructions, findings cannot be used for causual relationship

Exercise 1.28 Reading the paper

(a) An article titled Risks: Smokers Found More Prone to Dementia states the following
      There is no clear evidence given what are the other factors researchers have adjusted, so we cannot conclude the statement

(b) Another article titled The School Bully Is Sleepy states the following
      May be the sample of boys chosen might have sleep disorders apart from bullying reason, We cannot justify this reason

Exercise 1.36 Exercise and mental health

(a) What type of study is this?
      Observational Studies and Sampling Strategy

(b) What are the treatment and control groups in this study?
      Treatment Group: 2 times a week of exercise
      Control Group: No exercise

(c) Does this study make use of blocking? If so, what is the blocking variable?
      age is the variable, values will be 18-30, 31-40 and 41- 55 year olds

(d) Does this study make use of blinding?
      No evidence of blinding

(e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.
      True, This study can establish causal relationship, however number of cases considered will be an important factor if we generalize at large population

(f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?
      May be not, Funding can limit observation cases, If the study used to generalize something at large level, we should pick cases randomly with different factors like living area, economy suituation etc.

Exercise 1.48 Stats scores

Create a box plot of the distribution of these scores.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

Exercise 1.50 Mix-and-match

Describe the distribution in the histograms below and match them to the box plots
      (a)-(2) Skewed right
      (b)-(3) Uniform
      (c)-(1) Skewed right

Exercise 1.56 Distributions and appropriate statistics

(a) Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000.
      Skewed right, Median is relatively closer to first quartile, IQR as an outlier

(b) Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.
      Symmetrical or Normal Distribution, median is between 1st and 3rd quartile. IQR is not at best

(c) Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.
      Skewed right, Median and IQR also skewed a lot.

(d) Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than the all other employees.
      Skewed right, Median and IQR is appropriate for data.

Exercise 1.7 Heart transplants

(a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.
      Mosiac plot shows survival is improved those who received transplant, So we can say survival is depend on transplant

(b) What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment.
      Treatment Candidate has higher survival rate

library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
data(heartTr)
#str(heartTr)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
#heartTr
trans_data <- sqldf("select transplant,survived,count(*) as total from heartTr group by transplant,survived")
trans_data
##   transplant survived total
## 1    control    alive     4
## 2    control     dead    30
## 3  treatment    alive    24
## 4  treatment     dead    45
nrow(heartTr)
## [1] 103

(c) What proportion of patients in the treatment group and what proportion of patients in the control group died?
       Control Patients 30 died out of 103 and treatment patient 45 dead out of 103

(d) One approach for investigating whether or not the treatment is effectiveness is to use a randomization technique.
i. What are the claims being tested?

      Randomization technique: Ho - Tested null hypothesis, transplate and survival are independent to each other and Ha - alternative hypothesis are not independent to each other

ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.
       We write alive on 28 (control-4,treatment-24) cards representing patients who were alive at the end of the study, and dead on 75 (control-30,treatment-45) cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 (alive-24, dead-45) representing treatment, and another group of size 34 (alive-4, dead-30) representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are 24/69 - 4/34 = 0.35 - 0.12 = 0.23. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

iii. What do the simulation results shown below suggest about the effectiveness of the transplant program?
       The simulated diffrence is <-0.230179, so alternative hypothesis is considered, we can conclude that heart transplant is effective on patients