Week1

#1.10 Cheaters, scope of inference.
#160 Children between 5 and 15
#Flipping a coin, half asked not to cheat, all rewarded if reported white

#A) Identify the population of interest and the sample in this study
#Population: all children between 5 and 15
#Sample: 160 children between 5 and 15 in the experiment

#Answer: The question does not specify if the 160 children were randomly sampled or vollunteers. If they were randomly sampled then it would be possible to generalize for a wider population. Because it wasn't stated that this is a random sample, therefore we cannot say if a causal relationship could be established.

#1.20 Stressed out, Part I. A study that surveyed a random sample of otherwise helathy highschool students found that they are more likely to get muscle cramps when they are stressed. The study also noted that students drink more coffee and sleep less when they are stressed.

#A) what type of study is this?

#This study is an observational study

#B) Can this study be used to conclude a causal relationship between increased stress and muscle cramps

#No because this is only an observational study.

#C) State possible confounding variables that might explain the observed relationship between increased stress and muscle cramps.

#Students likely are more stressed at finals time. At this period in time they are likely to sleep less and drink more coffee to study more. It is also likely that sitting for extended periods of time can cause muscle cramps.

#1.30 Stressed out, Part II. in a study evaluating the relationship between stress and muscle crampsl half the subjects are randomly assigned to be exposed to increased stress by being placed into an elevator that falls rapidly and stops abruptly and the other half are left at no or baseline stress.

#A) What type of study is this?

#This is an expirmental study

#B) Can this study be used to conclude a causal relationship between increased stress and muscle cramps?

#Yes this can because an experiment was performed with a random sample and a control was introduced. However there could be a confounding variable with the short stop on the elevator causing muscle cramps.

#1.40 Office Productivity. Office productivity is relatively low when the employees feel no stress abut their work or job security. However, high levels of stress can also lead to reduced employee productivity. Sketch a plot to rpresent the relationhip between stress and productivity

prod <-   c(1,.8,.7,.75,.8,.5,.1,.1,.2,.2,.4,.55,.2,.4)
stress <- c(.5,.6,.65,.5,.55,.4,1,.1,.3,.9,.3,.6,.2,.8)

prod.df <- data.frame(prod, stress)
library(ggplot2)
ggplot(prod.df, aes(stress, prod)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#1.50 Mix-and-match. Describe the distribution in the histogram below and match them to the box plots.

#A) Matches box plot 2, the distribution of this data is unimodal, symmetric, and appears it could be normally distributed

#B) Matches box plot 3. The distribution of this data may be symmetric but does not look normally distributed, it is possible that it is uniform but a smaller bin would have to be analyzed.

#C) The box plot that matches this histogram is 1. The distribution of this data is unimodal and has a right skew.

#1.60 A new statistic. The statisitic mean/median can be used as a measure of skewness. Suppose we have a distribution where all observations are greater than 0, xi > 0. What is the expected shape of the distribution under the following conditions? Explain your reasoning.

#A) mean/median = 1 this would imply that the shape would be symetric as the mean is equivalent to the median.

#B) mean/median  < 1 the shape has a negative skewness. This is because in a negative skew data set, the median is always greater than the mean

#C) mean/median > 1 the shape will have a positive skewness. This is because in a positive skew data set, the median will always be less than the mean.

#1.70 Heart Transplants. [...] The Variable transplant indicates which group the patients were in ; patients in the treatment group got a transplant and those in the control did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the stuey. of the 34 patients in the control group 30 died. Of the 69 people in the treatment group, 45 died

#A) based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning

# It appears based on the mosaic plot that those in the treatment had a higher survival rate. This implies that the survival was not independent of a patient getting a transplant

#B) What do the box plots below suggest about the efficacy of the heart transplant treatment

#The heart treatment significantly increases the survival time for patients. Based on the box plots this is true across the distribution of the data.

#C) What proportion of patients in the treatment group and what proportion of the patients in the control group died?
library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

data("heartTr")
patient_control_dead <- nrow(subset(heartTr, heartTr$transplant == 
    "control" & heartTr$survived == "dead"))
patient_control <- nrow(subset(heartTr, heartTr$transplant == 
    "control"))
patient_treatement_dead <- nrow(subset(heartTr, heartTr$transplant == 
    "treatment" & heartTr$survived == "dead"))
patient_treatment <- nrow(subset(heartTr, heartTr$transplant == 
    "treatment"))
patient_control_dead_ratio <- patient_control_dead/patient_control
patient_treatment_dead_ratio <- patient_treatement_dead/patient_treatment
patient_control_dead_ratio

## [1] 0.8823529

patient_treatment_dead_ratio

## [1] 0.6521739

#D) One approach for investigating whether or not the treatment is effective is to use a randomization technique
  #i. What are the claims being tested?
    #The claim that is being tested is whether or not the transplant increases the patient's lifespan.
  #ii. The paragraph below describes the setup for such approach, if we were to do it without using statistical softare. Fillin the blanks with a number or phrase, whichever is approproate.
    #a)
        Patient_Alive <- sum(heartTr$survived == "alive")
        Patient_Alive

## [1] 28

    #b)
        Patient_Dead <- sum(heartTr$survived == "dead")
        Patient_Dead

## [1] 75

    #c)
        patient_treatment

## [1] 69

    #d)
        patient_control

## [1] 34

    #e)
        patient_treatment_dead_ratio - patient_control_dead_ratio

## [1] -0.230179

#We write alive on [28] cards representing patients who were alive at the end of the study, and dead on [75] cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size [69] representing treatment, and another group of size [34] representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at [approximately zero]. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are [more extreme than our determined result (-23.02%)]. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
        
  #iii. What do the simulation results whown below suggest about the effectiveness of the transplant program

  #Based on the analysis it would appear that a difference of -23.02% due to pure chance would only occur 2% of the time, with such a low probability this indicates that it is a rare event. Therefore we reject the null hypothesis in favor of the alternative hypothesis.

Week1

Joseph Sansevero

January 14, 2019