library(datasets)
data(iris)
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
data("heartTr")

Question 1.8

1.8 (a) Each row is an observation or case 1.8 (b) Based on the text, a total of 1691 participants were included in the survey 1.8 (c) The following variables are categorical: sex, but this is not ordinal martial, this is also not ordinal smoke, not ordinal grossIncome: based on the data it is within a range, so it is catagorical but ordinal. This variable is generally a numerical continous variable if it had a number defined and not a range

The following variables are numerical:

age: In this case it is discrete, but in theory it can be continous if age was measured at milseconds and nanoseconds AmtWeekends : This is a discrete variable but the way the users input the data can cause confusion and make people believe it is categorical AmtWeekdays : This is a discrete variable but the way the users input the data can cause confusion and make people believe it is categorical

Question 1.10

1.10 (a) The population are children from the ages of 5 to 15. The sample is 160 children that assumes total representation of this population 1.10 (b) I personally feel that a sample size fo 160 students is a relatively small sample for a population. However, there are other factors to take into account. Is religion part of the upbringing, is country considers, is culture considered? If the sample size is for a specific county or state, then it can be generalized to the population, but there are several other factors that cna influence the results. e.g. Are certain childred from certain cultures more prone to lying than others? Assuming that we used a clustered or multistage sample, then it is possible that the results can be representative.

Question 1.28

1.28 (a)

I do not feel that this study can definitively conclude that smoking causes dementia later in life. The population of of 23,123 sre health plan members who voluntarily participated in a survey. Also, this study only takes a population from 50-60 years of old. We do not know whether this same population are life long smokers or whether they had other confounding variables to that led to 25% of the group developing dementia. To make this even more complex, we do not know what other habits were common from 1978 to 1985, but we do know that there were other drug epidemics that occured - e.g. the war on drugs. We don’t know what the previous history was for these individuals vs those who did not participate in a study.

1.28 (b) This study is even more flawed than the previous study. Parents and teacher will not know whether a student are truly deprived of sleep. They are going by proxy information, based on what they asked the children in the study. The children themselves are not the ones actually answering the study.

Question 1.36

  1. What type of Study? The question states that they start with a stratified sampling with 3 age groups: 18-30, 31,40, 41-55. We can think of these are 3 teams. The sample is breaking these teams into a randomized experiment. One half instructed to exercise 2X and the other half to not.
  2. The treatment group is broken into three groups: 18-30, 31,40, 41-55, which is further broken into two groups: one who is asked to exercise 2X a week. The other is the one who is asked not to.
  3. It does not appear that blocking is used in this experiment. For example, the question does not explicitly state that higher age groups are automatically assumed to be associated to mential health issues. Therefore, we do not suspect that blocking is used here. If the experimeters knew that there were higher risks with different age groups, then this would be considered a blocking variable.
  4. Is not not random because the 2nd half is instructed** not to exercise.
  5. It is possible that the results can be used to generalize the public, however, because it is not a blind study, this study has the potential to be influenced based on the instructions of the expertimenter. half of the control group were told not to exercise. This could potentially skew the results.
  6. I would have reserverations because the experiment is 1. not blind: some are told to not exercise 2. it focuses on people aged 18-55, which excludes a significant portion of the population and 3. There may be confounding variables that may lead to the results.

Question 1.48 Create a Box Plot

# given: 

data <- c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
data
##  [1] 57 66 69 71 72 73 74 77 78 78 79 79 81 81 82 83 83 88 89 94
min <- min(data)
max <- max(data)
q1 <- 72.5
q3 <- 82.5
iqr <-  q3 - q1
iqr
## [1] 10
boxplot(data,data=data)

Question 1.50 Mix-and-Max based on the diagrams, the following histograms and boxplots match

  1. –> (2) #Reason this boxplot is mostly centraly distributed right some right outliers
  2. –> (3) #Reason There are no clear outliers in this example and the boxplot demostrates the distribution
  3. –> (1) #Reason There are several right skewed results in the box-plot that match the skewness of the bar chart

Question 1.56

1.56

  1. This would be let skewed because 75% of the house values would be on the left side, with 25% representing the right side for meaningful values at 6 million. The median would fall closer to the left right due to higher weighted values.
  2. This would be a right skewed since 75% of the houses were under 900,000 and only few houses would be outliers at 1,200,000 million. The median would fall closer to the left right due to higher weighted values.
  3. The age group for most college kids is 18-21. If we assume that most kids do not drink and abide by the law, then this means that the graph should be skewed to the right since there will be a higher number of students at 18, 19, 20, who do not drink.
  4. This is similar to our baseball example in the textbook. Most of the salaries would few on the left side which would make it a right skewed distribution.

Queston 1.70 (a) Based on the chart, survival does not appear to be independent of whether the patient received a heart transplant. In the control group there was a very high rate of deaths (%88 vs 65%). The key factor is that survival time does increase with treatment by having treatment.

##
total_control <- subset(heartTr, transplant=="control")
total_treatment <- subset(heartTr, transplant=="treatment")


# total in control
total_control
##     id acceptyear age survived survtime prior transplant wait
## 1   15         68  53     dead        1    no    control   NA
## 2   43         70  43     dead        2    no    control   NA
## 3   61         71  52     dead        2    no    control   NA
## 4   75         72  52     dead        2    no    control   NA
## 5    6         68  54     dead        3    no    control   NA
## 6   42         70  36     dead        3    no    control   NA
## 7   54         71  47     dead        3    no    control   NA
## 9   85         73  47     dead        5    no    control   NA
## 10   2         68  51     dead        6    no    control   NA
## 11 103         67  39     dead        6    no    control   NA
## 12  12         68  53     dead        8    no    control   NA
## 13  48         71  56     dead        9    no    control   NA
## 14 102         74  40    alive       11    no    control   NA
## 15  35         70  43     dead       12    no    control   NA
## 17  31         69  54     dead       16    no    control   NA
## 20   5         68  20     dead       18    no    control   NA
## 21  77         72  41     dead       21    no    control   NA
## 22  99         73  49     dead       21    no    control   NA
## 25 101         74  49    alive       31    no    control   NA
## 26  66         72  53     dead       32    no    control   NA
## 27  29         69  50     dead       35    no    control   NA
## 28  17         68  20     dead       36    no    control   NA
## 29  19         68  59     dead       37    no    control   NA
## 32   8         68  45     dead       40    no    control   NA
## 33  44         70  42     dead       40    no    control   NA
## 36   1         67  30     dead       50    no    control   NA
## 44  62         71  39     dead       69    no    control   NA
## 51   9         68  47     dead       85    no    control   NA
## 55  32         71  41     dead      102    no    control   NA
## 59  37         71  41     dead      149    no    control   NA
## 67  27         69   8     dead      263    no    control   NA
## 73  91         73  47     dead      340    no    control   NA
## 78  82         71  29    alive      427    no    control   NA
## 99  26         69  30    alive     1400    no    control   NA
# total in treatment
total_treatment
##      id acceptyear age survived survtime prior transplant wait
## 8    38         70  41     dead        5    no  treatment    5
## 16   95         73  40     dead       16    no  treatment    2
## 18    3         68  54     dead       16    no  treatment    1
## 19   74         72  29     dead       17    no  treatment    5
## 23   20         69  55     dead       28    no  treatment    1
## 24   70         72  52     dead       30    no  treatment    5
## 30    4         68  40     dead       39    no  treatment   36
## 31  100         74  35    alive       39   yes  treatment   38
## 34   16         68  56     dead       43    no  treatment   20
## 35   45         71  36     dead       45    no  treatment    1
## 37   22         69  42     dead       51    no  treatment   12
## 38   39         70  50     dead       53    no  treatment    2
## 39   10         68  42     dead       58    no  treatment   12
## 40   35         71  52     dead       61    no  treatment   10
## 41   37         70  61     dead       66    no  treatment   19
## 42   68         72  45     dead       68    no  treatment    3
## 43   60         71  49     dead       68    no  treatment    3
## 45   28         69  53     dead       72    no  treatment   71
## 46   47         71  47     dead       72    no  treatment   21
## 47   32         69  64     dead       77    no  treatment   17
## 48   65         72  51     dead       78    no  treatment   12
## 49   83         73  53     dead       80    no  treatment   32
## 50   13         68  54     dead       81    no  treatment   17
## 52   73         72  56     dead       90    no  treatment   27
## 53   79         72  53     dead       96    no  treatment   67
## 54   36         70  48     dead      100    no  treatment   46
## 56   98         73  28    alive      109    no  treatment   96
## 57   87         73  46     dead      110    no  treatment   60
## 58   97         73  23    alive      131    no  treatment   21
## 60   11         68  47     dead      153    no  treatment   26
## 61   94         73  43     dead      165   yes  treatment    4
## 62   96         73  26    alive      180    no  treatment   13
## 63   90         73  52     dead      186   yes  treatment  160
## 64   53         71  47     dead      188    no  treatment   41
## 65   89         73  51     dead      207    no  treatment  139
## 66   24         69  51     dead      219    no  treatment   83
## 68   93         73  47    alive      265    no  treatment   28
## 69   51         71  48     dead      285    no  treatment   32
## 70   67         73  19     dead      285    no  treatment   57
## 71   16         68  49     dead      308    no  treatment   28
## 72   84         73  42     dead      334    no  treatment   37
## 74   92         73  44    alive      340    no  treatment  310
## 75   58         71  47     dead      342   yes  treatment   21
## 76   88         73  54    alive      370    no  treatment   31
## 77   86         73  48    alive      397    no  treatment    8
## 79   81         73  52    alive      445    no  treatment    6
## 80   80         72  46    alive      482   yes  treatment   26
## 81   78         72  48    alive      515    no  treatment  210
## 82   76         72  52    alive      545   yes  treatment   46
## 83   64         72  48     dead      583   yes  treatment   32
## 84   72         72  26    alive      596    no  treatment    4
## 85   71         72  47    alive      630    no  treatment   31
## 86   69         72  47    alive      670    no  treatment   10
## 87    7         68  50     dead      675    no  treatment   51
## 88   23         69  58     dead      733    no  treatment    3
## 89   63         71  32    alive      841    no  treatment   27
## 90   30         69  44     dead      852    no  treatment   16
## 91   59         71  41    alive      915    no  treatment   78
## 92   56         71  38    alive      941    no  treatment   67
## 93   50         71  45     dead      979   yes  treatment   83
## 94   46         71  48     dead      995   yes  treatment    2
## 95   21         69  43     dead     1032    no  treatment    8
## 96   49         71  36    alive     1141   yes  treatment   36
## 97   41         70  45    alive     1321   yes  treatment   58
## 98   14         68  53     dead     1386    no  treatment   37
## 100  40         70  48    alive     1407   yes  treatment   41
## 101  34         69  40    alive     1571    no  treatment   23
## 102  33         69  48    alive     1586    no  treatment   51
## 103  25         69  33    alive     1799    no  treatment   25

1.70 (b)

The box plot for the control vs the treatment definitely seems to imply that treatment is effective in increasing the survival time. This is visible by the wider spread for the IQR, as 50% of the data is contained in this area. The overall mean is higher for the treatment group vs the control group.

1.70 (c)

Using R below, the proportion of patients who dead in the control group is 88% and 65% for the patients who died in the treatment group

total_control <- nrow(subset(heartTr, transplant=="control"))
total_treatment <- nrow(subset(heartTr, transplant=="treatment"))
### print totals for each
total_control
## [1] 34
total_treatment
## [1] 69
total_control_dead <- nrow(subset(heartTr, transplant=="control" & survived=="dead"))
total_treatment_dead <- nrow(subset(heartTr, transplant=="treatment" & survived=="dead"))

### total_control

total_control_dead
## [1] 30
total_treatment_dead
## [1] 45
# Proportation of control who died
prop_dead_control <- ( total_control_dead / total_control)
# Proportation of control who died
prop_dead_treatment <- ( total_treatment_dead / total_treatment)

# print out prop for control who died

sprintf("%.4f", prop_dead_control)
## [1] "0.8824"
# print out prop for treatment who died
sprintf("%.4f", prop_dead_treatment)
## [1] "0.6522"

1.70(d) (i) What are the claims being tested? One of the claims tested is whether the increased survival rate due to treatment is statistically significant - independent (H0) or whether it is due to the natural randomness of a sample (H2) - not independent

Blanks:

  1. 28 (4 control, 24 treatment)
  2. 75 (30 control, 45 treatment)
  3. 69 treatement
  4. 34 control
  5. centered at 0
  6. proportion ~23.03

the simulation below suggests to sugget that the difference is statistically significant. .2303 is almost at the right side of the distribution and very unlikely to be due to variation in chance.