library('DATA606')          # Load the package
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo
vignette(package='DATA606') # Lists vignettes in the DATA606 package
## no vignettes found
vignette('os3')             # Loads a PDF of the OpenIntro Statistics book
## Warning: vignette 'os3' not found
data(package='DATA606')     # Lists data available in the package
data(iris)

Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70

1.8 Smoking habits of UK residents

Answer:
  1. Each row represents a candidate in the survey
  2. 1691 participants are included in the survey
  3. sex is nominal categorical,
    age is discrete numerical,
    marital is nominal categorical,
    grossincome is ordinal categorical,
    smoke nominal categorical,
    amtweekends is discrete numerical,
    amtweekdays is discrete numerical

1.10 Cheaters, scope of inference

Answer:
1. The population of interest here is children between the ages of 5 and 15 and the sample is 160 children

2.The results of the study could indicate how different chidren based on age and gender react to specific instructions and show how specific category of children could be reward driven and do not care about honesty.

1.28 Reading the paper.

Answer:
a) I don't the study has enough data points to conclude that smoking leads to Dementia.  The study would be more conclusive if it was conducted on a large group of smokers vs an equally large group of non-smokers with all other things being equal. Or perhaps the starting dataset should have been a larger sample of candidates with dementia and breakdown possible sources for the cause.

b) I don't think the study is conclusive and it has the right data set to determine if lack of sleep has any relation to bullying.  In this case lack of sleep could very well be a result of other social or medical factors which may also be the cause of bullying

1.36 Exercise and mental health.

a) Experiment
b) Treatment group is the group of subjects that will excercise twice a week.  Control group will be the set of subjects that will not excercise
c) Yes.  In this case age groups are blocks
e) No, the study does not make use of blinding
e) Yes I think this study can help determine casual relationship between excercise and mental health, though I think for a tighter relationship many other variables, that could potentially impact mental health, have to be equal.
The conclusions can be generalized for larger population as the subjects are randomly chosen.
f) Yes, my reservation would be due to the fact that the study doesn't indicate how other factors affecting mental health are being ruled out.  These could mislead the results.

1.48 Stats scores.

Answer:
escore <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(escore)

summary(escore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

1.50 Mix-and-match

Answer:
a) histogram matches with the boxplot (2) as the median is around 60 and majority of the data set (2nd and 3rd quartile) is centered between the range of 58 and 68 approximately.  The distribution is symmetric.  

b) histogram matches with the boxplot (3). The data is more equally spread out between 1 and 100 with similar Y-axis values  

c) histogram matches with (1).  The histogram shows a stronger shift of choice to the left.  Indicates that lower range is most widely selected and then it drop quickly.  

1.56 Distributions and appropriate statistics, Part II

Answer:
  1. Right skewed as 75% of the houses cost below 450,00 and the remaining 25% are spread out between 450,000 and over 6,000,000. The center is better described by median therefore IQR is best suited for describing variability.

  2. Symmetric. Variability best described by SD

  3. Right skewed as 0 is the starting and most likely point. Variability best described by IQR

  4. Symmetric. Variability described by SD

1.70 Heart transplants.

Answer:
  1. Survival is not independent of whether or not a patient got transplant. The plot shows that survival rate higher for treatment patients that got a transplant.

  2. The box plot shows that while the survival time for the treatment group goes over 1000 days, 50% of the patients live up to approximately 200 days only.

library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
data(heartTr)
names(heartTr)
## [1] "id"         "acceptyear" "age"        "survived"   "survtime"  
## [6] "prior"      "transplant" "wait"
#heartTr[10,(heartTr$transplant != "dead")]

control_per <- length(heartTr$transplant[heartTr$transplant == "control" & heartTr$survived == "dead"]) /length(heartTr$transplant[heartTr$transplant == "control"])

## 88.2 percent of control patients died

treat_per <- length(heartTr$transplant[heartTr$transplant == "treatment" & heartTr$survived == "dead"]) / length(heartTr$transplant[heartTr$transplant == "treatment"])

table(heartTr$transplant)
## 
##   control treatment 
##        34        69
death_rate_dif <- treat_per - control_per 

control_per
## [1] 0.8823529
treat_per
## [1] 0.6521739
death_rate_dif
## [1] -0.230179
## 65.2% of treatment patients died
    1. The claim being tested here is via randomization that the treatment has no effect on increased lifespan
  1. We write alive on index  cards representing patients who were alive at the end of the study, and dead on index cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 34 representing treatment, and another group of size 69 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated differences in proportions are 0.23 . If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

  2. The simulated result shows that larger number of differences are close to zero and hence the independence model is not conclusive.