library('DATA606') # Load the package
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
vignette(package='DATA606') # Lists vignettes in the DATA606 package
## no vignettes found
vignette('os3') # Loads a PDF of the OpenIntro Statistics book
## Warning: vignette 'os3' not found
data(package='DATA606') # Lists data available in the package
data(iris)
1. The population of interest here is children between the ages of 5 and 15 and the sample is 160 children
2.The results of the study could indicate how different chidren based on age and gender react to specific instructions and show how specific category of children could be reward driven and do not care about honesty.
a) I don't the study has enough data points to conclude that smoking leads to Dementia. The study would be more conclusive if it was conducted on a large group of smokers vs an equally large group of non-smokers with all other things being equal. Or perhaps the starting dataset should have been a larger sample of candidates with dementia and breakdown possible sources for the cause.
b) I don't think the study is conclusive and it has the right data set to determine if lack of sleep has any relation to bullying. In this case lack of sleep could very well be a result of other social or medical factors which may also be the cause of bullying
a) Experiment
b) Treatment group is the group of subjects that will excercise twice a week. Control group will be the set of subjects that will not excercise
c) Yes. In this case age groups are blocks
e) No, the study does not make use of blinding
e) Yes I think this study can help determine casual relationship between excercise and mental health, though I think for a tighter relationship many other variables, that could potentially impact mental health, have to be equal.
The conclusions can be generalized for larger population as the subjects are randomly chosen.
f) Yes, my reservation would be due to the fact that the study doesn't indicate how other factors affecting mental health are being ruled out. These could mislead the results.
escore <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(escore)
summary(escore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
a) histogram matches with the boxplot (2) as the median is around 60 and majority of the data set (2nd and 3rd quartile) is centered between the range of 58 and 68 approximately. The distribution is symmetric.
b) histogram matches with the boxplot (3). The data is more equally spread out between 1 and 100 with similar Y-axis values
c) histogram matches with (1). The histogram shows a stronger shift of choice to the left. Indicates that lower range is most widely selected and then it drop quickly.
Right skewed as 75% of the houses cost below 450,00 and the remaining 25% are spread out between 450,000 and over 6,000,000. The center is better described by median therefore IQR is best suited for describing variability.
Symmetric. Variability best described by SD
Right skewed as 0 is the starting and most likely point. Variability best described by IQR
Symmetric. Variability described by SD
Survival is not independent of whether or not a patient got transplant. The plot shows that survival rate higher for treatment patients that got a transplant.
The box plot shows that while the survival time for the treatment group goes over 1000 days, 50% of the patients live up to approximately 200 days only.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
data(heartTr)
names(heartTr)
## [1] "id" "acceptyear" "age" "survived" "survtime"
## [6] "prior" "transplant" "wait"
#heartTr[10,(heartTr$transplant != "dead")]
control_per <- length(heartTr$transplant[heartTr$transplant == "control" & heartTr$survived == "dead"]) /length(heartTr$transplant[heartTr$transplant == "control"])
## 88.2 percent of control patients died
treat_per <- length(heartTr$transplant[heartTr$transplant == "treatment" & heartTr$survived == "dead"]) / length(heartTr$transplant[heartTr$transplant == "treatment"])
table(heartTr$transplant)
##
## control treatment
## 34 69
death_rate_dif <- treat_per - control_per
control_per
## [1] 0.8823529
treat_per
## [1] 0.6521739
death_rate_dif
## [1] -0.230179
## 65.2% of treatment patients died
We write alive on index  cards representing patients who were alive at the end of the study, and dead on index cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 34 representing treatment, and another group of size 69 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated differences in proportions are 0.23 . If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
The simulated result shows that larger number of differences are close to zero and hence the independence model is not conclusive.