Habib U Khan

Data 606 - Homework 1

Problem 1.8

  1. The number of row is 1691
  2. 1691
  3. sex (categorical - nominal), age (numerical - continuous), marital (categorical - ordinal), grossincome (categorical - ordinal), smoke (categorical - nominal), amtWeekends (numerical - discrete), amtWeekdays (numerical - discrete)

Problem 1.10

  1. The population of interest is the children of ages between 5 and 15 and sample size is 160
  2. Sample size is only 160 which means it cannot generalize the entire population. Sample size is needed to be increased. The The findings of study cannot be used to establish causal relationships.

Problem 1.28

  1. Since this is an observation that’s why it cannot be concluded that smoking causes Dementia later on in their lives. Observation cannot be tell if it causes or not. Also the data should be taken randomly for unbiased results and in this case, they voluntered by themselves and hence it cannot be justified or unbiased.
  2. Since a friend of mine is making the assumption that sleep disorders lead to bullying in school children and study is not experiment that is why it cannot be justified if sleep disorders cause bullying in school children or not.

Problem 1.36

  1. Prospective study
  2. Treatment group - exercise group and control group is ‘No Exercise’
  3. The study has blocking groups of 18 - 30, 31 - 40, and 41 - 55 years
  4. No blinding has been used in this study
  5. Since the study is going to conduct through taking random samples with stratas that’s why it has the potential to find the causal relationship between exercise and mental health. Question does not indicate the sample size but if large sample is taken from each group then it can represent the entire population.

Problem 1.48

statsscores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(statsscores, main="Final Exam Scores", y=2)

summary(statsscores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

Problem 1.50

  1. Symmetric, Unimodal —- > Plot 2
  2. Uniform —— > Plot 3
  3. Right skewed —– > Plot 1

Problem 1.56

  1. The datset indicates that it is right skewed with just a meaningful number of houses more than 6 million dollars therefore median and IQR would be prefered
  2. Here the distribution seems symmetric as the data is disbursed among each group in a bell shaped curve. Mean and standard deviation will work here better.
  3. As per my understanding the question, it seems like the data is right skewed and median & IQR are prefered here for better results.
  4. It seems like only few people (outliers) are there whose salaries are much more than the other employees so the data is right skewed and hence median and IQR would be better options.

Problem 1. 70

  1. Based on the mosaic plot, it seems that whether the patient got transplant or not, there are meaningful number of patients who survived in both groups therefore there is some other dependent variable(s) which impact patient’s survival.
  2. Although notable number of patients died in both cases but the boxplot indicates that the people with the transplant have had long survival time which shows the effectiveness of heart plant
  3. Treatment group - 45 died out of 69, 0.65 or almost 65 percent died in this group Control group - 30 died out of 34, 0.88 or almost 88 percent died in this group
  4. i - Ho: Survival of patients and whether they have had a transplant are independent Ha: Survival of patients and whether they have had a transplant are not independent ii - We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are less than or equal to 24/69 - 4/34 = 0.35 - 0.12 = 0.23. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. iii - Result of simulation indicates that the transplant is an effective program as there are only 2 simulations with a difference of at least 23% difference and it shows that it happens rarely so we accept the Ha stating that the transplant and survival of patients are not independent.
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
data(heartTr)
mosaicplot(table(heartTr$transplant, heartTr$survived))