ANS: Each row of a data matrix represents the case or object of various variables. It represents a sample data of a group.
ANS: There are a total of 1691 participants included in the survery.
ANS: sex: Categorical age: Numerical, Discrete marital: Categorical grossIncome: Categorical, Ordinal smoke: Categorical amtWeekends: Numerical, Discrete amtWeekdays: Numerical, Discrete
ANS: sample –160 children. population – 5 to 15 years old
ANS: Making causal conclusion based on observational data is not recommended. The more samples the researchers study the more accruately they can estimate the effect of the explanatory variable.
ANS: Based on this study, smokers are prone to Dementia to some degree but not entirely. This study does not provide what other reasons cause dimentia or the correlation between percentage of non smokers and dimentia. The research also does not provide whether the smokers had other habits like alcohol or drugs and the influence of it for Dementia. Based on this article, causal conclusion of smoking causes Dementia is not completely acceptable.
ANS: The research says kids who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders. whereas the friend says “sleep disorders lead to bullying in school children.” The interpretation of a leads to b may not be same as b leads to a.
ANS: This study is randomization. (b) What are the treatment and control groups in this study?
ANS: The treatment group is the the group instructed to exercise twice a week. The control group is the group instructed not to exercise.
ANS: Yes, this study makes use of blocking by choosing participants in different age group.
ANS: No, participants know whether they are exercising or not.
ANS: Stratifed sampling can be used to generalize to the population at large. However the population of the strata is not mentioned here.
ANS: Additional details are needed for this proposal are following. based on this the decision would be taken. 1. sample size 2. duration of the study and also is exercising twice sufficient for this study 3. impact of blinding the control group for this study period
library(ggplot2)
data <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
scores <- data.frame(scores=data,
type=rep("score", length(data)))
# Define the boxplot
bp <- ggplot(data=scores, aes(x=type, y=scores)) +
geom_boxplot(colour="#6666FF") +
labs(title="Boxplot of Statistics Scores", x="Stats Final Exam Scores", y="Score")
bp
ANS: (a) The distribution is mostly symmetric and matches the (2) boxplot. (b) The distribution is fairly evenly distributed, and multimodal. It matches the (3) boxplot. (c) The distribution is unimodel and matches the (1) boxplot.
ANS: (a) Q1 - 25% - Below $350,000; Q2 - 50% - Below $450,000 (Median); Q3 - 75% - Below $1,000,000; Meaningful # of houses that cost more than $6,000,000. Distribution: Right skewed due to significant number of houses that are more than $1M. Mean could be in the 3rd Quartile as meaningful # of houses cost more than 6M. Mean could be more than median. IQR would better represent the observation.
Q1 - 25% - Below $300,000; Q2 - 50% - Below $600,000 (Median); Q3 - 75% - Below $900,000; Very few houses that cost more than $1,200,000. Distribution: symmetric due to the proportion of houses in each of the quartiles, though slightly right skewed due to the few houses costing more than $1.2M. IQR seems the better represenation due to the distribution of the data.
Number of alcoholic drinks
Distribution: Left skewed due to the few drinks consumed under 21. IQR seems the better represenation of variability because it wouldn’t be influenced by the few excessive drinkers.
Distribution: Left skewed due to the more number of lower salaries and a few very high salaries on the right.
Mean would be influenced by the very high salaries and would show a higher average salary.
IQR is best for variability. Standard deviation would be influenced by the few high salaries.
The mosaic plot shows more survival rate in the treatment group rather than the control group. The treatment group size is wide. looks like there were more subjects in this group. The mosaic alone is not enough to claim survival is independent or dependent, but it appears that survival rate improves with, and is not entirely independent of, the transplant treatment.
The boxplots show the control group have a short survival time (much less than 250 days) with some outliers with extended survival times, while the treatment group has a much larger quartile range with a median close to 250 days and up above 500 days for the third quartile.
Considering this interpretation of the boxplots, the transplant appear to be effective in extending the survival time of the subjects significantly.
The following R code shows loading the Heart Transplant data and computing the proportions for each group:
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
# Load the data from our copy downloaded from course site.
heart <- read.table("https://raw.githubusercontent.com/jbryer/DATA606Fall2016/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/heartTr.csv", sep="," , stringsAsFactors=FALSE, header=TRUE)
# Using dplyr to group and count.
agg <- tally(group_by(heart, transplant, survived))
agg
## Source: local data frame [4 x 3]
## Groups: transplant [?]
##
## transplant survived n
## <chr> <chr> <int>
## 1 control alive 4
## 2 control dead 30
## 3 treatment alive 24
## 4 treatment dead 45
# Compute totals and died for each group
totalControl <- sum(agg[agg$transplant == "control",]$n)
totalControl
## [1] 34
diedControl <- sum(agg[agg$transplant == "control" &
agg$survived == "dead",]$n)
diedControl
## [1] 30
totalTreatment <- sum(agg[agg$transplant == "treatment",]$n)
totalTreatment
## [1] 69
diedTreatment <- sum(agg[agg$transplant == "treatment" &
agg$survived == "dead",]$n)
diedTreatment
## [1] 45
# Control porportion died
ratioControlDied <- diedControl / totalControl
ratioControlDied
## [1] 0.8823529
# Treatment porportion died
ratioTreatmentDied <- diedTreatment / totalTreatment
ratioTreatmentDied
## [1] 0.6521739
What are the claims being tested? The claim being tested is that a heart transplant increased lifespan.
The paragraph below describes the set up for such approach, if we were to do it with- out using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.
We write alive on 28  cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shu✏e these cards and split them into two groups: one group of size representing treatment rtotalTreatment, and another group of size representing control rtotalControl. We calculate the di↵erence between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated di↵erences in proportions are r ratioTreatmentDied - ratioControlDied or less. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. iii. What do the simulation results shown below suggest about the e↵ectiveness of the trans- plant program? the transplant program is effective