This is the Homework 1 for Data 606 - Spring 2019
The assigned problems to be graded are: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70
Each problem will be discussed in a separate section of this report.
Each row of the data displayed in the textbook on p57 for problem 1.8 represents an observation. Each observation is a response to the survey from one person (who is a UK resident).
There are 1691 participants in the survey.
The R code below shows the number of participants and the first few examples.
nrow( openintro::smoking )
## [1] 1691
str(openintro::smoking)
## 'data.frame': 1691 obs. of 12 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ maritalStatus : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ highestQualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
## $ nationality : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
## $ ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ grossIncome : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
## $ region : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ amtWeekends : int NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amtWeekdays : int NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : Factor w/ 5 levels "","Both/Mainly Hand-Rolled",..: 1 5 1 1 1 1 5 1 4 5 ...
| Variable Name | Type | SubType |
|---|---|---|
| gender | categorical | nominal |
| age | numerical | discrete |
| maritalStatus | categorical | nominal |
| highestQualification | categorical | ordinal |
| nationality | categorical | nominal |
| ethnicity | categorical | nominal |
| grossIncome | categorical | ordinal |
| region | categorical | nominal |
| smoke | categorical | nominal |
| amtWeekends | numerical | discrete |
| amtWeekdays | numerical | discrete |
| type | categorical | nominal |
I read the original study by Alessandro Buccioli and Marco Piovesan “Luck or cheating? A field experiment on honesty with children”. This paper is available to download from SSRN at the link ( https://dx.doi.org/10.2139/ssrn.1307694 ).
The population of interest is all children between the ages of 5-15. The actual sample consists of children observed of size 160 observed in a summer camp according to the textbook. The paper actually states the sample consists of 182 children.
The study was conducted in a summer camp in Padua, Italy.
The study’s conclusion cannot be generalized to ALL children in the world. There are 3 potential problems are related to the sample selection.
A reasonable causal inference from the study is that telling kids to be honest can affect their behavior. The study involves a controlled experiment and the conclusions are valid within the sample only. To establish validity for the broader population, follow-up studies to control for countries, cultures, socio-economic effects would be needed.
Based on the study cited in the New York Times, we cannot conclude smoking causes dementia. The reason is that the study does not control for confounding variables.
For example, a confounding variable could be a genetic defect which causes the tendency to smoke AND Alzheimer’s disease. This genetic factor could also have a quantitative dimension. More defects in the chromosome near the vulnerable area could increase both smoking and dementia risk. If this assumption were true, then smoking does not cause dementia but is only a correlated outcome. The study does not rule out this assumption.
We cannot justify the statement that sleep disorders cause bullying.
A confounding variable could cause both. One possible confounding variable is abusive parental behavior. We cannot even conclude that students who display disruptive behavior have a significant risk of sleeping disorders. That is because the article does not state the frequency of sleeping disorders, even though it states that one third of students in sample have bullying issues.
Suppose the typical frequency of sleeping disorders is 1 percent about all students.
Then it is 2 percent amongst bullies.
So even though a third of students have be bullies, only 2/33 or 6.0606061 percent of the bullies have sleep disorders.
In short, bullies may have a slightly elevated risk of sleep disorders.
The study is an experiment since a treatment is assigned to the sample.
The treatment group is the subsample assigned to exercise twice a week. The group spans multiple clusters: 18-30, 31-40, 41-55 years of age.
This study using block to group the sample by age brackets. The blocking variable is age.
This study does not make use of blinding because the subjects know if they are getting the treatment or not.
The results of this study cannot be useful to establish causal relationship because the experimental design is poor. No controls are in place to identify if the subjects are already exercising prior to participation in the study. For example, people in the control group are included even if they previously exercised twice per week. Thus, the control group may be actively performing the treatment. The level of exercise is also not described in detail to ensure a similar application of the treatment to all subjects. The study does not control for or block by gender or employment status or residential location which can be important to mental health outcomes.
I don't believe the study should be funded. Instead, the proposed study's design should be revised to apply better controls and Twenty introductory student scores on a final exam are plotted in a box plot with horizontal settings. The summary statistics below match those stated in the textbook.
We conclude that the vector of scores in this RMD file is consistent with the textbook’s. The distribution is slightly left skewd with the median closerly to the 75th percentile.
scores = c( 57, 66, 69, 71, 72, 73, 77, 78, 78, 79, 79, 81, 81, 82, 83, 88, 89, 94)
boxplot(scores, main="Student Scores on Final Exam" , horizontal = TRUE)
summary(scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.25 78.50 77.61 81.75 94.00
histogram with symmetric distribution and mean approximately 60 should map to boxplot 2.
histogram with nearly uniform distribution over range 0 to 100 with mean near 50 should correspond to boxplot 3.
histogram with right skew and mean near 2 should correspond to box plot 1.
Distribution is RIGHT skewed because the right tail is long. Manors can be arbitrarily expensive. Median house is the most typical. It is around 450,000. The mean could be quite high because of the right skew. IQR is the better measure of variabilty day by day. It is not as influenced by the addition of large expensive houses.
Distribution is symmetric due to few outliers. Median and mean are likely to be close. I prefer the median price because it is robust to outliers. IQR is a good measure of variation. It is 900,000 - 300,000.
College student drinks: Distribution is very right skewed. A few people drink, but most don’t. Median is zero. Mean is non-zero. I think zero (the median) is a better metric of the typical student because most don’t drink. Standard deviation is a better metric of variability because it is non-zero. The IQR is likely to be zero because 75% of students drink less than 1 drink per day.
Annual salaries of employees at a Fortune 500 company. Distribution is right skewed. A few managers make a lot of money. Median income is more robust measure of income. Most people earn near the median income not the mean which is higher. IQR is better because it is robust to outliers like manager salaries.
Based on the mosaic plot, survival appears to be dependent on getting treatment. With the treatment, the proportion of alive subjects is higher than in the control group.
The boxplot suggests the treatment group has a higher median survival time and a much larger IQR. This is evident in the 75 percentile of the treatment group being higher than all survival times of the control set except for 1 contender in control group.
I conclude that 65.2% of the treatment group died and 88.2% of the control group died. To calculate this, I using the tidyverse tibble package. I am able to transform the heartTr data table to extract the proportion of the survivors within the treatment and control groups within the total sample.
First, I group by transplant and survived status to get detailed subsample counts. Then, I group by transplant only to get the counts of treatment and control groups. Then, I join the two summaries on the group type (transplant) to calculate the row-wise proportions.
h = as_tibble(openintro::heartTr)
by_treatment_outcome = h %>% group_by( transplant, survived)
( detailed_outcome = ( by_treatment_outcome %>% summarise( count = n() ) ) )
## # A tibble: 4 x 3
## # Groups: transplant [?]
## transplant survived count
## <fct> <fct> <int>
## 1 control alive 4
## 2 control dead 30
## 3 treatment alive 24
## 4 treatment dead 45
by_treatment = by_treatment_outcome %>%
group_by( transplant ) %>%
summarise( TotalTreatment=n())
(mergedata = detailed_outcome %>%
inner_join(by_treatment, by = "transplant") %>%
mutate( proportionOfTreatment = count / TotalTreatment) )
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 4 x 5
## # Groups: transplant [2]
## transplant survived count TotalTreatment proportionOfTreatment
## <fct> <fct> <int> <int> <dbl>
## 1 control alive 4 34 0.118
## 2 control dead 30 34 0.882
## 3 treatment alive 24 69 0.348
## 4 treatment dead 45 69 0.652
The relevant values are stated in the 2nd and 4th rows above.
Part i
The claims being tested by the heart transplant study are: The treatment can improve survival rate of patients (to the end of the study period). The treatment can increase the median survival time of patients (to the end of the study).
Part ii (FILL IN THE BLANKS)
Using the summary statistics tables from section (c) above.
We write alive on 28 cards representing patients who were alive at the end of the study and dead on 75 card representing patients who were not.
We shuffle these cards and split them into two groups:
one group of size 69 representing treatment, and another group of size 34 representing control.
We repeat this 100 times to build a distribution centered at zero.
Lastly, we calculate the fraction of simulations where the simulated differences in proportions are less than the observed difference from the data.
Part iii
The simulated results suggest that the observed difference from the data (which is .652 - .882 = -0.23) is at the far left tail of the simulated distribution. We conclude the heart transplant program is effective.