1.8 Smoking habits of UK residents. A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that â£â stands for British Pounds Sterling, âcigâ stands for cigarettes, and âN/Aâ refers to a missing component of the data.
library (tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Warning: package 'forcats' was built under R version 3.3.3
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Datasmoking <- read.csv('D:/606 Jason Bryer Wang HomePC/DATA606Fall2018-master/data/openintro.org/Ch 1 Exercise Data/smoking.csv')
head(Datasmoking)
## gender age maritalStatus highestQualification nationality ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## grossIncome region smoke amtWeekends amtWeekdays type
## 1 2,600 to 5,200 The North No NA NA
## 2 Under 2,600 The North Yes 12 12 Packets
## 3 28,600 to 36,400 The North No NA NA
## 4 10,400 to 15,600 The North No NA NA
## 5 2,600 to 5,200 The North No NA NA
## 6 15,600 to 20,800 The North No NA NA
dim(Datasmoking)
## [1] 1691 12
summary(Datasmoking)
## gender age maritalStatus highestQualification
## Female:965 Min. :16.00 Divorced :161 No Qualification :586
## Male :726 1st Qu.:34.00 Married :812 GCSE/O Level :308
## Median :48.00 Separated: 68 Degree :262
## Mean :49.84 Single :427 Other/Sub Degree :127
## 3rd Qu.:65.50 Widowed :223 Higher/Sub Degree:125
## Max. :97.00 A Levels :105
## (Other) :178
## nationality ethnicity grossIncome
## English :833 Asian : 41 5,200 to 10,400 :396
## British :538 Black : 34 10,400 to 15,600:268
## Scottish:142 Chinese: 27 2,600 to 5,200 :257
## Other : 71 Mixed : 14 15,600 to 20,800:188
## Welsh : 66 Refused: 13 20,800 to 28,600:155
## Irish : 23 Unknown: 2 Under 2,600 :133
## (Other) : 18 White :1560 (Other) :294
## region smoke amtWeekends amtWeekdays
## London :182 No :1270 Min. : 0.00 Min. : 0.00
## Midlands & East Anglia:443 Yes: 421 1st Qu.:10.00 1st Qu.: 7.00
## Scotland :148 Median :15.00 Median :12.00
## South East :252 Mean :16.41 Mean :13.75
## South West :157 3rd Qu.:20.00 3rd Qu.:20.00
## The North :426 Max. :60.00 Max. :55.00
## Wales : 83 NA's :1270 NA's :1270
## type
## :1270
## Both/Mainly Hand-Rolled: 10
## Both/Mainly Packets : 42
## Hand-Rolled : 72
## Packets : 297
##
##
Each row repesents each UK resident.
1691 participants were included in the survey.
Sex: Categorical. Age: Discrete Numerical. Marital: Categorical. grossIncome: Categorical. Smoke: Categorical. amtWeekends: Discrete Numerical. amtWeekdays: Discrete Numerical.
1.10 Cheaters, scope of inference. Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across childrenâs characteristics within each group.
Identify the population of interest and the sample in this study. This dataset contains children between the ages of 5 and 15. Sample size is 160 subjects in total
Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships. No, the results cannot be generalized to the population, because this is an observation study. This cannot be used to establish causal relationships. Only the randomized clinical trial with suffient sample size are allowed to answer such questions.
1.28 Reading the paper. Below are excerpts from two articles published in the NY Times:
Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.
Answer: Thisis a survey data. The subjects in the survey data have self selection issue, and rely on recollection of knowdelge from previously (retrospective study), which is not an experiment at all. It certainly do not have the ability to imply a causation from this study. It is not even a prospective randomized oberservational study.
A friend of yours who read the article says, âThe study shows that sleep disorders lead to bullying in school children.â Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?
Answer: The conclusion is that there is a correlation but correlation does NOT imply causation, which is subject to further investigation to prove.
1.36 Exercise and mental health. A researcher is interested in the effects of exercise on mental health and he proposes the following study: Use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41- 55 year olds from the population. Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise. Conduct a mental health exam at the beginning and at the end of the study, and compare the results.
Answer: it is a stratified (with age strata, or age block) randomized prospective experiment (or, rancomized block design).
Treatment: Texcercising twice a week. Control: Not excercising at all.
Yes, it uses blocks. The blocking variable is 3 age group (ages 18-30, 31-40, and 41-55).
No, this study does NOT use blinding, neither study subject bliing, nor research conductor’s blinding. The subjects and the researchers are both aware of the subjects’ excercising effect.
Answer: If the study was truly randomized as noted in the question stem and the sample size was large enough, this could be generalized to the population at large. Since this is an experiment, a causal relationship could be established.
Restricting a group from exercising could be ethically wrong.
1.48 Stats scores. Below are the final exam scores of twenty introductory statistics students.
57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94
Create a box plot of the distribution of these scores. The five number summary provided below may be useful.
Min Q1 Q2 (Median) Q3 Max 57 72.5 78.5 82.5 94
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
boxplot(scores, main = "Exam Scores", xlab = "Student Scores", ylab = "Grades")
1.50 Mix-and-match. Describe the distribution in the histograms below and match them to the box plots.
Symmetrical and Unimodal. Matches with 2.
Symmetrical. Multimodal. Matches with 3.
Right Skew and Unimodal. Matches with 1.
1.56 Distributions and appropriate statistics, Part II . For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.
Answer: We can present it with median and with IQR, since the data is right skewed (high housing prices is presented abnormally).
Answer: This data set could be presented either with mean/SD or median/IQR, because it is symetrical.
Answer: This data is right skewed (to represent those heavy drinkers versus the noraml 0-2 drinks per week), so median/IQR will be a better fit for data interpretation.
Again, this will be right skewed. Most employees will make around $80,000 to $250,000 per year, whereas the CEOs can make up to 20 million per year and and handful of other top level executives will also likely to make greater than a million per year. This is right skewed and again, this data will be better explained with median/IQR.
1.70 Heart transplants. The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an o cial heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died.
library(openintro)
## Warning: package 'openintro' was built under R version 3.3.3
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
data(heartTr)
head(heartTr)
## id acceptyear age survived survtime prior transplant wait
## 1 15 68 53 dead 1 no control NA
## 2 43 70 43 dead 2 no control NA
## 3 61 71 52 dead 2 no control NA
## 4 75 72 52 dead 2 no control NA
## 5 6 68 54 dead 3 no control NA
## 6 42 70 36 dead 3 no control NA
dim(heartTr)
## [1] 103 8
unique(heartTr$transplant)
## [1] control treatment
## Levels: control treatment
library(ggplot2)
ggplot(data = heartTr)+
geom_boxplot(mapping = aes(x=transplant,y=survtime))
mosaicplot (data = heartTr, ~transplant+survived)
library(tidyverse)
count (heartTr,transplant,survived)
## # A tibble: 4 x 3
## transplant survived n
## <fct> <fct> <int>
## 1 control alive 4
## 2 control dead 30
## 3 treatment alive 24
## 4 treatment dead 45
Answer: Based on the mosaic plot, it appears that there is a dependency between the survival time and whether the patient is transplanted or not. It appears that transplant does pose a favorable survival time on patients.
The box plots suggest that the treatment group survived longer than the control group.
What proportion of patients in the treatment group and what proportion of patients in the control group died? answer: Control group: 30/34 = .8824 ==88.2% Treatment group: 45/69 = 6522 = 65.2%
One approach for investigating whether or not the treatment is effective is to use a randomization technique.
(Hypothesis Null HO). The transplantation status does not alter the survivability of these patients.
(Hypothesis Alternative HA). The null hypothesis is rejected. Transplanted patients were more likely to survive than non-transplanted patient.
We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 79cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0 . Lastly, we calculate the fraction of simulations where the simulated differences in proportions are .2302 . If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
Answer: Simulation results that null hypothesis should be rejected, and Probality of 0.2301 is much larger than the 0.05 cutoff value. There is strong evidence to suppor the hypothesis that transplatation does improve patients’ survivability, and this finding is not likely occur by chance.