1.8 Smoking habits of UK residents. A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Data downloaded from https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools
file: node-28452-28452>11263-Smoking_tcm86-13253.xls
location: https://github.com/niteen11/MSDS/blob/master/DATA606/Dataset/11263-Smoking_tcm86-13253_dataset.csv
smoking_data <-read.csv('C:\\NITEEN\\CUNY\\Spring 2018\\DATA 606\\LAB1\\node-28452-28452\\11263-Smoking_tcm86-13253_dataset.csv')
head(smoking_data)
## Sex Age Marital.Status Highest.Qualification Nationality Ethnicity
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## Gross.Income Region Smoke. Amount.Weekends
## 1 £2600 to less than £5200 The North No N/A
## 2 Less than £2600 The North Yes 12
## 3 £28600 to less than £36400 The North No N/A
## 4 £10400 to less than £15600 The North No N/A
## 5 £2600 to less than £5200 The North No N/A
## 6 £15600 to less than £20800 The North No N/A
## Amount.Weekdays Type X X.1
## 1 N/A N/A NA NA
## 2 12 Packets NA NA
## 3 N/A N/A NA NA
## 4 N/A N/A NA NA
## 5 N/A N/A NA NA
## 6 N/A N/A NA NA
summary(smoking_data)
## Sex Age Marital.Status Highest.Qualification
## Female:966 Min. :16.00 Divorced :161 No Qualification :586
## Male :727 1st Qu.:34.00 Married :812 GCSE/O Level :308
## Median :48.00 Separated: 69 Degree :262
## Mean :49.82 Single :428 Other/Sub Degree :127
## 3rd Qu.:65.00 Widowed :223 Higher/Sub Degree:125
## Max. :97.00 A Levels :105
## (Other) :180
## Nationality Ethnicity Gross.Income
## English :835 Asian : 41 £5200 to less than £10400 :396
## British :538 Black : 34 £10400 to less than £15600:269
## Scottish:142 Chinese: 27 £2600 to less than £5200 :257
## Other : 71 Mixed : 14 £15600 to less than £20800:188
## Welsh : 66 Refused: 13 £20800 to less than £28600:155
## Irish : 23 Unknown: 2 Less than £2600 :133
## (Other) : 18 White :1562 (Other) :295
## Region Smoke. Amount.Weekends Amount.Weekdays
## London :183 No :1270 N/A :1270 N/A :1270
## Midlands & East Anglia:443 Yes: 423 20 : 111 20 : 83
## Scotland :148 10 : 69 10 : 80
## South East :252 15 : 58 15 : 56
## South West :157 5 : 32 5 : 28
## The North :427 30 : 27 12 : 17
## Wales : 83 (Other): 126 (Other): 159
## Type X X.1
## Both/Mainly Hand-Rolled: 10 Mode:logical Mode:logical
## Both/Mainly Packets : 42 NA's:1693 NA's:1693
## Hand-Rolled : 73
## N/A :1270
## Packets : 298
##
##
dim(smoking_data)
## [1] 1693 14
nrow(smoking_data)
## [1] 1693
Ans: Each row represent an observation
Ans: per book it appears to be 1691 observations and hence 1691 particpants. However when I downloaded the data from the source site mentioned in the book I found 1693 participants
Ans Below are the :
1.10 Cheaters, scope of inference. Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across children’s characteristics within each group.
Ans: Children between the ages of 5 and 15. Sample size is 160 children between 5 and 15.
Ans: Children between the ages of 5 and 15. Sample size is 160 children between 5 and 15.
It is an observation study. This cannot be used to establish causal relationships. The study could be generalized to the public if the sample was truly randomized and the sample was drawn thoughout the nation (though I suspect that n = 160 may be too low.)
1.28 Reading the paper. Below are excerpts from two articles published in the NY Times:
Ans: It is an observational study, and not an experiment. It is hard to conclude that smoking causes dementia as there appears to be bias also as there were volunteers and not random selections.Also, we dont have insight into other factors.
Ans: The conclusion can be drawn that there is a correlation but not necessarily a causation. The study doesn’t explain the selction of students- local or random.
1.36 Exercise and mental health. A researcher is interested in the effects of exercise on mental health and he proposes the following study: Use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41- 55 year olds from the population. Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise. Conduct a mental health exam at the beginning and at the end of the study, and compare the results.
Ans: Prospective experiment.
Ans: Treatment: excercising twice a week.
Control: Not excercising at all.
Ans: The blocks are ages 18-30, 31-40, and 41-55.
Ans: The study does not use blinding.
Ans: A causal relationship could be established as both samplings and assignments were random.
Ans: I cant recommend to restrict a group from exercising, does not sound right.
1.48 - Stats scores. Below are the final exam scores of twenty introductory statistics students. 57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94. Create a box plot of the distribution of these scores. The five number summary provided below may be useful.
score <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
boxplot(score)
1.50 Mix-and-match. Describe the distribution in the histograms below and match them to the box plots.
Ans: Symmetrical and Unimodal. It matches with 2.
Multimodal. It matches with 3.
Right Skew and Unimodal. It matches with 1
1.56 Distributions and appropriate statistics, Part II. For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning:
Ans: Right skew, median, and IQR. The Median and IQR are better observations of the data when there are extreme values.
Ans: Either mean/SD or median/IQR can be used because the data set is symetrical.
Ans: Right skewed and median/IQR can be used. There might be outliers but given the right skew dataset and the outliers, I believe the median/IQR is the better fit.
Ans: Right skewed, median and IQR can be used. Few very high salaries would right skew the data.
1.70 Heart transplants. The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an o cial heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
data(heartTr)
head(heartTr)
## id acceptyear age survived survtime prior transplant wait
## 1 15 68 53 dead 1 no control NA
## 2 43 70 43 dead 2 no control NA
## 3 61 71 52 dead 2 no control NA
## 4 75 72 52 dead 2 no control NA
## 5 6 68 54 dead 3 no control NA
## 6 42 70 36 dead 3 no control NA
dim(heartTr)
## [1] 103 8
summary(heartTr)
## id acceptyear age survived
## Min. : 1.0 Min. :67.00 Min. : 8.00 alive:28
## 1st Qu.: 26.5 1st Qu.:69.00 1st Qu.:41.00 dead :75
## Median : 49.0 Median :71.00 Median :47.00
## Mean : 51.4 Mean :70.62 Mean :44.64
## 3rd Qu.: 77.5 3rd Qu.:72.00 3rd Qu.:52.00
## Max. :103.0 Max. :74.00 Max. :64.00
##
## survtime prior transplant wait
## Min. : 1.0 no :91 control :34 Min. : 1.00
## 1st Qu.: 33.5 yes:12 treatment:69 1st Qu.: 10.00
## Median : 90.0 Median : 26.00
## Mean : 310.2 Mean : 38.42
## 3rd Qu.: 412.0 3rd Qu.: 46.00
## Max. :1799.0 Max. :310.00
## NA's :34
unique(heartTr$transplant)
## [1] control treatment
## Levels: control treatment
library(ggplot2)
ggplot(data = heartTr)+
geom_boxplot(mapping = aes(x=transplant,y=survtime))
mosaicplot(data = heartTr, ~transplant+survived, color='#ADD8E6')
Ans It appears that transplanted patients were more likely to be alive (survive).
Ans: The box plots indicated that the treatment group survived longer than the control group.
count(heartTr,transplant,survived)
## # A tibble: 4 x 3
## transplant survived n
## <fctr> <fctr> <int>
## 1 control alive 4
## 2 control dead 30
## 3 treatment alive 24
## 4 treatment dead 45
control.dead.ratio <-30/34
control.dead.ratio
## [1] 0.8823529
treatment.survive.ratio <- 24/69
treatment.survive.ratio
## [1] 0.3478261
treatment.dead.ratio <- 45/69
treatment.dead.ratio
## [1] 0.6521739
Ans: (The Null Hypothesis H0) The survival of patients does not depend on transplant. Transplat and survival are independent and have no relatioships.
(Hypothesis Alternative HA) The transplanted patients are more likely to survive and survival ratio shows that transplated patients have better survival time and rate.
We write alive on cards 28 representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shu???e these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are 0.2302. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
round(control.dead.ratio - treatment.dead.ratio, 4)
## [1] 0.2302
Ans: The simulation is effective for 100 simulations as the data appears to be centered near to 0. We can safely reject the NULL hypothesis as transplanted patients are more likely to survive.