library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.58 sex age marital grossIncome smoke amtWeekends amtWeekdays 1 Female 42 Single Under£2,600 Yes 12 cig/day 12 cig/day 2 Male 44 Single £10,400 to £15,600 No N/A N/A 3 Male 53 Married Above£36,400 Yes 6 cig/day 6 cig/day
1691 Male 40 Single £2,600 to £5,200 Yes 8 cig/day 8 cig/day (a) What does each row of the data matrix represent? (b) How many participants were included in the survey? (c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.
reading smokingdata from github which I downloaded from www.stem.org.uk
smokingdata <- read.table('https://raw.githubusercontent.com/maharjansudhan/DATA606/master/smokingdata.csv',sep = ",")
head(smokingdata)
## V1 V2 V3 V4 V5 V6
## 1 Sex Age Marital Status Highest Qualification Nationality Ethnicity
## 2 Male 38 Divorced No Qualification British White
## 3 Female 42 Single No Qualification British White
## 4 Male 40 Married Degree English White
## 5 Female 40 Married Degree English White
## 6 Female 39 Married GCSE/O Level British White
## V7 V8 V9 V10
## 1 Gross Income Region Smoke? Amount Weekends
## 2 £2600 to less than £5200 The North No N/A
## 3 Less than £2600 The North Yes 12
## 4 £28600 to less than £36400 The North No N/A
## 5 £10400 to less than £15600 The North No N/A
## 6 £2600 to less than £5200 The North No N/A
## V11 V12
## 1 Amount Weekdays Type
## 2 N/A N/A
## 3 12 Packets
## 4 N/A N/A
## 5 N/A N/A
## 6 N/A N/A
summary(smokingdata)
## V1 V2 V3 V4
## Female:966 40 : 43 Divorced :161 No Qualification :586
## Sex: 1 34 : 40 Marital Status: 1 GCSE/O Level :308
## Male :727 31 : 38 Married :812 Degree :262
## 42 : 37 Separated : 69 Other/Sub Degree :127
## 33 : 36 Single :428 Higher/Sub Degree:125
## 39 : 35 Widowed :223 A Levels :105
## (Other):1465 (Other) :181
## V5 V6 V7
## English :835 White :1562 £5200 to less than £10400 :396
## British :538 Asian : 41 £10400 to less than £15600:269
## Scottish:142 Black : 34 £2600 to less than £5200 :257
## Other : 71 Chinese: 27 £15600 to less than £20800:188
## Welsh : 66 Mixed : 14 £20800 to less than £28600:155
## Irish : 23 Refused: 13 Less than £2600 :133
## (Other) : 19 (Other): 3 (Other) :296
## V8 V9 V10 V11
## Midlands & East Anglia:443 No :1270 N/A :1270 N/A :1270
## The North :427 Smoke?: 1 20 : 111 20 : 83
## South East :252 Yes : 423 10 : 69 10 : 80
## London :183 15 : 58 15 : 56
## South West :157 5 : 32 5 : 28
## Scotland :148 30 : 27 12 : 17
## (Other) : 84 (Other): 127 (Other): 160
## V12
## Both/Mainly Hand-Rolled: 10
## Both/Mainly Packets : 42
## Hand-Rolled : 73
## N/A :1270
## Packets : 298
## Type : 1
##
dim(smokingdata)
## [1] 1694 12
nrow(smokingdata)
## [1] 1694
What does each row of the data matrix represent? Each row represents an observation.
How many participants were included in the survey? There are (1694 - 1 = 1693) participants in the survey.
Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.
Answer:
Sex : Categorical Age : Discrete Numerical Marital Status : Categorical Highest Qualification : Ordinal Categorical Nationality : Categorical Ethnicity : Categorical Gross Income : Numerical Region : Categorical Smoke? : Amount Weekend : Discrete Numerical Amount Weekdays : Discerete Numerical Type : Ordinal Categorical
Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Di???erences were observed in the cheating rates in the instruction and no instruction groups, as well as some di???erences across children’s characteristics within each group.
Answer: Population is 160 and sample is 5 to 15.
Answer: There is no relationship because it is an observational study.If those childrens were randomnly selected then we could have done some kind of further data collection.
Below are excerpts from two articles published in the NY Times:
Answer: According to the given information, we cannot conclude that smoking causes dementia later in life because this is not an experiment. It is an observational study and volunters were not randomnly selected. There is no exact factor that proves this statement.
Answer: Since, the given information doesn’t state that the data is collected randomnly we cannot assume anything. There might be a correlation between the data but there is no such thing as casuation factor. There might be other reasons for sleep disorders like improper bed and food intake etc. which is not mentioned above.
A researcher is interested in the effects of exercise on mental health and he proposes the following study: Use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41- 55 year olds from the population. Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise. Conduct a mental health exam at the beginning and at the end of the study, and compare the results.
Answer: This is an experiment.
Answer: Treatment Group : Exercise twice a week Control Group : Not exercise at all
Answer: Yes it use blocking method. Age 18-30, 31-40, 41-55
Answer: No this study doesnot use the blinding method.
Answer: Yes, the casual relationship can be established because these are randomnly selected sampling data.
Answer: I don’t think this kind of study needs any kind of funding because in daily life people have their own way of living life. Workout not workout depends upon their work life and personal life. This study won’t make that much difference in human kind.
Below are the final exam scores of twenty introductory statistics students. 57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94 Create a box plot of the distribution of these scores. The five number summary provided below may be useful.
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
boxplot(scores)
Describe the distribution in the histograms below and match them to the box plots.
Answer: 1) Figure a is symmetrical and unimal and matches with Figure 2 2) Figure b is multimodal and matches with Figure 3 3) Figure c is right skewed and matches with Figure 1
For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.
Answer: The distribution is Right skewed. The median would be the best observation because it has higher values.
Answer: The data distribution is symmetrical so any mean or standard deviation, median or IQR can be used.
Answer: Since, there are non alcoholic students but some are excessive drinkers, the data distribution will be right skewed. We can use either Median or IQR for this.
Answer: Since, few executives earn higher, the data distribution will be right skewed and we can use Median or IQR for this.
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
data(heartTr)
head(heartTr)
## id acceptyear age survived survtime prior transplant wait
## 1 15 68 53 dead 1 no control NA
## 2 43 70 43 dead 2 no control NA
## 3 61 71 52 dead 2 no control NA
## 4 75 72 52 dead 2 no control NA
## 5 6 68 54 dead 3 no control NA
## 6 42 70 36 dead 3 no control NA
dim(heartTr)
## [1] 103 8
summary(heartTr)
## id acceptyear age survived
## Min. : 1.0 Min. :67.00 Min. : 8.00 alive:28
## 1st Qu.: 26.5 1st Qu.:69.00 1st Qu.:41.00 dead :75
## Median : 49.0 Median :71.00 Median :47.00
## Mean : 51.4 Mean :70.62 Mean :44.64
## 3rd Qu.: 77.5 3rd Qu.:72.00 3rd Qu.:52.00
## Max. :103.0 Max. :74.00 Max. :64.00
##
## survtime prior transplant wait
## Min. : 1.0 no :91 control :34 Min. : 1.00
## 1st Qu.: 33.5 yes:12 treatment:69 1st Qu.: 10.00
## Median : 90.0 Median : 26.00
## Mean : 310.2 Mean : 38.42
## 3rd Qu.: 412.0 3rd Qu.: 46.00
## Max. :1799.0 Max. :310.00
## NA's :34
unique(heartTr$transplant)
## [1] control treatment
## Levels: control treatment
ggplot(data = heartTr) + geom_boxplot(mapping = aes(x=transplant,y=survtime))
mosaicplot(data = heartTr, ~transplant+survived)
Answer: According to the mosaic plot, it seems trasnplant patients have more chances of survive.
Answer: The box plot shows the the treatment group survived longer than the control group.
Answer:
count(heartTr,transplant,survived)
## # A tibble: 4 x 3
## transplant survived n
## <fct> <fct> <int>
## 1 control alive 4
## 2 control dead 30
## 3 treatment alive 24
## 4 treatment dead 45
cdratio <- 30/34
cdratio
## [1] 0.8823529
tdratio <- 45/69
tdratio
## [1] 0.6521739
Answer: We can say transplant is not the major factor for the survival of the patients. There is no relationship between transplant and survial. Or we can say transplant patients are more likely to live longer than non-transplant patients.
Answer: We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distributioncentered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are 0.23. If this fraction is low,we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
cdratio - tdratio
## [1] 0.230179
Answer: Since, we can see that the transplant patients are more likely to live longer we can state that trasplant is a good option.