DATA606_W1_Lab_1_MatheeshaThambeliyagodage

1.8 Smoking habits of UK residents.

A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data

if (!require("openintro")) install.packages('openintro')

## Loading required package: openintro

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

#data(smoking)
#str(smoking)
#head(smoking)
#summary(smoking)
#dim(smoking)
#summary(smoking[,"age"])
#smoking

(a) What does each row of the data matrix represent?

summary(smoking)

##     gender         age          maritalStatus        highestQualification
##  Female:965   Min.   :16.00   Divorced :161   No Qualification :586      
##  Male  :726   1st Qu.:34.00   Married  :812   GCSE/O Level     :308      
##               Median :48.00   Separated: 68   Degree           :262      
##               Mean   :49.84   Single   :427   Other/Sub Degree :127      
##               3rd Qu.:65.50   Widowed  :223   Higher/Sub Degree:125      
##               Max.   :97.00                   A Levels         :105      
##                                               (Other)          :178      
##    nationality    ethnicity              grossIncome 
##  English :833   Asian  :  41   5,200 to 10,400 :396  
##  British :538   Black  :  34   10,400 to 15,600:268  
##  Scottish:142   Chinese:  27   2,600 to 5,200  :257  
##  Other   : 71   Mixed  :  14   15,600 to 20,800:188  
##  Welsh   : 66   Refused:  13   20,800 to 28,600:155  
##  Irish   : 23   Unknown:   2   Under 2,600     :133  
##  (Other) : 18   White  :1560   (Other)         :294  
##                     region    smoke       amtWeekends     amtWeekdays   
##  London                :182   No :1270   Min.   : 0.00   Min.   : 0.00  
##  Midlands & East Anglia:443   Yes: 421   1st Qu.:10.00   1st Qu.: 7.00  
##  Scotland              :148              Median :15.00   Median :12.00  
##  South East            :252              Mean   :16.41   Mean   :13.75  
##  South West            :157              3rd Qu.:20.00   3rd Qu.:20.00  
##  The North             :426              Max.   :60.00   Max.   :55.00  
##  Wales                 : 83              NA's   :1270    NA's   :1270   
##                       type     
##                         :1270  
##  Both/Mainly Hand-Rolled:  10  
##  Both/Mainly Packets    :  42  
##  Hand-Rolled            :  72  
##  Packets                : 297  
##                                
##

Each row represent how the smoking is distirbuted in the communites in UK . Following variables or attributes were taken in to consideration

gender,

age,

maritalStatus,

highestQualification,

nationality,

ethnicity,

grossIncome,

region,

smoke,

amtWeekends,

amtWeekdays,

type

(b) How many participants were included in the survey? -> 1691

dim(smoking)

## [1] 1691   12

(c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

gender is nominal categorical variable
age is distrete numerical variable
maritalStatus is nominal categorical variable
highestQualification is ordinal categorical variable
nationality is nominal categorical variable
ethnicity is nominal categorical variable
grossIncome is ordinal categorical variable
region is nominal categorical variable
smoke is nominal categorical variable
amtWeekends is ordinal categorical variable
amtWeekdays is nominal categorical variable
type is nominal categorical variable

1.10 Cheaters, scope of inference.

Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Di???erences were observed in the cheating rates in the instruction and no instruction groups, as well as some di???erences across children’s characteristics within each group.

(a) Identify the population of interest and the sample in this study.

sample -160 children. population - 5 to 15 years old

(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

The outcome of the experiment is hard to genarileze due to following reasons.

1- Sample size is small,

2- Not distributed properly in the population,

3- All Explanatory variables are not properly monitored or identify to consider the response variables

1.28 Reading the paper

An article titled Risks: Smokers Found More Prone to Dementia states the following:

(a).Based on this study, can we conclude that smoking causes dementia later in life? Explain your

reasoning.

No, this is an observational study.

(b). A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?

No enough information in study to justify the idea; it implies a causal association between sleep disorders and bullying. However, this was an observational study. A better conclusion would be “School children identi???ed as bullies are more likely to su???er from sleep disorders than non-bullies.”

1.36 Exercise and Mental Health.

(a) What type of study is this?

Experiment.

(b) What are the treatment and control groups in this study?

Treatment is exercise twice a week. Control is no exercise.

(c) Does this study make use of blocking? If so, what is the blocking variable?

Yes, the blocking variable is age.

(d) Does this study make use of blinding?

(e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.

This is an experiment, so a causal conclusion is reasonable. Since the sample is random, the conclusion can be generalized to the population at large. However, we must consider that a placebo e???ect is possible.

(f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

Yes. Randomly sampled people should not be required to participate in a clinical trial, and there are also ethical concerns about the plan to instruct one group not to participate in a healthy behavior, which in this case is exercise.

1.48 Box Plot of Final Exam Scores

FinScor <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(FinScor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00

boxplot(FinScor)

1.50 Mix-and-match.

Describe the distribution in the histograms below and match them to the box plots.

a.The match box plot number is 2. The distribution is unimodel which has one peak. And the histogram is symmetric.
b.The match box plot number is 3. The distribution is multimodel which has many peak. And the histogram is symmetric.
c.The match box plot number is 1. The distribution is unimodel which has one peak. And the histogram is right skewed.

1.56 Distributions and appropriate statistics

(a).Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000.

1.Right skewed
2.Median - Because the study is right skewed.
3.Standard Deviation - Because we need to best represent the variablity.

(b).Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.

1.Symmentrical distribution
2.Mean - Because the study is symmentrical
3.IQR - Because all the data can be showed in a single chart with variability.

c.Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.

1.Symmentrical distribution
2.Median - Because the study is multimodel
3.Standard Deviation

d.Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than the all other employees.

1.Right skewed
2.Median - Because the study is right skewed.
3.Standard Deviation - Because we need to best represent the variablity.

1.70 Heart transplants

heatTP <- read.csv("https://raw.githubusercontent.com/jbryer/DATA606Fall2016/master/Data/Data%20from%20openintro.org/Ch%201%20Exercise%20Data/heartTr.csv")
head(heatTP)

##   id acceptyear age survived survtime prior transplant wait
## 1 15         68  53     dead        1    no    control   NA
## 2 43         70  43     dead        2    no    control   NA
## 3 61         71  52     dead        2    no    control   NA
## 4 75         72  52     dead        2    no    control   NA
## 5  6         68  54     dead        3    no    control   NA
## 6 42         70  36     dead        3    no    control   NA

summary(heatTP)

##        id          acceptyear         age         survived 
##  Min.   :  1.0   Min.   :67.00   Min.   : 8.00   alive:28  
##  1st Qu.: 26.5   1st Qu.:69.00   1st Qu.:41.00   dead :75  
##  Median : 49.0   Median :71.00   Median :47.00             
##  Mean   : 51.4   Mean   :70.62   Mean   :44.64             
##  3rd Qu.: 77.5   3rd Qu.:72.00   3rd Qu.:52.00             
##  Max.   :103.0   Max.   :74.00   Max.   :64.00             
##                                                            
##     survtime      prior        transplant      wait       
##  Min.   :   1.0   no :91   control  :34   Min.   :  1.00  
##  1st Qu.:  33.5   yes:12   treatment:69   1st Qu.: 10.00  
##  Median :  90.0                           Median : 26.00  
##  Mean   : 310.2                           Mean   : 38.42  
##  3rd Qu.: 412.0                           3rd Qu.: 46.00  
##  Max.   :1799.0                           Max.   :310.00  
##                                           NA's   :34

(a).Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.

From the mosaic plot, the survival of the patient is dependet on transplant because plot shows all the survivals are in the treatment group.

mosaicplot(table(heatTP$transplant,heatTP$survived))

#### (b) What do the box plots suggest about the efficacy (effctiveness) of the heart transplant treatment.

From the box plots, the efficacy of heart transplant treatment is not very good. Most(46) of the patients were dead. But it is better compare to the Contrall Group.

(c) What proportion of patients in the treatment group and what proportion of patients in the control group died?

percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}

#-----------------------------

Total Death

ded_por <- (75/(34+69))
percent(ded_por)

## [1] "72.82%"

Treated Death

Tret_ded_por <- (45/(69))
percent(Tret_ded_por)

## [1] "65.22%"

Contralled Death

Cont_ded_por <- (30/(35))
percent(Cont_ded_por)

## [1] "85.71%"

(d) One approach for investigating whether or not the treatment is effective is to use a randomization technique.

(i). What are the claims being tested?

Whether the trasplant is successful or not.

(ii).The paragraph below describes the set up for such approach, if we were to do it without using

statistical software. Fill in the blanks with a number or phrase, whichever is appropriate. We write alive on 28 ??? cards representing patients who were alive at the end of the study, and dead on 75??? cards representing patients who were not. Then, we shfflee these cards and split them into two groups: one group of size 69 treatment Treatment, and another group of size 34 represent Control. We calculate the dfference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at mean ??? 0 .Lastly, we calculate the fraction of simulations where the simulated diffrences in proportions are low. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

1. What do the simulation results shown below suggest about the e???ectiveness of the trans- plant program? the transplant program is effective

iii.What do the simulation results shown below suggest about the effectiveness of the transplant program?

This is very heavily emphasized text.