Assignment #1

Raj Kumar

  1. Exercise: 1.8

Exercise: 1.8

Q: What does each row of the data matrix represent

A: Each Row represents individual obervation for a person in the study

Q: How many participants were included in the survey

A: 1691

Q: Indicate if each variable in the study is numeric or categorical

A: Catergorical variable: sex(nominal), marital (nominal), grossIncome(ordinal), smoke (nominal)

A: Numberic variable: age (discrete), amtWeekends (discrete), amtWeekdays (discrete)

  1. Exercise: 1.10 Cheaters, scope of influence

Exercise: 1.10 Cheaters, scope of influence

Q(a)

Answer: The population of interest is the children between ages of 5 and 15.

Q(b)

Answer:

- The study cannot be generalized as the population set is not large enough.

- Also since the study only focuses on students between age 5-15 is cannot be generalized to the population

- the study cannot be used to establish casual relationship.

  1. Exercise: 1.28 Reading the Paper

Exercise: 1.28 Reading the Paper

Q(a)

Answer: Yes based on the study, it can be concluded that smoking highers the risk of causing dimentia. The study focused on sample set 50-60 year old only and was observational. Causation cannot be determined using an observational study. Not all smokers got dimentia, there might be other factors that were missed which contribute like age of participant, genetics, exercises, diet etc.

Q(b)

Answer: No that statement is not justified. This is an observational study. Based on the study, the sleep disorder might be associated with bully. Association does not mean causation. I feel that both sleep disorder and bullying might be covariant varibles. The conclusion can be drawn “The Study shows that the bulling might be associated to sleep disorder”

  1. Exercise: 1.36 Exercise and Mental Health

Exercise: 1.36 Exercise and Mental Health

Q (a) What type of study is this?

This is an experiment using Statified sampling followed by simple random sampling.

Q (b) What are the treatment and control groups in this study?

There are 3 treatment groups and 3 control groups as part of each strata. All people who were told to exercise regularly are part of treatment group.

Q (c) Does this study make sure of blocking?

This study does not use blocking for randomizing data like sex and other variables. But I feel the Age Groups used in this study is kind of a blocking.

Q (d) Does this study use blinding

No. This study does not use blinding as both control groups know about their treatment.

Q (e) Can the study be used to establish a casual relationship between exercise and mental health?

I am not sure if the elements of this study are enough to create an casual relationship between mental health and exercise. I agree that exercise could be an important factor but I am bit confused as there could be additional factors like patients medical history, history of abuse, economic climate that could also be contributing factors.

Dear Professor, Could you please help explain this to me? In any study, we can always have other factors that were not considered, but which could be important for the study. Can we establish casual relationships in such cases?

Q (f) Would you have any reservations about the study proposal?

This is a good study but the study does not take into account many important variables without which detemining the causation would be incorrect.

I would like to add additional variables like additional variables in order to fund this study

- severity_of_mental_illness

- current_mental_health

- current_physical_health

- history_of_abuse

- exercise_time

- etc

Also we need to make sure sample size is large enough to make causation

  1. Exercise: 1.48

Exercise: 1.48

Create a box plot of distribution of these scores

exam_scores <- c(57, 66,69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)

# Summary
summary(exam_scores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00
#Mean
mean(exam_scores)
## [1] 77.7
#IQR
IQR(exam_scores)
## [1] 9.5
#Standard Distribution
sd(exam_scores)
## [1] 8.442374

Box plot of distributions

boxplot(exam_scores)

Box plot of distributions using ggplot2

library(ggplot2)
scoredf <- data.frame(exam_scores)
ggplot(scoredf, aes("", scoredf$exam_scores))+ 
    geom_boxplot()+
    theme_bw()

  1. Exercise: 1.50

Exercise: 1.50

Histogram (a): Can be matched with Box Plot (2) because

- The values range from 50-70

- The mean looks to be around 60

- Q1 is around 57 and Q3 is around 63 as most values are near the median

Histogram (b): Can be matched with Box Plot (3) because

- The values range from 0-100

- The mean and median looks to be around 50

- Q1 is around 25 and Q3 is around 75

Histogram (c): Can be matched with Box Plot (1) because

- The values range from 0-6

- The median looks to be around 1

- Q1 is around 1 and Q3 is around 2

- also since this is the last one left … LoL!!

  1. Exercise: 1.56

Exercise: 1.56: Distributions and appropriate statistics

(a):

- Distributions would be skewed towards the right as more than 75% of houses cost below $1Million

- The observations are best represented by median, because the houses the cost above $6Million would skew the mean alot. There would be significant number of outliers in the data

- The variable observations are best represented by IQR. SD is only better when you have a standard bell curve

(b):

- Distributions would be mostly symmetric as nearly all the data falls in well defined quatiles. There would be some skewing on the right but no significant.

- The observations are best represented by mean or median. This would depend on the size of the sample. If the size of sample is large, then few houses above 1.2M would not cause big difference between median and mean.

- The variable observations are best represented by SD or IQR as data would look like a bell curve.

(c):

- Distributions would be heavily skewed towards left as total number of student who drink 1-2 drink would be very large, students who drink 3-4 drinks would be lesser and since more students dont drink excessive, students who drink 5-6 drinks might be very small.

- The observations are best represented by median

- The variable observations are best represented by IQR as data is heavily skewed.

(d):

- Distributions would be skewed towards the left Lower wages employees would be highest in number, followed by slightly higher wage manager, followed by very high wage executives…. (income disparity someone!!) LoL!

- The observations are best represented by median

- The variable observations are best represented by IQR. Heavily skewed data = use IQR (my thumb rule!)

  1. Exercise: 1.70 Heart Transplant

Exercise: 1.70 Heart Transplant

(a) Is survival independent of transplant ?

Based on the numbers, approx. 35% people who got transplant survived, when only 11% who did NOT get transplant survived. This proves that survival was independent of transplant.

treatment_total <- 69
treatment_survived <- 69-45
control_total <- 34
control_survived <- 4
treatment_survived_prop <- (69-45)/69
treatment_survived_prop
## [1] 0.3478261
control_survived_prop <- 4/34
control_survived_prop
## [1] 0.1176471

(b)

the box plots show that

- The survival time of people who did not get transplant was very low (few days?)

- The median survival time of people who did get transplant was 150+ days and it went upto 1500 days (4 years+)

- The number of days the ouliers among people who did not get transplant was low. Only one outlier seemed to have survived 1500 days.

(c) Proportionns

- The proportion of people in treatment group who died was 88%

- The proportion of people in control group who died was 65%

(d) (i) What are the claims being tested

The claim being tested is the experimental heart transplant increases lifespan

(d) (ii) Fill in the blanks

- We write alive on 28 cards representing

- dead on 75 cards

- one group of size 69 representing treatment

- another group of size 34 representing control

- build a distribution centered at 0.

- proportions are equal/similar to the study

(d) (iii) Simulation results

the simulation of the results show that null hypothesis (that there is no difference between treatment and control group) is wrong. It is unlikely to see a difference of 23% percent between these two groups due to chance. So Heart transplant does increase the life span.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.