MSDS Spring 2018

DATA 606 Statistics and Probability for Data Analytics

Jiadi Li

Chapter 1: Introduction to Data

HW 1: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70 (use the library(openintro); data(heartTr) to load the data)

1.8 Smoking habits of UK residents

each row of the data matrix represents an observation (an instance or a participant)
there are 1691 participant. (# of rows)
Variables:
sex - categorical (not ordinal); age - numerical (continuous);
marital - categorical (not ordinal); grossIncome - categorical (ordinal);
smoke - categorical (not ordinal); amtWeekends - numerical (continuous);
amtWeekdays - numerical (continuous).

1.10 Cheaters, scope of inference

population of interest: Childrens who cheat in both groups.
sample in the study: all 160 children in the research.
The results should not be generalized. A lot more other variables need to be taken into consideration and the population size needs to be larger.

1.28 Reading the Paper

Yes, based on the assumption that the adjustment made for other factors took necessary aging problems into consideration. The conclusion didn’t indicate that percentage of participants; rather, it shows the percentage of likelihood increased for each smoking level.
No, the data collecting method is problematic since only parents and teachers observations are collected. Even if these two situation have positive correlation, the causual relationship should not be established based merely on the discription: bullying might lead to sleep disorder instead.

1.36 Exercise and mental health

It’s an experiment.
treatment group are those who assigned to exercise twice a week, the control group are those who assigned not to.
Yes, by separating them into three groups based on age, the experiment is using age as blocking variable.
No, since the participants are assigned to or not to exercise and they are aware of the situation.
No, since the amount of exercise conducted by the participants before is not taken into consideration. Moreover, the physical health situation is also a factor. Due to these uncleared factors, the result should not be generalized yet.
No, since the experiment is not well-designed.

1.48 Stats scores

scores <- c(57,66,69,71,72,73,74,77,78,79,79,81,81,82,83,83,88,89,94)
summary(scores)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.50   79.00   77.68   82.50   94.00

boxplot(scores)

#####1.50 Mix-and-match (a) close to normal distribution with very few outliers. Centered at x = 60 and range from 50 to 72. should be matched with (2).
(b) close to uniform distribution. Ranged from x = 0 to x = 100. should be matched with (3).
(c) Skewed to the left with a lot of outliers in the right side. Ranged from x = 0 to 8. should be matched with (1).

1.56 Distributions and appropriate statistics

Left skewed: mean and IQR because there are meaningful number of high price house but 75% cost below $1,000,000.00.
The data should be symmetric with very few outliers so the mean and standard deviation is better indication.
Right skewed: median and IQR because most of students don’t drink at all.
The data should be symmetric with very few outliers so the median and standard deviation is better indication.

1.70 Heart Transplants

It’s not independent based on the mosaic plot: survival rate is higher for those who received treatment.
The box plot shows that the treatment is very helpful since the survival time is much longer.