This document is for the Data 606 Homework-1 for Fall-2018 semester. The homework consists of 2 parts - Practice and Graded.

Part-A: Practice. First the practice questions are performed as given below. 1.7 - Fischer’s Irises

data("iris")
print("Iris data set looks like below")
## [1] "Iris data set looks like below"
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  1. How many cases were included in the data? - Answer is 150
  2. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete. Answer: 4 numerical variables in this dataset:
  1. sepal length
  2. sepal width
  3. petal length
  4. petal width
  1. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories). Answer: 1 categorical variable in this data set Species

1.9 Air pollution and birth outcomes, scope of inference. (a) Identify the population of interest and the sample in this study. Answer: Population of interest - Births in Southern California Sample in the study - 143,196 births between the years 1989 and 1993 on which the data was collected

  1. Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships. Answer: In case the data was collected randomly from all types of people staying in Southern California and the collection was unbiased, then we can safely say that the sample can be assumed to generalize the results for the whole population. Also the sample count is quite considerable. Generally bigger the sample, better it is for generalizing the results for the whole population. The study might give a explanatory and response relationship. That means if there is an association between the 2 variables - air pollution and terms at the time of births, that will give an explanatory and response relationship. But that does not signify a causal relationship.

1.23 Haters are gonna hate, study confirms. (a) What are the cases? Answer: Ratings from the 200 men and women for the subjects and microwave. That signifies each row of the dataset or each person’s response to all of these variables. (b) What is (are) the response variable(s) in this study? Answer: The ratings / response by men and women for the microwave are the response variable. As the researchers concluded from their findings that the ratings from the people for the microwave are associated with the ratings given by them for the subjects, hence the ratings for the subjects are the explanatory variables and the ratings for the microwave are the response variable. (c) What is (are) the explanatory variable(s) in this study? Answer: ratings for the subjects are the explanatory variables, as explained in the previous question (d) Does the study employ random sampling? Answer: Yes, it is an example of random sampling (e) Is this an observational study or an experiment? Explain your reasoning. Answer: This is an experiment. In this case, the researchers randomly picked 200 people and then asked them questions. Any such scenario where the explanatory variables are decided by the researcher and given to the sample to perform / evaluate, is an experiment. Observational study is generally a sample collected from an already complete or already happening scenario. (f) Can we establish a causal link between the explanatory and response variables? Answer: No, a causal link cannot be established even if this is a clear case of explanatory and response relationship. (g) Can the results of the study be generalized to the population at large? Answer: Yes, the results can be generalized with the exceptions.

1.33 (a) What type of study is this? Answer: This is an experiment, as this study applies treatment (b) How many factors are considered in this study? Identify them, and describe their levels. Answer: 3 light factors and 3 noise factors. Light factors - fluorescent overhead lighting, yellow overhead lighting, no overhead lighting (only desk lamps) Noise factors - no noise, construction noise, and human chatter noise (c) What is the role of the sex variable in this study? Answer: The light and noise might have different effects, as per the researchers. Hence they have included equal number of males and females in this experiment. This will allow them to study how the light and noise impacts the 2 sexes differently.

1.55 (a) Number of pets per household. Answer: This will be symmetric as the number of pets will not be a variable which will vary a lot. Mean will represent a typical observation as this will give the average number of pets in the sample of the households. Standard deviation will be a good measure to determine the variability in this case as this is a very symmetirc dataset and there is very less probability of a skewed value, that is 10 pets or more. (b) Distance to work, i.e. number of miles between work and home. Answer: This will most probably be a right skewed distribution. That means there will be very less people coming from very far. Median and IQR will be good as this is a skewed distribution. (c) Heights of adult males. Answer: Symmetrical, as there will be very less number of adult males with very less height and similarly less adult males who are very tall. Median and IQR are better options here, as most of the adult males heights will fall within a limit.

1.69 (a) (i) False. This is the absolute number and not the actual rate. So irrespective of the final conclusion, the reasoning is not correct. Actual rate of the cardiovascular problems in the patients on Pioglitazone treatment is 5,386 / 159,978. Similarly the rate of the cardiovascular problems in the patients on Rosiglitazone treatment is 2,593 / 67,593. (ii) True (iii) False. Reasoning statement itself is true, and the conclusion statement itself is also true. But the reasoning statement given is not the actual reason for the serious cardiovascular problems for the patients using Rosiglitazone. This is caused by Rosiglitazone itself. (iv) True

  1. 7,979 / 227,571

Graded Problems: 1.8 (a) Each row of the data matrix represents a case. (b) 1691 participants were included in the survey (c) sex - categorical, nominal age - numerical, continuous marital - categorical, ordinal grossIncome - categorical, ordinal smoke - categorical, ordinal amtWeekends - numerical, discrete amtWeekdays - numerical, discrete

1.10 (a) Population of interest - Children Sample in the study - 160 children between the age of 5 and 15 (b) The results of the study can be generalized to the population if the sample was thru random selection. The findings can be used to find if there is a corelation between the variables or if there is an explanatory and response relationship. But the causal relationship might not be established.

1.28 (a) Yes, this is a causal relationship. As it clearly comes out in the study results, the smokers with pack-a-day have 37% more chances of having dementia, and 2-packs-a-day have 44% more chances. This clearly means that more a person is addicted to cigarettes, more he or she is likely to have dementia later in life. (b) No, the first staement is not justified. The disrputive behaviour or bullying and the sleep disorder rate might have a correlation or might be associated, but that does not imply a causal relationship.

No causal relationship can be derived from this study. But the 2 variables are associated. The children whoe subject to bullying in school are more prone to take this erratic behaviour to their beds.

1.36 (a) This is an experiment. The people were split into groups before the study and then the 2 groups were asked to perform / not to perform a certain set of tasks and then the results were studied. (b) Treatment group - The group of people which were asked to exercise twice a week Control groups - The group of people which were asked not to exercise (c) Yes, this study uses blocking. The blocking variable is age. The 3 blocks are created for age groups - 18-30, 31-40 and 41-55. This is to ensure that the 2 treatment groups which are created have representatives from these 3 block of groups, as age is also a factor to determine how important exercising is to remain healthy. (d) This study does not use blinding. Blinding means that the 2 groups do not get to know during the course of study which of these is a treatment and which is a control group. This is done generally using a placebo. In this study for the question, the treatment group knows that they are being asked to exercise which is the treatment in this scenario, and the control group knows that they are the control group as they are asked not to exercise. (e) Yes, if the results signify a correlation between the mental health and exercise, then the causal relationship can be suggested to be true. The study results can be generalized to the whole population of adults in the age range - 18 - 55 as this is the age range which has good representatives from multiple sub-groups within this age range. (f) The mental health of a person also depends on how intensive mental cpability a person utilizes based on her or his profession. So, to make sure that the study gives the correct results to determine how exercise affects the mental health, the groups and the blocks should have equal representatives from the professions of people who are involved in the study. Also the sample of people involved should be considerable enough to generalize the study conclusion over a bigger population. Apart from these points, I think the study is going to give the relevant points / conclusion of the intended question.

1.48 - Has been given as a separate attachment which shows a box plot made in hand and then scanned.

1.50 (a) Symmetrical - This goes with box plot # 3 (b) Almost Symmetrical - This goes with box plot # 2 (c) Right skewed - This goes with box plot # 1

1.56 (a) The survival and whether the patient got a treatment or not are dependent. In the mosaic plot, the number of patients under the treatment is almost twice the number of patients under the control group. Moreover, the proportion of alive is even fairly smaller within the control group as compared to the same ratio within the tratment group. So, that clearly means that the chances of a patient to survive if he was under the treatment group are much more as compared to the chances of survival, had he been under the control group. Hence, the survival is a dependent of the group.

  1. The box plot gives a very fair idea of how long the patients survived when they are distributed among the 2 groups - control group and treatment group. The treatment group patients live for longer, even though all the patients were gravely ill. Out of the control group or the patients who did not get a transplant, most of the patients died within 100 days. While for the patients who got the treatment / transplant, 25% survived for another 200 to 650 days while other 25% survived for 650 to 1450. It is very clear that the patients who underwent the transplant had increased their survival time so they could spend at least some more time with their loved ones. Even a few of treatment patients went further to survive for more than 5 years, while no patient under the control group survived for that long.

  2. By the end of the study (going by mosaic plot), around 33% of patients were still alive from the treatmet group. By the same time, only 12% of the patients from the control group survived.

    1. Competing claims: H0 - The fact that a patient underwent the transplant or not has no bearing on the result whether the patient will sruvive longer or die early. H1 - The fact that a patient underwent the transplant or not will impact the chances clearly if the patient will live for longer or will die earlier.
  1. We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 34 representing treatment, and another group of size 69 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at .025 . Lastly, we calculate the fraction of simulations where the simulated differences in proportions are centered. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

  2. The distribution of these simulated differences is centered around -0.025.