DATA606 - Week 1 Homework

1.8 - Smoking habits of UK residents

(a) What does each row of the data matrix represent?

Each row is an observation, in this case an individual resident.

(b) How many participants were included in the survey?

There are a total of 1691 participants.

(c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

Sex is a nominal categorical variable
Age is a discrete numerical variable (it does not appear that there are fractional ages represented)
Marital is a nominal categorical variable
GrossIncome is a continuous numerical variable
Smoke is a nominal categoricl variable
AmtWeekends is a discrete numerical variable
AmtWeekdays is a discrete numerical variable

1.10 - Cheaters, scope of inference

(a) Identify the population of interest and the sample in this study.

The population of interest appears to be children aged 5 to 15. The sample is simply the 160 children who took part in the study.

(b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

The results could be generalized to the population assuming the participants used in the sample were chosen randomly and generally respresent the population at large. As far as establishing causality, the findings could indeed be used to establish such a relationship.

1.28 - Reading the paper

(a) Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.

I don’t think that we could conclude anything from the brief excerpt. There are many pieces of information that we’d need to make that conclusion such as: about how the study was conducted, who the participants were, and how the “…adjusting for other factors…” was done.

(b) A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?

No, that statement is not justified by the brief excerpt. It could be that some other variable that was not controlled could have an impact on both sleep and behavior. At best this shows a correlation between bullying behavior and trouble sleeping, no causality should be inferred.

1.36 - Exercise and mental health

(a) What type of study is this?

This is an experiment.

(b) What are the treatment and control groups in this study?

The treatment group are those who were instructed to exercise twice a week. The control group are those who were instructed not to exercise.

(c) Does this study make use of blocking? If so, what is the blocking variable?

No, this study doesn’t seem to make use or blocking.

(d) Does this study make use of blinding?

No, not according to the text.

(e) Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.

No, I don’t believe that the study, as it was stated, can be used to establish a causal relationship between exercise and mental health. This is mostly because there doesn’t seem to be any controls in place for confounding variables. For example, if some of the participants have a job requiring a good deal of manual labor that may influence the results depending on which group they were placed in. This is also why I don’t feel that results can be generalized to the population at large.

(f) Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

In its current form, yes I would have reservations. I would require that their approach be refined and control for more variables.

1.48 - Stats scores

Below are the final exam scores of twenty introductory statistics students.

scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)

Create a box plot of the distribution of these scores. The five number summary provided below may be useful.

boxplot(scores,main="Intro to Statistics - Final Exams", ylab="Score")

1.50 - Mix and match.

Describe the distribution in the histograms below and match them to the box plots.

Histogram (a) is unimodal and much of the data appears to be near 60 (small IQR) with little to no skew (evenly balanced whiskers and outlier dots). This matches boxplot (2) well.
Histogram (b) is multimodal and the data is evenly distributed throughout the entire range (large IQR) with no discernable skew. This matches boxplot (3) very well.
Histogram (c) is unimodal and the data seems centered between 1 and 2 with a significant positive skew. This matches boxplot (1) quite nicely (notice the positive outliers indicating a skew).

1.56 - 1.56 Distributions and appropriate statistics, Part II

For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.

(a) Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000.

I would expect that this distribution would be right (positively) skewed. The IQR here would be 650,000 making outliers 1,425,000 or above. Since there are “…a meaningful number of houses that cost more than $6,000,000” the median would be more representative of a typical observation, and IQR would be more appropriate to describe variability.

(b) Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.

This seems to be the opposite of (a) above. The IQR is 600,000 which makes the definition of outliers $1.5M. Since few houses are even above 1.2M, the data appears to have no real positive skew. Thus, the mean would be a good measure of a “typical” house and standard deviation would be a good measure of variability.

(c) Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.

Since, in this example, few college students drink alcohol, most observations would be zero. Thus, there may be a positive skew. However, if the definition “most” is taken to mean > 50% of the students, then the median value would be 0 which is a bit misleading (it makes it sound like college students do not drink at all). Since only a “few” drink excessively, the positive skew shouldn’t be too bad. I would probably use the mean as a “typical” value and standard deviation as a measure of variability (the large number of zeros will make this small, which reflects the stated “most of these student’s don’t drink”).

(d) Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than the all other employees.

Similarly to (a), this distribution would have a positive skew. Because the executives are likely earning salaries an order of magnitude larger (or more!), the skew may be more pronounced. Thus, I would lean towards using median as a measure of the typical salary. I’d also tend to use IQR as a measure of variability.

1.70 - Heart transplants

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study.

(a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.

It does not appear that survival is independent of whether the patient had a transplant or not. The proportion of those surviving amongst those who had a transplant was more than double those who did not have a transplant; so it appears that transplants had a positive impact on survival.

(b) What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment.

There is a significant increase in length of survival for those who had the transplant. The median value of the transplant group’s survival time is several times higher than the control group. The middle 50% of the data (box) is all considerably higher than the middle 50% of the control group.

(c) What proportion of patients in the treatment group and what proportion of patients in the control group died?

	alive	dead
control	0.118	0.882
treatment	0.348	0.652

It appears that about 88% of the control group died by the end of the study, whereas only about 65% of the treatment group died.

(d) One approach for investigating whether or not the treatment is effective is to use a randomization technique.

What are the claims being tested? The claim being tested is that the experimental transplant treatment extends the lifespan of patients.
The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate. We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 28 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are equal to or greater than what we observed in our experiement. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.