1.8) Smoking habits of UK Residents: a) What does each row of data represent? The data matrix has user level granularity. Therefore, each row in the data matrix pertains to a UK resident. b) How many participants were included in the survey? There are 1691 participants in the given study. c) Indicate whether each variable in the study is numerical or categorical. IF numerical, identify as continuous of discrete. If categorical, indicate if the variable is ordinal.
The variables in this data frame can be categorized as follows: Sex-Categorical Age –Discrete Numerical Marital Status – Categorical Gross Income – Ordinal Categorical Smoke – Categorical Amount of cigarettes per day on Weekends – Numerical Discrete Amount of cigarettes per day on Weekdays – Numerical Discrete
1.10) Cheaters, Scope of Inference a) Identify the population of interest and the sample in this study. The population of interest in this study consists of children between ages 5 and 15. The population within this age range had a total of 160 people. b) Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish casual relationships. I don’t think the study can be generalized because of a few reasons. The size of the sample might not be sufficient. The biggest reason why the study can’t be generalized is because the choice of population is not random. The sample population consists of participants; therefore casual relationships are not so easily established.
1.28) Reading the Paper a) Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning. It is not possible to conclude that smoking causes dementia later in life from this study. Causation is determined through random experiments. This study is observational. It can also be argued that the participants were aged 50-60. They were 23 years older when the findings were recorded. There can be other factors aside from smoking that can cause dementia, but that would have to be determined through random experiment. b) A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study? The statement would not be justified. As in part a, this is not a random experiment. This is only a random survey, hence it causation cannot be determined. The best description to describe the conclusion that can be drawn out from this study would be that there could likely be an association between sleeps disorders and bullying. If a child is suffering from a sleep disorder, then they may be likely to commit acts of bullying.
1.36) Exercise and Mental Health a) What type of study is this? The study is a block random experiment. The participants of this study are partitioned into subsets according to various age ranges. The participants were selected through stratified sampling and within each subset; the treatment was assigned to half at random. b) What are the treatment and control groups in this study? The treatment consists of the participants who were told to exercise twice a week. The control group is made of the participants who were instructed not to exercise. c) Does this study make use of blocking? If so, what is the blocking variable? This study does not make use of a blocking variable. Blocking variables for the most part are meant to be unchanging and observed, such as gender or names. D) Has blinding been used in this study? The study does not use blinding. There are no details that would indicate if a participant was purposely kept in the dark regarding the experiment. E) Comment on whether or not we can make a casual statement, and indicate whether or not we can generalize the conclusion to the population at large. A casual statement can be made from the results of this study. The results cannot be generalized. We should consider some additional blocking variables such as gender and profession.
1.48) Stat Scores Below are the final exam scores of 20 introductory statistics students. 57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94 Create a box plot of the Distribution of these scores.
scores = c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(scores)
This box plot can be improved with the use of ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
scores = data.frame(c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94))
ggplot(data = scores, aes(x = "", y = scores)) +
geom_boxplot() +
coord_cartesian(ylim = c(55, 95))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
1.50) Mix-and-Match Describe the distribution in the histograms below and match them to the box plots. Histogram A matches to Plot 2, with a median around 60 Histogram B matches plot 3, with a median around 50 Histogram C matches plot 1, with a median of around 1
1.56) Distributions and appropriate statistics, Part II For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning. a) Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000. -The median in this scenerio is 450, 000. If you take the difference between median and first quartile, you get 450, 000-350, 000=100, 000 If you take the difference between median and third quartile you get 1, 000, 000-450, 000=550, 000 The difference between median and third quartile is bigger, hence this would indicate a right skew. The inter quartile range can be found by taking 1, 000, 000-350, 000=650, 000 An benchmark to identify outliers would be any value that is greater than 1, 000, 000+(3x650, 000)=1, 975, 000 We can see that a considerable number of houses cost more than 6, 000, 000. This will certainly indicate that there are outliers present. In order to describe a typical observation, the metric of median should be used. The inter quartile range can best describe the variability since it is a skewed distribution.
b)Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000. -The median is placed at 600, 000. We can compute the range between median to first quartile 600, 000-300, 000=300, 000. The difference between third quartile and median is 900, 000-600, 000=300, 000. The distribution has a right skew. We are told that there is meaningful number of houses more than 1, 200, 000. There could be outliers. The median best describes a typical data point set while the interquartile range best represents variability.
c)Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively. The range of ages is bounded by 0 to R where r is a real number as there is no negative age. Since only 21 and older drink, there is a right skew. This data is compromised of discrete data points , the median can describe the typical observation while the interquartile range can describe variability.
1.70) Heart transplants. The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Another variable called survived was used to indicate whether or not the patient was alive at the end of the study.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
data(heartTr)
mosaicplot(table(heartTr$transplant,heartTr$survived))
-The mosaic plot shows that the population assigned in the treatment group had survived more than those in the control group. The plot also shows that survival is independent of the transplant.
What do the box plots below suggest about the efficacy (e↵ectiveness) of the heart transplant treatment. -The box plots seems to indicate that the survivial of those who were given the transplant have a somewhat even distribution.The survivial of those in every quartile had a greater survival rate.
What proportion of patients in the treatment group and what proportion of patients in the control group died? We can compute the proportion of people who died over people who survived in each of the groups.
table(heartTr$survived)
##
## alive dead
## 28 75
table(heartTr$transplant)
##
## control treatment
## 34 69
For the treatment group:
69/75
## [1] 0.92
For the control group:
34/75
## [1] 0.4533333
The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate. -28 -75 -69 -34 -0 -is equal to or greater
(69/75)-(34/75)
## [1] 0.4666667
What do the simulation results shown below suggest about the e↵ectiveness of the transplant program? -The simulation results indicate that the transplant does effect and improve survival.