A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.
#install.packages("openintro") ## already installed
library(openintro)## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
#data(package='openintro') ## this just lists the names of the datasets -- the relevant one is "smoking"
data(smoking)
#str(smoking) ## the smoking dataset includes more columns than are shown in the textbook
smoking_subsetdf = subset(smoking,
select=c(gender,age,maritalStatus,grossIncome,smoke,amtWeekends,amtWeekdays))
names(smoking_subsetdf)[1]="sex"
names(smoking_subsetdf)[3]="marital"
str(smoking_subsetdf)## 'data.frame': 1691 obs. of 7 variables:
## $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : int 38 42 40 40 39 37 53 44 40 41 ...
## $ marital : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ grossIncome: Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
## $ smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ amtWeekends: int NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amtWeekdays: int NA 12 NA NA NA NA 6 NA 8 12 ...
head(smoking_subsetdf)## sex age marital grossIncome smoke amtWeekends amtWeekdays
## 1 Male 38 Divorced 2,600 to 5,200 No NA NA
## 2 Female 42 Single Under 2,600 Yes 12 12
## 3 Male 40 Married 28,600 to 36,400 No NA NA
## 4 Female 40 Married 10,400 to 15,600 No NA NA
## 5 Female 39 Married 2,600 to 5,200 No NA NA
## 6 Female 37 Married 15,600 to 20,800 No NA NA
tail(smoking_subsetdf)## sex age marital grossIncome smoke amtWeekends amtWeekdays
## 1686 Female 31 Single 5,200 to 10,400 No NA NA
## 1687 Male 22 Single 2,600 to 5,200 No NA NA
## 1688 Female 49 Divorced 2,600 to 5,200 Yes 20 20
## 1689 Male 45 Married 5,200 to 10,400 No NA NA
## 1690 Female 51 Married 2,600 to 5,200 Yes 20 20
## 1691 Male 31 Married 10,400 to 15,600 No NA NA
Each row represents the responses from an individual participant in the survey with respect to the attribute identified at the top of the respective column.
The study included 1691 participants.
sex: categorical nominal (not ordinal).
levels = Female, Male
age: numerical - although age should be considered continuous (e.g., allowing for fractional years), in the present dataframe it has actually been stored discretely, as integer (i.e., full-years only):
class(smoking_subsetdf$age)## [1] "integer"
marital: categorical nominal (not ordinal), as it doesn’t make sense to impose an ordering on the following:
levels = Divorced,Married,Separated,Single,Widowed
grossincome: categorical : (see the following discussion regarding nominal vs. ordinal):
summary(smoking_subsetdf$grossIncome)## 10,400 to 15,600 15,600 to 20,800 2,600 to 5,200 20,800 to 28,600
## 268 188 257 155
## 28,600 to 36,400 5,200 to 10,400 Above 36,400 Refused
## 79 396 89 108
## Under 2,600 Unknown
## 133 18
In theory, this variable should be ordinal, because the income bands can be ranked.
However, in this dataset, this column has actually been stored as a non-ordered factor:
Is smoking_subsetdf$grossIncome stored as a factor?
is.factor(smoking_subsetdf$grossIncome)## [1] TRUE
str(smoking_subsetdf$grossIncome)## Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
Is smoking_subsetdf$grossIncome stored as an ORDERED factor?
is.ordered(smoking_subsetdf$grossIncome)## [1] FALSE
summary(smoking_subsetdf$grossIncome)## 10,400 to 15,600 15,600 to 20,800 2,600 to 5,200 20,800 to 28,600
## 268 188 257 155
## 28,600 to 36,400 5,200 to 10,400 Above 36,400 Refused
## 79 396 89 108
## Under 2,600 Unknown
## 133 18
The way in which the levels have actually been assigned fails to make use of the ordering, as the levels are stored alphabetically, disregarding the numerical semantics associated with each band:
for (i in 1:10) { cat(paste("grossIncome level", i,":",
levels(smoking_subsetdf$grossIncome)[i]), sep="\n")}## grossIncome level 1 : 10,400 to 15,600
## grossIncome level 2 : 15,600 to 20,800
## grossIncome level 3 : 2,600 to 5,200
## grossIncome level 4 : 20,800 to 28,600
## grossIncome level 5 : 28,600 to 36,400
## grossIncome level 6 : 5,200 to 10,400
## grossIncome level 7 : Above 36,400
## grossIncome level 8 : Refused
## grossIncome level 9 : Under 2,600
## grossIncome level 10 : Unknown
This can be fixed by (manually) designating an ordering on each of the levels, e.g. :
ordering=c(9,3,6,1,2,4,5,7,8,10)
orderedlevels = levels(smoking_subsetdf$grossIncome)[ordering]
for (i in 1:10) { cat(paste("grossIncome ordered level", i,":",
orderedlevels[i]), sep="\n")}## grossIncome ordered level 1 : Under 2,600
## grossIncome ordered level 2 : 2,600 to 5,200
## grossIncome ordered level 3 : 5,200 to 10,400
## grossIncome ordered level 4 : 10,400 to 15,600
## grossIncome ordered level 5 : 15,600 to 20,800
## grossIncome ordered level 6 : 20,800 to 28,600
## grossIncome ordered level 7 : 28,600 to 36,400
## grossIncome ordered level 8 : Above 36,400
## grossIncome ordered level 9 : Refused
## grossIncome ordered level 10 : Unknown
An ordered factor can thus be created (where I have arbitrarily placed “Refused” and “Unknown” at the end of the list):
orderedincome = ordered(smoking_subsetdf$grossIncome,levels=orderedlevels)
# Is this an ORDERED factor?
is.ordered(orderedincome)## [1] TRUE
summary(orderedincome)## Under 2,600 2,600 to 5,200 5,200 to 10,400 10,400 to 15,600
## 133 257 396 268
## 15,600 to 20,800 20,800 to 28,600 28,600 to 36,400 Above 36,400
## 188 155 79 89
## Refused Unknown
## 108 18
So, grossIncome can thus be converted from a nominal factor to an ordered factor:
smoking_subsetdf$grossIncome = orderedincome
# Is grossIncome NOW stored as an ORDERED factor?
is.ordered(smoking_subsetdf$grossIncome)## [1] TRUE
str(smoking_subsetdf$grossIncome)## Ord.factor w/ 10 levels "Under 2,600"<..: 2 1 7 4 2 5 8 4 2 3 ...
summary(smoking_subsetdf$grossIncome)## Under 2,600 2,600 to 5,200 5,200 to 10,400 10,400 to 15,600
## 133 257 396 268
## 15,600 to 20,800 20,800 to 28,600 28,600 to 36,400 Above 36,400
## 188 155 79 89
## Refused Unknown
## 108 18
Thus, although the grouping of income into bands fundamentally creates a categorical ordinal variable, the way in which it was stored in the dataframe did not make use of the ordinality, limiting its use to nominal unless changed to an ordered factor as shown.
smoke: categorical nominal (not ordinal) - the values are [Yes|No]
amtWeekends: numerical discrete (for those who smoke, the number is given as full cigarettes per day, stored as an integer)
amtWeekdays: numerical discrete (for those who smoke, the number is given as full cigarettes per day, stored as an integer)
Presumably one doesn’t smoke a fractional portion of a cigarette (though, as a non-smoker, I can’t be certain of this.) Thus the figures for amount smoked are not stored as a continuous variable.
Exercise 1.5 introduces a study where researchers studying the relationship between honesty, age, and self-control conducted an experiment on 160 children between the ages of 5 and 15. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. Half the students were explicitly told not to cheat and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across children’s characteristics within each group.
The population of interest is all children between the ages of 5 and 15.
The sample in this study is a subset of the children who attended the CUS (“Centro Universitario Sportivo”) Summer Camp of Padua, Italy in July 2008 and who were present on the two specific days on which the experiment was conducted.
The findings should not be generalized to the population because the specific nature of the children who participated in the experiment is unlikely to be representative of all children.
The day camp is comprised of children living in a specific area (Padua, Italy) whose families are presumably able to afford a weekly fee of approximately 100 euros to cover their attendance. (No indication is given as to whether fees are waived in the case of children from poorer families for whom the above fee would be a hardship.)
This could result in a biased sample, e.g., by excluding children from poorer families.
Among the children attending the camp, some did not participate in the experiment because their parents did not grant permission. This could further bias the results, as no indication is given as to the characteristics of those children who were excluded vs. those who participated.
While the study might be generalizable to children in Northern Italy of similar socioeconomic background to those who participated, further generalization would not be supportable absent repetition of the experiment across diverse locations. (Indeed, the study may not be generalizable across the whole of Italy, as the large difference in economic performance between wealthy North and the impoverished South is well-known.)
Because this is an experiment (rather than an observational study), it is appropriate to use the findings of the study to establish causal relationships.
Here, the research question is “Does explicitly telling children not to cheat affect their likelihood to cheat?”
The results of the experiment suggest that doing so does indeed reduce cheating, moreso in girls than in boys, with noticeable differences in other subgroups as well.
“Researchers analyzed data from 23,123 health plan members who participated in a voluntary exam and health behavior survey from 1978 to 1985, when they were 50-60 years old. 23 years later, about 25% of the group had dementia, including 1,136 with Alzheimer’s disease and 416 with vascular dementia. After adjusting for other factors, the researchers concluded that pack-aday smokers were 37% more likely than nonsmokers to develop dementia, and the risks went up with increased smoking; 44% for one to two packs a day; and twice the risk for more than two packs.”
As this was an observational study, rather than than an experiment, it is not appropriate to reach a causal conclusion.
Indeed, in the original journal article,
https://jamanetwork.com/journals/jamainternalmedicine/articlepdf/226695/ioi05150_333_339.pdf the authors do not claim “causation’, but rather they refer to an increased”association" of dementia with midlife smoking.
“The University of Michigan study, collected survey data from parents on each child’s sleep habits and asked both parents and teachers to assess behavioral concerns. About a third of the students studied were identified by parents or teachers as having problems with disruptive behavior or bullying. The researchers found that children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.”
A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.”
Again, this is an observational study, not a controlled, randomized experiment.
It displays an association between sleep disorders and behavioral problems/bullying, but this is not adequate to prove causation.
Indeed, in the original journal article, https://www.sciencedirect.com/science/article/pii/S1389945711001572
the authors write: “Our study was cross-sectional and cannot prove causality”
They add, “Evidence is growing to support the idea that SDB [‘sleep-deprived breathing’] may cause or contribute to disruptive behavior disorders.”
The authors conclude:
“Our findings do not prove a cause-and-effect relationship but raise the possibility that addressing the underpinnings of childhood sleepiness may offer a largely untapped opportunity to reduce the common problem of aggressive behavior in schoolchildren.”
Thus, the friend’s statement is not justified.
A researcher is interested in the effects of exercise on mental health and he proposes the following study:
Use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41- 55 year olds from the population.
Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise.
Conduct a mental health exam at the beginning and at the end of the study, and compare the results.
This is a randomized, controlled experiment to examine the effects of exercise on mental health.
The treatment group consists of the half of the participants (within each age group) who are advised to exercise twice each week during the course of the study.
The control group consists of the other half of the participants, who are advised to refrain from exercise during the study.
Yes, the experiment makes use of age-based blocking, as the participants are grouped by age before being placed into treatment vs. control groups. This ensures that an equal number of participants within each age group will be assigned to exercise vs. non-exercise groups.
No, because the experiment requires the participants to exercise or not exercise, it cannot be conducted without the participants knowing to which group they have been assigned.
Because this is a randomized, controlled experiment, the results can be used to establish a causal relationship.
As the experimental design indicates that the selection of the participants and assignment into groups should use “stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41- 55 year olds from the population,” it would be appropriate to generalize the conclusions to the population at large (where such population consisted of individuals between the ages of 18 and 55.)
My first reservation would be, how can you ensure that those subjects who are randomly placed into the treatment group and are directed to exercise actually do so, and those placed into the control group, and thus directed to not exercise, do not do so.
Another reservation would be whether confounding effects bearing no relationship to exercise or not could impact the mental health of a substantial number of participants.
Such confounding effects may include:
medications taken by a participant (and changes to such medications during the course of the study)
major life event(s) affecting individual participants during the course of the study, e.g., illness of a family member, marriage/divorce, job change/loss, etc., which could impact the mental health assessment
other factors that the study proposer hasn’t considered
Additionally, there is no indication given as to the expected length of the study. The proposer doesn’t state whether he/she wants to run this study for a month, a year, etc. (Presumably a longer study would require more funding.)
Below are the final exam scores of twenty introductory statistics students.
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)Create a box plot of the distribution of these scores.
boxplot(scores, main=c("Final Exam Scores"), col="lightblue")
boxplot(scores, main=c("Final Exam Scores"), col="yellow" , notch = TRUE, add = TRUE)The summary provided below may be useful.
summary(scores)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 72.75 78.50 77.70 82.25 94.00
For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR.
Explain your reasoning.
This distribution exhibits a strong right skew, in that there are significant number of very expensive homes which “pull” the right tail of the distribution. As such, the mean value of a house would be greater (perhaps substantially) than the median value.
In this case, the best measure to represent a “typical” observation in the data would be the median, and the best measure of variability would be the IQR, as these two measures are robust to outliers.
This distribution appears to be closest to a “Normal” (Gaussian) distribution, where the mean is close to the median = 600,000. The IQR of (300k,900k) is symmetric around the median, and the absence of many expensive houses means that the values in the lower quartile (0,300k) are closely reflected in the upper quartile (900k,+) .
For such a symmetric distribution, the mean would be best represent a typical observation in the data (i.e., its measure of central tendency) and the standard deviation would best represent the variability of the observations.
This distribution would be extremely right-skewed, because many college students may not drink at all, resulting in a sizable result at zero, while (as is stated above, only a few drink “excessively.”
In such case, the median would best represent a typical observation, and the IQR would be represent the variability.
[Note: there may be extreme variability between campuses, with heavier drinking expected at institutions known to be “party schools,” and presumably less drinking at more studious (or, religiously-oriented) schools.]
Again, this distribution would be extremely right-skewed with a few high-earners pulling the “right tail” of the distribution, while the rank-and-file workers would be earning considerably less.
In such case, the median would best represent a typical observation, and the IQR would be represent the variability.
library(ggplot2)##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:openintro':
##
## diamonds
library(openintro)
data(heartTr)
table(heartTr$transplant, heartTr$survived)##
## alive dead
## control 4 30
## treatment 24 45
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan.
Each patient entering the program was designated an official heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart.
Some patients got a transplant and some did not.
The variable transplant indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not.
Another variable called survived was used to indicate whether or not the patient was alive at the end of the study.
No, survival is not independent of whether or not a patient received a transplant.
Very few (4) members of the control group (who did not receive a transplant) were still alive at the conclusion of the study. A significantly larger number of members (24) of the treatment group (each of whom did receive a heart transplant) did remain alive as of the end of the study.
boxplot(survtime~transplant, data=heartTr, ylab="Number of days survived",
main="Survival time for transplant recipients vs. non-recipients")The boxplots show that members of the treatment group (i.e., those who did receive a transplant) lived much longer than members of the control group, who did not. This suggests that the heart transplant treatment is effective in extending survival.
ggplot(heartTr, aes(x=transplant, y=survtime, fill=transplant)) +
geom_boxplot() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())In the treatment group, 45/69 = 65.2 percent of the patients died, while in the control group, 30/34 = 88.2 percent died.
Thus, the treatment group exhibited an improvement of 23 percent.
Under the null hypothesis \(H_{0}\), whether a participant remained alive as of the end of the study is independent of whether he/she received a heart transplant (treatment group) or not (control group); the observed difference would only happen rarely.
Under the alternative hypothesis \(H_{A}\), receipt of a heart transplant does have an effect on longevity, and the observed difference was actually due to the fact that a heart transplant significantly increases lifespan (for those who are ill enough to be in need of a transplant), which explains the large difference of -23% in the rate of survival to the end of the study.
We write alive on __28__ cards representing patients who were alive at the end of the study, and dead on __75__ cards representing patients who were not.
Then, we shuffle these cards and split them into two groups:
One group of size __69__ representing treatment, and another group of size __34__ representing control.
We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value.
We repeat this 100 times to build a distribution centered at __zero__.
Lastly, we calculate the fraction of simulations where the simulated differences in proportions are __less than the (-23 percent difference) observed in the actual study__ . If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
#Recreate the histogram from the text
# confusion matrix - col and row sums
TotAlive = 28
TotDead = 75
TotTreatment = 69
TotControl = 34
# simulation results - by counting the number of dots in each column in the textbook historgram
SimTreatAlive = c(rep(24,2) ,
rep(23,1) ,
rep(22,8) ,
rep(21,15),
rep(20,18),
rep(19,19),
rep(18,12),
rep(17,11),
rep(16,7) ,
rep(15,4) ,
rep(14,3))
#SimTreatAlive
SimTreatDead = TotTreatment - SimTreatAlive
#SimTreatDead
PercentTreatAlive = SimTreatAlive / TotTreatment
#PercentTreatAlive
PercentTreatDead = SimTreatDead / TotTreatment
#PercentTreatDead
SimControlAlive = TotAlive - SimTreatAlive
#SimControlAlive
SimControlDead = TotControl - SimControlAlive
#SimControlDead
PercentControlAlive = SimControlAlive / TotControl
#PercentControlAlive
PercentControlDead = SimControlDead / TotControl
#PercentControlDead
# compute the difference in percent dead, between Treatment and Control, from simulation results
PercentDeadDiff = PercentTreatDead - PercentControlDead
# get the unique values for the percentages
uniq = unique(PercentDeadDiff)
#uniq
# find out how far apart the above values are, for positioning of breaks
dif = diff(uniq)
# put the breaks halfway between each value
breaks = uniq - dif[1]/2
#breaks
# append one more break on the far right
breaks[length(breaks)+1]=breaks[length(breaks)]+dif[length(dif)]
#breaks
# draw the histogram
hist(x = PercentDeadDiff, breaks = breaks)First, the stacking of results in vertical bars above specific points reflects the fact that because of the specific number of participants involved (n=103), with the above counts of alive vs. dead and treatment vs. control, it is not possible to obtain exactly zero as the difference in the proportions.
The peak value on the chart corresponds to a simulated result with 19 members of the treatment group alive; the x-value below this bar is \[ \frac{50}{69} - \frac{25}{34} = .7246 - .7352 = -.0107 \]
There are 19 dots in this column, which indicates that 19 of 100 simulations returned a result with 19 members of the treatment group alive and 50 dead, while 9 members of the control group survived vs. 25 dead.
Each column to the left represents the simulations where the number of survivors in the treatment group increased by 1, while the number of survivors in the control group was reduced by one.
Thus, the second-tallest column, containing 18 dots, represents the 18 simulations in which 20 members of the treatment group lived, as did 8 members of the control group. The difference in proportions (on the x-axis) can be computed as \[ \frac{49}{69} - \frac{26}{34} = .7101 - .7647 = -.0546 \] Continuing to the left, each bar plot represents the number of simulations where the number of survivors in the treatment group is 21, then 22, then 23, and finally at the leftmost point on the graph, 24.
This leftmost column reflects the number of simulations in which the count of survivors and deaths in each of the treatment and control groups happens to match the results from the actual study, where 24 members of the treatment group survive, but only 4 members of the control group do.
This corresponds to the difference in proportions computed as \[ \frac{45}{69} - \frac{30}{34} = .6522 - .8824 = -.2302 \]
So, out of 100 simulations where we distrubute 28 alive cards and 75 dead cards across 69 treatment slots and 34 control slots, we have only 2 simulations which correspond to the values from the actual study (and none which exceed it.)
Accordingly, the “p-value” of this simulation is 0.02 < 0.05, which provides evidence at 95% confidence to reject the null hypothesis \(H_{0}\) in favor of the alternative \(H_{A}\), which states that
the receipt of a heart transplant does have a significant effect on longevity, and the observed difference of -23 percent was actually due to the fact that a heart transplant significantly increases lifespan (for those who are ill enough to be in need of a transplant).