Question 1

A March 2006 article in the International Journal of Obesity described a study involving 422 children aged 5-10 from primary schools in the city of Trois-Rivieres, Quebec (Chaput, Brunet, & Tremblay, 2006). The researchers found that children who reported sleeping more hours per night were less likely to be obese than children who reported sleeping fewer hours.

  1. Is this an experiment or observational study? How do you know? This is an observational study, since the researchers are correlating two pre-existing conditions, without manipulating any of the variables that are being measured.

  2. What are the observational units? The children.

  3. Identify the independent variables. Are they quantitative or categorical? The independent variable is hours of sleep per night. It is continuous quantitative.

  4. Identify the dependent variables. Are they quantitative or categorical? The dependent variable is the likelihood of being obese. It is also continuous quantitative.

  5. What is a possible statistic in this study? Mean level of obesity in the sample of children from the sample primary schools primary schools.

  6. What is a possible parameter of interest in this study? Mean level of obesity in children.

  7. Is it legitimate to conclude from this study that less sleep caused the higher rate of obesity in Quebec children? If so, explain. If not, identify a confounding variable and explain why its effect on the response is confounded with that of the independent variable. No, we cannot conclude causation from this study because no variables were manipulated. There are several confounding variables, including diet, lifestyle, genetics, pre-existing health conditions, family history of obesity etc.

Question 2

An article in the June 10, 2002 issue of the Archives of Internal Medicine reported on a study of the effectiveness of a nicotine lozenge for helping smokers quit smoking (Shiffman, et al., 2002). Newspaper advertisements sought volunteers who were smokers interested in quitting. The 1818 volunteers selected to participate in the study were randomly assigned to receive either the nicotine lozenge or a placebo lozenge (with no active ingredient). a. Is this an experiment or observational study? How do you know? This is an experimental study, since participants are randomly assigned to treatment conditions, the independent variable is being manipulated.

  1. What are the observational units? The volunteers who were smokers interested in quitting.

  2. The article reports on many background variables such as age, weight, gender, smoking amount, and whether the person made a previous attempt to quit smoking. It shows that these two groups (nicotine lozenge and placebo lozenge) had similar distributions for these variables at the start of the study. Why do the researchers report this information? Do you think they were pleased that the distributions for these variables were similar between the two groups? Explain how this is helpful to the kind of conclusion they were hoping to draw. This information is useful since it shows us that these variables are simply nuisance variables and not confounding variables. They are variables that equally affect both treatment groups and hence can be concluded to not be involved in the effect of the treatment on the dependent variable.

  3. The researchers found that the proportion of subjects who successfully abstained from smoking was substantially higher in the nicotine lozenge group than in the placebo group. Is it legitimate to conclude that the nicotine lozenge was responsible for this higher rate of quitting? Explain. Yes, it is legitimate to conclude that the lozenge was responsible for the higher rate of quitting since the nuisance variables were accounted for and if the proportion of subjects who abstained from smoking was substantially higher in the lozenge group, we can causally infer that the lozenge was responsible for the higher rate of quitting.

Question 3

In June of 1992, the American Academy of Pediatrics (AAP) issued a recommendation that healthy infants be placed on their backs or on their sides, rather than on their stomachs, to sleep. This recommendation was based on mounting evidence that sleeping on their stomaches might be related to occurrences of sudden infant death syndrome (SIDS). The recommendation was strengthened in 1994 and accompanied by a national public education campaign called Back to Sleep. The National Infant Sleep Position Study was launched in 1992 to determine how widely this recommendation was adopted for American infants. The researchers took a random sample of households with infants younger than eight months, obtained from what they described as a “nationally representative list” generated from public birth records, infant photography companies, and infant formula companies. In 1992, they made 2068 calls and completed 1002 interviews. The primary reasons that some calls did not result in interviews were that the number was not a household’s, the household’s eligibility could not be determined, or the respondent declined to be interviewed. a. Identify the population of interest and the sample. The population of interest was all households with infants under the age of 8 months. The sample was the nationally representative list of randomly sampled households with infants younger than 8 months.

  1. What is the sample size? Sample size = 1002

  2. Explain how this sampling method differs from a simple random sample. Since this sampling method was not 100% random, every household did not have an equal chance of being represented due to confounding variables such as phone calls not going through, eligibility not being determined or respondent’s decline to interview.

  3. Some of the findings from the study were that the proportion of infants placed on their stomachs to sleep fell from 70% in 1992 to 24% in 1996, and the proportion of infants placed on their backs rose from 13% in 1992 to 35% in 1996. Are these numbers parameters or statistics? Explain. These are statistics, since they are the proportion of children from the study (sample) who were placed on their backs.

Question 4

Consider the following scores 21, 34, 34, 16, 37, 28 Do all of the following “by hand” in R. You can use simple functions (such as the sum function) but don’t use R’s built-in functions (such as mean) to compute the mean, SD, var, etc. Show your work.

a. Calculate the mean.

b. Calculate the variance (using n in the denominator).

c. Calculate the standard deviation (using n in the denominator).

d. Calculate the median.

e. Calculate the IQR.

f. Calculate the H-spread and note any outliers.

g. What is the z-score of 37?

h. Plot a histogram. Bin size = 5, start at 15, left closed interval (i.e., 15 goes in the 15-20bin). You can use R or not.

i. Note any skew

matrix <- c(21, 34, 34, 16, 37, 28)

#a
mean <- sum(matrix)/6
print(mean)
## [1] 28.33333
#b
sdiff <- (matrix-mean)^2
print(sdiff)
## [1]  53.7777778  32.1111111  32.1111111 152.1111111  75.1111111   0.1111111
var <- sum(sdiff)/6
print(var)
## [1] 57.55556
#c
sd <- sqrt(var)
print(sd)
## [1] 7.586538
#d
median <- median(matrix)
print(median)
## [1] 31
#e
Q1 <- quantile(matrix, probs = 0.25)
Q3 <- quantile(matrix, probs = 0.75)
IQR <- Q3-Q1
print(IQR)
##   75% 
## 11.25
#f
bp <- boxplot(matrix) 

print(bp)
## $stats
##      [,1]
## [1,]   16
## [2,]   21
## [3,]   31
## [4,]   34
## [5,]   37
## 
## $n
## [1] 6
## 
## $conf
##          [,1]
## [1,] 22.61458
## [2,] 39.38542
## 
## $out
## numeric(0)
## 
## $group
## numeric(0)
## 
## $names
## [1] "1"
bp$stats
##      [,1]
## [1,]   16
## [2,]   21
## [3,]   31
## [4,]   34
## [5,]   37
names(bp)
## [1] "stats" "n"     "conf"  "out"   "group" "names"
h_spread <- bp$stats[4] - bp$stats[2]
h_spread
## [1] 13
outliers <- bp$out
print(outliers)
## numeric(0)
#g
mean(matrix)
## [1] 28.33333
sd(matrix)
## [1] 8.310636
z37 <- (37 - mean(matrix))/sd(matrix)
print(z37)
## [1] 1.04284
#h
hist(matrix, breaks = seq(15, max(matrix)+5, by = 5), right = FALSE, main = "Histogram", xlab = "Value")

#i the histogram is right-skewed with a higher number of values  between 30-35.

Question 5

The mean of Y is 100 and the SD of Y is 10.

a. You add 10 to each score.

i. What is the new mean? 110

ii. What is the new standard deviation? 10

b. You multiply each score by 10.

i. What is the new mean? 1000

ii. What is the new standard deviation? 100

Question 6

Consider the following data

Group A: 21, 40, 34, 34, 16, 37, 21, 38, 11, 34, 38, 26, 27, 33, 47

Group B: 20, 16, 15, 38, 53, 61, 23, 44, 32, 34, 25, 19, 27, 14, 39

Use R to:

a. Plot a histogram of Group A.

b. Plot a boxplot of Group A.

c. Plot a barblot of the means of Groups A and B (no error bars for now).

d. Plot a lineplot with one line for each group (assume the ordering matters, i.e., plot left to right, no error bars for now).

A <- c(21, 40, 34, 34, 16, 37, 21, 38, 11, 34, 38, 26, 27, 33, 47)
B <- c(20, 16, 15, 38, 53, 61, 23, 44, 32, 34, 25, 19, 27, 14, 39)

#a
hist(A, breaks = 5, main = "Group A", xlab = "Value")

#b
boxplot(A, main = "Group A", ylab = "Value")

#c
means <- c(mean(A), mean(B))
names(means) <- c("GroupA", "GroupB")
barplot(means, main = "means of A and B", ylab = "mean")

#d
plot(A, type = "l", ylim = range(c(A, B)), xlab = "Index", ylab = "Value", main = "Group A and B")
lines(B, col = "red")
legend("topleft", legend = c("Group A", "Group B"), col = c("black", "red"), lty = 1)

Question 7

Consider the following data from 5 subjects in conditions 1-3. You may use software, but show work for partial credit. Subject C1 C2 C3 1 7 11 3 2 31 15 12 3 16 40 5 4 21 42 19 5 35 45 4 a. What is 𝑌𝑌23? b. What is 𝑌𝑌.3? c. Compute 𝑌𝑌.1 ���? d. Compute 𝑌𝑌2. ���? e. Compute ∑ 𝑌𝑌𝑖𝑖1 5 𝑖𝑖=1 ? f. Compute ∑ (𝑌𝑌.𝚥𝚥 ��� − 𝑌𝑌.. �) 3 2 𝑗𝑗=1 ? g. Compute ∑ ∑ 𝑌𝑌𝑖𝑖 3 2 𝑗𝑗=1 5 𝑖𝑖=1 ?

df <- data.frame(Subject = 1:5, C1 = c(7, 31, 16, 21, 35), C2 = c(11, 15, 40, 42, 45), C3 = c(3, 12, 5, 19, 4))
df <- df[,-1]
#a
df[2,3]
## [1] 12
#b
df[ ,3]
## [1]  3 12  5 19  4
#c
mean(df[,1])
## [1] 22
#d
mean(as.numeric(df[2,]))
## [1] 19.33333
#e
sum(df[1:5, 1])
## [1] 110
#f
mean_total <- mean(unlist(df))
colmeans <- colMeans(df)
sum((colmeans - mean_total)^2)
## [1] 245.84
#g
sum(df^2)
## [1] 9222

Question 8

Consider two tests. Test A has mean 50 and sd 10. Test B has mean 20 and sd 5.

a. Who did better, a student who scored 59 on Test A or a student who scored 25 on Test B? Why?

To check this, we calculate z score. We see that student who scored 25 on test B did better, since their z-score is 1 compared to the other student with z-score 0.9.

b. A student scores a 66 on Test A, what is the equivalent score on Test B?

Amean <- 50
ASD <- 10
Bmean <- 20
BSD <- 5
#a
studenta <- (59-Amean)/ASD
studentb <- (25-Bmean)/BSD

#b
zTestA <- (66-Amean)/ASD
XTestB <- Bmean + (zTestA*BSD)

Question 9

Do all of the following in R.

Show your code and the results.

a. Read in the file Ex2_13.csv.

b. Use head to show the first few lines.

c. What data point is row 10, column 2 (use [])?

d. Show the first row (use []).

e. Show all grades for test _2 (use $).

f. Show all grades for test _2 (use []).

g. Show all students (both test scores) who got greater than a 45 on test_1.

h. Show only the test_1 scores for all students who got greater than a 45 on test_1.

i. Show all students (both test scores) who got exactly a 25 on test_2.

j. What is the mean of test_1?

k. What is the median of test_1?

l. What is the sd of test_1?

m. What is the variance of test_1?

#a
data <- read.csv('Data_sets/Ex2_13.csv', header=TRUE)
#b
head(data)
##   test_1 test_2
## 1     37     43
## 2     39     52
## 3     40     36
## 4     32     25
## 5     46     49
## 6     38     45
#c
data[10,2]
## [1] 45
#d
data[1,]
##   test_1 test_2
## 1     37     43
#e
data$test_2
##  [1] 43 52 36 25 49 45 36 59 41 45 45 46 49 42 44 37 51 42 61 51 56 69 38 29 43
## [26] 50 52 51 40 51 39 55 46 41 48 63 56 60 44 29 40 56 71 40 52 52 46 49 37 40
#f
data[,2]
##  [1] 43 52 36 25 49 45 36 59 41 45 45 46 49 42 44 37 51 42 61 51 56 69 38 29 43
## [26] 50 52 51 40 51 39 55 46 41 48 63 56 60 44 29 40 56 71 40 52 52 46 49 37 40
#g
data[data$test_1>45, ]
##    test_1 test_2
## 5      46     49
## 19     48     61
## 26     47     50
## 50     48     40
#h
data[data$test_1>45, 1]
## [1] 46 48 47 48
#i
data[data$test_2==25, ]
##   test_1 test_2
## 4     32     25
#j
mean(data$test_1)
## [1] 38.6
#k
median(data$test_1)
## [1] 38.5
#l
sd(data$test_1)
## [1] 4.615856
#m
var(data$test_1)
## [1] 21.30612

Question 10

Consider the following histogram of scores. (Note: A 1 will go in the 1-2 bin.)

a. What proportion of the histogram is between 3 (inclusive) and 5 (exclusive)?

Bin frequencies: 2+8+5+2+1+2=20

3-5 = 2+1 = 3

Proportion = 3/20 = 0.15, so 15% of the histogram is between 3 and 5.

  1. Is the mean of the histogram more likely above or below the median? Mean is likely to be higher than the median since the histogram is skewed to the left.