3.1 a) give two examples of discrete random variables # 1. number of live births in a sample on 12/25/21 # 2. number of flights leaving from PDX on 1/14/22 b) give two examples of continuous random variables #1. BMI in a sample of patients #2. GPA in OHSU-PSU SPH students
3.5 a) P(T13 =< ?) = .10
qt(0.1, 13)
## [1] -1.350171
dt(2.05, 28)
## [1] 0.05205059
3.10
# a)
## false
# b)
## false
# c)
## false
# d)
## false
# e) the variance is like an average of squared deviations from the mean
## true
3.11 suppose the weight of males patients registered at a diet clinic has the noral distribution with mean 190 and variance 100 a) for a random sample of patients size n = 25, the expression P(Wbar < 180), is equivaent to saying P(Z > ?)
w_bar <- 190
w <- 180
sig_sq <- 100
n <- 25
z_stat <- ((w_bar - w) / sqrt(sig_sq/n))
z_stat # answer, P(Z > 5) #
## [1] 5
error <- qnorm(.9)*sqrt(sig_sq/n) # calculating the error #
left <- w_bar - error # lower bound #
right <- w_bar + error # upper bound #
left
## [1] 187.4369
right
## [1] 192.5631
(187.44, 192.56) = 0.8
3.12 The limits of a 95% CI for mean µ of a normal population with unknown varience are found by adding to and subtracting from the sample mean a certain multiple of the estimated standard error of the sample mean. If the sample size on which the confidence interval is based is 28, the multiple referred to in the previous sentence is the number ___?
qt(.975, 27)
## [1] 2.051831
3.13 a random sample of 32 persons attending a diet clinic was found to have lost an average of 30 pounds, with sample standard deviation of 11. For these data, a 99% CI for the true mean weight loss by all patients attending the clinic would have the limits (?,?)
# calculating the error #
bar_13 <- 30
sig <- 11
n_13 <- 32
error_13 <- qt(.995, df=n_13 - 1) * (sig / sqrt(n_13))
# calculating left and right bounds #
left_13 <- bar_13 - error_13
right_13 <- bar_13 + error_13
left_13
## [1] 24.66409
right_13
## [1] 35.33591
3.14 From two normal populations assumed to have equal variances, independent random samples of sizes 15 and 19 are drawn. The first sample (n=15) yielded mean and standard deviation 111.6 and 9.5 respectively. The second sample (n=19) gave mean and standard deviation 100.9 and 11.5. The estimated standard error of the difference in means is ____?
#obtaining the pooled sample variance
n1 <- 15
n2 <- 19
s1 <- 9.5*9.5
s2 <- 11.5*11.5
sp2 <- (((n1 - 1)*s1) + ((n2 - 1)*s2)) / (n1 + n2 - 2)
sp2 # pooled sample varience #
## [1] 113.875
# obtaining the estimated standard error of the difference in sample means
sem <- sqrt((sp2/n1) + (sp2/n2))
sem # estimated standard error of the difference in sample means #
## [1] 3.685795
3.15 Suppose, using above data, we wish to test H0: µ1 = µ2 versus HA: µ1 > µ2. We got the test statistic 2.55 a) what conclusions should be drawn for α = 0.05?
df15 <- n1+n2-2 # obtaining df for the function #
1 - pt(2.55, df=df15) # calculating p-value to compare w/alpha #
## [1] 0.007880889
3.16 test the null hypothesis that the true population average body weight is the same for two independent diagnosis groups from one hospital versus the alternative hypothesis that these two population averages are different
# creating data sets
samp1 <- c(132, 145, 124, 122, 165, 144, 151)
samp2 <- c(141, 139, 172, 131, 150, 125)
#step 1 # H0: mean weight pop 1 = mean weight pop 2 # HA: mean weight pop 1 != mean weight pop 2
#step 2:
t.test(samp1, samp2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: samp1 and samp2
## t = -0.2893, df = 11, p-value = 0.7777
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22.13483 16.99197
## sample estimates:
## mean of x mean of y
## 140.4286 143.0000
# from output of t.test, t = -0.2893
#fail to reject null hypothesis
3.17 independent random samples are drawn from two normal populations (same variance). Sample 1: n = 5, mean = 86.4, sd = 8.0 Sample 2: n = 7, mean = 78.6, sd = 10 The limits of a 99% CI are found by adding and subtracting from the difference in samples means a certain multiple of the estimated standard error of this difference. The multiple is the number ___?
#calculating degrees freedom for two-sample t test
n.17.1 <- 5
n.17.2 <- 7
df.17 <- n.17.1 + n.17.2 - 2
#calculating 99% CI multiple for df=10
df.17 <- n.17.1 + n.17.2 - 2
qt(.995, df=df.17)
## [1] 3.169273
3.20 Critical region: |T| >= 2.5 , Tdata = -2.75. What statements would be correct? a) H0 should be rejected #this statement is TRUE
2 * (1 - pt(2.75, df = Inf))
## [1] 0.005959526
3.21 xbar1 = 125.2 mmHg, xbar2 = 125.4 mmHg H0: mu1 mmHg SBP = mu2 mmHg SBP <- rejected at alpha = 0.001 level #statement c) is most reasonable, since there is statistically significant difference, however, clinically, a difference in SBP of 125.4 - 125.2 does not create difference in clinical diagnosis
3.22 the choice of an alternative hypothesis should depend primarily upon… # the only correct choice is: b) what the investigator is interested in determining
3.23 for each of the areas in the figure labled a, b, c, and d; choose a correcsponding label alpha, 1- alhpa, beta, 1 - beta # a) 1 - alpha # b) alpha # c) beta # d) 1- beta
3.25) suppose that H0: µ1 = µ2 and 0.005 < P < 0.01. What conclusion is most appropriate? # reject H0 because P is small (less than 0.01)
Homework Chapter 5:
5.1 The age and dry weights (y) of 11 chick embryos ranging in age 6 to 16 days (x) are given, along with the logarithms of the weights (z).
#uploading the dataset
setwd("~/Desktop")
library(readxl)
hw1 <- read_excel("~/Desktop/CH05Q01.xls")
#scatter plot 1
plot(x=hw1$AGE, y=hw1$DRYWGT,
main="Scatter plot of Dry Weight vs. Age",
xlab="Age (days)",
ylab="Dry Weight (grams)")
#scatter plot 2
plot(x=hw1$AGE, y=hw1$LOGDRYWG,
main="Common Logartithms of Dry Weight vs. Age",
xlab="Age (days)",
ylab="Common Log of Dry Weight")
# the relationship between X and Y is positive nonlinear while the relationship between X and Z is positive linear.
state the simple linear regression models for these two regressions: Y regressed on X and Z regressed on X # Y = Bsub0 + Bsub1 * X + E # Z = Bprimesub0 + Bprimesub1 * X + E
determine the least-squares estimates of each of the regression lines in part b)
model_yonx <- lm(DRYWGT~AGE, data=hw1) # SLR analysis for Y on X #
model_yonx
##
## Call:
## lm(formula = DRYWGT ~ AGE, data = hw1)
##
## Coefficients:
## (Intercept) AGE
## -1.8845 0.2351
# yhat = -1.885 + 0.235x
model_zonx <- lm(LOGDRYWG~AGE, data=hw1) # SLR analysis for Z on X #
model_zonx
##
## Call:
## lm(formula = LOGDRYWG ~ AGE, data = hw1)
##
## Coefficients:
## (Intercept) AGE
## -2.6893 0.1959
# Zhat = -2.689 + 0.196x
# Linear regression line on Dry Weight vs Age
plot(x=hw1$AGE, y=hw1$DRYWGT,
main="Scatter plot of Dry Weight vs. Age",
xlab="Age (days)",
ylab="Dry Weight (grams)",
abline(lm(DRYWGT~AGE, data=hw1), col="maroon"))
# linear regression line on common log of dry weight vs age
plot(x=hw1$AGE, y=hw1$LOGDRYWG,
main="Common Logartithms of Dry Weight vs. Age",
xlab="Age (days)",
ylab="Common Log of Dry Weight",
abline(lm(LOGDRYWG~AGE, data=hw1), col="navy"))
# The regression line for Common logs of dry weight vs age is better fit than dry weight vs age, thus, it is more appropriate to run a regression of Z on X. This is shown by Z on X has a smaller value for Bprimesub1 than Y on X does for Betasub1. The data is better fit to the regression line.
#95% CI of Z on X
confint(model_zonx, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -2.7585602 -2.6199489
## AGE 0.1898356 0.2019462
# 95% CI intercept (-2.759, -2.620)
# 95% CI slope (0.190, 0.202)
#At the alpha = 0.05 level of significance, we found evidence that the true parameter is significantly different from 0.
library(tidyr)
library(ggplot2) # prepping R to use GGplot for following problem #
model <- lm(LOGDRYWG~AGE, data=hw1) #SLR analysis#
pred <- predict(model, interval = "prediction") # 95% prediction interval for forecasting mean response, log dry age #
## Warning in predict.lm(model, interval = "prediction"): predictions on current data refer to _future_ responses
new.data <- cbind(hw1, pred) # adding prediction to original data frame #
# 95% CI and prediction bands for Z on X #
ggplot(new.data,
aes(x=AGE, y=LOGDRYWG))+
geom_point()+
geom_smooth(method=lm,
formula=y~x,
color="lightblue",
se=TRUE)+
geom_line(aes(y=lwr),
color="violet",
linetype="dashed")+
geom_line(aes(y=upr),
color="violet",
linetype="dashed")+
ggtitle("Scatterplot of Common Logartithms of Dry Weight vs. Age with Regression Line, overlaid \n
95% confidence band for mean response, overlaid \n
95% Prediction bankd for mean response")
# get 95% point-wise CI for mean response (LOGDRYWG) at AGE = 8 #
new.dat <- data.frame(AGE=8)
predict(model, newdata = new.dat, interval = 'confidence')
## fit lwr upr
## 1 -1.122127 -1.148522 -1.095733
# 95% CI when age = 8: (-1.149, -1.096) #
5.4 A sociologist assigned to a correctional institution was interested in studying the relationship between intelligence and deliquency. A delinquency index was formulated to account for the severity and the frequency of the crimes committed, while intelligence was measured by IQ.
# upload data frame #
library(readxl)
CH05Q04 <- read_excel("CH05Q04.xls")
#plotting data with regression line
ggplot(CH05Q04,
aes(x=IQ, y=DI))+
geom_point()+
geom_smooth(method=lm,
formula=y~x,
color="lightblue",
se=FALSE)+
ggtitle("Scatterplot of DI vs. IQ with Regression Line")
how do you account for the fact that Yhat = 52.273 when IQ = 0, even though the delinquency index goes no higher than 50? # we account for this by limiting the min and max values on our range to match that of the delinquency index
Decide whether the outlier IQ = 134 and DI = 39.6 has effect on the estimate of the IQ-DI relationship
#comparing the slope of the regression line with all points included to that of the regression line without the outlier
model5.4 <- lm(CH05Q04$DI~CH05Q04$IQ) # slope of regression with all points included #
summary(model5.4)
##
## Call:
## lm(formula = CH05Q04$DI ~ CH05Q04$IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7013 -4.7454 0.0903 3.6447 20.6838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.2729 12.0485 4.339 0.000508 ***
## CH05Q04$IQ -0.2489 0.1154 -2.157 0.046535 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.704 on 16 degrees of freedom
## Multiple R-squared: 0.2253, Adjusted R-squared: 0.1769
## F-statistic: 4.654 on 1 and 16 DF, p-value: 0.04653
yhat_noout <- 70.846 + -0.444 * CH05Q04$IQ # slope of the regression line with outlier point removed as given by textbook #
betahat0 <- 70.846
betahat1 <- -0.444
# y\hat{} = 52.2729 + -0.2489 * CH05Q04$IQ best fit line before removing outlier #
# step 0:
alpha <- 0.05
# step 1:
# H0: b_sub_1 = 0
# HA: b_sub_1 != 0
# step 2:
betahat1 <- -0.444
Syx <- 4.933
Sx <- 14.693
n_54 <- 12
T_54 <- betahat1 / (Syx/ (Sx * sqrt(n_54 - 1)))
T_54 # T statistic for hypothesis test, H0: slope = 0 with outlier removed #
## [1] -4.386101
# step 3:
df_54 <- n_54 - 2 # degrees freedom, n-2 #
pval <- 2 * (1 - pt(-4.386101, df=df_54, lower.tail = FALSE)) #obtaining the p value for the test #
pval
## [1] 0.001364339
# step 4:
pval < alpha # p value is less than alpha, reject null hypothesis that slope of regression line = 0 when outlier is removed #
## [1] TRUE
#step 5: # we have found evidence that the slope of the regression line, when the outlier is removed, is significantly different from 0, at the alpha = 0.05 level
cor(CH05Q04$IQ, CH05Q04$DI)# testing the correlation between IQ and DI #
## [1] -0.474686
#we see that the correlation between IQ and DI is negative, which would suggest that as the delinquency index decreases, IQ increases. #I will now test the significance of the slope to ensure correct
summary(model5.4) # testing statistical significance of the original regression line slope #
##
## Call:
## lm(formula = CH05Q04$DI ~ CH05Q04$IQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7013 -4.7454 0.0903 3.6447 20.6838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.2729 12.0485 4.339 0.000508 ***
## CH05Q04$IQ -0.2489 0.1154 -2.157 0.046535 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.704 on 16 degrees of freedom
## Multiple R-squared: 0.2253, Adjusted R-squared: 0.1769
## F-statistic: 4.654 on 1 and 16 DF, p-value: 0.04653