3.1 a) give two examples of discrete random variables # 1. number of live births in a sample on 12/25/21 # 2. number of flights leaving from PDX on 1/14/22 b) give two examples of continuous random variables #1. BMI in a sample of patients #2. GPA in OHSU-PSU SPH students

3.5 a) P(T13 =< ?) = .10

qt(0.1, 13)

## [1] -1.350171

P(|T28| >= 2.05) = ?

dt(2.05, 28)

## [1] 0.05205059

3.10

# a)
## false
# b)
## false
# c)
## false
# d)
## false
# e) the variance is like an average of squared deviations from the mean
## true

3.11 suppose the weight of males patients registered at a diet clinic has the noral distribution with mean 190 and variance 100 a) for a random sample of patients size n = 25, the expression P(Wbar < 180), is equivaent to saying P(Z > ?)

w_bar <- 190
w <- 180
sig_sq <- 100
n <- 25
z_stat <- ((w_bar - w) / sqrt(sig_sq/n))
z_stat # answer, P(Z > 5) #

## [1] 5

find an interval such that (a < Wbar < b) = 0.80 for the same random sample

error <- qnorm(.9)*sqrt(sig_sq/n) # calculating the error #
left <- w_bar - error # lower bound #
right <- w_bar + error # upper bound #

left

## [1] 187.4369

right

## [1] 192.5631

(187.44, 192.56) = 0.8

3.12 The limits of a 95% CI for mean µ of a normal population with unknown varience are found by adding to and subtracting from the sample mean a certain multiple of the estimated standard error of the sample mean. If the sample size on which the confidence interval is based is 28, the multiple referred to in the previous sentence is the number ___?

qt(.975, 27)

## [1] 2.051831

3.13 a random sample of 32 persons attending a diet clinic was found to have lost an average of 30 pounds, with sample standard deviation of 11. For these data, a 99% CI for the true mean weight loss by all patients attending the clinic would have the limits (?,?)

# calculating the error #
bar_13 <- 30
sig <- 11
n_13 <- 32
error_13 <- qt(.995, df=n_13 - 1) * (sig / sqrt(n_13))
# calculating left and right bounds #
left_13 <- bar_13 - error_13
right_13 <- bar_13 + error_13

left_13

## [1] 24.66409

right_13

## [1] 35.33591

the interval would be (24.664, 35.336)

3.14 From two normal populations assumed to have equal variances, independent random samples of sizes 15 and 19 are drawn. The first sample (n=15) yielded mean and standard deviation 111.6 and 9.5 respectively. The second sample (n=19) gave mean and standard deviation 100.9 and 11.5. The estimated standard error of the difference in means is ____?

#obtaining the pooled sample variance 
n1 <- 15
n2 <- 19
s1 <- 9.5*9.5
s2 <- 11.5*11.5
sp2 <- (((n1 - 1)*s1) + ((n2 - 1)*s2)) / (n1 + n2 - 2)
sp2 # pooled sample varience #

## [1] 113.875

# obtaining the estimated standard error of the difference in sample means
sem <- sqrt((sp2/n1) + (sp2/n2))
sem # estimated standard error of the difference in sample means #

## [1] 3.685795

3.15 Suppose, using above data, we wish to test H0: µ1 = µ2 versus HA: µ1 > µ2. We got the test statistic 2.55 a) what conclusions should be drawn for α = 0.05?

df15 <- n1+n2-2 # obtaining df for the function #
1 - pt(2.55, df=df15) # calculating p-value to compare w/alpha #

## [1] 0.007880889

0.0078 < 0.05, reject null hypothesis, there is evidence of significant difference

what about α = 0.01? # 0.0078 < 0.01, reject null hypothesis, there is evidence of significant difference

3.16 test the null hypothesis that the true population average body weight is the same for two independent diagnosis groups from one hospital versus the alternative hypothesis that these two population averages are different

# creating data sets
samp1 <- c(132, 145, 124, 122, 165, 144, 151)
samp2 <- c(141, 139, 172, 131, 150, 125)

step 0

alpha = 0.05

#step 1 # H0: mean weight pop 1 = mean weight pop 2 # HA: mean weight pop 1 != mean weight pop 2

#step 2:

t.test(samp1, samp2, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  samp1 and samp2
## t = -0.2893, df = 11, p-value = 0.7777
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22.13483  16.99197
## sample estimates:
## mean of x mean of y 
##  140.4286  143.0000

# from output of t.test, t = -0.2893

step 3:

from output of t.test, p = P(T > t) = 0.7777

step 4:

t > alpha == 0.7777 > 0.05

#fail to reject null hypothesis

step 5

We failed to find evidence of significant difference of the average population body weight in the two independent diagnosis groups in one hospital, at the alpha = 0.05 level.

3.17 independent random samples are drawn from two normal populations (same variance). Sample 1: n = 5, mean = 86.4, sd = 8.0 Sample 2: n = 7, mean = 78.6, sd = 10 The limits of a 99% CI are found by adding and subtracting from the difference in samples means a certain multiple of the estimated standard error of this difference. The multiple is the number ___?

#calculating degrees freedom for two-sample t test
n.17.1 <- 5
n.17.2 <- 7
df.17 <- n.17.1 + n.17.2 - 2
#calculating 99% CI multiple for df=10
df.17 <- n.17.1 + n.17.2 - 2
qt(.995, df=df.17)

## [1] 3.169273

3.20 Critical region: |T| >= 2.5 , Tdata = -2.75. What statements would be correct? a) H0 should be rejected #this statement is TRUE

2 * (1 - pt(2.75, df = Inf))

## [1] 0.005959526

the alpha is the probability that if H0 is true T is either greater than 2.75 or less than -2.75 #this statement is TRUE #c,d,e are all FALSE as 3.5 lies in the rejection area

3.21 xbar1 = 125.2 mmHg, xbar2 = 125.4 mmHg H0: mu1 mmHg SBP = mu2 mmHg SBP <- rejected at alpha = 0.001 level #statement c) is most reasonable, since there is statistically significant difference, however, clinically, a difference in SBP of 125.4 - 125.2 does not create difference in clinical diagnosis

3.22 the choice of an alternative hypothesis should depend primarily upon… # the only correct choice is: b) what the investigator is interested in determining

3.23 for each of the areas in the figure labled a, b, c, and d; choose a correcsponding label alpha, 1- alhpa, beta, 1 - beta # a) 1 - alpha # b) alpha # c) beta # d) 1- beta

3.25) suppose that H0: µ1 = µ2 and 0.005 < P < 0.01. What conclusion is most appropriate? # reject H0 because P is small (less than 0.01)

Homework Chapter 5:

5.1 The age and dry weights (y) of 11 chick embryos ranging in age 6 to 16 days (x) are given, along with the logarithms of the weights (z).

#uploading the dataset
setwd("~/Desktop")
library(readxl) 
hw1 <- read_excel("~/Desktop/CH05Q01.xls")

observe the following two scatter diagrams and describe the relationships

#scatter plot 1
plot(x=hw1$AGE, y=hw1$DRYWGT, 
     main="Scatter plot of Dry Weight vs. Age", 
     xlab="Age (days)", 
     ylab="Dry Weight (grams)")

#scatter plot 2

plot(x=hw1$AGE, y=hw1$LOGDRYWG, 
     main="Common Logartithms of Dry Weight vs. Age", 
     xlab="Age (days)", 
     ylab="Common Log of Dry Weight")

# the relationship between X and Y is positive nonlinear while the relationship between X and Z is positive linear.

state the simple linear regression models for these two regressions: Y regressed on X and Z regressed on X # Y = Bsub0 + Bsub1 * X + E # Z = Bprimesub0 + Bprimesub1 * X + E
determine the least-squares estimates of each of the regression lines in part b)

model_yonx <- lm(DRYWGT~AGE, data=hw1)  # SLR analysis for Y on X #
model_yonx

## 
## Call:
## lm(formula = DRYWGT ~ AGE, data = hw1)
## 
## Coefficients:
## (Intercept)          AGE  
##     -1.8845       0.2351

# yhat = -1.885 + 0.235x

model_zonx <- lm(LOGDRYWG~AGE, data=hw1)  # SLR analysis for Z on X #
model_zonx

## 
## Call:
## lm(formula = LOGDRYWG ~ AGE, data = hw1)
## 
## Coefficients:
## (Intercept)          AGE  
##     -2.6893       0.1959

# Zhat = -2.689 + 0.196x

sketch each estimated line on the appropriate scatter diagram, which has the best fit? Which is it more appropriate to run a linear regression on?

# Linear regression line on Dry Weight vs Age
plot(x=hw1$AGE, y=hw1$DRYWGT, 
     main="Scatter plot of Dry Weight vs. Age", 
     xlab="Age (days)", 
     ylab="Dry Weight (grams)",
abline(lm(DRYWGT~AGE, data=hw1), col="maroon"))

# linear regression line on common log of dry weight vs age
plot(x=hw1$AGE, y=hw1$LOGDRYWG, 
     main="Common Logartithms of Dry Weight vs. Age", 
     xlab="Age (days)", 
     ylab="Common Log of Dry Weight",
     abline(lm(LOGDRYWG~AGE, data=hw1), col="navy"))

# The regression line for Common logs of dry weight vs age is better fit than dry weight vs age, thus, it is more appropriate to run a regression of Z on X. This is shown by Z on X has a smaller value for Bprimesub1 than Y on X does for Betasub1. The data is better fit to the regression line.

find the 95% CI on the true slope and intercept for Z on X. Interpret each interval with regard to the null hypothesis that the true parameter is 0

#95% CI of Z on X
confint(model_zonx, level=0.95)

##                  2.5 %     97.5 %
## (Intercept) -2.7585602 -2.6199489
## AGE          0.1898356  0.2019462

# 95% CI intercept (-2.759, -2.620)
# 95% CI slope (0.190, 0.202)

H0: true parameter = 0 , neither of these confidence intervals contain the null value 0 , thus, we have 95% confidence that the true parameter of the intercept lies with the range (-2.759, -2.620) and the true parameter of the slope lies within the range (0.190, 0.202).

#At the alpha = 0.05 level of significance, we found evidence that the true parameter is significantly different from 0.

find and sketch 95% confidence and prediction bands. Find and interperet an approximate 95% confidence interval for the mean reponse of an 8-day-old chick

library(tidyr)
library(ggplot2) # prepping R to use GGplot for following problem #

model <- lm(LOGDRYWG~AGE, data=hw1) #SLR analysis#

pred <- predict(model, interval = "prediction") # 95% prediction interval for forecasting mean response, log dry age #

## Warning in predict.lm(model, interval = "prediction"): predictions on current data refer to _future_ responses

new.data <- cbind(hw1, pred) # adding prediction to original data frame #

# 95% CI and prediction bands for Z on X #
ggplot(new.data, 
       aes(x=AGE, y=LOGDRYWG))+
  geom_point()+
  geom_smooth(method=lm, 
              formula=y~x, 
              color="lightblue", 
              se=TRUE)+
  geom_line(aes(y=lwr), 
            color="violet", 
            linetype="dashed")+
  geom_line(aes(y=upr), 
            color="violet", 
            linetype="dashed")+
  ggtitle("Scatterplot of Common Logartithms of Dry Weight vs. Age with Regression Line, overlaid \n 
          95% confidence band for mean response, overlaid \n 
          95% Prediction bankd for mean response")

# get 95% point-wise CI for mean response (LOGDRYWG) at AGE = 8 #
new.dat <- data.frame(AGE=8)
predict(model, newdata = new.dat, interval = 'confidence')

##         fit       lwr       upr
## 1 -1.122127 -1.148522 -1.095733

# 95% CI when age = 8: (-1.149, -1.096) #

5.4 A sociologist assigned to a correctional institution was interested in studying the relationship between intelligence and deliquency. A delinquency index was formulated to account for the severity and the frequency of the crimes committed, while intelligence was measured by IQ.

# upload data frame #
library(readxl)
CH05Q04 <- read_excel("CH05Q04.xls")

#plotting data with regression line
ggplot(CH05Q04,
       aes(x=IQ, y=DI))+
  geom_point()+
  geom_smooth(method=lm, 
              formula=y~x, 
              color="lightblue", 
              se=FALSE)+
  ggtitle("Scatterplot of DI vs. IQ with Regression Line")

how do you account for the fact that Yhat = 52.273 when IQ = 0, even though the delinquency index goes no higher than 50? # we account for this by limiting the min and max values on our range to match that of the delinquency index
Decide whether the outlier IQ = 134 and DI = 39.6 has effect on the estimate of the IQ-DI relationship

#comparing the slope of the regression line with all points included to that of the regression line without the outlier 

model5.4 <- lm(CH05Q04$DI~CH05Q04$IQ) # slope of regression with all points included #
summary(model5.4)

## 
## Call:
## lm(formula = CH05Q04$DI ~ CH05Q04$IQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7013  -4.7454   0.0903   3.6447  20.6838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.2729    12.0485   4.339 0.000508 ***
## CH05Q04$IQ   -0.2489     0.1154  -2.157 0.046535 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.704 on 16 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.1769 
## F-statistic: 4.654 on 1 and 16 DF,  p-value: 0.04653

yhat_noout <- 70.846 + -0.444 * CH05Q04$IQ # slope of the regression line with outlier point removed as given by textbook #
betahat0 <- 70.846
betahat1 <- -0.444
# y\hat{} = 52.2729 + -0.2489 * CH05Q04$IQ best fit line before removing outlier #

We can see that the equation for the best fit line is clearly influenced by this outlier point, as that the slope of the best fit line goes from -0.2489 to -0.444, almost doubling the negative slope of the line. We also see that the intercept of this line has now moved up along the Y-axis. The outlier was flattening the slope of the best fit line.

test the null hypothesis of zero slope when the outlier is removed, given that S Y|X = 4.933 Sx + 14.693 and n = 12. alpha = 0.05

# step 0: 
alpha <- 0.05
# step 1:
# H0: b_sub_1 = 0
# HA: b_sub_1 != 0
# step 2:
betahat1 <- -0.444
Syx <- 4.933
Sx <- 14.693
n_54 <- 12
T_54 <- betahat1 / (Syx/ (Sx * sqrt(n_54 - 1)))
T_54 # T statistic for hypothesis test, H0: slope = 0 with outlier removed #

## [1] -4.386101

# step 3: 
df_54 <- n_54 - 2 # degrees freedom, n-2 #
pval <- 2 * (1 - pt(-4.386101, df=df_54, lower.tail = FALSE)) #obtaining the p value for the test #
pval

## [1] 0.001364339

# step 4:
pval < alpha # p value is less than alpha, reject null hypothesis that slope of regression line = 0 when outlier is removed #

## [1] TRUE

#step 5: # we have found evidence that the slope of the regression line, when the outlier is removed, is significantly different from 0, at the alpha = 0.05 level

for these data would you conclude that the delinquency index decreases as IQ increases?

cor(CH05Q04$IQ, CH05Q04$DI)# testing the correlation between IQ and DI #

## [1] -0.474686

#we see that the correlation between IQ and DI is negative, which would suggest that as the delinquency index decreases, IQ increases. #I will now test the significance of the slope to ensure correct

summary(model5.4) # testing statistical significance of the original regression line slope #

## 
## Call:
## lm(formula = CH05Q04$DI ~ CH05Q04$IQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7013  -4.7454   0.0903   3.6447  20.6838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.2729    12.0485   4.339 0.000508 ***
## CH05Q04$IQ   -0.2489     0.1154  -2.157 0.046535 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.704 on 16 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.1769 
## F-statistic: 4.654 on 1 and 16 DF,  p-value: 0.04653

From the test, Pr(>|t|) = 0.046535, which is less than the predetermined alpha value of 0.05.

We conclude, at the alpha = 0.05 level, the slope of the regression line is statistically significant, and thus conclude that we have found evidence that as the delinquency index decreases, IQ increases.

BSTA 512 Homework 1

Aine Huntington

1/14/2022