Provide ideas of the required sample sizes of the following analysis:
Submit an html file of your findings.
At least 20 observations. Example, You are investigating the relationship between hours of study (predictor variable) and exam scores (response variable) among high school students. predictor(hours of study) = 20 observations sample size = 20 observations
set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
exam_scores <- 50 + 2 * hours_of_study + rnorm(50, mean = 0, sd = 5)
data <- data.frame(hours_of_study, exam_scores)
models <- lm(exam_scores ~ hours_of_study, data = data)
summary(models)
##
## Call:
## lm(formula = exam_scores ~ hours_of_study, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3222 -2.4311 -0.1352 2.4702 10.1279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.4396 2.9180 17.63 <2e-16 ***
## hours_of_study 1.9649 0.1411 13.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.571 on 48 degrees of freedom
## Multiple R-squared: 0.8017, Adjusted R-squared: 0.7975
## F-statistic: 194 on 1 and 48 DF, p-value: < 2.2e-16
plot(models,which =2)
At least 10 to 20 Sample Sizes per predictor Example, You want to predict a student’s college GPA (response variable) based on hours of study, extracurricular activities, and high school GPA (three predictor variables).
set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
extracurricular <- rnorm(50, mean = 3, sd = 1)
high_school_GPA <- rnorm(50, mean = 3.5, sd = 0.5)
college_GPA <- 2 + 1.5 * hours_of_study + 0.8 * extracurricular + 0.5 * high_school_GPA + rnorm(50, mean = 0, sd = 1)
data <- data.frame(hours_of_study, extracurricular, high_school_GPA, college_GPA)
modelm <- lm(college_GPA ~ hours_of_study + extracurricular + high_school_GPA, data = data)
summary(modelm)
##
## Call:
## lm(formula = college_GPA ~ hours_of_study + extracurricular +
## high_school_GPA, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4680 -0.6755 -0.1069 0.6309 3.0194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.28083 1.25450 2.615 0.012017 *
## hours_of_study 1.47437 0.02905 50.747 < 2e-16 ***
## extracurricular 0.62984 0.15032 4.190 0.000125 ***
## high_school_GPA 0.44380 0.27509 1.613 0.113520
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9406 on 46 degrees of freedom
## Multiple R-squared: 0.9825, Adjusted R-squared: 0.9814
## F-statistic: 862.6 on 3 and 46 DF, p-value: < 2.2e-16
plot(modelm, which = 2)
At least of 10-20 events per predictor variable. Three predictors * Twenty events = 60 events Example, You are studying the likelihood of a customer making a purchase (binary outcome: 1 for purchase, 0 for no purchase) based on their age, income, and browsing time on a website.
set.seed(123)
age <- rnorm(100, mean = 30, sd = 5)
income <- rnorm(100, mean = 50000, sd = 10000)
browsing_time <- rnorm(100, mean = 30, sd = 10)
purchase <- rbinom(100, size = 1, prob = plogis(-2 + 0.1 * age + 0.0005 * income + 0.5 * browsing_time))
data <- data.frame(age, income, browsing_time, purchase)
modell <- glm(purchase ~ age + income + browsing_time, data = data, family = "binomial")
summary(modell)
##
## Call:
## glm(formula = purchase ~ age + income + browsing_time, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657e+01 3.415e+05 0 1
## age -1.165e-10 7.917e+03 0 1
## income -1.005e-14 3.707e+00 0 1
## browsing_time 1.265e-11 3.801e+03 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 99 degrees of freedom
## Residual deviance: 5.8016e-10 on 96 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
plot(modell, which = 2)
# Multiple Linear Regression Example
# Generate some example data
set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
extracurricular <- rnorm(50, mean = 3, sd = 1)
high_school_GPA <- rnorm(50, mean = 3.5, sd = 0.5)
college_GPA <- 2 + 1.5 * hours_of_study + 0.8 * extracurricular + 0.5 * high_school_GPA + rnorm(50, mean = 0, sd = 1)
# Create a data frame
data <- data.frame(hours_of_study, extracurricular, high_school_GPA, college_GPA)
# Fit the multiple linear regression model
model <- lm(college_GPA ~ hours_of_study + extracurricular + high_school_GPA, data = data)
# Print the summary
summary(model)
##
## Call:
## lm(formula = college_GPA ~ hours_of_study + extracurricular +
## high_school_GPA, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4680 -0.6755 -0.1069 0.6309 3.0194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.28083 1.25450 2.615 0.012017 *
## hours_of_study 1.47437 0.02905 50.747 < 2e-16 ***
## extracurricular 0.62984 0.15032 4.190 0.000125 ***
## high_school_GPA 0.44380 0.27509 1.613 0.113520
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9406 on 46 degrees of freedom
## Multiple R-squared: 0.9825, Adjusted R-squared: 0.9814
## F-statistic: 862.6 on 3 and 46 DF, p-value: < 2.2e-16
plot(model)