Problem # 8

Provide ideas of the required sample sizes of the following analysis:

Simple Linear Regression
Multiple Linear Regression
Logistic Regression

Submit an html file of your findings.

Simple Linear Regression

At least 20 observations. Example, You are investigating the relationship between hours of study (predictor variable) and exam scores (response variable) among high school students. predictor(hours of study) = 20 observations sample size = 20 observations

set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
exam_scores <- 50 + 2 * hours_of_study + rnorm(50, mean = 0, sd = 5)

data <- data.frame(hours_of_study, exam_scores)

models <- lm(exam_scores ~ hours_of_study, data = data)

summary(models)

## 
## Call:
## lm(formula = exam_scores ~ hours_of_study, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3222  -2.4311  -0.1352   2.4702  10.1279 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     51.4396     2.9180   17.63   <2e-16 ***
## hours_of_study   1.9649     0.1411   13.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.571 on 48 degrees of freedom
## Multiple R-squared:  0.8017, Adjusted R-squared:  0.7975 
## F-statistic:   194 on 1 and 48 DF,  p-value: < 2.2e-16

plot(models,which =2)

Multiple Linear Regression

At least 10 to 20 Sample Sizes per predictor Example, You want to predict a student’s college GPA (response variable) based on hours of study, extracurricular activities, and high school GPA (three predictor variables).

set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
extracurricular <- rnorm(50, mean = 3, sd = 1)
high_school_GPA <- rnorm(50, mean = 3.5, sd = 0.5)
college_GPA <- 2 + 1.5 * hours_of_study + 0.8 * extracurricular + 0.5 * high_school_GPA + rnorm(50, mean = 0, sd = 1)

data <- data.frame(hours_of_study, extracurricular, high_school_GPA, college_GPA)

modelm <- lm(college_GPA ~ hours_of_study + extracurricular + high_school_GPA, data = data)


summary(modelm)

## 
## Call:
## lm(formula = college_GPA ~ hours_of_study + extracurricular + 
##     high_school_GPA, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4680 -0.6755 -0.1069  0.6309  3.0194 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.28083    1.25450   2.615 0.012017 *  
## hours_of_study   1.47437    0.02905  50.747  < 2e-16 ***
## extracurricular  0.62984    0.15032   4.190 0.000125 ***
## high_school_GPA  0.44380    0.27509   1.613 0.113520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9406 on 46 degrees of freedom
## Multiple R-squared:  0.9825, Adjusted R-squared:  0.9814 
## F-statistic: 862.6 on 3 and 46 DF,  p-value: < 2.2e-16

plot(modelm, which = 2)

Logistic Regression

At least of 10-20 events per predictor variable. Three predictors * Twenty events = 60 events Example, You are studying the likelihood of a customer making a purchase (binary outcome: 1 for purchase, 0 for no purchase) based on their age, income, and browsing time on a website.

set.seed(123)
age <- rnorm(100, mean = 30, sd = 5)
income <- rnorm(100, mean = 50000, sd = 10000)
browsing_time <- rnorm(100, mean = 30, sd = 10)
purchase <- rbinom(100, size = 1, prob = plogis(-2 + 0.1 * age + 0.0005 * income + 0.5 * browsing_time))

data <- data.frame(age, income, browsing_time, purchase)

modell <- glm(purchase ~ age + income + browsing_time, data = data, family = "binomial")

summary(modell)

## 
## Call:
## glm(formula = purchase ~ age + income + browsing_time, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    2.657e+01  3.415e+05       0        1
## age           -1.165e-10  7.917e+03       0        1
## income        -1.005e-14  3.707e+00       0        1
## browsing_time  1.265e-11  3.801e+03       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 99  degrees of freedom
## Residual deviance: 5.8016e-10  on 96  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

plot(modell, which = 2)

# Multiple Linear Regression Example

# Generate some example data
set.seed(123)
hours_of_study <- rnorm(50, mean = 20, sd = 5)
extracurricular <- rnorm(50, mean = 3, sd = 1)
high_school_GPA <- rnorm(50, mean = 3.5, sd = 0.5)
college_GPA <- 2 + 1.5 * hours_of_study + 0.8 * extracurricular + 0.5 * high_school_GPA + rnorm(50, mean = 0, sd = 1)

# Create a data frame
data <- data.frame(hours_of_study, extracurricular, high_school_GPA, college_GPA)

# Fit the multiple linear regression model
model <- lm(college_GPA ~ hours_of_study + extracurricular + high_school_GPA, data = data)

# Print the summary
summary(model)

## 
## Call:
## lm(formula = college_GPA ~ hours_of_study + extracurricular + 
##     high_school_GPA, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4680 -0.6755 -0.1069  0.6309  3.0194 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.28083    1.25450   2.615 0.012017 *  
## hours_of_study   1.47437    0.02905  50.747  < 2e-16 ***
## extracurricular  0.62984    0.15032   4.190 0.000125 ***
## high_school_GPA  0.44380    0.27509   1.613 0.113520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9406 on 46 degrees of freedom
## Multiple R-squared:  0.9825, Adjusted R-squared:  0.9814 
## F-statistic: 862.6 on 3 and 46 DF,  p-value: < 2.2e-16

plot(model)

Problem # 8

Justin Lian G. Caballero

2024-12-10

Simple Linear Regression

Multiple Linear Regression

Logistic Regression