Probability, distribution and Statistics basics

1. Uncertainty (6 points)

Listed three random events in your daily life, and how do you/people usually deal with them? Three random events in my daily life include: 1. Playing games of League of Legends, where the other players in the lobby are randomized from active players also queuing at the same time in North America. This randomization is usually based on ranking, however since I have no control over it I deal with it by just playing the game and fore-fitting if its too hard to win. 2. Flipping a coin is a random event in daily life, with two options being heads or tails. I have a hard time making decisions and I like to use flipping a coin to help, like heads to eat rice and tails for noodles. If I don’t like the result of the coin flip, I know how I really feel. 3. Buying a lottery tickets is a random event in real life, where people usually deal with the randomness by choosing numbers that they like, or allowing the machine to randomly generate their number.

2. Probability (12 points)

please calculate the probability of the following events:

a)

A fair coin is flipped 5 times. What is the probability of getting exactly 3 heads? (Hint: binomial distribution) Binomial(5,3) = 5!/(3!(5-3)!) = 5!/(3!2!) = (5*4)/2 = 10 P(X=3)=binomial(5,3)(1/2)3(1/2)2 =10(1/8)(1/4) =10/32 =5/16

b)

A fair coin is flipped 5 times. What is the probability of getting at least 3 heads and at most 4 heads? (Hint: the event can be decomposed into two events: getting exactly 3 heads and getting exactly 4 heads. The probability of the union of two events can be calculated after you calculate the probability of each event.) P(X=3)= 5/16 P(X=4)=(5!/(4!1!))(1/2)4(1/2)1 = 5(1/16)(1/2) = 5/32 P(X=3 or X=4| roll 5 times)= P(X=3|roll 5 times) + P(X=4|roll 5 times) = 10/32+5/32 = 15/32 ### c) A box contains 5 red balls and 5 black balls. Two balls are drawn at random. What is the probability that both balls are red? (Hint: hypergeometric distribution) P(First red)= 5/10 P(Second red|First red)= 4/9 P(both red)= P(first red)P(second red|first red)= (1/2)(4/9) =2/9

d)

Two coins are in a box. One is a fair coin, and the other is a double-headed coin. A coin is selected at random and flipped. The coin lands heads. What is the probability that the coin is the fair coin? (Hint: Bayes’ theorem) P(heads)=P(heads|fair coin)P(fair coin) + P(heads|double-head coin)P(double-head coin) =(1/2)(1/2)+(1)(1/2) =3/4 P(fair|heads)=P(heads|fair)P(fair)/P(heads) =(1/2)(1/2)/(3/4) =1/3

3. Descriptive statistics (12 points)

Please calculate the following statistics:

a)

The mean, median, standard deviation, range, and quantiles of the following data:

data <- c(rnorm(100, mean = 10, sd = 2), 50, 75, 100, 1)
mean_val <- mean(data)
median_val <-median(data)
sd_val <- sd(data)
range_val <- range(data)
quantiles <- quantile(data)

cat("Mean:", mean_val, "\n")
## Mean: 11.85484
cat("Median:", median_val, "\n")
## Median: 10.06745
cat("Standard Deviation:", sd_val, "\n")
## Standard Deviation: 11.68614
cat("Range:", range_val[1], "to", range_val[2], "\n")
## Range: 1 to 100
cat("Quantiles:\n")
## Quantiles:
print(quantiles)
##         0%        25%        50%        75%       100% 
##   1.000000   8.683781  10.067454  11.614267 100.000000

b)

view the distribution of the data, what do you find?

# Load package
library(ggplot2)

# Convert data to a data frame
data_df <- data.frame(values = data)

# Histogram
ggplot(data_df, aes(x = values)) +
  geom_histogram(bins = 30, color = "black", fill = "skyblue", alpha = 0.7) +
  ggtitle("Histogram of Data") +
  xlab("Values") +
  ylab("Frequency") +
  theme_minimal()

# Boxplot
ggplot(data_df, aes(y = values)) +
  geom_boxplot(fill = "pink", color = "black", alpha = 0.7) +
  ggtitle("Boxplot of Data") +
  ylab("Values") +
  theme_minimal()

c)

remove the outliers from the data and calculate the mean, median, standard deviation, range, and quantiles of the data after removing the outliers. (Hint: you can define the outliers using the boxplot’s rule.)

##IQR calculations
Q1 <- quantile(data, 0.25)
Q3 <- quantile(data, 0.75)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - (1.5*IQR_val)
upper_bound <- Q3 - (1.5*IQR_val)

#removing outliers
clean_data <- data[data >= lower_bound & data <= upper_bound]

c_mean <- mean(clean_data)
c_median <- median(clean_data)
c_sd <- sd(clean_data)
c_range <- range(clean_data)
c_quantiles <- quantile(clean_data)

##print
cat("Mean after removing outliers:", c_mean, "\n")
## Mean after removing outliers: 6.411333
cat("Median after removing outliers:", c_median, "\n")
## Median after removing outliers: 6.734585
cat("Standard Deviation after removing outliers:", c_sd, "\n")
## Standard Deviation after removing outliers: 0.8907822
cat("Range after removing outliers:", c_range[1], "to", c_range[2], "\n")
## Range after removing outliers: 4.653954 to 7.183366
cat("Quantiles after removing outliers:\n")
## Quantiles after removing outliers:
print(c_quantiles)
##       0%      25%      50%      75%     100% 
## 4.653954 6.269732 6.734585 7.000707 7.183366

Inferential statistics

4. Hypothesis testing with continuous data (12 points):

You are given two groups of data representing the scores on two different types of training programs for employees. Each employee’s score is given as follows:

  • Group 1 (Training Program A): 68,75,80,72,74,79,85,90,88,92

  • Group 2 (Training Program B): 60,65,63,62,64,61,70,66,68,67

You aim to test the difference in means:

a)

State the null and alternative hypotheses for testing whether there is a significant difference between the means of the two training programs H0:There is no significant difference between the mean scores of Training Program A and Training Program B, where H0: u_a=u_b. H1: There is a significant difference between the mean scores of Training Program A and Training Program B, where H1” u_a =/= (does not equal) u_b.

b)

Perform a two-sample t-test (independent samples) for the means of the two groups.

group_A <- c(68,75,80,72,74,79,85,90,88,92)
group_B <- c(60,65,63,62,64,61,70,66,68,67)

t_test <- t.test(group_A, group_B, alternative = "two.sided", var.equal = TRUE)
print(t_test)
## 
##  Two Sample t-test
## 
## data:  group_A and group_B
## t = 5.6518, df = 18, p-value = 2.317e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.863872 21.536128
## sample estimates:
## mean of x mean of y 
##      80.3      64.6
cat("Test Statistic (t-value):", t_test$statistic, "\n")
## Test Statistic (t-value): 5.651774
cat("P-value:", t_test$p.value, "\n")
## P-value: 2.317381e-05

c)

Calculate the test statistic and p-value. Test Statistic (t-value): 5.651774 P-value: 2.317381e-05 ### d) Use a significance level of 0.05. Do you reject the null hypothesis? What is your conclusion? We reject the null hypothesis as our p_value < 0.05, concluding that there is a statistically significant difference between Training Program A and Training Program B.

5. Hypothesis testing with discrete data (12 points):

A research team conducted a study to determine whether a new vaccine reduces the risk of contracting a specific illness. They collected data from two groups of people: one group that received the vaccine and another group that did not. The data below shows the number of individuals who got sick and those who remained healthy in each group.

  • Vaccinated: total 200 people (30 sick and 170 healthy)

  • Unvaccinated: total 200 people (50 sick and 150 healthy)

a)

State the null and alternative hypotheses for testing whether the vaccine reduces the risk of contracting the illness. H0: The vaccine does not reduce the risk of contracting the illness, the probability of getting sick is the same for both vaccinated and unvaccinated individuals. P_vaccinated = P_unvaccinated. H1: The vaccine reduces the risk of contracting the illness, the probability of getting sick is lower in the vaccinated individuals than unvaccinated individuals. P_vaccinated<P_unvaccinated

b)

Perform a chi-square test to determine whether there is a significant difference in the risk of contracting the illness between the vaccinated and unvaccinated groups.

data_matrix <- matrix(c(20, 170, 50, 150), nrow= 2, byrow = TRUE)

rownames(data_matrix) <- c("Sick", "Healthy")
colnames(data_matrix) <- c("Vaccinated", "Unvaccinated")

chi_test <- chisq.test(data_matrix)

print(chi_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data_matrix
## X-squared = 12.894, df = 1, p-value = 0.0003296

c)

Calculate the test statistic and p-value.

cat("Chi-Square Test Statistic:", chi_test$statistic, "\n")
## Chi-Square Test Statistic: 12.89448
cat("P-value:", chi_test$p.value, "\n")
## P-value: 0.000329553

d)

Use a significance level of 0.05. Do you reject the null hypothesis? What is your conclusion? We reject the null hypothesis since p-value < 0.05, therefore there is evidence to conclude that the vaccination reduces the risk of contracting the illness and therefore the difference between the vaccinated and unvaccinated groups is statistically significant.

Resampling

6. Permutation test (12 points):

use the permutation test procedure to test whether the vaccine reduces the risk of contracting the illness in the previous question.

#reproductability
set.seed(42)

#data
vaccinated <- c(rep(1,30), rep(0,170)) # 1 = sick, 0 = healthy
unvaccinated <- c(rep(1,50), rep(0, 150))

data_matrix <- c(vaccinated, unvaccinated)
group_labels <- c(rep("Vaccinated", length(vaccinated)), rep("Unvaccinated", length(unvaccinated)))

#calculate obs differences 
obs_diff <- mean(vaccinated) - mean(unvaccinated)

#permutation test function
perm_test <- function(data, labels, num_permutations = 10000) {
  perm_diffs <- numeric(num_permutations) # stores differences
  
  for (i in 1:num_permutations) {
    shuffled_labels <- sample(labels)
    
    group1 <- data[shuffled_labels == "Vaccinated"]
    group2 <- data[shuffled_labels == "Unvaccinated"]
    
    perm_diffs[i] <- mean(group1) - mean(group2)
  }
    p_value <- mean(perm_diffs <= obs_diff)
    
    return(p_value)
}

p_value_perm <- perm_test(data_matrix, group_labels, num_permutations = 10000)

cat("Observed Difference in Promotion Rates:", obs_diff, "\n")
## Observed Difference in Promotion Rates: -0.1
cat("Permutation Test P-value:", p_value_perm, "\n")
## Permutation Test P-value: 0.0069

7. Bootstrap (12 points):

use the bootstrap procedure to estimate the 95% confidence interval of the risk difference between the vaccinated and unvaccinated groups in the previous question.

#reproductability
set.seed(42)

#data
vaccinated <- c(rep(1,30), rep(0,170)) # 1 = sick, 0 = healthy
unvaccinated <- c(rep(1,50), rep(0, 150))

#observed risk difference
rd_obs <- mean(unvaccinated) - mean(vaccinated)

# Bootstrap function
set.seed(123)  # for reproducibility
n_boot <- 1000

bootstrap_diffs <- function(vaccinated, unvaccinated, num_bootstrap = 10000) {
  boot_rd <- numeric(num_bootstrap)
  
  for (i in 1:num_bootstrap) {
    #resampling
    boot_vaccinated <- sample(vaccinated, replace = TRUE)
    boot_unvaccinated <- sample(unvaccinated, replace = TRUE)
    
    #compute risk differences
    boot_rd[i] <- mean(boot_unvaccinated) - mean(boot_vaccinated)
  }
  # Calculate 95% confidence interval (percentile method)
ci_lower <- quantile(boot_rd, 0.025)
ci_upper <- quantile(boot_rd, 0.975)

return(list(ci_lower = ci_lower, ci_upper = ci_upper, boot_rd = boot_rd))

}

bootstrap_results <- bootstrap_diffs(vaccinated, unvaccinated, num_bootstrap = 10000)

# Extract confidence interval and bootstrapped differences
ci_lower <- bootstrap_results$ci_lower
ci_upper <- bootstrap_results$ci_upper
boot_rd <- bootstrap_results$boot_rd  # Bootstrapped risk differences

# Print results
cat("Observed Difference in Survival Rates:", rd_obs, "\n")
## Observed Difference in Survival Rates: 0.1
cat("95% Confidence Interval: [", ci_lower, ",", ci_upper, "]\n")
## 95% Confidence Interval: [ 0.025 , 0.175 ]
# Visualize the bootstrap distribution
hist(boot_rd, breaks = 50, main = "Bootstrap Distribution of Sickness Rate Difference",
     xlab = "Difference in Sickness Rates (Unvaccinated - Vaccinated)", border = "blue")
abline(v = c(ci_lower, ci_upper), col = "red", lwd = 2, lty = 2)
abline(v = rd_obs, col = "darkgreen", lwd = 2)
text(rd_obs, max(table(cut(rd_obs, breaks = 50))), 
     paste("Observed Diff =", round(rd_obs, 3)), pos = 4, col = "darkgreen")

8. False discovery (6 points):

A research team conducted an RNA-seq experiment to compare gene expression levels between cancer patients and healthy individuals. They performed t-tests for 20 genes to assess whether their expression levels significantly differ between the two groups. The table below shows the raw p-values obtained from the tests.

p_values <- c(
  GENE1 = 0.0013, GENE2 = 0.0321, GENE3 = 0.0458, GENE4 = 0.0089, GENE5 = 0.1123, 
  GENE6 = 0.0754, GENE7 = 0.0024, GENE8 = 0.0976, GENE9 = 0.1543, GENE10 = 0.0105, 
  GENE11 = 0.0038, GENE12 = 0.0789, GENE13 = 0.0641, GENE14 = 0.0217, GENE15 = 0.1895, 
  GENE16 = 0.0592, GENE17 = 0.0413, GENE18 = 0.0872, GENE19 = 0.0059, GENE20 = 0.0364
)

a)

explain the meaning of p-values The p_value is a statistical measure that helps to determine the significance of test results. It represents, probability of obtaining the test statistic as extreme rather than observed, under the assumption the null hypothesis is true. Here the null hypothesis is that the gene is not differentially expressed between cancer and healthy individuals.

b)

Before applying the BH procedure, how many genes had a raw p-value below 0.05?

num_significant <- sum(p_values < 0.05)
cat("Number of genes with p-value < 0.05:", num_significant, "\n")
## Number of genes with p-value < 0.05: 11

c)

After applying the BH procedure, how many genes remain significant at FDR = 0.05? Why the number of significant genes is different before and after applying the BH procedure?

#BH procedure multiple testing

# Define the p-values
p_values <- c(
  GENE1 = 0.0013, GENE2 = 0.0321, GENE3 = 0.0458, GENE4 = 0.0089, GENE5 = 0.1123, 
  GENE6 = 0.0754, GENE7 = 0.0024, GENE8 = 0.0976, GENE9 = 0.1543, GENE10 = 0.0105, 
  GENE11 = 0.0038, GENE12 = 0.0789, GENE13 = 0.0641, GENE14 = 0.0217, GENE15 = 0.1895, 
  GENE16 = 0.0592, GENE17 = 0.0413, GENE18 = 0.0872, GENE19 = 0.0059, GENE20 = 0.0364
)

#apply BH
adjusted_pval <- p.adjust(p_values, method = "BH")

#combine into data frame

p_results <- data.frame(Gene = names(p_values), Raw_P_Value = p_values, Adjusted_P_Value = adjusted_pval)

# Sort by adjusted p-value for better interpretation
p_results <- p_results[order(p_results$Adjusted_P_Value), ]

print(p_results)
##          Gene Raw_P_Value Adjusted_P_Value
## GENE1   GENE1      0.0013       0.02400000
## GENE7   GENE7      0.0024       0.02400000
## GENE11 GENE11      0.0038       0.02533333
## GENE19 GENE19      0.0059       0.02950000
## GENE4   GENE4      0.0089       0.03500000
## GENE10 GENE10      0.0105       0.03500000
## GENE14 GENE14      0.0217       0.06200000
## GENE2   GENE2      0.0321       0.08025000
## GENE20 GENE20      0.0364       0.08088889
## GENE17 GENE17      0.0413       0.08260000
## GENE3   GENE3      0.0458       0.08327273
## GENE13 GENE13      0.0641       0.09861538
## GENE16 GENE16      0.0592       0.09861538
## GENE6   GENE6      0.0754       0.10520000
## GENE12 GENE12      0.0789       0.10520000
## GENE18 GENE18      0.0872       0.10900000
## GENE8   GENE8      0.0976       0.11482353
## GENE5   GENE5      0.1123       0.12477778
## GENE9   GENE9      0.1543       0.16242105
## GENE15 GENE15      0.1895       0.18950000
num_sig_BH <- sum(adjusted_pval < .05)

cat("Number of significant genes after BH correction (FDR < .05):", num_sig_BH, "\n")
## Number of significant genes after BH correction (FDR < .05): 6

Modeling

9. Regression and prediction (8 points):

a) Linear Regression

Using mtcars dataset, fit a linear regression model to predict Miles Per Gallon (mpg) based on Horsepower (hp), and predict the mpg when hp = 100

data("mtcars")

lm_model <- lm(mpg ~ hp, data=mtcars)

summary(lm_model)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
new_data <- data.frame(hp=100)

predicted_mpg <- predict (lm_model, new_data)

cat("Predicted mpg when horsepower = 100:", predicted_mpg, "\n")
## Predicted mpg when horsepower = 100: 23.27603

b) Logistic Regression

Using iris dataset, fit a logistic regression model to predict whether a flower is setosa (1) or not (0) based on Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. Use the first 100 samples as the training set and the remaining samples as the test set. Calculate the accuracy of the model on the test set.

data("iris")

# Predict whether a species is "setosa" based on sepal and petal measurements
iris$Setosa <- ifelse(iris$Species == "setosa", 1, 0)

#split data
train_data <- iris[1:100, ]
test_data <- iris[101:150, ]


iris_logistic <- glm(Setosa~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                    data = train_data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Statistical tests for logistic regression
summary(iris_logistic)
## 
## Call:
## glm(formula = Setosa ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, family = binomial, data = train_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -6.556 601950.495       0        1
## Sepal.Length      9.879 194223.317       0        1
## Sepal.Width       7.418  92924.482       0        1
## Petal.Length    -19.054 144516.044       0        1
## Petal.Width     -25.033 216059.004       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3863e+02  on 99  degrees of freedom
## Residual deviance: 1.3166e-09  on 95  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25
#predict probability on test set
test_prob <- predict(iris_logistic, newdata = test_data, type = "response")

#convert probability into binary predictions with a threshold of .5
test_pred <- ifelse(test_prob > 0.5, 1, 0)

#accuracy
accuracy <- mean(test_pred == test_data$Setosa)

cat("Accuracy of logistic regression model on the test set:", accuracy, "\n")
## Accuracy of logistic regression model on the test set: 1

10. Model selection (8 points):

a) AIC

Using mtcars dataset, fit linear regression models to predict mpg according to the following specifications and select the best model using the AIC criterion:

  • Model 1: hp, wt, and disp as the predictor variables

  • Model 2: hp and wt as the predictor variables

  • Model 3: hp as the predictor variable

data(mtcars)

model1 <- lm(mpg~hp + wt + disp, data = mtcars)
model2 <- lm(mpg~hp + wt, data = mtcars)
model3 <- lm(mpg~hp, data = mtcars)

#compute aic values
aic_values <- c(Model_1 = AIC(model1), Model_2 = AIC(model2), Model_3 = AIC(model3))

print(aic_values)
##  Model_1  Model_2  Model_3 
## 158.6430 156.6523 181.2386
best_model <- names(which.min(aic_values))

cat("Best model based on AIC:", best_model, "\n")
## Best model based on AIC: Model_2

b) Cross-validation

Using iris dataset, fit a logistic regression model to predict whether a flower is setosa (1) or not (0) based on the following specifications and select the best model using cross-validation:

  • Model 1: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width as the predictor variables

  • Model 2: Sepal.Length and Petal.Length as the predictor variables

  • Model 3: Petal.Length as the predictor variable

library(boot)
library(glmnet)  # Load glmnet
## Loading required package: Matrix
## Loaded glmnet 4.1-8
data(iris)

iris$Setosa <- ifelse(iris$Species == "setosa", 1, 0)

# Standardize predictors to help with convergence
iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] <- 
  scale(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])

model1 <- glm(Setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
              data = iris, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model2 <- glm(Setosa ~ Sepal.Length + Petal.Length, 
              data = iris, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model3 <- glm(Setosa ~ Petal.Length, 
              data = iris, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# 10-fold cross validation
set.seed(42)
cv_model1 <- cv.glm(iris, model1, K=10)$delta[1]
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv_model2 <- cv.glm(iris, model2, K=10)$delta[1]
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv_model3 <- cv.glm(iris, model3, K=10)$delta[1]
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cv_errors <- c(Model_1 = cv_model1, Model_2 = cv_model2, Model_3 = cv_model3)

print(cv_errors)
##      Model_1      Model_2      Model_3 
## 2.668769e-07 3.270618e-15 2.315645e-13
best_model <- names(which.min(cv_errors))

cat("Best model based on cross-validation:", best_model, "\n" )
## Best model based on cross-validation: Model_2