Jednoduchá logistická regresia

Zadanie: Vytvorte logistický regresný model modelujúci prejdenie skúškou (premenná pass) v závislosti od počtu hodín venovaných príprave (premenná study_hours). Vypočítajte koreláciu medzi premennými. Tiež vyrátajte confusion maticu a metriky modelu. Prešiel by, na základe modelu, student, ktorý venoval príprave 6 hodín?

data <- data.frame(
  study_hours = c(2, 3, 5, 7, 8, 10, 11, 13, 15, 16),
  pass = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1) # 0: fail, 1: pass
)

# Korelácia medzi premennými
correlation <- cor(data$study_hours, data$pass)
cat("Korelácia medzi study_hours a pass: ", correlation, "\n")
## Korelácia medzi study_hours a pass:  0.8056963
# Logistická regresia
model <- glm(pass ~ study_hours, data = data, family = binomial)
## Warning: glm.fit: алгоритм не сошелся
## Warning: glm.fit: возникли подогнанные вероятности 0 или 1
summary(model)
## 
## Call:
## glm(formula = pass ~ study_hours, family = binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -134.44  193058.17  -0.001    0.999
## study_hours     22.39   31296.13   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.2217e+01  on 9  degrees of freedom
## Residual deviance: 7.5703e-10  on 8  degrees of freedom
## AIC: 4
## 
## Number of Fisher Scoring iterations: 25
# Predikcie pravdepodobností
glm.probs <- predict(model, type = "response")

# Confusion matica
threshold <- 0.5
glm.pred <- ifelse(glm.probs > threshold, 1, 0)
cm <- table(glm.pred, data$pass)
cm
##         
## glm.pred 0 1
##        0 3 0
##        1 0 7
# Metriky modelu
accuracy <- sum(diag(cm)) / sum(cm)
precision <- cm[2,2] / sum(cm[,2])
recall <- cm[2,2] / sum(cm[2,])

cat("Accuracy: ", accuracy, "\n")
## Accuracy:  1
cat("Precision: ", precision, "\n")
## Precision:  1
cat("Recall: ", recall, "\n")
## Recall:  1
# Predikcia pre 6 hodín
new_data <- data.frame(study_hours = 6)
prediction <- predict(model, newdata = new_data, type = "response")
predicted_class <- ifelse(prediction > threshold, 1, 0)
cat("Predikcia pre 6 hodín prípravy (pravdepodobnosť): ", prediction, "\n")
## Predikcia pre 6 hodín prípravy (pravdepodobnosť):  0.4780112
cat("Predikovaná trieda: ", ifelse(predicted_class == 1, "PASS", "FAIL"), "\n")
## Predikovaná trieda:  FAIL

Násobná logistická regresia – generované dáta

Zadanie: Vytvorte logistický regresný model modelujúci pravdepodobnosť kúpenia produktu (premenná purchase) v závislosti od premenných age, gender, income a ad_click. Vypočítajte koreláciu medzi vysvetľujúcimi premennými. Tiež vyrátajte confusion maticu a metriky modelu a zhodnoťte ich.

data <- data.frame(
  age = c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
  gender = factor(c("male", "female", "female", "male", "male", "female", "female", "male",
                    "male", "female")),
  income = c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95),
  ad_click = c(1, 1, 0, 0, 1, 0, 1, 1, 0, 1),
  purchase = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 0)
)

# Korelácia vysvetľujúcich premenných
numeric_data <- data[, c("age", "income", "ad_click")]
cors <- cor(numeric_data)
cors
##          age income ad_click
## age        1      1        0
## income     1      1        0
## ad_click   0      0        1
# Logistická regresia
model <- glm(purchase ~ age + gender + income + ad_click, data = data, family = binomial)
## Warning: glm.fit: возникли подогнанные вероятности 0 или 1
summary(model)
## 
## Call:
## glm(formula = purchase ~ age + gender + income + ad_click, family = binomial, 
##     data = data)
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -83.480 244529.833   0.000        1
## age              3.084   6025.081   0.001        1
## gendermale      78.168 157973.344   0.000        1
## income              NA         NA      NA       NA
## ad_click      -156.524 266443.732  -0.001        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3863e+01  on 9  degrees of freedom
## Residual deviance: 4.7545e-10  on 6  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25
# Predikcie a Confusion matica
glm.probs <- predict(model, type = "response")
threshold <- 0.5
glm.pred <- ifelse(glm.probs > threshold, 1, 0)
cm <- table(glm.pred, data$purchase)
cm
##         
## glm.pred 0 1
##        0 5 0
##        1 0 5
# Metriky
accuracy <- sum(diag(cm)) / sum(cm)
precision <- cm[2,2] / sum(cm[,2]) 
recall <- cm[2,2] / sum(cm[2,])    

cat("Accuracy: ", accuracy, "\n")
## Accuracy:  1
cat("Precision: ", precision, "\n")
## Precision:  1
cat("Recall: ", recall, "\n")
## Recall:  1

Násobná logistická regresia - dataset

Zadanie: Vytvorte logistický regresný model predikujúci, či má auto automatickú (premenná am, hodnota 0) alebo manuálnu (premenná am, hodnota 1) prevodovku v závislosti od premenných hp a wt, všetky sa nachádzajú v datasete mtcars. Vypočítajte confusion maticu a metriky modelu a zhodnoťte ich. Sú v danom datasete premenné, ktoré dávajú lepší model a ak áno, ktoré to sú?

data(mtcars)
model <- glm(am ~ hp + wt, data = mtcars, family = binomial)
summary(model)
## 
## Call:
## glm(formula = am ~ hp + wt, family = binomial, data = mtcars)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 18.86630    7.44356   2.535  0.01126 * 
## hp           0.03626    0.01773   2.044  0.04091 * 
## wt          -8.08348    3.06868  -2.634  0.00843 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 10.059  on 29  degrees of freedom
## AIC: 16.059
## 
## Number of Fisher Scoring iterations: 8
# Predikcie a Confusion matica
glm.probs <- predict(model, type = "response")
threshold <- 0.5
glm.pred <- ifelse(glm.probs > threshold, 1, 0)
cm <- table(glm.pred, mtcars$am)
cm
##         
## glm.pred  0  1
##        0 18  1
##        1  1 12
# Metriky modelu
accuracy <- sum(diag(cm)) / sum(cm)
precision <- cm[2,2] / sum(cm[,2]) 
recall <- cm[2,2] / sum(cm[2,])    

cat("Accuracy: ", accuracy, "\n")
## Accuracy:  0.9375
cat("Precision: ", precision, "\n")
## Precision:  0.9230769
cat("Recall: ", recall, "\n")
## Recall:  0.9230769
# ROC krivka a AUC
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Присоединяю пакет: 'pROC'
## Следующие объекты скрыты от 'package:stats':
## 
##     cov, smooth, var
roc_score <- roc(mtcars$am, glm.probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_value <- auc(roc_score)
plot(roc_score, main = "ROC curve -- Logistic Regression")

cat("AUC hodnotenie: ", auc_value, "\n")
## AUC hodnotenie:  0.9838057