Data Mining

Data:25/10/2021

Pacotes Necessários para Análise

library(haven)

## Warning: package 'haven' was built under R version 4.0.5

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(readr)

## Warning: package 'readr' was built under R version 4.0.5

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.0.5

library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: lattice

library(Rcpp)

## Warning: package 'Rcpp' was built under R version 4.0.5

Dataset: voorbeeld7_1.sav

1º Import the data from voorbeeld7_1.sav and save the table under the name chol1.

chol1 <- read_sav("C:/Users/naian/Documents/UTwente/DataScience/voorbeeld7_1.sav")
View(chol1)

2º Make a scatterplot, with the function ggplot, from the column cholesterol chol (y-axis) and age leeftijd (x-axis). Add then a regression line to the graph with geom_smooth(method = lm”).

ggplot(chol1, aes(x = leeftijd, y = chol, color = "pink")) + 
  geom_point(size = 2) +
    geom_smooth(method = "lm") +
      labs(title = "Chol1", x = "Leeftijd", y = "Chol")

## `geom_smooth()` using formula 'y ~ x'

3º Fit a linear model for chol with leeftijd using the function lm. The formula for the model is chol˜leeftijd. Save the fitobject under the name fit1. View the result with summary(fit1).

fit1 <- lm(chol ~ leeftijd, data = chol1)

summary(fit1)

## 
## Call:
## lm(formula = chol ~ leeftijd, data = chol1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7045 -0.4643 -0.0691  0.4638  1.9101 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.458374   0.491525   5.002 1.25e-06 ***
## leeftijd    0.061799   0.007901   7.822 3.04e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7033 on 198 degrees of freedom
## Multiple R-squared:  0.2361, Adjusted R-squared:  0.2322 
## F-statistic: 61.19 on 1 and 198 DF,  p-value: 3.035e-13

4º Fit a model for chol with leeftijd, bmi, sekse and alcohol (lm(y ∼ x1 + x2 + x3)). Which factors are statistically significant?

fit2 <- lm(chol ~ leeftijd + bmi + sekse + alcohol, data = chol1)

summary(fit2)

## 
## Call:
## lm(formula = chol ~ leeftijd + bmi + sekse + alcohol, data = chol1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.66843 -0.38463  0.00177  0.33904  1.71594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.18291    0.60862  -0.301 0.764096    
## leeftijd     0.03285    0.00838   3.921 0.000122 ***
## bmi          0.13297    0.01917   6.937 5.76e-11 ***
## sekse        1.00121    0.18916   5.293 3.23e-07 ***
## alcohol      0.34919    0.10873   3.212 0.001544 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6303 on 195 degrees of freedom
## Multiple R-squared:  0.3958, Adjusted R-squared:  0.3834 
## F-statistic: 31.93 on 4 and 195 DF,  p-value: < 2.2e-16

Answer: According to the data, all factors are statistically significant

5º Add the residuals from the model fit2 to the table chol1 and make a histogram from the residuals.

ggplot(chol1, aes(fit2$residuals)) +
  geom_histogram(fill = "green", color = "black") +
    labs(title = "Model Fit2 and Chol1", x = "Residual", y = "Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Dataset: births.csv

1º Importando os dados

births <- read.csv("C:/Users/naian/Documents/UTwente/DataScience/births.csv")
View(births)

2º Recode the variable child_birth into a new variable home where home=’at_home’ if the childbirth was a so-called first line child birth, at home, if not then home=’not_at_home’. Use for this the function mutate, factor and if_else. So finally the variable home should be a factor variable with levels at_home and not_at_home.

births1 <- births %>% 
    rename(home = child_birth) %>% 
      mutate(home = factor(ifelse(home == "first line child birth, at home", "at_home", "not_at_home")))

3º Recode the variable parity in a new variable pari where pari has level primi if it concerns a first childbirth and multi if it is the second or more childbirth. You can do this again with mutate and if_else

births2 <- births1 %>% 
  rename(pari = parity) %>% 
    mutate(pari = factor(ifelse(pari == 1, "primi", "multi")))

View(births2)

4º Recode the variable etnicity into a new variable etni where etni has level Dutch if the woman was Dutch and Not Dutch if she was not Dutch. Hint: use table(etnicity) to look which levels there are in the variable etnicity.

births3 <- births2 %>% 
  rename(etni = etnicity) %>% 
    mutate(etni = factor(ifelse(etni == "Dutch", "Dutch", "Not_Dutch")))

View(births3)

5º Make a logistic regression model with the function glm for the probability of childbirth at home with the variables pari, age_cat (= age categorised), etni and urban (urbanisation degree). View the outcomes from the model with the summary function.

births_glm <- glm(
 home == "at_home" ~ pari + age_cat + etni + urban,
  data = births3
)

summary(births_glm)

## 
## Call:
## glm(formula = home == "at_home" ~ pari + age_cat + etni + urban, 
##     data = births3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6218  -0.3780  -0.2434   0.4460   0.9976  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.523084   0.007876  66.414  < 2e-16 ***
## pariprimi         -0.228245   0.004436 -51.451  < 2e-16 ***
## age_cat> 35 year   0.016378   0.008802   1.861   0.0628 .  
## age_cat25-29 year  0.015326   0.006786   2.258   0.0239 *  
## age_cat30-34 year  0.030903   0.006997   4.417 1.01e-05 ***
## etniNot_Dutch     -0.225722   0.006500 -34.726  < 2e-16 ***
## urbanmoderate      0.011792   0.006154   1.916   0.0554 .  
## urbannot          -0.066752   0.006523 -10.234  < 2e-16 ***
## urbanstrong        0.067840   0.006354  10.677  < 2e-16 ***
## urbanvery strong   0.055364   0.007535   7.347 2.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2181603)
## 
##     Null deviance: 12046  on 49702  degrees of freedom
## Residual deviance: 10841  on 49693  degrees of freedom
## AIC: 65389
## 
## Number of Fisher Scoring iterations: 2

6º Using the function rpart (with option method = class”) from the package rpart a decision tree model can be estimated on the data. Make a decision tree for the probability of childbirth at home with the same variables as in the logistic regression model. View the decision tree with the function rpart.plot from package rpart.plot.

births_tree <- rpart(home ~ pari + age_cat + etni + urban,
                     method = 'class', data = births3)

rpart.plot(births_tree)

7º For assessing which model, the logistic regression model or the decision tree fits better the data we should fit the models on a training set and calculate accuracy statistics on a test set (or use cross validation). For this we use the package caret. The following code will calculate several accuracy measures based on a split-file strategy. Which model fits the data better?

set.seed(100)
mydat <- births3
trainRowNumbers <- createDataPartition(mydat$home, p=0.8, list=FALSE)
trainData <- mydat[trainRowNumbers,]
testData <- mydat[-trainRowNumbers,]
fit_logreg <- train(home ~ pari + age_cat + etni + urban, data = trainData,
                    method="glm", family="binomial")
predicted1 <- predict(fit_logreg, testData)
confusionMatrix(reference = testData$home, data = predicted1,
                mode='everything', positive='at_home')

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    at_home not_at_home
##   at_home        2227        1601
##   not_at_home    1874        4238
##                                           
##                Accuracy : 0.6504          
##                  95% CI : (0.6409, 0.6598)
##     No Information Rate : 0.5874          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2715          
##                                           
##  Mcnemar's Test P-Value : 3.947e-06       
##                                           
##             Sensitivity : 0.5430          
##             Specificity : 0.7258          
##          Pos Pred Value : 0.5818          
##          Neg Pred Value : 0.6934          
##               Precision : 0.5818          
##                  Recall : 0.5430          
##                      F1 : 0.5617          
##              Prevalence : 0.4126          
##          Detection Rate : 0.2240          
##    Detection Prevalence : 0.3851          
##       Balanced Accuracy : 0.6344          
##                                           
##        'Positive' Class : at_home         
##

fit_rpart <- train(home ~ pari + age_cat + etni + urban, data = trainData,
                   method="rpart")
predicted2 <- predict(fit_rpart, testData)
confusionMatrix(reference = testData$home, data = predicted2,
                mode='everything', positive='at_home')

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    at_home not_at_home
##   at_home        2227        1601
##   not_at_home    1874        4238
##                                           
##                Accuracy : 0.6504          
##                  95% CI : (0.6409, 0.6598)
##     No Information Rate : 0.5874          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2715          
##                                           
##  Mcnemar's Test P-Value : 3.947e-06       
##                                           
##             Sensitivity : 0.5430          
##             Specificity : 0.7258          
##          Pos Pred Value : 0.5818          
##          Neg Pred Value : 0.6934          
##               Precision : 0.5818          
##                  Recall : 0.5430          
##                      F1 : 0.5617          
##              Prevalence : 0.4126          
##          Detection Rate : 0.2240          
##    Detection Prevalence : 0.3851          
##       Balanced Accuracy : 0.6344          
##                                           
##        'Positive' Class : at_home         
##

The first: confusion matrix of the logistic regression model. The Second: confusion matrix of the decision tree. According to the confusion matrices, the accuracies of the two models are the same, which means there is no one model fitting the data better.

Data_Mining

Data Mining

Dataset: voorbeeld7_1.sav

Dataset: births.csv