Pacotes Necessários para Análise
library(haven)
## Warning: package 'haven' was built under R version 4.0.5
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.5
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
library(Rcpp)
## Warning: package 'Rcpp' was built under R version 4.0.5
1º Import the data from voorbeeld7_1.sav and save the table under the name chol1.
chol1 <- read_sav("C:/Users/naian/Documents/UTwente/DataScience/voorbeeld7_1.sav")
View(chol1)
2º Make a scatterplot, with the function ggplot, from the column cholesterol chol (y-axis) and age leeftijd (x-axis). Add then a regression line to the graph with geom_smooth(method = lm”).
ggplot(chol1, aes(x = leeftijd, y = chol, color = "pink")) +
geom_point(size = 2) +
geom_smooth(method = "lm") +
labs(title = "Chol1", x = "Leeftijd", y = "Chol")
## `geom_smooth()` using formula 'y ~ x'
3º Fit a linear model for chol with leeftijd using the function lm. The formula for the model is chol˜leeftijd. Save the fitobject under the name fit1. View the result with summary(fit1).
fit1 <- lm(chol ~ leeftijd, data = chol1)
summary(fit1)
##
## Call:
## lm(formula = chol ~ leeftijd, data = chol1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7045 -0.4643 -0.0691 0.4638 1.9101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.458374 0.491525 5.002 1.25e-06 ***
## leeftijd 0.061799 0.007901 7.822 3.04e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 198 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2322
## F-statistic: 61.19 on 1 and 198 DF, p-value: 3.035e-13
4º Fit a model for chol with leeftijd, bmi, sekse and alcohol (lm(y ∼ x1 + x2 + x3)). Which factors are statistically significant?
fit2 <- lm(chol ~ leeftijd + bmi + sekse + alcohol, data = chol1)
summary(fit2)
##
## Call:
## lm(formula = chol ~ leeftijd + bmi + sekse + alcohol, data = chol1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.66843 -0.38463 0.00177 0.33904 1.71594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.18291 0.60862 -0.301 0.764096
## leeftijd 0.03285 0.00838 3.921 0.000122 ***
## bmi 0.13297 0.01917 6.937 5.76e-11 ***
## sekse 1.00121 0.18916 5.293 3.23e-07 ***
## alcohol 0.34919 0.10873 3.212 0.001544 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6303 on 195 degrees of freedom
## Multiple R-squared: 0.3958, Adjusted R-squared: 0.3834
## F-statistic: 31.93 on 4 and 195 DF, p-value: < 2.2e-16
Answer: According to the data, all factors are statistically significant
5º Add the residuals from the model fit2 to the table chol1 and make a histogram from the residuals.
ggplot(chol1, aes(fit2$residuals)) +
geom_histogram(fill = "green", color = "black") +
labs(title = "Model Fit2 and Chol1", x = "Residual", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1º Importando os dados
births <- read.csv("C:/Users/naian/Documents/UTwente/DataScience/births.csv")
View(births)
2º Recode the variable child_birth into a new variable home where home=’at_home’ if the childbirth was a so-called first line child birth, at home, if not then home=’not_at_home’. Use for this the function mutate, factor and if_else. So finally the variable home should be a factor variable with levels at_home and not_at_home.
births1 <- births %>%
rename(home = child_birth) %>%
mutate(home = factor(ifelse(home == "first line child birth, at home", "at_home", "not_at_home")))
3º Recode the variable parity in a new variable pari where pari has level primi if it concerns a first childbirth and multi if it is the second or more childbirth. You can do this again with mutate and if_else
births2 <- births1 %>%
rename(pari = parity) %>%
mutate(pari = factor(ifelse(pari == 1, "primi", "multi")))
View(births2)
4º Recode the variable etnicity into a new variable etni where etni has level Dutch if the woman was Dutch and Not Dutch if she was not Dutch. Hint: use table(etnicity) to look which levels there are in the variable etnicity.
births3 <- births2 %>%
rename(etni = etnicity) %>%
mutate(etni = factor(ifelse(etni == "Dutch", "Dutch", "Not_Dutch")))
View(births3)
5º Make a logistic regression model with the function glm for the probability of childbirth at home with the variables pari, age_cat (= age categorised), etni and urban (urbanisation degree). View the outcomes from the model with the summary function.
births_glm <- glm(
home == "at_home" ~ pari + age_cat + etni + urban,
data = births3
)
summary(births_glm)
##
## Call:
## glm(formula = home == "at_home" ~ pari + age_cat + etni + urban,
## data = births3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6218 -0.3780 -0.2434 0.4460 0.9976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.523084 0.007876 66.414 < 2e-16 ***
## pariprimi -0.228245 0.004436 -51.451 < 2e-16 ***
## age_cat> 35 year 0.016378 0.008802 1.861 0.0628 .
## age_cat25-29 year 0.015326 0.006786 2.258 0.0239 *
## age_cat30-34 year 0.030903 0.006997 4.417 1.01e-05 ***
## etniNot_Dutch -0.225722 0.006500 -34.726 < 2e-16 ***
## urbanmoderate 0.011792 0.006154 1.916 0.0554 .
## urbannot -0.066752 0.006523 -10.234 < 2e-16 ***
## urbanstrong 0.067840 0.006354 10.677 < 2e-16 ***
## urbanvery strong 0.055364 0.007535 7.347 2.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2181603)
##
## Null deviance: 12046 on 49702 degrees of freedom
## Residual deviance: 10841 on 49693 degrees of freedom
## AIC: 65389
##
## Number of Fisher Scoring iterations: 2
6º Using the function rpart (with option method = class”) from the package rpart a decision tree model can be estimated on the data. Make a decision tree for the probability of childbirth at home with the same variables as in the logistic regression model. View the decision tree with the function rpart.plot from package rpart.plot.
births_tree <- rpart(home ~ pari + age_cat + etni + urban,
method = 'class', data = births3)
rpart.plot(births_tree)
7º For assessing which model, the logistic regression model or the decision tree fits better the data we should fit the models on a training set and calculate accuracy statistics on a test set (or use cross validation). For this we use the package caret. The following code will calculate several accuracy measures based on a split-file strategy. Which model fits the data better?
set.seed(100)
mydat <- births3
trainRowNumbers <- createDataPartition(mydat$home, p=0.8, list=FALSE)
trainData <- mydat[trainRowNumbers,]
testData <- mydat[-trainRowNumbers,]
fit_logreg <- train(home ~ pari + age_cat + etni + urban, data = trainData,
method="glm", family="binomial")
predicted1 <- predict(fit_logreg, testData)
confusionMatrix(reference = testData$home, data = predicted1,
mode='everything', positive='at_home')
## Confusion Matrix and Statistics
##
## Reference
## Prediction at_home not_at_home
## at_home 2227 1601
## not_at_home 1874 4238
##
## Accuracy : 0.6504
## 95% CI : (0.6409, 0.6598)
## No Information Rate : 0.5874
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2715
##
## Mcnemar's Test P-Value : 3.947e-06
##
## Sensitivity : 0.5430
## Specificity : 0.7258
## Pos Pred Value : 0.5818
## Neg Pred Value : 0.6934
## Precision : 0.5818
## Recall : 0.5430
## F1 : 0.5617
## Prevalence : 0.4126
## Detection Rate : 0.2240
## Detection Prevalence : 0.3851
## Balanced Accuracy : 0.6344
##
## 'Positive' Class : at_home
##
fit_rpart <- train(home ~ pari + age_cat + etni + urban, data = trainData,
method="rpart")
predicted2 <- predict(fit_rpart, testData)
confusionMatrix(reference = testData$home, data = predicted2,
mode='everything', positive='at_home')
## Confusion Matrix and Statistics
##
## Reference
## Prediction at_home not_at_home
## at_home 2227 1601
## not_at_home 1874 4238
##
## Accuracy : 0.6504
## 95% CI : (0.6409, 0.6598)
## No Information Rate : 0.5874
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2715
##
## Mcnemar's Test P-Value : 3.947e-06
##
## Sensitivity : 0.5430
## Specificity : 0.7258
## Pos Pred Value : 0.5818
## Neg Pred Value : 0.6934
## Precision : 0.5818
## Recall : 0.5430
## F1 : 0.5617
## Prevalence : 0.4126
## Detection Rate : 0.2240
## Detection Prevalence : 0.3851
## Balanced Accuracy : 0.6344
##
## 'Positive' Class : at_home
##
The first: confusion matrix of the logistic regression model. The Second: confusion matrix of the decision tree. According to the confusion matrices, the accuracies of the two models are the same, which means there is no one model fitting the data better.