We will analyze the "The file ‘Childbirths’ contains data on childbirths in a US city."
consists of 42 observations and 16 variables (target and predictors variable),
the model to be used is a logistic model.
To make exploratory data analysis, we use Principal Component Analysis (PCA)
for preprocess data.PCA is a useful technique for exploratory data analysis,
allowing you to better visualize the variation present in a dataset with many variables.
It is particularly helpful in the case of "wide" datasets, where you have many
variables for each sample.
Check the categorycal data smoker,lowbwt,mage35, these categorycal variables should be
remove for preprocessing use PCA.
cb <- read.csv("d:/PCA/childb.csv")
dim(cb)
## [1] 42 16
head(cb)
## ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360 56 4.55 34 44 0 20 0 162 57
## 2 1016 53 4.32 36 40 0 19 0 171 62
## 3 462 58 4.10 39 41 0 35 0 172 58
## 4 1187 53 4.07 38 44 0 20 0 174 68
## 5 553 54 3.94 37 42 0 24 0 175 66
## 6 1636 51 3.93 38 38 0 29 0 165 61
## fage fedyrs fnocig fheight lowbwt mage35
## 1 23 10 35 179 0 0
## 2 19 12 0 183 0 0
## 3 31 16 25 185 0 1
## 4 26 14 25 189 0 0
## 5 30 12 0 184 0 0
## 6 31 16 0 180 0 0
unique(cb$smoker)
## [1] 0 1
unique(cb$lowbwt)
## [1] 0 1
unique(cb$mage35)
## [1] 0 1
cb <- read.csv("d:/PCA/childb.csv")
cb.pca <- prcomp(cb[,c(1:5,7,14)], center = TRUE,scale. = TRUE)
summary(cb.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7211 1.1660 0.9620 0.8781 0.74450 0.5048 0.41568
## Proportion of Variance 0.4232 0.1942 0.1322 0.1101 0.07918 0.0364 0.02468
## Cumulative Proportion 0.4232 0.6174 0.7496 0.8597 0.93891 0.9753 1.00000
str(cb.pca)
## List of 5
## $ sdev : num [1:7] 1.721 1.166 0.962 0.878 0.744 ...
## $ rotation: num [1:7, 1:7] -0.0989 -0.5192 -0.5273 -0.436 -0.4855 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:7] 894.07 51.33 3.31 34.6 39.19 ...
## ..- attr(*, "names")= chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
## $ scale : Named num [1:7] 467.616 2.936 0.604 2.4 2.643 ...
## ..- attr(*, "names")= chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
## $ x : num [1:42, 1:7] -2.72 -1.61 -3.04 -2.64 -1.95 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
Ploting the PCA, should use library(ggbiplot), can be install easily from github, use library(devtools)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
The axes are seen as arrows originating from the center point. Here, you see that the variables
Headcirc, Birthweight,Length,and Gestation all contribute to PC1, with higher values
in those variables moving the samples to the left on this plot. This lets you see how the
data points relate to the axes, but it's not very informative without knowing which point
corresponds to which sample (cb).
You'll provide an argument to ggbiplot: let's give it the rownames of cb as labels.
This will name each point with the name of the cb in question:
Now you can see variables Length, Birthweight, Headcirc, Gestation, together in the top left side.
This makes sense, in the analysis
How else can you try to better understand your data?
From the pca analysis in chart ggbiplot(cb.pca), we use the important variables Length,
Birthweight, Headcirc, Gestation, to build the model.extract
First we can analysis the influence of variable, Length, Birthweight, Headcirc, and Gestation
affect to the variable lowbwt.
we can say variable, Length, Birthweight, Headcirc, and Gestation as predictor variables and
variable lowbwt as target.
cb1 <- cb[,c(2:5,15)]
head(cb1)
## Length Birthweight Headcirc Gestation lowbwt
## 1 56 4.55 34 44 0
## 2 53 4.32 36 40 0
## 3 58 4.10 39 41 0
## 4 53 4.07 38 44 0
## 5 54 3.94 37 42 0
## 6 51 3.93 38 38 0
Split data into train and test data in composition 70% train and 30% test Use library(caret) create sampling data, set.seed(12) for create random sample.
## Loading required package: lattice
log_model <- glm(lowbwt~.,data=train)
summary(log_model)
##
## Call:
## glm(formula = lowbwt ~ ., data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.32097 -0.12826 -0.02667 0.09163 0.54792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.907150 1.469766 1.298 0.2068
## Length -0.005280 0.030363 -0.174 0.8634
## Birthweight -0.294016 0.159287 -1.846 0.0773 .
## Headcirc -0.004883 0.028268 -0.173 0.8643
## Gestation -0.009684 0.026031 -0.372 0.7131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.05660298)
##
## Null deviance: 2.6897 on 28 degrees of freedom
## Residual deviance: 1.3585 on 24 degrees of freedom
## AIC: 5.5313
##
## Number of Fisher Scoring iterations: 2
From the above summary(log_model), can conclude from the summary that is only Birthweight variable influenced of the lowbwt,
but not significant as not qualify Pr(>|t|)<= 0.05
cb <- read.csv("d:/PCA/childb.csv")
cb2 <- cb[,c(2:5,16)]
head(cb2)
## Length Birthweight Headcirc Gestation mage35
## 1 56 4.55 34 44 0
## 2 53 4.32 36 40 0
## 3 58 4.10 39 41 1
## 4 53 4.07 38 44 0
## 5 54 3.94 37 42 0
## 6 51 3.93 38 38 0
#library(caret)
set.seed(21)
spl2 <- sample(nrow(cb2),nrow(cb2)*0.7)
train2 <- cb2[spl2,]
test2 <- cb2[-spl2,]
cb2$mage35 <- as.factor(cb2$mage35)
log_model2 <- glm(mage35~.,data=train2)
summary(log_model2)
##
## Call:
## glm(formula = mage35 ~ ., data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.28844 -0.15120 -0.07003 0.07324 0.77083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.13723 1.52243 -2.718 0.0120 *
## Length 0.07118 0.03007 2.367 0.0263 *
## Birthweight -0.33365 0.16122 -2.069 0.0494 *
## Headcirc 0.02184 0.02931 0.745 0.4635
## Gestation 0.02366 0.03079 0.768 0.4497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.08176392)
##
## Null deviance: 2.6897 on 28 degrees of freedom
## Residual deviance: 1.9623 on 24 degrees of freedom
## AIC: 16.197
##
## Number of Fisher Scoring iterations: 2
1. Use PCA, visualisations(plot PCA) to enhance understanding of the variables used.
2. From the summary of(log_model2), we can see that most influenced variable is Length, Pr(>|t|)=0.0263
followed by variable BirthweightPr(>|t|)=0.0494