Example PCA analyss

1. The data

We will analyze the "The file ‘Childbirths’ contains data on childbirths in a US city." 
consists of 42 observations and 16 variables (target and predictors variable), 
the model to be used is a logistic model.

1.2 About PCA

To make exploratory data analysis, we use Principal Component Analysis (PCA) 
for preprocess data.PCA is a useful technique for exploratory data analysis, 
allowing you to better visualize the variation present in a dataset with many variables. 
It is particularly helpful in the case of "wide" datasets, where you have many 
variables for each sample.

1.3 Upload data

Check the categorycal data smoker,lowbwt,mage35, these categorycal variables should be 
remove for preprocessing use PCA.

cb <- read.csv("d:/PCA/childb.csv")
dim(cb)

## [1] 42 16

head(cb)

##     ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360     56        4.55       34        44      0   20      0     162    57
## 2 1016     53        4.32       36        40      0   19      0     171    62
## 3  462     58        4.10       39        41      0   35      0     172    58
## 4 1187     53        4.07       38        44      0   20      0     174    68
## 5  553     54        3.94       37        42      0   24      0     175    66
## 6 1636     51        3.93       38        38      0   29      0     165    61
##   fage fedyrs fnocig fheight lowbwt mage35
## 1   23     10     35     179      0      0
## 2   19     12      0     183      0      0
## 3   31     16     25     185      0      1
## 4   26     14     25     189      0      0
## 5   30     12      0     184      0      0
## 6   31     16      0     180      0      0

unique(cb$smoker)

## [1] 0 1

unique(cb$lowbwt)

## [1] 0 1

unique(cb$mage35)

## [1] 0 1

1.4 Remove unused variable,smoker,lowbwt and mage35

cb <- read.csv("d:/PCA/childb.csv")
cb.pca <- prcomp(cb[,c(1:5,7,14)], center = TRUE,scale. = TRUE)
summary(cb.pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     1.7211 1.1660 0.9620 0.8781 0.74450 0.5048 0.41568
## Proportion of Variance 0.4232 0.1942 0.1322 0.1101 0.07918 0.0364 0.02468
## Cumulative Proportion  0.4232 0.6174 0.7496 0.8597 0.93891 0.9753 1.00000

2. Look at your PCA object.

str(cb.pca)

## List of 5
##  $ sdev    : num [1:7] 1.721 1.166 0.962 0.878 0.744 ...
##  $ rotation: num [1:7, 1:7] -0.0989 -0.5192 -0.5273 -0.436 -0.4855 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:7] 894.07 51.33 3.31 34.6 39.19 ...
##   ..- attr(*, "names")= chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
##  $ scale   : Named num [1:7] 467.616 2.936 0.604 2.4 2.643 ...
##   ..- attr(*, "names")= chr [1:7] "ID" "Length" "Birthweight" "Headcirc" ...
##  $ x       : num [1:42, 1:7] -2.72 -1.61 -3.04 -2.64 -1.95 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

2.1 Plotting PCA

Ploting the PCA, should use library(ggbiplot), can be install easily from github, use library(devtools)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

The axes are seen as arrows originating from the center point. Here, you see that the variables 
Headcirc, Birthweight,Length,and Gestation  all contribute to PC1, with higher values 
in those variables moving the samples to the left on this plot. This lets you see how the 
data points relate to the axes, but it's not very informative without knowing which point 
corresponds to which sample (cb).
You'll provide an argument to ggbiplot: let's give it the rownames of cb as labels. 
This will name each point with the name of the cb in question:

2.2 Plot labeling

2.3 Look the chart

Now you can see variables Length, Birthweight, Headcirc, Gestation, together in the top  left side. 
This makes sense, in the analysis

How else can you try to better understand your data?
From the pca analysis in chart ggbiplot(cb.pca), we use the important variables Length, 
Birthweight, Headcirc, Gestation, to build the model.extract

3. Create logistic model

3.1 Use variable lowbwt as target variable

First we can analysis the influence of variable, Length, Birthweight, Headcirc, and Gestation 
affect to the variable lowbwt.
we can say variable, Length, Birthweight, Headcirc, and Gestation as predictor variables and
variable lowbwt as target.

cb1 <- cb[,c(2:5,15)]
head(cb1)

##   Length Birthweight Headcirc Gestation lowbwt
## 1     56        4.55       34        44      0
## 2     53        4.32       36        40      0
## 3     58        4.10       39        41      0
## 4     53        4.07       38        44      0
## 5     54        3.94       37        42      0
## 6     51        3.93       38        38      0

3.2 Split data

Split data into train and test data in composition 70% train and 30% test Use library(caret) create sampling data, set.seed(12) for create random sample.

## Loading required package: lattice

log_model <- glm(lowbwt~.,data=train)
summary(log_model)

## 
## Call:
## glm(formula = lowbwt ~ ., data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.32097  -0.12826  -0.02667   0.09163   0.54792  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.907150   1.469766   1.298   0.2068  
## Length      -0.005280   0.030363  -0.174   0.8634  
## Birthweight -0.294016   0.159287  -1.846   0.0773 .
## Headcirc    -0.004883   0.028268  -0.173   0.8643  
## Gestation   -0.009684   0.026031  -0.372   0.7131  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05660298)
## 
##     Null deviance: 2.6897  on 28  degrees of freedom
## Residual deviance: 1.3585  on 24  degrees of freedom
## AIC: 5.5313
## 
## Number of Fisher Scoring iterations: 2

From the above summary(log_model), can conclude from the summary that is only Birthweight variable influenced of the lowbwt,
but not significant as not qualify Pr(>|t|)<= 0.05

3.3 Target is mage35

3.3.1 Use variable mage35 as variable target

cb <- read.csv("d:/PCA/childb.csv")
cb2 <- cb[,c(2:5,16)]
head(cb2)

##   Length Birthweight Headcirc Gestation mage35
## 1     56        4.55       34        44      0
## 2     53        4.32       36        40      0
## 3     58        4.10       39        41      1
## 4     53        4.07       38        44      0
## 5     54        3.94       37        42      0
## 6     51        3.93       38        38      0

#library(caret)
set.seed(21)
spl2 <- sample(nrow(cb2),nrow(cb2)*0.7)
train2 <- cb2[spl2,]
test2 <- cb2[-spl2,]
cb2$mage35 <- as.factor(cb2$mage35)
log_model2 <- glm(mage35~.,data=train2)
summary(log_model2)

## 
## Call:
## glm(formula = mage35 ~ ., data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.28844  -0.15120  -0.07003   0.07324   0.77083  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -4.13723    1.52243  -2.718   0.0120 *
## Length       0.07118    0.03007   2.367   0.0263 *
## Birthweight -0.33365    0.16122  -2.069   0.0494 *
## Headcirc     0.02184    0.02931   0.745   0.4635  
## Gestation    0.02366    0.03079   0.768   0.4497  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08176392)
## 
##     Null deviance: 2.6897  on 28  degrees of freedom
## Residual deviance: 1.9623  on 24  degrees of freedom
## AIC: 16.197
## 
## Number of Fisher Scoring iterations: 2

4. Conclusion

1. Use PCA, visualisations(plot PCA) to enhance understanding of the variables used.
2. From the summary of(log_model2), we can see that most influenced variable is Length,  Pr(>|t|)=0.0263
followed by variable BirthweightPr(>|t|)=0.0494