library(broom)
library(knitr)
download.file('http://www.openintro.org/stat/data/bdims.RData', destfile = 'bdims.RData')
load(
"bdims.RData"
)
head(bdims)
str(bdims)
## 'data.frame': 507 obs. of 25 variables:
## $ bia.di: num 42.9 43.7 40.1 44.3 42.5 43.3 43.5 44.4 43.5 42 ...
## $ bii.di: num 26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
## $ bit.di: num 31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
## $ che.de: num 17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
## $ che.di: num 28 30.8 31.7 28.2 29.4 31.3 31.7 28.8 27.5 28 ...
## $ elb.di: num 13.1 14 13.9 13.9 15.2 14 16.1 15.1 14.1 15.6 ...
## $ wri.di: num 10.4 11.8 10.9 11.2 11.6 11.5 12.5 11.9 11.2 12 ...
## $ kne.di: num 18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
## $ ank.di: num 14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
## $ sho.gi: num 106 110 115 104 108 ...
## $ che.gi: num 89.5 97 97.5 97 97.5 ...
## $ wai.gi: num 71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
## $ nav.gi: num 74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
## $ hip.gi: num 93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
## $ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
## $ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
## $ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
## $ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
## $ cal.gi: num 36.5 37.5 37.3 34.8 38.6 36.1 40.3 36.7 35 38.6 ...
## $ ank.gi: num 23.5 24.5 21.9 23 24.4 23.5 23.6 22.5 22 22.2 ...
## $ wri.gi: num 16.5 17 16.9 16.6 18 16.9 18.8 18 16.5 16.9 ...
## $ age : int 21 23 28 23 22 21 26 27 23 21 ...
## $ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
## $ hgt : num 174 175 194 186 187 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
mdims <- subset(bdims, sex == 1)
fdims <-subset(bdims, sex == 0)
hist(mdims$hgt, main = "Male Heights", xlab = "Male heights/(cm)")
hist(fdims$hgt, main = 'Female Heights', xlab = "Female heighs / (cm)")
Both graphs have a general bell-curve shape, although male’s is abit more symetric about the center. female distribution graph also has a shorter range than that of the males.
fhgtmean <-mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)
hist(fdims$hgt, probability = TRUE, main = "females heights", xlab = "females heights in cm")
x<-140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = 2)
The histogram of female heights does appear to have a normal distribution. Both graphs peak at about the same point in the middle and have their tails symmetric about the center.
qqnorm(fdims$hgt)
qqline(fdims$hgt)
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
qqnorm(sim_norm)
qqline(sim_norm)
The distribution of the simulated data is normal. Almost all of the points are exactly on the line, with the exception of the tails.
qqnormsim(fdims$hgt)
## Exercize 4:
The Normal Probability plot looks similat to the plots created for the simulated data. most of the points are located on the line, with, yet agin, with slight variation at the tails. The plots do, indeed, provide evicence that the female heights are nearly normal.
qqnorm(fdims$wgt)
qqline(fdims$wgt)
hist(fdims$wgt, main = "Females Weight",xlab = "females weight in kgs")
qqnormsim(fdims$wgt)
The distribution of female weights does, indeed, seem to be normal.
1-pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
sum(fdims$hgt < 169)/length(fdims$hgt
)
## [1] 0.7307692
pnorm(q = 169, fhgtmean, fhgtsd)
## [1] 0.7358822
sum(fdims$hgt < 169)/length(fdims$hgt
)
## [1] 0.7307692
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
1- pnorm(q = 55, fwgtmean, fwgtsd)
## [1] 0.7198584
sum(fdims$wgt > 55) / length(fdims$wgt)
## [1] 0.6923077
Height had a closer agreement btwn the two calculation methods.
str(fdims)
## 'data.frame': 260 obs. of 25 variables:
## $ bia.di: num 37.6 36.7 34.8 36.6 35.5 37 35.5 37.4 37.8 38.6 ...
## $ bii.di: num 25 26.4 25.9 27.9 28.2 28 26.5 30.2 29 28.8 ...
## $ bit.di: num 31.3 31 30.2 31.8 31 32 29.2 33.2 32.6 33.2 ...
## $ che.de: num 16.2 16.8 16.4 19.3 18.2 15.1 15.4 18.8 18.6 19.7 ...
## $ che.di: num 24.9 24.5 24.2 24.9 26.2 25.7 24.5 26.6 25 29.4 ...
## $ elb.di: num 11.2 12.1 11.3 12.3 11.5 12.5 12.3 13.3 12.1 13.4 ...
## $ wri.di: num 9.2 9.9 8.9 9.5 9.1 10 9.4 10.7 9.8 11.5 ...
## $ kne.di: num 17 19.3 17 18.6 17.2 17.2 17.2 19.8 17.8 20.9 ...
## $ ank.di: num 12.3 12.8 12.2 13 12.4 13.2 12 13.8 12.7 13.2 ...
## $ sho.gi: num 95 99.5 88 97 103.3 ...
## $ che.gi: num 83 78.5 75 86.5 91 79.5 77 88 85 98.8 ...
## $ wai.gi: num 66.5 61.5 61.2 78 70.5 66.5 58 74.5 73.5 90.5 ...
## $ nav.gi: num 79 70.5 66.5 91 80.5 ...
## $ hip.gi: num 92 90.5 91 99.5 91.5 ...
## $ thi.gi: num 53.5 57.7 53 61.5 55 54 49.5 64 65.3 61.1 ...
## $ bic.gi: num 24.3 27.8 24 28 26.9 26.5 24.1 29.2 29 33.6 ...
## $ for.gi: num 20.5 24 22 24 22.7 22.5 22 26.2 23.4 26.6 ...
## $ kne.gi: num 32 38.5 32.5 35.2 33 34 32.5 38.5 35.3 37.2 ...
## $ cal.gi: num 32.2 38.5 32.5 36.7 33.3 35 32 38 37.4 35.8 ...
## $ ank.gi: num 21 22.5 19 23 19.9 23 19 22 21.6 22.6 ...
## $ wri.gi: num 13.5 15 14 15 14.5 14.5 13.9 16.8 15.2 16.3 ...
## $ age : int 22 20 19 25 21 23 26 22 28 40 ...
## $ wgt : num 51.6 59 49.2 63 53.6 59 47.6 69.8 66.8 75.2 ...
## $ hgt : num 161 168 160 157 156 ...
## $ sex : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
summary(fdims)
## bia.di bii.di bit.di che.de
## Min. :32.40 Min. :18.70 Min. :24.70 Min. :14.30
## 1st Qu.:35.17 1st Qu.:26.20 1st Qu.:30.00 1st Qu.:16.50
## Median :36.40 Median :27.80 Median :31.50 Median :17.50
## Mean :36.50 Mean :27.58 Mean :31.46 Mean :17.72
## 3rd Qu.:37.80 3rd Qu.:29.20 3rd Qu.:32.90 3rd Qu.:18.70
## Max. :42.60 Max. :33.30 Max. :37.80 Max. :26.80
## che.di elb.di wri.di kne.di
## Min. :22.2 Min. : 9.90 Min. : 8.100 Min. :15.7
## 1st Qu.:24.9 1st Qu.:11.80 1st Qu.: 9.400 1st Qu.:17.3
## Median :25.9 Median :12.40 Median : 9.800 Median :18.0
## Mean :26.1 Mean :12.37 Mean : 9.874 Mean :18.1
## 3rd Qu.:27.1 3rd Qu.:12.90 3rd Qu.:10.400 3rd Qu.:18.7
## Max. :33.2 Max. :15.00 Max. :12.200 Max. :24.3
## ank.di sho.gi che.gi wai.gi
## Min. : 9.90 Min. : 85.9 Min. : 72.60 Min. : 57.90
## 1st Qu.:12.40 1st Qu.: 96.1 1st Qu.: 81.97 1st Qu.: 64.75
## Median :13.00 Median : 99.5 Median : 85.50 Median : 68.30
## Mean :13.03 Mean :100.3 Mean : 86.06 Mean : 69.80
## 3rd Qu.:13.72 3rd Qu.:103.9 3rd Qu.: 89.50 3rd Qu.: 72.75
## Max. :15.50 Max. :129.5 Max. :109.00 Max. :101.50
## nav.gi hip.gi thi.gi bic.gi
## Min. : 64.00 Min. : 78.80 Min. :46.30 Min. :22.4
## 1st Qu.: 76.65 1st Qu.: 90.75 1st Qu.:53.77 1st Qu.:26.4
## Median : 82.35 Median : 94.95 Median :56.40 Median :27.8
## Mean : 83.75 Mean : 95.65 Mean :57.20 Mean :28.1
## 3rd Qu.: 90.03 3rd Qu.: 99.50 3rd Qu.:59.80 3rd Qu.:29.8
## Max. :121.10 Max. :128.30 Max. :75.70 Max. :40.3
## for.gi kne.gi cal.gi ank.gi
## Min. :19.60 Min. :29.00 Min. :28.40 Min. :17.40
## 1st Qu.:22.68 1st Qu.:33.50 1st Qu.:33.10 1st Qu.:20.30
## Median :23.60 Median :35.00 Median :34.85 Median :21.10
## Mean :23.76 Mean :35.26 Mean :35.01 Mean :21.21
## 3rd Qu.:24.70 3rd Qu.:36.80 3rd Qu.:36.60 3rd Qu.:22.00
## Max. :30.80 Max. :49.00 Max. :45.40 Max. :26.00
## wri.gi age wgt hgt sex
## Min. :13.00 Min. :18.00 Min. : 42.0 Min. :147.2 0:260
## 1st Qu.:14.50 1st Qu.:22.00 1st Qu.: 54.5 1st Qu.:160.0 1: 0
## Median :15.00 Median :26.00 Median : 59.0 Median :164.5
## Mean :15.06 Mean :28.77 Mean : 60.6 Mean :164.9
## 3rd Qu.:15.60 3rd Qu.:34.00 3rd Qu.: 65.6 3rd Qu.:169.5
## Max. :18.20 Max. :67.00 Max. :105.2 Max. :182.9
Normalize
normalize <- function(x){
(x-mean(x))/(sd(x))
}
Before performing a linear Regression we, need to check the bivariate distribution of the data.
library(corrplot)
## corrplot 0.84 loaded
data1<-fdims[,c(18:24)]
kpl <- cor(fdims[,c(18:24)])
corrplot(kpl, method = 'number' )
The correlation plot shows that kne.gi has a very strong postive relation with wgt. hence we further visualize to the bivariate relationship btwn the variables.
pairs(fdims[,c(18,23)],col = 1:10)
# Simple Regression for Kne.gi and Wgt variables.
model1 <- lm(fdims$wgt~fdims$kne.gi)
kable(tidy(model1))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -49.808360 | 4.4602462 | -11.16718 | 0 |
| fdims$kne.gi | 3.131275 | 0.1261604 | 24.81978 | 0 |
Thus the regression model estimeted is ;
wght = -49.808360 + 3.131275*Kne.gi
This means that a unit change of Kne.gi results to an increas of 3.13units of wght when other factors are kept constant.
From the above table, all the parameters are significant since their p-value is less than 0.05.
summary(model1)
##
## Call:
## lm(formula = fdims$wgt ~ fdims$kne.gi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7457 -3.4034 -0.3893 2.7856 22.6400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.8084 4.4602 -11.17 <2e-16 ***
## fdims$kne.gi 3.1313 0.1262 24.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.234 on 258 degrees of freedom
## Multiple R-squared: 0.7048, Adjusted R-squared: 0.7037
## F-statistic: 616 on 1 and 258 DF, p-value: < 2.2e-16
The adjusted R-squared value of our model is 0.7037 with an overal p-value of 2.2e-16. This implies that our model of prediction has a good fit from the data.
Logistic regression is done when the dependent variable is binary or dichotomous. We are therefore going to transform Kne.gi variable to 0 or 1 based on the mean of the variable.
MEAN <- mean(fdims$kne.gi)
cols <- names(data1)[-1]
dim(data1)
## [1] 260 7
data1[cols]<-lapply(data1[,-1], normalize)
data1$new_kne.gi <- as.factor(ifelse(fdims$kne.gi < MEAN, 1, 0))
kable(table(data1$new_kne.gi))
| Var1 | Freq |
|---|---|
| 0 | 121 |
| 1 | 139 |
View(data1)
# Logistic Regression
model <- glm(data1$new_kne.gi ~ data1$wgt, family = "binomial")
glance(model)
kable(tidy(model))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.0330202 | 0.1742122 | -0.1895403 | 0.8496694 |
| data1$wgt | -2.6531544 | 0.3344659 | -7.9325098 | 0.0000000 |
This logist regression model is estimated as
new_kne.gi <- -.0330202 - 2.6531544*wgt . This means that a unit change of wgt results in a decrease of 0.0330202 in new_kne.gi. From the above table, the wgt parameter is signinficant since its p-value is less than 0.05.
summary(model)
##
## Call:
## glm(formula = data1$new_kne.gi ~ data1$wgt, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9832 -0.5751 0.2031 0.6312 3.4308
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03302 0.17421 -0.190 0.85
## data1$wgt -2.65315 0.33447 -7.933 2.15e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 359.19 on 259 degrees of freedom
## Residual deviance: 214.09 on 258 degrees of freedom
## AIC: 218.09
##
## Number of Fisher Scoring iterations: 6
library(rpart)
library(rattle)
## Warning: package 'rattle' was built under R version 3.6.1
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
attach(data1)
rpart1<-rpart(new_kne.gi~wgt, data = data1, method = "class")
fancyRpartPlot(rpart1,main = " Transformed kne.gi Variable ")
Trees…
#tree1<-tree(new_kne.gi~wgt, data = data1)