library(broom)
library(knitr)

Labs

download.file('http://www.openintro.org/stat/data/bdims.RData', destfile = 'bdims.RData')
load(
  "bdims.RData"
)


head(bdims)
str(bdims)
## 'data.frame':    507 obs. of  25 variables:
##  $ bia.di: num  42.9 43.7 40.1 44.3 42.5 43.3 43.5 44.4 43.5 42 ...
##  $ bii.di: num  26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
##  $ bit.di: num  31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
##  $ che.de: num  17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
##  $ che.di: num  28 30.8 31.7 28.2 29.4 31.3 31.7 28.8 27.5 28 ...
##  $ elb.di: num  13.1 14 13.9 13.9 15.2 14 16.1 15.1 14.1 15.6 ...
##  $ wri.di: num  10.4 11.8 10.9 11.2 11.6 11.5 12.5 11.9 11.2 12 ...
##  $ kne.di: num  18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
##  $ ank.di: num  14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
##  $ sho.gi: num  106 110 115 104 108 ...
##  $ che.gi: num  89.5 97 97.5 97 97.5 ...
##  $ wai.gi: num  71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
##  $ nav.gi: num  74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
##  $ hip.gi: num  93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
##  $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
##  $ bic.gi: num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
##  $ for.gi: num  26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
##  $ kne.gi: num  34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
##  $ cal.gi: num  36.5 37.5 37.3 34.8 38.6 36.1 40.3 36.7 35 38.6 ...
##  $ ank.gi: num  23.5 24.5 21.9 23 24.4 23.5 23.6 22.5 22 22.2 ...
##  $ wri.gi: num  16.5 17 16.9 16.6 18 16.9 18.8 18 16.5 16.9 ...
##  $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
##  $ wgt   : num  65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
##  $ hgt   : num  174 175 194 186 187 ...
##  $ sex   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
mdims <- subset(bdims, sex == 1)
fdims <-subset(bdims, sex == 0)

Exercise 1:

hist(mdims$hgt, main = "Male Heights", xlab = "Male heights/(cm)")

hist(fdims$hgt, main = 'Female Heights', xlab = "Female heighs / (cm)")

Both graphs have a general bell-curve shape, although male’s is abit more symetric about the center. female distribution graph also has a shorter range than that of the males.

The Normal Distribution

fhgtmean <-mean(fdims$hgt)
fhgtsd <- sd(fdims$hgt)

Density CUrve

hist(fdims$hgt, probability = TRUE, main = "females heights", xlab = "females heights in cm")
x<-140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = 2)

Exercise 2:

The histogram of female heights does appear to have a normal distribution. Both graphs peak at about the same point in the middle and have their tails symmetric about the center.

Evaluating the Normal Distribution.

qqnorm(fdims$hgt)
qqline(fdims$hgt)

sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)

Exercise 3:

qqnorm(sim_norm)
qqline(sim_norm)

The distribution of the simulated data is normal. Almost all of the points are exactly on the line, with the exception of the tails.

qqnormsim(fdims$hgt)

## Exercize 4:

The Normal Probability plot looks similat to the plots created for the simulated data. most of the points are located on the line, with, yet agin, with slight variation at the tails. The plots do, indeed, provide evicence that the female heights are nearly normal.

Exersize 5:

qqnorm(fdims$wgt)
qqline(fdims$wgt)

hist(fdims$wgt, main = "Females Weight",xlab = "females weight in kgs")

qqnormsim(fdims$wgt)

The distribution of female weights does, indeed, seem to be normal.

Normal Probabilities.

1-pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
sum(fdims$hgt < 169)/length(fdims$hgt
                            )
## [1] 0.7307692

Exercise 6

  1. What is the probability that a randomly chose young adult female is shorter than 169cm?
pnorm(q = 169, fhgtmean, fhgtsd)
## [1] 0.7358822
sum(fdims$hgt < 169)/length(fdims$hgt
                            )
## [1] 0.7307692
  1. What id the probability that a randomly chosen young adult female weights over 55kg?
fwgtmean <- mean(fdims$wgt)
fwgtsd <- sd(fdims$wgt)
1- pnorm(q = 55, fwgtmean, fwgtsd)
## [1] 0.7198584
sum(fdims$wgt > 55) / length(fdims$wgt)
## [1] 0.6923077

Height had a closer agreement btwn the two calculation methods.

Machine Learning Algorithms

str(fdims)
## 'data.frame':    260 obs. of  25 variables:
##  $ bia.di: num  37.6 36.7 34.8 36.6 35.5 37 35.5 37.4 37.8 38.6 ...
##  $ bii.di: num  25 26.4 25.9 27.9 28.2 28 26.5 30.2 29 28.8 ...
##  $ bit.di: num  31.3 31 30.2 31.8 31 32 29.2 33.2 32.6 33.2 ...
##  $ che.de: num  16.2 16.8 16.4 19.3 18.2 15.1 15.4 18.8 18.6 19.7 ...
##  $ che.di: num  24.9 24.5 24.2 24.9 26.2 25.7 24.5 26.6 25 29.4 ...
##  $ elb.di: num  11.2 12.1 11.3 12.3 11.5 12.5 12.3 13.3 12.1 13.4 ...
##  $ wri.di: num  9.2 9.9 8.9 9.5 9.1 10 9.4 10.7 9.8 11.5 ...
##  $ kne.di: num  17 19.3 17 18.6 17.2 17.2 17.2 19.8 17.8 20.9 ...
##  $ ank.di: num  12.3 12.8 12.2 13 12.4 13.2 12 13.8 12.7 13.2 ...
##  $ sho.gi: num  95 99.5 88 97 103.3 ...
##  $ che.gi: num  83 78.5 75 86.5 91 79.5 77 88 85 98.8 ...
##  $ wai.gi: num  66.5 61.5 61.2 78 70.5 66.5 58 74.5 73.5 90.5 ...
##  $ nav.gi: num  79 70.5 66.5 91 80.5 ...
##  $ hip.gi: num  92 90.5 91 99.5 91.5 ...
##  $ thi.gi: num  53.5 57.7 53 61.5 55 54 49.5 64 65.3 61.1 ...
##  $ bic.gi: num  24.3 27.8 24 28 26.9 26.5 24.1 29.2 29 33.6 ...
##  $ for.gi: num  20.5 24 22 24 22.7 22.5 22 26.2 23.4 26.6 ...
##  $ kne.gi: num  32 38.5 32.5 35.2 33 34 32.5 38.5 35.3 37.2 ...
##  $ cal.gi: num  32.2 38.5 32.5 36.7 33.3 35 32 38 37.4 35.8 ...
##  $ ank.gi: num  21 22.5 19 23 19.9 23 19 22 21.6 22.6 ...
##  $ wri.gi: num  13.5 15 14 15 14.5 14.5 13.9 16.8 15.2 16.3 ...
##  $ age   : int  22 20 19 25 21 23 26 22 28 40 ...
##  $ wgt   : num  51.6 59 49.2 63 53.6 59 47.6 69.8 66.8 75.2 ...
##  $ hgt   : num  161 168 160 157 156 ...
##  $ sex   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
summary(fdims)
##      bia.di          bii.di          bit.di          che.de     
##  Min.   :32.40   Min.   :18.70   Min.   :24.70   Min.   :14.30  
##  1st Qu.:35.17   1st Qu.:26.20   1st Qu.:30.00   1st Qu.:16.50  
##  Median :36.40   Median :27.80   Median :31.50   Median :17.50  
##  Mean   :36.50   Mean   :27.58   Mean   :31.46   Mean   :17.72  
##  3rd Qu.:37.80   3rd Qu.:29.20   3rd Qu.:32.90   3rd Qu.:18.70  
##  Max.   :42.60   Max.   :33.30   Max.   :37.80   Max.   :26.80  
##      che.di         elb.di          wri.di           kne.di    
##  Min.   :22.2   Min.   : 9.90   Min.   : 8.100   Min.   :15.7  
##  1st Qu.:24.9   1st Qu.:11.80   1st Qu.: 9.400   1st Qu.:17.3  
##  Median :25.9   Median :12.40   Median : 9.800   Median :18.0  
##  Mean   :26.1   Mean   :12.37   Mean   : 9.874   Mean   :18.1  
##  3rd Qu.:27.1   3rd Qu.:12.90   3rd Qu.:10.400   3rd Qu.:18.7  
##  Max.   :33.2   Max.   :15.00   Max.   :12.200   Max.   :24.3  
##      ank.di          sho.gi          che.gi           wai.gi      
##  Min.   : 9.90   Min.   : 85.9   Min.   : 72.60   Min.   : 57.90  
##  1st Qu.:12.40   1st Qu.: 96.1   1st Qu.: 81.97   1st Qu.: 64.75  
##  Median :13.00   Median : 99.5   Median : 85.50   Median : 68.30  
##  Mean   :13.03   Mean   :100.3   Mean   : 86.06   Mean   : 69.80  
##  3rd Qu.:13.72   3rd Qu.:103.9   3rd Qu.: 89.50   3rd Qu.: 72.75  
##  Max.   :15.50   Max.   :129.5   Max.   :109.00   Max.   :101.50  
##      nav.gi           hip.gi           thi.gi          bic.gi    
##  Min.   : 64.00   Min.   : 78.80   Min.   :46.30   Min.   :22.4  
##  1st Qu.: 76.65   1st Qu.: 90.75   1st Qu.:53.77   1st Qu.:26.4  
##  Median : 82.35   Median : 94.95   Median :56.40   Median :27.8  
##  Mean   : 83.75   Mean   : 95.65   Mean   :57.20   Mean   :28.1  
##  3rd Qu.: 90.03   3rd Qu.: 99.50   3rd Qu.:59.80   3rd Qu.:29.8  
##  Max.   :121.10   Max.   :128.30   Max.   :75.70   Max.   :40.3  
##      for.gi          kne.gi          cal.gi          ank.gi     
##  Min.   :19.60   Min.   :29.00   Min.   :28.40   Min.   :17.40  
##  1st Qu.:22.68   1st Qu.:33.50   1st Qu.:33.10   1st Qu.:20.30  
##  Median :23.60   Median :35.00   Median :34.85   Median :21.10  
##  Mean   :23.76   Mean   :35.26   Mean   :35.01   Mean   :21.21  
##  3rd Qu.:24.70   3rd Qu.:36.80   3rd Qu.:36.60   3rd Qu.:22.00  
##  Max.   :30.80   Max.   :49.00   Max.   :45.40   Max.   :26.00  
##      wri.gi           age             wgt             hgt        sex    
##  Min.   :13.00   Min.   :18.00   Min.   : 42.0   Min.   :147.2   0:260  
##  1st Qu.:14.50   1st Qu.:22.00   1st Qu.: 54.5   1st Qu.:160.0   1:  0  
##  Median :15.00   Median :26.00   Median : 59.0   Median :164.5          
##  Mean   :15.06   Mean   :28.77   Mean   : 60.6   Mean   :164.9          
##  3rd Qu.:15.60   3rd Qu.:34.00   3rd Qu.: 65.6   3rd Qu.:169.5          
##  Max.   :18.20   Max.   :67.00   Max.   :105.2   Max.   :182.9

Normalize

normalize <- function(x){
  (x-mean(x))/(sd(x))
}

SImple Regression.

Before performing a linear Regression we, need to check the bivariate distribution of the data.

library(corrplot)
## corrplot 0.84 loaded
data1<-fdims[,c(18:24)]
kpl <- cor(fdims[,c(18:24)])
corrplot(kpl, method = 'number' )

The correlation plot shows that kne.gi has a very strong postive relation with wgt. hence we further visualize to the bivariate relationship btwn the variables.

pairs(fdims[,c(18,23)],col = 1:10)

# Simple Regression for Kne.gi and Wgt variables.

model1 <- lm(fdims$wgt~fdims$kne.gi)
kable(tidy(model1))
term estimate std.error statistic p.value
(Intercept) -49.808360 4.4602462 -11.16718 0
fdims$kne.gi 3.131275 0.1261604 24.81978 0

Thus the regression model estimeted is ;

wght = -49.808360 + 3.131275*Kne.gi

This means that a unit change of Kne.gi results to an increas of 3.13units of wght when other factors are kept constant.

From the above table, all the parameters are significant since their p-value is less than 0.05.

summary(model1)
## 
## Call:
## lm(formula = fdims$wgt ~ fdims$kne.gi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7457  -3.4034  -0.3893   2.7856  22.6400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -49.8084     4.4602  -11.17   <2e-16 ***
## fdims$kne.gi   3.1313     0.1262   24.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.234 on 258 degrees of freedom
## Multiple R-squared:  0.7048, Adjusted R-squared:  0.7037 
## F-statistic:   616 on 1 and 258 DF,  p-value: < 2.2e-16

The adjusted R-squared value of our model is 0.7037 with an overal p-value of 2.2e-16. This implies that our model of prediction has a good fit from the data.

Logistic Regression.

Logistic regression is done when the dependent variable is binary or dichotomous. We are therefore going to transform Kne.gi variable to 0 or 1 based on the mean of the variable.

MEAN <- mean(fdims$kne.gi)
cols <- names(data1)[-1]
dim(data1)
## [1] 260   7
data1[cols]<-lapply(data1[,-1], normalize)
data1$new_kne.gi <- as.factor(ifelse(fdims$kne.gi < MEAN, 1, 0))
kable(table(data1$new_kne.gi))
Var1 Freq
0 121
1 139
View(data1)
# Logistic Regression
model <- glm(data1$new_kne.gi ~ data1$wgt, family = "binomial")
glance(model)
kable(tidy(model))
term estimate std.error statistic p.value
(Intercept) -0.0330202 0.1742122 -0.1895403 0.8496694
data1$wgt -2.6531544 0.3344659 -7.9325098 0.0000000

This logist regression model is estimated as

new_kne.gi <- -.0330202 - 2.6531544*wgt . This means that a unit change of wgt results in a decrease of 0.0330202 in new_kne.gi. From the above table, the wgt parameter is signinficant since its p-value is less than 0.05.

summary(model)
## 
## Call:
## glm(formula = data1$new_kne.gi ~ data1$wgt, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9832  -0.5751   0.2031   0.6312   3.4308  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.03302    0.17421  -0.190     0.85    
## data1$wgt   -2.65315    0.33447  -7.933 2.15e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 359.19  on 259  degrees of freedom
## Residual deviance: 214.09  on 258  degrees of freedom
## AIC: 218.09
## 
## Number of Fisher Scoring iterations: 6

Classification Tree

library(rpart)
library(rattle)
## Warning: package 'rattle' was built under R version 3.6.1
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
attach(data1)
rpart1<-rpart(new_kne.gi~wgt, data = data1, method = "class")
fancyRpartPlot(rpart1,main = " Transformed kne.gi Variable ")

Trees…

#tree1<-tree(new_kne.gi~wgt, data = data1)