Well do not quote on me! Just brushing up Stat101 with Statistics by David Freedman.

I got the data from:

http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html

library(knitr)
library(ggplot2)
library(dplyr)
library(scales)

It has 2 variables Height[inches] and Wight [Kg]. The default unit for weight was pounds.

Let’s load the CSV

# data1<-read.csv("Book1.csv")
data1<-read.csv("C:/r data warehouse/height weight/book1.csv")
sum(is.na(data1))
## [1] 0
str(data1)
## 'data.frame':    25000 obs. of  2 variables:
##  $ Height.Inches.: num  65.8 71.5 69.4 68.2 67.8 ...
##  $ Weight.Kg.    : num  51.3 61.9 69.4 64.6 65.5 ...

It’s a clean & tidy data. It has 2 variables Height & Weight. It has 25000 obs. I think this will be good dataset.

Lets plot scatter plot. Lets assum that weight of a person depends on his height so that taller people have more weight.

Before we start, lets rename the variables - so that typing becomes painless.

data1 <- rename(data1, height=Height.Inches., weight=Weight.Kg.)
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_abline(intercept = 0)+
        scale_x_continuous(limits = c(0, 80))+
        scale_y_continuous(limits = c(0,80))

Above plot was zoomed out version with 45 deg line running through the plot. We’ll now see standard view plotted by R

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))

The scatter plot is in rugby ball shape and may have positive association between height and weight.

The Correlation coefficient

Now you’ve have scatter plot next step to summarize the data is to find average of height and weight.

mean.height<-mean(data1$height);mean.height
## [1] 67.99311
mean.weight<-mean(data1$weight);mean.weight
## [1] 57.64221

Now lets plot those means.

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)

The average of height and weight plotted with dashed line shows the point of average which is located in the centre of the cloud.

Now to find the spread of the cloud, we’ll use standard deviation of x and y.

sd.height<-sd(data1$height);sd.height
## [1] 1.901679
sd.weight<-sd(data1$weight);sd.weight
## [1] 5.28929
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")

From the above chart we can see that most of the observations falls within 2 SD above and below the mean.

Now, to finding the strenght of the association between height and weight we need correlation coefficient - its a measure of linear association or clustering around a line. The closer the the r which denotes correlation coefficient get to 1 the stronger the linear association between the variables. r falls between -1 and 1.

The SD line

Points in a scatter diagram generally seems to cluster around SD line. This line goes through the point of average and it goes through all the points which are an equal number of SDs away from the mean.

Finding SD ratio:

sd.ratio <- sd.weight/sd.height; sd.ratio
## [1] 2.781379

Now lets plot SD line. We know the slope and the line must pass through the point of averages.

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_abline(slope = 2.781379, intercept = -131, col="green")

Computing the Correlation coefficient

std.height<-(data1$height-mean.height)/sd.height
std.weight<-(data1$weight-mean.weight)/sd.weight

product<-std.height*std.weight
product<-as.data.frame(product)
str(product)
## 'data.frame':    25000 obs. of  1 variable:
##  $ product: num  1.404 1.494 1.645 0.154 -0.159 ...
r<-mean(product$product);r
## [1] 0.5028384

We can also get r by following commands:

cor(data1)
##           height    weight
## height 1.0000000 0.5028585
## weight 0.5028585 1.0000000

Now why does r work?

Lets plot the products in our plot

product$x<- data1$height

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_abline(slope = 2.781379, intercept = -131, col="green")+
        geom_point(data=product, aes(x,product))

ggplot(product, aes(x,product))+geom_point(aes(col=(product<0)))

Hmmm… the above plot looks like a mirrow image - symetric. Not sure but this could be due to our r being .50 I have colored red for positive values and blue for negative. We can also divide it into 4 quadrants. We’ll leave this for later investigation

Note that our r is a pure number without units since x and y were converted to standard units when calculating for it.

Our r can also be misleading in the presence of outliers and non-linear association.

For school children shoe size is strongly correlated with reading skills. However, learning new words does not make the feet get bigger. Instead, there is a third factor involved - age. Here age is a confounder.

CORRELATION MEASURES ASSOCIATION. BUT ASSOCIATION IS NOT THE SAME AS CAUSATION.

Regression

The regression method describes how one variable depends on another - how much of an increase in weight [y] is associated with a unit increase in height [x]?

Lets plot our weight vs height again:

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_abline(slope = 2.781379, intercept = -131, col="green")+
        scale_x_continuous(breaks = 60:80)+
        scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))

        #geom_smooth()

If we look at the SD line and see the observation plot they are almost symetrical. This is due to r being ~.50 And this is where r comes in. An increase of one SD in height there is an increase of only ~.50 SDs in weight on average.

average height + SD of height = 67.99 + 1.90 = 69.89

r x SD of weight = .50 x 5.289 = 2.6445

So the average weight of these men is around

2.6445 kg + 57.64 kg[mean of weight] = 60.2845 kg


average height + 2 SD of height = 67.99 + 2x1.90 = 71.79

r x 2xSD of weight = .50 x 2x5.289 = 5.289

So the average weight of these men is around

5.289 kg + 57.64 kg[mean of weight] = 62.929 kg


average height - 2 SD of height = 67.99 - 2x1.90 = 64.19

r x 2xSD of weight = .50 x 2x5.289 = 5.289

So the average weight of these men is around

-11.078 kg + 57.64 kg[mean of weight] = 52.351 kg


Lets save those coordinates in data.frame

rg.line<-data.frame(x=c(69.89,71.79,64.19), y=c(60.2845,62.929,52.351))

We’ll plot those coordinates in our plot.

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_abline(slope = 2.781379, intercept = -131, col="green")+
        scale_x_continuous(breaks = 60:80)+
        scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))+
        geom_segment(data=rg.line, aes(x=rg.line$x[2], y=rg.line$y[2], xend=rg.line$x[3], yend=rg.line$y[3],colour = "segment"))+
        geom_point(data=rg.line, aes(x,y, fill='grey', shape=23, size=3))+
         scale_shape_identity()+
        geom_smooth(method="lm")

So as you can see those 3 points have been plotted. The straight line passing through it is our regression line. Our regression line goes through the point of average.

All the points (height, estimates for average weight) fall on the solid line Pink.

THE REGRESSION LINE FOR Y ON X ESTIMATES THE AVERAGE VALUE OF Y CORRESPONDING TO EACH VALUE OF X.

ASSOCIATED WITH EACH INCREASE OF ONE SD IN X THERE IS AN INCREASE OF ONLY r X SDs IN Y, ON AVERAGE.

IN VIRTUALLY ALL TEST-RETEST SITUATIONS, THE BOTTOM GROUP ON THE FIRST TEST WILL ON AVERAGE SHOW SOME IMPROVEMENT ON THE SECOND TEST = AND THE TOP GROUP WILL ON AVERAGE FALL BACK. THIS IS THE REGRESSION EFFECT. AND THINKING THAT REGRESSION EFFECT MUST BE DUE TO SOMETHING IMPORTANT NOT JUST THE SPREAD AROUND THE LINE IS CALLED REGRESSION FALLACY.


Lets say a guy is 3.5 inch taller than average what is his average weight by using r?

How far is 3.5 from SD of height.

3.5/1.901=6.6535 SD

So r x SD = .50 x 6.6535 = 3.32675

so average weight is = average of weight + 3.32675 = 60.96895

If we look at our regression line ~ 71.5 corresponds to ~60

#data1%>% group_by(height) %>% summarise(M=mean(weight))

R.M.S. (root-mean-square error) Error for Regression

ERROR = ACTUAL VALUE - PREDICTED VALUE e=y-y^

Another word for error is residuals.

Prediction error equals vertical distance from x-axis to the regression line. Error equal actual value vertical distance till regression line.

The overall size of errors is measured by RMS.That is RMS error for regression line for predicting weight from height is

=sqrt[(e2+e2+….+n)/N] where e i.e y-y^ = observed - predicted value

\[\sum_{i=1}^n \sqrt{(y-\hat{y})^2 /n} \]

\[\sqrt{1-r^2}.SDy\]

The rms error for regression says how far typical points are above or below the regression line.

lm.model<-lm(weight~height, data=data1)
lm.model
## 
## Call:
## lm(formula = weight ~ height, data = data1)
## 
## Coefficients:
## (Intercept)       height  
##     -37.456        1.399
names(lm.model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
lm.model.sm<-summary(lm.model)
names(lm.model.sm)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
str(lm.model)
## List of 12
##  $ coefficients : Named num [1:2] -37.5 1.4
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
##  $ residuals    : Named num [1:25000] -3.299 -0.659 9.804 6.607 8.097 ...
##   ..- attr(*, "names")= chr [1:25000] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:25000] -9114.03 -420.54 9.83 6.63 8.12 ...
##   ..- attr(*, "names")= chr [1:25000] "(Intercept)" "height" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:25000] 54.6 62.6 59.6 58 57.4 ...
##   ..- attr(*, "names")= chr [1:25000] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:25000, 1:2] -1.58e+02 6.32e-03 6.32e-03 6.32e-03 6.32e-03 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:25000] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "height"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.01 1.01
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 24998
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = weight ~ height, data = data1)
##  $ terms        :Classes 'terms', 'formula' length 3 weight ~ height
##   .. ..- attr(*, "variables")= language list(weight, height)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. ..$ : chr "height"
##   .. ..- attr(*, "term.labels")= chr "height"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(weight, height)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  $ model        :'data.frame':   25000 obs. of  2 variables:
##   ..$ weight: num [1:25000] 51.3 61.9 69.4 64.6 65.5 ...
##   ..$ height: num [1:25000] 65.8 71.5 69.4 68.2 67.8 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 weight ~ height
##   .. .. ..- attr(*, "variables")= language list(weight, height)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. .. ..$ : chr "height"
##   .. .. ..- attr(*, "term.labels")= chr "height"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(weight, height)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  - attr(*, "class")= chr "lm"
#lm.model$model
#plot(lm.model)

residuals<- data.frame(data1$weight-lm.model$fitted.values)
residuals1<- data.frame(lm.model$residuals)
mean(residuals$data1.weight...lm.model.fitted.values)
## [1] -2.931794e-17
mean(lm.model$residuals)
## [1] -2.562205e-17
nrow(data1)
## [1] 25000
rms<- sqrt(sum((lm.model$residuals)^2)/nrow(data1))
rms
## [1] 4.571805

So, to get residuals from r its lm.model$residuals which is our error.

Lets plot residual vs height.

Mathematically the residuals from the regression line must average out to 0.

However, our locally weighted regression is not fully horizontal.

ggplot(data=lm.model, aes(x=height, y=residuals))+geom_jitter(aes(alpha=0.5))+
        geom_smooth()
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Now, lets plot regression line in residual plot. The residual from the regression line must average out to 0. Therefore, the regression line for the residual must be horizontal.

ggplot(data=lm.model, aes(x=height, y=residuals))+geom_jitter(aes(alpha=0.5))+
        geom_smooth(method="lm")
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

When all the vertical strips in a scatter diagram show similar amounts of spread it is HOMOSCEDASTIC.

ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
        geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
        geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
        geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
        geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
        geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
        geom_abline(slope = 2.781379, intercept = -131, col="green")+
        scale_x_continuous(breaks = 60:80)+
        scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))+
        geom_segment(data=rg.line, aes(x=rg.line$x[2], y=rg.line$y[2], xend=rg.line$x[3], yend=rg.line$y[3],colour = "segment"))+
        geom_point(data=rg.line, aes(x,y, fill='grey', shape=23, size=3))+
         scale_shape_identity()+
        geom_smooth(method="lm")

RMS error for the regression line of y on x can be figured by

\[\sqrt{1-r^2}.SDy\]

Lets split the data1 into training and test set

smp_size <- floor(0.75 * nrow(data1))
set.seed(123)
train_data1 <- sample(seq_len(nrow(data1)), size = smp_size)

train<- data1[train_data1,]
test<- data1[-train_data1,]

lm.train<- lm(weight~height, train)
predict1<-predict(lm.train, test)
pp<-as.data.frame(predict1)


rmse1<- sqrt(sum((test$weight-pp$predict1)^2)/nrow(test))
rmse1
## [1] 4.521159
sd.test.weight<-sd(test$weight)
cor.test<-cor(test)
cor.test<-cor.test[1,2]

rmse.test <- sqrt(1-(cor.test)^2)*sd.test.weight
rms-rmse1
## [1] 0.05064625

Very close.

#plot(data1$height,lm.model$residuals)
#plot(lm(lm.model$residuals~data1$height))

Probability

We’ll closely look at frequency theory.

THE CHANCE OF SOMETHING GIVES THE PERCENTAGE OF TIME IT IS EXPECTED TO HAPPEN WHEN THE BASIC PROCESS IS DONE OVER AND OVER AGAIN, INDEPENDENTLY AND UDNER THE SAME CONDITIONS.

When you draw at random, all the tickets in the box have the same chance to be picked.

set.seed(1)
sample(1:6, 6, replace = T)
## [1] 2 3 4 6 2 6
#sample(s1, 6, replace=F)

Above example is with replacement from the box and without.

When you draw at random, all tickets in the box have the same chance to be picked.

R G G with replacement

prob of picking R from the box is 1/3 and prob of picking G is 2/3

Two tickets are drawn at random without replacement from the box - 1,2,3,4

What is the chance that the sencond ticket is 4?

Ans. 1/4

what is the chance that the second ticket is 4 given the first is 2?

Ans. 1/3

Another way to put above question is P(2nd ticket 4 | 1st ticket 2) - the conditional probability for the sencond ticket to be 4 given the first was 2.

Multiplication rule

The chance that two things will both happen equals the chance that the first will happen, multiplied by the chance that the second will happen given the first has happened.

Q. Two cards will be dealt off the top of a well shuffled deck. What is the chance that the first will secen of clubs and the second will be queen of hearts?

Ans. 1/52 x 1/51 = 1/2652

Two things are independent if the chance for the second given the first are the same, no matter how the first one turns out. Otherwise the two things are dependent.

When drawing at random with replacement the draws are independent. Without replacement, the draws are dependent.

Q. Two dice are thrown. What is the chance of getting a total of 2 spots? i.e.- . & .

Ans. THere are 6 x 6 = 36 possible ways for the dice to fall.

Two have 2 spots out of 36 we have only 1/36 chance.

Additional Rule

To find the chance that at least one of two things will happen, check to see if they are mutually exclusive. If they are, add the chances.

Q. The chance that top card is ace of spadess or the bottom card is the ace of spades equals:

Ans. 1/52 + 1/52

Binomial Formula

Number of ways to arrange 4 R’s and 1 G’s

5!/4!.1!

the numerator is sum of denomator that is 4+1; 4 represent R and 1 represent G

Here 0! = 1

Binomial coefficient is \[\frac{n!}{k!(n-k)!}\]

n! / k!(n-k)!

where n is number of trails; k is the number of times the event will occur

Binomial Formuals

\[\frac{n!}{k!(n-k)!}.p^k.(1-p)^{n-k}\] p is the probability that the event will occur on any particular trial

Q. A die is rolled 10 times. What is the chance of getting exactly 2 aces?

A. n=10; k=2; n-k=8 ; p=(1/6) (5/6)

THerefore,

\[\frac{10!}{2!(10-2)!}.(1/6)^2.(5/6)^{10-2}\] ## Law of average

Lets toss a coin 10,000 times and see how many H and T

The assumptions that in the lonf run the no. of H and the no. of T even out is FALSE

set.seed(2)
coin.toss<-sample(x=c("H","T"), 10000, replace=TRUE, prob = c(.5,.5))
table(coin.toss)
## coin.toss
##    H    T 
## 5061 4939
coin.toss<-as.data.frame(coin.toss)
coin.toss<- coin.toss %>% mutate(Index=c(1:10000))

#ggplot(coin.toss, aes(as.factor(coin.toss)))+geom_bar(width=.5)
set.seed(2)
coin.toss2<-sample(x=c(1,0), 10000, replace=TRUE, prob = c(.5,.5))
table(coin.toss2)
## coin.toss2
##    0    1 
## 4939 5061
coin.toss2<-as.data.frame(coin.toss2)
coin.toss2<- coin.toss2 %>% mutate(Index=c(1:10000))

coin.toss2 <- coin.toss2 %>%
        mutate(no.toss.div.2=Index/2, Cumul.prob=cumsum(coin.toss2), Difference=Cumul.prob-no.toss.div.2)

coin.toss2.Headonly<- coin.toss2 %>% filter(coin.toss2>0) %>% select(coin.toss2,Index,Cumul.prob,Difference)

coin.toss2.Headonly<- coin.toss2.Headonly%>% 
        mutate(Diff.div.Index=(Difference/Index)*100)


ggplot(coin.toss2.Headonly, aes(Index, Difference))+geom_line()+
        labs(x="Coin toss", y="Chance of error")

# df.coin2<- as.data.frame(coin.toss2.Headonly)
# 
ggplot(coin.toss2.Headonly, aes(coin.toss2,Index))+geom_bar(stat = 'identity')

#barplot(table(coin.toss2.Headonly$Cumul.prob), density = 100, main = "2 Dice Sum, 10000 Rolls")
# x22<-table(coin.toss2.Headonly$Cumul.prob)
# hist(x22$)
# #ggplot(coin.toss, aes(as.factor(coin.toss)))+geom_bar(width=.5)

In 10,000 tosses you expect to get 5000 H right?

But not exactly. You only expect to get 5000 H. But in relaity you could get 5001, 4998 , 5007… The amount off 5000 is what we call chance of error.

number of H - half the number tosses [expected number]

therfore for our last 10,000 Toss

number of H is 5061 and total toss is 10,000 so half is 5000

so 5061 - 5000 is 61 is our chance of error.

Above plot shows that as our no. of toss increases the size difference between the no. of heads and the expected number is likely to be quiet large in absolute terms.[stated above.]

However, if we compare the chance of error to the number of tosses the difference is quite small. For our last 10000th toss the chance of error was 61 so 61 divided by 10,000 toss is

61/10000
## [1] 0.0061

which is very small.

So number of HEADS = half the no. of tosses + chance of error

eg for 9998th toss; no. of H = 4999 + 61 =5060

Lets plot chance of error to the number of tosses i.e. chance of error / tossess

ggplot(coin.toss2.Headonly, aes(Index,Diff.div.Index))+geom_line()

CHANCE PROCESS:

Two main ideas -

  1. Find an analogy between the process being studied (sampling voters in the poll example) and drawing numbers at random from a box.

  2. Connect the variability you want to know about ( for eg in the estimate for the Democratic vote) with the chance variability in the sum of the numbers drawn from the box.

The analogy between a chance process and drawing from a box is called a box model. More complicated process can be dealt with through the analogy.

THE SUM OF DRAWS:

Lets say we have a box of tickets with numbers on it - 1,2,3,4,5,6

Imagine drawing at random from it with replacement from the box.

Imagine taking 25 draw from the box

set.seed(2)
sample.tick<-c(1:6)
s.1<-sample(sample.tick, 25, replace = T)
s.1
##  [1] 2 5 4 2 6 6 1 6 3 4 4 2 5 2 3 6 6 2 3 1 4 3 6 1 3

Sum of s.1 is

sum(s.1)
## [1] 90

So our first draw of 25 times and adding them up came 88, lets repeat this process 10 times

set.seed(2)
s.2<-replicate(10, sum(sample(sample.tick, 25, replace = T)))
s.2
##  [1]  90  88  84  85  90  84  86  77  90 102

so drawing 25 tickets from the box of 1:6 adding them up and repeating this process 10 has 10 different result. Chance variability is easy to see. It can be small as 1 x 25 and as large as 6 x 25 = 150

The expected value and standard error

Suppose you have 4 tickets in a box and 100 draws are made with replacement. The expected value would be

1,1,1,5

There are four ticekts in the box. 5 should come 1/4 time and 1 3/4 so if we draw 100 time then 100x1/4x5 + 100x3/4x1 = 200

FORMULA for expected value for the “sum” of draws made at random with replacement from the box equal

(number of draws) x (average of box)

100 x [ (1+1+1+5)/4] = 200

AGAIN NOTE THE FORMUAL IS FOR SUM OF THE DRAWS NOT AVERAGE

Each ticket wont appear on exactly as the probabilit of the draws. The “SUM” will be off the expected value by a chance of error

SUM=Expected value+Chance of error

\[SE= \sqrt{number of draws } . (SD of box) \]

A SUM IS LIKELY TO BE AROUND ITS EXPECTED VALUE, BUT TO BE OFF BY A CHANCE ERROR SIMILAR TO THE STANDARD ERROR.

So let go back to our example:

We have 1:6 in the box.

Its expected value of SUM of draw = (number of draws) x (average of box)

mean(c(1:6))
## [1] 3.5

therefore, 25 * 3.5 = 87.5

25 * 3.5
## [1] 87.5

Now lets cacuclate is SE

SE = \[SE= \sqrt{number of draws } . (SD of box) \]

sd(1:6)
## [1] 1.870829
sqrt(25)*sd(1:6)
## [1] 9.354143

Therefore SE = 9.35

So that means the sum is likely to be around the expected value 87.5 give or take 9.35

Now let repeat the process 10 times

set.seed(2)
s.3<-replicate(10, sum(sample(sample.tick, 25, replace = T)))
s.3
##  [1]  90  88  84  85  90  84  86  77  90 102

We know sum=expected value+chance or error

So for first one i.e. 90 we have 90-87.5 = 2.5 chance of error which is less that one SE

s.3-87.5
##  [1]   2.5   0.5  -3.5  -2.5   2.5  -3.5  -1.5 -10.5   2.5  14.5

So we can see that above 2/10 were above 1 SE that’s 9.35 rest were less than that.

x.1<-c(0,1)
sd(x.1)
## [1] 0.7071068
mean(c(2,2))
## [1] 2
sqrt(10000)
## [1] 100

Probability Histogram

A probability histogram graph represent chance not data.

Lets get pair of dice and throw it. What is the outcome - is the sum of two dice number.

set.seed(2)
pair.dice<- c(1:6,1:6)
sum(pair.dice)
## [1] 42
rolls.100<-replicate(100, sum(sample(pair.dice, 2, replace = T)))
hist(rolls.100)

rolls.10000<-replicate(1000000, sum(sample(pair.dice, 2, replace = T)))
#hist(rolls.10000, probability = T)

#rolls.10000 <-replicate(10000, sum(sample(pair.dice, 2, replace = T, prob = table(outer(1:6,1:6,"+")) / 36)))
#hist(rolls.1000000)
rolls.10000  <- as.data.frame(rolls.10000 )
ggplot(rolls.10000 , aes(rolls.10000 ))+stat_bin(bins = 2)

ctrl+shift+c to add/remove # from cmd lines

So what we are doing below is sampling a coin toss 1000 times getting no of heads when tossing 1000 times and repeating this process 10,000 times.

THe plot shows all the expected value and the difference between the real exptected value is due to chance variability.

Chance = Observed value - expected

set.seed(2)
results <- list()
 for(i in 1:10000) {
     coinTosses   <- cumsum(sample(c(0,1), 1000, replace = TRUE, prob = c(.5,.5))) 
     results[[i]] <- coinTosses[length(coinTosses)]
    
 }

no.heads<-as.data.frame(unlist(results))
m.1<-mean(no.heads$`unlist(results)`)
s.1<-sd(no.heads$`unlist(results)`)

# hist(unlist(results), main = "No of Heads",col = "lightblue", breaks = 100)

# ggplot(no.heads, aes(unlist(results)))+stat_bin(bins = 50)


ggplot(no.heads, aes(unlist(results)))+geom_histogram(aes(y=..density..),binwidth=2)+
        stat_function(fun=dnorm, args=list(mean=m.1, sd=s.1))+
        geom_vline(aes(xintercept=mean(no.heads$`unlist(results)`)))+
        scale_x_continuous(breaks = seq(440,560, length.out = 9))

sd(c(0,2,3,4,6))
## [1] 2.236068
mean((c(0,2,3,4,6)))
## [1] 3

Note on SD difference between R and hand

and see the rest of the Wikipedia article for the discussion about estimation of standard deviations. Using the formula employed ‘by hand’ leads to a biased estimate, hence the correction of sqrt((N-1)/N). Here is a key quote:

The term standard deviation of the sample is used for the uncorrected estimator (using N) while the term sample standard deviation is used for the corrected estimator (using N ??? 1). The denominator N ??? 1 is the number of degrees of freedom in the vector of residuals, .

Now we know that we have 0,1 coin toss in box their SD is .7071068 and SEis 22.36068. our expected heads in 1000 toss is no of toss x mean of 0,1

1000 x .5 = 500 H

To find chance of getting exact 500 heads we use Z converstion i.e converting no of heads into standard units

SE.1<-sd(c(0,1)) *sqrt(1000)
SE.1
## [1] 22.36068
sd(c(0,1)) 
## [1] 0.7071068
mean(c(0,1)) 
## [1] 0.5
# no.heads <- no.heads %>% rename(H=`unlist(results)`)
# 
#no.heads <- no.heads %>% mutate(Z=(H-500)/SE.1)

# no.heads <- round(no.heads$Z)

Now To find chance of getting exact 500 heads we use Z converstion i.e converting no of heads into standard units

we have converted same above plot into z units ie standard units

z = (observed - mean)/ SE

so to find exact 500 head

(499.5-500)/22.36 = -0.02236

(500.5-500)/22.36= 0.02236

area under normal curve mean 0 sd 1 for z 0.02236 left side area is 0.5089196 and left side of -0.2236 is 0.49108

we want the area between minus them and ans is 1.76%

l.1<-pnorm(q=0.02236, mean=0, sd=1 ,lower.tail = T )
l.1
## [1] 0.5089196
(1-(1-l.1)*2)*100
## [1] 1.783921
l.2<-pnorm(q=-0.02236, mean=0, sd=1 ,lower.tail = T )
l.2
## [1] 0.4910804
l.1-l.2
## [1] 0.01783921
qnorm(.68, mean=500 ,sd=23)
## [1] 510.7571
#ggplot(no.heads, aes(Z))+geom_histogram(aes(y=..density..),binwidth =.5)+
        #stat_function(fun = dnorm, args=list(mean=mean(no.heads$Z), sd=sd(no.heads$Z)))

A coin toss 100 times

set.seed(2)
results.2 <- list()
 for(i in 1:100) {
     coinTosses.1   <- cumsum(sample(c(0,1), 100, replace = TRUE, prob = c(.5,.5))) 
     results.2[[i]] <- coinTosses.1[length(coinTosses.1)]
    
 }


no.heads.1<-as.data.frame(unlist(results.2))

colnames(no.heads.1) <- "H"

hist(no.heads.1$H)

pnorm(q=0.0707, mean=0, sd=1)
## [1] 0.5281817

Remember = std.error is sqrt(no of draws) * Sd [its not no of repetitions/replicate]

set.seed(2)
hist(replicate(10000, sum(sample(c(1,2,9), 500,replace=T))), breaks = 50)

# replicate(100, sum(sample(c(1,2,9), 100,replace=T)))

sum(sample(c(1,2,9), 500,replace=T))
## [1] 1992
expected.value<- 500*mean(sample(c(1,2,9)))
expected.value
## [1] 2000
sd.value<- sd(c(1,2,9))
sd.value
## [1] 4.358899
std.error<- sqrt(500)*sd.value
std.error
## [1] 97.46794

When the no of repetition is large like coin toss where draw is fixed the empirical histogram will be close to the probability histogram.

When the no of draws is largem the probability histogram for the sum will be close to the normal curve.

Consequently, when the no of repetitions and the no of draws are both large, the empirical histogram for the sums will be close to the curve.

Simple random sampling means drawing at random without replacement.

male<- rep(x=1, 3091)



female<-(rep(x=0, 3581))
# 
socio <- c(male, female)

sociao <- as.data.frame(socio)

mean(sociao$socio)
## [1] 0.4632794
sd(sociao$socio)
## [1] 0.4986871
# health.study<-rbind(male, female)

A politician wants to enter a primary district with 100,000 eligible votes but only if he has good chance of winning.

He hire survey org and they conduct 2500 simple random sample votes.

Now our box model would be 1 for yes votes and 0 for no all up 2500.

However we don’t the the fraction of population 100,000 who vote yes and no.

However we can use bootstraps method. Lets find the fraction in the sample which should relate to fraction in the box i,e population.

So we found 1328 voted yes and 2500-1328 votes no. Fractions :

1328/2500
## [1] 0.5312
(2500-1328)/2500
## [1] 0.4688

so 53.12 % voted yes and 46.88 % voted No.

mean of our sample would be 1328 since : total sample no * mean

2500 * (1328/2500)
## [1] 1328

SD will be

sqrt(.5312 * .4688)
## [1] 0.4990256

that is using the short cut. Otherwise you can do the long method

xx.12<- rep(1, 1328)
xx.13<- rep(0, 1172)

xx.14<- c(xx.12, xx.13)
sd(xx.14)
## [1] 0.4991254

So our SE is

sqrt(2500) * sd(xx.14)
## [1] 24.95627

so what this is saying is that our our estimate

1328/2500 = 53.12 % is likely to be off by

24.95/2500 = 1 %

SO in a sense its a good result since its unlikely to be off by 3 % point.

THe bootstrap: When sampling from 0-1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0s and 1s in the sample for the unknown fractions in the box. The estimate is good when the sample is reasonably large.

LEts to an example

lets say our pop has 80% 1 and 20 % 0 . But act as if we don’t know.

pop.1<- rep(1, 8000)
zeros.1<- rep(0, 2000)

pop.2<- c(pop.1,zeros.1)

Let pick sample of 2500 from it.

sample.1<- sample(pop.2, 2500, replace = F)

sum(sample.1)
## [1] 1993

So we have 2020 1s that is 80.8 % and 480 0 that is 19.2%

SE is

sqrt(2500)* sd(sample.1)
## [1] 20.10826

so ~20/2500

so it off by .8%

which is spot on 80% + .8% is 80.8% which is expected value of sample percentage.

Confidence interval

Above ege we know 80.8% of sample voted yes. So how far can the population percentage be from 80.8% be?

Our SE give that answer. We know our SE is .8% suggesting our chance error is .8%

Error larger than 2 SE occuer infrequently. Lets calculate 2 SE below and above the sample percentage

80.8 % +- 2 x .8%

80.8 + 2*.8
## [1] 82.4
80.8 - 2*.8
## [1] 79.2

so that our 95 % confidence interval. You can be 95 % confident that the population percentage is caught inside the interval from 79.2% to 82.4%

A sample percentage will be off the population percentage due to chance of error. The SE will tell you the likely size of the amount off.

The Accuracy of Averages

Lets say we have 1:7 in the box. We want to draw 25 samples from the box , average them out and repeat the process 25 times

set.seed(2)
sample.2<- c(1:7)
s.size<-25
rep<- 25
s_mean<- rep(NA, rep)


for ( i in 1:rep) {
        my_sample <- sample(sample.2, s.size, replace = T)
        s_mean[i] <- mean(my_sample)
}

# s_mean
hist(s_mean, breaks = 6)

So the average of the box is 4

mean(sample.2)
## [1] 4

Now find SE

We knoe the SD of the box 2.16

sd(sample.2)
## [1] 2.160247

We know the expected value for sum is 25 x 4 = 100

If it was sum then we would SQRT(sample size) * SD

but since its average we SQRT(sample size) * SD/(sample size)

or we can simplify

(sample size)^(1/2) * SD/(sample size) or

[ (sample size)^(1/2)/(sample size) ] * SD or

SD/SQRT(sample size)

average of 25 draws

100+10/25 = 100/25 +- 10/25

= 4 +- 0.4

sd(sample.2)/sqrt(s.size)
## [1] 0.4320494

The SD says how far family incomes are from averages - for typical families.

The SE say how far sample averages are from the population averages - for typical samples.

Tests of Significance

Lets say we have large box of numbered tickets. Dr. Null says average is 50 whereas Dr. Alt says its different from 50. To find out they draw 500 tickets at random. The average turn out to be 48 and the SD 15.3

Dr.Null says the box average is 50 because sample average is 48 which is not big difference and the sd is just 15.3 so the average is very little to SD.

Dr. Alt goes we need to look at SE not SD since SE tells us how far the average sample is likely to be from its expected value ie the average of the box.

SE=sqrt(500) * 15.3 = 342/500 = 0.7

so Z= (48-50)/500 = -3

Dr. ALt goes if your theory is right your average is 3 SEs below its expected value. We can’t explain the difference by chance. The difference is real. So the average ticket of the box isn’t 50, its some other numbers. Note that expected value of the sample average is the average of the ticket in the box.

THis sort of calculation is called test of significance.

Dr. Alt calimed that the difference was real - which means whether the difference just reflected chance variation or whether the average for all the tickets in the box was different from 50 .

NOTE: In statistical the null hypothesis and the alternative hypothesis are statements about the box NOT JUST THE SAMPLE. EACH SIDE REPRESENT ONE SIDE OF THE ARGUMNET.

NULLHYPOTHESIS - THE AVERAGE OF THE BOX EQUALS 50 ALTERNATIVE HYPOTHESIS - THE AVERAGE OF THE BOX IS LESS THAN 50

Basically NUll Hypo corresponds to the idea that an observed difference is due to chance.

Z= (observed - expected)/Se

Z says how many SEs away an observed value is from expected value. Expected value as of the box not the sample.

When we refer to the Z table the area to the left of -3 under the normal curve is ridiculously small. THe chance of getting a sample average 3 SEs or more below is expected value is about 1 in 1000. [We have to pause and note what that statement is saying - if our box model average was really 50, and from that box model we too 500 draws and its expected value came to be 48, the chance of that happening is 1 in 1000]

There is a strong evidence against Null Hypo.

The observed significance level (p) is the chance of getting a test statistics as extreme as or more extreme than the observed one. The samller the chance the stronger the evidence against the NULL.

Form the Z table we get z=-3 as 0.0013 which is the P value

THe P value of a test it the chance of getting a big test statistic assuming the null hypothesis to be right. P is not the chance of the null hypothesis being right.

The Z test is used for reasonably large samples when the normal approx can be used on the probability histogram for the average of the draws.

Many investigator draw line at 5 % IF P is less thatn 5 % the result is called statistically significant

If P is less than 1 % the result is highly significant

*** The t test

in t test t=(observed - expected)/SE

where SE=sqrt [no. of measurements/(no of measurement-1)]*SD

and degreee of freedom = no of measurement - 1