Well do not quote on me! Just brushing up Stat101 with Statistics by David Freedman.
I got the data from:
http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html
library(knitr)
library(ggplot2)
library(dplyr)
library(scales)
It has 2 variables Height[inches] and Wight [Kg]. The default unit for weight was pounds.
Let’s load the CSV
# data1<-read.csv("Book1.csv")
data1<-read.csv("C:/r data warehouse/height weight/book1.csv")
sum(is.na(data1))
## [1] 0
str(data1)
## 'data.frame': 25000 obs. of 2 variables:
## $ Height.Inches.: num 65.8 71.5 69.4 68.2 67.8 ...
## $ Weight.Kg. : num 51.3 61.9 69.4 64.6 65.5 ...
It’s a clean & tidy data. It has 2 variables Height & Weight. It has 25000 obs. I think this will be good dataset.
Lets plot scatter plot. Lets assum that weight of a person depends on his height so that taller people have more weight.
Before we start, lets rename the variables - so that typing becomes painless.
data1 <- rename(data1, height=Height.Inches., weight=Weight.Kg.)
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_abline(intercept = 0)+
scale_x_continuous(limits = c(0, 80))+
scale_y_continuous(limits = c(0,80))
Above plot was zoomed out version with 45 deg line running through the plot. We’ll now see standard view plotted by R
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))
The scatter plot is in rugby ball shape and may have positive association between height and weight.
Now you’ve have scatter plot next step to summarize the data is to find average of height and weight.
mean.height<-mean(data1$height);mean.height
## [1] 67.99311
mean.weight<-mean(data1$weight);mean.weight
## [1] 57.64221
Now lets plot those means.
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)
The average of height and weight plotted with dashed line shows the point of average which is located in the centre of the cloud.
Now to find the spread of the cloud, we’ll use standard deviation of x and y.
sd.height<-sd(data1$height);sd.height
## [1] 1.901679
sd.weight<-sd(data1$weight);sd.weight
## [1] 5.28929
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")
From the above chart we can see that most of the observations falls within 2 SD above and below the mean.
Now, to finding the strenght of the association between height and weight we need correlation coefficient - its a measure of linear association or clustering around a line. The closer the the r which denotes correlation coefficient get to 1 the stronger the linear association between the variables. r falls between -1 and 1.
Points in a scatter diagram generally seems to cluster around SD line. This line goes through the point of average and it goes through all the points which are an equal number of SDs away from the mean.
Finding SD ratio:
sd.ratio <- sd.weight/sd.height; sd.ratio
## [1] 2.781379
Now lets plot SD line. We know the slope and the line must pass through the point of averages.
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_abline(slope = 2.781379, intercept = -131, col="green")
std.height<-(data1$height-mean.height)/sd.height
std.weight<-(data1$weight-mean.weight)/sd.weight
product<-std.height*std.weight
product<-as.data.frame(product)
str(product)
## 'data.frame': 25000 obs. of 1 variable:
## $ product: num 1.404 1.494 1.645 0.154 -0.159 ...
r<-mean(product$product);r
## [1] 0.5028384
We can also get r by following commands:
cor(data1)
## height weight
## height 1.0000000 0.5028585
## weight 0.5028585 1.0000000
Now why does r work?
Lets plot the products in our plot
product$x<- data1$height
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_abline(slope = 2.781379, intercept = -131, col="green")+
geom_point(data=product, aes(x,product))
ggplot(product, aes(x,product))+geom_point(aes(col=(product<0)))
Hmmm… the above plot looks like a mirrow image - symetric. Not sure but this could be due to our r being .50 I have colored red for positive values and blue for negative. We can also divide it into 4 quadrants. We’ll leave this for later investigation
Note that our r is a pure number without units since x and y were converted to standard units when calculating for it.
Our r can also be misleading in the presence of outliers and non-linear association.
For school children shoe size is strongly correlated with reading skills. However, learning new words does not make the feet get bigger. Instead, there is a third factor involved - age. Here age is a confounder.
CORRELATION MEASURES ASSOCIATION. BUT ASSOCIATION IS NOT THE SAME AS CAUSATION.
The regression method describes how one variable depends on another - how much of an increase in weight [y] is associated with a unit increase in height [x]?
Lets plot our weight vs height again:
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_abline(slope = 2.781379, intercept = -131, col="green")+
scale_x_continuous(breaks = 60:80)+
scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))
#geom_smooth()
If we look at the SD line and see the observation plot they are almost symetrical. This is due to r being ~.50 And this is where r comes in. An increase of one SD in height there is an increase of only ~.50 SDs in weight on average.
average height + SD of height = 67.99 + 1.90 = 69.89
r x SD of weight = .50 x 5.289 = 2.6445
So the average weight of these men is around
2.6445 kg + 57.64 kg[mean of weight] = 60.2845 kg
average height + 2 SD of height = 67.99 + 2x1.90 = 71.79
r x 2xSD of weight = .50 x 2x5.289 = 5.289
So the average weight of these men is around
5.289 kg + 57.64 kg[mean of weight] = 62.929 kg
average height - 2 SD of height = 67.99 - 2x1.90 = 64.19
r x 2xSD of weight = .50 x 2x5.289 = 5.289
So the average weight of these men is around
-11.078 kg + 57.64 kg[mean of weight] = 52.351 kg
Lets save those coordinates in data.frame
rg.line<-data.frame(x=c(69.89,71.79,64.19), y=c(60.2845,62.929,52.351))
We’ll plot those coordinates in our plot.
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_abline(slope = 2.781379, intercept = -131, col="green")+
scale_x_continuous(breaks = 60:80)+
scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))+
geom_segment(data=rg.line, aes(x=rg.line$x[2], y=rg.line$y[2], xend=rg.line$x[3], yend=rg.line$y[3],colour = "segment"))+
geom_point(data=rg.line, aes(x,y, fill='grey', shape=23, size=3))+
scale_shape_identity()+
geom_smooth(method="lm")
So as you can see those 3 points have been plotted. The straight line passing through it is our regression line. Our regression line goes through the point of average.
All the points (height, estimates for average weight) fall on the solid line Pink.
THE REGRESSION LINE FOR Y ON X ESTIMATES THE AVERAGE VALUE OF Y CORRESPONDING TO EACH VALUE OF X.
ASSOCIATED WITH EACH INCREASE OF ONE SD IN X THERE IS AN INCREASE OF ONLY r X SDs IN Y, ON AVERAGE.
IN VIRTUALLY ALL TEST-RETEST SITUATIONS, THE BOTTOM GROUP ON THE FIRST TEST WILL ON AVERAGE SHOW SOME IMPROVEMENT ON THE SECOND TEST = AND THE TOP GROUP WILL ON AVERAGE FALL BACK. THIS IS THE REGRESSION EFFECT. AND THINKING THAT REGRESSION EFFECT MUST BE DUE TO SOMETHING IMPORTANT NOT JUST THE SPREAD AROUND THE LINE IS CALLED REGRESSION FALLACY.
Lets say a guy is 3.5 inch taller than average what is his average weight by using r?
How far is 3.5 from SD of height.
3.5/1.901=6.6535 SD
So r x SD = .50 x 6.6535 = 3.32675
so average weight is = average of weight + 3.32675 = 60.96895
If we look at our regression line ~ 71.5 corresponds to ~60
#data1%>% group_by(height) %>% summarise(M=mean(weight))
ERROR = ACTUAL VALUE - PREDICTED VALUE e=y-y^
Another word for error is residuals.
Prediction error equals vertical distance from x-axis to the regression line. Error equal actual value vertical distance till regression line.
The overall size of errors is measured by RMS.That is RMS error for regression line for predicting weight from height is
=sqrt[(e2+e2+….+n)/N] where e i.e y-y^ = observed - predicted value
\[\sum_{i=1}^n \sqrt{(y-\hat{y})^2 /n} \]
\[\sqrt{1-r^2}.SDy\]
The rms error for regression says how far typical points are above or below the regression line.
lm.model<-lm(weight~height, data=data1)
lm.model
##
## Call:
## lm(formula = weight ~ height, data = data1)
##
## Coefficients:
## (Intercept) height
## -37.456 1.399
names(lm.model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
lm.model.sm<-summary(lm.model)
names(lm.model.sm)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
str(lm.model)
## List of 12
## $ coefficients : Named num [1:2] -37.5 1.4
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
## $ residuals : Named num [1:25000] -3.299 -0.659 9.804 6.607 8.097 ...
## ..- attr(*, "names")= chr [1:25000] "1" "2" "3" "4" ...
## $ effects : Named num [1:25000] -9114.03 -420.54 9.83 6.63 8.12 ...
## ..- attr(*, "names")= chr [1:25000] "(Intercept)" "height" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:25000] 54.6 62.6 59.6 58 57.4 ...
## ..- attr(*, "names")= chr [1:25000] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:25000, 1:2] -1.58e+02 6.32e-03 6.32e-03 6.32e-03 6.32e-03 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:25000] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "height"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.01 1.01
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 24998
## $ xlevels : Named list()
## $ call : language lm(formula = weight ~ height, data = data1)
## $ terms :Classes 'terms', 'formula' length 3 weight ~ height
## .. ..- attr(*, "variables")= language list(weight, height)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. ..$ : chr "height"
## .. ..- attr(*, "term.labels")= chr "height"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(weight, height)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## $ model :'data.frame': 25000 obs. of 2 variables:
## ..$ weight: num [1:25000] 51.3 61.9 69.4 64.6 65.5 ...
## ..$ height: num [1:25000] 65.8 71.5 69.4 68.2 67.8 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 weight ~ height
## .. .. ..- attr(*, "variables")= language list(weight, height)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. .. ..$ : chr "height"
## .. .. ..- attr(*, "term.labels")= chr "height"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(weight, height)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## - attr(*, "class")= chr "lm"
#lm.model$model
#plot(lm.model)
residuals<- data.frame(data1$weight-lm.model$fitted.values)
residuals1<- data.frame(lm.model$residuals)
mean(residuals$data1.weight...lm.model.fitted.values)
## [1] -2.931794e-17
mean(lm.model$residuals)
## [1] -2.562205e-17
nrow(data1)
## [1] 25000
rms<- sqrt(sum((lm.model$residuals)^2)/nrow(data1))
rms
## [1] 4.571805
So, to get residuals from r its lm.model$residuals which is our error.
Lets plot residual vs height.
Mathematically the residuals from the regression line must average out to 0.
However, our locally weighted regression is not fully horizontal.
ggplot(data=lm.model, aes(x=height, y=residuals))+geom_jitter(aes(alpha=0.5))+
geom_smooth()
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Now, lets plot regression line in residual plot. The residual from the regression line must average out to 0. Therefore, the regression line for the residual must be horizontal.
ggplot(data=lm.model, aes(x=height, y=residuals))+geom_jitter(aes(alpha=0.5))+
geom_smooth(method="lm")
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
When all the vertical strips in a scatter diagram show similar amounts of spread it is HOMOSCEDASTIC.
ggplot(data1, aes(height, weight))+geom_point(aes(alpha=.2))+
geom_vline(xintercept = mean(data1$height),linetype = 2, col='red', size=1)+
geom_hline(yintercept = mean(data1$weight),linetype = 2,col="yellow", size=1)+
geom_vline(xintercept=2*sd.height+mean.height, linetype=2,col='red')+
geom_vline(xintercept=-2*sd.height+mean.height, linetype=2,col='red')+
geom_hline(yintercept =2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_hline(yintercept =-2*sd.weight+mean.weight,linetype=2,col="yellow")+
geom_abline(slope = 2.781379, intercept = -131, col="green")+
scale_x_continuous(breaks = 60:80)+
scale_y_continuous(breaks = c(35,40,45,50,55,60,65,70,75,80))+
geom_segment(data=rg.line, aes(x=rg.line$x[2], y=rg.line$y[2], xend=rg.line$x[3], yend=rg.line$y[3],colour = "segment"))+
geom_point(data=rg.line, aes(x,y, fill='grey', shape=23, size=3))+
scale_shape_identity()+
geom_smooth(method="lm")
RMS error for the regression line of y on x can be figured by
\[\sqrt{1-r^2}.SDy\]
Lets split the data1 into training and test set
smp_size <- floor(0.75 * nrow(data1))
set.seed(123)
train_data1 <- sample(seq_len(nrow(data1)), size = smp_size)
train<- data1[train_data1,]
test<- data1[-train_data1,]
lm.train<- lm(weight~height, train)
predict1<-predict(lm.train, test)
pp<-as.data.frame(predict1)
rmse1<- sqrt(sum((test$weight-pp$predict1)^2)/nrow(test))
rmse1
## [1] 4.521159
sd.test.weight<-sd(test$weight)
cor.test<-cor(test)
cor.test<-cor.test[1,2]
rmse.test <- sqrt(1-(cor.test)^2)*sd.test.weight
rms-rmse1
## [1] 0.05064625
Very close.
#plot(data1$height,lm.model$residuals)
#plot(lm(lm.model$residuals~data1$height))
We’ll closely look at frequency theory.
THE CHANCE OF SOMETHING GIVES THE PERCENTAGE OF TIME IT IS EXPECTED TO HAPPEN WHEN THE BASIC PROCESS IS DONE OVER AND OVER AGAIN, INDEPENDENTLY AND UDNER THE SAME CONDITIONS.
When you draw at random, all the tickets in the box have the same chance to be picked.
set.seed(1)
sample(1:6, 6, replace = T)
## [1] 2 3 4 6 2 6
#sample(s1, 6, replace=F)
Above example is with replacement from the box and without.
When you draw at random, all tickets in the box have the same chance to be picked.
R G G with replacement
prob of picking R from the box is 1/3 and prob of picking G is 2/3
Two tickets are drawn at random without replacement from the box - 1,2,3,4
What is the chance that the sencond ticket is 4?
Ans. 1/4
what is the chance that the second ticket is 4 given the first is 2?
Ans. 1/3
Another way to put above question is P(2nd ticket 4 | 1st ticket 2) - the conditional probability for the sencond ticket to be 4 given the first was 2.
The chance that two things will both happen equals the chance that the first will happen, multiplied by the chance that the second will happen given the first has happened.
Q. Two cards will be dealt off the top of a well shuffled deck. What is the chance that the first will secen of clubs and the second will be queen of hearts?
Ans. 1/52 x 1/51 = 1/2652
Two things are independent if the chance for the second given the first are the same, no matter how the first one turns out. Otherwise the two things are dependent.
When drawing at random with replacement the draws are independent. Without replacement, the draws are dependent.
Q. Two dice are thrown. What is the chance of getting a total of 2 spots? i.e.- . & .
Ans. THere are 6 x 6 = 36 possible ways for the dice to fall.
Two have 2 spots out of 36 we have only 1/36 chance.
To find the chance that at least one of two things will happen, check to see if they are mutually exclusive. If they are, add the chances.
Q. The chance that top card is ace of spadess or the bottom card is the ace of spades equals:
Ans. 1/52 + 1/52
Number of ways to arrange 4 R’s and 1 G’s
5!/4!.1!
the numerator is sum of denomator that is 4+1; 4 represent R and 1 represent G
Here 0! = 1
Binomial coefficient is \[\frac{n!}{k!(n-k)!}\]
n! / k!(n-k)!
where n is number of trails; k is the number of times the event will occur
Binomial Formuals
\[\frac{n!}{k!(n-k)!}.p^k.(1-p)^{n-k}\] p is the probability that the event will occur on any particular trial
Q. A die is rolled 10 times. What is the chance of getting exactly 2 aces?
A. n=10; k=2; n-k=8 ; p=(1/6) (5/6)
THerefore,
\[\frac{10!}{2!(10-2)!}.(1/6)^2.(5/6)^{10-2}\] ## Law of average
Lets toss a coin 10,000 times and see how many H and T
The assumptions that in the lonf run the no. of H and the no. of T even out is FALSE
set.seed(2)
coin.toss<-sample(x=c("H","T"), 10000, replace=TRUE, prob = c(.5,.5))
table(coin.toss)
## coin.toss
## H T
## 5061 4939
coin.toss<-as.data.frame(coin.toss)
coin.toss<- coin.toss %>% mutate(Index=c(1:10000))
#ggplot(coin.toss, aes(as.factor(coin.toss)))+geom_bar(width=.5)
set.seed(2)
coin.toss2<-sample(x=c(1,0), 10000, replace=TRUE, prob = c(.5,.5))
table(coin.toss2)
## coin.toss2
## 0 1
## 4939 5061
coin.toss2<-as.data.frame(coin.toss2)
coin.toss2<- coin.toss2 %>% mutate(Index=c(1:10000))
coin.toss2 <- coin.toss2 %>%
mutate(no.toss.div.2=Index/2, Cumul.prob=cumsum(coin.toss2), Difference=Cumul.prob-no.toss.div.2)
coin.toss2.Headonly<- coin.toss2 %>% filter(coin.toss2>0) %>% select(coin.toss2,Index,Cumul.prob,Difference)
coin.toss2.Headonly<- coin.toss2.Headonly%>%
mutate(Diff.div.Index=(Difference/Index)*100)
ggplot(coin.toss2.Headonly, aes(Index, Difference))+geom_line()+
labs(x="Coin toss", y="Chance of error")
# df.coin2<- as.data.frame(coin.toss2.Headonly)
#
ggplot(coin.toss2.Headonly, aes(coin.toss2,Index))+geom_bar(stat = 'identity')
#barplot(table(coin.toss2.Headonly$Cumul.prob), density = 100, main = "2 Dice Sum, 10000 Rolls")
# x22<-table(coin.toss2.Headonly$Cumul.prob)
# hist(x22$)
# #ggplot(coin.toss, aes(as.factor(coin.toss)))+geom_bar(width=.5)
In 10,000 tosses you expect to get 5000 H right?
But not exactly. You only expect to get 5000 H. But in relaity you could get 5001, 4998 , 5007… The amount off 5000 is what we call chance of error.
number of H - half the number tosses [expected number]
therfore for our last 10,000 Toss
number of H is 5061 and total toss is 10,000 so half is 5000
so 5061 - 5000 is 61 is our chance of error.
Above plot shows that as our no. of toss increases the size difference between the no. of heads and the expected number is likely to be quiet large in absolute terms.[stated above.]
However, if we compare the chance of error to the number of tosses the difference is quite small. For our last 10000th toss the chance of error was 61 so 61 divided by 10,000 toss is
61/10000
## [1] 0.0061
which is very small.
So number of HEADS = half the no. of tosses + chance of error
eg for 9998th toss; no. of H = 4999 + 61 =5060
Lets plot chance of error to the number of tosses i.e. chance of error / tossess
ggplot(coin.toss2.Headonly, aes(Index,Diff.div.Index))+geom_line()
CHANCE PROCESS:
Two main ideas -
Find an analogy between the process being studied (sampling voters in the poll example) and drawing numbers at random from a box.
Connect the variability you want to know about ( for eg in the estimate for the Democratic vote) with the chance variability in the sum of the numbers drawn from the box.
The analogy between a chance process and drawing from a box is called a box model. More complicated process can be dealt with through the analogy.
THE SUM OF DRAWS:
Lets say we have a box of tickets with numbers on it - 1,2,3,4,5,6
Imagine drawing at random from it with replacement from the box.
Imagine taking 25 draw from the box
set.seed(2)
sample.tick<-c(1:6)
s.1<-sample(sample.tick, 25, replace = T)
s.1
## [1] 2 5 4 2 6 6 1 6 3 4 4 2 5 2 3 6 6 2 3 1 4 3 6 1 3
Sum of s.1 is
sum(s.1)
## [1] 90
So our first draw of 25 times and adding them up came 88, lets repeat this process 10 times
set.seed(2)
s.2<-replicate(10, sum(sample(sample.tick, 25, replace = T)))
s.2
## [1] 90 88 84 85 90 84 86 77 90 102
so drawing 25 tickets from the box of 1:6 adding them up and repeating this process 10 has 10 different result. Chance variability is easy to see. It can be small as 1 x 25 and as large as 6 x 25 = 150
Suppose you have 4 tickets in a box and 100 draws are made with replacement. The expected value would be
1,1,1,5
There are four ticekts in the box. 5 should come 1/4 time and 1 3/4 so if we draw 100 time then 100x1/4x5 + 100x3/4x1 = 200
FORMULA for expected value for the “sum” of draws made at random with replacement from the box equal
(number of draws) x (average of box)
100 x [ (1+1+1+5)/4] = 200
AGAIN NOTE THE FORMUAL IS FOR SUM OF THE DRAWS NOT AVERAGE
Each ticket wont appear on exactly as the probabilit of the draws. The “SUM” will be off the expected value by a chance of error
SUM=Expected value+Chance of error
\[SE= \sqrt{number of draws } . (SD of box) \]
A SUM IS LIKELY TO BE AROUND ITS EXPECTED VALUE, BUT TO BE OFF BY A CHANCE ERROR SIMILAR TO THE STANDARD ERROR.
So let go back to our example:
We have 1:6 in the box.
Its expected value of SUM of draw = (number of draws) x (average of box)
mean(c(1:6))
## [1] 3.5
therefore, 25 * 3.5 = 87.5
25 * 3.5
## [1] 87.5
Now lets cacuclate is SE
SE = \[SE= \sqrt{number of draws } . (SD of box) \]
sd(1:6)
## [1] 1.870829
sqrt(25)*sd(1:6)
## [1] 9.354143
Therefore SE = 9.35
So that means the sum is likely to be around the expected value 87.5 give or take 9.35
Now let repeat the process 10 times
set.seed(2)
s.3<-replicate(10, sum(sample(sample.tick, 25, replace = T)))
s.3
## [1] 90 88 84 85 90 84 86 77 90 102
We know sum=expected value+chance or error
So for first one i.e. 90 we have 90-87.5 = 2.5 chance of error which is less that one SE
s.3-87.5
## [1] 2.5 0.5 -3.5 -2.5 2.5 -3.5 -1.5 -10.5 2.5 14.5
So we can see that above 2/10 were above 1 SE that’s 9.35 rest were less than that.
x.1<-c(0,1)
sd(x.1)
## [1] 0.7071068
mean(c(2,2))
## [1] 2
sqrt(10000)
## [1] 100
A probability histogram graph represent chance not data.
Lets get pair of dice and throw it. What is the outcome - is the sum of two dice number.
set.seed(2)
pair.dice<- c(1:6,1:6)
sum(pair.dice)
## [1] 42
rolls.100<-replicate(100, sum(sample(pair.dice, 2, replace = T)))
hist(rolls.100)
rolls.10000<-replicate(1000000, sum(sample(pair.dice, 2, replace = T)))
#hist(rolls.10000, probability = T)
#rolls.10000 <-replicate(10000, sum(sample(pair.dice, 2, replace = T, prob = table(outer(1:6,1:6,"+")) / 36)))
#hist(rolls.1000000)
rolls.10000 <- as.data.frame(rolls.10000 )
ggplot(rolls.10000 , aes(rolls.10000 ))+stat_bin(bins = 2)
So what we are doing below is sampling a coin toss 1000 times getting no of heads when tossing 1000 times and repeating this process 10,000 times.
THe plot shows all the expected value and the difference between the real exptected value is due to chance variability.
Chance = Observed value - expected
set.seed(2)
results <- list()
for(i in 1:10000) {
coinTosses <- cumsum(sample(c(0,1), 1000, replace = TRUE, prob = c(.5,.5)))
results[[i]] <- coinTosses[length(coinTosses)]
}
no.heads<-as.data.frame(unlist(results))
m.1<-mean(no.heads$`unlist(results)`)
s.1<-sd(no.heads$`unlist(results)`)
# hist(unlist(results), main = "No of Heads",col = "lightblue", breaks = 100)
# ggplot(no.heads, aes(unlist(results)))+stat_bin(bins = 50)
ggplot(no.heads, aes(unlist(results)))+geom_histogram(aes(y=..density..),binwidth=2)+
stat_function(fun=dnorm, args=list(mean=m.1, sd=s.1))+
geom_vline(aes(xintercept=mean(no.heads$`unlist(results)`)))+
scale_x_continuous(breaks = seq(440,560, length.out = 9))
sd(c(0,2,3,4,6))
## [1] 2.236068
mean((c(0,2,3,4,6)))
## [1] 3
and see the rest of the Wikipedia article for the discussion about estimation of standard deviations. Using the formula employed ‘by hand’ leads to a biased estimate, hence the correction of sqrt((N-1)/N). Here is a key quote:
The term standard deviation of the sample is used for the uncorrected estimator (using N) while the term sample standard deviation is used for the corrected estimator (using N ??? 1). The denominator N ??? 1 is the number of degrees of freedom in the vector of residuals, .
Now we know that we have 0,1 coin toss in box their SD is .7071068 and SEis 22.36068. our expected heads in 1000 toss is no of toss x mean of 0,1
1000 x .5 = 500 H
To find chance of getting exact 500 heads we use Z converstion i.e converting no of heads into standard units
SE.1<-sd(c(0,1)) *sqrt(1000)
SE.1
## [1] 22.36068
sd(c(0,1))
## [1] 0.7071068
mean(c(0,1))
## [1] 0.5
# no.heads <- no.heads %>% rename(H=`unlist(results)`)
#
#no.heads <- no.heads %>% mutate(Z=(H-500)/SE.1)
# no.heads <- round(no.heads$Z)
Now To find chance of getting exact 500 heads we use Z converstion i.e converting no of heads into standard units
we have converted same above plot into z units ie standard units
z = (observed - mean)/ SE
so to find exact 500 head
(499.5-500)/22.36 = -0.02236
(500.5-500)/22.36= 0.02236
area under normal curve mean 0 sd 1 for z 0.02236 left side area is 0.5089196 and left side of -0.2236 is 0.49108
we want the area between minus them and ans is 1.76%
l.1<-pnorm(q=0.02236, mean=0, sd=1 ,lower.tail = T )
l.1
## [1] 0.5089196
(1-(1-l.1)*2)*100
## [1] 1.783921
l.2<-pnorm(q=-0.02236, mean=0, sd=1 ,lower.tail = T )
l.2
## [1] 0.4910804
l.1-l.2
## [1] 0.01783921
qnorm(.68, mean=500 ,sd=23)
## [1] 510.7571
#ggplot(no.heads, aes(Z))+geom_histogram(aes(y=..density..),binwidth =.5)+
#stat_function(fun = dnorm, args=list(mean=mean(no.heads$Z), sd=sd(no.heads$Z)))
A coin toss 100 times
set.seed(2)
results.2 <- list()
for(i in 1:100) {
coinTosses.1 <- cumsum(sample(c(0,1), 100, replace = TRUE, prob = c(.5,.5)))
results.2[[i]] <- coinTosses.1[length(coinTosses.1)]
}
no.heads.1<-as.data.frame(unlist(results.2))
colnames(no.heads.1) <- "H"
hist(no.heads.1$H)
pnorm(q=0.0707, mean=0, sd=1)
## [1] 0.5281817
Remember = std.error is sqrt(no of draws) * Sd [its not no of repetitions/replicate]
set.seed(2)
hist(replicate(10000, sum(sample(c(1,2,9), 500,replace=T))), breaks = 50)
# replicate(100, sum(sample(c(1,2,9), 100,replace=T)))
sum(sample(c(1,2,9), 500,replace=T))
## [1] 1992
expected.value<- 500*mean(sample(c(1,2,9)))
expected.value
## [1] 2000
sd.value<- sd(c(1,2,9))
sd.value
## [1] 4.358899
std.error<- sqrt(500)*sd.value
std.error
## [1] 97.46794
When the no of repetition is large like coin toss where draw is fixed the empirical histogram will be close to the probability histogram.
When the no of draws is largem the probability histogram for the sum will be close to the normal curve.
Consequently, when the no of repetitions and the no of draws are both large, the empirical histogram for the sums will be close to the curve.
Simple random sampling means drawing at random without replacement.
male<- rep(x=1, 3091)
female<-(rep(x=0, 3581))
#
socio <- c(male, female)
sociao <- as.data.frame(socio)
mean(sociao$socio)
## [1] 0.4632794
sd(sociao$socio)
## [1] 0.4986871
# health.study<-rbind(male, female)
A politician wants to enter a primary district with 100,000 eligible votes but only if he has good chance of winning.
He hire survey org and they conduct 2500 simple random sample votes.
Now our box model would be 1 for yes votes and 0 for no all up 2500.
However we don’t the the fraction of population 100,000 who vote yes and no.
However we can use bootstraps method. Lets find the fraction in the sample which should relate to fraction in the box i,e population.
So we found 1328 voted yes and 2500-1328 votes no. Fractions :
1328/2500
## [1] 0.5312
(2500-1328)/2500
## [1] 0.4688
so 53.12 % voted yes and 46.88 % voted No.
mean of our sample would be 1328 since : total sample no * mean
2500 * (1328/2500)
## [1] 1328
SD will be
sqrt(.5312 * .4688)
## [1] 0.4990256
that is using the short cut. Otherwise you can do the long method
xx.12<- rep(1, 1328)
xx.13<- rep(0, 1172)
xx.14<- c(xx.12, xx.13)
sd(xx.14)
## [1] 0.4991254
So our SE is
sqrt(2500) * sd(xx.14)
## [1] 24.95627
so what this is saying is that our our estimate
1328/2500 = 53.12 % is likely to be off by
24.95/2500 = 1 %
SO in a sense its a good result since its unlikely to be off by 3 % point.
LEts to an example
lets say our pop has 80% 1 and 20 % 0 . But act as if we don’t know.
pop.1<- rep(1, 8000)
zeros.1<- rep(0, 2000)
pop.2<- c(pop.1,zeros.1)
Let pick sample of 2500 from it.
sample.1<- sample(pop.2, 2500, replace = F)
sum(sample.1)
## [1] 1993
So we have 2020 1s that is 80.8 % and 480 0 that is 19.2%
SE is
sqrt(2500)* sd(sample.1)
## [1] 20.10826
so ~20/2500
so it off by .8%
which is spot on 80% + .8% is 80.8% which is expected value of sample percentage.
Above ege we know 80.8% of sample voted yes. So how far can the population percentage be from 80.8% be?
Our SE give that answer. We know our SE is .8% suggesting our chance error is .8%
Error larger than 2 SE occuer infrequently. Lets calculate 2 SE below and above the sample percentage
80.8 % +- 2 x .8%
80.8 + 2*.8
## [1] 82.4
80.8 - 2*.8
## [1] 79.2
so that our 95 % confidence interval. You can be 95 % confident that the population percentage is caught inside the interval from 79.2% to 82.4%
A sample percentage will be off the population percentage due to chance of error. The SE will tell you the likely size of the amount off.
Lets say we have 1:7 in the box. We want to draw 25 samples from the box , average them out and repeat the process 25 times
set.seed(2)
sample.2<- c(1:7)
s.size<-25
rep<- 25
s_mean<- rep(NA, rep)
for ( i in 1:rep) {
my_sample <- sample(sample.2, s.size, replace = T)
s_mean[i] <- mean(my_sample)
}
# s_mean
hist(s_mean, breaks = 6)
So the average of the box is 4
mean(sample.2)
## [1] 4
Now find SE
We knoe the SD of the box 2.16
sd(sample.2)
## [1] 2.160247
We know the expected value for sum is 25 x 4 = 100
If it was sum then we would SQRT(sample size) * SD
but since its average we SQRT(sample size) * SD/(sample size)
or we can simplify
(sample size)^(1/2) * SD/(sample size) or
[ (sample size)^(1/2)/(sample size) ] * SD or
SD/SQRT(sample size)
average of 25 draws
100+10/25 = 100/25 +- 10/25
= 4 +- 0.4
sd(sample.2)/sqrt(s.size)
## [1] 0.4320494
The SD says how far family incomes are from averages - for typical families.
The SE say how far sample averages are from the population averages - for typical samples.
Lets say we have large box of numbered tickets. Dr. Null says average is 50 whereas Dr. Alt says its different from 50. To find out they draw 500 tickets at random. The average turn out to be 48 and the SD 15.3
Dr.Null says the box average is 50 because sample average is 48 which is not big difference and the sd is just 15.3 so the average is very little to SD.
Dr. Alt goes we need to look at SE not SD since SE tells us how far the average sample is likely to be from its expected value ie the average of the box.
SE=sqrt(500) * 15.3 = 342/500 = 0.7
so Z= (48-50)/500 = -3
Dr. ALt goes if your theory is right your average is 3 SEs below its expected value. We can’t explain the difference by chance. The difference is real. So the average ticket of the box isn’t 50, its some other numbers. Note that expected value of the sample average is the average of the ticket in the box.
THis sort of calculation is called test of significance.
Dr. Alt calimed that the difference was real - which means whether the difference just reflected chance variation or whether the average for all the tickets in the box was different from 50 .
NOTE: In statistical the null hypothesis and the alternative hypothesis are statements about the box NOT JUST THE SAMPLE. EACH SIDE REPRESENT ONE SIDE OF THE ARGUMNET.
NULLHYPOTHESIS - THE AVERAGE OF THE BOX EQUALS 50 ALTERNATIVE HYPOTHESIS - THE AVERAGE OF THE BOX IS LESS THAN 50
Basically NUll Hypo corresponds to the idea that an observed difference is due to chance.
Z= (observed - expected)/Se
Z says how many SEs away an observed value is from expected value. Expected value as of the box not the sample.
When we refer to the Z table the area to the left of -3 under the normal curve is ridiculously small. THe chance of getting a sample average 3 SEs or more below is expected value is about 1 in 1000. [We have to pause and note what that statement is saying - if our box model average was really 50, and from that box model we too 500 draws and its expected value came to be 48, the chance of that happening is 1 in 1000]
There is a strong evidence against Null Hypo.
The observed significance level (p) is the chance of getting a test statistics as extreme as or more extreme than the observed one. The samller the chance the stronger the evidence against the NULL.
Form the Z table we get z=-3 as 0.0013 which is the P value
THe P value of a test it the chance of getting a big test statistic assuming the null hypothesis to be right. P is not the chance of the null hypothesis being right.
The Z test is used for reasonably large samples when the normal approx can be used on the probability histogram for the average of the draws.
Many investigator draw line at 5 % IF P is less thatn 5 % the result is called statistically significant
If P is less than 1 % the result is highly significant
*** The t test
in t test t=(observed - expected)/SE
where SE=sqrt [no. of measurements/(no of measurement-1)]*SD
and degreee of freedom = no of measurement - 1