Statistical Inference for Data Science: Core Concepts


IID Random Variables

Random variables are said to be independent and identically distributed if they are independent and are all drawn from the same population. The reason iid samples are so important is that they are the model for random samples. This is the default starting point for most statistical inferences.

Note: An IID sequence does not imply the probabilities for all elements of the sample space or event space must be the same. For example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

Expected Value

In a probability distribution , the weighted average of possible values of a random variable , with weights given by their respective theoretical probabilities, is known as the expected value, usually represented by E(x).

The expected value informs about what to expect in an experiment “in the long run”, after many trials. In most of the cases, there could be no such value in the sample space (see dice example below).

Formula

The weighted average formula for expected value is given by multiplying each possible value for the random variable by the probability that the random variable takes that value, and summing all these products.

Roll1Die = function(n) sample(1:6,n,rep=T)
TwoDie = Roll1Die(100) + Roll1Die(100)
hist(TwoDie, breaks=c(1.5:12.5), ylim = c(0, 0.2), prob=T, xlab = "Sum of Two Die Rolls", main= "Demonstration of Expected Value", col = "lightblue")

plot of chunk unnamed-chunk-1

#expected value = long run average value of repititions of the experiment it represents
mean(TwoDie)
## [1] 6.75



Statistical Robustness

A robust statistic is resistant to non-�intuitive behaviors in the results, produced by deviations from assumptions (e.g., of normality). This means that if the assumptions are only approximately met, the robust estimator will still have a reasonable efficiency, and reasonably small bias, as well as being asymptotically unbiased, meaning having a bias tending towards 0 as the sample size tends towards infinity.



Pearson’s Correlation Coefficient

Pearson's Correlation Coefficient is not robust. Adding/removing a single outlier can significantly effect results

You can quantify the strength of a linear relationship between two variables by calculating a correlation coefficient. This corresponds to “R” in the regression table output. The coefficient of determination refers to R-squared.

library(MASS)
data(Boston)
attach(Boston)
cor(rm, medv)
## [1] 0.6953599
plot(rm, medv, xlab = "Avg # of Rooms Per Dwelling", ylab = "Median Value of Occupied Homes (1000s)", main = "Visualizing the Relationship", type = "p", col = "red")
reg <- lm(medv ~ rm)
abline(reg, col="blue", lwd = "5")

plot of chunk unnamed-chunk-2

Spearman’s Rank Correlation Coefficient

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples

Running the correlation test again using the Spearman method this time, you can see that the results indicate a weaker relationship between the two variables:

cor(rm, medv, method = "spearman")
## [1] 0.6335764



Coefficient of Variation

In order to compare the variation of one dataset to another we use the coefficient of variation, which is given by dividing the standard deviation by the mean.

CV <- function(sd, mean) {
  (sd/mean)*100
}

CV(sd(rm), mean(rm))
## [1] 11.17992
CV(sd(medv), mean(medv))
## [1] 40.81651



Maximum Likelihood Estimation

MLE is a method of estimating the parameters of a statistical model given data. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints.

Assuming that the heights are normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the model.

Note:

The method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution.

#ex data: number of right turns during 300 3 min intervals at specific intersection, Poisson Distribution
X <- c(rep(0,14), rep(1,30), rep(2,36), rep(3,68), rep(4, 43), rep(5,43), rep(6, 30), rep(7,14), rep(8,10), rep(9, 6), rep(10,4), rep(11,1), rep(12,1)) 

hist(X, main="Histogram of Number of Right Turns", xlab = "Intervals", right=FALSE, prob=TRUE, col = "red", ylim = c(0,.25)) 

plot of chunk unnamed-chunk-5

n <- length(X)
#maximize log-likelihood
negloglike<-function(lam) {n* lam-sum(X) *log(lam) + sum(log(factorial(X)))}

#evaluate the function by providing a value for lambda
negloglike(0.3) 
## [1] 2583.046
#After we define the negative log likelihood, we can perform the optimization as following: 
#nlm function = non-linear minimization function in R
#minimum = neg. log likelihood, estimate = MLE estimate of paramaters (mean in this case)
out<-nlm(negloglike,p=c(0.5), hessian = TRUE) 
out
## $minimum
## [1] 667.183
## 
## $estimate
## [1] 3.893331
## 
## $gradient
## [1] -2.569636e-05
## 
## $hessian
##          [,1]
## [1,] 77.03948
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10
#Check estimate against the mean of the Poisson Distribution (they should be very similar)
mean(X)
## [1] 3.893333