Estimating Parameters from data - sample size, central tendency, variation, skew, kurtosis

SAMPLE SIZE - denoted by ‘n’. This is an important factor when sampling *** finish later

CENTRAL TENDENCY

  1. MODE - shows the most frequently represented class of data. Distributions can be multi-modal, so not the best estimate of central tendency.
# rm(x) # removes any existing vectors by the name x
distribution = read.table("C:\\temp\\mode.txt", header = TRUE)
attach(distribution)
names(distribution)
## [1] "fx" "fy" "fz" "x"

To understand distribution of fx and fy, we will not plot them on a histogram.

par(mfrow=c(1,2))
barplot(fx,names=as.character(x))
barplot(fy,names=as.character(x))

2. MEDIAN - shows the middle value from a ranked list. Not sensitive to outliers. In our example, each value of x is repeated fx number of times. To find and plot the median, we need to first create the actual list and then rank it.

y = rep(x,fx) # create actual list of values to find median for
y = sort(y) # rank this list
length(y) # find the total number of data points in this list
## [1] 33

This shows that the list contains an odd no. of elements (n=33). Therefore, median will be the (n/2)th or 17th element. The same can be calculated using R as follows:

y[ceiling(length(y)/2)] #celiling returns the integer part of our answer
## [1] 14

However, one can avoid the above code and use R’s built-in function.

median(y)
## [1] 14

3a. ARITHEMATIC MEAN

mean(y)
## [1] 13.78788

3b. GEOMETRIC MEAN - used to average variables that change exponentially/multiplicatively.

prod(y)^(1/length(y))       # approach 1 - nth root of the product of y values
## [1] 13.71531
exp(sum(log(y))/length(y))  # approach 2 - using logarithm ("log of times is plus")
## [1] 13.71531

3c. HARMONIC MEAN - Used for averaging variable with a “rate” characteristic.

1/mean(1/y)
## [1] 13.64209
  1. SKEW - Long tail to the left is negative skew. Long tail to the right is positive skew.
# par(mfrow=c(1,1))
barplot(fz,names.arg = as.character(x))

Above bar plot shows a long-tail on the right side, therefore the distribution of fz is a positively skewed. Now, we will create the distribution x,fz (x contains labels, fz contains their frequencies) and then produce a summary of all measures of central tendency for it.

w = rep(x,fz)
AM = mean(w)
AM
## [1] 12.60606
GM = prod(w)^(1/(length(w)))
GM
## [1] 12.4038
HM = 1/mean(1/w)
HM
## [1] 12.22037
Median = median(w)
Median
## [1] 12
# Modes are at 11 and 16, with x=11 showing the largest mode.

Therefore, for above positively skewed, bi-modal distribution: AM>GM>HM>Median>Mode

VARIATION Here we will discuss the different ways to measure variation in a variable.

y=c(13,7,5,12,9,15,6,11,9,7,12)
plot(y,ylim = c(0,20))

1. Range - shows the min and max values of a variable

range(y)
## [1]  5 15

Range is highly impacted by outliers and the uncertainty it represents increases with increasing sample size. Therfore, we will try to find a more reliable measure of variability. One way is to fit the average value/mean of y through the data and measure how far each value is from the mean.

plot(y,ylim = c(0,20))
abline(mean(y),0)

Important concepts - sum of squares, mean square, degrees of freedom (n-p, where n is sample size, p is no. of parameters estimated from data)

  1. VARIANCE = (sum of squares/degrees of freedom)

Working on the ozone levels in Gardens A,B,C example:

A=c(3,4,4,3,2,3,1,3,5,2)
B=c(5,5,6,7,4,4,3,5,6,5)
C=c(3,3,2,1,10,4,3,11,3,10)

mean(A)
## [1] 3
mean(B)
## [1] 5
mean(C)
## [1] 5

Now, we have 3 samples and means of each. Now, we calculate vectors containing differences y-mean(y):

dA = A-3
dB = B-5
dC = C-5

Next, sum of squares of these differences:

SSA = sum(dA^2)
SSB = sum(dB^2)
SSC = sum(dC^2)

Obtaining variances by dividing each sum of squares by deg of freedom:

s2A = SSA/(length(A)-1)
s2B = SSB/(length(B)-1)
s2C = SSC/(length(C)-1)
s2A
## [1] 1.333333
s2B
## [1] 1.333333
s2C
## [1] 14.22222

Calculating variance using built-in function:

S2A = var(A)
S2B = var(B)
S2C = var(C)
S2A
## [1] 1.333333
S2B
## [1] 1.333333
S2C
## [1] 14.22222

Above example shows that: - 2 samples with the same mean can have very different variances - 2 samples with the same variance can have very different means Therefore, its important to check “variance ratio” for 2 samples as it should not be >4 (large variance/small variance.

When to use variance? - for establishing unreliability (Standard error = sqrt(variance/sample size)) - for testing hypotheses

sqrt(var(A)/length(A))
## [1] 0.3651484
sqrt(var(B)/length(B))
## [1] 0.3651484
sqrt(var(C)/length(C))
## [1] 1.19257

Constructing a Confidence Interval = Multiply the z or t-value with standard error

# For B, alpha=95%, deg of freedom = n-1 = 9
qt(0.025,9)*sqrt(var(B)/length(B))
## [1] -0.826023

Therefore, we can say that the average ozone conc in Garden B was Mean+-CI, i.e., 5+-0.826 (95% C.I.,n=10)

QUANTILES Creating a std normal dist of 1000 random numbers

z = rnorm(1000)
mean(z)
## [1] 0.003637856
quantile(z,c(0.025,0.975))
##      2.5%     97.5% 
## -2.146163  1.990625

For a std normal dist, 2.5% of the values should be < -1.96 and 97.5% of the values <+1.96. Next trying with a larger sample size:

z = rnorm(10000)
mean(z)
## [1] 0.008095526
quantile(z,c(0.025,0.975))
##      2.5%     97.5% 
## -1.926039  1.971935

Results are much more closer to what we expected with a larger sample size.

ROBUST ESTIMATOR - MEAN ABSOLUTE DEVIATION

Below we compare MAD vs SD on robustness

p = c(3,4,6,4,5,2,4,5,1,5,4,6)
mad(p)
## [1] 1.4826
sd(p)
## [1] 1.505042
mean(p)
## [1] 4.083333

Adding an outlier to our list

p1 = c(p,100)
mad(p1)
## [1] 1.4826
sd(p1)
## [1] 26.64149
mean(p1)
## [1] 11.46154
sd(p1)/mad(p1)   # this calculates the ratio of b/w sd and mad for our updated sample
## [1] 17.96944

Therefore, if we allow a window of upto 4x differences in sd and MAD, anything outside can be defined as an outlier. Next, creating a function to check for outliers:

outlier = function(x) {
  if(sd(x)>4*mad(x)) print("Outliers present")
  else print("Deviation is reasonable")
}

outlier(p)
## [1] "Deviation is reasonable"
outlier(p1)
## [1] "Outliers present"
SINGLE SAMPLE ESTIMATION
r light=read.csv("C:\\temp\\Course Data\\light.csv", header=TRUE) names(light)
## [1] "speed"
r summary(light)
## speed ## Min. : 650 ## 1st Qu.: 850 ## Median : 940 ## Mean : 909 ## 3rd Qu.: 980 ## Max. :1070
r hist(light$speed)
Above histogram shows a negatively skewed distribution for “speed”. The same is evident from the values of median and mean - median is bigger than the mean here.
INTERQUARTILE RANGE = difference b/w 1st and 3rd quartile = 980-850 = 130. This is a great measure for testing outliers (=any value > 1.5x of the interquartile range above the 3rd quartile or below the 1st quartile).
“Non-normality, outliers and serial correlation can all invalidate inferences made by standard parametric tests like Student’s t-test. Much better in such cases to use a robust non-parametric technique like Wilcoxon’s signed-rank test.”
INFERENCE IN THE 1-SAMPLE CASE We want to test the hypothesis that Michelson’s estimate of the speed of light is significantly different from the value of 299,990 thought to prevail at the time.
r #library(ctest) wilcox.test(light$speed,mu=990)
## Warning in wilcox.test.default(light$speed, mu = 990): cannot compute exact p- ## value with ties
## ## Wilcoxon signed rank test with continuity correction ## ## data: light$speed ## V = 22.5, p-value = 0.00213 ## alternative hypothesis: true location is not equal to 990 Result: Accept alternate hypothesis - true mu != 990
For demonstration purposes, using the t-test on above problem:
r t.test(light$speed,mu=990)
## ## One Sample t-test ## ## data: light$speed ## t = -3.4524, df = 19, p-value = 0.002669 ## alternative hypothesis: true mean is not equal to 990 ## 95 percent confidence interval: ## 859.8931 958.1069 ## sample estimates: ## mean of x ## 909 The above results show that the hypothesized value (mu=990) is far outside the 95% CI. The sample mean is highly significantly different from the hypothesized mean.
Applying 2-sample t-test on the “ozone levels in 3 gardens” problem: Hnull = the mean ozone concs in Gardens A and B are the same (Diff in means = 0) Halt = the mean ozone concs in Gardens A and B are not the same (Diff in means != 0)
r qt(0.975,18)
## [1] 2.100922 Above is the critical value of student’s t (boundary established by setting a 95% CI net on a 2-tailed test)
r t.test(A,B)
## ## Welch Two Sample t-test ## ## data: A and B ## t = -3.873, df = 18, p-value = 0.001115 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -3.0849115 -0.9150885 ## sample estimates: ## mean of x mean of y ## 3 5 Since the value of the test statistic is much larger than the critical t-value (sign of the test statistic does not matter in t-test), therefore we reject Hnull. Another way to look at it is that the CI of the test statistic does not include a “zero”, therefore the probability of the diff of means being zero is very low.

WILCOX RANK SUM TEST - an alternative to the t-test when the standard errors are not normal (which implies that the variable distribution is not normal).

Looking at the standard errors in the ozone-garden example:

par(mfrow=c(1,2))
hist(B,breaks=c(0.5:11.5))
hist(C,breaks=c(0.5:11.5))

The errors in B look normal but that’s not the case with C. Performing the Wilcox test on this:

wilcox.test(B,C)
## Warning in wilcox.test.default(B, C): cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  B and C
## W = 66, p-value = 0.2349
## alternative hypothesis: true location shift is not equal to 0

Since p-value is large enough (>0.05), we accept the null hypothesis, i.e., mean ozone concs in Gardens B and C are the same. The last statement in the result above is only meant to display the alt hyp (just for clarity) and is not the final result being expressed by the function.

TEST ON PAIRED SAMPLES - correlation expected between the 2 samples. Not effective when the correl is negative though.
Below data are a composite biodiversity score based on a kick sample of aquatic invertebrates.The elements of x and y are paired because the 2 samples were taken on the same river, upstream (y) or downstream (x) of a sewage outfall.
```r x=c(20,15,10,5,20,15,10,5,20,15,10,5,20,15,10,5) y=c(23,16,10,4,22,15,12,7,21,16,11,5,22,14,10,6)
t.test(x,y) # 2-sample t-test ```
## ## Welch Two Sample t-test ## ## data: x and y ## t = -0.40876, df = 29.755, p-value = 0.6856 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -5.248256 3.498256 ## sample estimates: ## mean of x mean of y ## 12.500 13.375
r t.test(x,y, paired = T) # paired t-test
## ## Paired t-test ## ## data: x and y ## t = -3.0502, df = 15, p-value = 0.0081 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.4864388 -0.2635612 ## sample estimates: ## mean of the differences ## -0.875
According to the first test, there is no difference b/w means since p-value is >0.05. However, when we feed the function the information of the samples being paired, the results are completely different. Test 2 shows that the means of the 2 samples are significantly different since p-value < 0.05.

CENTRAL LIMIT THEOREM

Generating 1000 random numbers from a negative binomial distribution with 2 parameters ‘size’ = 1, ‘probability’ = 0.2

par(mfrow=c(1,2))
y=rnbinom(1000,1,.2)
mean(y)
## [1] 4.079
var(y)
## [1] 19.61437
table(y)
## y
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 177 161 128 104  94  75  62  35  36  25  22  10  11  10  14   7   8   5   2   2 
##  20  21  23  25  26  28  29  30 
##   2   3   1   2   1   1   1   1
hist(y,breaks=-0.5:38.5)

Creating a normal distribution of the mean values:

my = numeric(1000)
for (i in 1:1000) {
  y=rnbinom(30, 1, 0.2)
  my[i]=mean(y)}
hist(my)

par(mfrow=c(1,1))