Chapter 13 Elementary Statistics

Statistics is the practice of turning data into information to identify trends and understand features of populations.

13.1 Describing Raw Data

Raw data= observations.
Varialbe is a characteristics of an individual in a population.

13.1.1 Numeric Variables

Numeric variables are observations recorded as numbers.
The two types are: discrete and continuous
Discrete numeric variables are observations that you can COUNT.
Continuous numeric variables can take an infinite value within a range.

13.1.2 Categorical Variables

Categorical variables are also known as discrete or qualitative variables. Categorical variables can be further categorized as either nominal, ordinal or dichotomous.

Nominal variables are variables that have two or more categories, but which can not be logicaly ranked. For example, a real estate agent could classify their types of property into distinct categories such as houses, condos, co-ops or bungalows. So “type of property” is a nominal variable with 4 categories called houses, condos, co-ops and bungalows. Of note, the different categories of a nominal variable can also be referred to as groups or levels of the nominal variable. Another example of a nominal variable would be classifying where people live in the USA by state. In this case there will be many more levels of the nominal variable (50 in fact).

Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either “male” or “female”. This is an example of a dichotomous variable (and also a nominal variable). Another example might be if we asked a person if they owned a mobile phone. Here, we may categorise mobile phone ownership as either “Yes” or “No”. In the real estate agent example, if type of property had been classified as either residential or commercial then “type of property” would be a dichotomous variable.

Ordinal variables are variables that can be ordered or ranked. They have two or more categories just like nominal variables. So if you asked someone if they liked the policies of the Democratic Party and they could answer either “Not very much”, “They are OK” or “Yes, a lot” then you have an ordinal variable. Why? Because you have 3 categories, namely “Not very much”, “They are OK” and “Yes, a lot” and you can rank them from the most positive (Yes, a lot), to the middle response (They are OK), to the least positive (Not very much). However, whilst we can rank the levels, we cannot place a “value” to them; we cannot say that “They are OK” is twice as positive as “Not very much” for example.

# examine the dataframe chickwts
chickwts[1:5,]
# describe chickwts
(?chickwts)

Here is the output :
Description

An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens.

Usage

chickwts
Format

A data frame with 71 observations on the following 2 variables.

weight a numeric variable giving the chick weight.

feed
a factor giving the feed type.

Details

Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types.

# so we take a look at weight and feed
chickwts$weight
 [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213 257 244 271 243 230 248
[26] 327 329 250 193 271 316 267 199 171 158 248 423 340 392 339 341 226 320 295 334 322 297 318 325 257
[51] 303 315 380 153 263 242 206 344 258 368 390 379 260 404 318 352 359 216 222 283 332
chickwts$feed
 [1] horsebean horsebean horsebean horsebean horsebean horsebean horsebean horsebean horsebean horsebean
[11] linseed   linseed   linseed   linseed   linseed   linseed   linseed   linseed   linseed   linseed  
[21] linseed   linseed   soybean   soybean   soybean   soybean   soybean   soybean   soybean   soybean  
[31] soybean   soybean   soybean   soybean   soybean   soybean   sunflower sunflower sunflower sunflower
[41] sunflower sunflower sunflower sunflower sunflower sunflower sunflower sunflower meatmeal  meatmeal 
[51] meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  casein   
[61] casein    casein    casein    casein    casein    casein    casein    casein    casein    casein   
[71] casein   
Levels: casein horsebean linseed meatmeal soybean sunflower

From the values, weight is a numeric variables (continuous)
From the values, feed is a categorical variable (nominal)

13.1.3 Univariate and Multivariate Data

Univariate data are related to only ONE variable
MUltivariate data can’t be used stand alone. (ie. gps lat and long)

quakes[1:5,]
plot(quakes$long,quakes$lat,xlab="longitude",ylab="Latitude",main="Wilson plot")

13.1.4 Parameter or Statistics?

A statistic and a parameter are very similar. They are both descriptions of groups, like ???50% of dog owners prefer X Brand dog food.??? The difference between a statistic and a parameter is that statistics describe a sample. A parameter describes an entire population.

13.2 Summary Statistics

NOw lets calculate some statistics with R.

13.2.1 Centrality: Mean, Median, Mode

Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way “central” to the set.

The mode.
The mode is the measurement that has the greatest frequency, (the most common!). Although it isn’t used that much, it is useful when differences are rare or when the differences are non numerical. The prototypical example of something is usually the mode.

The mode for our example is 3.2. It is the grade with the most people (3).

The median.
The median is the number at which half your measurements are more than that number and half are less than that number. The median is actually a better measure of centrality than the mean if your data are skewed, meaning lopsided.

If, for example, you have a dozen ordinary folks and one millionaire, the distribution of their wealth would be lopsided towards the ordinary people, and the millionaire would be an outlier, or highly deviant member of the group. The millionaire would influence the mean a great deal, making it seem like all the members of the group are doing quite well. The median would actually be closer to the mean of all the people other than the millionaire.

The median for our example is 3.0. Half the people scored lower, and half higher (and one exactly).

The mean.
The mean is just the average. It is the sum of all your measurements, divided by the number of measurements. This is the most used measure of central tendency, because of its mathematical qualities. It works best if the data is distributed very evenly across the range, or is distributed in the form of a normal or bell-shaped curve (see below). One interesting thing about the mean is that it represents the expected value if the distribution of measurements were random! Here is what the formula looks like:

= _{i=1}^N x_i =

So 3.0 + 2.8 + 2.8 + 2.4 + 3.2 + 2.8 + 1.8 + 3.8 + 2.6 + 3.4 + 2.4 + 4.0 + 3.4 + 3.2 + 3.2 is 43.8. Divide that by 15 and that is the mean or average for our example: 2.92.

xdata<-c(2,4.4,3,3,2,2.2,2,4)
xdata
[1] 2.0 4.4 3.0 3.0 2.0 2.2 2.0 4.0

Let’s compute the stats:

x.bar<-mean(xdata)
x.bar
[1] 2.825

Compute for median

m.bar<-median(xdata)
m.bar
[1] 2.6

Finding mode in R using table functions, which gives you the frequencies you need:

xtab<-table(xdata)
xtab
xdata
  2 2.2   3   4 4.4 
  3   1   2   1   1 
# now find the min and the max
min(xdata)
[1] 2
max(xdata)
[1] 4.4
# to get both in one go
range(xdata)
[1] 2.0 4.4
# when max is applied to a table it will output the frequencies
max(xtab)
[1] 3

Since the number 2 was seen 3 times. finally you can construct a logical flag vector to get the mode from the table.

d.bar <-xtab[xtab==max(xtab)]
d.bar
2 
3 

The first number is the number with the highest mode, and the second number is the frequency. Let’s look at the quake dataset

Qtab<-table(quakes$mag)
Qtab[Qtab==max(Qtab)]
4.5 
107 

This gives us the magnitude (4.5) that occured the most (107 times)

# when computing mean with Na or Nan values use na.rm=TRUE
test<-c(1,4,6,NA)
test1<-c(1,4,6,NaN)
mean(test)
[1] NA
mean(test1)
[1] NaN
# use na.rm=true
mean(test,na.rm=TRUE)
[1] 3.666667
mean(test1,na.rm=TRUE)
[1] 3.666667

The benefit of [tapply] Suppose you wanted to compute summary stats

Instead of doing this:

mean(chickwts$weight[chickwts$feed=="casein"])
[1] 323.5833
mean(chickwts$weight[chickwts$feed=="horsebean"])
[1] 160.2
mean(chickwts$weight[chickwts$feed=="linseed"])
[1] 218.75
mean(chickwts$weight[chickwts$feed=="meatmeal"])
[1] 276.9091
mean(chickwts$weight[chickwts$feed=="soybean"])
[1] 246.4286
mean(chickwts$weight[chickwts$feed=="sunflower"])
[1] 328.9167
  

YOu could use the [tapply]

tapply(chickwts$weight, INDEX=chickwts$feed,FUN=mean)
   casein horsebean   linseed  meatmeal   soybean sunflower 
 323.5833  160.2000  218.7500  276.9091  246.4286  328.9167 

Easier!

13.2.2 Counts Percentages, and Proportions

It is useful to know the number of observations of categorical variables.

# find the number of each type of feeds
table(chickwts$feed)

   casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12 
# then to get the proportions
table(chickwts$feed)/nrow(chickwts)

   casein horsebean   linseed  meatmeal   soybean sunflower 
0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141 

YOu can also use the fact that TRUE counts as 1 and false counts as zero.

sum(chickwts$feed=="soybean")/nrow(chickwts)
mean(chickwts$feed=="soybean")

Then we can get into more complicated cases, like what is the mean of soybean or horsebean?

mean(chickwts$feed=="soybean"|chickwts$feed=="horsebean")

You can again use the [tapply]

tapply(chickwts$weight,INDEX=chickwts$feed,FUN=function(x) length(x)/nrow(chickwts))
   casein horsebean   linseed  meatmeal   soybean sunflower 
0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141 

You can use the [round] function to round out the numbers


round(table(chickwts$feed)/nrow(chickwts),digits = 3)

13.2.3 Quartiles, Percentiles, and the 5 number summary

A quantile is a value computed from a collection of numeric measurements that indicates an observation’s rank when compared to all the other present observations.

Quantiles divide the observations in a sample in the same way.
So the median is a quantile as it divides the samples into 2.
It can also be called percentile as in the pTH quantile is equivalent to 100 x pTH percentile.

# review the contents of xdata
xdata
[1] 2.0 4.4 3.0 3.0 2.0 2.2 2.0 4.0
# get the quantile
quantile(xdata,prob=0.8)
80% 
3.6 
quantile(xdata,prob=c(0,0.25,.5,.75,1))
  0%  25%  50%  75% 100% 
2.00 2.00 2.60 3.25 4.40 

The above is called the ‘five number summary’ of xdata.

summary(xdata)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   2.600   2.825   3.250   4.400 
# compute the lower and upper quartiles of the weights of the chicks in the chickwts.
quantile(chickwts$weight, prob=c(0.25,.75))
  25%   75% 
204.5 323.5 
# compute using quake data
summary(quakes$mag[quakes$depth<400])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.00    4.40    4.60    4.67    4.90    6.40 

13.2.4 Spread: Variance, Standard Deviation, and Interquartile Range

Spread measures how dispersed your data are.

xdata
[1] 2.0 4.4 3.0 3.0 2.0 2.2 2.0 4.0
ydata=c(1,4.4,1,3,2,2.2,2,7)
ydata
[1] 1.0 4.4 1.0 3.0 2.0 2.2 2.0 7.0
mean(xdata)
[1] 2.825
mean(ydata)
[1] 2.825

NOtice that both xdata and ydata have the same mean. Let us visualize them by plotting

plot(xdata,type="n", xlab="", ylab="data vector", yaxt="n",bty="n")
abline(h=c(3,3.5),lty=2, col="gray")
abline(v=2.825,lwd=2,lty=3)
text(c(0.8,0.8),c(3,3.5),labels=c("x","y"))
points(jitter(c(xdata,ydata)),c(rep(3,length(xdata)),rep(3.5,length(ydata))))

comparing two hypothetical data vectors that share an identical mean.
Variance measures the degree of the spread of numeric observations aroudn their arithmetic mean.
Standard deviation is the square root of the variance. Interquartile range (IQR) measures the width of the middle 50 percent of the data that lie within a 25 percent quartile on either side of the median. As such it is the difference between the upper and lower quartiles of your data.

var(xdata)
[1] 0.9078571
sd(xdata)
[1] 0.9528154
IQR(xdata)
[1] 1.25
# confirm relationship between variance and standard deviation
sqrt(var(xdata))
[1] 0.9528154
as.numeric(quantile(xdata,0.75)-quantile(xdata,0.25))
[1] 1.25

0.9528154 is also the value for standard deviation.
1.25 is also the value of the IQR

Now compute for ydata

sd(ydata)
[1] 2.012639
IQR(ydata)
[1] 1.6
var(ydata)
[1] 4.050714

These confirm that the ydata results are more spreadout from the mean (the vertical dotted line in the plot above)

# what is the standard deviation of chickwts?
sd(chickwts$weight)
[1] 78.0737

This means that each chick is on the average 78.0737 grams away from the mean weight.

13.2.5 Covariance and Correlation

Use correlation to analyse the relationship between two numeric variables to assess trends.
Covariance expresses how much tow numeric variables “change together”
Positve results mean that they are positively correlated (the figures move in the same direction)
Negative results mean that they are negatively correlated (the figures move in opposite direction)

The relationship grows weaker the nearer it is to zero. Correlation allows you to interpret the covariance furhter by identifying both the direction and the strength of any association.

# recall xdata and ydata
xdata
[1] 2.0 4.4 3.0 3.0 2.0 2.2 2.0 4.0
ydata
[1] 1.0 4.4 1.0 3.0 2.0 2.2 2.0 7.0
cov(xdata,ydata)
[1] 1.479286
cov(xdata,ydata)/(sd(xdata)*sd(ydata))
[1] 0.7713962
cor(xdata,ydata)
[1] 0.7713962

Plotting…

plot(xdata,ydata,pch=13,cex=1.5)

Looking at the quake dataset and plotting them

plot(quakes$mag,quakes$stations,xlab="magnitude",ylab="No of stations")

cov(quakes$mag,quakes$stations)
[1] 7.508181
cor(quakes$mag,quakes$stations)
[1] 0.8511824

Remember correlation allows you to measure association. Not causation.

13.2.6 Outliers

Outliers are observations that does not appear to fit with the rest of hte data. It usually presents itself as extreme values.

# example of outliers by plotting
foo<-c(0.6,-0.6,.1,-0.2,-1.0,0.4,.3,-1.8,1.1,6.0)
foo
 [1]  0.6 -0.6  0.1 -0.2 -1.0  0.4  0.3 -1.8  1.1  6.0
plot(foo,rep(0,10,yaxt="n",ylab="",bty="n",cex=2,cex.axis=1.5,cex.lab=1.5))
abline(h=0,col="gray",lty=2)
arrows(5,0.5,5.9,0.1,lwd=2)
text(5,0.7,labels="outlier?",cex=3)

# bivarate example
bar<-c(0.1,.3,1.3,.6,.2,-1.7,.8,.9,-.8,-1)
baz<-c(-0.3,.9,2.8,2.3,1.2,-4.1,-.4,4.1,-2.3,-100)
plot(bar,baz,axes=T,cex=2,cex.axis=1.5,cex.lab=1.5)
arrows(-.5,-80,-.94,-97,lwd=2)
text(-0.45,-74,labels="outlier?",cex=3)

