Objectives:

In this document, we introduce and work with measures of central tendency, dispersion, skewness and kurtosis for univariate data. We then explore the basics of plotting numerical and categorical data.

We also learn a quick way of defining and coding functions.

As usual, we do this using various examples to increase our fluency with R.

Lists of functions to we work with:

Central Tendency Measures: mean(), median(), quantile()

Measures of Dispersion: var(), sd(), IQR(), developing functions for skewness and kurtosis calcualtions

Viewing numerical data using histograms, density plots, boxplots, and quantile plots

Viewing cateogrical data using bar charts

Example 1. Nuemrical Data

The Michelson dataset from the HisData package records the 100 measuremens by Michelson of the speed of light (in the variabe velocity).

examine the measures of central tendency for this data. Particularly mean and median. By comparing mean and median do you think the data is generally right or left skwewed?
write a function to calculate skewness of data.
examine the measures of dispersion. Particularly, variance, standard deviation, and IQR.
write a function to calculate kurtosis of data. What do you think of it, compared to a standardized normal distribution?
Identify 10th, 25th, 50th, 75th, and 90th quantiles of the data.
Write a function to calculate the 90th quantile of the data. Is the result equivalent to what you found in the previous section using quantile() function?
Check the boxplot of data.
Examine the histogram plot of the data and play with its arguments.
Determine the number of observations in each bin of the histogram, for a historam with bins of width 50 each.
Generate a density plot.
examine the data quantiles against the quantiles of a normal data using qqnorm() plots.

Solutions to Exercise 1.

Before we do anything, we will install and summon the package HistData. we then call the Michelson dataset.

#install.packages("HistData")
library(HistData)

Michelson

##     velocity
## 1        850
## 2        740
## 3        900
## 4       1070
## 5        930
## 6        850
## 7        950
## 8        980
## 9        980
## 10       880
## 11      1000
## 12       980
## 13       930
## 14       650
## 15       760
## 16       810
## 17      1000
## 18      1000
## 19       960
## 20       960
## 21       960
## 22       940
## 23       960
## 24       940
## 25       880
## 26       800
## 27       850
## 28       880
## 29       900
## 30       840
## 31       830
## 32       790
## 33       810
## 34       880
## 35       880
## 36       830
## 37       800
## 38       790
## 39       760
## 40       800
## 41       880
## 42       880
## 43       880
## 44       860
## 45       720
## 46       720
## 47       620
## 48       860
## 49       970
## 50       950
## 51       880
## 52       910
## 53       850
## 54       870
## 55       840
## 56       840
## 57       850
## 58       840
## 59       840
## 60       840
## 61       890
## 62       810
## 63       810
## 64       820
## 65       800
## 66       770
## 67       760
## 68       740
## 69       750
## 70       760
## 71       910
## 72       920
## 73       890
## 74       860
## 75       880
## 76       720
## 77       840
## 78       850
## 79       850
## 80       780
## 81       890
## 82       840
## 83       780
## 84       810
## 85       760
## 86       810
## 87       790
## 88       810
## 89       820
## 90       850
## 91       870
## 92       870
## 93       810
## 94       740
## 95       810
## 96       940
## 97       950
## 98       800
## 99       810
## 100      870

The above code prints out the dataset Michelson of the HistData package.

Now, with $sign we can call a specific variable of a dataset.

mean(Michelson$velocity)

## [1] 852.4

median(Michelson$velocity)

## [1] 850

Given that both mean and median are very close, the data should be symmetrical in shapre. In the clearly right skewed (positively skewed) data the median is smaller than the mean. This is because mean is easily influenced by extreme observations and moves towards them. In the clearly right skwewd data, the mean is driven towards right. The opposite is true for the left skewed function.

Here the mean and median are very close and thus it is hard to tell as to whether the data is left or right skewed, even though the mean is slightly bigger than median.

The formula for calculating the sample skewness is the Sum(zi^3)/n. zi is the z-score of obsevation ith.

skewness= function(x) {
  z_score = (x - mean(x))/sd(x)
  Zscore_cubic = z_score^3
  sum(Zscore_cubic)/length(x)
}

skewness(Michelson$velocity)

## [1] -0.01798641

As can be seen the measure is really small and negative. We cant rely on it much to claim the distribution is left skewed because in fact the mean and the median are extremely close.

Measures of spread.

var(Michelson$velocity)

## [1] 6242.667

sd(Michelson$velocity)

## [1] 79.01055

IQR(Michelson$velocity)

## [1] 85

Kurtosis is calculated as (Sum(zi^4)/n) - 3.

kurtosis= function(x) {
  z_score = (x - mean(x))/sd(x)
  Zscore_fourth_momentum = z_score^4
  Out =sum(Zscore_fourth_momentum)/length(x) 
  Out - 3
    }

kurtosis(Michelson$velocity)

## [1] 0.1985863

This distribution has a positive kurtosis compared to a standardized normal distribution whose kurtosis is ZERO (a noram distribution has a kurtosis of 3). Notice that because we are working with Zscores, we should automatically compare these results with a Kurtosis of a standardized normal distribution.

Quantiles

quantile(Michelson$velocity, c(0.1, 0.2, 0.25, 0.5, 0.75, 0.90))

##   10%   20%   25%   50%   75%   90% 
## 760.0 798.0 807.5 850.0 892.5 960.0

Indeed the 0.5 quantile is the median as found in section A. To interpret, one can say for examples that 90% of data are smaller than 960 (which is the 0.9 quantile).

Writing a function for 90th percent quantile.

ninetieth_percentile = function(x) {
  n = length (x)
  sorted_vec = sort(x, decreasing = FALSE)
  sorted_vec[0.9*n]
}

ninetieth_percentile(Michelson$velocity)

## [1] 960

Our function returns exactly the same results as does the function quantile (x, 0.9).

boxplot(Michelson$velocity, horizontal = TRUE)

Histogram and its different arguments

hist(Michelson$velocity) # a simple histogram of data

hist(Michelson$velocity, probability = T)

The latter code turns the histogram to a probability plot whose total area sums to 1.

Now we can determine the bin size of the histogram ourself, using the breaks argument.

bin = seq(600, 1200, by=200)
bin

## [1]  600  800 1000 1200

hist(Michelson$velocity, breaks = bin, probability = F)

The above was a very blocky histogram because the bins are really large and thus a few.

Remeber the use of hist function in the first part of this answer generated bins of width 50. Now we make one with width of 5 to make a histogram that seems too spiky, for having too many bins.

bin= seq(600, 1200, by = 5)
bin

##   [1]  600  605  610  615  620  625  630  635  640  645  650  655  660  665
##  [15]  670  675  680  685  690  695  700  705  710  715  720  725  730  735
##  [29]  740  745  750  755  760  765  770  775  780  785  790  795  800  805
##  [43]  810  815  820  825  830  835  840  845  850  855  860  865  870  875
##  [57]  880  885  890  895  900  905  910  915  920  925  930  935  940  945
##  [71]  950  955  960  965  970  975  980  985  990  995 1000 1005 1010 1015
##  [85] 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085
##  [99] 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155
## [113] 1160 1165 1170 1175 1180 1185 1190 1195 1200

hist(Michelson$velocity, breaks = bin)

Making titles using xlab and ylab:

hist(Michelson$velocity, xlab = "Velocity", ylab = "Observed Frequency")

Making a probability function using freq argument:

hist(Michelson$velocity, freq= F, xlab = "Velocity", ylab = "Observed Frequency")

Freq = Fis equivalent to set the probability argument to T.

Argument plot= in the hist() function, if set to FALSE, will stop the histogram to appear in the output.

hist(Michelson$velocity, plot= T, xlab = "Velocity", ylab = "Observed Frequency")

The histogram demonstrates the frequency (i.e., the number) of observations in within each bin. We can read them from the y axis but sometimes we might want to know the “exact” number within each bin. The following codes, using the cut() and table()functions, do the job for us.

First we define our bins:

bins = seq(600, 1200, by=50)  
x= Michelson$velocity
cut_outcome= cut(x, breaks = bins)

the cut command will categorize each value of the x in one of the bins created. So the outcome of the cut() function is 100 ranges, each with the width of fifty. for each element, the bin in which that element is located is printed out. See yourself:

cut_outcome #Prints out the result of cut()

##   [1] (800,850]          (700,750]          (850,900]         
##   [4] (1.05e+03,1.1e+03] (900,950]          (800,850]         
##   [7] (900,950]          (950,1e+03]        (950,1e+03]       
##  [10] (850,900]          (950,1e+03]        (950,1e+03]       
##  [13] (900,950]          (600,650]          (750,800]         
##  [16] (800,850]          (950,1e+03]        (950,1e+03]       
##  [19] (950,1e+03]        (950,1e+03]        (950,1e+03]       
##  [22] (900,950]          (950,1e+03]        (900,950]         
##  [25] (850,900]          (750,800]          (800,850]         
##  [28] (850,900]          (850,900]          (800,850]         
##  [31] (800,850]          (750,800]          (800,850]         
##  [34] (850,900]          (850,900]          (800,850]         
##  [37] (750,800]          (750,800]          (750,800]         
##  [40] (750,800]          (850,900]          (850,900]         
##  [43] (850,900]          (850,900]          (700,750]         
##  [46] (700,750]          (600,650]          (850,900]         
##  [49] (950,1e+03]        (900,950]          (850,900]         
##  [52] (900,950]          (800,850]          (850,900]         
##  [55] (800,850]          (800,850]          (800,850]         
##  [58] (800,850]          (800,850]          (800,850]         
##  [61] (850,900]          (800,850]          (800,850]         
##  [64] (800,850]          (750,800]          (750,800]         
##  [67] (750,800]          (700,750]          (700,750]         
##  [70] (750,800]          (900,950]          (900,950]         
##  [73] (850,900]          (850,900]          (850,900]         
##  [76] (700,750]          (800,850]          (800,850]         
##  [79] (800,850]          (750,800]          (850,900]         
##  [82] (800,850]          (750,800]          (800,850]         
##  [85] (750,800]          (800,850]          (750,800]         
##  [88] (800,850]          (800,850]          (800,850]         
##  [91] (850,900]          (850,900]          (800,850]         
##  [94] (700,750]          (800,850]          (900,950]         
##  [97] (900,950]          (750,800]          (800,850]         
## [100] (850,900]         
## 12 Levels: (600,650] (650,700] (700,750] (750,800] (800,850] ... (1.15e+03,1.2e+03]

Now, it is enough to count the frequency of each range. For example if (800,850] have been repeated 45 times, it means that “exactly” 45 observations have fallen within this range. The function table() will do the job of counting for us as well as the summary().

table(cut_outcome)

## cut_outcome
##          (600,650]          (650,700]          (700,750] 
##                  2                  0                  7 
##          (750,800]          (800,850]          (850,900] 
##                 16                 30                 22 
##          (900,950]        (950,1e+03]   (1e+03,1.05e+03] 
##                 11                 11                  0 
## (1.05e+03,1.1e+03] (1.1e+03,1.15e+03] (1.15e+03,1.2e+03] 
##                  1                  0                  0

summary(cut_outcome)

##          (600,650]          (650,700]          (700,750] 
##                  2                  0                  7 
##          (750,800]          (800,850]          (850,900] 
##                 16                 30                 22 
##          (900,950]        (950,1e+03]   (1e+03,1.05e+03] 
##                 11                 11                  0 
## (1.05e+03,1.1e+03] (1.1e+03,1.15e+03] (1.15e+03,1.2e+03] 
##                  1                  0                  0

density plot

plot(density(Michelson$velocity))

Note: the density() function calculates the specifications of the graph but does not generate one. The output of that calculation then will be entered into a plot () and is plotted.

qqnorm(Michelson$velocity)

Note: For “normally” distributed data the 68-95-99.7 rule of thumb applies, describing how much of the data is expected to be within 1, 2, and 3 standard deviations of the mean. These imply that a z-score of ????2 would be the 2.5 percentile, ????1 the 16th percentile and so ons.The quantile-normal plot takes these theoretical quantiles and compares them to the quantiles of a sample data set. Corresponding quantiles are plotted as pairs. The cover graphic shows the process for two theoretical distributions. If the distributions have the same shape, then the points will more or less be on a straight line. If the two shapes are not similar, the points will not be collinear. Of course, samples have sampling variability, so the points would not be expected to lie exactly on a line. The key to appreciating the graphic is to understand what is more or less straight.

Example 2. Categorical Data.

The data set central.park (in UsingR package) holds the coded variable “WX” rep-resenting bad weather (e.g., 1 for “fog”, 18 for “thunder”, etc.). NA is used when none of the types occurred.

Explore this variable and make a factor of it.
Remove the NAs and make a new factor.
Rename the lables of the levels from numbers to meaningful characters for the new factor. Particularly, turn 1 into “fog”, and 18 to “thunder”, etc. Then explore this factor using the table() and summary() functions and see if you can find our the frequencies of each level.
Using the barplot() generate a bar chart, with each level of the new factor on the X axis and their respective frequency on the y axis.

Solutions to Example 2.

Before we use the dataset, lets call its respective datapckage first and then summon the dataset.

central.park$WX   # Just checking the variable we want to examine later.

##  [1] 18  1 NA NA  1 18 18  1 18 18  1 18 NA NA NA NA NA  1 NA NA  1  1  1
## [24]  1 18  1 NA 18 18 18 18

summary(central.park$WX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000  18.000   9.905  18.000  18.000      10

str(central.park$WX)

##  num [1:31] 18 1 NA NA 1 18 18 1 18 18 ...

class(central.park$WX)

## [1] "numeric"

These initial analysis demonstrate that the variable is numerical. We should make it a factor because the numbers have no meanings but are levels of a factor.

my_fac = factor(central.park$WX)  #making a factor vector. 
my_fac

##  [1] 18   1    <NA> <NA> 1    18   18   1    18   18   1    18   <NA> <NA>
## [15] <NA> <NA> <NA> 1    <NA> <NA> 1    1    1    1    18   1    <NA> 18  
## [29] 18   18   18  
## Levels: 1 18

class(my_fac)

## [1] "factor"

The latter code shows that the class of this object is a factor now, not a numeric anymore.

which(is.na(my_fac))  #Returns the place of elements in the vector that are NA.

##  [1]  3  4 13 14 15 16 17 19 20 27

my_new_fac = my_fac[!is.na(my_fac)]  #Making a new factor that only contains non NA elements. 
my_new_fac

##  [1] 18 1  1  18 18 1  18 18 1  18 1  1  1  1  1  18 1  18 18 18 18
## Levels: 1 18

class(my_new_fac) #because it is made of a factor, the my_new_fac is also a factor.

## [1] "factor"

Note: is.na(x) returns T or F for each element by checking whether they are equal to NA. If yes, then it returns TRUE. Of course putting is.na() into [] arguments, then prints the NAs. We want the opposite of it to be transfered to the new factor. So we use the symbol ! to negate the logic.

First we see what are the levels in order they are published.

levels(my_new_fac)

## [1] "1"  "18"

Then we relable them:

levels(my_new_fac) = c("Fog", "Thunder")

See what happened:

my_new_fac

##  [1] Thunder Fog     Fog     Thunder Thunder Fog     Thunder Thunder
##  [9] Fog     Thunder Fog     Fog     Fog     Fog     Fog     Thunder
## [17] Fog     Thunder Thunder Thunder Thunder
## Levels: Fog Thunder

summary(my_new_fac)

##     Fog Thunder 
##      10      11

table(my_new_fac)

## my_new_fac
##     Fog Thunder 
##      10      11

Note: both summary() and table() return the same output. A sort of frequency table.

Just to explore:

class(table(my_new_fac))

## [1] "table"

Shows that the output of a table() is table object. Note that the outcomes of the summary() though looks the same as table() but is an integer.

To make a bar chart, notice that the input of a barplot is a table object. So the outcomes of a table() function can be an input for this plot.

barplot(table(my_new_fac)) #Frequency of each category.

barplot(table(my_new_fac), horiz = T) # draw the graph horizontally.

Exploring Univariate Data: Measures of Central Tendency and Distribution, Plotting Numerical and Categorical Univariate Data

Mehrad

12 juli 2016

Objectives:

Example 1. Nuemrical Data

Solutions to Exercise 1.

Example 2. Categorical Data.

Solutions to Example 2.