In this document, we introduce and work with measures of central tendency, dispersion, skewness and kurtosis for univariate data. We then explore the basics of plotting numerical and categorical data.
We also learn a quick way of defining and coding functions.
As usual, we do this using various examples to increase our fluency with R.
Lists of functions to we work with:
Central Tendency Measures: mean(), median(), quantile()
Measures of Dispersion: var(), sd(), IQR(), developing functions for skewness and kurtosis calcualtions
Viewing numerical data using histograms, density plots, boxplots, and quantile plots
Viewing cateogrical data using bar charts
The Michelson dataset from the HisData package records the 100 measuremens by Michelson of the speed of light (in the variabe velocity).
examine the measures of central tendency for this data. Particularly mean and median. By comparing mean and median do you think the data is generally right or left skwewed?
write a function to calculate skewness of data.
examine the measures of dispersion. Particularly, variance, standard deviation, and IQR.
write a function to calculate kurtosis of data. What do you think of it, compared to a standardized normal distribution?
Identify 10th, 25th, 50th, 75th, and 90th quantiles of the data.
Write a function to calculate the 90th quantile of the data. Is the result equivalent to what you found in the previous section using quantile() function?
Check the boxplot of data.
Examine the histogram plot of the data and play with its arguments.
Determine the number of observations in each bin of the histogram, for a historam with bins of width 50 each.
Generate a density plot.
examine the data quantiles against the quantiles of a normal data using qqnorm() plots.
Before we do anything, we will install and summon the package HistData. we then call the Michelson dataset.
#install.packages("HistData")
library(HistData)
Michelson
## velocity
## 1 850
## 2 740
## 3 900
## 4 1070
## 5 930
## 6 850
## 7 950
## 8 980
## 9 980
## 10 880
## 11 1000
## 12 980
## 13 930
## 14 650
## 15 760
## 16 810
## 17 1000
## 18 1000
## 19 960
## 20 960
## 21 960
## 22 940
## 23 960
## 24 940
## 25 880
## 26 800
## 27 850
## 28 880
## 29 900
## 30 840
## 31 830
## 32 790
## 33 810
## 34 880
## 35 880
## 36 830
## 37 800
## 38 790
## 39 760
## 40 800
## 41 880
## 42 880
## 43 880
## 44 860
## 45 720
## 46 720
## 47 620
## 48 860
## 49 970
## 50 950
## 51 880
## 52 910
## 53 850
## 54 870
## 55 840
## 56 840
## 57 850
## 58 840
## 59 840
## 60 840
## 61 890
## 62 810
## 63 810
## 64 820
## 65 800
## 66 770
## 67 760
## 68 740
## 69 750
## 70 760
## 71 910
## 72 920
## 73 890
## 74 860
## 75 880
## 76 720
## 77 840
## 78 850
## 79 850
## 80 780
## 81 890
## 82 840
## 83 780
## 84 810
## 85 760
## 86 810
## 87 790
## 88 810
## 89 820
## 90 850
## 91 870
## 92 870
## 93 810
## 94 740
## 95 810
## 96 940
## 97 950
## 98 800
## 99 810
## 100 870
The above code prints out the dataset Michelson of the HistData package.
Now, with $sign we can call a specific variable of a dataset.
mean(Michelson$velocity)
## [1] 852.4
median(Michelson$velocity)
## [1] 850
Given that both mean and median are very close, the data should be symmetrical in shapre. In the clearly right skewed (positively skewed) data the median is smaller than the mean. This is because mean is easily influenced by extreme observations and moves towards them. In the clearly right skwewd data, the mean is driven towards right. The opposite is true for the left skewed function.
Here the mean and median are very close and thus it is hard to tell as to whether the data is left or right skewed, even though the mean is slightly bigger than median.
skewness= function(x) {
z_score = (x - mean(x))/sd(x)
Zscore_cubic = z_score^3
sum(Zscore_cubic)/length(x)
}
skewness(Michelson$velocity)
## [1] -0.01798641
As can be seen the measure is really small and negative. We cant rely on it much to claim the distribution is left skewed because in fact the mean and the median are extremely close.
var(Michelson$velocity)
## [1] 6242.667
sd(Michelson$velocity)
## [1] 79.01055
IQR(Michelson$velocity)
## [1] 85
kurtosis= function(x) {
z_score = (x - mean(x))/sd(x)
Zscore_fourth_momentum = z_score^4
Out =sum(Zscore_fourth_momentum)/length(x)
Out - 3
}
kurtosis(Michelson$velocity)
## [1] 0.1985863
This distribution has a positive kurtosis compared to a standardized normal distribution whose kurtosis is ZERO (a noram distribution has a kurtosis of 3). Notice that because we are working with Zscores, we should automatically compare these results with a Kurtosis of a standardized normal distribution.
quantile(Michelson$velocity, c(0.1, 0.2, 0.25, 0.5, 0.75, 0.90))
## 10% 20% 25% 50% 75% 90%
## 760.0 798.0 807.5 850.0 892.5 960.0
Indeed the 0.5 quantile is the median as found in section A. To interpret, one can say for examples that 90% of data are smaller than 960 (which is the 0.9 quantile).
ninetieth_percentile = function(x) {
n = length (x)
sorted_vec = sort(x, decreasing = FALSE)
sorted_vec[0.9*n]
}
ninetieth_percentile(Michelson$velocity)
## [1] 960
Our function returns exactly the same results as does the function quantile (x, 0.9).
boxplot(Michelson$velocity, horizontal = TRUE)
hist(Michelson$velocity) # a simple histogram of data
hist(Michelson$velocity, probability = T)
The latter code turns the histogram to a probability plot whose total area sums to 1.
Now we can determine the bin size of the histogram ourself, using the breaks argument.
bin = seq(600, 1200, by=200)
bin
## [1] 600 800 1000 1200
hist(Michelson$velocity, breaks = bin, probability = F)
The above was a very blocky histogram because the bins are really large and thus a few.
Remeber the use of hist function in the first part of this answer generated bins of width 50. Now we make one with width of 5 to make a histogram that seems too spiky, for having too many bins.
bin= seq(600, 1200, by = 5)
bin
## [1] 600 605 610 615 620 625 630 635 640 645 650 655 660 665
## [15] 670 675 680 685 690 695 700 705 710 715 720 725 730 735
## [29] 740 745 750 755 760 765 770 775 780 785 790 795 800 805
## [43] 810 815 820 825 830 835 840 845 850 855 860 865 870 875
## [57] 880 885 890 895 900 905 910 915 920 925 930 935 940 945
## [71] 950 955 960 965 970 975 980 985 990 995 1000 1005 1010 1015
## [85] 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085
## [99] 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155
## [113] 1160 1165 1170 1175 1180 1185 1190 1195 1200
hist(Michelson$velocity, breaks = bin)
Making titles using xlab and ylab:
hist(Michelson$velocity, xlab = "Velocity", ylab = "Observed Frequency")
Making a probability function using freq argument:
hist(Michelson$velocity, freq= F, xlab = "Velocity", ylab = "Observed Frequency")
Freq = Fis equivalent to set the probability argument to T.
Argument plot= in the hist() function, if set to FALSE, will stop the histogram to appear in the output.
hist(Michelson$velocity, plot= T, xlab = "Velocity", ylab = "Observed Frequency")
cut() and table()functions, do the job for us.First we define our bins:
bins = seq(600, 1200, by=50)
x= Michelson$velocity
cut_outcome= cut(x, breaks = bins)
the cut command will categorize each value of the x in one of the bins created. So the outcome of the cut() function is 100 ranges, each with the width of fifty. for each element, the bin in which that element is located is printed out. See yourself:
cut_outcome #Prints out the result of cut()
## [1] (800,850] (700,750] (850,900]
## [4] (1.05e+03,1.1e+03] (900,950] (800,850]
## [7] (900,950] (950,1e+03] (950,1e+03]
## [10] (850,900] (950,1e+03] (950,1e+03]
## [13] (900,950] (600,650] (750,800]
## [16] (800,850] (950,1e+03] (950,1e+03]
## [19] (950,1e+03] (950,1e+03] (950,1e+03]
## [22] (900,950] (950,1e+03] (900,950]
## [25] (850,900] (750,800] (800,850]
## [28] (850,900] (850,900] (800,850]
## [31] (800,850] (750,800] (800,850]
## [34] (850,900] (850,900] (800,850]
## [37] (750,800] (750,800] (750,800]
## [40] (750,800] (850,900] (850,900]
## [43] (850,900] (850,900] (700,750]
## [46] (700,750] (600,650] (850,900]
## [49] (950,1e+03] (900,950] (850,900]
## [52] (900,950] (800,850] (850,900]
## [55] (800,850] (800,850] (800,850]
## [58] (800,850] (800,850] (800,850]
## [61] (850,900] (800,850] (800,850]
## [64] (800,850] (750,800] (750,800]
## [67] (750,800] (700,750] (700,750]
## [70] (750,800] (900,950] (900,950]
## [73] (850,900] (850,900] (850,900]
## [76] (700,750] (800,850] (800,850]
## [79] (800,850] (750,800] (850,900]
## [82] (800,850] (750,800] (800,850]
## [85] (750,800] (800,850] (750,800]
## [88] (800,850] (800,850] (800,850]
## [91] (850,900] (850,900] (800,850]
## [94] (700,750] (800,850] (900,950]
## [97] (900,950] (750,800] (800,850]
## [100] (850,900]
## 12 Levels: (600,650] (650,700] (700,750] (750,800] (800,850] ... (1.15e+03,1.2e+03]
Now, it is enough to count the frequency of each range. For example if (800,850] have been repeated 45 times, it means that “exactly” 45 observations have fallen within this range. The function table() will do the job of counting for us as well as the summary().
table(cut_outcome)
## cut_outcome
## (600,650] (650,700] (700,750]
## 2 0 7
## (750,800] (800,850] (850,900]
## 16 30 22
## (900,950] (950,1e+03] (1e+03,1.05e+03]
## 11 11 0
## (1.05e+03,1.1e+03] (1.1e+03,1.15e+03] (1.15e+03,1.2e+03]
## 1 0 0
summary(cut_outcome)
## (600,650] (650,700] (700,750]
## 2 0 7
## (750,800] (800,850] (850,900]
## 16 30 22
## (900,950] (950,1e+03] (1e+03,1.05e+03]
## 11 11 0
## (1.05e+03,1.1e+03] (1.1e+03,1.15e+03] (1.15e+03,1.2e+03]
## 1 0 0
plot(density(Michelson$velocity))
Note: the density() function calculates the specifications of the graph but does not generate one. The output of that calculation then will be entered into a plot () and is plotted.
qqnorm(Michelson$velocity)
Note: For “normally” distributed data the 68-95-99.7 rule of thumb applies, describing how much of the data is expected to be within 1, 2, and 3 standard deviations of the mean. These imply that a z-score of ????2 would be the 2.5 percentile, ????1 the 16th percentile and so ons.The quantile-normal plot takes these theoretical quantiles and compares them to the quantiles of a sample data set. Corresponding quantiles are plotted as pairs. The cover graphic shows the process for two theoretical distributions. If the distributions have the same shape, then the points will more or less be on a straight line. If the two shapes are not similar, the points will not be collinear. Of course, samples have sampling variability, so the points would not be expected to lie exactly on a line. The key to appreciating the graphic is to understand what is more or less straight.
The data set central.park (in UsingR package) holds the coded variable “WX” rep-resenting bad weather (e.g., 1 for “fog”, 18 for “thunder”, etc.). NA is used when none of the types occurred.
Explore this variable and make a factor of it.
Remove the NAs and make a new factor.
Rename the lables of the levels from numbers to meaningful characters for the new factor. Particularly, turn 1 into “fog”, and 18 to “thunder”, etc. Then explore this factor using the table() and summary() functions and see if you can find our the frequencies of each level.
Using the barplot() generate a bar chart, with each level of the new factor on the X axis and their respective frequency on the y axis.
Before we use the dataset, lets call its respective datapckage first and then summon the dataset.
central.park$WX # Just checking the variable we want to examine later.
## [1] 18 1 NA NA 1 18 18 1 18 18 1 18 NA NA NA NA NA 1 NA NA 1 1 1
## [24] 1 18 1 NA 18 18 18 18
summary(central.park$WX)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 18.000 9.905 18.000 18.000 10
str(central.park$WX)
## num [1:31] 18 1 NA NA 1 18 18 1 18 18 ...
class(central.park$WX)
## [1] "numeric"
These initial analysis demonstrate that the variable is numerical. We should make it a factor because the numbers have no meanings but are levels of a factor.
my_fac = factor(central.park$WX) #making a factor vector.
my_fac
## [1] 18 1 <NA> <NA> 1 18 18 1 18 18 1 18 <NA> <NA>
## [15] <NA> <NA> <NA> 1 <NA> <NA> 1 1 1 1 18 1 <NA> 18
## [29] 18 18 18
## Levels: 1 18
class(my_fac)
## [1] "factor"
The latter code shows that the class of this object is a factor now, not a numeric anymore.
which(is.na(my_fac)) #Returns the place of elements in the vector that are NA.
## [1] 3 4 13 14 15 16 17 19 20 27
my_new_fac = my_fac[!is.na(my_fac)] #Making a new factor that only contains non NA elements.
my_new_fac
## [1] 18 1 1 18 18 1 18 18 1 18 1 1 1 1 1 18 1 18 18 18 18
## Levels: 1 18
class(my_new_fac) #because it is made of a factor, the my_new_fac is also a factor.
## [1] "factor"
Note: is.na(x) returns T or F for each element by checking whether they are equal to NA. If yes, then it returns TRUE. Of course putting is.na() into [] arguments, then prints the NAs. We want the opposite of it to be transfered to the new factor. So we use the symbol ! to negate the logic.
levels(my_new_fac)
## [1] "1" "18"
Then we relable them:
levels(my_new_fac) = c("Fog", "Thunder")
See what happened:
my_new_fac
## [1] Thunder Fog Fog Thunder Thunder Fog Thunder Thunder
## [9] Fog Thunder Fog Fog Fog Fog Fog Thunder
## [17] Fog Thunder Thunder Thunder Thunder
## Levels: Fog Thunder
summary(my_new_fac)
## Fog Thunder
## 10 11
table(my_new_fac)
## my_new_fac
## Fog Thunder
## 10 11
Note: both summary() and table() return the same output. A sort of frequency table.
Just to explore:
class(table(my_new_fac))
## [1] "table"
Shows that the output of a table() is table object. Note that the outcomes of the summary() though looks the same as table() but is an integer.
barplot(table(my_new_fac)) #Frequency of each category.
barplot(table(my_new_fac), horiz = T) # draw the graph horizontally.