This part of the refresher course is based on:

Landers, R.N. (2019). A Step-by-Step Introduction to Statistics for Business. Sage Publications

We will discuss:

The module illustrates all topics discussed the textbook, and shows how they can be implemented in R.

1. The language of statistics

In statistics and data science, we often use data sets which have a rectangular shape. A rectangle, or matrix, has two dimensions which we call rows and columns.

The rows represent cases (or persons, objects, research units). The columns provide information about these cases. Although we will encounter some other ways to store and present information, this type of data set is by far the most common.

The data are measurements of the variables that appear in the columns.

For example, the first column is a simple serial number, from 1 to 7, to identify the case.

Each case represents, for example, a transaction. The second column then indicates which of our three stores A to C is responsible for this transaction.

The variables q1 to q3 contain structured information. What do we mean by structure? In principle, q1 is measured for each and every transaction, and we don't have variables that are unique to specific transactions.

Unstructured information, in comparison, could be transcripts from interviews with several interviewees on a particular topic. Even if you ask the same questions, the answers can go in any direction, and what your interviewees say, how many and which words they use, will be unique to every interviewee.

Scales of Measurement

The information in you data set, is based on measurements, of some sort.

When measuring things, we can make use of four types of scales.

Nominal: a scale of measurement for data with meaningful labels

Example: Gender

Ordinal: a scale of measurement for nominal data with order

Example: “1st", “2nd”, “3rd”

Interval: a scale of measurement for ordinal data with meaningful distances between values

Example: temperature in degrees Celsius

Ratio: a scale of measurement for interval data with a meaningful zero point

Example: sales in euros.

The difference between interval and ratio scales is subtle. If the outside temperature on a particular day is 20 degrees Celsius, you cannot say that it is twice as warm compared to 10 degrees Celsius. It depends on the scale used. Would we we have used Fahrenheit instead of Celsius, then the difference would not be a factor 2.

For the relationship between the two scales we can use the formula:

$ F = C*9/5 + 32$

Below, we have written the user defined formula CtoF() for the conversion from Celsius to Fahrenheit.

We apply the formula to a range of temperatures in degrees Celsius.

CtoF <- function(c)
{
  f <- 9/5*c + 32
  cat(c,"degrees Celsius =",f,"degrees Fahrenheit\n")
}

tempC <- c(0,5,10,20,30)
for(i in tempC) {
  CtoF(i)
}
## 0 degrees Celsius = 32 degrees Fahrenheit
## 5 degrees Celsius = 41 degrees Fahrenheit
## 10 degrees Celsius = 50 degrees Fahrenheit
## 20 degrees Celsius = 68 degrees Fahrenheit
## 30 degrees Celsius = 86 degrees Fahrenheit

Indeed, you see that while it is tempting to state that 10 degrees Celsius is twice as much as 5 degrees Celsius, the very same temperatures on the Fahrenheit scale are 50 and 41, respectively. The difference now looks much smaller!

Still, repeatedly adding 5 degrees Celsius, is equivalent to adding 9 degrees on the Fahrenheit scale. That is, differences (of, say, 5 degrees) on the Celsius scale correspond to differences (of 9 degrees) on the Fahrenheit scale. These scales satisfy the criteria for interval scales, but not for ratio scales.

Ordinal scales are quite common. We use them frequently, in for example surveys. We can ask you the following question about the contents of this chapter.

To what extent do you agree or disagree with the following statement? This module is the best refresher course in statistics in the world






Typically, we would code the answers numerically, for instance from 1=totally disagree to 5=totally agree. This coding scheme suggests that the differences between the points on the scale are the the same (equidistant), like temperatures on Celsius or Fahrenheit scales! But this is doubtful. It may be that the threshold for respondents to totally agree or disagree may be very high, while at the same time they feel comfortable in avoiding the neutral category even if they only slightly agree or disagree to the statement. That is, the distances at the extreme points of the scales are higher than in the middle. A coding scheme like {1 4 5 6 9} may better reflect the true distances between the answers.

But we do not know, and therefore - for the sake of ease - we assume equidistance. This assumption allows us to compute mean scores. A mean score of 3.6 would then mean that overall the feeling is slightly positive, somewhere in between neutral and agree.

Ordinal scales only imply that a score of 4 is more, or better than a score of 3, but we do not know how much better.

Nominal scales are just labels. A question on gender can be numerically coded (1 or 2, for males and females, or the other way round) but the numerical code is just used for convenience. Taking the average (1.5, if our group has as many males as females) makes no sense.

Later on we will see that we can still use nominal scales (like group membership) when analyzing quantitative data.

Exercise

Sally has opened a pizza restaurant. She delivers pizzas in the neighbourhood. For each delivery she collects the following information:

  1. Size of the pizza (small, medium, large)
  2. Number of pizzas ordered
  3. Day of the week
  4. Distance from restaurant to delivery address
  5. Billing amount

What are the scales of measurements, for 1 to 5?

2. Working with numbers, and data display

In statistics and data science, we are dealing with lots of data. The key challenge is to briefly summarize the data. To that end, we make use of graphs and numbers.

For example, your teachers may make use of different types of assessment, to check if you have studied hard enough to master all the things they have tried to teach you. At the end of the course, you may have gone through many exams and assignments, all of them graded with a number between 1 and 10. This system will generate a lot of information. For all students in the class, we will have grades on all exams and assignments, and the data can be organized in a data set.

If we have a small class and only two subjects, then an example of a data set may look like.

The data is organized in an Excel sheet, of which the figure above is a screenshot. Excel files can be read directly by R, but it's more common to share data in text files in which the data are separated by commas - hence, csv (comma separated values) files.

csv files, too, can be read by R directly. The first line of the csv data files normally contain the names (labels) of the variables, just like we would do in Excel. You have to tell R that this is the case.

grades <- read.table("grades.csv", header=TRUE, sep=",")
grades

You can have a look at the data, and find an answer to many questions.

  1. How many male and female students are there in class?
  2. What is the minumum, maximum and average age?
  3. What are the minumum, maximum and average grades for the two topics?
  4. What is the average grade for each student?
  5. Is there a (positive) correlation between the grades for the two topics? That is, do students with low/high grades for subject 1, also have higher grades for subject 2?

Try do find these answers from visual inspection! You will notice two things:

  1. Firstly, some tasks are easier than others. Especially (5) is not straightforward!
  2. It is time consuming and error prone, even with this small data set.

Well, that's why we have computers and software. We can give simple instructions, and the computer will do it faster and without errors. Try the summary() command.

summary(grades)
##        id            name              gender               age      
##  Min.   : 1.00   Length:10          Length:10          Min.   :18.0  
##  1st Qu.: 3.25   Class :character   Class :character   1st Qu.:20.0  
##  Median : 5.50   Mode  :character   Mode  :character   Median :20.0  
##  Mean   : 5.50                                         Mean   :20.3  
##  3rd Qu.: 7.75                                         3rd Qu.:21.0  
##  Max.   :10.00                                         Max.   :22.0  
##     grade_1         grade_2     
##  Min.   : 4.00   Min.   : 4.00  
##  1st Qu.: 6.25   1st Qu.: 6.25  
##  Median : 8.00   Median : 8.50  
##  Mean   : 7.50   Mean   : 7.70  
##  3rd Qu.: 9.00   3rd Qu.: 9.00  
##  Max.   :10.00   Max.   :10.00

Computers are smart. But not that smart. In the output you will find that R recognizes that name is a string variable, containing text. You cannot take an average, or a lot of other statistical things, with text. On the other hand, since we added id as a numerical variable or code, R blindly computes the average value.

For age and grade, statistical information is available. The summary() command returns the minimum and maximum values; the median and the mean, among other things. No student has a grade below 4 on any of the topics. The mean grade for subject 2 is slightly higher than for topic 1. And so on.

The grades have a so-called distribution, in the range of 4 to 10. While the mean grade is an interesting statistic, we are also interested in the distribution per se. Are the grades evenly distributed, or more bell shaped? We can plot a histogram of the data.

While the summary() command can be applied to the data set grades and all variables that are part of it, histograms are plotted for a single variable. Variable grade_1 has to be referred to as grades$grade_1. Just grade_1 does not suffice, since in principle it can be an object by itself!

You have control over the outlook of the histogram. Below, we added the number of breaks and clored the bars red. When omitted, the hist() command uses default values.

hist(grades$grade_1, breaks=10, col="red")

Again, ggplot produces nicer (and publication quality) graphs. To the basic part in which we tell R that grade_1 is the x-axis, we add layers ("geoms").

Note that in the basic part we first have to tell R which data set to use, and R then knows that grade_1 is a variable in that data set!

It is common to play around with settings like binwidth and colours and sizes, until you are happy with the result. We also added a dashed red line representing the mean of grade_1.

library(ggplot2)
ggplot(grades, aes(x=grade_1)) +
    geom_histogram(binwidth=1, colour="black", fill="white") +
    geom_vline(aes(xintercept=mean(grade_1, na.rm=T)), color="red", 
    linetype="dashed", size=1)    

3. The normal distribution

Understanding Normal Distribution

The normal distribution is the most common type of distribution assumed in statistical analyses.

The standard normal distribution has two parameters: the mean and the standard deviation.

For a normal distribution:

  • 68% of the observations are within ± one standard deviation of the mean
  • 95% are within ± two standard deviations, and
  • 99.7% are within ± three standard deviations.

The normal distribution model is motivated by the Central Limit Theorem which states that averages calculated from independent, identically distributed random variables have approximately normal distributions, regardless of the type of distribution from which the variables are sampled.

Skewness and Kurtosis

Real life data rarely, if ever, follow a perfect normal distribution.

The skewness and kurtosis coefficients measure how different a given distribution is from a normal distribution.

Skewness measures the symmetry of a distribution. The normal distribution is symmetric and has a skewness of zero. If the distribution of a data set has a skewness less than zero, or negative skewness, then the left tail of the distribution is longer than the right tail; positive skewness implies that the right tail of the distribution is longer than the left.

Kurtosis measures the thickness of the tail ends of a distribution in relation to the tails of the normal distribution. Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution. Distributions with low kurtosis exhibit tail data that is generally less extreme than the tails of the normal distribution.

Surprisingly many things in real life closely follow a normal distribution:

  • Height of people
  • Blood presssure
  • IQs
  • ...

One website mentioned salaries, as another example of normally distributed variables. However, that is doubtful. Maybe it holds true in some economies, but not in all economies and not worldwide. One (theoretical) issue is that normal distributions are continuous distributions ranging from -∞ to +∞. Well, salaries don't. I guess it's fiscally possible to have negative incomes, but if your employer offers you -10 dollars per hour, chances are you prefer staying home and "earn" 10 dollars more by doing nothing. That is, on the left of the disribution, salaries are cut off at zero, while on the right some people get incredibly high salaries. In other words, the distribution is not symmetrical, and right skewed.

You can find out more about the normal distribution here, and in the video below.

3. Central tendency

Above we said that the theoretical normal distribution, is characterized by various parameters. In actual, or empirical distributions, we compute statistics. For the normal distribution, it suffices to know the mean and the standard deviation. If we know thes two, then we know the exact shape of the distribution, and implicitly also other parameters like skewness and kurtosis.

We will focus on the mean and the standard deviation.

Mean (or average)

The mean (or average) is a measure of central tendency. Most of us know, at leat intuitively what the mean stands for. Technically, the mean is the sum of all the observations divided by the number of observations. In words, the mean represents the center of the data.

Although we seldom think about it to deeply, using the mean has some drawbacks. One is that it is sensitive to outliers. Suppose that we measure the incomes of two groups of people, The two groups happen to be identical, except for the one person with the highest income. An outlier lifts the mean of the whole group, and often substantially.

group1 <- c(10,8,12,15,7,8,11,10,14,13)
group2 <- c(10,8,12,15,7,8,11,10,14,130)
(mean(group1))
## [1] 10.8
(mean(group2))
## [1] 22.5

You notice that the means differ strongly! In the second group there is only 1 person out of 10 that earns above average; the vast majority earns much less.

Median

As an alternative, you can make use of the median as a measure of central tendency. The median is simply the middle value in the distribution after sorting them from low to high. The median is less sensitive to (or robust to) extreme values at the end of the distribution. For incomes and salaries, it is usually more informative to use the median, as it tells us more about the center of the group!

The median is simply the middle observation, in the data set sorted from low to high.

We can ask R to do the sorting:

(sort(group2))
##  [1]   7   8   8  10  10  11  12  14  15 130

A sligh complication is that, with an even number of observations, there is no observation in the middle. Here, with 10 observations, we have five observations at the bottom and five at the top, and the (imaginary) middle observation would be somewhere in between the fifth and the sixth observation. In case of an even number of observations, we then compute the median as:

Median = (5th observation + 6th observation)/2

In our example:

Sorted data: 7 8 8 10 10 / 11 12 14 15 130

Median = (10 + 11) / 2 = 10.5

R will do the job for you, with the median() function.

(median(group2))
## [1] 10.5

Mode

For the sake of completeness, we mention a third measure of central tendency: the mode.

The mode is the most common value in the data set. In our example, there are actually to modes, 10 and 11. Both values occur twice.

You can verify this with the table() command, which lists all possible values and their frequency in the data set. We apply cbind() to the output of table, to show the values and their frequencies in columns rather than rows.

(cbind(table(group2)))
##     [,1]
## 7      1
## 8      2
## 10     2
## 11     1
## 12     1
## 14     1
## 15     1
## 130    1

You can imagine that for many continuous variables, most values are unique. For example, net salaries in the Netherlands are pretty unique, as they are derived from gross salaries minus taxes, contributions to pension schemes, individual allowances, and so on and so forth. Since contributions to pension schemes depend on individual traits (age), and taxes depend on household situation and other sources of income, among others, the probability of two persons earning the exat same income is close to zero. Each net salary is unique, and there are as many modes as there are individuals and net salaries! Not very helpful.

A way out of this problem is to define income or salary brackets, like 1,000 to 2,000; 2,000 to 3,000 Euro, and so on. A bracket then contains all unique salaries in a range of salaries.

Distributions can be displayed using histograms. The hist() function, by default, selects the number of brackets, and the width of brackets. It is generally better to override the default values, for the best presentation of the data. But to give you an idea, below the histogram of group 1 data.

Note that the definition of the brackets make a big difference! The brackets are 6-8; 8-10; and so on. But since we do have values of 8 and 10, on the dividing line between these brackets, it is crucially important to know if, for example, 8 is in the first or second bracket!

For continuous variables, in contrast to the seemingly discrete numbers in group 1, this is not a big concern.

hist(group1)

4. Variability

Now that we know how to identify the middle of the distribution, the next step is to identify the width of it.

Range

One way to identify the width of the distribution, is to find the lowest and highest values in the distribution (the minimum and the maximum). The range then is simply computed as \(maximum - minimum\).

(min1 <- min(group1))
## [1] 7
(max1 <- max(group1))
## [1] 15
cat("The range of group 1 incomes is", max1 - min1)
## The range of group 1 incomes is 8

It is obvious that the range is highly sensitive to outliers. In group 2, containing one person with a very high income, the range is dominated by this one outlier.

(min2 <- min(group2))
## [1] 7
(max2 <- max(group2))
## [1] 130
cat("The range of group 1 incomes is", max2 - min2)
## The range of group 1 incomes is 123

The range is seldom used, in statistics and data science, for analytic purposes. However, minimum, maximum and range are often used in the early stages of the data analysis process, especially when checking the validity of the data. For example, if you school registers your grades from 1 to 10 (as in the Dutch system), then analyzing all grades for whatever purpose starts with checking that all grades have a \(minimum>1\), and a \(maximum>10\). If not, then something is wrong (for example, missing grades coded 99 are included in the data).

Deviation

A lot of analyses in statistics and data science are based on the principle of differences, deviations and distances. You will encounter these terms time and again throughout the program!

Taking our group 1 incomes, we can say something about all 10 group members, in terms of their differences to one another, or in terms of their deviation from the group average.

We have already calculated the mean income for group 1. It is the easy to calculate how far all members are from the mean of the group.

For example, we now see that the first person who earns 10, is 0.8 below the mean of the group.

(mean1 <- mean(group1))
## [1] 10.8
(group1dev <- group1 - mean1)
##  [1] -0.8 -2.8  1.2  4.2 -3.8 -2.8  0.2 -0.8  3.2  2.2

But our interest is not in individuals, but in how far individuals are removed from the mean. It is tempting then to take the sum of all deviations, as a measure. But there's a problem here!

Using the sum() function to take the sum of the differences, stored in group1dev, we see that the outcome is zero. That is, the positive and negative deviations cancel out! By definition, the mean is the centre of the data, and the sum of deviations from the mean will cancel out!

Note that we use the round() function. We do so to prevent R displaying a veeeeeery small number, something ending in "e-15", suggesting that the first part of the expression has to be multiplied by 10-15, or 0,000000000000001.

round(sum(group1dev),4)
## [1] 0

In order to prevent negative and positive deviations cancelling each other out, we can reason as follows. A deviation, positive or negative, is measure of the variability. So raher than taking the sum of the deviations, it makes sense to take the sum of the absolute deviations.

The absolute deviation is just the deviation without the sign (plus or minus). For example: \(|-0.8| = 0.8\), and \(|+1.2| = 1.2\). The "\(|x|\)" means "the absolute value of \(x\)". Absolute values are non-negative, by definition.

Since the absolute deviations are always positive, their sum is a measure of the variability of the data.

Let's see how it pans out. Given the outlier we expect this measure to be substantially bigger in group 2 as compared to group 1.

For group 1 we have:

(mean1 <- mean(group1))
## [1] 10.8
group1dev    <- group1 - mean1
group1absdev <- abs(group1 - mean1)
(cbind(group1dev,group1absdev))
##       group1dev group1absdev
##  [1,]      -0.8          0.8
##  [2,]      -2.8          2.8
##  [3,]       1.2          1.2
##  [4,]       4.2          4.2
##  [5,]      -3.8          3.8
##  [6,]      -2.8          2.8
##  [7,]       0.2          0.2
##  [8,]      -0.8          0.8
##  [9,]       3.2          3.2
## [10,]       2.2          2.2
sum(group1dev)
## [1] -7.105427e-15
sum(group1absdev)
## [1] 22

For group 2 we have:

(mean2 <- mean(group2))
## [1] 22.5
group2dev    <- group2 - mean2
group2absdev <- abs(group2 - mean2)
(cbind(group2dev,group2absdev))
##       group2dev group2absdev
##  [1,]     -12.5         12.5
##  [2,]     -14.5         14.5
##  [3,]     -10.5         10.5
##  [4,]      -7.5          7.5
##  [5,]     -15.5         15.5
##  [6,]     -14.5         14.5
##  [7,]     -11.5         11.5
##  [8,]     -12.5         12.5
##  [9,]      -8.5          8.5
## [10,]     107.5        107.5
sum(group2dev)
## [1] 0
sum(group2absdev)
## [1] 215

Since the sum of absolute deviations will go up with the number of observations, a measure to compare groups of different sizes would be to divide the sum by the number of observations, to obtain the mean absolute deviation.

Standard deviation

For a variety of reasons, the (mean) absolute deviation is not used often.

An alternative way to deal with the challenge of negative deviations is to square them. As we have learned, the product of two negative numbers gives a positive number:

\(-a * -a = a^{2}\)

\(-0.2 * -0.2 = 0.04\)

\(-0.8 * -0.8 = 0.64\)

Note that the effect of squaring is that larger deviations get a bigger weight. While (absolute) deviations of 0.2 and 0.8 differ by a factor 4, their squared counterparts differ by a factor of 16! That is, squared deviations are more sensitive to observations at the extreme ends of the distribution.

The procedure for establishing a (mean) deviation, is similar as compared to what we did for absolute deviations.

Let's do this step-by-step, to familiarize yourself with the concept of the standard deviation - which is so vital and many of the techniques you will encounter in statistics and data science.

(mean1 <- mean(group1))
## [1] 10.8
group1dev    <- group1 - mean1
group1sqdev  <- group1dev * group1dev
(cbind(group1dev,group1sqdev))
##       group1dev group1sqdev
##  [1,]      -0.8        0.64
##  [2,]      -2.8        7.84
##  [3,]       1.2        1.44
##  [4,]       4.2       17.64
##  [5,]      -3.8       14.44
##  [6,]      -2.8        7.84
##  [7,]       0.2        0.04
##  [8,]      -0.8        0.64
##  [9,]       3.2       10.24
## [10,]       2.2        4.84
sum(group1sqdev)
## [1] 65.6

The sum of squared deviations, will increase with the number of observations. Therefore, it is more meaningful to take the average, which allows comparing this statistics across groups (or data sets) of uneven size.

The result, the mean of the squared deviations, is what we call the variance.

varg1 <- sum(group1sqdev)/length(group1)
cat("The variance of incomes in group 1 is",varg1)
## The variance of incomes in group 1 is 6.56

A disadvantage of the variance measure is that, by squaring the deviations, the results is in squared units (dollars squared, if income is measured in dollars). In order to get the result in the same units, we take the square root.

Variance = σ2 = Sum of Squared Deviations / n

Standard deviation = σ = √ (Variance)

As you would expect, R has functions for variance and sd: var(), and sd().

var(group1)
## [1] 7.288889
sd(group1)
## [1] 2.699794

Oops!!

The var() gives a different result from ours. How come?

It is typically not possible to measure populations. Researchers use samples from populations, in order to estimate the parameters (like mean, and variance) in the population. Smart statisticians have found that variances based on samples, underestimate the population variance. This is called bias.

Luckily, there's a simple way to set things right, and get an unbiased estimate of the population variance. Rather than dividing by the size of the sample (n), we divide by n-1. In large samples, this correction is negligible, as nn-1, but in our small set of incomes of 10 group members, it makes a difference.

The functions var() and sd() use the correction for bias, assuming (rightly or wrongly) that our data are sample data. Since, in data science, we are relying on large volumes of data (say, big data) we don't have to worry about it. But for the sake of completeness, let's make an amendment to our own earlier calculations.

varg1 <- sum(group1sqdev)/(length(group1)-1)
cat("The variance of incomes in group 1 is",varg1,"\nAnd the standard deviation is", sqrt(varg1))
## The variance of incomes in group 1 is 7.288889 
## And the standard deviation is 2.699794

Exercises

Exercise 1

Below are 7 scores from employee performance appraisals of a company. The 7 scores can be considered a sample from the large population of employees of the company.

Compute, for each of the four aspects of employee performance:

  • The mean and the median
  • The minimum, maximum and range
  • The variance and standard deviation

Solution

It is always handy to put these data in a data frame.

jk <- c(2,3,4,5,7,7,3)
cs <- c(3,2,4,4,6,6,7)
at <- c(1,4,4,6,5,6,2)
di <- c(3,6,3,7,7,7,1)
appraisal <- data.frame(jk,cs,at,di); appraisal

It is, of course, perfectly fine to replicate some of the R-code we have used above to illustrate the concepts. Actually, we would encourage you to do so, for two reasons. Firstly, you will familiarize yourself with using R-code, and secondly, doing it step-by-step gives you a deeper understanding of what R is doing behind the scenes.

But once you know what you are doing, and fully understand the various measures of central tendency and variablility, it is better to find a package that does all of the work for you. A great source for popular packages is Quick-R. Here you can find, among many other things, packages for descriptive statistics. One of these packages is pastecs, which has many functions including stat.desc().

You have to install pastecs before using it. After installing it, you can invoke its functions by running the library(pastecs) command line. pastecs has options that are TRUE by default. See for yourself if you switch them off, by setting them to FALSE!

Again, we use the round() functions, to create reader-friendly output (2 decimals, rather than the default of many decimals).

Note that we have used T, as short for TRUE; likewise, you can use F for FALSE. These abbreviations are regarded as bad practice in programming, as T and F can be used as object names, which may confuse R.

# install.packages("pastecs")
library(pastecs)
round(stat.desc(appraisal, basic=T,desc=T),2)

Guiding you through the output:

  • nbr.val is short for the number of values (7 valid observations, for all 4 variables)
  • nbr.null and nbr.na are the numbers of undefined and not available values. Most often, these are referred to as missing values. A null is a value that just isn't there. A missing value represents a value that for some reason we did not record. The difference between the two is subtle, and in almost all projects you will encounter na's but no null's.
  • min and max are the lowest and highest values for each of the four variables. They are all in the range from 1 to 7, which is OK as the appraisal employs a 7-point scale
  • The range is simply \(max - min\)
  • The sum is the sum of the values. As the sum is highest for discipline (34), the employees score better on that aspect than on the other three. The sum here has no meaning. Adding incomes of individuals to a group income, would make sense.
  • The median and mean are the same values that would result from using mean() and median(), for all the variables separately. Applying the stat.desc() function to a data frame, is considerably quicker!

Note that, as measures of central tendency, the median and mean are quite similar for three of the four aspects (or variables). However, as we have argued before, the median is not sensitive to outliers. In discipline there's an outlier (1) at the bottom of the distribution. (More than) half of the values are 6 or higher, implying a median of 6. The mean is sensitive to the outlying value of 1, and brings down the group mean to 4.86. In general, \(median<mean\) is typical for right-skewed distributions, while \(median>mean\) (like here) indicate left-skewed distributions, with some pretty low values (potentially outliers). The definition and detection of outliers is a topic by itself, to which we will turn as the need arises.

  • We will not discuss SE.mean and CI.mean, here.
  • var and st.dev give the same results as var() and sd()

Exercise 2

Employee gender, employee salary, sales by employee are provided below, for company XYZ. In addition, the company uses productivity ratios (sales divided by salary) which still have to be computed from the available data.

Compute all appropriate measures of central tendency and variability.