Introduction

This tutorial is Part 2 of my Stats in R tutorial series. If you missed the first one, you can check it out at the following link. Today I will illustrate the following:

Different types of variables (and how they behave in R).
Common descriptive statistics definitions.
Estimate properties like mean, variance, and standard deviation.
Simulate normally distributed data.

Types of Variables

There are many types of variables, and it is important to recognize what they are and what they can do, especially in R, which is sensitive to the coding mistakes that may arise with the improper use of them. For most of this tutorial, we will just use discrete data, but it is good to be aware of what other variables there are and how to handle them before moving on to other analyses in R.

Numeric: Continuous and Discrete Variables

There are generally two types of numeric values. Discrete variables take on whole number values. Typically people refer to discrete variables as countable. Continuous variables on the other hand can take on any value, which includes decimals.

I have never found this explanation all that useful to be honest, because it assumes that you can’t “count up” other numeric values in other ways that don’t involve discrete variables (summing decimal values is a form of counting). The dumb mnemonic I have used in the past to remember this is that discrete values are sort of like secret agents. To be discrete, one must count down from planting a bomb in a bad guy’s evil lair by using whole numbers. Whole numbers like 5, 4, 3, 2….

Continuous values are different in that they can include decimals. They are like my neighbor Bill. He continues to say whatever he thinks to the chagrin of everyone around him. Whether it’s the 42.5 meters of soil he had to pack in his yard or the 2.5 inches of rain that came down on it the next day, it never seizes to amaze me how continuous his ramblings become…

Additionally, numeric data can be counted as both interval and ratio. Interval data has known, equal intervals (IQ of 70, 80, 90, etc.). Ratio are like interval data, but they contain an absolute zero. Temperature in Fahrenheit, for example, cannot be considered ratio data because there is no absolute zero (-32 degrees), but it can be considered interval data because each degree is equal in length.

Enough about that. Let’s create some vectors like in our last tutorial. We will label one discrete and the other continuous.

discrete <- c(8, 2, 0, 1, 8)
continuous <- c(4.5, 3.1, 8, 9.7, 1.2)
discrete

## [1] 8 2 0 1 8

continuous

## [1] 4.5 3.1 8.0 9.7 1.2

Spot something? If any of the values are decimals, whole numbers will be treated like decimals too. In the case of our continuous vector, the number 8 was automatically converted to 8.0, whereas the 8 in the discrete vector remained unchanged. One last thing…you can explicitly tell R to store numbers as integers (aka whole numbers) by simply adding L to it, like the numbers below.

l.number <- c(24L, 3L, 10L, 5L, 4L)
l.number

## [1] 24  3 10  5  4

Nominal and Ordinal Data

Nominal variables are usually easy to remember because they have a name and “name” sounds like “nominal”, so that is a good way of remembering. These are sometimes also called categorical variables. However, this is slightly confusing in the context of R, as these can sometimes be lumped in with ordinal variables, which are variables that have ranks. This is because both can be defined by a what are called factor variables in R. There are two major types of these variables.

Character: which contains a text string and no inherent leveling. These are always nominal by nature.
Factor: which can contain characters or numbers, but they contain a leveling scheme. Because they are leveled, these can be treated both as nominal variables as well as ordinal variables.

As an example, below we have two nominal variables with each coding.

var.char <- c("First","Second","Third")
var.fac <- factor(c("First","Second","Third"))

If we run var.char, we get this text string by itself:

var.char

## [1] "First"  "Second" "Third"

If we run var.fac, we get the same string, but it specifies the levels.

var.fac

## [1] First  Second Third 
## Levels: First Second Third

This is not a major issue right now, but it can pose problems if you have numbers that are treated as factors but not coded as factors. For now, I will just note their existence until we run into that problem down the road.

Logical Values

Though logical values are not usually defined in the context of the ordering of variables in statistics, they are important to know about. Two logical values you should know are TRUE and FALSE, otherwise denoted as T or F in R with no difference in functionality. These are often used as “switches” for turning functions on or off (as shown in the previous and current tutorial later), but can also function as logical tests. For example, let’s see if \(2 + 2 = 5\). Make sure to use == or it will try to assign a value to an object.

2 + 2 == 5

## [1] FALSE

What if we try the logical \(2 + 2 = 4\) instead?

2 + 2 == 4

## [1] TRUE

You can also use this to test the relations of objects to each other.

object.1 <- (400 + 20)/5
object.2 <- (20 + 400)/5
object.1 == object.2

## [1] TRUE

object.1 > object.2

## [1] FALSE

Probably the most important of logical values is a special case: NA values, which stands for “not available.” This often denotes if data is missing, and if you are not careful it can cause major issues. As discussed later on, vectors that contain missing values can often cause problems with calculation if they are not removed. To check where missing data is located in a vector, one can use is.na.

vec.na <- c(4,3,NA,NA,1)
is.na(vec.na)

## [1] FALSE FALSE  TRUE  TRUE FALSE

The logical values we learned earlier are now put to use here. The is.na function has told us that only locations 3 and 4 in the vector are missing, whereas the rest of the data is “observed”. What if we want to remove missing values from a vector? We can use the na.omit function.

na.omit(vec.na)

## [1] 4 3 1
## attr(,"na.action")
## [1] 3 4
## attr(,"class")
## [1] "omit"

You may have also noticed some operators already in use that allow comparison. These are:

>: greater than
>=: greater than or equal to
<: less than
<=: less than or equal to
==: equal to
!=: not equal to

You can try these out with the logical tests we used earlier. For now, we will move on to samples and populations.

Populations and Samples

A population is the set of all individuals of interest in a study. If we were interested in every smoker in the world, this would be a population of smokers. A parameter is a numerical value that describes the population, such as the mean number of cigarettes smoked by this population. A sample, by comparison, is a small subset of this population that is supposed to represent in some way the entire population. If we randomly chose 50 smokers from a restaurant, this would be a sample of the population of all smokers. Numerical values from a sample are called statistics. Today’s topic will mostly cover descriptive statistics, which are directly measured from a sample. However, most of statistics also includes inferential statistics, which are used to draw inference about a population indirectly.

We make this distinction for important reasons. First, we almost never know what the actual population parameters are, so we use samples to try to approximate their actual value. Second, because populations are known statistics and samples are only a portion of the population, they are more prone to bias, in that their variance will not be the same as the actual population and will thus cause error in measurement. Scientists try to minimize this error as best as possible when sampling from the population, as they want accurate predictions with their models.

A butterfly expert may sample a subset of butterflies to make inferences on the entire population of butterflies.

By the end of this tutorial, you will simulate a population as well as sample from that population. For now, let’s move on to some descriptive statistics, which will be useful for basic data analysis as well as understanding future inferential statistics.

Summation and Measures of Central Tendency

Sum

Many of these terms you have probably learned in math already. Nevertheless, it will be helpful to learn these as a review to guide you in our later stats tutorials. These concepts will also be utilized for our simulated data later on. We can start off fairly light by deriving the sum of a number, defined by the following formula:

\[ {\sum_{i=1}^{n}}x_i = x_1 + x_2 + ... + x_n \]

where \(i\) is the index of the number (first number, second number, etc.) and \(n\) is last number (the “upper limit” of summation). The \(\Sigma\) notation here is also important…it is used often to denote summation, and here we are just talking about the sum of x, or \(\Sigma{x}\). Put simply, we are just adding numbers like you are probably used to. We can do this manually in R by either manually typing a sum out or saving a summation and printing it’s result. We will try both below to see if they match. First we will derive \(\Sigma{x} = x_1 + x_2 + x_3\) or explicitly \(\Sigma{x} = 2 + 3 + 6\).

2 + 3 + 6

## [1] 11

x.sum <- 2 + 3 + 6
x.sum

## [1] 11

Of course manually typing out the numbers isn’t super helpful, and many operations in R may require stringing together sums to do complex actions. Instead of doing this by hand, we can also just use the sum function. This time we will create a vector (vectors were previously discussed in Tutorial 1 if you want a review). We will then sum it’s numbers with the sum function.

x1 <- c(60,33,56)
sum(x1)

## [1] 149

The result should be \(149\). One very important note that is true of many vectorized math functions in R…if you have a missing value NA, the function will always bring back an NA value unless you tell it not to. An example can be shown below.

x.na <- c(4,NA,7)
sum(x.na)

## [1] NA

This is of course problematic if you just want \(4 + 7\). To avoid this, many of these arithmetic functions have an argument called na.rm. Set na.rm = T or na.rm = TRUE to get rid of the missing value.

sum(x.na, na.rm = T)

## [1] 11

Mean

One of the three most commonly taught measures of central tendency is related to summation. In a normally distributed sample, the mean has the convenient property of telling us where the data is centrally located, or where most of the data “piles up.” It is also used in the formulas of inferential tests such as t-tests or ANOVAs. The mean of a vector is defined as:

\[ \bar{x} = \frac{\Sigma{x}}{n} \]

The \(\bar{x}\) here is simply notation for the mean of \(x\). To simplify, the mean (\(\bar{x}\)) is the sum of the terms (\(\Sigma{x}\)) divided by the number of terms (\(n\)). We can create another vector to test this out.

x2 <- c(10,60,70)
n <- 3
(sum(x2))/n

## [1] 46.66667

Take note of your parentheses…if you don’t carefully wrap the functions, the order of operations in the math will be wrong. If you use sum(x2)/n, this is equally valid, but know that numbers that directly precede a multiplier or division sign will be calculated first unless placed in parentheses. Here I explicitly dictate that the sum of x2 is wrapped in a numerator, and then divided by the denominator n. However we can speed this process up by simply using the function mean.

mean(x2)

## [1] 46.66667

Again, if there are any NA values present, you will need to make sure you remove them with the na.rm = T argument.

x.na <- c(NA,5,4,7,NA)
mean(x.na, na.rm = T)

## [1] 5.333333

Median

The median of a vector can be found one of two ways. If the number of terms in a vector is odd, one just has to find the middle-most point. For example, if we have a vector \(x\) like below:

\[ x = \begin{bmatrix} 33,82,100 \end{bmatrix} \]

Then \(82\) would be it’s median. However, when the vector has an even number of terms like below:

\[ x = \begin{bmatrix} 33,82,100,300 \end{bmatrix} \]

You take the middle two values, sum them, then divide by 2. In other words:

\[ \frac{82+100}{2} = 91 \]

Note that when doing this, one must always sort the numbers in numerical order, or else it is difficult to obtain this number. As always, we can do this manually, either by typing the numbers ourselves:

x3 <- c(1,5,9,20)
(5+9)/2

## [1] 7

…or by using vector notation (discussed in Tutorial 1). Since we are using a vector and not a matrix, we just have to specify one location. The below vector’s two middle points are located at location 2 and location 3, as specified below:

sum(x3[2]+x3[3])/2

## [1] 7

You can see that both ways get you the same thing, but again this is cumbersome. We can instead use the median function in R (and use na.rm = T if there are NA values).

median(x3)

## [1] 7

Mode

Deriving the mode is quite simple. You just need to find the value that has the highest frequency. While there is no native function in R for finding the mode, you can quickly tabulate data with the table function to see the frequency of numeric data.

x4 <- c(40,40,40,30,10,10,20)
table(x4)

## x4
## 10 20 30 40 
##  2  1  1  3

We can see our mode here is \(40\). Be careful when doing this with continuous data, as decimal amounts can sometimes lead to misleading mode values. However, this will not work on categorical data like factor or character vectors. Instead, we will need to load a library and use the count function from plyr to expand our R vocabulary to work with us.

Here is how it’s done.

library(plyr)
x5 <- c("cat","cat","cat","monkey","horse")
count(x5)

##        x freq
## 1    cat    3
## 2  horse    1
## 3 monkey    1

There are other more advanced ways of getting counts (including just creating your own mode function), but for now we will keep things simple and move on to measures of spread.

Measures of Spread

Measures of spread tell us to what degree our data varies by. A plot below shows a visual depiction of fluctuation in data points.

Plot depicting raw values and their distance from the mean.

The red line represents the mean, and the black dots are numbers from a given vector. If we trace how far away they are from the mean, we get the black lines that are shown in the plot. This should give you some idea of how consistent or inconsistent data can be. Some data points are quite close to the line, while others are quite far away. The distance from the mean represents the natural variance in the data and tells us how consistent these values are.

Range

We can first get a very basic idea of what the spread is by checking the range. This is defined as:

\[ range(x) = max(x) - min(x) \]

We simply have to subtract the highest value of \(x\) by it’s lowest value. If for example we have a vector \(x\) like so:

\[ x = \begin{bmatrix} 4,67,99,302 \end{bmatrix} \]

Our range would be:

\[ range(x) = 302 - 4 = 298 \]

One would think the range function would allow us to automatically get this for us…

x6 <- c(4,67,99,302)
range(x6)

## [1]   4 302

However you can see that it only takes the two values necessary for the range. We can create a quick function ourselves, which will just plug in what we did mathematically above.

better.range <- function(x){
  range <- max(x)-min(x)
  return(range)
}

Remember from Tutorial 1 how this was accomplished? We first name the function better.range, define it’s input (our vector) as x, save the calculation of the range as range, then return this value when we run the function.

better.range(x6)

## [1] 298

Sometimes when you want a job done, you gotta do it yourself. In any case, we can see that the range is 298. This means there is a distance of nearly 300 between our lowest possible value and our largest possible value, which seems to indicate that there is a wide range of data. However, this still doesn’t say much on it’s own.

IQR

Another measure of spread is the interquartile range (IQR). This is calculated with the following formula:

\[ \text{IQR} = \text{Q3} - \text{Q1} \]

\(Q\) here stands for a given quartile. Quartiles are the divisions of data (more generically called “quantiles”) into the 25% highest values of a vector (Q1), the 50% highest values (Q2, or the median), the 75% highest (Q3), and the max (Q4). The IQR is useful because 50% of our data should be located within Q3 and Q1, giving us some perspective of where most of our data should be, as well as how wide that range is. The IQR also gives us some convenience functions for deriving outliers, which are discussed later.

A simpler way of thinking about this is a median-based approach. I show another vector as an example:

\[ x = \begin{bmatrix} 10,33,50,341,987,2006,2008 \end{bmatrix} \] The median here is \(341\). To find Q1, we simply take the median of the values below 341. Here that would be 33. To find Q3, we do the same for the values above 341. This would be 2006. So our IQR can be calculated as so:

\[ \text{IQR}(x) = 2006 - 33 = 1973 \] The IQR function calculates this with ease, however one must select the IQR type by including type = 1.

x7 <- c(10,33,50,341,987,2006,2008)
IQR(x7, type = 1)

## [1] 1973

Once again, the IQR on it’s own doesn’t say much by itself. But now that we know what IQR and quartiles are, we can now intuitively interpret one of the most commonly used plots in statistics…the boxplot.

Boxplots

A boxplot contains the following metrics:

minimum
maximum
median
Q3
Q1
IQR
outliers

We haven’t spoken much about outliers so far, but we will use a boxplot first to show why they are useful. A boxplot can be made pretty quickly with the boxplot function, though it is also quite basic on it’s own.

x8 <- c(5,12,22,48,52,67,90,111,202,300,544,900,1200)
boxplot(x8)

This boxplot is pretty ugly, so let’s prettify it a bit before explaining what it actually does. We will set col = "steelblue to change the color, we will change the name of the plot with main, we will flip it sideways with horizontal = T, and we will slightly change the width of the boxplot with boxwex. We label the x-axis here with xlab.

boxplot(x8,
        col = "steelblue",
        main = "Boxplot of Data",
        horizontal = T,
        xlab = "Value",
        boxwex = .7)

Let’s look at this plot now. The thick black bar in the middle of the box is the median. The IQR is the blue box width, with Q1 located on the left edge (the “ridge”) and Q3 located on the right ridge. The bars furthest from the boxplot (separated from the box with dashed lines called “whiskers”) are normally the minimum and maximum values. However, we have two suspicious looking dots all the way to the right. These are the dreaded outliers we have been avoiding talking about until now.

Outliers are data points that are considered extreme and may unduly affect the distribution of your data. Outliers are calculated by different criterion depending on the program and purpose, but one way of calculating outliers for a single variable is using the 1.5 Rule. To calculate outliers that are far higher than your expected data, you use the following formula:

\[ \text{higher range outlier} = \text{Q3} + 1.5\times{\text{IQR}} \]

For finding an outlier well below what we would expect, a similar formula can be used:

\[ \text{lower range outlier} = \text{Q1} - 1.5\times{\text{IQR}} \]

We can check if this matches up with our boxplot. First, we will split up our data by quartiles using the quantile function. We specify Q1 with .25, the median with .50, and Q3 with .75. We will then convert the output into a matrix so it removes unnecessary column names. We will then derive the IQR, calculate the 1.5 cutoff, and then add/subtract to the appropriate quartile.

q <- quantile(x8,
         c(.25,.50,.75)) 
q.matrix <- matrix(q)
iqr.criterion <- IQR(x8, type = 1)
cutoff <- iqr.criterion*1.5
upper.outlier <- q.matrix[3] + cutoff
lower.outlier <- q.matrix[1] - cutoff

We will first print upper.outlier to see what is considered an outlier above the IQR.

upper.outlier

## [1] 678

We get a value of 678, and clearly the two dots to the right of the boxplot show they are too extreme. What about outliers below our IQR?

lower.outlier

## [1] -330

We get a value of -330. None of our data is below this value, so we are safe from outliers below our IQR. Now lets look at our boxplot with all this information before us.

A normally distributed variable will have a boxplot with few outliers, have a median closer to the middle of the boxplot, and a minimum/maximum value somewhat equal in length. Here we can see our data is fairly skewed to the right. We will save discussion of skew for later, but just know for now that the data is not normally distributed.

As a final note, one should not use the outlier method discussed above as a golden rule. You should check to see if the data is meaningful or if it is something worthy of removal, such as a data entry error that is obviously incorrect.

Variance

Before we can explore variance, we need to know what deviance is. Deviance can mean different things depending on the context, but in terms of the mean, it is simply how much our raw data deviates from the mean, or in geometric terms, it’s distance from the mean. This is defined as:

\[ \text{deviance} = x - \bar{x} \]

This is of course easy to obtain in R.

x9 <- c(3,6,9,1,0,3)
dev <- x9 - mean(x9)
dev

## [1] -0.6666667  2.3333333  5.3333333 -2.6666667 -3.6666667 -0.6666667

Notice though that if we sum the deviance in it’s current form:

sum(dev)

## [1] 1.332268e-15

We get zero! The output may look strange, but this is a result of floating point calculation, which is a subject that is too complicated for this tutorial. The important part is that it when calculating a number close to zero, you will sometimes get a large exponentiated number that is as close to zero as possible in binary. In any case, you can hand calculate this and see that this poses an issue if we want to understand variance, as we know the vector doesn’t vary by zero. We first need to square the values to turn them into positive units that can be easily summed.

squares <- dev^2

Now that our data is squared, we just sum them. This is called the sum of squares and is an important part of later statistical calculations.

sum.squares <- sum(squares)

Now we have our sum of squares, but this still doesn’t say much about our data. We need to figure out how to divide this by the number of observations first. If we were to estimate an entire population’s variance, we would estimate this directly with the following formula:

\[ \sigma^2 = \frac{\Sigma(x-\bar{x})^2}{n} \] where \(\sigma^2\) is the variance. However, since we are using a sample (our vector here), this is not quite as accurate as we would like. Because samples are often biased by providing less estimable variance than a population value, we can’t just divide by \(n\). Instead we use \(n-1\) to reduce the amount of bias that can be present from the estimate.

\[ s^2 = \frac{\Sigma(x-\bar{x})^2}{n-1} \]

Remember to use length here to calculate the number of observations in the vector (our \(n\) in this case).

variance <- (sum.squares)/(length(x9)-1)
variance

## [1] 11.06667

As always there is an easier way in R. Just use the var function instead.

var(x9)

## [1] 11.06667

And you see the number is the same.

Standard Deviation

The standard deviation only requires one additional step. You just take the square root of the variance.

\[ s = \sqrt{\frac{\Sigma(x-\bar{x})^2}{n-1}} \]

We can do this with our previously estimated variance object by using the sqrt function to take it’s square root.

st.dev <- sqrt(variance)
st.dev

## [1] 3.32666

To check if we did this correctly, we can compare it to the native sd function, which automatically calculates standard deviation for a vector.

sd(x9)

## [1] 3.32666

Once more, they are the same.

Z-Scores

Now that we know standard deviation, we can delve into z-scores. Z-scores are essentially how many standard deviations a raw score is from the mean. Z-scores can be useful because unlike raw scores, they can give you an approximation of it’s location in the distribution. A z-score of 1, for example, means that a data point is above the 84th percentile of scores. The formula for z-scores is:

\[ z = \frac{x-\bar{x}}{s} \] In other words, you take a raw score, subtract the mean from the score, and then divide by the standard deviation. The manual way of doing that is shown below.

x10 <- c(10,30,44,30,22,1)
z <- (x10-mean(x10))/sd(x10)
z

## [1] -0.83087290  0.46399396  1.37040076  0.46399396 -0.05395279 -1.41356298

The scale function achieves the same functionality, thought it’s output is in an array format.

scale(x10)

##             [,1]
## [1,] -0.83087290
## [2,]  0.46399396
## [3,]  1.37040076
## [4,]  0.46399396
## [5,] -0.05395279
## [6,] -1.41356298
## attr(,"scaled:center")
## [1] 22.83333
## attr(,"scaled:scale")
## [1] 15.4456

To convert a z-score to a raw score, one uses the formula below:

\[ x = \bar{x} + (z\times{s}) \] The first z-score in our x10 vector is -0.83. The mean of the vector is 22.83 and the standard deviation is 15.45. Thus to obtain the raw score, we use:

\[ x = 22.82 + (-0.83 \times{15.45}) \] I rounded the values here so it isn’t as precise, but you should get an \(x\) that is close to 10, which is the original score for the first number in our vector.

Skew and Kurtosis

The final topic before we move on to simulation is a couple of measures that help define the shape of our distributions. One is called skew, which is how asymmetrical the data is. A typical normal distribution looks something like a bell curve. However, a positively skewed distribution has more data on the left side of the graph with a “tail” of data on the right. Contrarily, a negatively skewed distribution has a distribution clustered to the right of a plot with a tail to the left. Examples of a normal distribution, positive skew, and negative skew are shown below.

Examples of skew.

To check skew, we can use a density plot to see what it’s shape looks like. First, we will measure the general shape of the distribution with density, then plot them with R’s basic plot function. We will also give it a title with the main argument.

x <- c(0.1,0.3,0.7,1,1,1,2,4,6,8,10,2,2,2,3,3,3,5)
d <- density(x)
plot(d,
     main = "Right-Skewed Distribution")

We can see that indeed the data is quite skewed. However, it’s kurtosis, or the sharpness of it’s peak, does not appear problematic. There may be issues with the distribution if either of these conditions are present. If your data has skew or kurtosis, it is not normally distributed, and you may need to apply either non-parametric tests or some standardizing function to the data to fix it.

Simulating Data

Now that we have spent a heavy amount of time on distributions, descriptive functions, and variables, we can go ahead and simulate our first population. As of 2022, there were around 300,000 residents in St. Louis, Missouri. Let’s assume their mean weight was around 200 pounds and this deviated between residents by about 25 pounds. We can simulate this in R fairly quickly by specifying these parameters in the rnorm function. This function simulates data from a normal distribution with the following arguments:

n: the total number of data points.
mean: the mean of these data points.
sd: the standard deviation of these data points.

Great news! We have covered all of this already. Below we simply simulate this data with the above given arguments. First, we must set a random seed, which can be any number (I just use “123” here). This initiates a pseudo-random number generator, thereafter simulated data can be made reproducible if someone wants to emulate the same results. You can choose to skip this step if you want it to be truly random. Whichever you choose, we will save an object called population that simulates the data using the arguments we mentioned.

set.seed(123)

population <- rnorm(n = 300000,
           mean = 200,
           sd = 25)

Now this isn’t super useful on it’s own. However, if we plot this data, it may make it more tangible. In the previous tutorial, I showed how to make a histogram. Using the same method, we will produce a histogram of the simulated data. We will use the hist function, specify that our data is population, color the plot steel blue, and give it both an x-axis and title label.

hist(x = population,
     col = "steelblue",
     xlab = "Weight (lbs)",
     main = "Weight Distribution in St. Louis")

You can see that the population values are quite symmetrical. This is the famous “bell curve” or Gaussian distribution, which is typically (but I must stress not always) formed by large numbers. Remember that population parameters are already known. We can check these directly with the functions we have learned so far. We should end up with a mean of 200 and a standard deviation of 25.

mean(population)

## [1] 200.0466

sd(population)

## [1] 25.0286

This seems to match up pretty closely to what we started with. Now let us pretend that we have randomly sampled around 50 residents of St. Louis for some study based off the same data. We can sample from our previously drawn population of residents in our simulated data by using the sample function. This function randomly selects a pre-specified amount of data. Simply dictate where you sample from and how many times you want to sample.

sample <- sample(population, 
                 size = 50)

Then we will run a similar histogram.

hist(x = sample,
     col = "steelblue",
     xlab = "Weight (lbs)",
     main = "Weight Distribution in St. Louis Sample (n = 50)")

You can see the distribution of data looks different. There is more dramatic variation in values and the number of data points has changed based off where it is in the distribution. However, you will notice that it is not vastly different. If a sample is representative of a population (which is never guaranteed nor assumed), it should at least have some similar characteristics as the population. You can see here that despite the distribution being quite different, the mean is at least close to the original population value. This is why one should always be mindful of where they sample data from. If it is not characteristic of the population at large, it can vary dramatically from the true population values.

To reiterate how sample statistics vary from population parameters, lets check the mean and standard deviation again.

mean(sample)

## [1] 194.109

sd(sample)

## [1] 18.60818

We can see that the sample varies slightly from the population. Because it is representative, this is not extreme, but research rarely finds such a perfect sample. Consider this when conducting statistics, as error is easy to obtain and difficult to prevent.

Conclusion

Hopefully this tutorial has taught you a number of useful statistical concepts and functions in R. Feel free to leave a comment below if you liked the content and look out for Part 3 in the near future.

My Other RPubs

Check out my other RPubs if you want to learn more about stats in R.

Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful and you would like to support me, please consider buying me a cup of coffee.

Enjoy your day and happy coding!

Stats in R Part 2: Measures of Central Tendency and Spread

Shawn Hemelstrand

2023-02-04