2025-11-17

The importance of proper sampling

In statistics, proper sampling methods are essential to the unbiased collection of data. Improper sampling could commit a number of faults, such as not collecting data truly randomly when random collection is expected, collecting data from too small a population, or collecting data from a single population and generalizing results to apply to multiple populations.

Example

A data collector could survey all people walking out of the campus gym and ask if they believe school funding should be put toward gym equipment. Using the results of this survey to support a claim that ‘students’ on campus want school funding to go toward new gym equipment would be biased, because no other students were surveyed. The collector could say that all gym-going students on Tuesday of the week of November 23rd on campus are in support of school funding going toward new gym equipment, but they could not generalize this result to all students on campus.

Supporting data

Here we have a potential data set for the example outlined in the last slide. You may scroll through the data.

Time_leaving Age Student_Status For_Equipment_Funding
605 19 1 1
621 19 1 1
636 19 1 1
642 20 1 1
655 18 1 1
701 21 1 1
727 25 1 1
728 18 1 1
810 19 1 1
813 20 1 0
846 20 1 1
848 20 1 1
900 24 1 1
924 20 1 1
929 21 1 1
931 21 1 0
938 19 1 1
944 23 1 1
957 23 1 0
1011 22 1 1
1017 18 1 1
1023 18 1 1
1042 17 1 1
1059 20 1 1
1103 23 1 0
1118 20 1 0
1124 23 1 0
1146 22 1 1
1149 19 1 1
1230 18 1 1
1242 22 1 1
1245 22 1 1
1246 34 0 0
1251 18 1 1
1301 18 1 1
1317 18 1 0
1329 18 1 0
1335 19 1 1
1340 20 1 1
1354 19 1 0
1420 18 1 0
1422 24 1 0
1437 19 1 1
1458 20 1 1
1503 20 1 0
1507 20 1 1
1509 41 0 0
1536 21 1 1
1538 21 1 1
1543 21 1 0
1550 21 1 1
1608 23 1 0
1619 23 1 1
1626 22 1 1
1633 23 1 0
1637 22 1 0
1641 19 1 1
1642 18 1 1
1643 25 1 1
1644 19 1 0
1659 22 1 1

Supporting data (cont.)

The surveyer noted the time that each willing participant (i.e., someone who was willing to answer her question) walked out of the gym. She asked the participants’ ages, student status, and whether or not they agreed that school funding should go toward the purchase of new gym equipment. ‘1’ means ‘yes I am a student’ and ‘yes I believe that school funding should go toward new gym equipment.’

Time_leaving Age Student_Status For_Equipment_Funding
605 19 1 1
621 19 1 1
636 19 1 1
642 20 1 1
655 18 1 1
701 21 1 1
727 25 1 1
728 18 1 1
810 19 1 1
813 20 1 0
846 20 1 1
848 20 1 1
900 24 1 1
924 20 1 1
929 21 1 1
931 21 1 0
938 19 1 1
944 23 1 1
957 23 1 0
1011 22 1 1
1017 18 1 1
1023 18 1 1
1042 17 1 1
1059 20 1 1
1103 23 1 0
1118 20 1 0
1124 23 1 0
1146 22 1 1
1149 19 1 1
1230 18 1 1
1242 22 1 1
1245 22 1 1
1246 34 0 0
1251 18 1 1
1301 18 1 1
1317 18 1 0
1329 18 1 0
1335 19 1 1
1340 20 1 1
1354 19 1 0
1420 18 1 0
1422 24 1 0
1437 19 1 1
1458 20 1 1
1503 20 1 0
1507 20 1 1
1509 41 0 0
1536 21 1 1
1538 21 1 1
1543 21 1 0
1550 21 1 1
1608 23 1 0
1619 23 1 1
1626 22 1 1
1633 23 1 0
1637 22 1 0
1641 19 1 1
1642 18 1 1
1643 25 1 1
1644 19 1 0
1659 22 1 1

Supporting data (cont.)

Let’s graph this data. From these plots, we can see that more students were for than against funding going toward new gym equipment. But let’s remember the context of this data collection! What should the collector say when she reports this visual finding (disregarding significance)?

Supporting data (cont.)

Let’s graph this data. From these plots, we can see that more students were for than against funding going toward new gym equipment. But let’s remember the context of this data collection! What should the collector say when she reports this visual finding (disregarding significance)?

Reporting – good example (accounting for sampling bias)

“Based on this 3D scatterplot, we can see that the overwhelming majority of participants were students, and all students were between the ages of 17 and 25, with 17 and 25 being unpopular student ages. Students left the gym throughout the day; the time is noted here in case the earlyness of gym activity is correlated with adoration of the gym and thus a preference for gym funding. The visual takeway here is that more students exiting the gym between 6 AM and 5 PM on November 23rd are for rather than against school funding going toward new gym equipment. However, please note that this is a sample of a sub-population within the student population: on-campus students who are gym-goers and who were specifically at the gym on November 23rd. Because these students were using the gym, it comes as no surprise to my team that they the majority of them are in favor of school funding going to new gym equipment.”

Reporting – bad example (generalizing to a wider population)

“Based on this 3D scatterplot, more students are for rather than against school funding going toward new gym equipment, with ages, times leaving the gym, and student status varying throuhgout our sample to account for differences between studens. It seems that on-campus students are in favor of school funding going toward new gym equipment.”

Supporting data (cont.)

This plot shows the ages of participants as categories. We can see that there is less of an obvious visual trend between participant age and partipant opinon and time of gym exit and participant opinion. Keep in mind that this data was created by me to illustrate an example!

Behind the scenes: R code

Here is the code behind the last graph you saw, which was a ggplot2 plot created in R:

category_age <- factor(gym_data$Age)
category_opinion <- factor(gym_data$For_Equipment_Funding)
x_axis_ticks = c("No", "Yes")
g <- ggplot(gym_data, aes(category_opinion, Time_leaving))
g + geom_point(aes(color = category_age)) +
scale_color_manual(values = c("red", "pink", "green", "blue", "cyan",
"purple", "orange", "black", "yellow", "darkgreen", "cyan4")) +
scale_x_discrete(labels = x_axis_ticks) + 
labs(x = "Opinion on using funding for new equipment", y = 
"Time exiting gym", color = "Participant Age") + ggtitle("Student 
preference for funding gym equipment -- 
ggplot2 no. 2")

Let’s practice with this data

We can create data from this data! Let’s find the mean age of gym-goers who want funding to go toward new gym equipment.

\(\\\mu = (\sum_{i=1}^{n} X_i)/{n}\)

[1] "mean = 20.2142857142857"

Let’s practice with this data

We can create data from this data! Let’s find the mean time of exit of gym-goers who want funding to go toward new gym equipment.

\(\\\mu = (\sum_{i=1}^{n} X_i)/{n}\)

[1] "mean = 1135.7380952381"

This measure is interesting because it indicates that the dataset is pulled toward lower times; the mean is about 11:46 AM. This could potentially be explained by the notion that people more motivated to go to the gym may go to the gym earlier in the day, and people more motivated to go to the gym would be in support of using school funding to revamp gym equipment.

Thank you!

Thank you for perusing this presentation! Please keep in mind that this is an example data set used to illustrate a point about biased sampling and biased reporting. This was also a chance for me to demonstrate how to create some R plots and how to calculate and possibly interpret mean values. Thank you!