The Importance of Proper Sampling

2025-11-17

In statistics, proper sampling methods are essential to the unbiased collection of data. Improper sampling could commit a number of faults, such as not collecting data truly randomly when random collection is expected, collecting data from too small a population, or collecting data from a single population and generalizing results to apply to multiple populations.

Example

A data collector could survey all people walking out of the campus gym and ask if they believe school funding should be put toward gym equipment. Using the results of this survey to support a claim that ‘students’ on campus want school funding to go toward new gym equipment would be biased, because no other students were surveyed. The collector could say that all gym-going students on Tuesday of the week of November 23rd on campus are in support of school funding going toward new gym equipment, but they could not generalize this result to all students on campus.

Supporting data

Here we have a potential data set for the example outlined in the last slide. You may scroll through the data.

Time_leaving	Age	Student_Status	For_Equipment_Funding
605	19	1	1
621	19	1	1
636	19	1	1
642	20	1	1
655	18	1	1
701	21	1	1
727	25	1	1
728	18	1	1
810	19	1	1
813	20	1	0
846	20	1	1
848	20	1	1
900	24	1	1
924	20	1	1
929	21	1	1
931	21	1	0
938	19	1	1
944	23	1	1
957	23	1	0
1011	22	1	1
1017	18	1	1
1023	18	1	1
1042	17	1	1
1059	20	1	1
1103	23	1	0
1118	20	1	0
1124	23	1	0
1146	22	1	1
1149	19	1	1
1230	18	1	1
1242	22	1	1
1245	22	1	1
1246	34	0	0
1251	18	1	1
1301	18	1	1
1317	18	1	0
1329	18	1	0
1335	19	1	1
1340	20	1	1
1354	19	1	0
1420	18	1	0
1422	24	1	0
1437	19	1	1
1458	20	1	1
1503	20	1	0
1507	20	1	1
1509	41	0	0
1536	21	1	1
1538	21	1	1
1543	21	1	0
1550	21	1	1
1608	23	1	0
1619	23	1	1
1626	22	1	1
1633	23	1	0
1637	22	1	0
1641	19	1	1
1642	18	1	1
1643	25	1	1
1644	19	1	0
1659	22	1	1

Supporting data (cont.)

The surveyer noted the time that each willing participant (i.e., someone who was willing to answer her question) walked out of the gym. She asked the participants’ ages, student status, and whether or not they agreed that school funding should go toward the purchase of new gym equipment. ‘1’ means ‘yes I am a student’ and ‘yes I believe that school funding should go toward new gym equipment.’

Time_leaving	Age	Student_Status	For_Equipment_Funding
605	19	1	1
621	19	1	1
636	19	1	1
642	20	1	1
655	18	1	1
701	21	1	1
727	25	1	1
728	18	1	1
810	19	1	1
813	20	1	0
846	20	1	1
848	20	1	1
900	24	1	1
924	20	1	1
929	21	1	1
931	21	1	0
938	19	1	1
944	23	1	1
957	23	1	0
1011	22	1	1
1017	18	1	1
1023	18	1	1
1042	17	1	1
1059	20	1	1
1103	23	1	0
1118	20	1	0
1124	23	1	0
1146	22	1	1
1149	19	1	1
1230	18	1	1
1242	22	1	1
1245	22	1	1
1246	34	0	0
1251	18	1	1
1301	18	1	1
1317	18	1	0
1329	18	1	0
1335	19	1	1
1340	20	1	1
1354	19	1	0
1420	18	1	0
1422	24	1	0
1437	19	1	1
1458	20	1	1
1503	20	1	0
1507	20	1	1
1509	41	0	0
1536	21	1	1
1538	21	1	1
1543	21	1	0
1550	21	1	1
1608	23	1	0
1619	23	1	1
1626	22	1	1
1633	23	1	0
1637	22	1	0
1641	19	1	1
1642	18	1	1
1643	25	1	1
1644	19	1	0
1659	22	1	1

Supporting data (cont.)

Let’s graph this data. From these plots, we can see that more students were for than against funding going toward new gym equipment. But let’s remember the context of this data collection! What should the collector say when she reports this visual finding (disregarding significance)?

Supporting data (cont.)

Reporting – good example (accounting for sampling bias)

“Based on this 3D scatterplot, we can see that the overwhelming majority of participants were students, and all students were between the ages of 17 and 25, with 17 and 25 being unpopular student ages. Students left the gym throughout the day; the time is noted here in case the earlyness of gym activity is correlated with adoration of the gym and thus a preference for gym funding. The visual takeway here is that more students exiting the gym between 6 AM and 5 PM on November 23rd are for rather than against school funding going toward new gym equipment. However, please note that this is a sample of a sub-population within the student population: on-campus students who are gym-goers and who were specifically at the gym on November 23rd. Because these students were using the gym, it comes as no surprise to my team that they the majority of them are in favor of school funding going to new gym equipment.”

Reporting – bad example (generalizing to a wider population)

“Based on this 3D scatterplot, more students are for rather than against school funding going toward new gym equipment, with ages, times leaving the gym, and student status varying throuhgout our sample to account for differences between studens. It seems that on-campus students are in favor of school funding going toward new gym equipment.”

Supporting data (cont.)

This plot shows the ages of participants as categories. We can see that there is less of an obvious visual trend between participant age and partipant opinon and time of gym exit and participant opinion. Keep in mind that this data was created by me to illustrate an example!

Behind the scenes: R code

Here is the code behind the last graph you saw, which was a ggplot2 plot created in R:

category_age <- factor(gym_data$Age)
category_opinion <- factor(gym_data$For_Equipment_Funding)
x_axis_ticks = c("No", "Yes")
g <- ggplot(gym_data, aes(category_opinion, Time_leaving))
g + geom_point(aes(color = category_age)) +
scale_color_manual(values = c("red", "pink", "green", "blue", "cyan",
"purple", "orange", "black", "yellow", "darkgreen", "cyan4")) +
scale_x_discrete(labels = x_axis_ticks) + 
labs(x = "Opinion on using funding for new equipment", y = 
"Time exiting gym", color = "Participant Age") + ggtitle("Student 
preference for funding gym equipment -- 
ggplot2 no. 2")

Let’s practice with this data

We can create data from this data! Let’s find the mean age of gym-goers who want funding to go toward new gym equipment.

\(\\\mu = (\sum_{i=1}^{n} X_i)/{n}\)

[1] "mean = 20.2142857142857"

Let’s practice with this data

We can create data from this data! Let’s find the mean time of exit of gym-goers who want funding to go toward new gym equipment.

\(\\\mu = (\sum_{i=1}^{n} X_i)/{n}\)

[1] "mean = 1135.7380952381"

This measure is interesting because it indicates that the dataset is pulled toward lower times; the mean is about 11:46 AM. This could potentially be explained by the notion that people more motivated to go to the gym may go to the gym earlier in the day, and people more motivated to go to the gym would be in support of using school funding to revamp gym equipment.

Thank you!

Thank you for perusing this presentation! Please keep in mind that this is an example data set used to illustrate a point about biased sampling and biased reporting. This was also a chance for me to demonstrate how to create some R plots and how to calculate and possibly interpret mean values. Thank you!