Hey there! Welcome to another R project of mine. This time I’m not really doing anything sophisticated with programming or analytics, but I’m exploring a phenomena that has to do with the power of crowds guessing. The idea is: when asked to estimate a number that no one would exactly know, but can give a reasonable guess about, the average of the answers will get freakishly close to the truth.

I’ve asked about 34 of my network to fill out a 10-question survey asking these types of questions. The questions, along with their answers and the average of the guesses can be found below:

How tall is the Empire State Building, in ft? Answer: 1,453, Avg Guess: 1,562

What year was Justin Timberlake born? Answer: 1981, Avg Guess: 1983

What is the diameter of earth, in miles? Answer: 7,926 miles, Avg Guess: 7,646

What year did Montana achieve statehood? Answer: 1889, Avg Guess: 1882

What is the gestation period for hippos, in months? Answer: 8 months, Avg Guess: 11 months

What, in inches, is the annual rainfall in the Sahara? Answer: 3 inches, Avg Guess: 4.4 inches

How far, in miles, is Omaha, NE from London, UK? Answer: 4,292 miles, Avg Guess: 4,767 miles

How much, in lbs, does the average male grizzly adult weigh? Answer: 600 lbs, Avg Answer: 926 lbs

How many people does the Tennessee Titans’ stadium seat? Answer: 69,143, Avg Guess: 54,754

How many grams of sugar does a 16 oz Pumpkin Spice Latte from Starbucks have? Answer: 50g, Avg Guess: 47g

Anyhow, let’s dive a little deeper and see what we have. I used the packages: tidyverse and ggplot2.

And with that…

The Empire State Building, lit up green for my NY Jets”

The data is pretty straightforward. I created a Google survey and cleaned up a few unreasonable/joke entries in the Excel file that was generated. I then imported the dataset into R. Once in R, there wasn’t a need for any processing.

I ran a simple histogram to give insight to the gender and age breakdown of survey takers. For the most part there is an even breakdown in gender with an obvious skewing of age around my age of 28.

I included a summary() function that will show a decent amount of info about how the results fell with dispersion. You can stop reading here, but I have some nice visuals to help further illustrate.

#importing the dataset
KDP2 <- read_excel("KLove Data Project pt 2 (Responses) (1).xlsx")
#giving a breakdown of gender/age of survey takers
ggplot(KDP2, aes(x =`Age?`, fill = `Gender/Sex?` ))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#showing a brief summary of all the answers to the questions
summary(KDP2)
##    Timestamp                   Gender/Sex?             Age?      
##  Min.   :2021-10-11 13:03:50   Length:33          Min.   :19.00  
##  1st Qu.:2021-10-11 14:29:24   Class :character   1st Qu.:25.00  
##  Median :2021-10-11 15:14:10   Mode  :character   Median :27.00  
##  Mean   :2021-10-11 22:29:48                      Mean   :27.24  
##  3rd Qu.:2021-10-12 09:36:00                      3rd Qu.:28.00  
##  Max.   :2021-10-13 11:10:40                      Max.   :38.00  
##                                                                  
##  What is the height of the Empire State Building (including pinnacle), in ?
##  Min.   :   69                                                             
##  1st Qu.:  850                                                             
##  Median : 1100                                                             
##  Mean   : 1563                                                             
##  3rd Qu.: 1475                                                             
##  Max.   :13000                                                             
##  NA's   :4                                                                 
##  What year was Justin Timberlake born? What is the diameter of earth in miles?
##  Min.   :1969                          Min.   :    4                          
##  1st Qu.:1981                          1st Qu.: 6592                          
##  Median :1983                          Median : 9000                          
##  Mean   :1983                          Mean   :12364                          
##  3rd Qu.:1986                          3rd Qu.:19000                          
##  Max.   :1990                          Max.   :40000                          
##                                        NA's   :6                              
##  What year did Montana achieve statehood?
##  Min.   :1800                            
##  1st Qu.:1856                            
##  Median :1888                            
##  Mean   :1883                            
##  3rd Qu.:1900                            
##  Max.   :1979                            
##  NA's   :1                               
##  What is the gestation period, in months, for hippos? (gestation is how long a fetus develops in the womb before birth, humans take 9 months)
##  Min.   : 3.00                                                                                                                               
##  1st Qu.: 7.00                                                                                                                               
##  Median :12.00                                                                                                                               
##  Mean   :11.61                                                                                                                               
##  3rd Qu.:17.00                                                                                                                               
##  Max.   :20.00                                                                                                                               
##                                                                                                                                              
##  What is the annual rainfall, in inches, for the Sahara Desert?
##  Min.   : 0.000                                                
##  1st Qu.: 1.000                                                
##  Median : 3.000                                                
##  Mean   : 4.434                                                
##  3rd Qu.: 5.750                                                
##  Max.   :16.000                                                
##  NA's   :2                                                     
##  How far apart, in miles, are London and Omaha, Nebraska?
##  Min.   :    1                                           
##  1st Qu.: 3000                                           
##  Median : 4000                                           
##  Mean   : 4767                                           
##  3rd Qu.: 5000                                           
##  Max.   :25000                                           
##  NA's   :4                                               
##  How much, in lbs, does the average adult male North American grizzly bear weigh?
##  Min.   : 200.0                                                                  
##  1st Qu.: 500.0                                                                  
##  Median : 690.0                                                                  
##  Mean   : 926.3                                                                  
##  3rd Qu.: 900.0                                                                  
##  Max.   :6000.0                                                                  
##                                                                                  
##  How many people does the Tennessee Titan's stadium seat?
##  Min.   : 15000                                          
##  1st Qu.: 40000                                          
##  Median : 50000                                          
##  Mean   : 54754                                          
##  3rd Qu.: 69036                                          
##  Max.   :100000                                          
##  NA's   :1                                               
##  How many grams of sugar are in 16 oz pumpkin spiced latte from Starbucks?
##  Min.   :  8.00                                                           
##  1st Qu.: 30.25                                                           
##  Median : 43.50                                                           
##  Mean   : 47.33                                                           
##  3rd Qu.: 59.00                                                           
##  Max.   :140.00                                                           
##  NA's   :3

Great, now to some actual fun stuff.

To make things simple, I’m going to run all three charts that I have ready. They’re very self explanatory to a point. The only things to note is for each, the red lines indicate the answer to the question and the blue lines indicate the avg guesses. You’ll notice how freakishly close everything is despite the dispersion of answers.

I only did three charts because I didn’t want to fluff up this project. You get the point. I will say this phenomena is partially behind how decision trees and random forests works. Like I mentioned in previous projects, a decision tree is like asking a friend a question where they can ask a few clarifying questions and then give a guess. A random forest is when you ask eight friends that question with the same process of clarifying and then you average five of those guesses. As the number of guesses increase, generally the accuracy does too. Now I could’ve made this experiment more like that by randomly selecting 25 of the 34 or so guesses, but I’m lazy.

ggplot(KDP2, aes(`What is the height of the Empire State Building (including pinnacle), in ?`,fill =`Gender/Sex?`))+
  geom_density(kernel="gaussian")+
  geom_vline(xintercept = 1453, color = "red", size = 1)+
  geom_vline(xintercept = 1562, color = "blue", size = 1)+
  theme(legend.position = "bottom")
## Warning: Removed 4 rows containing non-finite values (stat_density).

ggplot(KDP2, aes(`How many grams of sugar are in 16 oz pumpkin spiced latte from Starbucks?`,fill =`Gender/Sex?`))+
  geom_area(stat="bin")+
  geom_vline(xintercept = 50, color = "red", size = 1)+
  geom_vline(xintercept = 47, color = "blue", size = 1)+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).

ggplot(KDP2, aes(x=`What is the diameter of earth in miles?`, fill =`Gender/Sex?`))+
  geom_dotplot()+
  geom_vline(xintercept = 7926, color = "red", size = 1)+
  geom_vline(xintercept = 7646, color = "blue", size = 1)+
  theme(legend.position = "bottom")
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite values (stat_bindot).

That’s all today, thanks for checking out!