Data Dive — Sampling and Drawing Conclusions

Olympics Data

dataset_olympics <- read_delim("dataset_olympics.csv")

## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(10)

Our goal is to generate a collection of 5-10 random samples of data (with replacement) from at least 6 columns of data. To accomplish the same, we need to create a function that generates the samples and returns a dataframe of samples.

run_sampling <- function(column,sampleCount) {
  
  dColumnID <- c()
  dColumns <- c()
  dFrame <- data.frame()
  
  for (i in 1:sampleCount) {
    dColumnID[i] <- paste("Sample",i,sep="_")
    dColumns[[i]] <- dataset_olympics |> sample_frac(0.5, replace = TRUE) |> pluck(column)
  }
  ## Dataframe created
  dFrame <- bind_cols(dColumns)
  names(dFrame) <- dColumnID
  return (dFrame)
}

Printing the result we get for the following columns:

WEIGHT

result <- run_sampling("Weight",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

print(result)

## # A tibble: 35,000 × 8
##    Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1       NA       NA       93       70       78       66       75       52
##  2       64       93       79       NA       48       55       62       NA
##  3       73       44       75       70       87       70       NA       60
##  4       60       54       78       80       82       55       73       65
##  5       50       80       70       79       90       56       69       NA
##  6       75       69       94       76       47       73       NA       45
##  7       76       72       72       NA       NA      106       NA      120
##  8       55       50       NA       54       68       71       65       93
##  9       NA      101       63       77       78       NA       NA       55
## 10       57       58       NA       NA       95       NA       NA       72
## # ℹ 34,990 more rows

print(colMeans(result,na.rm = TRUE))

## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8 
## 70.94226 70.99851 70.95531 70.82500 70.96143 70.87109 70.88323 70.68560

We can observe multiple N/A items. N/A is a common occurrence in the Weight and Age attributes for the data. However, we can see that the mean for all columns is roughly the same for each sample.

AGE

result <- run_sampling("Age",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

print(result)

## # A tibble: 35,000 × 8
##    Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1       20       21       20       31       19       22       26       23
##  2       23       22       21       27       17       NA       20       21
##  3       21       16       22       24       24       23       25       20
##  4       30       19       23       31       25       20       19       NA
##  5       23       26       22       22       29       22       21       33
##  6       23       24       23       26       27       23       23       20
##  7       34       19       23       29       35       32       27       NA
##  8       20       22       21       27       19       24       29       25
##  9       32       18       25       26       20       16       21       NA
## 10       20       27       21       34       22       21       21       20
## # ℹ 34,990 more rows

#print(colMeans(result,na.rm = TRUE))
summary(result)

##     Sample_1        Sample_2       Sample_3        Sample_4       Sample_5    
##  Min.   :12.00   Min.   :12.0   Min.   :12.00   Min.   :11.0   Min.   :12.00  
##  1st Qu.:21.00   1st Qu.:21.0   1st Qu.:21.00   1st Qu.:21.0   1st Qu.:21.00  
##  Median :24.00   Median :24.0   Median :25.00   Median :24.0   Median :25.00  
##  Mean   :25.62   Mean   :25.6   Mean   :25.66   Mean   :25.6   Mean   :25.67  
##  3rd Qu.:28.00   3rd Qu.:28.0   3rd Qu.:28.00   3rd Qu.:28.0   3rd Qu.:28.00  
##  Max.   :88.00   Max.   :76.0   Max.   :88.00   Max.   :76.0   Max.   :84.00  
##  NA's   :1331    NA's   :1340   NA's   :1329    NA's   :1367   NA's   :1310   
##     Sample_6        Sample_7        Sample_8    
##  Min.   :11.00   Min.   :12.00   Min.   :12.00  
##  1st Qu.:21.00   1st Qu.:21.00   1st Qu.:21.00  
##  Median :25.00   Median :24.00   Median :25.00  
##  Mean   :25.67   Mean   :25.65   Mean   :25.68  
##  3rd Qu.:28.00   3rd Qu.:28.00   3rd Qu.:28.00  
##  Max.   :88.00   Max.   :76.00   Max.   :76.00  
##  NA's   :1383    NA's   :1409    NA's   :1333

Similar to weight, we can see that the mean is about the same for all 8 samples however, the lack of N/A values allows us to see patterns within the data where the min and max values are very close (< 1 unit apart) and the total N/As can be seen to vary greatly. Using 50% of the dataset as a sample size allows us to get near consistent samples.

City

result <- run_sampling("City",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

print(result)

## # A tibble: 35,000 × 8
##    Sample_1       Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##    <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
##  1 Helsinki       Paris    Seoul    Mexico … London   Munich   Seoul    Paris   
##  2 Helsinki       Lake Pl… Roma     Montreal Melbour… Helsinki Helsinki Montreal
##  3 Rio de Janeiro Barcelo… Munich   Barcelo… Torino   Lilleha… London   Innsbru…
##  4 Rio de Janeiro Montreal London   Rio de … Paris    Salt La… Tokyo    London  
##  5 Seoul          Munich   London   Rio de … Beijing  Stockho… Mexico … Albertv…
##  6 Sochi          Atlanta  Nagano   London   St. Lou… Rio de … Atlanta  Torino  
##  7 Los Angeles    Atlanta  Paris    Moskva   Stockho… Paris    Torino   Tokyo   
##  8 London         Athina   Atlanta  Torino   Atlanta  Grenoble Vancouv… Seoul   
##  9 Los Angeles    Calgary  Rio de … Rio de … Sydney   Innsbru… Roma     Munich  
## 10 Munich         Sydney   Salt La… Montreal Calgary  Montreal Lilleha… Helsinki
## # ℹ 34,990 more rows

#print(colMeans(result,na.rm = TRUE))
ggplotGrid(ncol = 2,
  lapply(c("Sample_1", "Sample_4"),
    function(col) {
        ggplot(result, aes_string(col)) + geom_bar() + coord_flip()
    }))

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Comparing two graphs of 2 chosen subsamples for the data we can see that the values are cosistent for most with mild deviations in for values such as ‘London’,‘Amsterdam’,‘Munich’. These are all some of the highest value items in the data.

Year

result <- run_sampling("Year",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

head(result)

## # A tibble: 6 × 8
##   Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     1928     2014     2016     2014     1948     2008     2000     2012
## 2     1980     1920     2014     1912     1956     2002     1976     1912
## 3     1928     1984     1988     2016     2008     1976     1984     1988
## 4     1980     1980     1988     2016     1972     1924     1932     1972
## 5     1992     1984     1920     2016     1964     1948     2012     1936
## 6     1984     2012     1936     2008     1998     1972     1984     1908

summary(result)

##     Sample_1       Sample_2       Sample_3       Sample_4       Sample_5   
##  Min.   :1896   Min.   :1896   Min.   :1896   Min.   :1896   Min.   :1896  
##  1st Qu.:1960   1st Qu.:1960   1st Qu.:1960   1st Qu.:1960   1st Qu.:1960  
##  Median :1984   Median :1984   Median :1984   Median :1984   Median :1984  
##  Mean   :1978   Mean   :1978   Mean   :1978   Mean   :1978   Mean   :1978  
##  3rd Qu.:2002   3rd Qu.:2002   3rd Qu.:2002   3rd Qu.:2002   3rd Qu.:2002  
##  Max.   :2016   Max.   :2016   Max.   :2016   Max.   :2016   Max.   :2016  
##     Sample_6       Sample_7       Sample_8   
##  Min.   :1896   Min.   :1896   Min.   :1896  
##  1st Qu.:1960   1st Qu.:1960   1st Qu.:1960  
##  Median :1988   Median :1984   Median :1984  
##  Mean   :1978   Mean   :1978   Mean   :1978  
##  3rd Qu.:2002   3rd Qu.:2002   3rd Qu.:2002  
##  Max.   :2016   Max.   :2016   Max.   :2016

The subsamples have different similar means and the data for 1st and 3rd Qs is identical for all samples.

Medal

result <- run_sampling("Medal",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

head(result)

## # A tibble: 6 × 8
##   Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##   <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Gold     <NA>     <NA>     <NA>     Silver   Gold     <NA>     <NA>    
## 2 Gold     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
## 3 <NA>     <NA>     <NA>     <NA>     Gold     <NA>     <NA>     <NA>    
## 4 <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     Silver   <NA>    
## 5 <NA>     <NA>     <NA>     Bronze   <NA>     <NA>     Bronze   <NA>    
## 6 <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     Bronze   <NA>

ggplotGrid(ncol = 3,
  lapply(c("Sample_1", "Sample_4", "Sample_8"),
    function(col) {
        ggplot(result, aes_string(col)) + geom_bar() + 
        geom_text(stat='count', aes(label=..count..), vjust=0)
    }))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can observe that the sample data varies a lot for all Medal types. Since we only have 4 medal types, we’re more likely to get high variation between all samples. I believe this column could be converted into 4 binary field columns based on medal type won. That would assist the data filtering and sampling.

Season

result <- run_sampling("Season",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

head(result)

## # A tibble: 6 × 8
##   Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##   <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Summer   Summer   Summer   Summer   Summer   Summer   Summer   Summer  
## 2 Summer   Summer   Summer   Summer   Winter   Winter   Summer   Summer  
## 3 Summer   Summer   Summer   Winter   Summer   Summer   Winter   Summer  
## 4 Summer   Summer   Winter   Summer   Summer   Summer   Summer   Summer  
## 5 Summer   Summer   Winter   Summer   Summer   Summer   Summer   Summer  
## 6 Summer   Summer   Summer   Summer   Summer   Summer   Winter   Winter

ggplotGrid(ncol = 3,
  lapply(c("Sample_1", "Sample_4", "Sample_8"),
    function(col) {
        ggplot(result, aes_string(col)) + geom_bar() + 
        geom_text(stat='count', aes(label=..count..), vjust=0)
    }))

Similar to the Medal field, we see variation in the counts in each subsample.

Sport

result <- run_sampling("Sport",8)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

head(result)

## # A tibble: 6 × 8
##   Sample_1        Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
##   <chr>           <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Cross Country … Handball Rowing   Wrestli… Athleti… Hockey   Biathlon Rowing  
## 2 Sailing         Sailing  Swimming Athleti… Basketb… Hockey   Cross C… Football
## 3 Athletics       Rowing   Shooting Boxing   Rowing   Weightl… Gymnast… Boxing  
## 4 Shooting        Boxing   Biathlon Judo     Cycling  Wrestli… Gymnast… Water P…
## 5 Athletics       Baseball Athleti… Ice Hoc… Hockey   Rowing   Athleti… Swimming
## 6 Gymnastics      Athleti… Gymnast… Cycling  Sailing  Wrestli… Short T… Short T…

# Sample 1 
sresult1 <- result |> count(Sample_1) 
print(sresult1)

## # A tibble: 63 × 2
##    Sample_1             n
##    <chr>            <int>
##  1 Alpine Skiing     1200
##  2 Alpinism             3
##  3 Archery            305
##  4 Art Competitions   466
##  5 Athletics         5286
##  6 Badminton          164
##  7 Baseball           115
##  8 Basketball         612
##  9 Beach Volleyball    68
## 10 Biathlon           604
## # ℹ 53 more rows

sresult2 <- result |> count(Sample_3)  
print(sresult2)

## # A tibble: 64 × 2
##    Sample_3             n
##    <chr>            <int>
##  1 Alpine Skiing     1176
##  2 Alpinism             6
##  3 Archery            299
##  4 Art Competitions   452
##  5 Athletics         5219
##  6 Badminton          144
##  7 Baseball           113
##  8 Basketball         632
##  9 Basque Pelota        1
## 10 Beach Volleyball    65
## # ℹ 54 more rows

# Print all (Code removed as seeding still didn't guarantee columns with same length)
#sDF <- cbind(sresult1,sresult2)
#print(sDF)

A big observation is that not all sports can be found in the samples. Samples have 62,63, or 64 rows. Sampling misses certain sports. One of the sports in this case having the least occurring data: Roque To bound this anomaly, I believe filtering out sports with a small ocurrence will provide a better data analysis.