knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(miscset)
##
## Attaching package: 'miscset'
##
## The following object is masked from 'package:dplyr':
##
## collapse
dataset_olympics <- read_delim("dataset_olympics.csv")
## Rows: 70000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(10)
Our goal is to generate a collection of 5-10 random samples of data (with replacement) from at least 6 columns of data. To accomplish the same, we need to create a function that generates the samples and returns a dataframe of samples.
run_sampling <- function(column,sampleCount) {
dColumnID <- c()
dColumns <- c()
dFrame <- data.frame()
for (i in 1:sampleCount) {
dColumnID[i] <- paste("Sample",i,sep="_")
dColumns[[i]] <- dataset_olympics |> sample_frac(0.5, replace = TRUE) |> pluck(column)
}
## Dataframe created
dFrame <- bind_cols(dColumns)
names(dFrame) <- dColumnID
return (dFrame)
}
Printing the result we get for the following columns:
result <- run_sampling("Weight",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
print(result)
## # A tibble: 35,000 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA 93 70 78 66 75 52
## 2 64 93 79 NA 48 55 62 NA
## 3 73 44 75 70 87 70 NA 60
## 4 60 54 78 80 82 55 73 65
## 5 50 80 70 79 90 56 69 NA
## 6 75 69 94 76 47 73 NA 45
## 7 76 72 72 NA NA 106 NA 120
## 8 55 50 NA 54 68 71 65 93
## 9 NA 101 63 77 78 NA NA 55
## 10 57 58 NA NA 95 NA NA 72
## # ℹ 34,990 more rows
print(colMeans(result,na.rm = TRUE))
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## 70.94226 70.99851 70.95531 70.82500 70.96143 70.87109 70.88323 70.68560
We can observe multiple N/A items. N/A is a common occurrence in the Weight and Age attributes for the data. However, we can see that the mean for all columns is roughly the same for each sample.
result <- run_sampling("Age",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
print(result)
## # A tibble: 35,000 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20 21 20 31 19 22 26 23
## 2 23 22 21 27 17 NA 20 21
## 3 21 16 22 24 24 23 25 20
## 4 30 19 23 31 25 20 19 NA
## 5 23 26 22 22 29 22 21 33
## 6 23 24 23 26 27 23 23 20
## 7 34 19 23 29 35 32 27 NA
## 8 20 22 21 27 19 24 29 25
## 9 32 18 25 26 20 16 21 NA
## 10 20 27 21 34 22 21 21 20
## # ℹ 34,990 more rows
#print(colMeans(result,na.rm = TRUE))
summary(result)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## Min. :12.00 Min. :12.0 Min. :12.00 Min. :11.0 Min. :12.00
## 1st Qu.:21.00 1st Qu.:21.0 1st Qu.:21.00 1st Qu.:21.0 1st Qu.:21.00
## Median :24.00 Median :24.0 Median :25.00 Median :24.0 Median :25.00
## Mean :25.62 Mean :25.6 Mean :25.66 Mean :25.6 Mean :25.67
## 3rd Qu.:28.00 3rd Qu.:28.0 3rd Qu.:28.00 3rd Qu.:28.0 3rd Qu.:28.00
## Max. :88.00 Max. :76.0 Max. :88.00 Max. :76.0 Max. :84.00
## NA's :1331 NA's :1340 NA's :1329 NA's :1367 NA's :1310
## Sample_6 Sample_7 Sample_8
## Min. :11.00 Min. :12.00 Min. :12.00
## 1st Qu.:21.00 1st Qu.:21.00 1st Qu.:21.00
## Median :25.00 Median :24.00 Median :25.00
## Mean :25.67 Mean :25.65 Mean :25.68
## 3rd Qu.:28.00 3rd Qu.:28.00 3rd Qu.:28.00
## Max. :88.00 Max. :76.00 Max. :76.00
## NA's :1383 NA's :1409 NA's :1333
Similar to weight, we can see that the mean is about the same for all 8 samples however, the lack of N/A values allows us to see patterns within the data where the min and max values are very close (< 1 unit apart) and the total N/As can be seen to vary greatly. Using 50% of the dataset as a sample size allows us to get near consistent samples.
result <- run_sampling("City",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
print(result)
## # A tibble: 35,000 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Helsinki Paris Seoul Mexico … London Munich Seoul Paris
## 2 Helsinki Lake Pl… Roma Montreal Melbour… Helsinki Helsinki Montreal
## 3 Rio de Janeiro Barcelo… Munich Barcelo… Torino Lilleha… London Innsbru…
## 4 Rio de Janeiro Montreal London Rio de … Paris Salt La… Tokyo London
## 5 Seoul Munich London Rio de … Beijing Stockho… Mexico … Albertv…
## 6 Sochi Atlanta Nagano London St. Lou… Rio de … Atlanta Torino
## 7 Los Angeles Atlanta Paris Moskva Stockho… Paris Torino Tokyo
## 8 London Athina Atlanta Torino Atlanta Grenoble Vancouv… Seoul
## 9 Los Angeles Calgary Rio de … Rio de … Sydney Innsbru… Roma Munich
## 10 Munich Sydney Salt La… Montreal Calgary Montreal Lilleha… Helsinki
## # ℹ 34,990 more rows
#print(colMeans(result,na.rm = TRUE))
ggplotGrid(ncol = 2,
lapply(c("Sample_1", "Sample_4"),
function(col) {
ggplot(result, aes_string(col)) + geom_bar() + coord_flip()
}))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Comparing two graphs of 2 chosen subsamples for the data we can see that
the values are cosistent for most with mild deviations in for values
such as ‘London’,‘Amsterdam’,‘Munich’. These are all some of the highest
value items in the data.
result <- run_sampling("Year",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
head(result)
## # A tibble: 6 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1928 2014 2016 2014 1948 2008 2000 2012
## 2 1980 1920 2014 1912 1956 2002 1976 1912
## 3 1928 1984 1988 2016 2008 1976 1984 1988
## 4 1980 1980 1988 2016 1972 1924 1932 1972
## 5 1992 1984 1920 2016 1964 1948 2012 1936
## 6 1984 2012 1936 2008 1998 1972 1984 1908
summary(result)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## Min. :1896 Min. :1896 Min. :1896 Min. :1896 Min. :1896
## 1st Qu.:1960 1st Qu.:1960 1st Qu.:1960 1st Qu.:1960 1st Qu.:1960
## Median :1984 Median :1984 Median :1984 Median :1984 Median :1984
## Mean :1978 Mean :1978 Mean :1978 Mean :1978 Mean :1978
## 3rd Qu.:2002 3rd Qu.:2002 3rd Qu.:2002 3rd Qu.:2002 3rd Qu.:2002
## Max. :2016 Max. :2016 Max. :2016 Max. :2016 Max. :2016
## Sample_6 Sample_7 Sample_8
## Min. :1896 Min. :1896 Min. :1896
## 1st Qu.:1960 1st Qu.:1960 1st Qu.:1960
## Median :1988 Median :1984 Median :1984
## Mean :1978 Mean :1978 Mean :1978
## 3rd Qu.:2002 3rd Qu.:2002 3rd Qu.:2002
## Max. :2016 Max. :2016 Max. :2016
The subsamples have different similar means and the data for 1st and 3rd Qs is identical for all samples.
result <- run_sampling("Medal",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
head(result)
## # A tibble: 6 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Gold <NA> <NA> <NA> Silver Gold <NA> <NA>
## 2 Gold <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> Gold <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> Silver <NA>
## 5 <NA> <NA> <NA> Bronze <NA> <NA> Bronze <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> Bronze <NA>
ggplotGrid(ncol = 3,
lapply(c("Sample_1", "Sample_4", "Sample_8"),
function(col) {
ggplot(result, aes_string(col)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0)
}))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can observe that the sample data varies a lot for all Medal types.
Since we only have 4 medal types, we’re more likely to get high
variation between all samples. I believe this column could be converted
into 4 binary field columns based on medal type won. That would assist
the data filtering and sampling.
result <- run_sampling("Season",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
head(result)
## # A tibble: 6 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Summer Summer Summer Summer Summer Summer Summer Summer
## 2 Summer Summer Summer Summer Winter Winter Summer Summer
## 3 Summer Summer Summer Winter Summer Summer Winter Summer
## 4 Summer Summer Winter Summer Summer Summer Summer Summer
## 5 Summer Summer Winter Summer Summer Summer Summer Summer
## 6 Summer Summer Summer Summer Summer Summer Winter Winter
ggplotGrid(ncol = 3,
lapply(c("Sample_1", "Sample_4", "Sample_8"),
function(col) {
ggplot(result, aes_string(col)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0)
}))
Similar to the Medal field, we see variation in the counts in each
subsample.
result <- run_sampling("Sport",8)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
head(result)
## # A tibble: 6 × 8
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7 Sample_8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Cross Country … Handball Rowing Wrestli… Athleti… Hockey Biathlon Rowing
## 2 Sailing Sailing Swimming Athleti… Basketb… Hockey Cross C… Football
## 3 Athletics Rowing Shooting Boxing Rowing Weightl… Gymnast… Boxing
## 4 Shooting Boxing Biathlon Judo Cycling Wrestli… Gymnast… Water P…
## 5 Athletics Baseball Athleti… Ice Hoc… Hockey Rowing Athleti… Swimming
## 6 Gymnastics Athleti… Gymnast… Cycling Sailing Wrestli… Short T… Short T…
# Sample 1
sresult1 <- result |> count(Sample_1)
print(sresult1)
## # A tibble: 63 × 2
## Sample_1 n
## <chr> <int>
## 1 Alpine Skiing 1200
## 2 Alpinism 3
## 3 Archery 305
## 4 Art Competitions 466
## 5 Athletics 5286
## 6 Badminton 164
## 7 Baseball 115
## 8 Basketball 612
## 9 Beach Volleyball 68
## 10 Biathlon 604
## # ℹ 53 more rows
sresult2 <- result |> count(Sample_3)
print(sresult2)
## # A tibble: 64 × 2
## Sample_3 n
## <chr> <int>
## 1 Alpine Skiing 1176
## 2 Alpinism 6
## 3 Archery 299
## 4 Art Competitions 452
## 5 Athletics 5219
## 6 Badminton 144
## 7 Baseball 113
## 8 Basketball 632
## 9 Basque Pelota 1
## 10 Beach Volleyball 65
## # ℹ 54 more rows
# Print all (Code removed as seeding still didn't guarantee columns with same length)
#sDF <- cbind(sresult1,sresult2)
#print(sDF)
A big observation is that not all sports can be found in the samples. Samples have 62,63, or 64 rows. Sampling misses certain sports. One of the sports in this case having the least occurring data: Roque To bound this anomaly, I believe filtering out sports with a small ocurrence will provide a better data analysis.