Milestone 2: Descriptive Stats of Olympics Dataset

Load Packages
Overview
Preamble
Getting and Cleaning Data
Sampling
Variable conversions
Missing values
Data imputation with R
- Sample statistics of the Age with sample size 7 and NAs removed
- Median imputation
Descriptive Statistics
Data Exploration with SQL
Data imputation with SQL
- Median values
- Discussion 2
Replacement of NA values
SQL data visualization
Definition of terms
Parameters
Summary
References

Load Packages

The following R packages are used for this analysis:

readr
RSQLite
dplyr
ggplot2

Overview

This project is sequel to Proposal for Data Analysis Project which is also the Milestone 1 in the capstone project.

In this Milestone 2 project, descriptive statistics are used on the dataset. Descriptive statistical methods involve the use of statistical techniques to quantitatively give a detailed summary of the sample dataset. Some of these techniques include:

Central Tendency: Including Mean, Median and Mode
Dispersion: Like Range, Standard deviation, Variance etc
Skewness: The sample data could be left or right skewed
Kurtosis: kurtosis could be platykurtic, mesokurtic or leptokurtic

Graphical representation is also used in descriptive statistics.

Preamble

R and SQL will be used for this analysis. Similar descriptive statistical results should be expected using either R or SQL. This approach should ensure a two way data validation. The R codes are not displayed, only its outputs are seen. Conversely, both the SQL codes and its outputs will be seen because this project is really about SQL and not R.

Getting and Cleaning Data

Using R

Data display of olympics_data

# A tibble: 6 x 15
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>

Observe the sum of NAs and columns with NA values

[1] 363853

    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0   9474  60171  62875      0      0      0      0      0 
  City  Sport  Event  Medal 
     0      0      0 231333

An observation of the olympics data showed lots of missing values, total sum of 363853. Some of the columns has quite a high level of NAs. The Medal has high number of missing values probably due to the fact that most athletes in the Olympics do not win any medals.

Data display of noc_data

# A tibble: 6 x 3
  NOC   region      notes               
  <chr> <chr>       <chr>               
1 AFG   Afghanistan <NA>                
2 AHO   Curacao     Netherlands Antilles
3 ALB   Albania     <NA>                
4 ALG   Algeria     <NA>                
5 AND   Andorra     <NA>                
6 ANG   Angola      <NA>

Merging data files

The two dataset can be merged using the common NOC variable. The left_join function of the dplyr package is used to join the noc_data data with 3 variables to olympics_data data with 15. The resultant data will now have 17 variables.

# A tibble: 6 x 17
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
#   region <chr>, notes <chr>

tibble [271,116 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
 $ Sex   : chr [1:271116] "M" "M" "M" "M" ...
 $ Age   : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
 $ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
 $ NOC   : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
 $ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
 $ Year  : num [1:271116] 1992 2012 1920 1900 1988 ...
 $ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
 $ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
 $ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
 $ Medal : chr [1:271116] NA NA NA "Gold" ...
 $ region: chr [1:271116] "China" "China" "Denmark" "Denmark" ...
 $ notes : chr [1:271116] NA NA NA NA ...

Sampling

The new data from the merger i.e nocolympics has over 200,000 rows and 17 columns. A sample of size 10000 is taken from the data to compute the descriptive statistics.

# A tibble: 6 x 17
      ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1   9967 Pave~ M        28    172     70 Czec~ TCH   1992~  1992 Winter Albe~
2 111428 Helg~ F        20    171     61 Arge~ ARG   1968~  1968 Winter Gren~
3 127942 Paul~ M        29     NA    109 Germ~ GER   1936~  1936 Summer Berl~
4  37121 Juli~ M        18     NA     NA Germ~ GER   1900~  1900 Summer Paris
5  23695 Leon~ M        53    173     76 Unit~ USA   1932~  1932 Summer Los ~
6   5056 Ferh~ M        23    178     68 Turk~ TUR   2016~  2016 Summer Rio ~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
#   region <chr>, notes <chr>

Variable conversions

Examining the class of each of the columns revealed inappropriate data type assigned to some of the columns.

         ID        Name         Sex         Age      Height      Weight 
  "numeric" "character" "character"   "numeric"   "numeric"   "numeric" 
       Team         NOC       Games        Year      Season        City 
"character" "character" "character"   "numeric" "character" "character" 
      Sport       Event       Medal      region       notes 
"character" "character" "character" "character" "character"

The following variables will be changed from:

Character to factor

Team
NOC
Games
Season
City
Sport
Event
region

Character to labelled factor

Sex

Characer to ordered factor

Medal

Number to factor

Year

[1] "Bronze" "Silver" "Gold"   "None"

tibble [10,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:10000] 9967 111428 127942 37121 23695 ...
 $ Name  : chr [1:10000] "Pavel Benc" "Helga Mara Sista de Salvateli" "Paul Wahl" "Julius Frey" ...
 $ Sex   : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 1 1 1 1 1 ...
 $ Age   : num [1:10000] 28 20 29 18 53 23 23 22 32 20 ...
 $ Height: num [1:10000] 172 171 NA NA 173 178 179 174 183 173 ...
 $ Weight: num [1:10000] 70 61 109 NA 76 68 70 65 73 64 ...
 $ Team  : Factor w/ 369 levels "30. Februar",..: 86 14 124 124 348 336 113 111 128 82 ...
 $ NOC   : Factor w/ 204 levels "AFG","AHO","ALB",..: 179 9 71 71 196 188 64 63 72 48 ...
 $ Games : Factor w/ 51 levels "1896 Summer",..: 39 27 14 2 12 51 41 7 28 32 ...
 $ Year  : Factor w/ 35 levels "1896","1900",..: 23 17 11 2 10 35 25 7 18 20 ...
 $ Season: Factor w/ 2 levels "Summer","Winter": 2 2 1 1 1 1 1 1 1 1 ...
 $ City  : Factor w/ 42 levels "Albertville",..: 1 13 8 27 19 28 5 3 24 23 ...
 $ Sport : Factor w/ 58 levels "Alpine Skiing",..: 14 1 57 47 3 24 36 4 4 11 ...
 $ Event : Factor w/ 618 levels "Alpine Skiing Men's Combined",..: 176 9 577 511 40 293 363 64 82 144 ...
 $ Medal : Ord.factor w/ 4 levels "Bronze"<"Silver"<..: 4 4 4 4 4 4 4 3 4 1 ...
 $ region: Factor w/ 187 levels "Afghanistan",..: 45 7 62 62 179 172 58 57 63 42 ...
 $ notes : chr [1:10000] NA NA NA NA ...

Missing values

The data merger introduced NULL values into the new data. These NULL values can greatly affect the outcome of any computation done on the new data.

[1] 14466

    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0    317   2117   2218      0      0      0      0      0 
  City  Sport  Event  Medal region  notes 
     0      0      0      0     12   9802

The total count of missing value in the sample_olympics data is 22961 from some of the variables like notes, Medal, Height etc.

Data imputation with R

A few sample of the sample_olympics data is selected with the na values removed and descriptive statistics is computed for columns with missing values.

Sample statistics of the `Age` with sample size 7 and NAs removed

Adding missing grouping variables: `Sport`

# A tibble: 7 x 6
  Sport                mean median   min   max     SD
  <fct>               <dbl>  <dbl> <dbl> <dbl>  <dbl>
1 Cycling              24.9     24    17    42   4.41
2 Speed Skating        24.0     24    14    41   4.20
3 Archery              26.7     24    17    63   8.37
4 Rhythmic Gymnastics  19.4     19    15    27   3.11
5 Military Ski Patrol  32       32    32    32 NaN   
6 Judo                 25.0     25    18    36   3.59
7 Rugby Sevens         26.2     27    19    33   4.09

Median imputation

The median values of the following columns will be used to replace missing values in these columns.

Age
Height
Weight

The outputs below showed no NA values because these values have been imputed with new values.

    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0      0      0      0      0      0      0      0      0 
  City  Sport  Event  Medal region  notes 
     0      0      0      0      0      0

# A tibble: 6 x 17
# Groups:   Sport, Age [6]
     ID Name  Sex     Age Height Weight Team  NOC   Games Year  Season City 
  <dbl> <chr> <fct> <dbl>  <dbl>  <dbl> <fct> <fct> <fct> <fct> <fct>  <fct>
1 62669 "Rok~ Male     21    180     71 Yugo~ YUG   1980~ 1980  Summer Mosk~
2 80059 "Zdr~ Male     25    182     72 Yugo~ YUG   1976~ 1976  Summer Mont~
3 57810 "Kar~ Male     29    173     74 Yugo~ YUG   1984~ 1984  Summer Los ~
4 44254 "Pet~ Male     23    190     70 Yugo~ YUG   1972~ 1972  Summer Muni~
5 33672 "Min~ Male     38    189     81 Yugo~ YUG   1980~ 1980  Summer Mosk~
6 24258 "Mil~ Male     23    186     83 Yugo~ YUG   1980~ 1980  Summer Mosk~
# ... with 5 more variables: Sport <fct>, Event <fct>, Medal <ord>,
#   region <fct>, notes <chr>

Descriptive Statistics

Visualization of age distribution

Visualization of height distribution

Visualization of weight distribution

Table 1

Descriptive statistics	Age	Height	Weight
`Mean`	24.54	177.28	71.61
`Median`	24	178	72
`Standard deviation`	4.99	9.81	11.74

Data Exploration with SQL

Create database and tables

[1] "nc"    "olymp"

Number of rows in olymp table


SELECT COUNT(*) AS OLYMPICS_COUNT FROM olymp

1 records
OLYMPICS_COUNT
271116


SELECT * FROM olymp LIMIT 6;

6 records
ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
1	A Dijiang	M	24	180	80	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	NA
2	A Lamusi	M	23	170	60	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	NA
3	Gunnar Nielsen Aaby	M	24	NA	NA	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	NA
4	Edgar Lindenau Aabye	M	34	NA	NA	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
5	Christine Jacoba Aaftink	F	21	185	82	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NA
5	Christine Jacoba Aaftink	F	21	185	82	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 1,000 metres	NA

Number of rows in nc table


SELECT COUNT(*) AS NOC_COUNT FROM nc

1 records
NOC_COUNT
230

Tables columns and datatypes

olymp table


PRAGMA table_xinfo(olymp);

Displaying records 1 - 10
cid	name	type	dflt_value
0	ID	REAL	NA
1	Name	TEXT	NA
2	Sex	TEXT	NA
3	Age	REAL	NA
4	Height	REAL	NA
5	Weight	REAL	NA
6	Team	TEXT	NA
7	NOC	TEXT	NA
8	Games	TEXT	NA
9	Year	REAL	NA


SELECT sql FROM sqlite_master WHERE tbl_name = 'olymp';

1 records
sql
CREATE TABLE `olymp` (

ID REAL, Name TEXT, Sex TEXT, Age REAL, Height REAL, Weight REAL, Team TEXT, NOC TEXT, Games TEXT, Year REAL, Season TEXT, City TEXT, Sport TEXT, Event TEXT, Medal TEXT ) |

nc table


PRAGMA table_xinfo(nc);

3 records
cid	name	type	dflt_value
0	NOC	TEXT	NA
1	region	TEXT	NA
2	notes	TEXT	NA

Discussion 1

From the results above, the data types of some of the columns are not appropriate as seen earlier while using R for the analysis. notnull column mean the column will not allow NULL and the column dflt_value, mean default value for the column. Thought the tables would not accept NULL values but NA values are allowed instead of NULL. This fact is also confirmed using R

Data imputation with SQL

Count the NA values for the variables Age, Height and Weight

Median values


--Median age

SELECT Age AS Median_Age
FROM (SELECT Age
      FROM olymp
      ORDER BY Age
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))

7 records
Median_Age
24
24
24
24
24
24
24


-- Median height
SELECT Height AS Median_Height
FROM (SELECT Height
      FROM olymp
      ORDER BY Height
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))

7 records
Median_Height
171
171
171
171
171
171
171


-- Median weight

SELECT Weight AS Median_Weight
FROM (SELECT Weight
      FROM olymp
      ORDER BY Weight
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))

7 records
Median_Weight
64
64
64
64
64
64
64

Discussion 2

From the outputs of both R and SQL, the sum of NA values for columns of interest are the same. Also, the similar median values are gotten from both R and SQL. Thus, ensuring two way data validation.

Replacement of NA values

Data with NAs


SELECT Sport
      ,Age
      ,Height 
      ,Weight
FROM olymp
GROUP BY Sport
LIMIT 15

Update data


UPDATE [olymp]
SET [Age]=24
WHERE [Age]= 'NA';


UPDATE [olymp]
SET [Height]=171
WHERE [Height]= 'NA';

UPDATE [olymp]
SET [Weight]=64
WHERE [Weight]= 'NA';

Data without NAs


SELECT Sport
      ,Age
      ,Height 
      ,Weight
FROM olymp
GROUP BY Sport
LIMIT 15

SQL data visualization

Age distribution

Height distribution

Weight distribution

Definition of terms

Sample: is a set of individuals or objects selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.
Population: a group of individuals that share at least one characteristic in common
Statistic: a statistical value that is calculated from all the values in a sample
Parameter: a statistical value that is calculated from all the values in a whole population

Parameters

Population mean


SELECT
      round(avg(Age)) AS Avg_Population_Age
     ,round(avg(Height)) AS Avg_Population_Height
     ,round(avg(Weight)) AS Avg_Population_Weight
FROM olymp;

Population median


SELECT Age AS Median_Popultion_Age
FROM (SELECT Age
      FROM olymp
      ORDER BY Age
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp));

Table 2

Variables	Population stat		Descriptive stat
.	Mean	Median	Mean	Median
`Age`	26	24	25	25
`Height`	174	171	177	178
`Weight`	69	64	72	72

Discussion 3

Similar trends were observed comparing the distributions of the above variables using R programming and SQL respectively.

Summary

The following conclusions can be inferred from the Olympics data

The age of most athletes in the games is between 24 and 26 years
The height of athletes is from 171 to 178cm
The weight is from 64 to 72kg

This milestone 2 project does not yet validate any of the assumptions made in milestone 1. Hopefully, some of the assumptions will be clarified by milestone 3. As at now no addition question comes to mind concerning the data.

References

Difference between Descriptive and Inferential Statistics
Descriptive statistics
Descriptive statistics
Juan Klopper: Understanding Clinical Research: Behind the Statistics
forpas
DB Browser for SQLite
Caitlin Lonski
R Core Team (2019)
ggplot2: Elegant Graphics for Data Analysis
RSQLite: 'SQLite' Interface for R
readr: Read Rectangular Text Data
dplyr: A Grammar of Data Manipulation

Milestone 2: Descriptive Stats of Olympics Dataset - 120 years of data

SQL for Data Science Week 2 Capstone Project

Olusola Afuwape

2020-08-21