Load Packages

The following R packages are used for this analysis:

  1. readr
  2. RSQLite
  3. dplyr
  4. ggplot2

Overview

This project is sequel to Proposal for Data Analysis Project which is also the Milestone 1 in the capstone project.

In this Milestone 2 project, descriptive statistics are used on the dataset. Descriptive statistical methods involve the use of statistical techniques to quantitatively give a detailed summary of the sample dataset. Some of these techniques include:

Graphical representation is also used in descriptive statistics.

Preamble

R and SQL will be used for this analysis. Similar descriptive statistical results should be expected using either R or SQL. This approach should ensure a two way data validation. The R codes are not displayed, only its outputs are seen. Conversely, both the SQL codes and its outputs will be seen because this project is really about SQL and not R.

Getting and Cleaning Data

Using R

Data display of olympics_data


# A tibble: 6 x 15
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>


Observe the sum of NAs and columns with NA values


[1] 363853
    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0   9474  60171  62875      0      0      0      0      0 
  City  Sport  Event  Medal 
     0      0      0 231333 


An observation of the olympics data showed lots of missing values, total sum of 363853. Some of the columns has quite a high level of NAs. The Medal has high number of missing values probably due to the fact that most athletes in the Olympics do not win any medals.

Data display of noc_data


# A tibble: 6 x 3
  NOC   region      notes               
  <chr> <chr>       <chr>               
1 AFG   Afghanistan <NA>                
2 AHO   Curacao     Netherlands Antilles
3 ALB   Albania     <NA>                
4 ALG   Algeria     <NA>                
5 AND   Andorra     <NA>                
6 ANG   Angola      <NA>                


Merging data files

The two dataset can be merged using the common NOC variable. The left_join function of the dplyr package is used to join the noc_data data with 3 variables to olympics_data data with 15. The resultant data will now have 17 variables.


# A tibble: 6 x 17
     ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
  <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1     1 A Di~ M        24    180     80 China CHN   1992~  1992 Summer Barc~
2     2 A La~ M        23    170     60 China CHN   2012~  2012 Summer Lond~
3     3 Gunn~ M        24     NA     NA Denm~ DEN   1920~  1920 Summer Antw~
4     4 Edga~ M        34     NA     NA Denm~ DEN   1900~  1900 Summer Paris
5     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
6     5 Chri~ F        21    185     82 Neth~ NED   1988~  1988 Winter Calg~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
#   region <chr>, notes <chr>
tibble [271,116 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
 $ Sex   : chr [1:271116] "M" "M" "M" "M" ...
 $ Age   : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
 $ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
 $ NOC   : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
 $ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
 $ Year  : num [1:271116] 1992 2012 1920 1900 1988 ...
 $ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
 $ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
 $ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
 $ Medal : chr [1:271116] NA NA NA "Gold" ...
 $ region: chr [1:271116] "China" "China" "Denmark" "Denmark" ...
 $ notes : chr [1:271116] NA NA NA NA ...


Sampling

The new data from the merger i.e nocolympics has over 200,000 rows and 17 columns. A sample of size 10000 is taken from the data to compute the descriptive statistics.


# A tibble: 6 x 17
      ID Name  Sex     Age Height Weight Team  NOC   Games  Year Season City 
   <dbl> <chr> <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
1   9967 Pave~ M        28    172     70 Czec~ TCH   1992~  1992 Winter Albe~
2 111428 Helg~ F        20    171     61 Arge~ ARG   1968~  1968 Winter Gren~
3 127942 Paul~ M        29     NA    109 Germ~ GER   1936~  1936 Summer Berl~
4  37121 Juli~ M        18     NA     NA Germ~ GER   1900~  1900 Summer Paris
5  23695 Leon~ M        53    173     76 Unit~ USA   1932~  1932 Summer Los ~
6   5056 Ferh~ M        23    178     68 Turk~ TUR   2016~  2016 Summer Rio ~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
#   region <chr>, notes <chr>


Variable conversions

Examining the class of each of the columns revealed inappropriate data type assigned to some of the columns.

         ID        Name         Sex         Age      Height      Weight 
  "numeric" "character" "character"   "numeric"   "numeric"   "numeric" 
       Team         NOC       Games        Year      Season        City 
"character" "character" "character"   "numeric" "character" "character" 
      Sport       Event       Medal      region       notes 
"character" "character" "character" "character" "character" 


The following variables will be changed from:

Character to factor

  • Team
  • NOC
  • Games
  • Season
  • City
  • Sport
  • Event
  • region


Character to labelled factor

  • Sex


Characer to ordered factor

  • Medal


Number to factor

  • Year


[1] "Bronze" "Silver" "Gold"   "None"  
tibble [10,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID    : num [1:10000] 9967 111428 127942 37121 23695 ...
 $ Name  : chr [1:10000] "Pavel Benc" "Helga Mara Sista de Salvateli" "Paul Wahl" "Julius Frey" ...
 $ Sex   : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 1 1 1 1 1 ...
 $ Age   : num [1:10000] 28 20 29 18 53 23 23 22 32 20 ...
 $ Height: num [1:10000] 172 171 NA NA 173 178 179 174 183 173 ...
 $ Weight: num [1:10000] 70 61 109 NA 76 68 70 65 73 64 ...
 $ Team  : Factor w/ 369 levels "30. Februar",..: 86 14 124 124 348 336 113 111 128 82 ...
 $ NOC   : Factor w/ 204 levels "AFG","AHO","ALB",..: 179 9 71 71 196 188 64 63 72 48 ...
 $ Games : Factor w/ 51 levels "1896 Summer",..: 39 27 14 2 12 51 41 7 28 32 ...
 $ Year  : Factor w/ 35 levels "1896","1900",..: 23 17 11 2 10 35 25 7 18 20 ...
 $ Season: Factor w/ 2 levels "Summer","Winter": 2 2 1 1 1 1 1 1 1 1 ...
 $ City  : Factor w/ 42 levels "Albertville",..: 1 13 8 27 19 28 5 3 24 23 ...
 $ Sport : Factor w/ 58 levels "Alpine Skiing",..: 14 1 57 47 3 24 36 4 4 11 ...
 $ Event : Factor w/ 618 levels "Alpine Skiing Men's Combined",..: 176 9 577 511 40 293 363 64 82 144 ...
 $ Medal : Ord.factor w/ 4 levels "Bronze"<"Silver"<..: 4 4 4 4 4 4 4 3 4 1 ...
 $ region: Factor w/ 187 levels "Afghanistan",..: 45 7 62 62 179 172 58 57 63 42 ...
 $ notes : chr [1:10000] NA NA NA NA ...


Missing values

The data merger introduced NULL values into the new data. These NULL values can greatly affect the outcome of any computation done on the new data.


[1] 14466
    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0    317   2117   2218      0      0      0      0      0 
  City  Sport  Event  Medal region  notes 
     0      0      0      0     12   9802 


The total count of missing value in the sample_olympics data is 22961 from some of the variables like notes, Medal, Height etc.


Data imputation with R

A few sample of the sample_olympics data is selected with the na values removed and descriptive statistics is computed for columns with missing values.


Sample statistics of the Age with sample size 7 and NAs removed


Adding missing grouping variables: `Sport`
# A tibble: 7 x 6
  Sport                mean median   min   max     SD
  <fct>               <dbl>  <dbl> <dbl> <dbl>  <dbl>
1 Cycling              24.9     24    17    42   4.41
2 Speed Skating        24.0     24    14    41   4.20
3 Archery              26.7     24    17    63   8.37
4 Rhythmic Gymnastics  19.4     19    15    27   3.11
5 Military Ski Patrol  32       32    32    32 NaN   
6 Judo                 25.0     25    18    36   3.59
7 Rugby Sevens         26.2     27    19    33   4.09


Median imputation


The median values of the following columns will be used to replace missing values in these columns.

  • Age
  • Height
  • Weight


The outputs below showed no NA values because these values have been imputed with new values.


    ID   Name    Sex    Age Height Weight   Team    NOC  Games   Year Season 
     0      0      0      0      0      0      0      0      0      0      0 
  City  Sport  Event  Medal region  notes 
     0      0      0      0      0      0 
# A tibble: 6 x 17
# Groups:   Sport, Age [6]
     ID Name  Sex     Age Height Weight Team  NOC   Games Year  Season City 
  <dbl> <chr> <fct> <dbl>  <dbl>  <dbl> <fct> <fct> <fct> <fct> <fct>  <fct>
1 62669 "Rok~ Male     21    180     71 Yugo~ YUG   1980~ 1980  Summer Mosk~
2 80059 "Zdr~ Male     25    182     72 Yugo~ YUG   1976~ 1976  Summer Mont~
3 57810 "Kar~ Male     29    173     74 Yugo~ YUG   1984~ 1984  Summer Los ~
4 44254 "Pet~ Male     23    190     70 Yugo~ YUG   1972~ 1972  Summer Muni~
5 33672 "Min~ Male     38    189     81 Yugo~ YUG   1980~ 1980  Summer Mosk~
6 24258 "Mil~ Male     23    186     83 Yugo~ YUG   1980~ 1980  Summer Mosk~
# ... with 5 more variables: Sport <fct>, Event <fct>, Medal <ord>,
#   region <fct>, notes <chr>


Descriptive Statistics


Visualization of age distribution



Visualization of height distribution



Visualization of weight distribution



Table 1


Descriptive statistics Age Height Weight
Mean 24.54 177.28 71.61
Median 24 178 72
Standard deviation 4.99 9.81 11.74


Data Exploration with SQL


Create database and tables


[1] "nc"    "olymp"


Number of rows in olymp table



SELECT COUNT(*) AS OLYMPICS_COUNT FROM olymp
1 records
OLYMPICS_COUNT
271116



SELECT * FROM olymp LIMIT 6;
6 records
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
1 A Dijiang M 24 180 80 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NA
2 A Lamusi M 23 170 60 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NA
3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NA
4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NA
5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres NA


Number of rows in nc table



SELECT COUNT(*) AS NOC_COUNT FROM nc
1 records
NOC_COUNT
230


Tables columns and datatypes


olymp table



PRAGMA table_xinfo(olymp);
Displaying records 1 - 10
cid name type notnull dflt_value pk hidden
0 ID REAL 0 NA 0 0
1 Name TEXT 0 NA 0 0
2 Sex TEXT 0 NA 0 0
3 Age REAL 0 NA 0 0
4 Height REAL 0 NA 0 0
5 Weight REAL 0 NA 0 0
6 Team TEXT 0 NA 0 0
7 NOC TEXT 0 NA 0 0
8 Games TEXT 0 NA 0 0
9 Year REAL 0 NA 0 0



SELECT sql FROM sqlite_master WHERE tbl_name = 'olymp';
1 records
sql
CREATE TABLE olymp (

ID REAL, Name TEXT, Sex TEXT, Age REAL, Height REAL, Weight REAL, Team TEXT, NOC TEXT, Games TEXT, Year REAL, Season TEXT, City TEXT, Sport TEXT, Event TEXT, Medal TEXT ) |


nc table



PRAGMA table_xinfo(nc);
3 records
cid name type notnull dflt_value pk hidden
0 NOC TEXT 0 NA 0 0
1 region TEXT 0 NA 0 0
2 notes TEXT 0 NA 0 0


Discussion 1

From the results above, the data types of some of the columns are not appropriate as seen earlier while using R for the analysis. notnull column mean the column will not allow NULL and the column dflt_value, mean default value for the column. Thought the tables would not accept NULL values but NA values are allowed instead of NULL. This fact is also confirmed using R


Data imputation with SQL

Count the NA values for the variables Age, Height and Weight



Median values



--Median age

SELECT Age AS Median_Age
FROM (SELECT Age
      FROM olymp
      ORDER BY Age
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))
7 records
Median_Age
24
24
24
24
24
24
24



-- Median height
SELECT Height AS Median_Height
FROM (SELECT Height
      FROM olymp
      ORDER BY Height
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))
7 records
Median_Height
171
171
171
171
171
171
171



-- Median weight

SELECT Weight AS Median_Weight
FROM (SELECT Weight
      FROM olymp
      ORDER BY Weight
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp))
7 records
Median_Weight
64
64
64
64
64
64
64


Discussion 2

From the outputs of both R and SQL, the sum of NA values for columns of interest are the same. Also, the similar median values are gotten from both R and SQL. Thus, ensuring two way data validation.


Replacement of NA values


Data with NAs



SELECT Sport
      ,Age
      ,Height 
      ,Weight
FROM olymp
GROUP BY Sport
LIMIT 15



Update data



UPDATE [olymp]
SET [Age]=24
WHERE [Age]= 'NA';


UPDATE [olymp]
SET [Height]=171
WHERE [Height]= 'NA';

UPDATE [olymp]
SET [Weight]=64
WHERE [Weight]= 'NA';


Data without NAs



SELECT Sport
      ,Age
      ,Height 
      ,Weight
FROM olymp
GROUP BY Sport
LIMIT 15




SQL data visualization


Age distribution



Height distribution



Weight distribution



Definition of terms


Parameters


Population mean



SELECT
      round(avg(Age)) AS Avg_Population_Age
     ,round(avg(Height)) AS Avg_Population_Height
     ,round(avg(Weight)) AS Avg_Population_Weight
FROM olymp;



Population median



SELECT Age AS Median_Popultion_Age
FROM (SELECT Age
      FROM olymp
      ORDER BY Age
      LIMIT 7
      OFFSET (SELECT (COUNT(*) - 1) / 2
      FROM olymp));



Table 2


Variables Population stat Descriptive stat
. Mean Median Mean Median
Age 26 24 25 25
Height 174 171 177 178
Weight 69 64 72 72


Discussion 3

Similar trends were observed comparing the distributions of the above variables using R programming and SQL respectively.


Summary


The following conclusions can be inferred from the Olympics data

This milestone 2 project does not yet validate any of the assumptions made in milestone 1. Hopefully, some of the assumptions will be clarified by milestone 3. As at now no addition question comes to mind concerning the data.


References

  1. Difference between Descriptive and Inferential Statistics
  2. Descriptive statistics
  3. Descriptive statistics
  4. Juan Klopper: Understanding Clinical Research: Behind the Statistics
  5. forpas
  6. DB Browser for SQLite
  7. Caitlin Lonski
  8. R Core Team (2019)
  9. ggplot2: Elegant Graphics for Data Analysis
  10. RSQLite: 'SQLite' Interface for R
  11. readr: Read Rectangular Text Data
  12. dplyr: A Grammar of Data Manipulation