The following R packages are used for this analysis:
readr
RSQLite
dplyr
ggplot2
This project is sequel to Proposal for Data Analysis Project which is also the Milestone 1 in the capstone project.
In this Milestone 2 project, descriptive statistics are used on the dataset. Descriptive statistical methods involve the use of statistical techniques to quantitatively give a detailed summary of the sample dataset. Some of these techniques include:
Mean, Median and ModeRange, Standard deviation, Variance etcGraphical representation is also used in descriptive statistics.
R and SQL will be used for this analysis. Similar descriptive statistical results should be expected using either R or SQL. This approach should ensure a two way data validation. The R codes are not displayed, only its outputs are seen. Conversely, both the SQL codes and its outputs will be seen because this project is really about SQL and not R.
# A tibble: 6 x 15
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Di~ M 24 180 80 China CHN 1992~ 1992 Summer Barc~
2 2 A La~ M 23 170 60 China CHN 2012~ 2012 Summer Lond~
3 3 Gunn~ M 24 NA NA Denm~ DEN 1920~ 1920 Summer Antw~
4 4 Edga~ M 34 NA NA Denm~ DEN 1900~ 1900 Summer Paris
5 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
6 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
# ... with 3 more variables: Sport <chr>, Event <chr>, Medal <chr>
[1] 363853
ID Name Sex Age Height Weight Team NOC Games Year Season
0 0 0 9474 60171 62875 0 0 0 0 0
City Sport Event Medal
0 0 0 231333
An observation of the olympics data showed lots of missing values, total sum of 363853. Some of the columns has quite a high level of NAs. The Medal has high number of missing values probably due to the fact that most athletes in the Olympics do not win any medals.
# A tibble: 6 x 3
NOC region notes
<chr> <chr> <chr>
1 AFG Afghanistan <NA>
2 AHO Curacao Netherlands Antilles
3 ALB Albania <NA>
4 ALG Algeria <NA>
5 AND Andorra <NA>
6 ANG Angola <NA>
The two dataset can be merged using the common NOC variable. The left_join function of the dplyr package is used to join the noc_data data with 3 variables to olympics_data data with 15. The resultant data will now have 17 variables.
# A tibble: 6 x 17
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 A Di~ M 24 180 80 China CHN 1992~ 1992 Summer Barc~
2 2 A La~ M 23 170 60 China CHN 2012~ 2012 Summer Lond~
3 3 Gunn~ M 24 NA NA Denm~ DEN 1920~ 1920 Summer Antw~
4 4 Edga~ M 34 NA NA Denm~ DEN 1900~ 1900 Summer Paris
5 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
6 5 Chri~ F 21 185 82 Neth~ NED 1988~ 1988 Winter Calg~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
# region <chr>, notes <chr>
tibble [271,116 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:271116] 1 2 3 4 5 5 5 5 5 5 ...
$ Name : chr [1:271116] "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
$ Sex : chr [1:271116] "M" "M" "M" "M" ...
$ Age : num [1:271116] 24 23 24 34 21 21 25 25 27 27 ...
$ Height: num [1:271116] 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight: num [1:271116] 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : chr [1:271116] "China" "China" "Denmark" "Denmark/Sweden" ...
$ NOC : chr [1:271116] "CHN" "CHN" "DEN" "DEN" ...
$ Games : chr [1:271116] "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
$ Year : num [1:271116] 1992 2012 1920 1900 1988 ...
$ Season: chr [1:271116] "Summer" "Summer" "Summer" "Summer" ...
$ City : chr [1:271116] "Barcelona" "London" "Antwerpen" "Paris" ...
$ Sport : chr [1:271116] "Basketball" "Judo" "Football" "Tug-Of-War" ...
$ Event : chr [1:271116] "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
$ Medal : chr [1:271116] NA NA NA "Gold" ...
$ region: chr [1:271116] "China" "China" "Denmark" "Denmark" ...
$ notes : chr [1:271116] NA NA NA NA ...
The new data from the merger i.e nocolympics has over 200,000 rows and 17 columns. A sample of size 10000 is taken from the data to compute the descriptive statistics.
# A tibble: 6 x 17
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 9967 Pave~ M 28 172 70 Czec~ TCH 1992~ 1992 Winter Albe~
2 111428 Helg~ F 20 171 61 Arge~ ARG 1968~ 1968 Winter Gren~
3 127942 Paul~ M 29 NA 109 Germ~ GER 1936~ 1936 Summer Berl~
4 37121 Juli~ M 18 NA NA Germ~ GER 1900~ 1900 Summer Paris
5 23695 Leon~ M 53 173 76 Unit~ USA 1932~ 1932 Summer Los ~
6 5056 Ferh~ M 23 178 68 Turk~ TUR 2016~ 2016 Summer Rio ~
# ... with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>,
# region <chr>, notes <chr>
Examining the class of each of the columns revealed inappropriate data type assigned to some of the columns.
ID Name Sex Age Height Weight
"numeric" "character" "character" "numeric" "numeric" "numeric"
Team NOC Games Year Season City
"character" "character" "character" "numeric" "character" "character"
Sport Event Medal region notes
"character" "character" "character" "character" "character"
The following variables will be changed from:
[1] "Bronze" "Silver" "Gold" "None"
tibble [10,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:10000] 9967 111428 127942 37121 23695 ...
$ Name : chr [1:10000] "Pavel Benc" "Helga Mara Sista de Salvateli" "Paul Wahl" "Julius Frey" ...
$ Sex : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 1 1 1 1 1 ...
$ Age : num [1:10000] 28 20 29 18 53 23 23 22 32 20 ...
$ Height: num [1:10000] 172 171 NA NA 173 178 179 174 183 173 ...
$ Weight: num [1:10000] 70 61 109 NA 76 68 70 65 73 64 ...
$ Team : Factor w/ 369 levels "30. Februar",..: 86 14 124 124 348 336 113 111 128 82 ...
$ NOC : Factor w/ 204 levels "AFG","AHO","ALB",..: 179 9 71 71 196 188 64 63 72 48 ...
$ Games : Factor w/ 51 levels "1896 Summer",..: 39 27 14 2 12 51 41 7 28 32 ...
$ Year : Factor w/ 35 levels "1896","1900",..: 23 17 11 2 10 35 25 7 18 20 ...
$ Season: Factor w/ 2 levels "Summer","Winter": 2 2 1 1 1 1 1 1 1 1 ...
$ City : Factor w/ 42 levels "Albertville",..: 1 13 8 27 19 28 5 3 24 23 ...
$ Sport : Factor w/ 58 levels "Alpine Skiing",..: 14 1 57 47 3 24 36 4 4 11 ...
$ Event : Factor w/ 618 levels "Alpine Skiing Men's Combined",..: 176 9 577 511 40 293 363 64 82 144 ...
$ Medal : Ord.factor w/ 4 levels "Bronze"<"Silver"<..: 4 4 4 4 4 4 4 3 4 1 ...
$ region: Factor w/ 187 levels "Afghanistan",..: 45 7 62 62 179 172 58 57 63 42 ...
$ notes : chr [1:10000] NA NA NA NA ...
The data merger introduced NULL values into the new data. These NULL values can greatly affect the outcome of any computation done on the new data.
[1] 14466
ID Name Sex Age Height Weight Team NOC Games Year Season
0 0 0 317 2117 2218 0 0 0 0 0
City Sport Event Medal region notes
0 0 0 0 12 9802
The total count of missing value in the sample_olympics data is 22961 from some of the variables like notes, Medal, Height etc.
A few sample of the sample_olympics data is selected with the na values removed and descriptive statistics is computed for columns with missing values.
Age with sample size 7 and NAs removedAdding missing grouping variables: `Sport`
# A tibble: 7 x 6
Sport mean median min max SD
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cycling 24.9 24 17 42 4.41
2 Speed Skating 24.0 24 14 41 4.20
3 Archery 26.7 24 17 63 8.37
4 Rhythmic Gymnastics 19.4 19 15 27 3.11
5 Military Ski Patrol 32 32 32 32 NaN
6 Judo 25.0 25 18 36 3.59
7 Rugby Sevens 26.2 27 19 33 4.09
The median values of the following columns will be used to replace missing values in these columns.
The outputs below showed no NA values because these values have been imputed with new values.
ID Name Sex Age Height Weight Team NOC Games Year Season
0 0 0 0 0 0 0 0 0 0 0
City Sport Event Medal region notes
0 0 0 0 0 0
# A tibble: 6 x 17
# Groups: Sport, Age [6]
ID Name Sex Age Height Weight Team NOC Games Year Season City
<dbl> <chr> <fct> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 62669 "Rok~ Male 21 180 71 Yugo~ YUG 1980~ 1980 Summer Mosk~
2 80059 "Zdr~ Male 25 182 72 Yugo~ YUG 1976~ 1976 Summer Mont~
3 57810 "Kar~ Male 29 173 74 Yugo~ YUG 1984~ 1984 Summer Los ~
4 44254 "Pet~ Male 23 190 70 Yugo~ YUG 1972~ 1972 Summer Muni~
5 33672 "Min~ Male 38 189 81 Yugo~ YUG 1980~ 1980 Summer Mosk~
6 24258 "Mil~ Male 23 186 83 Yugo~ YUG 1980~ 1980 Summer Mosk~
# ... with 5 more variables: Sport <fct>, Event <fct>, Medal <ord>,
# region <fct>, notes <chr>
| Descriptive statistics | Age | Height | Weight |
|---|---|---|---|
Mean |
24.54 | 177.28 | 71.61 |
Median |
24 | 178 | 72 |
Standard deviation |
4.99 | 9.81 | 11.74 |
[1] "nc" "olymp"
SELECT COUNT(*) AS OLYMPICS_COUNT FROM olymp| OLYMPICS_COUNT |
|---|
| 271116 |
SELECT * FROM olymp LIMIT 6;| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A Dijiang | M | 24 | 180 | 80 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NA |
| 2 | A Lamusi | M | 23 | 170 | 60 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NA |
| 3 | Gunnar Nielsen Aaby | M | 24 | NA | NA | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NA |
| 4 | Edgar Lindenau Aabye | M | 34 | NA | NA | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 5 | Christine Jacoba Aaftink | F | 21 | 185 | 82 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NA |
| 5 | Christine Jacoba Aaftink | F | 21 | 185 | 82 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | NA |
SELECT COUNT(*) AS NOC_COUNT FROM nc| NOC_COUNT |
|---|
| 230 |
PRAGMA table_xinfo(olymp);| cid | name | type | notnull | dflt_value | pk | hidden |
|---|---|---|---|---|---|---|
| 0 | ID | REAL | 0 | NA | 0 | 0 |
| 1 | Name | TEXT | 0 | NA | 0 | 0 |
| 2 | Sex | TEXT | 0 | NA | 0 | 0 |
| 3 | Age | REAL | 0 | NA | 0 | 0 |
| 4 | Height | REAL | 0 | NA | 0 | 0 |
| 5 | Weight | REAL | 0 | NA | 0 | 0 |
| 6 | Team | TEXT | 0 | NA | 0 | 0 |
| 7 | NOC | TEXT | 0 | NA | 0 | 0 |
| 8 | Games | TEXT | 0 | NA | 0 | 0 |
| 9 | Year | REAL | 0 | NA | 0 | 0 |
SELECT sql FROM sqlite_master WHERE tbl_name = 'olymp';| sql |
|---|
CREATE TABLE olymp ( |
ID REAL, Name TEXT, Sex TEXT, Age REAL, Height REAL, Weight REAL, Team TEXT, NOC TEXT, Games TEXT, Year REAL, Season TEXT, City TEXT, Sport TEXT, Event TEXT, Medal TEXT ) |
PRAGMA table_xinfo(nc);| cid | name | type | notnull | dflt_value | pk | hidden |
|---|---|---|---|---|---|---|
| 0 | NOC | TEXT | 0 | NA | 0 | 0 |
| 1 | region | TEXT | 0 | NA | 0 | 0 |
| 2 | notes | TEXT | 0 | NA | 0 | 0 |
From the results above, the data types of some of the columns are not appropriate as seen earlier while using R for the analysis. notnull column mean the column will not allow NULL and the column dflt_value, mean default value for the column. Thought the tables would not accept NULL values but NA values are allowed instead of NULL. This fact is also confirmed using R
Count the NA values for the variables Age, Height and Weight
--Median age
SELECT Age AS Median_Age
FROM (SELECT Age
FROM olymp
ORDER BY Age
LIMIT 7
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM olymp))| Median_Age |
|---|
| 24 |
| 24 |
| 24 |
| 24 |
| 24 |
| 24 |
| 24 |
-- Median height
SELECT Height AS Median_Height
FROM (SELECT Height
FROM olymp
ORDER BY Height
LIMIT 7
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM olymp))| Median_Height |
|---|
| 171 |
| 171 |
| 171 |
| 171 |
| 171 |
| 171 |
| 171 |
-- Median weight
SELECT Weight AS Median_Weight
FROM (SELECT Weight
FROM olymp
ORDER BY Weight
LIMIT 7
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM olymp))| Median_Weight |
|---|
| 64 |
| 64 |
| 64 |
| 64 |
| 64 |
| 64 |
| 64 |
From the outputs of both R and SQL, the sum of NA values for columns of interest are the same. Also, the similar median values are gotten from both R and SQL. Thus, ensuring two way data validation.
SELECT Sport
,Age
,Height
,Weight
FROM olymp
GROUP BY Sport
LIMIT 15
UPDATE [olymp]
SET [Age]=24
WHERE [Age]= 'NA';
UPDATE [olymp]
SET [Height]=171
WHERE [Height]= 'NA';
UPDATE [olymp]
SET [Weight]=64
WHERE [Weight]= 'NA';
SELECT Sport
,Age
,Height
,Weight
FROM olymp
GROUP BY Sport
LIMIT 15
SELECT
round(avg(Age)) AS Avg_Population_Age
,round(avg(Height)) AS Avg_Population_Height
,round(avg(Weight)) AS Avg_Population_Weight
FROM olymp;
SELECT Age AS Median_Popultion_Age
FROM (SELECT Age
FROM olymp
ORDER BY Age
LIMIT 7
OFFSET (SELECT (COUNT(*) - 1) / 2
FROM olymp));| Variables | Population stat | Descriptive stat | ||
|---|---|---|---|---|
| . | Mean | Median | Mean | Median |
Age |
26 | 24 | 25 | 25 |
Height |
174 | 171 | 177 | 178 |
Weight |
69 | 64 | 72 | 72 |
Similar trends were observed comparing the distributions of the above variables using R programming and SQL respectively.
The following conclusions can be inferred from the Olympics data
This milestone 2 project does not yet validate any of the assumptions made in milestone 1. Hopefully, some of the assumptions will be clarified by milestone 3. As at now no addition question comes to mind concerning the data.