Milestone 4: Presenting Your Findings

Olusola Afuwape
16th September, 2020

Contents

  • Review of findings during different data analyses
  • Discussion of technical challenges during data analysis
  • Visualization
  • Correlation, regression and relationship
  • Hypothesis
  • Recommendations

The project analysis began with:

  1. The creation of database and overview of the variables and observations
  2. The making of Entity Relationship Diagram (ERD)
  3. Postulation of hypothesis
  4. Questions that come mind
[1] "nc"    "olymp"
 [1] "ID"     "Name"   "Sex"    "Age"    "Height" "Weight" "Team"   "NOC"   
 [9] "Games"  "Year"   "Season" "City"   "Sport"  "Event"  "Medal" 
[1] "NOC"    "region" "notes" 
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
1 A Dijiang M 24 180 80 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NA
2 A Lamusi M 23 170 60 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NA
3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NA
4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NA
5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres NA
NOC region notes
AFG Afghanistan NA
AHO Curacao Netherlands Antilles
ALB Albania NA
ALG Algeria NA
AND Andorra NA
ANG Angola NA

Entity Relationship Diagram (ERD)

Descriptive statistical analyses were also performed on some of the variables for better understanding of the distributions. Some of the variables including:

  1. Age
  2. Height
  3. Weight


Descriptive statistics Age Height Weight
Mean 24.54 177.28 71.61
Median 24 178 72
Standard deviation 4.99 9.81 11.74


Data after imputing for NAs values



Comparing Descriptive and Population stats


The project analysis concluded by answering some of the questions.

  1. Which country has the most Olympics medals?
  2. Creating new metric
  3. Correlation between variables and relationships

The United States has most participation and won the most medals


plot of chunk tr

New metirc

Country_appearance


Team Country_appearance
United States 17847
France 11988
Great Britain 11404
Italy 10260
Germany 9326
Canada 9279
Japan 8289
Sweden 8052
Australia 7513
Hungary 6547

New metric

Medal_views


Medal Medal_views
Gold 13372
Bronze 13295
Silver 13116
NA 0

Technical challenges

  • RStudio was used for all the analyses
  • Many R software packages were engaged like readr, ggplot2 RSQLite etc
  • The package RSQLite did not give any output when used to compute SUM of NA values
  • Another none R software called DB Browser for SQLite was used to make the computation

Visualization: Height distribution

Visualization: Weight distribution

Visualization: Wordcloud for Teams

Correlation

  • Is about the strength of the linear association between two variables
  • It is measured by correlation coefficient denoted by R
  • Properties include:
    1. magnitude
    2. sign whether positive or negative
    3. correlation coefficient, R is between -1 and +1
    4. R = 0 indicates no linear relationship
  • Correlation does not imply causation

Relationship

  • When two variables show some connection - associated or dependent variables
  • When two variables show no connection - independent variables
  • Associated or dependent variables can be positive or negative

Regression

  • Is there a correlation or relationship between variables Height and Weight
  • Use simple linear model for regression
  • Any of the variables can be the response and explanatory variables
  • Regression line goes through the center of the data
  • Intercept is where the line crosses the y-axis
  • Intercept is the value of the response variable when explanatory variable is zero

Linear modeling and visualization

Scatterplot of height and weight

plot of chunk sp

Hypotheses

  • Null hypothesis - there is no relationship or correlation between the two variables
  • Alternative hypothesis - there is a correlation between variables
  • p-value: < 2.2e-16 much lower than 0.05 significant level
  • Null hypothesis is rejected in favour of the alternative hypothesis
  • There is also a correlation between Team and Medal

Recommendations

  • Positive, linear, strong correlation between height and weight
  • The explanatory variable Weight is a significant predictor of the response variable Height
  • 120 years of Olympics data collection was done by observational studies
  • Observational studies lack principles of experimental design
  • Thus, the findings about the data can not infer causalty
  • Observational studies can only infer correlation