## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

0.1 Problem 1

Reading Data

0.1.1 1

Exploring Dataset “PKDataset1”

## # A tibble: 6 x 14
##       P name  age   gender raceethnicity month   day  year streetaddress
##   <dbl> <chr> <chr> <chr>  <chr>         <chr> <dbl> <dbl> <chr>        
## 1     2 Matt… 22    Male   Black         Janu…     1  2015 1050 Carl Gr…
## 2     4 Lewi… 47    Male   White         Janu…     2  2015 4505 SW Mast…
## 3     7 Tim … 53    Male   Asian/Pacifi… Janu…     2  2015 600 E Island…
## 4     5 Mich… 19    Male   White         Janu…     3  2015 2600 Kaumual…
## 5     6 John… 23    Male   Hispanic/Lat… Janu…     3  2015 500 North Ol…
## 6     8 Matt… 32    Male   White         Janu…     4  2015 630 Valencia…
## # ... with 5 more variables: city <chr>, state <chr>,
## #   classification <chr>, lawenforcementagency <chr>, armed <chr>

From above we can observe that there are 14 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:

  • name: Name of the criminal
  • age: Age of the criminal
  • gender: Gender of the criminal
  • raceethnicity: race of the criminal
  • month: month of the killing
  • day: day of the killing
  • year: year of the killing
  • streetaddress: street address of the killing
  • city: city where the incident happened
  • state: state where the incident happened
  • classification: how did the killing occoured
  • lawenforcementagency: law agency involved
  • armed: whether the criminal was armed
## [1] 1145   14

There are 1145 observations in the dataset and 14 attributes.

## Observations: 1,145
## Variables: 14
## $ P                    <dbl> 2, 4, 7, 5, 6, 8, 91, 9, 10, 1010, 11, 12...
## $ name                 <chr> "Matthew Ajibade", "Lewis Lembke", "Tim E...
## $ age                  <chr> "22", "47", "53", "19", "23", "32", "18",...
## $ gender               <chr> "Male", "Male", "Male", "Male", "Male", "...
## $ raceethnicity        <chr> "Black", "White", "Asian/Pacific Islander...
## $ month                <chr> "January", "January", "January", "January...
## $ day                  <dbl> 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6,...
## $ year                 <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
## $ streetaddress        <chr> "1050 Carl Griffin Dr", "4505 SW Masters ...
## $ city                 <chr> "Savannah", "Aloha", "Shelton", "Kaumakan...
## $ state                <chr> "GA", "OR", "WA", "HI", "KS", "CA", "OK",...
## $ classification       <chr> "Death in custody", "Gunshot", "Gunshot",...
## $ lawenforcementagency <chr> "Chatham County Sheriff's Office", "Washi...
## $ armed                <chr> "No", "Firearm", "Firearm", "No", "No", "...
##        P              name               age               gender         
##  Min.   :   2.0   Length:1145        Length:1145        Length:1145       
##  1st Qu.: 291.0   Class :character   Class :character   Class :character  
##  Median : 583.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 584.8                                                           
##  3rd Qu.: 879.0                                                           
##  Max.   :1169.0                                                           
##  raceethnicity         month                day             year     
##  Length:1145        Length:1145        Min.   : 1.00   Min.   :2015  
##  Class :character   Class :character   1st Qu.: 8.00   1st Qu.:2015  
##  Mode  :character   Mode  :character   Median :15.00   Median :2015  
##                                        Mean   :15.64   Mean   :2015  
##                                        3rd Qu.:23.00   3rd Qu.:2015  
##                                        Max.   :31.00   Max.   :2015  
##  streetaddress          city              state          
##  Length:1145        Length:1145        Length:1145       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  classification     lawenforcementagency    armed          
##  Length:1145        Length:1145          Length:1145       
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
## 

From above we can see the summary stastics for the data. We have the min, max, median, mean, 1st quantile, 3rd quntile values for all numerical variables in the data.

Exploring Dataset “PKDataset2”

## # A tibble: 6 x 14
##      id name  date                manner_of_death armed   age gender race 
##   <dbl> <chr> <dttm>              <chr>           <chr> <dbl> <chr>  <chr>
## 1     3 Tim … 2015-01-02 00:00:00 shot            gun      53 M      A    
## 2     4 Lewi… 2015-01-02 00:00:00 shot            gun      47 M      W    
## 3     5 John… 2015-01-03 00:00:00 shot and Taser… unar…    23 M      H    
## 4     8 Matt… 2015-01-04 00:00:00 shot            toy …    32 M      W    
## 5     9 Mich… 2015-01-04 00:00:00 shot            nail…    39 M      H    
## 6    11 Kenn… 2015-01-04 00:00:00 shot            gun      18 M      W    
## # ... with 6 more variables: city <chr>, state <chr>,
## #   signs_of_mental_illness <lgl>, threat_level <chr>, flee <chr>,
## #   body_camera <lgl>

From above we can observe that there are 14 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:

  • name: Name of the criminal
  • date: date of execution
  • manner_of_death: how was the criminal executed
  • armed: whether the criminal was armed
  • age: race of the criminal
  • gender: month of the killing
  • race: day of the killing
  • city: the city where incident happened
  • state: state where the incident happened
  • signs_of_mental_illness: whether the criminal showed sign of mental illness
  • threat_level: threat level from the criminal
  • flee: was the criminal trying to flee
  • body_camera: did criminal had body camera on them
## [1] 1312   14

There are 1312 observations in the dataset and 14 attributes.

## Observations: 1,312
## Variables: 14
## $ id                      <dbl> 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19,...
## $ name                    <chr> "Tim Elliot", "Lewis Lee Lembke", "Joh...
## $ date                    <dttm> 2015-01-02, 2015-01-02, 2015-01-03, 2...
## $ manner_of_death         <chr> "shot", "shot", "shot and Tasered", "s...
## $ armed                   <chr> "gun", "gun", "unarmed", "toy weapon",...
## $ age                     <dbl> 53, 47, 23, 32, 39, 18, 22, 35, 34, 47...
## $ gender                  <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ race                    <chr> "A", "W", "H", "W", "H", "W", "H", "W"...
## $ city                    <chr> "Shelton", "Aloha", "Wichita", "San Fr...
## $ state                   <chr> "WA", "OR", "KS", "CA", "CO", "OK", "A...
## $ signs_of_mental_illness <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE...
## $ threat_level            <chr> "attack", "attack", "other", "attack",...
## $ flee                    <chr> "Not fleeing", "Not fleeing", "Not fle...
## $ body_camera             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
##        id             name                date                    
##  Min.   :   3.0   Length:1312        Min.   :2015-01-02 00:00:00  
##  1st Qu.: 436.8   Class :character   1st Qu.:2015-04-30 00:00:00  
##  Median : 791.5   Mode  :character   Median :2015-08-28 00:00:00  
##  Mean   : 784.4                      Mean   :2015-08-31 10:08:02  
##  3rd Qu.:1140.2                      3rd Qu.:2015-12-27 06:00:00  
##  Max.   :1501.0                      Max.   :2016-04-28 00:00:00  
##                                                                   
##  manner_of_death       armed                age           gender         
##  Length:1312        Length:1312        Min.   : 6.00   Length:1312       
##  Class :character   Class :character   1st Qu.:27.00   Class :character  
##  Mode  :character   Mode  :character   Median :34.00   Mode  :character  
##                                        Mean   :36.47                     
##                                        3rd Qu.:45.00                     
##                                        Max.   :86.00                     
##                                        NA's   :24                        
##      race               city              state          
##  Length:1312        Length:1312        Length:1312       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  signs_of_mental_illness threat_level           flee          
##  Mode :logical           Length:1312        Length:1312       
##  FALSE:977               Class :character   Class :character  
##  TRUE :335               Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  body_camera    
##  Mode :logical  
##  FALSE:1207     
##  TRUE :105      
##                 
##                 
##                 
## 

From above we can see the summary stastics for the data. We have the min, max, median, mean, 1st quantile, 3rd quntile values for all numerical variables in the data.

Contrasting features of the two dataset

Both the dataset have a lot of features in common but following are things PKDataset1 offers over PKDataset2:

  • Street Address of the killing
  • Law enforcement agency involved

These are the things that PKDataset2 provides over PKDataset1:

  • Whether the criminal was mentally ill
  • Whether the criminal was trying to flee
  • Whether criminal had a body camera

Other attributes are almost common with name of attribute differing in both dataset.

0.1.2 2

Checking for missing values

PKDataset1

## # A tibble: 6 x 14
##       P name  age   gender raceethnicity month   day  year streetaddress
##   <dbl> <chr> <chr> <chr>  <chr>         <chr> <dbl> <dbl> <chr>        
## 1    91 Kenn… 18    Male   White         Janu…     4  2015 <NA>         
## 2   276 Mya … 27    Female Black         March    30  2015 <NA>         
## 3   339 Sant… 24    Male   Hispanic/Lat… April    20  2015 <NA>         
## 4   360 Bill… 29    Male   White         April    26  2015 <NA>         
## 5   940 Dari… 30    Male   White         Octo…    20  2015 <NA>         
## 6  1154 Shun… 64    Male   Unknown       Dece…     3  2015 <NA>         
## # ... with 5 more variables: city <chr>, state <chr>,
## #   classification <chr>, lawenforcementagency <chr>, armed <chr>
## [1] 14

There are 14 cases where we have missing values in the dataset. I am simply omitting all the missing values in the dataset and creating a new PKDataset1 as below:

PKDataset2

## # A tibble: 94 x 14
##       id name  date                manner_of_death armed   age gender race 
##    <dbl> <chr> <dttm>              <chr>           <chr> <dbl> <chr>  <chr>
##  1   110 Will… 2015-01-25 00:00:00 shot            gun      59 M      <NA> 
##  2   584 Alej… 2015-02-20 00:00:00 shot            gun      NA M      H    
##  3   244 John… 2015-03-30 00:00:00 shot            gun      54 M      <NA> 
##  4   534 Mark… 2015-04-09 00:00:00 shot and Taser… vehi…    54 M      <NA> 
##  5   433 Jose… 2015-05-07 00:00:00 shot            knife    72 M      <NA> 
##  6   503 Jame… 2015-05-31 00:00:00 shot            gun      40 M      <NA> 
##  7   523 Jame… 2015-06-08 00:00:00 shot            gun      54 M      <NA> 
##  8   542 Raym… 2015-06-11 00:00:00 shot            gun      86 M      <NA> 
##  9   604 Bria… 2015-07-02 00:00:00 shot            cros…    59 M      <NA> 
## 10   641 Char… 2015-07-14 00:00:00 shot            gun      76 M      <NA> 
## # ... with 84 more rows, and 6 more variables: city <chr>, state <chr>,
## #   signs_of_mental_illness <lgl>, threat_level <chr>, flee <chr>,
## #   body_camera <lgl>

What other data could you imagine would be valuable to consolidate the existing data?

  • Population at granularity of city and state would be valuable to create a index for killing relative to population and better understanding of police killings.

  • Having latitute-longitude data would help to create geospatial graphs or charts, that would be a good tool to understand geospatial nature of police killings.

0.1.3 3

0.1.3.1 Visual Exploration of “PKDataset1”

Victimins age v/s count

The above graph depicts the distribution of victims age group vs the killing count. Most number of criminals belong to age from 20-35. As the age is increassing the bars are becoming smaller.

Count v/s Age and Gender of Shooting Victims

This is the same distribution plot as above. But now we have plotted stacked bars with each bar showing the ratio of 3 genders in the dataset. Though Non-Conforming gender is almost negligable, female crimanals are very less compared to male. Most female crimals belong to age of 35-40.

Top Cities Showing Age of Shooting Victims

This graph shows top cities with number of victims and age distribution for each city. Los Angles has highest number of victims followed by Houston.

US Police Shootings By Year

Since we have data for only one year, this single bar represents total number of shooting that year.

Manner of Death

This graph shows how were most victims killed. It can be clearly seen most victimis were killed by Gunshot foloowed by Taser.

What Month and Manner of Death

The above graph shows what months were most deaths reported and distribution of how the way the victim was killed for each month. Highest killing happende in July followed by March.

Multiple Graphs

Two graphs: First one shows the gender distribution of the victims. The second graph shows whether victim was armed or not and distribution of gender for each category of being armed.

0.1.3.2 Visual Exploration of “PKDataset2”

Our date variable has 2 formats. Before we can use it, we need to parse it into one format. We can do this by using parse_date_time() function from lubridate package. Afterwards, to preserve our date variable, we create new “Date” column and give it the proper date format using base R as.Date() function. Most importantly, we can then extract the month, day and year using lubridate package.

Ditribution of Victims according to age

The graph shows the distribution of victims according to age

Age and Gender of Shooting Victims

Graph shows distribution of victims according to age and gender distribution for each category of age.

Top Cities Showing Age of Shooting Victims

Distribution of victims over top cities with age distribution marked by different color for each city. Los Angles tops the list followed by San francisco for this dataset.

US Police Shootings By Year

Graph show US police shooting for each year.

Manner of Death

Graph show distribution of manner of death for each year.

What Month and Manner of Death

Graph shows month and manner of death and distribution of incidents over months.

Multiple Graphs

First graph shows distribution of victims as male and female. The second graph shows whether the victim was armed or not and what was the armed weapon, each category has distribution of males and feamles for them.

0.2 Problem 2

Reading Data

0.2.1 1

Exploring Dataset “OMDataset1”

## # A tibble: 6 x 10
##   Games Sport Event `Athlete(s)` CountryCode CountryName Medal Result Unit 
##   <dbl> <chr> <chr> <chr>        <chr>       <chr>       <chr> <chr>  <chr>
## 1  2012 Athl… 1000… Mo FARAH     GBR         Great Brit… Gold  1.910… M:SS…
## 2  2012 Athl… 1000… Galen RUPP   USA         United Sta… Silv… 1.910… M:SS…
## 3  2012 Athl… 1000… Tariku BEKE… ETH         Ethiopia    Bron… 1.911… M:SS…
## 4  2012 Athl… 1000… Tirunesh DI… ETH         Ethiopia    Gold  2.107… M:SS…
## 5  2012 Athl… 1000… Sally Jepko… KEN         Kenya       Silv… 2.113… M:SS…
## 6  2012 Athl… 1000… Vivian CHER… KEN         Kenya       Bron… 2.118… M:SS…
## # ... with 1 more variable: ResultInSeconds <dbl>

From above we can observe that there are 10 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:

  • Games: Year when the games happened
  • Sport: Kind of sport
  • Event: kind of event
  • Athelete(s): Name of athelete
  • CountryCode: Country code of where the athelete belonged
  • CountryName: Country name of where the athelete belonged
  • Medal: what medal was won
  • Result: Time in which athete completed the event
  • Unit: unit in which result was recorder
  • ResultInSeconds: result in seconds
## [1] 4093   10

Dataset has 4093 observations and 10 attributes.

##      Games         Sport              Event            Athlete(s)       
##  Min.   :1896   Length:4093        Length:4093        Length:4093       
##  1st Qu.:1956   Class :character   Class :character   Class :character  
##  Median :1980   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1974                                                           
##  3rd Qu.:2000                                                           
##  Max.   :2012                                                           
##                                                                         
##  CountryCode        CountryName           Medal          
##  Length:4093        Length:4093        Length:4093       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     Result              Unit           ResultInSeconds   
##  Length:4093        Length:4093        Min.   :    9.63  
##  Class :character   Class :character   1st Qu.:   60.88  
##  Mode  :character   Mode  :character   Median :  180.70  
##                                        Mean   :  755.05  
##                                        3rd Qu.:  382.07  
##                                        Max.   :17946.00  
##                                        NA's   :23
## Observations: 4,093
## Variables: 10
## $ Games           <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012...
## $ Sport           <chr> "Athletics", "Athletics", "Athletics", "Athlet...
## $ Event           <chr> "10000m Men", "10000m Men", "10000m Men", "100...
## $ `Athlete(s)`    <chr> "Mo FARAH", "Galen RUPP", "Tariku BEKELE", "Ti...
## $ CountryCode     <chr> "GBR", "USA", "ETH", "ETH", "KEN", "KEN", "AUS...
## $ CountryName     <chr> "Great Britain", "United States of America", "...
## $ Medal           <chr> "Gold", "Silver", "Bronze", "Gold", "Silver", ...
## $ Result          <chr> "1.9102083333333335E-2", "1.9107638888888889E-...
## $ Unit            <chr> "M:SS:DD", "M:SS:DD", "M:SS:DD", "M:SS:DD", "M...
## $ ResultInSeconds <dbl> 1650.42, 1650.90, 1651.43, 1820.75, 1826.37, 1...

Exploring Dataset “OMDataset2”

## # A tibble: 6 x 10
##   `List of medallis… X__1  X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 
##   <chr>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA>               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 DISCLAIMER: The I… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 <NA>               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 City               Edit… Sport Disc… Athl… NOC   Gend… Event Even… Medal
## 5 Antwerp            1920  Aqua… Divi… PRIE… USA   Men   10m … M     Bron…
## 6 Antwerp            1920  Aqua… Divi… PINK… USA   Men   10m … M     Gold

From above we can observe that there are 10 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:

  • City: The city where the game was hosted
  • Edition: Year when the games happened
  • Sport: Kind of sport
  • Discipline: Kind of discipline
  • Athelete: Name of athelete
  • NOC: Country name of where the athelete belonged
  • Gender: Gender of the athelete
  • Event: Kind of event
  • Event_gender: Gender of the athelete
  • Medal: Medal won
## [1] 26398    10

The dataset has 26394 observations and 10 different attributes.

##  List of medallists at the Games of the Olympiad per edition, sport, discipline, gender and event
##  Length:26398                                                                                    
##  Class :character                                                                                
##  Mode  :character                                                                                
##      X__1               X__2               X__3          
##  Length:26398       Length:26398       Length:26398      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      X__4               X__5               X__6          
##  Length:26398       Length:26398       Length:26398      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      X__7               X__8               X__9          
##  Length:26398       Length:26398       Length:26398      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

From above we can see the summary of each attribute.

## Observations: 26,398
## Variables: 10
## $ `List of medallists at the Games of the Olympiad per edition, sport, discipline, gender and event` <chr> ...
## $ X__1                                                                                               <chr> ...
## $ X__2                                                                                               <chr> ...
## $ X__3                                                                                               <chr> ...
## $ X__4                                                                                               <chr> ...
## $ X__5                                                                                               <chr> ...
## $ X__6                                                                                               <chr> ...
## $ X__7                                                                                               <chr> ...
## $ X__8                                                                                               <chr> ...
## $ X__9                                                                                               <chr> ...

The above table shows atributes and it’s datatypes.

Contrasting features of the dataset

The first dataset focuses more on athelete perfomace and has time and realted features as attributes.The second dataset has features like city and gender of the athelete which are missing from the first dataset.

0.2.2 2

OMDataset1

Checking missing values and creating dataframe by removing missing values

## # A tibble: 23 x 10
##    Games Sport Event `Athlete(s)` CountryCode CountryName Medal Result
##    <dbl> <chr> <chr> <chr>        <chr>       <chr>       <chr> <chr> 
##  1  2012 Athl… 4x10… DISQUALIFIE… <NA>        <NA>        Silv… <NA>  
##  2  1928 Rowi… Doub… Viktor Fles… AUT         Austria     Bron… No re…
##  3  1928 Rowi… Eigh… Jack Hand, … CAN         Canada      Bron… No re…
##  4  1924 Rowi… Coxl… Hans Walter… SUI         Switzerland Bron… No re…
##  5  1924 Rowi… Doub… Heini Thoma… SUI         Switzerland Bron… No re…
##  6  1924 Rowi… Eigh… Pietro Ivan… KIT         Italy       Bron… No re…
##  7  1920 Rowi… Eigh… Tollef Toll… NOR         Norway      Bron… No re…
##  8  1920 Rowi… Sing… Darcy Hadfi… NZL         New Zealand Bron… No re…
##  9  1912 Rowi… Eigh… Rudolf Reic… DEU         Germany     Bron… No re…
## 10  1912 Rowi… Sing… Hugo Kusick  EST         Estonia     Bron… No re…
## # ... with 13 more rows, and 2 more variables: Unit <chr>,
## #   ResultInSeconds <dbl>

There are 23 rows with missing values

The new OMDataset1 has no missing values now.

OMDataset2

## # A tibble: 6 x 10
##   `List of medallis… X__1  X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 
##   <chr>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA>               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 DISCLAIMER: The I… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 <NA>               <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 City               Edit… Sport Disc… Athl… NOC   Gend… Event Even… Medal
## 5 Antwerp            1920  Aqua… Divi… PRIE… USA   Men   10m … M     Bron…
## 6 Antwerp            1920  Aqua… Divi… PINK… USA   Men   10m … M     Gold

As we can see from above the 4th row is basically header of the dataframe and 1st three rows are occoupied by some random data. We will have to rectify this before we can start with our analysis. Let’s do it:

## # A tibble: 6 x 10
##   City  Edition Sport Discipline Athlete NOC   Gender Event Event_gender
##   <chr> <chr>   <chr> <chr>      <chr>   <chr> <chr>  <chr> <chr>       
## 1 DISC… <NA>    <NA>  <NA>       <NA>    <NA>  <NA>   <NA>  <NA>        
## 2 <NA>  <NA>    <NA>  <NA>       <NA>    <NA>  <NA>   <NA>  <NA>        
## 3 City  Edition Sport Discipline Athlete NOC   Gender Event Event_gender
## 4 Antw… 1920    Aqua… Diving     PRIEST… USA   Men    10m … M           
## 5 Antw… 1920    Aqua… Diving     PINKST… USA   Men    10m … M           
## 6 Antw… 1920    Aqua… Diving     ADLERZ… SWE   Men    10m … M           
## # ... with 1 more variable: Medal <chr>

The header has been adjusted. Let’ adjust the content of the row before we can get starte:

## # A tibble: 6 x 10
##   City  Edition Sport Discipline Athlete NOC   Gender Event Event_gender
##   <chr> <chr>   <chr> <chr>      <chr>   <chr> <chr>  <chr> <chr>       
## 1 Antw… 1920    Aqua… Diving     PRIEST… USA   Men    10m … M           
## 2 Antw… 1920    Aqua… Diving     PINKST… USA   Men    10m … M           
## 3 Antw… 1920    Aqua… Diving     ADLERZ… SWE   Men    10m … M           
## 4 Antw… 1920    Aqua… Diving     OLLIVI… SWE   Women  10m … W           
## 5 Antw… 1920    Aqua… Diving     FRYLAN… DEN   Women  10m … W           
## 6 Antw… 1920    Aqua… Diving     ARMSTR… GBR   Women  10m … W           
## # ... with 1 more variable: Medal <chr>

Now the datafrmae looks clean and ready for analysis. Before we move ahead let’s see for any missing values and remove all missing values from the dataframe.

## # A tibble: 0 x 10
## # ... with 10 variables: City <chr>, Edition <chr>, Sport <chr>,
## #   Discipline <chr>, Athlete <chr>, NOC <chr>, Gender <chr>, Event <chr>,
## #   Event_gender <chr>, Medal <chr>

There are no missing values in the dataframe. Let’s move to analysis part.

0.2.3 3

0.2.3.1 Visual Exploration of “OMDataset1”

The above chart has year of the game of x-axis and number of medal won on y axis. It’s an interactive chart so you can hover over to see number fo medals for each year. From the observation number of medals have been increasing each year. Pie chart gives distribution of gold, silver and bronze medals.

The above chart shows top 10 atheletes according to the olympic games held from 1896 -2012.

Higest medels are won in Athelitics followed by swimming.

0.2.3.2 Visual Exploration of “OMDataset2”

Medals won over different years and distribution of total medals over the years. The number decreased for a small moment of time and then it is continously increasing.

Above graph shows medal won in different cities and distribution of medal won by cities in pie chart. Los Angles has higest medals won according to this dataset. Los Angles has hosted olympics twice according to pie chart.

Medals won by country and distribution of medals won by different countries in the pie chart. USA is higest medal winner in this case. The order is as follows - Australia, Denmark, Finland, France …

Top 10 athelete for olympic games from 1920 - 2008, Larisa Latynina tops followed by Micheal Pheleps.

The graph shows medals one in different sports. Aquatics has higest medals followed by Athelitics.