## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Reading Data
Exploring Dataset “PKDataset1”
## # A tibble: 6 x 14
## P name age gender raceethnicity month day year streetaddress
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 2 Matt… 22 Male Black Janu… 1 2015 1050 Carl Gr…
## 2 4 Lewi… 47 Male White Janu… 2 2015 4505 SW Mast…
## 3 7 Tim … 53 Male Asian/Pacifi… Janu… 2 2015 600 E Island…
## 4 5 Mich… 19 Male White Janu… 3 2015 2600 Kaumual…
## 5 6 John… 23 Male Hispanic/Lat… Janu… 3 2015 500 North Ol…
## 6 8 Matt… 32 Male White Janu… 4 2015 630 Valencia…
## # ... with 5 more variables: city <chr>, state <chr>,
## # classification <chr>, lawenforcementagency <chr>, armed <chr>
From above we can observe that there are 14 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:
## [1] 1145 14
There are 1145 observations in the dataset and 14 attributes.
## Observations: 1,145
## Variables: 14
## $ P <dbl> 2, 4, 7, 5, 6, 8, 91, 9, 10, 1010, 11, 12...
## $ name <chr> "Matthew Ajibade", "Lewis Lembke", "Tim E...
## $ age <chr> "22", "47", "53", "19", "23", "32", "18",...
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "...
## $ raceethnicity <chr> "Black", "White", "Asian/Pacific Islander...
## $ month <chr> "January", "January", "January", "January...
## $ day <dbl> 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6,...
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
## $ streetaddress <chr> "1050 Carl Griffin Dr", "4505 SW Masters ...
## $ city <chr> "Savannah", "Aloha", "Shelton", "Kaumakan...
## $ state <chr> "GA", "OR", "WA", "HI", "KS", "CA", "OK",...
## $ classification <chr> "Death in custody", "Gunshot", "Gunshot",...
## $ lawenforcementagency <chr> "Chatham County Sheriff's Office", "Washi...
## $ armed <chr> "No", "Firearm", "Firearm", "No", "No", "...
## P name age gender
## Min. : 2.0 Length:1145 Length:1145 Length:1145
## 1st Qu.: 291.0 Class :character Class :character Class :character
## Median : 583.0 Mode :character Mode :character Mode :character
## Mean : 584.8
## 3rd Qu.: 879.0
## Max. :1169.0
## raceethnicity month day year
## Length:1145 Length:1145 Min. : 1.00 Min. :2015
## Class :character Class :character 1st Qu.: 8.00 1st Qu.:2015
## Mode :character Mode :character Median :15.00 Median :2015
## Mean :15.64 Mean :2015
## 3rd Qu.:23.00 3rd Qu.:2015
## Max. :31.00 Max. :2015
## streetaddress city state
## Length:1145 Length:1145 Length:1145
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## classification lawenforcementagency armed
## Length:1145 Length:1145 Length:1145
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
From above we can see the summary stastics for the data. We have the min, max, median, mean, 1st quantile, 3rd quntile values for all numerical variables in the data.
Exploring Dataset “PKDataset2”
## # A tibble: 6 x 14
## id name date manner_of_death armed age gender race
## <dbl> <chr> <dttm> <chr> <chr> <dbl> <chr> <chr>
## 1 3 Tim … 2015-01-02 00:00:00 shot gun 53 M A
## 2 4 Lewi… 2015-01-02 00:00:00 shot gun 47 M W
## 3 5 John… 2015-01-03 00:00:00 shot and Taser… unar… 23 M H
## 4 8 Matt… 2015-01-04 00:00:00 shot toy … 32 M W
## 5 9 Mich… 2015-01-04 00:00:00 shot nail… 39 M H
## 6 11 Kenn… 2015-01-04 00:00:00 shot gun 18 M W
## # ... with 6 more variables: city <chr>, state <chr>,
## # signs_of_mental_illness <lgl>, threat_level <chr>, flee <chr>,
## # body_camera <lgl>
From above we can observe that there are 14 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:
## [1] 1312 14
There are 1312 observations in the dataset and 14 attributes.
## Observations: 1,312
## Variables: 14
## $ id <dbl> 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19,...
## $ name <chr> "Tim Elliot", "Lewis Lee Lembke", "Joh...
## $ date <dttm> 2015-01-02, 2015-01-02, 2015-01-03, 2...
## $ manner_of_death <chr> "shot", "shot", "shot and Tasered", "s...
## $ armed <chr> "gun", "gun", "unarmed", "toy weapon",...
## $ age <dbl> 53, 47, 23, 32, 39, 18, 22, 35, 34, 47...
## $ gender <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ race <chr> "A", "W", "H", "W", "H", "W", "H", "W"...
## $ city <chr> "Shelton", "Aloha", "Wichita", "San Fr...
## $ state <chr> "WA", "OR", "KS", "CA", "CO", "OK", "A...
## $ signs_of_mental_illness <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE...
## $ threat_level <chr> "attack", "attack", "other", "attack",...
## $ flee <chr> "Not fleeing", "Not fleeing", "Not fle...
## $ body_camera <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## id name date
## Min. : 3.0 Length:1312 Min. :2015-01-02 00:00:00
## 1st Qu.: 436.8 Class :character 1st Qu.:2015-04-30 00:00:00
## Median : 791.5 Mode :character Median :2015-08-28 00:00:00
## Mean : 784.4 Mean :2015-08-31 10:08:02
## 3rd Qu.:1140.2 3rd Qu.:2015-12-27 06:00:00
## Max. :1501.0 Max. :2016-04-28 00:00:00
##
## manner_of_death armed age gender
## Length:1312 Length:1312 Min. : 6.00 Length:1312
## Class :character Class :character 1st Qu.:27.00 Class :character
## Mode :character Mode :character Median :34.00 Mode :character
## Mean :36.47
## 3rd Qu.:45.00
## Max. :86.00
## NA's :24
## race city state
## Length:1312 Length:1312 Length:1312
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## signs_of_mental_illness threat_level flee
## Mode :logical Length:1312 Length:1312
## FALSE:977 Class :character Class :character
## TRUE :335 Mode :character Mode :character
##
##
##
##
## body_camera
## Mode :logical
## FALSE:1207
## TRUE :105
##
##
##
##
From above we can see the summary stastics for the data. We have the min, max, median, mean, 1st quantile, 3rd quntile values for all numerical variables in the data.
Contrasting features of the two dataset
Both the dataset have a lot of features in common but following are things PKDataset1 offers over PKDataset2:
These are the things that PKDataset2 provides over PKDataset1:
Other attributes are almost common with name of attribute differing in both dataset.
Checking for missing values
PKDataset1
## # A tibble: 6 x 14
## P name age gender raceethnicity month day year streetaddress
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 91 Kenn… 18 Male White Janu… 4 2015 <NA>
## 2 276 Mya … 27 Female Black March 30 2015 <NA>
## 3 339 Sant… 24 Male Hispanic/Lat… April 20 2015 <NA>
## 4 360 Bill… 29 Male White April 26 2015 <NA>
## 5 940 Dari… 30 Male White Octo… 20 2015 <NA>
## 6 1154 Shun… 64 Male Unknown Dece… 3 2015 <NA>
## # ... with 5 more variables: city <chr>, state <chr>,
## # classification <chr>, lawenforcementagency <chr>, armed <chr>
## [1] 14
There are 14 cases where we have missing values in the dataset. I am simply omitting all the missing values in the dataset and creating a new PKDataset1 as below:
PKDataset2
## # A tibble: 94 x 14
## id name date manner_of_death armed age gender race
## <dbl> <chr> <dttm> <chr> <chr> <dbl> <chr> <chr>
## 1 110 Will… 2015-01-25 00:00:00 shot gun 59 M <NA>
## 2 584 Alej… 2015-02-20 00:00:00 shot gun NA M H
## 3 244 John… 2015-03-30 00:00:00 shot gun 54 M <NA>
## 4 534 Mark… 2015-04-09 00:00:00 shot and Taser… vehi… 54 M <NA>
## 5 433 Jose… 2015-05-07 00:00:00 shot knife 72 M <NA>
## 6 503 Jame… 2015-05-31 00:00:00 shot gun 40 M <NA>
## 7 523 Jame… 2015-06-08 00:00:00 shot gun 54 M <NA>
## 8 542 Raym… 2015-06-11 00:00:00 shot gun 86 M <NA>
## 9 604 Bria… 2015-07-02 00:00:00 shot cros… 59 M <NA>
## 10 641 Char… 2015-07-14 00:00:00 shot gun 76 M <NA>
## # ... with 84 more rows, and 6 more variables: city <chr>, state <chr>,
## # signs_of_mental_illness <lgl>, threat_level <chr>, flee <chr>,
## # body_camera <lgl>
What other data could you imagine would be valuable to consolidate the existing data?
Population at granularity of city and state would be valuable to create a index for killing relative to population and better understanding of police killings.
Having latitute-longitude data would help to create geospatial graphs or charts, that would be a good tool to understand geospatial nature of police killings.
Victimins age v/s count
The above graph depicts the distribution of victims age group vs the killing count. Most number of criminals belong to age from 20-35. As the age is increassing the bars are becoming smaller.
Count v/s Age and Gender of Shooting Victims
This is the same distribution plot as above. But now we have plotted stacked bars with each bar showing the ratio of 3 genders in the dataset. Though Non-Conforming gender is almost negligable, female crimanals are very less compared to male. Most female crimals belong to age of 35-40.
Top Cities Showing Age of Shooting Victims
This graph shows top cities with number of victims and age distribution for each city. Los Angles has highest number of victims followed by Houston.
US Police Shootings By Year
Since we have data for only one year, this single bar represents total number of shooting that year.
Manner of Death
This graph shows how were most victims killed. It can be clearly seen most victimis were killed by Gunshot foloowed by Taser.
What Month and Manner of Death
The above graph shows what months were most deaths reported and distribution of how the way the victim was killed for each month. Highest killing happende in July followed by March.
Multiple Graphs
Two graphs: First one shows the gender distribution of the victims. The second graph shows whether victim was armed or not and distribution of gender for each category of being armed.
Our date variable has 2 formats. Before we can use it, we need to parse it into one format. We can do this by using parse_date_time() function from lubridate package. Afterwards, to preserve our date variable, we create new “Date” column and give it the proper date format using base R as.Date() function. Most importantly, we can then extract the month, day and year using lubridate package.
Ditribution of Victims according to age
The graph shows the distribution of victims according to age
Age and Gender of Shooting Victims
Graph shows distribution of victims according to age and gender distribution for each category of age.
Top Cities Showing Age of Shooting Victims
Distribution of victims over top cities with age distribution marked by different color for each city. Los Angles tops the list followed by San francisco for this dataset.
US Police Shootings By Year
Graph show US police shooting for each year.
Manner of Death
Graph show distribution of manner of death for each year.
What Month and Manner of Death
Graph shows month and manner of death and distribution of incidents over months.
Multiple Graphs
First graph shows distribution of victims as male and female. The second graph shows whether the victim was armed or not and what was the armed weapon, each category has distribution of males and feamles for them.
Reading Data
Exploring Dataset “OMDataset1”
## # A tibble: 6 x 10
## Games Sport Event `Athlete(s)` CountryCode CountryName Medal Result Unit
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2012 Athl… 1000… Mo FARAH GBR Great Brit… Gold 1.910… M:SS…
## 2 2012 Athl… 1000… Galen RUPP USA United Sta… Silv… 1.910… M:SS…
## 3 2012 Athl… 1000… Tariku BEKE… ETH Ethiopia Bron… 1.911… M:SS…
## 4 2012 Athl… 1000… Tirunesh DI… ETH Ethiopia Gold 2.107… M:SS…
## 5 2012 Athl… 1000… Sally Jepko… KEN Kenya Silv… 2.113… M:SS…
## 6 2012 Athl… 1000… Vivian CHER… KEN Kenya Bron… 2.118… M:SS…
## # ... with 1 more variable: ResultInSeconds <dbl>
From above we can observe that there are 10 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:
## [1] 4093 10
Dataset has 4093 observations and 10 attributes.
## Games Sport Event Athlete(s)
## Min. :1896 Length:4093 Length:4093 Length:4093
## 1st Qu.:1956 Class :character Class :character Class :character
## Median :1980 Mode :character Mode :character Mode :character
## Mean :1974
## 3rd Qu.:2000
## Max. :2012
##
## CountryCode CountryName Medal
## Length:4093 Length:4093 Length:4093
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Result Unit ResultInSeconds
## Length:4093 Length:4093 Min. : 9.63
## Class :character Class :character 1st Qu.: 60.88
## Mode :character Mode :character Median : 180.70
## Mean : 755.05
## 3rd Qu.: 382.07
## Max. :17946.00
## NA's :23
## Observations: 4,093
## Variables: 10
## $ Games <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012...
## $ Sport <chr> "Athletics", "Athletics", "Athletics", "Athlet...
## $ Event <chr> "10000m Men", "10000m Men", "10000m Men", "100...
## $ `Athlete(s)` <chr> "Mo FARAH", "Galen RUPP", "Tariku BEKELE", "Ti...
## $ CountryCode <chr> "GBR", "USA", "ETH", "ETH", "KEN", "KEN", "AUS...
## $ CountryName <chr> "Great Britain", "United States of America", "...
## $ Medal <chr> "Gold", "Silver", "Bronze", "Gold", "Silver", ...
## $ Result <chr> "1.9102083333333335E-2", "1.9107638888888889E-...
## $ Unit <chr> "M:SS:DD", "M:SS:DD", "M:SS:DD", "M:SS:DD", "M...
## $ ResultInSeconds <dbl> 1650.42, 1650.90, 1651.43, 1820.75, 1826.37, 1...
Exploring Dataset “OMDataset2”
## # A tibble: 6 x 10
## `List of medallis… X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 DISCLAIMER: The I… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 City Edit… Sport Disc… Athl… NOC Gend… Event Even… Medal
## 5 Antwerp 1920 Aqua… Divi… PRIE… USA Men 10m … M Bron…
## 6 Antwerp 1920 Aqua… Divi… PINK… USA Men 10m … M Gold
From above we can observe that there are 10 attributes in the dataset. We can also see the datatype for each attribute. These are the derivation of the attributes from my understanding:
## [1] 26398 10
The dataset has 26394 observations and 10 different attributes.
## List of medallists at the Games of the Olympiad per edition, sport, discipline, gender and event
## Length:26398
## Class :character
## Mode :character
## X__1 X__2 X__3
## Length:26398 Length:26398 Length:26398
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## X__4 X__5 X__6
## Length:26398 Length:26398 Length:26398
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## X__7 X__8 X__9
## Length:26398 Length:26398 Length:26398
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
From above we can see the summary of each attribute.
## Observations: 26,398
## Variables: 10
## $ `List of medallists at the Games of the Olympiad per edition, sport, discipline, gender and event` <chr> ...
## $ X__1 <chr> ...
## $ X__2 <chr> ...
## $ X__3 <chr> ...
## $ X__4 <chr> ...
## $ X__5 <chr> ...
## $ X__6 <chr> ...
## $ X__7 <chr> ...
## $ X__8 <chr> ...
## $ X__9 <chr> ...
The above table shows atributes and it’s datatypes.
Contrasting features of the dataset
The first dataset focuses more on athelete perfomace and has time and realted features as attributes.The second dataset has features like city and gender of the athelete which are missing from the first dataset.
OMDataset1
Checking missing values and creating dataframe by removing missing values
## # A tibble: 23 x 10
## Games Sport Event `Athlete(s)` CountryCode CountryName Medal Result
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2012 Athl… 4x10… DISQUALIFIE… <NA> <NA> Silv… <NA>
## 2 1928 Rowi… Doub… Viktor Fles… AUT Austria Bron… No re…
## 3 1928 Rowi… Eigh… Jack Hand, … CAN Canada Bron… No re…
## 4 1924 Rowi… Coxl… Hans Walter… SUI Switzerland Bron… No re…
## 5 1924 Rowi… Doub… Heini Thoma… SUI Switzerland Bron… No re…
## 6 1924 Rowi… Eigh… Pietro Ivan… KIT Italy Bron… No re…
## 7 1920 Rowi… Eigh… Tollef Toll… NOR Norway Bron… No re…
## 8 1920 Rowi… Sing… Darcy Hadfi… NZL New Zealand Bron… No re…
## 9 1912 Rowi… Eigh… Rudolf Reic… DEU Germany Bron… No re…
## 10 1912 Rowi… Sing… Hugo Kusick EST Estonia Bron… No re…
## # ... with 13 more rows, and 2 more variables: Unit <chr>,
## # ResultInSeconds <dbl>
There are 23 rows with missing values
The new OMDataset1 has no missing values now.
OMDataset2
## # A tibble: 6 x 10
## `List of medallis… X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 DISCLAIMER: The I… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 City Edit… Sport Disc… Athl… NOC Gend… Event Even… Medal
## 5 Antwerp 1920 Aqua… Divi… PRIE… USA Men 10m … M Bron…
## 6 Antwerp 1920 Aqua… Divi… PINK… USA Men 10m … M Gold
As we can see from above the 4th row is basically header of the dataframe and 1st three rows are occoupied by some random data. We will have to rectify this before we can start with our analysis. Let’s do it:
## # A tibble: 6 x 10
## City Edition Sport Discipline Athlete NOC Gender Event Event_gender
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 DISC… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 City Edition Sport Discipline Athlete NOC Gender Event Event_gender
## 4 Antw… 1920 Aqua… Diving PRIEST… USA Men 10m … M
## 5 Antw… 1920 Aqua… Diving PINKST… USA Men 10m … M
## 6 Antw… 1920 Aqua… Diving ADLERZ… SWE Men 10m … M
## # ... with 1 more variable: Medal <chr>
The header has been adjusted. Let’ adjust the content of the row before we can get starte:
## # A tibble: 6 x 10
## City Edition Sport Discipline Athlete NOC Gender Event Event_gender
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Antw… 1920 Aqua… Diving PRIEST… USA Men 10m … M
## 2 Antw… 1920 Aqua… Diving PINKST… USA Men 10m … M
## 3 Antw… 1920 Aqua… Diving ADLERZ… SWE Men 10m … M
## 4 Antw… 1920 Aqua… Diving OLLIVI… SWE Women 10m … W
## 5 Antw… 1920 Aqua… Diving FRYLAN… DEN Women 10m … W
## 6 Antw… 1920 Aqua… Diving ARMSTR… GBR Women 10m … W
## # ... with 1 more variable: Medal <chr>
Now the datafrmae looks clean and ready for analysis. Before we move ahead let’s see for any missing values and remove all missing values from the dataframe.
## # A tibble: 0 x 10
## # ... with 10 variables: City <chr>, Edition <chr>, Sport <chr>,
## # Discipline <chr>, Athlete <chr>, NOC <chr>, Gender <chr>, Event <chr>,
## # Event_gender <chr>, Medal <chr>
There are no missing values in the dataframe. Let’s move to analysis part.
The above chart has year of the game of x-axis and number of medal won on y axis. It’s an interactive chart so you can hover over to see number fo medals for each year. From the observation number of medals have been increasing each year. Pie chart gives distribution of gold, silver and bronze medals.
The above chart shows top 10 atheletes according to the olympic games held from 1896 -2012.
Higest medels are won in Athelitics followed by swimming.
Above graph shows medal won in different cities and distribution of medal won by cities in pie chart. Los Angles has higest medals won according to this dataset. Los Angles has hosted olympics twice according to pie chart.
Medals won by country and distribution of medals won by different countries in the pie chart. USA is higest medal winner in this case. The order is as follows - Australia, Denmark, Finland, France …
Top 10 athelete for olympic games from 1920 - 2008, Larisa Latynina tops followed by Micheal Pheleps.
The graph shows medals one in different sports. Aquatics has higest medals followed by Athelitics.