The dataset I am going to look at includes all people killed by a police officer in 2016. I gathered this dataset a while back and put it on my github. You can see the original from the Gaurdian Newspaper here. I uploaded the file to my github and created a raw link for it. I’ll now read it into my code
df <- read.csv("https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/the-counted-2016.csv",encoding = "UTF-8")
head(df)
## uid name age gender raceethnicity month day year
## 1 20161 Joshua Sisson 30 Male White January 1 2016
## 2 20162 Germonta Wallace 30 Male Black January 3 2016
## 3 20163 Sean O'Brien 37 Male White January 2 2016
## 4 20164 Rodney Turner 22 Male Black January 4 2016
## 5 20165 Eric Senegal 27 Male Black January 4 2016
## 6 20166 David Zollo 54 Male White January 5 2016
## streetaddress city state classification
## 1 4200 6th Ave San Diego CA Gunshot
## 2 2600 Watson Dr Charlotte NC Gunshot
## 3 100 Washington St Livingston MT Gunshot
## 4 3600 NW 42nd St Oklahoma City OK Gunshot
## 5 Gene Stanley Rd Ragley LA Gunshot
## 6 151 S Bishop Ave Clifton Heights PA Gunshot
## lawenforcementagency armed
## 1 San Diego Police Department Knife
## 2 Charlotte-Mecklenburg Police Department Firearm
## 3 Livingston Police Department Knife
## 4 Oklahoma City Police Department Firearm
## 5 Beauregard Parish Sheriff's Office Unknown
## 6 Upper Darby Police Department Knife
I give a small sample of the data using the head command. Printing all of the data should be avoided!
head(df)
## uid name age gender raceethnicity month day year
## 1 20161 Joshua Sisson 30 Male White January 1 2016
## 2 20162 Germonta Wallace 30 Male Black January 3 2016
## 3 20163 Sean O'Brien 37 Male White January 2 2016
## 4 20164 Rodney Turner 22 Male Black January 4 2016
## 5 20165 Eric Senegal 27 Male Black January 4 2016
## 6 20166 David Zollo 54 Male White January 5 2016
## streetaddress city state classification
## 1 4200 6th Ave San Diego CA Gunshot
## 2 2600 Watson Dr Charlotte NC Gunshot
## 3 100 Washington St Livingston MT Gunshot
## 4 3600 NW 42nd St Oklahoma City OK Gunshot
## 5 Gene Stanley Rd Ragley LA Gunshot
## 6 151 S Bishop Ave Clifton Heights PA Gunshot
## lawenforcementagency armed
## 1 San Diego Police Department Knife
## 2 Charlotte-Mecklenburg Police Department Firearm
## 3 Livingston Police Department Knife
## 4 Oklahoma City Police Department Firearm
## 5 Beauregard Parish Sheriff's Office Unknown
## 6 Upper Darby Police Department Knife
If you were not going to load the data from a link (github only supports small data files). You can load directly into RStudio. Use Upload and select our file, I did this with a file called disney.xlsx. Now that it is in, I’ll load it into R.
library(readxl)
df2 = read_excel("disney.xlsx")
## New names:
## * `` -> ...1
head(df2)
## # A tibble: 6 × 32
## ...1 title `Production comp… `Release date` `Running time` Country Language
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0 Acade… Walt Disney Prod… ['May 19, 1937… 41 minutes (7… United… English
## 2 1 Snow … Walt Disney Prod… ['December 21,… 83 minutes United… English
## 3 2 Pinoc… Walt Disney Prod… ['February 7, … 88 minutes United… English
## 4 3 Fanta… Walt Disney Prod… ['November 13,… 126 minutes United… English
## 5 4 The R… Walt Disney Prod… ['June 20, 194… 74 minutes United… English
## 6 5 Dumbo Walt Disney Prod… ['October 23, … 64 minutes United… English
## # … with 25 more variables: Running time (int) <dbl>, Budget (float) <dbl>,
## # Box office (float) <dbl>, Release date (datetime) <dttm>, imdb <chr>,
## # metascore <chr>, rotten_tomatoes <dbl>, Directed by <chr>,
## # Produced by <chr>, Written by <chr>, Based on <chr>, Starring <chr>,
## # Music by <chr>, Distributed by <chr>, Budget <chr>, Box office <chr>,
## # Story by <chr>, Narrated by <chr>, Cinematography <chr>, Edited by <chr>,
## # Screenplay by <chr>, Production companies <chr>, Adaptation by <chr>, …
There are some datasets included in packages for R. You can find a list here I’ll load something from the list
Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
This will NOT work for the project…iris or cars might…
I found mention of a baseball stats package so I’ve added to my library by installing. You can find more info here
library(Lahman)
#help(Lahman)
head(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
library(babynames)
head(babynames)
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
library(nycflights13)
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
library(fec16)
head(candidates)
## # A tibble: 6 × 15
## cand_id cand_name cand_pty_affili… cand_election_yr cand_office_st cand_office
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 H0AL02087 ROBY, MA… REP 2016 AL H
## 2 H0AL02095 JOHN, RO… IND 2016 AL H
## 3 H0AL05163 BROOKS, … REP 2016 AL H
## 4 H0AL07086 SEWELL, … DEM 2016 AL H
## 5 H0AR01083 CRAWFORD… REP 2016 AR H
## 6 H0AR03055 WOMACK, … REP 2016 AR H
## # … with 9 more variables: cand_office_district <chr>, cand_ici <chr>,
## # cand_status <chr>, cand_pcc <chr>, cand_st1 <chr>, cand_st2 <chr>,
## # cand_city <chr>, cand_st <chr>, cand_zip <chr>
This is a great resource for datasets of excellent quality. I was able to load the Iris dataset from a link on the site.
read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
## X5.1 X3.5 X1.4 X0.2 Iris.setosa
## 1 4.9 3.0 1.4 0.2 Iris-setosa
## 2 4.7 3.2 1.3 0.2 Iris-setosa
## 3 4.6 3.1 1.5 0.2 Iris-setosa
## 4 5.0 3.6 1.4 0.2 Iris-setosa
## 5 5.4 3.9 1.7 0.4 Iris-setosa
## 6 4.6 3.4 1.4 0.3 Iris-setosa
## 7 5.0 3.4 1.5 0.2 Iris-setosa
## 8 4.4 2.9 1.4 0.2 Iris-setosa
## 9 4.9 3.1 1.5 0.1 Iris-setosa
## 10 5.4 3.7 1.5 0.2 Iris-setosa
## 11 4.8 3.4 1.6 0.2 Iris-setosa
## 12 4.8 3.0 1.4 0.1 Iris-setosa
## 13 4.3 3.0 1.1 0.1 Iris-setosa
## 14 5.8 4.0 1.2 0.2 Iris-setosa
## 15 5.7 4.4 1.5 0.4 Iris-setosa
## 16 5.4 3.9 1.3 0.4 Iris-setosa
## 17 5.1 3.5 1.4 0.3 Iris-setosa
## 18 5.7 3.8 1.7 0.3 Iris-setosa
## 19 5.1 3.8 1.5 0.3 Iris-setosa
## 20 5.4 3.4 1.7 0.2 Iris-setosa
## 21 5.1 3.7 1.5 0.4 Iris-setosa
## 22 4.6 3.6 1.0 0.2 Iris-setosa
## 23 5.1 3.3 1.7 0.5 Iris-setosa
## 24 4.8 3.4 1.9 0.2 Iris-setosa
## 25 5.0 3.0 1.6 0.2 Iris-setosa
## 26 5.0 3.4 1.6 0.4 Iris-setosa
## 27 5.2 3.5 1.5 0.2 Iris-setosa
## 28 5.2 3.4 1.4 0.2 Iris-setosa
## 29 4.7 3.2 1.6 0.2 Iris-setosa
## 30 4.8 3.1 1.6 0.2 Iris-setosa
## 31 5.4 3.4 1.5 0.4 Iris-setosa
## 32 5.2 4.1 1.5 0.1 Iris-setosa
## 33 5.5 4.2 1.4 0.2 Iris-setosa
## 34 4.9 3.1 1.5 0.1 Iris-setosa
## 35 5.0 3.2 1.2 0.2 Iris-setosa
## 36 5.5 3.5 1.3 0.2 Iris-setosa
## 37 4.9 3.1 1.5 0.1 Iris-setosa
## 38 4.4 3.0 1.3 0.2 Iris-setosa
## 39 5.1 3.4 1.5 0.2 Iris-setosa
## 40 5.0 3.5 1.3 0.3 Iris-setosa
## 41 4.5 2.3 1.3 0.3 Iris-setosa
## 42 4.4 3.2 1.3 0.2 Iris-setosa
## 43 5.0 3.5 1.6 0.6 Iris-setosa
## 44 5.1 3.8 1.9 0.4 Iris-setosa
## 45 4.8 3.0 1.4 0.3 Iris-setosa
## 46 5.1 3.8 1.6 0.2 Iris-setosa
## 47 4.6 3.2 1.4 0.2 Iris-setosa
## 48 5.3 3.7 1.5 0.2 Iris-setosa
## 49 5.0 3.3 1.4 0.2 Iris-setosa
## 50 7.0 3.2 4.7 1.4 Iris-versicolor
## 51 6.4 3.2 4.5 1.5 Iris-versicolor
## 52 6.9 3.1 4.9 1.5 Iris-versicolor
## 53 5.5 2.3 4.0 1.3 Iris-versicolor
## 54 6.5 2.8 4.6 1.5 Iris-versicolor
## 55 5.7 2.8 4.5 1.3 Iris-versicolor
## 56 6.3 3.3 4.7 1.6 Iris-versicolor
## 57 4.9 2.4 3.3 1.0 Iris-versicolor
## 58 6.6 2.9 4.6 1.3 Iris-versicolor
## 59 5.2 2.7 3.9 1.4 Iris-versicolor
## 60 5.0 2.0 3.5 1.0 Iris-versicolor
## 61 5.9 3.0 4.2 1.5 Iris-versicolor
## 62 6.0 2.2 4.0 1.0 Iris-versicolor
## 63 6.1 2.9 4.7 1.4 Iris-versicolor
## 64 5.6 2.9 3.6 1.3 Iris-versicolor
## 65 6.7 3.1 4.4 1.4 Iris-versicolor
## 66 5.6 3.0 4.5 1.5 Iris-versicolor
## 67 5.8 2.7 4.1 1.0 Iris-versicolor
## 68 6.2 2.2 4.5 1.5 Iris-versicolor
## 69 5.6 2.5 3.9 1.1 Iris-versicolor
## 70 5.9 3.2 4.8 1.8 Iris-versicolor
## 71 6.1 2.8 4.0 1.3 Iris-versicolor
## 72 6.3 2.5 4.9 1.5 Iris-versicolor
## 73 6.1 2.8 4.7 1.2 Iris-versicolor
## 74 6.4 2.9 4.3 1.3 Iris-versicolor
## 75 6.6 3.0 4.4 1.4 Iris-versicolor
## 76 6.8 2.8 4.8 1.4 Iris-versicolor
## 77 6.7 3.0 5.0 1.7 Iris-versicolor
## 78 6.0 2.9 4.5 1.5 Iris-versicolor
## 79 5.7 2.6 3.5 1.0 Iris-versicolor
## 80 5.5 2.4 3.8 1.1 Iris-versicolor
## 81 5.5 2.4 3.7 1.0 Iris-versicolor
## 82 5.8 2.7 3.9 1.2 Iris-versicolor
## 83 6.0 2.7 5.1 1.6 Iris-versicolor
## 84 5.4 3.0 4.5 1.5 Iris-versicolor
## 85 6.0 3.4 4.5 1.6 Iris-versicolor
## 86 6.7 3.1 4.7 1.5 Iris-versicolor
## 87 6.3 2.3 4.4 1.3 Iris-versicolor
## 88 5.6 3.0 4.1 1.3 Iris-versicolor
## 89 5.5 2.5 4.0 1.3 Iris-versicolor
## 90 5.5 2.6 4.4 1.2 Iris-versicolor
## 91 6.1 3.0 4.6 1.4 Iris-versicolor
## 92 5.8 2.6 4.0 1.2 Iris-versicolor
## 93 5.0 2.3 3.3 1.0 Iris-versicolor
## 94 5.6 2.7 4.2 1.3 Iris-versicolor
## 95 5.7 3.0 4.2 1.2 Iris-versicolor
## 96 5.7 2.9 4.2 1.3 Iris-versicolor
## 97 6.2 2.9 4.3 1.3 Iris-versicolor
## 98 5.1 2.5 3.0 1.1 Iris-versicolor
## 99 5.7 2.8 4.1 1.3 Iris-versicolor
## 100 6.3 3.3 6.0 2.5 Iris-virginica
## 101 5.8 2.7 5.1 1.9 Iris-virginica
## 102 7.1 3.0 5.9 2.1 Iris-virginica
## 103 6.3 2.9 5.6 1.8 Iris-virginica
## 104 6.5 3.0 5.8 2.2 Iris-virginica
## 105 7.6 3.0 6.6 2.1 Iris-virginica
## 106 4.9 2.5 4.5 1.7 Iris-virginica
## 107 7.3 2.9 6.3 1.8 Iris-virginica
## 108 6.7 2.5 5.8 1.8 Iris-virginica
## 109 7.2 3.6 6.1 2.5 Iris-virginica
## 110 6.5 3.2 5.1 2.0 Iris-virginica
## 111 6.4 2.7 5.3 1.9 Iris-virginica
## 112 6.8 3.0 5.5 2.1 Iris-virginica
## 113 5.7 2.5 5.0 2.0 Iris-virginica
## 114 5.8 2.8 5.1 2.4 Iris-virginica
## 115 6.4 3.2 5.3 2.3 Iris-virginica
## 116 6.5 3.0 5.5 1.8 Iris-virginica
## 117 7.7 3.8 6.7 2.2 Iris-virginica
## 118 7.7 2.6 6.9 2.3 Iris-virginica
## 119 6.0 2.2 5.0 1.5 Iris-virginica
## 120 6.9 3.2 5.7 2.3 Iris-virginica
## 121 5.6 2.8 4.9 2.0 Iris-virginica
## 122 7.7 2.8 6.7 2.0 Iris-virginica
## 123 6.3 2.7 4.9 1.8 Iris-virginica
## 124 6.7 3.3 5.7 2.1 Iris-virginica
## 125 7.2 3.2 6.0 1.8 Iris-virginica
## 126 6.2 2.8 4.8 1.8 Iris-virginica
## 127 6.1 3.0 4.9 1.8 Iris-virginica
## 128 6.4 2.8 5.6 2.1 Iris-virginica
## 129 7.2 3.0 5.8 1.6 Iris-virginica
## 130 7.4 2.8 6.1 1.9 Iris-virginica
## 131 7.9 3.8 6.4 2.0 Iris-virginica
## 132 6.4 2.8 5.6 2.2 Iris-virginica
## 133 6.3 2.8 5.1 1.5 Iris-virginica
## 134 6.1 2.6 5.6 1.4 Iris-virginica
## 135 7.7 3.0 6.1 2.3 Iris-virginica
## 136 6.3 3.4 5.6 2.4 Iris-virginica
## 137 6.4 3.1 5.5 1.8 Iris-virginica
## 138 6.0 3.0 4.8 1.8 Iris-virginica
## 139 6.9 3.1 5.4 2.1 Iris-virginica
## 140 6.7 3.1 5.6 2.4 Iris-virginica
## 141 6.9 3.1 5.1 2.3 Iris-virginica
## 142 5.8 2.7 5.1 1.9 Iris-virginica
## 143 6.8 3.2 5.9 2.3 Iris-virginica
## 144 6.7 3.3 5.7 2.5 Iris-virginica
## 145 6.7 3.0 5.2 2.3 Iris-virginica
## 146 6.3 2.5 5.0 1.9 Iris-virginica
## 147 6.5 3.0 5.2 2.0 Iris-virginica
## 148 6.2 3.4 5.4 2.3 Iris-virginica
## 149 5.9 3.0 5.1 1.8 Iris-virginica
I see that a lot of you have found great datasets on Kaggle. Rather than download and upload, I want to show how to directly pull from kaggle.com This process will require an account with kaggle.
library(httr) username <- “nurfnick” authkey <-.rs.askForPassword(“foo”) dataset <- httr::GET(“https://www.kaggle.com/ahsen1330/us-police-shootings/download”, httr::authenticate(username, authkey, type = “basic”))
temp <- tempfile() download.file(dataset$url,temp) data <- read.csv(unz(temp, “train.csv”)) unlink(temp)