R-Ladies: www.datateka.com
Dec 2017
Special Thanks
and
Katarina Kosmina
Please answer our survey at: https://tatjanakec.shinyapps.io/RLadies_Form
What is R?
The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.
R is a comprehensive public domain language for data analysis, with no licensing costs associated with it.
Being independent of any platform, R is universally applicable and simple to integrate into existing IT structures.
You can download R for free from:
https://cran.r-project.org/
and RStudio, a free and open-source integrated development environment (IDE) for R
https://www.rstudio.com/
Data Science
The new vast amount of data we have begun to take more and more notice of, has given a rise to the new discipline of data science.
Growing demand of data volume and easy understandability of extracted knowledge and insights from data is the motivating force of data science.
With the explosion of “Big Data” problems, data science has become a very hot field in many scientific areas as well as marketing, finance, and other business and social study disciplines. Hence, there is a growing demand for business and social scientific researchers with statistical, modelling and computing skills.
We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship, improve the human condition, and create commercial and social value.
The field of data science is emerging at the intersection of the fields of statistics, computer science and design. R provides grate platform for this multidisciplinarity. It is incredibly powerful and as such it should be the first language for data manipulation, data analysis and visualisation you’re looking to grow skills in if you want to move towards data science.
Data Science skills are in ever increasing demand!
Why R?
The R system has an extensive library of packages that offer state-of-the-art-abilities.
Many of the analyses that they offer are not even available in any of the standard packages.
The functionalities of:
R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages.
R enables easy experimentation and exploration, which improves data analysis.
R is a tool behind reporting modern data analyses in a reproducible manner making an analysis more useful to others because the data and code that actually conducted the analysis can be made available.
R Community
“The R community is one of R's best features!”
Revolutions Daily news about using open source R
Supported by the R Foundation for Statistical Computing and with the strong and open engagement of developers and users from all walks of background from science to commerce it is hard to envisage that any commercial corporation will be able to develop sustainable business model with the same innovative drive and power as R community.
The collaboration amongst statisticians and other scientist who are engaged with statistical computing and growing interest and engagement of large companies creates altruistic R community which generates the force within which R is conquering the field of data analytics. As a result it creates a more powerful R resource and becomes more usable and attractive to Data scientists and analysists.
R Community: list of resources
ROpenSci: “R community is not just for 'power users' or developers. It's a place for users and people interested in learning more about R”; Provides list of useful links:
#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
Local R meetup groups — a google search may show that there's one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
Rweekly — an incredible weekly recap of all things R
R-bloggers — an awesome resource to find posts from many different bloggers using R
DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research
Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)
Who uses R?
Some of the major domains using R include:
How to practice?
Platform for data science competitions
~.~.~.~.~.~.~.~
Annual competition for students and professionals to develop and test their Big Data Analytics and Data Science skills against friends, colleagues and top professionals around the world
~.~.~.~.~.~.~.~
R for Data Science by Garrett Grolemund & Hadley Wickham: http://r4ds.had.co.nz/
~.~.~.~.~.~.~.~
Efficient R programming by Colin Gillespie & Robin Lovelace: https://csgillespie.github.io/efficientR/
Tools needed in a typical data science project:
R for Data Science
by Garrett Grolemund & Hadley Wickham
http://r4ds.had.co.nz/index.html.
Real Example
Does declawing (onychectomy) cause harm to cats? Analyzing 17 years' worth of shelter admissions data.
Install and load packages and data
# The tidyverse is a collection of R packages designed for data science
# Install the complete tidyverse with
# install.packages("tidyverse")
# load the complete tidyverse with
library(tidyverse)
# Install and load the ggplot2 package: grammer of graphics
# install.packages("ggplot2")
library(ggplot2)
# Install and load the magick package fo advanced image-processing in R
# install.packages("magick")
library(magick)
# load the data saved on your computer
# cat_claw <- read.csv("declawing_data_sample.csv")
# or load the data directly from the website
cat_claw <- read.csv(url("http://www.declawing-project.org/wp-content/uploads/2017/08/declawing_data_sample.csv"))
Have alook at the data: head()
# let us look at first three raws of the data
head(cat_claw, n = 3)
Animal.ID Animal.Name Species Gender Date.Of.Birth Primary.Breed
1 1032415 HARLEY Cat M 9/18/1999 Domestic Shorthair
2 1032962 TRUCKER Cat M 4/10/1998 Domestic Shorthair
3 1033799 Cat M 2/2/2000 Domestic Longhair
Secondary.Breed Declawed Distinguishing.Markings Purebred BodyWeight
1 Mix None 0 0
2 Mix None 0 0
3 Mix None 0 2
BodyWeightUnit PrimaryColor SecondaryColor ColorPattern
1 <NA> Black White <NA>
2 <NA> Grey <NA> Tiger
3 pound Black <NA> <NA>
Intake.Date Intake.Type Intake.Subtype
1 03/18/2000 00:14:00 Owner/Guardian Surrender Schedule
2 04/06/2000 00:45:00 Stray Walk In
3 05/02/2000 00:37:00 Owner/Guardian Surrender Walk In
Reason Reason.Category
1 <NA>
2 <NA>
3 Too Many Pets Owner problem
Have alook at the structure of the data: str()
# look at the structure of the data
str(cat_claw)
'data.frame': 200 obs. of 20 variables:
$ Animal.ID : int 1032415 1032962 1033799 1033965 1038328 1048494 1052572 1053299 1054811 1057979 ...
$ Animal.Name : Factor w/ 183 levels "","Abigail","ALEXANDER",..: 63 176 1 1 1 138 116 23 52 163 ...
$ Species : Factor w/ 1 level "Cat": 1 1 1 1 1 1 1 1 1 1 ...
$ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 2 1 1 ...
$ Date.Of.Birth : Factor w/ 195 levels "1/14/2010","1/19/1999",..: 176 71 47 68 163 65 171 177 194 73 ...
$ Primary.Breed : Factor w/ 7 levels "Domestic Longhair",..: 3 3 1 1 3 3 3 3 3 3 ...
$ Secondary.Breed : Factor w/ 5 levels "","Domestic Shorthair",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Declawed : Factor w/ 3 levels "Both","Front",..: 3 3 3 3 3 2 3 2 3 3 ...
$ Distinguishing.Markings: Factor w/ 27 levels "","Black spot on snout",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Purebred : int 0 0 0 0 0 0 0 0 0 0 ...
$ BodyWeight : num 0 0 2 0 0 0 0 12 0 0 ...
$ BodyWeightUnit : Factor w/ 1 level "pound": NA NA 1 NA NA NA NA 1 NA NA ...
$ PrimaryColor : Factor w/ 8 levels "Beige","Black",..: 2 6 2 7 2 3 2 8 8 7 ...
$ SecondaryColor : Factor w/ 14 levels "Beige","Black",..: 14 NA NA 14 14 3 NA 2 2 NA ...
$ ColorPattern : Factor w/ 10 levels "Brindle","Calico",..: NA 7 NA NA NA 7 9 NA NA 7 ...
$ Intake.Date : Factor w/ 199 levels "01/06/2006 00:57:00",..: 34 45 62 64 119 77 137 150 169 21 ...
$ Intake.Type : Factor w/ 4 levels "Owner/Guardian Surrender",..: 1 3 1 1 3 1 1 1 1 1 ...
$ Intake.Subtype : Factor w/ 6 levels "Abandoned","Animal Control",..: 4 6 6 6 6 6 6 6 6 6 ...
$ Reason : Factor w/ 25 levels "","Abandoned",..: 1 1 24 24 1 14 3 3 23 21 ...
$ Reason.Category : Factor w/ 3 levels "Behavior other",..: NA NA 3 3 NA 3 3 3 2 3 ...
Do it in a tidy way: glimpse()
# previous output was messy as it didn't fit on the slide.
# we want tolook at the structure of the data as much data
# as possible and identify data types for each of the variables
glimpse(cat_claw)
Observations: 200
Variables: 20
$ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
$ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth <fctr> 9/18/1999, 4/10/1998, 2/2/2000, 3/7/2...
$ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
$ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date <fctr> 03/18/2000 00:14:00, 04/06/2000 00:45...
$ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason <fctr> , , Too Many Pets, Too Many Pets, , N...
$ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
What to focus on?
# Note that variable 'Declawed' is the main variable of interest
# with three possible outcomes
summary(cat_claw$Declawed)
Both Front None
8 92 100
# sort the dates (DOB and InatekD) to be in the same format
library(lubridate)
cat_claw$Date.Of.Birth <- as.Date(cat_claw$Date.Of.Birth, format='%m/%d/%Y')
cat_claw$Intake.Date <- as.Date(cat_claw$Intake.Date, format='%m/%d/%Y')
How does it look?
# check the data
glimpse(cat_claw)
Observations: 200
Variables: 20
$ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
$ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
$ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
$ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
$ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason <fctr> , , Too Many Pets, Too Many Pets, , N...
$ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
How old are the cats?
# calculate age in days
cat_claw$diff_in_days <- cat_claw$Intake.Date - cat_claw$Date.Of.Birth
summary(cat_claw$diff_in_days) # summary for class type: 'difftime'
Length Class Mode
200 difftime numeric
# summary for diff_in_days as numeric (does everything seem ok?)
summary(as.numeric(cat_claw$diff_in_days))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-322 122 730 1070 1822 5114
Identify 'incorrect' observation(s)
# identify negative diff_in_days; how many?
ind <- which.min(as.numeric(cat_claw$diff_in_days))
ind
[1] 30
# remove observations with intake date before date of bearth
# save it as a new data set
cat <- cat_claw[-ind,]
# replace empty spaces with NA
cat$Animal.Name[cat$Animal.Name == ""] <- NA
cat$Distinguishing.Markings[cat$Distinguishing.Markings == ""] <- NA
cat$Reason[cat$Reason == ""] <- NA
Check the data
glimpse(cat)
Observations: 199
Variables: 21
$ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name <fctr> HARLEY, TRUCKER, NA, NA, NA, PUDDY TA...
$ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
$ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
$ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason <fctr> NA, NA, Too Many Pets, Too Many Pets,...
$ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
$ diff_in_days <time> 182 days, 727 days, 90 days, 61 days,...
Plot Age vs Declawed using Boxplot
boxplot(as.numeric(diff_in_days) ~ Declawed, data = cat, horizontal = TRUE)
Can we make it more attractive looking?
Create graph using ggplot
#The image_graph() function opens a new graphics device similar to e.g. png() or x11().
# It returns an image objec to which the plot(s) will be written
fig <- image_graph(width = 600, height = 600, res = 96)
# plots Age (in years) vs Declawed and saves it as an image
ggplot(cat, aes(Declawed, round(as.numeric(diff_in_days)/365), 2)) +
geom_boxplot(outlier.size = 0) +
geom_jitter(position=position_jitter(width=0.30), shape = 20, size = 3, aes(colour=Declawed), alpha=0.75) +
stat_summary(fun.y=mean, shape=23, size = 3, fill = "orange", col= "black", geom='point') +
labs (title= "Cats: Age vs Declawed ", x = " Declawed", y = " Age") +
theme(panel.border = element_rect(fill = NA, colour = "black", size = 2)) +
theme(plot.title = element_text(size = 20, vjust = 2)) +
ggsave('~/Documents/my_R/RLadiesMNE/ggplot_image.png')
Adding an animation to a graph: Read gif and background files
# read cat gif file
cat_gif <- image_read("http://media.giphy.com/media/q0ujUmppx3Fu0/giphy.gif")
#
# Background image
graph_bg <- image_read("~/Documents/my_R/RLadiesMNE/ggplot_image.png")
background <- image_background(image_scale(graph_bg, "650"), "white", flatten = TRUE)
Adding an animation to a graph
# Combine and flatten frames
frames <- image_apply(cat_gif, function(frame) {
image_composite(background, frame, offset = "+410+10")
})
# Turn frames into animation
animation <- image_animate(frames, fps = 10)
print(animation)
format width height colorspace filesize
1 gif 650 650 sRGB 0
2 gif 650 650 sRGB 0
3 gif 650 650 sRGB 0
4 gif 650 650 sRGB 0
5 gif 650 650 sRGB 0
6 gif 650 650 sRGB 0
7 gif 650 650 sRGB 0
8 gif 650 650 sRGB 0
Happy Plotting!