What is R?

You can download R for free from:
https://cran.r-project.org/

and RStudio, a free and open-source integrated development environment (IDE) for R
https://www.rstudio.com/

Data Science

Why R?

  1. The R system has an extensive library of packages that offer state-of-the-art-abilities.
    Many of the analyses that they offer are not even available in any of the standard packages.

  2. The functionalities of:
  1. R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages.

  2. R enables easy experimentation and exploration, which improves data analysis.

  3. R is a tool behind reporting modern data analyses in a reproducible manner making an analysis more useful to others because the data and code that actually conducted the analysis can be made available.

R Community

“The R community is one of R’s best features!” Revolutions Daily news about using open source R

List of resources

ROpenSci: “R community is not just for ‘power users’ or developers. It’s a place for users and people interested in learning more about R”; Provides list of useful links:

#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter

R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters

Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable

Rweekly — an incredible weekly recap of all things R

R-bloggers — an awesome resource to find posts from many different bloggers using R

DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research

Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)

Who uses R?

Some of the major domains using R include:

Top companies using R are :

How do we do it?

Tools needed in a typical data science project:

R for Data Science by Garrett Grolemund & Hadley Wickham

http://r4ds.had.co.nz/index.html.

Real Example

Does declawing (onychectomy) cause harm to cats? Analyzing 17 years’ worth of shelter admissions data. - The dataset captures specifics about the individual cat (declawed status, age, breed, coat color, etc.) as well as the primary reason for admission. Some of the admission reasons are unconnected to the animal (e.g., moving, can’t afford pet, allergies) — but some reasons are based on problematic behaviors exhibited by the cat (e.g., house-soiling, aggressive to other animals, aggressive to people). Available to us is a CSV file containing 200 sample records.

Cat_Data

Cat_Data

Do it in R

# Install and load packages and data 
# The tidyverse is a collection of R packages designed for data science
# Install the complete tidyverse with
# install.packages("tidyverse")
# load the complete tidyverse with
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
# Install and load the ggplot2 package: grammer of graphics
# install.packages("ggplot2")
library(ggplot2)
# Install and load the magick package fo advanced image-processing in R
# install.packages("magick")
library(magick)
## Linking to ImageMagick 6.9.9.9
## Enabled features: cairo, fontconfig, freetype, fftw, pango, rsvg, webp
## Disabled features: ghostscript, lcms, x11
# load the data saved on your computer
# cat_claw <- read.csv("declawing_data_sample.csv")
# or load the data directly from the website
cat_claw <- read.csv(url("http://www.declawing-project.org/wp-content/uploads/2017/08/declawing_data_sample.csv"))
# Have a look at the data: head()
# let us look at first three raws of the data
head(cat_claw, n = 3)
##   Animal.ID Animal.Name Species Gender Date.Of.Birth      Primary.Breed
## 1   1032415      HARLEY     Cat      M     9/18/1999 Domestic Shorthair
## 2   1032962     TRUCKER     Cat      M     4/10/1998 Domestic Shorthair
## 3   1033799                 Cat      M      2/2/2000  Domestic Longhair
##   Secondary.Breed Declawed Distinguishing.Markings Purebred BodyWeight
## 1             Mix     None                                0          0
## 2             Mix     None                                0          0
## 3             Mix     None                                0          2
##   BodyWeightUnit PrimaryColor SecondaryColor ColorPattern
## 1           <NA>        Black          White         <NA>
## 2           <NA>         Grey           <NA>        Tiger
## 3          pound        Black           <NA>         <NA>
##           Intake.Date              Intake.Type Intake.Subtype
## 1 03/18/2000 00:14:00 Owner/Guardian Surrender       Schedule
## 2 04/06/2000 00:45:00                    Stray        Walk In
## 3 05/02/2000 00:37:00 Owner/Guardian Surrender        Walk In
##          Reason Reason.Category
## 1                          <NA>
## 2                          <NA>
## 3 Too Many Pets   Owner problem
# Have alook at the structure of the data: str()
# look at the structure of the data
str(cat_claw)
## 'data.frame':    200 obs. of  20 variables:
##  $ Animal.ID              : int  1032415 1032962 1033799 1033965 1038328 1048494 1052572 1053299 1054811 1057979 ...
##  $ Animal.Name            : Factor w/ 183 levels "","Abigail","ALEXANDER",..: 63 176 1 1 1 138 116 23 52 163 ...
##  $ Species                : Factor w/ 1 level "Cat": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 2 1 1 ...
##  $ Date.Of.Birth          : Factor w/ 195 levels "1/14/2010","1/19/1999",..: 176 71 47 68 163 65 171 177 194 73 ...
##  $ Primary.Breed          : Factor w/ 7 levels "Domestic Longhair",..: 3 3 1 1 3 3 3 3 3 3 ...
##  $ Secondary.Breed        : Factor w/ 5 levels "","Domestic Shorthair",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Declawed               : Factor w/ 3 levels "Both","Front",..: 3 3 3 3 3 2 3 2 3 3 ...
##  $ Distinguishing.Markings: Factor w/ 27 levels "","Black spot on snout",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Purebred               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BodyWeight             : num  0 0 2 0 0 0 0 12 0 0 ...
##  $ BodyWeightUnit         : Factor w/ 1 level "pound": NA NA 1 NA NA NA NA 1 NA NA ...
##  $ PrimaryColor           : Factor w/ 8 levels "Beige","Black",..: 2 6 2 7 2 3 2 8 8 7 ...
##  $ SecondaryColor         : Factor w/ 14 levels "Beige","Black",..: 14 NA NA 14 14 3 NA 2 2 NA ...
##  $ ColorPattern           : Factor w/ 10 levels "Brindle","Calico",..: NA 7 NA NA NA 7 9 NA NA 7 ...
##  $ Intake.Date            : Factor w/ 199 levels "01/06/2006 00:57:00",..: 34 45 62 64 119 77 137 150 169 21 ...
##  $ Intake.Type            : Factor w/ 4 levels "Owner/Guardian Surrender",..: 1 3 1 1 3 1 1 1 1 1 ...
##  $ Intake.Subtype         : Factor w/ 6 levels "Abandoned","Animal Control",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ Reason                 : Factor w/ 25 levels "","Abandoned",..: 1 1 24 24 1 14 3 3 23 21 ...
##  $ Reason.Category        : Factor w/ 3 levels "Behavior other",..: NA NA 3 3 NA 3 3 3 2 3 ...
# Do it in a tidy way: glimpse()
# previous output was messy as it didn't fit on the slide.
# we want tolook at the structure of the data as much data 
# as possible and identify data types for each of the variables
glimpse(cat_claw)
## Observations: 200
## Variables: 20
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name             <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
## $ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth           <fctr> 9/18/1999, 4/10/1998, 2/2/2000, 3/7/2...
## $ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed                <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date             <fctr> 03/18/2000 00:14:00, 04/06/2000 00:45...
## $ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason                  <fctr> , , Too Many Pets, Too Many Pets, , N...
## $ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...

What to focus on?

# Note that variable 'Declawed' is the main variable of interest
# with three possible outcomes
summary(cat_claw$Declawed)
##  Both Front  None 
##     8    92   100
# sort the dates (DOB and InatekD) to be in the same format
cat_claw$Date.Of.Birth <- as.Date(cat_claw$Date.Of.Birth, format='%m/%d/%Y')
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2017c.1.0/zoneinfo/Europe/Podgorica'
cat_claw$Intake.Date <- as.Date(cat_claw$Intake.Date, format='%m/%d/%Y')
# How does it look?
# check the data
glimpse(cat_claw)
## Observations: 200
## Variables: 20
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name             <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
## $ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
## $ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed                <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
## $ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason                  <fctr> , , Too Many Pets, Too Many Pets, , N...
## $ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...

How old are the cats?

# calculate age in days
cat_claw$diff_in_days <- cat_claw$Intake.Date - cat_claw$Date.Of.Birth
summary(cat_claw$diff_in_days) # summary for class type: 'difftime'
##   Length    Class     Mode 
##      200 difftime  numeric
# summary for diff_in_days as numeric (does everything seem ok?)
summary(as.numeric(cat_claw$diff_in_days))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -322     122     730    1070    1822    5114
# Identify 'incorrect' observation(s)
# identify negative diff_in_days; how many?
ind <- which.min(as.numeric(cat_claw$diff_in_days))
ind
## [1] 30
# remove observations with intake date before date of bearth
# save it as a new data set
cat <- cat_claw[-ind,]
# replace empty spaces with NA
cat$Animal.Name[cat$Animal.Name == ""] <- NA
cat$Distinguishing.Markings[cat$Distinguishing.Markings == ""] <- NA
cat$Reason[cat$Reason == ""] <- NA
# Check the data
glimpse(cat)
## Observations: 199
## Variables: 21
## $ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name             <fctr> HARLEY, TRUCKER, NA, NA, NA, PUDDY TA...
## $ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
## $ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed                <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
## $ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason                  <fctr> NA, NA, Too Many Pets, Too Many Pets,...
## $ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...
## $ diff_in_days            <time> 182 days, 727 days, 90 days, 61 days,...

What to plot?

Plot Age vs Declawed using Boxplot

boxplot(as.numeric(diff_in_days) ~ Declawed, data = cat, horizontal = TRUE)

Can we make it more attractive looking?

Create graph using ggplot

#The image_graph() function opens a new graphics device similar to e.g. png() or x11(). 
# It returns an image objec to which the plot(s) will be written
fig <- image_graph(width = 600, height = 600, res = 96)

# plots Age (in years) vs Declawed and saves it as an image
ggplot(cat, aes(Declawed, round(as.numeric(diff_in_days)/365), 2)) + 
  geom_boxplot(outlier.size = 0) + 
  geom_jitter(position=position_jitter(width=0.30), shape = 20, size = 3, aes(colour=Declawed), alpha=0.75) + 
  stat_summary(fun.y=mean, shape=23, size = 3, fill = "orange", col= "black", geom='point') +
  labs (title= "Cats: Age vs Declawed ", x = " Declawed", y = " Age") +
  theme(panel.border = element_rect(fill = NA, colour = "black", size = 2)) +
  theme(plot.title = element_text(size = 20, vjust = 2)) +
  ggsave('~/Documents/my_R/RLadiesMNE/ggplot_image.png') 
## Saving 6.25 x 6.25 in image

Adding an animation to a graph:

Read gif and background files

# read cat gif file
cat_gif <- image_read("http://media.giphy.com/media/q0ujUmppx3Fu0/giphy.gif")  
#
# Background image
graph_bg <- image_read("~/Documents/my_R/RLadiesMNE/ggplot_image.png")
background <- image_background(image_scale(graph_bg, "650"), "white", flatten = TRUE)
# Combine and flatten frames
frames <- image_apply(cat_gif, function(frame) {
  image_composite(background, frame, offset = "+410+10")
})
# Turn frames into animation
animation <- image_animate(frames, fps = 10)
print(animation)
##   format width height colorspace filesize
## 1    gif   650    650       sRGB        0
## 2    gif   650    650       sRGB        0
## 3    gif   650    650       sRGB        0
## 4    gif   650    650       sRGB        0
## 5    gif   650    650       sRGB        0
## 6    gif   650    650       sRGB        0
## 7    gif   650    650       sRGB        0
## 8    gif   650    650       sRGB        0

Happy Plotting!

Copyright © www.DataTeka.com