Why R?

R-Ladies: www.datateka.com
Dec 2017

Special Thanks

https://startit.rs

and

Katarina Kosmina

Please answer our survey at: https://tatjanakec.shinyapps.io/RLadies_Form

What is R?

  • The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.

  • R is a comprehensive public domain language for data analysis, with no licensing costs associated with it.

  • Being independent of any platform, R is universally applicable and simple to integrate into existing IT structures.

You can download R for free from:
https://cran.r-project.org/

and RStudio, a free and open-source integrated development environment (IDE) for R
https://www.rstudio.com/

Data Science

  • The new vast amount of data we have begun to take more and more notice of, has given a rise to the new discipline of data science.

  • Growing demand of data volume and easy understandability of extracted knowledge and insights from data is the motivating force of data science.

  • With the explosion of “Big Data” problems, data science has become a very hot field in many scientific areas as well as marketing, finance, and other business and social study disciplines. Hence, there is a growing demand for business and social scientific researchers with statistical, modelling and computing skills.

  • We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship, improve the human condition, and create commercial and social value.

  • The field of data science is emerging at the intersection of the fields of statistics, computer science and design. R provides grate platform for this multidisciplinarity. It is incredibly powerful and as such it should be the first language for data manipulation, data analysis and visualisation you’re looking to grow skills in if you want to move towards data science.

Data Science skills are in ever increasing demand!

Why R?

  • The R system has an extensive library of packages that offer state-of-the-art-abilities.
    Many of the analyses that they offer are not even available in any of the standard packages.

  • The functionalities of:

    • data manipulation,
    • data analysis and
    • visualisation implemented in R are incomparable.
  • R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages.

  • R enables easy experimentation and exploration, which improves data analysis.

  • R is a tool behind reporting modern data analyses in a reproducible manner making an analysis more useful to others because the data and code that actually conducted the analysis can be made available.

R Community

“The R community is one of R's best features!”

Revolutions Daily news about using open source R

  • Supported by the R Foundation for Statistical Computing and with the strong and open engagement of developers and users from all walks of background from science to commerce it is hard to envisage that any commercial corporation will be able to develop sustainable business model with the same innovative drive and power as R community.

  • The collaboration amongst statisticians and other scientist who are engaged with statistical computing and growing interest and engagement of large companies creates altruistic R community which generates the force within which R is conquering the field of data analytics. As a result it creates a more powerful R resource and becomes more usable and attractive to Data scientists and analysists.

R Community: list of resources

ROpenSci: “R community is not just for 'power users' or developers. It's a place for users and people interested in learning more about R”; Provides list of useful links:

#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter

R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters

Local R meetup groups — a google search may show that there's one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable

Rweekly — an incredible weekly recap of all things R

R-bloggers — an awesome resource to find posts from many different bloggers using R

DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research

Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)

Who uses R?

Some of the major domains using R include:

How to practice?

kaggle www.kaggle.com/

Platform for data science competitions

~.~.~.~.~.~.~.~

texata http://www.texata.com/

Annual competition for students and professionals to develop and test their Big Data Analytics and Data Science skills against friends, colleagues and top professionals around the world

~.~.~.~.~.~.~.~

R for Data Science by Garrett Grolemund & Hadley Wickham: http://r4ds.had.co.nz/

~.~.~.~.~.~.~.~

Efficient R programming by Colin Gillespie & Robin Lovelace: https://csgillespie.github.io/efficientR/

Tools needed in a typical data science project:









R for Data Science by Garrett Grolemund & Hadley Wickham

http://r4ds.had.co.nz/index.html.

Real Example

Does declawing (onychectomy) cause harm to cats? Analyzing 17 years' worth of shelter admissions data.

  • The dataset captures specifics about the individual cat (declawed status, age, breed, coat color, etc.) as well as the primary reason for admission. Some of the admission reasons are unconnected to the animal (e.g., moving, can’t afford pet, allergies) — but some reasons are based on problematic behaviors exhibited by the cat (e.g., house-soiling, aggressive to other animals, aggressive to people). Available to us is a CSV file containing 200 sample records.

Install and load packages and data

# The tidyverse is a collection of R packages designed for data science
# Install the complete tidyverse with
# install.packages("tidyverse")
# load the complete tidyverse with
library(tidyverse)
# Install and load the ggplot2 package: grammer of graphics
# install.packages("ggplot2")
library(ggplot2)
# Install and load the magick package fo advanced image-processing in R
# install.packages("magick")
library(magick)
# load the data saved on your computer
# cat_claw <- read.csv("declawing_data_sample.csv")
# or load the data directly from the website
cat_claw <- read.csv(url("http://www.declawing-project.org/wp-content/uploads/2017/08/declawing_data_sample.csv"))

Have alook at the data: head()

# let us look at first three raws of the data
head(cat_claw, n = 3)
  Animal.ID Animal.Name Species Gender Date.Of.Birth      Primary.Breed
1   1032415      HARLEY     Cat      M     9/18/1999 Domestic Shorthair
2   1032962     TRUCKER     Cat      M     4/10/1998 Domestic Shorthair
3   1033799                 Cat      M      2/2/2000  Domestic Longhair
  Secondary.Breed Declawed Distinguishing.Markings Purebred BodyWeight
1             Mix     None                                0          0
2             Mix     None                                0          0
3             Mix     None                                0          2
  BodyWeightUnit PrimaryColor SecondaryColor ColorPattern
1           <NA>        Black          White         <NA>
2           <NA>         Grey           <NA>        Tiger
3          pound        Black           <NA>         <NA>
          Intake.Date              Intake.Type Intake.Subtype
1 03/18/2000 00:14:00 Owner/Guardian Surrender       Schedule
2 04/06/2000 00:45:00                    Stray        Walk In
3 05/02/2000 00:37:00 Owner/Guardian Surrender        Walk In
         Reason Reason.Category
1                          <NA>
2                          <NA>
3 Too Many Pets   Owner problem

Have alook at the structure of the data: str()

# look at the structure of the data
str(cat_claw)
'data.frame':   200 obs. of  20 variables:
 $ Animal.ID              : int  1032415 1032962 1033799 1033965 1038328 1048494 1052572 1053299 1054811 1057979 ...
 $ Animal.Name            : Factor w/ 183 levels "","Abigail","ALEXANDER",..: 63 176 1 1 1 138 116 23 52 163 ...
 $ Species                : Factor w/ 1 level "Cat": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gender                 : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 2 1 1 ...
 $ Date.Of.Birth          : Factor w/ 195 levels "1/14/2010","1/19/1999",..: 176 71 47 68 163 65 171 177 194 73 ...
 $ Primary.Breed          : Factor w/ 7 levels "Domestic Longhair",..: 3 3 1 1 3 3 3 3 3 3 ...
 $ Secondary.Breed        : Factor w/ 5 levels "","Domestic Shorthair",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Declawed               : Factor w/ 3 levels "Both","Front",..: 3 3 3 3 3 2 3 2 3 3 ...
 $ Distinguishing.Markings: Factor w/ 27 levels "","Black spot on snout",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Purebred               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BodyWeight             : num  0 0 2 0 0 0 0 12 0 0 ...
 $ BodyWeightUnit         : Factor w/ 1 level "pound": NA NA 1 NA NA NA NA 1 NA NA ...
 $ PrimaryColor           : Factor w/ 8 levels "Beige","Black",..: 2 6 2 7 2 3 2 8 8 7 ...
 $ SecondaryColor         : Factor w/ 14 levels "Beige","Black",..: 14 NA NA 14 14 3 NA 2 2 NA ...
 $ ColorPattern           : Factor w/ 10 levels "Brindle","Calico",..: NA 7 NA NA NA 7 9 NA NA 7 ...
 $ Intake.Date            : Factor w/ 199 levels "01/06/2006 00:57:00",..: 34 45 62 64 119 77 137 150 169 21 ...
 $ Intake.Type            : Factor w/ 4 levels "Owner/Guardian Surrender",..: 1 3 1 1 3 1 1 1 1 1 ...
 $ Intake.Subtype         : Factor w/ 6 levels "Abandoned","Animal Control",..: 4 6 6 6 6 6 6 6 6 6 ...
 $ Reason                 : Factor w/ 25 levels "","Abandoned",..: 1 1 24 24 1 14 3 3 23 21 ...
 $ Reason.Category        : Factor w/ 3 levels "Behavior other",..: NA NA 3 3 NA 3 3 3 2 3 ...

Do it in a tidy way: glimpse()

# previous output was messy as it didn't fit on the slide.
# we want tolook at the structure of the data as much data 
# as possible and identify data types for each of the variables
glimpse(cat_claw)
Observations: 200
Variables: 20
$ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name             <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
$ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth           <fctr> 9/18/1999, 4/10/1998, 2/2/2000, 3/7/2...
$ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed                <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
$ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date             <fctr> 03/18/2000 00:14:00, 04/06/2000 00:45...
$ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason                  <fctr> , , Too Many Pets, Too Many Pets, , N...
$ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...

What to focus on?

# Note that variable 'Declawed' is the main variable of interest
# with three possible outcomes
summary(cat_claw$Declawed)
 Both Front  None 
    8    92   100 
# sort the dates (DOB and InatekD) to be in the same format
library(lubridate)
cat_claw$Date.Of.Birth <- as.Date(cat_claw$Date.Of.Birth, format='%m/%d/%Y')
cat_claw$Intake.Date <- as.Date(cat_claw$Intake.Date, format='%m/%d/%Y')

How does it look?

# check the data
glimpse(cat_claw)
Observations: 200
Variables: 20
$ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name             <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
$ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
$ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed                <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
$ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
$ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason                  <fctr> , , Too Many Pets, Too Many Pets, , N...
$ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...

How old are the cats?

# calculate age in days
cat_claw$diff_in_days <- cat_claw$Intake.Date - cat_claw$Date.Of.Birth
summary(cat_claw$diff_in_days) # summary for class type: 'difftime'
  Length    Class     Mode 
     200 difftime  numeric 
# summary for diff_in_days as numeric (does everything seem ok?)
summary(as.numeric(cat_claw$diff_in_days))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -322     122     730    1070    1822    5114 

Identify 'incorrect' observation(s)

# identify negative diff_in_days; how many?
ind <- which.min(as.numeric(cat_claw$diff_in_days))
ind
[1] 30
# remove observations with intake date before date of bearth
# save it as a new data set
cat <- cat_claw[-ind,]
# replace empty spaces with NA
cat$Animal.Name[cat$Animal.Name == ""] <- NA
cat$Distinguishing.Markings[cat$Distinguishing.Markings == ""] <- NA
cat$Reason[cat$Reason == ""] <- NA

Check the data

glimpse(cat)
Observations: 199
Variables: 21
$ Animal.ID               <int> 1032415, 1032962, 1033799, 1033965, 10...
$ Animal.Name             <fctr> HARLEY, TRUCKER, NA, NA, NA, PUDDY TA...
$ Species                 <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
$ Gender                  <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
$ Date.Of.Birth           <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
$ Primary.Breed           <fctr> Domestic Shorthair, Domestic Shorthai...
$ Secondary.Breed         <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
$ Declawed                <fctr> None, None, None, None, None, Front, ...
$ Distinguishing.Markings <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ Purebred                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ BodyWeight              <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
$ BodyWeightUnit          <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
$ PrimaryColor            <fctr> Black, Grey, Black, Orange, Black, Br...
$ SecondaryColor          <fctr> White, NA, NA, White, White, Brown, N...
$ ColorPattern            <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
$ Intake.Date             <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
$ Intake.Type             <fctr> Owner/Guardian Surrender, Stray, Owne...
$ Intake.Subtype          <fctr> Schedule, Walk In, Walk In, Walk In, ...
$ Reason                  <fctr> NA, NA, Too Many Pets, Too Many Pets,...
$ Reason.Category         <fctr> NA, NA, Owner problem, Owner problem,...
$ diff_in_days            <time> 182 days, 727 days, 90 days, 61 days,...

Plot Age vs Declawed using Boxplot

boxplot(as.numeric(diff_in_days) ~ Declawed, data = cat, horizontal = TRUE)

plot of chunk unnamed-chunk-11


Can we make it more attractive looking?

Create graph using ggplot

#The image_graph() function opens a new graphics device similar to e.g. png() or x11(). 
# It returns an image objec to which the plot(s) will be written
fig <- image_graph(width = 600, height = 600, res = 96)

# plots Age (in years) vs Declawed and saves it as an image
ggplot(cat, aes(Declawed, round(as.numeric(diff_in_days)/365), 2)) + 
  geom_boxplot(outlier.size = 0) + 
  geom_jitter(position=position_jitter(width=0.30), shape = 20, size = 3, aes(colour=Declawed), alpha=0.75) + 
  stat_summary(fun.y=mean, shape=23, size = 3, fill = "orange", col= "black", geom='point') +
  labs (title= "Cats: Age vs Declawed ", x = " Declawed", y = " Age") +
  theme(panel.border = element_rect(fill = NA, colour = "black", size = 2)) +
  theme(plot.title = element_text(size = 20, vjust = 2)) +
  ggsave('~/Documents/my_R/RLadiesMNE/ggplot_image.png') 

Adding an animation to a graph: Read gif and background files

# read cat gif file
cat_gif <- image_read("http://media.giphy.com/media/q0ujUmppx3Fu0/giphy.gif")  
#
# Background image
graph_bg <- image_read("~/Documents/my_R/RLadiesMNE/ggplot_image.png")
background <- image_background(image_scale(graph_bg, "650"), "white", flatten = TRUE)

Adding an animation to a graph

# Combine and flatten frames
frames <- image_apply(cat_gif, function(frame) {
  image_composite(background, frame, offset = "+410+10")
})

# Turn frames into animation
animation <- image_animate(frames, fps = 10)
print(animation)
  format width height colorspace filesize
1    gif   650    650       sRGB        0
2    gif   650    650       sRGB        0
3    gif   650    650       sRGB        0
4    gif   650    650       sRGB        0
5    gif   650    650       sRGB        0
6    gif   650    650       sRGB        0
7    gif   650    650       sRGB        0
8    gif   650    650       sRGB        0

Happy Plotting!