The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.
R is a comprehensive public domain language for data analysis, with no licensing costs associated with it.
Being independent of any platform, R is universally applicable and simple to integrate into existing IT structures.
You can download R for free from:
https://cran.r-project.org/
and RStudio, a free and open-source integrated development environment (IDE) for R
https://www.rstudio.com/
The new vast amount of data we have begun to take more and more notice of, has given a rise to the new discipline of data science.
Growing demand of data volume and easy understandability of extracted knowledge and insights from data is the motivating force of data science.
With the explosion of “Big Data” problems, data science has become a very hot field in many scientific areas as well as marketing, finance, and other business and social study disciplines. Hence, there is a growing demand for business and social scientific researchers with statistical, modelling and computing skills.
We can now identify patterns and regularities in data of all sorts that allow us to advance scholarship, improve the human condition, and create commercial and social value.
The field of data science is emerging at the intersection of the fields of statistics, computer science and design. R provides grate platform for this multidisciplinarity. It is incredibly powerful and as such it should be the first language for data manipulation, data analysis and visualisation you’re looking to grow skills in if you want to move towards data science.
The R system has an extensive library of packages that offer state-of-the-art-abilities.
Many of the analyses that they offer are not even available in any of the standard packages.
R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages.
R enables easy experimentation and exploration, which improves data analysis.
R is a tool behind reporting modern data analyses in a reproducible manner making an analysis more useful to others because the data and code that actually conducted the analysis can be made available.
“The R community is one of R’s best features!” Revolutions Daily news about using open source R
Supported by the R Foundation for Statistical Computing and with the strong and open engagement of developers and users from all walks of background from science to commerce it is hard to envisage that any commercial corporation will be able to develop sustainable business model with the same innovative drive and power as R community.
The collaboration amongst statisticians and other scientist who are engaged with statistical computing and growing interest and engagement of large companies creates altruistic R community which generates the force within which R is conquering the field of data analytics. As a result it creates a more powerful R resource and becomes more usable and attractive to Data scientists and analysists.
ROpenSci: “R community is not just for ‘power users’ or developers. It’s a place for users and people interested in learning more about R”; Provides list of useful links:
#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
Rweekly — an incredible weekly recap of all things R
R-bloggers — an awesome resource to find posts from many different bloggers using R
DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research
Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)
Some of the major domains using R include:
Top companies using R are :
Tools needed in a typical data science project:
R for Data Science by Garrett Grolemund & Hadley Wickham
Does declawing (onychectomy) cause harm to cats? Analyzing 17 years’ worth of shelter admissions data. - The dataset captures specifics about the individual cat (declawed status, age, breed, coat color, etc.) as well as the primary reason for admission. Some of the admission reasons are unconnected to the animal (e.g., moving, can’t afford pet, allergies) — but some reasons are based on problematic behaviors exhibited by the cat (e.g., house-soiling, aggressive to other animals, aggressive to people). Available to us is a CSV file containing 200 sample records.
Cat_Data
# Install and load packages and data
# The tidyverse is a collection of R packages designed for data science
# Install the complete tidyverse with
# install.packages("tidyverse")
# load the complete tidyverse with
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
# Install and load the ggplot2 package: grammer of graphics
# install.packages("ggplot2")
library(ggplot2)
# Install and load the magick package fo advanced image-processing in R
# install.packages("magick")
library(magick)
## Linking to ImageMagick 6.9.9.9
## Enabled features: cairo, fontconfig, freetype, fftw, pango, rsvg, webp
## Disabled features: ghostscript, lcms, x11
# load the data saved on your computer
# cat_claw <- read.csv("declawing_data_sample.csv")
# or load the data directly from the website
cat_claw <- read.csv(url("http://www.declawing-project.org/wp-content/uploads/2017/08/declawing_data_sample.csv"))
# Have a look at the data: head()
# let us look at first three raws of the data
head(cat_claw, n = 3)
## Animal.ID Animal.Name Species Gender Date.Of.Birth Primary.Breed
## 1 1032415 HARLEY Cat M 9/18/1999 Domestic Shorthair
## 2 1032962 TRUCKER Cat M 4/10/1998 Domestic Shorthair
## 3 1033799 Cat M 2/2/2000 Domestic Longhair
## Secondary.Breed Declawed Distinguishing.Markings Purebred BodyWeight
## 1 Mix None 0 0
## 2 Mix None 0 0
## 3 Mix None 0 2
## BodyWeightUnit PrimaryColor SecondaryColor ColorPattern
## 1 <NA> Black White <NA>
## 2 <NA> Grey <NA> Tiger
## 3 pound Black <NA> <NA>
## Intake.Date Intake.Type Intake.Subtype
## 1 03/18/2000 00:14:00 Owner/Guardian Surrender Schedule
## 2 04/06/2000 00:45:00 Stray Walk In
## 3 05/02/2000 00:37:00 Owner/Guardian Surrender Walk In
## Reason Reason.Category
## 1 <NA>
## 2 <NA>
## 3 Too Many Pets Owner problem
# Have alook at the structure of the data: str()
# look at the structure of the data
str(cat_claw)
## 'data.frame': 200 obs. of 20 variables:
## $ Animal.ID : int 1032415 1032962 1033799 1033965 1038328 1048494 1052572 1053299 1054811 1057979 ...
## $ Animal.Name : Factor w/ 183 levels "","Abigail","ALEXANDER",..: 63 176 1 1 1 138 116 23 52 163 ...
## $ Species : Factor w/ 1 level "Cat": 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 2 1 1 ...
## $ Date.Of.Birth : Factor w/ 195 levels "1/14/2010","1/19/1999",..: 176 71 47 68 163 65 171 177 194 73 ...
## $ Primary.Breed : Factor w/ 7 levels "Domestic Longhair",..: 3 3 1 1 3 3 3 3 3 3 ...
## $ Secondary.Breed : Factor w/ 5 levels "","Domestic Shorthair",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Declawed : Factor w/ 3 levels "Both","Front",..: 3 3 3 3 3 2 3 2 3 3 ...
## $ Distinguishing.Markings: Factor w/ 27 levels "","Black spot on snout",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Purebred : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BodyWeight : num 0 0 2 0 0 0 0 12 0 0 ...
## $ BodyWeightUnit : Factor w/ 1 level "pound": NA NA 1 NA NA NA NA 1 NA NA ...
## $ PrimaryColor : Factor w/ 8 levels "Beige","Black",..: 2 6 2 7 2 3 2 8 8 7 ...
## $ SecondaryColor : Factor w/ 14 levels "Beige","Black",..: 14 NA NA 14 14 3 NA 2 2 NA ...
## $ ColorPattern : Factor w/ 10 levels "Brindle","Calico",..: NA 7 NA NA NA 7 9 NA NA 7 ...
## $ Intake.Date : Factor w/ 199 levels "01/06/2006 00:57:00",..: 34 45 62 64 119 77 137 150 169 21 ...
## $ Intake.Type : Factor w/ 4 levels "Owner/Guardian Surrender",..: 1 3 1 1 3 1 1 1 1 1 ...
## $ Intake.Subtype : Factor w/ 6 levels "Abandoned","Animal Control",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ Reason : Factor w/ 25 levels "","Abandoned",..: 1 1 24 24 1 14 3 3 23 21 ...
## $ Reason.Category : Factor w/ 3 levels "Behavior other",..: NA NA 3 3 NA 3 3 3 2 3 ...
# Do it in a tidy way: glimpse()
# previous output was messy as it didn't fit on the slide.
# we want tolook at the structure of the data as much data
# as possible and identify data types for each of the variables
glimpse(cat_claw)
## Observations: 200
## Variables: 20
## $ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
## $ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth <fctr> 9/18/1999, 4/10/1998, 2/2/2000, 3/7/2...
## $ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
## $ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date <fctr> 03/18/2000 00:14:00, 04/06/2000 00:45...
## $ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason <fctr> , , Too Many Pets, Too Many Pets, , N...
## $ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
# Note that variable 'Declawed' is the main variable of interest
# with three possible outcomes
summary(cat_claw$Declawed)
## Both Front None
## 8 92 100
# sort the dates (DOB and InatekD) to be in the same format
cat_claw$Date.Of.Birth <- as.Date(cat_claw$Date.Of.Birth, format='%m/%d/%Y')
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2017c.1.0/zoneinfo/Europe/Podgorica'
cat_claw$Intake.Date <- as.Date(cat_claw$Intake.Date, format='%m/%d/%Y')
# How does it look?
# check the data
glimpse(cat_claw)
## Observations: 200
## Variables: 20
## $ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name <fctr> HARLEY, TRUCKER, , , , PUDDY TAT, MOP...
## $ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
## $ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> , , , , , , , , , , , , , , , , , , ,...
## $ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
## $ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason <fctr> , , Too Many Pets, Too Many Pets, , N...
## $ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
# calculate age in days
cat_claw$diff_in_days <- cat_claw$Intake.Date - cat_claw$Date.Of.Birth
summary(cat_claw$diff_in_days) # summary for class type: 'difftime'
## Length Class Mode
## 200 difftime numeric
# summary for diff_in_days as numeric (does everything seem ok?)
summary(as.numeric(cat_claw$diff_in_days))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -322 122 730 1070 1822 5114
# Identify 'incorrect' observation(s)
# identify negative diff_in_days; how many?
ind <- which.min(as.numeric(cat_claw$diff_in_days))
ind
## [1] 30
# remove observations with intake date before date of bearth
# save it as a new data set
cat <- cat_claw[-ind,]
# replace empty spaces with NA
cat$Animal.Name[cat$Animal.Name == ""] <- NA
cat$Distinguishing.Markings[cat$Distinguishing.Markings == ""] <- NA
cat$Reason[cat$Reason == ""] <- NA
# Check the data
glimpse(cat)
## Observations: 199
## Variables: 21
## $ Animal.ID <int> 1032415, 1032962, 1033799, 1033965, 10...
## $ Animal.Name <fctr> HARLEY, TRUCKER, NA, NA, NA, PUDDY TA...
## $ Species <fctr> Cat, Cat, Cat, Cat, Cat, Cat, Cat, Ca...
## $ Gender <fctr> M, M, M, M, M, M, F, M, F, F, M, F, M...
## $ Date.Of.Birth <date> 1999-09-18, 1998-04-10, 2000-02-02, 2...
## $ Primary.Breed <fctr> Domestic Shorthair, Domestic Shorthai...
## $ Secondary.Breed <fctr> Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mi...
## $ Declawed <fctr> None, None, None, None, None, Front, ...
## $ Distinguishing.Markings <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Purebred <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BodyWeight <dbl> 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12....
## $ BodyWeightUnit <fctr> NA, NA, pound, NA, NA, NA, NA, pound,...
## $ PrimaryColor <fctr> Black, Grey, Black, Orange, Black, Br...
## $ SecondaryColor <fctr> White, NA, NA, White, White, Brown, N...
## $ ColorPattern <fctr> NA, Tiger, NA, NA, NA, Tiger, Tortois...
## $ Intake.Date <date> 2000-03-18, 2000-04-06, 2000-05-02, 2...
## $ Intake.Type <fctr> Owner/Guardian Surrender, Stray, Owne...
## $ Intake.Subtype <fctr> Schedule, Walk In, Walk In, Walk In, ...
## $ Reason <fctr> NA, NA, Too Many Pets, Too Many Pets,...
## $ Reason.Category <fctr> NA, NA, Owner problem, Owner problem,...
## $ diff_in_days <time> 182 days, 727 days, 90 days, 61 days,...
Plot Age vs Declawed using Boxplot
boxplot(as.numeric(diff_in_days) ~ Declawed, data = cat, horizontal = TRUE)
Can we make it more attractive looking?
#The image_graph() function opens a new graphics device similar to e.g. png() or x11().
# It returns an image objec to which the plot(s) will be written
fig <- image_graph(width = 600, height = 600, res = 96)
# plots Age (in years) vs Declawed and saves it as an image
ggplot(cat, aes(Declawed, round(as.numeric(diff_in_days)/365), 2)) +
geom_boxplot(outlier.size = 0) +
geom_jitter(position=position_jitter(width=0.30), shape = 20, size = 3, aes(colour=Declawed), alpha=0.75) +
stat_summary(fun.y=mean, shape=23, size = 3, fill = "orange", col= "black", geom='point') +
labs (title= "Cats: Age vs Declawed ", x = " Declawed", y = " Age") +
theme(panel.border = element_rect(fill = NA, colour = "black", size = 2)) +
theme(plot.title = element_text(size = 20, vjust = 2)) +
ggsave('~/Documents/my_R/RLadiesMNE/ggplot_image.png')
## Saving 6.25 x 6.25 in image
Read gif and background files
# read cat gif file
cat_gif <- image_read("http://media.giphy.com/media/q0ujUmppx3Fu0/giphy.gif")
#
# Background image
graph_bg <- image_read("~/Documents/my_R/RLadiesMNE/ggplot_image.png")
background <- image_background(image_scale(graph_bg, "650"), "white", flatten = TRUE)
# Combine and flatten frames
frames <- image_apply(cat_gif, function(frame) {
image_composite(background, frame, offset = "+410+10")
})
# Turn frames into animation
animation <- image_animate(frames, fps = 10)
print(animation)
## format width height colorspace filesize
## 1 gif 650 650 sRGB 0
## 2 gif 650 650 sRGB 0
## 3 gif 650 650 sRGB 0
## 4 gif 650 650 sRGB 0
## 5 gif 650 650 sRGB 0
## 6 gif 650 650 sRGB 0
## 7 gif 650 650 sRGB 0
## 8 gif 650 650 sRGB 0