R Studio Interface
Using the Iris flower data set
From Wikipedia, the free encyclopedia
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gasp? Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3]
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
For more command reference resources, visit www.rdocumentation.org
dim - Retrieve or set the dimension of an object.
dim(iris)
## [1] 150 5
names - Functions to get (or set) the names of an object
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
To describe the structure of the iris data set, we may use the command "str" which will compactly display the structure Of an arbitrary R Object. It will compactly display the internal structure of an R object, a diagnostic function and an alternative to summary which we will use shortly.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Viewing data within the dataset. Below, we can specify which rows (rows 58 thru 63 of the Iris datset)
iris[58:63, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
Or perhaps display the first or last 5 rows using the "head" or "tail" commands similar to OS shell commands.
head(iris,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
tail(iris,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
We can also look at the values in one column, Sepal.Width
iris[1:5, "Sepal.Width"]
## [1] 3.5 3.0 3.2 3.1 3.6
Mean, median and range: mean(), median(), range() Quartiles and percentiles: quantile()
range(iris$Sepal.Length)
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
## 10% 30% 65%
## 4.80 5.27 6.20
Calculate the Mean of the variable Sepal.Length
mean(iris$Sepal.Length)
## [1] 5.843333
Calculate the Standard Deviation for the variable Sepal.Length
sd(iris$Sepal.Length)
## [1] 0.8280661
As a rule of thumb, the following guidelines on strength of relationship are often useful (though many experts would somewhat disagree on the choice of boundaries). Explorable
Value of r Strength of relationship -1.0 to -0.5 or 1.0 to 0.5 Strong -0.5 to -0.3 or 0.3 to 0.5 Moderate -0.3 to -0.1 or 0.1 to 0.3 Weak -0.1 to 0.1 None or very weak
Determine whether there is a correlation between Sepal.Length and Species
cor(iris$Sepal.Length, iris$Petal.Width)
## [1] 0.8179411
Now, let's display the iris dataset in a series of plots to get a sense for the shape, dispersion and trend of the data.
plot(iris)
Finally, let's visualize the data in a scatterplot with some color and magnitude.
library(ggplot2)
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
library(ggplot2)
# Create Data
data <- data.frame(
group=LETTERS[1:5],
value=c(13,7,9,21,2)
)
# Basic piechart
ggplot(data, aes(x="", y=value, fill=group)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void() # remove background, grid, numeric labels
install.packages("htmlwidgets")
# Library
library(leaflet)
# load example data (Fiji Earthquakes) + keep only 100 first lines
data(quakes)
quakes <- head(quakes, 100)
# Create a color palette with handmade bins.
mybins <- seq(4, 6.5, by=0.5)
mypalette <- colorBin( palette="YlOrBr", domain=quakes$mag, na.color="transparent", bins=mybins)
# Prepare the text for the tooltip:
mytext <- paste(
"Depth: ", quakes$depth, "<br/>",
"Stations: ", quakes$stations, "<br/>",
"Magnitude: ", quakes$mag, sep="") %>%
lapply(htmltools::HTML)
# Final Map
m <- leaflet(quakes) %>%
addTiles() %>%
setView( lat=-27, lng=170 , zoom=4) %>%
addProviderTiles("Esri.WorldImagery") %>%
addCircleMarkers(~long, ~lat,
fillColor = ~mypalette(mag), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
label = mytext,
labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
) %>%
addLegend( pal=mypalette, values=~mag, opacity=0.9, title = "Magnitude", position = "bottomright" )
m
library(ggplot2)
# create factors with value labels
mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual"))
mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),labels=c("4cyl","6cyl","8cyl"))
# Kernel density plots for mpg
# grouped by number of gears (indicated by color)
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5), main="Distribution of Gas Milage", xlab="Miles Per Gallon",
ylab="Density")
# Scatterplot of mpg vs. hp for each combination of gears and cylinders
# in each facet, transmittion type is represented by shape and color
qplot(hp, mpg, data=mtcars, shape=am, color=am, facets=gear~cyl, size=I(3), xlab="Horsepower", ylab="Miles per Gallon")
# Separate regressions of mpg on weight for each number of cylinders
qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"),
method="lm", formula=y~x, color=cyl,
main="Regression of MPG on Weight",
xlab="Weight", ylab="Miles per Gallon")
## Warning: Ignoring unknown parameters: method, formula
# Boxplots of mpg by number of gears
# observations (points) are overlayed and jittered
qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"),
fill=gear, main="Mileage by Gear Number",
xlab="", ylab="Miles per Gallon")