This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
The purpose of this notebook is provide some examples and problems for students to explore the use of R in analyzing data. For this set of problems, we use the built-in dataset called ‘iris’, based on Ronald Fischer’s 1936 pioneering work on statistics in biology. It is a multivariate data set introduced in his paper, “The use of multiple measurements in taxonomic problems.”
Because it is included in the R distribution, you can get immediate help by entering:
?iris
he ? is a general purpose tool for command-line help.
Here is an illustration of the iris attributes.
If you want to explore other included datasets, type
library(help = “datasets”)
This will give you a list of all included datasets in the dataset library for R.
Nice! we have inline help which describes our dataset. This is available for all included datasets.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
First some information about the dataset.
dim(iris)
## [1] 150 5
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Here is a quick look at the data graphically. It demonstrates both the built-in plot function and the ggplot library.
Note that the ggplot shows each species in a different color, so that plot provides more information.
The next plot shows regions of each species. Thanks to Yu Yang Liu for this code on Kaggle. https://www.kaggle.com/c34klh123/iris-data-with-ggplot-shiny/code
The object convexHull is a function that computes the regions. It is called generating the iris2 dataframe.
Note: The “marginal points” are “NOT in the region”. They are “just outside the region”
convexHull<-function(df) df[chull(df$Sepal.Length,df$Sepal.Width),]
iris2<-plyr::ddply(iris,"Species",convexHull)
ggplot(iris,aes(Sepal.Length,Sepal.Width))+
geom_point(data=iris,aes(color=Species))+
geom_polygon(data=iris2,alpha=.3,aes(Sepal.Length,Sepal.Width,fill=Species))+
theme(legend.position = "bottom",plot.title = element_text(size = 15,hjust = 0.5))+
annotate("segment",x=6,xend=5.8,y=3.75,yend =4 ,arrow=arrow(),color="black")+
annotate("segment",x=6.2,xend=6.2,y=3.65,yend =3.4 ,arrow=arrow(),color="black")+
annotate("segment",x=6.1,xend=6,y=3.65,yend =3.4 ,arrow=arrow(),color="black")+
annotate("text",x=6.21,y=3.72,label="marginal points",color="black",size=3)
It is your turn now. You should fill in each section with R code to answer the question.
Hint: try using the dplyr library operators like group_by, summarize, and %>% (pipe).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Start with this:
iris %>%
group_by(Species)
## # A tibble: 150 x 5
## # Groups: Species [3]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
#
# add summarize here
This should produce 6 graphs which look something like the ggplot below that I created using the mtcars dataset. First the built-in plot function, which simply plots mpg vs hp:
plot(x = mtcars$hp, y = mtcars$mpg,main="Miles per gallon vs. horsepower",xlab="Horsepower",ylab = "Miles per Gallon")
Here are similar plots using ggplot which demonstrates three side-by-side graphs with the same x and y axes:
ggplot (mtcars,aes(x=hp,y=mpg)) + geom_point() + facet_wrap (~cyl,nrow=1)
Try using ggplot and place the graphs 3 accross and two down, with Sepal in the upper set of graphs and petal in the lower.
# Put the code for your plot here.
Now with one of the ggplot renderings, add a linear regression line to the plot. For example:
# (by default includes 95% confidence region)
ggplot (mtcars,aes(x=hp,y=mpg)) + geom_point() + geom_smooth(method=lm)
Add the linear regression line, with the confidence interval.
# Your code goes here:
Here is a pointer to dplyr and tidyr:
https://rpubs.com/bradleyboehmke/data_wrangling
# Your code goes here:
Here is an example:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
mtcars$am[which(mtcars$am == 0)] = 'Automatic'
mtcars$am[which(mtcars$am == 1)] = 'Manual'
mtcars$am = as.factor(mtcars$am)
p = plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am,
colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Gross horsepower'),
zaxis = list(title = '1/4 mile time')))
p
Generate a similar plot with plot_ly function.
# Your code here
If you were considering a machine learning algorithm to predict iris species based on the measured attributes, which species do you think the the algorithm will do best?
Why?
With which two species is there a possibility of errors in prediction?
Why?
Nice web app for visualization https://yuyangliu.shinyapps.io/iris_result/
https://www.kaggle.com/c34klh123/iris-data-with-ggplot-shiny/notebook
http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
Nice 3d plotting libraries
Plotly - https://plot.ly/r/
Discussion of pipes https://www.r-bloggers.com/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/
Nice collection of ggplot graphs: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html