For Project #2, I took the Reed colleges student data who are enrolled since 2008 - 2015 per state and answer few of the questions as: - Which state has highest enrollment - How is the enrollment on year on year basis

So I started my analysis by data extraction from site using xpath, this helps me to extract the entire table in the most efficient way and store it in a temp table. Although it is recommended for making a csv file for the data, I find working directly on the dynamic site more challenging and a learning experience for me. Hope it is acceptable for grades.

library(rvest)
## Warning: package 'rvest' was built under R version 3.2.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.2.2
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
theurl <- "http://www.reed.edu/ir/geographic_states.html"
studentTable<-theurl%>%
  html() %>%
  html_node(xpath='//*[@id="mainContent"]/table')%>%
  html_table(fill=TRUE)
## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")

Cleaning the data is the next most important thing to do, for which tidyr and dplyr packages were very helpful, So I made the data as wide structure by first filtering the data table without Total rows and then gather the Years to be a column of the table. Now I need to make the Year and Enrolled field data properties relevant by coverting from factor to numeric and attach it with a data frame for easy manupulation.

studentTable<-filter(studentTable,State!='Total')
studentTable<-gather(studentTable,"Year","Enrolled",2:9)
studentTable$Enrolled<-as.numeric(studentTable$Enrolled)
studentTable$Year<-as.numeric(as.character(studentTable$Year))
Year<-c(studentTable$Year)
State<-c(studentTable$State)
Enrolled<-c(studentTable$Enrolled)
studentTable<-data.frame(State,Year,Enrolled)

Although rvest has group and summary like ready functions which can help in analysis, I always preferred SQLDF package because of my familiarity and convenient to use SQL, so I used SQLDF package to answer both of my questions as enrollment per year and enrollment per state.

yearEnrollment<- sqldf(c("select sum(Enrolled) As enrolledTot, Year
                  from studentTable group by Year order by Year desc"))
## Loading required package: tcltk
studentTable<-sqldf(c("select sum(Enrolled) As enrolledTot, State
                  from studentTable group by State order by enrolledTot desc"))

Final step is to Visualization of the data for which I used ggplot2 package

ggplot(data = studentTable, aes(x = State, y = enrolledTot, fill = State)) + geom_histogram(stat="identity") + ggtitle("State vs Enrollment") +
xlab("States") + ylab("Total Enrollment")

studentTable
##    enrolledTot           State
## 1          728      California
## 2          244 Foreign Schools
## 3          218          Oregon
## 4          194      Washington
## 5          187        New York
## 6          143   Massachusetts
## 7          124           Texas
## 8           71        Colorado
## 9           65        Illinois
## 10          63     Connecticut
## 11          58      New Jersey
## 12          54       Minnesota
## 13          52         Florida
## 14          44    Pennsylvania
## 15          42        Maryland
## 16          40         Arizona
## 17          37        Virginia
## 18          35   New Hampshire
## 19          35      New Mexico
## 20          31         Georgia
## 21          28        Missouri
## 22          28  North Carolina
## 23          28  Washington, DC
## 24          26           Idaho
## 25          25        Michigan
## 26          24          Hawaii
## 27          24       Wisconsin
## 28          23            Ohio
## 29          18            Iowa
## 30          18         Vermont
## 31          16       Tennessee
## 32          15          Alaska
## 33          15       Louisiana
## 34          15           Maine
## 35          14            Utah
## 36          13         Indiana
## 37          11          Nevada
## 38          11        Oklahoma
## 39          10         Montana
## 40          10    Rhode Island
## 41           7         Alabama
## 42           5          Kansas
## 43           5        Kentucky
## 44           5        Nebraska
## 45           5  South Carolina
## 46           3        Arkansas
## 47           3        Delaware
## 48           3     Mississippi
## 49           3    South Dakota
## 50           3   West Virginia
## 51           3         Wyoming
## 52           0    North Dakota
ggplot(data = yearEnrollment, aes(x = Year, y = enrolledTot, fill = Year)) + geom_histogram(stat="identity") + ggtitle("Year vs Enrollment") +
xlab("Year") + ylab("Total Enrollment")

yearEnrollment
##   enrolledTot Year
## 1         414 2015
## 2         347 2014
## 3         356 2013
## 4         320 2012
## 5         374 2011
## 6         373 2010
## 7         368 2009
## 8         330 2008