For Project #2, I took the Reed colleges student data who are enrolled since 2008 - 2015 per state and answer few of the questions as: - Which state has highest enrollment - How is the enrollment on year on year basis
So I started my analysis by data extraction from site using xpath, this helps me to extract the entire table in the most efficient way and store it in a temp table. Although it is recommended for making a csv file for the data, I find working directly on the dynamic site more challenging and a learning experience for me. Hope it is acceptable for grades.
library(rvest)
## Warning: package 'rvest' was built under R version 3.2.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.2.2
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
theurl <- "http://www.reed.edu/ir/geographic_states.html"
studentTable<-theurl%>%
html() %>%
html_node(xpath='//*[@id="mainContent"]/table')%>%
html_table(fill=TRUE)
## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")
Cleaning the data is the next most important thing to do, for which tidyr and dplyr packages were very helpful, So I made the data as wide structure by first filtering the data table without Total rows and then gather the Years to be a column of the table. Now I need to make the Year and Enrolled field data properties relevant by coverting from factor to numeric and attach it with a data frame for easy manupulation.
studentTable<-filter(studentTable,State!='Total')
studentTable<-gather(studentTable,"Year","Enrolled",2:9)
studentTable$Enrolled<-as.numeric(studentTable$Enrolled)
studentTable$Year<-as.numeric(as.character(studentTable$Year))
Year<-c(studentTable$Year)
State<-c(studentTable$State)
Enrolled<-c(studentTable$Enrolled)
studentTable<-data.frame(State,Year,Enrolled)
Although rvest has group and summary like ready functions which can help in analysis, I always preferred SQLDF package because of my familiarity and convenient to use SQL, so I used SQLDF package to answer both of my questions as enrollment per year and enrollment per state.
yearEnrollment<- sqldf(c("select sum(Enrolled) As enrolledTot, Year
from studentTable group by Year order by Year desc"))
## Loading required package: tcltk
studentTable<-sqldf(c("select sum(Enrolled) As enrolledTot, State
from studentTable group by State order by enrolledTot desc"))
Final step is to Visualization of the data for which I used ggplot2 package
ggplot(data = studentTable, aes(x = State, y = enrolledTot, fill = State)) + geom_histogram(stat="identity") + ggtitle("State vs Enrollment") +
xlab("States") + ylab("Total Enrollment")
studentTable
## enrolledTot State
## 1 728 California
## 2 244 Foreign Schools
## 3 218 Oregon
## 4 194 Washington
## 5 187 New York
## 6 143 Massachusetts
## 7 124 Texas
## 8 71 Colorado
## 9 65 Illinois
## 10 63 Connecticut
## 11 58 New Jersey
## 12 54 Minnesota
## 13 52 Florida
## 14 44 Pennsylvania
## 15 42 Maryland
## 16 40 Arizona
## 17 37 Virginia
## 18 35 New Hampshire
## 19 35 New Mexico
## 20 31 Georgia
## 21 28 Missouri
## 22 28 North Carolina
## 23 28 Washington, DC
## 24 26 Idaho
## 25 25 Michigan
## 26 24 Hawaii
## 27 24 Wisconsin
## 28 23 Ohio
## 29 18 Iowa
## 30 18 Vermont
## 31 16 Tennessee
## 32 15 Alaska
## 33 15 Louisiana
## 34 15 Maine
## 35 14 Utah
## 36 13 Indiana
## 37 11 Nevada
## 38 11 Oklahoma
## 39 10 Montana
## 40 10 Rhode Island
## 41 7 Alabama
## 42 5 Kansas
## 43 5 Kentucky
## 44 5 Nebraska
## 45 5 South Carolina
## 46 3 Arkansas
## 47 3 Delaware
## 48 3 Mississippi
## 49 3 South Dakota
## 50 3 West Virginia
## 51 3 Wyoming
## 52 0 North Dakota
ggplot(data = yearEnrollment, aes(x = Year, y = enrolledTot, fill = Year)) + geom_histogram(stat="identity") + ggtitle("Year vs Enrollment") +
xlab("Year") + ylab("Total Enrollment")
yearEnrollment
## enrolledTot Year
## 1 414 2015
## 2 347 2014
## 3 356 2013
## 4 320 2012
## 5 374 2011
## 6 373 2010
## 7 368 2009
## 8 330 2008