The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Libraries Used:
library(XML)
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.5.3
## Loading required package: bitops
library(rlist)
## Warning: package 'rlist' was built under R version 3.5.3
library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.5.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(png)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
Another discussion post I noticed that peaked my interest and would be good for HTML scrapping, was Tony Mei’s fantasy football stats post.
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY.png"
include_graphics(imgage)
I took it upon myself, knowning there is a lot of fantasy football information, to find a website that would have the stats on their HTML page for scrapping. I found a website, seen below, and started scrapping.
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY_page.png"
include_graphics(imgage)
heurl <- getURL("https://www.footballdb.com/fantasy-football/index.html?pos=QB%2CRB%2CWR%2CTE&yr=2019&wk=all&rules=1",.opts = list(ssl.verifypeer = FALSE) )
fantasytables <- readHTMLTable(heurl)
fantasytables <- list.clean(fantasytables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(fantasytables, function(t) dim(t)[1]))
str(fantasytables)
## List of 7
## $ NULL:'data.frame': 100 obs. of 19 variables:
## ..$ Player: Factor w/ 100 levels "A.J. BrownA. Brown",..: 87 60 80 23 13 73 43 33 40 41 ...
## ..$ Bye : Factor w/ 9 levels "10","11","12",..: 2 8 3 8 1 5 9 1 6 7 ...
## ..$ Pts* : Factor w/ 52 levels "104.00","122.00",..: 4 3 2 1 52 51 51 50 50 49 ...
## ..$ Att : Factor w/ 28 levels "0","1","106",..: 18 10 18 7 15 14 22 6 12 13 ...
## ..$ Cmp : Factor w/ 25 levels "0","101","102",..: 5 19 4 23 20 20 7 17 21 19 ...
## ..$ Yds : Factor w/ 31 levels "0","1,061","1,069",..: 14 6 15 8 30 7 16 29 27 11 ...
## ..$ TD : Factor w/ 11 levels "0","10","12",..: 3 2 2 11 11 11 9 8 2 11 ...
## ..$ Int : Factor w/ 8 levels "0","1","2","3",..: 1 3 1 4 3 3 8 2 3 6 ...
## ..$ 2Pt : Factor w/ 3 levels "0","1","2": 1 1 1 1 2 1 1 1 1 1 ...
## ..$ Att : Factor w/ 39 levels "0","1","10","12",..: 11 15 4 4 7 4 5 8 8 6 ...
## ..$ Yds : Factor w/ 63 levels "-1","-3","-8",..: 11 28 56 63 54 54 12 59 55 29 ...
## ..$ TD : Factor w/ 6 levels "0","1","2","3",..: 3 2 1 2 2 1 2 4 1 1 ...
## ..$ 2Pt : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Rec : Factor w/ 31 levels "0","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Yds : Factor w/ 65 levels "0","105","109",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ TD : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ 2Pt : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ FL : Factor w/ 5 levels "0","1","2","3",..: 2 1 2 1 1 3 4 3 2 1 ...
## ..$ TD : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
## $ NULL:'data.frame': 4 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 4 levels "Passing 2-Point Conversions",..: 4 3 2 1
## ..$ Point Value : Factor w/ 4 levels "-2 points","1 point for every 25 yards",..: 2 4 1 3
## $ NULL:'data.frame': 3 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 3 levels "Rushing 2-Point Conversions",..: 3 2 1
## ..$ Point Value : Factor w/ 3 levels "1 point for every 10 yards",..: 1 3 2
## $ NULL:'data.frame': 4 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 4 levels "Receiving 2-Point Conversions",..: 4 3 2 1
## ..$ Point Value : Factor w/ 4 levels "1 point (for PPR scoring)",..: 1 2 4 3
## $ NULL:'data.frame': 2 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 2 levels "Fumbles Lost",..: 2 1
## ..$ Point Value : Factor w/ 2 levels "-2 points","6 points": 2 1
## $ NULL:'data.frame': 2 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 2 levels "Kicking FGs",..: 2 1
## ..$ Point Value : Factor w/ 2 levels "0-49 yards: 3 points50+ yards: 5 points",..: 2 1
## $ NULL:'data.frame': 8 obs. of 2 variables:
## ..$ Stat Category: Factor w/ 8 levels "Defense/Special Teams TDs",..: 1 4 7 3 2 8 5 6
## ..$ Point Value : Factor w/ 4 levels "0-6 Points Against: 10 points7-13 points against: 7 points14-20 points against: 4 points21-27 points against: 1 points",..: 4 3 2 3 3 3 3 1
Like the Wikipedia page, there was more that one table embedded into the HTML page itself, seven actually.
names(fantasytables) <- c("table1", "table2","table3", "table4", "table5","table6","table7")
best_fantasy_table = fantasytables[["table1"]]
head(best_fantasy_table,3)
## Player Bye Pts* Att Cmp Yds TD Int 2Pt Att Yds TD
## 1 Russell WilsonR. Wilson 11 147.00 156 114 1,409 12 0 0 27 120 2
## 2 Lamar JacksonL. Jackson 8 125.00 134 87 1,110 10 2 0 36 238 1
## 3 Patrick MahomesP. Mahomes 12 122.00 156 106 1,510 10 0 0 12 64 0
## 2Pt Rec Yds TD 2Pt FL TD
## 1 0 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0
As you can see from the dataframe. Russell Wilson, Lamar Jackson, and Patrick Mahomes were the top three fantasy football players for the 2019-2020 season.
I do not have much knowledge about fantasy football or the relative information that would go into further analysis. However, after this Tidy, I would like to do some further analysis on this dataset in the future.
For my final analysis for this project, I wanted to Aggregate two different data-set with similar variable and pertaining to the same area into one. Here I chose to do the top 10 City in America, with one data set focusing on property crimes and the other to do with violent crimes. I found both datasets that I aggregated on this website.
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/crime.png"
include_graphics(imgage)
property <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/PropertyCrimeRates_1.csv")
violent <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/ViolentCrimeRates.csv")
names(property)
## [1] "City"
## [2] "State"
## [3] "Population"
## [4] "Property.Crime"
## [5] "Burglary"
## [6] "Larceny.Theft"
## [7] "Motor.Vehicle.Theft"
## [8] "Arson"
## [9] "Property.Crime.Rate.Per.100.000.People"
## [10] "Burglary.Rate.Per.100.000.People"
## [11] "Larceny.Theft.Rate.Per.100.000.People"
## [12] "Motor.Vehicle.Theft.Rate.Per.100.000.People"
## [13] "Arson.Rate.Per.100.000.People"
## [14] "Latitude"
## [15] "Longitude"
names(violent)
## [1] "City"
## [2] "State"
## [3] "Population"
## [4] "Violent.Crimes"
## [5] "Murder.and.Non.Negligent.Manslaughter"
## [6] "Rape"
## [7] "Robbery"
## [8] "Aggravated.Assault"
## [9] "Violent.Crime.Rate.Per.100.000.People"
## [10] "Murder.Rate.Per.100.000.People"
## [11] "Rape.Rate.Per.100.000.People"
## [12] "Robbery.Rate.Per.100.000.People"
## [13] "Aggravated.Assault.Rate.Per.100.000.People"
## [14] "Latitude"
## [15] "Longitude"
property2<-select(property, City, State, Property.Crime)
violent2<-select(violent, City, State, Violent.Crimes)
total<-merge(x = property2, y = violent2, by = "City", all = TRUE)
total$State.y <- NULL
total$Total <- (total$Property.Crime + total$Violent.Crimes)
ggplot(data = total, aes(x=State.x,y=Total))+
geom_bar(stat = 'identity',aes(fill=State.x))+
geom_text(aes( y = Total,
label = paste(Total),
group = City,
vjust = -0.4)) +
labs(title = "Total Crime in Citys of America, 2014",
y = "Total") +
facet_wrap(~City, ncol = 10)+
theme_bw()
According to the above graph, New York City has the highest total crimes out of the 10 most crime filled cities in the United States.