Objective

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  3. Perform the analysis requested in the discussion item.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Libraries Used:

library(XML)
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.5.3
## Loading required package: bitops
library(rlist)
## Warning: package 'rlist' was built under R version 3.5.3
library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.5.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(png)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3

Fantasy Football Dataset

Another discussion post I noticed that peaked my interest and would be good for HTML scrapping, was Tony Mei’s fantasy football stats post.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY.png"
include_graphics(imgage)

I took it upon myself, knowning there is a lot of fantasy football information, to find a website that would have the stats on their HTML page for scrapping. I found a website, seen below, and started scrapping.

Data Import
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY_page.png"
include_graphics(imgage)

heurl <- getURL("https://www.footballdb.com/fantasy-football/index.html?pos=QB%2CRB%2CWR%2CTE&yr=2019&wk=all&rules=1",.opts = list(ssl.verifypeer = FALSE) )
 fantasytables <- readHTMLTable(heurl)
Tidying of Data
fantasytables <- list.clean(fantasytables, fun = is.null, recursive = FALSE)
 n.rows <- unlist(lapply(fantasytables, function(t) dim(t)[1]))
 
 str(fantasytables)
## List of 7
##  $ NULL:'data.frame':    100 obs. of  19 variables:
##   ..$ Player: Factor w/ 100 levels "A.J. BrownA. Brown",..: 87 60 80 23 13 73 43 33 40 41 ...
##   ..$ Bye   : Factor w/ 9 levels "10","11","12",..: 2 8 3 8 1 5 9 1 6 7 ...
##   ..$ Pts*  : Factor w/ 52 levels "104.00","122.00",..: 4 3 2 1 52 51 51 50 50 49 ...
##   ..$ Att   : Factor w/ 28 levels "0","1","106",..: 18 10 18 7 15 14 22 6 12 13 ...
##   ..$ Cmp   : Factor w/ 25 levels "0","101","102",..: 5 19 4 23 20 20 7 17 21 19 ...
##   ..$ Yds   : Factor w/ 31 levels "0","1,061","1,069",..: 14 6 15 8 30 7 16 29 27 11 ...
##   ..$ TD    : Factor w/ 11 levels "0","10","12",..: 3 2 2 11 11 11 9 8 2 11 ...
##   ..$ Int   : Factor w/ 8 levels "0","1","2","3",..: 1 3 1 4 3 3 8 2 3 6 ...
##   ..$ 2Pt   : Factor w/ 3 levels "0","1","2": 1 1 1 1 2 1 1 1 1 1 ...
##   ..$ Att   : Factor w/ 39 levels "0","1","10","12",..: 11 15 4 4 7 4 5 8 8 6 ...
##   ..$ Yds   : Factor w/ 63 levels "-1","-3","-8",..: 11 28 56 63 54 54 12 59 55 29 ...
##   ..$ TD    : Factor w/ 6 levels "0","1","2","3",..: 3 2 1 2 2 1 2 4 1 1 ...
##   ..$ 2Pt   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Rec   : Factor w/ 31 levels "0","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Yds   : Factor w/ 65 levels "0","105","109",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ TD    : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ 2Pt   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ FL    : Factor w/ 5 levels "0","1","2","3",..: 2 1 2 1 1 3 4 3 2 1 ...
##   ..$ TD    : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NULL:'data.frame':    4 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 4 levels "Passing 2-Point Conversions",..: 4 3 2 1
##   ..$ Point Value  : Factor w/ 4 levels "-2 points","1 point for every 25 yards",..: 2 4 1 3
##  $ NULL:'data.frame':    3 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 3 levels "Rushing 2-Point Conversions",..: 3 2 1
##   ..$ Point Value  : Factor w/ 3 levels "1 point for every 10 yards",..: 1 3 2
##  $ NULL:'data.frame':    4 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 4 levels "Receiving 2-Point Conversions",..: 4 3 2 1
##   ..$ Point Value  : Factor w/ 4 levels "1 point (for PPR scoring)",..: 1 2 4 3
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 2 levels "Fumbles Lost",..: 2 1
##   ..$ Point Value  : Factor w/ 2 levels "-2 points","6 points": 2 1
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 2 levels "Kicking FGs",..: 2 1
##   ..$ Point Value  : Factor w/ 2 levels "0-49 yards: 3 points50+ yards: 5 points",..: 2 1
##  $ NULL:'data.frame':    8 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 8 levels "Defense/Special Teams TDs",..: 1 4 7 3 2 8 5 6
##   ..$ Point Value  : Factor w/ 4 levels "0-6 Points Against: 10 points7-13 points against: 7 points14-20 points against: 4 points21-27 points against: 1 points",..: 4 3 2 3 3 3 3 1

Like the Wikipedia page, there was more that one table embedded into the HTML page itself, seven actually.

names(fantasytables) <- c("table1", "table2","table3", "table4", "table5","table6","table7")
 best_fantasy_table = fantasytables[["table1"]]
 head(best_fantasy_table,3)
##                      Player Bye   Pts* Att Cmp   Yds TD Int 2Pt Att Yds TD
## 1   Russell WilsonR. Wilson  11 147.00 156 114 1,409 12   0   0  27 120  2
## 2   Lamar JacksonL. Jackson   8 125.00 134  87 1,110 10   2   0  36 238  1
## 3 Patrick MahomesP. Mahomes  12 122.00 156 106 1,510 10   0   0  12  64  0
##   2Pt Rec Yds TD 2Pt FL TD
## 1   0   0   0  0   0  1  0
## 2   0   0   0  0   0  0  0
## 3   0   0   0  0   0  1  0

As you can see from the dataframe. Russell Wilson, Lamar Jackson, and Patrick Mahomes were the top three fantasy football players for the 2019-2020 season.

I do not have much knowledge about fantasy football or the relative information that would go into further analysis. However, after this Tidy, I would like to do some further analysis on this dataset in the future.

Property and Violent Crime Datasets Aggregated for Analysis

For my final analysis for this project, I wanted to Aggregate two different data-set with similar variable and pertaining to the same area into one. Here I chose to do the top 10 City in America, with one data set focusing on property crimes and the other to do with violent crimes. I found both datasets that I aggregated on this website.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/crime.png"
include_graphics(imgage)

Import Two Datasets
property <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/PropertyCrimeRates_1.csv")

 violent <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/ViolentCrimeRates.csv")
Variable names
names(property)
##  [1] "City"                                       
##  [2] "State"                                      
##  [3] "Population"                                 
##  [4] "Property.Crime"                             
##  [5] "Burglary"                                   
##  [6] "Larceny.Theft"                              
##  [7] "Motor.Vehicle.Theft"                        
##  [8] "Arson"                                      
##  [9] "Property.Crime.Rate.Per.100.000.People"     
## [10] "Burglary.Rate.Per.100.000.People"           
## [11] "Larceny.Theft.Rate.Per.100.000.People"      
## [12] "Motor.Vehicle.Theft.Rate.Per.100.000.People"
## [13] "Arson.Rate.Per.100.000.People"              
## [14] "Latitude"                                   
## [15] "Longitude"
names(violent)
##  [1] "City"                                      
##  [2] "State"                                     
##  [3] "Population"                                
##  [4] "Violent.Crimes"                            
##  [5] "Murder.and.Non.Negligent.Manslaughter"     
##  [6] "Rape"                                      
##  [7] "Robbery"                                   
##  [8] "Aggravated.Assault"                        
##  [9] "Violent.Crime.Rate.Per.100.000.People"     
## [10] "Murder.Rate.Per.100.000.People"            
## [11] "Rape.Rate.Per.100.000.People"              
## [12] "Robbery.Rate.Per.100.000.People"           
## [13] "Aggravated.Assault.Rate.Per.100.000.People"
## [14] "Latitude"                                  
## [15] "Longitude"
Creation of smaller dataframes
property2<-select(property, City, State, Property.Crime)

violent2<-select(violent, City, State, Violent.Crimes)
Merge two datatframes
total<-merge(x = property2, y = violent2, by = "City", all = TRUE)

total$State.y <- NULL
Creating of Total Crime for Analyis
total$Total <- (total$Property.Crime + total$Violent.Crimes)
GGPLOT for Total Crime by City
ggplot(data = total, aes(x=State.x,y=Total))+
   
    geom_bar(stat = 'identity',aes(fill=State.x))+
    geom_text(aes( y = Total, 
                  label = paste(Total),
                  group = City,
                  vjust = -0.4)) +
    labs(title = "Total Crime in Citys of America, 2014", 
         
         y = "Total") +
    facet_wrap(~City, ncol = 10)+
    theme_bw()

According to the above graph, New York City has the highest total crimes out of the 10 most crime filled cities in the United States.