Objective

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Libraries Used:

library(XML)
library(RCurl)

## Warning: package 'RCurl' was built under R version 3.5.3

## Loading required package: bitops

library(rlist)

## Warning: package 'rlist' was built under R version 3.5.3

library(plyr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.3

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.5.3

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

library(png)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

Entertainment Related Datasets Analysis

My first thought when it come to picking these three datasets, I wanted to apply the HTML scrapping into R to this project. The first discussion post I came across that I would, first be interested in working on, and second that would be an HTML page to scrap the nodes of a table. Wikipedia had one of the a vast amount of its’ pages that include tables in the html code. My classmate, Saratchandra Palle, posted a discussion, that include the List of Primetime Emmy Award Winners with a wikipedia page link for it.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/EMMY.png"
include_graphics(imgage)

Import and Tidying of Data

theurl <- getURL("https://en.wikipedia.org/wiki/List_of_Primetime_Emmy_Award_winners",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)

n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
str(tables)

## List of 5
##  $ NULL:'data.frame':    72 obs. of  8 variables:
##   ..$ V1: Factor w/ 72 levels "1949","1950",..: 72 1 2 3 4 5 6 7 8 9 ...
##   ..$ V2: Factor w/ 65 levels "30 Rock  (NBC)",..: 11 37 44 40 58 23 59 31 56 38 ...
##   ..$ V3: Factor w/ 68 levels "24  (Fox)","41st Tony Awards  (CBS)",..: 11 45 56 47 46 42 35 62 38 33 ...
##   ..$ V4: Factor w/ 61 levels "Alan Alda M*A*S*H  (CBS)",..: 59 NA 34 2 61 22 12 60 56 41 ...
##   ..$ V5: Factor w/ 55 levels "Alan Alda M*A*S*H  (CBS)",..: 35 NA NA 17 42 52 16 8 40 48 ...
##   ..$ V6: Factor w/ 47 levels "Allison Janney The West Wing  (NBC)",..: 26 NA NA NA 44 28 NA 27 29 33 ...
##   ..$ V7: Factor w/ 24 levels "Allison Janney The West Wing  (NBC)",..: 10 NA NA NA 7 6 NA NA NA 12 ...
##   ..$ V8: Factor w/ 6 levels "Barbara Stanwyck The Big Valley  (ABC)",..: 4 NA NA NA NA NA NA NA NA NA ...
##  $ NULL:'data.frame':    15 obs. of  2 variables:
##   ..$ V1: Factor w/ 13 levels "ATAS\nNATAS\nInternational TV Academy",..: 13 1 8 6 3 4 6 3 5 11 ...
##   ..$ V2: Factor w/ 12 levels "1949\n1950\n1951\n1952\n1953\n1954\n1955\n1956\n1957\n1958\n1959\n1960\n1961\n1962\n1963\n1964\n1965\n1966\n196"| __truncated__,..: NA NA 11 1 6 12 3 8 2 4 ...
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ V1: Factor w/ 2 levels "Creative Arts",..: 2 1
##   ..$ V2: Factor w/ 2 levels "1949\n1950\n1951\n1952\n1953\n1954\n1955\n1956\n1957\n1958\n1959\n1960\n1961\n1962\n1963\n1964\n1965\n1966\n196"| __truncated__,..: 1 2
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ V1: Factor w/ 2 levels "Creative Arts",..: 2 1
##   ..$ V2: Factor w/ 2 levels "1974\n1975\n1976\n1977\n1978\n1979\n1980\n1981\n1982\n1983\n1984\n1985\n1986\n1987\n1988\n1989\n1990\n1991\n199"| __truncated__,..: 1 2
##  $ NULL:'data.frame':    26 obs. of  2 variables:
##   ..$ V1: Factor w/ 26 levels "Animation","Casting",..: 25 17 14 7 26 1 2 3 4 5 ...
##   ..$ V2: Factor w/ 25 levels "Art Direction for a Miniseries or Movie\nArt Direction for a Single-Camera Series\nCostumes for a Series\nCostu"| __truncated__,..: NA 5 13 8 25 11 2 3 4 6 ...

Here I noticed that the HTML page was broken down into four seperate tables, the first one looked like the largest so I planned on seperating that one into a new dataframe. First, I needed to name the, NULL named dataframes within tables.

names(tables) <- c("table1", "table2","table3", "table4", "table5")

overall = tables[["table1"]]
names(overall) <- as.matrix(overall[1, ])
overall <- overall[-1, ]
overall[] <- lapply(overall, function(x) type.convert(as.character(x)))
head(overall)

##   Year
## 2 1949
## 3 1950
## 4 1951
## 5 1952
## 6 1953
## 7 1954
##                                                                                                                        Comedy
## 2 Pantomime Quiz (Most Popular Television Program) (KTLA)The Necklace (Best Film Made for Television) (Your Show Time series)
## 3                                                                                                 Texaco Star Theatre  (KNBH)
## 4                                                                                             Pulitzer Prize Playhouse  (ABC)
## 5                                                                                                 The Red Skelton Show  (CBS)
## 6                                                                                                          I Love Lucy  (CBS)
## 7                                                                                    The U.S. Steel Hour  (ABC)Dragnet  (NBC)
##                                                               Drama
## 2 Shirley Dinsdale (Most Outstanding Television Personality) (KTLA)
## 3                                          The Ed Wynn Show  (KTTV)
## 4                                        The Alan Young Show  (CBS)
## 5                                                 Studio One  (CBS)
## 6                                 Robert Montgomery Presents  (NBC)
## 7                                                    Omnibus  (CBS)
##                                                                                                           Variety
## 2                                                                                                            <NA>
## 3 Milton Berle (Most Outstanding Kinescoped Personality) (KNBH)Ed Wynn (Most Outstanding Live Personality) (KTTV)
## 4                                                                                               Alan Young  (CBS)
## 5                                                                                       Your Show of Shows  (NBC)
## 6                                                                                            Jimmy Durante  (NBC)
## 7                                                                  Donald O'Connor The Colgate Comedy Hour  (NBC)
##                  Lead Comedy Actor    Lead Drama Actor Lead Comedy Actress
## 2                             <NA>                <NA>                <NA>
## 3                             <NA>                <NA>                <NA>
## 4             Gertrude Berg  (CBS)                <NA>                <NA>
## 5               Red Skelton  (NBC)   Sid Caesar  (NBC) Imogene Coca  (NBC)
## 6                  Thomas Mitchell Lucille Ball  (CBS)         Helen Hayes
## 7 Eve Arden Our Miss Brooks  (CBS)                <NA>                <NA>
##   Lead Drama Actress
## 2               <NA>
## 3               <NA>
## 4               <NA>
## 5               <NA>
## 6               <NA>
## 7               <NA>

After I quick analsis of this dataset, I noticed something odd. Under Comdey, Brian Cranston for Breaking bad was in the column.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/view_emmy.png"
include_graphics(imgage)

If you look at the wiki table

If a value of the same name, a person, show, etc., appeared in conectutive rows within the same column or catergory the import into R only counted that values ones for the first row or year with the column or catergory.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/EMMY_page.png"
include_graphics(imgage)

I wasn’t giving up on this Wikipedia table scrapping quest, so being a movie buff, I went to the List of Academy Award winning films that I knew this paage would contain a have a table that I could scrap into R.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/picture_page.png"
include_graphics(imgage)

Import and Tidying of data

theurl <- getURL("https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films",.opts = list(ssl.verifypeer = FALSE) )
 tables <- readHTMLTable(theurl)
 tables <- list.clean(tables, fun = is.null, recursive = FALSE)

 n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

 str(tables)

## List of 2
##  $ NULL:'data.frame':    1300 obs. of  4 variables:
##   ..$ V1: Factor w/ 1289 levels "12 Years a Slave",..: 332 394 157 723 143 938 39 1222 146 335 ...
##   ..$ V2: Factor w/ 92 levels "1927/28","1928/29",..: 92 91 91 91 91 91 91 91 91 91 ...
##   ..$ V3: Factor w/ 23 levels "0 (1)","0 (1)[1]",..: 23 10 12 10 10 4 4 4 4 4 ...
##   ..$ V4: Factor w/ 26 levels "0","1","1[2]",..: 26 19 19 7 23 7 24 24 20 16 ...
##  $ NULL:'data.frame':    11 obs. of  2 variables:
##   ..$ V1: Factor w/ 11 levels "Academy of Motion Picture Arts and Sciences (AMPAS)Records\nMost awards per ceremony",..: 11 1 4 5 2 7 9 6 8 10 ...
##   ..$ V2: Factor w/ 8 levels "Academy, Emmy, Grammy, and Tony Award careers\nTriple Crown of Acting",..: NA NA 8 1 3 5 7 2 4 6 ...

Again, I observed that there was more than one table embedded into the HTML page.

names(tables) <- c("table1", "table2")
 best_picture = tables[["table1"]]
 names(best_picture) <- as.matrix(best_picture[1, ])
 overall <- overall[-1, ]
 best_picture[] <- lapply(best_picture, function(x) type.convert(as.character(x)))
 best_picture <- best_picture[-1, ]
 df_sorted_awards <-best_picture[order(best_picture$Awards, decreasing = TRUE),] 
df_sorted_awards[1:15,]

##                             Film Year Awards Nominations
## 319          The English Patient 1996      9          12
## 452             The Last Emperor 1987      9           9
## 927                         Gigi 1958      9           9
## 936           Gone with the Wind 1939  8 (2)          13
## 149          Slumdog Millionaire 2008      8          10
## 496                      Amadeus 1984      8          11
## 521                       Gandhi 1982      8          11
## 652                      Cabaret 1972      8          10
## 919        From Here to Eternity 1953      8          13
## 1074                My Fair Lady 1964      8          12
## 1095           On the Waterfront 1954      8          12
## 800  The Best Years of Our Lives 1946  7 (1)           8
## 81                       Gravity 2013      7          10
## 296          Shakespeare in Love 1998      7          13
## 367             Schindler's List 1993      7          12

top_three <- df_sorted_awards[1:3,]

As you can see out of the three way tie for most awards for a film at the Oscars at three, the “English Patient” has the most Nominations

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/picture.png"
include_graphics(imgage)

I would like to complete this section of my Porject with a more detailed Entainment focus analyis.

As you can see from the above picture, the winner of the best picture for this year is indicated on the webpage by being highlighted in yellow, I would have like to written a function that would loop through for the winner every year, but the number of values within each year are not equal.I would like to figure something out for this in the future for further analysis.

While I was initially researching this project, I came across a number of .CSV files, one of which was a vast Movie Dataset

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/movie.png"
include_graphics(imgage)

Import Data

movies <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/movie_metadata.csv")
head(movies,3)

##   color  director_name num_critic_for_reviews duration
## 1 Color  James Cameron                    723      178
## 2 Color Gore Verbinski                    302      169
## 3 Color     Sam Mendes                    602      148
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
##      actor_1_name                                movie_title
## 1     CCH Pounder                                   AvatarÂ 
## 2     Johnny Depp Pirates of the Caribbean: At World's EndÂ 
## 3 Christoph Waltz                                  SpectreÂ 
##   num_voted_users cast_total_facebook_likes     actor_3_name
## 1          886204                      4834        Wes Studi
## 2          471220                     48350   Jack Davenport
## 3          275868                     11700 Stephanie Sigman
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
##                                                  plot_keywords
## 1                       avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                          bomb|espionage|sequel|spy|terrorist
##                                        movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
##   num_user_for_reviews language country content_rating   budget title_year
## 1                 3054  English     USA          PG-13 2.37e+08       2009
## 2                 1238  English     USA          PG-13 3.00e+08       2007
## 3                  994  English      UK          PG-13 2.45e+08       2015
##   actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
## 1                    936        7.9         1.78                33000
## 2                   5000        7.1         2.35                    0
## 3                    393        6.8         2.35                85000

names(movies)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

Omit any rows with an NA value

movies2<-na.omit(movies)

Tidy and Graph Data

My favorite books growing up were the Harry Potter Books, so I thought I would start there for analyis

moviecost.df<-movies2[grep("^Harry Potter", movies2$movie_title), ]

harrypotter.df<-subset(moviecost.df, select=c(movie_title, budget, gross))
head(harrypotter.df)

##                                     movie_title   budget     gross
## 10     Harry Potter and the Half-Blood PrinceÂ  2.50e+08 301956980
## 115 Harry Potter and the Order of the PhoenixÂ  1.50e+08 292000866
## 116       Harry Potter and the Goblet of FireÂ  1.50e+08 289994397
## 196  Harry Potter and the Prisoner of AzkabanÂ  1.30e+08 249358727
## 203     Harry Potter and the Sorcerer's StoneÂ  1.25e+08 317557891
## 286   Harry Potter and the Chamber of SecretsÂ  1.00e+08 261970615

hp.df<-data.frame(harrypotter.df)

hpmelt <- melt(hp.df, id = 'movie_title')
head(hpmelt, 15)

##                                    movie_title variable     value
## 1     Harry Potter and the Half-Blood PrinceÂ    budget 250000000
## 2  Harry Potter and the Order of the PhoenixÂ    budget 150000000
## 3        Harry Potter and the Goblet of FireÂ    budget 150000000
## 4   Harry Potter and the Prisoner of AzkabanÂ    budget 130000000
## 5      Harry Potter and the Sorcerer's StoneÂ    budget 125000000
## 6    Harry Potter and the Chamber of SecretsÂ    budget 100000000
## 7     Harry Potter and the Half-Blood PrinceÂ     gross 301956980
## 8  Harry Potter and the Order of the PhoenixÂ     gross 292000866
## 9        Harry Potter and the Goblet of FireÂ     gross 289994397
## 10  Harry Potter and the Prisoner of AzkabanÂ     gross 249358727
## 11     Harry Potter and the Sorcerer's StoneÂ     gross 317557891
## 12   Harry Potter and the Chamber of SecretsÂ     gross 261970615

GGPLOT for

ggplot() + geom_bar(aes(y = value, x = movie_title, fill =variable ), data = hpmelt,stat="identity")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
    labs( x="Harry Potter Movie Frachise", y="Dollar Amount")

From this graph, it is hard to tell which Film Profited the Most

 harrypotter.df$Profit <- (harrypotter.df$gross + harrypotter.df$budget)
 head(harrypotter.df)

##                                     movie_title   budget     gross
## 10     Harry Potter and the Half-Blood PrinceÂ  2.50e+08 301956980
## 115 Harry Potter and the Order of the PhoenixÂ  1.50e+08 292000866
## 116       Harry Potter and the Goblet of FireÂ  1.50e+08 289994397
## 196  Harry Potter and the Prisoner of AzkabanÂ  1.30e+08 249358727
## 203     Harry Potter and the Sorcerer's StoneÂ  1.25e+08 317557891
## 286   Harry Potter and the Chamber of SecretsÂ  1.00e+08 261970615
##        Profit
## 10  551956980
## 115 442000866
## 116 439994397
## 196 379358727
## 203 442557891
## 286 361970615

According to the above table, Harry Potter & the Half-Blood Prince has the largest profit margin out of the six film shown. However, this dataset does not include the last two films of the franchise, so I would not be comfortable in saying that Harry Potter & the Half-Blood Prince is the most profitable film of that franchise.

Fantasy Football Dataset

Another discussion post I noticed that peaked my interest and would be good for HTML scrapping, was Tony Mei’s fantasy football stats post.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY.png"
include_graphics(imgage)

I took it upon myself, knowning there is a lot of fantasy football information, to find a website that would have the stats on their HTML page for scrapping. I found a website, seen below, and started scrapping.

Data Import

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/FANTASY_page.png"
include_graphics(imgage)

heurl <- getURL("https://www.footballdb.com/fantasy-football/index.html?pos=QB%2CRB%2CWR%2CTE&yr=2019&wk=all&rules=1",.opts = list(ssl.verifypeer = FALSE) )
 fantasytables <- readHTMLTable(heurl)

Tidying of Data

fantasytables <- list.clean(fantasytables, fun = is.null, recursive = FALSE)
 n.rows <- unlist(lapply(fantasytables, function(t) dim(t)[1]))
 
 str(fantasytables)

## List of 7
##  $ NULL:'data.frame':    100 obs. of  19 variables:
##   ..$ Player: Factor w/ 100 levels "A.J. BrownA. Brown",..: 87 60 80 23 13 73 43 33 40 41 ...
##   ..$ Bye   : Factor w/ 9 levels "10","11","12",..: 2 8 3 8 1 5 9 1 6 7 ...
##   ..$ Pts*  : Factor w/ 52 levels "104.00","122.00",..: 4 3 2 1 52 51 51 50 50 49 ...
##   ..$ Att   : Factor w/ 28 levels "0","1","106",..: 18 10 18 7 15 14 22 6 12 13 ...
##   ..$ Cmp   : Factor w/ 25 levels "0","101","102",..: 5 19 4 23 20 20 7 17 21 19 ...
##   ..$ Yds   : Factor w/ 31 levels "0","1,061","1,069",..: 14 6 15 8 30 7 16 29 27 11 ...
##   ..$ TD    : Factor w/ 11 levels "0","10","12",..: 3 2 2 11 11 11 9 8 2 11 ...
##   ..$ Int   : Factor w/ 8 levels "0","1","2","3",..: 1 3 1 4 3 3 8 2 3 6 ...
##   ..$ 2Pt   : Factor w/ 3 levels "0","1","2": 1 1 1 1 2 1 1 1 1 1 ...
##   ..$ Att   : Factor w/ 39 levels "0","1","10","12",..: 11 15 4 4 7 4 5 8 8 6 ...
##   ..$ Yds   : Factor w/ 63 levels "-1","-3","-8",..: 11 28 56 63 54 54 12 59 55 29 ...
##   ..$ TD    : Factor w/ 6 levels "0","1","2","3",..: 3 2 1 2 2 1 2 4 1 1 ...
##   ..$ 2Pt   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Rec   : Factor w/ 31 levels "0","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Yds   : Factor w/ 65 levels "0","105","109",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ TD    : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ 2Pt   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ FL    : Factor w/ 5 levels "0","1","2","3",..: 2 1 2 1 1 3 4 3 2 1 ...
##   ..$ TD    : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NULL:'data.frame':    4 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 4 levels "Passing 2-Point Conversions",..: 4 3 2 1
##   ..$ Point Value  : Factor w/ 4 levels "-2 points","1 point for every 25 yards",..: 2 4 1 3
##  $ NULL:'data.frame':    3 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 3 levels "Rushing 2-Point Conversions",..: 3 2 1
##   ..$ Point Value  : Factor w/ 3 levels "1 point for every 10 yards",..: 1 3 2
##  $ NULL:'data.frame':    4 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 4 levels "Receiving 2-Point Conversions",..: 4 3 2 1
##   ..$ Point Value  : Factor w/ 4 levels "1 point (for PPR scoring)",..: 1 2 4 3
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 2 levels "Fumbles Lost",..: 2 1
##   ..$ Point Value  : Factor w/ 2 levels "-2 points","6 points": 2 1
##  $ NULL:'data.frame':    2 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 2 levels "Kicking FGs",..: 2 1
##   ..$ Point Value  : Factor w/ 2 levels "0-49 yards: 3 points50+ yards: 5 points",..: 2 1
##  $ NULL:'data.frame':    8 obs. of  2 variables:
##   ..$ Stat Category: Factor w/ 8 levels "Defense/Special Teams TDs",..: 1 4 7 3 2 8 5 6
##   ..$ Point Value  : Factor w/ 4 levels "0-6 Points Against: 10 points7-13 points against: 7 points14-20 points against: 4 points21-27 points against: 1 points",..: 4 3 2 3 3 3 3 1

Like the Wikipedia page, there was more that one table embedded into the HTML page itself, seven actually.

names(fantasytables) <- c("table1", "table2","table3", "table4", "table5","table6","table7")
 best_fantasy_table = fantasytables[["table1"]]
 head(best_fantasy_table,3)

##                      Player Bye   Pts* Att Cmp   Yds TD Int 2Pt Att Yds TD
## 1   Russell WilsonR. Wilson  11 147.00 156 114 1,409 12   0   0  27 120  2
## 2   Lamar JacksonL. Jackson   8 125.00 134  87 1,110 10   2   0  36 238  1
## 3 Patrick MahomesP. Mahomes  12 122.00 156 106 1,510 10   0   0  12  64  0
##   2Pt Rec Yds TD 2Pt FL TD
## 1   0   0   0  0   0  1  0
## 2   0   0   0  0   0  0  0
## 3   0   0   0  0   0  1  0

As you can see from the dataframe. Russell Wilson, Lamar Jackson, and Patrick Mahomes were the top three fantasy football players for the 2019-2020 season.

I do not have much knowledge about fantasy football or the relative information that would go into further analysis. However, after this Tidy, I would like to do some further analysis on this dataset in the future.

Property and Violent Crime Datasets Aggregated for Analysis

For my final analysis for this project, I wanted to Aggregate two different data-set with similar variable and pertaining to the same area into one. Here I chose to do the top 10 City in America, with one data set focusing on property crimes and the other to do with violent crimes. I found both datasets that I aggregated on this website.

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/crime.png"
include_graphics(imgage)

Import Two Datasets

property <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/PropertyCrimeRates_1.csv")

 violent <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA607/master/ViolentCrimeRates.csv")

Variable names

names(property)

##  [1] "City"                                       
##  [2] "State"                                      
##  [3] "Population"                                 
##  [4] "Property.Crime"                             
##  [5] "Burglary"                                   
##  [6] "Larceny.Theft"                              
##  [7] "Motor.Vehicle.Theft"                        
##  [8] "Arson"                                      
##  [9] "Property.Crime.Rate.Per.100.000.People"     
## [10] "Burglary.Rate.Per.100.000.People"           
## [11] "Larceny.Theft.Rate.Per.100.000.People"      
## [12] "Motor.Vehicle.Theft.Rate.Per.100.000.People"
## [13] "Arson.Rate.Per.100.000.People"              
## [14] "Latitude"                                   
## [15] "Longitude"

names(violent)

##  [1] "City"                                      
##  [2] "State"                                     
##  [3] "Population"                                
##  [4] "Violent.Crimes"                            
##  [5] "Murder.and.Non.Negligent.Manslaughter"     
##  [6] "Rape"                                      
##  [7] "Robbery"                                   
##  [8] "Aggravated.Assault"                        
##  [9] "Violent.Crime.Rate.Per.100.000.People"     
## [10] "Murder.Rate.Per.100.000.People"            
## [11] "Rape.Rate.Per.100.000.People"              
## [12] "Robbery.Rate.Per.100.000.People"           
## [13] "Aggravated.Assault.Rate.Per.100.000.People"
## [14] "Latitude"                                  
## [15] "Longitude"

Creation of smaller dataframes

property2<-select(property, City, State, Property.Crime)

violent2<-select(violent, City, State, Violent.Crimes)

Merge two datatframes

total<-merge(x = property2, y = violent2, by = "City", all = TRUE)

total$State.y <- NULL

Creating of Total Crime for Analyis

total$Total <- (total$Property.Crime + total$Violent.Crimes)

GGPLOT for Total Crime by City

ggplot(data = total, aes(x=State.x,y=Total))+
   
    geom_bar(stat = 'identity',aes(fill=State.x))+
    geom_text(aes( y = Total, 
                  label = paste(Total),
                  group = City,
                  vjust = -0.4)) +
    labs(title = "Total Crime in Citys of America, 2014", 
         
         y = "Total") +
    facet_wrap(~City, ncol = 10)+
    theme_bw()

According to the above graph, New York City has the highest total crimes out of the 10 most crime filled cities in the United States.

Data 607 Project 2

Joseph Simone

10/3/2019