Data607-Project2

Rathish Parayil Sasidharan

Overview

Choose any three of the “wide” data sets identified in the Week 6 Discussion items.
(You may use your own data set; please don’t use my Sample Post data set, since that was used in your Week 6 assignment!)
For each of the three chosen data sets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the data set.
You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
[Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen data sets: The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.

Prerequisite

Load required packages

library(tidyr)
library(dplyr)
library(ggplot2)

Prison admin and population

Problem

Below is a dataset on prison admissions and population at the county level. It was used for the NYTimes article “A Small Indiana County Sends More People to Prison Than San Francisco and Durham, N.C., Combined. Why?”

Dataset

The data is in wide format with prison admissions/10k residents, prison admissions count, and prison population count at three different time points reported per county. Also the year information is part of the column name

Read the data

#Read the CSV data
prisonData<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/county-prison-admissions.csv",stringsAsFactors=FALSE)
head(prisonData,10)

##    fips          county state admitsPer10k2006 admitsPer10k2013
## 1  1001  Autauga County    AL         44.25665        19.225189
## 2  1003  Baldwin County    AL         24.63739        17.703371
## 3  1005  Barbour County    AL         75.39988        10.378827
## 4  1007     Bibb County    AL         21.97416        11.109136
## 5  1009   Blount County    AL         16.13939        11.781012
## 6  1011  Bullock County    AL         66.07110         8.486563
## 7  1013   Butler County    AL         36.36540        37.458721
## 8  1015  Calhoun County    AL         46.04380        36.894987
## 9  1017 Chambers County    AL         50.66628        24.579371
## 10 1019 Cherokee County    AL         24.96735        38.292169
##    admitsPer10k2014 valid06 valid13 valid14 population2006 population2013
## 1         18.593736    true    true    true          51328          55136
## 2         16.540820    true    true    true         168121         195443
## 3         12.273589    true    true    true          27861          26978
## 4          6.664889    true    true    true          22099          22504
## 5          6.930127    true    true    true          55485          57720
## 6                NA    true    true                  10776          10605
## 7         34.489555    true    true    true          20815          20289
## 8         37.095828    true    true    true         115388         116547
## 9         25.824627    true    true    true          34945          34175
## 10        31.877712    true    true    true          25466          26115
##    population2014 admissions2006 admissions2013 admissions2014 source
## 1           55395            243            106            103   NCRP
## 2          200111            461            346            331   NCRP
## 3           26887            206             28             33   NCRP
## 4           22506             50             25             15   NCRP
## 5           57719             93             68             40   NCRP
## 6           10764             71              9                  NCRP
## 7           20296             75             76             70   NCRP
## 8          115916            542            430            430   NCRP
## 9           34076            173             84             88   NCRP
## 10          26037             65            100             83   NCRP

Tidy and transform the data

Clean up the column name ,extract the year information from column name

prisonData[,"admissions2014"] <- as.numeric(prisonData[, "admissions2014"])

## Warning: NAs introduced by coercion

#extract the year information from column name (extract last 4 characters of the column name)
#convert the data into long format by creating a new column for "year" value
prisonData.long <- pivot_longer(prisonData, cols = -c("fips","county","state","valid06","valid13","valid14","source"), 
                          names_to = c("type","year"), 
                          names_sep = -4, 
                         values_drop_na = TRUE,
                         values_to = "values2")



#values_ptypes = list(values2 = double())

#widen the data by converting the admitsPer10k,population,admissions column values into new columns for each year
#clean up the redundant columns
prisonDataFinal<-pivot_wider(prisonData.long ,names_from = type, values_from = values2) %>% select (fips,county,state,year,admitsPer10k,population,admissions)

head(prisonDataFinal,10)

## # A tibble: 10 x 7
##     fips county         state year  admitsPer10k population admissions
##    <int> <chr>          <chr> <chr>        <dbl>      <dbl>      <dbl>
##  1  1001 Autauga County AL    2006          44.3      51328        243
##  2  1001 Autauga County AL    2013          19.2      55136        106
##  3  1001 Autauga County AL    2014          18.6      55395        103
##  4  1003 Baldwin County AL    2006          24.6     168121        461
##  5  1003 Baldwin County AL    2013          17.7     195443        346
##  6  1003 Baldwin County AL    2014          16.5     200111        331
##  7  1005 Barbour County AL    2006          75.4      27861        206
##  8  1005 Barbour County AL    2013          10.4      26978         28
##  9  1005 Barbour County AL    2014          12.3      26887         33
## 10  1007 Bibb County    AL    2006          22.0      22099         50

Analysis

Find the county,state and year which has maximum admissions ?

prisonDataFinal %>% filter(admissions == max(prisonDataFinal$admissions,na.rm = TRUE))

## # A tibble: 1 x 7
##    fips county             state year  admitsPer10k population admissions
##   <int> <chr>              <chr> <chr>        <dbl>      <dbl>      <dbl>
## 1  6037 Los Angeles County CA    2006          34.2    9737955      33793

US County populations data:

Problem

US County populations data:

https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-04-30/week5_acs2015_county_data.csv

The description of the data is here:

Read the data

#Read the data
usPopulationDf<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/acs2015_county_data.csv")
head(usPopulationDf,10)

##    CensusId   State   County TotalPop   Men Women Hispanic White Black Native
## 1      1001 Alabama  Autauga    55221 26745 28476      2.6  75.8  18.5    0.4
## 2      1003 Alabama  Baldwin   195121 95314 99807      4.5  83.1   9.5    0.6
## 3      1005 Alabama  Barbour    26932 14497 12435      4.6  46.2  46.7    0.2
## 4      1007 Alabama     Bibb    22604 12073 10531      2.2  74.5  21.4    0.4
## 5      1009 Alabama   Blount    57710 28512 29198      8.6  87.9   1.5    0.3
## 6      1011 Alabama  Bullock    10678  5660  5018      4.4  22.2  70.7    1.2
## 7      1013 Alabama   Butler    20354  9502 10852      1.2  53.3  43.8    0.1
## 8      1015 Alabama  Calhoun   116648 56274 60374      3.5  73.0  20.3    0.2
## 9      1017 Alabama Chambers    34079 16258 17821      0.4  57.3  40.3    0.2
## 10     1019 Alabama Cherokee    26008 12975 13033      1.5  91.7   4.8    0.6
##    Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty
## 1    1.0       0   40725  51281      2391        24974            1080    12.9
## 2    0.7       0  147695  50254      1263        27317             711    13.4
## 3    0.4       0   20714  32964      2973        16824             798    26.7
## 4    0.1       0   17495  38678      3995        18431            1618    16.8
## 5    0.1       0   42345  45813      3141        20532             708    16.7
## 6    0.2       0    8057  31938      5884        17580            2055    24.6
## 7    0.4       0   15581  32229      1793        18390             714    25.4
## 8    0.9       0   88612  41703       925        21374             489    20.5
## 9    0.8       0   26462  34177      2949        21071            1366    21.6
## 10   0.3       0   20600  36296      1710        21811            1556    19.2
##    ChildPoverty Professional Service Office Construction Production Drive
## 1          18.6         33.2    17.0   24.2          8.6       17.1  87.5
## 2          19.2         33.1    17.7   27.1         10.8       11.2  84.7
## 3          45.3         26.8    16.1   23.1         10.8       23.1  83.8
## 4          27.9         21.5    17.9   17.8         19.0       23.7  83.2
## 5          27.2         28.5    14.1   23.9         13.5       19.9  84.9
## 6          38.4         18.8    15.0   19.7         20.1       26.4  74.9
## 7          39.2         27.5    16.6   21.9         10.3       23.7  84.5
## 8          31.6         27.3    17.7   24.2         10.5       20.4  85.3
## 9          37.2         23.3    14.5   26.3         11.5       24.4  85.1
## 10         30.1         29.3    16.0   19.5         13.7       21.5  83.9
##    Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork
## 1      8.8     0.1  0.5         1.3        1.8        26.5    23986        73.6
## 2      8.8     0.1  1.0         1.4        3.9        26.4    85953        81.5
## 3     10.9     0.4  1.8         1.5        1.6        24.1     8597        71.8
## 4     13.5     0.5  0.6         1.5        0.7        28.8     8294        76.8
## 5     11.2     0.4  0.9         0.4        2.3        34.9    22189        82.0
## 6     14.9     0.7  5.0         1.7        2.8        27.5     3865        79.5
## 7     12.4     0.0  0.8         0.6        1.7        24.6     7813        77.4
## 8      9.4     0.2  1.2         1.2        2.7        24.1    47401        74.1
## 9     11.9     0.2  0.3         0.4        2.1        25.1    13689        85.1
## 10    12.1     0.2  0.6         0.7        2.5        27.4    10155        73.1
##    PublicWork SelfEmployed FamilyWork Unemployment
## 1        20.9          5.5        0.0          7.6
## 2        12.3          5.8        0.4          7.5
## 3        20.8          7.3        0.1         17.6
## 4        16.1          6.7        0.4          8.3
## 5        13.5          4.2        0.4          7.7
## 6        15.1          5.4        0.0         18.0
## 7        16.2          6.2        0.2         10.9
## 8        20.8          5.0        0.1         12.3
## 9        12.1          2.8        0.0          8.9
## 10       18.5          7.9        0.5          7.9

Tidy and transform the data

#clean up the columns
usPopulationClndf <- usPopulationDf %>%select(1:12)
#Convert the 7:12 ethnicity values into ethnicity column and it's value into
usPopulationDfLong <- pivot_longer(usPopulationClndf,cols = c(7:12),names_to ="ethnicity",values_to ="ethnicPopulation")
head(usPopulationDfLong,10)

## # A tibble: 10 x 8
##    CensusId State   County  TotalPop   Men Women ethnicity ethnicPopulation
##       <int> <chr>   <chr>      <int> <int> <int> <chr>                <dbl>
##  1     1001 Alabama Autauga    55221 26745 28476 Hispanic               2.6
##  2     1001 Alabama Autauga    55221 26745 28476 White                 75.8
##  3     1001 Alabama Autauga    55221 26745 28476 Black                 18.5
##  4     1001 Alabama Autauga    55221 26745 28476 Native                 0.4
##  5     1001 Alabama Autauga    55221 26745 28476 Asian                  1  
##  6     1001 Alabama Autauga    55221 26745 28476 Pacific                0  
##  7     1003 Alabama Baldwin   195121 95314 99807 Hispanic               4.5
##  8     1003 Alabama Baldwin   195121 95314 99807 White                 83.1
##  9     1003 Alabama Baldwin   195121 95314 99807 Black                  9.5
## 10     1003 Alabama Baldwin   195121 95314 99807 Native                 0.6

Analysis

Which state and county has the highest Hispanic population ?

usPopulationDfLong %>% filter(ethnicity =='Hispanic') %>%
filter(ethnicPopulation == max(ethnicPopulation,na.rm = TRUE))

## # A tibble: 7 x 8
##   CensusId State      County     TotalPop   Men Women ethnicity ethnicPopulation
##      <int> <chr>      <chr>         <int> <int> <int> <chr>                <dbl>
## 1    72019 Puerto Ri~ Barranqui~    29805 14695 15110 Hispanic              99.9
## 2    72029 Puerto Ri~ CanÃ³vanas    47432 22973 24459 Hispanic              99.9
## 3    72079 Puerto Ri~ Lajas         24722 12065 12657 Hispanic              99.9
## 4    72087 Puerto Ri~ LoÃza         28454 13335 15119 Hispanic              99.9
## 5    72093 Puerto Ri~ Maricao        6315  3144  3171 Hispanic              99.9
## 6    72095 Puerto Ri~ Maunabo       11701  5779  5922 Hispanic              99.9
## 7    72107 Puerto Ri~ Orocovis      22595 11376 11219 Hispanic              99.9

Rising murder rate across different US cities

Problem

Rising murder rate across different US cities

The data is wide format with one column for each year’s murders

Read the data

#Read the data
murderRate <-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/murder_data.csv")
head(murderRate,10)

##           city     state X2015_murders X2016_murders X2017_murders
## 1      Chicago  Illinois           378           536           620
## 2      Orlando   Florida            19            73            88
## 3      Memphis Tennessee           114           158           178
## 4      Phoenix   Arizona            72           111           120
## 5    Las Vegas    Nevada            90           125           120
## 6  San Antonio     Texas            78           111           132
## 7   Louisville  Kentucky            52            79            91
## 8       Dallas     Texas            95           118           121
## 9      Houston     Texas           191           212           230
## 10  Fort Wayne   Indiana            17            34            42
##    X2018_murders
## 1            765
## 2            102
## 3            192
## 4            131
## 5            140
## 6            148
## 7            102
## 8            136
## 9            246
## 10            53
##                                                                                   source
## 1  https://portal.chicagopolice.org/portal/page/portal/ClearPath/News/Crime%20Statistics
## 2                                                                                   OPD 
## 3                                                                                    MPD
## 4                                                                                   PPD 
## 5              http://www.lvmpd.com/Sections/Homicide/HomicideLog/tabid/454/Default.aspx
## 6                                                                                  SAPD 
## 7                           https://louisvilleky.gov/government/police/lmpd-transparency
## 8                                                                                    DPD
## 9                                         http://www.houstontx.gov/police/cs/index-2.htm
## 10                                                                                 FWPD 
##        as_of
## 1  10/2/2016
## 2  9/22/2016
## 3  9/11/2016
## 4  8/31/2016
## 5  9/28/2016
## 6  9/26/2016
## 7  8/31/2016
## 8  8/31/2016
## 9  8/31/2016
## 10 9/26/2016

Tidy and transform the data

#clean up the column name
colnames(murderRate)[3] <- '2015_murders'
colnames(murderRate)[4] <- '2016_murders'
colnames(murderRate)[5] <- '2017_murders'
colnames(murderRate)[6] <- '2018_murders'

#convert the column names into year and murder
murderRate.long <- pivot_longer(murderRate, cols = -c("city","state","source","as_of"), 
                          names_to = c("year","murders"), 
                          names_sep = "_", 
                         values_drop_na = TRUE,
                         values_to = "murder")

#clean up the redundant columns
murderRate.final<-murderRate.long %>% select(city,state,year,murder)
head(murderRate.final,10)

## # A tibble: 10 x 4
##    city    state     year  murder
##    <chr>   <chr>     <chr>  <int>
##  1 Chicago Illinois  2015     378
##  2 Chicago Illinois  2016     536
##  3 Chicago Illinois  2017     620
##  4 Chicago Illinois  2018     765
##  5 Orlando Florida   2015      19
##  6 Orlando Florida   2016      73
##  7 Orlando Florida   2017      88
##  8 Orlando Florida   2018     102
##  9 Memphis Tennessee 2015     114
## 10 Memphis Tennessee 2016     158

Analysis

Which year,city,state had the highest murder ?

murderRate.final %>% filter(murder == max(murder,na.rm = TRUE))

## # A tibble: 1 x 4
##   city    state    year  murder
##   <chr>   <chr>    <chr>  <int>
## 1 Chicago Illinois 2018     765

Movie Rating

Problem

This is a sample data which is in wide format with movie names are listed as columns

Read the data

#Read the data
movieRating <-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/movie_data.csv")
head(movieRating,10)

##   RatingId GREENLAND THE.CROODS..A.NEW.AGE MULAN SUMMERLAND GREYHOUND
## 1        1         5                     3     1          2         5
## 2        2        NA                     2     4          3         5
## 3        3         1                    NA     5          5        NA
## 4        4         3                     2    NA          4         3
## 5        5         3                    NA     2          5         5
## 6        6         4                    NA     3          3         4

Tidy and transform the data

#convert the data into long format
movieRating.long <- pivot_longer(movieRating,cols = c(2:6),names_to ="movies",values_to ="rating")


#Add meaningful value to the rating 
movieRatingFinal <- mutate(movieRating.long,review = case_when(rating == 5 ~'Excellent',rating == 4 ~'Good',rating == 3 ~'Not Bad',rating == 2 ~'Bad',rating == 1 ~'Awful' ))
 
head(movieRatingFinal,10)

## # A tibble: 10 x 4
##    RatingId movies                rating review   
##       <int> <chr>                  <int> <chr>    
##  1        1 GREENLAND                  5 Excellent
##  2        1 THE.CROODS..A.NEW.AGE      3 Not Bad  
##  3        1 MULAN                      1 Awful    
##  4        1 SUMMERLAND                 2 Bad      
##  5        1 GREYHOUND                  5 Excellent
##  6        2 GREENLAND                 NA <NA>     
##  7        2 THE.CROODS..A.NEW.AGE      2 Bad      
##  8        2 MULAN                      4 Good     
##  9        2 SUMMERLAND                 3 Not Bad  
## 10        2 GREYHOUND                  5 Excellent

Analysis

Which movie has the highest rating

movieRatingAgg <- aggregate(movieRatingFinal$rating , by=list(Name=movieRatingFinal$movies), FUN=sum,na.rm = TRUE)

movieRatingAgg %>% filter(x == max(x,na.rm = TRUE))

##         Name  x
## 1  GREYHOUND 22
## 2 SUMMERLAND 22

Summary

Read different dataset which are untidy ,used tidyr and dplyr inorder to tidy and transform the data. Also performed different analysis on the data.