Choose any three of the “wide” data sets identified in the Week 6 Discussion items.
(You may use your own data set; please don’t use my Sample Post data set, since that was used in your Week 6 assignment!)
For each of the three chosen data sets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the data set.
You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
[Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen data sets: The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.
Load required packages
library(tidyr)
library(dplyr)
library(ggplot2)
Below is a dataset on prison admissions and population at the county level. It was used for the NYTimes article “A Small Indiana County Sends More People to Prison Than San Francisco and Durham, N.C., Combined. Why?”
The data is in wide format with prison admissions/10k residents, prison admissions count, and prison population count at three different time points reported per county. Also the year information is part of the column name
#Read the CSV data
prisonData<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/county-prison-admissions.csv",stringsAsFactors=FALSE)
head(prisonData,10)
## fips county state admitsPer10k2006 admitsPer10k2013
## 1 1001 Autauga County AL 44.25665 19.225189
## 2 1003 Baldwin County AL 24.63739 17.703371
## 3 1005 Barbour County AL 75.39988 10.378827
## 4 1007 Bibb County AL 21.97416 11.109136
## 5 1009 Blount County AL 16.13939 11.781012
## 6 1011 Bullock County AL 66.07110 8.486563
## 7 1013 Butler County AL 36.36540 37.458721
## 8 1015 Calhoun County AL 46.04380 36.894987
## 9 1017 Chambers County AL 50.66628 24.579371
## 10 1019 Cherokee County AL 24.96735 38.292169
## admitsPer10k2014 valid06 valid13 valid14 population2006 population2013
## 1 18.593736 true true true 51328 55136
## 2 16.540820 true true true 168121 195443
## 3 12.273589 true true true 27861 26978
## 4 6.664889 true true true 22099 22504
## 5 6.930127 true true true 55485 57720
## 6 NA true true 10776 10605
## 7 34.489555 true true true 20815 20289
## 8 37.095828 true true true 115388 116547
## 9 25.824627 true true true 34945 34175
## 10 31.877712 true true true 25466 26115
## population2014 admissions2006 admissions2013 admissions2014 source
## 1 55395 243 106 103 NCRP
## 2 200111 461 346 331 NCRP
## 3 26887 206 28 33 NCRP
## 4 22506 50 25 15 NCRP
## 5 57719 93 68 40 NCRP
## 6 10764 71 9 NCRP
## 7 20296 75 76 70 NCRP
## 8 115916 542 430 430 NCRP
## 9 34076 173 84 88 NCRP
## 10 26037 65 100 83 NCRP
Clean up the column name ,extract the year information from column name
prisonData[,"admissions2014"] <- as.numeric(prisonData[, "admissions2014"])
## Warning: NAs introduced by coercion
#extract the year information from column name (extract last 4 characters of the column name)
#convert the data into long format by creating a new column for "year" value
prisonData.long <- pivot_longer(prisonData, cols = -c("fips","county","state","valid06","valid13","valid14","source"),
names_to = c("type","year"),
names_sep = -4,
values_drop_na = TRUE,
values_to = "values2")
#values_ptypes = list(values2 = double())
#widen the data by converting the admitsPer10k,population,admissions column values into new columns for each year
#clean up the redundant columns
prisonDataFinal<-pivot_wider(prisonData.long ,names_from = type, values_from = values2) %>% select (fips,county,state,year,admitsPer10k,population,admissions)
head(prisonDataFinal,10)
## # A tibble: 10 x 7
## fips county state year admitsPer10k population admissions
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1001 Autauga County AL 2006 44.3 51328 243
## 2 1001 Autauga County AL 2013 19.2 55136 106
## 3 1001 Autauga County AL 2014 18.6 55395 103
## 4 1003 Baldwin County AL 2006 24.6 168121 461
## 5 1003 Baldwin County AL 2013 17.7 195443 346
## 6 1003 Baldwin County AL 2014 16.5 200111 331
## 7 1005 Barbour County AL 2006 75.4 27861 206
## 8 1005 Barbour County AL 2013 10.4 26978 28
## 9 1005 Barbour County AL 2014 12.3 26887 33
## 10 1007 Bibb County AL 2006 22.0 22099 50
Find the county,state and year which has maximum admissions ?
prisonDataFinal %>% filter(admissions == max(prisonDataFinal$admissions,na.rm = TRUE))
## # A tibble: 1 x 7
## fips county state year admitsPer10k population admissions
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 6037 Los Angeles County CA 2006 34.2 9737955 33793
US County populations data:
The description of the data is here:
#Read the data
usPopulationDf<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/acs2015_county_data.csv")
head(usPopulationDf,10)
## CensusId State County TotalPop Men Women Hispanic White Black Native
## 1 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4
## 2 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6
## 3 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2
## 4 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4
## 5 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3
## 6 1011 Alabama Bullock 10678 5660 5018 4.4 22.2 70.7 1.2
## 7 1013 Alabama Butler 20354 9502 10852 1.2 53.3 43.8 0.1
## 8 1015 Alabama Calhoun 116648 56274 60374 3.5 73.0 20.3 0.2
## 9 1017 Alabama Chambers 34079 16258 17821 0.4 57.3 40.3 0.2
## 10 1019 Alabama Cherokee 26008 12975 13033 1.5 91.7 4.8 0.6
## Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty
## 1 1.0 0 40725 51281 2391 24974 1080 12.9
## 2 0.7 0 147695 50254 1263 27317 711 13.4
## 3 0.4 0 20714 32964 2973 16824 798 26.7
## 4 0.1 0 17495 38678 3995 18431 1618 16.8
## 5 0.1 0 42345 45813 3141 20532 708 16.7
## 6 0.2 0 8057 31938 5884 17580 2055 24.6
## 7 0.4 0 15581 32229 1793 18390 714 25.4
## 8 0.9 0 88612 41703 925 21374 489 20.5
## 9 0.8 0 26462 34177 2949 21071 1366 21.6
## 10 0.3 0 20600 36296 1710 21811 1556 19.2
## ChildPoverty Professional Service Office Construction Production Drive
## 1 18.6 33.2 17.0 24.2 8.6 17.1 87.5
## 2 19.2 33.1 17.7 27.1 10.8 11.2 84.7
## 3 45.3 26.8 16.1 23.1 10.8 23.1 83.8
## 4 27.9 21.5 17.9 17.8 19.0 23.7 83.2
## 5 27.2 28.5 14.1 23.9 13.5 19.9 84.9
## 6 38.4 18.8 15.0 19.7 20.1 26.4 74.9
## 7 39.2 27.5 16.6 21.9 10.3 23.7 84.5
## 8 31.6 27.3 17.7 24.2 10.5 20.4 85.3
## 9 37.2 23.3 14.5 26.3 11.5 24.4 85.1
## 10 30.1 29.3 16.0 19.5 13.7 21.5 83.9
## Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork
## 1 8.8 0.1 0.5 1.3 1.8 26.5 23986 73.6
## 2 8.8 0.1 1.0 1.4 3.9 26.4 85953 81.5
## 3 10.9 0.4 1.8 1.5 1.6 24.1 8597 71.8
## 4 13.5 0.5 0.6 1.5 0.7 28.8 8294 76.8
## 5 11.2 0.4 0.9 0.4 2.3 34.9 22189 82.0
## 6 14.9 0.7 5.0 1.7 2.8 27.5 3865 79.5
## 7 12.4 0.0 0.8 0.6 1.7 24.6 7813 77.4
## 8 9.4 0.2 1.2 1.2 2.7 24.1 47401 74.1
## 9 11.9 0.2 0.3 0.4 2.1 25.1 13689 85.1
## 10 12.1 0.2 0.6 0.7 2.5 27.4 10155 73.1
## PublicWork SelfEmployed FamilyWork Unemployment
## 1 20.9 5.5 0.0 7.6
## 2 12.3 5.8 0.4 7.5
## 3 20.8 7.3 0.1 17.6
## 4 16.1 6.7 0.4 8.3
## 5 13.5 4.2 0.4 7.7
## 6 15.1 5.4 0.0 18.0
## 7 16.2 6.2 0.2 10.9
## 8 20.8 5.0 0.1 12.3
## 9 12.1 2.8 0.0 8.9
## 10 18.5 7.9 0.5 7.9
#clean up the columns
usPopulationClndf <- usPopulationDf %>%select(1:12)
#Convert the 7:12 ethnicity values into ethnicity column and it's value into
usPopulationDfLong <- pivot_longer(usPopulationClndf,cols = c(7:12),names_to ="ethnicity",values_to ="ethnicPopulation")
head(usPopulationDfLong,10)
## # A tibble: 10 x 8
## CensusId State County TotalPop Men Women ethnicity ethnicPopulation
## <int> <chr> <chr> <int> <int> <int> <chr> <dbl>
## 1 1001 Alabama Autauga 55221 26745 28476 Hispanic 2.6
## 2 1001 Alabama Autauga 55221 26745 28476 White 75.8
## 3 1001 Alabama Autauga 55221 26745 28476 Black 18.5
## 4 1001 Alabama Autauga 55221 26745 28476 Native 0.4
## 5 1001 Alabama Autauga 55221 26745 28476 Asian 1
## 6 1001 Alabama Autauga 55221 26745 28476 Pacific 0
## 7 1003 Alabama Baldwin 195121 95314 99807 Hispanic 4.5
## 8 1003 Alabama Baldwin 195121 95314 99807 White 83.1
## 9 1003 Alabama Baldwin 195121 95314 99807 Black 9.5
## 10 1003 Alabama Baldwin 195121 95314 99807 Native 0.6
Which state and county has the highest Hispanic population ?
usPopulationDfLong %>% filter(ethnicity =='Hispanic') %>%
filter(ethnicPopulation == max(ethnicPopulation,na.rm = TRUE))
## # A tibble: 7 x 8
## CensusId State County TotalPop Men Women ethnicity ethnicPopulation
## <int> <chr> <chr> <int> <int> <int> <chr> <dbl>
## 1 72019 Puerto Ri~ Barranqui~ 29805 14695 15110 Hispanic 99.9
## 2 72029 Puerto Ri~ Canóvanas 47432 22973 24459 Hispanic 99.9
## 3 72079 Puerto Ri~ Lajas 24722 12065 12657 Hispanic 99.9
## 4 72087 Puerto Ri~ LoÃza 28454 13335 15119 Hispanic 99.9
## 5 72093 Puerto Ri~ Maricao 6315 3144 3171 Hispanic 99.9
## 6 72095 Puerto Ri~ Maunabo 11701 5779 5922 Hispanic 99.9
## 7 72107 Puerto Ri~ Orocovis 22595 11376 11219 Hispanic 99.9
Rising murder rate across different US cities
The data is wide format with one column for each year’s murders
#Read the data
murderRate <-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/murder_data.csv")
head(murderRate,10)
## city state X2015_murders X2016_murders X2017_murders
## 1 Chicago Illinois 378 536 620
## 2 Orlando Florida 19 73 88
## 3 Memphis Tennessee 114 158 178
## 4 Phoenix Arizona 72 111 120
## 5 Las Vegas Nevada 90 125 120
## 6 San Antonio Texas 78 111 132
## 7 Louisville Kentucky 52 79 91
## 8 Dallas Texas 95 118 121
## 9 Houston Texas 191 212 230
## 10 Fort Wayne Indiana 17 34 42
## X2018_murders
## 1 765
## 2 102
## 3 192
## 4 131
## 5 140
## 6 148
## 7 102
## 8 136
## 9 246
## 10 53
## source
## 1 https://portal.chicagopolice.org/portal/page/portal/ClearPath/News/Crime%20Statistics
## 2 OPD
## 3 MPD
## 4 PPD
## 5 http://www.lvmpd.com/Sections/Homicide/HomicideLog/tabid/454/Default.aspx
## 6 SAPD
## 7 https://louisvilleky.gov/government/police/lmpd-transparency
## 8 DPD
## 9 http://www.houstontx.gov/police/cs/index-2.htm
## 10 FWPD
## as_of
## 1 10/2/2016
## 2 9/22/2016
## 3 9/11/2016
## 4 8/31/2016
## 5 9/28/2016
## 6 9/26/2016
## 7 8/31/2016
## 8 8/31/2016
## 9 8/31/2016
## 10 9/26/2016
#clean up the column name
colnames(murderRate)[3] <- '2015_murders'
colnames(murderRate)[4] <- '2016_murders'
colnames(murderRate)[5] <- '2017_murders'
colnames(murderRate)[6] <- '2018_murders'
#convert the column names into year and murder
murderRate.long <- pivot_longer(murderRate, cols = -c("city","state","source","as_of"),
names_to = c("year","murders"),
names_sep = "_",
values_drop_na = TRUE,
values_to = "murder")
#clean up the redundant columns
murderRate.final<-murderRate.long %>% select(city,state,year,murder)
head(murderRate.final,10)
## # A tibble: 10 x 4
## city state year murder
## <chr> <chr> <chr> <int>
## 1 Chicago Illinois 2015 378
## 2 Chicago Illinois 2016 536
## 3 Chicago Illinois 2017 620
## 4 Chicago Illinois 2018 765
## 5 Orlando Florida 2015 19
## 6 Orlando Florida 2016 73
## 7 Orlando Florida 2017 88
## 8 Orlando Florida 2018 102
## 9 Memphis Tennessee 2015 114
## 10 Memphis Tennessee 2016 158
Which year,city,state had the highest murder ?
murderRate.final %>% filter(murder == max(murder,na.rm = TRUE))
## # A tibble: 1 x 4
## city state year murder
## <chr> <chr> <chr> <int>
## 1 Chicago Illinois 2018 765
This is a sample data which is in wide format with movie names are listed as columns
#Read the data
movieRating <-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/movie_data.csv")
head(movieRating,10)
## RatingId GREENLAND THE.CROODS..A.NEW.AGE MULAN SUMMERLAND GREYHOUND
## 1 1 5 3 1 2 5
## 2 2 NA 2 4 3 5
## 3 3 1 NA 5 5 NA
## 4 4 3 2 NA 4 3
## 5 5 3 NA 2 5 5
## 6 6 4 NA 3 3 4
#convert the data into long format
movieRating.long <- pivot_longer(movieRating,cols = c(2:6),names_to ="movies",values_to ="rating")
#Add meaningful value to the rating
movieRatingFinal <- mutate(movieRating.long,review = case_when(rating == 5 ~'Excellent',rating == 4 ~'Good',rating == 3 ~'Not Bad',rating == 2 ~'Bad',rating == 1 ~'Awful' ))
head(movieRatingFinal,10)
## # A tibble: 10 x 4
## RatingId movies rating review
## <int> <chr> <int> <chr>
## 1 1 GREENLAND 5 Excellent
## 2 1 THE.CROODS..A.NEW.AGE 3 Not Bad
## 3 1 MULAN 1 Awful
## 4 1 SUMMERLAND 2 Bad
## 5 1 GREYHOUND 5 Excellent
## 6 2 GREENLAND NA <NA>
## 7 2 THE.CROODS..A.NEW.AGE 2 Bad
## 8 2 MULAN 4 Good
## 9 2 SUMMERLAND 3 Not Bad
## 10 2 GREYHOUND 5 Excellent
Which movie has the highest rating
movieRatingAgg <- aggregate(movieRatingFinal$rating , by=list(Name=movieRatingFinal$movies), FUN=sum,na.rm = TRUE)
movieRatingAgg %>% filter(x == max(x,na.rm = TRUE))
## Name x
## 1 GREYHOUND 22
## 2 SUMMERLAND 22
Read different dataset which are untidy ,used tidyr and dplyr inorder to tidy and transform the data. Also performed different analysis on the data.