The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
I will be using following datasets from week 6 discussion forums.
Income
limits data - Khyati Naik (me)
Zillow Home Values -
Sanielle Worrell
Gun
violence data - Keeno Glanville
I downloaded the data from above websites and have saved it in my github repository. I will be reading in daatsets from my github repository to perform data manipulation and analysis.
library(tidyverse)
#provide the github data path
dt_inc_lim <- "https://raw.githubusercontent.com/Naik-Khyati/data_prep_proj2/main/data/Section8-FY22.csv"
raw_dt_inc_lim <- read.csv(dt_inc_lim, sep=",", stringsAsFactors=FALSE)
glimpse(raw_dt_inc_lim)
## Rows: 4,765
## Columns: 35
## $ State_Alpha <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ fips2010 <dbl> 100199999, 100399999, 100599999, 100799999, 100999999…
## $ cbsasub <chr> "METRO33860M33860", "METRO19300M19300", "NCNTY01005N0…
## $ median2022 <int> 75500, 85500, 46400, 84800, 84800, 48200, 56100, 6880…
## $ Metro_Area_Name <chr> "Montgomery, AL MSA", "Daphne-Fairhope-Foley, AL MSA"…
## $ county_town_name <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ l50_1 <int> 26450, 29950, 21600, 29700, 29700, 21600, 21600, 2410…
## $ l50_2 <int> 30200, 34200, 24650, 33950, 33950, 24650, 24650, 2755…
## $ l50_3 <int> 34000, 38500, 27750, 38200, 38200, 27750, 27750, 3100…
## $ l50_4 <int> 37750, 42750, 30800, 42400, 42400, 30800, 30800, 3440…
## $ l50_5 <int> 40800, 46200, 33300, 45800, 45800, 33300, 33300, 3720…
## $ l50_6 <int> 43800, 49600, 35750, 49200, 49200, 35750, 35750, 3995…
## $ l50_7 <int> 46850, 53050, 38200, 52600, 52600, 38200, 38200, 4270…
## $ l50_8 <int> 49850, 56450, 40700, 56000, 56000, 40700, 40700, 4545…
## $ ELI_1 <int> 15900, 18000, 13590, 17850, 17850, 13590, 13590, 1450…
## $ ELI_2 <int> 18310, 20550, 18310, 20400, 20400, 18310, 18310, 1831…
## $ ELI_3 <int> 23030, 23100, 23030, 23030, 23030, 23030, 23030, 2303…
## $ ELI_4 <int> 27750, 27750, 27750, 27750, 27750, 27750, 27750, 2775…
## $ ELI_5 <int> 32470, 32470, 32470, 32470, 32470, 32470, 32470, 3247…
## $ ELI_6 <int> 37190, 37190, 35750, 37190, 37190, 35750, 35750, 3719…
## $ ELI_7 <int> 41910, 41910, 38200, 41910, 41910, 38200, 38200, 4191…
## $ ELI_8 <int> 46630, 46630, 40700, 46630, 46630, 40700, 40700, 4545…
## $ l80_1 <int> 42300, 47900, 34550, 47500, 47500, 34550, 34550, 3855…
## $ l80_2 <int> 48350, 54750, 39450, 54300, 54300, 39450, 39450, 4405…
## $ l80_3 <int> 54400, 61600, 44400, 61100, 61100, 44400, 44400, 4955…
## $ l80_4 <int> 60400, 68400, 49300, 67850, 67850, 49300, 49300, 5505…
## $ l80_5 <int> 65250, 73900, 53250, 73300, 73300, 53250, 53250, 5950…
## $ l80_6 <int> 70100, 79350, 57200, 78750, 78750, 57200, 57200, 6390…
## $ l80_7 <int> 74900, 84850, 61150, 84150, 84150, 61150, 61150, 6830…
## $ l80_8 <int> 79750, 90300, 65100, 89600, 89600, 65100, 65100, 7270…
## $ state <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ county <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29…
## $ County_Name <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ state_name <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
## $ metro <int> 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
The dataset has 4765 rows and 35 columns.
vli_wide <- raw_dt_inc_lim %>% select(metro, starts_with('l50_')) %>%
group_by (metro) %>%
summarise(across(everything(), list(mean)))
vli_long <- vli_wide %>% gather("vli_cat","mean_vli", 2:9)
head(vli_long)
## # A tibble: 6 × 3
## metro vli_cat mean_vli
## <int> <chr> <dbl>
## 1 0 l50_1_1 26747.
## 2 1 l50_1_1 32725.
## 3 0 l50_2_1 30560.
## 4 1 l50_2_1 37393.
## 5 0 l50_3_1 34384.
## 6 1 l50_3_1 42067.
Since the fill variable has to be of character to create a chart.
# converting integer to character to be used in fill for the chart
vli_long$metro = as.character(vli_long$metro)
ggplot(vli_long, aes(x = vli_cat, y = mean_vli, fill = metro )) +
geom_bar(position = "dodge",stat = "identity") +
xlab("# Person in Family") +
ylab("Mean Very Low Income") +
theme_minimal()
Income limit for household of 1 member is the lowest and income limit for household of 8 members is the highest. For all sizes of families (1 to 8), metro income limit is higher than non metro income limit.
vli_metro_wide <- vli_long %>% spread(metro,mean_vli)
vli_metro_wide <- vli_metro_wide %>% rename('non_metro'='0', 'metro'='1') %>% mutate(pct_met_high = (metro/non_metro-1)*100)
head(vli_metro_wide)
## # A tibble: 6 × 4
## vli_cat non_metro metro pct_met_high
## <chr> <dbl> <dbl> <dbl>
## 1 l50_1_1 26747. 32725. 22.3
## 2 l50_2_1 30560. 37393. 22.4
## 3 l50_3_1 34384. 42067. 22.3
## 4 l50_4_1 38176. 46716. 22.4
## 5 l50_5_1 41258. 50482. 22.4
## 6 l50_6_1 44311. 54215. 22.4
Mean income in metro areas is about 22% higher than non metro areas for very low income households. Across all family size (1 to 8), mean income is consistently higher by the same proportion of 22%.
#provide the github data path
dt_zhv <- "https://raw.githubusercontent.com/Naik-Khyati/data_prep_proj2/main/data/State_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv"
raw_dt_zhv <- read.csv(dt_zhv, sep=",", stringsAsFactors=FALSE)
glimpse(raw_dt_zhv)
## Rows: 51
## Columns: 277
## $ RegionID <int> 9, 54, 14, 43, 47, 21, 44, 16, 36, 30, 40, 56, 59, 8, 26, …
## $ SizeRank <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ RegionName <chr> "California", "Texas", "Florida", "New York", "Pennsylvani…
## $ RegionType <chr> "state", "state", "state", "state", "state", "state", "sta…
## $ StateName <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ X2000.01.31 <dbl> 199037, 114095, 115670, 150013, 106281, 147480, 110729, 13…
## $ X2000.02.29 <dbl> 199883, 114219, 116011, 150780, 106641, 147803, 110893, 13…
## $ X2000.03.31 <dbl> 200947, 114259, 116372, 151421, 106931, 148243, 111066, 13…
## $ X2000.04.30 <dbl> 203187, 114465, 117073, 152775, 107498, 149108, 111517, 13…
## $ X2000.05.31 <dbl> 205587, 114515, 117740, 153998, 107988, 149999, 112063, 13…
## $ X2000.06.30 <dbl> 208086, 114586, 118403, 155256, 108434, 150866, 112649, 13…
## $ X2000.07.31 <dbl> 210682, 114591, 119056, 156279, 108885, 151822, 113145, 13…
## $ X2000.08.31 <dbl> 213433, 114808, 119728, 157127, 109278, 152932, 113607, 13…
## $ X2000.09.30 <dbl> 216199, 115059, 120412, 157870, 109694, 154078, 114066, 13…
## $ X2000.10.31 <dbl> 218822, 115321, 121110, 158720, 110090, 155213, 114518, 13…
## $ X2000.11.30 <dbl> 221448, 115653, 121846, 159850, 110576, 156245, 114975, 13…
## $ X2000.12.31 <dbl> 224027, 116033, 122614, 161062, 111082, 157223, 115427, 14…
## $ X2001.01.31 <dbl> 226097, 116268, 123381, 162223, 111587, 158137, 115863, 14…
## $ X2001.02.28 <dbl> 228240, 116225, 124189, 163187, 112075, 158957, 116234, 14…
## $ X2001.03.31 <dbl> 230690, 116019, 125044, 164142, 112606, 159916, 116622, 14…
## $ X2001.04.30 <dbl> 233777, 115939, 125974, 165161, 113184, 160842, 117011, 14…
## $ X2001.05.31 <dbl> 236629, 116032, 126940, 166252, 113671, 161964, 117389, 14…
## $ X2001.06.30 <dbl> 239108, 116169, 127940, 167407, 114143, 163005, 117707, 14…
## $ X2001.07.31 <dbl> 241455, 116292, 129027, 168589, 114640, 164099, 118057, 14…
## $ X2001.08.31 <dbl> 243800, 116382, 130178, 169807, 115188, 165209, 118439, 14…
## $ X2001.09.30 <dbl> 246136, 116489, 131345, 170951, 115723, 166385, 118820, 14…
## $ X2001.10.31 <dbl> 248340, 116613, 132503, 172175, 116198, 167582, 119183, 14…
## $ X2001.11.30 <dbl> 250418, 116685, 133592, 173467, 116718, 168618, 119508, 14…
## $ X2001.12.31 <dbl> 252377, 116754, 134681, 174863, 117206, 169487, 119818, 14…
## $ X2002.01.31 <dbl> 253884, 116821, 135720, 176260, 117716, 170285, 120087, 14…
## $ X2002.02.28 <dbl> 255546, 117019, 136713, 177542, 118259, 171174, 120350, 14…
## $ X2002.03.31 <dbl> 257505, 117381, 137707, 178639, 118855, 172299, 120633, 15…
## $ X2002.04.30 <dbl> 260207, 117846, 138778, 179766, 119516, 173241, 120920, 15…
## $ X2002.05.31 <dbl> 263230, 118304, 139918, 180954, 120190, 174268, 121221, 15…
## $ X2002.06.30 <dbl> 266535, 118801, 141098, 182332, 120900, 175171, 121516, 15…
## $ X2002.07.31 <dbl> 270418, 119281, 142284, 183764, 121681, 176369, 121846, 15…
## $ X2002.08.31 <dbl> 274548, 119798, 143583, 185368, 122562, 177633, 122165, 15…
## $ X2002.09.30 <dbl> 278669, 120211, 144811, 186961, 123448, 178886, 122485, 15…
## $ X2002.10.31 <dbl> 282537, 120562, 146047, 188505, 124304, 180052, 122816, 15…
## $ X2002.11.30 <dbl> 286382, 120866, 147195, 189992, 124997, 180905, 123178, 15…
## $ X2002.12.31 <dbl> 290199, 121131, 148525, 191592, 125704, 181883, 123548, 15…
## $ X2003.01.31 <dbl> 293431, 121386, 149876, 193171, 126396, 182822, 123889, 15…
## $ X2003.02.28 <dbl> 296475, 121582, 151303, 194764, 127143, 183953, 124242, 15…
## $ X2003.03.31 <dbl> 299798, 121837, 152759, 196566, 127860, 184979, 124575, 15…
## $ X2003.04.30 <dbl> 303735, 122058, 154330, 198257, 128644, 186146, 124932, 15…
## $ X2003.05.31 <dbl> 308044, 122274, 155933, 199967, 129603, 187159, 125273, 15…
## $ X2003.06.30 <dbl> 311736, 122473, 157490, 201597, 130722, 188098, 125667, 15…
## $ X2003.07.31 <dbl> 315864, 122661, 159027, 203342, 131893, 189001, 126092, 15…
## $ X2003.08.31 <dbl> 320270, 122903, 160625, 205368, 132950, 190201, 126594, 15…
## $ X2003.09.30 <dbl> 325320, 123094, 162281, 207520, 133997, 191567, 127084, 15…
## $ X2003.10.31 <dbl> 330049, 123310, 163946, 209668, 134976, 192871, 127559, 15…
## $ X2003.11.30 <dbl> 334587, 123436, 165559, 211399, 135851, 193942, 127927, 15…
## $ X2003.12.31 <dbl> 338999, 123562, 167092, 212976, 136633, 194888, 128278, 15…
## $ X2004.01.31 <dbl> 343702, 124102, 168652, 214629, 137334, 196004, 128578, 15…
## $ X2004.02.29 <dbl> 349090, NA, 170235, 216234, 138275, 197117, 128951, 159140…
## $ X2004.03.31 <dbl> 355174, NA, 172026, 217543, 139189, 198283, 129339, 159506…
## $ X2004.04.30 <dbl> 362112, 128271, 173996, 218922, 140237, 199581, 129712, 16…
## $ X2004.05.31 <dbl> 370122, 129224, 176301, 220597, 141455, 201016, 130073, 16…
## $ X2004.06.30 <dbl> 378974, 129689, 178900, 222653, 142860, 202604, 130419, 16…
## $ X2004.07.31 <dbl> 387711, 130251, 181765, 224889, 144487, 204055, 130803, 16…
## $ X2004.08.31 <dbl> 395866, 130750, 184917, 227088, 145947, 205495, 131123, 16…
## $ X2004.09.30 <dbl> 403633, 131386, 188001, 229422, 147422, 206719, 131438, 16…
## $ X2004.10.31 <dbl> 411215, 131932, 190895, 231627, 148626, 207982, 131738, 16…
## $ X2004.11.30 <dbl> 417616, 132414, 193581, 233900, 149949, 209256, 132064, 16…
## $ X2004.12.31 <dbl> 424103, 132835, 196434, 236024, 151233, 210700, 132372, 16…
## $ X2005.01.31 <dbl> 429685, 133415, 199469, 238097, 152601, 212005, 132735, 16…
## $ X2005.02.28 <dbl> 436252, 133918, 202768, 240152, 153592, 213348, 133130, 16…
## $ X2005.03.31 <dbl> 443040, 134418, 206517, 242003, 154631, 214670, 133533, 16…
## $ X2005.04.30 <dbl> 450862, 134801, 210908, 244040, 155825, 216182, 133906, 16…
## $ X2005.05.31 <dbl> 458029, 135198, 215708, 246208, 157322, 217755, 134215, 16…
## $ X2005.06.30 <dbl> 464500, 135603, 220890, 248426, 159024, 219302, 134545, 16…
## $ X2005.07.31 <dbl> 470776, 135919, 226389, 250799, 160787, 220776, 134837, 16…
## $ X2005.08.31 <dbl> 477297, 136218, 231998, 253271, 162606, 222231, 135167, 16…
## $ X2005.09.30 <dbl> 483827, 136521, 237338, 255866, 164233, 223864, 135461, 16…
## $ X2005.10.31 <dbl> 490183, 136730, 242107, 258447, 165693, 225429, 135765, 16…
## $ X2005.11.30 <dbl> 494866, 137031, 246517, 260647, 167032, 226786, 136057, 16…
## $ X2005.12.31 <dbl> 499214, 137225, 250674, 262940, 168300, 228087, 136311, 17…
## $ X2006.01.31 <dbl> 502729, 137451, 254595, 264861, 169390, 229411, 136503, 17…
## $ X2006.02.28 <dbl> 505986, 137746, 258243, 266662, 170327, 230696, 136651, 17…
## $ X2006.03.31 <dbl> 509194, 138135, 262211, 268068, 171238, 231975, 136793, 17…
## $ X2006.04.30 <dbl> 512001, 138658, 266624, 269276, 172255, 233055, 136959, 17…
## $ X2006.05.31 <dbl> 514920, 139287, 270598, 270461, 173432, 234231, 137130, 17…
## $ X2006.06.30 <dbl> 516585, 139913, 273738, 271738, 174709, 234944, 137273, 17…
## $ X2006.07.31 <dbl> 516854, 140506, 275637, 272906, 175808, 235550, 137395, 17…
## $ X2006.08.31 <dbl> 516593, 140999, 276843, 273924, 176623, 236018, 137446, 17…
## $ X2006.09.30 <dbl> 515027, 141570, 277072, 274443, 177128, 236388, 137427, 17…
## $ X2006.10.31 <dbl> 513686, 142331, 276707, 274828, 177547, 236572, 137358, 17…
## $ X2006.11.30 <dbl> 512291, 142986, 276074, 274976, 177881, 236679, 137213, 17…
## $ X2006.12.31 <dbl> 511194, 143645, 275006, 275242, 178167, 236924, 137017, 17…
## $ X2007.01.31 <dbl> 510336, 143870, 273819, 275431, 178484, 237449, 136833, 17…
## $ X2007.02.28 <dbl> 509175, 143971, 272625, 275583, 178898, 237891, 136738, 17…
## $ X2007.03.31 <dbl> 507418, 143968, 271247, 275867, 179433, 238173, 136561, 17…
## $ X2007.04.30 <dbl> 505218, 144348, 269654, 275879, 179870, 238164, 136383, 17…
## $ X2007.05.31 <dbl> 501555, 144839, 267502, 276118, 180309, 238003, 136183, 17…
## $ X2007.06.30 <dbl> 497687, 145398, 264886, 276041, 180602, 237705, 136025, 17…
## $ X2007.07.31 <dbl> 492554, 145819, 261773, 276104, 180980, 237240, 135776, 17…
## $ X2007.08.31 <dbl> 487542, 146306, 258425, 276187, 181183, 236744, 135493, 17…
## $ X2007.09.30 <dbl> 481698, 146766, 255060, 276157, 181237, 235967, 135312, 17…
## $ X2007.10.31 <dbl> 475676, 147174, 251604, 276400, 181279, 235195, 135056, 17…
## $ X2007.11.30 <dbl> 469729, 147489, 247948, 276559, 181375, 234413, 134584, 17…
## $ X2007.12.31 <dbl> 463195, 147793, 244275, 276579, 181423, 233422, 134031, 17…
## $ X2008.01.31 <dbl> 456917, 147392, 240154, 276578, 181285, 232374, 133478, 17…
## $ X2008.02.29 <dbl> 448998, 146871, 236247, 276306, 181226, NA, 132975, 175487…
## $ X2008.03.31 <dbl> 439505, 146226, 231954, 276097, 181109, NA, 132590, 174484…
## $ X2008.04.30 <dbl> 428469, 146095, 227960, 275618, 181044, 227152, 132255, 17…
## $ X2008.05.31 <dbl> 417161, 145928, 223498, 274887, 180742, NA, 132014, 173036…
## $ X2008.06.30 <dbl> 406642, 145727, 219351, 273960, 180331, 224434, 131698, 17…
## $ X2008.07.31 <dbl> 395775, 145564, 215030, 272761, 179590, 223532, 131399, 17…
## $ X2008.08.31 <dbl> 385543, 145094, 210500, 271554, 178685, 220748, 130996, 17…
## $ X2008.09.30 <dbl> 376244, 144477, 205921, 270191, 177792, 218955, 130519, 16…
## $ X2008.10.31 <dbl> 368453, 143791, 201287, 268509, 177073, 215937, 129861, 16…
## $ X2008.11.30 <dbl> 360429, 143171, 196772, 266723, 176364, 213516, 129238, 16…
## $ X2008.12.31 <dbl> 352774, 142612, 192500, 265094, 175555, 210352, 128438, 16…
## $ X2009.01.31 <dbl> 345731, 142092, 188494, 263605, 174747, 207031, 127535, 16…
## $ X2009.02.28 <dbl> 341313, 141949, 185038, 262363, 174324, 204660, 126578, 15…
## $ X2009.03.31 <dbl> 337318, 141956, 181702, 261243, 174156, 202170, 125802, 15…
## $ X2009.04.30 <dbl> 333307, 142042, 178317, 260308, 174025, 200952, 125360, 15…
## $ X2009.05.31 <dbl> 329379, 142072, 175089, 259418, 173707, 199057, 125086, 15…
## $ X2009.06.30 <dbl> 326150, 142031, 172052, 258391, 173301, 197780, 124815, 15…
## $ X2009.07.31 <dbl> 323966, 142051, 169596, 257350, 172890, 196491, 124421, 15…
## $ X2009.08.31 <dbl> 322020, 141841, 167242, 256209, 172383, 195283, 123992, 15…
## $ X2009.09.30 <dbl> 320559, 141632, 165085, 255343, 171938, 193341, 123623, 14…
## $ X2009.10.31 <dbl> 320052, 141365, 163326, 254693, 171676, 191576, 123380, 14…
## $ X2009.11.30 <dbl> 321760, 141497, 162320, 254317, 171691, 189894, 123066, 14…
## $ X2009.12.31 <dbl> 324346, 141640, 161474, 254041, 171822, 189345, 122783, 14…
## $ X2010.01.31 <dbl> 326486, 141638, 160800, 253817, 172134, 188434, 122486, 14…
## $ X2010.02.28 <dbl> 326863, 141609, 160175, 253848, 172522, 188087, 122439, 14…
## $ X2010.03.31 <dbl> 328049, 141697, 159803, 253929, 173073, 186836, 122377, 14…
## $ X2010.04.30 <dbl> 329519, 142002, 159434, 254408, 173541, 187275, 122438, 14…
## $ X2010.05.31 <dbl> 331090, 142240, 158880, 254794, 173783, 186710, 122438, 14…
## $ X2010.06.30 <dbl> 330380, 142212, 158026, 255368, 173716, 187339, 122386, 14…
## $ X2010.07.31 <dbl> 328690, 141817, 156873, 255604, 173306, 185326, 122060, 14…
## $ X2010.08.31 <dbl> 326277, 141187, 155473, 255304, 172683, 184175, 121627, 14…
## $ X2010.09.30 <dbl> 324169, 140531, 154005, 254555, 171807, 181990, 121065, 14…
## $ X2010.10.31 <dbl> 321870, 139909, 152488, 253384, 170839, 180569, 120449, 13…
## $ X2010.11.30 <dbl> 319983, 139274, 150966, 252329, 169800, 178247, 119691, 13…
## $ X2010.12.31 <dbl> 318820, 138762, 149550, 251270, 168874, 176209, 118862, 13…
## $ X2011.01.31 <dbl> 317873, 138307, 148091, 250644, 168096, 174411, 118115, 13…
## $ X2011.02.28 <dbl> 316849, 137842, 146745, 250128, 167484, 173017, 117415, 13…
## $ X2011.03.31 <dbl> 315038, 137426, 145325, 249844, 167029, 172052, 116913, 13…
## $ X2011.04.30 <dbl> 313299, 137044, 143996, 249305, 166465, 170666, 116366, 13…
## $ X2011.05.31 <dbl> 310414, 136752, 142652, 248773, 165812, 169876, 115873, 13…
## $ X2011.06.30 <dbl> 308628, 136383, 141502, 247920, 164959, 168582, 115403, 12…
## $ X2011.07.31 <dbl> 306980, 136237, 140438, 247192, 164179, 168081, 115126, 12…
## $ X2011.08.31 <dbl> 304998, 136112, 139672, 246482, 163574, 166903, 114903, 12…
## $ X2011.09.30 <dbl> 302559, 136061, 139107, 245848, 163170, 165894, 114627, 12…
## $ X2011.10.31 <dbl> 300074, 135876, 138846, 245189, 162802, 163915, 114257, 12…
## $ X2011.11.30 <dbl> 299424, 135594, 138726, 244448, 162397, 162685, 113864, 12…
## $ X2011.12.31 <dbl> 298844, 135303, 138676, 243879, 162006, 160863, 113511, 12…
## $ X2012.01.31 <dbl> 298635, 135217, 138739, 243364, 161678, 159961, 113055, 12…
## $ X2012.02.29 <dbl> 298383, 135466, 138845, 242832, 161440, 158290, 112617, 12…
## $ X2012.03.31 <dbl> 298524, 135913, 139159, 242459, 161342, 158087, 112366, 12…
## $ X2012.04.30 <dbl> 299218, 136317, 139549, 242355, 161371, 157766, 112325, 12…
## $ X2012.05.31 <dbl> 300456, 136742, 140021, 242570, 161452, 158072, 112307, 12…
## $ X2012.06.30 <dbl> 302221, 137154, 140518, 242845, 161402, 157979, 112251, 12…
## $ X2012.07.31 <dbl> 304240, 137655, 141176, 243020, 161224, 157382, 112234, 12…
## $ X2012.08.31 <dbl> 306538, 138014, 141971, 243114, 161018, 156749, 112277, 12…
## $ X2012.09.30 <dbl> 309123, 138223, 142944, 243122, 160828, 155964, 112324, 12…
## $ X2012.10.31 <dbl> 312608, 138385, 144081, 242997, 160708, 156209, 112328, 12…
## $ X2012.11.30 <dbl> 316463, 138748, 145269, 243013, 160629, 156519, 112283, 12…
## $ X2012.12.31 <dbl> 321847, 139261, 146640, 243167, 160671, 156777, 112260, 12…
## $ X2013.01.31 <dbl> 327341, 139947, 148205, 243430, 160833, 157302, 112286, 12…
## $ X2013.02.28 <dbl> 333693, 140683, 149931, 243937, 161103, 158096, 112416, 12…
## $ X2013.03.31 <dbl> 340160, 141591, 151616, 244450, 161455, 158858, 112564, 13…
## $ X2013.04.30 <dbl> 347140, 142537, 153188, 245359, 161920, 159829, 112868, 13…
## $ X2013.05.31 <dbl> 354166, 143476, 154949, 246275, 162394, 160689, 113325, 13…
## $ X2013.06.30 <dbl> 361540, 144411, 156651, 247303, 162886, 162145, 113933, 13…
## $ X2013.07.31 <dbl> 368559, 145362, 158611, 248166, 163321, 163305, 114495, 13…
## $ X2013.08.31 <dbl> 375058, 146253, 160633, 248802, 163606, 164770, 114942, 13…
## $ X2013.09.30 <dbl> 380799, 147173, 162866, 249363, 163846, 165943, 115214, 13…
## $ X2013.10.31 <dbl> 386201, 148166, 164990, 249993, 163969, 167508, 115534, 13…
## $ X2013.11.30 <dbl> 391681, 149220, 167055, 250670, 164189, 168227, 115840, 14…
## $ X2013.12.31 <dbl> 396876, 150178, 169015, 251266, 164359, 169197, 116098, 14…
## $ X2014.01.31 <dbl> 402202, 151109, 170811, 251873, 164608, 169778, 116061, 14…
## $ X2014.02.28 <dbl> 406778, 152114, 172352, 252503, 164891, 170852, 115966, 14…
## $ X2014.03.31 <dbl> 409219, 153042, 173656, 253287, 165189, 171439, 115890, 14…
## $ X2014.04.30 <dbl> 410132, 153884, 174600, 254079, 165402, 171798, 116011, 14…
## $ X2014.05.31 <dbl> 410463, 154733, 175512, 254815, 165569, 172381, 116259, 14…
## $ X2014.06.30 <dbl> 411002, 155717, 176246, 255450, 165814, 173246, 116606, 14…
## $ X2014.07.31 <dbl> 411459, 156604, 177091, 255830, 166052, 173953, 116959, 14…
## $ X2014.08.31 <dbl> 411259, 157442, 177684, 256154, 166173, 174188, 117090, 14…
## $ X2014.09.30 <dbl> 411942, 158222, 178502, 256431, 166171, 174478, 117207, 14…
## $ X2014.10.31 <dbl> 413546, 159109, 179642, 256729, 166157, 174710, 117222, 15…
## $ X2014.11.30 <dbl> 416257, 160116, 180835, 257032, 166210, 175119, 117461, 15…
## $ X2014.12.31 <dbl> 418698, 161219, 182104, 257456, 166390, 175549, 117791, 15…
## $ X2015.01.31 <dbl> 421284, 162373, 183201, 257830, 166551, 175948, 118313, 15…
## $ X2015.02.28 <dbl> 424047, 163540, 184381, 258123, 166718, 176436, 118878, 15…
## $ X2015.03.31 <dbl> 427057, 164731, 185318, 258301, 167022, 177153, 119434, 15…
## $ X2015.04.30 <dbl> 429170, 165902, 186199, 258778, 167529, 178089, 119959, 15…
## $ X2015.05.31 <dbl> 431433, 166994, 187127, 259403, 168121, 178856, 120532, 15…
## $ X2015.06.30 <dbl> 433060, 168090, 188161, 260218, 168454, 178977, 121063, 15…
## $ X2015.07.31 <dbl> 433725, 169090, 189144, 260980, 168581, 179096, 121545, 15…
## $ X2015.08.31 <dbl> 435015, 169909, 190301, 261646, 168696, 179013, 121913, 15…
## $ X2015.09.30 <dbl> 437202, 170631, 191587, 262237, 168905, 179084, 122226, 15…
## $ X2015.10.31 <dbl> 440934, 171521, 192975, 262657, 169183, 179100, 122497, 15…
## $ X2015.11.30 <dbl> 443916, 172476, 194264, 263094, 169437, 179275, 122816, 16…
## $ X2015.12.31 <dbl> 448004, 173548, 195767, 263391, 169702, 179867, 123125, 16…
## $ X2016.01.31 <dbl> 451830, 174619, 197263, 263687, 170106, 180680, 123549, 16…
## $ X2016.02.29 <dbl> 455184, 175657, 198805, 264226, 170682, 182138, 123977, 16…
## $ X2016.03.31 <dbl> 456935, 176568, 199943, 264996, 171085, 183298, 124573, 16…
## $ X2016.04.30 <dbl> 459213, 177473, 201174, 265966, 171424, 184183, 125173, 16…
## $ X2016.05.31 <dbl> 461417, 178528, 202293, 266932, 171604, 184617, 125716, 16…
## $ X2016.06.30 <dbl> 463755, 179554, 203611, 267791, 172057, 184963, 126166, 16…
## $ X2016.07.31 <dbl> 464367, 180549, 204917, 268573, 172460, 185436, 126586, 16…
## $ X2016.08.31 <dbl> 465952, 181315, 206218, 269380, 172946, 185778, 127035, 16…
## $ X2016.09.30 <dbl> 467769, 182139, 207386, 270146, 173284, 186153, 127509, 16…
## $ X2016.10.31 <dbl> 471333, 183004, 208543, 271027, 173683, 186628, 128000, 16…
## $ X2016.11.30 <dbl> 474161, 184214, 209895, 271983, 174145, 187472, 128587, 16…
## $ X2016.12.31 <dbl> 476690, 185579, 211332, 273123, 174887, 188622, 129260, 17…
## $ X2017.01.31 <dbl> 479565, 186786, 212657, 274208, 175399, 189989, 129998, 17…
## $ X2017.02.28 <dbl> 482525, 187946, 213801, 274968, 175862, 191224, 130744, 17…
## $ X2017.03.31 <dbl> 485476, 188842, 214862, 275701, 176331, 192504, 131512, 17…
## $ X2017.04.30 <dbl> 487933, 189810, 216231, 276655, 177125, 193470, 132342, 17…
## $ X2017.05.31 <dbl> 490856, 190607, 217567, 278031, 177968, 194224, 133129, 17…
## $ X2017.06.30 <dbl> 494155, 191469, 218929, 279665, 178650, 194472, 133824, 17…
## $ X2017.07.31 <dbl> 495683, 192150, 220066, 281143, 179249, 194716, 134381, 17…
## $ X2017.08.31 <dbl> 497374, 192849, 221368, 282479, 179863, 195000, 135028, 17…
## $ X2017.09.30 <dbl> 499704, 193615, 222629, 283680, 180577, 195592, 135731, 17…
## $ X2017.10.31 <dbl> 503836, 194599, 223951, 284933, 181387, 196253, 136544, 18…
## $ X2017.11.30 <dbl> 508085, 195643, 225206, 286257, 182132, 197073, 137269, 18…
## $ X2017.12.31 <dbl> 512309, 196669, 226614, 287434, 182946, 198083, 137976, 18…
## $ X2018.01.31 <dbl> 516966, 197668, 227673, 288779, 183856, 199237, 138497, 18…
## $ X2018.02.28 <dbl> 522015, 198672, 228826, 290180, 184797, 200275, 139205, 18…
## $ X2018.03.31 <dbl> 526349, 199711, 230087, 291730, 186081, 200934, 140047, 18…
## $ X2018.04.30 <dbl> 529734, 200784, 231601, 293149, 187192, 201321, 141136, 18…
## $ X2018.05.31 <dbl> 532445, 201784, 233108, 294442, 188275, 201667, 142153, 19…
## $ X2018.06.30 <dbl> 535166, 202772, 234423, 295673, 188854, 201946, 142969, 19…
## $ X2018.07.31 <dbl> 536884, 203620, 235658, 296976, 189406, 202250, 143712, 19…
## $ X2018.08.31 <dbl> 538273, 204589, 236780, 298371, 189928, 202438, 144243, 19…
## $ X2018.09.30 <dbl> 539162, 205437, 237889, 299844, 190562, 202911, 144758, 19…
## $ X2018.10.31 <dbl> 540899, 206373, 239006, 301169, 191198, 203320, 145270, 19…
## $ X2018.11.30 <dbl> 541954, 207201, 240276, 302428, 191971, 203902, 145941, 19…
## $ X2018.12.31 <dbl> 541811, 208184, 241623, 303552, 192768, 204382, 146680, 19…
## $ X2019.01.31 <dbl> 541188, 209334, 242873, 304688, 193626, 204779, 147558, 19…
## $ X2019.02.28 <dbl> 540315, 210413, 243687, 305771, 194381, 205382, 148431, 20…
## $ X2019.03.31 <dbl> 540144, 211145, 244248, 306996, 194934, 205815, 149324, 20…
## $ X2019.04.30 <dbl> 540867, 211585, 244742, 307966, 195361, 206281, 149950, 20…
## $ X2019.05.31 <dbl> 541905, 212004, 245166, 308807, 195900, 206393, 150640, 20…
## $ X2019.06.30 <dbl> 542789, 212483, 245660, 309296, 196502, 206419, 151246, 20…
## $ X2019.07.31 <dbl> 542752, 212991, 246195, 309607, 197149, 206441, 151905, 20…
## $ X2019.08.31 <dbl> 543167, 213624, 246573, 309782, 197737, 206472, 152409, 20…
## $ X2019.09.30 <dbl> 544231, 214399, 246875, 309873, 198286, 206487, 153014, 20…
## $ X2019.10.31 <dbl> 546162, 214848, 247319, 310027, 198913, 206429, 153592, 20…
## $ X2019.11.30 <dbl> 547591, 215234, 248029, 310436, 199555, 206433, 154048, 20…
## $ X2019.12.31 <dbl> 549605, 215443, 248747, 311140, 200279, 206638, 154561, 20…
## $ X2020.01.31 <dbl> 552164, 216044, 249783, 312089, 200984, 207044, 155295, 20…
## $ X2020.02.29 <dbl> 556088, 216585, 251093, 313174, 201639, 207487, 156152, 21…
## $ X2020.03.31 <dbl> 560612, 217364, 252651, 314156, 202412, 207902, 157054, 21…
## $ X2020.04.30 <dbl> 563997, 218041, 254097, 315084, 203184, 208066, 157914, 21…
## $ X2020.05.31 <dbl> 563896, 218617, 255283, 315544, 203693, 207972, 158832, 21…
## $ X2020.06.30 <dbl> 562830, 219184, 256184, 316079, 204349, 207970, 159725, 21…
## $ X2020.07.31 <dbl> 565069, 219991, 256968, 317182, 205482, 208708, 160706, 21…
## $ X2020.08.31 <dbl> 571713, 221383, 258515, 319023, NA, 210007, 162051, 216806…
## $ X2020.09.30 <dbl> 580209, 223303, 260617, 321702, 209589, 212198, 163563, 21…
## $ X2020.10.31 <dbl> 588419, 225557, 263149, 325073, NA, 214376, 165321, 220879…
## $ X2020.11.30 <dbl> 595980, 227822, 265746, 328783, 214821, 216628, 167383, 22…
## $ X2020.12.31 <dbl> 602825, 229996, 268449, 332430, 217570, 218495, 169344, 22…
## $ X2021.01.31 <dbl> 608991, 232265, 271597, 335746, 220349, 219969, 171623, 22…
## $ X2021.02.28 <dbl> 616012, 235043, 274730, 339063, 222808, 222026, 173711, 23…
## $ X2021.03.31 <dbl> 623779, 237996, 278433, 342309, 225323, 223924, 176194, 23…
## $ X2021.04.30 <dbl> 633956, 241646, 282437, 345529, 228015, 226447, 178546, 23…
## $ X2021.05.31 <dbl> 648122, 245980, 287865, 349618, 231597, 228768, 181084, 24…
## $ X2021.06.30 <dbl> 665178, 251417, 294710, 354568, 235469, 231931, 183719, 24…
## $ X2021.07.31 <dbl> 683587, 257072, 302651, 360213, 238942, 235397, 186336, 25…
## $ X2021.08.31 <dbl> 697149, 262742, 310696, 365058, 241732, 238801, 188660, 25…
## $ X2021.09.30 <dbl> 706869, 267595, 318591, 369103, 243898, 241341, 190723, 26…
## $ X2021.10.31 <dbl> 711808, 271867, 325971, 372165, 245890, 243579, 192448, 27…
## $ X2021.11.30 <dbl> 717220, 275502, 333282, 375055, 247762, 245842, 194167, 27…
## $ X2021.12.31 <dbl> 723926, 279765, 340088, 377914, 250004, 248349, 196257, 28…
## $ X2022.01.31 <dbl> 733988, 284428, 347640, 381216, 252544, 250702, 198574, 28…
## $ X2022.02.28 <dbl> 745945, 289868, 355566, 385005, 255326, 253070, 201174, 29…
## $ X2022.03.31 <dbl> 760729, 295282, 365267, 389766, 257958, 255747, 203466, 29…
## $ X2022.04.30 <dbl> 775012, 301326, 375948, 394721, 260578, 258600, 206047, 30…
## $ X2022.05.31 <dbl> 785789, 306785, 386674, 399829, 262976, 261789, 208315, 31…
## $ X2022.06.30 <dbl> 788081, 311477, 395718, 403950, 265107, 264469, 210562, 31…
## $ X2022.07.31 <dbl> 784926, 313941, 401934, 407002, 266601, 266341, 212040, 31…
## $ X2022.08.31 <dbl> 775876, 314837, 405489, 408833, 267594, 267079, 213335, 32…
Zillow homw value dataset has 51 rows and 277 columns.
zhv_long <- raw_dt_zhv %>%
select(-RegionID, -SizeRank, -RegionType, -StateName) %>%
gather('period','home_val',2:ncol(.))
glimpse(zhv_long)
## Rows: 13,872
## Columns: 3
## $ RegionName <chr> "California", "Texas", "Florida", "New York", "Pennsylvania…
## $ period <chr> "X2000.01.31", "X2000.01.31", "X2000.01.31", "X2000.01.31",…
## $ home_val <dbl> 199037, 114095, 115670, 150013, 106281, 147480, 110729, 131…
zhv_long_dt <- zhv_long %>%
separate(period, c("yr", "mo" , "day"),"\\.")
zhv_long_dt$yr <- as.numeric(gsub('X', '', zhv_long_dt$yr))
zhv_long_dt$mo <- as.numeric(zhv_long_dt$mo)
zhv_long_dt$date <- as.Date(with(zhv_long_dt, paste(yr, mo, 1,sep="-")), "%Y-%m-%d")
# keep only numeric values in yr column
head(zhv_long_dt)
## RegionName yr mo day home_val date
## 1 California 2000 1 31 199037 2000-01-01
## 2 Texas 2000 1 31 114095 2000-01-01
## 3 Florida 2000 1 31 115670 2000-01-01
## 4 New York 2000 1 31 150013 2000-01-01
## 5 Pennsylvania 2000 1 31 106281 2000-01-01
## 6 Illinois 2000 1 31 147480 2000-01-01
zhv_long_ts <- zhv_long_dt %>% filter (RegionName %in% c('Virginia', 'Maryland', 'North Carolina'))
ggplot(data = zhv_long_ts, aes(x=date, y=home_val)) + geom_line(aes(colour=RegionName)) + theme_minimal()
It is an interesting trend chart, as we can see that average home prices in North Carolina and Virginia were nearly the same in 2000. But, by 2022, Virginia home prices are much higher than North Carolina home prices. Maryland has the highest average home prices, among the 3 states.
zhv_long_reg <- zhv_long_dt %>%
filter (yr==max(zhv_long_dt$yr)) %>%
group_by (RegionName) %>%
summarise(mean_hv = mean(home_val)) %>%
arrange(mean_hv)
glimpse(zhv_long_reg)
## Rows: 51
## Columns: 2
## $ RegionName <chr> "West Virginia", "Mississippi", "Arkansas", "Oklahoma", "Io…
## $ mean_hv <dbl> 137052.5, 163117.6, 174626.4, 177118.8, 188413.0, 193936.8,…
ggplot(zhv_long_reg, aes(x = reorder(RegionName,-mean_hv), y = mean_hv )) +
geom_bar(position = "dodge",stat = "identity", fill="steelblue") +
xlab("State") +
ylab("Mean Home Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In 2022, Hawaii has the highest home prices. It is followed by
California and DC.
West Virginia has the lowest home prices followed by Mississippi.
#provide the github data path
dt_gv <- "https://raw.githubusercontent.com/Naik-Khyati/data_prep_proj2/main/data/gun-violence-data_01-2013_03-2018.csv"
raw_dt_gv <- read.csv(dt_gv, sep=",", stringsAsFactors=FALSE)
glimpse(raw_dt_gv)
## Rows: 239,677
## Columns: 23
## $ incident_id <int> 461105, 460726, 478855, 478925, 478959, 47…
## $ date <chr> "1/1/2013", "1/1/2013", "1/1/2013", "1/5/2…
## $ state <chr> "Pennsylvania", "California", "Ohio", "Col…
## $ city_or_county <chr> "Mckeesport", "Hawthorne", "Lorain", "Auro…
## $ address <chr> "1506 Versailles Avenue and Coursin Street…
## $ n_killed <int> 0, 1, 1, 4, 2, 4, 5, 0, 0, 1, 1, 1, 2, 0, …
## $ n_injured <int> 4, 3, 3, 0, 2, 0, 0, 5, 4, 6, 3, 3, 3, 5, …
## $ incident_url_fields_missing <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ congressional_district <int> 14, 43, 9, 6, 6, 1, 1, 2, 9, 7, 3, 1, 3, 1…
## $ gun_stolen <chr> "", "", "0::Unknown||1::Unknown", "", "0::…
## $ gun_type <chr> "", "", "0::Unknown||1::Unknown", "", "0::…
## $ incident_characteristics <chr> "Shot - Wounded/Injured||Mass Shooting (4+…
## $ n_guns_involved <int> NA, NA, 2, NA, 2, NA, 2, NA, NA, NA, 1, 1,…
## $ notes <chr> "Julian Sims under investigation: Four Sho…
## $ participant_age <chr> "0::20", "0::20", "0::25||1::31||2::33||3:…
## $ participant_age_group <chr> "0::Adult 18+||1::Adult 18+||2::Adult 18+|…
## $ participant_gender <chr> "0::Male||1::Male||3::Male||4::Female", "0…
## $ participant_name <chr> "0::Julian Sims", "0::Bernard Gillis", "0:…
## $ participant_relationship <chr> "", "", "", "", "3::Family", "", "5::Famil…
## $ participant_status <chr> "0::Arrested||1::Injured||2::Injured||3::I…
## $ participant_type <chr> "0::Victim||1::Victim||2::Victim||3::Victi…
## $ state_house_district <int> NA, 62, 56, 40, 62, 72, 10, 93, 11, NA, 28…
## $ state_senate_district <int> NA, 35, 13, 28, 27, 11, 14, 5, 7, 44, 10, …
This data set has 239,677 rows and 23 columns.
raw_dt_gv <- raw_dt_gv %>% mutate(date_fmt = as.Date(date, format= "%m/%d/%Y"))
raw_dt_gv$year <- as.numeric(format(raw_dt_gv$date_fmt,'%Y'))
yr_inci_cnt <- raw_dt_gv %>% group_by(year) %>% summarise(incid_count=n())
ggplot(yr_inci_cnt, aes(x = year, y = incid_count )) +
geom_bar(position = "dodge",stat = "identity", fill="darkgreen") +
xlab("year") +
ylab("# Incident") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There were fewest incidents in 2013. There was a spike in number of
incidents in 2014 where incidents rose to above 50,000. Since, 2014,
incidents have been on an increasing trend. However, in 2018, we see a
drop in number of incidents.
gv_long_dt <- raw_dt_gv %>% select (year,n_injured,n_killed) %>%
group_by(year) %>% summarise_each(list(sum)) %>%
gather('impact_type','nbr_impact',2:3)
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
ggplot(gv_long_dt, aes(x = year, y = nbr_impact, fill=impact_type )) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(impact_share * 100), '%')),
position = position_stack(vjust = 0.5)) +
xlab("year") +
ylab("# of impacted person") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
unique(raw_dt_gv$state)
## [1] "Pennsylvania" "California" "Ohio"
## [4] "Colorado" "North Carolina" "Oklahoma"
## [7] "New Mexico" "Louisiana" "Maryland"
## [10] "Tennessee" "Missouri" "District of Columbia"
## [13] "Illinois" "Delaware" "Utah"
## [16] "Michigan" "Georgia" "Indiana"
## [19] "Mississippi" "New York" "Florida"
## [22] "Washington" "South Carolina" "Arizona"
## [25] "Kentucky" "New Jersey" "Virginia"
## [28] "Wisconsin" "Rhode Island" "Texas"
## [31] "Alabama" "Kansas" "Connecticut"
## [34] "West Virginia" "Minnesota" "Nevada"
## [37] "Nebraska" "Massachusetts" "Hawaii"
## [40] "New Hampshire" "Iowa" "Alaska"
## [43] "Arkansas" "Idaho" "Oregon"
## [46] "Wyoming" "Maine" "North Dakota"
## [49] "Montana" "Vermont" "South Dakota"
state_gv <- raw_dt_gv %>% group_by(state) %>% summarise(incid_count = n())
ggplot(state_gv, aes(x = reorder(state,-incid_count), y = incid_count )) +
geom_bar(position = "dodge",stat = "identity", fill="steelblue") +
xlab("State") +
ylab("Gun Violence Incident Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Illinois has the most number of incidents followed by California,
Florida and Texas.
Hawaii, Vermont and Wyoming has fewest number of incidents.
We prepared 3 different wide datasets to answer corresponding research questions. Gun Violence dataset had nearly 240,000 rows, whereas Zillow home value data has 277 columns. Thus, I got the opportunity to work with relatively large dataset, which were wide too.
In this project, I performed various data manipulation tasks such as string split, string substitution ,conversion from one data type to another, creating a date variable from multiple columns, selecting only few columns based on starting characters, converting data from wide to long and vice versa, aggregating and summarizing the data, adding new columns to the data frame, subsetting the data, sorting the data etc. Moreover, I also created stack and line charts to answer to answer research questions.