Our task here is to Create an Example.Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
URL: https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv
We need to load the Library first
Tidyverse has following packages
✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ✓ tibble 3.1.6 ✓ dplyr 1.0.7 ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ✓ readr 2.1.2 ✓ forcats 0.5.1
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Then we load it into R
# load data
hate_url<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv"
hate_url <-read.csv(hate_url)
head(hate_url)
## state median_household_income share_unemployed_seasonal
## 1 Alabama 42278 0.060
## 2 Alaska 67629 0.064
## 3 Arizona 49254 0.063
## 4 Arkansas 44922 0.052
## 5 California 60487 0.059
## 6 Colorado 60940 0.040
## share_population_in_metro_areas share_population_with_high_school_degree
## 1 0.64 0.821
## 2 0.63 0.914
## 3 0.90 0.842
## 4 0.69 0.824
## 5 0.97 0.806
## 6 0.80 0.893
## share_non_citizen share_white_poverty gini_index share_non_white
## 1 0.02 0.12 0.472 0.35
## 2 0.04 0.06 0.422 0.42
## 3 0.10 0.09 0.455 0.49
## 4 0.04 0.12 0.458 0.26
## 5 0.13 0.09 0.471 0.61
## 6 0.06 0.07 0.457 0.31
## share_voters_voted_trump hate_crimes_per_100k_splc
## 1 0.63 0.12583893
## 2 0.53 0.14374012
## 3 0.50 0.22531995
## 4 0.60 0.06906077
## 5 0.33 0.25580536
## 6 0.44 0.39052330
## avg_hatecrimes_per_100k_fbi
## 1 1.8064105
## 2 1.6567001
## 3 3.4139280
## 4 0.8692089
## 5 2.3979859
## 6 2.8046888
We replace the na to 0 and sum the hate crime cases together.
haterate <-hate_url %>%
replace(is.na(.), 0) %>%
mutate(hate_rate_sum_per100k = rowSums(.[11:12]))
head(haterate)
## state median_household_income share_unemployed_seasonal
## 1 Alabama 42278 0.060
## 2 Alaska 67629 0.064
## 3 Arizona 49254 0.063
## 4 Arkansas 44922 0.052
## 5 California 60487 0.059
## 6 Colorado 60940 0.040
## share_population_in_metro_areas share_population_with_high_school_degree
## 1 0.64 0.821
## 2 0.63 0.914
## 3 0.90 0.842
## 4 0.69 0.824
## 5 0.97 0.806
## 6 0.80 0.893
## share_non_citizen share_white_poverty gini_index share_non_white
## 1 0.02 0.12 0.472 0.35
## 2 0.04 0.06 0.422 0.42
## 3 0.10 0.09 0.455 0.49
## 4 0.04 0.12 0.458 0.26
## 5 0.13 0.09 0.471 0.61
## 6 0.06 0.07 0.457 0.31
## share_voters_voted_trump hate_crimes_per_100k_splc
## 1 0.63 0.12583893
## 2 0.53 0.14374012
## 3 0.50 0.22531995
## 4 0.60 0.06906077
## 5 0.33 0.25580536
## 6 0.44 0.39052330
## avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
## 1 1.8064105 1.9322494
## 2 1.6567001 1.8004402
## 3 3.4139280 3.6392479
## 4 0.8692089 0.9382696
## 5 2.3979859 2.6537913
## 6 2.8046888 3.1952121
We use glimpse to check the format of the data as well as how many columns and rows.
glimpse(haterate)
## Rows: 51
## Columns: 13
## $ state <chr> "Alabama", "Alaska", "Arizona…
## $ median_household_income <int> 42278, 67629, 49254, 44922, 6…
## $ share_unemployed_seasonal <dbl> 0.060, 0.064, 0.063, 0.052, 0…
## $ share_population_in_metro_areas <dbl> 0.64, 0.63, 0.90, 0.69, 0.97,…
## $ share_population_with_high_school_degree <dbl> 0.821, 0.914, 0.842, 0.824, 0…
## $ share_non_citizen <dbl> 0.02, 0.04, 0.10, 0.04, 0.13,…
## $ share_white_poverty <dbl> 0.12, 0.06, 0.09, 0.12, 0.09,…
## $ gini_index <dbl> 0.472, 0.422, 0.455, 0.458, 0…
## $ share_non_white <dbl> 0.35, 0.42, 0.49, 0.26, 0.61,…
## $ share_voters_voted_trump <dbl> 0.63, 0.53, 0.50, 0.60, 0.33,…
## $ hate_crimes_per_100k_splc <dbl> 0.12583893, 0.14374012, 0.225…
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.41392…
## $ hate_rate_sum_per100k <dbl> 1.9322494, 1.8004402, 3.63924…
We select the column that we need when some column is not needed to minimize the size of the data.
haterate <- haterate %>%
select("state","median_household_income","share_unemployed_seasonal","hate_crimes_per_100k_splc","avg_hatecrimes_per_100k_fbi","hate_rate_sum_per100k")
head(haterate)
## state median_household_income share_unemployed_seasonal
## 1 Alabama 42278 0.060
## 2 Alaska 67629 0.064
## 3 Arizona 49254 0.063
## 4 Arkansas 44922 0.052
## 5 California 60487 0.059
## 6 Colorado 60940 0.040
## hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
## 1 0.12583893 1.8064105 1.9322494
## 2 0.14374012 1.6567001 1.8004402
## 3 0.22531995 3.4139280 3.6392479
## 4 0.06906077 0.8692089 0.9382696
## 5 0.25580536 2.3979859 2.6537913
## 6 0.39052330 2.8046888 3.1952121
We create a summary to have a overview of the data, it is helpful to quick check if there is any outliner.
summary(haterate)
## state median_household_income share_unemployed_seasonal
## Length:51 Min. :35521 Min. :0.02800
## Class :character 1st Qu.:48657 1st Qu.:0.04200
## Mode :character Median :54916 Median :0.05100
## Mean :55224 Mean :0.04957
## 3rd Qu.:60719 3rd Qu.:0.05750
## Max. :76165 Max. :0.07300
## hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
## Min. :0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.1297 1st Qu.: 1.273 1st Qu.: 1.414
## Median :0.2136 Median : 1.937 Median : 2.227
## Mean :0.2802 Mean : 2.321 Mean : 2.601
## 3rd Qu.:0.3430 3rd Qu.: 3.168 3rd Qu.: 3.441
## Max. :1.5223 Max. :10.953 Max. :12.476
we try to visualize the count of unemployment and hate crimes cases.
ggplot(haterate, aes(x=hate_rate_sum_per100k)) + geom_histogram(bins = 30)
ggplot(haterate, aes(x=share_unemployed_seasonal)) + geom_histogram(bins = 30)
In this assingment we learned more function from tinyverse and it is useful to perform tiny data.