Project details

Our task here is to Create an Example.Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

URL: https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv

Tidyverse project - prepare the data

Step 1 Load Library

We need to load the Library first

Tidyverse has following packages

✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ✓ tibble 3.1.6 ✓ dplyr 1.0.7 ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ✓ readr 2.1.2 ✓ forcats 0.5.1

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Step 2 Load data

Then we load it into R

# load data

hate_url<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv"
hate_url <-read.csv(hate_url)
head(hate_url)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   share_population_in_metro_areas share_population_with_high_school_degree
## 1                            0.64                                    0.821
## 2                            0.63                                    0.914
## 3                            0.90                                    0.842
## 4                            0.69                                    0.824
## 5                            0.97                                    0.806
## 6                            0.80                                    0.893
##   share_non_citizen share_white_poverty gini_index share_non_white
## 1              0.02                0.12      0.472            0.35
## 2              0.04                0.06      0.422            0.42
## 3              0.10                0.09      0.455            0.49
## 4              0.04                0.12      0.458            0.26
## 5              0.13                0.09      0.471            0.61
## 6              0.06                0.07      0.457            0.31
##   share_voters_voted_trump hate_crimes_per_100k_splc
## 1                     0.63                0.12583893
## 2                     0.53                0.14374012
## 3                     0.50                0.22531995
## 4                     0.60                0.06906077
## 5                     0.33                0.25580536
## 6                     0.44                0.39052330
##   avg_hatecrimes_per_100k_fbi
## 1                   1.8064105
## 2                   1.6567001
## 3                   3.4139280
## 4                   0.8692089
## 5                   2.3979859
## 6                   2.8046888

Step 3 add new column with calcalation

We replace the na to 0 and sum the hate crime cases together.

haterate <-hate_url %>%
  replace(is.na(.), 0) %>%
  mutate(hate_rate_sum_per100k = rowSums(.[11:12]))

head(haterate)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   share_population_in_metro_areas share_population_with_high_school_degree
## 1                            0.64                                    0.821
## 2                            0.63                                    0.914
## 3                            0.90                                    0.842
## 4                            0.69                                    0.824
## 5                            0.97                                    0.806
## 6                            0.80                                    0.893
##   share_non_citizen share_white_poverty gini_index share_non_white
## 1              0.02                0.12      0.472            0.35
## 2              0.04                0.06      0.422            0.42
## 3              0.10                0.09      0.455            0.49
## 4              0.04                0.12      0.458            0.26
## 5              0.13                0.09      0.471            0.61
## 6              0.06                0.07      0.457            0.31
##   share_voters_voted_trump hate_crimes_per_100k_splc
## 1                     0.63                0.12583893
## 2                     0.53                0.14374012
## 3                     0.50                0.22531995
## 4                     0.60                0.06906077
## 5                     0.33                0.25580536
## 6                     0.44                0.39052330
##   avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
## 1                   1.8064105             1.9322494
## 2                   1.6567001             1.8004402
## 3                   3.4139280             3.6392479
## 4                   0.8692089             0.9382696
## 5                   2.3979859             2.6537913
## 6                   2.8046888             3.1952121

Tidyverse project - use funtion from Tidyverse

glimpse data

We use glimpse to check the format of the data as well as how many columns and rows.

glimpse(haterate)
## Rows: 51
## Columns: 13
## $ state                                    <chr> "Alabama", "Alaska", "Arizona…
## $ median_household_income                  <int> 42278, 67629, 49254, 44922, 6…
## $ share_unemployed_seasonal                <dbl> 0.060, 0.064, 0.063, 0.052, 0…
## $ share_population_in_metro_areas          <dbl> 0.64, 0.63, 0.90, 0.69, 0.97,…
## $ share_population_with_high_school_degree <dbl> 0.821, 0.914, 0.842, 0.824, 0…
## $ share_non_citizen                        <dbl> 0.02, 0.04, 0.10, 0.04, 0.13,…
## $ share_white_poverty                      <dbl> 0.12, 0.06, 0.09, 0.12, 0.09,…
## $ gini_index                               <dbl> 0.472, 0.422, 0.455, 0.458, 0…
## $ share_non_white                          <dbl> 0.35, 0.42, 0.49, 0.26, 0.61,…
## $ share_voters_voted_trump                 <dbl> 0.63, 0.53, 0.50, 0.60, 0.33,…
## $ hate_crimes_per_100k_splc                <dbl> 0.12583893, 0.14374012, 0.225…
## $ avg_hatecrimes_per_100k_fbi              <dbl> 1.8064105, 1.6567001, 3.41392…
## $ hate_rate_sum_per100k                    <dbl> 1.9322494, 1.8004402, 3.63924…

Select Data

We select the column that we need when some column is not needed to minimize the size of the data.

haterate <- haterate %>%
select("state","median_household_income","share_unemployed_seasonal","hate_crimes_per_100k_splc","avg_hatecrimes_per_100k_fbi","hate_rate_sum_per100k")
head(haterate)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
## 1                0.12583893                   1.8064105             1.9322494
## 2                0.14374012                   1.6567001             1.8004402
## 3                0.22531995                   3.4139280             3.6392479
## 4                0.06906077                   0.8692089             0.9382696
## 5                0.25580536                   2.3979859             2.6537913
## 6                0.39052330                   2.8046888             3.1952121

create summary

We create a summary to have a overview of the data, it is helpful to quick check if there is any outliner.

summary(haterate)
##     state           median_household_income share_unemployed_seasonal
##  Length:51          Min.   :35521           Min.   :0.02800          
##  Class :character   1st Qu.:48657           1st Qu.:0.04200          
##  Mode  :character   Median :54916           Median :0.05100          
##                     Mean   :55224           Mean   :0.04957          
##                     3rd Qu.:60719           3rd Qu.:0.05750          
##                     Max.   :76165           Max.   :0.07300          
##  hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi hate_rate_sum_per100k
##  Min.   :0.0000            Min.   : 0.000              Min.   : 0.000       
##  1st Qu.:0.1297            1st Qu.: 1.273              1st Qu.: 1.414       
##  Median :0.2136            Median : 1.937              Median : 2.227       
##  Mean   :0.2802            Mean   : 2.321              Mean   : 2.601       
##  3rd Qu.:0.3430            3rd Qu.: 3.168              3rd Qu.: 3.441       
##  Max.   :1.5223            Max.   :10.953              Max.   :12.476

ggplot

we try to visualize the count of unemployment and hate crimes cases.

ggplot(haterate, aes(x=hate_rate_sum_per100k)) + geom_histogram(bins = 30) 

ggplot(haterate, aes(x=share_unemployed_seasonal)) + geom_histogram(bins = 30)

Summary

In this assingment we learned more function from tinyverse and it is useful to perform tiny data.