DS Labs Assignment

Author

Telesphore Kabore

Loading Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(infer)
library(skimr)

BRIEF DESCRIPTION

For this assignment, after installing the DS Labs package, i choose a data set called “murders”. The data set is made of 51 observations representing the 50 State of the USA plus the District of Colombia. Five (5) different variables are presented including the States, the Regions, the abbreviation of the states, the total population per State and the number of murder per State. After cleaning up the data set, i will create a box plot of total murders per region, to explore variation of murders across different regions.

Loading package and data

# Load DS Labs (package)
library(dslabs)
# Load the data set
data(murders)

Quick look at the data set

glimpse(murders)
Rows: 51
Columns: 5
$ state      <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…
$ abb        <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",…
$ region     <fct> South, West, West, South, West, West, Northeast, South, Sou…
$ population <dbl> 4779736, 710231, 6392017, 2915918, 37253956, 5029196, 35740…
$ total      <dbl> 135, 19, 232, 93, 1257, 65, 97, 38, 99, 669, 376, 7, 12, 36…

Data cleaning

# Looking at understanding variables definitions
names(murders)
[1] "state"      "abb"        "region"     "population" "total"     
# Make all headers lowercase and remove spaces
names(murders) <- tolower(names(murders))
names(murders) <- gsub(" ","",names(murders))
head(murders)
       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65
summary(murders)
    state               abb                      region     population      
 Length:51          Length:51          Northeast    : 9   Min.   :  563626  
 Class :character   Class :character   South        :17   1st Qu.: 1696962  
 Mode  :character   Mode  :character   North Central:12   Median : 4339367  
                                       West         :13   Mean   : 6075769  
                                                          3rd Qu.: 6636084  
                                                          Max.   :37253956  
     total       
 Min.   :   2.0  
 1st Qu.:  24.5  
 Median :  97.0  
 Mean   : 184.4  
 3rd Qu.: 268.0  
 Max.   :1257.0  

Multivariate Box Plot

ggplot(murders, aes(x = region, y = total, fill = region)) +
  geom_boxplot() +
  labs( title = "Distribution of Total Murders by Region",
        
x = "region",
y = "Total Murders"
) +
  theme_minimal()