M1 Lab 1 Submission

Diandra Dzib

Overview

In this assignment, you will analyze the USArrests dataset, which contains statistics on violent crime rates in the United States for each of the 50 states in 1973. You will apply the R skills you learned in this module, including importing data, summarizing information, creating visualizations, and using tidyverse functions.

Load Libraries

For this assignment, and most others in this class you will need to use the tidyverse and GGally libraries. Before loading your libraries, if you have not already installed those libraries on the version of R you are currently using, please install them now. You can do that by typing the code below in your console or by using the Tools dropdown menu. You will not need to do this again for future assignments as long as you are using the same computer.

install.packages("tidyverse")
install.packages("GGally")

Load the necessary libraries for data analysis and visualization.

library(tidyverse)
library(GGally)  # For pairwise scatter plots

Data

The USArrests dataset contains the following variables: - Murder: Murder arrests (per 100,000 residents) - Assault: Assault arrests (per 100,000 residents) - UrbanPop: Percent of the population living in urban areas

# Read in dataset
us_arrests <- read_csv("M1-US-Arrests-Data.csv")

Use the following commands to explore the dataset content:

# View first few rows of the dataset
head(us_arrests)
State Murder Assault UrbanPop
Alabama 13.2 236 58
Alaska 10.0 263 48
Arizona 8.1 294 80
Arkansas 8.8 190 50
California 9.0 276 91
Colorado 7.9 204 78
# Get a quick glimpse of the dataset
glimpse(us_arrests)
## Rows: 50
## Columns: 4
## $ State    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Co…
## $ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.…
## $ Assault  <dbl> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 24…
## $ UrbanPop <dbl> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 6…

Question 1

Use the following tasks to practice subsetting data using filter() to select specific rows and select() to choose specific columns.

1.1

Retrieve all rows where Murder is greater than 10 per 100,000 residents.

us_arrests %>%
  filter(Murder > 10)
State Murder Assault UrbanPop
Alabama 13.2 236 58
Florida 15.4 335 80
Georgia 17.4 211 60
Illinois 10.4 249 83
Louisiana 15.4 249 66
Maryland 11.3 300 67
Michigan 12.1 255 74
Mississippi 16.1 259 44
Nevada 12.2 252 81
New Mexico 11.4 285 70
New York 11.1 254 86
North Carolina 13.0 337 45
South Carolina 14.4 279 48
Tennessee 13.2 188 59
Texas 12.7 201 80

1.2

Retrieve all rows where UrbanPop is less than 50%.

us_arrests %>%
  filter(UrbanPop < 50)
State Murder Assault UrbanPop
Alaska 10.0 263 48
Mississippi 16.1 259 44
North Carolina 13.0 337 45
North Dakota 0.8 45 44
South Carolina 14.4 279 48
South Dakota 3.8 86 45
Vermont 2.2 48 32
West Virginia 5.7 81 39

1.3

Find the data for the states of California, Texas, and New York.

1.4

Select only the State and Murder columns.

us_arrests %>%
select("State", "Murder")
State Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Connecticut 3.3
Delaware 5.9
Florida 15.4
Georgia 17.4
Hawaii 5.3
Idaho 2.6
Illinois 10.4
Indiana 7.2
Iowa 2.2
Kansas 6.0
Kentucky 9.7
Louisiana 15.4
Maine 2.1
Maryland 11.3
Massachusetts 4.4
Michigan 12.1
Minnesota 2.7
Mississippi 16.1
Missouri 9.0
Montana 6.0
Nebraska 4.3
Nevada 12.2
New Hampshire 2.1
New Jersey 7.4
New Mexico 11.4
New York 11.1
North Carolina 13.0
North Dakota 0.8
Ohio 7.3
Oklahoma 6.6
Oregon 4.9
Pennsylvania 6.3
Rhode Island 3.4
South Carolina 14.4
South Dakota 3.8
Tennessee 13.2
Texas 12.7
Utah 3.2
Vermont 2.2
Virginia 8.5
Washington 4.0
West Virginia 5.7
Wisconsin 2.6
Wyoming 6.8

1.5

Select the State, Assault, and UrbanPop columns.

us_arrests %>%
select("State", "Assault", "UrbanPop")
State Assault UrbanPop
Alabama 236 58
Alaska 263 48
Arizona 294 80
Arkansas 190 50
California 276 91
Colorado 204 78
Connecticut 110 77
Delaware 238 72
Florida 335 80
Georgia 211 60
Hawaii 46 83
Idaho 120 54
Illinois 249 83
Indiana 113 65
Iowa 56 57
Kansas 115 66
Kentucky 109 52
Louisiana 249 66
Maine 83 51
Maryland 300 67
Massachusetts 149 85
Michigan 255 74
Minnesota 72 66
Mississippi 259 44
Missouri 178 70
Montana 109 53
Nebraska 102 62
Nevada 252 81
New Hampshire 57 56
New Jersey 159 89
New Mexico 285 70
New York 254 86
North Carolina 337 45
North Dakota 45 44
Ohio 120 75
Oklahoma 151 68
Oregon 159 67
Pennsylvania 106 72
Rhode Island 174 87
South Carolina 279 48
South Dakota 86 45
Tennessee 188 59
Texas 201 80
Utah 120 80
Vermont 48 32
Virginia 156 63
Washington 145 73
West Virginia 81 39
Wisconsin 53 66
Wyoming 161 60

1.6

Retrieve only the states where Assault is greater than 200, but display only the State and Assault columns.

us_arrests %>% 
  select("State", "Assault") %>%
  filter(Assault > 200)
State Assault
Alabama 236
Alaska 263
Arizona 294
California 276
Colorado 204
Delaware 238
Florida 335
Georgia 211
Illinois 249
Louisiana 249
Maryland 300
Michigan 255
Mississippi 259
Nevada 252
New Mexico 285
New York 254
North Carolina 337
South Carolina 279
Texas 201

1.7

Retrieve only the states where Murder is below 5 and UrbanPop is above 70, but display only State, Murder, and UrbanPop.

us_arrests %>%
  select("State", "Murder", "UrbanPop") %>%
  filter(Murder < 5 & UrbanPop > 70)
State Murder UrbanPop
Connecticut 3.3 77
Massachusetts 4.4 85
Rhode Island 3.4 87
Utah 3.2 80
Washington 4.0 73

Question 2

Create a new variable high_murder that equals 1 if Murder is greater than the median murder rate and 0 otherwise.

high_murder <- median(us_arrests$Murder, na.rm = TRUE) 
us_arrests <- us_arrests %>% mutate(high.ses = ifelse(Murder > high_murder, 1, 0))

Question 3

3.1

Create a scatter plot to show the relationship between the number of assault arrests (Assault) and murder arrests (Murder).

ggplot(us_arrests, aes (x = Assault, y = Murder)) +
geom_point(alpha = 0.6, color = "blue") +
  labs(x = "Assault", y = "Murder") 

3.2

Add a linear trend line to the plot.

ggplot(us_arrests, aes (x = Assault, y = Murder)) +
geom_point(alpha = 0.6, color = "blue") +
  labs(x = "Assault", y = "Murder") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_minimal()

3.3

Create histograms for Murder and UrbanPop to understand their distributions.

ggplot(us_arrests, aes(x = Murder)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(x = "Murder", y = "Count") +
  theme_minimal()

ggplot(us_arrests, aes(x = UrbanPop)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(x = "UrbanPop", y = "Count") +
  theme_minimal()

3.4

Generate pairwise scatter plots to explore relationships among Murder, Assault, and UrbanPop.

ggpairs(us_arrests, columns = c("Murder", "Assault", "UrbanPop"))

Question 4

4.1

Calculate the mean murder rate for states classified as high_murder (1) and not high murder (0).

4.2

Create a new variable high_urban that equals 1 if UrbanPop is greater than the median urban population percentage and 0 otherwise. Then, summarize the mean murder rate by high_urban.

# Your answer here