USArrests dataset:Make sure you have set RStudio’s Working Directory to your STATS 10 Lab Folder:
Session -> Set Working Directory -> Choose Directory
IMPORTANT: Run the chunk below FIRST! We will need this data to complete the lab!
# Glossary:
# Murder: Number of Murders arrests per 100,000 people
# Assult: Number of Assults arrests per 100,000 people
# UrbanPop: Percentage of Urban Population
# Rape: Number of Rape arrests per 100,000 people
usa_df <- read.csv("USArrests.csv")
head(usa_df)
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 NA 294 80 31.0
## 4 8.8 NA 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
Consider the USArrests.csv dataset, which tracks the
number of arrests by crime in each state. Find the mean of the
Assault variable, first by using the optional argument
na.rm = TRUE and then by removing all rows with any NA
value anywhere (under any row and any column).
Compare the two means and explain which approach holds more datapoints and why.
Approach #1: Using mean() with
na.rm = TRUE
# na.rm: Whether NA values are stripped prior to calculation; FALSE by default
# Remove NA values under ONLY Assult, then calculate the mean
mean(usa_df$Assault, na.rm = TRUE)
## [1] 169.5532
Approach #2: Cleaning the data
# na.omit(): Remove NA values under ALL rows and ALL columns by omitting the rows they are on
usa_cleaned_df <- na.omit(usa_df)
mean(usa_cleaned_df$Assault)
## [1] 169.3778
Answer: Delete this text and type in your answer.
Report the mean, standard deviation, and number of observations of
the Murder variable from USArrests for all
states with an UrbanPop over 75. Repeat for all states with
an UrbanPop under 60. Make sure to clean the dataset by
removing all rows with NA values.
UrbanPop over 75:
# Check along the UrbanPop column which rows have a value ABOVE 75
pop_above_75 <- usa_cleaned_df$UrbanPop > 75
# usa_cleaned_df$Assault[pop_above_75]: Retrieve only rows under Assult with UrbanPop > 75
# Mean
mean(usa_cleaned_df$Assault[pop_above_75])
## [1] 200.75
# Standard Deviation
sd(usa_cleaned_df$Assault[pop_above_75])
## [1] 79.08928
# No. Observation
length(usa_cleaned_df$Assault[pop_above_75])
## [1] 12
UrbanPop below 65:
# Fill in the dots (...). Check the previous chunk in case you are stuck
# Check along the UrbanPop column which rows have a value BELOW 65
pop_below_65 <- usa_cleaned_df$Assault < 65
# Hint:
# usa_cleaned_df$Assault[pop_above_75]: Retrieve only rows under Assult with UrbanPop > 75
# What should you do to get only rows with UrbanPop < 65?
# Mean
mean(usa_cleaned_df$Assault[pop_below_65])
## [1] 50.4
# Standard Deviation
sd(usa_cleaned_df$Assault[pop_below_65])
## [1] 5.683309
# No. Observation
length(usa_cleaned_df$Assault[pop_below_65])
## [1] 5
Create a histogram of a variable of your choice in the cleaned
USArrests dataset. Describe the distribution of the
histogram (centers, spreads, etc.).
# main: Title the plot (character)
# xlab: Label the horizontal or x-axis (character)
# Fill in the variable's column name over the dots (...)
# We use the cleaned data (usa_cleaned_df) with no NA value
hist(
usa_cleaned_df$Assault,
main = "Histogram",
xlab = "USArrests"
)
Answer: USArrests
Create and plot the relationship between UrbanPop and Murder with a linear regression model using the cleaned USArrests dataset. Write the linear regression equation in the form: y = a + bx, with Murder being y and UrbanPop being x. For example, y = 4 + 12x.
Make sure to appropriately label the plot axes and title the plot.
Linear Regression
# lm: Fit a linear model.
# y ~ x: Specify that y and x are your DEPENDENT and INDEPENDENT variables
# data: The data frame or table from which we are pulling the values
# summary(): Print a summary table for the model
# You can read the goodness of fit (R-squared) from the summary table
usa_model <- lm(UrbanPop ~ Murder, data = usa_cleaned_df)
summary(usa_model)
##
## Call:
## lm(formula = UrbanPop ~ Murder, data = usa_cleaned_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.446 -10.870 1.335 9.845 24.583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1607 4.4726 13.675 <2e-16 ***
## Murder 0.5841 0.5041 1.159 0.253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 43 degrees of freedom
## Multiple R-squared: 0.03028, Adjusted R-squared: 0.007726
## F-statistic: 1.343 on 1 and 43 DF, p-value: 0.253
Scatterplot
# y ~ x means that we are plotting the variables y on the y-axis and x on the x-axis
# We use these arguments to plot the data and label the scatter plot:
# main: Title the plot (character)
# xlab: Label the horizontal or x-axis (character)
# ylab: Label the vertical or y-axis (character)
# col: The colors of the points (character); see https://r-charts.com/colors/
# pch: The symbols of the points (numerical); see https://r-charts.com/base-r/pch-symbols/
# cex: The sizes of the points (numerical)
# Fill in the dots based on the two variables you choose!
plot(
usa_df$UrbanPop ~ usa_df$Murder,
main = "Pred. UrbanPop by Murder",
xlab = "Murder",
ylab = "UrbanPop",
col = "red",
pch = 3,
cex = 1.5
)
# We use abline(model_name) to add regression lines to scatterplots
# lwd: Line width
# lty: Line type (lty = 2 makes dashed lines); see https://r-charts.com/base-r/line-types/
abline(usa_model, lwd = 1.5, lty = 2)
Answer: Seems to be weakly related
Consider the linear regression model from Question 3. Explain whether the model is a good fit.
summary(usa_model)
##
## Call:
## lm(formula = UrbanPop ~ Murder, data = usa_cleaned_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.446 -10.870 1.335 9.845 24.583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1607 4.4726 13.675 <2e-16 ***
## Murder 0.5841 0.5041 1.159 0.253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 43 degrees of freedom
## Multiple R-squared: 0.03028, Adjusted R-squared: 0.007726
## F-statistic: 1.343 on 1 and 43 DF, p-value: 0.253
Answer:
Multiple R-squared: 0.03028