Lab Exercises

Question 1

Consider the USArrests.csv dataset, which tracks the number of arrests by crime in each state. Find the mean of the Assault variable, first by using the optional argument na.rm = TRUE and then by removing all rows with any NA value anywhere (under any row and any column).

Compare the two means and explain which approach holds more datapoints and why.

Approach #1: Using mean() with na.rm = TRUE

# na.rm: Whether NA values are stripped prior to calculation; FALSE by default
# Remove NA values under ONLY Assult, then calculate the mean
mean(usa_df$Assault, na.rm = TRUE)

## [1] 169.5532

Approach #2: Cleaning the data

# na.omit(): Remove NA values under ALL rows and ALL columns by omitting the rows they are on
usa_cleaned_df <- na.omit(usa_df)
mean(usa_cleaned_df$Assault)

## [1] 169.3778

Answer: Delete this text and type in your answer.

Question 2

Report the mean, standard deviation, and number of observations of the Murder variable from USArrests for all states with an UrbanPop over 75. Repeat for all states with an UrbanPop under 60. Make sure to clean the dataset by removing all rows with NA values.

UrbanPop over 75:

# Check along the UrbanPop column which rows have a value ABOVE 75
pop_above_75 <- usa_cleaned_df$UrbanPop > 75

# usa_cleaned_df$Assault[pop_above_75]: Retrieve only rows under Assult with UrbanPop > 75

# Mean
mean(usa_cleaned_df$Assault[pop_above_75])

## [1] 200.75

# Standard Deviation
sd(usa_cleaned_df$Assault[pop_above_75])

## [1] 79.08928

# No. Observation
length(usa_cleaned_df$Assault[pop_above_75])

## [1] 12

UrbanPop below 65:

# Fill in the dots (...). Check the previous chunk in case you are stuck
# Check along the UrbanPop column which rows have a value BELOW 65
pop_below_65 <- usa_cleaned_df$Assault < 65

# Hint:
# usa_cleaned_df$Assault[pop_above_75]: Retrieve only rows under Assult with UrbanPop > 75
# What should you do to get only rows with UrbanPop < 65?

# Mean
mean(usa_cleaned_df$Assault[pop_below_65])

## [1] 50.4

# Standard Deviation
sd(usa_cleaned_df$Assault[pop_below_65])

## [1] 5.683309

# No. Observation
length(usa_cleaned_df$Assault[pop_below_65])

## [1] 5

Question 3

Create a histogram of a variable of your choice in the cleaned USArrests dataset. Describe the distribution of the histogram (centers, spreads, etc.).

# main: Title the plot (character)
# xlab: Label the horizontal or x-axis (character)

# Fill in the variable's column name over the dots (...)
# We use the cleaned data (usa_cleaned_df) with no NA value

hist(
  usa_cleaned_df$Assault,
  main = "Histogram",
  xlab = "USArrests"
)

Answer: USArrests

Question 4

Create and plot the relationship between UrbanPop and Murder with a linear regression model using the cleaned USArrests dataset. Write the linear regression equation in the form: y = a + bx, with Murder being y and UrbanPop being x. For example, y = 4 + 12x.

Make sure to appropriately label the plot axes and title the plot.

Linear Regression

# lm: Fit a linear model.
# y ~ x: Specify that y and x are your DEPENDENT and INDEPENDENT variables
# data: The data frame or table from which we are pulling the values
# summary(): Print a summary table for the model
# You can read the goodness of fit (R-squared) from the summary table

usa_model <- lm(UrbanPop ~ Murder, data = usa_cleaned_df)
summary(usa_model)

## 
## Call:
## lm(formula = UrbanPop ~ Murder, data = usa_cleaned_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.446 -10.870   1.335   9.845  24.583 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.1607     4.4726  13.675   <2e-16 ***
## Murder        0.5841     0.5041   1.159    0.253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.36 on 43 degrees of freedom
## Multiple R-squared:  0.03028,    Adjusted R-squared:  0.007726 
## F-statistic: 1.343 on 1 and 43 DF,  p-value: 0.253

Scatterplot

# y ~ x means that we are plotting the variables y on the y-axis and x on the x-axis
# We use these arguments to plot the data and label the scatter plot:
# main: Title the plot (character)
# xlab: Label the horizontal or x-axis (character)
# ylab: Label the vertical or y-axis (character)
# col: The colors of the points (character); see https://r-charts.com/colors/
# pch: The symbols of the points (numerical); see https://r-charts.com/base-r/pch-symbols/
# cex: The sizes of the points (numerical)
# Fill in the dots based on the two variables you choose!

plot(
  usa_df$UrbanPop ~ usa_df$Murder,
  main = "Pred. UrbanPop by Murder",
  xlab = "Murder",
  ylab = "UrbanPop",
  col = "red",
  pch = 3,
  cex = 1.5
)

# We use abline(model_name) to add regression lines to scatterplots
# lwd: Line width
# lty: Line type (lty = 2 makes dashed lines); see https://r-charts.com/base-r/line-types/
abline(usa_model, lwd = 1.5, lty = 2)

Answer: Seems to be weakly related

Question 5

Consider the linear regression model from Question 3. Explain whether the model is a good fit.

summary(usa_model)

## 
## Call:
## lm(formula = UrbanPop ~ Murder, data = usa_cleaned_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.446 -10.870   1.335   9.845  24.583 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61.1607     4.4726  13.675   <2e-16 ***
## Murder        0.5841     0.5041   1.159    0.253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.36 on 43 degrees of freedom
## Multiple R-squared:  0.03028,    Adjusted R-squared:  0.007726 
## F-statistic: 1.343 on 1 and 43 DF,  p-value: 0.253

Answer:

Multiple R-squared:  0.03028

we only describe 3% of the variation, this is a very poor model.

Stats 10 Lab 4 Exercises

Read in the `USArrests` dataset:

Lab Exercises

Question 1

Question 2

Question 3

Question 4

Question 5

Stats 10 Lab 4 Exercises

Read in the USArrests dataset:

Lab Exercises

Question 1

Question 2

Question 3

Question 4

Question 5

Read in the `USArrests` dataset: