# Looading necessary packages
pacman::p_load(
rio,
skimr,
dplyr,
rstatix,
flextable,
pastecs
)
# Importing data
vital <- import("VS10.csv")
vital_sub <- vital %>% select(`Violent Crime Rate per 1,000 Residents`, `Unemployment Rate`, `Median Household Income`, `High School Dropout/Withdrawl Rate`, `Percentage of Residential Properties that are Vacant and Abandoned`, `Rate of Dirty Streets and Alleys Reports per 1,000 Residents`)Assignment02: Exploratory Data Analysis
Introduction
For this assignment, you will produce descriptive statistics for a subset of variables in the dataset. There are three goals for the assignment. The first goal is for you to become familiar with the different types of R code that can be used to run descriptive statistics. The second is for you become familiar with the Vital Signs data. And the third is to learn about the specific qualities of the variables and dataset for the later assignments in the semester.
There are variations in how code can be written and there are several packages that can be used to produce descriptive statistics. For full credit the code and the output must the packages covered in the lecture.
Directions
Use the lecture material from Module 4 and the reading material from Practical Data Science with R, Chapter 3.1, and R in Action, Chapter 4, Section 4.4 and Chapter 7.1 as a guide for making changes in the dataset and producing descriptive statistics.
The completed assignment should adhere to the following guidelines:
a. Include your answers in the Quarto assignment document. Use the packages and code demonstrated in the lecture material.
b. Write your answers using complete sentences with correct punctuation, grammar, and spelling.
c. Submit your completed qmd file and html file generated by the Quarto file for the assignment Canvas assignment portal.
Use the Vital Signs 2010 CSV file posted in Module 2 for the assignment (VS10.csv). Include flextable to format the tables that are part of the output. Use the BNIA links from Module 2 for complete information about the variables. Answer the following questions:
- Create a subset of the BNIA Vital Signs data with five interval level variables of your choice. Choose five that you find interesting or that you expect to have an association with violent crime. Subset those five variables plus the violent crime variable (Violent Crime Rate per 1000 Residents), and paste the code that you used to import packages, import the data, and subset the data below. (To select variables that are not adjacent in the dataset, just use a comma between the column numbers rather than a colon). (1 point)
- Run descriptive statistics for the subset using the rstatix package. Use the links for the BNIA site from Module 2 to find a full description of the variables. Describe each of the variables using what you observe from the descriptive statistics. Describe the results in terms of statistical distributions and in terms of what it means in the real world. For example, ‘the minimum value of zero means that . . .’ Use the available statistics as fully as possible in your description, and discuss each of the variables. (3 points)
vital_sub %>%
get_summary_stats(., type = "common") %>%
flextable::flextable() %>%
flextable::autofit()variable | n | min | max | median | iqr | mean | sd | se | ci |
|---|---|---|---|---|---|---|---|---|---|
Violent Crime Rate per 1,000 Residents | 55 | 1.765 | 97.853 | 15.180 | 12.785 | 16.932 | 13.534 | 1.825 | 3.659 |
Unemployment Rate | 55 | 3.424 | 26.251 | 11.367 | 7.557 | 12.192 | 5.941 | 0.801 | 1.606 |
Median Household Income | 55 | 13,811.240 | 96,854.470 | 37,034.480 | 18,909.960 | 41,892.963 | 17,087.876 | 2,304.129 | 4,619.500 |
High School Dropout/Withdrawl Rate | 55 | 0.000 | 8.120 | 3.937 | 1.479 | 3.872 | 1.359 | 0.183 | 0.367 |
Percentage of Residential Properties that are Vacant and Abandoned | 55 | 0.000 | 40.000 | 3.000 | 9.000 | 7.717 | 10.707 | 1.444 | 2.894 |
Rate of Dirty Streets and Alleys Reports per 1,000 Residents | 55 | 3.170 | 611.875 | 41.685 | 60.212 | 76.309 | 100.945 | 13.611 | 27.289 |
Violet Crime Rate per 1,000 Residents: The violent crime rate indicates the number of violent crimes (i.e., homicide, rape, aggravated assault, and robbery) reported to a community’s police department per 1,000 residents. For this indicator, higher values are indicative of areas with more violent crime. Across the 55 communities of Baltimore, MD, there is considerable variation in the rate of violent crime. Specifically, the minimum rate of violent crime was 1.76 crimes per 1,000 residents. In contrast, the maximum rate of violent crime was 97.85 violent crimes per 1,000 residents. Despite having a wide range of violent crime rates, 50% of Baltimore communities violent crime rates fell within 8.79 (25th percentile) and 21.57 (75th percentile). Moreover, the average violent crime rate was 16.93 violent crimes per 1,000 residents. Thus, although violent crime rate varies by community (as indicated by the wide range of values and large standard deviation of 13.53), 75 percent of all Baltimore communities have a violent crime rate less than 21.57 violent crimes per 1,000 residents.
Unemployment Rate: Unemployment rate indicates the percentage of all individuals between ages 16 and 64 that are within the labor force but currently unemployed. Similar to violent crime rate, higher values on this indicator represent a higher percentage of unemployed people in a given community. Concurrent with the vast variability of violent crime rates, unemployment rates ranged from 3.42 to 26.25 percent across the 55 communities in 2010. However, 75 percent of communities had unemployment rates below 15.14 percent in 2010. To put this in perspective, in 2010 the national unemployment rate was 9.6 percent (Bureau of Labor Statistics). However, the state of Maryland’s unemployment rate fell slightly below the national rate in 2010 reaching 7.5 percent (Bureau of Labor Statistics). Given that both the median (11.37%) and average (12.19%) unemployment rate across all Baltimore communities in 2010 was significantly higher (+3.87% and +4.69% respectively) than both the state and national rates, it can be inferred that Baltimore (as a whole) experienced greater unemployment rates than the average community in Maryland and nationally.
Median Household Income: Median household income represents the central value of all household incomes within a community in the prior year. In 2010, the median household income of Baltimore communities ranged from $13,811.24 to $96,854.47 with an average of $41,892.96. For comparison purposes, the average household income of U.S. families in 2010 was $49,445 with households residing in the Northeast (including Maryland) averaging at $50,599 (U.S. Census Bureau). Across Baltimore, 75 percent of communities had a median household income less than $74,854.40. Despite this, 50 percent of communities had a median household income ($37,034.480) significantly less than the national and regional values. This suggests that although Baltimore contained some highly affluent areas in 2010, a large portion of its communities were well below the national and regional median household income values.
High School Dropout/Withdrawal Rate: High school dropout/withdrawal rate indicates the percentage of 9th through 12th grade students who withdrew from public school out of all high school students within a school year with higher values indicating a greater proportion of students leaving high school before graduation. Across Baltimore communities in 2010, the high school dropout rate ranges from 0 to 8.12 percent with an average of 3.87 percent. In 2010, the national dropout rate for 16- to 24-year-olds in the U.S. was 8.3 percent (National Center for Education Statistics). This suggests that each Baltimore community had a high school dropout rate lower than the national average in 2010. Moreover, 75 percent of Baltimore communities had a high school dropout rate below 4.67 percent in 2010 - a little under half the national rate.
Percentage of Residential Properties that are Vacant and Abandoned: The percentage of vacant and abandoned residential properties represents the percent ratio of vacant or abandoned properties (as deemed by the Baltimore City Department of Housing) out of all residential properties. While 75 percent of communities had fewer than 7.5 percent vacant or abandoned properties, at least one community had as much as 40 percent of all residential properties deemed abandoned or vacant. However, on average, the percentage of vacant and abandoned residential properties across Baltimore communities was 7.71 percent.
Rate of Dirty Streets and Alleys Reported per 1,000 Residents: The rate of reported dirty streets and alleys represents the number of service requests for dirty streets and alleys through Baltimore’s 311 system per 1,000 residents. Of note, more than one unique service request may be logged for the same issue. On average, Baltimore communities received 76.31 reports of dirty streets and alleys per 1,000 residents in 2010. However, the median number of reports per 1,000 residents was significantly below the mean at 41.68. This difference combined with the fact that this ratio ranges from as little as 3.17 reports per 1,000 residents to over 600 suggests incredibly wide variability across the communities. Despite this, 75 percent of communities rate of dirty streets and alleys per 1,000 residents fell below 71.79. The fact that the 75th quantile is below the mean further illustrates the presence of outliers.
- Use skimr to produce means and histograms for the variables. (1 point)
skim(vital_sub) %>%
dplyr::select(skim_variable, numeric.mean, numeric.hist) %>%
flextable::flextable() %>%
flextable::autofit()skim_variable | numeric.mean | numeric.hist |
|---|---|---|
Violent Crime Rate per 1,000 Residents | 16.932181 | ▇▃▁▁▁ |
Unemployment Rate | 12.192498 | ▇▇▃▂▂ |
Median Household Income | 41,892.962909 | ▅▇▃▂▁ |
High School Dropout/Withdrawl Rate | 3.871666 | ▁▅▇▂▁ |
Percentage of Residential Properties that are Vacant and Abandoned | 7.717091 | ▇▁▁▁▁ |
Rate of Dirty Streets and Alleys Reports per 1,000 Residents | 76.308968 | ▇▁▁▁▁ |
- Produce a table of means by quartile for violent crime (Violent Crime Rate per 1000 Residents). Report what you observe about the relationship between the variables you have chosen and the violent crime quartiles. Also include comparisons with the means for all of the cases produced for Question 3. (2 points)
vital_sub$crime_cat <- cut(vital_sub$`Violent Crime Rate per 1,000 Residents`, quantile(vital_sub$`Violent Crime Rate per 1,000 Residents`), include.lowest = TRUE)
vital_sub %>%
group_by(crime_cat) %>%
summarize(across(.cols = c(2:6),
.fns = mean,
na.rm = TRUE)) %>%
flextable::flextable() %>%
flextable::autofit()Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(.cols = c(2:6), .fns = mean, na.rm = TRUE)`.
ℹ In group 1: `crime_cat = "[1.76,9.51]"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
crime_cat | Unemployment Rate | Median Household Income | High School Dropout/Withdrawl Rate | Percentage of Residential Properties that are Vacant and Abandoned | Rate of Dirty Streets and Alleys Reports per 1,000 Residents |
|---|---|---|---|---|---|
[1.76,9.51] | 8.01871 | 58,616.74 | 2.782178 | 0.7957143 | 20.83623 |
(9.51,15.2] | 10.45171 | 41,203.55 | 3.658309 | 1.8571429 | 35.54885 |
(15.2,22.3] | 13.41688 | 39,653.58 | 4.294606 | 10.5615385 | 116.12477 |
(22.3,97.9] | 16.97015 | 27,938.03 | 4.781780 | 17.8571429 | 135.57000 |
Across all variables except for median household income, increases in the rate of violent crimes per 1,000 residents were associated with increases within the indicators themselves. Median household income presented an inverse relationship with the rate of violent crime such that the more affluent a community was the less incidence of violent crime there was within that community.
The average unemployment rate across all communities was closest to the average unemployment rate (M = 12.19) among communities with the highest violent crime rate (i.e., 22.3-97.9 violent crimes per 1,000 residents). However, the average median household income (M = 41,892.96) and average high school dropout/withdrawal rates (M = 3.87) across all communities was closest to that of the second lowest violent crime rate quartile (i.e., 9.51-15.2 violent crimes per 1,000 residents). Finally, average percentage of vacant or abandoned residential properties (M = 7.72) and the rate of dirty streets and alley reports per 1,000 residents (M = 76.30) was most similar to their respective values in the second highest violent crime rate quartile (i.e., 15.2-22.3 violent crimes per 1,000 residents).
Use pastecs to produce tests for normality. Select two of the variables to explore further and generate histograms for those two. Report what you observe from the statistics about the skewness, kurtosis, and normality or non-normality of the distributions of the variables for all of the variables. (3 points)
# Rounding to two decimal places stable <- flextable(round(stat.desc(vital_sub[,c(1:6)], basic = FALSE, desc = TRUE, norm = TRUE, p = 0.95), 2)) (normtable <- as.data.frame(stable[["body"]][["dataset"]]) %>% tibble::rownames_to_column() %>% flextable::flextable() %>% flextable::autofit())rowname
Violent Crime Rate per 1,000 Residents
Unemployment Rate
Median Household Income
High School Dropout/Withdrawl Rate
Percentage of Residential Properties that are Vacant and Abandoned
Rate of Dirty Streets and Alleys Reports per 1,000 Residents
median
15.18
11.37
37,034.48
3.94
3.00
41.68
mean
16.93
12.19
41,892.96
3.87
7.72
76.31
SE.mean
1.82
0.80
2,304.13
0.18
1.44
13.61
CI.mean.0.95
3.66
1.61
4,619.50
0.37
2.89
27.29
var
183.18
35.30
291,995,494.28
1.85
114.63
10,189.86
std.dev
13.53
5.94
17,087.88
1.36
10.71
100.94
coef.var
0.80
0.49
0.41
0.35
1.39
1.32
skewness
3.79
0.63
1.03
-0.10
1.60
3.36
skew.2SE
5.88
0.98
1.60
-0.15
2.49
5.22
kurtosis
20.46
-0.55
0.86
1.90
1.56
13.44
kurt.2SE
16.14
-0.43
0.68
1.50
1.23
10.61
normtest.W
0.65
0.94
0.92
0.95
0.73
0.61
normtest.p
0.00
0.01
0.00
0.03
0.00
0.00
hist(vital_sub$`Median Household Income`)hist(vital_sub$`Percentage of Residential Properties that are Vacant and Abandoned`)
The following variables are significantly skewed right, as indicated by a positive skew.2SE value greater than 1 (values in parentheses): violent crime rate per 1,000 residents (5.88), median household income (1.60), percentage of vacant or abandoned residential properties (2.49), and the rate of dirty streets and alley reports per 1,000 residents (5.22). In terms of kurtosis, all variables except for unemployment rate and median household income possessed kurt.2SE values above 1. Taken together, the only variable distribution that is neither skewed nor containing significant kurtosis was that of median household income - suggesting that it is the only normally distributed variable.