The goal of this document is to help you organize your thoughts and your work-flow for the group data project.
You will not submit this document
The goal of the project is to explore and wrangle an interesting data set, ultimately using it to answer two research questions. The work will be done in a small group of 3-4. The deliverable is fully developed report that introduces the data, describes any data management/wrangling, introduces two research questions, provides insightful visualizations, and performs statistical inference to provide answers
The code book for the data is available as a handout in class and as a document in Canvas. Look it over. With your group, consider a few preliminary ideas about variables you might be particularly interested in, and what kinds of research questions might be appropriate. You can use this space to write a few notes if you desire.
Use the code-chunk below to load the data into the workspace and begin exploring your variables. Hint: look at data summaries, view a list of unique values for a variable, create some visualizations. This should help you decide if you want to continue pursuing your original ideas or whether you want to modify them.
RawData <- read.csv(file.choose())
glimpse(RawData)
Rows: 75,000
Columns: 28
$ incident_id <dbl> 20216002897, 2020171061, 2024686299, 2022658747, 202458408…
$ offense_type_id <chr> "criminal-mischief-mtr-veh", "theft-of-motor-vehicle", "th…
$ reported_date <chr> "1/30/21 23:44", "3/19/20 15:54", "12/23/24 6:54", "12/26/…
$ nbrhd_name <chr> "washington-park-west", "gateway-green-valley-ranch", "cen…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ hour <int> 23, 15, 6, 19, 7, 3, 21, 2, 6, 22, 23, 16, 16, 20, 22, 3, …
$ day <chr> "Saturday", "Thursday", "Monday", "Monday", "Saturday", "S…
$ month <chr> "Jan", "Mar", "Dec", "Dec", "Oct", "Apr", "Aug", "Nov", "O…
$ year <int> 2021, 2020, 2024, 2022, 2024, 2021, 2020, 2024, 2022, 2024…
$ total_population <int> 7382, 43809, 29939, 16718, 16120, 8907, 8628, 12548, 4631,…
$ percent_black <dbl> 0.80, 27.62, 9.00, 7.99, 3.21, 2.54, 0.60, 1.80, 6.91, 9.0…
$ percent_white <dbl> 83.54, 19.12, 66.88, 70.89, 73.94, 65.98, 79.51, 39.75, 70…
$ percent_native_american <dbl> 0.00, 0.15, 0.55, 0.34, 0.00, 0.08, 0.00, 0.18, 0.00, 0.55…
$ percent_aapi <dbl> 2.74, 6.69, 7.97, 1.91, 4.25, 4.06, 8.19, 3.88, 2.79, 7.97…
$ percent_male <dbl> 54.57, 49.72, 49.45, 49.76, 55.50, 46.03, 56.17, 49.64, 61…
$ percent_under18 <dbl> 12.31, 28.28, 31.93, 15.72, 1.55, 22.05, 1.72, 25.05, 0.43…
$ percent_65plus <dbl> 12.26, 8.12, 8.11, 22.45, 10.33, 25.31, 19.19, 13.89, 8.62…
$ percent_in_school <dbl> 13.93, 28.88, 28.65, 16.19, 10.02, 24.80, 9.34, 23.34, 25.…
$ percent_only_english <dbl> 86.83, 50.92, 75.12, 82.25, 91.48, 82.73, 91.83, 62.57, 84…
$ percent_spanish <dbl> 3.77, 29.65, 4.11, 7.33, 5.04, 6.39, 3.04, 24.31, 7.13, 4.…
$ percent_housing_owner_occupied <dbl> 45.57, 70.94, 67.77, 45.67, 22.67, 74.68, 22.06, 68.25, 21…
$ percent_housing_renter_occupied <dbl> 48.75, 26.13, 28.65, 49.12, 72.45, 20.01, 57.66, 29.80, 58…
$ median_earn <int> 62336, 41018, 75472, 47752, 49134, 48793, 70918, 36720, 55…
$ median_earn_female <int> 60213, 35993, 54630, 39635, 42961, 44190, 54864, 31476, 32…
$ median_earn_male <int> 63980, 46346, 100308, 56654, 56624, 53854, 78000, 39537, 7…
$ median_rent <int> 1081, 1183, 1164, 1057, 958, 1019, 1317, 963, 1125, 1164, …
$ median_home_value <int> 680135, 360326, 636942, 463261, 339727, 447906, 732075, 36…
$ percent_poverty <dbl> 6.35, 8.40, 3.05, 5.92, 9.62, 12.37, 10.80, 10.50, 23.50, …
tally(~month, data = RawData)
month
Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
6024 6434 5730 6263 7091 6464 6115 6565 6267 5526 6294 6227
Write down what your research questions, and translate them into the correct hypothesis tests. Review the research question guidelines in step one if necessary
Is there an association between the time of day and the number of arrests for theft-of-motor-vehicle?
Time intervals:
8PM-Midnight midnight- 4am 4am - 8am 8am-noon noon-4pm 4pm-8pm
Once you have your research questions, it is strongly recommended that you talk with your professor (in lab) about whether they can be answered with your data and what kind of data manipulation might be needed. You should think carefully about what type of data wrangling you think should be done. Your professor can help confirm your thoughts, or suggest different approaches. It’s possible you have a data wrangling idea that isn’t directly covered by the content in lecture. Your professor may be able to assist.
Use this space to manipulate the data in whatever way you deem necessary. We suggest you select only the variables you need to perform your data analysis.
# Research Question 1
#The data that we care about(vehicle crimes) and month of the year
denver_crime_data_1 <- RawData |>
# Only look at crimes related to vehicles
filter(grepl('vehicle',offense_type_id) | grepl('veh', offense_type_id) |grepl('car', offense_type_id) | grepl('bicycle', offense_type_id) | grepl('auto', offense_type_id))|>
select(offense_type_id, month)
glimpse(denver_crime_data_1)
# Groups by month, adds variable of season_category to determine 'winter' or other
denver_crime_data_1 <- denver_crime_data_1 |>
group_by(month)|>
mutate(season_category = case_when(
month == 'Dec' | month == 'Jan' | month == 'Feb' ~ 'winter',
month == 'Mar' | month == 'Apr' | month == 'May' ~ 'spring',
month == 'Jun' | month == 'Jul' |month == 'Aug' ~ 'summer',
month == 'Sep' | month == 'Oct' | month == 'Nov' ~ 'autumn',
.default = 'other')
)
denver_crime_data_1
#counts the number of crimes in winter vs. other
counts_1 <- denver_crime_data_1|>
group_by(season_category) |>
summarise(
number_of_crimes = n()
)|>
mutate(
proportions <- (n() / sum(n('winter', 'summer', 'spring', 'autumn'))
)
)
counts_1
You should be able to clearly describe the data manipulations you perform, as well as the resulting variables.
Produce two visualizations that help you gain insight into the research questions from part 2 (one for each). Think about what the visualizations tell you about your research question.
Confirm that it is appropriate to use a distribution to perform inference, then perform the inference using prop.test, t.test, cor.test, or xchisq.test. You will need to do this twice, once for each research question.
Based on the outcome of Step 5, you need to answer both your research questions.
Take everything you’ve done in this worksheet and organize it using the DataProjectReportTemplate.
The text of your report must be less than 600 words. To check the word count, go to Edit -> Word Count. The two figures will go at the end.
You’ll need to the follow sections (also described in the template):
Introduce the dataset in your own words. Discuss, briefly, the motivation and curiousity behind your choice of research questions.
Write the research question in the section title and clearly detail any data manipulations you performed to be able to answer this research question. Do not include R code; use complete sentences to describe what you did.
Write the research question in the section title and clearly detail any data manipulations you performed to be able to answer this research question. Do not include R code; use complete sentences to describe what you did.
One figure for each research questions in order.