We’ll use the data file project_data.rda, which should be in the same directory as this markdown file (which should also be your working directory). It contains 2 data frames:
crime.subset: a data set of crimes committed in Houston 2010. This is taken from lecture 6.movie.data: Contains the release year, title, genre, budget, and gross earnings for 4998 movies. This data set was constructed by combining financial information listed at http://www.the-numbers.com/movie/budgets/all with IMDB genre information from ftp://ftp.fu-berlin.de/pub/misc/movies/database/genres.list.gz.Additionally, there is one more data set in the file movie_genre_data.rda, which you may optionally use for part 4:
movie.genre.data: Similar to movie.data, but contains genre and release year information for the movies as well. Note that if a movie belongs to multiple genres, then it is listed multiple times in the data set.load('project_data.rda')
library(ggplot2)
library(plyr)
library(reshape2)
library(splines)
library(boot)
library(broom)
## Warning: package 'broom' was built under R version 3.6.3
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(formattable)
## Warning: package 'formattable' was built under R version 3.6.3
#install.packages("formattable")
#install.packages("broom")
In Lecture 6, we counted the number of crimes occuring each week in each block, and we looked for counts which were abnormally high. To do this, we computed p-values under the hypothesis that the number of crimes was poisson distributed for each block and each week, where the poisson parameter lambda varied by block (and equaled the average rate for that block.)
Here we will repeat this exercise, but restrict to certain types of crimes. After that, we will look for specific addresses (instead of entire city blocks) and days (instead of weeks) which had unusual crime counts.
Question 1a. Count the number of auto theft crimes that occur in each block, each week. For each block, compute the average number of auto theft crimes per week. Construct a table showing the 5 block-weeks with the highest number of auto thefts, along with average number occuring per week at each block
Hint 1: to get the average number of crimes per week, divide the total number of crimes by the number of weeks in the data set, which is 35. (The way we did this in the notes might not give the correct answer – you might want to think about why.)
Hint 2: your table should have 4 columns: the block, the week, the number of auto thefts that block-week, and the average number of auto thefts per week for that block.
# preview data
head(crime.subset)
## time date hour premise offense beat block street
## 82730 2010-01-01 1/1/2010 0 13R robbery 13D10 4700-4799 telephone
## 82735 2010-01-01 1/1/2010 0 20A burglary 10H60 2500-2599 southmore
## 82736 2010-01-01 1/1/2010 0 20R burglary 13D10 6300-6399 rupley
## 82737 2010-01-01 1/1/2010 0 20A burglary 3B10 5000-5099 georgi
## 82739 2010-01-01 1/1/2010 0 20R burglary 15E10 5700-5799 jason
## 82740 2010-01-01 1/1/2010 0 20R burglary 11H10 7000-7099 ave k
## type suffix number month day location
## 82730 rd - 1 jan-apr friday road / street / sidewalk
## 82735 blvd - 1 jan-apr friday apartment
## 82736 cir - 1 jan-apr friday residence / house
## 82737 ln - 1 jan-apr friday apartment
## 82739 - 1 jan-apr friday residence / house
## 82740 - 1 jan-apr friday residence / house
## address lon lat week
## 82730 4750 telephone rd -95.29888 29.69171 1
## 82735 2550 southmore blvd -95.37340 29.71989 1
## 82736 6350 rupley cir -95.31560 29.69027 1
## 82737 5050 georgi ln -95.46658 29.82974 1
## 82739 5750 jason -95.49227 29.68369 1
## 82740 7050 ave k -95.29743 29.74156 1
# types of offense
summary(crime.subset$offense)
## aggravated assault auto theft burglary
## 0 3471 7360
## murder rape robbery
## 0 0 2884
## theft
## 22863
# create a new dataframe with auto theft data
crime.autotheft <- crime.subset[(crime.subset$offense=='auto theft'), ]
# Count of number of autothefts by week and block
count.autotheft = ddply(crime.autotheft, c('block', 'week'), summarise, count = length(block))
# average number of auto theft crimes per block per week
count.autotheft = ddply(count.autotheft, 'block', mutate, average = sum(count)/35)
# sort by highest count
count.autotheft <- arrange(count.autotheft, desc(count))
# preview data
head(count.autotheft)
## block week count average
## 1 5800-5899 5 7 1.114286
## 2 2200-2299 33 6 1.114286
## 3 2400-2499 20 6 1.485714
## 4 2500-2599 16 6 1.228571
## 5 3800-3899 24 6 1.285714
## 6 5000-5099 19 6 1.457143
# change column names for display purposes
# first save the original data set to new table to preserve earlier table
count.autotheft.fancy <- count.autotheft
# rename columns
count.autotheft.fancy <- plyr::rename(count.autotheft.fancy, c("block"="Block", "week"="Week", "count"="Count", "average"="Average/Week"))
# create first fancy table using kable
kable(count.autotheft.fancy[1:5, c('Block', 'Week', 'Count', 'Average/Week')], digits=3)
| Block | Week | Count | Average/Week |
|---|---|---|---|
| 5800-5899 | 5 | 7 | 1.114 |
| 2200-2299 | 33 | 6 | 1.114 |
| 2400-2499 | 20 | 6 | 1.486 |
| 2500-2599 | 16 | 6 | 1.229 |
| 3800-3899 | 24 | 6 | 1.286 |
# create one more nice table using formattable, I learnt this recently
count.autotheft.fancy.top5 <- count.autotheft.fancy[1:5, c('Block', 'Week', 'Count', 'Average/Week')]
#create formattable table
widget_formattable <- formattable(count.autotheft.fancy.top5, digits=4)
# display formattable table
widget_formattable
| Block | Week | Count | Average/Week |
|---|---|---|---|
| 5800-5899 | 5 | 7 | 1.114 |
| 2200-2299 | 33 | 6 | 1.114 |
| 2400-2499 | 20 | 6 | 1.486 |
| 2500-2599 | 16 | 6 | 1.229 |
| 3800-3899 | 24 | 6 | 1.286 |
Question 1b. For each block-week, compute a p-value for its auto theft count. For the null hypothesis required by our p-values, we will assume that the number of auto thefts in each block-week is a Poisson random variable, with expectation (the parameter lambda) equal to its weekly average computed in Question 1a. (This is the same as in the lecture.) Label each block-week as anomalous if its p-value is lower than a Bonferoni-corrected fast detection rate of 5%. How many anomalous block-weeks did you find? For the anomalous block-weeks (if there are any), did the crimes tend to occur at the same address?
#View(count.autotheft)
# moving the dataset to a new named dataset for easy reading
autotheft <- count.autotheft
# Calculate p-value
# use a for loop to compute poisson p-values for each block-week
pval = 0
for (i in 1:nrow(autotheft)){
pval[i] = poisson.test(autotheft$count[i], r=autotheft$average[i], alternative = 'greater')$p.value
}
# replace the old pval column in count.block with this one
autotheft$pval <- pval
Let’s put this in a nice table and look at first 20 values.
| block | week | average | count | pval |
|---|---|---|---|---|
| 5800-5899 | 5 | 1.11429 | 7 | 0.00016 |
| 5400-5499 | 34 | 1.08571 | 6 | 0.00091 |
| 2200-2299 | 33 | 1.11429 | 6 | 0.00103 |
| 12000-12099 | 31 | 0.05714 | 2 | 0.00157 |
| 2500-2599 | 16 | 1.22857 | 6 | 0.00169 |
| 3800-3899 | 24 | 1.28571 | 6 | 0.00211 |
| 9500-9599 | 34 | 0.25714 | 3 | 0.00234 |
| 5000-5099 | 19 | 1.45714 | 6 | 0.00388 |
| 2400-2499 | 20 | 1.48571 | 6 | 0.00426 |
| 1700-1799 | 27 | 1.05714 | 5 | 0.00461 |
| 3600-3699 | 29 | 0.65714 | 4 | 0.00462 |
| 6900-6999 | 6 | 0.65714 | 4 | 0.00462 |
| 1100-1199 | 19 | 1.11429 | 5 | 0.00573 |
| 2200-2299 | 21 | 1.11429 | 5 | 0.00573 |
| 8200-8299 | 15 | 0.71429 | 4 | 0.00617 |
| 8200-8299 | 27 | 0.71429 | 4 | 0.00617 |
| 5600-5699 | 14 | 1.14286 | 5 | 0.00636 |
| 6000-6099 | 5 | 1.62857 | 6 | 0.00656 |
| 5700-5799 | 10 | 1.17143 | 5 | 0.00703 |
| 1800-1899 | 24 | 0.77143 | 4 | 0.00803 |
In the next section I will use Bonferoni correction to flag blocks as anomalous.
# cutoff for 5% correction
cutoff <- 0.05/nrow(autotheft.augmented)
# we want to know pvals lower than cutoff
sum(autotheft.augmented$pval < cutoff)
## [1] 0
# we want to know pvals lower than cutoff and flag them as anomalous
autotheft.augmented <- mutate(autotheft.augmented, flag = ifelse(pval<cutoff,'Anomalous', 'Non-Anomalous'))
# View(autotheft.augmented)
# preview data
head(autotheft.augmented)
## block week count average pval flag
## 1 5800-5899 5 7 1.11428571 0.0001609073 Non-Anomalous
## 2 5400-5499 34 6 1.08571429 0.0009056359 Non-Anomalous
## 3 2200-2299 33 6 1.11428571 0.0010333135 Non-Anomalous
## 4 12000-12099 31 2 0.05714286 0.0015717695 Non-Anomalous
## 5 2500-2599 16 6 1.22857143 0.0016869022 Non-Anomalous
## 6 3800-3899 24 6 1.28571429 0.0021125695 Non-Anomalous
ANSWER: (How many anomalous block-weeks did you find? For the anomalous block-weeks, did the crimes tend to occur at the same address?)
We find 0 block-weeks that are anomalous. The second part of the question is not applicable as we do not see any anomalous block-weeks.
Note: In our lecture we use pval less than and equal to cutoff, I have tried that as well. But I have shown pval less than cutoff (no equal to) because that is the ask in the problem.
Question 1c. Find the daily counts of auto thefts occuring at each unique address. For each address in your data set, also compute the average number of auto thefts occuring per day. Construct a table showing the 5 address-dates with the highest number of auto thefts, along with the average number occuring per day at those addresses:
(This is analogous to Question 1a, except that you are grouping by address and date, instead of block and week. For the average number of auto thefts per day, you will want to divide the total number of auto thefts by 264, the number of days in the data set)
# View(crime.autotheft)
# Count of number of autothefts by date and address
autotheft.date.address <- ddply(crime.autotheft, c('date', 'address'), summarise, count = length(address))
# average number of auto theft crimes per adress per date
autotheft.date.address <- ddply(autotheft.date.address, 'address', mutate, average = sum(count)/length(unique(crime.autotheft$date)))
# sort by highest count
autotheft.date.address <- arrange(autotheft.date.address, desc(count))
# preview data
head(autotheft.date.address)
## date address count average
## 1 4/16/2010 2550 broadway st 3 0.020576132
## 2 3/23/2010 2650 south lp w 3 0.049382716
## 3 6/13/2010 3850 norfolk 3 0.024691358
## 4 4/5/2010 350 gellhorn 2 0.008230453
## 5 4/16/2010 550 75th st 2 0.016460905
## 6 8/9/2010 1050 pinemont dr 2 0.020576132
# View(autotheft.date.address)
# change column names for display purposes
# first save the original data set to new table to preserve earlier table
autotheft.date.address.fancy <- autotheft.date.address
# rename columns
autotheft.date.address.fancy <- plyr::rename(autotheft.date.address.fancy, c("date"="Date", "address"="Address", "count"="Count", "average"="Average/Date"))
# create first fancy table using kable
kable(autotheft.date.address.fancy[1:5, c('Address', 'Date', 'Count', 'Average/Date')], digits=3)
| Address | Date | Count | Average/Date |
|---|---|---|---|
| 2550 broadway st | 4/16/2010 | 3 | 0.021 |
| 2650 south lp w | 3/23/2010 | 3 | 0.049 |
| 3850 norfolk | 6/13/2010 | 3 | 0.025 |
| 350 gellhorn | 4/5/2010 | 2 | 0.008 |
| 550 75th st | 4/16/2010 | 2 | 0.016 |
# create one more nice table using formattable, I learnt this recently
autotheft.date.address.fancy2 <- autotheft.date.address.fancy[1:5, c('Address', 'Date', 'Count', 'Average/Date')]
#create formattable table
widget_formattable2 <- formattable(autotheft.date.address.fancy2, digits=4)
# display formattable table
widget_formattable2
| Address | Date | Count | Average/Date |
|---|---|---|---|
| 2550 broadway st | 4/16/2010 | 3 | 0.02058 |
| 2650 south lp w | 3/23/2010 | 3 | 0.04938 |
| 3850 norfolk | 6/13/2010 | 3 | 0.02469 |
| 350 gellhorn | 4/5/2010 | 2 | 0.00823 |
| 550 75th st | 4/16/2010 | 2 | 0.01646 |
Question 1d. For each address-date, compute a p-value for its auto theft count, where the null hypothesis is that the number of auto thefts is a Poisson random variable, with expectation (the parameter lambda) equal to the daily average that you computed in question 1c. (Again, this is the same as in the lecture) Label each address-date as anomalous if its p-value is smaller than a Bonferoni-corrected false detection rate of 5%. How many address-dates were anomalous? For the anomalous address-dates, how many auto thefts occurred? What was the location for these anomalous addresses?
(Note: location is a column in the original crime.subset data frame)
# Calculate p-value
# use a for loop to compute poisson p-values for each address-date
pval = 0
for (i in 1:nrow(autotheft.date.address)){
pval[i] = poisson.test(autotheft.date.address$count[i], r=autotheft.date.address$average[i], alternative = 'greater')$p.value
}
# replace the old pval column with this one
autotheft.date.address$pval <- pval
# View(autotheft.date.address)
Let’s put this in a nice table and look at first 20 values.
| address | date | average | count | pval |
|---|---|---|---|---|
| 2550 broadway st | 4/16/2010 | 0.02058 | 3 | 0.00000 |
| 3850 norfolk | 6/13/2010 | 0.02469 | 3 | 0.00000 |
| 2650 south lp w | 3/23/2010 | 0.04938 | 3 | 0.00002 |
| 350 gellhorn | 4/5/2010 | 0.00823 | 2 | 0.00003 |
| 2450 westcreek ln | 4/11/2010 | 0.00823 | 2 | 0.00003 |
| 4250 scotland | 5/21/2010 | 0.00823 | 2 | 0.00003 |
| 5750 glenmont dr | 4/28/2010 | 0.00823 | 2 | 0.00003 |
| 7050 telephone rd | 5/11/2010 | 0.00823 | 2 | 0.00003 |
| 7650 phoenix dr | 4/8/2010 | 0.00823 | 2 | 0.00003 |
| 2350 tidwell rd | 8/14/2010 | 0.01235 | 2 | 0.00008 |
| 4450 oxford st | 1/10/2010 | 0.01235 | 2 | 0.00008 |
| 4550 pinemont | 1/9/2010 | 0.01235 | 2 | 0.00008 |
| 4750 beechnut st | 3/28/2010 | 0.01235 | 2 | 0.00008 |
| 550 75th st | 4/16/2010 | 0.01646 | 2 | 0.00013 |
| 2750 briargrove dr | 2/12/2010 | 0.01646 | 2 | 0.00013 |
| 5150 richmond ave | 7/12/2010 | 0.01646 | 2 | 0.00013 |
| 5550 antoine dr | 3/22/2010 | 0.01646 | 2 | 0.00013 |
| 8650 kirby dr | 3/4/2010 | 0.01646 | 2 | 0.00013 |
| 8850 gulf fwy | 1/22/2010 | 0.01646 | 2 | 0.00013 |
| 1050 pinemont dr | 8/9/2010 | 0.02058 | 2 | 0.00021 |
In the next section I will use Bonferoni correction to flag addresses as anomalous.
# cutoff for 5% correction
cutoff <- 0.05/nrow(autotheft.date.address.augmented)
# we want to know pvals lower than cutoff
sum(autotheft.date.address.augmented$pval < cutoff)
## [1] 2
# we want to know pvals lower than cutoff and flag them as anomalous
autotheft.date.address.augmented <- mutate(autotheft.date.address.augmented, flag = ifelse(pval<cutoff,'Anomalous', 'Non-Anomalous'))
# View(autotheft.augmented)
# preview data
head(autotheft.date.address.augmented)
## date address count average pval flag
## 1 4/16/2010 2550 broadway st 3 0.020576132 1.429688e-06 Anomalous
## 2 6/13/2010 3850 norfolk 3 0.024691358 2.462896e-06 Anomalous
## 3 3/23/2010 2650 south lp w 3 0.049382716 1.934232e-05 Non-Anomalous
## 4 4/5/2010 350 gellhorn 2 0.008230453 3.368490e-05 Non-Anomalous
## 5 4/11/2010 2450 westcreek ln 2 0.008230453 3.368490e-05 Non-Anomalous
## 6 5/21/2010 4250 scotland 2 0.008230453 3.368490e-05 Non-Anomalous
# View(autotheft.date.address.augmented)
autotheft.date.address.augmented.ano <- autotheft.date.address.augmented %>%
filter(flag=='Anomalous')
nrow(autotheft.date.address.augmented.ano)
## [1] 2
sum(autotheft.date.address.augmented.ano$count)
## [1] 6
Working on locations of anomalous flagged address-date combinations
# Merge original auto theft data set with above table to extract the location
autotheft.date.address.ano.loc <- merge(autotheft.date.address.augmented.ano, crime.autotheft, by="address")
# gather all the unique locations in a data frame
anomalous.autotheft.locations <- unique(autotheft.date.address.ano.loc$location)
anomalous.autotheft.locations
## [1] "apartment parking lot" NA
## [3] "strip business center parking lot" "commercial parking lot / garage"
ANSWER: (How many address-dates were anomalous? For the anomalous address-dates, how many auto thefts occurred? What was the location for these anomalous addresses?)
2 address-dates were anomalous; 2550 broadway st on 4/16/2010 and 3850 norfolk on 6/13/2010. For the anomoalous address-dates, 6 auto thefts occurred.
Data frame anomalous.autotheft.locations contains the location of all the 6 auto thefts that occurred.
Question 1e. The highest number of auto thefts occuring at any address in a single day was 3. This happened on 3 separate occurrences: 2550 broadway st on 4/16, 3850 norfolk on 6/13/2010, and 2650 south lp w on 3/23. Were all 3 occurences included as anomalous in your previous analysis? If so, why do you think were they included? If not, why do you think some were included, but others not?
ANSWER:
No, all these three occurences were not marked as anomalous in my finding. Address 2650 south lp w on 3/23/2010was not marked anomalous, however, 2550 broadway st on 4/16/2010 and 3850 norfolk on 6/13/2010 were marked anomalous. We used Bonferoni-correction here where we have 5% chance that anything on the data set we created with p-value will be wrong/anomalous. When we did this 5% correction cutoff within our data set, only 2 of the addresses showed up as being anomalous. And the third one did not due to higher p-value. So, only 2 of the addresses passed the threshold of 5% false detection rate. The 2 addresses passed the test that 5% data set has error. With 5% error propability, we are able to trust only those 2 addresses.
The movie.data data set has the following columns for 4337 movies
Release.Date: When was the movie releasedMovie: The title of the moveProduction.Budget: The amount of money budgeted for the production of the movieDomestic.Gross: The total revenue earned within the USA by the movieWe are going to construct a Quantile Regression to model the relationship between Production.Budget and Domestic.Gross. For this task, we have already written all of the code for you. However, our method is going to run into problems – your task will be to recognize these problems as they arise.
Question 2a. Below are 3 plots, showing the Domestic Gross Revenue as a function of the Production Budget for each movie, under (1) no transform, (2) a log transform, and (3) a power transform taking 4th roots. For the quantile regression, we decided to try the power transform first – why might we have done this? What was not-so-good about the plots with no transform and with the log transform?
# preview data
head(movie.data)
## Release.Date Movie Production.Budget
## 1 12/18/2009 Avatar 4.25e+08
## 5 5/24/2007 Pirates of the Caribbean: At World's End 3.00e+08
## 11 7/20/2012 The Dark Knight Rises 2.75e+08
## 13 7/2/2013 The Lone Ranger 2.75e+08
## 16 3/9/2012 John Carter 2.75e+08
## 20 11/24/2010 Tangled 2.60e+08
## Domestic.Gross
## 1 760507625
## 5 309420425
## 11 448139099
## 13 89289910
## 16 73058679
## 20 200821936
ANSWER: (why was the power transform better than the others?)
No Transform
The graph of non transformed values of Domestic Gross Revenue as a function of the Production Budget seems extremely noisy. It is hard to interpret a pattern. One would think that the relationship should be linear, but the graph is super disrupted. To start with, all the values are clustered towards (0, 0). And as the values increase, we see all the points are scattered showing no well defined relationship between Budget and Revenue. The outliers in the graph are not helping as well. There is a very high value of budget for highest value of revenue, this outlier is compressing the whole plot. We do not see a definite trend in this plot. A large number of points are spread away. The strength of the pattern is weak. Only interpretation that we can make is that budget and revenue have a positive association. It is hard to fit a line.
Log Transform
The graph of log transform of Revenue as a function of log transform of Budget is seems a bit more scattered at the beginning of the graph compared to the no-transform plot. It depicts a weak linear relationship between revenue and budget of a movie. This plot shows a correlation, however the pattern is unclear. The plot is noisy is the beginning, but with the increase in log of budget, we see the plot converging and finally it becomes a huge blob of points. The graph shows a better interpret that budget and revenue of movie have a positive association. The outlier still resides at the highest value of budget and revenue.
Power Transform with 4th root
Out of the three plots, power transform with 4th root is doing the best job in defining a trend. The strength of the relationship between 4th root of budget and 4th root of revenue is strongly evident. The graph shows a positive relationship between 4th root of budget and 4th root of revenue. There are few outliers including the one seen at the value of budget and revenue in all the plots. This plot shows the strongest positive association between revenue and budget. We still have the outlier at highest value of 4th root of budget and 4th root of revenue.
We see most of the flaws of no-transfor and log-transform plot getting rectified in the power transform. Hence, it is the best one to pick from the available options for further analysis.
Question 2b. Following lecture 10, we try to construct a quantile regression plot for movie revenues as a function of their production budget. We first fit a spline and compute the predictions and residuals. Then we create a Spread-Location plot to see if the residuals look identically distributed. Unfortunately, the plot is not good – it suggests that this cannot be the case. Your task: explain why the S-L plot suggests this.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ANSWER: (explain why the S-L plot implies that our residuals are NOT identically distributed)
This plot estimates the spread of residuals as a function of prediction. The plot suggests that initially the abosolute value of residuals increase with an increase in predicted values, however they start to decrease post predicted value around 75. The trend line increases till predicted value 50 and then slowly starts to drop. This means that the variance of residuals is decreasing with predicted values. Since the trend line is not flat or straight, the residuals are not identically distributed.
Question 2c. Even though the S-L plot looked bad, suppose that we proceed anyway to construct the quantile regression. However, when we show the Quantile Regression for the power transformed variables, it is visually obvious that it must be wrong, even if we didn’t look at the S-L plot beforehand. How can we tell?
ANSWER: (explain why we can visually tell that this quantile regression plot must be wrong)
All the percentiles go up with increase in budget. There are a number of data points that fall below 5th percentile and above 95th percentile. These are the outliers compressing the data points. The plot shows that the difference in revenue generated between all the quantiles is same. So, for example the revenue difference between 95th and 85th percentiles is the same for the revenue generated between 50th and 15th percentile. In practical scenarios this is not possible. One would except that a peak will show up between budget’s 4th root value of 50 and 100. However, all the quantiles are flat and straight. Due to all these flaws the Quantile regression plot above cannot be trusted.
In the crimes.subset data frame, there are the following columns
offense: the type of crime – theft, auto theft, burglary, and robberyhour: the hour that the crime occurred – 0 (midnight) to 23 (11 PM)month: the month that the crime occurred. We have grouped the month into two categories: jan-apr, and may-augQuestion 3a. Make a scatterplot (or line plot) showing the percentage of crimes committed each hour.
We will first create a table to show the fraction of crimes committed eah hour.
# create summary data frame
fraction.crime <- ddply(crime.subset, "hour", total.records = nrow(crime.subset),
summarise, crime.per.hour = length(hour),
fraction.by.crime.type = crime.per.hour/total.records)
# View(fraction.crime)
# plot hour vs fraction of crime and save as variable
crime.hour.plot <- ggplot(data=fraction.crime, mapping=aes(y=fraction.by.crime.type, x=hour)) +
geom_point() +
geom_line() +
labs(x='Hour of the day', y ='Fraction of Crimes',
title='Fraction of Crimes Committed at each hour of the day')
# draw the plot
crime.hour.plot
Question 3b. Repeat the plot in Question 3a, but separately for each type of offense. In your each of your plots, show include a reference curve using the pooled data, in order to facilitate comparisons. (Hint: you computed this reference curve in question 3a). Do the different types of crimes have different distributions for their time of occurence? How do they differ?
# Create first half of dataset for plot
# count by hour, and crime type
crime.offense.hour.summary <- crime.subset %>%
group_by(offense, hour) %>%
summarise(offense.hour.count = n())
# View(crime.offense.hour.summary)
# create second half of data for plot
# count by crime type
crime.summary <- crime.subset %>%
group_by(offense) %>%
summarise(offense.count = n())
# View(crime.summary)
# merge the above two data sets based on offense column
merge.crime <- merge(crime.summary, crime.offense.hour.summary, by="offense")
# View(merge.crime)
# compute the fraction of crime
data.fraction.merge.crime <- merge.crime %>%
mutate(fraction.crime=offense.hour.count/offense.count)
# View(data.fraction.merge.crime)
# Pick the dataframe created in 3a
# Assign those fractions based on hour to merged dataframe created above
crime.ref.interim <- merge(data.fraction.merge.crime, fraction.crime, by = "hour")
# View(crime.ref.interim)
# take subset to form required dataframe for reference line
crime.ref <- subset(crime.ref.interim, select = c('hour', 'fraction.by.crime.type'))
# View(crime.ref)
# plot with reference line
plot.merge.crime <- ggplot(data = data.fraction.merge.crime, aes(x = hour, y = fraction.crime)) +
geom_point() +
geom_line() +
facet_wrap(~offense) +
geom_line(data = crime.ref, color='slategrey', aes( y=fraction.by.crime.type, x=hour)) +
#geom_point(data = crime.ref3, color='grey', size=1) + geom_point() +
labs(title="Fraction of crimes committed each hour of the day\n grouped by offense types",
x ="Hour of the Day", y = "Fraction of crimes per hour")
# draw the plot
plot.merge.crime
ANSWER: (How does the distribution the time of occurence differ for each type of crime?)
Auto thefts
The lowest fraction of auto thefts occur at around 4th hour of the day (4AM). The fraction value is close to the reference trend. The highest fraction of auto thefts happen at around 10PM. The fraction value is quite high than the reference. The trend shows a decrease in autothefts from midnight to 4AM, then there is an increase followed by a flat. Post 12PM the trend fraction increases with time.
Burglary
The lowest fraction of auto thefts occur at around 4th hour of the day (4AM). The fraction value is higher than the reference trend value. The highest fraction of auto thefts happen at around 8AM. The fraction value is quite high than the reference trend value at that time. We see a sharp increase in burglaries from 3AM to 8AM. Post 8AM the trend shows ups and downs in the data. We see a flat from 12PM to 3PM post which the fraction reaches the second peak at 4PM.
Robbery
The lowest fraction occurs at around 7AM. This value does not match with the reference line point low. The highest fraction appears at around 11PM. The value does not match with the reference line point high. The trend first sees a negative relationship between fraction of robbery and hour of the day till 7AM. Post 7AM, the robberies follow a positive relationship with the increase in hour of the day. The 2nd peak fraction is seen at 10AM.
Theft
The plot of theft is closest to the plot of refence line. The lowest fraction for theft occurs at around 5AM. This value is slightly lower than the reference trend low point. The 2 highest fractions for theft occur at around 6PM and 12AM (midnight). This is close to the reference line. The trend first sees a negative relationship between fraction of theft and hour of the day till 5AM. Post 5AM, the thefts follow a positive relationship with the increase in hour of the day.
Question 3c. Repeat the plot in Question 3b, but separately for each type of offense and month (i.e., use facet_grid(offense ~ month)). As before, include a reference curve using the pooled data, in order to facilitate comparisons. (You don’t have to analyze the plot yet – wait for part 3f)
# Create first half of dataset for plot
# count by hour, offense, and crime type
crime.offense.hour.month.summary <- crime.subset %>%
group_by(offense, hour, month) %>%
summarise(offense.hour.month.count = n())
# View(crime.offense.hour.month.summary)
# create second half of data for plot
# count by crime type, and month
crime.month.offense.summary <- crime.subset %>%
group_by(offense, month) %>%
summarise(offense.month.count = n())
# View(crime.month.offense.summary)
# merge the above two datasets by offense and month
merge.crime2 <- merge(crime.month.offense.summary, crime.offense.hour.month.summary, by=c("offense", "month"))
# View(merge.crime2)
# compute the frantion of crime
data.fraction.merge.crime2 <- merge.crime2 %>%
mutate(fraction.crime=offense.hour.month.count/offense.month.count)
# View(data.fraction.merge.crime2)
# create data frame for reference line
# take original data set
# compute count by hour, month, and offense
# Pick the dataframe created in 3a
# Assign those fractions based on hour to merged dataframe created above
crime.ref.interim2 <- merge(data.fraction.merge.crime2, fraction.crime, by = "hour")
# View(crime.ref.interim2)
# take subset to form required dataframe for reference line
crime.ref2 <- subset(crime.ref.interim2, select = c('hour', 'fraction.by.crime.type'))
# View(crime.ref2)
# plot with reference line
plot.merge.crime2 <- ggplot(data = data.fraction.merge.crime2, aes(x = hour, y = fraction.crime)) +
geom_point() +
geom_line() +
facet_grid(month~offense) +
geom_line(data = crime.ref2, color='slategrey', aes( y=fraction.by.crime.type, x=hour)) +
#geom_point(data = crime.ref3, color='grey', size=1) + geom_point() +
labs(title="Fraction of crimes committed each hour of the day\n grouped by offense types and month",
x ="Hour of the Day", y = "Fraction of crimes per hour")
# draw the plot
plot.merge.crime2
Question 3d. As an alternative, create a QQ plot comparing the distribution of hour for auto theft crimes occuring in jan-apr vs may-aug. Include the reference line y=x, as this is standard practice for QQ plots. Repeat this for the other 3 types of offense. You may use base graphics if you wish.
# filter to get auto theft data only
crime.autotheft <- crime.subset[(crime.subset$offense=='auto theft'), ]
head(crime.autotheft)
## time date hour premise offense beat block
## 82742 2010-01-01 00:00:00 1/1/2010 0 18A auto theft 7C20 4600-4699
## 82744 2010-01-01 00:00:00 1/1/2010 0 18N auto theft 1A10 300-399
## 82808 2010-01-01 03:00:00 1/1/2010 3 20D auto theft 3B50 7200-7299
## 82833 2010-01-01 06:00:00 1/1/2010 6 20D auto theft 18F20 400-499
## 82990 2010-01-01 21:00:00 1/1/2010 21 18T auto theft 5F10 7600-7699
## 83006 2010-01-01 22:00:00 1/1/2010 22 18A auto theft 14D20 7800-7899
## street type suffix number month day
## 82742 falls st - 1 jan-apr friday
## 82744 hamilton st - 1 jan-apr friday
## 82808 roswell - 1 jan-apr friday
## 82833 pinewold dr - 1 jan-apr friday
## 82990 katy fwy - 1 jan-apr friday
## 83006 cullen blvd - 1 jan-apr friday
## location address lon
## 82742 apartment parking lot 4650 falls st -95.33067
## 82744 bar / night club parking lot 350 hamilton st -95.35346
## 82808 driveway 7250 roswell -95.36952
## 82833 driveway 450 pinewold dr -95.46346
## 82990 strip business center parking lot 7650 katy fwy -95.47292
## 83006 apartment parking lot 7850 cullen blvd -95.35589
## lat week
## 82742 29.80306 1
## 82744 29.75704 1
## 82808 29.82671 1
## 82833 29.76426 1
## 82990 29.78457 1
## 83006 29.67407 1
# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.autotheft,
qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Auto theft'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)
# filter to get burglary data only
crime.burglary <- crime.subset[(crime.subset$offense=='burglary'), ]
head(crime.burglary)
## time date hour premise offense beat block street
## 82735 2010-01-01 1/1/2010 0 20A burglary 10H60 2500-2599 southmore
## 82736 2010-01-01 1/1/2010 0 20R burglary 13D10 6300-6399 rupley
## 82737 2010-01-01 1/1/2010 0 20A burglary 3B10 5000-5099 georgi
## 82739 2010-01-01 1/1/2010 0 20R burglary 15E10 5700-5799 jason
## 82740 2010-01-01 1/1/2010 0 20R burglary 11H10 7000-7099 ave k
## 82741 2010-01-01 1/1/2010 0 070 burglary 7C20 4400-4499 liberty
## type suffix number month day location
## 82735 blvd - 1 jan-apr friday apartment
## 82736 cir - 1 jan-apr friday residence / house
## 82737 ln - 1 jan-apr friday apartment
## 82739 - 1 jan-apr friday residence / house
## 82740 - 1 jan-apr friday residence / house
## 82741 rd - 1 jan-apr friday convenience store
## address lon lat week
## 82735 2550 southmore blvd -95.37340 29.71989 1
## 82736 6350 rupley cir -95.31560 29.69027 1
## 82737 5050 georgi ln -95.46658 29.82974 1
## 82739 5750 jason -95.49227 29.68369 1
## 82740 7050 ave k -95.29743 29.74156 1
## 82741 4450 liberty rd -95.32520 29.78724 1
# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.burglary,
qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Burglary'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)
# filter to get robbery data only
crime.robbery <- crime.subset[(crime.subset$offense=='robbery'), ]
head(crime.robbery)
## time date hour premise offense beat block
## 82730 2010-01-01 00:00:00 1/1/2010 0 13R robbery 13D10 4700-4799
## 82757 2010-01-01 01:00:00 1/1/2010 1 13R robbery 2A50 5200-5299
## 82758 2010-01-01 01:00:00 1/1/2010 1 18G robbery 11H20 1500-1599
## 82775 2010-01-01 02:00:00 1/1/2010 2 13R robbery 2A50 5200-5299
## 82776 2010-01-01 02:00:00 1/1/2010 2 20R robbery 9C20 900-999
## 82813 2010-01-01 04:00:00 1/1/2010 4 18B robbery 10H60 4000-4099
## street type suffix number month day
## 82730 telephone rd - 1 jan-apr friday
## 82757 center st - 1 jan-apr friday
## 82758 evergreen dr - 1 jan-apr friday
## 82775 washington ave - 1 jan-apr friday
## 82776 hoffman - 1 jan-apr friday
## 82813 canfield st - 1 jan-apr friday
## location address lon
## 82730 road / street / sidewalk 4750 telephone rd -95.29888
## 82757 road / street / sidewalk 5250 center st -95.41530
## 82758 grocery / supermarket parking lot 1550 evergreen dr -95.29021
## 82775 road / street / sidewalk 5250 washington ave -95.41415
## 82776 residence / house 950 hoffman -95.31054
## 82813 bank / saving institution parking lot 4050 canfield st -95.35488
## lat week
## 82730 29.69171 1
## 82757 29.77119 1
## 82758 29.71219 1
## 82775 29.77065 1
## 82776 29.77116 1
## 82813 29.72037 1
# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.robbery,
qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Robbery'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)
# filter to get theft data only
crime.theft <- crime.subset[(crime.subset$offense=='theft'), ]
head(crime.theft)
## time date hour premise offense beat block street
## 82748 2010-01-01 1/1/2010 0 03B theft 1A10 1200-1299 caroline
## 82749 2010-01-01 1/1/2010 0 03B theft 2A40 1900-1999 washington
## 82750 2010-01-01 1/1/2010 0 03B theft 3B40 5100-5199 yale
## 82752 2010-01-01 1/1/2010 0 20V theft 2A30 2600-2699 lawrence
## 82753 2010-01-01 1/1/2010 0 20R theft 3B40 900-999 36th
## 82754 2010-01-01 1/1/2010 0 250 theft 16E40 4600-4699 kingsridge
## type suffix number month day
## 82748 st - 1 jan-apr friday
## 82749 ave - 1 jan-apr friday
## 82750 - 1 jan-apr friday
## 82752 - 1 jan-apr friday
## 82753 st E 1 jan-apr friday
## 82754 rd - 1 jan-apr friday
## location
## 82748 bar / night club
## 82749 bar / night club
## 82750 bar / night club
## 82752 vacant single occupancy residence (houses, townhouses, duplexes, etc)
## 82753 residence / house
## 82754 other / unknown
## address lon lat week
## 82748 1250 caroline st -95.36382 29.75352 1
## 82749 1950 washington ave -95.37823 29.76791 1
## 82750 5150 yale -95.40283 29.84116 1
## 82752 2650 lawrence -95.40813 29.81045 1
## 82753 950 36th st -95.38664 29.81949 1
## 82754 4650 kingsridge rd -95.38306 29.76288 1
# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.theft,
qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Theft'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)
Question 3e. Between the plots you made in Question 3d and Question 3e, which one is better? Or should we keep both? Why?
ANSWER:
In my opinion, plots created in the last 2 problems serve different purposes. So, both are useful in their own sense. The purpose of both these plots is different. None of them are incorrect, they both tell stories that are useful for users.
The plots in 3c show a visual representation of fraction of crime at each and every hour of the day split by crime type and month group. It could be a one stop shop for users to answers questions such as: At what hour of the day during January through April do we see highest fraction of robberies? How does this number change from May through August? What offense type has the highest or lowest fraction at midnight in January through April.
The Quantile-Quantile plots created in 3d represent the distribution of hour of day for each type of crime compared for the two month groups. We have plotted the quantiles of hours for distribution in January through April against distribution in May through August. The distribution is compared against the reference line. These plots can answer questions such as: Which crime type’s distribution of hour for January through April is above the reference line? Which month-groups’ distribution of hours is below refence type between robbery and burglary?
Since the purpose of both the types of plots are different and useful for making important decisions, I would keep both the sets.
Question 3g. How does the distribution of the time of occurence vary by month? Answer separately for each type of crime.
ANSWER:
Auto theft
Autotheft’s highest fraction value at 10PM is higher in May to August compared to January to April. The time of their peaks are at around 10PM. The 2 plots for auto theft pretty much look alike for the 2 month ranges except for few spots. Autotheft’s lowest fraction value in January to April is at 5AM and it switches to 4AM from May to August. The second peak is seen at midnight for both the groups. Fraction value for second peak is close to 0.05 from January to April and it is 0.075 from May to August.
Burglary
Burglary’s highest fraction value in January to April is at 7AM and it switches to 8AM from May to August. The fraction value is 0.075 for both month groups. Burglary’s lowest fraction value in January to April is at 3AM and it switches to 4AM from May to August. The fraction value is close to 0.0125 for both month groups. The second peak is seen at 8AM for January to April with a fraction value of 0.072 and it switches to 6PM for May to August group with a fraction value of 0.06.
Robbery
Robbery’s highest fraction value in January to April is at 8PM with a fraction value of 0.0875 and it switches to 10PM from May to August with a fraction value of 0.09. Robbery’s lowest fraction values and time are same in January to April and May to August. The fraction value is close to 0.0125 for both month groups at 7AM. The second peak is seen at 9PM for January to April with a fraction value of about 0.078 and it switches to 10PM for May to August group with a fraction value of 0.075.
Theft
The 2 plots for theft pretty much look alike for the 2 month ranges except for few spots. And they both match to a great extent with the reference line. Theft’s highest fraction values and time are same in January to April and May to August. The fraction value is close to 0.063 for both month groups at 6PM. Theft’s lowest fraction value in January to April is at 5AM with a fraction value of about 0.01 and it switches to 5AM from May to August with the same fraction value of about 0.01. Both the month groups see high crime acvities at (or around) 12AM, 12PM, 6PM and 10PM.
Question 4. Using either crime.subset, movie.data, or movie.genre.data (which is in the file movie_genre_data.rda), create a plot showing something interesting about the data. Then discuss what the plot shows, and what the viewer should look when examining the plot. Be sure to label all axes, add a title, adjust for overplotting, etc…, so that it is clear what the plot is trying to show.
Note 1: The plot should be one of the types that we discussed in class.
Note 2: Facets of course are allowed
Answer
For this part of the project I have chosen movie.genre.data after exploring all the available data sets. movie.genre.data dataset contains Movie name, Movie Genre, Release Date of the movie, Budget of the movie, Revenue generated by the movie and Release date. These are extremely interesting data especially for a person such as me who likes movies and the facts associated with them.
In this section, I have worked on the following steps:
I wanted to study the movie genre data for past 20 years. So, I have filtered the data set to have records from 2000 to 2020.
I have calculated the count of movies for every genre and release year combination
I divided the years into 4 groups
Created plots for each group of year using facet_wrap
load('movie_genre_data.rda')
#View(movie.genre.data)
movie.genre.data.small <- movie.genre.data[(movie.genre.data$Release.Year>=2000), ]
# View(movie.genre.data.small)
# Create first half of dataset for plot
# count by hour, offense, and crime type
movie.genre.summary <- movie.genre.data.small %>%
group_by(Genre, Release.Year) %>%
summarise(movie.count = n())
# View(movie.genre.summary)
movie.genre.summary$year.group <- cut(as.numeric(movie.genre.summary$Release.Year),
c(2000, 2005, 2010, 2015, 2020),
c("(2000,2005)", "(2005,2010)", "(2010,2015)", "(2015,2020)"),
include.lowest=TRUE)
ggplot(data = movie.genre.summary, aes(x = movie.count, y = Genre)) +
geom_point(color='blue') +
facet_wrap(~year.group, nrow=1) +
labs(title='Movie count vs Genre grouped by years',
x = 'Movie count', y = 'Movie Genre')
The idea of these graphs is to study the count trend of movie genre in the last 20 years. The plots above show the count of movies for each genre type spread across sets of 5 year groups. I was always intrigued by the spread of genres we have in the world of cinema. The plots above have some super interesting observations that I would like to highlight here.
Talk show related movies have always had the least count in last 20 years.
Reality-TV, Game show and News related movies have seen a low count across all the years.
Drama movies have always had the highest count in last 20 years. Who doesn’t like some drama?
Romance probably lost its charm beginning 2010. The numbers dropped to about 30 in 2010-2015 when it saw the highest of about 70 in 2005-2010.
The second genre with the highest count across all years is Comedy. A does of good laughter helps everyone.
Action movies saw the increase in count with the passage of years. The highest is about 65 counts in year range 2010-2015.
Thriller and Sci-Fi movies saw a trend towards increasing count with increase in years.
There are many more interpretations that can be done from this useful graph. I have mentioned the top ones that could catch a user’s eyes.