Exploratory-data-analysis.utf8

Preliminaries

We’ll use the data file project_data.rda, which should be in the same directory as this markdown file (which should also be your working directory). It contains 2 data frames:

crime.subset: a data set of crimes committed in Houston 2010. This is taken from lecture 6.
movie.data: Contains the release year, title, genre, budget, and gross earnings for 4998 movies. This data set was constructed by combining financial information listed at http://www.the-numbers.com/movie/budgets/all with IMDB genre information from ftp://ftp.fu-berlin.de/pub/misc/movies/database/genres.list.gz.

Additionally, there is one more data set in the file movie_genre_data.rda, which you may optionally use for part 4:

movie.genre.data: Similar to movie.data, but contains genre and release year information for the movies as well. Note that if a movie belongs to multiple genres, then it is listed multiple times in the data set.

load('project_data.rda')
library(ggplot2)
library(plyr)
library(reshape2)
library(splines)
library(boot)
library(broom)

## Warning: package 'broom' was built under R version 3.6.3

library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(formattable)

## Warning: package 'formattable' was built under R version 3.6.3

#install.packages("formattable")
#install.packages("broom")

Part 1: Finding Outliers in Crime

In Lecture 6, we counted the number of crimes occuring each week in each block, and we looked for counts which were abnormally high. To do this, we computed p-values under the hypothesis that the number of crimes was poisson distributed for each block and each week, where the poisson parameter lambda varied by block (and equaled the average rate for that block.)

Here we will repeat this exercise, but restrict to certain types of crimes. After that, we will look for specific addresses (instead of entire city blocks) and days (instead of weeks) which had unusual crime counts.

Question 1a. Count the number of auto theft crimes that occur in each block, each week. For each block, compute the average number of auto theft crimes per week. Construct a table showing the 5 block-weeks with the highest number of auto thefts, along with average number occuring per week at each block

Hint 1: to get the average number of crimes per week, divide the total number of crimes by the number of weeks in the data set, which is 35. (The way we did this in the notes might not give the correct answer – you might want to think about why.)

Hint 2: your table should have 4 columns: the block, the week, the number of auto thefts that block-week, and the average number of auto thefts per week for that block.

# preview data
head(crime.subset)

##             time     date hour premise  offense  beat     block    street
## 82730 2010-01-01 1/1/2010    0     13R  robbery 13D10 4700-4799 telephone
## 82735 2010-01-01 1/1/2010    0     20A burglary 10H60 2500-2599 southmore
## 82736 2010-01-01 1/1/2010    0     20R burglary 13D10 6300-6399    rupley
## 82737 2010-01-01 1/1/2010    0     20A burglary  3B10 5000-5099    georgi
## 82739 2010-01-01 1/1/2010    0     20R burglary 15E10 5700-5799     jason
## 82740 2010-01-01 1/1/2010    0     20R burglary 11H10 7000-7099     ave k
##       type suffix number   month    day                 location
## 82730   rd      -      1 jan-apr friday road / street / sidewalk
## 82735 blvd      -      1 jan-apr friday                apartment
## 82736  cir      -      1 jan-apr friday        residence / house
## 82737   ln      -      1 jan-apr friday                apartment
## 82739           -      1 jan-apr friday        residence / house
## 82740           -      1 jan-apr friday        residence / house
##                   address       lon      lat week
## 82730   4750 telephone rd -95.29888 29.69171    1
## 82735 2550 southmore blvd -95.37340 29.71989    1
## 82736     6350 rupley cir -95.31560 29.69027    1
## 82737      5050 georgi ln -95.46658 29.82974    1
## 82739          5750 jason -95.49227 29.68369    1
## 82740          7050 ave k -95.29743 29.74156    1

# types of offense
summary(crime.subset$offense)

## aggravated assault         auto theft           burglary 
##                  0               3471               7360 
##             murder               rape            robbery 
##                  0                  0               2884 
##              theft 
##              22863

# create a new dataframe with auto theft data
crime.autotheft <- crime.subset[(crime.subset$offense=='auto theft'), ]

# Count of number of autothefts by week and block
count.autotheft = ddply(crime.autotheft, c('block', 'week'), summarise, count = length(block))

# average number of auto theft crimes per block per week
count.autotheft = ddply(count.autotheft, 'block', mutate, average = sum(count)/35)

# sort by highest count
count.autotheft <- arrange(count.autotheft, desc(count))

# preview data
head(count.autotheft)

##       block week count  average
## 1 5800-5899    5     7 1.114286
## 2 2200-2299   33     6 1.114286
## 3 2400-2499   20     6 1.485714
## 4 2500-2599   16     6 1.228571
## 5 3800-3899   24     6 1.285714
## 6 5000-5099   19     6 1.457143

# change column names for display purposes
# first save the original data set to new table to preserve earlier table
count.autotheft.fancy <- count.autotheft

# rename columns
count.autotheft.fancy <- plyr::rename(count.autotheft.fancy, c("block"="Block", "week"="Week", "count"="Count", "average"="Average/Week"))

# create first fancy table using kable
kable(count.autotheft.fancy[1:5, c('Block', 'Week', 'Count', 'Average/Week')], digits=3)

Block	Week	Count	Average/Week
5800-5899	5	7	1.114
2200-2299	33	6	1.114
2400-2499	20	6	1.486
2500-2599	16	6	1.229
3800-3899	24	6	1.286

# create one more nice table using formattable, I learnt this recently
count.autotheft.fancy.top5 <- count.autotheft.fancy[1:5, c('Block', 'Week', 'Count', 'Average/Week')]

#create formattable table 
widget_formattable <- formattable(count.autotheft.fancy.top5, digits=4) 

# display formattable table
widget_formattable

Block	Week	Count	Average/Week
5800-5899	5	7	1.114
2200-2299	33	6	1.114
2400-2499	20	6	1.486
2500-2599	16	6	1.229
3800-3899	24	6	1.286

Question 1b. For each block-week, compute a p-value for its auto theft count. For the null hypothesis required by our p-values, we will assume that the number of auto thefts in each block-week is a Poisson random variable, with expectation (the parameter lambda) equal to its weekly average computed in Question 1a. (This is the same as in the lecture.) Label each block-week as anomalous if its p-value is lower than a Bonferoni-corrected fast detection rate of 5%. How many anomalous block-weeks did you find? For the anomalous block-weeks (if there are any), did the crimes tend to occur at the same address?

#View(count.autotheft)
# moving the dataset to a new named dataset for easy reading

autotheft <- count.autotheft

# Calculate p-value
# use a for loop to compute poisson p-values for each block-week
pval = 0
for (i in 1:nrow(autotheft)){
  pval[i] = poisson.test(autotheft$count[i], r=autotheft$average[i], alternative = 'greater')$p.value
}

# replace the old pval column in count.block with this one
autotheft$pval <- pval

Let’s put this in a nice table and look at first 20 values.

block	week	average	count	pval
5800-5899	5	1.11429	7	0.00016
5400-5499	34	1.08571	6	0.00091
2200-2299	33	1.11429	6	0.00103
12000-12099	31	0.05714	2	0.00157
2500-2599	16	1.22857	6	0.00169
3800-3899	24	1.28571	6	0.00211
9500-9599	34	0.25714	3	0.00234
5000-5099	19	1.45714	6	0.00388
2400-2499	20	1.48571	6	0.00426
1700-1799	27	1.05714	5	0.00461
3600-3699	29	0.65714	4	0.00462
6900-6999	6	0.65714	4	0.00462
1100-1199	19	1.11429	5	0.00573
2200-2299	21	1.11429	5	0.00573
8200-8299	15	0.71429	4	0.00617
8200-8299	27	0.71429	4	0.00617
5600-5699	14	1.14286	5	0.00636
6000-6099	5	1.62857	6	0.00656
5700-5799	10	1.17143	5	0.00703
1800-1899	24	0.77143	4	0.00803

In the next section I will use Bonferoni correction to flag blocks as anomalous.

# cutoff for 5% correction

cutoff <- 0.05/nrow(autotheft.augmented)

# we want to know pvals lower than cutoff
sum(autotheft.augmented$pval < cutoff)

## [1] 0

# we want to know pvals lower than cutoff and flag them as anomalous
autotheft.augmented <- mutate(autotheft.augmented, flag = ifelse(pval<cutoff,'Anomalous', 'Non-Anomalous'))

# View(autotheft.augmented)

# preview data
head(autotheft.augmented)

##         block week count    average         pval          flag
## 1   5800-5899    5     7 1.11428571 0.0001609073 Non-Anomalous
## 2   5400-5499   34     6 1.08571429 0.0009056359 Non-Anomalous
## 3   2200-2299   33     6 1.11428571 0.0010333135 Non-Anomalous
## 4 12000-12099   31     2 0.05714286 0.0015717695 Non-Anomalous
## 5   2500-2599   16     6 1.22857143 0.0016869022 Non-Anomalous
## 6   3800-3899   24     6 1.28571429 0.0021125695 Non-Anomalous

ANSWER: (How many anomalous block-weeks did you find? For the anomalous block-weeks, did the crimes tend to occur at the same address?)

We find 0 block-weeks that are anomalous. The second part of the question is not applicable as we do not see any anomalous block-weeks.

Note: In our lecture we use pval less than and equal to cutoff, I have tried that as well. But I have shown pval less than cutoff (no equal to) because that is the ask in the problem.

Question 1c. Find the daily counts of auto thefts occuring at each unique address. For each address in your data set, also compute the average number of auto thefts occuring per day. Construct a table showing the 5 address-dates with the highest number of auto thefts, along with the average number occuring per day at those addresses:

(This is analogous to Question 1a, except that you are grouping by address and date, instead of block and week. For the average number of auto thefts per day, you will want to divide the total number of auto thefts by 264, the number of days in the data set)

# View(crime.autotheft)
# Count of number of autothefts by date and address
autotheft.date.address <- ddply(crime.autotheft, c('date', 'address'), summarise, count = length(address))

# average number of auto theft crimes per adress per date
autotheft.date.address <- ddply(autotheft.date.address, 'address', mutate, average = sum(count)/length(unique(crime.autotheft$date)))

# sort by highest count
autotheft.date.address <- arrange(autotheft.date.address, desc(count))

# preview data
head(autotheft.date.address)

##        date          address count     average
## 1 4/16/2010 2550 broadway st     3 0.020576132
## 2 3/23/2010  2650 south lp w     3 0.049382716
## 3 6/13/2010     3850 norfolk     3 0.024691358
## 4  4/5/2010     350 gellhorn     2 0.008230453
## 5 4/16/2010      550 75th st     2 0.016460905
## 6  8/9/2010 1050 pinemont dr     2 0.020576132

# View(autotheft.date.address)

# change column names for display purposes
# first save the original data set to new table to preserve earlier table
autotheft.date.address.fancy <- autotheft.date.address

# rename columns
autotheft.date.address.fancy <- plyr::rename(autotheft.date.address.fancy, c("date"="Date", "address"="Address", "count"="Count", "average"="Average/Date"))

# create first fancy table using kable
kable(autotheft.date.address.fancy[1:5, c('Address', 'Date', 'Count', 'Average/Date')], digits=3)

Address	Date	Count	Average/Date
2550 broadway st	4/16/2010	3	0.021
2650 south lp w	3/23/2010	3	0.049
3850 norfolk	6/13/2010	3	0.025
350 gellhorn	4/5/2010	2	0.008
550 75th st	4/16/2010	2	0.016

# create one more nice table using formattable, I learnt this recently
autotheft.date.address.fancy2 <- autotheft.date.address.fancy[1:5, c('Address', 'Date', 'Count', 'Average/Date')]

#create formattable table 
widget_formattable2 <- formattable(autotheft.date.address.fancy2, digits=4) 

# display formattable table
widget_formattable2

Address	Date	Count	Average/Date
2550 broadway st	4/16/2010	3	0.02058
2650 south lp w	3/23/2010	3	0.04938
3850 norfolk	6/13/2010	3	0.02469
350 gellhorn	4/5/2010	2	0.00823
550 75th st	4/16/2010	2	0.01646

Question 1d. For each address-date, compute a p-value for its auto theft count, where the null hypothesis is that the number of auto thefts is a Poisson random variable, with expectation (the parameter lambda) equal to the daily average that you computed in question 1c. (Again, this is the same as in the lecture) Label each address-date as anomalous if its p-value is smaller than a Bonferoni-corrected false detection rate of 5%. How many address-dates were anomalous? For the anomalous address-dates, how many auto thefts occurred? What was the location for these anomalous addresses?

(Note: location is a column in the original crime.subset data frame)

# Calculate p-value
# use a for loop to compute poisson p-values for each address-date
pval = 0
for (i in 1:nrow(autotheft.date.address)){
  pval[i] = poisson.test(autotheft.date.address$count[i], r=autotheft.date.address$average[i], alternative = 'greater')$p.value
}

# replace the old pval column  with this one
autotheft.date.address$pval <- pval

# View(autotheft.date.address)

Let’s put this in a nice table and look at first 20 values.

address	date	average	count	pval
2550 broadway st	4/16/2010	0.02058	3	0.00000
3850 norfolk	6/13/2010	0.02469	3	0.00000
2650 south lp w	3/23/2010	0.04938	3	0.00002
350 gellhorn	4/5/2010	0.00823	2	0.00003
2450 westcreek ln	4/11/2010	0.00823	2	0.00003
4250 scotland	5/21/2010	0.00823	2	0.00003
5750 glenmont dr	4/28/2010	0.00823	2	0.00003
7050 telephone rd	5/11/2010	0.00823	2	0.00003
7650 phoenix dr	4/8/2010	0.00823	2	0.00003
2350 tidwell rd	8/14/2010	0.01235	2	0.00008
4450 oxford st	1/10/2010	0.01235	2	0.00008
4550 pinemont	1/9/2010	0.01235	2	0.00008
4750 beechnut st	3/28/2010	0.01235	2	0.00008
550 75th st	4/16/2010	0.01646	2	0.00013
2750 briargrove dr	2/12/2010	0.01646	2	0.00013
5150 richmond ave	7/12/2010	0.01646	2	0.00013
5550 antoine dr	3/22/2010	0.01646	2	0.00013
8650 kirby dr	3/4/2010	0.01646	2	0.00013
8850 gulf fwy	1/22/2010	0.01646	2	0.00013
1050 pinemont dr	8/9/2010	0.02058	2	0.00021

In the next section I will use Bonferoni correction to flag addresses as anomalous.

# cutoff for 5% correction

cutoff <- 0.05/nrow(autotheft.date.address.augmented)

# we want to know pvals lower than cutoff
sum(autotheft.date.address.augmented$pval < cutoff)

## [1] 2

# we want to know pvals lower than cutoff and flag them as anomalous
autotheft.date.address.augmented <- mutate(autotheft.date.address.augmented, flag = ifelse(pval<cutoff,'Anomalous', 'Non-Anomalous'))

# View(autotheft.augmented)

# preview data
head(autotheft.date.address.augmented)

##        date           address count     average         pval          flag
## 1 4/16/2010  2550 broadway st     3 0.020576132 1.429688e-06     Anomalous
## 2 6/13/2010      3850 norfolk     3 0.024691358 2.462896e-06     Anomalous
## 3 3/23/2010   2650 south lp w     3 0.049382716 1.934232e-05 Non-Anomalous
## 4  4/5/2010      350 gellhorn     2 0.008230453 3.368490e-05 Non-Anomalous
## 5 4/11/2010 2450 westcreek ln     2 0.008230453 3.368490e-05 Non-Anomalous
## 6 5/21/2010     4250 scotland     2 0.008230453 3.368490e-05 Non-Anomalous

# View(autotheft.date.address.augmented)


autotheft.date.address.augmented.ano <- autotheft.date.address.augmented %>%
  filter(flag=='Anomalous')

nrow(autotheft.date.address.augmented.ano)

## [1] 2

sum(autotheft.date.address.augmented.ano$count)

## [1] 6

Working on locations of anomalous flagged address-date combinations

# Merge original auto theft data set with above table to extract the location
autotheft.date.address.ano.loc <- merge(autotheft.date.address.augmented.ano, crime.autotheft, by="address")

# gather all the unique locations in a data frame
anomalous.autotheft.locations <- unique(autotheft.date.address.ano.loc$location)

anomalous.autotheft.locations

## [1] "apartment parking lot"             NA                                 
## [3] "strip business center parking lot" "commercial parking lot / garage"

ANSWER: (How many address-dates were anomalous? For the anomalous address-dates, how many auto thefts occurred? What was the location for these anomalous addresses?)

2 address-dates were anomalous; 2550 broadway st on 4/16/2010 and 3850 norfolk on 6/13/2010. For the anomoalous address-dates, 6 auto thefts occurred.

Data frame anomalous.autotheft.locations contains the location of all the 6 auto thefts that occurred.

Question 1e. The highest number of auto thefts occuring at any address in a single day was 3. This happened on 3 separate occurrences: 2550 broadway st on 4/16, 3850 norfolk on 6/13/2010, and 2650 south lp w on 3/23. Were all 3 occurences included as anomalous in your previous analysis? If so, why do you think were they included? If not, why do you think some were included, but others not?

ANSWER:

No, all these three occurences were not marked as anomalous in my finding. Address 2650 south lp w on 3/23/2010was not marked anomalous, however, 2550 broadway st on 4/16/2010 and 3850 norfolk on 6/13/2010 were marked anomalous. We used Bonferoni-correction here where we have 5% chance that anything on the data set we created with p-value will be wrong/anomalous. When we did this 5% correction cutoff within our data set, only 2 of the addresses showed up as being anomalous. And the third one did not due to higher p-value. So, only 2 of the addresses passed the threshold of 5% false detection rate. The 2 addresses passed the test that 5% data set has error. With 5% error propability, we are able to trust only those 2 addresses.

Part 2: Modeling Movie Revenues as a Function of Production Budget

The movie.data data set has the following columns for 4337 movies

Release.Date: When was the movie released
Movie: The title of the move
Production.Budget: The amount of money budgeted for the production of the movie
Domestic.Gross: The total revenue earned within the USA by the movie

We are going to construct a Quantile Regression to model the relationship between Production.Budget and Domestic.Gross. For this task, we have already written all of the code for you. However, our method is going to run into problems – your task will be to recognize these problems as they arise.

Question 2a. Below are 3 plots, showing the Domestic Gross Revenue as a function of the Production Budget for each movie, under (1) no transform, (2) a log transform, and (3) a power transform taking 4th roots. For the quantile regression, we decided to try the power transform first – why might we have done this? What was not-so-good about the plots with no transform and with the log transform?

# preview data
head(movie.data)

##    Release.Date                                    Movie Production.Budget
## 1    12/18/2009                                   Avatar          4.25e+08
## 5     5/24/2007 Pirates of the Caribbean: At World's End          3.00e+08
## 11    7/20/2012                    The Dark Knight Rises          2.75e+08
## 13     7/2/2013                          The Lone Ranger          2.75e+08
## 16     3/9/2012                              John Carter          2.75e+08
## 20   11/24/2010                                  Tangled          2.60e+08
##    Domestic.Gross
## 1       760507625
## 5       309420425
## 11      448139099
## 13       89289910
## 16       73058679
## 20      200821936

ANSWER: (why was the power transform better than the others?)

Observations on the trend of each graph

No Transform

The graph of non transformed values of Domestic Gross Revenue as a function of the Production Budget seems extremely noisy. It is hard to interpret a pattern. One would think that the relationship should be linear, but the graph is super disrupted. To start with, all the values are clustered towards (0, 0). And as the values increase, we see all the points are scattered showing no well defined relationship between Budget and Revenue. The outliers in the graph are not helping as well. There is a very high value of budget for highest value of revenue, this outlier is compressing the whole plot. We do not see a definite trend in this plot. A large number of points are spread away. The strength of the pattern is weak. Only interpretation that we can make is that budget and revenue have a positive association. It is hard to fit a line.

Log Transform

The graph of log transform of Revenue as a function of log transform of Budget is seems a bit more scattered at the beginning of the graph compared to the no-transform plot. It depicts a weak linear relationship between revenue and budget of a movie. This plot shows a correlation, however the pattern is unclear. The plot is noisy is the beginning, but with the increase in log of budget, we see the plot converging and finally it becomes a huge blob of points. The graph shows a better interpret that budget and revenue of movie have a positive association. The outlier still resides at the highest value of budget and revenue.

Power Transform with 4th root

Out of the three plots, power transform with 4th root is doing the best job in defining a trend. The strength of the relationship between 4th root of budget and 4th root of revenue is strongly evident. The graph shows a positive relationship between 4th root of budget and 4th root of revenue. There are few outliers including the one seen at the value of budget and revenue in all the plots. This plot shows the strongest positive association between revenue and budget. We still have the outlier at highest value of 4th root of budget and 4th root of revenue.

We see most of the flaws of no-transfor and log-transform plot getting rectified in the power transform. Hence, it is the best one to pick from the available options for further analysis.

Question 2b. Following lecture 10, we try to construct a quantile regression plot for movie revenues as a function of their production budget. We first fit a spline and compute the predictions and residuals. Then we create a Spread-Location plot to see if the residuals look identically distributed. Unfortunately, the plot is not good – it suggests that this cannot be the case. Your task: explain why the S-L plot suggests this.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ANSWER: (explain why the S-L plot implies that our residuals are NOT identically distributed)

This plot estimates the spread of residuals as a function of prediction. The plot suggests that initially the abosolute value of residuals increase with an increase in predicted values, however they start to decrease post predicted value around 75. The trend line increases till predicted value 50 and then slowly starts to drop. This means that the variance of residuals is decreasing with predicted values. Since the trend line is not flat or straight, the residuals are not identically distributed.

Question 2c. Even though the S-L plot looked bad, suppose that we proceed anyway to construct the quantile regression. However, when we show the Quantile Regression for the power transformed variables, it is visually obvious that it must be wrong, even if we didn’t look at the S-L plot beforehand. How can we tell?

ANSWER: (explain why we can visually tell that this quantile regression plot must be wrong)

All the percentiles go up with increase in budget. There are a number of data points that fall below 5th percentile and above 95th percentile. These are the outliers compressing the data points. The plot shows that the difference in revenue generated between all the quantiles is same. So, for example the revenue difference between 95th and 85th percentiles is the same for the revenue generated between 50th and 15th percentile. In practical scenarios this is not possible. One would except that a peak will show up between budget’s 4th root value of 50 and 100. However, all the quantiles are flat and straight. Due to all these flaws the Quantile regression plot above cannot be trusted.

Part 3: What time do most crimes occur?

In the crimes.subset data frame, there are the following columns

offense: the type of crime – theft, auto theft, burglary, and robbery
hour: the hour that the crime occurred – 0 (midnight) to 23 (11 PM)
month: the month that the crime occurred. We have grouped the month into two categories: jan-apr, and may-aug

Question 3a. Make a scatterplot (or line plot) showing the percentage of crimes committed each hour.

We will first create a table to show the fraction of crimes committed eah hour.

# create summary data frame
fraction.crime <- ddply(crime.subset, "hour", total.records = nrow(crime.subset),
                          summarise, crime.per.hour = length(hour),
                          fraction.by.crime.type = crime.per.hour/total.records)

# View(fraction.crime)

# plot hour vs fraction of crime and save as variable
crime.hour.plot <- ggplot(data=fraction.crime, mapping=aes(y=fraction.by.crime.type, x=hour)) +
                         geom_point() +
                         geom_line() +
                         labs(x='Hour of the day', y ='Fraction of Crimes', 
                              title='Fraction of Crimes Committed at each hour of the day')
                         
# draw the plot
crime.hour.plot

Question 3b. Repeat the plot in Question 3a, but separately for each type of offense. In your each of your plots, show include a reference curve using the pooled data, in order to facilitate comparisons. (Hint: you computed this reference curve in question 3a). Do the different types of crimes have different distributions for their time of occurence? How do they differ?

# Create first half of dataset for plot
# count by hour, and crime type
crime.offense.hour.summary <- crime.subset %>%
  group_by(offense, hour) %>%
  summarise(offense.hour.count = n())

# View(crime.offense.hour.summary)

# create second half of data for plot
# count by crime type
crime.summary <- crime.subset %>%
  group_by(offense) %>%
  summarise(offense.count = n())

# View(crime.summary)  

# merge the above two data sets based on offense column
merge.crime <- merge(crime.summary, crime.offense.hour.summary, by="offense")

# View(merge.crime)

# compute the fraction of crime
data.fraction.merge.crime <- merge.crime %>%
                        mutate(fraction.crime=offense.hour.count/offense.count)

# View(data.fraction.merge.crime)


# Pick the dataframe created in 3a
# Assign those fractions based on hour to merged dataframe created above

crime.ref.interim <- merge(data.fraction.merge.crime, fraction.crime, by = "hour")

# View(crime.ref.interim)

# take subset to form required dataframe for reference line
crime.ref <- subset(crime.ref.interim, select = c('hour', 'fraction.by.crime.type'))

# View(crime.ref)

# plot with reference line
plot.merge.crime <- ggplot(data = data.fraction.merge.crime, aes(x = hour, y = fraction.crime)) +
                    geom_point() +
                    geom_line() +
                    facet_wrap(~offense) +
                    geom_line(data = crime.ref, color='slategrey', aes( y=fraction.by.crime.type, x=hour)) +
                    #geom_point(data = crime.ref3, color='grey', size=1) + geom_point() +
                    labs(title="Fraction of crimes committed each hour of the day\n grouped by offense types",
                         x ="Hour of the Day", y = "Fraction of crimes per hour")


# draw the plot
plot.merge.crime

ANSWER: (How does the distribution the time of occurence differ for each type of crime?)

Auto thefts

The lowest fraction of auto thefts occur at around 4th hour of the day (4AM). The fraction value is close to the reference trend. The highest fraction of auto thefts happen at around 10PM. The fraction value is quite high than the reference. The trend shows a decrease in autothefts from midnight to 4AM, then there is an increase followed by a flat. Post 12PM the trend fraction increases with time.

Burglary

The lowest fraction of auto thefts occur at around 4th hour of the day (4AM). The fraction value is higher than the reference trend value. The highest fraction of auto thefts happen at around 8AM. The fraction value is quite high than the reference trend value at that time. We see a sharp increase in burglaries from 3AM to 8AM. Post 8AM the trend shows ups and downs in the data. We see a flat from 12PM to 3PM post which the fraction reaches the second peak at 4PM.

Robbery

The lowest fraction occurs at around 7AM. This value does not match with the reference line point low. The highest fraction appears at around 11PM. The value does not match with the reference line point high. The trend first sees a negative relationship between fraction of robbery and hour of the day till 7AM. Post 7AM, the robberies follow a positive relationship with the increase in hour of the day. The 2nd peak fraction is seen at 10AM.

Theft

The plot of theft is closest to the plot of refence line. The lowest fraction for theft occurs at around 5AM. This value is slightly lower than the reference trend low point. The 2 highest fractions for theft occur at around 6PM and 12AM (midnight). This is close to the reference line. The trend first sees a negative relationship between fraction of theft and hour of the day till 5AM. Post 5AM, the thefts follow a positive relationship with the increase in hour of the day.

Question 3c. Repeat the plot in Question 3b, but separately for each type of offense and month (i.e., use facet_grid(offense ~ month)). As before, include a reference curve using the pooled data, in order to facilitate comparisons. (You don’t have to analyze the plot yet – wait for part 3f)

# Create first half of dataset for plot
# count by hour, offense, and crime type
crime.offense.hour.month.summary <- crime.subset %>%
  group_by(offense, hour, month) %>%
  summarise(offense.hour.month.count = n())

# View(crime.offense.hour.month.summary)

# create second half of data for plot
# count by crime type, and month
crime.month.offense.summary <- crime.subset %>%
  group_by(offense, month) %>%
  summarise(offense.month.count = n())
  
# View(crime.month.offense.summary)  

# merge the above two datasets by offense and month
merge.crime2 <- merge(crime.month.offense.summary, crime.offense.hour.month.summary, by=c("offense", "month"))

# View(merge.crime2)

# compute the frantion of crime
data.fraction.merge.crime2 <- merge.crime2 %>%
                        mutate(fraction.crime=offense.hour.month.count/offense.month.count)

# View(data.fraction.merge.crime2)


# create data frame for reference line
# take original data set
# compute count by hour, month, and offense


# Pick the dataframe created in 3a
# Assign those fractions based on hour to merged dataframe created above

crime.ref.interim2 <- merge(data.fraction.merge.crime2, fraction.crime, by = "hour")

# View(crime.ref.interim2)

# take subset to form required dataframe for reference line
crime.ref2 <- subset(crime.ref.interim2, select = c('hour', 'fraction.by.crime.type'))

# View(crime.ref2)

# plot with reference line
plot.merge.crime2 <- ggplot(data = data.fraction.merge.crime2, aes(x = hour, y = fraction.crime)) +
                    geom_point() +
                    geom_line() +
                    facet_grid(month~offense) +
                    geom_line(data = crime.ref2, color='slategrey', aes( y=fraction.by.crime.type, x=hour)) +
                    #geom_point(data = crime.ref3, color='grey', size=1) + geom_point() +
                    labs(title="Fraction of crimes committed each hour of the day\n grouped by offense types and month",
                         x ="Hour of the Day", y = "Fraction of crimes per hour")


# draw the plot
plot.merge.crime2

Question 3d. As an alternative, create a QQ plot comparing the distribution of hour for auto theft crimes occuring in jan-apr vs may-aug. Include the reference line y=x, as this is standard practice for QQ plots. Repeat this for the other 3 types of offense. You may use base graphics if you wish.

Working with crime type = ‘auto theft’

# filter to get auto theft data only
crime.autotheft <- crime.subset[(crime.subset$offense=='auto theft'), ]

head(crime.autotheft)

##                      time     date hour premise    offense  beat     block
## 82742 2010-01-01 00:00:00 1/1/2010    0     18A auto theft  7C20 4600-4699
## 82744 2010-01-01 00:00:00 1/1/2010    0     18N auto theft  1A10   300-399
## 82808 2010-01-01 03:00:00 1/1/2010    3     20D auto theft  3B50 7200-7299
## 82833 2010-01-01 06:00:00 1/1/2010    6     20D auto theft 18F20   400-499
## 82990 2010-01-01 21:00:00 1/1/2010   21     18T auto theft  5F10 7600-7699
## 83006 2010-01-01 22:00:00 1/1/2010   22     18A auto theft 14D20 7800-7899
##         street type suffix number   month    day
## 82742    falls   st      -      1 jan-apr friday
## 82744 hamilton   st      -      1 jan-apr friday
## 82808  roswell           -      1 jan-apr friday
## 82833 pinewold   dr      -      1 jan-apr friday
## 82990     katy  fwy      -      1 jan-apr friday
## 83006   cullen blvd      -      1 jan-apr friday
##                                location          address       lon
## 82742             apartment parking lot    4650 falls st -95.33067
## 82744      bar / night club parking lot  350 hamilton st -95.35346
## 82808                          driveway     7250 roswell -95.36952
## 82833                          driveway  450 pinewold dr -95.46346
## 82990 strip business center parking lot    7650 katy fwy -95.47292
## 83006             apartment parking lot 7850 cullen blvd -95.35589
##            lat week
## 82742 29.80306    1
## 82744 29.75704    1
## 82808 29.82671    1
## 82833 29.76426    1
## 82990 29.78457    1
## 83006 29.67407    1

# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.autotheft, 
     qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Auto theft'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)

Working with crime type = ‘burglary’

# filter to get burglary data only
crime.burglary <- crime.subset[(crime.subset$offense=='burglary'), ]

head(crime.burglary)

##             time     date hour premise  offense  beat     block    street
## 82735 2010-01-01 1/1/2010    0     20A burglary 10H60 2500-2599 southmore
## 82736 2010-01-01 1/1/2010    0     20R burglary 13D10 6300-6399    rupley
## 82737 2010-01-01 1/1/2010    0     20A burglary  3B10 5000-5099    georgi
## 82739 2010-01-01 1/1/2010    0     20R burglary 15E10 5700-5799     jason
## 82740 2010-01-01 1/1/2010    0     20R burglary 11H10 7000-7099     ave k
## 82741 2010-01-01 1/1/2010    0     070 burglary  7C20 4400-4499   liberty
##       type suffix number   month    day          location
## 82735 blvd      -      1 jan-apr friday         apartment
## 82736  cir      -      1 jan-apr friday residence / house
## 82737   ln      -      1 jan-apr friday         apartment
## 82739           -      1 jan-apr friday residence / house
## 82740           -      1 jan-apr friday residence / house
## 82741   rd      -      1 jan-apr friday convenience store
##                   address       lon      lat week
## 82735 2550 southmore blvd -95.37340 29.71989    1
## 82736     6350 rupley cir -95.31560 29.69027    1
## 82737      5050 georgi ln -95.46658 29.82974    1
## 82739          5750 jason -95.49227 29.68369    1
## 82740          7050 ave k -95.29743 29.74156    1
## 82741     4450 liberty rd -95.32520 29.78724    1

# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.burglary, 
     qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Burglary'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)

Working with crime type = ‘robbery’

# filter to get robbery data only
crime.robbery <- crime.subset[(crime.subset$offense=='robbery'), ]

head(crime.robbery)

##                      time     date hour premise offense  beat     block
## 82730 2010-01-01 00:00:00 1/1/2010    0     13R robbery 13D10 4700-4799
## 82757 2010-01-01 01:00:00 1/1/2010    1     13R robbery  2A50 5200-5299
## 82758 2010-01-01 01:00:00 1/1/2010    1     18G robbery 11H20 1500-1599
## 82775 2010-01-01 02:00:00 1/1/2010    2     13R robbery  2A50 5200-5299
## 82776 2010-01-01 02:00:00 1/1/2010    2     20R robbery  9C20   900-999
## 82813 2010-01-01 04:00:00 1/1/2010    4     18B robbery 10H60 4000-4099
##           street type suffix number   month    day
## 82730  telephone   rd      -      1 jan-apr friday
## 82757     center   st      -      1 jan-apr friday
## 82758  evergreen   dr      -      1 jan-apr friday
## 82775 washington  ave      -      1 jan-apr friday
## 82776    hoffman           -      1 jan-apr friday
## 82813   canfield   st      -      1 jan-apr friday
##                                    location             address       lon
## 82730              road / street / sidewalk   4750 telephone rd -95.29888
## 82757              road / street / sidewalk      5250 center st -95.41530
## 82758     grocery / supermarket parking lot   1550 evergreen dr -95.29021
## 82775              road / street / sidewalk 5250 washington ave -95.41415
## 82776                     residence / house         950 hoffman -95.31054
## 82813 bank / saving institution parking lot    4050 canfield st -95.35488
##            lat week
## 82730 29.69171    1
## 82757 29.77119    1
## 82758 29.71219    1
## 82775 29.77065    1
## 82776 29.77116    1
## 82813 29.72037    1

# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.robbery, 
     qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Robbery'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)

Working with crime type = ‘theft’

# filter to get theft data only
crime.theft <- crime.subset[(crime.subset$offense=='theft'), ]

head(crime.theft)

##             time     date hour premise offense  beat     block     street
## 82748 2010-01-01 1/1/2010    0     03B   theft  1A10 1200-1299   caroline
## 82749 2010-01-01 1/1/2010    0     03B   theft  2A40 1900-1999 washington
## 82750 2010-01-01 1/1/2010    0     03B   theft  3B40 5100-5199       yale
## 82752 2010-01-01 1/1/2010    0     20V   theft  2A30 2600-2699   lawrence
## 82753 2010-01-01 1/1/2010    0     20R   theft  3B40   900-999       36th
## 82754 2010-01-01 1/1/2010    0     250   theft 16E40 4600-4699 kingsridge
##       type suffix number   month    day
## 82748   st      -      1 jan-apr friday
## 82749  ave      -      1 jan-apr friday
## 82750           -      1 jan-apr friday
## 82752           -      1 jan-apr friday
## 82753   st      E      1 jan-apr friday
## 82754   rd      -      1 jan-apr friday
##                                                                    location
## 82748                                                      bar / night club
## 82749                                                      bar / night club
## 82750                                                      bar / night club
## 82752 vacant single occupancy residence (houses, townhouses, duplexes, etc)
## 82753                                                     residence / house
## 82754                                                       other / unknown
##                   address       lon      lat week
## 82748    1250 caroline st -95.36382 29.75352    1
## 82749 1950 washington ave -95.37823 29.76791    1
## 82750           5150 yale -95.40283 29.84116    1
## 82752       2650 lawrence -95.40813 29.81045    1
## 82753         950 36th st -95.38664 29.81949    1
## 82754  4650 kingsridge rd -95.38306 29.76288    1

# qqplot() makes QQ plot from x and y vectors
# with() makes it more readable
with(crime.theft, 
     qqplot(x = hour[ month == 'jan-apr'], y = hour[ month == 'may-aug'], main='QQ plot for Theft'))
# abline() adds a reference line, with y intercept 0 and slope 1
abline(0,1)

Question 3e. Between the plots you made in Question 3d and Question 3e, which one is better? Or should we keep both? Why?

ANSWER:

In my opinion, plots created in the last 2 problems serve different purposes. So, both are useful in their own sense. The purpose of both these plots is different. None of them are incorrect, they both tell stories that are useful for users.

The plots in 3c show a visual representation of fraction of crime at each and every hour of the day split by crime type and month group. It could be a one stop shop for users to answers questions such as: At what hour of the day during January through April do we see highest fraction of robberies? How does this number change from May through August? What offense type has the highest or lowest fraction at midnight in January through April.

The Quantile-Quantile plots created in 3d represent the distribution of hour of day for each type of crime compared for the two month groups. We have plotted the quantiles of hours for distribution in January through April against distribution in May through August. The distribution is compared against the reference line. These plots can answer questions such as: Which crime type’s distribution of hour for January through April is above the reference line? Which month-groups’ distribution of hours is below refence type between robbery and burglary?

Since the purpose of both the types of plots are different and useful for making important decisions, I would keep both the sets.

Question 3g. How does the distribution of the time of occurence vary by month? Answer separately for each type of crime.

ANSWER:

Auto theft

Autotheft’s highest fraction value at 10PM is higher in May to August compared to January to April. The time of their peaks are at around 10PM. The 2 plots for auto theft pretty much look alike for the 2 month ranges except for few spots. Autotheft’s lowest fraction value in January to April is at 5AM and it switches to 4AM from May to August. The second peak is seen at midnight for both the groups. Fraction value for second peak is close to 0.05 from January to April and it is 0.075 from May to August.

Burglary

Burglary’s highest fraction value in January to April is at 7AM and it switches to 8AM from May to August. The fraction value is 0.075 for both month groups. Burglary’s lowest fraction value in January to April is at 3AM and it switches to 4AM from May to August. The fraction value is close to 0.0125 for both month groups. The second peak is seen at 8AM for January to April with a fraction value of 0.072 and it switches to 6PM for May to August group with a fraction value of 0.06.

Robbery

Robbery’s highest fraction value in January to April is at 8PM with a fraction value of 0.0875 and it switches to 10PM from May to August with a fraction value of 0.09. Robbery’s lowest fraction values and time are same in January to April and May to August. The fraction value is close to 0.0125 for both month groups at 7AM. The second peak is seen at 9PM for January to April with a fraction value of about 0.078 and it switches to 10PM for May to August group with a fraction value of 0.075.

Theft

The 2 plots for theft pretty much look alike for the 2 month ranges except for few spots. And they both match to a great extent with the reference line. Theft’s highest fraction values and time are same in January to April and May to August. The fraction value is close to 0.063 for both month groups at 6PM. Theft’s lowest fraction value in January to April is at 5AM with a fraction value of about 0.01 and it switches to 5AM from May to August with the same fraction value of about 0.01. Both the month groups see high crime acvities at (or around) 12AM, 12PM, 6PM and 10PM.

Part 4: Create your own visualization

Question 4. Using either crime.subset, movie.data, or movie.genre.data (which is in the file movie_genre_data.rda), create a plot showing something interesting about the data. Then discuss what the plot shows, and what the viewer should look when examining the plot. Be sure to label all axes, add a title, adjust for overplotting, etc…, so that it is clear what the plot is trying to show.

Note 1: The plot should be one of the types that we discussed in class.

Note 2: Facets of course are allowed

Answer

For this part of the project I have chosen movie.genre.data after exploring all the available data sets. movie.genre.data dataset contains Movie name, Movie Genre, Release Date of the movie, Budget of the movie, Revenue generated by the movie and Release date. These are extremely interesting data especially for a person such as me who likes movies and the facts associated with them.

Study movie genre trend in past 20 years.

In this section, I have worked on the following steps:

I wanted to study the movie genre data for past 20 years. So, I have filtered the data set to have records from 2000 to 2020.
I have calculated the count of movies for every genre and release year combination
I divided the years into 4 groups
Created plots for each group of year using facet_wrap

load('movie_genre_data.rda')
#View(movie.genre.data)

movie.genre.data.small <- movie.genre.data[(movie.genre.data$Release.Year>=2000), ]

# View(movie.genre.data.small)

# Create first half of dataset for plot
# count by hour, offense, and crime type
movie.genre.summary <- movie.genre.data.small %>%
  group_by(Genre, Release.Year) %>%
  summarise(movie.count = n())


# View(movie.genre.summary)

movie.genre.summary$year.group <- cut(as.numeric(movie.genre.summary$Release.Year), 
                                     c(2000, 2005, 2010, 2015, 2020), 
                                     c("(2000,2005)", "(2005,2010)", "(2010,2015)", "(2015,2020)"), 
                                     include.lowest=TRUE)

ggplot(data = movie.genre.summary, aes(x = movie.count, y = Genre)) + 
  geom_point(color='blue') +
  facet_wrap(~year.group, nrow=1) + 
  labs(title='Movie count vs Genre grouped by years', 
       x = 'Movie count', y = 'Movie Genre')

The idea of these graphs is to study the count trend of movie genre in the last 20 years. The plots above show the count of movies for each genre type spread across sets of 5 year groups. I was always intrigued by the spread of genres we have in the world of cinema. The plots above have some super interesting observations that I would like to highlight here.

Talk show related movies have always had the least count in last 20 years.
Reality-TV, Game show and News related movies have seen a low count across all the years.
Drama movies have always had the highest count in last 20 years. Who doesn’t like some drama?
Romance probably lost its charm beginning 2010. The numbers dropped to about 30 in 2010-2015 when it saw the highest of about 70 in 2005-2010.
The second genre with the highest count across all years is Comedy. A does of good laughter helps everyone.
Action movies saw the increase in count with the passage of years. The highest is about 65 counts in year range 2010-2015.
Thriller and Sci-Fi movies saw a trend towards increasing count with increase in years.

There are many more interpretations that can be done from this useful graph. I have mentioned the top ones that could catch a user’s eyes.