Note: For this exam, don’t worry about including the correct labels and titles for the plots.

1. Loading data (5 points)

Question:

Load the data called riverton-crime.csv using the proper R command and file path.

Code (5 points):

dat_riverton <- read.csv("C:/Users/evely/AppData/Roaming/Microsoft/Windows/Network Shortcuts/Penn Classes/CRIM/Data/riverton-crime.csv")

2. Variable types (5 points)

Question:

Read the codebook. What types of (stat) variables are these? Variables: id, gender, offense_type, num_prior_arrests, and time_served_months.

Text answer (5 points):

3. Data dimensions (5 points)

Question:

How many observations and variables are there in the data?

Code (3 points):

dim(dat_riverton)
## [1] 500   6

Text answer (2 points):

There are 500 observations of 6 varibales

4. Quantitative EDA (5 points)

Question:

What percentage of the sample is male?

Code (3 points):

dat_riverton %>%
  count(gender) %>%
  mutate(prop = prop.table(n))
##   gender   n  prop
## 1 Female 122 0.244
## 2   Male 378 0.756

Text answer (2 points):

75.6% of the sample is male.

5. Visual EDA (5 points)

Question:

Make a histogram for number of prior arrests and look for an appropriate number of bins. Describe this histogram.

Code (3 points):

dat_riverton %>%
  ggplot(aes(x = num_prior_arrests)) +  geom_histogram(binwidth = 1)

Text answer (2 points):

The histogram appears to be unimodal, slightly skewed to the right,with a small tail to the right, and no real noticeable outlier.

6. Quantitative EDA (5 points)

Question:

Find statistics to describe the centrality and spread of the variable number of prior arrests. Are these numbers surprising?

Code (3 points):

summ_stats_num_prior_arrest <- dat_riverton %>% summarize(median=median(num_prior_arrests),
                  IQR = IQR(num_prior_arrests))
summ_stats_num_prior_arrest
##   median IQR
## 1      3   3

Text answer (2 points):

It’s surprising that the median is 3 because this means over half the individuals have at least 3 prior arrests, rather than clustering at 0 or 1 as you might expect. The IQR of 3 shows that the middle half of people fall within a span of 3 arrest.

7. Visual EDA (5 points)

Question:

Now split up the histogram by gender. Does it look like males tend to have more or fewer prior arrests than females?

Code (3 points):

dat_riverton %>%
  ggplot(aes(x=num_prior_arrests, fill=gender)) + geom_histogram(binwidth = 1)

Text answer (2 points):

For the most part, males tend to have more prior arrests than females.

8. Visual EDA (5 points)

Question:

Make a barplot for offense_type and describe it.

Code (3 points):

dat_riverton %>%
  ggplot(aes(x=offense_type)) + geom_bar()

Text answer (2 points):

Offenses regarding property tend to be the most common, followed by drugs, violnt, and public offenses.

9. Visual EDA (5 points)

Question:

Now split the barplot by gender. Does it look like males commit more violent crimes than females?

Code (3 points):

dat_riverton %>%
  ggplot(aes(x=offense_type, fill=gender)) + geom_bar()

Text answer (2 points):

The barplot illustrates that males do commit more violent crimes than females do.

10. Visual EDA (5 points)

Question:

Make a histogram for time served. Split it up by offense type. Does it look like public order offenses have shorter or longer time served?

Code (3 points):

dat_riverton %>%
  ggplot(aes(x=time_served_months, fill=offense_type)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Text answer (2 points):

The histogram illustrates that relative to other offenses, public order offenses have shorter time served.

Extra credit. Quantitative EDA (5 points extra credit)

Question:

Now find a measure of centrality to determine whether your guess to #10 is correct.

Note: I have given you part of the code. You need to fill in the rest: Write the statistic you want to see in the empty round brackets. Next in the class, we will learn how to test these differences statistically.

Code (3 point):

dat_riverton %>%
  group_by(offense_type) %>%
  summarize(median = median(time_served_months),
            mean = mean(time_served_months))
## # A tibble: 4 × 3
##   offense_type median  mean
##   <chr>         <dbl> <dbl>
## 1 Drug           35.7  35.1
## 2 Property       29.1  29.1
## 3 Public Order   17.7  17.6
## 4 Violent        54.2  52.9

Text answer (2 point):

Both the mean and median validate that my answer in 10 was correct as they both illustrate that relative to other offenses, public order offenses have shorter time served.