Visual Exploration of Your Data

In this assignment you will be working with the data that you have collected for your Final Project.

Submission Format

Submit 2 files: Rmarkdown and a knitted Rmarkdown (html or pdf). Text should be entered outside of code blocks (do not use #comments to describe your figures). Format your graphs properly: captions, title, axis labels

#install.packages("ggplot2")
library(tidyr)
library(ggplot2)
crimedata <- read.csv("/Users/pallavisaitu/Downloads/crimedata.csv", header =  TRUE)
#crimedata

TASK 1: Create Univariate analysis for the variable of your interest (your Y variable). Calculate skewness and kurtosis and describe the results. [histogram, skewness values, kurtosis values, description - 10pts]

dim(crimedata) 
## [1] 2215  147
str(crimedata)
## 'data.frame':    2215 obs. of  147 variables:
##  $ Ecommunityname       : Factor w/ 2018 levels "Aberdeencity",..: 151 1035 1781 665 141 1700 1272 41 566 1860 ...
##  $ state                : Factor w/ 48 levels "AK","AL","AR",..: 29 36 35 32 23 24 19 15 27 41 ...
##  $ countyCode           : Factor w/ 115 levels "?","1","101",..: 57 60 1 55 84 1 46 1 40 1 ...
##  $ communityCode        : Factor w/ 960 levels "?","100","1000",..: 511 426 1 215 473 1 468 1 177 1 ...
##  $ fold                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ population           : int  11980 23123 29344 16656 11245 140494 28700 59459 74111 103590 ...
##  $ householdsize        : num  3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
##  $ racepctblack         : num  1.37 0.8 0.74 1.7 0.53 ...
##  $ racePctWhite         : num  91.8 95.6 94.3 97.3 89.2 ...
##  $ racePctAsian         : num  6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
##  $ racePctHisp          : num  1.88 0.85 2.35 0.7 0.52 ...
##  $ agePct12t21          : num  12.5 11 11.4 12.6 24.5 ...
##  $ agePct12t29          : num  21.4 21.3 25.9 25.2 40.5 ...
##  $ agePct16t24          : num  10.9 10.5 11 12.2 28.7 ...
##  $ agePct65up           : num  11.3 17.2 10.3 17.6 12.7 ...
##  $ numbUrban            : int  11980 23123 29344 0 0 140494 28700 59449 74115 103590 ...
##  $ pctUrban             : num  100 100 100 0 0 100 100 100 100 100 ...
##  $ medIncome            : int  75122 47917 35669 20580 17390 21577 42805 23221 25326 17852 ...
##  $ pctWWage             : num  89.2 79 82 68.2 69.3 ...
##  $ pctWFarmSelf         : num  1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
##  $ pctWInvInc           : num  70.2 64.1 55.7 39 42.8 ...
##  $ pctWSocSec           : num  23.6 35.5 22.2 39.5 32.2 ...
##  $ pctWPubAsst          : num  1.03 2.75 2.94 11.71 11.21 ...
##  $ pctWRetire           : num  18.4 22.9 14.6 18.3 14.4 ...
##  $ medFamInc            : int  79584 55323 42112 26501 24018 27705 50394 28901 34269 24058 ...
##  $ perCapInc            : int  29711 20148 16946 10810 8483 11878 18193 12161 13554 10195 ...
##  $ whitePerCap          : int  30233 20191 17103 10909 9009 12029 18276 12599 13727 12126 ...
##  $ blackPerCap          : int  13600 18137 16644 9984 887 7382 17342 9820 8852 5715 ...
##  $ indianPerCap         : int  5725 0 21606 4941 4425 10264 21482 6634 5344 11313 ...
##  $ AsianPerCap          : int  27101 20074 15528 3541 3352 10753 12639 8802 8011 5770 ...
##  $ OtherPerCap          : Factor w/ 1918 levels "?","0","10000",..: 1022 1049 1174 717 784 1418 681 1460 1068 1445 ...
##  $ HispPerCap           : int  22838 12222 8405 4391 1328 8104 22594 6187 5174 6984 ...
##  $ NumUnderPov          : int  227 885 1389 2831 2855 23223 1126 10320 9603 27767 ...
##  $ PctPopUnderPov       : num  1.96 3.98 4.75 17.23 29.99 ...
##  $ PctLess9thGrade      : num  5.81 5.61 2.8 11.05 12.15 ...
##  $ PctNotHSGrad         : num  9.9 13.72 9.09 33.68 23.06 ...
##  $ PctBSorMore          : num  48.2 29.9 30.1 10.8 25.3 ...
##  $ PctUnemployed        : num  2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
##  $ PctEmploy            : num  64.5 62 69.8 54.7 52.4 ...
##  $ PctEmplManu          : num  14.65 12.26 15.95 31.22 6.89 ...
##  $ PctEmplProfServ      : num  28.8 29.3 21.5 27.4 36.5 ...
##  $ PctOccupManu         : num  5.49 6.39 8.79 26.76 10.94 ...
##  $ PctOccupMgmtProf     : num  50.7 37.6 32.5 22.7 27.8 ...
##  $ MalePctDivorce       : num  3.67 4.23 10.1 10.98 7.51 ...
##  $ MalePctNevMarr       : num  26.4 28 25.8 28.1 50.7 ...
##  $ FemalePctDiv         : num  5.22 6.45 14.76 14.47 11.64 ...
##  $ TotalPctDiv          : num  4.47 5.42 12.55 12.91 9.73 ...
##  $ PersPerFam           : num  3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
##  $ PctFam2Par           : num  91.4 86.9 78.5 64 58.6 ...
##  $ PctKids2Par          : num  90.2 85.3 78.8 62.4 55.2 ...
##  $ PctYoungKids2Par     : num  95.8 96.8 92.4 65.4 66.5 ...
##  $ PctTeen2Par          : num  95.8 86.5 75.7 67.4 79.2 ...
##  $ PctWorkMomYoungKids  : num  44.6 51.1 66.1 59.6 61.2 ...
##  $ PctWorkMom           : num  58.9 62.4 74.2 70.3 68.9 ...
##  $ NumKidsBornNeverMar  : int  31 43 164 561 402 1511 263 2368 751 3537 ...
##  $ PctKidsBornNeverMar  : num  0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
##  $ NumImmig             : int  1277 1920 1468 339 196 2091 2637 517 1474 4793 ...
##  $ PctImmigRecent       : num  8.69 5.21 16.42 13.86 46.94 ...
##  $ PctImmigRec5         : num  13 8.65 23.98 13.86 56.12 ...
##  $ PctImmigRec8         : num  21 13.3 32.1 15.3 67.9 ...
##  $ PctImmigRec10        : num  30.9 22.5 35.6 15.3 69.9 ...
##  $ PctRecentImmig       : num  0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
##  $ PctRecImmig5         : num  1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
##  $ PctRecImmig8         : num  2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
##  $ PctRecImmig10        : num  3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
##  $ PctSpeakEnglOnly     : num  85.7 87.8 93.1 95 94.6 ...
##  $ PctNotSpeakEnglWell  : num  1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
##  $ PctLargHouseFam      : num  4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
##  $ PctLargHouseOccup    : num  4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
##  $ PersPerOccupHous     : num  2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
##  $ PersPerOwnOccHous    : num  3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
##  $ PersPerRentOccHous   : num  2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
##  $ PctPersOwnOccup      : num  91.5 89 64.2 58.2 58.1 ...
##  $ PctPersDenseHous     : num  0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
##  $ PctHousLess3BR       : num  11.1 23.6 47.5 45.7 55.6 ...
##  $ MedNumBR             : int  3 3 3 3 2 2 3 2 2 2 ...
##  $ HousVacant           : int  64 240 544 669 333 5119 566 2051 1562 5606 ...
##  $ PctHousOccup         : num  98.4 97.2 95.7 91.2 92.5 ...
##  $ PctHousOwnOcc        : num  91 84.9 57.8 54.9 53.6 ...
##  $ PctVacantBoarded     : num  3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
##  $ PctVacMore6Mos       : num  37.5 18.33 7.54 57.85 42.64 ...
##  $ MedYrHousBuilt       : int  1959 1958 1976 1939 1958 1966 1956 1954 1971 1960 ...
##  $ PctHousNoPhone       : num  0 0.31 1.55 7 7.45 ...
##  $ PctWOFullPlumb       : num  0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
##  $ OwnOccLowQuart       : int  215900 136300 74700 36400 30600 37700 155100 26300 54500 28600 ...
##  $ OwnOccMedVal         : int  262600 164200 90400 49600 43200 53900 179000 37000 70300 43100 ...
##  $ OwnOccHiQuart        : int  326900 199900 112000 66500 59500 73100 215500 52400 93700 67400 ...
##  $ OwnOccQrange         : int  111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
##  $ RentLowQ             : int  685 467 370 195 202 215 463 186 241 192 ...
##  $ RentMedian           : int  1001 560 428 250 283 280 669 253 321 281 ...
##  $ RentHighQ            : int  1001 672 520 309 362 349 824 325 387 369 ...
##  $ RentQrange           : int  316 205 150 114 160 134 361 139 146 177 ...
##  $ MedRent              : int  1001 627 484 333 332 340 736 338 355 353 ...
##  $ MedRentPctHousInc    : num  23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
##  $ MedOwnCostPctInc     : num  21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
##  $ MedOwnCostPctIncNoMtg: num  14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
##  $ NumInShelters        : int  11 0 16 0 2 327 0 21 125 43 ...
##  $ NumStreet            : int  0 0 0 0 0 4 0 0 15 4 ...
##  $ PctForeignBorn       : num  10.66 8.3 5 2.04 1.74 ...
##   [list output truncated]
# names(crimedata)
class(crimedata)
## [1] "data.frame"
# summary(crimedata)
crimedata$rapes <- as.numeric(crimedata$rapes)
crimedata$assaults <- as.numeric(crimedata$assaults)
crimedata$robberies <- as.numeric(crimedata$robberies)
crimedata$burglaries <- as.numeric(crimedata$burglaries)
crimedata$larcenies <- as.numeric(crimedata$larcenies)
crimedata$autoTheft <- as.numeric(crimedata$autoTheft)
crimedata$arsons <- as.numeric(crimedata$arsons)
crimedata$ViolentCrimesPerPop <- as.numeric(crimedata$ViolentCrimesPerPop)
Fig1<-hist(crimedata$ViolentCrimesPerPop)

# install.packages("psych")
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(moments)
describe(crimedata$ViolentCrimesPerPop)
##    vars    n  mean    sd median trimmed    mad min  max range skew kurtosis
## X1    1 2215 888.3 615.6    877  877.26 810.98   1 1974  1973 0.08    -1.26
##       se
## X1 13.08

Interpretation Task 1:

Skewness: 2.06 - Positive skewed Kurtosis: 5.57 - five or more standard deviations from the mean,exhibits tail data exceeding the tails of the normal distribution

TASK 2:

Create Bivariate plot Box Plot for your Y variable and one of other important metrics (your X). Describe figure. [box plot, description - 10pts]

aggregate(crimedata$ViolentCrimesPerPop, by=list(state=crimedata$state), FUN=sum)
##    state      x
## 1     AK   3642
## 2     AL  38613
## 3     AR  24899
## 4     AZ  20923
## 5     CA 300959
## 6     CO  21632
## 7     CT  79189
## 8     DC    945
## 9     DE   1864
## 10    FL  83729
## 11    GA  38366
## 12    IA  17966
## 13    ID   6782
## 14    IL     40
## 15    IN  43732
## 16    KS   1851
## 17    KY  30193
## 18    LA  17018
## 19    MA 118165
## 20    MD  10063
## 21    ME  16979
## 22    MI    108
## 23    MN   5291
## 24    MO  41615
## 25    MS  20787
## 26    NC  43823
## 27    ND   5819
## 28    NH  20651
## 29    NJ 212313
## 30    NM  10682
## 31    NV   4715
## 32    NY  43263
## 33    OH  91727
## 34    OK  33515
## 35    OR  26452
## 36    PA  95321
## 37    RI  31177
## 38    SC  20257
## 39    SD   8469
## 40    TN  38858
## 41    TX 163863
## 42    UT  23201
## 43    VA  31240
## 44    VT   1758
## 45    WA  41915
## 46    WI  54009
## 47    WV  13563
## 48    WY   5653
boxplot(ViolentCrimesPerPop~state,data=crimedata)

## Interpretation Task 2: Generated box plots for Voilent crimes per population against states. First thing we can clearly see are the outliers in for most of the states. Secondly, some states have exponentially more voilent crime rates per population compared to other states. Florida has the highest voilent crime rate per population and Vermont has lowest voilent crime rate per population.

TASK 3:

If your variables are continuous - Create a scatter plot between your Y and your X. If your variables are categorical - Create a bar plot. Describe figure [plot, description - 10pts]

ggplot(data = crimedata, mapping = aes(x = state, y = ViolentCrimesPerPop)) +
  geom_bar(stat = "identity") +
  labs(x = "US States")

## Interpretation Task 3: Generated bar chart for Voilent crimes per population against states. First thing we can clearly see are the outliers in for most of the states.

TASK 4:

Create a multivariate plot - Use the same plot as in 3 but add another important variable using colored symbols. Describe Figure. Make sure to add legend [scatterplot, description - 10pts]

# Change point shapes and colors
p<-ggplot(crimedata, aes(x=state, y=ViolentCrimesPerPop, color=assaults)) +
  geom_point()+
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)
p
## `geom_smooth()` using formula 'y ~ x'

Interpretation Task 4:

Generated bar chart for Voilent crimes per population against states while looking at assault. You can notice that some of the previous seen outliers are basically the higer assault rates in those states.