ANLY 506 - Course Project

============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/466232

Objective

To identify the correlations between violent / nonviolent crime rates (total number of crimes per 100,000 population) in 1995 and five demographic factors of communities in 1990: population, ethnicity percentage, age group percentage, average income per capita, education level.

Literature Review

Violent crime includes but is not limited to murder, criminal homicide (voluntary manslaughter), forcible rape, aggravated assault, and robbery.
Non-violent crime includes but is not limited to property crime: theft, embezzlement, arson of personal property, fraud, tax crimes, drug and alcohol-related crimes, prostitution, gambling and racketeering, bribery.
Note: VC=Violent Crime, NVC=Non-violent Crime

1968

1996

2002

2004

2018

Reference

Data Source

“Communities and Crime Unnormalized Data Set” is a cross-sectional data uploaded on March 2nd, 2011, created by Michael Redmond, and archived on Machine Learning Repository at UCI. Link: http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized.

Error in make.names(col.names, unique = TRUE) : invalid multibyte string at ‘com<6d>unityname’.
In the format of txt extension, correct the typo in the first column of headers, and then read the dataset again.

crime <- read.csv("~/Documents/HU/ANLY 506-51-B/Week 1, Oct 27- Nov 2nd/crimedata.csv")

Data Cleaning

Dataset contains 147 attributes and 2215 instances including missing values as “?”. Demographic statistics are in Column 6 - 129 and were recorded in 1990. Crime rates are in Column 130 - 147 and were recorded in 1995. By definition, the first five columns should be the factor type, and the rest are numeric (some are integer while others are double). Warning message suggests 41 columns containing NA. Hawaii (HI), Montana (MT), and Nebraska (NE) states are not included in the dataset while one instance from Washington District of Columbia (DC) has been recorded.

crimeSub <- crime
crimeSub[,5] <- as.factor(crimeSub[,5])
crimeSub[,-c(1,2,5)] <- sapply(crimeSub[,-c(1,2,5)],as.character)
crimeSub[,-c(1,2,5)] <- sapply(crimeSub[,-c(1,2,5)],as.numeric)
crime <- crimeSub; rm(crimeSub)

library(usmap)
us <- as.data.frame(table(crime$state)); colnames(us) <- c("state","count")
plot_usmap(data=us,values="count")+scale_fill_continuous(low="white",high="red",name="Count")+ggtitle("Frequency Count of States in Dataset")+theme(legend.position="right",plot.title=element_text(hjust=0.5,size=15,face="bold"))

Select the dependent variable:
ViolentCrimesPerPop – total number of violent crimes per 100K popuation

nonViolPerPop – total number of non-violent crimes per 100K popuation

Select the independent variables (five demographic factors):
1, Population: population – population for community

2, Ethnicity: racePctWhite – percentage of population that is caucasian

crime$raceSum <- crime$racepctblack+crime$racePctWhite+crime$racePctAsian+crime$racePctHisp
c(sum(crime$raceSum==100),sum(crime$raceSum<100),sum(crime$raceSum>100),range(crime$raceSum))

## [1]    2.00  164.00 2049.00   72.95  175.41

crime[crime$raceSum==min(crime$raceSum),c(1:2)]

##     communityname state
## 940 Tahlequahcity    OK

crime[crime$raceSum==max(crime$raceSum),c(1:2)]

##        communityname state
## 1042 Brownsvillecity    TX

QUESTION: why is the sum of race percentage as low as 73%?
[Hint] http://worldpopulationreview.com/us-cities/tahlequah-ok-population/

3, Age: agePct65up / agePct12t29
– agePct65up: percentage of population that is 65 and over in age
– agePct12t29: percentage of population that is 12-29 in age

crime$primaryKey <- paste(crime$communityname,crime$state,sep="_")
if(!(0 %in% crime$agePct12t29)){
  crime$age <- round(crime$agePct65up/crime$agePct12t29,4)}

4, Income: medIncome / householdsize
– householdsize: mean people per household
– medIncome: median household income

if(!(0 %in% crime$householdsize)){
  crime$income <- round(crime$medIncome/crime$householdsize,2)}

5, Education: PctNotHSGrad – percentage of people 25 and over that are not high school graduates

Exploratory Data Analysis (EDA)

Below is an overview of a table of descriptive statistics for the list of variables. It is noticeable that there are numerous missing values in the dependent variable(s). Moreover, there is NULL value in ViolentCrimesPerPop, i.e., “0”. All these dirty data will be removed for modeling but can be updated and regarded as a validation set.

crimeNA <- crime[,c(149,6,9,150,151,36,146,147)] #subset of raw data
crimeOmit <- na.omit(crimeNA) #omit rows containing NA in ViolentCrimesPerPop or nonViolPerPop
colnames(crimeNA) <- c("Community","Population","Ethnicity","Age","Income","Education","VC Rate","NVC Rate")
colnames(crimeOmit) <- c("Community","Population","Ethnicity","Age","Income","Education","VC Rate","NVC Rate")

options(scipen=100,digits=2)
pastecs::stat.desc(crimeNA[,-1]) %>% kable() %>% kable_styling()

	Population	Ethnicity	Age	Income	Education	VC Rate	NVC Rate
nbr.val	2215.0	2215.00	2215.00	2215.00	2215.00	1994	2118.00
nbr.null	0.0	0.00	0.00	0.00	0.00	1	0.00
nbr.na	0.0	0.00	0.00	0.00	0.00	221	97.00
min	10005.0	2.68	0.04	2995.27	1.46	0	116.79
max	7322564.0	99.63	5.15	42049.32	73.66	4877	27119.76
range	7312559.0	96.95	5.12	39054.05	72.20	4877	27002.97
sum	117656335.0	186015.30	1032.67	27915545.03	49405.84	1174623	10395656.14
median	22792.0	90.35	0.43	11770.40	21.38	374	4425.45
mean	53118.0	83.98	0.47	12602.95	22.31	589	4908.24
SE.mean	4347.7	0.35	0.01	101.25	0.23	14	59.53
CI.mean.0.95	8526.0	0.68	0.01	198.56	0.46	27	116.74
var	41869447877.7	269.59	0.10	22707332.49	120.77	377960	7506004.86
std.dev	204620.2	16.42	0.31	4765.22	10.99	615	2739.71
coef.var	3.9	0.20	0.66	0.38	0.49	1	0.56

subset(crimeNA,`VC Rate`==0) %>% kable() %>% kable_styling()

	Community	Population	Ethnicity	Age	Income	Education	VC Rate	NVC Rate
1397	Spencercity_IA	11066	99	0.58	10196	15	0	NA

Histogram: Population

ggplot(subset(crimeNA,Population<1000000),aes(Population))+geom_histogram(binwidth =35000,fill="black",color="white",alpha=0.5)+ggtitle("Histogram of Population")

Ethnicity

ggplot(crimeNA,aes(Ethnicity))+geom_histogram(binwidth=8,fill="black",color="white",alpha=0.5)+ggtitle("Histogram of Ethnicity")+xlab("Ethnicity (%)")

Age

ggplot(crimeNA,aes(Age))+geom_histogram(binwidth=0.1,fill="black",color="white",alpha=0.5)+ggtitle("Histogram of Age")

Income

ggplot(crimeNA,aes(Income))+geom_histogram(binwidth=1000,fill="black",color="white",alpha=0.5)+ggtitle("Histogram of Income")+xlab("Income (USD)")

Education

ggplot(crimeNA,aes(Education))+geom_histogram(binwidth=1,fill="black",color="white",alpha=0.5)+ggtitle("Histogram of Education")+xlab("Education (%)")

Scatter plot: Population - VC Rate

ggplot(crimeNA,aes(Population,`VC Rate`))+geom_point()+ggtitle("Scatter Plot of Population and VC Rate")+ylab("VC Rate")

Ethnicity - VC Rate

ggplot(crimeNA,aes(Ethnicity,`VC Rate`))+geom_point()+ggtitle("Scatter Plot of Ethnicity and VC Rate")+xlab("Ethnicity (%)")+ylab("VC Rate")

Age - VC Rate

ggplot(crimeNA,aes(Age,`VC Rate`))+geom_point()+ggtitle("Scatter Plot of Age and VC Rate")+ylab("VC Rate")

Income - VC Rate

ggplot(crimeNA,aes(Income,`VC Rate`))+geom_point()+ggtitle("Scatter Plot of Income and VC Rate")+xlab("Income (USD)")+ylab("VC Rate")

Education - VC Rate

ggplot(crimeNA,aes(Education,`VC Rate`))+geom_point()+ggtitle("Scatter Plot of Education and VC Rate")+xlab("Education (%)")+ylab("VC Rate")

Scatter plot: Population - NVC Rate

ggplot(crimeNA,aes(Population,`NVC Rate`))+geom_point()+ggtitle("Scatter Plot of Population and NVC Rate")+ylab("NVC Rate")

Ethnicity - NVC Rate

ggplot(crimeNA,aes(Ethnicity,`NVC Rate`))+geom_point()+ggtitle("Scatter Plot of Ethnicity and NVC Rate")+xlab("Ethnicity (%)")+ylab("NVC Rate")

Age - NVC Rate

ggplot(crimeNA,aes(Age,`NVC Rate`))+geom_point()+ggtitle("Scatter Plot of Age and NVC Rate")+ylab("NVC Rate")

Income - NVC Rate

ggplot(crimeNA,aes(Income,`NVC Rate`))+geom_point()+ggtitle("Scatter Plot of Income and NVC Rate")+xlab("Income (USD)")+ylab("NVC Rate")

Education - NVC Rate

ggplot(crimeNA,aes(Education,`NVC Rate`))+geom_point()+ggtitle("Scatter Plot of Education and NVC Rate")+xlab("Education (%)")+ylab("NVC Rate")

The correlations between each pair of variables are shown as below.

correlations <- round(cor(crimeOmit[,-1]),2)
corrplot::corrplot(correlations,method="color",type="lower",addCoef.col="black",tl.col="black",tl.srt=45,diag=F)

Clustering

The idea is to group data into some number of clusters and perform k-means clustering (k=10) on the five demographic factors to see how communities vary from each other. Then, under a certain cluster, a specific model will be built for it. The column indicator for population, ethnicity, age, income, and education is 6, 9, 150, 151, and 36 respectively in this case.

#distance matrix of communities
dmCommunities <- cluster::daisy(crimeNA[,-c(1,7,8)])
set.seed(506)
crime$Cluster <- as.factor(kmeans(dmCommunities,10)$cluster)
#NOTICE: the group's indicator numbers may differ

crimeVio <- na.omit(crime[,c(149,6,9,150,151,36,146,152)]) #omit rows containing NA in ViolentCrimesPerPop
crimeNon <- na.omit(crime[,c(149,6,9,150,151,36,147,152)]) #omit rows containing NA in nonViolPerPop
crimeVioS <- crimeVio; crimeNonS <- crimeNon
#standardizing all seven variables, i.e., mean=0, sd=1
crimeVioS[,-c(1,8)] <- scale(crimeVioS[,-c(1,8)])
crimeNonS[,-c(1,8)] <- scale(crimeNonS[,-c(1,8)])

colnames(crimeVioS) <- c("Community","Population","Ethnicity","Age","Income","Education","VC Rate","Cluster")
colnames(crimeNonS) <- c("Community","Population","Ethnicity","Age","Income","Education","NVC Rate","Cluster")

VC Rate

#the count table of 10 clusters
table(crimeVio$Cluster)

## 
##    1    2    3    4    5    6    7    8    9   10 
##  384    4    1   21 1329   24   11    1   59  160

colnames(crimeVio) <- c("Community","Population","Ethnicity","Age","Income","Education","VC Rate","Cluster")
DT::datatable(crimeVio,filter="top")

NVC Rate

#the count table of 10 clusters
table(crimeNon$Cluster)

## 
##    1    2    3    4    5    6    7    8    9   10 
##  410    6    2   22 1408   22   10    1   63  174

colnames(crimeNon) <- c("Community","Population","Ethnicity","Age","Income","Education","NVC Rate","Cluster")
DT::datatable(crimeNon,filter="top")

p1 <- ggplot(subset(crime,Cluster  %in% c(5)),aes(ViolentCrimesPerPop,nonViolPerPop))+geom_point(aes(color=Cluster))+xlab("VC Rate")+ylab("NVC Rate")+geom_abline(slope=1,color="black",linetype="dashed")
p2 <- ggplot(subset(crime,Cluster  %in% c(1,10)),aes(ViolentCrimesPerPop,nonViolPerPop))+geom_point(aes(color=Cluster))+xlab("VC Rate")+ylab("NVC Rate")+geom_abline(slope=1,color="black",linetype="dashed")
p3 <- ggplot(subset(crime,Cluster  %in% c(4,6,9)),aes(ViolentCrimesPerPop,nonViolPerPop))+geom_point(aes(color=Cluster))+xlab("VC Rate")+ylab("NVC Rate")+geom_abline(slope=1,color="black",linetype="dashed")
p4 <- ggplot(subset(crime,Cluster  %in% c(2,3,7,8)),aes(ViolentCrimesPerPop,nonViolPerPop))+geom_point(aes(color=Cluster))+xlab("VC Rate")+ylab("NVC Rate")+geom_abline(slope=1,color="black",linetype="dashed")
gridExtra::grid.arrange(p1,p2,p3,p4,nrow=2,bottom="Cluster Analysis")

Modeling

Filtered on some clusters, a model of linear regression on multi-variables is built respectively. The p-value close to 0 is statistically significant at an alpha level of 0.05. Hence, reject the null hypothesis that independent variables do not have an effect on the dependent variable. In other words, factors such as population, ethnicity, age, income, and education do relate to violent / non-violent crime rates. However, the adjusted R-squared ranges from 0.239 to 0.571, suggesting the linear model may not suit so well.

Take the coefficients in Cluster 5 scenario as an example, factors such as population, age, and education do positively relate to the dependent variable while factors such as ethnicity and income do negatively relate to the dependent variable. Considering variable definition, crime rate will decrease if population is lower or Caucasian percentage is higher or senior proportion is lower or income is higher or education level is higher.

Cluster 5

fitVio5 <- lm(`VC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeVioS,Cluster==5))
summary(fitVio5)

## 
## Call:
## lm(formula = `VC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeVioS, Cluster == 5))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.824 -0.295 -0.098  0.156  3.772 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   0.2274     0.0902    2.52                0.012 *  
## Population    2.0880     0.5279    3.96          0.000080511 ***
## Ethnicity    -0.4718     0.0217  -21.73 < 0.0000000000000002 ***
## Age           0.0968     0.0175    5.52          0.000000041 ***
## Income       -0.0886     0.0223   -3.97          0.000076205 ***
## Education     0.1176     0.0258    4.57          0.000005451 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.59 on 1323 degrees of freedom
## Multiple R-squared:  0.46,   Adjusted R-squared:  0.458 
## F-statistic:  226 on 5 and 1323 DF,  p-value: <0.0000000000000002

par(mfrow=c(2,2)); plot(fitVio5)

fitNon5 <- lm(`NVC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeNonS,Cluster==5))
summary(fitNon5)

## 
## Call:
## lm(formula = `NVC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeNonS, Cluster == 5))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.354 -0.459 -0.139  0.299  8.593 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   0.2503     0.1231    2.03               0.0422 *  
## Population    2.0565     0.7231    2.84               0.0045 ** 
## Ethnicity    -0.3965     0.0287  -13.83 < 0.0000000000000002 ***
## Age           0.1508     0.0233    6.46        0.00000000015 ***
## Income       -0.2661     0.0300   -8.88 < 0.0000000000000002 ***
## Education     0.0150     0.0345    0.43               0.6637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.81 on 1402 degrees of freedom
## Multiple R-squared:  0.312,  Adjusted R-squared:  0.31 
## F-statistic:  127 on 5 and 1402 DF,  p-value: <0.0000000000000002

plot(fitNon5)

Cluster 1

fitVio1 <- lm(`VC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeVioS,Cluster==1))
summary(fitVio1)

## 
## Call:
## lm(formula = `VC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeVioS, Cluster == 1))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.956 -0.387 -0.109  0.193  4.867 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.0908     0.0483    1.88              0.0606 .  
## Population    0.3768     0.8626    0.44              0.6625    
## Ethnicity    -0.5223     0.0470  -11.11 <0.0000000000000002 ***
## Age           0.0512     0.0330    1.55              0.1212    
## Income       -0.1569     0.0598   -2.63              0.0090 ** 
## Education     0.1670     0.0621    2.69              0.0075 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.79 on 378 degrees of freedom
## Multiple R-squared:  0.476,  Adjusted R-squared:  0.469 
## F-statistic: 68.5 on 5 and 378 DF,  p-value: <0.0000000000000002

par(mfrow=c(2,2)); plot(fitVio1)

fitNon1 <- lm(`NVC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeNonS,Cluster==1))
summary(fitNon1)

## 
## Call:
## lm(formula = `NVC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeNonS, Cluster == 1))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.064 -0.415 -0.098  0.284  6.918 
## 
## Coefficients:
##             Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)   0.1142     0.0514    2.22       0.027 *  
## Population    0.5025     0.9020    0.56       0.578    
## Ethnicity    -0.2700     0.0487   -5.55 0.000000052 ***
## Age           0.0465     0.0335    1.39       0.166    
## Income       -0.3244     0.0606   -5.36 0.000000143 ***
## Education    -0.0212     0.0633   -0.33       0.738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.81 on 404 degrees of freedom
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.239 
## F-statistic: 26.7 on 5 and 404 DF,  p-value: <0.0000000000000002

plot(fitNon1)

Cluster 10

fitVio10 <- lm(`VC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeVioS,Cluster==10))
summary(fitVio10)

## 
## Call:
## lm(formula = `VC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeVioS, Cluster == 10))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.979 -0.385 -0.141  0.301  2.251 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   0.2757     0.1505    1.83                0.069 .  
## Population   -0.1145     0.7751   -0.15                0.883    
## Ethnicity    -0.6514     0.0598  -10.89 < 0.0000000000000002 ***
## Age           0.4447     0.1001    4.44             0.000017 ***
## Income       -0.4037     0.0930   -4.34             0.000025 ***
## Education    -0.0718     0.0849   -0.85                0.399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.71 on 154 degrees of freedom
## Multiple R-squared:  0.585,  Adjusted R-squared:  0.571 
## F-statistic: 43.4 on 5 and 154 DF,  p-value: <0.0000000000000002

par(mfrow=c(2,2)); plot(fitVio10)

fitNon10 <- lm(`NVC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeNonS,Cluster==10))
summary(fitNon10)

## 
## Call:
## lm(formula = `NVC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeNonS, Cluster == 10))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5131 -0.4190 -0.0492  0.3629  2.3331 
## 
## Coefficients:
##             Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)   0.2412     0.1338    1.80      0.0734 .  
## Population    0.6452     0.7317    0.88      0.3792    
## Ethnicity    -0.2692     0.0567   -4.75 0.000004431 ***
## Age           0.5224     0.0933    5.60 0.000000086 ***
## Income       -0.5205     0.0873   -5.96 0.000000014 ***
## Education    -0.2425     0.0814   -2.98      0.0033 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.69 on 168 degrees of freedom
## Multiple R-squared:  0.352,  Adjusted R-squared:  0.333 
## F-statistic: 18.3 on 5 and 168 DF,  p-value: 0.0000000000000187

plot(fitNon10)

Cluster 9

fitVio9 <- lm(`VC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeVioS,Cluster==9))
summary(fitVio9)

## 
## Call:
## lm(formula = `VC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeVioS, Cluster == 9))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.502 -0.543 -0.055  0.237  3.337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.694      0.572    1.21    0.231   
## Population    -0.877      1.064   -0.82    0.414   
## Ethnicity     -0.576      0.185   -3.11    0.003 **
## Age            0.237      0.333    0.71    0.479   
## Income        -0.658      0.316   -2.08    0.042 * 
## Education     -0.108      0.268   -0.40    0.690   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.91 on 53 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.293 
## F-statistic: 5.81 on 5 and 53 DF,  p-value: 0.00024

par(mfrow=c(2,2)); plot(fitVio9)

fitNon9 <- lm(`NVC Rate`~Population+Ethnicity+Age+Income+Education,data=subset(crimeNonS,Cluster==9))
summary(fitNon9)

## 
## Call:
## lm(formula = `NVC Rate` ~ Population + Ethnicity + Age + Income + 
##     Education, data = subset(crimeNonS, Cluster == 9))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3306 -0.4625  0.0593  0.3922  2.1699 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.949      0.421    2.25   0.0281 *  
## Population    -1.001      0.812   -1.23   0.2232    
## Ethnicity     -0.469      0.139   -3.37   0.0013 ** 
## Age            0.585      0.247    2.37   0.0214 *  
## Income        -1.051      0.242   -4.34 0.000059 ***
## Education     -0.573      0.204   -2.81   0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7 on 57 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.344 
## F-statistic:  7.5 on 5 and 57 DF,  p-value: 0.0000182

plot(fitNon9)

Conclusion

Applied the data analysis techniques learnt in the course such as descriptive analysis, exploratory data analysis (EDA), principal component analysis (PCA), cluster analysis, linear regression, hypothesis test, and variance analysis, though some may not be covered in this document.
Conducted 10-means clustering by identifying each community’s demographic factors namely population, ethnicity, age, income, and education.
Built a linear regression model to assess crime rates by factors namely population, ethnicity, age, income, and education.