SEND YOUR RESULTS AND INTERPRETATIONS TO vera.vtitkova@ya.ru Format is pdf/word. I DO NOT ACCEPT MARKDOWN FILES.
Let’s explain crime rates in Boston -> dataset “Boston” from package MASS
library(knitr)
library(rmarkdown)
library(MASS)
#??Boston
dim(Boston)
names(Boston)
1. make a descriptive statistic table for variables: crim, indus, chas, black. Save it. What are types of variables?
library(dplyr)
mydata<-Boston %>%
select(crim, indus, chas, black)
mydata$chas <- as.factor(mydata$chas)
Types of variables:
2. choose an adequate test for these variables: chas & crim. Interpret your results
t.test(mydata$crim ~ mydata$chas, var.equal = F, na.rm = T)
##
## Welch Two Sample t-test
##
## data: mydata$crim by mydata$chas
## t = 3.2224, df = 120.42, p-value = 0.001636
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7298305 3.0557226
## sample estimates:
## mean in group 0 mean in group 1
## 3.744447 1.851670
Here we see that p-value is rather small (<.05) - that mean that there is the difference between means of crime rate (per capita) by town in two groups (if tract bounds river or not)/ The mean values are seen above: in the first group (no river bound) it’s higher, than in the second group.
3. Choose an adequate test for two numeric variables (of your choice). Save a simple graph and interpret your results
For this part I’d like to use variables ‘crim’ and ‘black’ to see the correlation between proportion of blacks in the town and it’s crime rate per capita.
plot(crim ~ black, data = mydata)
On the graph (yeah, it’s quite rough, but fast and easy) we can see the increase of crime rate with the increase of proportion of blacks in the city. Let’s check the correlation between these two variables.
cor.test(mydata$black, mydata$crim)
##
## Pearson's product-moment correlation
##
## data: mydata$black and mydata$crim
## t = -9.367, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4568967 -0.3082415
## sample estimates:
## cor
## -0.3850639
According to the test these two variables are correlated (p-value <.05). To be clear, we’re rejecting the null hypothesis, and may say that the correlation coefficient between proportion of blacks in the town and it’s crime rate per capita is significantly different from zero. As it is equal to -0.385 we may also say that the correlation is negative and not really strong.
4. divide the variable “indus” into three parts. explain, why do you choose this way. Create new variable “indusGROUP”. Take ANOVA test between “indusGROUP” and other correct variable. Interpret your results
hist (mydata$indus)
I’ve decided to divide this ‘indus’ variable into parts accorsing to the gaps, that it has (0-17 - small, 28-24 - medium, 25-28 - big). Moreover, these three parts will have good and visible difference in frequency of observations.
mydata$indusGROUP <- rep(NA, length(mydata$indus))
mydata$indusGROUP[mydata$indus >= 0 & mydata$indus <= 17] <- "small"
mydata$indusGROUP[mydata$indus >= 18 & mydata$indus <= 24] <- "medium"
mydata$indusGROUP[mydata$indus >= 25 & mydata$indus <= 28] <- "big"
mydata$indusGROUP <-as.factor(mydata$indusGROUP)
5. choose an adequate test for these variables: chas & indusGROUP. Interpret your results.
For this relation I’m using chi-square test.
t_5 <- with(mydata, table(chas, indusGROUP))
chi_5 <- chisq.test(t_5)
## Warning in chisq.test(t_5): Chi-squared approximation may be incorrect
chi_5
##
## Pearson's Chi-squared test
##
## data: t_5
## X-squared = 1.7406, df = 2, p-value = 0.4188
# the's no relation, so I don't run code for residuals
#chi_5$stdres
Let’s look at p-value - it’s big (>.05), so there’s no relation between if town tract bounds river and it’s proportion of non-retail business acres.
Q1. What is our dependent variable?
model1 <- lm(Boston$crim~Boston$black)
summary(model1) # getting results
##
## Call:
## lm(formula = Boston$crim ~ Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## Boston$black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
Here’s the regression model, where the outcome or dependent variable is criminal rate in the town (‘crim’ variable). The proportion of blacks in the town here is the only predictor
Q2. Describe the trend. Is relation positive/negative? Strong or weak? Explanatory power of the model?
The relation here is strong (three stars are telling us that), negative (because estimate coefficient is below 0), and as for the explanatory power of the model, Adjusted R-squared value is equal to 0.15 meaning that the model explains about 15% of variance of the predicted variable, which is not that bad.
Q3. Now create second simple regression model for a factor (non-numeric) predictor. Interpret the results
model2<-lm(mydata$crim~mydata$indusGROUP)
summary(model2)
##
## Call:
## lm(formula = mydata$crim ~ mydata$indusGROUP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.673 -0.207 -0.134 0.058 79.053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14980 2.09633 0.071 0.943
## mydata$indusGROUPmedium 9.77329 2.16622 4.512 8.01e-06 ***
## mydata$indusGROUPsmall 0.07183 2.13564 0.034 0.973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.262 on 503 degrees of freedom
## Multiple R-squared: 0.2901, Adjusted R-squared: 0.2872
## F-statistic: 102.8 on 2 and 503 DF, p-value: < 2.2e-16
For factor predictor I’ve decided to used newly created ‘indusGROUP’ variable. Let’s see if there’s any relatiion between proportion of non-retail business acres per town and it’s criminal rate. So, the result is that there is some relation, if we talk about medium group (where proportion is between 18 and 24), there’s strong positive realtion. And as adjusted R-squared is 0.29, I would say that the model explains about 29% of variance of the predicted variable. Good!
And yes, I understand, that it’s quite silly to use here divided ‘indus’ variable, when I have possibility to use it in full nu,meric way, but… I did it out of pure interest, so… sorry :) Results with river variables were insignificant. Proofs are below: big p-value and small adjusted R-squared.
model21 <- lm(Boston$crim~Boston$chas)
summary(model21)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## Boston$chas -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
Q4 What are the predictors now? Describe the results. Did something change?
model3 <- lm(Boston$crim~Boston$chas+Boston$black)
summary(model3)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.794 -2.380 -2.173 -0.995 86.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.579673 1.426456 11.623 <2e-16 ***
## Boston$chas -1.259561 1.394069 -0.904 0.367
## Boston$black -0.036109 0.003878 -9.310 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared: 0.1497, Adjusted R-squared: 0.1463
## F-statistic: 44.26 on 2 and 503 DF, p-value: < 2.2e-16
Now we have two predictors: proportion of blacks in the town and if town bounds the river. From the results we see that proportion of blacks still hae strong negative relation, but Charles River has no dignificant relation to the outcome (criminal rate). Adjusted R-squared is nearly the same as it was without river variable.
Q5: Add more predictors in your model. “Update” function: to add or remove variables from your model
model4<-update(model3, ~.+Boston$indus) # point means "the same as in previos model"
summary (model4)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.454 -2.078 -0.653 0.513 83.496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.441410 1.736139 4.862 1.56e-06 ***
## Boston$chas -2.116562 1.328385 -1.593 0.112
## Boston$black -0.025425 0.003949 -6.439 2.81e-10 ***
## Boston$indus 0.393925 0.052588 7.491 3.11e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2306
## F-statistic: 51.45 on 3 and 502 DF, p-value: < 2.2e-16
model5 <- update(model4,~.-Boston$chas)
summary(model5)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.376 -2.182 -0.649 0.516 83.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.546889 1.737527 4.919 1.18e-06 ***
## Boston$black -0.025906 0.003943 -6.570 1.26e-10 ***
## Boston$indus 0.386709 0.052472 7.370 7.07e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2282
## F-statistic: 75.67 on 2 and 503 DF, p-value: < 2.2e-16
model6 <- update(model5,~.+Boston$tax)
summary(model6)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus + Boston$tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.664 -2.390 -0.262 0.997 79.207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.700570 1.884098 -0.903 0.367
## Boston$black -0.015122 0.003755 -4.027 6.53e-05 ***
## Boston$indus -0.051235 0.064675 -0.792 0.429
## Boston$tax 0.027626 0.002741 10.078 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.898 on 502 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3568
## F-statistic: 94.39 on 3 and 502 DF, p-value: < 2.2e-16
It’s inrersting to compare model 5 and model 6: addition of ‘tax’ variable (full-value property-tax rate) increases adjusted R-squared and reduces the significance of relation with ‘indus’ variable (proportion of non-retail business acres).
model7 <- update(model6,~.-Boston$indus)
summary(model7)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.900 -2.387 -0.160 0.960 79.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.741283 1.882699 -0.925 0.355
## Boston$black -0.014937 0.003747 -3.987 7.69e-05 ***
## Boston$tax 0.026167 0.002030 12.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.896 on 503 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3573
## F-statistic: 141.4 on 2 and 503 DF, p-value: < 2.2e-16
The next model I’ve built is without proportion of non-retail business acres but only with balcks propportion in the town and full-value property-tax rate per $10,000. Adjusted R-squared here is .357 which means that the model explains about 36% of variance of the predicted variable. Great, it’s our biggest result!
library(ggplot2)
regr <- ggplot(Boston, aes(crim,black))
regr+geom_point()
Q6: Create a plot with a different predictor
Let’s try to depict our last findings: relation with not only blacks proportion, but also with tax rate.
regr1 <- ggplot(Boston, aes(x = tax, y = crim)) +
geom_point()
regr1
Well.. that looks.. strange…
Let’s test another idea!
regr2 <- ggplot(Boston, aes(x = black, y = crim)) +
geom_point()+
facet_grid(~chas)
regr2
In thas case faceting shows us that in the town, which bound Charles River proportion of blacks high, while in thу towns, which do not bound the river, blacks proportion vary (but still pretty high).
Also, I’ve try to facet according to the old and trustful indusGROUP variable.
regr3 <- ggplot(mydata, aes(x = black, y = crim)) +
geom_point()+
facet_grid(~indusGROUP)
regr3
And here we see the hogh activity in the middle section, where the proportion of non-retail business acres is medium. But in fact there’s not much visile relation to the changes of crime rate in the towns.