important

SEND YOUR RESULTS AND INTERPRETATIONS TO vera.vtitkova@ya.ru Format is pdf/word. I DO NOT ACCEPT MARKDOWN FILES.

Part 1 Parametric tests

Let’s explain crime rates in Boston -> dataset “Boston” from package MASS

library(knitr)
library(rmarkdown)
library(MASS)
#??Boston
dim(Boston)
names(Boston)

1. make a descriptive statistic table for variables: crim, indus, chas, black. Save it. What are types of variables?

library(dplyr) 
mydata<-Boston %>% 
  select(crim, indus, chas, black) 
mydata$chas <- as.factor(mydata$chas)

Types of variables:

  • ‘crim’ - per capita crime rate by town - numeric variable;
  • ‘indus’ - proportion of non-retail business acres per town - numeric variable;
  • ‘chas’ - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - binary variable, according to its description. So I’ve converted it in factor type;
  • ‘black’ - proportion of blacks by town - numeric variable.

2. choose an adequate test for these variables: chas & crim. Interpret your results

t.test(mydata$crim ~ mydata$chas, var.equal = F, na.rm = T)
## 
##  Welch Two Sample t-test
## 
## data:  mydata$crim by mydata$chas
## t = 3.2224, df = 120.42, p-value = 0.001636
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7298305 3.0557226
## sample estimates:
## mean in group 0 mean in group 1 
##        3.744447        1.851670

Here we see that p-value is rather small (<.05) - that mean that there is the difference between means of crime rate (per capita) by town in two groups (if tract bounds river or not)/ The mean values are seen above: in the first group (no river bound) it’s higher, than in the second group.

3. Choose an adequate test for two numeric variables (of your choice). Save a simple graph and interpret your results

For this part I’d like to use variables ‘crim’ and ‘black’ to see the correlation between proportion of blacks in the town and it’s crime rate per capita.

plot(crim ~ black, data = mydata) 

On the graph (yeah, it’s quite rough, but fast and easy) we can see the increase of crime rate with the increase of proportion of blacks in the city. Let’s check the correlation between these two variables.

cor.test(mydata$black, mydata$crim)
## 
##  Pearson's product-moment correlation
## 
## data:  mydata$black and mydata$crim
## t = -9.367, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4568967 -0.3082415
## sample estimates:
##        cor 
## -0.3850639

According to the test these two variables are correlated (p-value <.05). To be clear, we’re rejecting the null hypothesis, and may say that the correlation coefficient between proportion of blacks in the town and it’s crime rate per capita is significantly different from zero. As it is equal to -0.385 we may also say that the correlation is negative and not really strong.

4. divide the variable “indus” into three parts. explain, why do you choose this way. Create new variable “indusGROUP”. Take ANOVA test between “indusGROUP” and other correct variable. Interpret your results

hist (mydata$indus)

I’ve decided to divide this ‘indus’ variable into parts accorsing to the gaps, that it has (0-17 - small, 28-24 - medium, 25-28 - big). Moreover, these three parts will have good and visible difference in frequency of observations.

mydata$indusGROUP <- rep(NA, length(mydata$indus))
mydata$indusGROUP[mydata$indus >= 0 & mydata$indus <= 17] <- "small"
mydata$indusGROUP[mydata$indus >= 18 & mydata$indus <= 24] <- "medium"
mydata$indusGROUP[mydata$indus >= 25 & mydata$indus <= 28] <- "big"
mydata$indusGROUP <-as.factor(mydata$indusGROUP)

5. choose an adequate test for these variables: chas & indusGROUP. Interpret your results.

For this relation I’m using chi-square test.

t_5 <- with(mydata, table(chas, indusGROUP))
chi_5 <- chisq.test(t_5)
## Warning in chisq.test(t_5): Chi-squared approximation may be incorrect
chi_5
## 
##  Pearson's Chi-squared test
## 
## data:  t_5
## X-squared = 1.7406, df = 2, p-value = 0.4188
# the's no relation, so I don't run code for residuals
#chi_5$stdres

Let’s look at p-value - it’s big (>.05), so there’s no relation between if town tract bounds river and it’s proportion of non-retail business acres.

Part 2 OLS

1. Simple regressions with different predictors: create simple regression model**

Q1. What is our dependent variable?

model1 <- lm(Boston$crim~Boston$black)
summary(model1) # getting results
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.553529   1.425903  11.609   <2e-16 ***
## Boston$black -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16

Here’s the regression model, where the outcome or dependent variable is criminal rate in the town (‘crim’ variable). The proportion of blacks in the town here is the only predictor

Q2. Describe the trend. Is relation positive/negative? Strong or weak? Explanatory power of the model?

The relation here is strong (three stars are telling us that), negative (because estimate coefficient is below 0), and as for the explanatory power of the model, Adjusted R-squared value is equal to 0.15 meaning that the model explains about 15% of variance of the predicted variable, which is not that bad.

Q3. Now create second simple regression model for a factor (non-numeric) predictor. Interpret the results

model2<-lm(mydata$crim~mydata$indusGROUP)
summary(model2) 
## 
## Call:
## lm(formula = mydata$crim ~ mydata$indusGROUP)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.673 -0.207 -0.134  0.058 79.053 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.14980    2.09633   0.071    0.943    
## mydata$indusGROUPmedium  9.77329    2.16622   4.512 8.01e-06 ***
## mydata$indusGROUPsmall   0.07183    2.13564   0.034    0.973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.262 on 503 degrees of freedom
## Multiple R-squared:  0.2901, Adjusted R-squared:  0.2872 
## F-statistic: 102.8 on 2 and 503 DF,  p-value: < 2.2e-16

For factor predictor I’ve decided to used newly created ‘indusGROUP’ variable. Let’s see if there’s any relatiion between proportion of non-retail business acres per town and it’s criminal rate. So, the result is that there is some relation, if we talk about medium group (where proportion is between 18 and 24), there’s strong positive realtion. And as adjusted R-squared is 0.29, I would say that the model explains about 29% of variance of the predicted variable. Good!

And yes, I understand, that it’s quite silly to use here divided ‘indus’ variable, when I have possibility to use it in full nu,meric way, but… I did it out of pure interest, so… sorry :) Results with river variables were insignificant. Proofs are below: big p-value and small adjusted R-squared.

model21 <- lm(Boston$crim~Boston$chas)
summary(model21) 
## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## Boston$chas  -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

2. multiple regression: taking into account several predictors

Q4 What are the predictors now? Describe the results. Did something change?

model3 <- lm(Boston$crim~Boston$chas+Boston$black)
summary(model3) 
## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.794  -2.380  -2.173  -0.995  86.728 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.579673   1.426456  11.623   <2e-16 ***
## Boston$chas  -1.259561   1.394069  -0.904    0.367    
## Boston$black -0.036109   0.003878  -9.310   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared:  0.1497, Adjusted R-squared:  0.1463 
## F-statistic: 44.26 on 2 and 503 DF,  p-value: < 2.2e-16

Now we have two predictors: proportion of blacks in the town and if town bounds the river. From the results we see that proportion of blacks still hae strong negative relation, but Charles River has no dignificant relation to the outcome (criminal rate). Adjusted R-squared is nearly the same as it was without river variable.

Q5: Add more predictors in your model. “Update” function: to add or remove variables from your model

model4<-update(model3, ~.+Boston$indus) # point means "the same as in previos model"
summary (model4)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.454  -2.078  -0.653   0.513  83.496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.441410   1.736139   4.862 1.56e-06 ***
## Boston$chas  -2.116562   1.328385  -1.593    0.112    
## Boston$black -0.025425   0.003949  -6.439 2.81e-10 ***
## Boston$indus  0.393925   0.052588   7.491 3.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2306 
## F-statistic: 51.45 on 3 and 502 DF,  p-value: < 2.2e-16
model5 <- update(model4,~.-Boston$chas)
summary(model5)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.376  -2.182  -0.649   0.516  83.712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.546889   1.737527   4.919 1.18e-06 ***
## Boston$black -0.025906   0.003943  -6.570 1.26e-10 ***
## Boston$indus  0.386709   0.052472   7.370 7.07e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2282 
## F-statistic: 75.67 on 2 and 503 DF,  p-value: < 2.2e-16
model6 <- update(model5,~.+Boston$tax)
summary(model6)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus + Boston$tax)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.664  -2.390  -0.262   0.997  79.207 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.700570   1.884098  -0.903    0.367    
## Boston$black -0.015122   0.003755  -4.027 6.53e-05 ***
## Boston$indus -0.051235   0.064675  -0.792    0.429    
## Boston$tax    0.027626   0.002741  10.078  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.898 on 502 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3568 
## F-statistic: 94.39 on 3 and 502 DF,  p-value: < 2.2e-16

It’s inrersting to compare model 5 and model 6: addition of ‘tax’ variable (full-value property-tax rate) increases adjusted R-squared and reduces the significance of relation with ‘indus’ variable (proportion of non-retail business acres).

model7 <- update(model6,~.-Boston$indus)
summary(model7)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$tax)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.900  -2.387  -0.160   0.960  79.219 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.741283   1.882699  -0.925    0.355    
## Boston$black -0.014937   0.003747  -3.987 7.69e-05 ***
## Boston$tax    0.026167   0.002030  12.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.896 on 503 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3573 
## F-statistic: 141.4 on 2 and 503 DF,  p-value: < 2.2e-16

The next model I’ve built is without proportion of non-retail business acres but only with balcks propportion in the town and full-value property-tax rate per $10,000. Adjusted R-squared here is .357 which means that the model explains about 36% of variance of the predicted variable. Great, it’s our biggest result!

3. Plots

library(ggplot2)
regr <- ggplot(Boston, aes(crim,black))
regr+geom_point()

Q6: Create a plot with a different predictor

Let’s try to depict our last findings: relation with not only blacks proportion, but also with tax rate.

regr1 <- ggplot(Boston, aes(x = tax, y = crim)) + 
  geom_point()
regr1

Well.. that looks.. strange…

Let’s test another idea!

regr2 <- ggplot(Boston, aes(x = black, y = crim)) +
  geom_point()+
  facet_grid(~chas) 
regr2

In thas case faceting shows us that in the town, which bound Charles River proportion of blacks high, while in thу towns, which do not bound the river, blacks proportion vary (but still pretty high).

Also, I’ve try to facet according to the old and trustful indusGROUP variable.

regr3 <- ggplot(mydata, aes(x = black, y = crim)) +
  geom_point()+
  facet_grid(~indusGROUP) 
regr3

And here we see the hogh activity in the middle section, where the proportion of non-retail business acres is medium. But in fact there’s not much visile relation to the changes of crime rate in the towns.