y <- data.frame(read.table("https://raw.githubusercontent.com/hovig/MSDS_CUNY/master/DATA605/poverty.csv"))
df <- y[,-c(1,3,4,6)]
df[1][is.na(df[1])] <- 0
head(df)
## V2 V5
## 1 PovPct ViolCrime
## 2 20.1 11.2
## 3 7.1 9.1
## 4 16.1 10.4
## 5 14.9 10.4
## 6 16.7 11.2
plot(df, xlab = "Poverty Percent", ylab = "Violation Crime", las = 1)
lines(lowess(df[[1]], df[[2]], f = 2/3, iter = 3), col = "red")
title(main = "Poverty Index")

Poverty_Percent <- log(as.numeric(as.character(df[[1]])))
Violation_Crime <- log(as.numeric(as.character(df[[2]])))
lregression <- lm(Violation_Crime ~ Poverty_Percent, data = df)
summary(lregression)
##
## Call:
## lm(formula = Violation_Crime ~ Poverty_Percent, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83801 -0.33373 0.05525 0.37489 1.77400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0048 0.7292 -1.378 0.17446
## Poverty_Percent 1.1016 0.2866 3.843 0.00035 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6434 on 49 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2316, Adjusted R-squared: 0.2159
## F-statistic: 14.77 on 1 and 49 DF, p-value: 0.00035
par(mfrow = c(2, 2))
plot(lregression)

- Even though R-squared is low (23.2%) and with low p-value, we can still show that there’s a real relationship between the response variable and the predictors. We need to take into consideration that the dataset is only 52 rows.