The data was collected in 1968.
#data(package = .packages(all.available = TRUE))
mydata <- Freedman
mydata$City <- rownames(mydata)
rownames(mydata) <- NULL
head(mydata)
## population nonwhite density crime City
## 1 675 7.3 746 2602 Akron
## 2 713 2.6 322 1388 Albany
## 3 NA 3.3 NA 5018 Albuquerque
## 4 534 0.8 491 1182 Allentown
## 5 1261 1.4 1612 3341 Anaheim
## 6 1330 22.8 770 2805 Atlanta
Renaming the columns so it is more apealing.
colnames(mydata) <- c("Population", "Nonwhite", "Density", "Crime", "City")
New data frames without NA rows and removed city of Honolulu, which had abnormal crime rate performed by nonwhites, probably due to the fact that its population in mostly nonwhite. I sepparated the citites in big or small depending on the number residents (population).
mydata <- drop_na(mydata)
mydata <- mydata %>% mutate("City_size" = ifelse(Population < 675, "Small", "Big"))
head(mydata[order(-mydata$Nonwhite), ])
## Population Nonwhite Density Crime City City_size
## 38 628 64.3 1053 3283 Honolulu Small
## 51 770 38.0 565 2691 Memphis Big
## 84 299 34.1 171 1864 Shreveport Small
## 10 739 32.1 272 2285 Birmingham Big
## 58 1064 30.9 539 3467 New.Orleans Big
## 55 382 30.8 135 2433 Mobile Small
mydata <- mydata[-38, ]
The description of some of the statistical indicators for my dataset:
Population (in 1000s):
– minimum is 270.0 (The smallest city in the US has a population in
1000s of 270.0)
– maximum is 11551.0 (The biggest city has a population in 1000s in
11551.0)
– median is 675.0 (50% of cities have a population of less than
675.0)
– mean is 1141.1 (The arithmetic average of the people in US cities in
1000s is 675.0)
– quantiles:
— 1st: 398.5 (25% of cities have a population in 1000s of less than
398.5)
— 3rd: 1185.5 (75% of cities have a population in 1000s of less than
1185.5)
The percentage of crimes commited by nonwhites in cities:
– minimum is 0.30
– maximum is 64.30
– median is 7.30
– mean is 10.12
– quantiles:
— 1st: 3.40
— 3rd: 14.50
Crime rate (per 100,000):
– minimum is 458.0
– maximum is 5441.0
– median is 2726.0
– mean is 2728.0
– quantiles:
— 1st: 2084.0
— 3rd: 3326.0
Population has the highest coefficient of variation, meaning that the dispersion of data points in a data series around the mean is the largest among the 3 observed variables.
summary(mydata[,-c(5,6)])
## Population Nonwhite Density Crime
## Min. : 270.0 Min. : 0.30 Min. : 37.0 Min. : 458
## 1st Qu.: 398.5 1st Qu.: 3.40 1st Qu.: 265.0 1st Qu.:2084
## Median : 675.0 Median : 7.30 Median : 405.0 Median :2726
## Mean : 1141.1 Mean :10.12 Mean : 762.8 Mean :2728
## 3rd Qu.: 1185.5 3rd Qu.:14.50 3rd Qu.: 764.0 3rd Qu.:3326
## Max. :11551.0 Max. :38.00 Max. :13087.0 Max. :5441
sapply(mydata[,-c(3,5,6)], function(x) sd(x) / mean(x) * 100)
## Population Nonwhite Crime
## 137.34100 84.45282 36.18566
In US cities crimes by nonwhites accounted on average for 10.66 percent of all crimes, with median of 7.3 percent. The standard deviation of crimes by nonwhites in 10.08. The percentage of population that was nonwhite in US in the time of collecting data was around 86 percent by online sources, meaning white people proportionately committed more crime than nonwhite people.
#round(stat.desc(mydata2[ , -c(1,2)]), 2)
round(mean(mydata[ , 2]), 2)
## [1] 10.12
round(median(mydata[ , 2]), 2)
## [1] 7.3
round(sd(mydata[ , 2]), 2)
## [1] 8.55
Crimes committed by nonwhites per 100,000 residents on average increase when the size of the city increases.
mydata %>% filter(Population <= 500) %>%
select("Crime") %>% pull() %>% mean()
## [1] 2131.5
mydata %>% filter(Population > 500 & Population <= 1000) %>%
select("Crime") %>% pull() %>% mean()
## [1] 2743.378
mydata %>% filter(Population > 1000 & Population <= 2000) %>%
select("Crime") %>% pull() %>% mean()
## [1] 3217.632
mydata %>% filter(Population > 2000 & Population <= 10000) %>%
select("Crime") %>% pull() %>% mean()
## [1] 3445.4
mydata %>% filter(Population > 10000) %>%
select("Crime") %>% pull() %>% mean()
## [1] 4732
Some additional sample statistics regarding the size of the city. I categorized them in big and small cities using the median to get two samples of almost the same size.
Statistics <- summarySE(mydata,
measurevar="Crime",
groupvars = c("City_size"),
conf.interval=0.95)
Statistics
## City_size N Crime sd se ci
## 1 Big 50 3140.160 901.3052 127.4638 256.1481
## 2 Small 49 2306.653 895.2608 127.8944 257.1489
The distribution of crime in the US seems normal.
hist(mydata$Crime,
main = "Distribution of Crime in the US",
xlab = "Number of crimes in the US",
ylab = "Frequency",
breaks = seq(from = 1, to = 6000, by = 100))
As previously mentioned, crime rate is higher in the bigger cities, with a mean of 3140.16 for cities with population of more than 664,000. The mean of the crime rate in the smaller cities is 2306.65.
ggplot(mydata, aes(x = City_size ,y = Crime))+
geom_boxplot() +
theme_bw()
aggregate(mydata$Crime, list(mydata$City_size), FUN=mean)
## Group.1 x
## 1 Big 3140.160
## 2 Small 2306.653
In the graph below, we can see that nonwhite crime is higher in less populated cities.
ggplot(mydata, aes(x=Population, y= Nonwhite))+
geom_point()+
theme_bw()
As expected, due to the smaller percentage of the nonwhite groups in the US in that time, the distribution of percentages of crimes done by nonwhites in skewed to the right.
hist(mydata$Nonwhite,
main = "Distribution of crime commited in the US cities by nonwhites",
xlab = "Percentage of crimes commited in the US cities by nonwhites",
ylab = "Frequency",
breaks = seq(from = 0, to = 100, by = 2))
ggplot(mydata, aes(x=Nonwhite)) +
geom_histogram(binwidth = 2, colour="gray") +
facet_wrap(~City_size, ncol = 1) +
ylab("Frequency")+
theme_bw()
Some additional graphs.
scatterplotMatrix(mydata[ , c(-5, -6)],
smooth = FALSE)
rcorr(as.matrix(mydata[ , c(-5, -6)]))
## Population Nonwhite Density Crime
## Population 1.00 0.10 0.37 0.40
## Nonwhite 0.10 1.00 -0.01 0.31
## Density 0.37 -0.01 1.00 0.11
## Crime 0.40 0.31 0.11 1.00
##
## n= 99
##
##
## P
## Population Nonwhite Density Crime
## Population 0.3029 0.0002 0.0000
## Nonwhite 0.3029 0.9195 0.0016
## Density 0.0002 0.9195 0.2728
## Crime 0.0000 0.0016 0.2728
mydata2 <- read.table("~/IMB/Bootcamp/HomeExam/Task 2/Body mass.csv",
header=TRUE,
sep=";",
dec=",")
head(mydata2)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
Description:
mean(mydata2$Mass)
## [1] 62.876
sd(mydata2$Mass)
## [1] 6.011403
library(ggplot2)
ggplot(NULL, aes(c(-4, 4))) +
geom_line(stat = "function", fun = dt, args = list (df = 49)) +
ylab("Density") +
xlab("Sample estimates") +
labs(title="Distribution of sample estimates")
qt(p = 0.025, df = 49, lower.tail = FALSE)
## [1] 2.009575
qt(p = 0.025, df = 49, lower.tail = TRUE)
## [1] -2.009575
t is higher than 2.009575, so we reject H_0.
OR
We reject H_0 at p = 0.05%. True mean in not equal to 59.5.
t.test(mydata2$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
t <- 3.9711
r <- sqrt((t^2) / (t^2 + 49))
r
## [1] 0.4934295
#Task 3
library(readxl)
mydata3 <- read_xlsx("~/IMB/Bootcamp/HomeExam/Task 3/Apartments.xlsx")
Description:
mydata3$ParkingFactor <- factor(mydata3$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata3$BalconyFactor <- factor(mydata3$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata4 <- mydata3
We reject H_0 at p = 0.5% and conclude that true price mean in not equal to 1900.
t.test(mydata3$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
fit1 <- lm(formula = Price ~ Age, data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
#fit2 <- lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, data = mydata3)
#summary(fit2)
cor(mydata3$Price, mydata3$Age)
## [1] -0.230255
There is a significant distance between independent variables and the drawn line, therefore we can assume the multicolinearity is not present. To know for certain, we would have to calculate the VIF.
scatterplotMatrix(mydata3[ , c(3,1,2)],
smooth = FALSE)
fit2 <- lm(formula = Price ~ Age + Distance, data = mydata3)
fit2
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Coefficients:
## (Intercept) Age Distance
## 2460.101 -7.934 -20.667
The VIF is smaller than 5 for all the variables, meaning there is no multicolinearity.
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
There is none very problematic case of an outlier or a case with a big influence, however i removed one that was “sticking out” the most.
mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.95303, p-value = 0.003645
# p value is much less than 5%
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid
head(mydata3[order(-mydata3$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid
mydata3 <- mydata3[c(-53),]
#we make the linear regression again for removing one value
fit2 <- lm(formula = Price ~ Age + Distance, data = mydata3)
mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.93901, p-value = 0.0006104
#p-value is still much less than 5%
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 12 14 1650 0 1 No Yes -1.58 0.015
## 2 12 14 1650 0 0 No No -1.58 0.015
## 3 13 8 1800 0 0 No No -1.47 0.014
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid
head(mydata3[order(-mydata3$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.64 0.336
## 2 43 37 1740 0 0 No No 1.59 0.129
## 3 2 11 2790 1 0 Yes No 2.01 0.069
## 4 37 3 2540 1 1 Yes Yes 1.62 0.064
## 5 40 2 2400 0 1 No Yes 1.13 0.04
## 6 45 21 1910 0 1 No Yes 0.988 0.038
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid
We check the assumption with the help of a scatter plot between standardized residuals and (standardized) fitted values. Variance is not growing, so we expect that assumption of homoskedasticity is not violated.
mydata3$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x= mydata3$StdfittedValues,
ylab = "Standarized residuals",
xlab = "Standarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Assumptions:
H0: variable is normally distributed H1: variable is not normally distributed
p-value = 0.0008604 –> p-value is smaller than alpha = 5%, so we can reject H0 and accept H1 With p=0.0008604 with alpha = 0.05. So it means the standardized residuals are not distributed normally.
Assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.
hist(mydata3$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.93901, p-value = 0.0006104
fit2 <- lm(formula = Price ~ Age + Distance, data = mydata4)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
fit3 <- lm(formula =Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata4)
fit3
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata4)
##
## Coefficients:
## (Intercept) Age Distance ParkingFactorYes
## 2301.667 -6.799 -18.045 196.168
## BalconyFactorYes
## 1.935
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Parking: Given Age, Distance and Balcony Factor, the apartments with parking place have on average price higher by 196 euro with p = 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not claim that the presence of a balcony affects the price (can not reject the null hypothesis).
Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
p-value < 0.001 therefore we reject H0 at p< 0.001. It means we have found the linear relationship between dependent (Price) and at least one explanatory variable. Model is good (ρ squared > 0)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorYes 196.168 62.868 3.120 0.00251 **
## BalconyFactorYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
mydata4 <- mydata4 %>% mutate(Fittedvalues = fitted.values(fit3))
residuali <- residuals(fit3)
residuali
## 1 2 3 4 5 6
## -110.741499 442.588928 -88.806740 260.078971 -412.575790 -126.691071
## 7 8 9 10 11 12
## 2.487853 -299.119349 48.055978 278.225949 373.977207 -226.278658
## 13 14 15 16 17 18
## -319.381563 -19.069791 205.640748 -174.245621 -258.864697 86.320138
## 19 20 21 22 23 24
## -176.994443 -268.919765 -182.433383 345.922299 -8.985715 -195.209942
## 25 26 27 28 29 30
## 323.798405 117.886614 378.991292 -412.595734 -203.359865 17.255452
## 31 32 33 34 35 36
## 291.296500 63.977207 504.260809 -48.276278 -259.693334 -414.436979
## 37 38 39 40 41 42
## -353.275183 526.262587 404.441776 -97.732888 -388.912810 -98.552925
## 43 44 45 46 47 48
## 21.164859 154.810666 -59.410945 298.480775 -133.502878 -200.657597
## 49 50 51 52 53 54
## 318.267281 -268.289267 144.478518 -155.302792 -459.919210 -230.932193
## 55 56 57 58 59 60
## 398.358370 -1.271528 594.366706 412.646046 -60.563303 194.705434
## 61 62 63 64 65 66
## 440.654168 -88.806740 262.013730 -412.575790 -124.756312 4.422612
## 67 68 69 70 71 72
## -299.119349 48.055978 278.225949 372.042448 -226.278658 -317.446803
## 73 74 75 76 77 78
## -17.135032 205.640748 -174.245621 -256.929937 -57.476185 296.546015
## 79 80 81 82 83 84
## -133.502878 -198.722838 318.267281 -268.289267 144.478518 -155.302792
## 85
## -133.502878