longitude: Quantitative (numeric) / Continuous / Interval (-180 to 180)
latitude: Quantitative (numeric) / Continuous / Interval (-90 to 90)
housing_median_age: Quantitative (numeric) / Discrete / Ratio (0 to whatever)
total_rooms: Quantitative (numeric) / Discrete / Ratio (0 to whatever)
population: Quantitative (numeric) / Discrete / Ratio (0 to whatever)
households: Quantitative (numeric) / Discrete / Ratio (0 to whatever)
median_income: Quantitative (numeric) / Continuous / Ratio (0 to whatever)
median_house_value: Quantitative (numeric) / Continuous / Ratio (0 to whatever)
ocean_proximity: Qualitative (categorial) / Ordinal || My assumption is that proximity to ocean improves life quality and is thus more pricey
data <- read.csv("/Users/mauritiusloosli/Downloads/BAQM - Assignment/california.csv", header=TRUE, sep = ";")
str(data)
## 'data.frame': 19648 obs. of 9 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: int 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : int 880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
## $ population : int 322 2401 496 558 565 413 1094 1157 1206 1551 ...
## $ households : int 126 1138 177 219 259 193 514 647 595 714 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 45.3 35.9 35.2 34.1 34.2 ...
## $ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
Numeric and integer is alright, ocean_proximity needs to be adjusted
data$ocean_proximity <- factor(data$ocean_proximity, levels = c("INLAND", "<1H OCEAN", "NEAR BAY", "NEAR OCEAN", "ISLAND"))
Checked manually on map with coordinates, what the levels exactly mean. INLAND = inland, <1H OCEAN = close to ocean, NEAR BAY = at bay, NEAR OCEAN = at ocean, ISLAND = on island
Could also do ordered factors with data$ocean_proximity <- ordered(data$ocean_proximity, levels = c(...)) || My assumption is that proximity to ocean/island is ranked highest, followed by bay, <1h ocean and inland.
levels(data$ocean_proximity)
## [1] "INLAND" "<1H OCEAN" "NEAR BAY" "NEAR OCEAN" "ISLAND"
str(data)
## 'data.frame': 19648 obs. of 9 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: int 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : int 880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
## $ population : int 322 2401 496 558 565 413 1094 1157 1206 1551 ...
## $ households : int 126 1138 177 219 259 193 514 647 595 714 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 45.3 35.9 35.2 34.1 34.2 ...
## $ ocean_proximity : Factor w/ 5 levels "INLAND","<1H OCEAN",..: 3 3 3 3 3 3 3 3 3 3 ...
table(data$ocean_proximity)
##
## INLAND <1H OCEAN NEAR BAY NEAR OCEAN ISLAND
## 6523 8595 2088 2437 5
nrow(data)
## [1] 19648
colSums(is.na(data))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## population households median_income median_house_value
## 0 0 0 0
## ocean_proximity
## 0
prop.table(table(data$ocean_proximity))
##
## INLAND <1H OCEAN NEAR BAY NEAR OCEAN ISLAND
## 0.3319930782 0.4374491042 0.1062703583 0.1240329805 0.0002544788
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(data = data, mapping = aes(x = ocean_proximity)) +
geom_bar(aes(y = (..count..) / sum(..count..))) +
labs(y = "Count", x = NULL) +
theme_bw()
longitude
summary(data$longitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -124.3 -121.8 -118.5 -119.6 -118.0 -114.3
sd(data$longitude)
## [1] 2.00576
hist(data$longitude)
latitude
summary(data$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.54 33.93 34.27 35.65 37.73 41.95
sd(data$latitude)
## [1] 2.150066
hist(data$latitude)
housing_median_age
summary(data$housing_median_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 28.00 28.37 37.00 52.00
sd(data$housing_median_age)
## [1] 12.50405
hist(data$housing_median_age)
total_rooms
summary(data$total_rooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 1438 2111 2620 3121 39320
sd(data$total_rooms)
## [1] 2182.372
hist(data$total_rooms)
population
summary(data$population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 796 1179 1441 1746 35682
sd(data$population)
## [1] 1144.075
hist(data$population)
households
summary(data$households)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 282.0 411.0 501.2 606.0 6082.0
sd(data$households)
## [1] 383.3914
hist(data$households)
median_income
summary(data$median_income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4999 2.5263 3.4491 3.6764 4.5825 15.0001
sd(data$median_income)
## [1] 1.570602
hist(data$median_income)
median_house_value
summary(data$median_house_value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.50 11.65 17.36 19.21 24.79 49.91
sd(data$median_house_value)
## [1] 9.711085
hist(data$median_house_value)
A low standard deviation means data are clustered around the mean, and a high standard deviation indicates that data is more spread out. Comparing standard deviation to mean of the dataset is an approach to determine if it’s rather high or low. CV well below 1: standard deviation quite low. CV greater than 1: standard deviation quite high. For none of the variables above there is a CV greater than 0.83.
For some of the variables we can observe a right-skewedness. In right skewed distributions, the mean is typically higher than the median which is being confirmed by the output above. Here better use first and third quartiles, since these will give some sense of the asymmetry of the distribution.
aggregate(data$median_house_value, list(data$ocean_proximity), FUN=mean)
## Group.1 x
## 1 INLAND 12.31949
## 2 <1H OCEAN 22.37242
## 3 NEAR BAY 23.59176
## 4 NEAR OCEAN 22.67112
## 5 ISLAND 38.04400
My assumptions (proximity to water = pricier) from earlier got confirmed. Although bay seems to be more costly than near ocean. Seems to be specific for San Francisco, as bay is huge and pretty (I assume, was also there but can’t recall). Checked coordinates manually on map.
data_correlation_matrix <- data[, c(1,2,3,4,5,6,7,8)]
data_correlation_matrix_rounded <- cor(data_correlation_matrix)
round(data_correlation_matrix_rounded, 2)
## longitude latitude housing_median_age total_rooms population
## longitude 1.00 -0.92 -0.10 0.04 0.10
## latitude -0.92 1.00 0.01 -0.03 -0.11
## housing_median_age -0.10 0.01 1.00 -0.37 -0.29
## total_rooms 0.04 -0.03 -0.37 1.00 0.86
## population 0.10 -0.11 -0.29 0.86 1.00
## households 0.06 -0.07 -0.31 0.92 0.91
## median_income -0.01 -0.08 -0.20 0.22 0.04
## median_house_value -0.05 -0.15 0.07 0.14 0.01
## households median_income median_house_value
## longitude 0.06 -0.01 -0.05
## latitude -0.07 -0.08 -0.15
## housing_median_age -0.31 -0.20 0.07
## total_rooms 0.92 0.22 0.14
## population 0.91 0.04 0.01
## households 1.00 0.05 0.10
## median_income 0.05 1.00 0.65
## median_house_value 0.10 0.65 1.00
Comment on the resulting table with one sentence. [✓]
Districts on the island make up too small of an amount to have an impact, other than that the distribution is not too bad, considering near bay and near ocean could be combined as it has similar value for inhabitants.