Task 1


Determine the measurement level for each of the variables in the dataset. [✓]

longitude: Quantitative (numeric) / Continuous / Interval (-180 to 180)

latitude: Quantitative (numeric) / Continuous / Interval (-90 to 90)

housing_median_age: Quantitative (numeric) / Discrete / Ratio (0 to whatever)

total_rooms: Quantitative (numeric) / Discrete / Ratio (0 to whatever)

population: Quantitative (numeric) / Discrete / Ratio (0 to whatever)

households: Quantitative (numeric) / Discrete / Ratio (0 to whatever)

median_income: Quantitative (numeric) / Continuous / Ratio (0 to whatever)

median_house_value: Quantitative (numeric) / Continuous / Ratio (0 to whatever)

ocean_proximity: Qualitative (categorial) / Ordinal || My assumption is that proximity to ocean improves life quality and is thus more pricey

Task 2


Import the data with R / RStudio. [✓]
data <- read.csv("/Users/mauritiusloosli/Downloads/BAQM - Assignment/california.csv", header=TRUE, sep = ";")

Check the data types of the variables and create factors where necessary. [✓]

str(data) 
## 'data.frame':    19648 obs. of  9 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: int  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : int  880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
##  $ population        : int  322 2401 496 558 565 413 1094 1157 1206 1551 ...
##  $ households        : int  126 1138 177 219 259 193 514 647 595 714 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  45.3 35.9 35.2 34.1 34.2 ...
##  $ ocean_proximity   : chr  "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...

Numeric and integer is alright, ocean_proximity needs to be adjusted

data$ocean_proximity <- factor(data$ocean_proximity, levels = c("INLAND", "<1H OCEAN", "NEAR BAY", "NEAR OCEAN", "ISLAND"))

Checked manually on map with coordinates, what the levels exactly mean. INLAND = inland, <1H OCEAN = close to ocean, NEAR BAY = at bay, NEAR OCEAN = at ocean, ISLAND = on island

Could also do ordered factors with data$ocean_proximity <- ordered(data$ocean_proximity, levels = c(...)) || My assumption is that proximity to ocean/island is ranked highest, followed by bay, <1h ocean and inland.

levels(data$ocean_proximity)
## [1] "INLAND"     "<1H OCEAN"  "NEAR BAY"   "NEAR OCEAN" "ISLAND"
str(data)
## 'data.frame':    19648 obs. of  9 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: int  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : int  880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
##  $ population        : int  322 2401 496 558 565 413 1094 1157 1206 1551 ...
##  $ households        : int  126 1138 177 219 259 193 514 647 595 714 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  45.3 35.9 35.2 34.1 34.2 ...
##  $ ocean_proximity   : Factor w/ 5 levels "INLAND","<1H OCEAN",..: 3 3 3 3 3 3 3 3 3 3 ...
table(data$ocean_proximity)
## 
##     INLAND  <1H OCEAN   NEAR BAY NEAR OCEAN     ISLAND 
##       6523       8595       2088       2437          5

How many observations does this dataset contain? [✓]

nrow(data)
## [1] 19648

Task 3


Check if there are any missing values in one of the variables. [✓]

colSums(is.na(data))
##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##         population         households      median_income median_house_value 
##                  0                  0                  0                  0 
##    ocean_proximity 
##                  0

Task 4


Create a relative frequency table for the variable ocean_proximity. [✓]

prop.table(table(data$ocean_proximity))
## 
##       INLAND    <1H OCEAN     NEAR BAY   NEAR OCEAN       ISLAND 
## 0.3319930782 0.4374491042 0.1062703583 0.1240329805 0.0002544788

Comment on the resulting table with one sentence. [✓]

Districts on the island make up too small of an amount to have an impact, other than that the distribution is not too bad, considering near bay and near ocean could be combined as it has similar value for inhabitants.

Often, it is easier to interpret data with the help of plots. Therefore, create a bar plot that shows the relative frequencies. [✓]

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(data = data, mapping = aes(x = ocean_proximity)) +
  geom_bar(aes(y = (..count..) / sum(..count..))) +
  labs(y = "Count", x = NULL) +
  theme_bw()

Task 5


Compute the mean, median, and standard deviation for each quantitative (numeric) variable. [✓]

longitude

summary(data$longitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -124.3  -121.8  -118.5  -119.6  -118.0  -114.3
sd(data$longitude)
## [1] 2.00576
hist(data$longitude)

latitude

summary(data$latitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.54   33.93   34.27   35.65   37.73   41.95
sd(data$latitude)
## [1] 2.150066
hist(data$latitude)

housing_median_age

summary(data$housing_median_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   28.00   28.37   37.00   52.00
sd(data$housing_median_age)
## [1] 12.50405
hist(data$housing_median_age)

total_rooms

summary(data$total_rooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2    1438    2111    2620    3121   39320
sd(data$total_rooms)
## [1] 2182.372
hist(data$total_rooms)

population

summary(data$population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3     796    1179    1441    1746   35682
sd(data$population)
## [1] 1144.075
hist(data$population)

households

summary(data$households)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   282.0   411.0   501.2   606.0  6082.0
sd(data$households)
## [1] 383.3914
hist(data$households)

median_income

summary(data$median_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4999  2.5263  3.4491  3.6764  4.5825 15.0001
sd(data$median_income)
## [1] 1.570602
hist(data$median_income)

median_house_value

summary(data$median_house_value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.50   11.65   17.36   19.21   24.79   49.91
sd(data$median_house_value)
## [1] 9.711085
hist(data$median_house_value)

Describe your results with a few sentences. [✓]

A low standard deviation means data are clustered around the mean, and a high standard deviation indicates that data is more spread out. Comparing standard deviation to mean of the dataset is an approach to determine if it’s rather high or low. CV well below 1: standard deviation quite low. CV greater than 1: standard deviation quite high. For none of the variables above there is a CV greater than 0.83.

For some of the variables we can observe a right-skewedness. In right skewed distributions, the mean is typically higher than the median which is being confirmed by the output above. Here better use first and third quartiles, since these will give some sense of the asymmetry of the distribution.

Task 6


Compute the average for the variable median_house_value, but grouped by ocean_proximity. [✓]

aggregate(data$median_house_value, list(data$ocean_proximity), FUN=mean) 
##      Group.1        x
## 1     INLAND 12.31949
## 2  <1H OCEAN 22.37242
## 3   NEAR BAY 23.59176
## 4 NEAR OCEAN 22.67112
## 5     ISLAND 38.04400

What do you observe? [✓]

My assumptions (proximity to water = pricier) from earlier got confirmed. Although bay seems to be more costly than near ocean. Seems to be specific for San Francisco, as bay is huge and pretty (I assume, was also there but can’t recall). Checked coordinates manually on map.

Task 7


Use the R function cor() to compute a correlation matrix. Note that cor() will not accept any non-numeric variables. Hence, you will need to pass on the dataframe with only numeric variables. Hint: df[,c(1,2)] returns the dataframe df with only the first and second variable (column). [✓]

data_correlation_matrix <- data[, c(1,2,3,4,5,6,7,8)]
data_correlation_matrix_rounded <- cor(data_correlation_matrix)
round(data_correlation_matrix_rounded, 2)
##                    longitude latitude housing_median_age total_rooms population
## longitude               1.00    -0.92              -0.10        0.04       0.10
## latitude               -0.92     1.00               0.01       -0.03      -0.11
## housing_median_age     -0.10     0.01               1.00       -0.37      -0.29
## total_rooms             0.04    -0.03              -0.37        1.00       0.86
## population              0.10    -0.11              -0.29        0.86       1.00
## households              0.06    -0.07              -0.31        0.92       0.91
## median_income          -0.01    -0.08              -0.20        0.22       0.04
## median_house_value     -0.05    -0.15               0.07        0.14       0.01
##                    households median_income median_house_value
## longitude                0.06         -0.01              -0.05
## latitude                -0.07         -0.08              -0.15
## housing_median_age      -0.31         -0.20               0.07
## total_rooms              0.92          0.22               0.14
## population               0.91          0.04               0.01
## households               1.00          0.05               0.10
## median_income            0.05          1.00               0.65
## median_house_value       0.10          0.65               1.00

Which variable is most correlated with the variable median_house_value? [✓]

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(data_correlation_matrix), tl.srt=45, type="upper", tl.cex=0.6, tl.col="black", order="hclust")

The variable median_income seems to be correlated the most with the variable median_house_value with a value of 0.65 which makes sense. People with a lot of income usually have more expensive houses.

Task 8


Create a grouped boxplot for the variable median_house_value, where groups are built based on the variable ocean_proximity. [✓]

ggplot(data, aes(x = ocean_proximity, y = median_house_value)) + geom_boxplot()

Try to order the boxplots in an ascending manner (from left to right). This often helps with the interpretation. Hint: https://rpubs.com/crazyhottommy/reorder-boxplot [✓]

ggplot(data, aes(x = reorder(ocean_proximity, median_house_value, FUN = median), y = median_house_value)) + geom_boxplot()

Outliers of INLAND have been checked manually on maps.

On coordinates 34.15, -118.03 it’s hillside Los Angeles, where probably rich people and stars live https://www.neighborhoodscout.com/ca/monrovia.

On coordinates 37.64, -121.88 it’s in San Francisco also hillside, where probably also all the rich people live https://bestneighborhood.org/best-neighborhoods-verona-ca/.

Task 9


Use ggplot2 to create a geographical plot of the districts with the latitude on the y-axis and the longitude on the x-axis. Each district shall be represented with a dot and the dot shall be colored according to the variable median_house_value. Hint: one of the ggplot2 layers needs to be scale_colour_continuous(low = “blue”, high = “red”, space = “Lab”). You should use geom_point() to add the dots. [✓]

ggplot(data, aes(x = longitude, y = latitude, color = median_house_value)) + geom_point() + scale_color_continuous(low = "blue", high = "red", space = "Lab")

Comment on your plot. [✓]

Prices on the coast are highest. But only around San Francisco and Los Angeles (also some areas in between)

Task 10


Compute the population mean and the population standard deviation for the variable median_house_value. [✓]

mean(data$median_house_value)
## [1] 19.20553
sd(data$median_house_value)
## [1] 9.711085

Task 11


Draw a random sample of size n = 50 and compute the sample mean X and the standard error of the mean (SEM) for the variable median_house_value. [✓]

library(plotrix)
set.seed(166)
n <- 50
N <- nrow(data)
data_sample_1 <- data[sample(1:N, n, replace = FALSE), ]
sample_mean_1 <- mean(data_sample_1$median_house_value)
sample_mean_1
## [1] 18.0224
sample_se_1 <- std.error(data_sample_1$median_house_value)
sample_se_1
## [1] 1.304508
data_sample_2 <- data[sample(1:N, n, replace = FALSE), ]
sample_mean_2 <- mean(data_sample_2$median_house_value)
sample_mean_2
## [1] 20.1856
sample_se_2 <- std.error(data_sample_2$median_house_value)
sample_se_2
## [1] 1.393956

Mean sample 1: 18.0224
Standard error of mean sample 1: 1.304508

Mean sample 2: 20.1856
Standard error of mean sample 2: 1.393956

How far away is your sample mean from the population mean? Briefly comment on your results. [✓]

mean(data$median_house_value)
## [1] 19.20553

Mean of sample 1 is 1.18313 away from population mean
Mean of sample 2 is 0.98007 away from population mean

The results are not too bad, but 50 is probably too small of a sample size. Getting better values around a sample of n = 400

set.seed(166)
data_sample_400 <- sample(data$median_house_value, size = 400, replace = FALSE, prob = NULL)
mean(data_sample_400)
## [1] 19.18285
std.error(data_sample_400)
## [1] 0.485244

A sampling size of 400 is only 0.02268 away from population mean. As we increase sample size, sample mean approaches population mean more.

Task 12


Compute a 90% confidence interval for the mean. [✓] # Not sure if we have to continue with sample or with full sample of 19648 observations. Hence I do both.

Sample 1

sample_mean_ci <- sample_mean_1 
sample_sd_ci <- sd(data_sample_1$median_house_value)
sample_size_ci <- nrow(data_sample_1)
error <- qnorm(0.95)*sample_sd_ci/sqrt(sample_size_ci)
left <- sample_mean_ci-error
right <- sample_mean_ci+error
left
## [1] 15.87667
right
## [1] 20.16813

The level of certainty about the true mean is 90% in predicting that the true mean is within the interval between 15.87667 and 20.16813.

Sample 2

sample_mean_ci <- sample_mean_2
sample_sd_ci <- sd(data_sample_2$median_house_value)
sample_size_ci <- nrow(data_sample_2)
error <- qnorm(0.95)*sample_sd_ci/sqrt(sample_size_ci)
left <- sample_mean_ci-error
right <- sample_mean_ci+error
left
## [1] 17.89274
right
## [1] 22.47845

With a level of certainity of 90%, it is between 17.89274 and 22.47845.

The level of certainty about the true mean is 90% in predicting that the true mean is within the interval between 17.89274 and 22.47845.

Full population

population_mean <- mean(data$median_house_value)
population_sd <- sd(data$median_house_value)
population_size <- nrow(data)
error <- qnorm(0.95)*population_sd/sqrt(population_size)
left <- population_mean-error
right <- population_mean+error
left
## [1] 19.09158
right
## [1] 19.31949

The level of certainty about the true mean is 90% in predicting that the true mean is within the interval between 19.09158 and 19.31949.

The sample size with n = 50 still gives somewhat of an “accurate” estimate, where 90% of the values for the true mean lie in between the range of 15.87667 to 20.16813 for sample 1 and 17.89274 to 22.47845 for sample 2.

Task 13


Compute the sample proportion of districts in the category INLAND. [✓]

table(data_sample_1$ocean_proximity)/length(data_sample_1$ocean_proximity)
## 
##     INLAND  <1H OCEAN   NEAR BAY NEAR OCEAN     ISLAND 
##       0.32       0.34       0.10       0.24       0.00

Sample proportion of districts in the category INLAND is 0.32.

Calculate a 95% confidence interval for the proportion of districts in this category. [✓]

sqrt(0.32*(1-0.32)/(n-1)*(1-n/N))
## [1] 0.0665546
0.32 + c(-1,1)*1.96*sqrt(0.32*(1-0.32)/(n-1)*(1-n/N))
## [1] 0.189553 0.450447

With a certainity of 95% the true proportions lie within the interval 0.189553 and 0.450447.