Introduction

This report demonstrates:

  1. Linear regression using the Boston housing dataset
  2. Visualization of sea level data
  3. Statistical interpretation of results

The aim is to understand relationships between variables and visualize environmental trends.


Load Required Libraries

# Load necessary libraries

library(MASS)       
library(ggplot2)    
library(dplyr)      
library(lubridate)  

Part 1: Linear Regression (Boston Dataset)

Load Dataset

# Load Boston dataset

data("Boston")

# View structure of dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset includes:

  • crim → Crime rate
  • rm → Number of rooms
  • ptratio → Pupil-teacher ratio
  • nox → Pollution level
  • medv → Median house value

Simple Linear Regression

Regression: medv vs crim

# Fit regression model

model_b <- lm(medv ~ crim, data = Boston)

# Display model results
summary(model_b)
## 
## Call:
## lm(formula = medv ~ crim, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

Interpretation

We examine the p-value of crim.

If:

p < 0.05 → statistically significant

Result:

Crime rate is significant because its p-value is less than 0.05.


Interpretation of Crime Effect

The coefficient for crim is negative.

Meaning:

  • Higher crime rate → Lower house value

This indicates an inverse relationship.


Multiple Linear Regression

Add More Variables

# Fit multiple regression model

model_c <- lm(
  medv ~ crim + rm + ptratio + nox,
  data = Boston
)

summary(model_c)
## 
## Call:
## lm(formula = medv ~ crim + rm + ptratio + nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.966  -3.207  -0.563   1.881  39.588 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.64051    4.36496   1.521    0.129    
## crim         -0.13906    0.03362  -4.136 4.14e-05 ***
## rm            6.90415    0.40150  17.196  < 2e-16 ***
## ptratio      -1.06741    0.12943  -8.247 1.44e-15 ***
## nox         -13.15253    2.49470  -5.272 2.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.724 on 501 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.6126 
## F-statistic: 200.6 on 4 and 501 DF,  p-value: < 2.2e-16

Interpretation of Variables

rm (rooms)
Positive coefficient → More rooms increase house value.

nox (pollution)
Negative coefficient → Higher pollution decreases house value.

Conclusion:

House value increases with room number and decreases with pollution.


Residual Plot

# Create residual and fitted values

Boston$residuals <- resid(model_b)
Boston$fitted <- fitted(model_b)

# Plot residuals

ggplot(Boston,
       aes(x = fitted,
           y = residuals)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Residual Plot",
    x = "Fitted Values",
    y = "Residuals"
  )
## `geom_smooth()` using formula = 'y ~ x'

Interpretation

The trendline slope is close to zero.

This indicates:

  • Model fits reasonably well
  • No strong pattern in residuals

Part 2: Sea Level Visualization

Files used:

  • extreme.csv
  • sealevel.csv

Extreme Sea Level Plot (48 Hours)

# Load extreme sea level data

extreme <- read.csv("extreme.csv")

ggplot(extreme,
       aes(x = hour + (day-1)*24,
           y = sealevel,
           color = factor(year))) +
  geom_line() +
  labs(
    title = "Sea Level During Extreme Days",
    x = "Hour",
    y = "Sea Level (cm)",
    color = "Year"
  )

Explanation

This plot shows:

  • A 48-hour time period
  • Separate line for each year
  • Different colors for clarity

Yearly Sea Level Statistics

# Load dataset

sealevel <- read.csv("sealevel.csv")

# Convert date
sealevel$date <- as.Date(sealevel$date)

# Extract year
sealevel$year <- format(sealevel$date, "%Y")

# Calculate yearly statistics

year_data <- sealevel %>%
  group_by(year) %>%
  summarise(
    mean_level = mean(sealevel, na.rm=TRUE),
    max_level  = max(sealevel, na.rm=TRUE),
    min_level  = min(sealevel, na.rm=TRUE),
    .groups = "drop"
  )

# Plot results

ggplot(year_data,
       aes(x = as.numeric(year))) +
  geom_line(aes(y = mean_level)) +
  geom_line(aes(y = max_level)) +
  geom_line(aes(y = min_level)) +
  labs(
    title = "Sea Level Statistics per Year",
    x = "Year",
    y = "Sea Level (cm)"
  )


Monthly Median Levels (10-Year Periods)

# Extract month and year

sealevel$month <- month(sealevel$date)
sealevel$year  <- year(sealevel$date)

# Keep years 1960–2009

data_10yr <- sealevel %>%
  filter(year >= 1960 & year <= 2009)

# Create 10-year periods

data_10yr$period <- cut(
  data_10yr$year,
  breaks = seq(1960, 2010, by=10),
  right = FALSE,
  labels = paste(
    seq(1960, 2000, by=10),
    seq(1969, 2009, by=10),
    sep = "-"
  )
)

# Calculate statistics

month_data <- data_10yr %>%
  group_by(period, month) %>%
  summarise(
    median_level = median(sealevel, na.rm=TRUE),
    max_level    = max(sealevel, na.rm=TRUE),
    .groups = "drop"
  )

# Plot results

ggplot(month_data,
       aes(x = month,
           y = median_level,
           fill = period)) +
  geom_col(position = "dodge") +
  labs(
    title = "Median Sea Level per Month (10-year periods)",
    x = "Month",
    y = "Sea Level (cm)",
    fill = "Period"
  )


Conclusion

Regression Results

  • Crime rate significantly affects house value
  • More rooms increase house value
  • Pollution decreases house value

Sea Level Results

  • Sea levels vary across years
  • Long-term patterns are visible
  • Monthly variations are observed

End of Report