Introduction

This report demonstrates:

Linear regression using the Boston housing dataset
Visualization of sea level data
Statistical interpretation of results

The aim is to understand relationships between variables and visualize environmental trends.

Load Required Libraries

# Load necessary libraries

library(MASS)       
library(ggplot2)    
library(dplyr)      
library(lubridate)

Part 1: Linear Regression (Boston Dataset)

Load Dataset

# Load Boston dataset

data("Boston")

# View structure of dataset
str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset includes:

crim → Crime rate
rm → Number of rooms
ptratio → Pupil-teacher ratio
nox → Pollution level
medv → Median house value

Simple Linear Regression

Regression: medv vs crim

# Fit regression model

model_b <- lm(medv ~ crim, data = Boston)

# Display model results
summary(model_b)

## 
## Call:
## lm(formula = medv ~ crim, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

Interpretation

We examine the p-value of crim.

If:

p < 0.05 → statistically significant

Result:

Crime rate is significant because its p-value is less than 0.05.

Interpretation of Crime Effect

The coefficient for crim is negative.

Meaning:

Higher crime rate → Lower house value

This indicates an inverse relationship.

Multiple Linear Regression

Add More Variables

# Fit multiple regression model

model_c <- lm(
  medv ~ crim + rm + ptratio + nox,
  data = Boston
)

summary(model_c)

## 
## Call:
## lm(formula = medv ~ crim + rm + ptratio + nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.966  -3.207  -0.563   1.881  39.588 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.64051    4.36496   1.521    0.129    
## crim         -0.13906    0.03362  -4.136 4.14e-05 ***
## rm            6.90415    0.40150  17.196  < 2e-16 ***
## ptratio      -1.06741    0.12943  -8.247 1.44e-15 ***
## nox         -13.15253    2.49470  -5.272 2.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.724 on 501 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.6126 
## F-statistic: 200.6 on 4 and 501 DF,  p-value: < 2.2e-16

Interpretation of Variables

rm (rooms)
Positive coefficient → More rooms increase house value.

nox (pollution)
Negative coefficient → Higher pollution decreases house value.

Conclusion:

House value increases with room number and decreases with pollution.

Residual Plot

# Create residual and fitted values

Boston$residuals <- resid(model_b)
Boston$fitted <- fitted(model_b)

# Plot residuals

ggplot(Boston,
       aes(x = fitted,
           y = residuals)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Residual Plot",
    x = "Fitted Values",
    y = "Residuals"
  )

## `geom_smooth()` using formula = 'y ~ x'

Interpretation

The trendline slope is close to zero.

This indicates:

Model fits reasonably well
No strong pattern in residuals

Part 2: Sea Level Visualization

Files used:

extreme.csv
sealevel.csv

Extreme Sea Level Plot (48 Hours)

# Load extreme sea level data

extreme <- read.csv("extreme.csv")

ggplot(extreme,
       aes(x = hour + (day-1)*24,
           y = sealevel,
           color = factor(year))) +
  geom_line() +
  labs(
    title = "Sea Level During Extreme Days",
    x = "Hour",
    y = "Sea Level (cm)",
    color = "Year"
  )

Explanation

This plot shows:

A 48-hour time period
Separate line for each year
Different colors for clarity

Yearly Sea Level Statistics

# Load dataset

sealevel <- read.csv("sealevel.csv")

# Convert date
sealevel$date <- as.Date(sealevel$date)

# Extract year
sealevel$year <- format(sealevel$date, "%Y")

# Calculate yearly statistics

year_data <- sealevel %>%
  group_by(year) %>%
  summarise(
    mean_level = mean(sealevel, na.rm=TRUE),
    max_level  = max(sealevel, na.rm=TRUE),
    min_level  = min(sealevel, na.rm=TRUE),
    .groups = "drop"
  )

# Plot results

ggplot(year_data,
       aes(x = as.numeric(year))) +
  geom_line(aes(y = mean_level)) +
  geom_line(aes(y = max_level)) +
  geom_line(aes(y = min_level)) +
  labs(
    title = "Sea Level Statistics per Year",
    x = "Year",
    y = "Sea Level (cm)"
  )

Monthly Median Levels (10-Year Periods)

# Extract month and year

sealevel$month <- month(sealevel$date)
sealevel$year  <- year(sealevel$date)

# Keep years 1960–2009

data_10yr <- sealevel %>%
  filter(year >= 1960 & year <= 2009)

# Create 10-year periods

data_10yr$period <- cut(
  data_10yr$year,
  breaks = seq(1960, 2010, by=10),
  right = FALSE,
  labels = paste(
    seq(1960, 2000, by=10),
    seq(1969, 2009, by=10),
    sep = "-"
  )
)

# Calculate statistics

month_data <- data_10yr %>%
  group_by(period, month) %>%
  summarise(
    median_level = median(sealevel, na.rm=TRUE),
    max_level    = max(sealevel, na.rm=TRUE),
    .groups = "drop"
  )

# Plot results

ggplot(month_data,
       aes(x = month,
           y = median_level,
           fill = period)) +
  geom_col(position = "dodge") +
  labs(
    title = "Median Sea Level per Month (10-year periods)",
    x = "Month",
    y = "Sea Level (cm)",
    fill = "Period"
  )

Assignment 3: Linear Regression and Sea Level Analysis

Iman makhzoum

2026-03-26

Introduction

Load Required Libraries

Part 1: Linear Regression (Boston Dataset)

Load Dataset

Simple Linear Regression

Regression: medv vs crim

Interpretation

Interpretation of Crime Effect

Multiple Linear Regression

Add More Variables

Interpretation of Variables

Residual Plot

Interpretation

Part 2: Sea Level Visualization

Extreme Sea Level Plot (48 Hours)

Explanation

Yearly Sea Level Statistics

Monthly Median Levels (10-Year Periods)

Conclusion

Regression Results

Sea Level Results

End of Report