Introduction

This report demonstrates:

  1. Linear regression using the Boston housing dataset
  2. Visualization of sea level data
  3. Statistical interpretation of model outputs

Load Required Libraries

library(MASS)
library(ggplot2)
library(dplyr)
library(lubridate)

Part 1: Linear Regression (Boston Dataset)

Load Dataset

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston dataset contains:

  • crim → crime rate
  • rm → number of rooms
  • nox → pollution level
  • medv → median house value

Simple Linear Regression

model_b <- lm(medv ~ crim, data = Boston)
summary(model_b)
## 
## Call:
## lm(formula = medv ~ crim, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

Crime rate significance is determined using the p-value.

If p < 0.05, crime rate significantly affects house value.


Multiple Linear Regression

model_c <- lm(
  medv ~ crim + rm + ptratio + nox,
  data = Boston
)

summary(model_c)
## 
## Call:
## lm(formula = medv ~ crim + rm + ptratio + nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.966  -3.207  -0.563   1.881  39.588 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.64051    4.36496   1.521    0.129    
## crim         -0.13906    0.03362  -4.136 4.14e-05 ***
## rm            6.90415    0.40150  17.196  < 2e-16 ***
## ptratio      -1.06741    0.12943  -8.247 1.44e-15 ***
## nox         -13.15253    2.49470  -5.272 2.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.724 on 501 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.6126 
## F-statistic: 200.6 on 4 and 501 DF,  p-value: < 2.2e-16

Interpretation:

  • rm → Positive effect
  • nox → Negative effect

More rooms increase value, pollution decreases value.


Residual Plot

Boston$residuals <- resid(model_b)
Boston$fitted <- fitted(model_b)

ggplot(Boston,
       aes(x = fitted,
           y = residuals)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Residual Plot",
    x = "Fitted Values",
    y = "Residuals"
  )
## `geom_smooth()` using formula = 'y ~ x'

Residuals should scatter randomly if the model fits well.


Part 2: Sea Level Plotting

Extreme Sea Level Plot

extreme <- read.csv("extreme.csv")

ggplot(extreme,
       aes(x = hour + (day-1)*24,
           y = sealevel,
           color = factor(year))) +
  geom_line() +
  labs(
    title = "Sea Level During Extreme Days",
    x = "Hour",
    y = "Sea Level (cm)",
    color = "Year"
  )

This plot shows a 48-hour sea level trend.


Yearly Sea Level Statistics

sealevel <- read.csv("sealevel.csv")

sealevel$date <- as.Date(sealevel$date)
sealevel$year <- format(sealevel$date, "%Y")

year_data <- sealevel %>%
  group_by(year) %>%
  summarise(
    mean_level = mean(sealevel, na.rm=TRUE),
    max_level  = max(sealevel, na.rm=TRUE),
    min_level  = min(sealevel, na.rm=TRUE)
  )

ggplot(year_data,
       aes(x = as.numeric(year))) +
  geom_line(aes(y = mean_level)) +
  geom_line(aes(y = max_level)) +
  geom_line(aes(y = min_level)) +
  labs(
    title = "Sea Level Statistics per Year",
    x = "Year",
    y = "Sea Level (cm)"
  )

This plot shows yearly mean, maximum, and minimum sea levels.


Monthly Median Levels (10-Year Periods)

sealevel$month <- month(sealevel$date)
sealevel$year  <- year(sealevel$date)

data_10yr <- sealevel %>%
  filter(year >= 1960 & year <= 2009)

data_10yr$period <- cut(
  data_10yr$year,
  breaks = seq(1960, 2010, by = 10),
  right = FALSE
)

month_data <- data_10yr %>%
  group_by(period, month) %>%
  summarise(
    median_level = median(sealevel, na.rm=TRUE),
    max_level    = max(sealevel, na.rm=TRUE),
    .groups="drop"
  )

ggplot(month_data,
       aes(x = month,
           y = median_level,
           fill = period)) +
  geom_col(position = "dodge") +
  labs(
    title = "Median Sea Level per Month",
    x = "Month",
    y = "Sea Level (cm)"
  )

This compares monthly sea level trends across decades.


Optional Extra Task (Disabled to Prevent Errors)

# Function to read csv file

readFile <- function(arg1) {
  data <- read.csv(arg1)
  return(data)
}

# List files

file_list <- list.files(
  "data_folder",
  pattern = "sl_stat",
  full.names = TRUE
)

# Combine files safely

data_list <- list()

for (i in 1:min(10, length(file_list))) {
  data_list[[i]] <- readFile(file_list[i])
}

combined_data <- do.call(rbind, data_list)

# Standardize formats

data_list <- list()

for (i in 1:length(file_list)) {

  temp <- read.csv(file_list[i])

  names(temp) <- c("date", "sealevel")

  data_list[[i]] <- temp
}

final_data <- do.call(rbind, data_list)

This section is disabled to avoid file path errors.


Conclusion

Regression Findings

  • Crime rate significantly affects house value
  • More rooms increase property value
  • Pollution decreases property value

Sea Level Findings

  • Sea levels vary yearly
  • Long-term patterns exist
  • Monthly variations are visible

End of Report