SIT2009 Assignment 2 (MLR)

Author

Muhammad Azamuddin

Package Installation and Data Setup

## Package installation
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("faraway"); library("faraway")

The downloaded binary packages are in
    /var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//Rtmp2cUPQN/downloaded_packages
install.packages("lmtest"); library("lmtest")

The downloaded binary packages are in
    /var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//Rtmp2cUPQN/downloaded_packages
## Data setup
data("aatemp"); data("divusa")
str(aatemp); str(divusa)
'data.frame':   115 obs. of  2 variables:
 $ year: int  1854 1855 1871 1881 1882 1883 1884 1885 1890 1891 ...
 $ temp: num  49.1 46.5 48.8 48 47.3 ...
'data.frame':   77 obs. of  7 variables:
 $ year      : int  1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
 $ divorce   : num  8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
 $ unemployed: num  5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
 $ femlab    : num  22.7 22.8 22.9 23 23.1 ...
 $ marriage  : num  92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
 $ birth     : num  118 120 111 110 111 ...
 $ military  : num  3.22 3.56 2.46 2.21 2.29 ...

Analysis

First, we start by plotting the scatterplot from this dataset. The first plot on the first row denotes the overall trend of temperature with respect to year, while the remaining plots denote the scatterplots of temperature with respect to different segments of year of the same dataset.

# Set up plotting layout for 3 rows and 2 columns
par(mfrow = c(2,3))

# Plot for the entire dataset (aatemp)
plot(aatemp$year, aatemp$temp,
     main = "Average annual \n temperature, 1854-2000",
     xlab = "Year",
     ylab = "Temperature",
     pch = 19, # Solid circle
     col = "blue")
# Fit a linear model for the entire dataset
model_full <- lm(temp ~ year, data = aatemp)
# Add the regression line for the entire dataset
abline(model_full, col = "red", lwd = 2)

# Data Segmentation
split_aatemp <- split(aatemp, rep(1:5, each = 23, length.out = 115))

## Assign a vector of titles
titles <- c(
  "Average annual \n temperature, 1854-1904",
  "Average annual \n temperature, 1905-1927",
  "Average annual \n temperature, 1928-1951", 
  "Average annual \n temperature, 1952-1977",
  "Average annual \n temperature, 1978-2000")

# Scatterplots with Linear Model for Segmented Data
for (i in seq_along(split_aatemp)) {
  model <- lm(temp ~ year, data = split_aatemp[[i]])
  # Plot for segmented data
  plot(split_aatemp[[i]]$year, split_aatemp[[i]]$temp,
       main = titles[i],
       xlab = "Year",
       ylab = "Temperature",
       pch = 19, # Solid circle
       col = "blue")
  
  # Add the regression line for segmented data
  abline(model, col = "red", lwd = 2)
}

# Reset plotting parameters to default
par(mfrow = c(1, 1))

From the grid of scatterplot, we can see that the variable year predicts temp very inconsistently, from the different slopes of the regression line from each scatterplot. To explain this, we will proceed with checking the model adequacy of the this linear model.

Model Adequacy Checking

In this section, we will be examining with two versions of aatempt: the general case and the segmentized case, where for the segmentized datasets, all 115 observations accounting for 115 different years are segmentized into 5 equal parts, with 23 observations each. This is so that we can examine the dataset more thoroughly in order to dtect different trends.

## Model definition
temp_year <- lm(aatemp$temp~aatemp$year)

## Summary statistics
summary(temp_year)

Call:
lm(formula = aatemp$temp ~ aatemp$year)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9843 -0.9113 -0.0820  0.9946  3.5343 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 24.005510   7.310781   3.284  0.00136 **
aatemp$year  0.012237   0.003768   3.247  0.00153 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.466 on 113 degrees of freedom
Multiple R-squared:  0.08536,   Adjusted R-squared:  0.07727 
F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533
## Summary plots
par(mfrow = c(2,2))
plot(temp_year)

## Shapiro-Wilk Test
shapiro.test(temp_year$residuals)

    Shapiro-Wilk normality test

data:  temp_year$residuals
W = 0.995, p-value = 0.9568
## Breusch-Pagan Test
bptest(temp_year)

    studentized Breusch-Pagan test

data:  temp_year
BP = 0.69086, df = 1, p-value = 0.4059

We see that in the general case, our model exhibits a very weak correlation between year and temp. We can determine this from the low value of the coefficient estimate \(\hat{\beta_{1}} = 0.012237\) of \(R^{2} = 0.08536\) , though the p-value associated with \(\hat{\beta_{1}}\), \(p = 0.001533 < 0.05\) is statistically significant. We also noticed that, our data mainly passes the assumptions involved in the residual analysis, mainly that it exhibits constant variance, and that the residuals are distributed normally. But we notice one interesting detail: a low value of \(\hat{\beta_{1}}\) should imply we did not reject \(H_{0}: \beta_{1} = 0\) but \(p= 0.001533 < 0.05\). To find out why, we extend our model adequacy checking procedure to a segmented version of aatemp.

## Data Segmentation 
split_aatemp <- split(aatemp, rep(1:5, each = 23, length.out = 115))

## Assign a vector of titles
titles <- c("Average annual temperature,\n 1854-1904",
            "Average annual temperature,\n 1905-1927",
            "Average annual temperature,\n 1928-1951", 
            "Average annual temperature,\n 1952-1977",
            "Average annual temperature,\n 1978-2000")

## Summary Statistics and plots
for(i in seq_along(split_aatemp)) {
  model <- lm(temp ~ year, data = split_aatemp[[i]])
  cat("For", titles[i], "the summary statistics are:\n")
  print(summary(model))
  # Set up the multi-panel layout and plot diagnostics
  par(mfrow = c(2, 2))  # 'oma' sets the outer margins
  # Plot diagnostics without default titles
  plot(model, which = 1, main = "")  # Residuals vs Fitted
  plot(model, which = 2, main = "")  # Normal Q-Q
  plot(model, which = 3, main = "")  # Scale-Location
  plot(model, which = 5, main = "")  # Residuals vs Leverage
  # Add a large title above the panels
  mtext(titles[i], outer = TRUE, cex = 1.5, font = 1, line = 2)
  
  # Shapiro-Wilk test for normality of residuals
  shapiro_result <- shapiro.test(model$residuals)
  shapiro_p_value <- shapiro_result$p.value
  normality <- ifelse(shapiro_p_value < 0.05, 
                      "the data is not normally distributed.", 
                      "the data is normally distributed.")
  
  # Breusch-Pagan test for homoskedasticity
  bptest_result <- bptest(model)
  bptest_p_value <- bptest_result$p.value
  homoskedasticity <- ifelse(bptest_p_value < 0.05, 
                             "the data is not homoskedastic.", 
                             "the data is homoskedastic.")
  
  # Print the final conclusions along with p-values
  cat("For", titles[i], ":\n")
  cat(normality, " (p-value:", shapiro_p_value, ")\n")
  cat(homoskedasticity, " (p-value:", bptest_p_value, ")\n\n")
}
For Average annual temperature,
 1854-1904 the summary statistics are:

Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7138 -0.9173  0.6097  1.1532  1.9000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.53361   46.26341   1.438    0.165
year        -0.01030    0.02449  -0.420    0.678

Residual standard error: 1.575 on 21 degrees of freedom
Multiple R-squared:  0.008349,  Adjusted R-squared:  -0.03887 
F-statistic: 0.1768 on 1 and 21 DF,  p-value: 0.6784

For Average annual temperature,
 1854-1904 :
the data is normally distributed.  (p-value: 0.07282286 )
the data is homoskedastic.  (p-value: 0.7130842 )

For Average annual temperature,
 1905-1927 the summary statistics are:

Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3655 -0.7004 -0.1575  1.2605  3.0852 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.997510  90.932229   0.308    0.761
year         0.009832   0.047459   0.207    0.838

Residual standard error: 1.51 on 21 degrees of freedom
Multiple R-squared:  0.00204,   Adjusted R-squared:  -0.04548 
F-statistic: 0.04292 on 1 and 21 DF,  p-value: 0.8379

For Average annual temperature,
 1905-1927 :
the data is normally distributed.  (p-value: 0.7714022 )
the data is homoskedastic.  (p-value: 0.637467 )

For Average annual temperature,
 1928-1951 the summary statistics are:

Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8936 -1.0171 -0.1107  1.0161  2.7330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.818793  78.304786   0.700    0.492
year        -0.003305   0.040383  -0.082    0.936

Residual standard error: 1.299 on 21 degrees of freedom
Multiple R-squared:  0.0003188, Adjusted R-squared:  -0.04729 
F-statistic: 0.006698 on 1 and 21 DF,  p-value: 0.9355

For Average annual temperature,
 1928-1951 :
the data is normally distributed.  (p-value: 0.5013498 )
the data is homoskedastic.  (p-value: 0.3985379 )

For Average annual temperature,
 1952-1977 the summary statistics are:

Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.77097 -0.62019 -0.07366  0.78329  2.28406 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 152.09265   59.67824   2.549   0.0187 *
year         -0.05269    0.03036  -1.736   0.0973 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.063 on 21 degrees of freedom
Multiple R-squared:  0.1254,    Adjusted R-squared:  0.08379 
F-statistic: 3.012 on 1 and 21 DF,  p-value: 0.0973

For Average annual temperature,
 1952-1977 :
the data is normally distributed.  (p-value: 0.755814 )
the data is homoskedastic.  (p-value: 0.9602909 )

For Average annual temperature,
 1978-2000 the summary statistics are:

Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1084 -0.9165 -0.1129  0.7180  3.1465 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -146.14046   85.68566  -1.706   0.1028  
year           0.09754    0.04308   2.264   0.0343 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.37 on 21 degrees of freedom
Multiple R-squared:  0.1962,    Adjusted R-squared:  0.1579 
F-statistic: 5.126 on 1 and 21 DF,  p-value: 0.03427

For Average annual temperature,
 1978-2000 :
the data is normally distributed.  (p-value: 0.5136732 )
the data is not homoskedastic.  (p-value: 0.0474906 )
# Reset plotting parameters to default
par(mfrow = c(1, 1))

Moving to our segmented data, we noticed that all our models from the data segments display similar characteristics, which are low values of \(\hat{\beta_{1}}\) and \(R^{2}\), and \(p \geq 0.05\). All of the model residuals of our data segments are also normally distributed and most fulfill the homoscedasticity. However, the model from the data segment from years \(1978-2000\) failed to fulfill the homoscedasticity assumption. As such, this non-constant variance might explain the seemingly contradictory results that we get from our model from the unsegmented data. Thus we can conclude that year is not an effective explanation to temp.

Thus, we can conclude that year_temp is not able give us much insight about aatemp. Since there is not much insight to obtain from aatempt, we will proceed with the dataset divusa.