The downloaded binary packages are in
/var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//Rtmp2cUPQN/downloaded_packages
install.packages("lmtest"); library("lmtest")
The downloaded binary packages are in
/var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//Rtmp2cUPQN/downloaded_packages
## Data setupdata("aatemp"); data("divusa")str(aatemp); str(divusa)
'data.frame': 115 obs. of 2 variables:
$ year: int 1854 1855 1871 1881 1882 1883 1884 1885 1890 1891 ...
$ temp: num 49.1 46.5 48.8 48 47.3 ...
'data.frame': 77 obs. of 7 variables:
$ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
$ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
$ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
$ femlab : num 22.7 22.8 22.9 23 23.1 ...
$ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
$ birth : num 118 120 111 110 111 ...
$ military : num 3.22 3.56 2.46 2.21 2.29 ...
Analysis
First, we start by plotting the scatterplot from this dataset. The first plot on the first row denotes the overall trend of temperature with respect to year, while the remaining plots denote the scatterplots of temperature with respect to different segments of year of the same dataset.
# Set up plotting layout for 3 rows and 2 columnspar(mfrow =c(2,3))# Plot for the entire dataset (aatemp)plot(aatemp$year, aatemp$temp,main ="Average annual \n temperature, 1854-2000",xlab ="Year",ylab ="Temperature",pch =19, # Solid circlecol ="blue")# Fit a linear model for the entire datasetmodel_full <-lm(temp ~ year, data = aatemp)# Add the regression line for the entire datasetabline(model_full, col ="red", lwd =2)# Data Segmentationsplit_aatemp <-split(aatemp, rep(1:5, each =23, length.out =115))## Assign a vector of titlestitles <-c("Average annual \n temperature, 1854-1904","Average annual \n temperature, 1905-1927","Average annual \n temperature, 1928-1951", "Average annual \n temperature, 1952-1977","Average annual \n temperature, 1978-2000")# Scatterplots with Linear Model for Segmented Datafor (i inseq_along(split_aatemp)) { model <-lm(temp ~ year, data = split_aatemp[[i]])# Plot for segmented dataplot(split_aatemp[[i]]$year, split_aatemp[[i]]$temp,main = titles[i],xlab ="Year",ylab ="Temperature",pch =19, # Solid circlecol ="blue")# Add the regression line for segmented dataabline(model, col ="red", lwd =2)}
# Reset plotting parameters to defaultpar(mfrow =c(1, 1))
From the grid of scatterplot, we can see that the variable year predicts temp very inconsistently, from the different slopes of the regression line from each scatterplot. To explain this, we will proceed with checking the model adequacy of the this linear model.
Model Adequacy Checking
In this section, we will be examining with two versions of aatempt: the general case and the segmentized case, where for the segmentized datasets, all 115 observations accounting for 115 different years are segmentized into 5 equal parts, with 23 observations each. This is so that we can examine the dataset more thoroughly in order to dtect different trends.
## Model definitiontemp_year <-lm(aatemp$temp~aatemp$year)## Summary statisticssummary(temp_year)
Call:
lm(formula = aatemp$temp ~ aatemp$year)
Residuals:
Min 1Q Median 3Q Max
-3.9843 -0.9113 -0.0820 0.9946 3.5343
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.005510 7.310781 3.284 0.00136 **
aatemp$year 0.012237 0.003768 3.247 0.00153 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.466 on 113 degrees of freedom
Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
Shapiro-Wilk normality test
data: temp_year$residuals
W = 0.995, p-value = 0.9568
## Breusch-Pagan Testbptest(temp_year)
studentized Breusch-Pagan test
data: temp_year
BP = 0.69086, df = 1, p-value = 0.4059
We see that in the general case, our model exhibits a very weak correlation between year and temp. We can determine this from the low value of the coefficient estimate \(\hat{\beta_{1}} = 0.012237\) of \(R^{2} = 0.08536\) , though the p-value associated with \(\hat{\beta_{1}}\), \(p = 0.001533 < 0.05\) is statistically significant. We also noticed that, our data mainly passes the assumptions involved in the residual analysis, mainly that it exhibits constant variance, and that the residuals are distributed normally. But we notice one interesting detail: a low value of \(\hat{\beta_{1}}\) should imply we did not reject \(H_{0}: \beta_{1} = 0\) but \(p= 0.001533 < 0.05\). To find out why, we extend our model adequacy checking procedure to a segmented version of aatemp.
## Data Segmentation split_aatemp <-split(aatemp, rep(1:5, each =23, length.out =115))## Assign a vector of titlestitles <-c("Average annual temperature,\n 1854-1904","Average annual temperature,\n 1905-1927","Average annual temperature,\n 1928-1951", "Average annual temperature,\n 1952-1977","Average annual temperature,\n 1978-2000")## Summary Statistics and plotsfor(i inseq_along(split_aatemp)) { model <-lm(temp ~ year, data = split_aatemp[[i]])cat("For", titles[i], "the summary statistics are:\n")print(summary(model))# Set up the multi-panel layout and plot diagnosticspar(mfrow =c(2, 2)) # 'oma' sets the outer margins# Plot diagnostics without default titlesplot(model, which =1, main ="") # Residuals vs Fittedplot(model, which =2, main ="") # Normal Q-Qplot(model, which =3, main ="") # Scale-Locationplot(model, which =5, main ="") # Residuals vs Leverage# Add a large title above the panelsmtext(titles[i], outer =TRUE, cex =1.5, font =1, line =2)# Shapiro-Wilk test for normality of residuals shapiro_result <-shapiro.test(model$residuals) shapiro_p_value <- shapiro_result$p.value normality <-ifelse(shapiro_p_value <0.05, "the data is not normally distributed.", "the data is normally distributed.")# Breusch-Pagan test for homoskedasticity bptest_result <-bptest(model) bptest_p_value <- bptest_result$p.value homoskedasticity <-ifelse(bptest_p_value <0.05, "the data is not homoskedastic.", "the data is homoskedastic.")# Print the final conclusions along with p-valuescat("For", titles[i], ":\n")cat(normality, " (p-value:", shapiro_p_value, ")\n")cat(homoskedasticity, " (p-value:", bptest_p_value, ")\n\n")}
For Average annual temperature,
1854-1904 the summary statistics are:
Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])
Residuals:
Min 1Q Median 3Q Max
-3.7138 -0.9173 0.6097 1.1532 1.9000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.53361 46.26341 1.438 0.165
year -0.01030 0.02449 -0.420 0.678
Residual standard error: 1.575 on 21 degrees of freedom
Multiple R-squared: 0.008349, Adjusted R-squared: -0.03887
F-statistic: 0.1768 on 1 and 21 DF, p-value: 0.6784
For Average annual temperature,
1854-1904 :
the data is normally distributed. (p-value: 0.07282286 )
the data is homoskedastic. (p-value: 0.7130842 )
For Average annual temperature,
1905-1927 the summary statistics are:
Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])
Residuals:
Min 1Q Median 3Q Max
-3.3655 -0.7004 -0.1575 1.2605 3.0852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.997510 90.932229 0.308 0.761
year 0.009832 0.047459 0.207 0.838
Residual standard error: 1.51 on 21 degrees of freedom
Multiple R-squared: 0.00204, Adjusted R-squared: -0.04548
F-statistic: 0.04292 on 1 and 21 DF, p-value: 0.8379
For Average annual temperature,
1905-1927 :
the data is normally distributed. (p-value: 0.7714022 )
the data is homoskedastic. (p-value: 0.637467 )
For Average annual temperature,
1928-1951 the summary statistics are:
Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.8936 -1.0171 -0.1107 1.0161 2.7330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.818793 78.304786 0.700 0.492
year -0.003305 0.040383 -0.082 0.936
Residual standard error: 1.299 on 21 degrees of freedom
Multiple R-squared: 0.0003188, Adjusted R-squared: -0.04729
F-statistic: 0.006698 on 1 and 21 DF, p-value: 0.9355
For Average annual temperature,
1928-1951 :
the data is normally distributed. (p-value: 0.5013498 )
the data is homoskedastic. (p-value: 0.3985379 )
For Average annual temperature,
1952-1977 the summary statistics are:
Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.77097 -0.62019 -0.07366 0.78329 2.28406
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 152.09265 59.67824 2.549 0.0187 *
year -0.05269 0.03036 -1.736 0.0973 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.063 on 21 degrees of freedom
Multiple R-squared: 0.1254, Adjusted R-squared: 0.08379
F-statistic: 3.012 on 1 and 21 DF, p-value: 0.0973
For Average annual temperature,
1952-1977 :
the data is normally distributed. (p-value: 0.755814 )
the data is homoskedastic. (p-value: 0.9602909 )
For Average annual temperature,
1978-2000 the summary statistics are:
Call:
lm(formula = temp ~ year, data = split_aatemp[[i]])
Residuals:
Min 1Q Median 3Q Max
-2.1084 -0.9165 -0.1129 0.7180 3.1465
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -146.14046 85.68566 -1.706 0.1028
year 0.09754 0.04308 2.264 0.0343 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.37 on 21 degrees of freedom
Multiple R-squared: 0.1962, Adjusted R-squared: 0.1579
F-statistic: 5.126 on 1 and 21 DF, p-value: 0.03427
For Average annual temperature,
1978-2000 :
the data is normally distributed. (p-value: 0.5136732 )
the data is not homoskedastic. (p-value: 0.0474906 )
# Reset plotting parameters to defaultpar(mfrow =c(1, 1))
Moving to our segmented data, we noticed that all our models from the data segments display similar characteristics, which are low values of \(\hat{\beta_{1}}\) and \(R^{2}\), and \(p \geq 0.05\). All of the model residuals of our data segments are also normally distributed and most fulfill the homoscedasticity. However, the model from the data segment from years \(1978-2000\) failed to fulfill the homoscedasticity assumption. As such, this non-constant variance might explain the seemingly contradictory results that we get from our model from the unsegmented data. Thus we can conclude that year is not an effective explanation to temp.
Thus, we can conclude that year_temp is not able give us much insight about aatemp. Since there is not much insight to obtain from aatempt, we will proceed with the dataset divusa.