1 Overview
2 Raw Data
- 2.1 Weather
  - 2.1.1 Numerical Data Fields
  - 2.1.2 Categorical or Incident Report Fields
- 2.2 Other Meteorological Data
  - 2.2.1 Lake Huron-Michigan Water level
  - 2.2.2 Simulated Solar Irradiation Data
3 Processing Raw Data
- 3.1 Handling Missing Data
- 3.2 Preparing Year Records
4 Training and Final Test Sets
5 Preliminary Selection of Variables
- 5.1 Heat Map
- 5.2 Reduced Variable Set for Model Building
6 Unsupervised Learning and Exploration
7 Supervised Learning and Training Models
8 Predictions
- 8.1 Average June Rain
9 Accuracy
10 Appendix

1 Overview

The Farmer's Almanac advises that “A cold and wet June spoils the rest of the year”. It might seem foolhardy to improve upon a maxim like this one or try to predict anything about weather, but this investigation nevertheless considers whether anything might be gained if we apply machine-learning or artificial intelligence to weather data records from O’Hare Airport at Chicago, Illinois covering the years from 1960 to 2018. This analysis certainly shows the difficulties in weather prediction, but perhaps it shows that the likelihood or probability of a wet June appears to be weakly correlated with snow in February and cold weather in early spring and other variables over the previous months of a yearly weather record.

For exploration and unsupervised learning, we first explore the relationship between June rainfall and other weather data by computing correlations between weather variables in Section-5 and performing cluster analysis in Section-6.1. Based on the correlation results, a preliminary ordinary-least-squares analysis is performed in Section-6.2 for every possible combination of a reduced set of six column variables which have relatively high correlation with June rain. This was followed by principal components analysis for the same reduced set of six variables in Section-6.3.

For supervised machine learning and model training, we develop models for four algorithms that predict the amount of June rain and compare the mean-squared error for these models using cross-validation in Section-7.5. Finally, we perform final tests for which the bias, variance, and mean-square-error is presented for each method in Section-9.

The low correlation of June rain with February snow and cold weather in March and April changes the probability of a dry June a small but statistically significant amount, particularly for wet cluster years with a predicted rain level exceeds 100-mm. The years in the dry cluster seem to have a constant rain level, independent of the amount of predicted rainfall. None of the dry cluster years have June rain predictions above 100-mm while the three wettest years have predictions above 100-mm as shown in the exploratory plot of Figure-6.

The reproducible code for this project and report is shared at this address:

https://github.com/marcelMerchat/dry_june_in_illinois

2 Raw Data

2.1 Weather

The raw data for O’Hare Airport was automatically downloaded following the application programming interface (API) at the National Centers for Environmental Information (NCEI) for the United States Government. The data covers the period beginning with the 1958-1959 snow year and ends with the 2018-2019 snow year where a snow year begins on July 1 and ends the following calendar year on June 30. The query parameters at the end of the internet address specify Weather Station USW00094846 at O’Hare Airport, starting date on 1958-01-01, ending date of 2019-07-01, and comma separated data formatting. The download was saved as a file named ChicagoWeather2018.csv.

2.1.1 Numerical Data Fields

These fields were selected for analysis.

PRCP, Precipitation (tenths of mm)

SNOW, Snowfall (mm)

SNWD, Snow depth (mm)

TMAX, Maximum temperature in tenths of degrees (°C)

TMIN, Minimum temperature in tenths of degrees (°C)

AWND, Average daily wind speed (tenths of meters per second)

WDF2, Direction of fastest 2-minute wind (degrees)

WSF2, Fastest 2-minute wind speed (meters per second)

WSF5, Gust intensity as fastest 5-second wind speed (meters per second)

The raw data was processed to make a more uniform data table or frame where units as tenths of millimeters or tenths of degrees are simply expressed as millimeters or degrees respectively.

2.1.2 Categorical or Incident Report Fields

These fields indicate whether not something occurred and was reported for a given day. For example, the WT03 field indicates if thunder was reported or not. A field with two possible values is an example of a categorical variable.

WT01, Fog, ice fog, or freezing fog (may include heavy fog)

WT03, Thunder

WT05, Hail

WT08, Smoke or haze

WT09, Blowing or drifting snow

WT11, High or damaging winds

WT13, Mist

WT16, Rain (may include freezing rain, drizzle, and freezing drizzle)

WT17, Freezing rain

WT18, Snow, snow pellets, snow grains, or ice crystals

2.2 Other Meteorological Data

Although we only use the above weather data for our final predictions, we also explored if the level of Lakes Huron-Michigan or solar irradiation levels were correlated with June rain, but this other meteorological data was eliminated from model-building and prediction as the airport weather data provides most of the predictive power.

2.2.1 Lake Huron-Michigan Water level

The Army Corp of Engineering considers Lakes Huron and Michigan as a single body of water with the same average water level. The monthly mean water level for Station 9075014 at Harbor Beach, MI and Station 9087044 at Calumet Harbor in Illinois was automatically downloaded from the NOAA government website with query parameters to select the years beginning with the 1958-1959 snow year and ending with the 2018-2019 snow year where a snow year begins on July 1 and ends the following calendar year on June 30.

9075014 Harbor Beach, MI: “huron_michigan_harbor_beach19592018.csv”

9087044 Calumet Harbor, IL: “huron_michigan_calumet_harbor19592018.csv.”

After downloading the data for two stations in order to handle the problem of a few missing days, the problem was alternately solved by downloading month average water levels. The monthly mean water level for Station 9087044 at Calumet Harbor was used.

As only airport weather data was selected for the final models, the dimension of the solution was reduced by elimination of lake level variables from the model. However, the lake data is still part of the data frame and file for all variables.

2.2.2 Simulated Solar Irradiation Data

Solar data was not selected for the final models despite having some low correlation with June rain as the weather variables in the airport data have higher correlation. As shown in Table-1, the highest absolute correlation for solar irradiation for a simplified solar system model is about 0.25 whereas weather variables such as February snow depth and March temperature have correlation of at least 0.3. Since it might be unwarranted to discard the perturbation of the earth’s orbit and the solar irradiation levels for the earth caused by the gravitational pull of Jupiter without understanding how insignificant this might be, the solar data requires further study. The description of the simulated solar irradiation data is included in the Appendix at Section-10.6. The average yearly irradiation and month averages for January and July are listed in Table-21 at Section-10.6.4.

The code for generating the raw solar data is contained in the file named SunJupiterEarthSimulation.R. The raw data was processed to produce monthly and year averages as described below under preparing year records.

3 Processing Raw Data

3.1 Handling Missing Data

The data quality for the airport and lake stations are very high. However, as fields were left blank where zero was obvious such as reports of blowing-snow for July at Chicago and other binary categorical fields, there was some need convert the blanks to zero for numerical analysis. Where data is missing for binary categorical fields, blanks were replaced with 0 which helps data processing to identify missing data versus other kinds of blank spaces in files. The binary categorical data fields are listed in Section-2.1.2 above. Another type of obvious missing data is wind data for years before wind data was recorded. In this case, missing data is indicated as NA which is useful in data processing to identify missing data among other kinds of blank spaces in files. Here is a list of changes for the modified data file:

There are separate columns for the year, month, and date as integers.
Temperatures are now indicated in degrees Centigrade. In the original data, the temperature is multiplied by 10 to eliminate any decimals.
Rain precipitation is converted to millimeters where the raw PRCP field was expressed in tenths of a millimeter without decimals. For example, 1.2-mm of precipitation was originally recorded as 12 which may have been helpful for data entry. In the new data file, 1.2-mm of rain is expressed as 1.2.
The data begins in 1959 when maximum and minimum temperature are first reported over a full year.
Before average wind speed data was reported beginning in 1984, this data is indicated as NA in the new file. Since peak daily two-minute wind speed and direction as well as peak 5-second wind gusts are first reported for a full year in 1997, the missing data before 1997 for these fields are also indicated by NA.

The processed version of the daily raw weather data was saved as this file:

ChicagoWeatherMetricUnits.csv

The code that accomplishes this is in this file:

prepareWeatherAndLakeLevelYearRecords.R

This file also contains other data-processing code for making the file named yearlyChicagoWeatherAndLakeLevel.csv that is described directly below for preparing year records.

3.2 Preparing Year Records

The raw daily record data was processed to produce snow year records beginning with the year designated as 1959 from July 1959 to June 1960 as the first year. For each raw daily weather variable, a year record is constructed from twenty-two variables including nineteen daily weather variables and three lake-level variables for high, mean, and low average monthly water levels. For each of the twenty-two variables, the computed year record includes twelve monthly variables plus a thirteenth variable as the average for the entire year. A year record consists of 286 columns (13 X 22) plus one variable for the year. As a result, a data frame with 287 columns was prepared and saved as a file named yearlyChicagoWeatherAndLakeLevel.csv.

After generating the raw simulated irradiation data, the raw irradiation data was processed to produce a data frame and file of year records that corresponds to the file of weather and lake-levels. A year record for irradiation includes twelve monthly averages plus a thirteenth variable for the yearly average plus one variable for the year. The data frame of irradiation year records was saved as a file named yearlyIrradiationData.csv. The code for generating the raw irradiation data and year records is in the file named SunJupiterEarthSimulation.R

The 13 solar irradiation columns were added to the 287 columns of weather and lake-levels and the combined file of 300 columns was saved as a file named yearlyChicagoMeteorologicalWeather.csv. The code for the final merging of data is at the end of the file named SunJupiterEarthSimulation.R.

4 Training and Final Test Sets

The data science exploration and machine-learning begins by importing the data from the final merged file named yearlyChicagoMeteorologicalWeather.csv. The data was divided into training and final test sets. Approximately 70% of the year records, forty-four records, were randomly selected for training models and the remaining sixteen records were reserved for a formal final test. Although cross-validation appears not to affect the final model building process, the cross-validation comparison of different models in Section-7.5 was performed using only the training set to help insure an independent final test as this is intended as a scientific study.

5 Preliminary Selection of Variables

Variables with Correlation of 0.2 or Greater with June Rain

We begin by selecting variables with at least some significant correlation with June rain. Since there are nearly three hundred variables in the year records file described by Section-3.2, it is desirable to eliminate some of these. Therefore, we select approximately twenty variables or columns of data with absolute correlation of at least 0.2 with June rain.

As negatively correlated variables are just as predictive as positively-correlated ones, the variables are listed by the order of the magnitude of the absolute value of the correlation. Many of these variables are included in the heat map below in Figure-1.

Certain unlikely events such as funnel clouds and damaging winds were removed because there are not enough incidents to include them in a statistical model. Freezing rain is removed because it depends on a particular unlikely temperature to occur. Where warm weather is a factor, average minimum temperature is replaced by average maximum temperature to eliminate confusion because average daily high and low temperatures are well-correlated.

Absolute June Correlation

Table-1
Variable	Correlation
JunRain	1.00
MarThunder	-0.39
FebSnowDepth	0.36
FebSnow	0.36
MarAverageMaxTemp	-0.33
AugFog	-0.31
FebBlowingSnow	0.30
AprFog	0.30
AugRain_wt16	-0.29
MarRain_wt16	-0.28
MarBlowingSnow	0.28
MarSnowDepth	0.28
MayThunder	-0.27
NovAverageMaxTemp	-0.27
NovRain_wt16	-0.26
JanAverageIrrad	-0.25
MarSnow	0.24
AprRain	0.23
AugThunder	-0.22
FebAverageMinTemp	-0.21
AprAverageMaxTemp	-0.21
MarAverageIrrad	-0.20

5.1 Heat Map

In the graph below, light-brown and red hues indicates positive correlation and green indicates negative correlation. The squares across the top row indicate the correlation of the variables with June rain. As the map is symmetrical about the red diagonal line, the squares along the right edge reflect the same data. Since the map is symmetrical about the diagonal line of red squares, we only need half of the map and could erase the other part. The red diagonal line of squares from the lower left to the upper right of the map contains zero information as it indicates the trivial self-correlation of the variables with themselves for which r = 1. If we remove the top row and the right-most column, the remainder is a heat map of the self-correlation of only the predictive variables.

Positive correlation with June rain is indicated by the light-brown in the right-most column. These squares correspond to snow, average snow depth, and blowing-snow in February and March as well as April rain and fog. Negative-correlation in green corresponds to solar radiation from January to March, rain in August, November, and March; thunder in August, March, and May; above average temperatures in late fall in November and late winter and early spring from February to April.

Negatively correlated variables have the same predictive power as positively correlated ones. Later, we will use principal components analysis to generate an equivalent set of independent variables for which the mutual correlation is zero.

The correlation here is the Pearson Correlation Coefficient.

5.2 Reduced Variable Set for Model Building

The Reduced Group

A set of six variables that we will refer to as the reduced group was chosen for an ANOVA comparison of models constructed with all possible combinations of these variables. The selection is based on having a relatively high correlation with June rain in the range from 0.25 to 0.5 as shown in Table-1. The reduced group is also used for principal components analysis (PCA) in Section-6.3 although PCA could conceivably have been applied to the larger preliminary group of about 20 variables too. The six selected variables are listed in Table-2 below.

Variable Names and Definitions

Table-2
Variable	Description	Alias	Name	Units	Field
x1	Average snow depth for February	February snow depth	FebSnowDepth	mm	SNWD
x2	February blowing snow days	February blowing snow	FebBlowingSnow		WT09
x3	March thunderstorm days	March thunder	MarThunder		WT03
x4	Average daily high temperature for March	March temperature	MarAverageMaxTemp	°C	TMAX
x5	Average daily high temperature for November	November temperature	NovAverageMaxTemp	°C	TMIN
x6	March Rain	March rain	MarRain		WT16

The field names in the right-most column are defined by the GHCND documentation for Station USW00094846. The WT03, WT09, and WT16 fields were processed to become ratios without units. In the raw data, these names indicated whether not there was blowing-snow, thunder, or rain on a particular day, which was processed to compute a fraction of month days. The same field name may appear more than once as the raw data is processed to compute month averages.

Data Summary for the Reduced Group of Variables

Table-3

Some variables were eliminated where mutually correlated variables appear in pairs such as high and low daily temperatures or monthly snowfall and average snow depth. Variables with quantitative numerical data were favored over variables based on observational judgement such as fog reports. Examples of this are Fog in April and in August of the previous summer with June rain correlations of 0.3 and -0.31 respectively. But this certainly indicates that further study of fog might be worthwhile, especially if quantitative measurements become available. March blowing-snow reports (MarBlowingSnow) were not used for a similar reason and because February blowing-snow reports were more common.

The reduced set of six variables was subjected to ANOVA analysis based on an ordinary-least-squares model described in Section-6.2. Based on the ANOVA model comparison in Table-6 of Section-6.2.2, we will select one of the most promising combinations for supervised machine learning in Section-8 for the OLS and Partial-Least-Squares (PLS) models.

6 Unsupervised Learning and Exploration

Two types of unsupervised exploration of the records in the training dataset that only depend on the data are these.

Computing correlations between the data columns. A column of data is a particular instance or representation of a variable. A non-zero correlation between two columns indicates a relationship between two variables. In the case of a regression line, the slope of the line is equal to the correlation.
Cluster analysis.

Since there are almost three hundred variables, there is a need to eliminate some of them to even begin cluster analysis. Any elimination of variables might initially appear to be part of model building, but this is effectively unsupervised if we discard uncorrelated variables based only on the data. Similarly, principal components analysis is unsupervised learning based only on data. When we finally begin building prediction models in Section-8, we seek to include additional prior understanding and knowledge about the world to help construct an algorithm.

6.1 Cluster Analysis

Searching for clusters is a form of unsupervised learning that can help discover patterns and relationships among variables. Consider an example of dot patterns data where the patterns might be symbols. We use K-means cluster analysis to determine if we can separate the dot patterns into clusters that correspond to the symbols. We have not assigned names or labeled the patterns with respect to anything outside of the data that requires prior knowledge or assumptions yet. Any such labeling would be beyond cluster analysis and unsupervised learning. We cannot build a predictive model from only cluster analysis; building a predictive model involves some related prior knowledge about the set of possible symbols in order to map the clusters to particular symbols.

When we divide the year records into two clusters using K-means analysis, the resulting clusters correspond to groups with different average rain levels which we identify as dry and wet year records. This process is purely mathematical until we assign dry and wet labels to the clusters.

Metric Cluster Data

Table-4
Cluster	June Rain (mm)	February Snow Depth (mm)	February Blowing-Snow	March Thunder	March High (°C)	November High (°C)	March Rain WT16
dry	70.6	30.5	0.022	0.093	9.3	9.8	0.411
wet	130.6	109.0	0.067	0.039	5.8	8.0	0.253

English Units

Table-5
Cluster	June Rain (inches)	February Snow Depth (inches)	February Blowing-Snow	March Thunder	March High (°F)	November High (°F)	March Rain WT16
dry	2.8	1.2	0.022	0.093	48.7	49.6	0.411
wet	5.1	4.3	0.067	0.039	42.5	46.5	0.253

A two-cluster analysis of the reduced group of variables separates the data into wet and dry years. As shown in Table-4 above, the dry cluster has average June rain of 70.6-mm and the wet year cluster has average June rain of 130.6-mm. The dry years have average February snow depth of 30.5-mm while wet years have average February snow depth of 109-mm.

6.1.1 February Snow Depth

The plot below shows dry and wet year clusters on a plot of June rain vs. February snow depth. As shown earlier in Table-1, the correlation between average February snow depth (FebSnowDepth) and June rain is 0.36. Yellow-brown text color indicates dry cluster years in the plot below; blue-green corresponds to wet years.

Cluster analysis divided the years into two groups where the wet cluster includes the eight cases of June rain over 150-mm such as 1992, 2017, and 1966. The dry year cluster of less than 50-mm of June rain in the lower left corner of the plot consists of the five years 1981, 1987, 1990, 1994, and 2004. Higher June rain levels appear to be related to snow depth in February in these cases.

February snow depth over 90-mm coincides with wet cluster years.
For dry cluster years, February snow depth is less than 90-mm.
For February snow depth exceeding 90-mm, average June rain is 129.6-mm.
For February snow depth less than 90-mm, average June rain is 89.7-mm.
There is some overlap of the wet and dry clusters such as years 1979, 2000 and 2018 appearing in the boundary area between the two clusters.
Wet Junes occurred without any February snow in five years including 1969, 1970, 1995, 1998, and 2000. The grouping of these years in two bunches around 1970 and 1995-2000 might be investigated further.

6.1.2 March Thunderstorms

As shown earlier in Table-1, the correlation between March thunderstorms (MarThunder) and June rain is negative with a value of -0.39.

For one or less thunderstorm days, average June rain is 123.6-mm.
For two or more thunderstorm days, average June rain is 81.6-mm.
For four or more thunderstorm days, average June rain is 74.7-mm.
The eight years with over 150-mm of June rain correspond to years with no more than two thunderstorm days.
That thunderstorms in March have a relationship with dry weather in June is likely a result of the association of thunderstorms with warm weather in March as is explored in the next plot. Thunderstorms are obviously more unlikely in cold winter weather in early March.

6.1.3 March Daily High Temperature

As shown earlier in Table-1, there is negative correlation between March daily high temperature (MarAverageMaxTemp) and June rain of -0.33.

The seven dry years with June rain below 50-mm are easily spotted at the bottom of the above plot.

Wet cluster years have average March daily high temperature of 5.8°C (42.5°F) compared to 9.3°C (48.7°F) for dry cluster years.
Years with average March temperature greater than 10°C belong to the dry cluster and have average June rain of 77.3-mm.
Years with average March temperature of less than 5°C belong to the wet cluster and have average June rain of 119.9-mm. For average temperature greater than 5°C, average June rain is 95.0-mm.

6.1.4 March Rain Days

As shown earlier in Table-1, the correlation between March rain days (MarRain_wt16) and June rain is negative with a value of -0.28.

The dry years at the bottom with nine or more days of March rain reports and June rain below 50-mm are easily spotted at the bottom of the above plot.

For ten or more March days with a report of rain, average June rain is 91.1-mm.
For nine or less March days with a report of rain, average June rain is 108.6-mm.
The wet cluster has a uniform distribution over the full range of the x-axis. Eight days or less of March rain coincides with at least moderate June rain except in 2015.
For the dry group, there are at least nine March days with rain reports except in 2015.

6.2 Preliminary Ordinary Least Squares Analysis

6.2.1 General Model for Ordinary-Least-Squares

We are not prepared to build our model yet, although even the general form of Equation-1 below might be the beginning of something like that. We continue exploring in an unsupervised manner by evaluating all possible ordinary-least-square (OLS) models that are combinations of the six selected variables of the reduced group described in Section 5.2. However, supervised learning certainly begins when we compare model performances and proceed to select what is considered the best model in Section-6.2.3. We will employ supervised machine-learning to build OLS and other classes of models beginning in Section-8.

A prediction or estimate is calculated as a linear combination of the six variables. The dimension of a solution is reduced and a variable is eliminated from a model by setting a \(c_i\) coefficient to zero. Next, the remaining coefficients computed according to the matrix solution in the appendix at Section-10.3; there is a standard OLS solution available too. To calculate the estimate, the contributions of the component variables are summed over using the known \(x_i\) values from the data record for the estimated prediction as follows:

6.2.1.0.0.1 \[\hat {y} = c_0 \; + \; c_1 x_1 \; + \; c_2 x_2 \; + \; c_3 x_3 \; + \; c_4 x_4 \; + \; c_5 x_5 \; + \; c_6 x_6\]

6.2.2 Model Comparisons

Here are rankings of the sixty-three models that are constructed with the possible combinations of the six variables in the reduced group.

MSE in the third column is an acronym meaning Mean-Square-Error. Similar to the mean, this is one of the most common statistical measurements and represents the true variance of a population, but we usually only have access to a sample from the population. The raw data is often used to estimate the true variance using a closely related statistic called the sample variance.
The \(C_p\) statistic in the fourth column is an underfitting and overfitting statistic that penalizes the size or the number of variables included in the model. While it has a complexity penalty for each variable added to a model, it also has a penalty for the estimated variance which is equal to the square of the sample standard deviation. The \(C_p\) statistic is considered reasonable if it is roughly equal to the number of variables in the model.
PRESS in the fifth column is an acronym meaning Prediction Sum-Of-Squares. The PRESS statistic is a modified error sum-of-squares based on a ‘deleted’ delta figure that reflects a statistical confidence level. For each record, the delta figure is calculated with the record deleted in a manner very similar to Leave-One-Out-Cross-Validation. There is more information about the PRESS statistic in the Appendix at Section-10.4.

Model Comparison

Table-6

The first two models have the best \(C_p\) and PRESS figures. The first model in the list composed of x1 and x3 has the best \(C_p\) figure of 1.59, primarily because the small model size of two variables minimizes a complexity penalty.

The models with the lowest MSE of 2120 are the second (x1, x3, x6) and sixth (x1,x3,x5,x6) models.

The predictions for this preliminary ANOVA analysis are inherently overfitted because the actual outcome for the record being predicted is included in the model calculations. For final testing, which begins at Section-8, the model will be tested using independent data records.

6.2.3 Best Preliminary OLS Model

The model in the second row of Table-6 that is composed of the x1, x3, and x6 variables was selected as the best preliminary model. This model has the lowest PRESS statistic and is tied for best MSE. It has the second best \(C_p\) statistic. Recall the original names for these variables:

x1: February snow-depth
x3: March thunderstorm days
x6: March rain days

For the purpose of preliminary exploration, the results for the selected OLS model is presented next.

6.2.4 Predictions for Best Preliminary Model

The outcome for June Rain in the first column corresponds to the y-axis in Figure-6 below. Similarly, the Fitted Prediction column corresponds to the x-axis in Figure-6.

A predictions generated for a record in the training data is called a Fitted Predictions because the record was used to help build the model. A Residual Error is an error generated during model building; it is the difference between an outcome taken from a data record in the training dataset and the prediction for the record generated by a model built from the same training dataset—not final testing errors with independent data records outside of the training dataset. Residual errors will typically be less on average than final testing errors because the errors are from records that were included for training the model which inherently reduces the expected error which is called overfitting. It is even possible to build an overfitted model with near zero residual errors by finding the curve that goes through all of the data points, but this typically make the errors worse for new data records due to overfitting. Although a straight regression line typically doesn’t pass through any of the points when there are three or more points, predictions for the data records in the training set are still overfitted.

Residual Prediction Errors

Table-7

6.2.5 Plot of June Rain versus Fitted Predictions

Considering the outcomes and the fitted predictions for the training data, we construct a best fitted regression line in the plot below. The residual error is indicated by the vertical displacement of the points above or below the solid best-fitted line. The dot color indicates the cluster that is assigned by means of k-means analysis.

Dry years with less than 50-mm of June rainfall have predictions below 100-mm while the three wettest years have predictions above 100-mm.

For years in the blue wet cluster, the positive slope of the regression line seems to help predict the outcome, but this does hold for the dry year cluster where the outcome seems to be independent of the predicted rain level. The relatively large variance for June rain is more crucial when the predicted amount of rainfall is below 100-mm compared to years with more rain. Predictions appear possibly more useful for large rain levels.

The variance for the blue wet cluster is larger for predictions in the range below 130-mm.

The shaded area indicates the standard confidence limits for the estimated mean. The wider limits between the light-green dotted-and-dashed lines apply to the population of individual years. The 95% confidence levels are used.

Predictions for new independent data records will be inherently less accurate. The results for the independent testing set are plotted in Figure-10 in Section-9.3.

Confidence Limits

Table-8

‘Fit’ refers to a prediction made using a data record from the training dataset for building the model.

‘SE fit’ refers to the standard error of the mean which corresponds to the uncertainty of the expected average outcome that falls on the regression line.

In columns 3 and 4, STD-Low-Limit and STD-High-limit are the 95% confidence limits for the standard error of the mean which falls on the regression line. The two right-most columns named Low-Limit and High-limit are the wider 95% confidence limits for a particular year. The computations for these limits were performed using the matrix formulas in Section-12.5 of the Walpole textbook (Walpole, Myers, Myers, Ye, 2007, chap. 12, pp. 458, 460). The calculations are built into the code in the following computer file used to generates this report.

Dry_Illinois_June.Rmd

6.2.6 Null Hypothesis Test for Coefficients

Does the Model Explain Anything about June Rain?

The degree of contribution for each variable is determined and whether or not its coefficient is statistically different from zero is summarized in the table below. The probability that the contribution is non-zero is provided by the p-value. is used to determine this probability.

Coefficients of Components

Table-9
Coefficient	Estimate	Standard Error	t	p-value
\(c_0\)	123.73135	17.8983	6.9130	0.00000
\(c_1\)	0.20200	0.0783	2.5805	0.01364
\(c_2\)	-290.03731	138.5217	-2.0938	0.04266
\(c_3\)	-57.63097	48.6445	-1.1847	0.24311

Does the model explain anything meaningful? What is the probability that the slope is actually zero? The non-zero slope of the regression line provides some predictive power that improves our estimate and reduces prediction error, but the slope is a random variable that varies from sample group to another sample group for a different set of records. The probability that the true slope is actually zero instead of the one shown in Figure-1 is expressed as a probability or p-value. The p-value of the slope is actually zero is 0.01364 to five significant decimal places over a 95% confidence interval.

As there are several dozen columns in the data that have non-zero correlation, we consider algorithms or models with multiple variables or covariates. Beyond merely non-zero correlation, at a minimum we would perhaps hope that the correlation is statistically different from zero for at least one column and indeed there is at least one as the p-value of 0.004 for February snow depth correlation is significant. If we compare the estimates with actual amount of rain in Figure-2, we might also consider the probability that the true slope of the best-fitted line would be zero.

The expected value for each coefficient is used to calculate a contribution to the prediction, but there is always a probability that the component is actually 0 and has no effect at all; that is, the coefficient merely fluctuates around zero for a small sample of years but would average to zero if enough years were included in the sample. The probability each coefficient is 0 is provided in the table below as the p-value in the far right column.

The standard NULL Hypothesis assumes that the coefficient is non-zero if the p-value is less than 0.05. A natural question might be to question why any statistically insignificant variables are included the model and this is a factor. But any such definition of insignificance such as 0.05 is arbitrary and the goal is to make the best prediction. It certainly is a consideration for the model if the p-value for some components is greater than 0.05, but here we are trying to make the best possible prediction for the amount of rain in June.

6.2.7 ANOVA: Analysis of Variance

An ANOVA analysis for the selected ordinary-least-squares model is next presented. The degree of lack-of-fit is summarized in the table below.

ANOVA Summary

Table-10
Variation	Degrees of Freedom	Sum of Squares	Mean Square	F-value	p-value
model	3	34679	11560	5.45	0.02429
error	40	84812	2120
total	43	119688

The third column that is named Sum of Squares has the following explanation. The entry in the first row presents the variation explained the model curve or regression line (SSR). The entry in the second row is the error (SSE), variation that is not explained by the model. The entry in the third row is the total variation (SST) with respect to the mean.

In ANOVA analysis, the fourth column named Mean Square has the following explanation: This column is obtained by dividing the Sum of Squares column by the Degrees of Freedom column. For the error row, the Sum of Squares column entry is the SSE; since it was obtained by dividing by the Degrees of Freedom, the entry in the Mean Square column for the error row is different from the common mean-squared-error (MSE) or root-mean-squared-error (RMSE) that uses the sample size in the formula.

The ANOVA definition for Mean Square provides a step in the process to find the F statistic in Column-5 and corresponding p-value in Column-6. The F-Value is the ratio of the Mean Square entry in the first row (the model row) to the Mean Square entry in the second row (the error row). This ratio of two variances is a F statistic that describes the significance of Mean Square variance of 11560 explained by the regression model compared to the Mean Square variance of 2120 for the residual error. The null hypothesis is that the ratio is equal to one, that the mean square error explained by the model is same as the unexplained mean square error. In this case, the F ratio is about 5.45 and perhaps this is significantly different than 1 to be only a random fluctuation. The p-value is the probability that the null hypothesis is true, that is, the ratio is equal to one, the two true variances are the same. Since the p-value of 0.024 is less than 0.05, this weather model should not be rejected yet. It appears to have at least some predictive power and further study might be warranted.

While the t-distribution in Table-9 corresponds to a ratio of a normally distributed variable and a chi-square variable, the F-distribution corresponds to the ratio of two chi-square variables.

Quality of Fit

Table-11
fit_parameter	value
s	46.05
Mean Outcome	97.85
Percent Variance	47.06
\(R^{2}\)	0.29
Adjusted \(R^{2}\)	0.24

s is a particular estimated variance for ANOVA, by definition the square root of 2120 which appears in the error row in the Mean-Square column in Table-10.

After converting to percent, R² is approximately equal to 29% and adjusted R² is approximately 24%. \(R^2\) indicates how much of the variation is explained by the model. If all of the plotted data is on the regression line, the explanation would be perfect and \(R^2\) would have a fractional value of 1 or 100%. The remaining 71% portion of R² is not explained by the model.

Percent Variance is similar to coeff var in the SAS output in Figure 12.1 of Walpole’s textbook (Walpole, Myers, Myers, Ye, 2007, chap. 12, 462). It is the ratio of s to the mean outcome expressed as a percentage.

6.3 Principal Components Analysis

Preprocessing for Machine-learning

Principal Components Analysis or PCA transforms the variables to an alternative system of independent variables which are inherently uncorrelated. PCA helps eliminate unimportant dimensions and thereby simplify a model and this is particularly facilitated because the method automatically orders the transformed variables by their importance. Variable importance is determined by the magnitude of the eigenvalues which are the coefficients of the transformed variables. Here we start with the reduced group of six variables.

6.3.1 Scree plot

6.3.2 Cumulative Proportion

The cumulative proportion plot indicates the importance of the independent variables with the largest eigenvalues and suggests how many variables would included a simplified model with fewer dimensions where each independent variable corresponds to a dimension.

The principal components are combinations of the scaled raw data variables. Without scaling, there is only one significant component of dominated by February snow depth. We need to apply scaling to include the effects of the other variables. Values close to one indicate the component is very similar to an original variable while values close to zero are not very important.

6.3.3 Simplified Model based on PCA

Since the method automatically ordered the transformed variables by their importance, we exploit this to select the first two principal components for a simplified model. As the model with only the first principal component performed just as well, including the first two components is only accomplished for formal reasons.

6.3.4 Interpretation of Principal Components

Eigenvectors

Table-12
	PC-1	PC-2	PC-3	PC-4	PC-5	PC-6
x1	0.228	-0.640	-0.431	0.144	-0.164	-0.553
x2	0.286	-0.641	0.220	-0.300	0.008	0.607
x3	-0.517	-0.270	0.027	-0.316	0.716	-0.217
x4	-0.511	-0.146	0.337	-0.360	-0.658	-0.204
x5	-0.409	-0.293	0.237	0.807	0.001	0.197
x6	-0.412	0.011	-0.772	-0.087	-0.168	0.445

Principal component PC-1 consists of a combination of the six components, but loaded the most with March thunder (x3) and March daily high temperature (x4). Principal component PC-2 is mainly loaded with average February snow depth (x1) and February blowing-snow (x2). PC-3 is loaded with March rain day reports (x6).

7 Supervised Learning and Training Models

Building or training a prediction model is an example of supervised learning that can provide more accurate predictions for an unknown variable in a future data record. We seek to build the best predictive model based on the data and upon our understanding about the relationships of the variables and the world. The prediction model goes beyond the data when it includes prior knowledge to help construct an algorithm. We consider this knowledge and any previous understanding we have about the relationships among the variables as we begin formulate a prediction algorithm. We proceed to build models to predict an unknown variable in a future data record.

Compounding the problem of high variance that is inherent in weather prediction is the small group of sixteen year records that remained for the final test from a complete dataset of sixty weather year records. Due to the small size of this group of records for the final test, cross-validation was used in an attempt to select the best model. After indicating which model has the best cross-validation performance, we still proceed to build final models for each model in the cross-validation comparison and observe how the cross-validation performance is similar to the final test results for the sixteen fresh data records that were held out only for final testing.

Four classes of models were constructed: ordinary-least-squares (OLS), partial least squares (PLS), preprocessing with principal components analysis followed by PLS, and Ridge regression. As this is intended as a scientific study, sixteen year records or approximately thirty percent of the records were reserved for final tests and the remaining forty-four records or approximately seventy percent were used for model building and comparing them using cross-validation.

7.1 Simplified Variable Sets

Model building often involves selecting correlated variables and eliminating others where a variable corresponds to a column of test data or database. While ridge regression and PCA are suitable for eliminating variables and reducing dimensions to simplify a model, other methods such as ordinary-least-squares (OLS) require manual variable selection. Since there are nearly three hundred possible variables for each year record, a preliminary group of twenty-one variables with the best correlation to June rain were initially selected.

Preliminary Group: Twenty-one variables with the best correlation to June rain. (See list in Table-1)
Reduced Group: The Preliminary group of 20 variables was further reduced based on the highest correlation with June rain and some practical considerations to the following reduced group of six variables.
Select Group: The select group of three variables was chosen as follows:; The select group of three variables was chosen based on the earlier model comparison in Section-6.2.2 for the possible combinations of the six variables of the reduced group using ordinary-least-squares (OLS). This model of three variables was considered to have the best \(C_p\) figure in Table-6 using a Leave-One-Out-Cross-Validation (LOOCV) procedure.

7.2 Model Types

Ordinary-least-squares (OLS)
Partial least Squares (PLS)
Preprocessing with Principal Components Analysis (PCA) followed by supervised learning with Partial Least Squares (PLS).
Ridge Regression

PCA or Principal Components Analysis can also help eliminate dimensions and thereby simplify a model. The number of dimensions was reduced to the two most important principal components in the cumulative proportion plot of Section-6.3.2.

7.3 Model Variables

Below we list the variable set from Section-7.1 that is used as input for each model type.

Ordinary-least-squares (OLS): An OLS model was constructed using the select group of three variables with the best \(C_p\) figures
Partial Least Squares (PLS): PLS was applied to the select group of three variables, the same ones as used for OLS.
Principal Components Analysis (PCA): After preprocessing the reduced group of six variables with Principal Components analysis, a Partial Least Squares (PLS) model was constructed using the two most important principal components.
Ridge Regression: As ridge regression handles large numbers of variables, this model was trained on the entire preliminary group of twenty-one variables. Ridge regression was used after first attempting Lasso regression which filtered the predictions from deviating from the overall average rainfall too much resulting in an underfitting-like effect.

7.4 Ridge Regression

Ridge regression was included as a model instead of LASSO regression because the later did not perform well for this rather unpredictable weather problem with only many weakly correlated variables. LASSO causes an underfitting problem with little predictive power for cases at a distance from the expected mean value. Increasing the hyperparameter \(\lambda\) until the model was simplified caused underfitting and loss of sensitivity where the predicted variation from the average response was filtered too much. The underfitting problem is reduced For ridge regression and some predictive power is obtained.

An overview of ridge regression and the related LASSO method is described in the Appendix at Section-10.5. Ridge regression reduces variance by shrinking the regression coefficients by means of the \(\lambda\) hyperparameter named in Equation-19 in Section-10.5.2

7.4.1 Cross-Validation Curve for Optimum \(\lambda\)

The plot below shows the cross-validation curve as a red dotted line with upper and lower standard deviation error bars. Two values of \(\lambda\) are indicated by vertical dotted lines where the one on the left indicates the value of lambda for minimum mean cross-validated error. The one on the right indicates a higher level of penalization that eliminates more variables but introducing more bias error.

The minimum possible mean-square-error (mse) of the cross-validation error is 2170.6 which was obtained using the cv.glmnet function. The value of the regularization hyperparameter \(\lambda\) for the minimization is called \(\lambda_{min}\) which equals 88 in this case. The result is affected by the seed value for the random sequence which was set to 3232 before calling the function. Lower predictive power is perhaps indicated by the \(\lambda_{min}\) value of 88.

Variables associated with storms including March blowing-snow (MarBlowingSnow) have the highest regularization coefficients and appear high on the list.

Thunder appears prominently in March (MarThunder), May (MayThunder), and the previous August (AugThunder), all having negative correlation with June rain.

A wet August has some relationship with dryness in June of the following year: August fog (AugFog), August Thunder (AugThunder), and August Rain (AugRain_WT16) appear with negative correlation. There may be some weak oscillation pattern that reverses in late spring.

March thunder was selected for most other models because of its high absolute correlation with June rain.

Table-13

Regularized Coefficients
Variable	Coefficient
MarBlowingSnow	115.12
MarThunder	-85.68
MayThunder	-49.36
FebBlowingSnow	49.29
AprFog	31.23
AugFog	-30.21
AugThunder	-18.92
AugRain_wt16	-17.63
NovRain_wt16	-17.16
MarRain_wt16	-9.21
NovAverageMaxTemp	-1.40
AprAverageMaxTemp	-1.03
MarAverageMaxTemp	-0.94
JanAverageIrrad	-0.40
MarAverageIrrad	-0.21

7.5 Model Comparisons using Cross-Validation

Cross-validation was performed to compare the performance of alternative performance models. There were 10-folds. Beginning in second column, these methods are Partial Least Squares (PLS), ordinary-least-squares (OLS), preprocessing with Principal Components Analysis (PCA) followed by PLS, and finally ridge regression. MSE is an abbreviation for mean-squared-error.

Summary of Cross-Validation Results

Table-14
Statistic	PLS	OLS	PCA & PLS	Ridge \(\lambda_{min}\)
Average MSE	2147	1988	1844	2170.6

As shown at Section 12.3.1 in Figure-9, setting the value of the hyperparameter \(\lambda\) to 88 minimizes the MSE cross-validation error for ridge regression to 2170.6. This value for the hyperparameter is referred to as \(\lambda_{min}\).

From the performance results of the cross-validation comparison, we would select PCA & PLS as the best model with the lowest average mean-square-error (MSE) of 1844 over all of the folds. Recall that this model consists of PCA preprocessing followed by training with partial-least-squares (PLS). It is difficult to make any conclusions about the accuracy of the average over the folds because the results for different folds are well-correlated, but we have a some idea about the best model.

7.6 Model Building with Resampling

Predictions are generated in the course of model building before the best performing model is selected. The average result over the folds for cross-validation is replaced by an average over resampling with replacement from the training group for the default procedure named bootstrap. In bootstrap resampling, samples are drawn twenty-five times with replacement from the training set. It follows that bootstrap resampling may not include every record. While some records may not be tested and some can possibly be tested multiple times.

8 Predictions

If we have enough records beyond what was needed for model building, we can perform final testing with records which were not used for training or cross-validation. Recognizing the limits inherent in a small group of records, predictions based on the final models were nevertheless tested using the test group of sample records which were not used for training the model. Since there are only limited data consisting of fifty-nine weather year records, we might gain some insight from the cross-validation results above in Table-14.

Table-18 compares the corresponding final models built using the entire training dataset against the holdout test dataset that was not previously used either for cross-validation analysis or training the final model. The overall bias for the final models are shown in Table-17.

The prediction error is the difference between June Rain in the second column and the predictions in Columns 3-6.

Final Prediction Test

Table-15
Year	June Rain	PLS	OLS	PCA & PLS	Ridge
1991	34.4	34.6	34.1	62.3	106.3
1984	50.0	110.8	115.9	104.3	86.3
1983	51.5	98.6	96.9	117.7	106.1
1986	55.6	98.7	98.3	90.2	83.0
1965	75.1	86.7	86.8	76.1	91.3
1964	87.3	108.6	108.0	135.8	139.3
2016	87.4	75.5	67.6	71.3	94.3
1985	88.7	113.3	111.6	95.6	84.5
1971	90.3	78.9	77.2	91.1	117.6
2008	91.5	76.4	77.4	82.2	91.4
1967	105.4	102.2	99.2	85.2	90.5
1973	119.2	54.3	55.9	81.1	70.8
1974	129.1	89.7	88.2	93.8	82.3
2009	156.7	109.4	110.0	93.7	87.7
1968	197.1	120.2	114.6	109.7	114.0
2013	198.3	154.7	153.2	142.0	144.2

Wet years with June rain exceeding 150-mm in 2013, 1968, and 2009 had PLS estimates of at least 109-mm of rain or higher as shown in the last three rows of the table. The four driest years at the top with less than 60-mm of June rain in 1991,1984, 1983, 1986 have PLS estimates below 100-mm except for the year 1984 where the model does not predict a dry year. The year 1984 followed some cold winters before years where the level of Lakes Huron-Michigan was high.

8.1 Average June Rain

A prediction can made before applying regression. Before applying regression, we can predict the average amount of rain for the training group. The same prediction would be made for any new record. The purpose of regression is to improve this, to reduce the mean-squared error compared to what it is without regression.

Average June Rain in millimeters

Table-16
Training	Final Test
98	101

The rather small number of years in the test group can easily cause some appreciable difference in the average rain for the test group. Nearly 200-mm of rain in the years 1968 and 2017 helps to increase the average the test group. It will be later in the century before this improves even a little.

9 Accuracy

9.1 Prediction Bias

The bias of the average prediction is shown in the table below. Although cross-validation would generally be used to select the best model, we still show final test results for all of the models that were compared using cross-validation. Based on the cross-validation results, we selected PCA & PLS as the best model.

Bias

Table-17
Statistic	Outcome	PLS	OLS	PCA & PLS	Ridge \(\lambda_{min}\)
Average	101.10	94.53	93.43	95.74	99.36
Bias Error		-6.57	-7.67	-5.36	-1.74

The final predictions are all lower than the actual average June rain which indicates a bias problem, but bias is less for ridge regression. As mean-square-error is a wider measure that includes bias error and variance, it may be possible to the reduce the overall error by choosing the bias level in the optimum way.

9.2 Mean Square Error

Although cross-validation would generally be used to select the best model, we continue to show the test results for all of the models that were compared using cross-validation. Based on the cross-validation results, we selected PCA & PLS or preprocessing with Principal Components Analysis followed by Partial Least Squares (PLS) as the final model. The results for the final test in Table-18 below should be compared with the cross-validation results of Table-14 above. To facilitate this, the cross-validation (CV) result is repeated in the first row below.

Average Mean-Square-Error (MSE)

Table-18
Statistic	PLS	OLS	PCA & PLS	Ridge \(\lambda_{min}\)	No Regression
Cross-Validation	2147	1988	1844	2170.6	2254
Final Test	1581	1676	1871	2084	2254
Final \(R^2\)	0.315	0.281	0.179	0.088	0

Recall from Section-7.5 that we selected PCA & PLS with an MSE of 1844 as the best model based on the cross-validation comparison. Recall that PCA & PLS stands for preprocessing with principal components analysis followed by training with PLS. The results for cross-validation are repeated in the first row above. For the final test in the second row, we find that PCA & PLS is no longer the best result. PLS alone without preprocessing in the second column has the lowest MSE of 1581. This result is likely affected by the random selection of only sixteen year records for the final test. There are simply not enough year records to reach a conclusion about the best method. The results are affected if certain years with heavy rain are randomly selected for the final test. As a result, the best model from cross-validation should still be considered.

The \(R^2\) value of 0.315 in the bottom row under PLS indicates that partial-least-squares regression can account for 31.5% of the observed rain variance which means that 68.5% of the variation is unexplained.

For the No-Regression model, the single unique prediction is the average June rain over all years. The purpose of regression is to find a better prediction than this that depends on the data record. For regression to be helpful, it must provide a more accurate prediction with a smaller mean-squared-error than the 2254 for the No-Regression model in the right-most column. The trivial \(R^2\) value of 0 under No-Regression reflects that \(R^2\) is a comparison with the No-Regression model itself.

With the exception of PCA and the trivial case of the No-Regression model, mean-squared error (MSE) is lower for the final test in the second row compared to the MSE for cross-validation in the first row. This reflects that the model training data for a particular fold does not include the records for which it provides predictions. For ten folds, the fold training datasets are 10 percent smaller which leads to more variation. On the other hand, the effect of any outlier cases is reduced by averaging over all of the folds.

9.3 Prediction Accuracy Plot

The prediction error variance is particularly problematic in dry years as there is a big difference for crop yields between a small amount of rain and no rain at all. The result can be affected by a single storm in the middle of the month.

Although the overall slope of the regression line is positive, there might be a subgroup where this does not hold. The positive slope was only associated with the wet cluster for the years in the training group in Figure-6 of Section-6.2.5

The 95% Confidence limits for the estimated average rain are indicated by the shaded area between the light brown lines. The wider limits for particular years are indicated by the blue-green line.

10 Appendix

10.1 Prediction Error

Prediction error is the difference between an actual result and a prediction. The mean-squared-error of a group of prediction errors is a very useful statistic. Mean-squared-error (MSE) can be decomposed into bias and a precision or variance part. Statistical bias has a mathematical definition apart from the common meaning of fearful or pre-conceived ideas as the difference between a prediction and an expected average outcome. Variance has a central role in statistics as a measure of variability. Bias typically increases when we eliminate variables to simply a model or to minimize the total error or MSE. Variance typically increases when a variable added to a model but the addition is justified if the bias error is reduced enough as a result to lower the total MSE.

10.2 Machine-Learning

10.2.1 Unsupervised Learning

10.2.1.1 Unlabeled Data

For unsupervised learning and exploration, data is considered without labels, assumptions, prior knowledge, or preconceived ideas. Consider an example of dot patterns that we know nothing about. Could they be symbols or characters or are they just random dot patterns? We attempt to explore any patterns or relationships among the dots.

10.2.1.2 Exploration for Clusters

Searching for clusters is a form of unsupervised learning that can help discover patterns, relationships, or correlations among variables. We should observe if the variables appear to be a list of discrete things or categories or are they decimal numbers. We might explore two columns of numbers that represent the x and y coordinates of dots using K-means cluster analysis to determine if we can find any patterns or clustering among the dots. Assume that our cluster search finds ten types of dot clusters. We have not assigned names or labeled the patterns with respect to anything outside of the data that requires prior knowledge or assumptions yet. Any such labeling would be beyond cluster analysis and unsupervised learning. Predictive models cannot be built from cluster analysis alone.

10.2.2 Supervised Learning and Model Building

10.2.2.1 Model Architecture and Prediction Algorithms

After our exploration and cluster analysis and other unsupervised learning where we found ten types of dot patterns, we suspect that the patterns correspond to the ten digits from 0-9 that we have prior knowledge about from having attended school and what we have reason to believe about the patterns. Next we might attempt to develop an algorithm to assign or label the patterns as the digits 0-9. Building or training such an algorithm or model to label the patterns is an example of supervised learning. The patterns must be assigned to one of the ten digits that we are familiar with. Even if some of dots are misplaced or missing or there are other random dots, our algorithm will attempt to assign one of the ten labels.

10.2.2.2 Regression

A regression model is an example of supervised learning that involves using prior knowledge about the relationship or correlation between variables to make predictions where the variables are columns of decimal numbers. Although the numbers appear to be part of an infinite continuous set, we are still ultimately limited to a countable discrete set of numbers that our computer can handle, but too many to label. However, decimal numbers are an ordered set that opens-up a new possibility of considering a column of numbers as a space-like vector.

Purpose of Regression

The purpose of regression is to find a better prediction than the simple average from the training set which is an unbiased estimate with an uncertainty level described by the sample variance. We seek a better unbiased estimate with lower variance. Unbiased means that the expected prediction will approach the true average for a very large number of records. We apply our model to predict a future outcome considering the additional conditional probability provided by the data record for the outcome. The prediction is typically chosen as the most probable outcome.

If a customer has purchased a bicycle, we might compute the probability that athletic clothing might be purchased on the next visit by attempting to guess what product a customer will buy based the conditional probability provided by a prior event which changes the probability of purchasing the next product. Selecting some record fields and eliminating others for simplification can bias the estimate. The bias can only be justified if the variance regarding the next purchase is reduced enough to more than compensate for the bias error by reducing the overall mean-squared-error (MSE).

Another Regression Example

Consider a more complicated model of predicting the height of a student in the fifth grade. What if we had four student columns, weight, age, height of the mother, and the last digits of an identification number. The simplest and unbiased prediction without using regression is the average height of all students in the fifth grade. However, we have good reason to believe we can make a better estimate if we consider the age of the student. Adding a variable is called model adjustment. The probability distribution for the entire fifth grade is replaced by a different conditional distribution based on student grade and age. Predictions involving conditional probabilities usually depend upon Bayes’ theorem. If there are many students in each age group, this will usually be a good estimate but it would be more problematic if there are not many students in the same category which can lead to bias.

Eliminating some columns can bias predictions. It’s not hard to understand why we would want a data reduction method to remove the column about the last digits of the identification number, that would probably add only noise, greater variance and make the prediction worse. Bias can also result because we are excluding the height of the mother for which the correlation with height is well below 1, but leaving it out will probably change the prediction, although it is not obvious yet if the bias of the change is up or down. Presumably we are excluding some fields based on prior or other knowledge, but this inherently introduces bias for the estimate.

10.3 Matrix Solution for Ordinary-Least-Squares Regression

Similar to the matrix solution called Cramer’s Rule, the linear regression equation for multiple covariates can be solved using matrices.

10.3.0.0.0.1 \[\begin{aligned} y = X \beta + \epsilon \end{aligned}\]

10.3.0.0.0.2 \[ y = \begin {bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \\ \end{bmatrix} \quad X = \begin {bmatrix} 1 & x_{11} & x_{21} & \cdots & x_{k1} \\ 1 & x_{12} & x_{22} & \cdots & y_{k2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \cdots & x_{kn} \\ \end{bmatrix} \quad \beta = \begin {bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n \\ \end{bmatrix} \quad \epsilon = \begin {bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{bmatrix} \]

The total error is as follows:

10.3.0.0.0.3 \[SSE = (y - Xb)'(y-Xb)\]

We minimize the error by first taking the derivative of the error with respect to the vector \(b\) and setting it equal to zero. This involves some vector calculus.

10.3.0.0.0.4 \[\frac{\partial (SSE)}{d \mathbf{b}} = 0\]

The \(\mathbf {b}\) vector of the estimated coefficients can be solved using more vector and matrix calculus. This is somewhat like a Cramer’s rule solution with matrices for this problem.

10.3.0.0.0.5 \[\mathbf {b} = (X'X)^{-1} x' y\]

The vector \(\mathbf {b}\) is an unbiased estimate of the true regression coefficients \(\mathbf {\beta}\). The inverse matrix \(\mathbf {(X'X)^{-1}}\) is an important object in itself as it represents a variance-covariance matrix for the estimated regression coefficients \(\mathbf {b}\).

10.3.1 Inverse Covariance Matrix

This important matrix is used many formulas.

Inverse Covariance Matrix

Table-19
V1	V2	V3	V4
0.1511	-0.0002	-0.2329	-0.2930
-0.0002	0.0000	0.0004	-0.0001
-0.2329	0.0004	9.0497	-1.2023
-0.2930	-0.0001	-1.2023	1.1160

The component in the upper left corner of 0.1511 is the covariance multiplying factor for the outcome-axis intercept if all inputs were zero. The next diagonal element 0.0000 corresponds to the confidence interval for the coefficient of x₁.

10.3.2 The HAT Matrix

The inverse covariance matrix is used to calculate the HAT matrix.

10.3.2.0.0.1 \[HAT = X(X'X)^{-1}X'\]

The diagonal elements of the HAT matrix \(h_{ii}\) are used to help calculate the confidence interval for the mean response for a particular data record \(x=\{1,x_{i1},x_{i2},x_{i3}, \;.\; .\; .\}\) in the matrix X above at Equations-2 where the first term is always 1.

10.3.2.0.0.2 \[h_{ii} = x'(X'X)^{-1}x\]

This formula is from Section-12.10 of the Walpole textbook (Walpole, Myers, Myers, Ye, 2007, chap. 12, pp. 485)

10.3.3 Variance

Expressing the elements of the above covariance matrix as \(c_{ii}\), we calculate the variance of the individual coefficients by multiplying the corresponding diagonal element by the variance \(\sigma^2\).

10.3.3.0.0.1 \[\begin{aligned} \sigma_{b_i}^{2} = c_{ii}\sigma^2 && {\text{i = 0,1,2, . . .}} \end{aligned}\]

The total variance can be expressed as the sum of these terms.

10.3.3.0.0.2 \[\sigma^{2} = \sigma_{b_0}^{2} + \sigma_{b_1}^{2} + \sigma_{b_2}^{2} + \quad . \quad . \quad .\]

10.4 PRESS Statistics

PRESS is an acronym meaning Prediction Sum-Of-Squares which is calculated using a Leave-One-Out-Cross-Validation. It is similar to the Error Sum-of-Squares (SSE) but the error (e) is corrected for the confidence level. Since the model for a particular cross-validation prediction is always built without the corresponding record, an error can simply be called a prediction error, dropping the residual term used for training errors.

Let June rainfall for the i-th record be denoted by \(y_i\). The predicted rainfall is denoted by \(\widehat y_i\) where the model for the prediction is built without the record that corresponds to \(y_i\) according to Leave-One-Out-Cross-Validation (LOOCV). Let \(e_i\) be the prediction error for the i-th record.

10.4.0.0.0.1 \[e_i = y_i - \widehat y_i\]

The error \(e_i\) is corrected as follows where \(h_{ii}\) is a diagonal matrix element of the HAT matrix developed during the ordinary-least-square solution above at Equation-7. The corrected error is denoted by \(\delta_i\).

10.4.0.0.0.2 \[\delta_i = \frac {e_i} {1-h_{ii}}\]

The \(h_{ii}\) terms appear in the computation of the confidence interval of the mean response. Finally, the press statistic is calculated as follows:

10.4.0.0.0.3 \[PRESS = \sum_{i=0}^n {\delta_i}^2\]

The PRESS results appear as a column in the table below. Another adjusted or modified version of \(R^2\) is calculated where the PRESS figure is substituted for the total sum of squared-errors (SSE).

10.4.0.0.0.4 \[{R^2}_{predict} = 1 - \frac {PRESS} {SST}\]

The \({R^2}_{predict}\) results appear in the R-Square-Predict column in the table below. The total sum of squares (SST) is equal to the mean-squared-error multiplied by the quantity of records; that is, it is the numerator of the mean-squared-error (MSE) formula. These equations are from Section-12.11 of the Walpole textbook (Walpole, Myers, Myers, Ye, 2007, chap. 12, pp. 490-491)

PRESS Statistics

Table-20

10.5 LASSO and Ridge Regression

LASSO is an abbreviation for Least Absolute Shrinkage and Selection Operator. This method can reduce overfitting by automatically eliminating uncorrelated and unimportant variables. As illustrated by Equation-13 below, errors are weighted using their absolute values which corresponds to L-1 normalization or the Manhattan taxi-cab distance between points. It depends on a regularization parameter \(\lambda_1\), which can be selected to minimize the absolute values of the errors. While LASSO can reduce overfitting, it can result in underfitting if there are few well-correlated variables to be distinguished from many weakly correlated ones. Predictions can loose sensitivity and cluster too close to the mean while failing to predict outlying cases if the regularization becomes too strong as the value of the coefficients are reduced in order to eliminate unimportant variables.

Ridge regression is related to Lasso regression as can seen by comparing Equation-13 and Equation-14 below. Ridge regression also attempts to lower variance by reducing the coefficients but they remain non-zero. It punishes large errors using L-2 normalization to weigh errors at the square Euclidean distance, which is the common understanding of distance, compared to the absolute error values used for LASSO.

Consider a sample consisting of N cases, each of which consists of p covariates and a single outcome. Let \(y_i\) be the outcomes and \(X\) be the data matrix. The objective is to solve this equation for the unknown column matrix \(\beta\) of coefficients.

10.5.0.0.0.1 \[\min_{\beta_{0},\beta } {\left( \|y-\beta_{0}-X\beta \|_{2}^{2} \right)}\]

10.5.0.0.0.2 \[{\text{ subject to: }}\]

10.5.0.0.0.3 \[\lambda \cdot \left[ \alpha \|\beta \|_{1}+(1-\alpha ) \|\beta \|_{2}^{2} \right] \leq t.\]

The subscript for the norm is the \({\displaystyle \ell ^{p}}\) norm like the following expression.

10.5.0.0.0.4 \[{\displaystyle \|u\|_{p}=\left(\sum _{i=1}^{N}|u_{i}|^{p}\right)^{1/p}}\]

The \(\alpha\) parameter is used to select L1 LASSO or Ridge regression or a combination of them. The \(t\) parameter is used to reduce or eliminate variable coefficients and is called a regularization parameter. The total error is held below the limit controlled by \(t\). The last equation can be put in a Lagrangian form where \(t\) is replaced by equivalent \(\lambda\) parameter that depends on the data. The elastic net equation encompassing lasso and ridge regression is as follows:

10.5.0.0.0.5 \[\displaystyle \min_{\beta \in \mathbb {R} ^{p}}\left\{\| y-X \beta \|_{2}^{2} + \lambda \cdot \left[(\alpha ) \|\beta \|_{1}+(1 - \alpha ) \|\beta \|_{2}^{2}\right] \right\}\]

The hyperparameter \(\lambda\) is either zero or positive-valued. It controls the error penalty beyond the squared errors of the first term inside Equation-12.

10.5.1 LASSO Regression: L1 Regularization

The LASSO method reduces the coefficients to a degree where some are set to zero and thereby eliminated using the absolute value of the error as a penalty, which is closer to one than the squared value used for regression. The following equation is derived from Equation-12 if \(\alpha\) is set equal to one.

10.5.1.0.0.1 \[\displaystyle \min_{\beta \in \mathbb {R} ^{p}}\left\{\| y-X \beta \|_{2}^{2} + \lambda \|\beta \|_{1} \right\}\]

The value of the regularization hyperparameter \(\lambda\) is increased while keeping the total error below a limit which can eliminate variables.

In a Bayesian interpretation, Lasso is a type of linear regression for which the coefficients have Laplace prior distributions which have an sharp exponential decay to zero and hence sharply peaked near the central location compared to normal distributions for ridge regression.

10.5.2 Ridge Regression: L2 Regularization

The following equation is derived from Equation-12 if \(\alpha\) is set equal to zero. The ridge method reduces the coefficients using the squared Euclidean distance in the penalty term which is the second term in the equation below.

10.5.2.0.0.1 \[\displaystyle \min_{\beta \in \mathbb {R} ^{p}}\left\{\| y-X \beta \|_{2}^{2} + \lambda \|\beta \|_{2}^{2}\right\}\]

The value of the regularization hyperparameter \(\lambda\) is increased while keeping the total error below a limit. This lowers the value of the coefficients but variables are not eliminated.

10.6 Simulated Solar Irradiation Data

The purpose of this study was to evaluate the small variability of solar irradiation impinging upon the earth over a number of Jupiter orbits. A three-body model of the solar system was formed consisting of the sun, Jupiter, and the earth. This simplified model was selected because these three bodies and the related moons account for well over 99% of the total mass the solar system. An improved model requiring more orbit calculations would include Venus and Saturn. The computer code that generated the simulation is in the file SunJupiterEarthSimulation.R.

Although the solar output power is fairly constant, the instantaneous irradiation impinging upon the earth varies as the distance between the earth and the sun changes in its rather circular but elliptical orbit about the center of the solar system. The distance between the earth and the center of the solar system is known to vary from about 147.1-million to 152.1-million kilometers every year. The investigation considers the larger variance of the distance between the earth and the sun caused by the small orbit of the sun around the center of the solar system of very roughly 0.8-million kilometers with a period roughly the same as Jupiter’s orbit of 11.862 years to a first approximation for this model which neglects Saturn and the other planets excluded from this three-body model.

10.6.1 Dominance of Jupiter among the Planets

If we add up the mass in a NASA-JPL table, Jupiter accounts for approximately 71.2% of the solar system mass outside the sun, but it is responsible for most of the force applied to the inner solar system consisting mainly of the sun, Mercury, Venus, and earth lumped together as a system because the Saturn, Uranus, and Neptune gas giants are further away. As far as Jupiter and the other gas giants are concerned, the inner rocky planets including Venus, earth, and the sun are a combined system with a center-of-mass within 1000-km of the center of the sun. After this approximation, Jupiter accounts for at least 91.4% of the force applied to the sun-Venus-earth system; Saturn accounts for about 8% because it is much further away, Uranus 0.3%. Although these error terms are large, the calculations can still shed some light about the variation of solar irradiation upon the earth.

10.6.2 Sampling Frequency for Simulation

More sample points would be helpful. The sampling rate was increased to 80,000 samples per year for the first year of earth’s orbit to obtain a more accurate preliminary year after which the final sampling rate was reset at 8000 samples per year on July 5th, 1959. The longitudinal position of earth’s orbit advances about 1 degree at the shift to the final lower sampling rate. Since the computer could recalculate the forces, velocities, and positions about 2000 times per minute, 72 years of data was generated in about five hours. The sample rate of 8000 is a compromise between accuracy and what was feasible.

10.6.3 Initial conditions

Solving the orbit equations would permit one to start the planets simultaneously with the required velocity and heading necessary to reach aphelion at the proper times. Indeed, solving the equation would eliminate the need for this entire project of generating simulated orbit data. The solution of the orbits of a three-body planetary system has not been found as an equation or in what is called closed form.

Simulated data collection began on July 5th, 1959 after setting the initial conditions as follows. The planets were started at aphelion where the direction of the planet velocity is approximately perpendicular to the position vector to the center-of-mass of the solar system. Starting orbits at aphelion leads to a new problem because the planets do not reach the aphelion of their orbit simultaneously. For example, while the earth is at aphelion every year in early July, the calendar date of aphelion for Jupiter varies because the period of Jupiter’s orbit is 11.862 years. Since, this makes it impossible to set the initial conditions for all of the planets simultaneously, a process was used to set the planets at aphelion in a sequential manner. After starting the three bodies at aphelion at the time of Jupiter’s aphelion in 1957, the earth is arbitrarily moved and restarted at its own aphelion after a delay of 251 days on July 5, 1958.

A. The Sun and Jupiter

The biggest error in this three-body model of the sun, Jupiter, and earth is likely the initial position of the sun opposite Jupiter to start the simulation. Consider that two stars of a binary star system pair orbit a common center-of-mass with the same period. Similar to a binary star pair, we might be tempted to guess that the sun would orbit the common center-of-mass of the Jupiter-sun combination somewhat roughly every 11.86 years like Jupiter because Jupiter contains 71% of the total planet masses and that the center of the solar system is bounded within some region near the center-of-mass of the Jupiter-sun combination subject to the effects of the other planets as known error terms for this model. The position of the sun is probably not well known beyond this approximation at this time in the year 2020. Adding Saturn to the model would help reduce the error concerning the center-of-mass as the combined mass of Jupiter and Saturn contains 90% of the total planet masses.

Lacking better information, the sun is placed at the aphelion its orbit simultaneously with Jupiter on October 26, 1957. The position of the Sun is set opposite Jupiter so that the center-of-mass of the Jupiter-Sun system is positioned at origin of the coordinate system.

Time of aphelion for Sun-Jupiter System: October 26, 1957 at 1957-10-26 23:16:58 GMT, corresponding to -384,396,182 seconds before 1970-01-01 00:00:00.

B. Earth

After starting Jupiter on October 26, 1957, the position of earth is reset at aphelion on July 5, 1958 after a delay of approximately 251.097 days at -362,701,401 seconds.

Aphelion of the earth-moon system in the year 2050 was estimated as the 99-year average of the earth’s aphelion from 2001 to 2099 using the data from Fred Espenak at www.Astropixels.com based on JPL DE405. Using this data, the 99-year mean is in the year 2050 at 2050-07-05 14:50:55 GMT, which corresponds to 2,540,645,455 seconds after 1970-01-01 00:00:00.

Aphelion2050 = 2,540,645,455 seconds; Going back in time using a sideral year of approximately 365.256 days or 31,558,118 seconds, the corresponding time of Earth’s aphelion on July 5, 1958 at 01:36:39 GMT is as follows:
Aphelion1958 = Aphelion2050 + (1958-2050)*31558118
Aphelion1958 = -362,701,401 seconds

Adding the earth as a third body transforms the stationary center-of-mass of the sun-Jupiter system at the origin of the coordinate system to a new center-of-mass which orbits the origin with radius of 0.00045-million kilometers. As the distance from the earth to the sun varies about 5-million kilometers, this issue was small enough to proceed with the other analysis. This problem could be eliminated by a solution of the equation for the three-body model.

10.6.4 Simulated Data for Sun-Jupiter-Earth System

At the start of the simulation, the initial position of the sun is set opposite Jupiter such that the center-of-mass for the Sun-Jupiter pair is at the origin the solar system as described in more detail above. In a better model, the sun would be initially placed opposite the center-of-mass of Jupiter and Saturn. The following plot only illustrates why such a four-body model should be computed with a more powerful computer. The plot below indicates that the climate on earth could be modulated by the solar system.

While our model for predicting June rain is only based the weather data, January irradiation still appears in Table-1 with a correlation of -0.25. If we were to add Saturn to the model, perhaps we might expect this would reduce the effect of Jupiter when Saturn is opposite Jupiter and increase the effect when Saturn is in conjuncture with Jupiter. As is well-reported, Saturn and Jupiter appeared together in the western sky briefly after sundown in December of 2020, which indicates an approximate conjuncture with earth on the opposite side of the sun. But this is only approximate because the earth is offset somewhat from the line passing through the sun and Jupiter by the angle which is apparent after sundown at that time.

The table below lists the data plotted immediately above.

Simulated Solar Irradiation for Earth (Watts / square-meter)

Table-21

10.7 Shared Project Code and Data

The code for this project is provided available at this GitHub Project: