Wednesday, 27 March 2013

Local Regression (LOESS) and the Sun Spot Cycle

Sun Spot Cycle 24 

Preamble 

On the NASA website, which was updated on 1 March 2013,  it is reported that the current sun spot cycle seems to have already peaked at with the sun spot count of 67 in February 2012 (last year):-

NASA sun spot prediction

The sun spot cycle 24 is not behaving as predicted a couple of years ago. For the past few years the size (or sun spot count) of the sun spot cycle 24 had been constantly downgraded: -

Older NASA sun spot predictions

From the earlier predictions made by the consensus group of scientists (working under the auspices of NASA?)  we can conclude that there were two issues that did not work out as forecasted:

  1. The size (or the peak) of the sun spot count.
  2. The date on which sun spot cycle peaked.
Now whereas I most definitely do not feel qualified or even energetic enough to comment on the first issue, I want to focus on the second issue, as I believe that the wrong statistics were used to calculate the previous sun spot cycle peaks with a consequence that the forecast for this cycle was also wrong, because it started on the wrong footing.

For a good primer about why the sun spot cycle is not as high as predicted I can suggest the interest reader to study the book:

 'Grand Phases on the Sun' written by Steven Haywood Yaskell  ISBN 978-1-4669-6300-9


Moving average and the sun spot count

The current and past sun spot cycle counts are readily available on the web and hereto is a link to an official website in Belgium from where the interested user can download his own set of data to analyse: -


If you are not interested in data manipulation exercises, you can download graphs of the various sun spot cycles, ready prepared from the sites that host the raw sun spot data.

On the official graphs prepared by the various governmental agencies they state that the graphs have been prepared with a moving average. A standard of one year and one month seems to be the norm.

How do you calculate a moving average?

This seems like a silly question as just about everybody knows what a moving average is and how to calculate it; the method is used extensively in everyday business by accountants, people responsible for budgets, and guys in charge of production processes. The fact that it is used in the field of astrophysics should therefore be comforting, or should it?

To illustrate the workings of a moving average calculation I prepared the following table. Feel free to skip the following discussion-

Table 1 - Moving Average calculation  illustration

Using a very simple data series, named Series as a the first column on table 1, the working of a moving average calculation is shown. For this example I used a span of 3, 5 and 10 cells. The cells represent days, weeks, months, years, or whatever series of data you work with.

In the second column coloured blue, the results of a 3 period or span moving average is shown. The arrows shows the values that were used to calculate the value in the given cell. In the example I show that cells 5, 6 and 7 were added together and divided by 3, giving the answer 6. Once the formula is entered in one cell it is very easy to copy the same formula to all the cell in the series under investigation.

The third column coloured in bright green shows the results of the moving average for 5 periods, and the last column coloured in orange shows the results of the 10 period moving average calculation.

Strictly speaking, if you are calculating the moving average with a 3 period span you cannot calculate the answer for the first 2 cells in the series, when calculating a 5 period moving average the first 4 cells contains no answer, and for 10 period moving average, the first 9 cells returns a blank answer.


I used the information shown on Table 1 to prepare graph 1.

Notice the different starting positions for the moving average lines generated by the varying moving average periods.


Graph 1 - Moving Average
Because I used a steadily increasing data series one of the short coming of the moving average calculation is clearly shown. The moving average cannot catch up with the peak of the ever increasing series and the greater the span used to calculate the moving average, the greater the lag between the series value and the value returned by the moving average  calculation.



Simple local regression, using adjacent neighbours to calculate the estimated value of the cell

The origins of local regression as applied to a local completed series is hard to find. However, using local point values around a point of interest has been done for hundreds of year, It can be argued that the drawing of contour lines, starting at the beginning of the 19th century, is a very clear example of how the survey points surrounding a given point are used to calculate the position of a given contour line. In geological grade estimation the use of sample data surrounding a point of interest to calculate the value or grade of a given point has been used extensively for almost as long as the mining has been practised.

Table 2 - Simple local regression
To show how a simple local regression calculation is used I prepared Table 2. The same simple increasing series from Table 1 is used in this example. A 3, 5 and 7 span local regression with equal weighting to data points is used in this example. In the 3 span equal weighted local regression the series value ahead of the cell for the estimation is required is added to the adjacent cell and the cell immediately behind it, the sum is then divided by 3. In the case of the cell marked by the arrows in the blue column, the result is (8+7+6)/3 = 7.

Notice how the answer of the simple equal weighted local regression can keep pace with the ever increasing value of the series. This is quite in contrast with the answer obtained with a moving average calculation.

With equal weighting local regression calculation we find that the result lags the series at the start and end of the results column. The lag  is dependent on the span selected to do the calculation.

Graph 2 is a representation of the information on Table 2. The lag at the start and end of the calculated series  is the reason the straight line appears to have magic coloured handles. It is not a fairy or wizards wand, just the result of the lag that you get when using the local regression calculation.
Graph 2 - Local Regression








To demonstrate the second failing of the moving average calculation I used a different data set. I used as base a increasing and decreasing data set with one valley flanked by two hills, forming a series that represents an inverted 'W'. (See table 3 for details of this series) To this series I added some noise to make the graph more readable and to ensure that the lines are not plotted on top of one another as in the case of Graph 2. The noise was created by using the random function that is a built in feature of Excel.

(I do know that purists do not consider the Excel Random function as random enough and wouldn't dream of using it in a serious paper. This, however, is just a blog and for illustrative purposes the Excel Random function suffice.)

Graph 3 Moving Average with wavy data series
The results for the moving average calculation of this series is shown on Graph 3. I did not copy the table with the actual calculations to this blog, and if you are that way inclined and want proof, construct your own spreadsheet to check whether Graph 3 does indeed reflect the correct results.

On Graph 3 the series to be modelled is shown with red diamonds. The moving average calculations with a span of 3, 5 and 10 periods are plotted in solid lines. On inspecting this graph we can draw two conclusions. We can confirm that the moving average calculation cannot reach to the same peak as the series being modelled, and the greater the span, the greater the shortfall in reaching the peak. And secondly, there is an offset between the peak modelled by the moving average calculation and the actual peak of the series being modelled. The greater the span of the moving average calculation, the greater the lag before the moving average shows either the peak or valley.

The moving average calculation is backward looking and only past value are used to calculate the present cell values and as a consequence there is a  lag in showing when the peak or valley of a series is actually reached. It is a major short coming of the moving average calculation when trying to analyse any wavy data, and is one of there reasons why I went to such great lengths to show this problem as clearly as possible.

There must be a better way to analyse a series that contains information that are wavy, undulating and repeating. A basic local regression fits the bill. See Graph 4. However, just like the moving average calculation the local regression cannot reach the same peaks and valley values, though the difference is far less striking than that of the moving average calculation.

Graph 4 - Local regression and wavy data series




Table 3 - Local regression with weighing factors for distance
The basic local regression calculation method can be further refined. If we use the geological grade calculation concept that a closer neighbour exerts a greater influence on the estimation point than a value further away, we can change what I have been doing so far, giving all the adjacent values equal weighting.

I assume that the nearest neighbours have a greater weighting than the values further away from the estimation.

Table 4 is a close-up of the smaller weighing factor listing shown of Table 3.


Now lets look at the detail on Table 4. The weighing factors add up to the span considered. For the purposes of constructing Table 4 I used arbitrary numbers. The following rules applied: the weighing factors must add up to the span, and the values must decrease from the estimation towards the flanks. Because I used arbitrary numbers I show the same centre value of 2 for the 5 and 7 span local regression. A graphical representation of the weighing factors used in Table 4 is given in Graph 5.    Just having the discipline of creating a tapering sequence of numbers creates a bell-shaped curve. Bell-shaped curve, that rings a bell! Lets investigate further as this exercise seems to be leading somewhere!


Table 4 - Weighing Factors before normalizing

Graph 5 - Weighing factors before normalizing
Up to now all the weighing factors added up to the size (breadth?) of the span, be it 3 , 5, or 7. This was done to show a feature which can immediately recognised as similar to the moving average calculation. If we now do the calculation differently, and divide every factor in the weighing table by the span, we normalise the factors. And not having a set rule to work to and to select the factors, and just using numbers plucked out of fresh air, is not the ideal way to do an analysis.

By amusing a weighting factor of 50%, implying that values double the distance from the estimation point have half the influence on the estimation point than the immediate neighbours, Table 5 can be prepared. Table 5 is showing normalised values with the totals of the factors adding to 1. Using the Solver function in Excel, it is very easy to find the normalised factors once a table with weighing factors is constructed. In the case of Table 5 the values on the side of the central column is calculated by using a 50% value to its immediate neighbour.

Table 5 - Weighted mean - half distance - normalized


Graph 6 - Weighing factors after normalizing
The weighing factors calculated according to strict recipe: all factors added together equals 1; the influence of the values decreases by 50% for each step away from the estimation point.

Graph 6 shows the weighing factors for spans of 3 5 and 7 cells. Notice how the normalising changed the value of the central point. And how the shapes of the weighing factors curves are resembling a bell-curve even better than previously.

In Table 6 the results of the local regression using a normalised weighing factor table is shown. As expected, the peaks and valleys are neatly followed, yet the highs and lows of the undulating series are not match. 

A cynic would conclude that effects of local regression is the same as that as Socialism, it steals from the rich (highs) and gives it to the poor (lows).


Table 6 - Local regression with normalized factors

To here I used very basic and straight forward arithmetic to demonstrate how a moving average is calculated, the short comings of using that technique, the advantages of using local regression with a very simple illustration of why it is far superior to the classical method. I also reached the end of what can be achieved by simple and logical arithmetic. 

Enter the mathematicians


Whilst still working for Bell Laboratories William Cleveland, who has been working on the presentation of data to get information wrote a ground-breaking paper "Robust Locally Weighted Regression and Smoothing Scatterplots", that was published in 1979 in the  Journal of American Statistical Association, Vol. 74, pp. 829-836. (This was just a few years after Cleveland published papers on Trellis Plots, a way to represent graphical information that is now in common use.) The LOESS paper by Cleveland became one of the most quoted statistical papers of the late 20th century. 

A full description of the LOESS procedure based on the work of Cleveland is incorporated in the NIST Engineering Statistics Handbook: -


The work of Professor Cleveland is also mentioned on Wikipedia, find it yourself

Enter Jon Peltier, a software engineer that runs his own consulting business, who perused the paper of Cleveland and in 2009 wrote a Visual Basic application that can be used as an Add-in in Excel. Jon published full details on his blog and invited those interested to download the VB application  and try it out. This I did.


Interlude


Allow me to indulge you in story that I think has very striking parallels.

In 1974 Lerch and Grossman, two mathematicians working for IBM wrote a paper explaining how graph theory can be used to solve an age old mining problem: pit optimisation. At the time the paper was considered a breakthrough, but given the amount of time required to do an LG pit optimization run on the computers of the day, the mining industry abandoned the method and adopted a far more cruder method, called Floating Cone. In the mid 1980s, whilst working for BHP in Australia, Dr Jeff Whittle, a British born mathematician (from Yorkshire) read the LG paper and realised that he could write a software programme that can run on the new workstations, which were the state-of-the-art small computers of the time. He left BHP and started his own consulting company that marketed the Whittle software products. The rest they say is history. In the mining industry today very few people refer to the Pit Optimization process as the LG logarithm; the universal term is Whittling

Sunspot Cycle 19 

I tested the Loess application prepared by Jon Peltier on the raw sun spot data I down loaded from the Royal Observatory of Belgium - the SIDC site: -


In Graph 7 I show the sun spot number count for cycle 19. Using the basic Excel built in software to calculate the moving average for 390 days gives a very similar answer to the graph published on the SIDC website with the credit: - 

The graphs have been prepared by Jan Alvestad based on data from SIDC, Brussels.

Graph 7 - Local regression v moving average
Now compare the LOESS 390 day plot with one year and one month moving average. A similar but displaced outline.

I went to great lengths to show and explain the short comings of the moving average calculation and here we can see the difference.

It is my intention to post more revised graphs and a discussions on the sun spot count in future. 

FIN

No comments:

Post a Comment