New Computer Fund

Sunday, November 17, 2013

Climate Time Series Smoothing and What is Statistically Proper

In the chart I have the "raw" data for the Indian Ocean sea surface temperature via Climate Explorer from the ERSSTv3 data set plus an aggressively smoothed version compared to the "raw" Indian Ocean 0-700 meter vertical temperature anomaly.  Raw is in scare quotes because there is not such thing as manageable raw data in climate.  Every data collection method involves some form of natural and decision based smoothing.  Since is statistics you should never determine a correlation based on "smoothed" data, there is actually no "proper" way to determine any correlation in mixed climate related times series.  That means that you have to "play" with smoothing choices based on your best understanding of the situation.  Different people have different "understandings" of the situation which would bias their choices in smoothing. 

Since the majority of the "raw" data using is seasonally adjusted temperature anomaly you open another can of worms with "unbelievable" confidence intervals.  With anomaly the actual deviation is not very sensitive to the baseline period selected to create the anomaly nor is the trend, but absolute value that the anomaly represents can have significant difference dependent on the range of absolute values being averaged to create the anomaly series.  Since you are really considered with the energy not so much the temperature that should represent that energy, the F to T^4 relationship severely limits the range that can confidently be assume to have "negligible" error.

This is also aggressively smoothed zonal SST data where the original data is actual temperature with the seasonal cycle intact. 

This is what the "raw" data looks like without "all" of the aggressive smoothing.  Notice the 5S-5N series in darker green.  That is the big kahuna with 5N-15N and 5S-15S fighting for the position.  Since the "raw" data is monthly, it has already been smoothed to some extent.  Even the daily data is smoothed.  Once you get to a number of samples per day or hourly you are actually getting to raw data which quickly becomes completely unmanageable.  In this chart the smoothing is one 27 month moving average "selected" because there "appears" to be a recurrent ~27 month cycle.  I made a choice.  That costs one degree of freedom.  I "selected" the data series and width of the zonal bands.  There goes two more degrees of freedom.  Since the data was "smoothed" by others, there is a degree of freedom or two that has to be considered there.  Then if I compare this data to another data set I should have to consider all the degrees of freedom of that data set plus one because I chose that data set.  So now that I am up to about five degrees of freedom, I should be able to find just about anything I like. 

I am in a pickle.

Now we have the data and we have to make choices so how do we avoid fooling ourselves?  I think with lots of comparisons and lots of humility.  No matter what choices you make there will always be someone that perceives your choices are biased, because they are.  That is unavoidable.  So there becomes a battle for "consistency". 

I may for example compare any or all of those SST regions to a Paleo reconstruction that has its own natural and collector based smoothing.  Ocean core samples build over many thousands of years and selecting a high or low frequency reconstruction.  I compare a low frequency paleo series with higher but not highest frequency instrumental I get one correlation and then any smoothing of the instrumental will improve the correlation.  William Briggs has an excellent post on that pitfall.  But even knowing the pitfall, some smoothing can be helpful if properly noted and considered.

So I think there should be a better "degree of bias" that encompasses the combined degrees of known and unknown freedom in the statistical food chain. 

Just having a brain fart.

No comments:

Post a Comment