Probability: The Rain in Spain ...
Today I was going to write about probability theory, statistical independence, and correlation, and their importance when analyzing performance data. But I know those subjects have a tendency to induce severe attacks of fear and loathing. So instead I'm going to talk about the weather -- specifically about the difference between British and Californian weather.
When I came to the US, having previously experienced only the notorious changeability of British weather, I was quite surprised to discover that Californians enjoyed almost the exact opposite.
You may have heard the song It Never Rains in Southern California (but man it pours). And if it's sunny today, chances are good it will be sunny again tomorrow -- and probably for the rest of the week too! But during the winter months in Northern California it often rains for several days. And when it's raining, expect the rain to continue for a few days.
A friend of mine used to joke that the talents of weather forecasters were wasted here. British weather forecasters are equally redundant, but for the opposite reason; it's almost impossible for them to get it right. To cope with this, they have evolved umpteen ways to say unpredictable -- sunshine with occasional showers, cloudy with sunny intervals, overcast but brightening later, and so on.
Maybe you're wondering why I am telling you this, since you probably already know why Surfin' USA is a lot better known than Surfin' UK. Or maybe you've figured out that my sudden interest in the weather is really just a cover for my original purpose of discussing probability theory. So let me explain.
Consider any weather-related variable that's measured daily, like hours of sunshine, or inches of rain. The technical term for this set of measurements is a time series. Suppose you want to analyze such a time series and use it as a basis for forecasting tomorrow's weather (and assume for purposes of this example that no other meteorological data exists). If the two conditions you care about are rain (R) and sun (S), then you can do a lot better than simply calculating the long-term probability of rain, which a statistician writes as p(R).
That naive approach might be OK for forecasting the weather in Britain. But not in California, where a forecaster needs to consider conditional probabilities -- the probability of an outcome given what we already know, which is written p(outcome|condition). If it's raining in California today, what really matters are the conditional probabilities of rain tomorrow [p(R|R)] and sun tomorrow [p(S|R)].
For a time series, these kinds of conditional probabilities depend upon a statistical property of the series called independence, and its opposite, autocorrelation. When the results of successive measurements are independent, knowing the history does not help you to forecast future outcomes. This is the challenge facing the BBC's weather forecaster, whose unpredictable weather is such that p(R|R) = p(R|S) = p(R).
But for the weather forecaster in Sacramento, California, prior knowledge is crucial. During a typical year it rains on 58 days, so p(R), the probability of rain on any given day, is 58/365 or 0.158. However, those 58 days are not randomly spread across the year. Like the Aspens in Colorado, they come in clusters. So if it's raining today, the probability of rain tomorrow [p(R|R)] is significantly greater than 0.158, because for Sacramento weather, the time series is autocorrelated.
So ... this has been today's science lesson from your Web Weather station. Tune in again tomorrow, and I will explain how understanding the statistics of weather forecasting can also help you forecast your Web site availability. And you thought those weather forecasters on the nightly news were just there for comic relief!