Everything Is a Time Series

Prediction is a matter of finding patterns that exist in historical data. This boils down to looking at data you have already collected and trying to glean a set of rules for forecasting the future. In the absence of any relationship among different points in time, the data evolves as noise. In this case, specific predictions are impossible, though bulk statistical properties of the noise may still be useful for making decisions.

All Predictions Require Time Series

A time series is a collection of measurements of the same type of data taken at different points in time. It defines the state of some system as it evolves through time.

In the simplest case, this is a single measurement at fixed intervals, such as temperature every minute. In more complex cases, this can be multiple measurements taken at irregular intervals, such as all customer interactions with a web application at any given point in time.

Even measurements that do not at first appear to be time series end up having time series properties in the context of predictive analytics.

For example, object recognition in a single image typically only requires a model to compare different pixels in the image to each other. This certainly requires spatial relationships among the data, but the temporal relationships are not so clear. It turns out that the temporal character is hidden in the model itself. The training data that was fed into the model must necessarily have come from a previous time. While we may not specifically be interested in how these images may have changed over time, we are still implicitly splitting time into at least two periods: the training period that came before and the predicting period that came after. We believe that it is possible to use the object recognition model for prediction because we believe there exists a relationship between what came before and what will happen.

This qualitative relationship between past and future holds for any predictive model. If we assume that training data will have any descriptive or explanatory power in the future, we are establishing a relationship defined in time. In the case of time series models this relationship is explicit, though it is important to remember that the relationship holds for any kind of model we intend to apply to data we have not yet collected because that data has not yet come into being.

Measuring the Relationship between Past and Future

There are many possible relationships we can define between data we have recorded in the past and data we will see in the future. For now, we will focus on one of these possible relationships since many other measurements are in turn related to this measurement. This measurement is known as autocorrelation.

Autocorrelation is the correlation of part of a time series with an earlier part of itself. This earlier part of the time series is often referred to as the lagged time series. In other words, autocorrelation compares one interval in time to another interval in time in order to measure the extent to which the past predicts the future.

For example, air temperature throughout the day tends to have non-zero autocorrelation for a lag of 24 hours. This is because temperatures tend to rise during the day and fall during the night. While the exact quantitative nature of this pattern will vary from day to day, the general relationship tends to hold.

If autocorrelation for a specific lag is non-zero, then there exists a relationship between that state in the past and the current state. This means that at least part of the current state could have been predicted from the lagged state. In the absence of autocorrelation, there is no predictive relationship between the past and the current state.

One interesting application of autocorrelation is in discovering whether a dataset is forecastable. In other words, if we can observe non-zero autocorrelation in a dataset, then we have reason to believe that the future state depends on the past state, so it is possible to predict at least some portion of the future state. Likewise, if autocorrelation is zero, then we have reason to believe that the future state does not depend on the past state, so predictive modeling is not possible. This means that measuring the relationship between the past and the future can help us screen models before we invest valuable time and resources into further training, deployment, and monitoring.

Applying Time Series Principles throughout the Predictive Modeling Lifecycle

Here we draw the distinction between principles that define a mode of thinking and methods that define a specific set of computations, models, or rules. We will see more about specific methods in the next section.

Armed with the knowledge that prediction is inherently temporal, we can start to apply time series forecasting principles to almost any kind of predictive modeling. This extends across the entire lifecycle of modeling, from data acquisition to model training to model monitoring.

Starting with data acquisition, we can apply time series principles to help define how we capture data. Wherever possible, we want to capture timestamps alongside data so that we know when it was first generated or at least first recorded. We will also want to consider storing data in such a way that it is easy to access, order, and slice in terms of those timestamps. Sometimes this means using a time series database or other data store that makes time a first-class citizen. Sometimes it simply means tracking time as a relevant field in the larger dataset.

Moving toward model training, we can once again use time series thinking to make our models more reliable and easier to reason about. Even if we are working with methods that do not explicitly require a definition of time, there is a strong argument for ordering training and testing data in time. More specifically, we should use older time-ordered data for model training and newer time-ordered data for model testing. This enforces causal disconnection between datasets. This in turn prevents data leakage between different time periods. By ordering training data in time, we make it impossible for the future to affect our measurements of the past.

For example, consider a model that predicts customer churn for a subscription service. No matter how good our data may be in regard to customers and their interactions with the service, there will remain hidden variables we have not accounted for. If we take customer data from random intervals, then we run the risk of discovering relationships that may actually have been due to changes in these hidden variables. Maybe marketing strategy changed significantly over different periods or maybe a new competitor began offering a similar service. If we use customer data that has not been ordered in time, then we run the risk of missing the magnitude of the effects of these events. We may also significantly underestimate churn since we are mixing pre-competitor and post-competitor customer interactions. By ordering training and testing data in time, we are more likely to notice that change. We are more likely to train a model that is more accurate or at least to recognize that our model is not accurate because of seasonal or exogenous changes that would be useful to identify for the needs of our business.

In a similar vein, we can apply time series principles to monitor a model that has already been trained and deployed. We should likewise look at time-ordered datasets when we are computing model drift over time. This prevents the same issues with causality and hidden variables that we saw during training. In this case, we can apply that principle equally well to the input data for the predictive model as well as the predictions that the model generates.

For example, let us return to customer churn. We have gathered time-ordered data, trained a model that performed well on a time-ordered test set, and now deployed that model to production to generate predictions for customer churn probability over the next 3 months. After the first 3 months have elapsed, we want to know how well the production model performed. We calculate model drift by comparing model training data to the data that was fed into the deployed model. We also compare the distribution of predictions for the training data to the distribution for the predictions the model made in production. So far, we are performing data and model monitoring as it is normally defined. Our time series principles come into play when we continue to monitor the model beyond the initial 3 month mark. In this case, we want to continue to enforce time ordering during monitoring. In other words, we should compare, say, months 1-3 to months 2-4 of production, but we should not mix data from the different periods, i.e. no 1,3,4. This removes potential data leakage and it can help us identify exogenous shocks that may have caused the model to underperform.

We will see in the next section how this can be extended to arbitrary intervals of time.

Applying Specific Time Series Methods for Model Monitoring

In addition to time series principles, we can borrow specific computational methods from time series forecasting and apply them in scenarios that have not traditionally been regarded as primarily temporal.

One important but often overlooked domain is model monitoring. As we have already seen, monitoring is inherently governed by time series principles. We can expand these into specific computational methods through clever use of time ordering, probability statistics, and time series forecasting. In this case, we assume that our data should have no temporal character, meaning that subsequent data points and subsequent predictions from the model are not related in time. If we can discover relationships in time, then we have reason to believe that our model is not stable.

As live production data starts to flow into a deployed model, we want to be able to ensure that the production data has a distribution that is similar to the training data for the model. We also want to ensure that model predictions tend to follow a distribution that is similar to the predictions the model produced for the training data. (Curiously, we want the distributions to be similar even if this means that the model has worse performance than we might otherwise desire. If measured model performance in production is significantly better than performance in training, we should be skeptical of how well the model is actually performing or will continue to perform.)

Now, we may be able to wait a sufficiently long period of time for data to flow into the model before we compute performance metrics, though we will want to keep this period as short as possible to reduce risk associated with model error.

As far as the data itself is concerned, we can, for example, compare individual points of production predictions to the distribution of the predictions for the training data in order to determine if the production data falls far outside of the range where the cumulative distribution function (CDF) is more than 0 but less than 1. In other words, we can measure how likely a given prediction is, and if it falls far outside of the range of likely values we begin to suspect the model is not performing well.

However, just by chance we expect some number of points to fall almost anywhere in the range of possible values. What matters more is whether we can detect a trend that points are consistently falling outside of the distribution of the predictions for the data on which the model was trained or if they are going to start falling outside of this range.

One way to detect this change is to search for relationships in time. We can compute CDF values for each point and then determine whether these values depend on each other in time. As a rough estimate, we can use autocorrelation. For a more refined measurement, we can actually train a time series model on this data. One common candidate for such a model is autoregressive moving average (ARMA). This class of models uses various lags of a time series to find trends that quantify how the time series reverts to the mean and how that mean changes over time.

In the case of a stable model, we should see no trend in the CDF values. This means that our best fit ARMA model should be an ARMA(0,0), which is just random noise. If we detect some trend in time, then we have reason to believe that values are drifting. This method is especially interesting because it allows us to forecast roughly when the model will become unstable, provided that we continue to see similar drift behavior from the data. Furthermore, by computing these models over different intervals of production data, we can also detect finite regions over which the model may have drifted, even if it now appears to be stable again. This gives us clearer insight into how our models have performed and will perform in the future.

Notice that this discussion has been very general in nature because these methods apply to any model we may want to monitor. That model does not itself need to be a time series model, though we can use time series methods to great effect in monitoring it.

Even for a time series model, these methods are useful for monitoring because they can act as a signal for retraining. For example, we may compute a time series model that we expect to diverge over time simply because future values depend additively on previous ones. Eventually this model will compute values that fall outside of the data used for training, even if the model may still continue to perform well. In this case, we may want to retrain the model on the newly observed data just to be sure that we believe the trend has continued and will continue into the future.

Conclusion

Any data we use for predictive modeling is inherently time series data, even if we do not work directly with timestamps and autocorrelations. We can use this observation to help us train better, more stable models through conscious use of time ordering, time series forecasting, and model monitoring.

If you have any questions about time series, predictive analytics, and model monitoring, or if you want to learn more about how these methods can work for your business, please reach out to hello@decode-ds.com