Paradigm shift: model to pipeline validation in ML engineering

Introduction

Imagine making a decision today with the knowledge of tomorrow. Sounds like an unfair advantage, right? Well, it depends on the context; in some cases, this ‘advantage’ can be quite a deceptive pitfall. But let’s explore why.

To the problem at hand…

As an ML Engineer at Vortexa Ltd (a pioneering force in AI-driven energy analytics), I find myself at the helm of transforming abstract ML models into tangible, production-ready tools. Over the last three-plus years, my team and I have tackled many problems, contributing to the establishment of robust data pipelines. Each challenge was not just a problem to solve; it was an opportunity to innovate and shape the company’s dynamic ML landscape, which informs the data feeds we provide to our clients in a multifaceted way.

Here is one such way for example: our clients tap into the data-driven insights we provide, not just to get a snapshot of the energy landscape, but also to leverage the critical signals we are feeding into their own predictive models. As such, a crucial aspect of the value we offer resides in our capacity to enable retrospective analysis. In short, our clients often wonder:

“Had we incorporated Vortexa’s predictions back in 2018 into our decision-making, would the outcomes have been profitable?”

This question underscores the business need to retrospectively evaluate our models and forms the crux of this blog post.

Paradigm shift: model to pipeline validation in ML engineeri_cover

The saga of retrospective validation stumbles upon a fundamental paradox. Typically, we might attempt to ‘travel back in time’ by applying the model trained on all available data to the past scenarios it was originally trained on. But this method inherently conflicts with the principle of predictive modelling:

“…a model should not be asked to predict an outcome from a past it has already learned”

This could be seen as a model merely repeating a learned pattern, rather than making novel predictions. It’s akin to revising history with a future lens, practically creating a skewed perception of the model’s true predictive capabilities.

Let’s call this the ‘Future Leakage Paradox’ (FLiP); a situation where one mistakenly allows future information to seep into a past prediction.

Paradigm shift: model to pipeline validation in ML engineeri_embedded_1

Why is this so important?

Well, in the context of Vortexa, understanding and addressing FLiP is especially crucial given the nature of our predictive challenges. Take for instance our vessel destination prediction model, tasked with forecasting the destinations of vessels back in 2018. The oil and gas industry is intrinsically volatile, influenced by a myriad of global events and trends. Just imagine the implications of incorporating future knowledge in our models.

Paradigm shift: model to pipeline validation in ML engineeri_embedded_2

For instance, let’s consider the unprecedented disruptions caused by the COVID-19 pandemic in 2020. The lockdowns resulted in a substantial decrease in oil demand, triggering a ripple effect that altered shipping routes globally. If we ‘leak’ this future information into our 2018 model, we end up inaccurately weighing the importance of certain factors that were irrelevant in the pre-pandemic world but drastically altered post-pandemic.

Paradigm shift: model to pipeline validation in ML engineeri_embedded_3

Another example could be the Ukrainian war and the subsequent international sanctions imposed on Russia by the European Union. These events led to significant rerouting of vessels and altered trade flows. If a model trained on data from after these events was to predict vessel destinations for 2018, the results would be misleading.

Such instances underscore how FLiP can introduce considerable inaccuracies in our retrospective evaluations, yielding predictions that are skewed by events and trends that were unknown at the time.

Addressing the Challenge: Pipeline Validation

In a way, this example is just a proxy to emphasise a more general ML engineering practice, which dictates that engineers should always be focussing on “pipeline validation” and not so much on “model validation”. Take this with a pinch of salt but, in short, as an engineer I don’t care that much about training a good model — that’s not my job. I do care about building a training pipeline that allows me to train good models. I understand how generic this may sound, but let me elaborate by expanding on the examples we just introduced.

Due to FLiP, to correctly generate predictions retrospectively, we can’t use just one model, but we have to use multiple. Say we have shipping routes data from 2016 onwards up to today. Suppose we want to retrospectively predict destinations for 2018. We can then use 2 years to train a model, then predict destinations for 2018. Then for 2019 we can either expand the training window to include what actually happened in 2018 and therefore train and use a new model based on 3 years’ worth of data (2016,2017,2018 expanding window) and predict for 2019, or we can maintain the two years training window (2017,2018 sliding window) and train a different model for the 2019 predictions and so forth up until we get to 2024. This approach stresses why the entire machine-learning pipeline should be under scrutiny. And though it may seem like this is just a special case, it is not.

Paradigm shift: model to pipeline validation in ML engineeri_embedded_4

Model drift and new data will always dictate that models are retrained to benefit from data freshness and new patterns. So having a robust model training pipeline, that’s curated with care in an equal manner to how most production ETL data pipelines are cared for is essential.

This shift isn’t just a nuance — it’s a transformative approach. As we confronted the challenges posed by FLiP, it became clear that merely adjusting our models or the historical training data they use, wouldn’t suffice. Our focus had to shift from producing one impeccable model to ensuring that our entire ML pipeline could reliably generate a sequence of effective models. The objective here is not just to evaluate models but to rigorously assess the capability of the ML-training pipeline.

Paradigm shift: model to pipeline validation in ML engineeri_embedded_5

A New Approach to Validation

By advocating for an expanding or sliding window strategy multiple models are trained using distinct, or possibly overlapping slices of historical data, allowing for more genuine back-testing. By focusing on the pipeline’s ability to generate numerous reliable models, we prioritise:

Idempotence/Determinism: In a world where multiple models are the norm, reproducibility becomes paramount. Data scientists need to be able to decouple the impact of a code change on a model’s performance and to do so it is imperative that provided a specific data snapshot as input, the pipeline will always produce the same or at least equivalent models every time.
Consistency: Ensuring that the models produced over different time windows are all of a high standard.
Temporal Stability: It’s vital to gauge if the model’s performance remains consistent over time. If we notice variations in performance for more recent years, it indicates the changing nature of external factors and the pipeline’s ability (or inability) to capture them.

The Quest for Temporal Stability

Temporal stability isn’t a standalone concept and it is influenced by many factors, whether domain-oriented or related to computational concerns.

Nature of Data Changes: In our domain, the structure of data can evolve. For instance, if we observe significant structural changes in the energy sector due to geopolitical events or technological advancements, a sliding window might be more appropriate, giving more weight to recent data. Conversely, if we identify regular cyclic patterns over more extended durations, indicative of recurring trends in energy consumption or supply, an expanding window might offer a clearer perspective.

Business Objectives: In general, if we are we aiming to understand long-term patterns and overarching trends, then an expanding window should be apt. For more immediate, short-term insights, perhaps to rapidly respond to market changes, the sliding window becomes the tool of choice.

Computational Costs: Then again, as our data grows, the computational strain can increase. If computational resources are limited, a sliding window may be a more feasible choice due to the consistent dataset size.

Model’s Ability to Forget: Especially pertinent to deep learning models, some can “remember” patterns from old data even if the new data showcases different patterns. This retention can be a subtle form of overfitting. In these scenarios, a sliding window might be more effective in “forcing” the model to shed these outdated patterns.

So, given these factors, should we employ the sliding or expanding window approach, or should we try a hybrid strategy?

Sliding vs. Expanding Windows: Trade-offs and Considerations

Both methods have their advantages and drawbacks, and the optimal choice is often contingent on the specific problem and goals. As with any other data science choice, here as well it’s valuable to initially test both the sliding and expanding window approaches. Then, maybe, set aside a distinct test period and assess how models trained using both methods perform. Moreover, hybrid strategies can be contemplated. For instance, consider a weighted expanding window where the recent data holds more sway, but older data isn’t wholly disregarded. Let’s try to assess these options in more detail.

1. Sliding Window

This method involves training on a fixed-size window of data that ‘slides’ forward in time. For example, if you use a 1-year window, you’d train on data from 2017–2018 to predict 2019, then slide to 2018–2019 to predict 2020, and so on.

Advantages

Temporal Relevance: By always using a fixed-sized recent dataset, it ensures that the model is always trained on the most temporally relevant data. This is particularly useful in fast-changing environments where older data might not be as relevant.

Drawbacks

Limited Historical Context: Sliding windows might overlook longer-term patterns by focusing only on recent data.
Increased Variability: By constantly changing the training data set, the model might produce more variable results across different time windows.

2. Expanding Window

Here, the window of data ‘expands’ over time. Using the same example, you’d train on data from 2017–2018 to predict 2019, then on 2017–2019 to predict 2020, continuing to include all prior data.

Advantages

Rich Historical Context: It provides the model with more historical data, potentially capturing long-term patterns and trends.
Stability: Tends to produce more stable models since they’re always trained on all available prior data.

Drawbacks

Computational Overhead: As the training dataset grows over time, computational costs can increase.
Potential Overemphasis on Historical Data: In fast-changing environments, older data might become less relevant, but with an expanding window, it’s always included in the training set.

Combination of Both:

In some scenarios, a hybrid approach that combines elements of both methods might be the most appropriate. For instance, an expanding window can be used up to a certain point in time, after which a sliding window is employed to maintain a balance of computational efficiency and temporal relevance.

Contrasting in a Table:

So, when deciding between sliding, expanding, or a combination of both, it’s essential to understand the nature of the data and the specific predictive challenges at hand. For industries or applications where recent events significantly influence outcomes, a sliding window might be more suitable. Conversely, when long-term trends play a crucial role, an expanding window might be more appropriate. A deep understanding of the underlying data-generating processes and business objectives will guide this choice.

Measuring the Pipeline’s Effectiveness

A question naturally arises: “How do we measure the effectiveness of such a pipeline?” Here are some ideas:

Aggregate Metrics: By evaluating models trained over multiple periods, we can derive aggregate performance metrics (accuracy, F1 score, precision) to provide a holistic view of the pipeline’s capabilities. For instance, the median and variance of F1 scores across all models could be insightful. A low variance might indicate consistency, while a median close to the top range indicates good overall performance
Adaptability: In an industry as dynamic as ours, data sources and feature sets evolve. An effective ML pipeline should not only adapt to these changes seamlessly but do so without compromising performance.
Data Leakage Detection: Tools that detect suspicious correlations or performances that seem ‘too good to be true’ are invaluable. This is because extraordinary performance often hints at the leakage of future data into training sets, marring the model’s predictive integrity.

Out of these three metrics, it is worth focussing a bit more on data leakage. Data leakage is a silent killer in retrospective analysis. It’s an ML pitfall we constantly guard against. Some best practices to counteract this include:

Feature Construction: Ensure features aren’t inadvertently calculated using future data. It might seem elementary, but even subtle oversights here can lead to models being fed information from the future, instantly rendering them unrealistic.
Alignment with External Data: When integrating external datasets, they must be in harmony with the same temporal restrictions as our primary data.
Shuffling Care: Be wary of shuffling time-series data. Sequential data carries the baggage of time, and random shuffling can lead to disastrous temporal inconsistencies.
Cross-validation Techniques: Conventional cross-validation isn’t fit for time-series data. Techniques like time series split or rolling window validation are preferred.
Feature Engineering for Each Window: Processes like normalisation or standardisation must occur in such a manner that doesn’t infuse future information into past records. Cleaning, normalisation, and feature engineering must be meticulously re-executed for each data window.

Periodic Validation: Essential Across All Models

While our discussion may have emphasised retrospective predictions, it’s crucial to understand that these validation principles are just as vital for live models. Maintaining the accuracy of live models over time demands frequent updates and rigorous validation, achievable through either expanding or sliding window techniques. The expanding window integrates all past data up to the present, ensuring comprehensive learning, whereas the sliding window keeps the model updated with the most recent data, enhancing its responsiveness to new trends.

For those utilising neural networks (NNs), validation typically occurs at the end of each epoch, where an epoch is defined as a complete cycle of processing the full dataset. However, the necessity for regular validation extends beyond NNs to all types of models. This is where time-series cross-validation becomes particularly important. It allows models to be tested on multiple chronological splits, ensuring they perform well across different periods and conditions, thereby preventing overfitting and future data leakage.

Regular, intermittent validation safeguards against the Future Leakage Paradox by preventing the inadvertent inclusion of future data in past predictions. It ensures that all models, whether ‘live’ or ‘historical,’ consistently reflect accurate and real-world conditions. By aligning our models with both the latest data and historical truths and employing methods like time-series cross-validation, we prevent biases and maintain the robustness and reliability needed for effective decision-making.

Evaluating Efficiency and Ensuring Traceability

Finally, efficiency metrics can act as an invaluable guide. By evaluating models over diverse periods, aggregate performance metrics like accuracy, F1 score, and precision offer a panoramic view of the pipeline’s prowess. Yet, beyond aggregate scores, we need to delve deeper.

For example, one aspect we care about is adaptability. As already explained, industries like ours are in flux. Data sources change. Features evolve. An ideal ML pipeline shouldn’t just cope with these shifts but should leverage them without performance erosion. At the same time, versioning and traceability are key engineering notions that become even more essential to our work. In a world with multiple models churned out periodically, tracking their lineage becomes vital. Each model, its associated data, features, and hyperparameters need meticulous logging.

This traceability is essential not just for reproducibility but to ensure that as we refine our processes, we’re building on a foundation of understanding and not just iteration.

Last words…

The application of machine learning in the energy sector continues to evolve, much like in other industries, with a strong focus on enhancing the accuracy and reliability of our models. Key to this evolution is the shift from merely fine-tuning models to rigorously validating entire data pipelines. This ensures that our predictions remain robust against rapidly changing data landscapes and the complexities of retrospective analysis.

Our commitment to improving pipeline validation processes is not merely a response to immediate challenges but a proactive approach to future-proofing our models. Through these efforts, we aim to keep our predictions accurate and our methodologies transparent, setting a solid foundation for tackling new problems as they arise.

We encourage others in the field to share how they are addressing these or similar challenges and keep an eye on this product blog for more (hopefully) interesting posts.

See you soon.