With the advancement of AI and machine learning, companies are starting to use complex machine learning pipelines in various applications, such as recommendation systems, fraud detection, and more. These complex systems typically require hundreds to thousands of features to support critical business applications, and feature pipelines are maintained by different team members in different business groups.
In these machine learning systems, we encounter numerous challenges that consume a lot of energy from machine learning engineers and data specialists, including duplicated feature development, online-offline skew, and low-latency feature serving.
In organizations, thousands of features are hidden in different scripts and formats; they are not captured, organized, and preserved, thus cannot be reused and amplified by other teams except those who created them.
Since feature development is crucial for machine learning models, and features cannot be shared, data specialists must duplicate their efforts in feature development across different teams.
When it comes to features, autonomous training and interactive serving usually require different data serving pipelines—ensuring feature consistency across different environments comes at a high cost.
Teams refrain from using real-time data for serving due to the complexity of providing the right data.
Providing a convenient way to ensure data correctness at a specific point in time is key to preventing label leakage.
For real-time applications, retrieving features from a database for real-time serving without compromising response latency and with high throughput can be a challenging task.
Easy access to low-latency features is crucial in many machine learning scenarios, and optimization is needed to consolidate various REST API calls for features.
To address these challenges, a concept called the Feature Store was developed, so that:
Building a feature store from scratch takes time, and much more time is spent making it stable, scalable, and user-friendly. Feathr is a feature store that has been in production and tested at LinkedIn for over 6 years, serving the entire LinkedIn machine learning feature platform with thousands of production features.
At Microsoft, the LinkedIn team and the Azure team have collaborated closely with Feathr as an open-source project, made it extensible, and created built-in integration with Azure. It is available in this GitHub repository, and you can learn more about Feathr in the LinkedIn engineering blog.
Some key points about Feathr include:
A data or machine learning engineer creates features using their preferred tools (such as pandas, Azure Machine Learning, Azure Databricks, and others).
These features are loaded into offline stores, which can be: