Feathr: LinkedIn Feature Store Now Available in Azure

Feathr: LinkedIn Feature Store Now Available in Azure

With the advancement of AI and machine learning, companies are starting to use complex machine learning pipelines in various applications, such as recommendation systems, fraud detection, and more. These complex systems typically require hundreds to thousands of features to support critical business applications, and feature pipelines are maintained by different team members in different business groups.

In these machine learning systems, we encounter numerous challenges that consume a lot of energy from machine learning engineers and data specialists, including duplicated feature development, online-offline skew, and low-latency feature serving.

Duplicated Feature Development

In organizations, thousands of features are hidden in different scripts and formats; they are not captured, organized, and preserved, thus cannot be reused and amplified by other teams except those who created them.

Since feature development is crucial for machine learning models, and features cannot be shared, data specialists must duplicate their efforts in feature development across different teams.

Online-Offline Skew

When it comes to features, autonomous training and interactive serving usually require different data serving pipelines—ensuring feature consistency across different environments comes at a high cost.

Teams refrain from using real-time data for serving due to the complexity of providing the right data.

Providing a convenient way to ensure data correctness at a specific point in time is key to preventing label leakage.

Low-Latency Feature Serving

For real-time applications, retrieving features from a database for real-time serving without compromising response latency and with high throughput can be a challenging task.

Easy access to low-latency features is crucial in many machine learning scenarios, and optimization is needed to consolidate various REST API calls for features.

To address these challenges, a concept called the Feature Store was developed, so that:

  • Features are centralized within the organization and can be reused
  • Features can be served synchronously between offline and online environments
  • Features can be served in real-time with low latency
  • Introducing Feathr, a battle-tested feature store

Building a feature store from scratch takes time, and much more time is spent making it stable, scalable, and user-friendly. Feathr is a feature store that has been in production and tested at LinkedIn for over 6 years, serving the entire LinkedIn machine learning feature platform with thousands of production features.

At Microsoft, the LinkedIn team and the Azure team have collaborated closely with Feathr as an open-source project, made it extensible, and created built-in integration with Azure. It is available in this GitHub repository, and you can learn more about Feathr in the LinkedIn engineering blog.

Some key points about Feathr include:

  • Scalability with built-in optimizations. For example, based on certain internal use cases, Feathr can handle billions of rows and PB-scale data using built-in optimizations such as Bloom filters and salted joins.
  • Wide support for time-travel connections and aggregations: Feathr has high-performance built-in operators designed for the Feature Store, including time-based aggregations, sliding window joins, lookup functions, all with timestamp precision.
  • Extensive user-defined function (UDF) customization capabilities with native PySpark and Spark SQL support for simplifying data processing and analysis for data practitioners.
  • Pythonic API for low learning curve access to everything. Integration with model building, so data practitioners can be productive from day one.
  • Rich type system, including embedding support for advanced ML/DL scenarios. One common use case is creating embeddings for customer profiles, and these embeddings can be reused across the organization in all ML applications.
  • Native cloud integration with a simplified and scalable architecture, illustrated in the next section.
  • Easier feature sharing and reuse: Feathr has a built-in feature registry, making it easy to share features across different teams and empower team productivity.

A data or machine learning engineer creates features using their preferred tools (such as pandas, Azure Machine Learning, Azure Databricks, and others).

These features are loaded into offline stores, which can be:

  1. Azure SQL Database (including serverless), Azure Synapse dedicated SQL pool (formerly SQL DW).
  2. Object storage, such as Azure Blob storage, Azure Data Lake Store, etc. The format can be Parquet, Avro, or Delta Lake.