LinkedIn Open Sources a Tool that Formats Big Data for TensorFlow

0
856
Advertisement

Inside LinkedIn, Avro2TF is an integral part of a system called TensorFlowIn that helps users easily feed data into the TensorFlow modeling process.

 LinkedIn has open sourced a tool it developed to convert Apache Spark-based big data into a format that can be readily consumed by TensorFlow, the leading deep learning library.

The tool, called Avro2TF, aims to reduce the complexity of data processing and improve the velocity of advanced modeling.

With this technology, developers can improve productivity by focusing on building machine learning (ML) models rather than converting data, said LinkedIn, which is on a mission to democratize machine learning.

Advertisement

“Based on the feedback from our users on the LinkedIn ML vertical teams, we needed a scalable solution focused on scalable data conversion. More specifically, we needed a solution that converted our LinkedIn data types (e.g., sparse vector, dense vector, etc.) into a deep learning format (i.e., tensors),” the LinkedIn engineering team explained in a blog post.

The team noted that like LinkedIn, many other companies have vast amounts of ML data in a similar sparse vector format, and Tensor format is still relatively new to many of them. The believe that Avro2TF can help address these challenges.

“Avro2TF bridges the gap and presents an elegant solution for ML engineers, freeing them up to focus on different deep learning algorithms,” the team stated.

How Avro2TF converts big data into tensors?

Avro2TF provides a simple config for modelers to obtain tensors from existing training data.

Tensor data itself is not self-contained. So, to be loaded to TensorFlow, it is required to carry metadata and Avro2TF fills this gap by providing a distributed metadata collection job.

Inside LinkedIn, Avro2TF is an integral part of a system called TensorFlowIn that helps users easily feed data into the TensorFlow modeling process.

An overview of TensorFlowIn

“Since large-scale data processing is an important step that is not only critical to many LinkedIn applications, but is also useful to the larger AI community, we decided to open source this engine after receiving positive internal feedback,” the LinkedIn engineering team added.

The project is now available on GitHub, and the company has published a tutorial on how to use it.

 

Advertisement

LEAVE A REPLY

Please enter your comment!
Please enter your name here