My Reading Notes for “Designing Machine Learning Systems” by Chip Huyen

Xing Zeng
8 min readJun 2, 2023

--

I recently finished reading Designing Machine Learning Systems by Chip Huyen, so I would like to share some learnings I had from reading this book.

This book isn’t the type that would go very deep into one topic but is the type that would cover as many topics as it could. I guess that’s exactly the perfect representation of the current state of designing machine learning systems: there are so many components inside of it.

Sculley, David, et al. “Hidden technical debt in machine learning systems.” Advances in neural information processing systems 28 (2015).

The above figure, despite being published almost 8 years ago, the complexity around the many pieces next to the small ML Code still holds, and can be arguably claimed to have become more and more complicated.

Among all the topics it covered, here are some of the ones that I, as a Backend Software Engineer working on implementing a Text-based ML system, have found to be most interesting and/or are likely to miss, grouped by the Chapter, and some of the comments I have regarding to the content of it.

Note that this is not designed to be a general summary of the book. Someone with a very different background and/or position (e.g. Data Scientist) would focus on a completely different part of this book. So, I do recommend trying to get access to this book if possible, one may learn from a different aspect — and the book does cover a lot of different aspects — as I did.

Chapter 3 — Data Engineering Fundamentals

In a Machine Learning System, it’s common to have three types of data flows:

  • Data Passing through Persistent Storage—Everything is stored in a Persistent Storage, each of the different containers/components just reads and writes from the database.
  • Data Passing through Service — Use RESTful or RPC framework to create API interfaces to allow different containers/components to interact
  • Data Passing through Real-Time Transport — Use a message queue system or pub-sub system to act as a broker that coordinates the passing of the data between different containers/components. The intermediate could either be based on Persistent Storage, partially based on Persistent Storage, or totally not related to Persistent Storage at all.

In Batch Processing, which often deal with historical data as well as features that changed less frequently, the first two mechanisms are more common. In Stream Processing, where sometimes live data may appear and features can change more quickly, the last mechanism is more common and sometimes even required.

Recently, there has been more move toward Streaming Processing, as it’s sometimes being seen as a more generic paradigm. Even if you have historical data, you can probably utilize the Streaming paradigm by having a Batch Processing Component that specialized in putting Batch data into a Streaming pipeline. Though I do like to point out that some algorithms, like those to train a neural network and the one operating on the graph, will probably still have to stay as Batch Processing for a very long time. A significant number of neural network training algorithms these days are still not fully online, and graph operations naturally required a complete graph instead of one node or one edge to be operated on.

Chapter 6 — Model Development and Offline Evaluation

Data Parallelism is still the most common method to train ML models across multiple machines: There is one centralized server that has access to all the model parameters. Other machines can download the model from the centralized server, have access to different data, and can calculate part of gradients just on the data it has in parallel, and submit the gradient back to the centralized server for processing. The difference between whether the parameter service may set up barriers or not for progressing toward the next iteration separates Asynchronous SGD or Synchronous SGD.

Model Parallelism has also appeared, as an orthogonal direction to Data Parallelism, to speed up the training of larger ML models that may not fit in one machine. When a model’s number of parameters becomes too huge to be stored on only one machine, multiple machines may store only part of the model parameters. In a neural network case, the common approach to utilize such a mechanism is to store a group of layers in one machine, and another group of layers in another machine. This is especially useful for cases where there may be a fully connected component in-between layers, which makes splitting one layer complicated. Note that, this still does not automatically means the processing is actually parallel, since each layer is still dependent on each other. A popular technique for it is when a particular machine is processing the forward pass for one group of layers on the first batch of data, other machines will wait until the first group of layers has been processed. Once this particular data batch has been finished processing, the machine that has the next group of layers will kick in to process this batch, with the machine that has the first layer may start to process the next batch.

Chapter 7 — Model Deployment and Prediction Service

Depending on the data flows as well as the type of data that needs to predict, there can be:

  • Batch Prediction
  • Online Prediction that only uses Batch Feature
  • Online Prediction that may use Streaming Feature (also called Streaming Prediction)

Combined with Training — which usually happens in batch but can be made online depending on the algorithm — an important goal is to make sure the Streaming processing and Batch processing should generate the same feature instead of different features on the same data point. To avoid bugs that may result from different implementations of the same logic, one should implement a unified data pipeline that is shared across stream processing and batch processing.

The unification of the data pipeline may not be trivial — more to come in Chapter 10.

Chapter 8 — Data Distribution Shift and Monitoring

It is important to continuously monitor the performance of the ML system as Data Distribution shifts may occur. Some monitoring techniques include:

  • Monitor accuracy-related metrics, if the user feedback may be immediate. This is common in recommendation systems and active learning scenarios.
  • Monitor the distribution of the prediction itself, and check if there is a distribution shift on the predicted labels from the test labels.
  • Monitor the distribution of each of the features. This can be more generic, but the min/max/median/histogram would be a good starting point.
  • Monitor the raw input. The book claim this may be out of the scope of data science or ML team, but I think it depends more on how the team structure is actually formulated. In my field of Text Processing, this is also where interesting metrics can come along: the length of the Text / the language of the Text / Token distributions within the Text can all be Metric to be monitored. These can also be used as features and thus may fall into the form of a “Monitoring feature”, but there are situations where these features may not contribute much to training a neural network model on text, especially in this ChatGPT era, but they can still be useful for the monitoring team to identify if the data has been shifted, to identify if a LoRA-type of fine-tune may need to be required.

Chapter 10 — Infrastructure and Tooling for MLOps

(This is my favourite chapter of the book. Indeed, this is the part that’s more related to a Backend Software Engineer ^_^)

Standardized Dev Environments, either across the company or just across the whole team, will make it easy for a team to debug issues related to different versions related to a different package. When the need for a unified data pipeline may also arise (as in Chapter 7), a standardized dev environment will also make it easier to utilize such a pipeline, which may be developed to be more Production-oriented and thus harder to be deployed in an ad-hoc way. Container is a great way to make such standardization.

ML Workflows, which are naturally iterative, can be managed through Workflow Management Tools, which will make running workflow and monitoring the progress of workflow much easier. They also contribute to making sharing containers/components easier which is useful for a unified data pipeline. The book gave examples of Apache Airflow & Argo & Metaflow. I have also recently played with some other Workflow Management Tools like Prefect & Temporal. They all have their own pros and cons. I personally prefer Temporal more since it is better at scaling and crash-handling, and it also has more capability in supporting multiple languages.

Storing a Model in persistent storage may sound easy, but combined with the requirement of monitoring and deployment, it may not be trivial. Using Model Store will make it much easier. MLflow is a very commonly used model store and perhaps the most popular one that’s not associated with a particular cloud provider. I have personally played with it and MLflow’s ability to also wrap around part of the preprocessing functionality makes it a really good choice as a Model Store.

The book also covered Featuer Store. This is a new concept that focuses either on managing features, or computing features, or maintaining the consistency of features across training and predicting, or all of them. This is also highly related to Chapter 7’s suggestion of building a unified data pipeline. Feast / Tecton are some of the examples that were given as Feature Store Tool one may look into.

An important section of this book segwayed from the above: for these tools, should one build in-house or consider the buying option? This is, of course, very case-by-case. But I guess, the naturality of this question may raise, does show the current stage of how the MLOps platform has been. This is certainly the best of time, as an immature industry means plenty of opportunities. It’s also the worst of times, I can’t tell how many times I heard of different ML tool startup companies tackling different problems from different angles, and with all these jargon flying around it’s really hard to decide on what to do ;(

Chapter 11 — The Human Side of Machine Learning

MLOps required ML expertise and Ops (short for “Operation”, usually referring to knowledge from the infra side or backend side) expertise. Companies have been either having two teams for them, or requesting one team to be good at both of them. Both have their own issue: having separated teams have tons of overhead you will have when there is more than one team, and requiring one team to have knowledge on both sides makes an impractical requirement on requiring Data Scientists to understand low-level infratusture. The author claims that the best approach probably lies somewhere in between: a good tool built by those with Ops expertise that abstracts the Ops concept away from Data Scientists, and with that tool Data Scientists may fully own their side of the work without having much need to contact the Ops side of the team. This, obviously, doesn’t eliminate all the communication complexity, but certainly, if the abstraction layer is done good enough, it will make cross-team collaboration much more efficient.

As a former UBC NLP Lab graduate who thought about becoming a Data Scientist, but somehow in the end turned up to be a Senior Backend Software Engineer, I can feel that I am sort of unique in this standpoint. But I do agree cases like mine are rare, and I feel like my stepping on both sides sometimes ends up me being neither perfect on Infra nor perfect on ML. Either way, the ML system is complicated, and it naturally required collaboration. And just like the author suggests, thinking more about packaging and abstraction is the best way to make collaboration across teams. As for me, who happens to be someone that lives in between, this is my perfect opportunity to think more about what exactly this abstraction should be.

--

--