Speed Up Your REST Workflows with asyncio

01.08.2020 03:02 API concurrent python REST

I have been waiting for a project that would allow me to dig into the Python’s asyncio library. Recently, such a project presented itself. I was tasked with hitting a rate limited REST API with just under 4 million requests. My first attempt was simple. Gather and build a block of search queries, POST each one to the API, process the results, and finally insert them in a database. Here is what the code looked like:

The code above is in need of a refactor. It is slow. Why? Every time we call the API with a query, the CPU waits for the API to respond, which could take a few hundredths of a second or longer. While that may not seem like a lot of time, it really adds up. Remember, we are going to make nearly 4 million queries. To explain the problem, let’s think about how a fast food restaurant like McDonald’s works.

McDonalds makes burgers concurrently. That means they work on more than one burger at a time. Imagine how long it could take if they waited to start an order until the previous order was complete. What happens when an ingredient goes missing? All of the upcoming orders in the queue are stuck waiting for the missing ingredient! That is not efficient. McDonald’s breaks the burger making process into small, repeatable processes. Grill station, condiment station, wrapping station, etc. The code above suffers from the need to complete an HTTP request before moving on to the next. asyncio can break that process into concurrent tasks which will complete much faster.

I am not going to explain everything about asyncio. Event loops, coroutines, futures, work queues are all words that can scare off developers who have yet to encounter concurrent workflows. For more details on concurrency, threads, and asyncio check out the asyncio developer documentation, RealPython’s post on AsyncIO, or the excellent chapter from Operating Systems in Three Easy Pieces, “Concurrency and Threads.” Instead, I will attempt to explain why running these API requests concurrently is a good idea and something you too could implement.

Lets change the code to use asyncio for the HTTP requests. We will change the function responsible for calling the API to an asynchronous function and add the keyword pair async/await to a few locations. This change in code is because we need to convert functions to coroutine functions ****and then await the execution of each coroutine. The relevant changes are included below.

The updated code defines the functions that were previously waiting on network responses as awaitable coroutine functions. Notice we had to use different libraries for http objects. The http module in the standard library is blocking and not appropriate for asyncio, so we use aiohttp. We also use the keyword gather to collect the coroutines that were awaited. What does gather do?

https://docs.python.org/3/library/asyncio-task.html#asyncio.gather

In short, it simplifies handling coroutines.

The new code runs much faster than the first attempt. It processes blocks of 10,000 requests in seconds instead of minutes.

With great power comes added responsibility. Debugging asyncio programs is less straight forward than adding inline print statements. Use of the logging library and asyncio’s DEBUG functionality will really help when developing concurrent programs. Take care to not overwhelm webservers. Seriously. Understand the terms of use and follow the limits provided by the provider.

Using asyncio makes writing concurrent programs much simpler than juggling system calls yourself. A few code changes will improve performance for almost any operation that is I/O bound. The library is a powerful one and I have only scratched its surface. I encourage you to check out the resources linked to in the article to learn more.