Concurrency and Parallelism

Symaware is meant to run real time simulation connecting to all kinds of simulators, physics engines or even real robots. As such, the simulation step, meaning the time it takes to run each component of the agent, as well as the environment, is a critical factor in the performance of the simulation. Anything which is not in the order of milliseconds becomes noticeable and will affect the user experience.
There are cases when then computation load is too high and the threshold is not met. Let’s investigate some of the possible workarounds based on the root cause of the problem.

Difference between Concurrency and Parallelism

First of all, it is important to understand the difference between concurrency and parallelism.

Concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the final outcome. Concurrent tasks usually share the same memory (e.g. data structures, allocated storage), making communication easier. They may or may not run simultaneously, depending on the underlying architecture and the way the program is implemented.

        block-beta
  columns 10
    task1:2 space:2 task2:2 space:2 task3:2
    space task4:3 space:2 task5:2 space:2
style task1 fill:#D694AF,color:#000
style task2 fill:#f46d43,color:#000
style task3 fill:#fee08b,color:#000
style task4 fill:#abdda4,color:#000
style task5 fill:#3288bd,color:#000

Parallelism is the ability of a program to execute multiple parts or units simultaneously, each being run by a separate core or processor. Parallel tasks are independent for the most part, but can communicate and synchronize with each other if needed.

        block-beta
  columns 10
    task1:10
    space task2:7 space:2
    space:2 task3:8
    space task4:8
style task1 fill:#D694AF,color:#000
style task2 fill:#f46d43,color:#000
style task3 fill:#fee08b,color:#000
style task4 fill:#abdda4,color:#000

While they may seem similar and there is some overlap between the two, the results of employing one or the other can be drastically different.

Different kind of bottlenecks

Before diving into the solutions, it is also important to understand the different kinds of bottlenecks that can arise in a simulation (or any other kind of program, for that matter).

CPU-bound

A CPU-bound bottleneck occurs when the CPU is the limiting factor in the performance of the simulation.
This can happen when the simulation is computationally intensive, meaning that the CPU is constantly busy performing calculations and has little time to do anything else.

import time

# Running this function with a large value of n
# will require the cpu to perform a lot of operations,
# taking a long time to complete
def fibonacci(n: int) -> int:
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

I/O-bound

An I/O-bound bottleneck occurs when the input/output operations are the limiting factor in the performance of the simulation.
This can happen when the simulation requires a lot of data to be read from or written to disk, or when it needs to communicate with other systems over a network. In this case, the CPU has to wait for the I/O operation to complete before it can continue with the next task.

import requests

# Running this function will require the program to wait
# for the response from the server, which can take a long time
# and is inherently orders of magnitude slower than a CPU cycle
def fetch_url(url: str) -> str:
    response = requests.get(url)
    return response.text

Concurrency and Parallelism in Python

When trying to achieve concurrency or parallelism in Python (CPython, to be precise), there is a notable obstacle one must overcome: the Global Interpreter Lock (GIL). The GIL is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. In other words, no matter how many threads are spawned, only one of them can execute Python code at a time. As a consequence, it is not possible to achieve parallelism through threads.

        block-beta
  columns 11
    thread1 t11["t1"]:1 space:3 t12["t1"]:1 space:1 t13["t1"]:1 space:3
    thread2 space t21["t2"]:1 space:1 t22["t2"]:1 space:1 t23["t2"]:1 space:4
    thread3 space:2 t31["t3"]:1 space:4 t32["t3"]
style t11 fill:#D694AF,color:#000
style t12 fill:#D694AF,color:#000
style t13 fill:#D694AF,color:#000
style t21 fill:#f46d43,color:#000
style t22 fill:#f46d43,color:#000
style t23 fill:#f46d43,color:#000
style t31 fill:#fee08b,color:#000
style t32 fill:#fee08b,color:#000

This approach has its merits, as it makes managing race conditions and all sorts of complication arising from concurrent programming a non-issue. Threads can still be useful to handle blocking I/O operations, as for when the GIL is released, other threads can run, without having to busy-wait for the I/O operation to complete.

On the other hand, if the bottleneck is CPU-bound, the use of threads may even degrade the performance, as only one of them will run at any given time and the overhead of context switching between them will be added to the mix.

Multiprocessing

To achieve parallelism in Python, one must use the multiprocessing module, which allows the creation of separate sub-processes that can run in parallel.

        block-beta
  columns 11
    process1 p1:10
    process2 space p2:9
    process3 space:2 p3:7 space
style p1 fill:#D694AF,color:#000
style p2 fill:#f46d43,color:#000
style p3 fill:#fee08b,color:#000

Asyncio

There is a third way to achieve concurrency in Python, which is through the asyncio module.

This approach closely resembles the thread-based concurrency model, but with a few key differences. Instead of using threads, asyncio uses coroutines, which are functions that can be paused and resumed at specific points. This allows for a more fine-grained control over the execution of the program, as the programmer can explicitly define where the execution should yield control to another coroutine with the await keyword. Moreover, the operating system is not involved in the context switch, making it more lightweight.

        block-beta
  columns 8
    coroutine1 t11["c1 -> await"]:2 space:4 t12["resume -> c1"]:1
    coroutine2 space:2 t21["c2 -> await"]:1 space t22["resume -> c2"]:2 space
    coroutine3 space:3 t31["c3"]:1
style t11 fill:#D694AF,color:#000
style t12 fill:#D694AF,color:#000
style t21 fill:#f46d43,color:#000
style t22 fill:#f46d43,color:#000
style t31 fill:#fee08b,color:#000

How to speed up slow simulations

Now that we have a better understanding of the different kinds of bottlenecks and the tools available to overcome them, let’s see how we can speed up slow simulations.

The first step is to identify the root cause of the problem. Is the simulation CPU-bound or I/O-bound?
Then, it is important to determine whether the slowdown can be localized in a single component (or group of components).
If that’s the case, is strictly necessary to run it (or them) at the same frequency as the others?

Fast simulation

Assuming there is no bottleneck introduced by the framework’s components, the maximum pace of the simulation is determined by the simulator. The synchronous approach would be able to achieve the maximum performance with the minimum overhead.

class MyComponent(Component):
    def _compute(self):
        # Very fast computation

def main():
    # ...
    agent_coordinator = AgentCoordinator(...)
    agent = Agent(AGENT_ID, AgentEntity(...))
    agent.add_components(MyComponent(agent.id))
    # ...
    agent_coordinator.add_agents(agent)
    agent_coordinator.run(TIME_INTERVAL) # This will run the simulation synchronously

which can be visualized as:

        block-beta
  columns 9
    m["main thread"] m1["Environment 1"]:1 m2["Slow component 1"]:2 m3["Fast component 1"]:1 m4["Environment 2"]:1 m5["Slow component 2"]:2 m6["Fast component 2"]:1
style m1 fill:#D694AF,color:#000
style m4 fill:#D694AF,color:#000
style m2 fill:#f46d43,color:#000
style m5 fill:#f46d43,color:#000
style m3 fill:#fee08b,color:#000
style m6 fill:#fee08b,color:#000

I/O-bound simulation

If the simulation is I/O-bound (e.g. it requires a lot of data to be read from disk or fetched from the network), the best approach is to use the asynchronous model. This way, all other components can run while the I/O operation is in progress, making the most of the available resources. Note that this means that the slow component will run less frequently than the others.

class MyComponent(Component):
    async def _async_compute(self):
        # I/O-bound computation
        await # Some I/O operation

def main():
    # ...
    agent_coordinator = AgentCoordinator(...)
    agent = Agent(AGENT_ID, AgentEntity(...))
    agent.add_components(MyComponent(agent.id, TimeIntervalAsyncLoopLock(TIME_INTERVAL)))
    # ...
    agent_coordinator.add_agents(agent)
    agent_coordinator.async_run() # This will run the simulation asynchronously

which can be visualized as:

        block-beta
  columns 7
    c1["Environment coroutine"] e1["Environment 1"]:1 space:2 e2["Environment 2"]:1 space:2
    c2["Fast component coroutine"] space f1["Fast component 1"]:1 space:2 f2["Fast component 2"]:1 space
    c3["Slow component coroutine "] space:2 s1["Slow component 1 -> wait"]:1 space:2 s2["Slow component 1 -> continue"]:1
style e1 fill:#D694AF,color:#000
style e2 fill:#D694AF,color:#000
style f1 fill:#f46d43,color:#000
style f2 fill:#f46d43,color:#000
style s1 fill:#fee08b,color:#000
style s2 fill:#fee08b,color:#000

CPU-bound simulation

If the simulation is CPU-bound (e.g. the operation is computationally expensive), the best approach is to use the multiprocessing model. Even if the computation itself is not parallelizable, running it in another process will avoid blocking the main one and allow other components to run in parallel. Combining this approach with the asynchronous model can yield even better results, as the slow component will wait for the parallel process to complete while the others can run uninterrupted. Note that this means that the slow component will run less frequently than the others.

def cpu_computation(i: int) -> int:
    return i * i

class MyComponent(Component):
    async def _async_compute(self) -> list[int]:
        # Use multiprocessing to run the expensive computation in a parallel process
        # In this case we are using a single process, but more can be added if needed
        NUM_PROCESSES = 1
        inputs = [i for i in range(NUM_PROCESSES)]  # Example inputs
        outputs = []
        with ProcessPoolExecutor(max_workers=NUM_PROCESSES) as executor:
            subtasks = [
                asyncio.get_event_loop().run_in_executor(executor, cpu_computation, inputs[i])
                for i in range(NUM_PROCESSES)
            ]
            for task in asyncio.as_completed(subtasks):
                outputs.append(await task)
        return outputs

def main():
    # ...
    agent_coordinator = AgentCoordinator(...)
    agent = Agent(AGENT_ID, AgentEntity(...))
    agent.add_components(MyComponent(agent.id, TimeIntervalAsyncLoopLock(TIME_INTERVAL)))
    # ...
    agent_coordinator.add_agents(agent)
    agent_coordinator.async_run() # This will run the simulation asynchronously

which can be visualized as:

        block-beta
  columns 7
    c1["Environment coroutine"] e1["Environment 1"]:1 space:2 e2["Environment 2"]:1 space:2
    c2["Fast component coroutine"] space f1["Fast component 1"]:1 space:2 f2["Fast component 2"]:1 space
    c3["Slow component coroutine "] space:2 s1["Slow component 1 -> wait"]:1 space:2 s2["Slow component 1 -> continue"]:1
    c4["Parallel process"] space:2 p1["Parallel process"]:3 
style e1 fill:#D694AF,color:#000
style e2 fill:#D694AF,color:#000
style f1 fill:#f46d43,color:#000
style f2 fill:#f46d43,color:#000
style s1 fill:#fee08b,color:#000
style s2 fill:#fee08b,color:#000
style p1 fill:#abdda4,color:#000

Slow simulation at the same frequency

Unfortunately, there are cases when the slow component is necessary and must run at the same frequency as the others. The only option in this case is to either optimize the component computation in some way (e.g. using a more efficient algorithm or approximating the result) or slow down the simulation as a whole to accommodate the slow component. This is achieved automatically if the simulation is running in synchronous mode and the same will happen in asynchronous mode if the bottleneck is CPU-bound (see the asyncio section).

Some alternative approaches to side-step the problem outside the framework include running some precomputation before the simulation starts and caching the results of the slow component, or running the simulation on a more powerful machine.