- Blog/
Multiprocessing in Python and Jupyter Notebooks on Windows
Table of Contents
When you are using Python for data science you will probably process and analyze large amounts of data. To speed up your code you can use the multiprocessing
library (API documentation). This library is build into Python and will allow you to run multiple processes in parallel. When you’re using multiprocessing
on a Unix based operating system you’re basically golden and everything works as expected. When using multiprocessing
in Windows however, there are some weird quicks to keep in mind. If you try to use it in a Jupyter Notebook on Windows you will run into even more problems. Luckily there are some workarounds to get it working.
Multiprocessing on Linux vs. Windows #
The main difference between Windows and an Unix based OS is that Windows does not have a fork()
system call. This means that when you start a new process in Windows it will spawn a new process, that is, start a new Python interpreter and import the main module. This means that the main module will be executed again from the beginning, where in Unix it will start at the point where you called multiprocessing.Process()
.
How can we prove this behavior? Let’s create a simple Python script that prints a message when it starts and when it ends.
import multiprocessing as mp
from time import sleep
print('Before defining mp_func')
def mp_func():
print('Starting mp_func')
sleep(1)
print('Finishing mp_func')
if __name__ == '__main__':
p = mp.Process(target=mp_func)
p.start()
print('Waiting for mp_func to end')
p.join()
When we run this script in Linux we get the following output:
Before defining mp_func
Waiting for mp_func to end
Starting mp_func
Finishing mp_func
This is the behaviour we would expect to see. Now when we run this script in Windows we get the following output:
Before defining mp_func
Waiting for mp_func to end
Before defining mp_func
Starting mp_func
Finishing mp_func
We see that that Before defining mp_func
is printed twice. This is because the main module is executed twice. The first time when we run the script and a second time when we execute mp.Process()
to start a new process. This is because Windows spawning the new process instead of forking it.
Now it also becomes clear why the if __name__ == '__main__':
block at the end is necessary. If we omit the if __name__ == '__main__':
block and instead include the code directly in the main module we would get an infinite loop on Windows because the main module will be executed again and again.
Luckily, Python doesn’t enter an infinite loop but instead throws a RuntimeError
when you try to start a new process without the if __name__ == '__main__':
block present:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Multiprocessing in Jupyter Notebooks #
The most common usecase for multiprocessing in Jupyter Notebooks is parrallelizing a for-loop. Let’s say we want to calculate the square of a list of numbers. We can do this in a for-loop like this:
def square(x):
return x**2
numbers = [1, 2, 3, 4, 5]
squares = []
for number in numbers:
squares.append(square(number))
print(squares)
Which will give us the following output:
[1, 4, 9, 16, 25]
Now let’s say we want to speed up this code by using multiprocessing. We can do this by using the multiprocessing.Pool()
class. This class will allow us to run multiple processes in parallel. We can use the map()
method to apply a function to a list of arguments. Let’s see how this works:
from multiprocessing import Pool
def square(x):
return x**2
numbers = [1, 2, 3, 4, 5]
with Pool() as pool:
squares = pool.map(square, numbers)
print(squares)
AttributeError
.
On an Unix based OS this will give us the following output:
[1, 4, 9, 16, 25]
What happens under the hood when Python spawns a process is that python pickles (serializes) the arguments and the main module and sends them to the new process. The new process will create a new Python interpreter and re-import the main module, because of this the main module will be executed again from the beginning. The new process will call the mp_func()
function with the unpickled arguments as input arguments. After the new process has finished executing the main module it will pickle (serialize) the return value of the mp_func()
function and send it back to the parent process. The parent process will then unpickle (deserialize) the return value and continue executing the main module.
There is one problem with this approach. A Jupyter Notebook is not a regular Python script but instead is a web application that runs an interactive Python kernel in the background. This means that the main module is not a regular Python script but instead a Python object. This object can’t be pickled and therefore can’t be sent to the new process.
There are four possible solutions to the problem.
Solution 1: Import the worker function from a separate module #
The first solution is to define the worker function in a separate python file and then import the worker function as a separate module. This will work because the worker function is not part of the main module and therefore can be pickled and sent to the new process. Let’s see how this works:
# worker.py
def square(x):
return x**2
# Cell in Jupyter Notebook
from multiprocessing import Pool
from worker import square
numbers = [1, 2, 3, 4, 5]
with Pool() as pool:
squares = pool.map(square, numbers)
print(squares)
Which will give us the following output:
[1, 4, 9, 16, 25]
Solution 2: Use a ThreadPool instead of a Pool #
The second solution is to use a multiprocessing.pool.ThreadPool
instead of a multiprocessing.Pool
. This will work because a multiprocessing.pool.ThreadPool
will not spawn a new process but instead will spawn a new thread. This means that the main module will not be executed again and therefore the main module does not have to be pickled and sent to the new process. Let’s see how this works:
from multiprocessing.pool import ThreadPool
def square(x):
return x**2
numbers = [1, 2, 3, 4, 5]
with ThreadPool() as pool:
squares = pool.map(square, numbers)
print(squares)
Which will give us the following output:
[1, 4, 9, 16, 25]
Solution 3: Use joblib’s embarrassingly parallel for loops #
joblib
is a third party library that provides tools for parallel computing. You can install joblib by running:
pip install joblib
joblib
provides the Parallel
class which allows you to run multiple processes in parallel. We can use the delayed()
function to apply a function to a list of arguments. An added benefit of using joblib
is that it offers us automatich batching of tasks and it can show us the current progress. joblib
uses loky
as its backend for spawning processes which in turn uses cloudpickle
instead of pickle
.
The benefit of using joblib
is that it can be used directly in a Jupyter notebook on Windows and it had a very simple API.
Let’s see how this works:
from joblib import Parallel, delayed
def square(x):
return x**2
numbers = [1, 2, 3, 4, 5]
squares = Parallel(n_jobs=-1, verbose=1)(delayed(square)(number) for number in numbers)
print(squares)
Which will give us the following output:
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[1, 4, 9, 16, 25]
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 0.4s finished
Solution 4: Use dask for parallel computing #
dask
is a third party library that provides tools for parallel computing. You can install dask
by running:
pip install dask
One of these tools that dask
provides is the dask.delayed
decorator. This decorator will allow you to run multiple processes in parallel. dask
offers automatic batching of tasks and uses, like joblib
, cloudpickle
under the hood and can therefore be used directly in a Jupyter notebook on Windows. Let’s see how this works:
The benefit of using dask
is that if needed, you can scale up the parallel computing to a cluster of machines.
Let’s see how this works:
import dask
@dask.delayed
def square(x):
return x**2
numbers = [1, 2, 3, 4, 5]
squares = []
for number in numbers:
s = square(number)
squares.append(s)
print(*dask.compute(squares))
Which will give us the following output:
[1, 4, 9, 16, 25]
Conclusion #
In this post we have seen that multiprocessing in Python has some quircks on Windows and some more in Juptyer Notebooks. We have seen that multiprocessing in Python on Windows is different from multiprocessing in Python on Linux. We have seen that there are four possible solutions to the problem. For IO bound tasks you can use multiprocessing.pool.ThreadPool
to speed up your code. When dealing with CPU bound tasks you can use a multiprocessing.Pool
to speed up your code as long as you import the worker function from an separate module. If you want a simpler API you can use joblib
for parallelizing for loops. Lastly, if you want to scale up your parallel computing to a cluster of machines you can use dask
for parallel computing.