Last Updated : 04 Jul, 2024

Comments

Improve

Parallel computing is a powerful technique to enhance the performance of computationally intensive tasks. In Python, Numba is a Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code. One of its features is the ability to parallelize loops, which can significantly speed up your code.

Parallelizing Python for loops is a crucial step in optimizing the performance of computationally intensive applications. Numba, a popular Python library, provides several tools to achieve parallelism, including the`prange`

function and the`parallel=True`

option. In this article, we will delve into the details of how to effectively parallelize Python for loops using Numba, highlighting the key concepts, techniques, and best practices.

Table of Content

- Understanding Numba’s Parallelization Capabilities
- Why Parallelize Loops?
- Identifying Parallel Loops: Key Considerations
- Parallelizing Loops with Numba
- Example 1: Parallelizing a Simple Loop
- Example 2: Parallel Sum of Arrays
- Example 3: Estimating Pi Using Monte Carlo Methods
- Example 4: Usingprangefor Explicit Parallelization
- Advanced Example: Parallelizing Matrix Multiplication
- Measuring Performance Gains from Parallelization
- Best Practices for Parallelization

**Understanding Numba’s Parallelization Capabilities**

**Understanding Numba’s Parallelization Capabilities**

**Numba offers two primary methods for parallelizing code: automatic parallelization and explicit parallelization using****prange**** . **Automatic parallelization is achieved by setting

`parallel=True`

when using the`@jit`

decorator. This option attempts to optimize array operations and run them in parallel, making it suitable for embarrassingly parallel loops. On the other hand,`prange`

allows for explicit parallelization of specific loops, providing more control over the parallelization process.### Why Parallelize Loops?

Parallelizing loops can drastically reduce the execution time of your code by distributing the workload across multiple CPU cores. This is particularly beneficial for tasks that are “embarrassingly parallel,” meaning they can be easily divided into independent subtasks.

### Identifying Parallel Loops: Key Considerations

Before diving into the parallelization process, it’s crucial to determine if your for loop is a suitable candidate. The ideal loops for parallelization are:

Each iteration is independent and doesn’t rely on data modified in other iterations.**Embarrassingly Parallel:**The time spent within each iteration is significant enough to outweigh the overhead of parallel execution.**Computationally Intensive:**

## Parallelizing Loops with Numba

Numba provides the`prange`

function, which is used to parallelize loops. The`prange`

function is similar to Python’s built-in`range`

function but is designed for parallel execution.

** Installation: **First, you need to install Numba. You can do this using pip:

`pip install numba`

### Example 1: Parallelizing a Simple Loop

Let’s start with a simple example where we parallelize a loop that computes the sum of squares:

import numpy as npfrom numba import njit, prange@njit(parallel=True)def sum_of_squares(n): result = 0 for i in prange(n): result += i ** 2 return resultn = 1000000print(sum_of_squares(n))

Output:

`333332833333500000`

In this example, the loop iterating over`prange(n)`

is executed in parallel, leveraging multiple CPU cores.

**Example 2**: Parallel Sum of Arrays

**Example 2**

Let’s parallelize a loop that computes the sum of elements in an array.

from numba import njit, prange@njit(parallel=True)def parallel_sum_array(arr): total = 0 for i in prange(len(arr)): total += arr[i] return total# Example usageimport numpy as nparr = np.arange(1000000)print(parallel_sum_array(arr))

Output:

`499999500000`

In this example:

- @njit(parallel=True) tells Numba to compile the function with parallel execution.
- prange(len(arr)) enables parallel iteration over the array.

### Example 3: **Estimating Pi Using Monte Carlo Methods**

**Estimating Pi Using Monte Carlo Methods**

Parallelizing Monte Carlo methods for estimating pi can also lead to substantial performance improvements.

import randomdef calc_pi(N): M = 0 for i in range(N): x = random.uniform(-1, 1) y = random.uniform(-1, 1) if x**2 + y**2 <= 1: M += 1 return 4 * M / N# Define the number of iterationsN = 1000000# Calculate and print the approximation of pipi_approx = calc_pi(N)print(f"Approximation of pi after {N} iterations: {pi_approx}")

Output:

`Approximation of pi after 1000000 iterations: 3.142464`

### Example 4: **Using****prange**

**for Explicit Parallelization**

**Using**

**prange**

**for Explicit Parallelization**

`prange`

is a Numba-specific function that replaces the standard Python`range`

function in parallelized loops. It is essential to use`prange`

when parallelizing loops, as it informs Numba which loops to parallelize. For example, in the following code snippet,`prange`

is used to parallelize the outer loop:

import numpy as npfrom numba import njit, prange@njit(parallel=True)def csrMult_numba(x, Adata, Aindices, Aindptr, Ashape): numRowsA = Ashape Ax = np.zeros(numRowsA) for i in prange(numRowsA): Ax_i = 0.0 for dataIdx in range(Aindptr[i], Aindptr[i + 1]): j = Aindices[dataIdx] Ax_i += Adata[dataIdx] * x[j] Ax[i] = Ax_i return Ax# Example usage:Adata = np.array([1, 2, 3, 4, 5], dtype=np.float32)Aindices = np.array([0, 2, 2, 0, 1], dtype=np.int32)Aindptr = np.array([0, 2, 3, 5], dtype=np.int32)Ashape = 3 # Number of rows# Define a vector to multiplyx = np.array([1, 2, 3], dtype=np.float32)# Perform the matrix-vector multiplicationresult = csrMult_numba(x, Adata, Aindices, Aindptr, Ashape)print(result)

Output:

`[ 7. 9. 14.]`

### Advanced Example: Parallelizing Matrix Multiplication

To illustrate a more complex use case, let’s parallelize a matrix multiplication operation.

from numba import njit, prangeimport numpy as np@njit(parallel=True)def parallel_matrix_multiplication(A, B): n, m = A.shape m, p = B.shape C = np.zeros((n, p)) for i in prange(n): for j in prange(p): for k in prange(m): C[i, j] += A[i, k] * B[k, j] return C# Example usageA = np.random.rand(100, 100)B = np.random.rand(100, 100)C = parallel_matrix_multiplication(A, B)print(C)

Output:

[[20.80764878 23.00057672 21.9369858 ... 22.41715703 23.0755662

22.33375024]

[21.03665146 24.0755907 22.25624691 ... 21.52803639 22.21485889

20.41275549]

[22.08134646 25.5358516 23.7381806 ... 24.65153569 26.01077343

24.54440725]

...

[20.45125475 24.54111658 22.26924075 ... 22.0734628 23.32851616

21.40838884]

[23.03796554 24.14278303 24.24539058 ... 24.092034 26.98564742

24.086983 ]

[24.26815164 26.91033613 25.56298534 ... 26.13709548 27.11784094

26.00035639]]

In this example:

- parallel_matrix_multiplication multiplies two matrices A and B.
- The nested loops are parallelized using prange.

## Measuring Performance Gains from Parallelization

To measure the performance gains from parallelization, you can use the time module or timeit function.

import timeimport numpy as npfrom numba import njit, prange# Define the array to sumarr = np.random.rand(1000000) # Array of 1,000,000 random numbers# Without parallelizationdef sum_array(arr): return np.sum(arr)# With parallelization using Numba@njit(parallel=True)def parallel_sum_array(arr): total = 0.0 for i in prange(len(arr)): total += arr[i] return total# Measure execution time without parallelizationstart_time = time.time()sum_result = sum_array(arr)end_time = time.time()print("Non-parallel execution time:", end_time - start_time)print("Sum (Non-parallel):", sum_result)# Measure execution time with parallelizationstart_time = time.time()parallel_sum_result = parallel_sum_array(arr)end_time = time.time()print("Parallel execution time:", end_time - start_time)print("Sum (Parallel):", parallel_sum_result)

Output:

Non-parallel execution time: 0.0016186237335205078

Sum (Non-parallel): 500147.43266961584

Parallel execution time: 1.089543104171753

Sum (Parallel): 500147.43266962166

## Best Practices for Parallelization

**Use****prange**: Always use**for Parallel Loops**`prange`

instead of`range`

for loops you want to parallelize.: Ensure that loop iterations are independent of each other to maximize parallel efficiency.**Minimize Dependencies**: Use profiling tools to identify bottlenecks and verify that parallelization is improving performance.**Profile Your Code**

## Conclusion

Parallelizing for loops with Numba is a powerful technique to accelerate Python code, especially for numerical computations. By leveraging the @njit(parallel=True) decorator and the prange function, you can easily distribute workloads across multiple CPU cores. This can lead to significant performance improvements, making Numba an invaluable tool for high-performance Python programming.

jyotijb23

Improve

Next Article

NLP | Parallel list processing with execnet