Cython: Understanding What The Html Annotation File Has To Say?

September 13, 2023 Post a Comment

After compiling the following Cython code, I get the html file that looks like this: import numpy as np cimport numpy as np cpdef my_function(np.ndarray[np.double_t, ndim = 1] arr

Solution 1:

As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.

Not every interaction with python means a (measurable) slowdown.

Let's take a look at this simplified function:

%%cython
cimport numpy as np
defuse_slices(np.ndarray[np.double_t] a):
    a[0:len(a)]=0.0

When we look into the produced code we see (I kept only the important parts):

__pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a)); 
  __pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1); 
  __pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None); 
  PyObject_SetItem(((PyObject *)__pyx_v_a)

So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.

Let's take a look at version with hand-written for loop:

cimport numpy as np
defuse_for(np.ndarray[np.double_t] a):
    cdefint i
    for i inrange(len(a)):
        a[i]=0.0

It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:

>>>import numpy as np>>>a=np.ones((500,))>>>%timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>>%timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop

>>>b=np.ones((250000,))>>>%timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>>%timeit use_for(b)
1000 loops, best of 3: 364 µs per loop

You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.

Let's disable these checks:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)@cython.wraparound(False)defuse_for_no_checks(np.ndarray[np.double_t] a):
    cdefint i
    for i inrange(len(a)):
        a[i]=0.0

In the produced html we can see, that a[i] gets as simple as it gets:

__pyx_t_3 = __pyx_v_i;
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
  }

__Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0). And now:

>>>%timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>>%timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop

We can improve it further by releasing gil in the for-loop:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)@cython.wraparound(False)defuse_for_no_checks_no_gil(np.ndarray[np.double_t] a):
    cdefint i
    cdefint n=len(a)
    with nogil:
      for i inrange(n):
        a[i]=0.0

and now:

>>>%timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>>%timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop

So it is somewhat faster, but still you cannot beat numpy for larger arrays.

In my opinion, there are two things to take from it:

Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.

One last try using memset function:

%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t]a):
    memset(&a[0], 0, len(a)*sizeof(np.double_t))

We get:

>>>%timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>>%timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop

It is also as fast as the numpy-code for large arrays.

As DavidW suggested, one could try to use memory-views:

%%cython
cimport numpy as np
defuse_slices_memview(double[::1] a):
    a[0:len(a)]=0.0

leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):

>>>%timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop

>>>%timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop

That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:

__pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a); 
  __pyx_t_2.data = __pyx_v_a.data;
  __pyx_t_2.memview = __pyx_v_a.memview;
  __PYX_INC_MEMVIEW(&__pyx_t_2, 0);
  __pyx_t_3 = -1;
  if (unlikely(__pyx_memoryview_slice_memviewslice(
    &__pyx_t_2,
    __pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
    0,
    0,
    &__pyx_t_3,
    0,
    __pyx_t_1,
    0,
    1,
    1,
    0,
    1) < 0))
{
    __PYX_ERR(0, 27, __pyx_L1_error)
}

{
      double __pyx_temp_scalar = 0.0;
      {
          Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
          Py_ssize_t __pyx_temp_idx;
          double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
          for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
            *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
            __pyx_temp_pointer += 1;
          }
      }
  }
  __PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
  __pyx_t_2.memview = NULL;
  __pyx_t_2.data = NULL;

I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.

My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:

# created code for memview-slices:*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
 __pyx_temp_pointer += 1;

#created code for memview-for-loop:
 __pyx_v_i = __pyx_t_3;
 __pyx_t_4 = __pyx_v_i;
 *((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;

I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.

As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

Introduction to Python Course

Cython: Understanding What The Html Annotation File Has To Say?

Solution 1:

Post a Comment for "Cython: Understanding What The Html Annotation File Has To Say?"