Cython: Understanding What The Html Annotation File Has To Say?
Solution 1:
As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.
Not every interaction with python means a (measurable) slowdown.
Let's take a look at this simplified function:
%%cython
cimport numpy as np
defuse_slices(np.ndarray[np.double_t] a):
a[0:len(a)]=0.0
When we look into the produced code we see (I kept only the important parts):
__pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a));
__pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1);
__pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None);
PyObject_SetItem(((PyObject *)__pyx_v_a)
So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem
) to set all elements to 0.0
, which is C-code under the hood.
Let's take a look at version with hand-written for loop:
cimport numpy as np
defuse_for(np.ndarray[np.double_t] a):
cdefint i
for i inrange(len(a)):
a[i]=0.0
It still uses PyObject_Length
(because of length
) and bound-checking, but otherwise it is C-code. When we compare times:
>>>import numpy as np>>>a=np.ones((500,))>>>%timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>>%timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop
>>>b=np.ones((250000,))>>>%timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>>%timeit use_for(b)
1000 loops, best of 3: 364 µs per loop
You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.
Let's disable these checks:
%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)@cython.wraparound(False)defuse_for_no_checks(np.ndarray[np.double_t] a):
cdefint i
for i inrange(len(a)):
a[i]=0.0
In the produced html we can see, that a[i]
gets as simple as it gets:
__pyx_t_3 = __pyx_v_i;
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
}
__Pyx_BufPtrStrided1d(type, buf, i0, s0)
is define for (type)((char*)buf + i0 * s0)
.
And now:
>>>%timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>>%timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop
We can improve it further by releasing gil in the for-loop:
%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)@cython.wraparound(False)defuse_for_no_checks_no_gil(np.ndarray[np.double_t] a):
cdefint i
cdefint n=len(a)
with nogil:
for i inrange(n):
a[i]=0.0
and now:
>>>%timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>>%timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop
So it is somewhat faster, but still you cannot beat numpy for larger arrays.
In my opinion, there are two things to take from it:
- Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
- There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.
One last try using memset
function:
%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t]a):
memset(&a[0], 0, len(a)*sizeof(np.double_t))
We get:
>>>%timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>>%timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop
It is also as fast as the numpy-code for large arrays.
As DavidW suggested, one could try to use memory-views:
%%cython
cimport numpy as np
defuse_slices_memview(double[::1] a):
a[0:len(a)]=0.0
leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):
>>>%timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop
>>>%timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop
That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:
__pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a);
__pyx_t_2.data = __pyx_v_a.data;
__pyx_t_2.memview = __pyx_v_a.memview;
__PYX_INC_MEMVIEW(&__pyx_t_2, 0);
__pyx_t_3 = -1;
if (unlikely(__pyx_memoryview_slice_memviewslice(
&__pyx_t_2,
__pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
0,
0,
&__pyx_t_3,
0,
__pyx_t_1,
0,
1,
1,
0,
1) < 0))
{
__PYX_ERR(0, 27, __pyx_L1_error)
}
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
__PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
__pyx_t_2.memview = NULL;
__pyx_t_2.data = NULL;
I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.
My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:
# created code for memview-slices:*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
#created code for memview-for-loop:
__pyx_v_i = __pyx_t_3;
__pyx_t_4 = __pyx_v_i;
*((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;
I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.
As Behzad Jamali pointed out, there is difference between double[:] a
and double[::1] a
. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1]
version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:]
we don't know anything about stride until the runtime.
Post a Comment for "Cython: Understanding What The Html Annotation File Has To Say?"