Faster Bit-level Data Packing
Solution 1:
Down to 130 ms from 200 ms by just wrapping the loop in a function
def packer0(imd):
"""same loop in a def"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 35 ms by Cythonizing the same code
def packer1(imd):
"""Cythonize python nibble packing loop"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 16 ms with type
def packer2(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int b
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Not much of a difference with a "simplified" loop
def packer3(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int i
for i in range(256*64/2):
packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
return packed
Maybe a tiny bit faster even (15 ms)
def packer4(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
Here's with timeit
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop
This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).
Or with parallelism on the GPU or the multi-core Raspberry Pis.
A crude comparison with the same loop in C (ideone):
#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
uint8_t in[SIZE] = {0};
uint8_t out[SIZE/2] = {0};
uint8_t t;
for(t=0; t<100; t++){
uint16_t i;
for(i=0; i<SIZE/2; i++){
out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
}
}
return 0;
}
It's apparently 100 times faster:
pi@raspberry:~ $ gcc p.c
pi@raspberry:~ $ time ./a.out
real 0m0.085s
user 0m0.060s
sys 0m0.010s
Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):
def packs(bytes it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
results in
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop
Post a Comment for "Faster Bit-level Data Packing"