Because of the original market for the VideoCore GPU (a multimedia coprocessor, not apps processor), EmptyThisBuffer and FillBufferDone include transferring (copying) the buffer contents from host processor (ARM) to GPU memory or back again. On the co-pro this meant sending the data off chip, and is all handled by ILCS (IL Component Service
). All video decode and display, or camera to encode type use cases always kept the image data on the GPU by setting up IL tunnels.
I wasn't expecting that to take as much time as you are observing, but it will be a modest chunk for the ~3MB of a 1080P YUV4:2:0 frame.
The main aim with IL is to form a pipeline. Using Gstreamer you're taking one IL component in isolation and wrapping GST over the top. Create a pure IL pipeline of video_decode (MPEG2 to YUV) -> video_encode (YUV to H264) with tunnels and you've a chance of it working.
My main hesitation would be that the codec block is only specified to achieve 1080P30 encode with about 10% headroom, and it is the same block shared with decode. Overclocking may get you enough bandwidth to handle both the encode and decode simultaneously, but I just don't know.
You don't say how you were doing the decode in GStreamer, nor where you are dealing with deinterlacing your 1080i images, so I'm having to make guesses here.. How much are you stressing the ARM cores?
You also seem to be missing that IL is all about being a pipeline. It supports multiple buffers on each port. As long as the EmptyThisBuffer call takes less than a frame time, then you may be increasing latency, but you can still achieve full frame rate given appropriate pipelining.
To elaborate, if you're encoding a YUV420PackedPlanar frame to H264 on Pi, then the frame has to get to the GPU (ARM memcpy), be converted to an internal format (VPU), motion estimation (CME and FME), encode (ENC), entropy code, and transfer the resulting data back to the ARM (ARM memcpy). WIth the exception of the two copies, all of those are on different bits of silicon. So each could take 33ms, and whilst you'd end up with an encoding latency of 6*33 = 200ms, it would still achieve 30fps.
If you want to get this working on Pi with software decode of the video, then I'd recommend you look at MMAL instead of IL. MMAL was written because IL was such a pain to work with, and because things were shifting to the apps processor architecture and shared memory. There you can allocate a zero copy buffer (ie GPU memory) and fill it with your image data to avoid copying full frame buffers around.
Alternatively you can again set up mmal_connections to have a complete pipeline on the GPU. There may even be a couple more tricks that can be pulled to further minimise image format conversions and memory bandwidth if really needed to squeeze every last drop of performance from the system.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.