Thursday, 15 April 2010

A comparison of IPC methods for Mac OSX

For a project I'm working on, I needed to transfer 6 600x300@30fps video feeds from one process to potentially 6 others. Thats around 31MB/sec. It had to be fast as other processing, compression and decompression was also required simultaneously.
A survey of various IPC techniques was in order and I was surprised I couldn't find any similar comparisons on the web.  I spent about a day running through various options and benchmarking my results in a not-so-rigorous-but-still-quite-useful way. I figured I'd post my results here in the hopes it saves someone a bit of time.


Testing Methodology


Plain and simple really. Send 1000 1024x768 frames from one process to another as fast as possible. That comes to about 66GB of data. Each test application fork()'s with one of the processes becoming the sender and the other the receiver. The parent process blocks on the child process until complete. I don't care so much that my Mac Mini is multi-core. I'm more interested in the "real" times returned by "time ./test" as these should roughly reflect real-world expected throughput.

Establishing a lower bound


How fast is realistic? I thought I better do a simple test. Use memcpy() to copy 1000 frames as fast as possible. I still fork and
I copy in both processes to try to stay realistic (I presume copying on both sides will be necessary for most of these IPC methods.)
real 0m1.759s
user 0m2.951s
sys 0m0.078s

So any IPC that can get close to 3 seconds for user + system time combined is about as fast as I can expect to get. Anything close to 1.8 seconds real-time is looking pretty nice as well (as it is able to offload the sender/receiver roles to different CPU cores. Putting this into perspective by the way, 1000 x 1024 x 768 x 3 / 1.759 = 1.3GBytes/sec. We're not talking about small numbers here.

Message Queue


OSX supports the POSIX message queue service msqget() but its curiously missing from their documentation. I have a recollection of it being there last week no-less so perhaps they have plans to remove it? In any case, its one possible transport mechanism that I thought I'd give a try.

This IPC is fairly simple to use. Each queue is named using a key_t handle that can be obtained with a valid system path and ID, or by some pre-arranged integer. This is the same handle type used by shmget() and semget(). I'll get to those later on.

POSIX message queue on OSX seems to use 8kb buffers to transfer messages from process to process. It doesn't seem engineered for large data streams like I will need but I thought I'd see how it travelled anyway.
real 1m59.474s
user 0m2.626s
sys 3m42.894s

Ouch. At a guess, it seems like context switching killed any chances this had of being useful. At 8kb per message, I need about 65 messages to transfer a single frame of uncompressed RGB video. That's 130 context switches per frame. Yuck. Next!

Unix Sockets


On OSX, this also uses 8kb blocks it seems.
real 0m52.686s
user 0m0.361s
sys 0m9.306s

Not vert impressive! I'm still lost as to how 10s of user + system time = 52 seconds of real time. I was watching top and nothing else was running. I ran it a few times and got similar results. Somethings amiss here but clearly this isn't very good. Skipped!

File-based FIFO


Simple IPC based on a call to mkfifo() and then opening for read access on one side and write access on the other. I had high hopes for this given how prevalent these sorts of IPC's are in the UNIX world. Unfortunately, the 8kb block size seems to rear its ugly head here as well. I expected as much, but I had hoped it would perform in a similar way to Windows in the case when a single write is too big for a FIFO's buffer. I'm pretty sure windows blocks the writer thread, wakes the reader thread and copes the data directly into the read buffer of the reader thread. Thats potentially only one context switch per write if my understanding is correct.

Alas, in the Mac world, no such direct access occurs. 8kb by 8kb, slow as she goes the data moves across. Still, it lacks all the priority features and other complexities that come with message queue so no suprise that this is much much bit faster than the first attempt but still not fast enough.
real 0m6.157s
user 0m0.280s
sys 0m7.815s

Compared to 2 minutes, this is awesome. 162fps max though and I have 30x6 = 180 fps to push. I want some more speed if I can get it.

Regular Files


What the hell, this might work. Right?
real 0m5.673s
user 0m0.036s
sys 0m9.992s

Interesting. Slightly faster than FIFO's. In this case, I'm just calling lseek(fd, 0, SEEK_SET), writing a frame and doing the same seek and a read on the other side. 5.6 actual seconds but it looks like it was using more in-kernel time than FIFOs.

Memory-mapped files


Similar to regular files in that, well, I'm using a regular file. But this time I've mmap()ed it to an address in both processes. I guess this cuts out some of the copies and things that happen inside the OS because it seems to have a serious effect on the times I get.
real 0m3.579s
user 0m6.391s
sys 0m0.111s

Just shy of 280fps if I take that "real" number on face value. Mind you, this is likely chewing up both cores of my CPU. I suspect OS lock contention is way down as the OS maps the file to a page of memory and shares the page with both processes. Not bad though.

Shared Memory


Under the BSD-based OSX kernel, shared memory is fixed at boot time. As far as I know, its resident at all times and will never be paged to disk. The default Mac OSX setting is to allocate 1024 pages of memory - a mere 4MB - regardless of system memory size. (According to this and this)

You can override the defaults, but thats a pain for end users as it requires admin rights _and_ a system restart. So the first test here is a combination of 128kb shared memory segments synchronised with boost::interprocess semaphores. My plan was to transfer a frame of video over in blocks, just like msgget() and FIFO techniques do, only instead of 8kb blocks, I'd use 128kb blocks.
real 0m1.622s
user 0m2.693s
sys 0m0.220s

Nice results! Half the time of memory mapped files. This beats my memcpy benchmarks with regular malloc()ed RAM. Weird! Maybe paging overheads are lower with shared mem?? Anyway, this was so nice that I implemented it in my all only to be terribly disappointed when
it led to very jerky video. Why? In order to uphold the producer-consumer usage semantics of FIFO's, I still need to block on writes until
the reader is ready or, as I decided in my case, discard writes if there is no reader and block on reads for a shortish (30ms) fixed time before failing.

These semantics lead to some interesting effects when you have 6 of them running concurrently. The reader process for all 6 streams turned out to be in the same thread, so I got decent video that got jerkier as I added more streams as each stream was blocking for up to 1 frame, waiting on a read. Argh! I was getting great efficiency in terms of CPU usage but losing it all to locking.

Shared Memory (with modified OS settings)


I tried growing my system shared memory and using one segment per frame. Sure enough, it worked great. Speed was awesome (I misplaced my initial numbers but 0.010 real time comes to memory). No more locking problems.

The problem with this approach is that a shared memory segment doesn't just die when the processes using it do. It persists until you delete
it or reboot. This doesn't sit well with me. Not for an end-user app.

On top of this, there is only 4MB per machine by default. Thats only a few frames of video. Not enough for all 6 streams. If some other app uses all the shared memory, we're screwed. If some app leaks shared memory over time, we're screwed.

Some crazy ideas that didn't work


mmap() with MAP_ANON|MAP_SHARED gives you a block of shared memory with no file attached. Sounds perfect except that you can only seem to access it in child processes fork()ed from the creating process. Shame.. :(

mmap() with mlock() to pin the pages in RAM. This worked but gave no performance improvement.

passing mmap() MAP_FIXED addresses between processes. Fail. I'm glad actually. This would be gaping hole in overall OS security.

The Verdict


Shared Memory is clearly the fastest but its a scarce system resource requiring careful management so its likely to fail occasionally "in the wild".

Memory mapped files are about half the speed of shared memory but are mapped to regular pages so they're free to use available system RAM, there are less limitations on how they can and can't be used and they provide a very simple, convenient API.

FIFO's worked OK but for balance of performance and convenience, mmap() is my final choice.