I’m a software developer, data scientist, artist and technology writer.
If you have a challenging technical problem or would just like some advice, then send me an email or message me on Twitter! My virtual door is always open.
Oh – and if you want to learn how a computer works or how to use the Rust programming language, then consider buying my book Rust in Action.
Serialization is a process that transforms data structures into a sequence of bytes. It can be an expensive operation because your program needs to collect values that are stored internally as references and transform every data type into a common encoding.
If you’ve been learning about what happens when you sending over the network or writing to disk, you will have probably encountered the term “serialization”. Huh? To reduce confustion (perhaps?) other people have introduced other terms, like “marshalling”.
The best place to figure out what “serialization” means is the word itself. “in serial” means one after another. This constrasts with terms such as “in tandem” or “in parallel”. So this implies that when writing objects to disk, we need to put bytes one after another.
But, you may ask, aren’t bytes in memory stored one after another anyway? Yes, and no. In a sense, this is true. RAM is shaped the same way disks are. Memory addresses start at 0 and go up to 264 - 1, and are a sequence. But in practice, data that is stored in memory isn’t able to
There are a few reasons why objects are stored in a non-serial fashion. This post covers three of them:
Whenever you encounter a data structure that can grow, such as a list, you will be implicitly dealing with something that manages references. They’re needed because when data structures grow, they sometimes need to be moved around in memory.
One a list has been written to disk, it’s stuck. That means that the first thing that will be written in full is the most deeply nested object. If you consider the following JSON object, we can’t finish writing b to disk without first writing b1 and b2:
{
"a": [1.0, 1.1, 1.2],
"b": {
"b1": [10.0, 11.0, 12.0],
"b2": [20.0, 21.0, 22.0],
}
}
Sidebar: some serialization formats allow references within the document that they’re writing. Notably XML and YAML. They’re rarely used, but available.
In a sense, live data structures operate in tandem. Each list occupies its own portion of the address space and each can be modified without interfering with the others.
Every data type has its own mapping between being a sequence of zeros and ones to the values that are being represented. Human-readable formats have a different mapping. Translating between two systems takes engineering effort.
If you are interested in more details about this translation process, here are some more details.
The integer 42 is two bytes long when represented as two numerals in the UTF-8 encoding. UTF-8 is what JSON uses, so there’s a good chance that it’s accurate in your case. But your CPU will probably use 8 bytes (64 bits) to represent that number. They differ because the CPU wants use the same amount of space for every integer. So, while 42 in base 2 is 101000, your computer adds 58 leading zeros.
Consider the number 42 stored as variable answer. Here’s an example from a hypothetical programming langauge:
fn meaning_of_life() {
var answer = 42;
return answer;
}
answer is an integer, and so should be easy to store compactly. But perhaps there might be a rule in the language that says that when answer leaves the scope, it should be deleted — unless there is a live reference to answer. To enable this rule, the programming language needs to keep track of references to answer.
So, instead of storing just 42, the computer might be storing something like this:
structure Variable {
value: int,
number_of_references: int,
}
In memory, the Variable structure needs space for two integers, even though one is actually used for the thing we care about.
It gets worse. Most programming languages store more than integers. So, they might need much more internal machinery. Here’s is a fuller example of what a dynamic language might need to do:
structure Variable {
value: Value,
type: DataType,
number_of_references: int,
}
structure Value {
address_in_memory: int,
length_in_bytes: int,
}
enumeration DataType {
Integer, Float, String
}
Few of these details are necessary in a serialized form. We just care about 42.
Computer programmers love to introduce jargon. Don’t be afraid to ask people what terms mean — they’re normally metaphors that were useful at the time.
tl;dr Can’t be done directly. You have two options: a) mock async I/O with threads, or b) redirect STDIN, STDOUT & STDERR to other file handles that support overlapping (aka non-blocking/async) I/O, such as named pipes.
tokio-rs 0.1 is out! Yay! This is great news for networking, I wonder what life is like for file I/O?
Its getting started examples use an echo server, but I really wanted to learn how to create an efficient worker to fit with th Hadoop Streaming API (among other use cases). That means reading from STDIN and writing to STDOUT. It turns out, tokio doesn’t have support for non-blocking I/O for stdio.
Turns out, others have looked into this. As it happens, async I/O and upstream progress has stalled pending more research into how IO Completion Ports work. It seems that Windows being different makes life difficult.
This is going to require more research into how STDIN & co works w/ IOCP. I will tentatively assign this to the 1.0 milestone, but will potentially have to punt if it is tricky.
— carllerche, Dec 2015
A fairly large number of projects don’t think so. Here is a quote from 2008 that is a fairly telling portent:
Development of the library Boost.Process stopped two years ago. One of the biggest outstanding issues is adding support for asynchronous I/O to stdin/stdout/stderr.
— Boris, asio C++ mailing list
Let’s work our way through the MSDN documentation to figure the situation out. To start, let’s clear up a few terms os that we all know what we’re talking about.
There are some significant differences between the UNIXish multiverse and Windows family when it comes to networking I/O. As well as differing APIs, there is also differing terminology.
Unlike calling select or poll on a single file descriptor, Windows offers you the ability to wrap a file handle in an IO completion port (IOCP). The file descriptor and the completion port are independent, but linked. The port takes care of dealing with the file itself.
Its proponents believe (with good reason) that the completion port model is a good one for supporting interleaved reads and writes acrosss multiple threads without blocking.
Some notes on Windows terminology differences:
CONIN$, CONOUT$ and CONERR$ within Windows documentation With all of this in mind, creating an IOCP looks like this under the covers:
HANDLE WINAPI CreateIoCompletionPort(
_In_ HANDLE FileHandle,
_In_opt_ HANDLE ExistingCompletionPort,
_In_ ULONG_PTR CompletionKey,
_In_ DWORD NumberOfConcurrentThreads
);
The important parameter is FileHandle, an object created by CreateFile. That handle must support overlapped I/O. Here is the relevant extract of the creating an CreateIoCompletionPort reference:
The handle passed in the
FileHandleparameter can be any handle that supports overlapped I/O. Most commonly, this is a handle opened by theCreateFilefunction using theFILE_FLAG_OVERLAPPEDflag (for example, files, mail slots, and pipes). Objects created by other functions such assocketcan also be associated with an I/O completion port. For an example using sockets, seeAcceptEx. A handle can be associated with only one I/O completion port, and after the association is made, the handle remains associated with that I/O completion port until it is closed.— “CreateIoCompletionPort function” MSDN
This raises an important question, do the file handles for CONIN$, CONOUT$ & CONERR$ support FILE_FLAG_OVERLAPPED? We need to look to the documentation for CreateFile to see.
After some browsing, one comes across the section on async I/O describing how to provide the flag. We provide it within the dwFlagsAndAttributes parameter.
Synchronous and Asynchronous I/O Handles
CreateFileprovides for creating a file or device handle that is either synchronous or asynchronous. A synchronous handle behaves such that I/O function calls using that handle are blocked until they complete, while an asynchronous file handle makes it possible for the system to return immediately from I/O function calls, whether they completed the I/O operation or not. As stated previously, this synchronous versus asynchronous behavior is determined by specifyingFILE_FLAG_OVERLAPPEDwithin thedwFlagsAndAttributesparameter. There are several complexities and potential pitfalls when using asynchronous I/O; for more information, see Synchronous and Asynchronous I/O.
This gets us closer, but we still don’t yet know. When you read the Consoles section of the same article, you discover the documenation explicitly states that the parameter is ignored.
Consoles
The
CreateFilefunction can create a handle to console input (CONIN$). If the process has an open handle to it as a result of inheritance or duplication, it can also create a handle to the active screen buffer (CONOUT$).…
dwFlagsAndAttributes ignored
So after all of that we discover that no, it’s not possible.
Maybe I should have read that original post in a little more detail before hunting through all of the documentation myself:
> If you look at the MSDN docs for CreateFile then you will see, under the
> heading Consoles, that CreateFile ignores file flags when creating a
> handle to a console buffer. I doubt that there is any way to do genuine
> asynchronous io to a console buffer.— Roger Austin, , asio C++ mailing list
Clearly, many projects face similar issues. They want to write to STDOUT as fast as possible, without blocking the mail thread. What have they done to create non-blocking servers that access these blocking APIs?
There are two main options:
In an article entitled “Asynchronous I/O in Windows for Unix Programmers”, Ryan Dahl (creator of node.js), provides a very good discussion of IOCP that includes file I/O, rather than just network I/O. His suggested approach for Console applications is to spawn threads that wait for events that then communicate with the main thread.
Console/TTY
It is (usually?) possible to poll a Unix TTY file descriptor for readability or writablity just like a TCP socket—this is very helpful and nice. In Windows the situation is worse, not only is it a completely different API but there are not overlapped versions to read and write to the TTY. Polling for readability can be accomplished by waiting in another thread with
RegisterWaitForSingleObject().emphasis added
This approach is taken by FastCGI within libfcgi/os_win32.c. STDIN is mocked out, but STDOUT is kept synchronous. The StdinThread function loops in a thread until shutdown:
/*
*--------------------------------------------------------------
*
* StdinThread--
*
* This thread performs I/O on stadard input. It is needed
* because you can't guarantee that all applications will
* create standard input with sufficient access to perform
* asynchronous I/O. Since we don't want to block the app
* reading from stdin we make it look like it's using I/O
* completion ports to perform async I/O.
*
* Results:
* Data is read from stdin and posted to the io completion
* port.
*
* Side effects:
* None.
*
*--------------------------------------------------------------
*/
static void StdinThread(LPDWORD startup){
int doIo = TRUE;
int fd;
int bytesRead;
POVERLAPPED_REQUEST pOv;
while(doIo) {
/*
* Block until a request to read from stdin comes in or a
* request to terminate the thread arrives (fd = -1).
*/
if (!GetQueuedCompletionStatus(hStdinCompPort, &bytesRead, &fd,
(LPOVERLAPPED *)&pOv, (DWORD)-1) && !pOv) {
doIo = 0;
break;
}
ASSERT((fd == STDIN_FILENO) || (fd == -1));
if(fd == -1) {
doIo = 0;
break;
}
ASSERT(pOv->clientData1 != NULL);
if(ReadFile(stdioHandles[STDIN_FILENO], pOv->clientData1, bytesRead,
&bytesRead, NULL)) {
PostQueuedCompletionStatus(hIoCompPort, bytesRead,
STDIN_FILENO, (LPOVERLAPPED)pOv);
} else {
doIo = 0;
break;
}
}
ExitThread(0);
}
A very old version of Twisted seems to have an implement this approach. (From glancing at the current code, it looks like Twisted has moved back to threads :/ )
There are bound to be more examples around of using a proxy handle around though, as it seems like quite a nifty approach. The relevant MSDN article is “Creating a Child Process with Redirected Input and Output”
The important takeaways seem to be:
SECURITY_ATTRIBUTES correctlyAn extract of old Twisted code demonstrating how to proceed looks like this:
# Counter for uniquely identifying pipes
counter = itertools.count(1)
class Process(object):
...
def __init__(...):
...
# Set the bInheritHandle flag so pipe handles are inherited.
saAttr = win32security.SECURITY_ATTRIBUTES()
saAttr.bInheritHandle = 1
# in duplex mode so we can read from it too in order to detect when
# Create a pipe for the child process's STDIN. This one is opened
# the child closes their end of the pipe.
self.stdinPipeName = r"\\.\pipe\twisted-iocp-stdin-%d-%d-%d" % (self.pid, counter.next(), time.time())
self.hChildStdinWr = win32pipe.CreateNamedPipe(
self.stdinPipeName,
win32con.PIPE_ACCESS_DUPLEX | win32con.FILE_FLAG_OVERLAPPED, # open mode
win32con.PIPE_TYPE_BYTE, # pipe mode
1, # max instances
self.pipeBufferSize, # out buffer size
self.pipeBufferSize, # in buffer size
0, # timeout
saAttr)
self.hChildStdinRd = win32file.CreateFile(
self.stdinPipeName,
win32con.GENERIC_READ,
win32con.FILE_SHARE_READ|win32con.FILE_SHARE_WRITE,
saAttr,
win32con.OPEN_EXISTING,
win32con.FILE_FLAG_OVERLAPPED,
0);
# Duplicate the write handle to the pipe so it is not inherited.
self.hChildStdinWrDup = win32api.DuplicateHandle(
currentPid, self.hChildStdinWr,
currentPid, 0,
0,
win32con.DUPLICATE_SAME_ACCESS)
win32api.CloseHandle(self.hChildStdinWr)
self.hChildStdinWr = self.hChildStdinWrDup
There are others who are significantly more experienced in this area than I. The conventional approach certainly seems threads, but using redirection does appeal to me for some reason. As it nears midnight, my sugestion to the Tokio team and others would be to go with the approach that’s easiest to maintain unless benchmarks prove compelling.