The Universe of Discourse


Wed, 22 Jun 2016

How Unix pipes work

Will not appear in live blog

formula provider: mathjax

A few years ago I gave a series of talks on how Unix works, discussing Unix's concepts of the file and the process. In the process talk I implemented some Unix commands in Perl, as examples, and one of the largest of these examples was a shell. The shell examples started very simple, and then added more shell features: first file redirections, then built-in commands like cd, and eventually pipes.

Unfortunately the talk wasn't long enough to explain how the pipes worked, so I'm going to do it now.

What is a pipe?

In the shell, we write something like

    ls | rev

and this runs the ls and rev commands, and arranges that the output of ls goes into the input of rev. The output-input redirection passes through a kernel construct called a pipe.

At the bottom, the pipe is nothing more than a buffer in kernel memory, typically 64 kilobytes. (On older systems the buffer was 8 or even 4 kilobytes, but for the rest of this article we'll assume 64.) The buffer can be read from or written to. When the pipe is created, the kernel allocates two open file pointers for it, one for reading and one for writing, and these file pointers become the access points for processes to store data into the buffer and retrieve the stored data again.

In Perl, it looks like this:

    my ($rd, $wr);
    pipe($rd, $wr) or die "pipe: $!";

    syswrite $wr, "I like pie\n";
    my $line = <$rd>;
    print ">> $line";

The output is:

    >> I like pie\n

The pipe call allocates the buffer and the two filehandles, which are stored in $rd (for reading from the buffer) and $wr (for writing to the buffer). These handles are traditionally called the “reading end” and the “writing end” of the pipe.

I'm using syswrite to write "I like pie\n" into the pipe, rather than print, because data printed with print is buffered by the standard I/O library and would be an unnecessary confusion in this example program. (For more details, see Suffering from Buffering?.) Skipping the standard I/O buffering will expose the kernel's basic behavior and make it easier to observe. Once the data is in the pipe, we can use the regular Perl <…> operator to read it back out again. (Later we'll switch from <…> to sysread to get rid of the standard I/O library here also.)

This is not especially useful, because we could just as easily have used $line = "I like pie\n" and left the kernel out of the procedure. But the advantage of the pipe is that it can be shared among several processes, which can then use it for interprocess communication. We will see this shortly.

Pipes are just buffers

But before we do, please take note of two important points. First: bytes are read out of the pipe in the order they went in: The I was first in, and it was first to be read out again; the jargon for this is that pipes are FIFO (“first-in first-out”) and in some contexts they are even called FIFOs.

The second point is more subtle: Pipes are nothing but byte buffers. Any structure on the messages written into them must be imposed by the application.

Here's an example just like the previous one, but with one more write:

        my ($rd, $wr);
        pipe($rd, $wr) or die "pipe: $!";

        syswrite $wr, "I like pie...";
        syswrite $wr, "Especially raspberry\n";
        my $line = <$rd>;
        print ">> $line";

The output is:

    >> I like pie...Especially raspberry\n

Here we wrote two messages into the pipe. Will the <$rd> extract the first message separately? No. The Perl <…> operator always reads characters up to the next newline (or whatever $/ is set to). Here it reads all the way up to the newline after the word raspberry.

We can see the lack of structure more clearly if we use Perl's sysread operator, which reads a fixed number of bytes:

        my ($rd, $wr);
        pipe($rd, $wr) or die "pipe: $!";

        syswrite $wr, "I like pie...";
        syswrite $wr, "Especially raspberry!";
        my $bytes;
        while (sysread $rd, $bytes, 4) {
          print ">> '$bytes'\n";
        }

Now the output is:

    >> 'I li'
    >> 'ke p'
    >> 'ie..'
    >> '.Esp'
    >> 'ecia'
    >> 'lly '
    >> 'rasp'
    >> 'berr'
    >> 'y!'

and then it hangs. (We'll see why it hangs in the next section.)

Unix is happy to give our the reader four bytes at a time, even though the bytes were written in groups of 13 and 21. Or at least, it's happy to do that up until there are only two bytes left; then the program asks for 4 but gets only 2. And the read after that one hangs. Why?

Semantics of reading pipes

At the bottommost level, you read a pipe with the Unix read call, which corresponds approximately with Perl's sysread function. (Perl's read and <…> introduce the standard I/O library, which is an additional complication we'll consider later.) At the C level, the call looks like this:

   int fd;
   char buffer[65536];
   size_t bytes_to_read;

   int bytes_read = read(fd, buffer, bytes_to_read);

The fd variable is a file descriptor, which is the kernel's low-level version of a filehandle; it is simply an integer that identifies what to read from. The buffer tells the kernel where to store the data once it's read. And bytes_to_read is a non-negative integer that tells the kernel how many bytes we want to read.

The short description of what this does is: it reads at most bytes_to_read out of the pipe and stores the data in the buffer; then it returns the number of bytes it actually read. But there are a number of fine points and exceptions:

  1. An error might occur. For example, the caller might have provided a bad file descriptor or buffer pointer. In this case read returns -1 and sets the kernel error indicator, errno, to indicate what the problem was. In Perl, errno shows up in the special variable $!.

  2. If there are at least bytes_to_read bytes in the pipe, then that is how many are read and stored into the buffer, and that is the number returned. The buffer had better be big enough to hold the data, or the kernel will cheerfully overwrite the process’s memory with the extra!

  3. If there is at least one, but fewer than bytes_to_read bytes in the pipe, all of them are read, and their number is returned.

  4. However, if there are no bytes in the pipe, then:

    1. If the writing end of the pipe has been closed, the request returns 0; the process should interpret this as an end-of-file condition.

    2. If the writing end of the pipe is still open, the request blocks: the kernel puts the process to sleep until data becomes available or the writing end is closed.

    3. (Exception for advanced users: the file descriptor can be marked non-blocking, in which case the read call never blocks; instead the blocking call turns into an error: it immediately returns -1 and sets errno to EWOULDBLOCK (“Operation would block” or sometimes “Resource temporarily unavailable”.)

Now we can see why the last example program hung. There were 34 bytes in the pipe. The program issued read calls with bytes_to_read set to 4, and the first eight such calls read 4 bytes each, returning the number 4 each time. (That's case 1.( The ninth read found only 2 bytes in the pipe and read them, returning 2. (That's case 2.) And the tenth read found no bytes in the pipe. But the program still had the writing end open, so the call blocked. (That's case 3b.) And since no other process held the writing end of the pipe, the process could never wake up!

We can fix this:

        my ($rd, $wr);
        pipe($rd, $wr) or die "pipe: $!";

        syswrite $wr, "I like pie...";
        syswrite $wr, "Especially raspberry!";
        close $wr;
        my $bytes;
        while (sysread $rd, $bytes, 4) {
         print ">> '$bytes'\n";
        }

Now after the program prints >> 'y!' it exits. Why?

After the process has done its writing, it closes the writing end of the pipe. Now the reading goes as before, up through the ninth read of y!. The tenth read finds an empty pipe as before. But this time the writing end is closed, so instead of blocking the read call immediately returns 0. (This time it;s case 3a instead of 3b.) Perl passes the 0 value into the script as the value returned by sysread, which terminates the while loop and ends the program.

Interprocess communication with pipes

Having the same process be both the reader and the writer is a little strange and not very useful. It's also tricky to pull off correctly, because pipes were nor really designed to be used in this way. The normal use case is that one process reads and another writes. To do that, one process needs to hold the writing and of the pipe and the other needs to hold the reader.

Typically the way this is done is as follows. One process creates the pipe, and then forks a child process. File descriptors are inherited from parent to child after a fork, so both processes have both ends of the pipe. Let's consider a typical scenario, where the parent runs a command and wants to read its output and then continue. The child will be the writer and the parent will be the reader, so
the child closes the reading end of the pipe, and the parent closes the writing end. The child then uses the writing end to write data into the pipe; the parent uses the reading end to read the data back out.

If the parent gets ahead of the child, it tries to read the empty pipe, and blocks until the child writes more data.

Eventually, the child has nothing more to say and closes the writing end of the pipe, either with an explicit close call or more likely by exiting. After the parent reads the remaining data, the pipe is empty. Its next read returns 0, signalling end-of-file, and it can close the reading end and proceed as appropriate.

Here's complete code for a demo:

        my ($rd, $wr);
        pipe($rd, $wr) or die "pipe: $!";

        my $pid = fork();
        die "Couldn't fork: $!" unless defined $pid;

        if ($pid == 0) {           # child (writer) process
          close $rd;
          print $wr "abcdefghijklmnopqrstuvwxyz\n";
        } else {                   # parent (reader) process
          close $wr;
          my $buf;
          my $line = 0;
          while (sysread $rd, $buf, 1) {
            print "Reader: ", ++$line, ": $buf\n";
          }
          print "Reader: End of file\n";
        }

The process forks and the two resulting processes take different paths through the if-else block. The child process takes the if part, closing the reading end of the pipe, writing data into the pipe, and exiting immediately afterward. The written data is safe in the kernel and will survive the death of the process that wrote it.

The parent takes the else clause. It closes the writing end of the pipe to avoid a deadlock just like the one we saw in the previous section, and then loops on sysread as before, reading one character at a time. It transforms the input that the parent wrote, prints out the result, and then exits.

It is quite simple to have the child be the writer and the parent the reader; just change the if ($pid != 0) test to if ($pid == 0).

Attaching the pipe to a command

The example of the previous section isn't quite typical: what's the point of forking a process and creating a pipe just to get the string abcdefghijklmnopqrstuvwxyz\n? Instead, we'll have the child run the ls -l command so that the parent gets the command output.

The ls command writes to standard output, which is inherited from the parent and is attached to the terminal. We want to arrange that the child's standard output is attached to the writing end of the pipe instead. Then when ls runs it will write into the pipe.

        my ($rd, $wr);
        pipe($rd, $wr) or die "pipe: $!";

        my $pid = fork();
        die "Couldn't fork: $!" unless defined $pid;

        if ($pid == 0) {           # child (writer) process
          close $rd;
          
          my $fd = fileno($wr);
          open STDOUT, ">&=$fd"
            or die "Couldn't dup pipe descriptor: $!";
          exec "ls", "-l";
          die "Couldn't exec ls: $!";
                    
        } else {                   # parent (reader) process
          close $wr;
          my $buf;
          my $line = 0;
          while (sysread $rd, $buf, 1) {
            print "Reader: ", ++$line, ": $buf\n";
          }
          print "Reader: End of file\n";
        }

There's a lot of Perl weirdness here; oddly, the code is simpler in C! The child process needs to attach the writing end of the pipe, $wr, to its standard output. The way it does this is by obtaining the file descriptor number of the writing end with fileno and then using this number in the odd-looking “filename” >&=$fd in the open call. If that succeeds, it runs the ls command with exec. A successful exec does not return—or rather, it returns inside ls rather than inside our Perl script—and ls takes over from there, writing its usual data into the pipe for the parent to read.

In C, this is a somewhat less weird-looking: (some details, such as error checking¸ are omitted)

    int rd, wr;
    pipe(rd, wr);
    ...
    if (pid == 0) {        /* child (writer) */
      close(rd);
      dup2(wr, 1);         /* stdout is always descriptor 1 */
      execvp("ls", "-l", (char *) 0);
    } else {               /* parent (reader) */
      ...
    }

The dup2 call is doing the heavy lifting for Perl's bizarre open STDOUT, ">&=$fd" thing. It says to take whatever is attached to file descriptor wr and attach it to file descriptor 1 also. Standard output is file descriptor 1 by definition, because commands like ls are written to write to file descriptor 1, whatever it is attached to. (Descriptor 0 is standard input, and descriptor 2 is standard error.)

Semantics of writing pipes

In a C program, pipes are written to with the kernel's write call, and every function you use in Perl, including print, syswrite, say, and so forth, eventually turns into write down in the kernel. The call looks just like the read call that was used for reading:

   int fd;
   char buffer[65536];
   size_t bytes_to_write;

   int bytes_written = write(fd, buffer, bytes_to_write);

Again, the fd variable is a file descriptor, the buffer tells the kernel where the data is coming from, and bytes_to_write says how many bytes to copy out of the buffer.

The basic semantics are almost exactly opposite to those of read: a successful call copies at most bytes_to_write bytes out of the buffer and puts them into the pipe. But again there are a number of fine points and exceptions:

  1. An error might occur. The behavior is the same as for read: the call returns -1 and sets errno.

  2. If the reading end of the pipe has been closed, the kernel sends the process the SIGPIPE signal, which normally kills it instantly.

    1. However, the process can arrange beforehand to catch the signal, in which case its signal handler function is called instead.

    2. Or it can ignore the signal, in which case the write call returns -1 and errno is set to EPIPE (“broken pipe”).

  3. If the reading end of the pipe is still open, all the data is copied from the buffer into the pipe, and bytes_to_write is returned. In contrast to the read case, partial writes don't happen.

    What if there is not enough room in the pipe for bytes_to_write bytes? For example, what if the buffer is so big that it can't fit into the pipe all at once? Then the write call blocks until all the data has been written and still returns bytes_to_write. That is, the process goes to sleep while the write is in progress. The kernel copies as much data as will fit, and the process stays asleep. When some space is freed up in the pipe by data being read out, the kernel copies more data into the pipe. The writing process does not wake up until the last byte is copied, whereupon the write call finally returns bytes_to_write

  4. (Exception for advanced users: sometimes partial writes do happen; for example if a signal arrives in the middle of a long write.)

  5. (Or the writing end of the pipe can be marked non-blocking which converts the block into an immediate error result, just as in the read case.)

The upshot of all this is that you can attach the standard output of ls -l to your process’ pipe, and then you can ignore it it. It will do its thing and write whenever it finds it convenient. If there is room in the pipe, it will continue writing until the pipe is full, whereupon it will go to sleep (case 2) and wait for your process to empty the pipe again. When it exits, the reading end will close, and your process will read the rest of the data and then get an end-of-file indication. If your process dies prematurely, then the next time ls tries to write to the pipe the kernel will kill it (case 1).


[Other articles in category /Unix] permanent link