Mayank Mandava

Linux System Programming by Robert Love [Chapters 1-4]

Chapter 1 - Introduction

  • processes inherit the UID and GID
  • permission octal values: r=4, w=2, x=1. Order is user, group, everyone else
  • functions normally just return a -1 to indicate an error
  • more details to be found in extern int errno in <errno.h>
  • to print void perror(const char *str) in <stdio.h>
  • example:
if (close (fd) ==1)
        perror ("close");

Chapter 2 - File I/O

  • Opening files:
fd = open(<path>, flags)
  • create:
fd = open (file, O_WRONLY | O_CREAT | O_TRUNC, 0664);
  • is identical to
fd = creat (filename, 0644);
  • fd of -1 indicates error
  • reading:
#include <unistd.h>
ssize_t read (int fd, void *buf, size_t len);
  • A call to read() can result in many possibilitie, to read all the bytes
ssize_t ret;
while (len != 0 && (ret = read (fd, buf, len)) != 0) {
  if (ret ==1) {
    if (errno == EINTR)
      continue;
    perror ("read");
    break;
  }
  len -= ret;
  buf += ret;
}
  • nonblocking reads can be achieved by passing O_NONBLOCK to open
  • if a read is blocked, it will return -1 and errno will be set to EAGAIN
  • write:
#include <unistd.h>
ssize_t write (int fd, const void *buf, size_t count);
  • call fsync(int ft) to sync writes to storage
  • fdatasync(int fd) does the same thing, without updated metadata, and is faster
  • use the O_SYNC flag to always sync
  • closing:
#include <unistd.h>
int close(int fd)`
  • seeking:
#include <sys/types.h>
#include <unistd.h>
off_t lseek (int fd, off_t pos, int origin);
  • you CAN seek past the end of a file, it will be padded with zeros
  • position reads and writes avoid the race condtion associated with seeking and then reading
#define _XOPEN_SOURCE 500
#include <unistd.h>
ssize_t pread (int fd, void *buf, size_t count, off_t pos);
ssize_t pwrite (int fd, const void *buf, size_t count, off_t pos);
Multiplexed IO
  • poll() is easier to use than select()
  • both wait on a set of open file descriptors and return when any are available
  • poll example
#include <stdio.h>
#include <poll.h>
#include <unistd.h>

int main(int argc, char **argv)
{
        struct pollfd fds[2];
        fds[0].fd = STDIN_FILENO;
        fds[0].events = POLLIN;

        fds[1].fd = STDOUT_FILENO;
        fds[1].events = POLLOUT;

        int err = poll(fds, (nfds_t) 2, 0);
        if (err == -1) {
            perror("poll");
            return -1;
        }

        if (fds[0].revents & POLLIN) {
            printf("STDIN ready\n");
        }
        if (fds[1].revents & POLLOUT) {
            printf("STDOUT ready\n");
        }

        return 0;
}
  • The VFS provides a unified blocks and inodes based interface to all filesystems
  • The page cache holds retrieved info including readaheads

Chapter 3 - Buffered I/O

  • Usually more efficient to read in multiples of 4096 or 8192 bytes because of block alignment
  • User space buffered IO can increase performance even more
  • Write to a buffer, which is written in a single operation
  • The read requests are served from the buffer
  • The end result is fewer system calls for larger amounts of data, all aligned on block boundaries.
  • Provided by stdio
  • StardardI/O routines act on file pointers, not fds
  • FILE type defined in stdio.h
FILE * fopen (const char *path, const char *mode);
FILE * fdopen (int fd, const char *mode);
  • modes: r, w, r+ (read+write), w+ (read, write, truncates), a+(rw in append mode)
  • Closing the stream will close the file descriptor as well.
int fclose (FILE *stream);
int fcloseall (void); // Linux specific
  • reading
// read a char
int fgetc (FILE *stream);
// put it back
int ungetc (int c, FILE *stream);
// read a line
// reads one char less than size and puts a  at the end
// will stop and newline and also put a 

char * fgets (char *str, int size, FILE *stream);
// Read binary data
// reads `nr` elements, each of size `size`
// returns less than nr if there's an error
// it is impossible to know which of the two conditions occurred without using ferror() and feof()
size_t fread (void *buf, size_t size, size_t nr, FILE *stream);
  • writing
// write a char
// return EOF in case of error
int fputc (int c, FILE *stream);
// write a string
int fputs (const char *str, FILE *stream);
// binary
// A return value less than nr denotes error.
size_t fwrite (void *buf, size_t size, size_t nr, FILE *stream);
  • It’s important to bear in mind that because of differences in variable sizes, align‐ ment, and so on, binary data written with one application may not be readable by other applications.
  • Example program
# include <stdio.h>

int main(int arc, char **argv) {
    struct pirate {
            char name[100];
            unsigned long booty;
            unsigned int beard_len;
    } p, blackbeard = {"Mayank", 100, 50};

    FILE * file = fopen("/tmp/pirate", "w");
    if (!file) {
        perror("fopen");
        return 1;
    }
    if(!fwrite(&blackbeard, sizeof (struct pirate), 1, file)){
        perror("fwrite");
        return 1;
    }
    if(fclose(file)) {
        perror("fclose");
        return 1;
    }

    file = fopen("/tmp/pirate", "r");
    if (!file) {
        perror("fopen");
        return 1;
    }
    if(!fread(&p, sizeof(struct pirate), 1, file)) {
        perror("fread"); return 1;
    }
    if(fclose(file)) {
        perror("fclose");
        return 1;
    }
    printf("%s, %lu, %u", p.name, p.booty, p.beard_len);
}
  • other
int fseek (FILE *stream, long offset, int whence);
// fsetpos, rewind, ftell, fgetpos for seeking
// fflush flushes the data to the kernel (but does not sync)
fileno(*stream) gets the fd
  • errors
ferror(FILE *stream) returns non-zero if error is set
feof() returns nonzero if EOF is set
clearerr() clears the error
Threading
  • Standard io functions are thread safe
  • For multipl functions, use explicit locks
flockfile(*stream) locks a file (blocking) by incrementing lock count
funlockfile(*stream) decrements lock count
ftrylockfile(*stream) is non-blocking, returning nonzero if cannot lock

Chapter 4: Advanced File I/O

Scatter/Gather I/O
  • The idea is to write several buffers to a file or to read several buffers from a file
  • The buffers are read and written in serial order, but very efficiently by the kernel
#include <sys/uio.h>
struct iovec {
  void *iov_base;
  size_t iov_len;
};
ssize_t readv (int fd, const struct iovec *iov, int count);
ssize_t writev (int fd, const struct iovec *iov, int count);
epoll
  • easier to use than poll and select, but linux specific
  • decouples creating a listener, adding fds and waiting on it
#include <stdio.h>
#include <sys/epoll.h>
#include <unistd.h>
int main() {
    int epfd = epoll_create1(0);
    struct epoll_event events[2];
    events[0].events = EPOLLIN;
    events[0].data.fd = STDIN_FILENO;
    events[1].events = EPOLLOUT;
    events[1].data.fd = STDOUT_FILENO;
    if(epoll_ctl(epfd, EPOLL_CTL_ADD, STDIN_FILENO, &events[0]) == -1) {
        perror("epoll_ctl");
        return 1;
    }
    if(epoll_ctl(epfd, EPOLL_CTL_ADD, STDOUT_FILENO, &events[1]) == -1) {
        perror("epoll_ctl");
        return 1;
    }

    struct epoll_event out_events[2];
    int events_ready = epoll_wait(epfd, out_events, 2, 0);
    if(events_ready == -1) {
        perror("epoll_wait");
        return 1;
    }

    printf("Events Ready: %d\n", events_ready);
    for(int i = 0; i < events_ready; i++) {
        printf("Event fd: %i: %i\n", out_events[i].events, out_events[i].data.fd);
    }
    return 0;
}
MMAP
  • mmap maps a file into memory
#include <sys/mman.h>
void * mmap (void *addr, size_t len, int prot, int flags, int fd, off_t offset);
  • prot is protection mode - PROT_READ, PROT_WRITE, PROT_EXEC
  • it should match the open mode of the file
  • flags:

    • MAX_FIXED - addr is a requirement, not optional (discouraged)
    • MAP_PRIVATE - file is mapped copy-on-write
    • MAP_SHARED - shared with other processes that map this file
    • one of the previous two must be specified, but not both
  • mapping increments the file's reference counter
  • addr and offset must be aligned on a page boundary
  • to get page size
long page_size = sysconf(_SC_PAGESIZE); // <unistd.h>
// or
int getpagesize(void);
  • to remove mapping:
int munmap(void *arrd, size_t len);
  • will remove all mappings in the given range
  • example
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>

int main(int argc, char *argv[]) {

    if (argc != 2) {
        printf("Must suppy filename\n"); return 1;
    }

    int fd = open(argv[1], O_RDONLY);
    if (fd == -1) {
        perror("open"); return 1;
    }

    struct stat sb;

    if (fstat(fd, &sb) == -1) {
        perror("fstat"); return 1;
    }

    if (!S_ISREG(sb.st_mode)) {
        printf("Not a regular file\n"); return 1;
    }

    char * p;
    if ((p = mmap(0, sb.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED) {
        perror("mmap"); return 1;
    }

    for(int i = 0; i < sb.st_size; i++) {
        putchar(*(p + i));
    }
    putchar('\n');

    if(munmap(p, sb.st_size) == -1) {
        perror("munmap"); return 1;
    }

    return 0;

}
  • advantages of mmap:

    • avoids the extra copy that happens to a user space buffer when using read and write
    • there's no system call overhead to read from memory
    • shared mode lets processes share file
    • there's no need for lseek
  • mremap to change size of mapping
  • glib often uses mremap to implement realloc
  • mprotect to change protection of a mapping
  • msync to sync changes back to disk - should always be called because kernel does not know that memory has been modified, unlike write's dirty buffers
  • madvise to give the kernel a hint about usage of map - usually related to amount of readahead (more for sequential, none for random)
  • posix_fadvise is very similar, but for normal I/O (not mmap)
Asynchronous I/O
  • The book here is very brief. The following notes are from the aio(7) man page

  • The POSIX AIO interface consists of the following functions:

#include <aiocb.h>

aio_read(3)     Enqueue a read request.  This is the asynchronous analog of read(2).
aio_write(3)    Enqueue a write request.  This is the asynchronous analog of write(2).
aio_fsync(3)    Enqueue a sync request for the I/O operations on a file descriptor.  This is the asynchronous analog of fsync(2) and fdatasync(2).
aio_error(3)    Obtain the error status of an enqueued I/O request.
aio_return(3)   Obtain the return status of a completed I/O request.
aio_suspend(3)  Suspend the caller until one or more of a specified set of I/O requests completes.
aio_cancel(3)   Attempt to cancel outstanding I/O requests on a specified file descriptor.
lio_listio(3)   Enqueue multiple I/O requests using a single function call.


struct aiocb {
    int             aio_fildes;     /* File descriptor */
    off_t           aio_offset;     /* File offset */
    volatile void  *aio_buf;        /* Location of buffer */
    size_t          aio_nbytes;     /* Length of transfer */
    int             aio_reqprio;    /* Request priority */
    struct sigevent aio_sigevent;   /* Notification method */
    int             aio_lio_opcode; /* Operation to be performed;
};
  • You can either poll the status of the request (with aio_error) or have the completion fire a signal
  • You can ask the request to signal via aiocbp->aio_sigevent.sigev_notify = SIGEV_SIGNAL
  • You can also ask it to spawn a thread
  • If aio_error is not EINPROGRESS, you can use aio_return to get the return value (of the corresponding read() operation for example)
I/O Schedulers and I/O performance
  • Scheduler is set in /sys/block/[device]/queue/scheduler

  • The scheduler would try to sequence reads and writes on disk efficiently

  • It performs merges (combining reads in adjecent blocks) and sorting

  • Since writes are streamed (async), they can starve the reads (writes-starving-reads)

  • The Deadline scheduler

    • keeps a sorted list of operations (sorted by physical block)
    • keeps a FIFO queue of read operations and a FIFO queue of write operations
    • read ops have a deadline of 500ms and write ops 6s
    • it goes by sorted list, unless a deadline happens, which get priority
  • My current machine with an NMVe SSD disc uses mq-deadline (multiqueue)

Optimizing IO
  • There are many optimizations that can be done on the application side:

    • Minimize I/O operations (coalescing)
    • Perform block aligned I/O
    • Use buffering
    • vectored I/O, positional I/O and async I/O
  • We can also sort I/O requests before sending them to the kernel

    • Sectors are mapped to physical blocks so block storage can be addressed by blocks (logical block addressing, LBA) instead of cylinders, heads and sectors
    • Filesystems are purely software, and talk in terms of logical blocks
    • Logical block size is a multiple of physical block size
  • Sorting writes in the application:

    • One way to do that is pathname -> poor approximation
    • Another is inode number -> breaks with more fragmentation
    • Or you can try to sort by physical block:

      • Each file is broken into logical blocks
      • You can get the physical block from logical block:
        ret = ioctl (fd, FIBMAP, &block);
      
      • The logical blocks are zero-indexed and file-relative
      • First determine number of blocks in a file via fstat
      • Then, for each logical block, find the physical block via ioctl
      • Since files tend to be contiguous and we will mostly want to order I/O for a given file, just use the 0th block to sort:
      int logical_block = 0;
      ioctl(fd, FIBMAP, &logical_block);