Skip to main content
  1. Posts/

Handling Sparse Files on Linux

·3 mins·
Linux 存储
Tech Enthusiast running out of Coffee.
Table of Contents

Handling Sparse Files on Linux
#

Sparse files are common in  Linux/Unix and are also supported by Windows (e.g. NTFS) and macOSes (e.g. HFS+). Sparse files uses storage efficiently when the files have a lot of holes (contiguous ranges of bytes having the value of zero) by storing only metadata for the holes instead of using real disk blocks. They are especially in case like  allocating VM images.

The following image illustrate the structure of a sparse file (image by:  User:Sven on Wikimedia).

Fig

In this post, we will discuss some common tools and libraries for handling sparse files in Linux environments.

Command line tools for handling sparse files
#

Linux has a bunch set of tools that can make or handle sparse files.

Create sparse files
#

You may use [truncate](https://www.systutorials.com/docs/linux/man/1-truncate/) or the general [dd](https://www.systutorials.com/docs/linux/man/1-dd/) to create sparse (almost empty) files.

truncate shrinks or extends the size of a file to the specified size. So if the file already exists, truncate only appends holes to its end. If the files does not exist yet, truncate will create the file by default. For example, the following command will create a 20GB empty sparse file or extend/shrink it to 20GB if it already exists.

truncate -s 20g ./vmdisk0

The common dd tools can make sparse files too by dding from /dev/zero. For example, to create a 20GB size vmdisk0, dd can do as follows.

dd if=/dev/zero of=./vmdisk0 bs=1k seek=20480k count=1

Archive or copy sparse files
#

To efficiently handle sparse files, the kernel and tools should support the SEEK_HOLE/SEEK_DATA functionalities. For details, please check  SEEK_HOLE and SEEK_DATA: efficiently archive/copy large sparse files.

If you are using a Linux system with kernel greater or equal to version 3.1, the kernel and tools in it will like already support sparse files. A set of tools that may be used: [rsync](https://www.systutorials.com/docs/linux/man/1-rsync/)[tar](https://www.systutorials.com/docs/linux/man/1-tar/)[cp](https://www.systutorials.com/docs/linux/man/1-cp/) and more.

Library functions for handling sparse files programmatically
#

There are a set of C functions available for handling sparse files. Other  programming libraries may be built above of them. Some of those that can be used are as follows.

lseek()
#

If what you want is to create an empty sparse file, lseek could be enough.

off_t lseek(int fd, off_t offset, int whence);

Here is one example of C function using lseek(). The idea is to create a file, seek to the required size and close the file. There will be naturally a large hole in the file.

// -1 on fail
// 0 on success
int create_sparse_file(char *path, uint64_t size)
{
    int fd = 0;
    fd = open(path, O_RDWR|O_CREAT, 0666);
    if (fd == -1) {
        return -1;
    }
    if (lseek(fd, size - 1, SEEK_CUR) == -1) {
        return -1;
    }
    write(fd, "\0", 1);
    close(fd);
    return 0;
}

Check more in [lseek() manual]( https://www.systutorials.com/docs/linux/man/2-lseek/).

truncate() and ftruncate()
#

The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.

If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes (‘\0’).

int truncate(const char *path, off_t length);
int ftruncate(int fd, off_t length);

Check more in [truncate() manual]( https://www.systutorials.com/docs/linux/man/2-truncate/).

fallocate()
#

fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes.

int fallocate(int fd, int mode, off_t offset, off_t len);

Check more in [fallocate() manual]( https://www.systutorials.com/docs/linux/man/2-fallocate/#lbAF).

Related

Coding for SSDs - Part 6: A Summary – What every programmer should know about solid-state drives
·1 min
存储 SSD
Coding for SSDs - Part 5: Access Patterns and System Optimizations
·2 mins
存储 SSD
Coding for SSDs – Part 4: Advanced Functionalities and Internal Parallelism
·1 min
存储 SSD
Coding for SSDs - Part 3: Pages, Blocks, and the Flash Translation Layer
·2 mins
存储 SSD
Coding for SSDs - Part 2: Architecture of an SSD and Benchmarking
·4 mins
存储 SSD