This is the first post of a three-part series about filesystems, containers, and Concourse. The goal of this series is to describe what the different volume drivers available in Concourse do.
At Techorrect, we’re huge fans of Concourse. At its core, it offers a few simple primitives (resources and tasks) which you can then combine and extend to accomplish work with various levels of complexity. As you begin using Concourse more, you may encounter a few issues relating to volumes and disk space, and many times, the suggested fix is to switch from the btrfs
storage driver to the overlay
storage driver. I wanted to learn what was going on behind the scenes, and this series is a summary of my notes during this investigation.
While filesystems are huge topic in itself and is not simple, this post will only briefly describe the btrfs
and overlayfs
filesystems in Linux1. These two filesystems contain features that can be leveraged by container managers like Docker to allow containers to all point to the same underlying image instead of copying the image each time a new container is created. Docker has a good explanation of how it manages image layers with btrfs
and overlayfs
at https://docs.docker.com/storage/storagedriver (select the relevant filesystem in the menu on the left), and this is also the topic of the next blog post in this series.
btrfs
btrfs
is a “real” filesystem operating on a block device, based on the copy-on-write
(CoW) idea. btrfs
provides a “subvolumes” feature - when you create a new btrfs
subvolume, it is still part of the parent btrfs
filesystem, but can be conceptualized as an independent area. You can access btrfs
subvolumes just like a regular directory, or you can also mount it like a new filesystem (which hides the original parent btrfs
filesystem). Additionally, you can take “snapshots” of subvolumes - snapshots are also subvolumes just with some initial content. Because of btrfs
’s CoW capabilities, taking snapshots is an easy and efficient thing to do. Another feature that makes btrfs
well-suited to container managers is its ability to nest subvolumes within subvolumes, which we’ll come back to in a future post in this series. Let’s play around with btrfs
on an actual system that supports it - I’ve run the following commands on an AWS EC2 Ubuntu 18.04 instance with an extra EBS volume (/dev/xvdb
) attached. You can also find the scripts at https://github.com/edtan/filesystems-experiment.
First, let’s make sure that the kernel supports btrfs
:
cat /proc/filesystems | grep btrfs
btrfs
We then create a btrfs
filesystem on the extra EBS volume that we attached to the ECS instance. This needs to be done because btrfs
is a filesystem dealing with actual block devices.
sudo mkfs.btrfs -f /dev/xvdb
Mount the volume at /scratch
, and assign the current user the owner of the mount directory:
sudo mkdir /scratch
sudo mount -t btrfs /dev/xvdb /scratch
sudo chown -R $USER:$USER /scratch
Begin by creating a btrfs
subvolume:
btrfs subvolume create /scratch/subvolume1
Then create a large file - I just wrote a bunch of yes’s to a file and manually hit Ctrl+C after a few seconds, which resulted in a 352MB file.
yes > /scratch/subvolume1/yes.txt
You can now pass the --reflink
option to cp
which performs a CoW copy. This would error if you try doing this on a filesystem that doesn’t support CoW.
cp --reflink /scratch/subvolume1/yes.txt /scratch/subvolume1/yes-copy.txt
Now, we need to keep in mind the distinction between space used at the block layer (e.g. the underlying space used) and at the file layer. To some regular tools, it looks like there are two 352MB files, as it reports on files:
ls -l --human-readable /scratch/subvolume1/
total 703M
-rw-rw-r-- 1 ubuntu ubuntu 352M Aug 20 18:29 yes-copy.txt
-rw-rw-r-- 1 ubuntu ubuntu 352M Aug 20 18:29 yes.txt
But the underlying space used is 352MB, not 704MB. We can check this using btrfs
tools.
In the next command, FS bytes
shows the underlying space used. Also notice the 1.08GB used: btrfs
allocates data in 1GB chunks:
sudo btrfs filesystem show /dev/xvdb
Label: none uuid: 69e7407a-7bf2-4179-ac0e-ce342f091f54
Total devices 1 FS bytes used 351.74MiB
devid 1 size 8.00GiB used 1.08GiB path /dev/xvdb
It’s also interesting to see the output of a few other btrfs
commands and regular commands:
# check the disk space usage. "Set shared" is the all the space shared by
# all children of the argument to du (in this case, /scratch). Notice
# that the "set shared" is only the size of the original yes.txt
btrfs filesystem du /scratch
Total Exclusive Set shared Filename
351.74MiB 0.00B - /scratch/subvolume1/yes.txt
351.74MiB 0.00B - /scratch/subvolume1/yes-copy.txt
703.48MiB 0.00B - /scratch/subvolume1
703.48MiB 0.00B 351.74MiB /scratch
# check filesystem stats, may need to wait a while before running this
# command to have up to date stats
btrfs filesystem df /scratch
Data, single: total=840.00MiB, used=351.99MiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=264.00MiB, used=512.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
# Notice that we get slightly differently stats from the df and du
# commands
df --human /scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/xvdb 8.0G 369M 7.4G 5% /scratch
du --human --summarize /scratch/
704M /scratch/
Now, let’s try creating a snapshot of subvolume1
:
btrfs subvolume snapshot /scratch/subvolume1/ /scratch/subvolume2
We can see that the snapshot (subvolume2
) starts off with the contents from subvolume1
:
ls /scratch/subvolume2
yes-copy.txt yes.txt
Next, we create a new file in the new snapshot, filing up a file with a bunch of no’s. Again, Ctrl+C after a second produces a sufficiently large file for testing:
yes no > /scratch/subvolume2/no.txt
Checking the disk space usage again, we see that the “set shared” is still the same
because we’ve taken a snapshot of the original subvolume. The only data that is still shared
is the original yes.txt
. Additionally, we see that the new no.txt
in the snapshot
has a non-zero “exclusive” size, meaning that its data is not shared with any other files:
btrfs filesystem du /scratch
Total Exclusive Set shared Filename
351.74MiB 0.00B - /scratch/subvolume1/yes.txt
351.74MiB 0.00B - /scratch/subvolume1/yes-copy.txt
703.48MiB 0.00B - /scratch/subvolume1
351.74MiB 0.00B - /scratch/subvolume2/yes.txt
351.74MiB 0.00B - /scratch/subvolume2/yes-copy.txt
157.06MiB 157.06MiB - /scratch/subvolume2/no.txt
860.54MiB 157.06MiB - /scratch/subvolume2
1.53GiB 157.06MiB 351.74MiB /scratch
Let’s try creating a new snapshot called subvolume3
of the existing snapshot (subvolume2
):
btrfs subvolume snapshot /scratch/subvolume2/ /scratch/subvolume3
No surprise here, it starts off with the contents of subvolume2
:
ls /scratch/subvolume3
no.txt yes-copy.txt yes.txt
Finally, we run various commands to examine “space used”:
btrfs filesystem du /scratch
Total Exclusive Set shared Filename
351.74MiB 0.00B - /scratch/subvolume1/yes.txt
351.74MiB 0.00B - /scratch/subvolume1/yes-copy.txt
703.48MiB 0.00B - /scratch/subvolume1
351.74MiB 0.00B - /scratch/subvolume2/yes.txt
351.74MiB 0.00B - /scratch/subvolume2/yes-copy.txt
157.06MiB 0.00B - /scratch/subvolume2/no.txt
860.54MiB 0.00B - /scratch/subvolume2
351.74MiB 0.00B - /scratch/subvolume3/yes.txt
351.74MiB 0.00B - /scratch/subvolume3/yes-copy.txt
157.06MiB 0.00B - /scratch/subvolume3/no.txt
860.54MiB 0.00B - /scratch/subvolume3
2.37GiB 0.00B 508.80MiB /scratch
btrfs filesystem df /scratch
Data, single: total=840.00MiB, used=509.06MiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=264.00MiB, used=704.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
df --human /scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/xvdb 8.0G 526M 7.3G 7% /scratch
du --human --summarize /scratch/
2.4G /scratch/
As the example shows, different tools return different stats on how much space is being used depending on whether we’re looking at the block or file layer. For a better, detailed explaination, see this link.
Finally, here’s a diagram summarizing our actions:
overlayfs
Whereas btrfs
is an actual filesystem dealing with block devices, overlayfs
is a union filesystem that exists on top of an existing filesystem. A union mount lets you view a set of directories as a single, consolidated directory. In overlayfs
, you provide a lower
directory and a upper
directory, and overlayfs
provides a view of these two directories “overlayed” together - this view is accessible as just another directory. In this overlay directory, if a name exists in both the lower
and upper
directories, the object in the upper
directory “wins” and shows up while the object in the lower
directory is hidden. Any writes to the overlay directory go to the upper
directory behind the scenes.
You can also have multiple lower
directories where the directories are “stacked” on top of each other, and directories higher up the stack hide objects of the same name lower down the stack. Additionally, a lower
directory can also be another overlayfs
directory, but keep in mind that this is different than trying to create another overlayfs
from lower
and upper
directories nested within the overlay
directory (we’ll come back to this in a future post). All this is probably better explained with an example.
Create a working directory, and lower
, upper
, workdir
, and overlay
directories. workdir
is used internally by overlayfs
:
# Create a working directory, and create the lower, upper, workdir, and
# overlay directories in it in preparation for an overlayfs mount
mkdir -p overlay-test/{lower,upper,workdir,overlay}
mkdir -p overlay-test/lower/samedir
mkdir -p overlay-test/upper/samedir
cd overlay-test
Now create files in the lower
and upper
directories for three cases:
- a file only in the
lower
directory (lower.txt
) - a file only in the
upper
directory (upper.txt
) - a file in both the
lower
andupper
directories (same.txt
)
Additionally, create the same three cases within subdirectories of the lower
and upper
directories to show what happens when directories are “merged” together in overlayfs
:
echo "This is lower.txt in the lower dir" | tee lower/lower.txt lower/samedir/lower.txt
echo "This is upper.txt in the upper dir" | tee upper/upper.txt upper/samedir/upper.txt
echo "This is same.txt in the lower dir" | tee lower/same.txt lower/samedir/same.txt
echo "This is same.txt in the upper dir" | tee upper/same.txt upper/samedir/same.txt
This is what the directories look like after we have finished the preparation:
tree
.
├── lower
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ └── same.txt
│ └── same.txt
├── overlay
├── upper
│ ├── samedir
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ └── upper.txt
└── workdir
Now we mount overlayfs
on the overlay
directory. Notice that we don’t have to issue
a “mkfs” command like we did for btrfs
, because this isn’t a “real” filesystem dealing with
block devices. We’re just using directories on the existing filesystem (I’m running this example
on ext4
from my own machine):
sudo mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=workdir overlay
Let’s examine the directories now. Notice that the overlay
directory is now a combined view of the lower
and upper
directories!
tree
.
├── lower
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ └── same.txt
│ └── same.txt
├── overlay
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ └── upper.txt
├── upper
│ ├── samedir
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ └── upper.txt
└── workdir
└── work [error opening dir]
same.txt
from the upper
directory “wins”, and is the one visible in the overlay
directory:
cat overlay/same.txt
This is same.txt in the upper dir
Now let’s modify same.txt
in the overlay
directory, and see what happens
to the underlying files in the lower
and upper
directories:
echo "Modifying the same.txt" > overlay/same.txt
The same.txt
in the upper
directory was modified:
cat upper/same.txt
Modifying the same.txt
But same.txt
in the lower
directory was left intact:
cat lower/same.txt
This is same.txt in the lower dir
Now let’s try modifying same.txt
in the upper
directory and
see what happens in the overlay
directory:
echo "Modifying the same.txt in upper dir directly" > upper/same.txt
Success! In the overlay
directory, we can now see the changes that were made directly in the upper
directory:
cat overlay/same.txt
Modifying the same.txt in upper dir directly
The next example shows how we can create another overlayfs
mount using
an existing overlayfs
mount as a lower
directory. We create a new set of directories
except for the lower
directory, which we’ll reuse from the previous example.
mkdir -p {upper2,workdir2,overlay2}
ls
lower overlay overlay2 upper upper2 workdir workdir2
For good measure, let’s create a new file in the new upper2
directory:
echo "This is upper2.txt in the upper2 dir" > upper2/upper2.txt
Create the second overlayfs
mount on overlay2
, using the previous overlay
directory as the lower
directory.
sudo mount -t overlay overlay -o lowerdir=overlay,upperdir=upper2,workdir=workdir2 overlay2
Notice how overlay2
looks like the previous overlay
directory combined with the new upper2
directory:
tree
.
├── lower
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ └── same.txt
│ └── same.txt
├── overlay
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ └── upper.txt
├── overlay2
│ ├── lower.txt
│ ├── samedir
│ │ ├── lower.txt
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ ├── upper2.txt
│ └── upper.txt
├── upper
│ ├── samedir
│ │ ├── same.txt
│ │ └── upper.txt
│ ├── same.txt
│ └── upper.txt
├── upper2
│ └── upper2.txt
├── workdir
│ └── work [error opening dir]
└── workdir2
└── work [error opening dir]
However, trying to “nest” an independent overlayfs
completely within an existing overlay
directory is not possible.
To illustrate this, we first create a new, independent set of lower
, upper
, workdir
, and overlay
directories completely within the existing overlay
directory:
cd overlay
mkdir -p {innerlower,innerupper,innerworkdir,inneroverlay}
tree
.
├── innerlower
├── inneroverlay
├── innerupper
├── innerworkdir
├── lower.txt
├── samedir
│ ├── lower.txt
│ ├── same.txt
│ └── upper.txt
├── same.txt
└── upper.txt
This doesn’t work:
sudo mount -t overlay overlay -o lowerdir=inneroverlay,upperdir=innerupper,workdir=innerworkdir inneroverlay
mount: wrong fs type, bad option, bad superblock on overlay,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
Finally, let’s take a look at file sizes to contrast this with
btrfs
.
df --human
# I've filtered out the output to show only the lines of interest
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 111G 104G 916M 100% /
overlay 111G 104G 916M 100% /home/user/overlay-test/overlay
overlay 111G 104G 916M 100% /home/user/overlay-test/overlay2
cd ..
sudo du --human --summarize *
24K lower
48K overlay
52K overlay2
40K upper
4.0K upper2
8.0K workdir
8.0K workdir2
Let’s generate a large file in lower
and see what the file sizes
look like then. Again, Ctrl+C after a few seconds results in a large (323MB) file:
yes > lower/yes.txt
Examining file sizes:
df --human
# I've filtered out the output to show only the lines of interest
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 111G 105G 594M 100% /
overlay 111G 105G 594M 100% /home/user/overlay-test/overlay
overlay 111G 105G 594M 100% /home/user/overlay-test/overlay2
sudo du --human --summarize *
323M lower
375M overlay
375M overlay2
40K upper
4.0K upper2
8.0K workdir
8.0K workdir2
From df
, we see that /dev/sda6
’s available space has decreased by 322MB, the size of the last file
we created in the lower
directory. However, df
also shows the same results for the other two overlayfs
mounts we created.
This suggest that overlay
and overlay2
are pointing to the yes.txt
in lower
instead of being copies.
Finally, clean up our work directory:
sudo umount overlay2
sudo umount overlay
cd ..
rm -rf overlay-test
In part 2 of this series, we’ll explain how container managers use these file systems to efficiently reuse images.
-
I use
overlayfs
to refer to the Linux filesystem driver, andoverlay
to refer to Concourse’s storage driver that usesoverlayfs
to manage its containers. While this post focuses onoverlayfs
only, the distinction will become important for the next posts in this series. ↩
Comments (0 responses)