doc/dev/kclient.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478

Testing changes to the Linux Kernel CephFS driver
=================================================

This walkthrough will explain one (opinionated) way to do testing of the Linux
kernel client against a development cluster. We will try to mimimize any
assumptions about pre-existing knowledge of how to do kernel builds or any
related best-practices.

.. note:: There are many completely valid ways to do kernel development for
          Ceph. This guide is a walkthrough of the author's own environment.
          You may decide to do things very differently.

Step One: build the kernel
==========================

Clone the kernel:

.. code-block:: bash

    git init linux && cd linux
    git remote add torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    git remote add ceph https://github.com/ceph/ceph-client.git
    git fetch && git checkout torvalds/master


Configure the kernel:

.. code-block:: bash

    make defconfig

.. note:: You can alternatively use the `Ceph Kernel QA Config`_ for building the kernel.

We now have a kernel config with reasonable defaults for the architecture you're
building on. The next thing to do is to enable configs which will build Ceph and/or
provide functionality we need to do testing.

.. code-block:: bash

    cat > ~/.ceph.config <<EOF
    CONFIG_CEPH_FS=y
    CONFIG_CEPH_FSCACHE=y
    CONFIG_CEPH_FS_POSIX_ACL=y
    CONFIG_CEPH_FS_SECURITY_LABEL=y
    CONFIG_CEPH_LIB_PRETTYDEBUG=y
    CONFIG_DYNAMIC_DEBUG=y
    CONFIG_DYNAMIC_DEBUG_CORE=y
    CONFIG_FRAME_POINTER=y
    CONFIG_FSCACHE
    CONFIG_FSCACHE_STATS
    CONFIG_FS_ENCRYPTION=y
    CONFIG_FS_ENCRYPTION_ALGS=y
    CONFIG_KGDB=y
    CONFIG_KGDB_SERIAL_CONSOLE=y
    CONFIG_XFS_FS=y
    EOF

Beyond enabling Ceph-related configs, we are also enabling some useful
debug configs and XFS (as an alternative to ext4 if needed for our root file
system).

.. note:: It is a good idea to not build anything as a kernel module. Otherwise, you would need to ``make modules_install`` on the root drive of the VM.

Now, merge the configs.


.. code-block:: bash


    scripts/kconfig/merge_config.sh .config ~/.ceph.config


Finally, build the kernel:

.. code-block:: bash

    make -j


.. note:: This document does not discuss how to get relevant utilities for your
          distribution to actually build the kernel, like gcc. Please use your search
          engine of choice to learn how to do that.


Step Two: create a VM
=====================

A virtual machine is a good choice for testing the kernel client for a few reasons:

* You can more easily monitor and configure networking for the VM.
* You can very rapidly test a change to the kernel (build -> mount in less than 10 seconds).
* A fault in the kernel won't crash your machine.
* You have a suite of tools available for analysis on the running kernel.

The main decision for you to make is what Linux distribution you want to use.
This document uses Arch Linux due to the author's familiarity. We also use LVM
to create a volume. You may use partitions or whatever mechanism you like to
create a block device. In general, this block device will be used repeatedly in
testing. You may want to use snapshots to avoid a VM somehow corrupting your
root disk and forcing you to start over.


.. code-block:: bash

    # create a volume
    VOLUME_GROUP=foo
    sudo lvcreate -L 256G "$VOLUME_GROUP" -n $(whoami)-vm-0
    DEV="/dev/${VOLUME_GROUP}/$(whoami)-vm-0"
    sudo mkfs.xfs "$DEV"
    sudo mount "$DEV" /mnt
    sudo pacstrap /mnt base base-devel vim less jq
    sudo arch-chroot /mnt
    # # delete root's password for ease of login
    # passwd -d root
    # mkdir -p /root/.ssh && echo "$YOUR_SSH_KEY_PUBKEY" >> /root/.ssh/authorized_keys
    # exit
    sudo umount /mnt

Once that's done, we should be able to run a VM:


.. code-block:: bash

    qemu-system-x86_64 -enable-kvm -kernel $(pwd)/arch/x86/boot/bzImage -drive file="$DEV",if=virtio,format=raw -append 'root=/dev/vda rw'

You should see output like:

::

    VNC server running on ::1:5900

You could view that console using:


.. code-block:: bash

    vncviewer 127.0.0.1:5900

Congratulations, you have a VM running the kernel that you just built.


Step Three: Networking the VM
=============================

This is the "hard part" and requires the most customization depending on what
you want to do. For this author, I currently have a development setup like:


::

     sepian netns
    ______________
   |              |
   | kernel VM    |              sepia-bounce VM      vossi04.front.sepia.ceph.com
   |  -------  |  |                  ------                    -------
   |  |     |  |  | 192.168.20.1     |    |                    |     |
   |  |     |--|--|- <- wireguard -> |    |  <-- sepia vpn ->  |     |
   |  |_____|  |  |     192.168.20.2 |____|                    |_____|
   |          br0 |
   |______________|


The sepia-bounce VM is used as a bounce box to the sepia lab. It can proxy ssh
connections, route any sepia-bound traffic, or serve as a DNS proxy. The use of
a sepia-bounce VM is optional but can be useful, especially if you want to
create numerous kernel VMs for testing.

I like to use the vossi04 `developer playground`_ to build Ceph and setup a
vstart cluster.  It has sufficient resources to make building Ceph very fast
(~5 minutes cold build) and local disk resources to run a decent vstart
cluster.

To avoid overcomplicating this document with the details of the sepia-bounce
VM, I will note the following main configurations used for the purpose of
testing the kernel:

- setup a wireguard tunnel between the machine creating kernel VMs and the sepia-bounce VM
- use ``systemd-resolved`` as a DNS resolver and listen on 192.168.20.2 (instead of just localhost)
- connect to the sepia `VPN`_ and use `systemd resolved update script`_ to configure ``systemd-resolved`` to use the DNS servers acquired via DHCP from the sepia VPN
- configure ``firewalld`` to allow wireguard traffic and to masquerade and forward traffic to the sepia vpn

The next task is to connect the kernel VM to the sepia-bounce VM. A network
namespace can be useful for this purpose to isolate traffic / routing rules for
the VMs. For me, I orchestrate this using a custom systemd one-shot unit that
looks like:

::

    # create the net namespace
    ExecStart=/usr/bin/ip netns add sepian
    # bring lo up
    ExecStart=/usr/bin/ip netns exec sepian ip link set dev lo up
    # setup wireguard to sepia-bounce
    ExecStart=/usr/bin/ip link add wg-sepian type wireguard
    ExecStart=/usr/bin/wg setconf wg-sepian /etc/wireguard/wg-sepian.conf
    # move the wireguard interface to the sepian nents
    ExecStart=/usr/bin/ip link set wg-sepian netns sepian
    # configure the static ip and bring it up
    ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.20.1/24 dev wg-sepian
    ExecStart=/usr/bin/ip netns exec sepian ip link set wg-sepian up
    # logging info
    ExecStart=/usr/bin/ip netns exec sepian ip addr
    ExecStart=/usr/bin/ip netns exec sepian ip route
    # make wireguard the default route
    ExecStart=/usr/bin/ip netns exec sepian ip route add default via 192.168.20.2 dev wg-sepian
    # more logging
    ExecStart=/usr/bin/ip netns exec sepian ip route
    # add a bridge interface for VMs
    ExecStart=/usr/bin/ip netns exec sepian ip link add name br0 type bridge
    # configure the addresses and bring it up
    ExecStart=/usr/bin/ip netns exec sepian ip addr add 192.168.0.1/24 dev br0
    ExecStart=/usr/bin/ip netns exec sepian ip link set br0 up
    # masquerade/forward traffic to sepia-bounce
    ExecStart=/usr/bin/ip netns exec sepian iptables -t nat -A POSTROUTING -o wg-sepian -j MASQUERADE


When using the network namespace, we will use ``ip netns exec``. There is a
handy feature to automatically bind mount files into the ``/etc`` namespace for
commands run via that command:

::

    # cat /etc/netns/sepian/resolv.conf
    nameserver 192.168.20.2

That file will configure the libc name resolution stack to route DNS requests
for applications to the ``systemd-resolved`` daemon running on sepia-bounce.
Consequently, any application running in that netns will be able to resolve
sepia hostnames:

::

    $ sudo ip netns exec sepian host vossi04.front.sepia.ceph.com
    vossi04.front.sepia.ceph.com has address 172.21.10.4


Okay, great. We have a network namespace that forwards traffic to the sepia
VPN.  The next mental step is to connect virtual machines running a kernel to
the bridge we have configured. The straightforward way to do that is to create
a "tap" device which connects to the bridge:

.. code-block:: bash

     sudo ip netns exec sepian qemu-system-x86_64 \
         -enable-kvm \
         -kernel $(pwd)/arch/x86/boot/bzImage \
         -drive file="$DEV",if=virtio,format=raw \
         -netdev tap,id=net0,ifname=tap0,script="$HOME/bin/qemu-br0",downscript=no \
         -device virtio-net-pci,netdev=net0 \
         -append 'root=/dev/vda rw'

The new relevant bits here are (a) executing the VM in the netns we have
constructed; (b) a ``-netdev``  command to configure a tap device; (c) a
virtual network card for the VM. There is also a script ``$HOME/bin/qemu-br0``
run by qemu to configure the tap device it creates for the VM:

::

    #!/bin/bash
    tap=$1
    ip link set "$tap" master br0
    ip link set dev "$tap" up

That simply plugs the new tap device into the bridge.

This is all well and good but we are now missing one last crucial step. What is
the IP address of the VM?  There are two options:

1. configure a static IP but the VM's root device networking stack
   configuration must be modified
2. use DHCP and configure the root device for VMs to always use dhcp to
   configure their ethernet device addresses

The second option is more complicated to setup, since you must run a DHCP
server now, but provides the greatest flexibility for adding more VMs as needed
when testing.

The modified (or "hacked") standard dhcpd systemd service looks like:

::

    # cat sepian-dhcpd.service
    [Unit]
    Description=IPv4 DHCP server
    After=network.target network-online.target sepian-netns.service
    Wants=network-online.target
    Requires=sepian-netns.service
    
    [Service]
    ExecStartPre=/usr/bin/touch /tmp/dhcpd.leases
    ExecStartPre=/usr/bin/cat /etc/netns/sepian/dhcpd.conf
    ExecStart=/usr/bin/dhcpd -f -4 -q -cf /etc/netns/sepian/dhcpd.conf -lf /tmp/dhcpd.leases
    NetworkNamespacePath=/var/run/netns/sepian
    RuntimeDirectory=dhcpd4
    User=dhcp
    AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_NET_RAW
    ProtectSystem=full
    ProtectHome=on
    KillSignal=SIGINT
    # We pull in network-online.target for a configured network connection.
    # However this is not guaranteed to be the network connection our
    # networks are configured for. So try to restart on failure with a delay
    # of two seconds. Rate limiting kicks in after 12 seconds.
    RestartSec=2s
    Restart=on-failure
    StartLimitInterval=12s
    
    [Install]
    WantedBy=multi-user.target

Similarly, the referenced dhcpd.conf:

::

    # cat /etc/netns/sepian/dhcpd.conf
    option domain-name-servers 192.168.20.2;
    option subnet-mask 255.255.255.0;
    option routers 192.168.0.1;
    subnet 192.168.0.0 netmask 255.255.255.0 {
        range 192.168.0.100 192.168.0.199;
    }

Importantly, this tells the VM to route traffic to 192.168.0.1 (the IP of the
bridge in the netns) and DNS can be provided by 192.168.20.2 (via
``systemd-resolved`` on the sepia-bounce VM).

In the VM, the networking looks like:

::

	[root@archlinux ~]# ip link
	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    	link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
	3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    	link/sit 0.0.0.0 brd 0.0.0.0
	[root@archlinux ~]# ip addr
	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    	inet 127.0.0.1/8 scope host lo
       	valid_lft forever preferred_lft forever
    	inet6 ::1/128 scope host noprefixroute 
       	valid_lft forever preferred_lft forever
	2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    	link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
    	inet 192.168.0.100/24 metric 1024 brd 192.168.0.255 scope global dynamic enp0s3
       	valid_lft 28435sec preferred_lft 28435sec
    	inet6 fe80::5054:ff:fe12:3456/64 scope link proto kernel_ll 
       	valid_lft forever preferred_lft forever
	3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    	link/sit 0.0.0.0 brd 0.0.0.0
	[root@archlinux ~]# systemd-resolve --status
	Global
           	Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
    	resolv.conf mode: stub
	Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com 2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google
	
	Link 2 (enp0s3)
    	Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
         	Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
	Current DNS Server: 192.168.20.2
       	DNS Servers: 192.168.20.2
	
	Link 3 (sit0)
    	Current Scopes: none
         	Protocols: -DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported


Finally, some other networking configurations to consider:

* Run the VM on your machine with full access to the host networking stack. If you have the sepia vpn, this will probably work without too much configuration.
* Run the VM in a netns as above but also setup the sepia vpn in the same netns. This can help to avoid using a sepia-bounce VM. You'll still need to configure routing between the bridge and the sepia VPN.
* Run the VM in a netns as above but only use a local vstart cluster (possibly in another VM) in the same netns.


Step Four: mounting a CephFS file system in your VM
---------------------------------------------------

This guide uses a vstart cluster on a machine in the sepia lab. Because the mon
addresses will change with any new vstart cluster, it will invalidate any
static configuration we may setup for our VM mounting the CephFS via the kernel
driver.  So, we should create a script to fetch the configuration for our
vstart cluster prior to mounting:

.. code-block:: bash

    #!/bin/bash
    # kmount.sh -- mount a vstart Ceph cluster on a remote machine
    
    # the cephx client credential, vstart creates "client.fs" by default
    NAME=fs
    # static fs name, vstart creates an "a" file system by default
    FS=a
    # where to mount on the VM
    MOUNTPOINT=/mnt
    # cephfs mount point (root by default)
    CEPHFS_MOUNTPOINT=/
    
    function run {
        printf '%s\n' "$*" >&2
        "$@"
    }
    
    function mssh {
        run ssh vossi04.front.sepia.ceph.com "cd ceph/build && (source vstart_environment.sh; $1)"
    }
    
    # create the minimum config (including mon addresses) and store it in the VM's ceph.conf. This is not used for mounting; we're storing it for potential use with `ceph` commands.
    mssh "ceph config generate-minimal-conf" > /etc/ceph/ceph.conf
    # get the vstart cluster's fsid
    FSID=$(mssh "ceph fsid")
    # get the auth key associated with client.fs
    KEY=$(mssh "ceph auth get-key client.$NAME")
    # dump the v2 mon addresses and format for the -o mon_addr mount option
    MONS=$(mssh "ceph mon dump --format=json" | jq -r '.mons[] | .public_addrs.addrvec[] | select(.type == "v2") | .addr' | paste -s -d/)
    
    # turn on kernel debugging (and any other debugging you'd like)
    echo "module ceph +p" | tee /sys/kernel/debug/dynamic_debug/control
    # do the mount! we use the new device syntax for this mount
    run mount -t ceph "${NAME}@${FSID}.${FS}=${CEPHFS_MOUNTPOINT}" -o "mon_addr=${MONS},ms_mode=crc,name=${NAME},secret=${KEY},norequire_active_mds,noshare" "$MOUNTPOINT"

That would be run like:

.. code-block:: bash

    $ sudo ip netns exec sepian ssh root@192.168.0.100 ./kmount.sh
    ...
    mount -t ceph fs@c9653bca-110b-4f70-9f84-5a195b205e9a.a=/ -o mon_addr=172.21.10.4:40762/172.21.10.4:40764/172.21.10.4:40766,ms_mode=crc,name=fs,secret=AQD0jgln43pBCxAA7cJlZ4Px7J0UmiK4A4j3rA==,norequire_active_mds,noshare /mnt
    $ sudo ip netns exec sepian ssh root@192.168.0.100 df -h /mnt
    Filesystem                                   Size  Used Avail Use% Mounted on
    fs@c9653bca-110b-4f70-9f84-5a195b205e9a.a=/  169G     0  169G   0% /mnt


If you run into difficulties, it may be:

* The firewall on the node running the vstart cluster is blocking your connections.
* Some misconfiguration in your networking stack.
* An incorrect configuration for the mount.


Step Five: testing kernel changes in teuthology
-----------------------------------------------

There 3 static branches in the `ceph kernel git repository`_ managed by the Ceph team:

* `for-linus <https://github.com/ceph/ceph-client/tree/for-linus>`_: A branch managed by the primary Ceph maintainer to share changes with Linus Torvalds (upstream). Do not push to this branch.
* `master <https://github.com/ceph/ceph-client/tree/master>`_: A staging ground for patches planned to be sent to Linus. Do not push to this branch. 
* `testing <https://github.com/ceph/ceph-client/tree/testing>`_ A staging ground for miscellaneous patches that need wider QA testing (via nightlies or regular Ceph QA testing). Push patches you believe to be nearly ready for upstream acceptance.

You may also push a ``wip-$feature`` branch to the ``ceph-client.git``
repository which will be built by Jenkins. Then view the results of the build
in `Shaman <https://shaman.ceph.com/builds/kernel/>`_.

Once a kernel branch is built, you can test it via the ``fs`` CephFS QA suite:

.. code-block:: bash

    $ teuthology-suite ... --suite fs --kernel wip-$feature --filter k-testing


The ``k-testing`` filter is looking for the fragment which normally sets
``testing`` branch of the kernel for routine QA. That is, the ``fs`` suite
regularly runs tests against whatever is in the ``testing`` branch of the
kernel. We are overriding that choice of kernel branch via the ``--kernel
wip-$featuree`` switch.

.. note:: Without filtering for ``k-testing``, the ``fs`` suite will also run jobs using ceph-fuse or stock kernel, libcephfs tests, and other tests that may not be of interest to you when evaluating changes to the kernel.

The actual override is controlled using Lua merge scripts in the
``k-testing.yaml`` fragment. See that file for more details.


.. _VPN: https://wiki.sepia.ceph.com/doku.php?id=vpnaccess
.. _systemd resolved update script: systemd-resolved: https://wiki.archlinux.org/title/Systemd-resolved
.. _Ceph Kernel QA Config: https://github.com/ceph/ceph-build/tree/899d0848a0f487f7e4cee773556aaf9529b8db26/kernel/build
.. _developer playground: https://wiki.sepia.ceph.com/doku.php?id=devplayground#developer_playgrounds
.. _ceph kernel git repository: https://github.com/ceph/ceph-client