summaryrefslogtreecommitdiffstats
path: root/doc/developer/process-architecture.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel@debian.org>2024-11-09 14:26:35 +0100
committerDaniel Baumann <daniel@debian.org>2024-11-09 14:26:35 +0100
commit47e4d7c791a050deb06e6c0fdfcac94a782a7cb9 (patch)
tree19edcac0f5dbda32bc329fa68773254fb2c488c3 /doc/developer/process-architecture.rst
parentInitial commit. (diff)
downloadfrr-47e4d7c791a050deb06e6c0fdfcac94a782a7cb9.tar.xz
frr-47e4d7c791a050deb06e6c0fdfcac94a782a7cb9.zip
Adding upstream version 10.1.1.upstream/10.1.1
Signed-off-by: Daniel Baumann <daniel@debian.org>
Diffstat (limited to 'doc/developer/process-architecture.rst')
-rw-r--r--doc/developer/process-architecture.rst328
1 files changed, 328 insertions, 0 deletions
diff --git a/doc/developer/process-architecture.rst b/doc/developer/process-architecture.rst
new file mode 100644
index 00000000..85126cab
--- /dev/null
+++ b/doc/developer/process-architecture.rst
@@ -0,0 +1,328 @@
+.. _process-architecture:
+
+Process Architecture
+====================
+
+FRR is a suite of daemons that serve different functions. This document
+describes internal architecture of daemons, focusing their general design
+patterns, and especially how threads are used in the daemons that use them.
+
+Overview
+--------
+The fundamental pattern used in FRR daemons is an `event loop
+<https://en.wikipedia.org/wiki/Event_loop>`_. Some daemons use `kernel threads
+<https://en.wikipedia.org/wiki/Thread_(computing)#Kernel_threads>`_. In these
+daemons, each kernel thread runs its own event loop. The event loop
+implementation is constructed to be thread safe and to allow threads other than
+its owning thread to schedule events on it. The rest of this document describes
+these two designs in detail.
+
+Terminology
+-----------
+Because this document describes the architecture for kernel threads as well as
+the event system, a digression on terminology is in order here.
+
+Historically Quagga's loop system was viewed as an implementation of userspace
+threading. Because of this design choice, the names for various datastructures
+within the event system are variations on the term "thread". The primary
+datastructure that holds the state of an event loop in this system is called a
+"threadmaster". Events scheduled on the event loop - what would today be called
+an 'event' or 'task' in systems such as libevent - are called "threads" and the
+datastructure for them is ``struct event``. To add to the confusion, these
+"threads" have various types, one of which is "event". To hopefully avoid some
+of this confusion, this document refers to these "threads" as a 'task' except
+where the datastructures are explicitly named. When they are explicitly named,
+they will be formatted ``like this`` to differentiate from the conceptual
+names. When speaking of kernel threads, the term used will be "pthread" since
+FRR's kernel threading implementation uses the POSIX threads API.
+
+.. This should be broken into its document under :ref:`libfrr`
+.. _event-architecture:
+
+Event Architecture
+------------------
+This section presents a brief overview of the event model as currently
+implemented in FRR. This doc should be expanded and broken off into its own
+section. For now it provides basic information necessary to understand the
+interplay between the event system and kernel threads.
+
+The core event system is implemented in :file:`lib/event.c` and
+:file:`lib/frrevent.h`. The primary
+structure is ``struct event_loop``, hereafter referred to as a
+``threadmaster``. A ``threadmaster`` is a global state object, or context, that
+holds all the tasks currently pending execution as well as statistics on tasks
+that have already executed. The event system is driven by adding tasks to this
+data structure and then calling a function to retrieve the next task to
+execute. At initialization, a daemon will typically create one
+``threadmaster``, add a small set of initial tasks, and then run a loop to
+fetch each task and execute it.
+
+These tasks have various types corresponding to their general action. The types
+are given by integer macros in :file:`frrevent.h` and are:
+
+``EVENT_READ``
+ Task which waits for a file descriptor to become ready for reading and then
+ executes.
+
+``EVENT_WRITE``
+ Task which waits for a file descriptor to become ready for writing and then
+ executes.
+
+``EVENT_TIMER``
+ Task which executes after a certain amount of time has passed since it was
+ scheduled.
+
+``EVENT_EVENT``
+ Generic task that executes with high priority and carries an arbitrary
+ integer indicating the event type to its handler. These are commonly used to
+ implement the finite state machines typically found in routing protocols.
+
+``EVENT_READY``
+ Type used internally for tasks on the ready queue.
+
+``EVENT_UNUSED``
+ Type used internally for ``struct event`` objects that aren't being used.
+ The event system pools ``struct event`` to avoid heap allocations; this is
+ the type they have when they're in the pool.
+
+``EVENT_EXECUTE``
+ Just before a task is run its type is changed to this. This is used to show
+ ``X`` as the type in the output of :clicmd:`show event cpu`.
+
+The programmer never has to work with these types explicitly. Each type of task
+is created and queued via special-purpose functions (actually macros, but
+irrelevant for the time being) for the specific type. For example, to add a
+``EVENT_READ`` task, you would call
+
+::
+
+ event_add_read(struct event_loop *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref);
+
+The ``struct event`` is then created and added to the appropriate internal
+datastructure within the ``threadmaster``. Note that the ``READ`` and
+``WRITE`` tasks are independent - a ``READ`` task only tests for
+readability, for example.
+
+The Event Loop
+^^^^^^^^^^^^^^
+To use the event system, after creating a ``threadmaster`` the program adds an
+initial set of tasks. As these tasks execute, they add more tasks that execute
+at some point in the future. This sequence of tasks drives the lifecycle of the
+program. When no more tasks are available, the program dies. Typically at
+startup the first task added is an I/O task for VTYSH as well as any network
+sockets needed for peerings or IPC.
+
+To retrieve the next task to run the program calls ``event_fetch()``.
+``event_fetch()`` internally computes which task to execute next based on
+rudimentary priority logic. Events (type ``EVENT_EVENT``) execute with the
+highest priority, followed by expired timers and finally I/O tasks (type
+``EVENT_READ`` and ``EVENT_WRITE``). When scheduling a task a function and an
+arbitrary argument are provided. The task returned from ``event_fetch()`` is
+then executed with ``event_call()``.
+
+The following diagram illustrates a simplified version of this infrastructure.
+
+.. todo: replace these with SVG
+.. figure:: ../figures/threadmaster-single.png
+ :align: center
+
+ Lifecycle of a program using a single threadmaster.
+
+The series of "task" boxes represents the current ready task queue. The various
+other queues for other types are not shown. The fetch-execute loop is
+illustrated at the bottom.
+
+Mapping the general names used in the figure to specific FRR functions:
+
+- ``task`` is ``struct event *``
+- ``fetch`` is ``event_fetch()``
+- ``exec()`` is ``event_call()``
+- ``cancel()`` is ``event_cancel()``
+- ``schedule()`` is any of the various task-specific ``event_add_*`` functions
+
+Adding tasks is done with various task-specific function-like macros. These
+macros wrap underlying functions in :file:`event.c` to provide additional
+information added at compile time, such as the line number the task was
+scheduled from, that can be accessed at runtime for debugging, logging and
+informational purposes. Each task type has its own specific scheduling function
+that follow the naming convention ``event_add_<type>``; see :file:`frrevent.h`
+for details.
+
+There are some gotchas to keep in mind:
+
+- I/O tasks are keyed off the file descriptor associated with the I/O
+ operation. This means that for any given file descriptor, only one of each
+ type of I/O task (``EVENT_READ`` and ``EVENT_WRITE``) can be scheduled. For
+ example, scheduling two write tasks one after the other will overwrite the
+ first task with the second, resulting in total loss of the first task and
+ difficult bugs.
+
+- Timer tasks are only as accurate as the monotonic clock provided by the
+ underlying operating system.
+
+- Memory management of the arbitrary handler argument passed in the schedule
+ call is the responsibility of the caller.
+
+
+Kernel Thread Architecture
+--------------------------
+Efforts have begun to introduce kernel threads into FRR to improve performance
+and stability. Naturally a kernel thread architecture has long been seen as
+orthogonal to an event-driven architecture, and the two do have significant
+overlap in terms of design choices. Since the event model is tightly integrated
+into FRR, careful thought has been put into how pthreads are introduced, what
+role they fill, and how they will interoperate with the event model.
+
+Design Overview
+^^^^^^^^^^^^^^^
+Each kernel thread behaves as a lightweight process within FRR, sharing the
+same process memory space. On the other hand, the event system is designed to
+run in a single process and drive serial execution of a set of tasks. With this
+consideration, a natural choice is to implement the event system within each
+kernel thread. This allows us to leverage the event-driven execution model with
+the currently existing task and context primitives. In this way the familiar
+execution model of FRR gains the ability to execute tasks simultaneously while
+preserving the existing model for concurrency.
+
+The following figure illustrates the architecture with multiple pthreads, each
+running their own ``threadmaster``-based event loop.
+
+.. todo: replace these with SVG
+.. figure:: ../figures/threadmaster-multiple.png
+ :align: center
+
+ Lifecycle of a program using multiple pthreads, each running their own
+ ``threadmaster``
+
+Each roundrect represents a single pthread running the same event loop
+described under :ref:`event-architecture`. Note the arrow from the ``exec()``
+box on the right to the ``schedule()`` box in the middle pthread. This
+illustrates code running in one pthread scheduling a task onto another
+pthread's threadmaster. A global lock for each ``threadmaster`` is used to
+synchronize these operations. The pthread names are examples.
+
+
+.. This should be broken into its document under :ref:`libfrr`
+.. _kernel-thread-wrapper:
+
+Kernel Thread Wrapper
+^^^^^^^^^^^^^^^^^^^^^
+The basis for the integration of pthreads and the event system is a lightweight
+wrapper for both systems implemented in :file:`lib/frr_pthread.[ch]`. The
+header provides a core datastructure, ``struct frr_pthread``, that encapsulates
+structures from both POSIX threads and :file:`event.c`, :file:`frrevent.h`.
+In particular, this
+datastructure has a pointer to a ``threadmaster`` that runs within the pthread.
+It also has fields for a name as well as start and stop functions that have
+signatures similar to the POSIX arguments for ``pthread_create()``.
+
+Calling ``frr_pthread_new()`` creates and registers a new ``frr_pthread``. The
+returned structure has a pre-initialized ``threadmaster``, and its ``start``
+and ``stop`` functions are initialized to defaults that will run a basic event
+loop with the given threadmaster. Calling ``frr_pthread_run()`` starts the thread
+with the ``start`` function. From there, the model is the same as the regular
+event model. To schedule tasks on a particular pthread, simply use the regular
+:file:`event.c` functions as usual and provide the ``threadmaster`` pointed to
+from the ``frr_pthread``. As part of implementing the wrapper, the
+:file:`event.c` functions were made thread-safe. Consequently, it is safe to
+schedule events on a ``threadmaster`` belonging both to the calling thread as
+well as *any other pthread*. This serves as the basis for inter-thread
+communication and boils down to a slightly more complicated method of message
+passing, where the messages are the regular task events as used in the
+event-driven model. The only difference is thread cancellation, which requires
+calling ``event_cancel_async()`` instead of ``event_cancel()`` to cancel a task
+currently scheduled on a ``threadmaster`` belonging to a different pthread.
+This is necessary to avoid race conditions in the specific case where one
+pthread wants to guarantee that a task on another pthread is cancelled before
+proceeding.
+
+In addition, the existing commands to show statistics and other information for
+tasks within the event driven model have been expanded to handle multiple
+pthreads; running :clicmd:`show event cpu` will display the usual event
+breakdown, but it will do so for each pthread running in the program. For
+example, :ref:`bgpd` runs a dedicated I/O pthread and shows the following
+output for :clicmd:`show event cpu`:
+
+::
+
+ frr# show event cpu
+
+ Event statistics for bgpd:
+
+ Showing statistics for pthread main
+ ------------------------------------
+ CPU (user+system): Real (wall-clock):
+ Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
+ 0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer
+ 0 0.000 1 0 0 18 18 T bgp_startup_timer_expire
+ 0 850.000 18 47222 222000 47795 233814 T work_queue_run
+ 0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb
+ 0 0.000 8 0 0 117 160 W zclient_flush_data
+ 2 2.000 1 2000 2000 831 831 R bgp_accept
+ 0 1.000 1 1000 1000 2832 2832 E zclient_connect
+ 1 42082.000 240574 174 37000 178 72810 R vtysh_read
+ 1 152.000 1885 80 2000 96 6292 R zclient_read
+ 0 549346.000 2997298 183 7000 153 20242 E bgp_event
+ 0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer)
+ 0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map
+ 0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer
+ 0 1417.000 9147 154 48000 132 61998 T bgp_process_packet
+ 300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer)
+ 0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets)
+ 0 0.000 1 0 0 105 105 W vtysh_write
+ 0 52.000 599 86 2000 138 6992 T (bgp_start_timer)
+ 1 1.000 8 125 1000 164 593 R vtysh_accept
+ 0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer)
+ 0 11.000 299 36 3000 53 3128 RW bgp_connect_check
+
+
+ Showing statistics for pthread BGP I/O thread
+ ----------------------------------------------
+ CPU (user+system): Real (wall-clock):
+ Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
+ 0 1611.000 9296 173 13000 188 13685 R bgp_process_reads
+ 0 2995.000 11753 254 26000 182 29355 W bgp_process_writes
+
+
+ Showing statistics for pthread BGP Keepalives thread
+ -----------------------------------------------------
+ CPU (user+system): Real (wall-clock):
+ Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
+ No data to display yet.
+
+Attentive readers will notice that there is a third thread, the Keepalives
+thread. This thread is responsible for -- surprise -- generating keepalives for
+peers. However, there are no statistics showing for that thread. Although the
+pthread uses the ``frr_pthread`` wrapper, it opts not to use the embedded
+``threadmaster`` facilities. Instead it replaces the ``start`` and ``stop``
+functions with custom functions. This was done because the ``threadmaster``
+facilities introduce a small but significant amount of overhead relative to the
+pthread's task. In this case since the pthread does not need the event-driven
+model and does not need to receive tasks from other pthreads, it is simpler and
+more efficient to implement it outside of the provided event facilities. The
+point to take away from this example is that while the facilities to make using
+pthreads within FRR easy are already implemented, the wrapper is flexible and
+allows usage of other models while still integrating with the rest of the FRR
+core infrastructure. Starting and stopping this pthread works the same as it
+does for any other ``frr_pthread``; the only difference is that event
+statistics are not collected for it, because there are no events.
+
+Notes on Design and Documentation
+---------------------------------
+Because of the choice to embed the existing event system into each pthread
+within FRR, at this time there is not integrated support for other models of
+pthread use such as divide and conquer. Similarly, there is no explicit support
+for thread pooling or similar higher level constructs. The currently existing
+infrastructure is designed around the concept of long-running worker threads
+responsible for specific jobs within each daemon. This is not to say that
+divide and conquer, thread pooling, etc. could not be implemented in the
+future. However, designs in this direction must be very careful to take into
+account the existing codebase. Introducing kernel threads into programs that
+have been written under the assumption of a single thread of execution must be
+done very carefully to avoid insidious errors and to ensure the program remains
+understandable and maintainable.
+
+In keeping with these goals, future work on kernel threading should be
+extensively documented here and FRR developers should be very careful with
+their design choices, as poor choices tightly integrated can prove to be
+catastrophic for development efforts in the future.