improved dense number paragraphs

2026-04-28 09:29:56 +02:00
1 changed files with 300 additions and 307 deletions
@@ -6,23 +6,23 @@
 This chapter presents the results of the benchmark suite across all
 ten VPN implementations and the internal baseline.  The structure
-follows the impairment profiles from ideal to degraded:
+follows the impairment profiles from ideal to degraded.
 Section~\ref{sec:baseline} establishes overhead under ideal
-conditions, then subsequent sections examine how each VPN responds to
+conditions; subsequent sections examine how each VPN responds to
 increasing network impairment, with source-code excerpts woven in
-where they explain the measured behaviour.  A recurring theme is
+where they explain the measured behaviour.  No single metric
-that no single metric captures VPN performance; the rankings shift
+captures VPN performance.  The rankings shift depending on what is
-depending on whether one measures throughput, latency, retransmit
+measured: throughput, latency, retransmit behaviour, or
-behavior, or real-world application performance.
+application-level performance.
-\section{Baseline Performance}
+\section{Baseline performance}
 \label{sec:baseline}
 The baseline impairment profile introduces no artificial loss or
 reordering, so any performance gap between VPNs can be attributed to
 the VPN itself.  Throughout the plots in this section, the
-\emph{internal} bar marks a direct host-to-host connection with no VPN
+\emph{internal} bar is a direct host-to-host connection with no VPN
-in the path; it represents the best the hardware can do.  On its own,
+in the path; it is the best the hardware can do.  On its own,
 this link delivers 934\,Mbps on a single TCP stream and a round-trip
 latency of just
 0.60\,ms.  WireGuard reaches 92.5\,\% of bare-metal throughput with only a
@@ -30,35 +30,29 @@ single retransmit across an entire 30-second test.  Mycelium sits at
 the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
 bare-metal figure.
-A note on naming: ``Headscale'' in every table and figure of this
+Throughout this chapter, ``Headscale'' labels the scenario in
-chapter labels the test scenario in which the Tailscale client
+which the Tailscale client (\texttt{tailscaled}, built on
-(\texttt{tailscaled}) connects to a self-hosted Headscale control
+\texttt{wireguard-go}) connects to a self-hosted Headscale
-server.  The data plane is therefore the Tailscale client built on
+control server.  The data plane is therefore Tailscale's, not
-\texttt{wireguard-go}, not the Headscale binary itself, which is
+Headscale's; the Headscale binary itself is only a control-plane
-only a control-plane server.  Statements below about ``Headscale''
+server.  Section~\ref{sec:tailscale_degraded} returns to which
-running \texttt{wireguard-go} should be read as statements about
+Tailscale code paths the test rig actually exercises.
 the Tailscale client in this scenario.
 Section~\ref{sec:tailscale_degraded} covers the specifics of how
 the rig launches \texttt{tailscaled} and which Tailscale code
 paths that choice activates.
-\subsection{Test Execution Overview}
+\subsection{Test execution overview}
-Running the full baseline suite across all ten VPNs and the internal
+The full baseline suite ran in just over four hours across all ten
-reference took just over four hours.  Actual benchmark execution
+VPNs and the internal reference.  Benchmark execution consumed
-consumed the bulk of that time at 2.6~hours (63\,\%).  VPN
+63\,\% of that time; VPN installation and deployment accounted for
-installation and deployment accounted for another 45~minutes
+19\,\%; the test rig spent 9\,\% waiting for tunnels to come up
-(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
+after restarts.  Service restarts and traffic-control (tc)
-for VPN tunnels to come up after restarts.  VPN service restarts and
+stabilization took the remainder.  Figure~\ref{fig:test_duration}
-traffic-control (tc) stabilization took the remainder.
+breaks the time down per VPN.
 Figure~\ref{fig:test_duration} breaks this down per VPN.
-Most VPNs completed every benchmark without issues, but four failed
+Most VPNs completed every benchmark without issue.  Four failed
-one test each: Nebula and Headscale timed out on the qperf
+one test each: Nebula and Headscale timed out on the qperf QUIC
-QUIC performance benchmark after six retries, while Hyprspace and
+benchmark after six retries; Hyprspace and Mycelium failed the
-Mycelium failed the UDP iPerf3 test
+UDP iPerf3 test with a 120-second timeout.  Their individual
-with a 120-second timeout.  Their individual success rate is
+success rate is 85.7\,\%; every other VPN passed the full suite
 85.7\,\%, with all other VPNs passing the full suite
 (Figure~\ref{fig:success_rate}).
 \begin{figure}[H]
@@ -88,7 +82,7 @@ with a 120-second timeout.  Their individual success rate is
  \label{fig:test_overview}
 \end{figure}
-\subsection{TCP Throughput}
+\subsection{TCP throughput}
 Each VPN ran a single-stream iPerf3 session for 30~seconds on every
 link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
@@ -144,14 +138,12 @@ Raw throughput alone is incomplete.  The retransmit rate
 (Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
 by estimated packet count, accounting for the different segment sizes
 each VPN negotiates (1\,228 to 32\,731 bytes).  WireGuard and
-Headscale are effectively loss-free ($<$\,0.01\,\%).  Tinc, EasyTier,
+Headscale are effectively loss-free ($<$\,0.01\,\%); Hyprspace is
-Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
+the clear outlier at 0.49\,\%; the remaining VPNs spread between
-Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
+these poles.  The interesting case is ZeroTier: it sustains
-0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%.  ZeroTier
+814\,Mbps despite a 0.10\,\% retransmit rate by riding TCP
-reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
+congestion-control recovery, where WireGuard delivers comparable
-for tunnel-internal loss through repeated TCP congestion-control
+throughput with effectively zero loss.
 recovery; WireGuard delivers comparable throughput with effectively
 zero loss.
 \begin{figure}[H]
  \centering
@@ -170,7 +162,7 @@ Figure~\ref{fig:tcp_window} shows the raw window sizes, and
 Figure~\ref{fig:retransmit_correlations} plots them against retransmit
 rate.  Hyprspace, with a 0.49\,\% retransmit rate, maintains the
 smallest max congestion window in the dataset (200\,KB), while
-Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
+Yggdrasil's 0.09\,\% rate allows a 4.3\,MB window, the largest of
 any VPN.  At
 first glance this suggests a clean inverse correlation between
 retransmit rate and congestion window size, but the picture is
@@ -204,17 +196,13 @@ in the dataset.  This points to significant in-tunnel packet loss
 or buffering at the VpnCloud layer that the retransmit rate
 (0.06\,\%) alone does not fully explain.
-Variability, whether stochastic across runs or systematic across
+Variability across links also differs substantially.  WireGuard's
-links, also differs substantially.  WireGuard's three link
+three link directions cluster within a 60\,Mbps band; Mycelium's
-directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
+span a 3:1 ratio (122--379\,Mbps), but this is per-link
-and are nearly indistinguishable.  Mycelium's three directions span
+path-selection asymmetry rather than run-to-run noise
-122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
+(Section~\ref{sec:mycelium_routing}).  A VPN whose throughput
-Section~\ref{sec:mycelium_routing} shows the spread is per-link
+varies that widely across links is harder to capacity-plan around
-path-selection asymmetry, with one link finding a direct route and
+than one that delivers a consistent figure on every direction.
 the other two routing through the global overlay.  Either way, a
 VPN whose throughput varies that widely across links is harder to
 capacity-plan around than one that delivers a consistent figure
 on every direction.
 \begin{figure}[H]
  \centering
@@ -292,16 +280,13 @@ moderate overhead.  Then there is Mycelium at 34.9\,ms, so far
 removed from the rest that Section~\ref{sec:mycelium_routing} gives
 it a dedicated analysis.
-The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
+The spike-ratio column in Table~\ref{tab:latency_baseline} exposes
-outliers among the low-latency VPNs.  VpnCloud leads at
+two outliers among the low-latency VPNs.  VpnCloud and ZeroTier
-2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
+post the highest spike ratios (2.8$\times$ and 2.3$\times$) and
-2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
+the highest jitter, while Tinc and Headscale stay close to
-jitter in the table (0.25\,ms).  Tinc and Headscale, by contrast,
+bare-metal stability.  The spikes in VpnCloud and ZeroTier are
-stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
+consistent with periodic control-plane events such as key
-timing is nearly as stable as bare metal.  The spikes in VpnCloud and
+rotation or peer heartbeats that briefly stall the data path.
 ZeroTier are consistent with periodic
 control-plane work such as key rotation or peer heartbeats that
 briefly stalls the data path.
 \begin{figure}[H]
  \centering
@@ -353,7 +338,7 @@ spot.
 \begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
-  \caption{Latency vs.\ throughput at baseline. Each point represents
+  \caption{Latency vs.\ throughput at baseline. Each point is
    one VPN. The quadrants reveal different bottleneck types:
    VpnCloud (low latency, moderate throughput), Tinc (low latency,
    low throughput, CPU-bound), Mycelium (high latency, low
@@ -361,7 +346,7 @@ spot.
  \label{fig:latency_throughput}
 \end{figure}
-\subsection{Parallel TCP Scaling}
+\subsection{Parallel TCP scaling}
 The single-stream benchmark tests one link direction at a time.
 The
@@ -383,7 +368,7 @@ Table~\ref{tab:parallel_scaling} lists the results.
  \centering
  \caption{Parallel TCP scaling at baseline. Scaling factor is the
    ratio of parallel to single-stream throughput. Internal's
-  1.50$\times$ represents the expected scaling on this hardware.}
+  1.50$\times$ is the expected scaling on this hardware.}
  \label{tab:parallel_scaling}
  \begin{tabular}{lrrr}
    \hline
@@ -501,7 +486,7 @@ parallel load.
  \label{fig:parallel_tcp}
 \end{figure}
-\subsection{UDP Stress Test}
+\subsection{UDP stress test}
 The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
 which is a deliberate overload test rather than a realistic workload.
@@ -568,12 +553,13 @@ complete the UDP test at all; both timed out after 120 seconds.
 % usable payload after tunnel overhead, but conflating it with path
 % MTU is misleading.  Consider renaming to "effective payload size"
 % throughout.
-The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
+The effective UDP payload size, reported in the
-payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
+\texttt{blksize\_bytes} field, differs sharply across
-2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
+implementations.  Yggdrasil sends 32\,731-byte jumbo segments;
-1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
+ZeroTier negotiates 2\,728 bytes; the remaining VPNs cluster
-smallest). These differences affect fragmentation behavior under real
+between 1\,208 (Headscale, the smallest) and 1\,448 (Internal).
-workloads, particularly for protocols that send large datagrams.
+These differences affect fragmentation behaviour under workloads
 that send large datagrams.
 %TODO: Mention QUIC
 %TODO: Mention again that the "default" settings of every VPN have been used
@@ -611,7 +597,7 @@ workloads, particularly for protocols that send large datagrams.
 % TODO: Compare parallel TCP retransmit rate
 % with single TCP retransmit rate and see what changed
-\subsection{Real-World Workloads}
+\subsection{Real-world workloads}
 Saturating a link with iPerf3 measures peak capacity, but not how a
 VPN performs under realistic traffic.  This subsection switches to
@@ -620,7 +606,7 @@ cache and streaming video over RIST.  Both interact with the VPN
 tunnel the way real software does, through many short-lived
 connections, TLS handshakes, and latency-sensitive UDP packets.
-\paragraph{Nix Binary Cache Downloads.}
+\paragraph{Nix binary cache downloads.}
 This test downloads a fixed set of Nix packages through each VPN and
 measures the total transfer time.  The results
@@ -705,7 +691,7 @@ Nebula sits just below at 99.8\%, and Hyprspace's headline figure
 of 100\% conceals a separate failure mode discussed below.  The
 14--16 dropped frames that appear uniformly across every run, including
 Internal, are most likely encoder warm-up artefacts rather than
-tunnel overhead, though we have not verified this directly.
+tunnel overhead, though this has not been verified directly.
 % TODO: The packet-drop distribution statistics (288 mean,
 % 10\% median, IQR 255--330) are not shown in any figure.
@@ -756,7 +742,7 @@ overwhelm FEC entirely.
  \label{fig:rist_quality}
 \end{figure}
-\subsection{Operational Resilience}
+\subsection{Operational resilience}
 Throughput, latency, and application performance describe how a
 tunnel behaves once it is up.  The next question is how quickly it
@@ -767,7 +753,7 @@ reboot matters as much as its peak throughput.
 Reboot reconnection rearranges the rankings.  Hyprspace, the worst
 performer under sustained TCP load, recovers in just 8.7~seconds on
 average, faster than any other VPN.  WireGuard and Nebula follow at
-10.1\,s each.  Nebula's consistency is striking: 10.06, 10.06,
+10.1\,s each.  Nebula's consistency is striking: 10.06, 10.07,
 10.07\,s across its three nodes, an exact match for Nebula's
 \texttt{HostUpdateNotification} interval, whose default is
 10~seconds in the lighthouse protocol (configurable, but the
@@ -781,8 +767,8 @@ Section~\ref{sec:mycelium_routing} argues from that uniformity
 that the bound is a fixed timer in the overlay protocol.
 Yggdrasil produces the most lopsided result in the dataset: its yuki
-node is back in 7.1~seconds while lom and luna take 94.8 and
+node is back in 7.1~seconds while lom and luna take 97.3 and
-97.3~seconds respectively.  Yggdrasil organises its overlay as a
+94.8~seconds respectively.  Yggdrasil organises its overlay as a
 distributed spanning tree rooted at the node with the highest public
 key: every other node picks a parent closer to the root and the
 whole network hangs off that parent chain.  The gap likely reflects
@@ -816,14 +802,14 @@ can route traffic.
  \label{fig:reboot_reconnection}
 \end{figure}
-\subsection{Pathological Cases}
+\subsection{Pathological cases}
 \label{sec:pathological}
-Three VPNs exhibit behaviors that the aggregate numbers alone cannot
+Hyprspace, Mycelium, and Tinc each show a pathology that the
-explain.  The following subsections piece together observations from
+aggregate tables flatten.  The following subsections diagnose each
-earlier benchmarks into per-VPN diagnoses.
+in turn.
-\paragraph{Hyprspace: Buffer Bloat.}
+\paragraph{Hyprspace: buffer bloat.}
 \label{sec:hyprspace_bloat}
 % TODO: The under-load latency of 2,800 ms is not shown in any plot
@@ -841,7 +827,7 @@ The consequences show in every TCP metric.  With 4\,965
 retransmits per 30-second test (one in every 200~segments), TCP
 spends most of its time in congestion recovery rather than
 steady-state transfer.  The max congestion window shrinks to
-205\,KB, the smallest in the dataset.  Under parallel load the
+200\,KB, the smallest in the dataset.  Under parallel load the
 situation worsens: retransmits climb to 17\,426.  % TODO: The
 % explanation for the sender/receiver inversion (ACK delays
 % causing sender-side timer undercounting) is a hypothesis.  Normally
@@ -853,12 +839,9 @@ while the sender sees only 367.9\,Mbps, likely because massive ACK delays
 cause the sender-side timer to undercount the actual data rate.  The
 UDP test never finished at all; it timed out at 120~seconds.
-% Should we always use percentages for retransmits?
+Outside sustained load, Hyprspace looks fine.  It has the fastest
-
+reboot reconnection in the dataset (8.7\,s) and delivers 100\,\%
-What prevents Hyprspace from being entirely unusable is everything
+video quality between burst events.  The pathology is narrow but
 \emph{except} sustained load.  It has the fastest reboot
 reconnection in the dataset (8.7\,s) and delivers 100\,\% video
 quality outside of its burst events.  The pathology is narrow but
 severe: any continuous data stream saturates the tunnel's internal
 buffers.
@@ -910,33 +893,35 @@ stack but the benchmark traffic never reaches it; that case is
 the subject of Section~\ref{sec:tailscale_degraded}.
 If gVisor is out of scope, the buffer bloat must originate
-further up the Hyprspace stack instead.  Hyprspace uses
+further up the Hyprspace stack.  Hyprspace uses \texttt{libp2p},
-\texttt{libp2p}, a peer-to-peer networking library, and its
+a peer-to-peer networking library, and its \texttt{yamux} stream
-\texttt{yamux} stream multiplexer, which runs many logical streams
+multiplexer.  Yamux runs many logical streams over a single
-over a single underlying connection and polices each one with a
+underlying connection and polices each one with a credit-based
-credit-based flow-control window.  The most plausible source of
+flow-control window.  The most plausible source of the bloat is
-the bloat is this libp2p/yamux layer, through which raw IP packets
+this libp2p/yamux layer, through which raw IP packets are
-are funnelled.  Hyprspace's TUN-read loop dispatches
+funnelled.
-each outbound packet on its own goroutine, and every such
+
-goroutine ends up in \texttt{node/node.go}'s
+Hyprspace's TUN-read loop dispatches each outbound packet on its
-\texttt{sendPacket}, which keeps exactly one libp2p stream per
+own goroutine, and every such goroutine ends up in
-destination peer in \texttt{activeStreams} and guards it with a
+\texttt{node/node.go}'s \texttt{sendPacket}.  This function keeps
-single per-peer \texttt{sync.Mutex}
+exactly one libp2p stream per destination peer in
-(Listing~\ref{lst:hyprspace_sendpacket}).  Concurrent
+\texttt{activeStreams}, guarded by a single per-peer
-application TCP flows to the same Hyprspace neighbour therefore
+\texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}).
 Concurrent application TCP flows to the same Hyprspace neighbour
 serialise behind that one lock: the parallel iPerf3 test, which
-opens multiple TCP connections to the same peer at once,
+opens multiple TCP connections to the same peer at once, collapses
-collapses to a single send pipeline at this layer.  Each
+to a single send pipeline at this layer.  Each goroutine waiting
-goroutine waiting for the lock pins its own 1420-byte packet
+for the lock pins its own 1420-byte packet buffer, and the
-buffer, and the underlying yamux session adds a per-stream
+underlying yamux session adds a per-stream flow-control window on
-flow-control window on top.  None of this is visible to the
+top.
-kernel TCP sender that produced the inner segments: the kernel
+
-sees only that the TUN write returned, so it keeps growing its
+None of this is visible to the kernel TCP sender that produced
-congestion window while the libp2p layer falls further behind.  The
+the inner segments.  The kernel sees only that the TUN write
-geometry is the textbook one for buffer bloat: a
+returned, so it keeps growing its congestion window while the
-fast producer (kernel TCP) sitting upstream of a slow,
+libp2p layer falls further behind.  The geometry is the standard
-serialised consumer (the single yamux stream per peer) with
+shape of buffer bloat: a fast producer (kernel TCP) sitting
-no flow-control signal coupling the two.
+upstream of a slow, serialised consumer (the single yamux stream
 per peer), with no flow-control signal coupling the two.
 \lstinputlisting[language=Go,caption={Hyprspace's outbound
    fast path keeps exactly one libp2p stream per destination peer
@@ -987,9 +972,9 @@ reports a kernel-measured RTT that is independent of
 ICMP ping.  For the luna$\rightarrow$lom stream, this
 TCP~RTT starts at 51.6\,ms and climbs to a mean of
 144\,ms over the 30-second run, with
-757~retransmits---the link was clearly overlay-routed
+757~retransmits.  The link was clearly overlay-routed during
-during the throughput test, even though ping had found a
+the throughput test, even though ping had found a direct path
-direct path eight minutes earlier.  For
+eight minutes earlier.  For
 yuki$\rightarrow$luna the reverse happened: the TCP
 stream measured only 12--22\,ms, and its bidirectional
 return path recorded 1.0\,ms, a direct LAN connection
@@ -1057,17 +1042,14 @@ inversion.
 \end{figure}
 % TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
-% (47.3 ms) numbers are from qperf but not shown in any figure.
+% (47.3 ms vs.\ 16.0 ms) numbers are from qperf but not shown in
-% Add a connection-setup latency table or plot.  Also
+% any figure.  Add a connection-setup latency table or plot.
 % clarify what
 % Internal's connection establishment time is (47.3 /
 % 3 = 15.8 ms?)
 % so the "3× overhead" can be verified.
 The overlay penalty shows up most clearly at connection setup.
 Mycelium's average time-to-first-byte is 93.7\,ms
 (vs.\ Internal's
 16.8\,ms, a 5.6$\times$ overhead), and connection establishment
-alone costs 47.3\,ms (3$\times$ overhead).  Every new connection
+alone costs 47.3\,ms against Internal's 16.0\,ms (a
 2.95$\times$ overhead).  Every new connection
 incurs that overhead, so workloads dominated by
 short-lived connections accumulate it rapidly.  Bulk
 downloads, by
@@ -1094,13 +1076,13 @@ delay.
 The UDP test timed out at 120~seconds, and even first-time
 connectivity required a 70-second wait at startup.
-\paragraph{Tinc: Userspace Processing Bottleneck.}
+\paragraph{Tinc: userspace processing bottleneck.}
 The latency subsection already traced Tinc's 336\,Mbps ceiling to
 single-core CPU exhaustion.  The usual network suspects do not
 apply.  Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
 effective UDP payload size (1\,353 bytes) and its retransmit count
-(240) are in the normal range.  That leaves CPU: 14.9\,\%
+(240) are in the normal range.  That leaves CPU: 12.3\,\%
 whole-system utilization is what one saturated core looks like on
 a multi-core host, which fits a single-threaded userspace VPN.
 The parallel benchmark confirms the diagnosis.  Tinc scales to
@@ -1110,7 +1092,7 @@ the gaps a single flow would leave idle, and the extra work
 translates directly into extra throughput.
 % TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
 % unresolved CPU-profiling TODO from the latency subsection
-% (VpnCloud's identical 14.9\% at 539\,Mbps).  If per-thread
+% (VpnCloud's similar 14.2\% at 539\,Mbps).  If per-thread
 % profiling refutes the single-core story, this paragraph must
 % be revisited as well.
@@ -1121,13 +1103,12 @@ Baseline benchmarks rank VPNs by overhead under ideal
 conditions.  The impairment profiles in
 Table~\ref{tab:impairment_profiles} test a different property:
 resilience.  Each profile applies symmetric \texttt{tc netem}
-impairment to every machine.  Low adds roughly 2\,ms of delay and
+impairment to every machine.  Low adds 2\,ms of delay and
-0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
+0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 4\,ms
-${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
+of delay and 1\,\% loss with 2.5\,\% reordering; High adds 6\,ms
-adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
+of delay and 2.5\,\% loss with 5\,\% reordering.  Medium and High
-reordering.  Medium and High both use 50\,\% correlation, so
+both use 50\,\% correlation, so losses and reorderings are bursty
-losses and reorderings are bursty rather than uniform.  Two
+rather than uniform.  Two results dominate the data.
 results dominate the data.
 % TODO: Double-check these per-profile parameters against the
 % canonical impairment-profile definitions in the earlier chapter
 % (Table~\ref{tab:impairment_profiles}).  The Low/High loss and
@@ -1150,9 +1131,9 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
 through what turns out to be the wrong hypothesis.  The
 investigation begins with Tailscale's much-discussed gVisor TCP
 stack, validates the candidate parameters in isolation on the
-bare-metal host, and only then discovers, by reading the rig's
+bare-metal host, and only then discovers, by reading the test
-own NixOS module, that the gVisor stack is not actually in the
+rig's own NixOS module, that the gVisor stack is not actually in
-data path of the benchmark at all.  The real culprit is a
+the data path of the benchmark at all.  The real culprit is a
 combination of the Linux kernel's tight default
 \texttt{tcp\_reordering} threshold and the way
 \texttt{wireguard-go}
@@ -1268,7 +1249,7 @@ RTT standard deviation reaches 44.6\,ms at High, the
 worst jitter
 of any VPN.  A userspace retry mechanism is the
 likely cause, but
-without source-code evidence we cannot say so with certainty.
+without source-code evidence this cannot be confirmed.
 % TODO: Ping packet loss data is not shown in any plot.  The 1/9
 % = 11.1\% interpretation is clever but depends on
@@ -1342,9 +1323,9 @@ the first step (Table~\ref{tab:tcp_impairment}).
 \end{figure}
 Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
-impairment, a
+impairment, a 98.3\% loss.  The Low profile injects only modest
-98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
+impairment per machine: 2\,ms of latency, 2\,ms of jitter, 0.25\%
-0.25\% packet loss, and 0.5\% reordering per machine.
+loss, and 0.5\% reordering.
 Even Mycelium,
 the slowest VPN at baseline (259\,Mbps), retains more
 throughput at
@@ -1427,38 +1408,28 @@ emerges from the runs that did complete.
 % explanation (e.g., iPerf3 crash, tc interaction,
 % timing issue).
 Three implementations maintain throughput at the profiles where
-data exists.  Internal holds ${\sim}$950\,Mbps at
+data exists: Internal, WireGuard, and Headscale all sustain
-Baseline, Medium,
+several hundred Mbps where they complete (see
-and High; WireGuard sustains 850--898\,Mbps; and
+Figure~\ref{fig:udp_impairment_heatmap}).  Internal and WireGuard
-Headscale sustains
+ride the host kernel's transport-layer backpressure (Internal
-700--876\,Mbps. % TODO: verify WireGuard UDP range --
+directly, WireGuard via the in-kernel WireGuard module).
-% analysis doc says 850-898, possible digit transposition
+Headscale takes a different route to the same outcome.  Its
-Internal and WireGuard ride the host kernel's transport-layer
+\texttt{magicsock} layer is incompatible with the kernel
-backpressure (Internal directly, WireGuard via the in-kernel
+WireGuard datapath, so \texttt{wireguard-go} runs in userspace
-WireGuard module).  Headscale, by contrast, never
+and leans on three host-kernel offloads to absorb a
-uses the kernel
+\texttt{-b~0} sender flood: batched UDP I/O
-module even though it builds on the WireGuard protocol: as
+(\texttt{recvmmsg} / \texttt{sendmmsg}), UDP
-established in Section~\ref{sec:baseline}, Tailscale's
+segmentation/aggregation offload (\texttt{UDP\_SEGMENT} /
-\texttt{magicsock} layer intercepts every packet for endpoint
+\texttt{UDP\_GRO}) on the outer WireGuard socket, and a 7\,MiB
-selection, DERP relay, and the disco protocol, and that
+socket buffer on that same socket.  Section~\ref{sec:tailscale_degraded}
-interception is incompatible with the kernel WireGuard datapath.
+returns to these mechanisms when they reappear as the explanation
-Headscale therefore runs \texttt{wireguard-go} in userspace and
+for Headscale's TCP behaviour under reordering.
-compensates with UDP batching
+
-(\texttt{recvmmsg}/\texttt{sendmmsg}),
+Userspace VPNs without that engineering collapse.  EasyTier
-host-kernel UDP segmentation/aggregation offload
+walks down 865, 435, 38.5, 6.1\,Mbps across the four profiles.
-(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
+Yggdrasil, already pathological at baseline (98.7\,\% loss),
-WireGuard socket), and a 7\,MB socket buffer on the same outer
+drops to 12.3\,Mbps at Low and fails entirely at Medium and
-socket.  These offloads live in the host kernel; gVisor netstack
+High.
 itself implements no UDP GSO or UDP GRO of its own.
 Together they
 absorb a \texttt{-b 0} sender flood without
 collapsing.  Userspace
 VPNs without the same engineering do collapse:
 EasyTier drops from
 865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
 Yggdrasil, already pathological at baseline (98.7\%
 loss), crashes
 to 12.3\,Mbps at Low and fails entirely at Medium and High.
 \begin{figure}[H]
  \centering
@@ -1576,10 +1547,9 @@ concurrent streams benefits from it independently.
 EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
 51\% of its baseline.  Headscale and EasyTier are the only VPNs
 that retain more than half their baseline parallel throughput at
-Low impairment; no other implementation exceeds 30\%.
+Low impairment; no other implementation exceeds 30\%.  EasyTier's
-We have no
+resilience has no direct architectural explanation in this work,
-direct architectural explanation for EasyTier's resilience and
+and none is claimed here.
 do not claim one here.
 Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
 Low, a 99.6\%
@@ -1590,7 +1560,7 @@ loss.  % TODO: DOWNSTREAM DEPENDENCY — This
 % under-load latency.  If that diagnosis is revised,
 % this explanation
 % for parallel collapse must also be revisited.
-The buffer bloat that already plagues single-stream transfers
+The buffer bloat that already constrains single-stream transfers
 (Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
 flows compete for the same bloated buffers at once.
@@ -1799,33 +1769,26 @@ at Low, completes here in 170\,s.  At High, only Headscale,
 Nebula, and Tinc survive.  Internal's failure at High is the
 surprising one: the bare-metal baseline cannot sustain a
 multi-connection HTTP workload under severe degradation, while
-Headscale's userspace TCP stack pulls it through.
+Headscale completes in 219\,s.
-Section~\ref{sec:tailscale_degraded} explains why.
+Section~\ref{sec:tailscale_degraded} traces the mechanism.
 \section{Tailscale under degraded conditions}
 \label{sec:tailscale_degraded}
-% TODO: Editorial pass needed on two chapter-wide issues before
+% TODO: The magicsock / wireguard-go userspace-datapath
-% submission:
+% explanation is repeated three times in slightly different forms
-% (1) magicsock / wireguard-go userspace-datapath explanation is
+% (once in baseline UDP, once in impairment UDP, once here).
-%     repeated three times in slightly different forms (once in
+% Consider introducing it once in full here, where it is
-%     baseline UDP, once in impairment UDP, once here).  Consider
+% load-bearing, and replacing the earlier occurrences with
-%     introducing it once in full here, where it is load-bearing,
+% one-sentence forward references.
 %     and replacing the earlier occurrences with one-sentence
 %     forward references.
 % (2) This section uses first-person plural ("we pursued", "we
 %     worked it out", "we ran two follow-up benchmarks") while
 %     the rest of the chapter is in impersonal voice.  Either
 %     harmonise everything to one voice, or explicitly frame this
 %     section as a first-person narrative detour.
-This section is about an observation that should not exist:
+Headscale, a tunnelling VPN built on \texttt{wireguard-go}, beats
-Headscale, a tunnelling VPN built on a kernel TCP stack and
+the bare-metal Internal baseline at Medium impairment.  Under
-\texttt{wireguard-go}, beats the bare-metal Internal baseline at
+parallel load at Low impairment it beats Internal by a factor of
-Medium impairment, and at Low impairment under parallel load
+2.6.  A VPN should not outperform the direct connection it tunnels
-beats it by a factor of 2.6.  The short answer turns out to be
+through, and the explanation took some chasing.  The obvious
-different from the obvious answer, and we worked it out only by
+hypothesis was wrong, and pursuing it to its end was the only way
-chasing the obvious answer to its end.
+to find out.
 \subsection{An anomaly worth pursuing}
@@ -1877,16 +1840,15 @@ comparison.
  \label{fig:headscale_vs_internal}
 \end{figure}
-WireGuard-the-kernel-module is the obvious sanity
+The in-kernel WireGuard module is the obvious sanity check.  It
-check.  It uses
+uses the same Noise/WireGuard cryptographic protocol that
-the same Noise/WireGuard cryptographic protocol Tailscale ships
+Tailscale embeds, and it is the closest available comparison
-and is the closest available comparison without the rest of
+without the rest of Tailscale's stack.  Kernel WireGuard shows
-Tailscale's stack.  WireGuard shows none of Headscale's
+none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at
-advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
+Medium, both well below Internal at the same profile.  The
-below Internal at the same profile.  So the encryption layer is
+encryption layer is not the answer.  Neither is the basic UDP
-not the answer, and the basic UDP tunnel is not the answer.
+tunnel.  Whatever Headscale is doing lives somewhere else in
-Whatever Headscale is doing differently lives somewhere else in
+Tailscale's implementation.
 the rest of Tailscale's implementation.
 % TODO: The Medium-impairment retransmit percentages (5.2\%,
 % 2.4\%) are not in any table or figure.  Add a retransmit
@@ -1906,9 +1868,9 @@ not.
 \subsection{A plausible villain: Tailscale's gVisor stack}
-The candidate explanation we pursued first, and the one any
+The first candidate was Tailscale's userspace TCP/IP stack:
-reading of the upstream Tailscale documentation will lead to,
+the answer any reading of the upstream Tailscale documentation
-is Tailscale's userspace TCP/IP stack.  The Tailscale client
+points to.  The Tailscale client
 imports Google's gVisor netstack
 (\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
 it as an in-process TCP implementation.  The gVisor
@@ -1948,9 +1910,9 @@ reordering link than the host kernel.  The hypothesis follows
 directly: Headscale's iPerf3 traffic
 runs through this gVisor instance instead of through the host
 kernel TCP stack, and so it inherits the more
-reordering-tolerant behaviour.  WireGuard-the-kernel-module
+reordering-tolerant behaviour.  Kernel WireGuard shares only
-shares only the cryptographic protocol; it does not include
+the cryptographic protocol; it does not include the gVisor
-the gVisor stack, and therefore does not get the advantage.
+stack, and therefore does not get the advantage.
 The natural way to test this is to extract
 the parameters Tailscale sets inside gVisor, apply their
@@ -1962,8 +1924,8 @@ supported.  If it does not, the hypothesis fails.
 \subsection{Reproducing the effect on bare metal}
 \label{sec:tuned}
-We ran two follow-up benchmarks on the same hardware and
+Two follow-up benchmarks ran on the same hardware and impairment
-impairment setup as the original 18.12.2025 run.
+setup as the original 18.12.2025 run.
 \begin{itemize}
    \bitem{Tailscale-style (27.02.2026):}
@@ -2023,43 +1985,33 @@ impairment setup as the original 18.12.2025 run.
  \label{fig:kernel_tuning_comparison}
 \end{figure}
-The result felt like confirmation.  Internal's
+The result felt like confirmation.  Three sysctls raised
-Medium-impairment throughput jumped from 29.6\,Mbps to
+Internal's Medium-impairment throughput by 146\,\% and halved its
-72.7\,Mbps under the reorder-only configuration, a 146\,\%
+Nix cache download time
-increase from a three-line sysctl change, and the retransmit
+(Table~\ref{tab:kernel_tuning_internal}).  The retransmit rate at
-rate at Medium dropped from ${\sim}$2.4\,\% to
+Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means
-1.11\,\%, which
+more than half of the original retransmissions were spurious.
 means more than half of the original retransmissions were
 spurious.  The Nix cache download at Medium roughly halved,
 from 58.6\,s to 29.1\,s.
 Parallel TCP gained even more.  Internal at Low climbed from
-277 to 902\,Mbps, a 226\,\% increase.  This exceeds Internal's
+277\,Mbps to 902\,Mbps, a 226\,\% increase that exceeded
-old single-stream best and overtakes Headscale's original
+Internal's old single-stream best and overtook Headscale's
-718\,Mbps from the unmodified run.  %
+original 718\,Mbps from the unmodified run.
-% TODO: DOWNSTREAM
+% TODO: DOWNSTREAM DEPENDENCY -- "six concurrent flows"
-% DEPENDENCY — "six concurrent flows" inherits
+% inherits the unresolved 6-vs-10 stream count from the baseline
-% the unresolved
+% parallel test description.  Update when that TODO is resolved.
-% 6-vs-10 stream count from the baseline parallel test
+Each of the six concurrent flows benefits independently from the
-% description.  Update when that TODO is resolved.
+higher reordering threshold, and the gains compound.
 Each of the six concurrent flows benefits independently from
 the higher reordering threshold, and the gains compound.
 % TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
 % not in any table.  Add a table showing Headscale's results
-% from the follow-up runs alongside Internal's so
+% from the follow-up runs alongside Internal's so readers can
 % readers can
 % verify the reversal.
-Headscale itself, retested with the same sysctls,
+Headscale, retested with the same sysctls, gained more modestly:
-gained more
+21\,\% at Medium and a small $-$5\,\% wobble at Low.  And the
-modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
+anomaly reversed entirely
-Low.  And the anomaly reversed entirely.  At Medium, tuned
+(Figure~\ref{fig:headscale_gap_reversal}).  Tuned Internal now
-Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
+leads Headscale at Medium across every metric, where the
-a 45\,\% lead for Internal where the original run
+original run had Headscale ahead.
 had Headscale
 40\,\% ahead.  The Nix cache flipped the same way: Internal
 completed in 29.1\,s against Headscale's 36.3\,s, where the
 original had Headscale 17\,\% faster.
 \begin{figure}[H]
  \centering
@@ -2088,11 +2040,11 @@ collapsed to three host-kernel sysctls:
 \texttt{tcp\_early\_retrans}.
 At this point in the investigation the hypothesis seemed
-settled.  Tailscale's gVisor stack ships with
+settled.  Tailscale's gVisor stack applies these overrides;
-these overrides;
+the bare-metal kernel uses stricter defaults; matching
-the bare-metal kernel ships with stricter defaults; matching
+the kernel to gVisor reproduces the effect.  The remaining
-the kernel to gVisor reproduces the effect.  Then we checked
+question was which Tailscale code path the test rig was actually
-which Tailscale code path the test rig was actually running.
+running.
 \subsection{The data path that was not there}
 \label{sec:gvisor_not_in_path}
@@ -2168,9 +2120,9 @@ the gVisor TCP business at all.
 The puzzle the investigation began with has not gone away.
 Headscale starts at 41.5\,Mbps where Internal starts at
 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
-TCP stack.  Whatever Headscale is doing (partially, weakly, but
+TCP stack.  Whichever mechanism Headscale relies on (partially,
-reproducibly) is worth roughly twelve megabits per second on the
+weakly, but reproducibly) is worth roughly twelve megabits per
-Medium profile, and it is not gVisor netstack.
+second on the Medium profile, and it is not gVisor netstack.
 The +21\,\% sysctl gain for Headscale itself is also informative
 about the size of the mechanism.  If the gain were 0\,\%,
@@ -2182,7 +2134,7 @@ that the two effects are not fully additive.
 Two features of the \texttt{wireguard-go} data-plane pipeline are
 the most likely candidates, and both live on the kernel-TUN path
-that Tailscale actually uses in the rig.
+that Tailscale actually uses in the test rig.
 The first is TUN TCP and UDP generic receive offload. Tailscale's
 \texttt{tstun} wrapper enables both on the kernel TUN device on
@@ -2248,32 +2200,38 @@ Hyprspace cannot be used as a negative control for any of this.
 It does import gVisor netstack, but only for its in-VPN
 service-network feature, and the Hyprspace benchmark traffic goes
 through a kernel TUN exactly like Headscale's
-(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
+(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ in the
-wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
+\texttt{wireguard-go} pipeline (TUN GRO and the 7\,MiB outer-UDP
-not on whether gVisor handles their inner TCP.  The gVisor angle
+buffer), not in whether gVisor handles their inner TCP.  The gVisor angle
 simply does not apply to either of them in this benchmark.
 The kernel-side picture closes the loop.  Three host-kernel TCP
 parameters dominate the bare-metal behaviour the benchmarks
-expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
+expose:
-number of out-of-order segments the kernel will tolerate before
+
-declaring fast retransmit, and with \texttt{tc netem} injecting
+\begin{description}
-0.5--2.5\,\% reordering per machine, bursts of several reordered
+  \item[\texttt{tcp\_reordering} (default 3)] The number of
-packets are frequent enough that the threshold is repeatedly
+    out-of-order segments the kernel tolerates before declaring
-tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
+    fast retransmit.  With \texttt{tc~netem} injecting 0.5--2.5\,\%
-(default \texttt{1}, RACK enabled) adds time-based reordering
+    reordering per machine, bursts of several reordered packets
-detection on top of the segment-count threshold, which compounds
+    repeatedly trip this threshold on the bare-metal path.
-the spurious retransmits when reordering is high. And
+  \item[\texttt{tcp\_recovery} (default \texttt{1}, RACK enabled)]
-\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
+    Adds time-based reordering detection on top of the
-Loss Probe enabled) fires speculative retransmits when
+    segment-count threshold, which amplifies spurious retransmits
-unacknowledged segments sit at the tail of a transmission window,
+    when reordering is high.
-which interacts poorly with an already-impaired link.  Loosening
+  \item[\texttt{tcp\_early\_retrans} (default \texttt{3}, TLP
-any one of the three softens the kernel's loss detection on the
+    enabled)] Fires speculative retransmits when unacknowledged
    segments sit at the tail of the transmission window, which
    interacts poorly with an already-impaired link.
 \end{description}
 \noindent
 Loosening any one softens the kernel's loss detection on the
 bare-metal path; loosening all three recovers most of the
 throughput.  The Headscale path reaches the same kernel TCP stack
-but is already feeding it the GRO-coalesced, buffer-cushioned
+but is already feeding it a GRO-coalesced, buffer-cushioned
-stream described above, so the kernel's tight defaults fire less
+stream, so the kernel's tight defaults fire less often there to
-often there to begin with.
+begin with.
 The same logic explains the anomaly's shape across profiles. At
 baseline there is no reordering, so the kernel's tight
@@ -2318,31 +2276,66 @@ any Linux host and entirely independent of any VPN.
 The less durable finding, and the one that motivated this section,
 is that Tailscale's much-discussed userspace TCP stack is not in
 the data path for the workload that exposed the anomaly.  The
-advantage we attributed to it comes from a more ordinary place:
+advantage initially attributed to it comes from a more ordinary
-the way \texttt{wireguard-go} batches and coalesces packets
+place: the way \texttt{wireguard-go} batches and coalesces packets
 between the wire and the kernel TCP stack, and the larger UDP
-buffer it pins on its outer socket.  We were chasing the wrong
+buffer it pins on its outer socket.  The experiment was chasing
-hypothesis with the right experiment, and the experiment turned
+the wrong hypothesis, but the experiment turned out to be more
-out to be more useful than the hypothesis.
+useful than the hypothesis.
-% TODO: These sections are empty stubs but the chapter
+\section{Summary}
-% introduction (line 12--13) promises "findings from the source
+\label{sec:results_summary}
 % code analysis." Either write these sections or remove the
 % promise from the intro.
-\section{Source code analysis}
+Four findings hold together across all four impairment profiles.
-\subsection{Feature matrix overview}
+At baseline, the throughput hierarchy splits into three tiers
 separated by natural gaps in the data: WireGuard, ZeroTier,
 Headscale, and Yggdrasil at the top ($>$\,80\,\% of bare metal);
 Nebula, EasyTier, and VpnCloud in the middle (55--80\,\%);
 Hyprspace, Tinc, and Mycelium at the bottom ($<$\,40\,\%).
 Latency rearranges the rankings: VpnCloud is the fastest VPN at
 1.13\,ms despite mid-tier throughput, Tinc has low latency but
 single-core CPU caps its bulk transfer rate, and Mycelium's
 34.9\,ms average is an outlier driven by Babel's overlay
 routing, not by tunnel overhead.
-% Summary of the 108-feature matrix across all ten VPNs.
+Under impairment, the hierarchy collapses.  At High impairment
-% Highlight key architectural differences that explain
+the spread between fastest and slowest implementation compresses
-% performance results.
+from 675\,Mbps to under 3\,Mbps; the impairment profile itself
 becomes the bottleneck.  Three pathologies stand out at the
 intermediate profiles.  Yggdrasil's 32\,KB jumbo overlay MTU,
 which inflates its baseline numbers, becomes a liability at
 Low impairment: a single lost outer packet costs roughly
 24$\times$ more retransmitted inner data than a standard-MTU VPN
 would lose, and throughput drops from 795 to 13\,Mbps.
 Hyprspace's libp2p/yamux send pipeline serialises concurrent
 flows behind a per-peer mutex; under any sustained load the
 pipeline backs up and ping latency balloons by three orders of
 magnitude.  Headscale's RIST video quality stays at 13\,\% across
 every profile, almost certainly because of MTU fragmentation in
 the DERP relay layer; the failure is profile-independent because
 it is structural.
-\subsection{Security vulnerabilities}
+Headscale's apparent lead over the bare-metal Internal baseline
 at Medium impairment turns out not to come from Tailscale's
 gVisor TCP stack, which is not in the data path of the
 benchmark.  It comes from \texttt{wireguard-go}'s TUN GRO
 coalescing and the 7\,MiB outer-UDP socket buffer that
 \texttt{magicsock} pins, both of which feed the host kernel TCP
 stack a smoother input than the bare-metal path receives.  The
 underlying cause is a host-kernel one: the default
 \texttt{tcp\_reordering=3} threshold is too tight for the kind
 of bursty, correlated reordering \texttt{tc netem} produces, and
 costs the bare-metal host more than half its achievable
 throughput.  Three sysctl lines repair it, and the fix is
 portable to any Linux host independent of any VPN.
-% Vulnerabilities discovered during source code review.
+A ranking by single metric would obscure all of this.  The most
-
+useful one-sentence summary is therefore that no VPN dominates
-\section{Summary of findings}
+across throughput, latency, application-level workloads, and
-
+operational resilience together; each implementation makes
-% Brief summary table or ranking of VPNs by key metrics. Save
+trade-offs that surface only when the workload changes.
-% deeper interpretation for a Discussion chapter.
+WireGuard comes closest to a default recommendation for
 performance-critical use; Headscale is the most robust under
 adverse network conditions; the others occupy specific niches
 that the per-section analyses describe.