verified all numbers

2026-04-10 11:18:40 +02:00
parent 0e636ee5f3
commit 13633f092a
1 changed files with 251 additions and 221 deletions
@@ -9,10 +9,9 @@ ten VPN implementations and the internal baseline. The structure
 follows the impairment profiles from ideal to degraded:
 Section~\ref{sec:baseline} establishes overhead under ideal
 conditions, then subsequent sections examine how each VPN responds to
-increasing network impairment. The chapter concludes with findings
+increasing network impairment, with source-code excerpts woven in
-from the source code analysis. A recurring theme is that no single
+where they explain the measured behaviour.  A recurring theme is
-metric captures VPN
+that no single metric captures VPN performance; the rankings shift
 performance; the rankings shift
 depending on whether one measures throughput, latency, retransmit
 behavior, or real-world application performance.
@@ -26,38 +25,32 @@ the VPN itself.  Throughout the plots in this section, the
 in the path; it represents the best the hardware can do.  On its own,
 this link delivers 934\,Mbps on a single TCP stream and a round-trip
 latency of just
-0.60\,ms.  WireGuard comes remarkably close to these numbers, reaching
+0.60\,ms.  WireGuard reaches 92.5\,\% of bare-metal throughput with only a
-92.5\,\% of bare-metal throughput with only a single retransmit across
+single retransmit across an entire 30-second test.  Mycelium sits at
-an entire 30-second test.  Mycelium sits at the other extreme, adding
+the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
-34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
+bare-metal figure.
 A note on naming: ``Headscale'' in every table and figure of this
 chapter labels the test scenario in which the Tailscale client
 (\texttt{tailscaled}) connects to a self-hosted Headscale control
 server.  The data plane is therefore the Tailscale client built on
 \texttt{wireguard-go}, not the Headscale binary itself, which is
-only a control-plane server.  The test rig launches
+only a control-plane server.  Statements below about ``Headscale''
-\texttt{tailscaled} via the NixOS \texttt{services.tailscale}
+running \texttt{wireguard-go} should be read as statements about
-module with \texttt{interfaceName = "ts-headscale"}, which
+the Tailscale client in this scenario.
-translates to \texttt{--tun ts-headscale}; this means the Tailscale
+Section~\ref{sec:tailscale_degraded} covers the specifics of how
-client uses a real kernel TUN device and the host kernel's TCP/IP
+the rig launches \texttt{tailscaled} and which Tailscale code
-stack handles every tunneled packet.  The alternate
+paths that choice activates.
 \texttt{--tun=userspace-networking} mode, in which gVisor netstack
 terminates tunneled TCP inside the \texttt{tailscaled} process, is
 \emph{not} engaged in any of the benchmarks reported here.
 Statements below about ``Headscale'' running \texttt{wireguard-go}
 should be read as statements about the Tailscale client in this
 scenario.
 \subsection{Test Execution Overview}
 Running the full baseline suite across all ten VPNs and the internal
-reference took just over four hours.  The bulk of that time, about
+reference took just over four hours.  Actual benchmark execution
-2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
+consumed the bulk of that time at 2.6~hours (63\,\%).  VPN
-installation and deployment accounted for another 45~minutes (19\,\%),
+installation and deployment accounted for another 45~minutes
-and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
+(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
-up after restarts.  The remaining time was consumed by VPN service restarts
+for VPN tunnels to come up after restarts.  VPN service restarts and
-and traffic-control (tc) stabilization.
+traffic-control (tc) stabilization took the remainder.
 Figure~\ref{fig:test_duration} breaks this down per VPN.
 Most VPNs completed every benchmark without issues, but four failed
@@ -146,8 +139,8 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates
 needs.  ZeroTier compensates for tunnel-internal packet loss by
 repeatedly triggering TCP congestion-control recovery, whereas
 WireGuard delivers data with negligible in-tunnel loss.  The
-bare-metal Internal reference sits at 1.7~retransmits per test —
+bare-metal Internal reference sits at 1.7~retransmits per test,
-essentially noise — and the VPNs split into three groups around
+essentially noise, and the VPNs split into three groups around
 it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
 \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
 and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
@@ -187,10 +180,10 @@ and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
 \end{figure}
 Retransmits have a direct mechanical relationship with TCP congestion
-control. Each retransmit triggers a reduction in the congestion window
+control: each one triggers a reduction in the congestion window
-(\texttt{cwnd}), throttling the sender.
+(\texttt{cwnd}) and throttles the sender.
-This relationship is visible
+Figure~\ref{fig:retransmit_correlations} shows the relationship:
-in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
+Hyprspace, with 4965
 retransmits, maintains the smallest max congestion window in the
 dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
 window, the largest of any VPN.  At first glance this suggests a
@@ -200,24 +193,31 @@ largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
 segment carries far more data, so the window in bytes is inflated
 relative to VPNs using a standard ${\sim}$1\,400-byte MTU.  Comparing
 congestion windows across different MTU sizes is not meaningful
-without normalizing for segment size.  What \emph{is} clear is that
+without normalizing for segment size.  The reliable conclusion is
-high retransmit rates force TCP to spend more time in congestion
+simpler: high retransmit rates force TCP to spend more time in
-recovery than in steady-state transmission, capping throughput
+congestion recovery than in steady-state transmission, and that
-regardless of available bandwidth.  ZeroTier illustrates the
+caps throughput regardless of available bandwidth.  ZeroTier
-opposite extreme: brute-force retransmission can still yield high
+illustrates the opposite extreme: brute-force retransmission can
-throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
+still yield high throughput (814\,Mbps with 1\,163 retransmits), at
-bandwidth and unstable flow behavior.
+the cost of wasted bandwidth and unstable flow behavior.
-VpnCloud stands out: its sender reports 538.8\,Mbps
+VpnCloud stands out: its sender reports 538.8\,Mbps but the
-but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
+receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
-in the dataset). This suggests significant in-tunnel packet loss or
+in the dataset.  This points to significant in-tunnel packet loss
-buffering at the VpnCloud layer that the retransmit count (857)
+or buffering at the VpnCloud layer that the retransmit count (857)
 alone does not fully explain.
 % TODO: Clarify whether the headline TCP table
 % (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports
 % sender or receiver throughput.  The prose here cites sender
 % 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender
 % column, so the table caption should say so explicitly.  Same
 % clarification needed for Hyprspace (368 in table vs.\ sender
 % 367.9 / receiver 419.8 in the pathological-cases paragraph).
-Variability — whether stochastic across runs or systematic across
+Variability, whether stochastic across runs or systematic across
-links — also differs substantially.  WireGuard's three link
+links, also differs substantially.  WireGuard's three link
-directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window),
+directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
-behaving almost identically.  Mycelium's three directions span
+and are nearly indistinguishable.  Mycelium's three directions span
 122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
 Section~\ref{sec:mycelium_routing} shows the spread is per-link
 path-selection asymmetry, with one link finding a direct route and
@@ -315,25 +315,21 @@ interference that the average hides.
 Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
 but only the second-lowest throughput (336\,Mbps).  Packets traverse
-the tunnel quickly, yet single-threaded userspace processing cannot
+the tunnel quickly, yet something caps the overall rate.  The qperf
-keep up with the link speed.  The qperf benchmark backs this up: Tinc
+benchmark reports Tinc maxing out at 14.9\,\% total system CPU while
-maxes out at
+delivering 336\,Mbps.  On a multi-core host this figure is consistent
-14.9\,\% total system CPU while delivering just 336\,Mbps.
+with a single saturated core, which fits Tinc's single-threaded
-% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
+userspace architecture: one core encrypts, copies, and forwards
 packets, and the remaining cores sit idle.  But VpnCloud reports the
 same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc),
 so whole-system CPU alone cannot explain the gap, and a per-packet
 processing cost difference must also be in play.
 % TODO: 14.9\% total CPU does not pin the bottleneck on its own.
 % This is whole-system utilization on a multi-core machine, and a
 % single saturated core fits the budget — but VpnCloud reports the
-% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc.
+% same 14.9\% \emph{and} reaches 539\,Mbps.  Verify with per-thread
-% The single-saturated-core story alone therefore cannot explain
+% CPU sampling or eBPF profiling to confirm the single-core story
-% the throughput gap; per-packet processing cost must differ
+% and quantify the per-packet cost difference.
 % materially between the two.  Verify with per-thread CPU sampling
 % or eBPF profiling.
 On a multi-core system, this low percentage is consistent with a
 single saturated core (and Tinc is single-threaded), which would
 explain why the CPU rather than the network is the bottleneck.
 The story is incomplete, however: VpnCloud shows the same 14.9\,\%
 total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc —
 so a difference in per-packet processing cost between the two
 implementations must also be in play.
 Figure~\ref{fig:latency_throughput} makes this disconnect easy to
 spot.
@@ -346,9 +342,9 @@ spot.
 The qperf measurements also reveal a wide spread in CPU usage.
 Hyprspace (55.1\,\%) and Yggdrasil
 (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
-9.7\,\%.  WireGuard sits at 30.8\,\%, surprisingly high for a
+9.7\,\%.  WireGuard sits at 30.8\,\%, higher than expected for a
-kernel-level implementation, presumably due to in-kernel
+kernel-level implementation; in-kernel cryptographic processing
-cryptographic processing.
+is the likely cause, though no profiling data confirms this.
 On the efficient end, VpnCloud
 (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
 CPU time.  Nebula and Headscale are missing from
@@ -416,8 +412,10 @@ Table~\ref{tab:parallel_scaling} lists the results.
 The VPNs that gain the most are those most constrained in
 single-stream mode.  Mycelium's 34.9\,ms RTT means a lone TCP stream
-can never fill the pipe: the bandwidth-delay product demands a window
+can never fill the pipe: the bandwidth-delay product (the amount
-larger than any single flow maintains, so multiple concurrent flows
+  of in-flight data a TCP flow needs to saturate a link, equal to the
 link bandwidth times the round-trip time) demands a window larger
 than any single flow maintains, so multiple concurrent flows
 compensate for that constraint and push throughput to 2.20$\times$
 the single-stream figure.  Hyprspace scales almost as well
 (2.18$\times$) for the same reason but with a different
@@ -425,7 +423,7 @@ bottleneck.  Its libp2p send pipeline accumulates roughly
 2\,800\,ms of under-load latency
 (Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
 flow a bandwidth-delay product on the order of hundreds of
-megabytes to fill — far beyond any single kernel cwnd.  And
+megabytes to fill, far beyond any single kernel cwnd.  And
 because Hyprspace keys \texttt{activeStreams} by destination
 \texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
 three concurrent peer pairs in the parallel benchmark each get
@@ -440,8 +438,9 @@ more of the bloated pipeline than one can.
 % Listing~\ref{lst:hyprspace_sendpacket}, but neither the
 % per-flow window evolution nor the actual under-load latency
 % has been measured directly.  A tcpdump of one Hyprspace
-% iPerf3 run with inter-arrival timing analysis would settle
+% iPerf3 run with inter-arrival timing analysis would settle it.
-% it.  Tinc picks up a
+
 Tinc picks up a
 1.68$\times$ boost because several streams can collectively keep its
 single-threaded CPU busy during what would otherwise be idle gaps in
 a single flow.
@@ -449,9 +448,9 @@ a single flow.
 % TODO: "zero retransmits" in parallel mode is not shown in any table
 % or figure.  Add parallel-mode retransmit data or remove the claim.
 WireGuard and Internal both scale cleanly at around
-1.48--1.50$\times$ with zero retransmits, suggesting that
+1.48--1.50$\times$ with zero retransmits.  This is consistent
-WireGuard's overhead is a fixed per-packet cost that does not worsen
+with WireGuard's overhead being a fixed per-packet cost that does
-under multiplexing.
+not worsen under multiplexing.
 Nebula is the only VPN that actually gets \emph{slower} with more
 streams: throughput drops from 706\,Mbps to 648\,Mbps
@@ -498,8 +497,9 @@ The sender throughput values are artifacts: they reflect how fast the
 sender can write to the socket, not how fast data traverses the
 tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
 throughput because it uses a 32,731-byte block size (a jumbo-frame
-overlay MTU), inflating the apparent rate per \texttt{send()} system
+overlay MTU), which inflates the apparent rate per
-call. Only the receiver throughput is meaningful.
+\texttt{send()} system call.  Only the receiver throughput is
 meaningful.
 \begin{table}[H]
  \centering
@@ -537,16 +537,19 @@ because the sender overwhelms the tunnel's userspace processing capacity.
 Headscale shares WireGuard's cryptographic protocol but, contrary to
 intuition, does not share its kernel datapath: Tailscale's
 \texttt{magicsock} layer intercepts every packet to handle endpoint
-selection and DERP relay, which is incompatible with the in-kernel
+selection and DERP (Designated Encrypted Relay for Packets,
-WireGuard module.  Headscale therefore runs \texttt{wireguard-go}
+  Tailscale's TLS-over-TCP relay network used when a direct UDP path
-entirely in userspace, and the unbounded \texttt{-b~0} flood overruns
+between peers cannot be established), which is incompatible with the
-that userspace pipeline just as it overruns every other userspace
+in-kernel WireGuard module.  Headscale therefore runs
-implementation, producing 69.8\,\% loss despite the WireGuard branding.
+\texttt{wireguard-go} entirely in userspace, and the unbounded
 \texttt{-b~0} flood overruns that userspace pipeline just as it
 overruns every other userspace implementation, and Headscale
 shows 69.8\,\% loss despite the WireGuard branding.
 Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
 (due to its large block size) but loses almost all of it. These loss
 rates do not reflect real-world UDP behavior but reveal which VPNs
 implement effective flow control. Hyprspace and Mycelium could not
-complete the UDP test at all, timing out after 120 seconds.
+complete the UDP test at all; both timed out after 120 seconds.
 % TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
 % the path MTU.  It is derived from the socket MSS and reflects the
@@ -743,19 +746,11 @@ overwhelm FEC entirely.
 \subsection{Operational Resilience}
-Sustained-load performance does not predict recovery speed.  How
+Throughput, latency, and application performance describe how a
-quickly a tunnel comes up after a reboot, and how reliably it
+tunnel behaves once it is up.  The next question is how quickly it
-reconverges, matters as much as peak throughput for operational use.
+gets there.  Sustained-load numbers do not predict recovery speed,
-
+and for operational use the time a tunnel takes to come up after a
-% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
+reboot matters as much as its peak throughput.
 % are not shown in any figure or table.  Either add a figure or
 % scrap this paragraph (see note below).
 First-time connectivity spans a wide range.  Headscale and WireGuard
 are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
 (10--14\,s) spend seconds negotiating with their control planes
 before passing traffic.
 %TODO: Maybe we want to scrap first-time connectivity
 Reboot reconnection rearranges the rankings.  Hyprspace, the worst
 performer under sustained TCP load, recovers in just 8.7~seconds on
@@ -768,18 +763,21 @@ benchmarks use the default).  After a reboot, a node must
 wait until the next periodic update before its lighthouses learn
 its new endpoint, so the reconnection time tracks the timer rather
 than any topology-dependent convergence.
-Mycelium sits at the opposite end, needing 76.6~seconds and showing
+Mycelium sits at the opposite end at 76.6~seconds, and its three
-the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
+nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
-suggesting a fixed protocol-level wait built into the overlay.
+Section~\ref{sec:mycelium_routing} argues from that uniformity
 that the bound is a fixed timer in the overlay protocol.
 Yggdrasil produces the most lopsided result in the dataset: its yuki
 node is back in 7.1~seconds while lom and luna take 94.8 and
-97.3~seconds respectively.  The gap likely reflects the overlay's
+97.3~seconds respectively.  Yggdrasil organises its overlay as a
-spanning-tree rebuild: a node near the root of the tree reconverges
+distributed spanning tree rooted at the node with the highest public
-quickly, while one further out has to wait for the topology to
+key: every other node picks a parent closer to the root and the
-propagate.
+whole network hangs off that parent chain.  The gap likely reflects
-
+the cost of rebuilding that tree after a reboot: a node close to the
-%TODO: Needs clarifications what is a "spanning tree build"
+current root reconverges quickly, while one further out must wait
 for updated parent information to propagate hop-by-hop before it
 can route traffic.
 \begin{figure}[H]
  \centering
@@ -823,14 +821,14 @@ earlier benchmarks into per-VPN diagnoses.
 Hyprspace produces the most severe performance collapse in the
 dataset.  At idle, its ping latency is a modest 1.79\,ms.
 Under TCP load, that number balloons to roughly 2\,800\,ms, a
-1\,556$\times$ increase.  This is not the network becoming
+1\,556$\times$ increase.  The network itself has capacity to spare;
-congested; it is the VPN tunnel itself filling up with buffered
+the VPN tunnel is filling up with buffered packets and failing to
-packets and refusing to drain.
+drain.
-The consequences ripple through every TCP metric.  With 4\,965
+The consequences show in every TCP metric.  With 4\,965
 retransmits per 30-second test (one in every 200~segments), TCP
 spends most of its time in congestion recovery rather than
-steady-state transfer, shrinking the max congestion window to
+steady-state transfer.  The max congestion window shrinks to
 205\,KB, the smallest in the dataset.  Under parallel load the
 situation worsens: retransmits climb to 17\,426.  % TODO: The
 % explanation for the sender/receiver inversion (ACK delays
@@ -841,7 +839,7 @@ The buffering even
 inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
 while the sender sees only 367.9\,Mbps, likely because massive ACK delays
 cause the sender-side timer to undercount the actual data rate.  The
-UDP test never finished at all, timing out at 120~seconds.
+UDP test never finished at all; it timed out at 120~seconds.
 % Should we always use percentages for retransmits?
@@ -891,7 +889,7 @@ Since the benchmark targets the regular Hyprspace IPv4/IPv6
 addresses rather than service-network proxies, both endpoints
 rely on their host kernel's TCP stack for the entire transfer.
 Whatever options Hyprspace's gVisor instance might set
-internally — congestion control, loss recovery, buffer sizes —
+internally (congestion control, loss recovery, buffer sizes)
 are therefore irrelevant to these measurements; the inner TCP
 state machine the kernel runs is the only one in the path.
 The same caveat applies more sharply to Tailscale, where the
@@ -900,9 +898,13 @@ stack but the benchmark traffic never reaches it; that case is
 the subject of Section~\ref{sec:tailscale_degraded}.
 If gVisor is out of scope, the buffer bloat must originate
-further up the Hyprspace stack instead.  The most plausible
+further up the Hyprspace stack instead.  Hyprspace uses
-source is the libp2p / yamux stream layer through which raw IP
+\texttt{libp2p}, a peer-to-peer networking library, and its
-packets are funnelled.  Hyprspace's TUN-read loop dispatches
+\texttt{yamux} stream multiplexer, which runs many logical streams
 over a single underlying connection and polices each one with a
 credit-based flow-control window.  The most plausible source of
 the bloat is this libp2p/yamux layer, through which raw IP packets
 are funnelled.  Hyprspace's TUN-read loop dispatches
 each outbound packet on its own goroutine, and every such
 goroutine ends up in \texttt{node/node.go}'s
 \texttt{sendPacket}, which keeps exactly one libp2p stream per
@@ -916,10 +918,10 @@ collapses to a single send pipeline at this layer.  Each
 goroutine waiting for the lock pins its own 1420-byte packet
 buffer, and the underlying yamux session adds a per-stream
 flow-control window on top.  None of this is visible to the
-kernel TCP sender that produced the inner segments — the kernel
+kernel TCP sender that produced the inner segments: the kernel
-sees only that the TUN write returned — so it keeps growing
+sees only that the TUN write returned, so it keeps growing its
-its congestion window while the libp2p layer falls further
+congestion window while the libp2p layer falls further behind.  The
-behind.  The geometry is the textbook one for buffer bloat: a
+geometry is the textbook one for buffer bloat: a
 fast producer (kernel TCP) sitting upstream of a slow,
 serialised consumer (the single yamux stream per peer) with
 no flow-control signal coupling the two.
@@ -1036,10 +1038,15 @@ background.
 Mycelium is also the slowest VPN to recover from a reboot:
 76.6~seconds on average, and almost suspiciously uniform across
-nodes (75.7, 75.7, 78.3\,s).  That kind of consistency points to
+nodes (75.7, 75.7, 78.3\,s).  That kind of consistency points to a
-a fixed convergence timer in the overlay protocol —
+fixed convergence timer in the overlay protocol, most likely a
-most likely a
+default wait interval hard-coded into the reconnection logic.  A
-default interval rather than anything topology-dependent.
+topology-dependent recovery time, by contrast, would vary with each
 node's position in the overlay: a node near an active peer would
 reconverge quickly while one further away would wait longer for
 routing information to reach it.  Mycelium shows no such variation,
 so the bound is almost certainly a timer rather than a propagation
 delay.
 % TODO: Identify which Mycelium constant or default this 75-78 s
 % recovery actually corresponds to before claiming it is a fixed
 % timer; the source code would settle whether it is hard-coded,
@@ -1047,49 +1054,46 @@ default interval rather than anything topology-dependent.
 The UDP test timed out at 120~seconds, and even first-time
 connectivity required a 70-second wait at startup.
 % Explain what topology-dependent means in this case.
 \paragraph{Tinc: Userspace Processing Bottleneck.}
-Tinc is a clear case of a CPU bottleneck masquerading
+The latency subsection already traced Tinc's 336\,Mbps ceiling to
-as a network
+single-core CPU exhaustion.  The usual network suspects do not
-problem.  At 1.19\,ms latency, packets get through the
+apply.  Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
-tunnel quickly.  Yet throughput tops out at 336\,Mbps, barely a
+effective UDP payload size (1\,353 bytes) and its retransmit count
-third of the bare-metal link.
+(240) are in the normal range.  That leaves CPU: 14.9\,\%
-The usual suspects do not apply:
+whole-system utilization is what one saturated core looks like on
-Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
+a multi-core host, which fits a single-threaded userspace VPN.
-  1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
+The parallel benchmark confirms the diagnosis.  Tinc scales to
-WireGuard at 1\,368) is in the normal range, and its retransmit
+563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
-count (240) is moderate.  What limits Tinc is its
+Several concurrent TCP streams keep that one core busy through
-single-threaded
+the gaps a single flow would leave idle, and the extra work
-userspace architecture: one CPU core simply cannot
+translates directly into extra throughput.
-encrypt, copy,
+% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
-and forward packets fast enough to fill the pipe.
+% unresolved CPU-profiling TODO from the latency subsection
-
+% (VpnCloud's identical 14.9\% at 539\,Mbps).  If per-thread
-% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the
+% profiling refutes the single-core story, this paragraph must
-% Tinc CPU bottleneck
+% be revisited as well.
 % diagnosis from above, but the 14.9% CPU figure has
 % an unresolved TODO
 % (the same utilization as VpnCloud at 539 Mbps).  If
 % the CPU claim is
 % revised or refuted, this confirmation must be updated too.
 The parallel benchmark confirms this diagnosis.  Tinc scales to
 563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
 Multiple TCP streams collectively keep that single
 core busy during
 what would otherwise be idle gaps in any individual
 flow, squeezing
 out throughput that no single stream could reach alone.
 \section{Impact of network impairment}
 \label{sec:impairment}
 Baseline benchmarks rank VPNs by overhead under ideal
-conditions.
+conditions.  The impairment profiles in
-The impairment profiles in
+Table~\ref{tab:impairment_profiles} test a different property:
-Table~\ref{tab:impairment_profiles} test
+resilience.  Each profile applies symmetric \texttt{tc netem}
-a different property: resilience.  Two results
+impairment to every machine.  Low adds roughly 2\,ms of delay and
-dominate the data.
+0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
 ${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
 adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
 reordering.  Medium and High both use 50\,\% correlation, so
 losses and reorderings are bursty rather than uniform.  Two
 results dominate the data.
 % TODO: Double-check these per-profile parameters against the
 % canonical impairment-profile definitions in the earlier chapter
 % (Table~\ref{tab:impairment_profiles}).  The Low/High loss and
 % delay numbers are cross-checked against later prose in this
 % chapter, but the correlation and jitter values should be
 % verified against the authoritative profile definition.
 The first is the collapse of the throughput hierarchy.  At High
 impairment, the 675\,Mbps spread between fastest and slowest
@@ -1106,8 +1110,8 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
 through what turns out to be the wrong hypothesis.  The
 investigation begins with Tailscale's much-discussed gVisor TCP
 stack, validates the candidate parameters in isolation on the
-bare-metal host, and only then discovers — by reading the rig's
+bare-metal host, and only then discovers, by reading the rig's
-own NixOS module — that the gVisor stack is not actually in the
+own NixOS module, that the gVisor stack is not actually in the
 data path of the benchmark at all.  The real culprit is a
 combination of the Linux kernel's tight default
 \texttt{tcp\_reordering} threshold and the way
@@ -1313,6 +1317,16 @@ every lost or reordered outer packet costs roughly
 retransmitted inner data than a standard 1\,400-byte
 MTU VPN would
 lose.
 % TODO: The jumbo-MTU-as-liability argument is reused in several
 % places (TCP impairment, QUIC impairment, RIST video, and
 % §sec:baseline Tier analysis).  In each it is presented as a
 % mechanism rather than a measurement.  Consider running one
 % controlled experiment --- force Yggdrasil to a standard
 % 1\,420-byte overlay MTU and rerun the Low/Medium impairment
 % profiles --- to test the hypothesis directly, or consolidate
 % the argument into a single "jumbo-MTU liability" paragraph and
 % cite it from the other sections instead of restating the
 % mechanism each time.
 Headscale retains 34.3\% of its baseline throughput
 at Low, almost
@@ -1444,6 +1458,15 @@ indicator than as a throughput measurement.  A VPN that cannot
 complete a 30-second UDP flood under 0.25\% packet loss has a
 flow-control problem that will surface under real workloads too,
 even when the symptoms are milder.
 % TODO: Non-monotonic failure pattern (Internal and WireGuard
 % fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
 % fail selectively) is never explained and directly undermines
 % the "robustness indicator" framing above.  Reproduce one of
 % the failing Low-profile runs with iPerf3 debug logging and
 % \texttt{tc -s qdisc show} to establish whether these are VPN
 % flow-control failures, iPerf3/tc interaction artefacts, or
 % timing issues; then either explain the pattern or soften the
 % robustness-indicator claim.
 \subsection{Parallel TCP}
@@ -1552,10 +1575,10 @@ At High impairment, WireGuard (23.2\,Mbps), VpnCloud
 ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
 0.4\,Mbps of one another.  At baseline these four
 span a 188\,Mbps
-range (656 to 844\,Mbps).  QUIC's own congestion
+range (656 to 844\,Mbps).  At this point QUIC's own congestion
-control, running on
+control is the sole limiter: it runs on top of an
-top of an already-degraded outer link, has become the
+already-degraded outer link and cannot push past
-sole limiter.
+${\sim}$23\,Mbps regardless of the VPN underneath.
 \begin{figure}[H]
  \centering
@@ -1742,6 +1765,20 @@ Section~\ref{sec:tailscale_degraded} explains why.
 \section{Tailscale under degraded conditions}
 \label{sec:tailscale_degraded}
 % TODO: Editorial pass needed on two chapter-wide issues before
 % submission:
 % (1) magicsock / wireguard-go userspace-datapath explanation is
 %     repeated three times in slightly different forms (once in
 %     baseline UDP, once in impairment UDP, once here).  Consider
 %     introducing it once in full here, where it is load-bearing,
 %     and replacing the earlier occurrences with one-sentence
 %     forward references.
 % (2) This section uses first-person plural ("we pursued", "we
 %     worked it out", "we ran two follow-up benchmarks") while
 %     the rest of the chapter is in impersonal voice.  Either
 %     harmonise everything to one voice, or explicitly frame this
 %     section as a first-person narrative detour.
 This section is about an observation that should not exist:
 Headscale, a tunnelling VPN built on a kernel TCP stack and
 \texttt{wireguard-go}, beats the bare-metal Internal baseline at
@@ -1753,7 +1790,7 @@ chasing the obvious answer to its end.
 \subsection{An anomaly worth pursuing}
 At Medium impairment, Headscale reaches 41.5\,Mbps on a single
-TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for
+TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
 the VPN over the direct host-to-host link it tunnels through.
 Headscale costs the expected ${\sim}$14\,\% at baseline, and at
 Low and High impairment it lags Internal by some margin.  Yet at
@@ -1837,12 +1874,12 @@ imports Google's gVisor netstack
 it as an in-process TCP implementation.  The gVisor
 documentation is direct about why this matters: netstack is
 designed for adverse networks where the host kernel's TCP
-defaults are too aggressive.  Tailscale's release notes go
+defaults are too aggressive.  Tailscale's release notes go further
-further, calling out specific overrides on top of gVisor — the
+and name specific overrides
-most visible being an explicit RACK disable and 8\,MiB / 6\,MiB
+on top of gVisor; the most visible are an explicit RACK disable
-receive and send buffers.
+and 8\,MiB / 6\,MiB receive and send buffers.
-Reading Tailscale's source confirms it.
+The Tailscale source code bears this out.
 \texttt{wgengine/netstack/netstack.go} contains the netstack
 initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
 reproduces the relevant overrides verbatim.  RACK is disabled
@@ -1863,25 +1900,22 @@ enabled (gVisor's default is off).
    \texttt{wgengine/netstack/netstack.go}.
 \textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
-Read against the Linux kernel defaults — RACK on, CUBIC by
+Read against the Linux kernel defaults (RACK on, CUBIC by
-default, ${\sim}$1\,MiB receive and send buffers,
+  default, ${\sim}$1\,MiB receive and send buffers,
-\texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these
+\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
 overrides describe a TCP stack better suited to a lossy,
-reordering link than the host kernel.  The hypothesis writes
+reordering link than the host kernel.  The hypothesis follows
-itself: Headscale's iPerf3 traffic is processed
+directly: Headscale's iPerf3 traffic
-by this gVisor
+runs through this gVisor instance instead of through the host
-instance instead of by the host kernel TCP stack, and so it
+kernel TCP stack, and so it inherits the more
-inherits the more reordering-tolerant behaviour.
+reordering-tolerant behaviour.  WireGuard-the-kernel-module
-WireGuard-the-kernel-module shares only the cryptographic
+shares only the cryptographic protocol; it does not include
-protocol; it does not get the gVisor stack, and
+the gVisor stack, and therefore does not get the advantage.
 therefore does
 not get the advantage.
-It is a clean story.  The natural way to test it
+The natural way to test this is to extract
 is to extract
 the parameters Tailscale sets inside gVisor, apply their
 nearest Linux equivalents to the bare-metal host as sysctls,
-and see whether Internal — with no VPN at all — picks up the
+and see whether Internal, with no VPN at all, picks up the
 same advantage.  If it does, the gVisor explanation is
 supported.  If it does not, the hypothesis fails.
@@ -1951,21 +1985,18 @@ impairment setup as the original 18.12.2025 run.
 The result felt like confirmation.  Internal's
 Medium-impairment throughput jumped from 29.6\,Mbps to
-72.7\,Mbps under the reorder-only configuration — a 146\,\%
+72.7\,Mbps under the reorder-only configuration, a 146\,\%
-increase from a three-line sysctl change — and
+increase from a three-line sysctl change, and the retransmit
 the retransmit
 rate at Medium dropped from ${\sim}$2.4\,\% to
 1.11\,\%, which
 means more than half of the original retransmissions were
 spurious.  The Nix cache download at Medium roughly halved,
 from 58.6\,s to 29.1\,s.
-Parallel TCP gained more.  Internal at Low
+Parallel TCP gained even more.  Internal at Low climbed from
-climbed from 277 to
+277 to 902\,Mbps, a 226\,\% increase.  This exceeds Internal's
-902\,Mbps, a 226\,\% increase that not only
+old single-stream best and overtakes Headscale's original
-exceeds Internal's
+718\,Mbps from the unmodified run.  %
 old single-stream best but actually overtakes Headscale's
 original 718\,Mbps from the unmodified run.  %
 % TODO: DOWNSTREAM
 % DEPENDENCY — "six concurrent flows" inherits
 % the unresolved
@@ -2024,9 +2055,10 @@ the kernel to gVisor reproduces the effect.  Then we checked
 which Tailscale code path the test rig was actually running.
 \subsection{The data path that was not there}
 \label{sec:gvisor_not_in_path}
-In default mode — what anyone running \texttt{tailscale up}
+In default mode (what anyone running \texttt{tailscale up}
-on a Linux host gets — the Tailscale client creates a real
+on a Linux host gets), the Tailscale client creates a real
 kernel TUN device, registers a route for the
 Tailscale subnet
 through it, and forwards inbound and outbound
@@ -2054,20 +2086,19 @@ running inside \texttt{tailscaled} itself (Tailscale SSH,
 Taildrop, the metric endpoint).  External processes such as
 iPerf3 cannot reach the Tailscale network in that mode.
-The test rig does not use that mode.  As shown in
+The test rig does not use that mode.  The benchmark suite's
-Listing~\ref{lst:rig_interface_name}, the benchmark
+Headscale module sets the interface name to
-suite's Headscale module sets the interface name to
+\texttt{ts-\$\{instanceName\}}
-\texttt{ts-\$\{instanceName\}}, resolving to
+(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
-\texttt{tailscaled --tun ts-headscale}: a real kernel
+launches with \texttt{--tun ts-headscale}: a real kernel TUN.
-TUN.  gVisor netstack is therefore unreachable from
+External benchmark traffic cannot reach gVisor netstack at all.
 external benchmark traffic.
 \lstinputlisting[language=Nix,caption={The
    benchmark suite's
    Headscale module sets \texttt{interfaceName} to a real kernel
    TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
-    This means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale}
+    This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
    ts-headscale}
    on every test machine.
 \textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
@@ -2075,7 +2106,7 @@ The empirical fingerprint pins the same conclusion down without
 source-code reading.  Headscale itself gained +21\,\% at Medium
 from the host-kernel sysctl tuning.  If Headscale's iPerf3
 traffic were processed by gVisor netstack, host-kernel sysctls
-would change nothing — they configure the host kernel TCP stack
+would change nothing; they configure the host kernel TCP stack
 and only the host kernel TCP stack. The fact that Headscale moves
 measurably under those sysctls is direct evidence that
 Headscale's application TCP runs on the host kernel stack, just
@@ -2097,8 +2128,8 @@ the gVisor TCP business at all.
 The puzzle the investigation began with has not gone away.
 Headscale starts at 41.5\,Mbps where Internal starts at
 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
-TCP stack.  Whatever Headscale is doing — partially, weakly, but
+TCP stack.  Whatever Headscale is doing (partially, weakly, but
-reproducibly — is worth roughly twelve megabits per second on the
+reproducibly) is worth roughly twelve megabits per second on the
 Medium profile, and it is not gVisor netstack.
 The +21\,\% sysctl gain for Headscale itself is also informative
@@ -2143,8 +2174,8 @@ The second is the 7\,MiB outer-UDP socket buffer that
 \texttt{SO\_*BUFFORCE} variant where available so the value is
 honoured even past \texttt{net.core.rmem\_max}.  The host kernel
 default is in the low hundreds of KiB.  Under burst-correlated
-impairment — Medium and High both use 50\,\% correlation, so
+impairment (Medium and High both use 50\,\% correlation, so
-losses and reorderings cluster — this larger buffer absorbs
+losses and reorderings cluster), this larger buffer absorbs
 spikes in arrival rate that would otherwise overflow the kernel
 UDP receive queue and surface as additional inner-TCP losses.
 Internal has no such cushion on its incoming wire path.
@@ -2244,16 +2275,15 @@ the bare-metal host more than half of its achievable throughput.
 Three lines of \texttt{sysctl} repair it.  The fix is portable to
 any Linux host and entirely independent of any VPN.
-The unresilient finding — the one that motivated us to write this
+The less durable finding, and the one that motivated this section,
-section in the first place — is that Tailscale's much-discussed
+is that Tailscale's much-discussed userspace TCP stack is not in
-userspace TCP stack is, for the workload that exposed the
+the data path for the workload that exposed the anomaly.  The
-anomaly, sitting on the bench.  The advantage we attributed to it
+advantage we attributed to it comes from a more ordinary place:
-must come from a more ordinary place: the way
+the way \texttt{wireguard-go} batches and coalesces packets
-\texttt{wireguard-go} batches and coalesces packets between the
+between the wire and the kernel TCP stack, and the larger UDP
-wire and the kernel TCP stack, and the larger UDP buffer it pins
+buffer it pins on its outer socket.  We were chasing the wrong
-on its outer socket.  We were chasing the wrong hypothesis with
+hypothesis with the right experiment, and the experiment turned
-the right experiment, and the experiment turned out to be more
+out to be more useful than the hypothesis.
 useful than the hypothesis.
 % TODO: These sections are empty stubs but the chapter
 % introduction (line 12--13) promises "findings from the source