diff --git a/Chapters/Results.tex b/Chapters/Results.tex index 5901312..9da7ac8 100644 --- a/Chapters/Results.tex +++ b/Chapters/Results.tex @@ -9,10 +9,9 @@ ten VPN implementations and the internal baseline. The structure follows the impairment profiles from ideal to degraded: Section~\ref{sec:baseline} establishes overhead under ideal conditions, then subsequent sections examine how each VPN responds to -increasing network impairment. The chapter concludes with findings -from the source code analysis. A recurring theme is that no single -metric captures VPN -performance; the rankings shift +increasing network impairment, with source-code excerpts woven in +where they explain the measured behaviour. A recurring theme is +that no single metric captures VPN performance; the rankings shift depending on whether one measures throughput, latency, retransmit behavior, or real-world application performance. @@ -26,38 +25,32 @@ the VPN itself. Throughout the plots in this section, the in the path; it represents the best the hardware can do. On its own, this link delivers 934\,Mbps on a single TCP stream and a round-trip latency of just -0.60\,ms. WireGuard comes remarkably close to these numbers, reaching -92.5\,\% of bare-metal throughput with only a single retransmit across -an entire 30-second test. Mycelium sits at the other extreme, adding -34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. +0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a +single retransmit across an entire 30-second test. Mycelium sits at +the other extreme: 34.9\,ms of latency, roughly 58$\times$ the +bare-metal figure. A note on naming: ``Headscale'' in every table and figure of this chapter labels the test scenario in which the Tailscale client (\texttt{tailscaled}) connects to a self-hosted Headscale control server. The data plane is therefore the Tailscale client built on \texttt{wireguard-go}, not the Headscale binary itself, which is -only a control-plane server. The test rig launches -\texttt{tailscaled} via the NixOS \texttt{services.tailscale} -module with \texttt{interfaceName = "ts-headscale"}, which -translates to \texttt{--tun ts-headscale}; this means the Tailscale -client uses a real kernel TUN device and the host kernel's TCP/IP -stack handles every tunneled packet. The alternate -\texttt{--tun=userspace-networking} mode, in which gVisor netstack -terminates tunneled TCP inside the \texttt{tailscaled} process, is -\emph{not} engaged in any of the benchmarks reported here. -Statements below about ``Headscale'' running \texttt{wireguard-go} -should be read as statements about the Tailscale client in this -scenario. +only a control-plane server. Statements below about ``Headscale'' +running \texttt{wireguard-go} should be read as statements about +the Tailscale client in this scenario. +Section~\ref{sec:tailscale_degraded} covers the specifics of how +the rig launches \texttt{tailscaled} and which Tailscale code +paths that choice activates. \subsection{Test Execution Overview} Running the full baseline suite across all ten VPNs and the internal -reference took just over four hours. The bulk of that time, about -2.6~hours (63\,\%), was spent on actual benchmark execution; VPN -installation and deployment accounted for another 45~minutes (19\,\%), -and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come -up after restarts. The remaining time was consumed by VPN service restarts -and traffic-control (tc) stabilization. +reference took just over four hours. Actual benchmark execution +consumed the bulk of that time at 2.6~hours (63\,\%). VPN +installation and deployment accounted for another 45~minutes +(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting +for VPN tunnels to come up after restarts. VPN service restarts and +traffic-control (tc) stabilization took the remainder. Figure~\ref{fig:test_duration} breaks this down per VPN. Most VPNs completed every benchmark without issues, but four failed @@ -146,8 +139,8 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates needs. ZeroTier compensates for tunnel-internal packet loss by repeatedly triggering TCP congestion-control recovery, whereas WireGuard delivers data with negligible in-tunnel loss. The -bare-metal Internal reference sits at 1.7~retransmits per test — -essentially noise — and the VPNs split into three groups around +bare-metal Internal reference sits at 1.7~retransmits per test, +essentially noise, and the VPNs split into three groups around it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale), \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). @@ -187,10 +180,10 @@ and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). \end{figure} Retransmits have a direct mechanical relationship with TCP congestion -control. Each retransmit triggers a reduction in the congestion window -(\texttt{cwnd}), throttling the sender. -This relationship is visible -in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 +control: each one triggers a reduction in the congestion window +(\texttt{cwnd}) and throttles the sender. +Figure~\ref{fig:retransmit_correlations} shows the relationship: +Hyprspace, with 4965 retransmits, maintains the smallest max congestion window in the dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB window, the largest of any VPN. At first glance this suggests a @@ -200,24 +193,31 @@ largely an artifact of its jumbo overlay MTU (32\,731 bytes): each segment carries far more data, so the window in bytes is inflated relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing congestion windows across different MTU sizes is not meaningful -without normalizing for segment size. What \emph{is} clear is that -high retransmit rates force TCP to spend more time in congestion -recovery than in steady-state transmission, capping throughput -regardless of available bandwidth. ZeroTier illustrates the -opposite extreme: brute-force retransmission can still yield high -throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted -bandwidth and unstable flow behavior. +without normalizing for segment size. The reliable conclusion is +simpler: high retransmit rates force TCP to spend more time in +congestion recovery than in steady-state transmission, and that +caps throughput regardless of available bandwidth. ZeroTier +illustrates the opposite extreme: brute-force retransmission can +still yield high throughput (814\,Mbps with 1\,163 retransmits), at +the cost of wasted bandwidth and unstable flow behavior. -VpnCloud stands out: its sender reports 538.8\,Mbps -but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest -in the dataset). This suggests significant in-tunnel packet loss or -buffering at the VpnCloud layer that the retransmit count (857) +VpnCloud stands out: its sender reports 538.8\,Mbps but the +receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest +in the dataset. This points to significant in-tunnel packet loss +or buffering at the VpnCloud layer that the retransmit count (857) alone does not fully explain. +% TODO: Clarify whether the headline TCP table +% (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports +% sender or receiver throughput. The prose here cites sender +% 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender +% column, so the table caption should say so explicitly. Same +% clarification needed for Hyprspace (368 in table vs.\ sender +% 367.9 / receiver 419.8 in the pathological-cases paragraph). -Variability — whether stochastic across runs or systematic across -links — also differs substantially. WireGuard's three link -directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window), -behaving almost identically. Mycelium's three directions span +Variability, whether stochastic across runs or systematic across +links, also differs substantially. WireGuard's three link +directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window) +and are nearly indistinguishable. Mycelium's three directions span 122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: Section~\ref{sec:mycelium_routing} shows the spread is per-link path-selection asymmetry, with one link finding a direct route and @@ -315,25 +315,21 @@ interference that the average hides. Tinc presents a paradox: it has the third-lowest latency (1.19\,ms) but only the second-lowest throughput (336\,Mbps). Packets traverse -the tunnel quickly, yet single-threaded userspace processing cannot -keep up with the link speed. The qperf benchmark backs this up: Tinc -maxes out at -14.9\,\% total system CPU while delivering just 336\,Mbps. -% TODO: 14.9\% total CPU does not obviously indicate a bottleneck. +the tunnel quickly, yet something caps the overall rate. The qperf +benchmark reports Tinc maxing out at 14.9\,\% total system CPU while +delivering 336\,Mbps. On a multi-core host this figure is consistent +with a single saturated core, which fits Tinc's single-threaded +userspace architecture: one core encrypts, copies, and forwards +packets, and the remaining cores sit idle. But VpnCloud reports the +same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc), +so whole-system CPU alone cannot explain the gap, and a per-packet +processing cost difference must also be in play. +% TODO: 14.9\% total CPU does not pin the bottleneck on its own. % This is whole-system utilization on a multi-core machine, and a % single saturated core fits the budget — but VpnCloud reports the -% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc. -% The single-saturated-core story alone therefore cannot explain -% the throughput gap; per-packet processing cost must differ -% materially between the two. Verify with per-thread CPU sampling -% or eBPF profiling. -On a multi-core system, this low percentage is consistent with a -single saturated core (and Tinc is single-threaded), which would -explain why the CPU rather than the network is the bottleneck. -The story is incomplete, however: VpnCloud shows the same 14.9\,\% -total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc — -so a difference in per-packet processing cost between the two -implementations must also be in play. +% same 14.9\% \emph{and} reaches 539\,Mbps. Verify with per-thread +% CPU sampling or eBPF profiling to confirm the single-core story +% and quantify the per-packet cost difference. Figure~\ref{fig:latency_throughput} makes this disconnect easy to spot. @@ -346,9 +342,9 @@ spot. The qperf measurements also reveal a wide spread in CPU usage. Hyprspace (55.1\,\%) and Yggdrasil (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's -9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a -kernel-level implementation, presumably due to in-kernel -cryptographic processing. +9.7\,\%. WireGuard sits at 30.8\,\%, higher than expected for a +kernel-level implementation; in-kernel cryptographic processing +is the likely cause, though no profiling data confirms this. On the efficient end, VpnCloud (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least CPU time. Nebula and Headscale are missing from @@ -416,8 +412,10 @@ Table~\ref{tab:parallel_scaling} lists the results. The VPNs that gain the most are those most constrained in single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream -can never fill the pipe: the bandwidth-delay product demands a window -larger than any single flow maintains, so multiple concurrent flows +can never fill the pipe: the bandwidth-delay product (the amount + of in-flight data a TCP flow needs to saturate a link, equal to the +link bandwidth times the round-trip time) demands a window larger +than any single flow maintains, so multiple concurrent flows compensate for that constraint and push throughput to 2.20$\times$ the single-stream figure. Hyprspace scales almost as well (2.18$\times$) for the same reason but with a different @@ -425,7 +423,7 @@ bottleneck. Its libp2p send pipeline accumulates roughly 2\,800\,ms of under-load latency (Section~\ref{sec:hyprspace_bloat}), which gives any single TCP flow a bandwidth-delay product on the order of hundreds of -megabytes to fill — far beyond any single kernel cwnd. And +megabytes to fill, far beyond any single kernel cwnd. And because Hyprspace keys \texttt{activeStreams} by destination \texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the three concurrent peer pairs in the parallel benchmark each get @@ -440,8 +438,9 @@ more of the bloated pipeline than one can. % Listing~\ref{lst:hyprspace_sendpacket}, but neither the % per-flow window evolution nor the actual under-load latency % has been measured directly. A tcpdump of one Hyprspace -% iPerf3 run with inter-arrival timing analysis would settle -% it. Tinc picks up a +% iPerf3 run with inter-arrival timing analysis would settle it. + +Tinc picks up a 1.68$\times$ boost because several streams can collectively keep its single-threaded CPU busy during what would otherwise be idle gaps in a single flow. @@ -449,9 +448,9 @@ a single flow. % TODO: "zero retransmits" in parallel mode is not shown in any table % or figure. Add parallel-mode retransmit data or remove the claim. WireGuard and Internal both scale cleanly at around -1.48--1.50$\times$ with zero retransmits, suggesting that -WireGuard's overhead is a fixed per-packet cost that does not worsen -under multiplexing. +1.48--1.50$\times$ with zero retransmits. This is consistent +with WireGuard's overhead being a fixed per-packet cost that does +not worsen under multiplexing. Nebula is the only VPN that actually gets \emph{slower} with more streams: throughput drops from 706\,Mbps to 648\,Mbps @@ -498,8 +497,9 @@ The sender throughput values are artifacts: they reflect how fast the sender can write to the socket, not how fast data traverses the tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender throughput because it uses a 32,731-byte block size (a jumbo-frame -overlay MTU), inflating the apparent rate per \texttt{send()} system -call. Only the receiver throughput is meaningful. +overlay MTU), which inflates the apparent rate per +\texttt{send()} system call. Only the receiver throughput is +meaningful. \begin{table}[H] \centering @@ -537,16 +537,19 @@ because the sender overwhelms the tunnel's userspace processing capacity. Headscale shares WireGuard's cryptographic protocol but, contrary to intuition, does not share its kernel datapath: Tailscale's \texttt{magicsock} layer intercepts every packet to handle endpoint -selection and DERP relay, which is incompatible with the in-kernel -WireGuard module. Headscale therefore runs \texttt{wireguard-go} -entirely in userspace, and the unbounded \texttt{-b~0} flood overruns -that userspace pipeline just as it overruns every other userspace -implementation, producing 69.8\,\% loss despite the WireGuard branding. +selection and DERP (Designated Encrypted Relay for Packets, + Tailscale's TLS-over-TCP relay network used when a direct UDP path +between peers cannot be established), which is incompatible with the +in-kernel WireGuard module. Headscale therefore runs +\texttt{wireguard-go} entirely in userspace, and the unbounded +\texttt{-b~0} flood overruns that userspace pipeline just as it +overruns every other userspace implementation, and Headscale +shows 69.8\,\% loss despite the WireGuard branding. Yggdrasil's 98.7\% loss is the most extreme: it sends the most data (due to its large block size) but loses almost all of it. These loss rates do not reflect real-world UDP behavior but reveal which VPNs implement effective flow control. Hyprspace and Mycelium could not -complete the UDP test at all, timing out after 120 seconds. +complete the UDP test at all; both timed out after 120 seconds. % TODO: blksize_bytes is the UDP payload size iPerf3 selects, not % the path MTU. It is derived from the socket MSS and reflects the @@ -743,19 +746,11 @@ overwhelm FEC entirely. \subsection{Operational Resilience} -Sustained-load performance does not predict recovery speed. How -quickly a tunnel comes up after a reboot, and how reliably it -reconverges, matters as much as peak throughput for operational use. - -% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s) -% are not shown in any figure or table. Either add a figure or -% scrap this paragraph (see note below). -First-time connectivity spans a wide range. Headscale and WireGuard -are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud -(10--14\,s) spend seconds negotiating with their control planes -before passing traffic. - -%TODO: Maybe we want to scrap first-time connectivity +Throughput, latency, and application performance describe how a +tunnel behaves once it is up. The next question is how quickly it +gets there. Sustained-load numbers do not predict recovery speed, +and for operational use the time a tunnel takes to come up after a +reboot matters as much as its peak throughput. Reboot reconnection rearranges the rankings. Hyprspace, the worst performer under sustained TCP load, recovers in just 8.7~seconds on @@ -768,18 +763,21 @@ benchmarks use the default). After a reboot, a node must wait until the next periodic update before its lighthouses learn its new endpoint, so the reconnection time tracks the timer rather than any topology-dependent convergence. -Mycelium sits at the opposite end, needing 76.6~seconds and showing -the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s), -suggesting a fixed protocol-level wait built into the overlay. +Mycelium sits at the opposite end at 76.6~seconds, and its three +nodes come back at almost the same time (75.7, 75.7, 78.3\,s). +Section~\ref{sec:mycelium_routing} argues from that uniformity +that the bound is a fixed timer in the overlay protocol. Yggdrasil produces the most lopsided result in the dataset: its yuki node is back in 7.1~seconds while lom and luna take 94.8 and -97.3~seconds respectively. The gap likely reflects the overlay's -spanning-tree rebuild: a node near the root of the tree reconverges -quickly, while one further out has to wait for the topology to -propagate. - -%TODO: Needs clarifications what is a "spanning tree build" +97.3~seconds respectively. Yggdrasil organises its overlay as a +distributed spanning tree rooted at the node with the highest public +key: every other node picks a parent closer to the root and the +whole network hangs off that parent chain. The gap likely reflects +the cost of rebuilding that tree after a reboot: a node close to the +current root reconverges quickly, while one further out must wait +for updated parent information to propagate hop-by-hop before it +can route traffic. \begin{figure}[H] \centering @@ -823,14 +821,14 @@ earlier benchmarks into per-VPN diagnoses. Hyprspace produces the most severe performance collapse in the dataset. At idle, its ping latency is a modest 1.79\,ms. Under TCP load, that number balloons to roughly 2\,800\,ms, a -1\,556$\times$ increase. This is not the network becoming -congested; it is the VPN tunnel itself filling up with buffered -packets and refusing to drain. +1\,556$\times$ increase. The network itself has capacity to spare; +the VPN tunnel is filling up with buffered packets and failing to +drain. -The consequences ripple through every TCP metric. With 4\,965 +The consequences show in every TCP metric. With 4\,965 retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than -steady-state transfer, shrinking the max congestion window to +steady-state transfer. The max congestion window shrinks to 205\,KB, the smallest in the dataset. Under parallel load the situation worsens: retransmits climb to 17\,426. % TODO: The % explanation for the sender/receiver inversion (ACK delays @@ -841,7 +839,7 @@ The buffering even inverts iPerf3's measurements: the receiver reports 419.8\,Mbps while the sender sees only 367.9\,Mbps, likely because massive ACK delays cause the sender-side timer to undercount the actual data rate. The -UDP test never finished at all, timing out at 120~seconds. +UDP test never finished at all; it timed out at 120~seconds. % Should we always use percentages for retransmits? @@ -891,7 +889,7 @@ Since the benchmark targets the regular Hyprspace IPv4/IPv6 addresses rather than service-network proxies, both endpoints rely on their host kernel's TCP stack for the entire transfer. Whatever options Hyprspace's gVisor instance might set -internally — congestion control, loss recovery, buffer sizes — +internally (congestion control, loss recovery, buffer sizes) are therefore irrelevant to these measurements; the inner TCP state machine the kernel runs is the only one in the path. The same caveat applies more sharply to Tailscale, where the @@ -900,9 +898,13 @@ stack but the benchmark traffic never reaches it; that case is the subject of Section~\ref{sec:tailscale_degraded}. If gVisor is out of scope, the buffer bloat must originate -further up the Hyprspace stack instead. The most plausible -source is the libp2p / yamux stream layer through which raw IP -packets are funnelled. Hyprspace's TUN-read loop dispatches +further up the Hyprspace stack instead. Hyprspace uses +\texttt{libp2p}, a peer-to-peer networking library, and its +\texttt{yamux} stream multiplexer, which runs many logical streams +over a single underlying connection and polices each one with a +credit-based flow-control window. The most plausible source of +the bloat is this libp2p/yamux layer, through which raw IP packets +are funnelled. Hyprspace's TUN-read loop dispatches each outbound packet on its own goroutine, and every such goroutine ends up in \texttt{node/node.go}'s \texttt{sendPacket}, which keeps exactly one libp2p stream per @@ -916,10 +918,10 @@ collapses to a single send pipeline at this layer. Each goroutine waiting for the lock pins its own 1420-byte packet buffer, and the underlying yamux session adds a per-stream flow-control window on top. None of this is visible to the -kernel TCP sender that produced the inner segments — the kernel -sees only that the TUN write returned — so it keeps growing -its congestion window while the libp2p layer falls further -behind. The geometry is the textbook one for buffer bloat: a +kernel TCP sender that produced the inner segments: the kernel +sees only that the TUN write returned, so it keeps growing its +congestion window while the libp2p layer falls further behind. The +geometry is the textbook one for buffer bloat: a fast producer (kernel TCP) sitting upstream of a slow, serialised consumer (the single yamux stream per peer) with no flow-control signal coupling the two. @@ -1036,10 +1038,15 @@ background. Mycelium is also the slowest VPN to recover from a reboot: 76.6~seconds on average, and almost suspiciously uniform across -nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to -a fixed convergence timer in the overlay protocol — -most likely a -default interval rather than anything topology-dependent. +nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a +fixed convergence timer in the overlay protocol, most likely a +default wait interval hard-coded into the reconnection logic. A +topology-dependent recovery time, by contrast, would vary with each +node's position in the overlay: a node near an active peer would +reconverge quickly while one further away would wait longer for +routing information to reach it. Mycelium shows no such variation, +so the bound is almost certainly a timer rather than a propagation +delay. % TODO: Identify which Mycelium constant or default this 75-78 s % recovery actually corresponds to before claiming it is a fixed % timer; the source code would settle whether it is hard-coded, @@ -1047,49 +1054,46 @@ default interval rather than anything topology-dependent. The UDP test timed out at 120~seconds, and even first-time connectivity required a 70-second wait at startup. -% Explain what topology-dependent means in this case. - \paragraph{Tinc: Userspace Processing Bottleneck.} -Tinc is a clear case of a CPU bottleneck masquerading -as a network -problem. At 1.19\,ms latency, packets get through the -tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a -third of the bare-metal link. -The usual suspects do not apply: -Tinc's effective UDP payload size (\texttt{blksize\_bytes} of - 1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and -WireGuard at 1\,368) is in the normal range, and its retransmit -count (240) is moderate. What limits Tinc is its -single-threaded -userspace architecture: one CPU core simply cannot -encrypt, copy, -and forward packets fast enough to fill the pipe. - -% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the -% Tinc CPU bottleneck -% diagnosis from above, but the 14.9% CPU figure has -% an unresolved TODO -% (the same utilization as VpnCloud at 539 Mbps). If -% the CPU claim is -% revised or refuted, this confirmation must be updated too. -The parallel benchmark confirms this diagnosis. Tinc scales to -563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio. -Multiple TCP streams collectively keep that single -core busy during -what would otherwise be idle gaps in any individual -flow, squeezing -out throughput that no single stream could reach alone. +The latency subsection already traced Tinc's 336\,Mbps ceiling to +single-core CPU exhaustion. The usual network suspects do not +apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its +effective UDP payload size (1\,353 bytes) and its retransmit count +(240) are in the normal range. That leaves CPU: 14.9\,\% +whole-system utilization is what one saturated core looks like on +a multi-core host, which fits a single-threaded userspace VPN. +The parallel benchmark confirms the diagnosis. Tinc scales to +563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio. +Several concurrent TCP streams keep that one core busy through +the gaps a single flow would leave idle, and the extra work +translates directly into extra throughput. +% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the +% unresolved CPU-profiling TODO from the latency subsection +% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread +% profiling refutes the single-core story, this paragraph must +% be revisited as well. \section{Impact of network impairment} \label{sec:impairment} Baseline benchmarks rank VPNs by overhead under ideal -conditions. -The impairment profiles in -Table~\ref{tab:impairment_profiles} test -a different property: resilience. Two results -dominate the data. +conditions. The impairment profiles in +Table~\ref{tab:impairment_profiles} test a different property: +resilience. Each profile applies symmetric \texttt{tc netem} +impairment to every machine. Low adds roughly 2\,ms of delay and +0.25\,\% packet loss with 0.5\,\% reordering; Medium adds +${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High +adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\% +reordering. Medium and High both use 50\,\% correlation, so +losses and reorderings are bursty rather than uniform. Two +results dominate the data. +% TODO: Double-check these per-profile parameters against the +% canonical impairment-profile definitions in the earlier chapter +% (Table~\ref{tab:impairment_profiles}). The Low/High loss and +% delay numbers are cross-checked against later prose in this +% chapter, but the correlation and jitter values should be +% verified against the authoritative profile definition. The first is the collapse of the throughput hierarchy. At High impairment, the 675\,Mbps spread between fastest and slowest @@ -1106,8 +1110,8 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly through what turns out to be the wrong hypothesis. The investigation begins with Tailscale's much-discussed gVisor TCP stack, validates the candidate parameters in isolation on the -bare-metal host, and only then discovers — by reading the rig's -own NixOS module — that the gVisor stack is not actually in the +bare-metal host, and only then discovers, by reading the rig's +own NixOS module, that the gVisor stack is not actually in the data path of the benchmark at all. The real culprit is a combination of the Linux kernel's tight default \texttt{tcp\_reordering} threshold and the way @@ -1313,6 +1317,16 @@ every lost or reordered outer packet costs roughly retransmitted inner data than a standard 1\,400-byte MTU VPN would lose. +% TODO: The jumbo-MTU-as-liability argument is reused in several +% places (TCP impairment, QUIC impairment, RIST video, and +% §sec:baseline Tier analysis). In each it is presented as a +% mechanism rather than a measurement. Consider running one +% controlled experiment --- force Yggdrasil to a standard +% 1\,420-byte overlay MTU and rerun the Low/Medium impairment +% profiles --- to test the hypothesis directly, or consolidate +% the argument into a single "jumbo-MTU liability" paragraph and +% cite it from the other sections instead of restating the +% mechanism each time. Headscale retains 34.3\% of its baseline throughput at Low, almost @@ -1444,6 +1458,15 @@ indicator than as a throughput measurement. A VPN that cannot complete a 30-second UDP flood under 0.25\% packet loss has a flow-control problem that will surface under real workloads too, even when the symptoms are milder. +% TODO: Non-monotonic failure pattern (Internal and WireGuard +% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud +% fail selectively) is never explained and directly undermines +% the "robustness indicator" framing above. Reproduce one of +% the failing Low-profile runs with iPerf3 debug logging and +% \texttt{tc -s qdisc show} to establish whether these are VPN +% flow-control failures, iPerf3/tc interaction artefacts, or +% timing issues; then either explain the pattern or soften the +% robustness-indicator claim. \subsection{Parallel TCP} @@ -1552,10 +1575,10 @@ At High impairment, WireGuard (23.2\,Mbps), VpnCloud ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within 0.4\,Mbps of one another. At baseline these four span a 188\,Mbps -range (656 to 844\,Mbps). QUIC's own congestion -control, running on -top of an already-degraded outer link, has become the -sole limiter. +range (656 to 844\,Mbps). At this point QUIC's own congestion +control is the sole limiter: it runs on top of an +already-degraded outer link and cannot push past +${\sim}$23\,Mbps regardless of the VPN underneath. \begin{figure}[H] \centering @@ -1742,6 +1765,20 @@ Section~\ref{sec:tailscale_degraded} explains why. \section{Tailscale under degraded conditions} \label{sec:tailscale_degraded} +% TODO: Editorial pass needed on two chapter-wide issues before +% submission: +% (1) magicsock / wireguard-go userspace-datapath explanation is +% repeated three times in slightly different forms (once in +% baseline UDP, once in impairment UDP, once here). Consider +% introducing it once in full here, where it is load-bearing, +% and replacing the earlier occurrences with one-sentence +% forward references. +% (2) This section uses first-person plural ("we pursued", "we +% worked it out", "we ran two follow-up benchmarks") while +% the rest of the chapter is in impersonal voice. Either +% harmonise everything to one voice, or explicitly frame this +% section as a first-person narrative detour. + This section is about an observation that should not exist: Headscale, a tunnelling VPN built on a kernel TCP stack and \texttt{wireguard-go}, beats the bare-metal Internal baseline at @@ -1753,7 +1790,7 @@ chasing the obvious answer to its end. \subsection{An anomaly worth pursuing} At Medium impairment, Headscale reaches 41.5\,Mbps on a single -TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for +TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for the VPN over the direct host-to-host link it tunnels through. Headscale costs the expected ${\sim}$14\,\% at baseline, and at Low and High impairment it lags Internal by some margin. Yet at @@ -1837,12 +1874,12 @@ imports Google's gVisor netstack it as an in-process TCP implementation. The gVisor documentation is direct about why this matters: netstack is designed for adverse networks where the host kernel's TCP -defaults are too aggressive. Tailscale's release notes go -further, calling out specific overrides on top of gVisor — the -most visible being an explicit RACK disable and 8\,MiB / 6\,MiB -receive and send buffers. +defaults are too aggressive. Tailscale's release notes go further +and name specific overrides +on top of gVisor; the most visible are an explicit RACK disable +and 8\,MiB / 6\,MiB receive and send buffers. -Reading Tailscale's source confirms it. +The Tailscale source code bears this out. \texttt{wgengine/netstack/netstack.go} contains the netstack initialiser, and Listing~\ref{lst:tailscale_netstack_overrides} reproduces the relevant overrides verbatim. RACK is disabled @@ -1863,25 +1900,22 @@ enabled (gVisor's default is off). \texttt{wgengine/netstack/netstack.go}. \textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go} -Read against the Linux kernel defaults — RACK on, CUBIC by -default, ${\sim}$1\,MiB receive and send buffers, -\texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these +Read against the Linux kernel defaults (RACK on, CUBIC by + default, ${\sim}$1\,MiB receive and send buffers, +\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these overrides describe a TCP stack better suited to a lossy, -reordering link than the host kernel. The hypothesis writes -itself: Headscale's iPerf3 traffic is processed -by this gVisor -instance instead of by the host kernel TCP stack, and so it -inherits the more reordering-tolerant behaviour. -WireGuard-the-kernel-module shares only the cryptographic -protocol; it does not get the gVisor stack, and -therefore does -not get the advantage. +reordering link than the host kernel. The hypothesis follows +directly: Headscale's iPerf3 traffic +runs through this gVisor instance instead of through the host +kernel TCP stack, and so it inherits the more +reordering-tolerant behaviour. WireGuard-the-kernel-module +shares only the cryptographic protocol; it does not include +the gVisor stack, and therefore does not get the advantage. -It is a clean story. The natural way to test it -is to extract +The natural way to test this is to extract the parameters Tailscale sets inside gVisor, apply their nearest Linux equivalents to the bare-metal host as sysctls, -and see whether Internal — with no VPN at all — picks up the +and see whether Internal, with no VPN at all, picks up the same advantage. If it does, the gVisor explanation is supported. If it does not, the hypothesis fails. @@ -1951,21 +1985,18 @@ impairment setup as the original 18.12.2025 run. The result felt like confirmation. Internal's Medium-impairment throughput jumped from 29.6\,Mbps to -72.7\,Mbps under the reorder-only configuration — a 146\,\% -increase from a three-line sysctl change — and -the retransmit +72.7\,Mbps under the reorder-only configuration, a 146\,\% +increase from a three-line sysctl change, and the retransmit rate at Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means more than half of the original retransmissions were spurious. The Nix cache download at Medium roughly halved, from 58.6\,s to 29.1\,s. -Parallel TCP gained more. Internal at Low -climbed from 277 to -902\,Mbps, a 226\,\% increase that not only -exceeds Internal's -old single-stream best but actually overtakes Headscale's -original 718\,Mbps from the unmodified run. % +Parallel TCP gained even more. Internal at Low climbed from +277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's +old single-stream best and overtakes Headscale's original +718\,Mbps from the unmodified run. % % TODO: DOWNSTREAM % DEPENDENCY — "six concurrent flows" inherits % the unresolved @@ -2024,9 +2055,10 @@ the kernel to gVisor reproduces the effect. Then we checked which Tailscale code path the test rig was actually running. \subsection{The data path that was not there} +\label{sec:gvisor_not_in_path} -In default mode — what anyone running \texttt{tailscale up} -on a Linux host gets — the Tailscale client creates a real +In default mode (what anyone running \texttt{tailscale up} +on a Linux host gets), the Tailscale client creates a real kernel TUN device, registers a route for the Tailscale subnet through it, and forwards inbound and outbound @@ -2054,20 +2086,19 @@ running inside \texttt{tailscaled} itself (Tailscale SSH, Taildrop, the metric endpoint). External processes such as iPerf3 cannot reach the Tailscale network in that mode. -The test rig does not use that mode. As shown in -Listing~\ref{lst:rig_interface_name}, the benchmark -suite's Headscale module sets the interface name to -\texttt{ts-\$\{instanceName\}}, resolving to -\texttt{tailscaled --tun ts-headscale}: a real kernel -TUN. gVisor netstack is therefore unreachable from -external benchmark traffic. - +The test rig does not use that mode. The benchmark suite's +Headscale module sets the interface name to +\texttt{ts-\$\{instanceName\}} +(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled} +launches with \texttt{--tun ts-headscale}: a real kernel TUN. +External benchmark traffic cannot reach gVisor netstack at all. \lstinputlisting[language=Nix,caption={The benchmark suite's Headscale module sets \texttt{interfaceName} to a real kernel TUN name (\texttt{ts-}, truncated to 15 characters). - This means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale} + This means \texttt{tailscaled} runs as \texttt{tailscaled --tun + ts-headscale} on every test machine. \textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix} @@ -2075,7 +2106,7 @@ The empirical fingerprint pins the same conclusion down without source-code reading. Headscale itself gained +21\,\% at Medium from the host-kernel sysctl tuning. If Headscale's iPerf3 traffic were processed by gVisor netstack, host-kernel sysctls -would change nothing — they configure the host kernel TCP stack +would change nothing; they configure the host kernel TCP stack and only the host kernel TCP stack. The fact that Headscale moves measurably under those sysctls is direct evidence that Headscale's application TCP runs on the host kernel stack, just @@ -2097,8 +2128,8 @@ the gVisor TCP business at all. The puzzle the investigation began with has not gone away. Headscale starts at 41.5\,Mbps where Internal starts at 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel -TCP stack. Whatever Headscale is doing — partially, weakly, but -reproducibly — is worth roughly twelve megabits per second on the +TCP stack. Whatever Headscale is doing (partially, weakly, but +reproducibly) is worth roughly twelve megabits per second on the Medium profile, and it is not gVisor netstack. The +21\,\% sysctl gain for Headscale itself is also informative @@ -2143,8 +2174,8 @@ The second is the 7\,MiB outer-UDP socket buffer that \texttt{SO\_*BUFFORCE} variant where available so the value is honoured even past \texttt{net.core.rmem\_max}. The host kernel default is in the low hundreds of KiB. Under burst-correlated -impairment — Medium and High both use 50\,\% correlation, so -losses and reorderings cluster — this larger buffer absorbs +impairment (Medium and High both use 50\,\% correlation, so +losses and reorderings cluster), this larger buffer absorbs spikes in arrival rate that would otherwise overflow the kernel UDP receive queue and surface as additional inner-TCP losses. Internal has no such cushion on its incoming wire path. @@ -2244,16 +2275,15 @@ the bare-metal host more than half of its achievable throughput. Three lines of \texttt{sysctl} repair it. The fix is portable to any Linux host and entirely independent of any VPN. -The unresilient finding — the one that motivated us to write this -section in the first place — is that Tailscale's much-discussed -userspace TCP stack is, for the workload that exposed the -anomaly, sitting on the bench. The advantage we attributed to it -must come from a more ordinary place: the way -\texttt{wireguard-go} batches and coalesces packets between the -wire and the kernel TCP stack, and the larger UDP buffer it pins -on its outer socket. We were chasing the wrong hypothesis with -the right experiment, and the experiment turned out to be more -useful than the hypothesis. +The less durable finding, and the one that motivated this section, +is that Tailscale's much-discussed userspace TCP stack is not in +the data path for the workload that exposed the anomaly. The +advantage we attributed to it comes from a more ordinary place: +the way \texttt{wireguard-go} batches and coalesces packets +between the wire and the kernel TCP stack, and the larger UDP +buffer it pins on its outer socket. We were chasing the wrong +hypothesis with the right experiment, and the experiment turned +out to be more useful than the hypothesis. % TODO: These sections are empty stubs but the chapter % introduction (line 12--13) promises "findings from the source