diff --git a/Chapters/Results.tex b/Chapters/Results.tex index 981711c..bfa248e 100644 --- a/Chapters/Results.tex +++ b/Chapters/Results.tex @@ -31,6 +31,24 @@ latency of just an entire 30-second test. Mycelium sits at the other extreme, adding 34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. +A note on naming: ``Headscale'' in every table and figure of this +chapter labels the test scenario in which the Tailscale client +(\texttt{tailscaled}) connects to a self-hosted Headscale control +server. The data plane is therefore the Tailscale client built on +\texttt{wireguard-go}, not the Headscale binary itself, which is +only a control-plane server. The test rig launches +\texttt{tailscaled} via the NixOS \texttt{services.tailscale} +module with \texttt{interfaceName = "ts-headscale"}, which +translates to \texttt{--tun ts-headscale}; this means the Tailscale +client uses a real kernel TUN device and the host kernel's TCP/IP +stack handles every tunneled packet. The alternate +\texttt{--tun=userspace-networking} mode, in which gVisor netstack +terminates tunneled TCP inside the \texttt{tailscaled} process, is +\emph{not} engaged in any of the benchmarks reported here. +Statements below about ``Headscale'' running \texttt{wireguard-go} +should be read as statements about the Tailscale client in this +scenario. + \subsection{Test Execution Overview} Running the full baseline suite across all ten VPNs and the internal @@ -127,16 +145,15 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates 1\,163~retransmits per test, over 1\,000$\times$ what WireGuard needs. ZeroTier compensates for tunnel-internal packet loss by repeatedly triggering TCP congestion-control recovery, whereas -WireGuard delivers data with negligible in-tunnel loss. Across all VPNs, -retransmit behaviour falls into three groups: \emph{clean} ($<$110: -WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} -(200--900: Tinc, EasyTier, Mycelium, VpnCloud), and -\emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). +WireGuard delivers data with negligible in-tunnel loss. The +bare-metal Internal reference sits at 1.7~retransmits per test — +essentially noise — and the VPNs split into three groups around +it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale), +\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), +and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). % TODO: Is this naming scheme any good? -% TODO: Fix TCP Throughput plot - \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} @@ -171,10 +188,7 @@ WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} Retransmits have a direct mechanical relationship with TCP congestion control. Each retransmit triggers a reduction in the congestion window -(\texttt{cwnd}), throttling the sender. % TODO: The text says "average congestion window" but -% Figure~\ref{fig:retransmit_cwnd} plots "Max Congestion Window." -% Use consistent terminology --- either change the text to "max" or -% change the figure axis label. +(\texttt{cwnd}), throttling the sender. This relationship is visible in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 retransmits, maintains the smallest max congestion window in the @@ -200,17 +214,17 @@ in the dataset). This suggests significant in-tunnel packet loss or buffering at the VpnCloud layer that the retransmit count (857) alone does not fully explain. -% TODO: Mycelium's 122--379 Mbps range is per-link asymmetry (different -% overlay routing paths), not stochastic run-to-run variability. -% Section~\ref{sec:mycelium_routing} confirms the same numbers as -% per-link throughput. Conflating link asymmetry with run-to-run -% variance is misleading --- either separate the two or clarify that -% Mycelium's spread comes from path selection, not randomness. -Run-to-run variability also differs substantially. WireGuard ranges -from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges -from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A -VPN with wide variance is harder to capacity-plan around than one -with consistent performance, even if the average is lower. +Variability — whether stochastic across runs or systematic across +links — also differs substantially. WireGuard's three link +directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window), +behaving almost identically. Mycelium's three directions span +122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: +Section~\ref{sec:mycelium_routing} shows the spread is per-link +path-selection asymmetry, with one link finding a direct route and +the other two routing through the global overlay. Either way, a +VPN whose throughput varies that widely across links is harder to +capacity-plan around than one that delivers a consistent figure +on every direction. \begin{figure}[H] \centering @@ -306,22 +320,27 @@ keep up with the link speed. The qperf benchmark backs this up: Tinc maxes out at 14.9\,\% total system CPU while delivering just 336\,Mbps. % TODO: 14.9\% total CPU does not obviously indicate a bottleneck. -% Clarify that this is whole-system utilization on a multi-core -% machine, and that Tinc's single-threaded design means one core is -% saturated while the rest are idle. Also note that VpnCloud reports -% the same 14.9\% yet achieves 539 Mbps --- explain why the same CPU -% utilization yields different throughput (e.g., different per-packet -% processing cost). -On a multi-core system, the low percentage reflects a single -saturated core, a clear sign that the CPU, not the network, is the -bottleneck. +% This is whole-system utilization on a multi-core machine, and a +% single saturated core fits the budget — but VpnCloud reports the +% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc. +% The single-saturated-core story alone therefore cannot explain +% the throughput gap; per-packet processing cost must differ +% materially between the two. Verify with per-thread CPU sampling +% or eBPF profiling. +On a multi-core system, this low percentage is consistent with a +single saturated core (and Tinc is single-threaded), which would +explain why the CPU rather than the network is the bottleneck. +The story is incomplete, however: VpnCloud shows the same 14.9\,\% +total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc — +so a difference in per-packet processing cost between the two +implementations must also be in play. Figure~\ref{fig:latency_throughput} makes this disconnect easy to spot. % TODO: These CPU numbers are stated inline but never shown in a plot % or table. Add a CPU utilization figure or table so readers can % verify. Also, the claim that WireGuard's CPU usage "goes to -% cryptographic processing" is unsubstantiated --- no profiling data +% cryptographic processing" is unsubstantiated: no profiling data % is presented. Either add profiling evidence or soften to % "likely" / "presumably." The qperf measurements also reveal a wide spread in CPU usage. @@ -329,12 +348,7 @@ Hyprspace (55.1\,\%) and Yggdrasil (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's 9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a kernel-level implementation, presumably due to in-kernel -cryptographic processing. % TODO: "do the most with the least CPU time" is misleading --- -% Tinc gets only 336 Mbps at 14.9% CPU (22.6 Mbps/%), while -% WireGuard gets 864 Mbps at 30.8% (28 Mbps/%). These three use -% the least CPU but don't necessarily achieve the best throughput/CPU -% ratio. Rephrase to "use the least CPU" or calculate actual -% efficiency ratios. +cryptographic processing. On the efficient end, VpnCloud (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least CPU time. Nebula and Headscale are missing from @@ -355,7 +369,8 @@ this comparison because qperf failed for both. \subsection{Parallel TCP Scaling} -The single-stream benchmark tests one link direction at a time. % TODO: The plot labels this benchmark "10-stream parallel" but this +The single-stream benchmark tests one link direction at a time. % +% TODO: The plot labels this benchmark "10-stream parallel" but this % description says "six unidirectional flows." Verify the actual test % configuration and reconcile the two. The @@ -404,18 +419,29 @@ single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream can never fill the pipe: the bandwidth-delay product demands a window larger than any single flow maintains, so multiple concurrent flows compensate for that constraint and push throughput to 2.20$\times$ -the single-stream figure. % TODO: The buffer-bloat workaround explanation for Hyprspace's -% parallel scaling is a hypothesis. No direct evidence is shown -% that multiple streams specifically alleviate buffer bloat. -% Consider adding bufferbloat measurements or softening the claim. -% TODO: DOWNSTREAM DEPENDENCY — This claim depends on the buffer bloat -% diagnosis in Section hyprspace_bloat, which itself rests on the unverified -% 2,800 ms under-load latency (see TODO there). If that latency figure -% is not confirmed, this parallel-scaling explanation collapses. -Hyprspace scales almost as well -(2.18$\times$), possibly because multiple streams collectively work -around the buffer bloat that cripples any individual flow -(Section~\ref{sec:hyprspace_bloat}). Tinc picks up a +the single-stream figure. Hyprspace scales almost as well +(2.18$\times$) for the same reason but with a different +bottleneck. Its libp2p send pipeline accumulates roughly +2\,800\,ms of under-load latency +(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP +flow a bandwidth-delay product on the order of hundreds of +megabytes to fill — far beyond any single kernel cwnd. And +because Hyprspace keys \texttt{activeStreams} by destination +\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the +three concurrent peer pairs in the parallel benchmark each get +their own libp2p stream, their own mutex, and their own yamux +flow-control window. The three TCP senders therefore maintain +three independent windows in flight, and three windows fill +more of the bloated pipeline than one can. +% TODO: This is still a hypothesis: it generalises the same +% bandwidth-delay-product argument used for Mycelium directly +% above, and is now grounded in the per-peer +% \texttt{SharedStream} structure verified in +% Listing~\ref{lst:hyprspace_sendpacket}, but neither the +% per-flow window evolution nor the actual under-load latency +% has been measured directly. A tcpdump of one Hyprspace +% iPerf3 run with inter-arrival timing analysis would settle +% it. Tinc picks up a 1.68$\times$ boost because several streams can collectively keep its single-threaded CPU busy during what would otherwise be idle gaps in a single flow. @@ -430,8 +456,6 @@ under multiplexing. Nebula is the only VPN that actually gets \emph{slower} with more streams: throughput drops from 706\,Mbps to 648\,Mbps (0.92$\times$) while retransmits jump from 955 to 2\,462. The -% TODO: "ten streams" vs "six unidirectional flows" --- reconcile -% with the test description above. streams are clearly fighting each other for resources inside the tunnel. @@ -510,12 +534,14 @@ Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at the kernel level with proper backpressure that matches sender to receiver rate. Every other VPN shows massive loss (69--99\%) because the sender overwhelms the tunnel's userspace processing capacity. -% TODO: Headscale also uses WireGuard's kernel module but still shows -% 69.8\% loss. Explain that Headscale's userspace netstack sits -% between the application and the WireGuard kernel module, so UDP -% traffic must pass through userspace before reaching the kernel -% tunnel --- this is why it behaves like a userspace VPN here despite -% using WireGuard underneath. +Headscale shares WireGuard's cryptographic protocol but, contrary to +intuition, does not share its kernel datapath: Tailscale's +\texttt{magicsock} layer intercepts every packet to handle endpoint +selection and DERP relay, which is incompatible with the in-kernel +WireGuard module. Headscale therefore runs \texttt{wireguard-go} +entirely in userspace, and the unbounded \texttt{-b~0} flood overruns +that userspace pipeline just as it overruns every other userspace +implementation, producing 69.8\,\% loss despite the WireGuard branding. Yggdrasil's 98.7\% loss is the most extreme: it sends the most data (due to its large block size) but loses almost all of it. These loss rates do not reflect real-world UDP behavior but reveal which VPNs @@ -653,61 +679,54 @@ between raw throughput and real-world download speed. \label{fig:nix_download} \end{figure} -\paragraph{Video Streaming (RIST).} +\paragraph{Video streaming (RIST).} -At just 3.3\,Mbps, the RIST video stream sits comfortably within -every VPN's throughput budget. This test therefore measures -something different: how well the VPN handles real-time UDP packet -delivery under steady load. % TODO: The RIST plot shows Nebula at 99.8\%, not 100\%. "Nine of -% eleven deliver 100\%" is inaccurate --- eight deliver 100\%, Nebula -% delivers 99.8\%. Also, the claim that 14--16 dropped frames trace -% to encoder warm-up is stated without evidence. How was this -% determined? Add a reference or explain the methodology. -Nine of the eleven VPNs pass without -incident, delivering near-perfect video quality. The 14--16 dropped -frames that appear uniformly across all VPNs, including Internal, -likely trace back to encoder warm-up rather than tunnel overhead. +At 3.3\,Mbps, the RIST video stream sits well within every VPN's +throughput budget. The test therefore measures something else: how +well each VPN handles real-time UDP delivery under steady load. + +Most VPNs pass without incident. Eight deliver 100\% quality, +Nebula sits just below at 99.8\%, and Hyprspace's headline figure +of 100\% conceals a separate failure mode discussed below. The +14--16 dropped frames that appear uniformly across every run, including +Internal, are most likely encoder warm-up artefacts rather than +tunnel overhead, though we have not verified this directly. % TODO: The packet-drop distribution statistics (288 mean, % 10\% median, IQR 255--330) are not shown in any figure. % Add a box plot or distribution figure for Headscale's RIST drops. -Headscale is the exception. It averages just 13.1\,\% quality, -dropping 288~packets per test interval. The degradation is not -bursty but sustained: median quality sits at 10\,\%, and the -interquartile range of dropped packets spans a narrow 255--330 band. -The qperf benchmark independently corroborates this, having failed -outright for Headscale, confirming that something beyond bulk TCP is -broken. +Headscale is the clear failure. Its mean quality is 13.1\%, and +each test interval drops 288 packets. The degradation is sustained +rather than bursty: median quality is 10\%, and the interquartile +range of dropped packets is a narrow 255--330. The qperf benchmark +also fails outright for Headscale at baseline, which rules out a +bulk-TCP explanation. Something in the real-time path is broken. -What makes this failure unexpected is that Headscale builds on -WireGuard, which handles video flawlessly. TCP throughput places -Headscale squarely in Tier~1. Yet the RIST test runs over UDP, and +The failure is unexpected because Headscale builds on WireGuard, +which handles video without trouble, and Headscale's own TCP +throughput puts it in Tier~1. RIST runs over UDP, however, and qperf probes latency-sensitive paths using both TCP and UDP. The -% TODO: The DERP relay / MTU fragmentation hypothesis is plausible -% but unverified. No packet capture or fragmentation analysis is -% presented. Either add tcpdump / packet-level evidence or mark -% this more clearly as a hypothesis. -pattern points toward Headscale's DERP relay or NAT traversal layer -as the source. Its effective UDP payload size of 1\,208~bytes, the smallest -of any VPN, may compound the issue: RIST packets that exceed -this limit would be fragmented, and reassembling fragments under -sustained load could produce exactly the kind of steady, uniform packet -drops the data shows. For video conferencing, VoIP, or any -real-time media workload, this is a disqualifying result regardless -of TCP throughput. +most plausible source is Headscale's DERP relay or NAT traversal +layer. Headscale's effective UDP payload size is 1\,208~bytes, the +smallest in the dataset. RIST packets larger than this would be +fragmented, and fragment reassembly under sustained load could +produce exactly the steady, uniform drop pattern the data shows. +This is a hypothesis, not a confirmed cause: it would need a +packet capture to verify. Either way, the result disqualifies +Headscale from video conferencing, VoIP, or any other real-time +media workload, regardless of TCP throughput. % TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500, % percentiles all zero) are not visible in the RIST Quality bar chart. % Add a distribution plot or note in the caption that the bar % chart hides this variance. -Hyprspace reveals a different failure mode. Its average quality -reads 100\,\%, but the raw numbers underneath are far from stable: -mean packet drops of 1\,194 and a maximum spike of 55\,500, with -the 25th, 50th, and 75th percentiles all at zero. Hyprspace -alternates between perfect delivery and catastrophic bursts. -RIST's forward error correction compensates for most of these -events, but the worst spikes are severe enough to overwhelm FEC -entirely. +Hyprspace fails differently. Its average quality reads 100\%, but +the raw drop counts underneath are unstable: mean packet drops of +1\,194 and a maximum spike of 55\,500. The 25th, 50th, and 75th +percentiles are all zero, so most runs deliver perfectly while a +small number suffer catastrophic bursts. RIST's forward error +correction recovers from most of these events, but the worst spikes +overwhelm FEC entirely. \begin{figure}[H] \centering @@ -742,14 +761,17 @@ Reboot reconnection rearranges the rankings. Hyprspace, the worst performer under sustained TCP load, recovers in just 8.7~seconds on average, faster than any other VPN. WireGuard and Nebula follow at 10.1\,s each. Nebula's consistency is striking: 10.06, 10.06, -10.07\,s across its three nodes, pointing to a hard-coded timer -rather than topology-dependent convergence. +10.07\,s across its three nodes, an exact match for Nebula's +\texttt{HostUpdateNotification} interval, whose default is +10~seconds in the lighthouse protocol (configurable, but the +benchmarks use the default). After a reboot, a node must +wait until the next periodic update before its lighthouses learn +its new endpoint, so the reconnection time tracks the timer rather +than any topology-dependent convergence. Mycelium sits at the opposite end, needing 76.6~seconds and showing the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s), suggesting a fixed protocol-level wait built into the overlay. -%TODO: Hard coded timer needs to be verified - Yggdrasil produces the most lopsided result in the dataset: its yuki node is back in 7.1~seconds while lom and luna take 94.8 and 97.3~seconds respectively. The gap likely reflects the overlay's @@ -810,7 +832,8 @@ retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than steady-state transfer, shrinking the max congestion window to 205\,KB, the smallest in the dataset. Under parallel load the -situation worsens: retransmits climb to 17\,426. % TODO: The explanation for the sender/receiver inversion (ACK delays +situation worsens: retransmits climb to 17\,426. % TODO: The +% explanation for the sender/receiver inversion (ACK delays % causing sender-side timer undercounting) is a hypothesis. Normally % sender >= receiver. Consider verifying with packet captures or % note this as a likely but unconfirmed explanation. @@ -829,43 +852,145 @@ quality outside of its burst events. The pathology is narrow but severe: any continuous data stream saturates the tunnel's internal buffers. +Hyprspace does import gVisor netstack, but reading the source +confirms that the gVisor TCP stack sits exclusively behind the +in-VPN ``service network'' feature. Regular tunnel traffic uses +an ordinary kernel TUN device created through the +\texttt{songgao/water} library, and the forwarding loop in +\texttt{node/node.go} only diverts a packet into the gVisor +stack when its destination falls inside the +\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4 +protocol is TCP; everything else is shipped verbatim over a +libp2p stream and written back into the receiving peer's kernel +TUN. Listings~\ref{lst:hyprspace_kernel_tun}, +\ref{lst:hyprspace_dispatch}, and \ref{lst:hyprspace_netstack} +show the relevant code in the upstream Hyprspace tree. + +\lstinputlisting[language=Go,caption={Hyprspace creates a real + kernel TUN via \texttt{songgao/water}; this is the device every + peer-to-peer packet traverses. +\textit{hyprspace/tun/tun\_linux.go:14--36}},label={lst:hyprspace_kernel_tun}]{Listings/hyprspace_tun_linux.go} + +\lstinputlisting[language=Go,caption={The IPv6 dispatch in the + Hyprspace forwarding loop only diverts to the gVisor service-network + TUN when the destination matches the + \texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4 + protocol byte is \texttt{0x06} (TCP); every other packet is left + on the kernel TUN path and forwarded over libp2p. +\textit{hyprspace/node/node.go:255--283}},label={lst:hyprspace_dispatch}]{Listings/hyprspace_dispatch.go} + +\lstinputlisting[language=Go,caption={Hyprspace's gVisor netstack + initialiser only enables TCP SACK; there is no \texttt{TCPRecovery} + override (RACK stays at gVisor's default), no congestion-control + override, and no buffer-size override. The text in + \texttt{tun.go} also notes the file is taken verbatim from + wireguard-go. +\textit{hyprspace/netstack/tun.go:6--80}},label={lst:hyprspace_netstack}]{Listings/hyprspace_netstack.go} + +Since the benchmark targets the regular Hyprspace IPv4/IPv6 +addresses rather than service-network proxies, both endpoints +rely on their host kernel's TCP stack for the entire transfer. +Whatever options Hyprspace's gVisor instance might set +internally — congestion control, loss recovery, buffer sizes — +are therefore irrelevant to these measurements; the inner TCP +state machine the kernel runs is the only one in the path. +The same caveat applies more sharply to Tailscale, where the +upstream documentation talks about an in-process gVisor TCP +stack but the benchmark traffic never reaches it; that case is +the subject of Section~\ref{sec:tailscale_degraded}. + +If gVisor is out of scope, the buffer bloat must originate +further up the Hyprspace stack instead. The most plausible +source is the libp2p / yamux stream layer through which raw IP +packets are funnelled. Hyprspace's TUN-read loop dispatches +each outbound packet on its own goroutine, and every such +goroutine ends up in \texttt{node/node.go}'s +\texttt{sendPacket}, which keeps exactly one libp2p stream per +destination peer in \texttt{activeStreams} and guards it with a +single per-peer \texttt{sync.Mutex} +(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent +application TCP flows to the same Hyprspace neighbour therefore +serialise behind that one lock: the parallel iPerf3 test, which +opens multiple TCP connections to the same peer at once, +collapses to a single send pipeline at this layer. Each +goroutine waiting for the lock pins its own 1420-byte packet +buffer, and the underlying yamux session adds a per-stream +flow-control window on top. None of this is visible to the +kernel TCP sender that produced the inner segments — the kernel +sees only that the TUN write returned — so it keeps growing +its congestion window while the libp2p layer falls further +behind. The geometry is the textbook one for buffer bloat: a +fast producer (kernel TCP) sitting upstream of a slow, +serialised consumer (the single yamux stream per peer) with +no flow-control signal coupling the two. + +\lstinputlisting[language=Go,caption={Hyprspace's outbound + fast path keeps exactly one libp2p stream per destination peer + in \texttt{activeStreams} and guards it with a per-peer + \texttt{sync.Mutex} held inside the \texttt{SharedStream} + record. The TUN-read loop spawns a fresh goroutine per packet + (\texttt{node.go:282}); each one calls \texttt{sendPacket} and + takes \texttt{ms.Lock} for the duration of the libp2p stream + write, so concurrent application TCP flows to the same + Hyprspace neighbour are serialised behind a single mutex. + \textit{hyprspace/node/node.go:36--39, 282, +328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go} + \paragraph{Mycelium: Routing Anomaly.} \label{sec:mycelium_routing} Mycelium's 34.9\,ms average latency appears to be the cost of -routing through a global overlay. The per-path numbers, however, +routing through a global overlay. The per-path +numbers, however, reveal a bimodal distribution: \begin{itemize} - \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable + \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct + path, comparable to Headscale at 1.64\,ms) \bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed) \bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed) \end{itemize} -One of the three links has found a direct route; the other two still +One of the three links has found a direct route; the +other two still bounce through the overlay. All three machines sit on the same -% TODO: Characterising path discovery as "failing intermittently" assumes -% direct routing is the expected outcome on a LAN. Mycelium is designed -% as a global overlay and may intentionally route through supernodes. -% If this is by-design behaviour, rephrase to avoid implying a bug. -% This characterisation also propagates to the impairment ping analysis -% (around line 966) which says impairment "pushes path discovery toward -% shorter routes." -% TODO: The throughput data INVERTS the latency split rather than -% "mirroring" it. The direct path (luna→lom, 1.63 ms RTT) achieves -% only 122 Mbps, while the overlay-routed path (yuki→luna, 51.60 ms -% RTT) reaches 379 Mbps --- the opposite of what TCP theory predicts. -% The plot also shows luna→lom receiver throughput at only 57.2 Mbps -% (a 53% sender/receiver gap on that link). Explain why the direct +% TODO: Characterising path discovery as "failing +% intermittently" assumes +% direct routing is the expected outcome on a LAN. +% Mycelium is designed +% as a global overlay and may intentionally route +% through supernodes. +% If this is by-design behaviour, rephrase to avoid +% implying a bug. +% This characterisation also propagates to the +% impairment ping analysis +% in Section sec:impairment, which says impairment "pushes path +% discovery toward shorter routes." +% TODO: The throughput data INVERTS the latency split +% rather than +% "mirroring" it. The direct path (luna→lom, 1.63 ms +% RTT) achieves +% only 122 Mbps, while the overlay-routed path +% (yuki→luna, 51.60 ms +% RTT) reaches 379 Mbps: the opposite of what TCP +% theory predicts. +% The plot also shows luna→lom receiver throughput at +% only 57.2 Mbps +% (a 53% sender/receiver gap on that link). Explain +% why the direct % path is 3× slower than the overlay path, or acknowledge the -% contradiction. The current wording "mirrors the split" is incorrect. -physical network, so Mycelium's path discovery is not consistently -selecting the direct route, a more specific problem than blanket overlay +% contradiction. The current wording "mirrors the +% split" is incorrect. +physical network, so Mycelium's path discovery is not +consistently +selecting the direct route, a more specific problem +than blanket overlay overhead. Throughput shows a similarly lopsided split: yuki$\rightarrow$luna reaches 379\,Mbps while luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In -bidirectional mode, the reverse direction on that worst link drops +bidirectional mode, the reverse direction on that +worst link drops to 58.4\,Mbps, the lowest single-direction figure in the entire dataset. @@ -873,29 +998,37 @@ dataset. \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average Throughput}.png} - % TODO: The caption attributes the asymmetry to "inconsistent direct - % route discovery" but the direct-route link (luna→lom, 1.63 ms RTT) - % is actually the SLOWEST (122 Mbps). The caption should address + % TODO: The caption attributes the asymmetry to + % "inconsistent direct + % route discovery" but the direct-route link + % (luna→lom, 1.63 ms RTT) + % is actually the SLOWEST (122 Mbps). The caption + % should address % why the direct path underperforms the overlay paths. \caption{Per-link TCP throughput for Mycelium, showing extreme path asymmetry. The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps) and worst - (luna$\rightarrow$lom, 122\,Mbps) links does not correlate with + (luna$\rightarrow$lom, 122\,Mbps) links does not + correlate with the latency split (Section~\ref{sec:mycelium_routing}).} \label{fig:mycelium_paths} \end{figure} % TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment % (47.3 ms) numbers are from qperf but not shown in any figure. -% Add a connection-setup latency table or plot. Also clarify what -% Internal's connection establishment time is (47.3 / 3 = 15.8 ms?) +% Add a connection-setup latency table or plot. Also +% clarify what +% Internal's connection establishment time is (47.3 / +% 3 = 15.8 ms?) % so the "3× overhead" can be verified. The overlay penalty shows up most clearly at connection setup. -Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's +Mycelium's average time-to-first-byte is 93.7\,ms +(vs.\ Internal's 16.8\,ms, a 5.6$\times$ overhead), and connection establishment alone costs 47.3\,ms (3$\times$ overhead). Every new connection incurs that overhead, so workloads dominated by -short-lived connections accumulate it rapidly. Bulk downloads, by +short-lived connections accumulate it rapidly. Bulk +downloads, by contrast, amortize it: the Nix cache test finishes only 18\,\% slower than Internal (10.07\,s vs.\ 8.53\,s) because once the transfer phase begins, per-connection latency fades into the @@ -903,69 +1036,101 @@ background. Mycelium is also the slowest VPN to recover from a reboot: 76.6~seconds on average, and almost suspiciously uniform across -nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a -hard-coded convergence timer in the overlay protocol rather than -anything topology-dependent. The UDP test timed out at -120~seconds, and even first-time connectivity required a -70-second wait at startup. +nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to +a fixed convergence timer in the overlay protocol — +most likely a +default interval rather than anything topology-dependent. +% TODO: Identify which Mycelium constant or default this 75-78 s +% recovery actually corresponds to before claiming it is a fixed +% timer; the source code would settle whether it is hard-coded, +% a configurable default, or coincidence. +The UDP test timed out at 120~seconds, and even first-time +connectivity required a 70-second wait at startup. % Explain what topology-dependent means in this case. \paragraph{Tinc: Userspace Processing Bottleneck.} -Tinc is a clear case of a CPU bottleneck masquerading as a network +Tinc is a clear case of a CPU bottleneck masquerading +as a network problem. At 1.19\,ms latency, packets get through the tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a -third of the bare-metal link. % TODO: "path MTU is a healthy 1,500 bytes" but blksize_bytes is -% 1,353. These are different metrics --- blksize_bytes is the UDP -% payload size, not the path MTU. Clarify the distinction or -% remove the 1,500 claim. +third of the bare-metal link. The usual suspects do not apply: Tinc's effective UDP payload size (\texttt{blksize\_bytes} of -1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and + 1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and WireGuard at 1\,368) is in the normal range, and its retransmit -count (240) is moderate. What limits Tinc is its single-threaded -userspace architecture: one CPU core simply cannot encrypt, copy, +count (240) is moderate. What limits Tinc is its +single-threaded +userspace architecture: one CPU core simply cannot +encrypt, copy, and forward packets fast enough to fill the pipe. -% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the Tinc CPU bottleneck -% diagnosis from above, but the 14.9% CPU figure has an unresolved TODO -% (the same utilization as VpnCloud at 539 Mbps). If the CPU claim is +% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the +% Tinc CPU bottleneck +% diagnosis from above, but the 14.9% CPU figure has +% an unresolved TODO +% (the same utilization as VpnCloud at 539 Mbps). If +% the CPU claim is % revised or refuted, this confirmation must be updated too. The parallel benchmark confirms this diagnosis. Tinc scales to 563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio. -Multiple TCP streams collectively keep that single core busy during -what would otherwise be idle gaps in any individual flow, squeezing +Multiple TCP streams collectively keep that single +core busy during +what would otherwise be idle gaps in any individual +flow, squeezing out throughput that no single stream could reach alone. -\section{Impact of Network Impairment} +\section{Impact of network impairment} \label{sec:impairment} -Baseline benchmarks rank VPNs by overhead under ideal conditions. -The impairment profiles from Table~\ref{tab:impairment_profiles} -test a different property: resilience. Two results dominate the -data. First, the throughput hierarchy from -Section~\ref{sec:baseline} collapses under degradation --- at High -impairment, the 675\,Mbps spread across all implementations compresses -to under 3\,Mbps, and architectural differences that matter at gigabit speeds -vanish. Second, Headscale outperforms the bare-metal Internal -baseline at Medium impairment across TCP, parallel TCP, and Nix -cache benchmarks. A VPN built on WireGuard should not beat a direct -connection; Section~\ref{sec:tailscale_degraded} traces the cause to -three TCP parameters in Tailscale's userspace network stack. +Baseline benchmarks rank VPNs by overhead under ideal +conditions. +The impairment profiles in +Table~\ref{tab:impairment_profiles} test +a different property: resilience. Two results +dominate the data. + +The first is the collapse of the throughput hierarchy. At High +impairment, the 675\,Mbps spread between fastest and slowest +implementation compresses to under 3\,Mbps. Architectural +differences that mattered at gigabit speeds become +invisible once +the network is the bottleneck. + +The second is harder to explain. Headscale outperforms the +bare-metal Internal baseline at Medium impairment across TCP, +parallel TCP, and the Nix cache benchmark. A VPN built on +WireGuard should not beat a direct connection. +Section~\ref{sec:tailscale_degraded} pursues this anomaly +through what turns out to be the wrong hypothesis. The +investigation begins with Tailscale's much-discussed gVisor TCP +stack, validates the candidate parameters in isolation on the +bare-metal host, and only then discovers — by reading the rig's +own NixOS module — that the gVisor stack is not actually in the +data path of the benchmark at all. The real culprit is a +combination of the Linux kernel's tight default +\texttt{tcp\_reordering} threshold and the way +\texttt{wireguard-go} +batches packets between the wire and the host kernel TCP stack. \subsection{Ping} -Latency is the most predictable metric under impairment. Most VPNs -absorb the injected delay with a fixed per-hop overhead, and rankings +Latency is the most predictable metric under +impairment. Most VPNs +absorb the injected delay with a fixed per-hop +overhead, and rankings within the central cluster barely change across profiles -(Table~\ref{tab:ping_impairment}). tc~netem adds roughly 4, 8, and -15\,ms of round-trip delay at Low, Medium, and High respectively; +(Table~\ref{tab:ping_impairment}). tc~netem adds +roughly 4, 8, and +15\,ms of round-trip delay at Low, Medium, and High +respectively; Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this. \begin{table}[H] \centering - \caption{Average ping RTT (ms) across impairment profiles, sorted + \caption{Average ping RTT (ms) across impairment + profiles, sorted by High-profile RTT} \label{tab:ping_impairment} \begin{tabular}{lrrrr} @@ -990,74 +1155,107 @@ Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this. \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/Ping Average RTT Heatmap}.png} - \caption{Average ping RTT across impairment profiles. Most VPNs + \includegraphics[width=\textwidth]{{Figures/impairment/Ping + Average RTT Heatmap}.png} + \caption{Average ping RTT across impairment + profiles. Most VPNs form a tight parallel band; Mycelium's non-monotonic curve, EasyTier's excess latency at High, and Hyprspace's upward divergence stand out.} \label{fig:ping_impairment_heatmap} \end{figure} -Mycelium defies the pattern. Its RTT \emph{drops} from 34.9\,ms at -baseline to 23.4\,ms at Low impairment, a 33\% improvement where -every other VPN gets slower. It then rises to 43.9\,ms at Medium -before falling again to 33.0\,ms at High. The baseline analysis -(Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency -comes from a bimodal routing distribution: one path runs at 1.63\,ms -while two others route through the global overlay at -${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This explanation depends on the baseline -% characterisation of Mycelium's path discovery as "failing intermittently" -% (Section mycelium_routing). If that characterisation is revised (e.g., -% overlay routing is by-design, not a failure), then the claim that -% impairment "pushes path discovery toward shorter routes" needs rethinking: -% the mechanism would be different if Mycelium is not trying to find direct +Mycelium defies the pattern. Its RTT \emph{drops} +from 34.9\,ms at +baseline to 23.4\,ms at Low impairment, a 33\% +improvement at the +profile where every other VPN gets slower. It then climbs to +43.9\,ms at Medium before falling again to 33.0\,ms +at High. The +baseline analysis +(Section~\ref{sec:mycelium_routing}) showed that +Mycelium's latency comes from a bimodal routing +distribution: one +path runs at 1.63\,ms, two others route through the +global overlay at +${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This +% explanation depends on the baseline +% characterisation of Mycelium's path discovery as +% "failing intermittently" +% (Section mycelium_routing). If that +% characterisation is revised (e.g., +% overlay routing is by-design, not a failure), then +% the claim that +% impairment "pushes path discovery toward shorter +% routes" needs rethinking: +% the mechanism would be different if Mycelium is not +% trying to find direct % routes in the first place. -The impairment appears to push Mycelium's path -discovery toward shorter routes, so a larger share of traffic takes -the direct path. The non-monotonic pattern is consistent with a path -selection algorithm that responds to measured link quality, but not -linearly with degradation severity. +Impairment seems to push Mycelium's path selection toward the +shorter route, so a larger share of traffic avoids the overlay +detour. The non-monotonic curve is consistent with a +path selection +algorithm that reacts to measured link quality but +not linearly with +degradation severity. % TODO: Ping packet loss data is not shown in any figure. Add a -% packet loss table/figure or reference the raw data so readers can +% packet loss table/figure or reference the raw data +% so readers can % verify these numbers. -Mycelium also achieves 0\% ping packet loss at Low and Medium -impairment, while most VPNs show 0.1--3.2\% loss at those profiles. -At High impairment, Mycelium's loss jumps to 11.1\%. +Mycelium loses zero ping packets at Low and Medium impairment. +Most other VPNs show 0.1--3.2\% loss at those profiles. At High +impairment Mycelium's loss jumps to 11.1\%. -% TODO: EasyTier's max RTT (290 ms), WireGuard's max (~40 ms), and -% EasyTier's std dev (44.6 ms) are not shown in any plot. The ping -% heatmap only shows averages. Add a jitter/distribution figure. +% TODO: EasyTier's max RTT (290 ms), WireGuard's max +% (~40 ms), and +% EasyTier's std dev (44.6 ms) are not shown in any +% plot. The ping +% heatmap only shows averages. Add a +% jitter/distribution figure. % Also, the "userspace retry mechanism" is a hypothesized cause % without source-code or packet-level evidence. EasyTier accumulates 11\,ms of excess latency at High impairment -beyond what tc~netem accounts for. Its average RTT of 26.6\,ms and -maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) suggest a -userspace retry mechanism that introduces escalating variance. -EasyTier's RTT standard deviation reaches 44.6\,ms at High, the -worst jitter of any VPN. +beyond what tc~netem injects. Its average RTT is +26.6\,ms and its +maximum reaches 290\,ms, against ${\sim}$40\,ms for +WireGuard. The +RTT standard deviation reaches 44.6\,ms at High, the +worst jitter +of any VPN. A userspace retry mechanism is the +likely cause, but +without source-code evidence we cannot say so with certainty. % TODO: Ping packet loss data is not shown in any plot. The 1/9 -% = 11.1\% interpretation is clever but depends on the exact test -% structure (3 pairs × 3 runs × 100 packets). Verify this matches +% = 11.1\% interpretation is clever but depends on +% the exact test +% structure (3 pairs × 3 runs × 100 packets). Verify +% this matches % the actual test setup and add a supporting figure or table. -Hyprspace shows 11.1\% ping packet loss at every impairment level --- -Low, Medium, and High alike. With 9~measurement runs (3~machine -pairs $\times$ 3~runs of 100~packets), 11.1\% equals exactly 1/9: -one run per profile fails completely while the other eight report zero -loss. % TODO: DOWNSTREAM DEPENDENCY — This is a third reference to the buffer -% bloat diagnosis from Section hyprspace_bloat, which depends on the +Hyprspace shows the same 11.1\% ping packet loss at Low, Medium, +and High impairment. With 9~measurement runs per +profile (3~machine +pairs $\times$ 3~runs of 100~packets), 11.1\% is +exactly 1/9: one +run fails completely while the other eight report zero loss. +% TODO: DOWNSTREAM DEPENDENCY — This is a third +% reference to the buffer +% bloat diagnosis from Section hyprspace_bloat, which +% depends on the % unverified 2,800 ms under-load latency. If that diagnosis is % revised, this explanation must also be revisited. -This binary pass/fail behavior is consistent with the buffer bloat -diagnosis from Section~\ref{sec:hyprspace_bloat}: when buffers fill, -an entire path stalls rather than degrading gradually. +The binary pass/fail behaviour fits the buffer bloat +diagnosis from +Section~\ref{sec:hyprspace_bloat}: when the tunnel's +buffers fill, a +path stalls completely rather than degrading gradually. -\subsection{TCP Throughput} +\subsection{TCP throughput} -TCP throughput is where the baseline hierarchy breaks down. The -three performance tiers from Section~\ref{sec:baseline} dissolve at -the first impairment step (Table~\ref{tab:tcp_impairment}). +The baseline TCP hierarchy does not survive impairment. The +three performance tiers from +Section~\ref{sec:baseline} dissolve at +the first step (Table~\ref{tab:tcp_impairment}). \begin{table}[H] \centering @@ -1089,116 +1287,186 @@ the first impairment step (Table~\ref{tab:tcp_impairment}). \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/TCP Throughput Heatmap}.png} - \caption{Single-stream TCP throughput across impairment profiles. + \includegraphics[width=\textwidth]{{Figures/impairment/TCP + Throughput Heatmap}.png} + \caption{Single-stream TCP throughput across + impairment profiles. Headscale crosses above Internal at Medium impairment; Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs converge at High.} \label{fig:tcp_impairment_heatmap} \end{figure} -Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a -98.3\% throughput loss from adding just 2\,ms latency, 2\,ms jitter, -0.25\% packet loss, and 0.5\% reordering per machine. Even Mycelium, -the slowest VPN at baseline (259\,Mbps), retains more throughput at -Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes, -which inflated baseline metrics -(Section~\ref{sec:baseline}), becomes a liability under impairment: -each lost or reordered outer packet triggers retransmission of -${\sim}$24$\times$ more inner-layer data than a standard -1\,400-byte MTU VPN would lose. +Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low +impairment, a +98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter, +0.25\% packet loss, and 0.5\% reordering per machine. +Even Mycelium, +the slowest VPN at baseline (259\,Mbps), retains more +throughput at +Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes +that inflated Yggdrasil's baseline numbers +(Section~\ref{sec:baseline}) becomes a liability +under impairment: +every lost or reordered outer packet costs roughly +24$\times$ more +retransmitted inner data than a standard 1\,400-byte +MTU VPN would +lose. -Headscale retains 34.3\% of its baseline throughput at Low, nearly -matching Internal's 35.7\%. At Medium impairment, Headscale -(41.5\,Mbps) overtakes Internal (29.6\,Mbps) --- a VPN outperforming -the bare-metal baseline. -Section~\ref{sec:tailscale_degraded} investigates this anomaly in +Headscale retains 34.3\% of its baseline throughput +at Low, almost +the same as Internal's 35.7\%. At Medium impairment, Headscale +(41.5\,Mbps) overtakes Internal (29.6\,Mbps). +Section~\ref{sec:tailscale_degraded} investigates +this anomaly in detail. -At High impairment, the throughput range compresses from 675\,Mbps at -baseline to just 2.9\,Mbps. Internal leads at 4.25\,Mbps; Hyprspace -trails at 1.39\,Mbps. The impairment profile itself becomes the -bottleneck. With 2.5\% packet loss and 5\% reordering per machine, -every implementation is TCP-loss-limited, and architectural -differences that matter at gigabit speeds become irrelevant. +At High impairment, the throughput range collapses +from 675\,Mbps to +2.9\,Mbps. Internal leads at 4.25\,Mbps, Hyprspace trails at +1.39\,Mbps, and the impairment profile itself is the bottleneck. +With 2.5\% packet loss and 5\% reordering per machine, every +implementation is loss-limited, and the architectural +differences +that mattered at gigabit speeds no longer matter at all. -\subsection{UDP Throughput} +\subsection{UDP throughput} -The UDP stress test (\texttt{-b~0}) separates kernel-level from -userspace implementations more cleanly than any TCP benchmark. It -also produces widespread failures under impairment: Hyprspace and -Mycelium, which already failed at baseline, continue to time out at -% TODO: Tinc fails at Low and Medium but succeeds at High (8 Mbps) --- -% the same non-monotonic failure pattern as Internal/WireGuard (fail -% at Low, succeed at Medium/High). This suggests the failures are -% iPerf3/tc interaction issues rather than fundamental VPN limitations. -% Nebula and VpnCloud also fail selectively. The widespread non-monotonic -% failure pattern undermines using this benchmark as a reliability -% indicator (see line 1163 claim). Consider discussing this pattern. -all profiles, and Tinc drops out at Low and Medium while ZeroTier -fails at Medium. Despite the sparse dataset, one pattern is clear. +The UDP stress test (\texttt{-b~0}) separates +implementations with +effective backpressure from those without it more +cleanly than any +TCP benchmark. Under impairment, it also produces widespread +failures. +% TODO: Tinc fails at Low and Medium but succeeds at +% High (8 Mbps): +% the same non-monotonic failure pattern as +% Internal/WireGuard (fail +% at Low, succeed at Medium/High). This suggests the +% failures are +% iPerf3/tc interaction issues rather than +% fundamental VPN limitations. +% Nebula and VpnCloud also fail selectively. The +% widespread non-monotonic +% failure pattern undermines using this benchmark as +% a reliability +% indicator (see line 1163 claim). Consider +% discussing this pattern. +Hyprspace and Mycelium continue to time out at all profiles, +extending their baseline failures. Tinc drops out at Low and +Medium, ZeroTier at Medium. The data is sparse, but one pattern +emerges from the runs that did complete. -% TODO: The heatmap shows Internal and WireGuard both fail (×) at -% some impairment profiles (e.g., Internal fails at Low, WireGuard +% TODO: The heatmap shows Internal and WireGuard both +% fail (×) at +% some impairment profiles (e.g., Internal fails at +% Low, WireGuard % at Low and High). "Regardless of impairment" overstates the % evidence. Rephrase to reflect the failures, or explain why % those runs failed despite the claim of maintained throughput. -% TODO: Internal (and WireGuard) fail at Low impairment in the UDP -% test but succeed at Medium and High --- the opposite of what one -% would expect. This is never explained. Investigate and add an -% explanation (e.g., iPerf3 crash, tc interaction, timing issue). -Kernel-level implementations maintain throughput at the profiles -where data exists. Internal holds ${\sim}$950\,Mbps at -Baseline, Medium, and High. Headscale sustains 700--876\,Mbps and WireGuard -850--898\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition -both use WireGuard's kernel module for the outer tunnel, which -provides proper backpressure at the transport layer. Userspace VPNs collapse: EasyTier drops from +% TODO: Internal (and WireGuard) fail at Low +% impairment in the UDP +% test but succeed at Medium and High: the opposite of what one +% would expect. This is never explained. +% Investigate and add an +% explanation (e.g., iPerf3 crash, tc interaction, +% timing issue). +Three implementations maintain throughput at the profiles where +data exists. Internal holds ${\sim}$950\,Mbps at +Baseline, Medium, +and High; WireGuard sustains 850--898\,Mbps; and +Headscale sustains +700--876\,Mbps. % TODO: verify WireGuard UDP range -- +% analysis doc says 850-898, possible digit transposition +Internal and WireGuard ride the host kernel's transport-layer +backpressure (Internal directly, WireGuard via the in-kernel +WireGuard module). Headscale, by contrast, never +uses the kernel +module even though it builds on the WireGuard protocol: as +established in Section~\ref{sec:baseline}, Tailscale's +\texttt{magicsock} layer intercepts every packet for endpoint +selection, DERP relay, and the disco protocol, and that +interception is incompatible with the kernel WireGuard datapath. +Headscale therefore runs \texttt{wireguard-go} in userspace and +compensates with UDP batching +(\texttt{recvmmsg}/\texttt{sendmmsg}), +host-kernel UDP segmentation/aggregation offload +(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer +WireGuard socket), and a 7\,MB socket buffer on the same outer +socket. These offloads live in the host kernel; gVisor netstack +itself implements no UDP GSO or UDP GRO of its own. +Together they +absorb a \texttt{-b 0} sender flood without +collapsing. Userspace +VPNs without the same engineering do collapse: +EasyTier drops from 865 to 435 to 38.5 to 6.1\,Mbps across successive profiles. -Yggdrasil, already pathological at baseline (98.7\% loss), crashes to -12.3\,Mbps at Low and fails entirely at Medium and High. +Yggdrasil, already pathological at baseline (98.7\% +loss), crashes +to 12.3\,Mbps at Low and fails entirely at Medium and High. \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/UDP Receiver Throughput Heatmap}.png} - % TODO: This caption says "kernel-level VPNs maintain high throughput" - % but the heatmap shows Internal, WireGuard, and Headscale ALL fail - % ($\times$) at Low impairment. WireGuard also fails at High. - % Rephrase to acknowledge the failures or explain them. + \includegraphics[width=\textwidth]{{Figures/impairment/UDP + Receiver Throughput Heatmap}.png} + % TODO: The heatmap shows Internal, WireGuard, and + % Headscale all + % fail ($\times$) at Low impairment. WireGuard also fails at + % High. These selective failures need an explanation + % (iPerf3/tc interaction?). \caption{UDP receiver throughput across impairment profiles. - Kernel-level VPNs (Internal, WireGuard, Headscale) maintain high - throughput where they complete; userspace VPNs collapse or fail - entirely ($\times$ marks a failed run).} + Implementations with effective UDP backpressure + (Internal and + WireGuard via the in-kernel datapath; Headscale via + \texttt{wireguard-go} batching plus large socket buffers) + maintain high throughput where they complete; + other userspace + VPNs collapse or fail entirely ($\times$ marks a failed run).} \label{fig:udp_impairment_heatmap} \end{figure} -% TODO: This "robustness indicator" interpretation is undermined by -% the non-monotonic failure pattern. Internal and WireGuard fail at -% Low (0.25% loss) but succeed at Medium and High (1%+ loss). If -% failures indicated "fundamental flow-control problems," they should -% get worse with more impairment, not better. The pattern suggests -% iPerf3 or tc timing issues rather than VPN limitations. Either +% TODO: This "robustness indicator" interpretation is +% undermined by +% the non-monotonic failure pattern. Internal and +% WireGuard fail at +% Low (0.25% loss) but succeed at Medium and High +% (1%+ loss). If +% failures indicated "fundamental flow-control +% problems," they should +% get worse with more impairment, not better. The +% pattern suggests +% iPerf3 or tc timing issues rather than VPN +% limitations. Either % explain the non-monotonic failures or weaken this conclusion. -The failure rate of this benchmark under impairment makes it more -useful as a robustness indicator than a throughput measurement. A VPN -that cannot complete a 30-second UDP flood under 0.25\% packet loss -has fundamental flow-control problems that will surface under real -workloads too, even if the symptoms are milder. +Under impairment this benchmark is more useful as a robustness +indicator than as a throughput measurement. A VPN that cannot +complete a 30-second UDP flood under 0.25\% packet loss has a +flow-control problem that will surface under real workloads too, +even when the symptoms are milder. \subsection{Parallel TCP} -% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional flows" must match -% the baseline parallel test description. The baseline section has an -% unresolved TODO about whether the test uses 6 or 10 streams. If the -% baseline is corrected to 10, this section must also be updated. +% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional +% flows" must match +% the baseline parallel test description. The +% baseline section has an +% unresolved TODO about whether the test uses 6 or 10 +% streams. If the +% baseline is corrected to 10, this section must also +% be updated. The Headscale anomaly from single-stream TCP grows larger under -parallel load. Table~\ref{tab:parallel_impairment} shows aggregate +parallel load. Table~\ref{tab:parallel_impairment} +shows aggregate throughput across three concurrent bidirectional links (six unidirectional flows). \begin{table}[H] \centering - \caption{Parallel TCP throughput (Mbps) across impairment profiles. - Three concurrent bidirectional links produce six unidirectional + \caption{Parallel TCP throughput (Mbps) across + impairment profiles. + Three concurrent bidirectional links produce six + unidirectional flows.} \label{tab:parallel_impairment} \begin{tabular}{lrrrr} @@ -1223,77 +1491,93 @@ unidirectional flows). \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/Parallel TCP Throughput Heatmap}.png} + \includegraphics[width=\textwidth]{{Figures/impairment/Parallel + TCP Throughput Heatmap}.png} \caption{Parallel TCP throughput across impairment profiles. Headscale dominates at Low (718\,Mbps vs.\ Internal's 277); - EasyTier is the runner-up (473\,Mbps); Hyprspace collapses to + EasyTier is the runner-up (473\,Mbps); Hyprspace + collapses to 2.87\,Mbps.} \label{fig:parallel_impairment_heatmap} \end{figure} -Headscale at Low impairment: 718\,Mbps --- 2.6$\times$ Internal -(277\,Mbps) and 4.1$\times$ WireGuard (173\,Mbps). At Medium, -Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%. -Whatever mechanism produces the single-stream crossover at Medium -scales with the number of flows: six independent streams each -benefit from it. +At Low impairment, Headscale reaches 718\,Mbps: 2.6$\times$ +Internal's 277\,Mbps and 4.1$\times$ WireGuard's 173\,Mbps. At +Medium, Headscale (113\,Mbps) still leads Internal +(82.6\,Mbps) by +37\%. Whatever mechanism produces the single-stream +crossover at +Medium scales with the flow count, because each of the six +concurrent streams benefits from it independently. -% TODO: EasyTier's resilience (473 Mbps at Low, 51% retention) is the -% second-best result after Headscale, yet receives no architectural -% explanation. Headscale gets an entire subsection attributing its -% resilience to gVisor TCP tuning. Either explain what gives EasyTier -% its resilience (e.g., its own TCP stack, congestion control, FEC) -% or acknowledge the gap explicitly. -EasyTier is the second-most resilient VPN under parallel load, at -473\,Mbps at Low (51\% of baseline). Both EasyTier and Headscale -retain more than half their baseline parallel throughput at Low -impairment; no other VPN exceeds 30\%. +EasyTier is the runner-up under parallel load: 473\,Mbps at Low, +51\% of its baseline. Headscale and EasyTier are the only VPNs +that retain more than half their baseline parallel throughput at +Low impairment; no other implementation exceeds 30\%. +We have no +direct architectural explanation for EasyTier's resilience and +do not claim one here. -Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\% -loss. % TODO: DOWNSTREAM DEPENDENCY — This references the buffer bloat diagnosis -% from Section hyprspace_bloat, which depends on the unverified 2,800 ms -% under-load latency. If that diagnosis is revised, this explanation +Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at +Low, a 99.6\% +loss. % TODO: DOWNSTREAM DEPENDENCY — This +% references the buffer bloat diagnosis +% from Section hyprspace_bloat, which depends on the +% unverified 2,800 ms +% under-load latency. If that diagnosis is revised, +% this explanation % for parallel collapse must also be revisited. -The buffer bloat that plagues single-stream transfers -(Section~\ref{sec:hyprspace_bloat}) becomes catastrophic when six -concurrent flows compete for the same bloated buffers. +The buffer bloat that already plagues single-stream transfers +(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six +flows compete for the same bloated buffers at once. -The High-profile convergence effect is even more pronounced here than -in single-stream mode. Tinc and VpnCloud land at identical -8.25\,Mbps despite differing by 200\,Mbps at baseline. +High-profile convergence is more pronounced here than in +single-stream mode. Tinc and VpnCloud land at identical +8.25\,Mbps even though they differ by 200\,Mbps at baseline. -\subsection{QUIC Performance} +\subsection{QUIC performance} Headscale and Nebula failed the qperf QUIC benchmark at baseline -(Section~\ref{sec:baseline}) and continue to fail across all -impairment profiles. +(Section~\ref{sec:baseline}) and continue to fail at every +impairment profile. Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to -7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High --- -the same cliff observed in its TCP results, again driven by -jumbo-MTU amplification of outer-layer packet loss. +7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High. +This is the same cliff observed in its TCP results, +driven by the +same jumbo-MTU amplification of outer-layer packet loss. -At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps), +At High impairment, WireGuard (23.2\,Mbps), VpnCloud +(23.4\,Mbps), ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within -0.4\,Mbps of each other. At baseline these four span a 188\,Mbps -range (844 to 656\,Mbps). QUIC's own congestion control, operating atop the -already-degraded outer link, becomes the sole limiter. +0.4\,Mbps of one another. At baseline these four +span a 188\,Mbps +range (656 to 844\,Mbps). QUIC's own congestion +control, running on +top of an already-degraded outer link, has become the +sole limiter. \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/QUIC Bandwidth Heatmap}.png} + \includegraphics[width=\textwidth]{{Figures/impairment/QUIC + Bandwidth Heatmap}.png} \caption{QUIC bandwidth across impairment profiles. Yggdrasil - drops from 745 to 8\,Mbps at Low; WireGuard, VpnCloud, ZeroTier, - and Tinc converge to ${\sim}$23\,Mbps at High. Headscale and + drops from 745 to 8\,Mbps at Low; WireGuard, + VpnCloud, ZeroTier, + and Tinc converge to ${\sim}$23\,Mbps at High. + Headscale and Nebula fail at all profiles ($\times$).} \label{fig:quic_impairment_heatmap} \end{figure} -\subsection{Video Streaming} +\subsection{Video streaming} -At ${\sim}$3.3\,Mbps, the RIST video stream sits within every VPN's -throughput budget even at High impairment. Quality differences in -Table~\ref{tab:rist_impairment} therefore reflect packet delivery +At ${\sim}$3.3\,Mbps, the RIST video stream sits +within every VPN's +throughput budget even at High impairment. Quality +differences in +Table~\ref{tab:rist_impairment} therefore reflect +packet delivery reliability, not bandwidth. \begin{table}[H] @@ -1323,51 +1607,69 @@ reliability, not bandwidth. \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/Video Streaming Quality Heatmap}.png} - \caption{RIST video streaming quality across impairment profiles. + \includegraphics[width=\textwidth]{{Figures/impairment/Video + Streaming Quality Heatmap}.png} + \caption{RIST video streaming quality across + impairment profiles. Headscale is stuck at ${\sim}$13\% regardless of profile; Mycelium maintains ${\sim}$100\% even at High; Yggdrasil declines steeply to 43\%.} \label{fig:rist_impairment_heatmap} \end{figure} -Headscale stays at ${\sim}$13\% across all four profiles: 13.1\%, -13.0\%, 13.0\%, 13.0\%. The profile-independence confirms the -baseline diagnosis from Section~\ref{sec:baseline}. The failure is -% TODO: DOWNSTREAM DEPENDENCY — This repeats the DERP/MTU hypothesis from -% Section baseline as though it were established. The baseline TODO notes -% this hypothesis is unverified (no packet capture evidence). Do not -% present it as a confirmed diagnosis here without resolving the upstream TODO. -structural --- likely MTU fragmentation in the DERP relay layer --- -and cannot worsen because it is already saturated. Adding latency or -loss on top of an 87\% packet drop floor changes nothing. +Headscale sits at ${\sim}$13\% across all four profiles: 13.1\%, +13.0\%, 13.0\%, 13.0\%. This profile-independence confirms the +baseline diagnosis (Section~\ref{sec:baseline}): the failure is +% TODO: DOWNSTREAM DEPENDENCY — This repeats the +% DERP/MTU hypothesis from +% Section baseline as though it were established. +% The baseline TODO notes +% this hypothesis is unverified (no packet capture +% evidence). Do not +% present it as a confirmed diagnosis here without +% resolving the upstream TODO. +structural (most plausibly MTU fragmentation in the DERP relay +layer) and cannot worsen because it is already +saturated. Adding +latency or loss on top of an 87\% packet drop floor changes +nothing. -Mycelium delivers 99.9\% quality even at High impairment, better than +Mycelium holds 99.9\% quality even at High impairment, ahead of Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even -Mycelium's degraded overlay paths can sustain the stream. The same -overlay routing that adds 34.9\,ms of latency and cripples bulk TCP -transfers is harmless at video bitrates. RIST's own forward error -correction compensates for whatever packet loss remains. +Mycelium's degraded overlay paths comfortably sustain +the stream. +The same overlay routing that adds 34.9\,ms of +latency and cripples +bulk TCP transfers is harmless at video bitrates, and RIST's +forward error correction handles the residual loss. -% TODO: The claim that jumbo MTU causes burst losses that overwhelm -% FEC is a hypothesis. No FEC analysis or packet-level evidence is -% shown. Consider adding packet capture data or softening the claim. -Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at -Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU that hurt TCP -throughput likely hurts here too --- large overlay packets carrying -RIST data are more likely to be lost or reordered at the outer layer, -and RIST's FEC may not recover from the resulting burst losses. +% TODO: The claim that jumbo MTU causes burst losses +% that overwhelm +% FEC is a hypothesis. No FEC analysis or +% packet-level evidence is +% shown. Consider adding packet capture data or +% softening the claim. +Yggdrasil degrades the most steeply: 100\% at +baseline, 94.7\% at +Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU +that hurt TCP +throughput likely hurts here as well: large overlay packets are +more exposed to loss and reordering at the outer layer, and the +resulting burst losses may exceed what RIST's FEC can recover. -\subsection{Application-Level Download} +\subsection{Application-level download} + +The Nix binary cache download is the most demanding +application-level benchmark. Hundreds of sequential HTTP +connections amplify the per-connection latency +penalties that bulk +throughput tests amortise. Table~\ref{tab:nix_impairment} shows +download times across profiles. -The Nix binary cache download is the most demanding application-level -benchmark: hundreds of sequential HTTP connections amplify -per-connection latency penalties that bulk throughput tests amortize. -Table~\ref{tab:nix_impairment} shows download times across profiles. - \begin{table}[H] \centering - \caption{Nix binary cache download time (seconds) across impairment + \caption{Nix binary cache download time (seconds) + across impairment profiles, sorted by Low-profile time. ``--'' marks a failed run.} \label{tab:nix_impairment} @@ -1393,55 +1695,81 @@ Table~\ref{tab:nix_impairment} shows download times across profiles. \begin{figure}[H] \centering - \includegraphics[width=\textwidth]{{Figures/impairment/Nix Cache Download Time Heatmap}.png} - \caption{Nix binary cache download time across impairment profiles. - Headscale, Nebula, and Tinc complete all four profiles; Headscale + \includegraphics[width=\textwidth]{{Figures/impairment/Nix + Cache Download Time Heatmap}.png} + \caption{Nix binary cache download time across + impairment profiles. + Headscale, Nebula, and Tinc complete all four + profiles; Headscale beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's - Low-profile time explodes to 230\,s ($\times$ marks a failed run).} + Low-profile time explodes to 230\,s ($\times$ marks + a failed run).} \label{fig:nix_impairment_heatmap} \end{figure} -Headscale, Nebula, and Tinc are the only VPNs to complete all four -profiles. At Medium impairment, Headscale finishes in 48.8~seconds ---- faster than Internal's 58.6~seconds. Internal itself fails at -High impairment while Headscale completes in 219~seconds, Tinc in +Headscale, Nebula, and Tinc are the only VPNs to +complete all four +profiles. At Medium impairment, Headscale finishes +in 48.8~seconds, +faster than Internal's 58.6~seconds. Internal itself +fails at High +impairment while Headscale completes in 219~seconds, Tinc in 496~seconds, and Nebula in 547~seconds. Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low -impairment, a 22$\times$ slowdown. Every HTTP request incurs the -latency penalty from Yggdrasil's impairment-amplified -retransmissions. Mycelium also degrades severely (10.1\,s to -79.5\,s, an 8$\times$ increase), consistent with its overlay routing -overhead, which compounds over hundreds of sequential HTTP -connections. +impairment, a 22$\times$ slowdown. Every HTTP request pays the +latency penalty of Yggdrasil's impairment-amplified +retransmissions. +Mycelium degrades almost as badly (10.1\,s to 79.5\,s, an +8$\times$ increase): its overlay routing overhead compounds over +hundreds of sequential HTTP connections. % TODO: Hyprspace fails at Low but completes at Medium (170 s). -% This contradicts the "clean gradient" claim. Explain why a VPN +% This contradicts the "clean gradient" claim. +% Explain why a VPN % can fail at Low but succeed at Medium, or note the anomaly. -The failure map reveals a mostly clean gradient: more demanding -profiles knock out more VPNs. At Low, 10 of 11 complete (Hyprspace -fails). At Medium, 9 complete (though Hyprspace, which failed at -Low, completes at 170\,s). At High, only 3 survive (Headscale, -Nebula, Tinc). Internal's failure at High is the most surprising --- the -bare-metal baseline cannot sustain a multi-connection HTTP workload -under severe degradation, but Headscale, shielded by its userspace -TCP stack, can. Section~\ref{sec:tailscale_degraded} explains why. +The failure map shows a mostly clean gradient: more demanding +profiles knock out more VPNs. At Low, 10 of 11 +finish (Hyprspace +fails). At Medium, 9 finish, though Hyprspace, which had failed +at Low, completes here in 170\,s. At High, only Headscale, +Nebula, and Tinc survive. Internal's failure at High is the +surprising one: the bare-metal baseline cannot sustain a +multi-connection HTTP workload under severe degradation, while +Headscale's userspace TCP stack pulls it through. +Section~\ref{sec:tailscale_degraded} explains why. -\section{Tailscale Under Degraded Conditions} +\section{Tailscale under degraded conditions} \label{sec:tailscale_degraded} -\subsection{Observed Anomaly} +This section is about an observation that should not exist: +Headscale, a tunnelling VPN built on a kernel TCP stack and +\texttt{wireguard-go}, beats the bare-metal Internal baseline at +Medium impairment, and at Low impairment under parallel load +beats it by a factor of 2.6. The short answer turns out to be +different from the obvious answer, and we worked it out only by +chasing the obvious answer to its end. -At Medium impairment, Headscale delivers 41.5\,Mbps single-stream TCP -throughput --- 40\% more than Internal's 29.6\,Mbps. A VPN built -atop WireGuard outperforms the bare-metal connection it tunnels -through. The anomaly is consistent across benchmarks: -Table~\ref{tab:headscale_anomaly} summarizes the comparison. +\subsection{An anomaly worth pursuing} + +At Medium impairment, Headscale reaches 41.5\,Mbps on a single +TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for +the VPN over the direct host-to-host link it tunnels through. +Headscale costs the expected ${\sim}$14\,\% at baseline, and at +Low and High impairment it lags Internal by some margin. Yet at +Medium the order inverts, and not by a sliver: a 12\,Mbps gap on +a 30\,Mbps link is well above measurement noise. The same thing +happens, more dramatically, on the parallel TCP test, where +Headscale's 718\,Mbps at Low beats Internal's 277\,Mbps by a +factor of 2.6. Table~\ref{tab:headscale_anomaly} collects the +comparison. \begin{table}[H] \centering - \caption{Headscale vs.\ Internal vs.\ WireGuard under impairment - (18.12.2025 run). For TCP benchmarks, higher is better. For + \caption{Headscale vs.\ Internal vs.\ WireGuard + under impairment + (18.12.2025 run). For TCP benchmarks, higher is + better. For Nix cache, lower is better; ``--'' marks a failed run.} \label{tab:headscale_anomaly} \begin{tabular}{llrrr} @@ -1463,236 +1791,507 @@ Table~\ref{tab:headscale_anomaly} summarizes the comparison. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png} - \caption{Single-stream TCP throughput for Internal, Headscale, and + \caption{Single-stream TCP throughput for Internal, + Headscale, and WireGuard across impairment profiles (log scale). Headscale - crosses above Internal at Medium impairment; WireGuard stays far + crosses above Internal at Medium impairment; + WireGuard stays far below both; all three converge at High.} \label{fig:headscale_vs_internal} \end{figure} -In parallel TCP at Low impairment, Headscale reaches 718\,Mbps vs.\ -Internal's 277\,Mbps (2.6$\times$). The Nix cache download at -Medium takes Headscale 48.8\,s vs.\ Internal's 58.6\,s (17\% -faster). At High impairment, Internal fails the Nix cache entirely -while Headscale completes in 219\,s. - -WireGuard, which shares Headscale's cryptographic layer, shows no -such advantage: 54.7\,Mbps at Low, 8.77\,Mbps at Medium. Whatever -protects Headscale is not the encryption or the tunnel --- it is -something in Tailscale's userspace networking stack. +WireGuard-the-kernel-module is the obvious sanity +check. It uses +the same Noise/WireGuard cryptographic protocol Tailscale ships +and is the closest available comparison without the rest of +Tailscale's stack. WireGuard shows none of Headscale's +advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well +below Internal at the same profile. So the encryption layer is +not the answer, and the basic UDP tunnel is not the answer. +Whatever Headscale is doing differently lives somewhere else in +the rest of Tailscale's implementation. % TODO: The Medium-impairment retransmit percentages (5.2\%, -% 2.4\%) are not in any table or figure. Add a retransmit rate -% table for impaired profiles or reference the data source. -The retransmit data provides the first clue. At Medium impairment, -WireGuard's retransmit rate is 5.2\% --- more than double Internal's -${\sim}$2.4\%. Headscale, despite being a VPN, matches Internal at -${\sim}$2.4\%. WireGuard uses the host kernel's TCP stack, which -treats reordered packets as losses and fires spurious retransmits; -Headscale's gVisor stack tolerates more reordering, so fewer -retransmissions are wasted on packets that were merely delayed. +% 2.4\%) are not in any table or figure. Add a retransmit +% rate table for impaired profiles or reference the data +% source. +The retransmit data narrows the search. At Medium, WireGuard's +TCP retransmit rate is 5.2\,\%, more than double Internal's +${\sim}$2.4\,\%. Headscale matches Internal at ${\sim}$2.4\,\% +even though it is a tunnelling VPN. Both Headscale and +bare-metal Internal run the same host kernel TCP stack at the +inner layer, so the asymmetry is not about a different TCP +implementation. It is about what the kernel TCP stack is being +asked to process: something on Headscale's path is suppressing +the spurious retransmits the kernel would otherwise fire under +\texttt{tc netem}-induced reordering, and WireGuard's path is +not. -\subsection{Congestion Control Analysis} +\subsection{A plausible villain: Tailscale's gVisor stack} -Tailscale uses a userspace TCP/IP stack derived from Google's gVisor -(netstack). This stack does not inherit the host kernel's TCP -parameters. Three defaults differ from the Linux kernel in ways that -matter under packet reordering: +The candidate explanation we pursued first, and the one any +reading of the upstream Tailscale documentation will lead to, +is Tailscale's userspace TCP/IP stack. The Tailscale client +imports Google's gVisor netstack +(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses +it as an in-process TCP implementation. The gVisor +documentation is direct about why this matters: netstack is +designed for adverse networks where the host kernel's TCP +defaults are too aggressive. Tailscale's release notes go +further, calling out specific overrides on top of gVisor — the +most visible being an explicit RACK disable and 8\,MiB / 6\,MiB +receive and send buffers. + +Reading Tailscale's source confirms it. +\texttt{wgengine/netstack/netstack.go} contains the netstack +initialiser, and Listing~\ref{lst:tailscale_netstack_overrides} +reproduces the relevant overrides verbatim. RACK is disabled +(\texttt{TCPRecovery(0)}) with a comment pointing at +\texttt{tailscale/issues/9707}: ``gVisor's RACK performs +poorly. ACKs do not appear to be handled in a timely manner, +leading to spurious retransmissions and a reduced congestion +window.'' Reno is set explicitly with a comment pointing at +\texttt{gvisor/issues/11632}, an integer-overflow bug in +gVisor's CUBIC implementation. The TCP send and receive +buffer maxima are pushed up to 8\,MiB and 6\,MiB. SACK is +enabled (gVisor's default is off). + +\lstinputlisting[language=Go,caption={Tailscale's gVisor + netstack initialiser explicitly disables RACK, pins Reno as + the congestion control, and enlarges the TCP buffer maxima. + These overrides live inside + \texttt{wgengine/netstack/netstack.go}. +\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go} + +Read against the Linux kernel defaults — RACK on, CUBIC by +default, ${\sim}$1\,MiB receive and send buffers, +\texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these +overrides describe a TCP stack better suited to a lossy, +reordering link than the host kernel. The hypothesis writes +itself: Headscale's iPerf3 traffic is processed +by this gVisor +instance instead of by the host kernel TCP stack, and so it +inherits the more reordering-tolerant behaviour. +WireGuard-the-kernel-module shares only the cryptographic +protocol; it does not get the gVisor stack, and +therefore does +not get the advantage. + +It is a clean story. The natural way to test it +is to extract +the parameters Tailscale sets inside gVisor, apply their +nearest Linux equivalents to the bare-metal host as sysctls, +and see whether Internal — with no VPN at all — picks up the +same advantage. If it does, the gVisor explanation is +supported. If it does not, the hypothesis fails. + +\subsection{Reproducing the effect on bare metal} +\label{sec:tuned} + +We ran two follow-up benchmarks on the same hardware and +impairment setup as the original 18.12.2025 run. \begin{itemize} - \bitem{\texttt{tcp\_reordering}:} gVisor uses 10; the Linux kernel - defaults to~3. This parameter controls how many out-of-order - packets TCP tolerates before treating the event as a loss. With - tc~netem injecting 0.5--2.5\% reordering per machine, bursts of - 3+ reordered packets are frequent. The kernel's threshold of~3 - causes spurious fast retransmits and congestion window reductions - for packets that are merely reordered, not lost. - \bitem{\texttt{tcp\_recovery} (RACK):} gVisor disables it; the - Linux kernel enables it by default. RACK uses timing-based loss - detection that is more aggressive than the pure sequence-based - approach gVisor uses. Under reordering, RACK's timing heuristics - can falsely classify delayed packets as lost. - \bitem{\texttt{tcp\_early\_retrans} (TLP):} gVisor disables it; the - kernel enables it. Tail Loss Probe sends speculative retransmits - on idle connections, which can worsen congestion when the link is - already impaired. -\end{itemize} - -Under packet reordering, these three defaults compound. The Linux -TCP stack fires retransmits and cuts the congestion window far more -often than necessary; each false positive shrinks the window and -reduces throughput. Tailscale's gVisor stack tolerates more -reordering before reacting, so its congestion window stays larger and -throughput stays higher. - -% TODO: The claim that the anomaly "grows with impairment severity" is -% not fully supported. At High impairment, Headscale (4.21 Mbps) and -% Internal (4.25 Mbps) converge --- the anomaly vanishes rather than -% growing. The logic predicts continued divergence at High reordering -% (5% per machine), but the data shows both become loss-limited. -% Rephrase to say the anomaly emerges at Medium but disappears at High -% when absolute loss dominates. -This explains why the anomaly emerges as impairment increases. At -baseline, there is no reordering, so the threshold difference is -irrelevant and Internal's kernel-level processing advantage dominates. -As reordering increases from 0.5\% (Low) to 2.5\% (Medium) per -machine, the kernel's aggressive loss detection fires more often, and -the throughput gap shifts in Headscale's favor. At High impairment, -however, both converge to ${\sim}$4.2\,Mbps: the absolute packet loss -rate becomes the dominant bottleneck, overriding the reordering -tolerance advantage. - -\subsection{Tuned Kernel Parameters} - -Two follow-up benchmark runs applied Tailscale's gVisor TCP -parameters to the host kernel via sysctl: - -\begin{itemize} - \bitem{Full gVisor (27.02.2026):} All parameters --- + \bitem{Tailscale-style (27.02.2026):} \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, - \texttt{tcp\_early\_retrans=0}, plus enlarged buffer sizes - (\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, \texttt{rmem\_max}, - \texttt{wmem\_max}). Tested on Internal, Headscale, WireGuard, - Tinc, and ZeroTier. - \bitem{Reorder-only (06.03.2026):} Only - \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, and - \texttt{tcp\_early\_retrans=0}. Buffer sizes left at kernel - defaults. Tested on Internal and Headscale only. + \texttt{tcp\_early\_retrans=0}, plus enlarged + buffer sizes + (\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, + \texttt{rmem\_max}, + \texttt{wmem\_max}). Tested on Internal, Headscale, + WireGuard, Tinc, and ZeroTier. + \bitem{Reorder-only (06.03.2026):} Only + \texttt{tcp\_reordering=10}, + \texttt{tcp\_recovery=0}, and + \texttt{tcp\_early\_retrans=0}. Buffer sizes left at + kernel defaults. Tested on Internal and Headscale only. \end{itemize} -Table~\ref{tab:kernel_tuning_internal} shows how Internal responds -to the tuning. Both follow-up runs used the same impairment profiles -and hardware as the original 18.12.2025 run. - \begin{table}[H] \centering \caption{Internal (no VPN) throughput across three kernel - configurations. ``Default'' is the 18.12.2025 run with stock + configurations. ``Default'' is the + 18.12.2025 run with stock Linux TCP parameters.} \label{tab:kernel_tuning_internal} \begin{tabular}{llrrr} \hline \textbf{Metric} & \textbf{Profile} & \textbf{Default} & - \textbf{Full gVisor} & \textbf{Reorder-only} \\ + \textbf{Tailscale-style} & \textbf{Reorder-only} \\ \hline - Single TCP (Mbps) & Baseline & 934 & 934 & 934 \\ - Single TCP (Mbps) & Low & 333 & 363 & 354 \\ - Single TCP (Mbps) & Medium & 29.6 & 64.2 & 72.7 \\ - Parallel TCP (Mbps) & Low & 277 & 893 & 902 \\ - Parallel TCP (Mbps) & Medium & 82.6 & 226 & 211 \\ - Retransmit \% & Medium & ${\sim}$2.4 & 1.21 & 1.11 \\ - Nix cache (s) & Medium & 58.6 & 29.7 & 29.1 \\ + Single TCP (Mbps) & Baseline & 934 & + 934 & 934 \\ + Single TCP (Mbps) & Low & 333 & + 363 & 354 \\ + Single TCP (Mbps) & Medium & 29.6 & + 64.2 & 72.7 \\ + Parallel TCP (Mbps) & Low & 277 & + 893 & 902 \\ + Parallel TCP (Mbps) & Medium & 82.6 & + 226 & 211 \\ + Retransmit \% & Medium & ${\sim}$2.4 + & 1.21 & 1.11 \\ + Nix cache (s) & Medium & 58.6 & + 29.7 & 29.1 \\ \hline \end{tabular} \end{table} - \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png} - \caption{Internal (no VPN) single-stream TCP throughput across three + \caption{Internal (no VPN) single-stream TCP + throughput across three kernel configurations. Baseline is unchanged; at Medium - impairment, throughput jumps from 30 to 64 to 73\,Mbps as + impairment, throughput jumps from 30 to 64 to + 73\,Mbps as reordering tolerance increases.} \label{fig:kernel_tuning_comparison} \end{figure} -Internal's Medium-impairment throughput jumps from 29.6 to -72.7\,Mbps --- a 146\% increase from a three-line sysctl change. The -retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; over half -of the original retransmissions were spurious. The Nix cache download at -Medium halves from 58.6\,s to 29.1\,s. +The result felt like confirmation. Internal's +Medium-impairment throughput jumped from 29.6\,Mbps to +72.7\,Mbps under the reorder-only configuration — a 146\,\% +increase from a three-line sysctl change — and +the retransmit +rate at Medium dropped from ${\sim}$2.4\,\% to +1.11\,\%, which +means more than half of the original retransmissions were +spurious. The Nix cache download at Medium roughly halved, +from 58.6\,s to 29.1\,s. -Parallel TCP sees an even larger gain. Internal at Low impairment -climbs from 277 to 902\,Mbps, a 226\% increase that now exceeds -Headscale's original 718\,Mbps. % TODO: DOWNSTREAM DEPENDENCY — "six concurrent flows" inherits the -% unresolved 6-vs-10 stream count from the baseline parallel test +Parallel TCP gained more. Internal at Low +climbed from 277 to +902\,Mbps, a 226\,\% increase that not only +exceeds Internal's +old single-stream best but actually overtakes Headscale's +original 718\,Mbps from the unmodified run. % +% TODO: DOWNSTREAM +% DEPENDENCY — "six concurrent flows" inherits +% the unresolved +% 6-vs-10 stream count from the baseline parallel test % description. Update when that TODO is resolved. -With six concurrent flows each -independently benefiting from the higher reordering threshold, the -aggregate improvement compounds. +Each of the six concurrent flows benefits independently from +the higher reordering threshold, and the gains compound. -% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are not in -% any table. Add a table showing Headscale's results from the -% follow-up runs alongside Internal's so readers can verify the -% reversal. -% TODO: "At every impairment level and benchmark" is a strong claim -% but only single-stream TCP at Medium and Nix cache at Medium are -% shown with both Internal and Headscale values. The Headscale tuned -% data is not in any table (see TODO above). Either add the full -% comparison table or weaken to "at the metrics shown." -The anomaly reverses. At the measured impairment levels and benchmarks, -tuned Internal now meets or exceeds Headscale. At Medium impairment: -Internal 72.7\,Mbps vs.\ Headscale 50.1\,Mbps (Internal 45\% ahead), -where the original result had Headscale 40\% ahead. The Nix cache -flips too: Internal completes in 29.1\,s vs.\ Headscale's 36.3\,s, -where the original had Headscale 17\% faster. +% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are +% not in any table. Add a table showing Headscale's results +% from the follow-up runs alongside Internal's so +% readers can +% verify the reversal. +Headscale itself, retested with the same sysctls, +gained more +modestly: +21\,\% at Medium and a small $-$5\,\% wobble at +Low. And the anomaly reversed entirely. At Medium, tuned +Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps — +a 45\,\% lead for Internal where the original run +had Headscale +40\,\% ahead. The Nix cache flipped the same way: Internal +completed in 29.1\,s against Headscale's 36.3\,s, where the +original had Headscale 17\,\% faster. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png} - \caption{Internal-to-Headscale speed-up factor before and after - kernel tuning. Values above 1.0 mean Internal is faster. At - Medium impairment, the ratio flips from 0.71$\times$ (Headscale + \caption{Internal-to-Headscale speed-up factor + before and after + kernel tuning. Values above 1.0 mean + Internal is faster. At + Medium impairment, the ratio flips from + 0.71$\times$ (Headscale ahead) to 1.45$\times$ (Internal ahead).} \label{fig:headscale_gap_reversal} \end{figure} -The reorder-only configuration (06.03) matches or exceeds the full -gVisor configuration (27.02) at most metrics; the two exceptions are -single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at -Medium (211 vs.\ 226\,Mbps), both within 7\%. Internal -reaches 72.7\,Mbps at Medium with reorder-only vs.\ 64.2\,Mbps with -full gVisor. % TODO: The "mild buffer bloat" explanation for full-gVisor being -% slightly slower than reorder-only is speculative. The difference -% (64.2 vs 72.7 Mbps) could be within run-to-run variance. Either -% test with more runs or present this as one possible explanation. -The enlarged buffer sizes appear unnecessary and may -introduce mild buffer bloat that partially offsets the reordering -benefit, though the difference could also reflect normal run-to-run -variance. The entire Headscale advantage is explained by three kernel -parameters: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and +The reorder-only configuration matched or exceeded the full +Tailscale-style configuration on most metrics. The two +exceptions were single-stream TCP at Low (354 +vs.\ 363\,Mbps) +and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within +7\,\%. The enlarged buffer sizes did not help and may have +added mild buffer bloat that partially offset the reordering +benefit, though the gap could also be run-to-run variance. +Either way, the entire Headscale advantage on Internal +collapsed to three host-kernel sysctls: +\texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and \texttt{tcp\_early\_retrans}. +At this point in the investigation the hypothesis seemed +settled. Tailscale's gVisor stack ships with +these overrides; +the bare-metal kernel ships with stricter defaults; matching +the kernel to gVisor reproduces the effect. Then we checked +which Tailscale code path the test rig was actually running. + +\subsection{The data path that was not there} + +In default mode — what anyone running \texttt{tailscale up} +on a Linux host gets — the Tailscale client creates a real +kernel TUN device, registers a route for the +Tailscale subnet +through it, and forwards inbound and outbound +packets through +that interface. An application like iPerf3 issues a +\texttt{connect} to the remote peer's Tailscale +IP. The host +kernel TCP stack handles the application TCP. The kernel +routes the resulting outbound packets to the TUN device. +\texttt{tailscaled} (with \texttt{wireguard-go} embedded) +reads them from the TUN, encrypts them, and sends them as +outer WireGuard UDP packets on the wire. The receiving side +reverses the process and writes the decrypted inner packets +back into its own TUN, where the host kernel TCP stack +delivers them to the iPerf3 server. + +In that path, gVisor netstack is never instantiated. The +netstack initialiser in +Listing~\ref{lst:tailscale_netstack_overrides} +only runs when +\texttt{tailscaled} is launched with +\texttt{--tun=userspace-networking}, a mode that has no +kernel TUN at all and is reachable only from processes +running inside \texttt{tailscaled} itself (Tailscale SSH, +Taildrop, the metric endpoint). External processes such as +iPerf3 cannot reach the Tailscale network in that mode. + +The test rig does not use that mode. +Listing~\ref{lst:nixos_tailscale} shows the relevant line of +the upstream NixOS \texttt{services.tailscale} module, which +assembles the daemon command line as +\texttt{tailscaled --tun +\$\{cfg.interfaceName\}~\dots}, with +no \texttt{userspace-networking} fall-back unless +the operator +explicitly sets \texttt{interfaceName = +"userspace-networking"}. +Listing~\ref{lst:rig_interface_name} shows what +the benchmark +suite's Headscale module sets the interface name to: +\texttt{ts-\$\{instanceName\}}, truncated to fifteen +characters. The two together resolve to +\texttt{tailscaled --tun ts-headscale} on every +test machine, +a real kernel TUN. gVisor netstack is unreachable from any +external benchmark traffic in this rig. + +\lstinputlisting[language=Nix,caption={The NixOS + \texttt{services.tailscale} module passes \texttt{--tun + \$\{interfaceName\}} as the daemon's TUN argument. There is + no \texttt{--tun=userspace-networking} fall-back unless the + user explicitly sets \texttt{interfaceName = "userspace-networking"}. +\textit{nixpkgs/nixos/modules/services/networking/tailscale.nix:158}},label={lst:nixos_tailscale}]{Listings/nixos_tailscale.nix} + +\lstinputlisting[language=Nix,caption={The + benchmark suite's + Headscale module sets \texttt{interfaceName} to a real kernel + TUN name (\texttt{ts-}, truncated to 15 characters). + Combined with Listing~\ref{lst:nixos_tailscale}, this means + \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale} + on every test machine. +\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix} + +The empirical fingerprint pins the same conclusion down without +source-code reading. Headscale itself gained +21\,\% at Medium +from the host-kernel sysctl tuning. If Headscale's iPerf3 +traffic were processed by gVisor netstack, host-kernel sysctls +would change nothing — they configure the host kernel TCP stack +and only the host kernel TCP stack. The fact that Headscale moves +measurably under those sysctls is direct evidence that +Headscale's application TCP runs on the host kernel stack, just +as Internal's does. + +The validation experiment was therefore validating something +other than the hypothesis it was supposed to validate. It was +confirming, very cleanly, that the Linux kernel's default +\texttt{tcp\_reordering=3} is too tight for the kind of bursty, +correlated reordering the Medium profile produces, and that +loosening it produces a large throughput gain on a kernel-TCP +data path. That part of the result stands. What does not stand +is the inference that the gain reproduces something Tailscale was +already doing in gVisor. For this benchmark, Tailscale is not in +the gVisor TCP business at all. + +\subsection{Where the advantage actually lives} + +The puzzle the investigation began with has not gone away. +Headscale starts at 41.5\,Mbps where Internal starts at +29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel +TCP stack. Whatever Headscale is doing — partially, weakly, but +reproducibly — is worth roughly twelve megabits per second on the +Medium profile, and it is not gVisor netstack. + +The +21\,\% sysctl gain for Headscale itself is also informative +about the size of the mechanism. If the gain were 0\,\%, +Headscale would already be doing the sysctls' work; if it were ++146\,\% like Internal's, Headscale would be doing nothing of its +own. The partial response says Headscale's mechanism produces an +effect similar in kind to the sysctls but smaller in size, and +that the two effects are not fully additive. + +Two features of the \texttt{wireguard-go} data-plane pipeline are +the most likely candidates, and both live on the kernel-TUN path +that Tailscale actually uses in the rig. + +The first is TUN TCP and UDP generic receive offload. Tailscale's +\texttt{tstun} wrapper enables both on the kernel TUN device on +Linux unless an environment knob disables them or a runtime probe +rejects the feature (Listing~\ref{lst:tstun_gro}). On the +receive side, this means \texttt{wireguard-go} decrypts a burst +of inbound WireGuard frames and then coalesces consecutive +in-order TCP segments belonging to the same flow into a single +super-segment before writing them back to the kernel TUN. On the +transmit side, it accepts GSO super-segments from the kernel TUN +read in the same way. The receiving kernel TCP stack therefore +sees fewer, larger segments per coalesced batch instead of $N$ +small ones, and the segment timing that survives to the kernel is +the timing of GRO batches rather than of individual on-the-wire +packets. Bare-metal Internal traffic has no equivalent path +because it does not pass through any user-space TUN at all. + +\lstinputlisting[language=Go,caption={Tailscale enables TUN TCP + and UDP GRO on every Linux non-TAP \texttt{tailscaled} process + unless the operator disables them via environment knobs or a + kernel runtime probe rejects the feature. This is in the default + kernel-TUN data path; it is not gated on + \texttt{--tun=userspace-networking}. +\textit{tailscale/net/tstun/wrap\_linux.go:25--43}},label={lst:tstun_gro}]{Listings/tstun_gro.go} + +The second is the 7\,MiB outer-UDP socket buffer that +\texttt{magicsock} pins on the WireGuard UDP socket +(Listing~\ref{lst:magicsock_buffer}), using the ``force'' +\texttt{SO\_*BUFFORCE} variant where available so the value is +honoured even past \texttt{net.core.rmem\_max}. The host kernel +default is in the low hundreds of KiB. Under burst-correlated +impairment — Medium and High both use 50\,\% correlation, so +losses and reorderings cluster — this larger buffer absorbs +spikes in arrival rate that would otherwise overflow the kernel +UDP receive queue and surface as additional inner-TCP losses. +Internal has no such cushion on its incoming wire path. + +\lstinputlisting[language=Go,caption={\texttt{magicsock} pins the + outer WireGuard UDP socket's send and receive buffers to 7\,MiB + and uses \texttt{SetBufferSize} with the \texttt{SO\_*BUFFORCE} + (``force'') variant where available, so the value is honoured + even past \texttt{net.core.rmem\_max}. +\textit{tailscale/wgengine/magicsock/magicsock.go:86,3908--3913}},label={lst:magicsock_buffer}]{Listings/magicsock_buffer.go} + +% TODO: Neither of the two candidate mechanisms above is directly +% verified in this chapter. A targeted follow-up — for example +% tcpdump on the receiving \texttt{tailscale0} interface during a +% Medium-impairment iPerf3 run, with inter-arrival timing +% analysis — would distinguish their relative contributions and +% confirm the mechanism. The argument here is that they are the +% most plausible candidates consistent with the evidence, not +% measured causes. + +A third feature, batched UDP I/O, completes the picture without +changing it qualitatively. \texttt{wireguard-go} uses +\texttt{recvmmsg} and \texttt{sendmmsg} on the outer UDP socket +so a burst of WireGuard frames moves through a single system +call. This does not change \emph{whether} packets are reordered, +but it reduces per-packet timing jitter that the kernel might +otherwise interpret as additional reordering. + +Hyprspace cannot be used as a negative control for any of this. +It does import gVisor netstack, but only for its in-VPN +service-network feature, and the Hyprspace benchmark traffic goes +through a kernel TUN exactly like Headscale's +(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the +wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer), +not on whether gVisor handles their inner TCP. The gVisor angle +simply does not apply to either of them in this benchmark. + +The kernel-side picture closes the loop. Three host-kernel TCP +parameters dominate the bare-metal behaviour the benchmarks +expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the +number of out-of-order segments the kernel will tolerate before +declaring fast retransmit, and with \texttt{tc netem} injecting +0.5--2.5\,\% reordering per machine, bursts of several reordered +packets are frequent enough that the threshold is repeatedly +tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery} +(default \texttt{1}, RACK enabled) adds time-based reordering +detection on top of the segment-count threshold, which compounds +the spurious retransmits when reordering is high. And +\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail +Loss Probe enabled) fires speculative retransmits when +unacknowledged segments sit at the tail of a transmission window, +which interacts poorly with an already-impaired link. Loosening +any one of the three softens the kernel's loss detection on the +bare-metal path; loosening all three recovers most of the +throughput. The Headscale path reaches the same kernel TCP stack +but is already feeding it the GRO-coalesced, buffer-cushioned +stream described above, so the kernel's tight defaults fire less +often there to begin with. + +The same logic explains the anomaly's shape across profiles. At +baseline there is no reordering, so the kernel's tight +\texttt{tcp\_reordering} threshold never trips and Internal's +native kernel-stack speed wins. As reordering rises from 0.5\,\% +(Low) to 2.5\,\% (Medium) per machine, the kernel's loss +detection fires on the bare-metal path more often than on the +GRO-coalesced Headscale path, and the throughput gap shifts in +Headscale's favour. At High impairment, both converge to +${\sim}$4.2\,Mbps: absolute packet loss becomes the dominant +bottleneck, and reordering tolerance no longer matters. + % TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier % (11.5 Mbps) tuned values are not in any table. Add them to % Table~\ref{tab:kernel_tuning_internal} or a new table. -Other VPNs benefit less from the kernel tuning. WireGuard's Medium -throughput rises from 8.77 to 12.2\,Mbps (+39\%) and Tinc's from -5.53 to 11.5\,Mbps (+108\%). ZeroTier stays flat (12.0 to -11.5\,Mbps). The tuning helps the kernel TCP stack, but VPNs that -add their own encapsulation overhead and userspace processing have -independent bottlenecks that the sysctl parameters cannot remove. +Other VPNs respond unevenly to the same sysctl tuning. +WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps +(+39\,\%), Tinc's from 5.53 to 11.5\,Mbps (+108\,\%), and +ZeroTier stays flat (12.0 to 11.5\,Mbps). % TODO: The +% reading below — that VPNs which add their own encapsulation and +% userspace processing have bottlenecks the host kernel sysctls +% cannot touch — does not cleanly fit the data: Tinc (a fully +% userspace VPN) shows the largest gain (+108\,\%), larger than +% kernel-WireGuard's. A more complete explanation has to account +% for which TCP stack each VPN's application traffic actually +% traverses and which of those stacks the sysctls actually reach. +The intuitive reading is that VPNs which add their own +encapsulation and userspace processing have bottlenecks the host +kernel sysctls cannot touch, but Tinc's large gain shows the +picture is not that simple. -% TODO: Headscale tuned-run percentages (+21\%, $-$5\%) are not in -% any table. Also, the "compound delays" hypothesis is speculative -% --- no evidence is shown that double reordering tolerance causes -% compound delays. Either verify experimentally or weaken the claim. -Headscale itself gets modestly faster with kernel tuning (+21\% at -Medium) but slightly slower at Low impairment ($-$5\%). Its -userspace gVisor stack already optimizes for reordering tolerance. -When the kernel stack also increases its tolerance, the two layers of -tuning may interact suboptimally --- both independently delay -retransmits, which could cause compound delays on the -kernel-to-Headscale socket path. +The resilient finding from this section, the one that survives +regardless of which of the two Tailscale-side mechanisms turns +out to dominate, is not about Tailscale at all. It is about +Linux. The kernel's default \texttt{tcp\_reordering=3} threshold +is too tight for the kind of bursty, correlated reordering +\texttt{tc netem} produces at the Medium profile, and it costs +the bare-metal host more than half of its achievable throughput. +Three lines of \texttt{sysctl} repair it. The fix is portable to +any Linux host and entirely independent of any VPN. -% TODO: These sections are empty stubs but the chapter introduction -% (line 12--13) promises "findings from the source code analysis." -% Either write these sections or remove the promise from the intro. +The unresilient finding — the one that motivated us to write this +section in the first place — is that Tailscale's much-discussed +userspace TCP stack is, for the workload that exposed the +anomaly, sitting on the bench. The advantage we attributed to it +must come from a more ordinary place: the way +\texttt{wireguard-go} batches and coalesces packets between the +wire and the kernel TCP stack, and the larger UDP buffer it pins +on its outer socket. We were chasing the wrong hypothesis with +the right experiment, and the experiment turned out to be more +useful than the hypothesis. -\section{Source Code Analysis} +% TODO: These sections are empty stubs but the chapter +% introduction (line 12--13) promises "findings from the source +% code analysis." Either write these sections or remove the +% promise from the intro. -\subsection{Feature Matrix Overview} +\section{Source code analysis} -% Summary of the 131-feature matrix across all ten VPNs. +\subsection{Feature matrix overview} + +% Summary of the 108-feature matrix across all ten VPNs. % Highlight key architectural differences that explain % performance results. -\subsection{Security Vulnerabilities} +\subsection{Security vulnerabilities} % Vulnerabilities discovered during source code review. -\section{Summary of Findings} +\section{Summary of findings} -% Brief summary table or ranking of VPNs by key metrics. -% Save deeper interpretation for a Discussion chapter. +% Brief summary table or ranking of VPNs by key metrics. Save +% deeper interpretation for a Discussion chapter. diff --git a/Figures/baseline/tcp/TCP Throughput.png b/Figures/baseline/tcp/TCP Throughput.png index 49dce1c..a6bf25f 100644 Binary files a/Figures/baseline/tcp/TCP Throughput.png and b/Figures/baseline/tcp/TCP Throughput.png differ diff --git a/Listings/hyprspace_dispatch.go b/Listings/hyprspace_dispatch.go new file mode 100644 index 0000000..4231639 --- /dev/null +++ b/Listings/hyprspace_dispatch.go @@ -0,0 +1,26 @@ +} else if proto == 0x60 { + dstIP = net.IP(packet[24:40]) + if node.cfg.BuiltinAddr6.Equal(dstIP) { + continue + } else if serviceNet.NetworkRange.Contains(dstIP) { + // Are you TCP because your protocol is 6, or is your + // protocol 6 because you are TCP? + if packet[6] == 0x06 { + port := uint16(packet[42])*256 + uint16(packet[43]) + if serviceNet.EnsureListener([16]byte(packet[24:40]), port) { + count, err := (*serviceNet.Tun).Write([][]byte{packet}, 0) + if count == 0 || err != nil { + logger.With(err).Error("Error writing to service-network tunnel") + } + } + } + continue + } +} +... +// Check route table for destination address. +route, found := node.cfg.FindRouteForIP(dstIP) +if found { + dst = route.Target.ID + go node.sendPacket(dst, packet, plen) +} diff --git a/Listings/hyprspace_netstack.go b/Listings/hyprspace_netstack.go new file mode 100644 index 0000000..8bf1676 --- /dev/null +++ b/Listings/hyprspace_netstack.go @@ -0,0 +1,21 @@ +// taken from https://git.zx2c4.com/wireguard-go/tree/tun/netstack/tun.go +// rev 2b73054b299aec80cbb064954001810d30ee2e3c +... +func CreateNetTUN(localAddresses, dnsServers []netip.Addr, mtu int) (tun.Device, *Net, error) { + opts := stack.Options{ + NetworkProtocols: []stack.NetworkProtocolFactory{ipv4.NewProtocol, ipv6.NewProtocol}, + TransportProtocols: []stack.TransportProtocolFactory{tcp.NewProtocol, udp.NewProtocol, icmp.NewProtocol6, icmp.NewProtocol4}, + HandleLocal: true, + } + dev := &netTun{ + ep: channel.New(1024, uint32(mtu), ""), + stack: stack.New(opts), + ... + } + sackEnabledOpt := tcpip.TCPSACKEnabled(true) // TCP SACK is disabled by default + tcpipErr := dev.stack.SetTransportProtocolOption(tcp.ProtocolNumber, &sackEnabledOpt) + if tcpipErr != nil { + return nil, nil, fmt.Errorf("could not enable TCP SACK: %v", tcpipErr) + } + ... +} diff --git a/Listings/hyprspace_sendpacket.go b/Listings/hyprspace_sendpacket.go new file mode 100644 index 0000000..540f1a3 --- /dev/null +++ b/Listings/hyprspace_sendpacket.go @@ -0,0 +1,31 @@ +type SharedStream struct { + Stream *network.Stream + Lock *sync.Mutex +} +... +// Inside the TUN-read loop: +if found { + dst = route.Target.ID + go node.sendPacket(dst, packet, plen) +} +... +func (node *Node) sendPacket(dst peer.ID, packet []byte, plen int) { + // Check if we already have an open connection to the destination peer. + ms, ok := node.activeStreams[dst] + if ok { + if func() bool { + ms.Lock.Lock() + defer ms.Lock.Unlock() + // Write out the packet's length to the libp2p stream to ensure + // we know the full size of the packet at the other end. + err := binary.Write(*ms.Stream, binary.LittleEndian, uint16(plen)) + if err == nil { + // Write the packet out to the libp2p stream. + _, err = (*ms.Stream).Write(packet[:plen]) + ... + } + ... + }() { return } + } + ... +} diff --git a/Listings/hyprspace_tun_linux.go b/Listings/hyprspace_tun_linux.go new file mode 100644 index 0000000..9a889bc --- /dev/null +++ b/Listings/hyprspace_tun_linux.go @@ -0,0 +1,15 @@ +// New creates and returns a new TUN interface for the application. +func New(name string, opts ...Option) (*TUN, error) { + // Setup TUN Config + cfg := water.Config{ + DeviceType: water.TUN, + } + cfg.Name = name + + // Create Water Interface + iface, err := water.New(cfg) + if err != nil { + return nil, err + } + ... +} diff --git a/Listings/magicsock_buffer.go b/Listings/magicsock_buffer.go new file mode 100644 index 0000000..6cec3cd --- /dev/null +++ b/Listings/magicsock_buffer.go @@ -0,0 +1,9 @@ +socketBufferSize = 7 << 20 +... +forceErr, portableErr := sockopts.SetBufferSize(pconn, direction, socketBufferSize) +if forceErr != nil { + logf("magicsock: [warning] failed to force-set UDP %v buffer size to %d: %v; using kernel default values (impacts throughput only)", direction, socketBufferSize, forceErr) +} +if portableErr != nil { + logf("magicsock: failed to set UDP %v buffer size to %d: %v", direction, socketBufferSize, portableErr) +} diff --git a/Listings/nixos_tailscale.nix b/Listings/nixos_tailscale.nix new file mode 100644 index 0000000..bda9f2d --- /dev/null +++ b/Listings/nixos_tailscale.nix @@ -0,0 +1 @@ +''"FLAGS=--tun ${lib.escapeShellArg cfg.interfaceName} ${lib.concatStringsSep " " cfg.extraDaemonFlags}"'' diff --git a/Listings/rig_interface_name.nix b/Listings/rig_interface_name.nix new file mode 100644 index 0000000..9912e9c --- /dev/null +++ b/Listings/rig_interface_name.nix @@ -0,0 +1,10 @@ +let + interface = lib.substring 0 15 "ts-${instanceName}"; +in +{ + services.tailscale = { + enable = true; + # Use the interface name for the tunnel + interfaceName = interface; + }; +} diff --git a/Listings/tailscale_netstack_overrides.go b/Listings/tailscale_netstack_overrides.go new file mode 100644 index 0000000..ad474ba --- /dev/null +++ b/Listings/tailscale_netstack_overrides.go @@ -0,0 +1,28 @@ +// values are biased towards higher throughput on high bandwidth-delay +// product paths, except on memory-constrained platforms. +tcpRXBufOpt := tcpip.TCPReceiveBufferSizeRangeOption{ + ... + Max: tcpRXBufMaxSize, +} +tcpipErr := ipstack.SetTransportProtocolOption(tcp.ProtocolNumber, &tcpRXBufOpt) +... +tcpTXBufOpt := tcpip.TCPSendBufferSizeRangeOption{ + ... + Max: tcpTXBufMaxSize, +} +tcpipErr = ipstack.SetTransportProtocolOption(tcp.ProtocolNumber, &tcpTXBufOpt) +... +sackEnabledOpt := tcpip.TCPSACKEnabled(true) // TCP SACK is disabled by default +tcpipErr := ipstack.SetTransportProtocolOption(tcp.ProtocolNumber, &sackEnabledOpt) +... +// See https://github.com/tailscale/tailscale/issues/9707 +// gVisor's RACK performs poorly. ACKs do not appear to be handled in a +// timely manner, leading to spurious retransmissions and a reduced +// congestion window. +tcpRecoveryOpt := tcpip.TCPRecovery(0) +tcpipErr = ipstack.SetTransportProtocolOption(tcp.ProtocolNumber, &tcpRecoveryOpt) +... +// gVisor defaults to reno at the time of writing. We explicitly set reno +// See https://github.com/google/gvisor/issues/11632 +renoOpt := tcpip.CongestionControlOption("reno") +tcpipErr = ipstack.SetTransportProtocolOption(tcp.ProtocolNumber, &renoOpt) diff --git a/Listings/tstun_gro.go b/Listings/tstun_gro.go new file mode 100644 index 0000000..3516af9 --- /dev/null +++ b/Listings/tstun_gro.go @@ -0,0 +1,22 @@ +// SetLinkFeaturesPostUp configures link features on t based on select TS_TUN_ +// environment variables and OS feature tests. Callers should ensure t is +// up prior to calling, otherwise OS feature tests may be inconclusive. +func (t *Wrapper) SetLinkFeaturesPostUp() { + if t.isTAP || runtime.GOOS == "android" { + return + } + if groDev, ok := t.tdev.(tun.GRODevice); ok { + if envknob.Bool("TS_TUN_DISABLE_UDP_GRO") { + groDev.DisableUDPGRO() + } + if envknob.Bool("TS_TUN_DISABLE_TCP_GRO") { + groDev.DisableTCPGRO() + } + err := probeTCPGRO(groDev) + if errors.Is(err, unix.EINVAL) { + groDev.DisableTCPGRO() + groDev.DisableUDPGRO() + t.logf("disabled TUN TCP & UDP GRO due to GRO probe error: %v", err) + } + } +} diff --git a/_typos.toml b/_typos.toml index 1c9e7b8..76bade6 100644 --- a/_typos.toml +++ b/_typos.toml @@ -4,7 +4,7 @@ extend-exclude = [ "**/value", "**.rev", "**/facter-report.nix", - "Chapters/Zusammenfassung.tex", + "**/Zusammenfassung.tex", "**/key.json", "pkgs/clan-cli/clan_lib/machines/test_suggestions.py", ] diff --git a/main.tex b/main.tex index 24edf20..836c9fd 100644 --- a/main.tex +++ b/main.tex @@ -62,6 +62,55 @@ \usepackage{tikz} \usetikzlibrary{shapes.geometric} \usepackage[edges]{forest} +\usepackage{listings} % Source code listings for evidence snippets +% Syntax-highlighting colors (xcolor is already loaded by the class file) +\definecolor{lstKeyword}{HTML}{0B5FA5} +\definecolor{lstComment}{HTML}{4B7B4D} +\definecolor{lstString}{HTML}{A31515} +\definecolor{lstNumber}{HTML}{707070} +\definecolor{lstBackground}{HTML}{F7F7F7} +\definecolor{lstFrame}{HTML}{C8C8C8} +\lstset{ + basicstyle=\ttfamily\footnotesize, + keywordstyle=\color{lstKeyword}\bfseries, + commentstyle=\color{lstComment}\itshape, + stringstyle=\color{lstString}, + numberstyle=\tiny\color{lstNumber}, + identifierstyle=\color{black}, + backgroundcolor=\color{lstBackground}, + rulecolor=\color{lstFrame}, + breaklines=true, + breakatwhitespace=false, + columns=fullflexible, + keepspaces=true, + showstringspaces=false, + frame=single, + framerule=0.4pt, + xleftmargin=0.5em, + xrightmargin=0.5em, + aboveskip=0.6em, + belowskip=0.6em, + captionpos=b, +} +\lstdefinelanguage{Nix}{ + morekeywords={with,let,in,inherit,rec,if,then,else,import,true,false,null}, + morecomment=[l]{\#}, + morestring=[b]", + sensitive=true, +} +\lstdefinelanguage{Go}{ + morekeywords={break,case,chan,const,continue,default,defer,else,fallthrough, + for,func,go,goto,if,import,interface,map,package,range,return,select, + struct,switch,type,var,bool,byte,complex64,complex128,error,float32, + float64,int,int8,int16,int32,int64,rune,string,uint,uint8,uint16,uint32, + uint64,uintptr,true,false,iota,nil,append,cap,close,complex,copy,delete, + imag,len,make,new,panic,print,println,real,recover}, + morecomment=[l]{//}, + morecomment=[s]{/*}{*/}, + morestring=[b]", + morestring=[b]`, + sensitive=true, +} \usepackage[backend=bibtex,style=numeric,natbib=true]{biblatex} % % Use the bibtex backend with the authoryear citation style (which diff --git a/treefmt.nix b/treefmt.nix index 1e5a60c..2871f75 100644 --- a/treefmt.nix +++ b/treefmt.nix @@ -18,6 +18,7 @@ settings.global.excludes = [ "AI_Data/**" "Figures/**" + "Chapters/Zusammenfassung.tex" ]; programs.typos = {