% Chapter Template \chapter{Results} % Main chapter title \label{Results} This chapter presents the results of the benchmark suite across all ten VPN implementations and the internal baseline. The structure follows the impairment profiles from ideal to degraded: Section~\ref{sec:baseline} establishes overhead under ideal conditions, then subsequent sections examine how each VPN responds to increasing network impairment. The chapter concludes with findings from the source code analysis. A recurring theme is that no single metric captures VPN performance; the rankings shift depending on whether one measures throughput, latency, retransmit behavior, or real-world application performance. \section{Baseline Performance} \label{sec:baseline} The baseline impairment profile introduces no artificial loss or reordering, so any performance gap between VPNs can be attributed to the VPN itself. Throughout the plots in this section, the \emph{internal} bar marks a direct host-to-host connection with no VPN in the path; it represents the best the hardware can do. On its own, this link delivers 934\,Mbps on a single TCP stream and a round-trip latency of just 0.60\,ms. WireGuard comes remarkably close to these numbers, reaching 92.5\,\% of bare-metal throughput with only a single retransmit across an entire 30-second test. Mycelium sits at the other extreme, adding 34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. A note on naming: ``Headscale'' in every table and figure of this chapter labels the test scenario in which the Tailscale client (\texttt{tailscaled}) connects to a self-hosted Headscale control server. The data plane is therefore the Tailscale client built on \texttt{wireguard-go}, not the Headscale binary itself, which is only a control-plane server. The test rig launches \texttt{tailscaled} via the NixOS \texttt{services.tailscale} module with \texttt{interfaceName = "ts-headscale"}, which translates to \texttt{--tun ts-headscale}; this means the Tailscale client uses a real kernel TUN device and the host kernel's TCP/IP stack handles every tunneled packet. The alternate \texttt{--tun=userspace-networking} mode, in which gVisor netstack terminates tunneled TCP inside the \texttt{tailscaled} process, is \emph{not} engaged in any of the benchmarks reported here. Statements below about ``Headscale'' running \texttt{wireguard-go} should be read as statements about the Tailscale client in this scenario. \subsection{Test Execution Overview} Running the full baseline suite across all ten VPNs and the internal reference took just over four hours. The bulk of that time, about 2.6~hours (63\,\%), was spent on actual benchmark execution; VPN installation and deployment accounted for another 45~minutes (19\,\%), and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come up after restarts. The remaining time was consumed by VPN service restarts and traffic-control (tc) stabilization. Figure~\ref{fig:test_duration} breaks this down per VPN. Most VPNs completed every benchmark without issues, but four failed one test each: Nebula and Headscale timed out on the qperf QUIC performance benchmark after six retries, while Hyprspace and Mycelium failed the UDP iPerf3 test with a 120-second timeout. Their individual success rate is 85.7\,\%, with all other VPNs passing the full suite (Figure~\ref{fig:success_rate}). \begin{figure}[H] \centering \begin{subfigure}[t]{1.0\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Average Test Duration per Machine}.png} \caption{Average test duration per VPN, including installation time and benchmark execution} \label{fig:test_duration} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{1.0\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Benchmark Success Rate}.png} \caption{Benchmark success rate across all seven tests} \label{fig:success_rate} \end{subfigure} \caption{Test execution overview. Hyprspace has the longest average duration due to UDP timeouts and long VPN connectivity waits. WireGuard completes fastest. Nebula, Headscale, Hyprspace, and Mycelium each fail one benchmark.} \label{fig:test_overview} \end{figure} \subsection{TCP Throughput} Each VPN ran a single-stream iPerf3 session for 30~seconds on every link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna, luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the averages. Three distinct performance tiers emerge, separated by natural gaps in the data. \begin{table}[H] \centering \caption{Single-stream TCP throughput at baseline, sorted by throughput. Retransmits are averaged per 30-second test across all three link directions. The horizontal rules separate the three performance tiers.} \label{tab:tcp_baseline} \begin{tabular}{lrrr} \hline \textbf{VPN} & \textbf{Throughput (Mbps)} & \textbf{Baseline (\%)} & \textbf{Retransmits} \\ \hline Internal & 934 & 100.0 & 1.7 \\ WireGuard & 864 & 92.5 & 1 \\ ZeroTier & 814 & 87.2 & 1163 \\ Headscale & 800 & 85.6 & 102 \\ Yggdrasil & 795 & 85.1 & 75 \\ \hline Nebula & 706 & 75.6 & 955 \\ EasyTier & 636 & 68.1 & 537 \\ VpnCloud & 539 & 57.7 & 857 \\ \hline Hyprspace & 368 & 39.4 & 4965 \\ Tinc & 336 & 36.0 & 240 \\ Mycelium & 259 & 27.7 & 710 \\ \hline \end{tabular} \end{table} The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier, Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link. A middle tier (55--80\,\%) follows with Nebula, EasyTier, and VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier at under 40\,\% of baseline. Figure~\ref{fig:tcp_throughput} visualizes this hierarchy. Raw throughput alone is incomplete, however. The retransmit column reveals that not all high-throughput VPNs get there cleanly. ZeroTier, for instance, reaches 814\,Mbps but accumulates 1\,163~retransmits per test, over 1\,000$\times$ what WireGuard needs. ZeroTier compensates for tunnel-internal packet loss by repeatedly triggering TCP congestion-control recovery, whereas WireGuard delivers data with negligible in-tunnel loss. The bare-metal Internal reference sits at 1.7~retransmits per test — essentially noise — and the VPNs split into three groups around it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale), \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). % TODO: Is this naming scheme any good? \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP Throughput}.png} \caption{Average single-stream TCP throughput} \label{fig:tcp_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP Retransmit Rate}.png} % TODO: Caption says "retransmits" (counts) but the plot axis shows % "Retransmit Rate (\%)." Align the caption with the plot. \caption{TCP retransmit rate (\%)} \label{fig:tcp_retransmits} \end{subfigure} % TODO: This parent caption still says "retransmit count" but the % subfigure axis and caption were corrected to "retransmit rate (%)." % Align the parent caption terminology (counts vs rates). \caption{TCP throughput and retransmit rate at baseline. WireGuard leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000 retransmits per test. The retransmit count does not always track inversely with throughput: ZeroTier achieves high throughput \emph{despite} high retransmits.} \label{fig:tcp_results} \end{figure} Retransmits have a direct mechanical relationship with TCP congestion control. Each retransmit triggers a reduction in the congestion window (\texttt{cwnd}), throttling the sender. This relationship is visible in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 retransmits, maintains the smallest max congestion window in the dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB window, the largest of any VPN. At first glance this suggests a clean inverse correlation between retransmits and congestion window size, but the picture is misleading. Yggdrasil's outsized window is largely an artifact of its jumbo overlay MTU (32\,731 bytes): each segment carries far more data, so the window in bytes is inflated relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing congestion windows across different MTU sizes is not meaningful without normalizing for segment size. What \emph{is} clear is that high retransmit rates force TCP to spend more time in congestion recovery than in steady-state transmission, capping throughput regardless of available bandwidth. ZeroTier illustrates the opposite extreme: brute-force retransmission can still yield high throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted bandwidth and unstable flow behavior. VpnCloud stands out: its sender reports 538.8\,Mbps but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest in the dataset). This suggests significant in-tunnel packet loss or buffering at the VpnCloud layer that the retransmit count (857) alone does not fully explain. Variability — whether stochastic across runs or systematic across links — also differs substantially. WireGuard's three link directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window), behaving almost identically. Mycelium's three directions span 122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: Section~\ref{sec:mycelium_routing} shows the spread is per-link path-selection asymmetry, with one link finding a direct route and the other two routing through the global overlay. Either way, a VPN whose throughput varies that widely across links is harder to capacity-plan around than one that delivers a consistent figure on every direction. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png} \caption{Retransmits vs.\ throughput} \label{fig:retransmit_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png} \caption{Retransmits vs.\ max congestion window} \label{fig:retransmit_cwnd} \end{subfigure} \caption{Retransmit correlations (log scale on x-axis). High retransmits do not always mean low throughput (ZeroTier: 1\,163 retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace: 4\,965 retransmits, 368\,Mbps). The apparent inverse correlation between retransmits and congestion window size is dominated by Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated by its 32\,KB jumbo overlay MTU rather than by low retransmits alone.} \label{fig:retransmit_correlations} \end{figure} \subsection{Latency} Sorting by latency rearranges the rankings considerably. Table~\ref{tab:latency_baseline} lists the average ping round-trip times, which cluster into three distinct ranges. \begin{table}[H] \centering \caption{Average ping RTT at baseline, sorted by latency} \label{tab:latency_baseline} \begin{tabular}{lr} \hline \textbf{VPN} & \textbf{Avg RTT (ms)} \\ \hline Internal & 0.60 \\ VpnCloud & 1.13 \\ Tinc & 1.19 \\ WireGuard & 1.20 \\ Nebula & 1.25 \\ ZeroTier & 1.28 \\ EasyTier & 1.33 \\ \hline Headscale & 1.64 \\ Hyprspace & 1.79 \\ Yggdrasil & 2.20 \\ \hline Mycelium & 34.9 \\ \hline \end{tabular} \end{table} Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal 0.60\,ms; EasyTier sits just above at 1.33\,ms. VpnCloud posts the lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms), yet its throughput tops out at only 539\,Mbps. Low per-packet latency does not guarantee high bulk throughput. A second group (Headscale, Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing moderate overhead. Then there is Mycelium at 34.9\,ms, so far removed from the rest that Section~\ref{sec:mycelium_routing} gives it a dedicated analysis. % TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT % plot. Add a max-RTT figure or table, or reference the raw data % source. ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any sub-2\,ms VPN. These spikes point to periodic control-plane interference that the average hides. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png} \caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a massive outlier at 58$\times$ the internal baseline. VpnCloud is the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).} \label{fig:ping_rtt} \end{figure} Tinc presents a paradox: it has the third-lowest latency (1.19\,ms) but only the second-lowest throughput (336\,Mbps). Packets traverse the tunnel quickly, yet single-threaded userspace processing cannot keep up with the link speed. The qperf benchmark backs this up: Tinc maxes out at 14.9\,\% total system CPU while delivering just 336\,Mbps. % TODO: 14.9\% total CPU does not obviously indicate a bottleneck. % This is whole-system utilization on a multi-core machine, and a % single saturated core fits the budget — but VpnCloud reports the % same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc. % The single-saturated-core story alone therefore cannot explain % the throughput gap; per-packet processing cost must differ % materially between the two. Verify with per-thread CPU sampling % or eBPF profiling. On a multi-core system, this low percentage is consistent with a single saturated core (and Tinc is single-threaded), which would explain why the CPU rather than the network is the bottleneck. The story is incomplete, however: VpnCloud shows the same 14.9\,\% total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc — so a difference in per-packet processing cost between the two implementations must also be in play. Figure~\ref{fig:latency_throughput} makes this disconnect easy to spot. % TODO: These CPU numbers are stated inline but never shown in a plot % or table. Add a CPU utilization figure or table so readers can % verify. Also, the claim that WireGuard's CPU usage "goes to % cryptographic processing" is unsubstantiated: no profiling data % is presented. Either add profiling evidence or soften to % "likely" / "presumably." The qperf measurements also reveal a wide spread in CPU usage. Hyprspace (55.1\,\%) and Yggdrasil (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's 9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a kernel-level implementation, presumably due to in-kernel cryptographic processing. On the efficient end, VpnCloud (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least CPU time. Nebula and Headscale are missing from this comparison because qperf failed for both. %TODO: Explain why they consistently failed \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png} \caption{Latency vs.\ throughput at baseline. Each point represents one VPN. The quadrants reveal different bottleneck types: VpnCloud (low latency, moderate throughput), Tinc (low latency, low throughput, CPU-bound), Mycelium (high latency, low throughput, overlay routing overhead).} \label{fig:latency_throughput} \end{figure} \subsection{Parallel TCP Scaling} The single-stream benchmark tests one link direction at a time. % % TODO: The plot labels this benchmark "10-stream parallel" but this % description says "six unidirectional flows." Verify the actual test % configuration and reconcile the two. The parallel benchmark changes this setup: all three link directions (lom$\rightarrow$yuki, yuki$\rightarrow$luna, luna$\rightarrow$lom) run simultaneously in a circular pattern for 60~seconds, each carrying one bidirectional TCP stream (six unidirectional flows in total). Because three independent link pairs now compete for shared tunnel resources at once, the aggregate throughput is naturally higher than any single direction alone, which is why even Internal reaches 1.50$\times$ its single-stream figure. The scaling factor (parallel throughput divided by single-stream throughput) captures two effects: the benefit of using multiple link pairs in parallel, and how well the VPN handles the resulting contention. Table~\ref{tab:parallel_scaling} lists the results. \begin{table}[H] \centering \caption{Parallel TCP scaling at baseline. Scaling factor is the ratio of parallel to single-stream throughput. Internal's 1.50$\times$ represents the expected scaling on this hardware.} \label{tab:parallel_scaling} \begin{tabular}{lrrr} \hline \textbf{VPN} & \textbf{Single (Mbps)} & \textbf{Parallel (Mbps)} & \textbf{Scaling} \\ \hline Mycelium & 259 & 569 & 2.20$\times$ \\ Hyprspace & 368 & 803 & 2.18$\times$ \\ Tinc & 336 & 563 & 1.68$\times$ \\ Yggdrasil & 795 & 1265 & 1.59$\times$ \\ Headscale & 800 & 1228 & 1.54$\times$ \\ Internal & 934 & 1398 & 1.50$\times$ \\ ZeroTier & 814 & 1206 & 1.48$\times$ \\ WireGuard & 864 & 1281 & 1.48$\times$ \\ EasyTier & 636 & 927 & 1.46$\times$ \\ VpnCloud & 539 & 763 & 1.42$\times$ \\ Nebula & 706 & 648 & 0.92$\times$ \\ \hline \end{tabular} \end{table} The VPNs that gain the most are those most constrained in single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream can never fill the pipe: the bandwidth-delay product demands a window larger than any single flow maintains, so multiple concurrent flows compensate for that constraint and push throughput to 2.20$\times$ the single-stream figure. Hyprspace scales almost as well (2.18$\times$) for the same reason but with a different bottleneck. Its libp2p send pipeline accumulates roughly 2\,800\,ms of under-load latency (Section~\ref{sec:hyprspace_bloat}), which gives any single TCP flow a bandwidth-delay product on the order of hundreds of megabytes to fill — far beyond any single kernel cwnd. And because Hyprspace keys \texttt{activeStreams} by destination \texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the three concurrent peer pairs in the parallel benchmark each get their own libp2p stream, their own mutex, and their own yamux flow-control window. The three TCP senders therefore maintain three independent windows in flight, and three windows fill more of the bloated pipeline than one can. % TODO: This is still a hypothesis: it generalises the same % bandwidth-delay-product argument used for Mycelium directly % above, and is now grounded in the per-peer % \texttt{SharedStream} structure verified in % Listing~\ref{lst:hyprspace_sendpacket}, but neither the % per-flow window evolution nor the actual under-load latency % has been measured directly. A tcpdump of one Hyprspace % iPerf3 run with inter-arrival timing analysis would settle % it. Tinc picks up a 1.68$\times$ boost because several streams can collectively keep its single-threaded CPU busy during what would otherwise be idle gaps in a single flow. % TODO: "zero retransmits" in parallel mode is not shown in any table % or figure. Add parallel-mode retransmit data or remove the claim. WireGuard and Internal both scale cleanly at around 1.48--1.50$\times$ with zero retransmits, suggesting that WireGuard's overhead is a fixed per-packet cost that does not worsen under multiplexing. Nebula is the only VPN that actually gets \emph{slower} with more streams: throughput drops from 706\,Mbps to 648\,Mbps (0.92$\times$) while retransmits jump from 955 to 2\,462. The streams are clearly fighting each other for resources inside the tunnel. More streams also amplify existing retransmit problems. Hyprspace climbs from 4\,965 to 17\,426~retransmits; VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream mode stay clean under load, while the stressed ones only get worse. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png} \caption{Single-stream vs.\ parallel throughput} \label{fig:single_vs_parallel} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png} \caption{Parallel TCP scaling factor} \label{fig:scaling_factor} \end{subfigure} \caption{Parallel TCP scaling at baseline. Nebula is the only VPN where parallel throughput is lower than single-stream (0.92$\times$). Mycelium and Hyprspace benefit most from parallelism ($>$2$\times$), compensating for latency and buffer bloat respectively. The dashed line at 1.0$\times$ marks the break-even point.} \label{fig:parallel_tcp} \end{figure} \subsection{UDP Stress Test} The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}), which is a deliberate overload test rather than a realistic workload. The sender throughput values are artifacts: they reflect how fast the sender can write to the socket, not how fast data traverses the tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender throughput because it uses a 32,731-byte block size (a jumbo-frame overlay MTU), inflating the apparent rate per \texttt{send()} system call. Only the receiver throughput is meaningful. \begin{table}[H] \centering \caption{UDP receiver throughput and packet loss at baseline (\texttt{-b 0} stress test). Hyprspace and Mycelium timed out at 120 seconds and are excluded.} \label{tab:udp_baseline} \begin{tabular}{lrr} \hline \textbf{VPN} & \textbf{Receiver (Mbps)} & \textbf{Loss (\%)} \\ \hline Internal & 952 & 0.0 \\ WireGuard & 898 & 0.0 \\ Nebula & 890 & 76.2 \\ Headscale & 876 & 69.8 \\ EasyTier & 865 & 78.3 \\ Yggdrasil & 852 & 98.7 \\ ZeroTier & 851 & 89.5 \\ VpnCloud & 773 & 83.7 \\ Tinc & 471 & 89.9 \\ \hline \end{tabular} \end{table} %TODO: Explain that the UDP test also crashes often, % which makes the test somewhat unreliable % but a good indicator if the network traffic is "different" then % the programmer expected Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at the kernel level with proper backpressure that matches sender to receiver rate. Every other VPN shows massive loss (69--99\%) because the sender overwhelms the tunnel's userspace processing capacity. Headscale shares WireGuard's cryptographic protocol but, contrary to intuition, does not share its kernel datapath: Tailscale's \texttt{magicsock} layer intercepts every packet to handle endpoint selection and DERP relay, which is incompatible with the in-kernel WireGuard module. Headscale therefore runs \texttt{wireguard-go} entirely in userspace, and the unbounded \texttt{-b~0} flood overruns that userspace pipeline just as it overruns every other userspace implementation, producing 69.8\,\% loss despite the WireGuard branding. Yggdrasil's 98.7\% loss is the most extreme: it sends the most data (due to its large block size) but loses almost all of it. These loss rates do not reflect real-world UDP behavior but reveal which VPNs implement effective flow control. Hyprspace and Mycelium could not complete the UDP test at all, timing out after 120 seconds. % TODO: blksize_bytes is the UDP payload size iPerf3 selects, not % the path MTU. It is derived from the socket MSS and reflects the % usable payload after tunnel overhead, but conflating it with path % MTU is misleading. Consider renaming to "effective payload size" % throughout. The \texttt{blksize\_bytes} field reveals each VPN's effective UDP payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at 2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at 1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the smallest). These differences affect fragmentation behavior under real workloads, particularly for protocols that send large datagrams. %TODO: Mention QUIC %TODO: Mention again that the "default" settings of every VPN have been used % to better reflect real world use, as most users probably won't % change these defaults % and explain that good defaults are as much a part of good software as % having the features but they are hard to configure correctly \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP Throughput}.png} \caption{UDP receiver throughput} \label{fig:udp_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP Packet Loss}.png} \caption{UDP packet loss} \label{fig:udp_loss} \end{subfigure} \caption{UDP stress test results at baseline (\texttt{-b 0}, unlimited sender rate). Internal and WireGuard are the only implementations with 0\% loss. Hyprspace and Mycelium are excluded due to 120-second timeouts.} \label{fig:udp_results} \end{figure} % TODO: Compare parallel TCP retransmit rate % with single TCP retransmit rate and see what changed \subsection{Real-World Workloads} Saturating a link with iPerf3 measures peak capacity, but not how a VPN performs under realistic traffic. This subsection switches to application-level workloads: downloading packages from a Nix binary cache and streaming video over RIST. Both interact with the VPN tunnel the way real software does, through many short-lived connections, TLS handshakes, and latency-sensitive UDP packets. \paragraph{Nix Binary Cache Downloads.} This test downloads a fixed set of Nix packages through each VPN and measures the total transfer time. The results (Table~\ref{tab:nix_cache}) compress the throughput hierarchy considerably: even Hyprspace, the worst performer, finishes in 11.92\,s, only 40\,\% slower than bare metal. Once connection setup, TLS handshakes, and HTTP round-trips enter the picture, throughput differences between 500 and 900\,Mbps matter far less than per-connection latency. \begin{table}[H] \centering \caption{Nix binary cache download time at baseline, sorted by duration. Overhead is relative to the internal baseline (8.53\,s).} \label{tab:nix_cache} \begin{tabular}{lrr} \hline \textbf{VPN} & \textbf{Mean (s)} & \textbf{Overhead (\%)} \\ \hline Internal & 8.53 & -- \\ Nebula & 9.15 & +7.3 \\ ZeroTier & 9.22 & +8.1 \\ VpnCloud & 9.39 & +10.0 \\ EasyTier & 9.39 & +10.1 \\ WireGuard & 9.45 & +10.8 \\ Headscale & 9.79 & +14.8 \\ Tinc & 10.00 & +17.2 \\ Mycelium & 10.07 & +18.1 \\ Yggdrasil & 10.59 & +24.2 \\ Hyprspace & 11.92 & +39.7 \\ \hline \end{tabular} \end{table} Several rankings invert relative to raw throughput. ZeroTier finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite 6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil is the clearest example: it has the fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead because its 2.2\,ms latency adds up over the many small sequential HTTP requests that constitute a Nix cache download. Figure~\ref{fig:throughput_vs_download} confirms this weak link between raw throughput and real-world download speed. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache Mean Download Time}.png} \caption{Nix cache download time per VPN} \label{fig:nix_cache} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png} \caption{Raw throughput vs.\ download time} \label{fig:throughput_vs_download} \end{subfigure} \caption{Application-level download performance. The throughput hierarchy compresses under real HTTP workloads: the worst VPN (Hyprspace, 11.92\,s) is only 40\% slower than bare metal. Throughput explains some variance but not all: Yggdrasil (795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s) because latency matters more for HTTP workloads.} \label{fig:nix_download} \end{figure} \paragraph{Video streaming (RIST).} At 3.3\,Mbps, the RIST video stream sits well within every VPN's throughput budget. The test therefore measures something else: how well each VPN handles real-time UDP delivery under steady load. Most VPNs pass without incident. Eight deliver 100\% quality, Nebula sits just below at 99.8\%, and Hyprspace's headline figure of 100\% conceals a separate failure mode discussed below. The 14--16 dropped frames that appear uniformly across every run, including Internal, are most likely encoder warm-up artefacts rather than tunnel overhead, though we have not verified this directly. % TODO: The packet-drop distribution statistics (288 mean, % 10\% median, IQR 255--330) are not shown in any figure. % Add a box plot or distribution figure for Headscale's RIST drops. Headscale is the clear failure. Its mean quality is 13.1\%, and each test interval drops 288 packets. The degradation is sustained rather than bursty: median quality is 10\%, and the interquartile range of dropped packets is a narrow 255--330. The qperf benchmark also fails outright for Headscale at baseline, which rules out a bulk-TCP explanation. Something in the real-time path is broken. The failure is unexpected because Headscale builds on WireGuard, which handles video without trouble, and Headscale's own TCP throughput puts it in Tier~1. RIST runs over UDP, however, and qperf probes latency-sensitive paths using both TCP and UDP. The most plausible source is Headscale's DERP relay or NAT traversal layer. Headscale's effective UDP payload size is 1\,208~bytes, the smallest in the dataset. RIST packets larger than this would be fragmented, and fragment reassembly under sustained load could produce exactly the steady, uniform drop pattern the data shows. This is a hypothesis, not a confirmed cause: it would need a packet capture to verify. Either way, the result disqualifies Headscale from video conferencing, VoIP, or any other real-time media workload, regardless of TCP throughput. % TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500, % percentiles all zero) are not visible in the RIST Quality bar chart. % Add a distribution plot or note in the caption that the bar % chart hides this variance. Hyprspace fails differently. Its average quality reads 100\%, but the raw drop counts underneath are unstable: mean packet drops of 1\,194 and a maximum spike of 55\,500. The 25th, 50th, and 75th percentiles are all zero, so most runs deliver perfectly while a small number suffer catastrophic bursts. RIST's forward error correction recovers from most of these events, but the worst spikes overwhelm FEC entirely. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/Video Streaming/RIST Quality}.png} \caption{RIST video streaming quality at baseline. Headscale at 13.1\% average quality is the clear outlier. Every other VPN achieves 99.8\% or higher. Nebula is at 99.8\% (minor degradation). The video bitrate (3.3\,Mbps) is well within every VPN's throughput capacity, so this test reveals real-time UDP handling quality rather than bandwidth limits.} \label{fig:rist_quality} \end{figure} \subsection{Operational Resilience} Sustained-load performance does not predict recovery speed. How quickly a tunnel comes up after a reboot, and how reliably it reconverges, matters as much as peak throughput for operational use. % TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s) % are not shown in any figure or table. Either add a figure or % scrap this paragraph (see note below). First-time connectivity spans a wide range. Headscale and WireGuard are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud (10--14\,s) spend seconds negotiating with their control planes before passing traffic. %TODO: Maybe we want to scrap first-time connectivity Reboot reconnection rearranges the rankings. Hyprspace, the worst performer under sustained TCP load, recovers in just 8.7~seconds on average, faster than any other VPN. WireGuard and Nebula follow at 10.1\,s each. Nebula's consistency is striking: 10.06, 10.06, 10.07\,s across its three nodes, an exact match for Nebula's \texttt{HostUpdateNotification} interval, whose default is 10~seconds in the lighthouse protocol (configurable, but the benchmarks use the default). After a reboot, a node must wait until the next periodic update before its lighthouses learn its new endpoint, so the reconnection time tracks the timer rather than any topology-dependent convergence. Mycelium sits at the opposite end, needing 76.6~seconds and showing the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s), suggesting a fixed protocol-level wait built into the overlay. Yggdrasil produces the most lopsided result in the dataset: its yuki node is back in 7.1~seconds while lom and luna take 94.8 and 97.3~seconds respectively. The gap likely reflects the overlay's spanning-tree rebuild: a node near the root of the tree reconverges quickly, while one further out has to wait for the topology to propagate. %TODO: Needs clarifications what is a "spanning tree build" \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png} \caption{Average reconnection time per VPN} \label{fig:reboot_bar} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png} \caption{Per-node reconnection time heatmap} \label{fig:reboot_heatmap} \end{subfigure} \caption{Reboot reconnection time at baseline. The heatmap reveals Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\ 95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s across all nodes). Hyprspace reconnects fastest (8.7\,s average) despite its poor sustained-load performance.} \label{fig:reboot_reconnection} \end{figure} \subsection{Pathological Cases} \label{sec:pathological} Three VPNs exhibit behaviors that the aggregate numbers alone cannot explain. The following subsections piece together observations from earlier benchmarks into per-VPN diagnoses. \paragraph{Hyprspace: Buffer Bloat.} \label{sec:hyprspace_bloat} % TODO: The under-load latency of 2,800 ms is not shown in any plot % or table. Where does this number come from? Add a figure showing % latency-under-load (e.g., from qperf concurrent ping) or reference % the raw data source. Hyprspace produces the most severe performance collapse in the dataset. At idle, its ping latency is a modest 1.79\,ms. Under TCP load, that number balloons to roughly 2\,800\,ms, a 1\,556$\times$ increase. This is not the network becoming congested; it is the VPN tunnel itself filling up with buffered packets and refusing to drain. The consequences ripple through every TCP metric. With 4\,965 retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than steady-state transfer, shrinking the max congestion window to 205\,KB, the smallest in the dataset. Under parallel load the situation worsens: retransmits climb to 17\,426. % TODO: The % explanation for the sender/receiver inversion (ACK delays % causing sender-side timer undercounting) is a hypothesis. Normally % sender >= receiver. Consider verifying with packet captures or % note this as a likely but unconfirmed explanation. The buffering even inverts iPerf3's measurements: the receiver reports 419.8\,Mbps while the sender sees only 367.9\,Mbps, likely because massive ACK delays cause the sender-side timer to undercount the actual data rate. The UDP test never finished at all, timing out at 120~seconds. % Should we always use percentages for retransmits? What prevents Hyprspace from being entirely unusable is everything \emph{except} sustained load. It has the fastest reboot reconnection in the dataset (8.7\,s) and delivers 100\,\% video quality outside of its burst events. The pathology is narrow but severe: any continuous data stream saturates the tunnel's internal buffers. Hyprspace does import gVisor netstack, but reading the source confirms that the gVisor TCP stack sits exclusively behind the in-VPN ``service network'' feature. Regular tunnel traffic uses an ordinary kernel TUN device created through the \texttt{songgao/water} library, and the forwarding loop in \texttt{node/node.go} only diverts a packet into the gVisor stack when its destination falls inside the \texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4 protocol is TCP; everything else is shipped verbatim over a libp2p stream and written back into the receiving peer's kernel TUN. Listings~\ref{lst:hyprspace_kernel_tun}, \ref{lst:hyprspace_dispatch}, and \ref{lst:hyprspace_netstack} show the relevant code in the upstream Hyprspace tree. \lstinputlisting[language=Go,caption={Hyprspace creates a real kernel TUN via \texttt{songgao/water}; this is the device every peer-to-peer packet traverses. \textit{hyprspace/tun/tun\_linux.go:14--36}},label={lst:hyprspace_kernel_tun}]{Listings/hyprspace_tun_linux.go} \lstinputlisting[language=Go,caption={The IPv6 dispatch in the Hyprspace forwarding loop only diverts to the gVisor service-network TUN when the destination matches the \texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4 protocol byte is \texttt{0x06} (TCP); every other packet is left on the kernel TUN path and forwarded over libp2p. \textit{hyprspace/node/node.go:255--283}},label={lst:hyprspace_dispatch}]{Listings/hyprspace_dispatch.go} \lstinputlisting[language=Go,caption={Hyprspace's gVisor netstack initialiser only enables TCP SACK; there is no \texttt{TCPRecovery} override (RACK stays at gVisor's default), no congestion-control override, and no buffer-size override. The text in \texttt{tun.go} also notes the file is taken verbatim from wireguard-go. \textit{hyprspace/netstack/tun.go:6--80}},label={lst:hyprspace_netstack}]{Listings/hyprspace_netstack.go} Since the benchmark targets the regular Hyprspace IPv4/IPv6 addresses rather than service-network proxies, both endpoints rely on their host kernel's TCP stack for the entire transfer. Whatever options Hyprspace's gVisor instance might set internally — congestion control, loss recovery, buffer sizes — are therefore irrelevant to these measurements; the inner TCP state machine the kernel runs is the only one in the path. The same caveat applies more sharply to Tailscale, where the upstream documentation talks about an in-process gVisor TCP stack but the benchmark traffic never reaches it; that case is the subject of Section~\ref{sec:tailscale_degraded}. If gVisor is out of scope, the buffer bloat must originate further up the Hyprspace stack instead. The most plausible source is the libp2p / yamux stream layer through which raw IP packets are funnelled. Hyprspace's TUN-read loop dispatches each outbound packet on its own goroutine, and every such goroutine ends up in \texttt{node/node.go}'s \texttt{sendPacket}, which keeps exactly one libp2p stream per destination peer in \texttt{activeStreams} and guards it with a single per-peer \texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}). Concurrent application TCP flows to the same Hyprspace neighbour therefore serialise behind that one lock: the parallel iPerf3 test, which opens multiple TCP connections to the same peer at once, collapses to a single send pipeline at this layer. Each goroutine waiting for the lock pins its own 1420-byte packet buffer, and the underlying yamux session adds a per-stream flow-control window on top. None of this is visible to the kernel TCP sender that produced the inner segments — the kernel sees only that the TUN write returned — so it keeps growing its congestion window while the libp2p layer falls further behind. The geometry is the textbook one for buffer bloat: a fast producer (kernel TCP) sitting upstream of a slow, serialised consumer (the single yamux stream per peer) with no flow-control signal coupling the two. \lstinputlisting[language=Go,caption={Hyprspace's outbound fast path keeps exactly one libp2p stream per destination peer in \texttt{activeStreams} and guards it with a per-peer \texttt{sync.Mutex} held inside the \texttt{SharedStream} record. The TUN-read loop spawns a fresh goroutine per packet (\texttt{node.go:282}); each one calls \texttt{sendPacket} and takes \texttt{ms.Lock} for the duration of the libp2p stream write, so concurrent application TCP flows to the same Hyprspace neighbour are serialised behind a single mutex. \textit{hyprspace/node/node.go:36--39, 282, 328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go} \paragraph{Mycelium: Routing Anomaly.} \label{sec:mycelium_routing} Mycelium's 34.9\,ms average latency appears to be the cost of routing through a global overlay. The per-path numbers, however, reveal a bimodal distribution: \begin{itemize} \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable to Headscale at 1.64\,ms) \bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed) \bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed) \end{itemize} One of the three links has found a direct route; the other two still bounce through the overlay. All three machines sit on the same % TODO: Characterising path discovery as "failing % intermittently" assumes % direct routing is the expected outcome on a LAN. % Mycelium is designed % as a global overlay and may intentionally route % through supernodes. % If this is by-design behaviour, rephrase to avoid % implying a bug. % This characterisation also propagates to the % impairment ping analysis % in Section sec:impairment, which says impairment "pushes path % discovery toward shorter routes." % TODO: The throughput data INVERTS the latency split % rather than % "mirroring" it. The direct path (luna→lom, 1.63 ms % RTT) achieves % only 122 Mbps, while the overlay-routed path % (yuki→luna, 51.60 ms % RTT) reaches 379 Mbps: the opposite of what TCP % theory predicts. % The plot also shows luna→lom receiver throughput at % only 57.2 Mbps % (a 53% sender/receiver gap on that link). Explain % why the direct % path is 3× slower than the overlay path, or acknowledge the % contradiction. The current wording "mirrors the % split" is incorrect. physical network, so Mycelium's path discovery is not consistently selecting the direct route, a more specific problem than blanket overlay overhead. Throughput shows a similarly lopsided split: yuki$\rightarrow$luna reaches 379\,Mbps while luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In bidirectional mode, the reverse direction on that worst link drops to 58.4\,Mbps, the lowest single-direction figure in the entire dataset. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average Throughput}.png} % TODO: The caption attributes the asymmetry to % "inconsistent direct % route discovery" but the direct-route link % (luna→lom, 1.63 ms RTT) % is actually the SLOWEST (122 Mbps). The caption % should address % why the direct path underperforms the overlay paths. \caption{Per-link TCP throughput for Mycelium, showing extreme path asymmetry. The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps) and worst (luna$\rightarrow$lom, 122\,Mbps) links does not correlate with the latency split (Section~\ref{sec:mycelium_routing}).} \label{fig:mycelium_paths} \end{figure} % TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment % (47.3 ms) numbers are from qperf but not shown in any figure. % Add a connection-setup latency table or plot. Also % clarify what % Internal's connection establishment time is (47.3 / % 3 = 15.8 ms?) % so the "3× overhead" can be verified. The overlay penalty shows up most clearly at connection setup. Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's 16.8\,ms, a 5.6$\times$ overhead), and connection establishment alone costs 47.3\,ms (3$\times$ overhead). Every new connection incurs that overhead, so workloads dominated by short-lived connections accumulate it rapidly. Bulk downloads, by contrast, amortize it: the Nix cache test finishes only 18\,\% slower than Internal (10.07\,s vs.\ 8.53\,s) because once the transfer phase begins, per-connection latency fades into the background. Mycelium is also the slowest VPN to recover from a reboot: 76.6~seconds on average, and almost suspiciously uniform across nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a fixed convergence timer in the overlay protocol — most likely a default interval rather than anything topology-dependent. % TODO: Identify which Mycelium constant or default this 75-78 s % recovery actually corresponds to before claiming it is a fixed % timer; the source code would settle whether it is hard-coded, % a configurable default, or coincidence. The UDP test timed out at 120~seconds, and even first-time connectivity required a 70-second wait at startup. % Explain what topology-dependent means in this case. \paragraph{Tinc: Userspace Processing Bottleneck.} Tinc is a clear case of a CPU bottleneck masquerading as a network problem. At 1.19\,ms latency, packets get through the tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a third of the bare-metal link. The usual suspects do not apply: Tinc's effective UDP payload size (\texttt{blksize\_bytes} of 1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and WireGuard at 1\,368) is in the normal range, and its retransmit count (240) is moderate. What limits Tinc is its single-threaded userspace architecture: one CPU core simply cannot encrypt, copy, and forward packets fast enough to fill the pipe. % TODO: DOWNSTREAM DEPENDENCY — This "confirms" the % Tinc CPU bottleneck % diagnosis from above, but the 14.9% CPU figure has % an unresolved TODO % (the same utilization as VpnCloud at 539 Mbps). If % the CPU claim is % revised or refuted, this confirmation must be updated too. The parallel benchmark confirms this diagnosis. Tinc scales to 563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio. Multiple TCP streams collectively keep that single core busy during what would otherwise be idle gaps in any individual flow, squeezing out throughput that no single stream could reach alone. \section{Impact of network impairment} \label{sec:impairment} Baseline benchmarks rank VPNs by overhead under ideal conditions. The impairment profiles in Table~\ref{tab:impairment_profiles} test a different property: resilience. Two results dominate the data. The first is the collapse of the throughput hierarchy. At High impairment, the 675\,Mbps spread between fastest and slowest implementation compresses to under 3\,Mbps. Architectural differences that mattered at gigabit speeds become invisible once the network is the bottleneck. The second is harder to explain. Headscale outperforms the bare-metal Internal baseline at Medium impairment across TCP, parallel TCP, and the Nix cache benchmark. A VPN built on WireGuard should not beat a direct connection. Section~\ref{sec:tailscale_degraded} pursues this anomaly through what turns out to be the wrong hypothesis. The investigation begins with Tailscale's much-discussed gVisor TCP stack, validates the candidate parameters in isolation on the bare-metal host, and only then discovers — by reading the rig's own NixOS module — that the gVisor stack is not actually in the data path of the benchmark at all. The real culprit is a combination of the Linux kernel's tight default \texttt{tcp\_reordering} threshold and the way \texttt{wireguard-go} batches packets between the wire and the host kernel TCP stack. \subsection{Ping} Latency is the most predictable metric under impairment. Most VPNs absorb the injected delay with a fixed per-hop overhead, and rankings within the central cluster barely change across profiles (Table~\ref{tab:ping_impairment}). tc~netem adds roughly 4, 8, and 15\,ms of round-trip delay at Low, Medium, and High respectively; Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this. \begin{table}[H] \centering \caption{Average ping RTT (ms) across impairment profiles, sorted by High-profile RTT} \label{tab:ping_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 0.60 & 4.82 & 9.38 & 15.49 \\ Tinc & 1.19 & 5.32 & 9.85 & 15.92 \\ Nebula & 1.25 & 5.38 & 9.99 & 15.96 \\ WireGuard & 1.20 & 5.36 & 9.88 & 15.99 \\ Headscale & 1.64 & 5.82 & 10.39 & 16.07 \\ VpnCloud & 1.13 & 5.41 & 10.35 & 16.21 \\ ZeroTier & 1.28 & 5.34 & 10.02 & 16.54 \\ Yggdrasil & 2.20 & 6.73 & 11.99 & 20.20 \\ Hyprspace & 1.79 & 6.15 & 10.76 & 24.49 \\ EasyTier & 1.33 & 6.27 & 14.13 & 26.60 \\ Mycelium & 34.90 & 23.42 & 43.88 & 33.05 \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/Ping Average RTT Heatmap}.png} \caption{Average ping RTT across impairment profiles. Most VPNs form a tight parallel band; Mycelium's non-monotonic curve, EasyTier's excess latency at High, and Hyprspace's upward divergence stand out.} \label{fig:ping_impairment_heatmap} \end{figure} Mycelium defies the pattern. Its RTT \emph{drops} from 34.9\,ms at baseline to 23.4\,ms at Low impairment, a 33\% improvement at the profile where every other VPN gets slower. It then climbs to 43.9\,ms at Medium before falling again to 33.0\,ms at High. The baseline analysis (Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency comes from a bimodal routing distribution: one path runs at 1.63\,ms, two others route through the global overlay at ${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This % explanation depends on the baseline % characterisation of Mycelium's path discovery as % "failing intermittently" % (Section mycelium_routing). If that % characterisation is revised (e.g., % overlay routing is by-design, not a failure), then % the claim that % impairment "pushes path discovery toward shorter % routes" needs rethinking: % the mechanism would be different if Mycelium is not % trying to find direct % routes in the first place. Impairment seems to push Mycelium's path selection toward the shorter route, so a larger share of traffic avoids the overlay detour. The non-monotonic curve is consistent with a path selection algorithm that reacts to measured link quality but not linearly with degradation severity. % TODO: Ping packet loss data is not shown in any figure. Add a % packet loss table/figure or reference the raw data % so readers can % verify these numbers. Mycelium loses zero ping packets at Low and Medium impairment. Most other VPNs show 0.1--3.2\% loss at those profiles. At High impairment Mycelium's loss jumps to 11.1\%. % TODO: EasyTier's max RTT (290 ms), WireGuard's max % (~40 ms), and % EasyTier's std dev (44.6 ms) are not shown in any % plot. The ping % heatmap only shows averages. Add a % jitter/distribution figure. % Also, the "userspace retry mechanism" is a hypothesized cause % without source-code or packet-level evidence. EasyTier accumulates 11\,ms of excess latency at High impairment beyond what tc~netem injects. Its average RTT is 26.6\,ms and its maximum reaches 290\,ms, against ${\sim}$40\,ms for WireGuard. The RTT standard deviation reaches 44.6\,ms at High, the worst jitter of any VPN. A userspace retry mechanism is the likely cause, but without source-code evidence we cannot say so with certainty. % TODO: Ping packet loss data is not shown in any plot. The 1/9 % = 11.1\% interpretation is clever but depends on % the exact test % structure (3 pairs × 3 runs × 100 packets). Verify % this matches % the actual test setup and add a supporting figure or table. Hyprspace shows the same 11.1\% ping packet loss at Low, Medium, and High impairment. With 9~measurement runs per profile (3~machine pairs $\times$ 3~runs of 100~packets), 11.1\% is exactly 1/9: one run fails completely while the other eight report zero loss. % TODO: DOWNSTREAM DEPENDENCY — This is a third % reference to the buffer % bloat diagnosis from Section hyprspace_bloat, which % depends on the % unverified 2,800 ms under-load latency. If that diagnosis is % revised, this explanation must also be revisited. The binary pass/fail behaviour fits the buffer bloat diagnosis from Section~\ref{sec:hyprspace_bloat}: when the tunnel's buffers fill, a path stalls completely rather than degrading gradually. \subsection{TCP throughput} The baseline TCP hierarchy does not survive impairment. The three performance tiers from Section~\ref{sec:baseline} dissolve at the first step (Table~\ref{tab:tcp_impairment}). \begin{table}[H] \centering \caption{Single-stream TCP throughput (Mbps) across impairment profiles, sorted by baseline. Retention is the Low-to-baseline ratio.} \label{tab:tcp_impairment} \begin{tabular}{lrrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} & \textbf{Retention} \\ \hline Internal & 934 & 333 & 29.6 & 4.25 & 35.7\% \\ WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\% \\ ZeroTier & 814 & 63.7 & 12.0 & 4.01 & 7.8\% \\ Headscale & 800 & 274 & 41.5 & 4.21 & 34.3\% \\ Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\% \\ \hline Nebula & 706 & 49.8 & 7.82 & 2.60 & 7.1\% \\ EasyTier & 636 & 156 & 17.4 & 3.59 & 24.6\% \\ VpnCloud & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\ \hline Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\% \\ Tinc & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\ Mycelium & 259 & 16.2 & 3.87 & 2.73 & 6.3\% \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/TCP Throughput Heatmap}.png} \caption{Single-stream TCP throughput across impairment profiles. Headscale crosses above Internal at Medium impairment; Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs converge at High.} \label{fig:tcp_impairment_heatmap} \end{figure} Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a 98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter, 0.25\% packet loss, and 0.5\% reordering per machine. Even Mycelium, the slowest VPN at baseline (259\,Mbps), retains more throughput at Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes that inflated Yggdrasil's baseline numbers (Section~\ref{sec:baseline}) becomes a liability under impairment: every lost or reordered outer packet costs roughly 24$\times$ more retransmitted inner data than a standard 1\,400-byte MTU VPN would lose. Headscale retains 34.3\% of its baseline throughput at Low, almost the same as Internal's 35.7\%. At Medium impairment, Headscale (41.5\,Mbps) overtakes Internal (29.6\,Mbps). Section~\ref{sec:tailscale_degraded} investigates this anomaly in detail. At High impairment, the throughput range collapses from 675\,Mbps to 2.9\,Mbps. Internal leads at 4.25\,Mbps, Hyprspace trails at 1.39\,Mbps, and the impairment profile itself is the bottleneck. With 2.5\% packet loss and 5\% reordering per machine, every implementation is loss-limited, and the architectural differences that mattered at gigabit speeds no longer matter at all. \subsection{UDP throughput} The UDP stress test (\texttt{-b~0}) separates implementations with effective backpressure from those without it more cleanly than any TCP benchmark. Under impairment, it also produces widespread failures. % TODO: Tinc fails at Low and Medium but succeeds at % High (8 Mbps): % the same non-monotonic failure pattern as % Internal/WireGuard (fail % at Low, succeed at Medium/High). This suggests the % failures are % iPerf3/tc interaction issues rather than % fundamental VPN limitations. % Nebula and VpnCloud also fail selectively. The % widespread non-monotonic % failure pattern undermines using this benchmark as % a reliability % indicator (see line 1163 claim). Consider % discussing this pattern. Hyprspace and Mycelium continue to time out at all profiles, extending their baseline failures. Tinc drops out at Low and Medium, ZeroTier at Medium. The data is sparse, but one pattern emerges from the runs that did complete. % TODO: The heatmap shows Internal and WireGuard both % fail (×) at % some impairment profiles (e.g., Internal fails at % Low, WireGuard % at Low and High). "Regardless of impairment" overstates the % evidence. Rephrase to reflect the failures, or explain why % those runs failed despite the claim of maintained throughput. % TODO: Internal (and WireGuard) fail at Low % impairment in the UDP % test but succeed at Medium and High: the opposite of what one % would expect. This is never explained. % Investigate and add an % explanation (e.g., iPerf3 crash, tc interaction, % timing issue). Three implementations maintain throughput at the profiles where data exists. Internal holds ${\sim}$950\,Mbps at Baseline, Medium, and High; WireGuard sustains 850--898\,Mbps; and Headscale sustains 700--876\,Mbps. % TODO: verify WireGuard UDP range -- % analysis doc says 850-898, possible digit transposition Internal and WireGuard ride the host kernel's transport-layer backpressure (Internal directly, WireGuard via the in-kernel WireGuard module). Headscale, by contrast, never uses the kernel module even though it builds on the WireGuard protocol: as established in Section~\ref{sec:baseline}, Tailscale's \texttt{magicsock} layer intercepts every packet for endpoint selection, DERP relay, and the disco protocol, and that interception is incompatible with the kernel WireGuard datapath. Headscale therefore runs \texttt{wireguard-go} in userspace and compensates with UDP batching (\texttt{recvmmsg}/\texttt{sendmmsg}), host-kernel UDP segmentation/aggregation offload (\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer WireGuard socket), and a 7\,MB socket buffer on the same outer socket. These offloads live in the host kernel; gVisor netstack itself implements no UDP GSO or UDP GRO of its own. Together they absorb a \texttt{-b 0} sender flood without collapsing. Userspace VPNs without the same engineering do collapse: EasyTier drops from 865 to 435 to 38.5 to 6.1\,Mbps across successive profiles. Yggdrasil, already pathological at baseline (98.7\% loss), crashes to 12.3\,Mbps at Low and fails entirely at Medium and High. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/UDP Receiver Throughput Heatmap}.png} % TODO: The heatmap shows Internal, WireGuard, and % Headscale all % fail ($\times$) at Low impairment. WireGuard also fails at % High. These selective failures need an explanation % (iPerf3/tc interaction?). \caption{UDP receiver throughput across impairment profiles. Implementations with effective UDP backpressure (Internal and WireGuard via the in-kernel datapath; Headscale via \texttt{wireguard-go} batching plus large socket buffers) maintain high throughput where they complete; other userspace VPNs collapse or fail entirely ($\times$ marks a failed run).} \label{fig:udp_impairment_heatmap} \end{figure} % TODO: This "robustness indicator" interpretation is % undermined by % the non-monotonic failure pattern. Internal and % WireGuard fail at % Low (0.25% loss) but succeed at Medium and High % (1%+ loss). If % failures indicated "fundamental flow-control % problems," they should % get worse with more impairment, not better. The % pattern suggests % iPerf3 or tc timing issues rather than VPN % limitations. Either % explain the non-monotonic failures or weaken this conclusion. Under impairment this benchmark is more useful as a robustness indicator than as a throughput measurement. A VPN that cannot complete a 30-second UDP flood under 0.25\% packet loss has a flow-control problem that will surface under real workloads too, even when the symptoms are milder. \subsection{Parallel TCP} % TODO: DOWNSTREAM DEPENDENCY — "six unidirectional % flows" must match % the baseline parallel test description. The % baseline section has an % unresolved TODO about whether the test uses 6 or 10 % streams. If the % baseline is corrected to 10, this section must also % be updated. The Headscale anomaly from single-stream TCP grows larger under parallel load. Table~\ref{tab:parallel_impairment} shows aggregate throughput across three concurrent bidirectional links (six unidirectional flows). \begin{table}[H] \centering \caption{Parallel TCP throughput (Mbps) across impairment profiles. Three concurrent bidirectional links produce six unidirectional flows.} \label{tab:parallel_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 1398 & 277 & 82.6 & 10.4 \\ Headscale & 1228 & 718 & 113 & 20.0 \\ WireGuard & 1281 & 173 & 24.5 & 8.39 \\ Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\ ZeroTier & 1206 & 176 & 35.4 & 7.97 \\ EasyTier & 927 & 473 & 57.4 & 10.7 \\ Hyprspace & 803 & 2.87 & 6.94 & 3.62 \\ VpnCloud & 763 & 174 & 23.7 & 8.25 \\ Nebula & 648 & 103 & 15.3 & 4.93 \\ Mycelium & 569 & 72.7 & 7.51 & 3.69 \\ Tinc & 563 & 168 & 23.7 & 8.25 \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/Parallel TCP Throughput Heatmap}.png} \caption{Parallel TCP throughput across impairment profiles. Headscale dominates at Low (718\,Mbps vs.\ Internal's 277); EasyTier is the runner-up (473\,Mbps); Hyprspace collapses to 2.87\,Mbps.} \label{fig:parallel_impairment_heatmap} \end{figure} At Low impairment, Headscale reaches 718\,Mbps: 2.6$\times$ Internal's 277\,Mbps and 4.1$\times$ WireGuard's 173\,Mbps. At Medium, Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%. Whatever mechanism produces the single-stream crossover at Medium scales with the flow count, because each of the six concurrent streams benefits from it independently. EasyTier is the runner-up under parallel load: 473\,Mbps at Low, 51\% of its baseline. Headscale and EasyTier are the only VPNs that retain more than half their baseline parallel throughput at Low impairment; no other implementation exceeds 30\%. We have no direct architectural explanation for EasyTier's resilience and do not claim one here. Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\% loss. % TODO: DOWNSTREAM DEPENDENCY — This % references the buffer bloat diagnosis % from Section hyprspace_bloat, which depends on the % unverified 2,800 ms % under-load latency. If that diagnosis is revised, % this explanation % for parallel collapse must also be revisited. The buffer bloat that already plagues single-stream transfers (Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six flows compete for the same bloated buffers at once. High-profile convergence is more pronounced here than in single-stream mode. Tinc and VpnCloud land at identical 8.25\,Mbps even though they differ by 200\,Mbps at baseline. \subsection{QUIC performance} Headscale and Nebula failed the qperf QUIC benchmark at baseline (Section~\ref{sec:baseline}) and continue to fail at every impairment profile. Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to 7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High. This is the same cliff observed in its TCP results, driven by the same jumbo-MTU amplification of outer-layer packet loss. At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps), ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within 0.4\,Mbps of one another. At baseline these four span a 188\,Mbps range (656 to 844\,Mbps). QUIC's own congestion control, running on top of an already-degraded outer link, has become the sole limiter. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/QUIC Bandwidth Heatmap}.png} \caption{QUIC bandwidth across impairment profiles. Yggdrasil drops from 745 to 8\,Mbps at Low; WireGuard, VpnCloud, ZeroTier, and Tinc converge to ${\sim}$23\,Mbps at High. Headscale and Nebula fail at all profiles ($\times$).} \label{fig:quic_impairment_heatmap} \end{figure} \subsection{Video streaming} At ${\sim}$3.3\,Mbps, the RIST video stream sits within every VPN's throughput budget even at High impairment. Quality differences in Table~\ref{tab:rist_impairment} therefore reflect packet delivery reliability, not bandwidth. \begin{table}[H] \centering \caption{RIST video streaming quality (\%) across impairment profiles, sorted by High-profile quality} \label{tab:rist_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Mycelium & 100.0 & 100.0 & 100.0 & 99.9 \\ EasyTier & 100.0 & 100.0 & 96.2 & 85.5 \\ Internal & 100.0 & 99.2 & 89.3 & 80.2 \\ ZeroTier & 100.0 & 99.3 & 89.9 & 80.2 \\ VpnCloud & 100.0 & 99.2 & 89.7 & 80.1 \\ WireGuard & 100.0 & 99.3 & 90.0 & 80.0 \\ Hyprspace & 100.0 & 92.9 & 87.9 & 78.1 \\ Tinc & 100.0 & 99.3 & 90.0 & 77.8 \\ Nebula & 99.8 & 98.8 & 85.6 & 72.1 \\ Yggdrasil & 100.0 & 94.7 & 71.4 & 43.3 \\ Headscale & 13.1 & 13.0 & 13.0 & 13.0 \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/Video Streaming Quality Heatmap}.png} \caption{RIST video streaming quality across impairment profiles. Headscale is stuck at ${\sim}$13\% regardless of profile; Mycelium maintains ${\sim}$100\% even at High; Yggdrasil declines steeply to 43\%.} \label{fig:rist_impairment_heatmap} \end{figure} Headscale sits at ${\sim}$13\% across all four profiles: 13.1\%, 13.0\%, 13.0\%, 13.0\%. This profile-independence confirms the baseline diagnosis (Section~\ref{sec:baseline}): the failure is % TODO: DOWNSTREAM DEPENDENCY — This repeats the % DERP/MTU hypothesis from % Section baseline as though it were established. % The baseline TODO notes % this hypothesis is unverified (no packet capture % evidence). Do not % present it as a confirmed diagnosis here without % resolving the upstream TODO. structural (most plausibly MTU fragmentation in the DERP relay layer) and cannot worsen because it is already saturated. Adding latency or loss on top of an 87\% packet drop floor changes nothing. Mycelium holds 99.9\% quality even at High impairment, ahead of Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even Mycelium's degraded overlay paths comfortably sustain the stream. The same overlay routing that adds 34.9\,ms of latency and cripples bulk TCP transfers is harmless at video bitrates, and RIST's forward error correction handles the residual loss. % TODO: The claim that jumbo MTU causes burst losses % that overwhelm % FEC is a hypothesis. No FEC analysis or % packet-level evidence is % shown. Consider adding packet capture data or % softening the claim. Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU that hurt TCP throughput likely hurts here as well: large overlay packets are more exposed to loss and reordering at the outer layer, and the resulting burst losses may exceed what RIST's FEC can recover. \subsection{Application-level download} The Nix binary cache download is the most demanding application-level benchmark. Hundreds of sequential HTTP connections amplify the per-connection latency penalties that bulk throughput tests amortise. Table~\ref{tab:nix_impairment} shows download times across profiles. \begin{table}[H] \centering \caption{Nix binary cache download time (seconds) across impairment profiles, sorted by Low-profile time. ``--'' marks a failed run.} \label{tab:nix_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 8.53 & 11.9 & 58.6 & -- \\ Headscale & 9.79 & 13.5 & 48.8 & 219 \\ EasyTier & 9.39 & 22.1 & 141 & -- \\ VpnCloud & 9.39 & 27.9 & 163 & -- \\ WireGuard & 9.45 & 28.8 & 161 & -- \\ Nebula & 9.15 & 30.8 & 180 & 547 \\ Tinc & 10.0 & 30.9 & 166 & 496 \\ ZeroTier & 9.22 & 36.2 & 141 & -- \\ Mycelium & 10.1 & 79.5 & -- & -- \\ Yggdrasil & 10.6 & 230 & -- & -- \\ Hyprspace & 11.9 & -- & 170 & -- \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/impairment/Nix Cache Download Time Heatmap}.png} \caption{Nix binary cache download time across impairment profiles. Headscale, Nebula, and Tinc complete all four profiles; Headscale beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's Low-profile time explodes to 230\,s ($\times$ marks a failed run).} \label{fig:nix_impairment_heatmap} \end{figure} Headscale, Nebula, and Tinc are the only VPNs to complete all four profiles. At Medium impairment, Headscale finishes in 48.8~seconds, faster than Internal's 58.6~seconds. Internal itself fails at High impairment while Headscale completes in 219~seconds, Tinc in 496~seconds, and Nebula in 547~seconds. Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low impairment, a 22$\times$ slowdown. Every HTTP request pays the latency penalty of Yggdrasil's impairment-amplified retransmissions. Mycelium degrades almost as badly (10.1\,s to 79.5\,s, an 8$\times$ increase): its overlay routing overhead compounds over hundreds of sequential HTTP connections. % TODO: Hyprspace fails at Low but completes at Medium (170 s). % This contradicts the "clean gradient" claim. % Explain why a VPN % can fail at Low but succeed at Medium, or note the anomaly. The failure map shows a mostly clean gradient: more demanding profiles knock out more VPNs. At Low, 10 of 11 finish (Hyprspace fails). At Medium, 9 finish, though Hyprspace, which had failed at Low, completes here in 170\,s. At High, only Headscale, Nebula, and Tinc survive. Internal's failure at High is the surprising one: the bare-metal baseline cannot sustain a multi-connection HTTP workload under severe degradation, while Headscale's userspace TCP stack pulls it through. Section~\ref{sec:tailscale_degraded} explains why. \section{Tailscale under degraded conditions} \label{sec:tailscale_degraded} This section is about an observation that should not exist: Headscale, a tunnelling VPN built on a kernel TCP stack and \texttt{wireguard-go}, beats the bare-metal Internal baseline at Medium impairment, and at Low impairment under parallel load beats it by a factor of 2.6. The short answer turns out to be different from the obvious answer, and we worked it out only by chasing the obvious answer to its end. \subsection{An anomaly worth pursuing} At Medium impairment, Headscale reaches 41.5\,Mbps on a single TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for the VPN over the direct host-to-host link it tunnels through. Headscale costs the expected ${\sim}$14\,\% at baseline, and at Low and High impairment it lags Internal by some margin. Yet at Medium the order inverts, and not by a sliver: a 12\,Mbps gap on a 30\,Mbps link is well above measurement noise. The same thing happens, more dramatically, on the parallel TCP test, where Headscale's 718\,Mbps at Low beats Internal's 277\,Mbps by a factor of 2.6. Table~\ref{tab:headscale_anomaly} collects the comparison. \begin{table}[H] \centering \caption{Headscale vs.\ Internal vs.\ WireGuard under impairment (18.12.2025 run). For TCP benchmarks, higher is better. For Nix cache, lower is better; ``--'' marks a failed run.} \label{tab:headscale_anomaly} \begin{tabular}{llrrr} \hline \textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} & \textbf{Headscale} & \textbf{WireGuard} \\ \hline Single TCP (Mbps) & Low & 333 & 274 & 54.7 \\ Single TCP (Mbps) & Medium & 29.6 & 41.5 & 8.77 \\ Single TCP (Mbps) & High & 4.25 & 4.21 & 2.63 \\ Parallel TCP (Mbps) & Low & 277 & 718 & 173 \\ Parallel TCP (Mbps) & Medium & 82.6 & 113 & 24.5 \\ Nix cache (s) & Medium & 58.6 & 48.8 & 161 \\ Nix cache (s) & High & -- & 219 & -- \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png} \caption{Single-stream TCP throughput for Internal, Headscale, and WireGuard across impairment profiles (log scale). Headscale crosses above Internal at Medium impairment; WireGuard stays far below both; all three converge at High.} \label{fig:headscale_vs_internal} \end{figure} WireGuard-the-kernel-module is the obvious sanity check. It uses the same Noise/WireGuard cryptographic protocol Tailscale ships and is the closest available comparison without the rest of Tailscale's stack. WireGuard shows none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well below Internal at the same profile. So the encryption layer is not the answer, and the basic UDP tunnel is not the answer. Whatever Headscale is doing differently lives somewhere else in the rest of Tailscale's implementation. % TODO: The Medium-impairment retransmit percentages (5.2\%, % 2.4\%) are not in any table or figure. Add a retransmit % rate table for impaired profiles or reference the data % source. The retransmit data narrows the search. At Medium, WireGuard's TCP retransmit rate is 5.2\,\%, more than double Internal's ${\sim}$2.4\,\%. Headscale matches Internal at ${\sim}$2.4\,\% even though it is a tunnelling VPN. Both Headscale and bare-metal Internal run the same host kernel TCP stack at the inner layer, so the asymmetry is not about a different TCP implementation. It is about what the kernel TCP stack is being asked to process: something on Headscale's path is suppressing the spurious retransmits the kernel would otherwise fire under \texttt{tc netem}-induced reordering, and WireGuard's path is not. \subsection{A plausible villain: Tailscale's gVisor stack} The candidate explanation we pursued first, and the one any reading of the upstream Tailscale documentation will lead to, is Tailscale's userspace TCP/IP stack. The Tailscale client imports Google's gVisor netstack (\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses it as an in-process TCP implementation. The gVisor documentation is direct about why this matters: netstack is designed for adverse networks where the host kernel's TCP defaults are too aggressive. Tailscale's release notes go further, calling out specific overrides on top of gVisor — the most visible being an explicit RACK disable and 8\,MiB / 6\,MiB receive and send buffers. Reading Tailscale's source confirms it. \texttt{wgengine/netstack/netstack.go} contains the netstack initialiser, and Listing~\ref{lst:tailscale_netstack_overrides} reproduces the relevant overrides verbatim. RACK is disabled (\texttt{TCPRecovery(0)}) with a comment pointing at \texttt{tailscale/issues/9707}: ``gVisor's RACK performs poorly. ACKs do not appear to be handled in a timely manner, leading to spurious retransmissions and a reduced congestion window.'' Reno is set explicitly with a comment pointing at \texttt{gvisor/issues/11632}, an integer-overflow bug in gVisor's CUBIC implementation. The TCP send and receive buffer maxima are pushed up to 8\,MiB and 6\,MiB. SACK is enabled (gVisor's default is off). \lstinputlisting[language=Go,caption={Tailscale's gVisor netstack initialiser explicitly disables RACK, pins Reno as the congestion control, and enlarges the TCP buffer maxima. These overrides live inside \texttt{wgengine/netstack/netstack.go}. \textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go} Read against the Linux kernel defaults — RACK on, CUBIC by default, ${\sim}$1\,MiB receive and send buffers, \texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these overrides describe a TCP stack better suited to a lossy, reordering link than the host kernel. The hypothesis writes itself: Headscale's iPerf3 traffic is processed by this gVisor instance instead of by the host kernel TCP stack, and so it inherits the more reordering-tolerant behaviour. WireGuard-the-kernel-module shares only the cryptographic protocol; it does not get the gVisor stack, and therefore does not get the advantage. It is a clean story. The natural way to test it is to extract the parameters Tailscale sets inside gVisor, apply their nearest Linux equivalents to the bare-metal host as sysctls, and see whether Internal — with no VPN at all — picks up the same advantage. If it does, the gVisor explanation is supported. If it does not, the hypothesis fails. \subsection{Reproducing the effect on bare metal} \label{sec:tuned} We ran two follow-up benchmarks on the same hardware and impairment setup as the original 18.12.2025 run. \begin{itemize} \bitem{Tailscale-style (27.02.2026):} \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, \texttt{tcp\_early\_retrans=0}, plus enlarged buffer sizes (\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, \texttt{rmem\_max}, \texttt{wmem\_max}). Tested on Internal, Headscale, WireGuard, Tinc, and ZeroTier. \bitem{Reorder-only (06.03.2026):} Only \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, and \texttt{tcp\_early\_retrans=0}. Buffer sizes left at kernel defaults. Tested on Internal and Headscale only. \end{itemize} \begin{table}[H] \centering \caption{Internal (no VPN) throughput across three kernel configurations. ``Default'' is the 18.12.2025 run with stock Linux TCP parameters.} \label{tab:kernel_tuning_internal} \begin{tabular}{llrrr} \hline \textbf{Metric} & \textbf{Profile} & \textbf{Default} & \textbf{Tailscale-style} & \textbf{Reorder-only} \\ \hline Single TCP (Mbps) & Baseline & 934 & 934 & 934 \\ Single TCP (Mbps) & Low & 333 & 363 & 354 \\ Single TCP (Mbps) & Medium & 29.6 & 64.2 & 72.7 \\ Parallel TCP (Mbps) & Low & 277 & 893 & 902 \\ Parallel TCP (Mbps) & Medium & 82.6 & 226 & 211 \\ Retransmit \% & Medium & ${\sim}$2.4 & 1.21 & 1.11 \\ Nix cache (s) & Medium & 58.6 & 29.7 & 29.1 \\ \hline \end{tabular} \end{table} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png} \caption{Internal (no VPN) single-stream TCP throughput across three kernel configurations. Baseline is unchanged; at Medium impairment, throughput jumps from 30 to 64 to 73\,Mbps as reordering tolerance increases.} \label{fig:kernel_tuning_comparison} \end{figure} The result felt like confirmation. Internal's Medium-impairment throughput jumped from 29.6\,Mbps to 72.7\,Mbps under the reorder-only configuration — a 146\,\% increase from a three-line sysctl change — and the retransmit rate at Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means more than half of the original retransmissions were spurious. The Nix cache download at Medium roughly halved, from 58.6\,s to 29.1\,s. Parallel TCP gained more. Internal at Low climbed from 277 to 902\,Mbps, a 226\,\% increase that not only exceeds Internal's old single-stream best but actually overtakes Headscale's original 718\,Mbps from the unmodified run. % % TODO: DOWNSTREAM % DEPENDENCY — "six concurrent flows" inherits % the unresolved % 6-vs-10 stream count from the baseline parallel test % description. Update when that TODO is resolved. Each of the six concurrent flows benefits independently from the higher reordering threshold, and the gains compound. % TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are % not in any table. Add a table showing Headscale's results % from the follow-up runs alongside Internal's so % readers can % verify the reversal. Headscale itself, retested with the same sysctls, gained more modestly: +21\,\% at Medium and a small $-$5\,\% wobble at Low. And the anomaly reversed entirely. At Medium, tuned Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps — a 45\,\% lead for Internal where the original run had Headscale 40\,\% ahead. The Nix cache flipped the same way: Internal completed in 29.1\,s against Headscale's 36.3\,s, where the original had Headscale 17\,\% faster. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png} \caption{Internal-to-Headscale speed-up factor before and after kernel tuning. Values above 1.0 mean Internal is faster. At Medium impairment, the ratio flips from 0.71$\times$ (Headscale ahead) to 1.45$\times$ (Internal ahead).} \label{fig:headscale_gap_reversal} \end{figure} The reorder-only configuration matched or exceeded the full Tailscale-style configuration on most metrics. The two exceptions were single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within 7\,\%. The enlarged buffer sizes did not help and may have added mild buffer bloat that partially offset the reordering benefit, though the gap could also be run-to-run variance. Either way, the entire Headscale advantage on Internal collapsed to three host-kernel sysctls: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and \texttt{tcp\_early\_retrans}. At this point in the investigation the hypothesis seemed settled. Tailscale's gVisor stack ships with these overrides; the bare-metal kernel ships with stricter defaults; matching the kernel to gVisor reproduces the effect. Then we checked which Tailscale code path the test rig was actually running. \subsection{The data path that was not there} In default mode — what anyone running \texttt{tailscale up} on a Linux host gets — the Tailscale client creates a real kernel TUN device, registers a route for the Tailscale subnet through it, and forwards inbound and outbound packets through that interface. An application like iPerf3 issues a \texttt{connect} to the remote peer's Tailscale IP. The host kernel TCP stack handles the application TCP. The kernel routes the resulting outbound packets to the TUN device. \texttt{tailscaled} (with \texttt{wireguard-go} embedded) reads them from the TUN, encrypts them, and sends them as outer WireGuard UDP packets on the wire. The receiving side reverses the process and writes the decrypted inner packets back into its own TUN, where the host kernel TCP stack delivers them to the iPerf3 server. In that path, gVisor netstack is never instantiated. The netstack initialiser in Listing~\ref{lst:tailscale_netstack_overrides} only runs when \texttt{tailscaled} is launched with \texttt{--tun=userspace-networking}, a mode that has no kernel TUN at all and is reachable only from processes running inside \texttt{tailscaled} itself (Tailscale SSH, Taildrop, the metric endpoint). External processes such as iPerf3 cannot reach the Tailscale network in that mode. The test rig does not use that mode. Listing~\ref{lst:nixos_tailscale} shows the relevant line of the upstream NixOS \texttt{services.tailscale} module, which assembles the daemon command line as \texttt{tailscaled --tun \$\{cfg.interfaceName\}~\dots}, with no \texttt{userspace-networking} fall-back unless the operator explicitly sets \texttt{interfaceName = "userspace-networking"}. Listing~\ref{lst:rig_interface_name} shows what the benchmark suite's Headscale module sets the interface name to: \texttt{ts-\$\{instanceName\}}, truncated to fifteen characters. The two together resolve to \texttt{tailscaled --tun ts-headscale} on every test machine, a real kernel TUN. gVisor netstack is unreachable from any external benchmark traffic in this rig. \lstinputlisting[language=Nix,caption={The NixOS \texttt{services.tailscale} module passes \texttt{--tun \$\{interfaceName\}} as the daemon's TUN argument. There is no \texttt{--tun=userspace-networking} fall-back unless the user explicitly sets \texttt{interfaceName = "userspace-networking"}. \textit{nixpkgs/nixos/modules/services/networking/tailscale.nix:158}},label={lst:nixos_tailscale}]{Listings/nixos_tailscale.nix} \lstinputlisting[language=Nix,caption={The benchmark suite's Headscale module sets \texttt{interfaceName} to a real kernel TUN name (\texttt{ts-}, truncated to 15 characters). Combined with Listing~\ref{lst:nixos_tailscale}, this means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale} on every test machine. \textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix} The empirical fingerprint pins the same conclusion down without source-code reading. Headscale itself gained +21\,\% at Medium from the host-kernel sysctl tuning. If Headscale's iPerf3 traffic were processed by gVisor netstack, host-kernel sysctls would change nothing — they configure the host kernel TCP stack and only the host kernel TCP stack. The fact that Headscale moves measurably under those sysctls is direct evidence that Headscale's application TCP runs on the host kernel stack, just as Internal's does. The validation experiment was therefore validating something other than the hypothesis it was supposed to validate. It was confirming, very cleanly, that the Linux kernel's default \texttt{tcp\_reordering=3} is too tight for the kind of bursty, correlated reordering the Medium profile produces, and that loosening it produces a large throughput gain on a kernel-TCP data path. That part of the result stands. What does not stand is the inference that the gain reproduces something Tailscale was already doing in gVisor. For this benchmark, Tailscale is not in the gVisor TCP business at all. \subsection{Where the advantage actually lives} The puzzle the investigation began with has not gone away. Headscale starts at 41.5\,Mbps where Internal starts at 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel TCP stack. Whatever Headscale is doing — partially, weakly, but reproducibly — is worth roughly twelve megabits per second on the Medium profile, and it is not gVisor netstack. The +21\,\% sysctl gain for Headscale itself is also informative about the size of the mechanism. If the gain were 0\,\%, Headscale would already be doing the sysctls' work; if it were +146\,\% like Internal's, Headscale would be doing nothing of its own. The partial response says Headscale's mechanism produces an effect similar in kind to the sysctls but smaller in size, and that the two effects are not fully additive. Two features of the \texttt{wireguard-go} data-plane pipeline are the most likely candidates, and both live on the kernel-TUN path that Tailscale actually uses in the rig. The first is TUN TCP and UDP generic receive offload. Tailscale's \texttt{tstun} wrapper enables both on the kernel TUN device on Linux unless an environment knob disables them or a runtime probe rejects the feature (Listing~\ref{lst:tstun_gro}). On the receive side, this means \texttt{wireguard-go} decrypts a burst of inbound WireGuard frames and then coalesces consecutive in-order TCP segments belonging to the same flow into a single super-segment before writing them back to the kernel TUN. On the transmit side, it accepts GSO super-segments from the kernel TUN read in the same way. The receiving kernel TCP stack therefore sees fewer, larger segments per coalesced batch instead of $N$ small ones, and the segment timing that survives to the kernel is the timing of GRO batches rather than of individual on-the-wire packets. Bare-metal Internal traffic has no equivalent path because it does not pass through any user-space TUN at all. \lstinputlisting[language=Go,caption={Tailscale enables TUN TCP and UDP GRO on every Linux non-TAP \texttt{tailscaled} process unless the operator disables them via environment knobs or a kernel runtime probe rejects the feature. This is in the default kernel-TUN data path; it is not gated on \texttt{--tun=userspace-networking}. \textit{tailscale/net/tstun/wrap\_linux.go:25--43}},label={lst:tstun_gro}]{Listings/tstun_gro.go} The second is the 7\,MiB outer-UDP socket buffer that \texttt{magicsock} pins on the WireGuard UDP socket (Listing~\ref{lst:magicsock_buffer}), using the ``force'' \texttt{SO\_*BUFFORCE} variant where available so the value is honoured even past \texttt{net.core.rmem\_max}. The host kernel default is in the low hundreds of KiB. Under burst-correlated impairment — Medium and High both use 50\,\% correlation, so losses and reorderings cluster — this larger buffer absorbs spikes in arrival rate that would otherwise overflow the kernel UDP receive queue and surface as additional inner-TCP losses. Internal has no such cushion on its incoming wire path. \lstinputlisting[language=Go,caption={\texttt{magicsock} pins the outer WireGuard UDP socket's send and receive buffers to 7\,MiB and uses \texttt{SetBufferSize} with the \texttt{SO\_*BUFFORCE} (``force'') variant where available, so the value is honoured even past \texttt{net.core.rmem\_max}. \textit{tailscale/wgengine/magicsock/magicsock.go:86,3908--3913}},label={lst:magicsock_buffer}]{Listings/magicsock_buffer.go} % TODO: Neither of the two candidate mechanisms above is directly % verified in this chapter. A targeted follow-up — for example % tcpdump on the receiving \texttt{tailscale0} interface during a % Medium-impairment iPerf3 run, with inter-arrival timing % analysis — would distinguish their relative contributions and % confirm the mechanism. The argument here is that they are the % most plausible candidates consistent with the evidence, not % measured causes. A third feature, batched UDP I/O, completes the picture without changing it qualitatively. \texttt{wireguard-go} uses \texttt{recvmmsg} and \texttt{sendmmsg} on the outer UDP socket so a burst of WireGuard frames moves through a single system call. This does not change \emph{whether} packets are reordered, but it reduces per-packet timing jitter that the kernel might otherwise interpret as additional reordering. Hyprspace cannot be used as a negative control for any of this. It does import gVisor netstack, but only for its in-VPN service-network feature, and the Hyprspace benchmark traffic goes through a kernel TUN exactly like Headscale's (Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer), not on whether gVisor handles their inner TCP. The gVisor angle simply does not apply to either of them in this benchmark. The kernel-side picture closes the loop. Three host-kernel TCP parameters dominate the bare-metal behaviour the benchmarks expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the number of out-of-order segments the kernel will tolerate before declaring fast retransmit, and with \texttt{tc netem} injecting 0.5--2.5\,\% reordering per machine, bursts of several reordered packets are frequent enough that the threshold is repeatedly tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery} (default \texttt{1}, RACK enabled) adds time-based reordering detection on top of the segment-count threshold, which compounds the spurious retransmits when reordering is high. And \texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail Loss Probe enabled) fires speculative retransmits when unacknowledged segments sit at the tail of a transmission window, which interacts poorly with an already-impaired link. Loosening any one of the three softens the kernel's loss detection on the bare-metal path; loosening all three recovers most of the throughput. The Headscale path reaches the same kernel TCP stack but is already feeding it the GRO-coalesced, buffer-cushioned stream described above, so the kernel's tight defaults fire less often there to begin with. The same logic explains the anomaly's shape across profiles. At baseline there is no reordering, so the kernel's tight \texttt{tcp\_reordering} threshold never trips and Internal's native kernel-stack speed wins. As reordering rises from 0.5\,\% (Low) to 2.5\,\% (Medium) per machine, the kernel's loss detection fires on the bare-metal path more often than on the GRO-coalesced Headscale path, and the throughput gap shifts in Headscale's favour. At High impairment, both converge to ${\sim}$4.2\,Mbps: absolute packet loss becomes the dominant bottleneck, and reordering tolerance no longer matters. % TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier % (11.5 Mbps) tuned values are not in any table. Add them to % Table~\ref{tab:kernel_tuning_internal} or a new table. Other VPNs respond unevenly to the same sysctl tuning. WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps (+39\,\%), Tinc's from 5.53 to 11.5\,Mbps (+108\,\%), and ZeroTier stays flat (12.0 to 11.5\,Mbps). % TODO: The % reading below — that VPNs which add their own encapsulation and % userspace processing have bottlenecks the host kernel sysctls % cannot touch — does not cleanly fit the data: Tinc (a fully % userspace VPN) shows the largest gain (+108\,\%), larger than % kernel-WireGuard's. A more complete explanation has to account % for which TCP stack each VPN's application traffic actually % traverses and which of those stacks the sysctls actually reach. The intuitive reading is that VPNs which add their own encapsulation and userspace processing have bottlenecks the host kernel sysctls cannot touch, but Tinc's large gain shows the picture is not that simple. The resilient finding from this section, the one that survives regardless of which of the two Tailscale-side mechanisms turns out to dominate, is not about Tailscale at all. It is about Linux. The kernel's default \texttt{tcp\_reordering=3} threshold is too tight for the kind of bursty, correlated reordering \texttt{tc netem} produces at the Medium profile, and it costs the bare-metal host more than half of its achievable throughput. Three lines of \texttt{sysctl} repair it. The fix is portable to any Linux host and entirely independent of any VPN. The unresilient finding — the one that motivated us to write this section in the first place — is that Tailscale's much-discussed userspace TCP stack is, for the workload that exposed the anomaly, sitting on the bench. The advantage we attributed to it must come from a more ordinary place: the way \texttt{wireguard-go} batches and coalesces packets between the wire and the kernel TCP stack, and the larger UDP buffer it pins on its outer socket. We were chasing the wrong hypothesis with the right experiment, and the experiment turned out to be more useful than the hypothesis. % TODO: These sections are empty stubs but the chapter % introduction (line 12--13) promises "findings from the source % code analysis." Either write these sections or remove the % promise from the intro. \section{Source code analysis} \subsection{Feature matrix overview} % Summary of the 108-feature matrix across all ten VPNs. % Highlight key architectural differences that explain % performance results. \subsection{Security vulnerabilities} % Vulnerabilities discovered during source code review. \section{Summary of findings} % Brief summary table or ranking of VPNs by key metrics. Save % deeper interpretation for a Discussion chapter.