diff --git a/Chapters/Results.tex b/Chapters/Results.tex index 529c818..53446a5 100644 --- a/Chapters/Results.tex +++ b/Chapters/Results.tex @@ -5,24 +5,24 @@ \label{Results} This chapter presents the results of the benchmark suite across all -ten VPN implementations and the internal baseline. The structure -follows the impairment profiles from ideal to degraded: +ten VPN implementations and the internal baseline. The structure +follows the impairment profiles from ideal to degraded. Section~\ref{sec:baseline} establishes overhead under ideal -conditions, then subsequent sections examine how each VPN responds to +conditions; subsequent sections examine how each VPN responds to increasing network impairment, with source-code excerpts woven in -where they explain the measured behaviour. A recurring theme is -that no single metric captures VPN performance; the rankings shift -depending on whether one measures throughput, latency, retransmit -behavior, or real-world application performance. +where they explain the measured behaviour. No single metric +captures VPN performance. The rankings shift depending on what is +measured: throughput, latency, retransmit behaviour, or +application-level performance. -\section{Baseline Performance} +\section{Baseline performance} \label{sec:baseline} The baseline impairment profile introduces no artificial loss or reordering, so any performance gap between VPNs can be attributed to the VPN itself. Throughout the plots in this section, the -\emph{internal} bar marks a direct host-to-host connection with no VPN -in the path; it represents the best the hardware can do. On its own, +\emph{internal} bar is a direct host-to-host connection with no VPN +in the path; it is the best the hardware can do. On its own, this link delivers 934\,Mbps on a single TCP stream and a round-trip latency of just 0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a @@ -30,35 +30,29 @@ single retransmit across an entire 30-second test. Mycelium sits at the other extreme: 34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. -A note on naming: ``Headscale'' in every table and figure of this -chapter labels the test scenario in which the Tailscale client -(\texttt{tailscaled}) connects to a self-hosted Headscale control -server. The data plane is therefore the Tailscale client built on -\texttt{wireguard-go}, not the Headscale binary itself, which is -only a control-plane server. Statements below about ``Headscale'' -running \texttt{wireguard-go} should be read as statements about -the Tailscale client in this scenario. -Section~\ref{sec:tailscale_degraded} covers the specifics of how -the rig launches \texttt{tailscaled} and which Tailscale code -paths that choice activates. +Throughout this chapter, ``Headscale'' labels the scenario in +which the Tailscale client (\texttt{tailscaled}, built on +\texttt{wireguard-go}) connects to a self-hosted Headscale +control server. The data plane is therefore Tailscale's, not +Headscale's; the Headscale binary itself is only a control-plane +server. Section~\ref{sec:tailscale_degraded} returns to which +Tailscale code paths the test rig actually exercises. -\subsection{Test Execution Overview} +\subsection{Test execution overview} -Running the full baseline suite across all ten VPNs and the internal -reference took just over four hours. Actual benchmark execution -consumed the bulk of that time at 2.6~hours (63\,\%). VPN -installation and deployment accounted for another 45~minutes -(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting -for VPN tunnels to come up after restarts. VPN service restarts and -traffic-control (tc) stabilization took the remainder. -Figure~\ref{fig:test_duration} breaks this down per VPN. +The full baseline suite ran in just over four hours across all ten +VPNs and the internal reference. Benchmark execution consumed +63\,\% of that time; VPN installation and deployment accounted for +19\,\%; the test rig spent 9\,\% waiting for tunnels to come up +after restarts. Service restarts and traffic-control (tc) +stabilization took the remainder. Figure~\ref{fig:test_duration} +breaks the time down per VPN. -Most VPNs completed every benchmark without issues, but four failed -one test each: Nebula and Headscale timed out on the qperf -QUIC performance benchmark after six retries, while Hyprspace and -Mycelium failed the UDP iPerf3 test -with a 120-second timeout. Their individual success rate is -85.7\,\%, with all other VPNs passing the full suite +Most VPNs completed every benchmark without issue. Four failed +one test each: Nebula and Headscale timed out on the qperf QUIC +benchmark after six retries; Hyprspace and Mycelium failed the +UDP iPerf3 test with a 120-second timeout. Their individual +success rate is 85.7\,\%; every other VPN passed the full suite (Figure~\ref{fig:success_rate}). \begin{figure}[H] @@ -88,7 +82,7 @@ with a 120-second timeout. Their individual success rate is \label{fig:test_overview} \end{figure} -\subsection{TCP Throughput} +\subsection{TCP throughput} Each VPN ran a single-stream iPerf3 session for 30~seconds on every link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna, @@ -144,14 +138,12 @@ Raw throughput alone is incomplete. The retransmit rate (Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts by estimated packet count, accounting for the different segment sizes each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and -Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier, -Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%). -Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and -0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier -reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating -for tunnel-internal loss through repeated TCP congestion-control -recovery; WireGuard delivers comparable throughput with effectively -zero loss. +Headscale are effectively loss-free ($<$\,0.01\,\%); Hyprspace is +the clear outlier at 0.49\,\%; the remaining VPNs spread between +these poles. The interesting case is ZeroTier: it sustains +814\,Mbps despite a 0.10\,\% retransmit rate by riding TCP +congestion-control recovery, where WireGuard delivers comparable +throughput with effectively zero loss. \begin{figure}[H] \centering @@ -170,7 +162,7 @@ Figure~\ref{fig:tcp_window} shows the raw window sizes, and Figure~\ref{fig:retransmit_correlations} plots them against retransmit rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the smallest max congestion window in the dataset (200\,KB), while -Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of +Yggdrasil's 0.09\,\% rate allows a 4.3\,MB window, the largest of any VPN. At first glance this suggests a clean inverse correlation between retransmit rate and congestion window size, but the picture is @@ -204,17 +196,13 @@ in the dataset. This points to significant in-tunnel packet loss or buffering at the VpnCloud layer that the retransmit rate (0.06\,\%) alone does not fully explain. -Variability, whether stochastic across runs or systematic across -links, also differs substantially. WireGuard's three link -directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window) -and are nearly indistinguishable. Mycelium's three directions span -122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: -Section~\ref{sec:mycelium_routing} shows the spread is per-link -path-selection asymmetry, with one link finding a direct route and -the other two routing through the global overlay. Either way, a -VPN whose throughput varies that widely across links is harder to -capacity-plan around than one that delivers a consistent figure -on every direction. +Variability across links also differs substantially. WireGuard's +three link directions cluster within a 60\,Mbps band; Mycelium's +span a 3:1 ratio (122--379\,Mbps), but this is per-link +path-selection asymmetry rather than run-to-run noise +(Section~\ref{sec:mycelium_routing}). A VPN whose throughput +varies that widely across links is harder to capacity-plan around +than one that delivers a consistent figure on every direction. \begin{figure}[H] \centering @@ -292,16 +280,13 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far removed from the rest that Section~\ref{sec:mycelium_routing} gives it a dedicated analysis. -The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two -outliers among the low-latency VPNs. VpnCloud leads at -2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at -2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest -jitter in the table (0.25\,ms). Tinc and Headscale, by contrast, -stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet -timing is nearly as stable as bare metal. The spikes in VpnCloud and -ZeroTier are consistent with periodic -control-plane work such as key rotation or peer heartbeats that -briefly stalls the data path. +The spike-ratio column in Table~\ref{tab:latency_baseline} exposes +two outliers among the low-latency VPNs. VpnCloud and ZeroTier +post the highest spike ratios (2.8$\times$ and 2.3$\times$) and +the highest jitter, while Tinc and Headscale stay close to +bare-metal stability. The spikes in VpnCloud and ZeroTier are +consistent with periodic control-plane events such as key +rotation or peer heartbeats that briefly stall the data path. \begin{figure}[H] \centering @@ -353,7 +338,7 @@ spot. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png} - \caption{Latency vs.\ throughput at baseline. Each point represents + \caption{Latency vs.\ throughput at baseline. Each point is one VPN. The quadrants reveal different bottleneck types: VpnCloud (low latency, moderate throughput), Tinc (low latency, low throughput, CPU-bound), Mycelium (high latency, low @@ -361,7 +346,7 @@ spot. \label{fig:latency_throughput} \end{figure} -\subsection{Parallel TCP Scaling} +\subsection{Parallel TCP scaling} The single-stream benchmark tests one link direction at a time. The @@ -383,7 +368,7 @@ Table~\ref{tab:parallel_scaling} lists the results. \centering \caption{Parallel TCP scaling at baseline. Scaling factor is the ratio of parallel to single-stream throughput. Internal's - 1.50$\times$ represents the expected scaling on this hardware.} + 1.50$\times$ is the expected scaling on this hardware.} \label{tab:parallel_scaling} \begin{tabular}{lrrr} \hline @@ -501,7 +486,7 @@ parallel load. \label{fig:parallel_tcp} \end{figure} -\subsection{UDP Stress Test} +\subsection{UDP stress test} The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}), which is a deliberate overload test rather than a realistic workload. @@ -568,12 +553,13 @@ complete the UDP test at all; both timed out after 120 seconds. % usable payload after tunnel overhead, but conflating it with path % MTU is misleading. Consider renaming to "effective payload size" % throughout. -The \texttt{blksize\_bytes} field reveals each VPN's effective UDP -payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at -2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at -1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the -smallest). These differences affect fragmentation behavior under real -workloads, particularly for protocols that send large datagrams. +The effective UDP payload size, reported in the +\texttt{blksize\_bytes} field, differs sharply across +implementations. Yggdrasil sends 32\,731-byte jumbo segments; +ZeroTier negotiates 2\,728 bytes; the remaining VPNs cluster +between 1\,208 (Headscale, the smallest) and 1\,448 (Internal). +These differences affect fragmentation behaviour under workloads +that send large datagrams. %TODO: Mention QUIC %TODO: Mention again that the "default" settings of every VPN have been used @@ -611,7 +597,7 @@ workloads, particularly for protocols that send large datagrams. % TODO: Compare parallel TCP retransmit rate % with single TCP retransmit rate and see what changed -\subsection{Real-World Workloads} +\subsection{Real-world workloads} Saturating a link with iPerf3 measures peak capacity, but not how a VPN performs under realistic traffic. This subsection switches to @@ -620,7 +606,7 @@ cache and streaming video over RIST. Both interact with the VPN tunnel the way real software does, through many short-lived connections, TLS handshakes, and latency-sensitive UDP packets. -\paragraph{Nix Binary Cache Downloads.} +\paragraph{Nix binary cache downloads.} This test downloads a fixed set of Nix packages through each VPN and measures the total transfer time. The results @@ -705,7 +691,7 @@ Nebula sits just below at 99.8\%, and Hyprspace's headline figure of 100\% conceals a separate failure mode discussed below. The 14--16 dropped frames that appear uniformly across every run, including Internal, are most likely encoder warm-up artefacts rather than -tunnel overhead, though we have not verified this directly. +tunnel overhead, though this has not been verified directly. % TODO: The packet-drop distribution statistics (288 mean, % 10\% median, IQR 255--330) are not shown in any figure. @@ -756,7 +742,7 @@ overwhelm FEC entirely. \label{fig:rist_quality} \end{figure} -\subsection{Operational Resilience} +\subsection{Operational resilience} Throughput, latency, and application performance describe how a tunnel behaves once it is up. The next question is how quickly it @@ -767,7 +753,7 @@ reboot matters as much as its peak throughput. Reboot reconnection rearranges the rankings. Hyprspace, the worst performer under sustained TCP load, recovers in just 8.7~seconds on average, faster than any other VPN. WireGuard and Nebula follow at -10.1\,s each. Nebula's consistency is striking: 10.06, 10.06, +10.1\,s each. Nebula's consistency is striking: 10.06, 10.07, 10.07\,s across its three nodes, an exact match for Nebula's \texttt{HostUpdateNotification} interval, whose default is 10~seconds in the lighthouse protocol (configurable, but the @@ -781,8 +767,8 @@ Section~\ref{sec:mycelium_routing} argues from that uniformity that the bound is a fixed timer in the overlay protocol. Yggdrasil produces the most lopsided result in the dataset: its yuki -node is back in 7.1~seconds while lom and luna take 94.8 and -97.3~seconds respectively. Yggdrasil organises its overlay as a +node is back in 7.1~seconds while lom and luna take 97.3 and +94.8~seconds respectively. Yggdrasil organises its overlay as a distributed spanning tree rooted at the node with the highest public key: every other node picks a parent closer to the root and the whole network hangs off that parent chain. The gap likely reflects @@ -816,14 +802,14 @@ can route traffic. \label{fig:reboot_reconnection} \end{figure} -\subsection{Pathological Cases} +\subsection{Pathological cases} \label{sec:pathological} -Three VPNs exhibit behaviors that the aggregate numbers alone cannot -explain. The following subsections piece together observations from -earlier benchmarks into per-VPN diagnoses. +Hyprspace, Mycelium, and Tinc each show a pathology that the +aggregate tables flatten. The following subsections diagnose each +in turn. -\paragraph{Hyprspace: Buffer Bloat.} +\paragraph{Hyprspace: buffer bloat.} \label{sec:hyprspace_bloat} % TODO: The under-load latency of 2,800 ms is not shown in any plot @@ -841,7 +827,7 @@ The consequences show in every TCP metric. With 4\,965 retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than steady-state transfer. The max congestion window shrinks to -205\,KB, the smallest in the dataset. Under parallel load the +200\,KB, the smallest in the dataset. Under parallel load the situation worsens: retransmits climb to 17\,426. % TODO: The % explanation for the sender/receiver inversion (ACK delays % causing sender-side timer undercounting) is a hypothesis. Normally @@ -853,12 +839,9 @@ while the sender sees only 367.9\,Mbps, likely because massive ACK delays cause the sender-side timer to undercount the actual data rate. The UDP test never finished at all; it timed out at 120~seconds. -% Should we always use percentages for retransmits? - -What prevents Hyprspace from being entirely unusable is everything -\emph{except} sustained load. It has the fastest reboot -reconnection in the dataset (8.7\,s) and delivers 100\,\% video -quality outside of its burst events. The pathology is narrow but +Outside sustained load, Hyprspace looks fine. It has the fastest +reboot reconnection in the dataset (8.7\,s) and delivers 100\,\% +video quality between burst events. The pathology is narrow but severe: any continuous data stream saturates the tunnel's internal buffers. @@ -910,33 +893,35 @@ stack but the benchmark traffic never reaches it; that case is the subject of Section~\ref{sec:tailscale_degraded}. If gVisor is out of scope, the buffer bloat must originate -further up the Hyprspace stack instead. Hyprspace uses -\texttt{libp2p}, a peer-to-peer networking library, and its -\texttt{yamux} stream multiplexer, which runs many logical streams -over a single underlying connection and polices each one with a -credit-based flow-control window. The most plausible source of -the bloat is this libp2p/yamux layer, through which raw IP packets -are funnelled. Hyprspace's TUN-read loop dispatches -each outbound packet on its own goroutine, and every such -goroutine ends up in \texttt{node/node.go}'s -\texttt{sendPacket}, which keeps exactly one libp2p stream per -destination peer in \texttt{activeStreams} and guards it with a -single per-peer \texttt{sync.Mutex} -(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent -application TCP flows to the same Hyprspace neighbour therefore +further up the Hyprspace stack. Hyprspace uses \texttt{libp2p}, +a peer-to-peer networking library, and its \texttt{yamux} stream +multiplexer. Yamux runs many logical streams over a single +underlying connection and polices each one with a credit-based +flow-control window. The most plausible source of the bloat is +this libp2p/yamux layer, through which raw IP packets are +funnelled. + +Hyprspace's TUN-read loop dispatches each outbound packet on its +own goroutine, and every such goroutine ends up in +\texttt{node/node.go}'s \texttt{sendPacket}. This function keeps +exactly one libp2p stream per destination peer in +\texttt{activeStreams}, guarded by a single per-peer +\texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}). +Concurrent application TCP flows to the same Hyprspace neighbour serialise behind that one lock: the parallel iPerf3 test, which -opens multiple TCP connections to the same peer at once, -collapses to a single send pipeline at this layer. Each -goroutine waiting for the lock pins its own 1420-byte packet -buffer, and the underlying yamux session adds a per-stream -flow-control window on top. None of this is visible to the -kernel TCP sender that produced the inner segments: the kernel -sees only that the TUN write returned, so it keeps growing its -congestion window while the libp2p layer falls further behind. The -geometry is the textbook one for buffer bloat: a -fast producer (kernel TCP) sitting upstream of a slow, -serialised consumer (the single yamux stream per peer) with -no flow-control signal coupling the two. +opens multiple TCP connections to the same peer at once, collapses +to a single send pipeline at this layer. Each goroutine waiting +for the lock pins its own 1420-byte packet buffer, and the +underlying yamux session adds a per-stream flow-control window on +top. + +None of this is visible to the kernel TCP sender that produced +the inner segments. The kernel sees only that the TUN write +returned, so it keeps growing its congestion window while the +libp2p layer falls further behind. The geometry is the standard +shape of buffer bloat: a fast producer (kernel TCP) sitting +upstream of a slow, serialised consumer (the single yamux stream +per peer), with no flow-control signal coupling the two. \lstinputlisting[language=Go,caption={Hyprspace's outbound fast path keeps exactly one libp2p stream per destination peer @@ -987,9 +972,9 @@ reports a kernel-measured RTT that is independent of ICMP ping. For the luna$\rightarrow$lom stream, this TCP~RTT starts at 51.6\,ms and climbs to a mean of 144\,ms over the 30-second run, with -757~retransmits---the link was clearly overlay-routed -during the throughput test, even though ping had found a -direct path eight minutes earlier. For +757~retransmits. The link was clearly overlay-routed during +the throughput test, even though ping had found a direct path +eight minutes earlier. For yuki$\rightarrow$luna the reverse happened: the TCP stream measured only 12--22\,ms, and its bidirectional return path recorded 1.0\,ms, a direct LAN connection @@ -1057,17 +1042,14 @@ inversion. \end{figure} % TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment -% (47.3 ms) numbers are from qperf but not shown in any figure. -% Add a connection-setup latency table or plot. Also -% clarify what -% Internal's connection establishment time is (47.3 / -% 3 = 15.8 ms?) -% so the "3× overhead" can be verified. +% (47.3 ms vs.\ 16.0 ms) numbers are from qperf but not shown in +% any figure. Add a connection-setup latency table or plot. The overlay penalty shows up most clearly at connection setup. Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's 16.8\,ms, a 5.6$\times$ overhead), and connection establishment -alone costs 47.3\,ms (3$\times$ overhead). Every new connection +alone costs 47.3\,ms against Internal's 16.0\,ms (a +2.95$\times$ overhead). Every new connection incurs that overhead, so workloads dominated by short-lived connections accumulate it rapidly. Bulk downloads, by @@ -1094,13 +1076,13 @@ delay. The UDP test timed out at 120~seconds, and even first-time connectivity required a 70-second wait at startup. -\paragraph{Tinc: Userspace Processing Bottleneck.} +\paragraph{Tinc: userspace processing bottleneck.} The latency subsection already traced Tinc's 336\,Mbps ceiling to single-core CPU exhaustion. The usual network suspects do not apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its effective UDP payload size (1\,353 bytes) and its retransmit count -(240) are in the normal range. That leaves CPU: 14.9\,\% +(240) are in the normal range. That leaves CPU: 12.3\,\% whole-system utilization is what one saturated core looks like on a multi-core host, which fits a single-threaded userspace VPN. The parallel benchmark confirms the diagnosis. Tinc scales to @@ -1110,7 +1092,7 @@ the gaps a single flow would leave idle, and the extra work translates directly into extra throughput. % TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the % unresolved CPU-profiling TODO from the latency subsection -% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread +% (VpnCloud's similar 14.2\% at 539\,Mbps). If per-thread % profiling refutes the single-core story, this paragraph must % be revisited as well. @@ -1121,13 +1103,12 @@ Baseline benchmarks rank VPNs by overhead under ideal conditions. The impairment profiles in Table~\ref{tab:impairment_profiles} test a different property: resilience. Each profile applies symmetric \texttt{tc netem} -impairment to every machine. Low adds roughly 2\,ms of delay and -0.25\,\% packet loss with 0.5\,\% reordering; Medium adds -${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High -adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\% -reordering. Medium and High both use 50\,\% correlation, so -losses and reorderings are bursty rather than uniform. Two -results dominate the data. +impairment to every machine. Low adds 2\,ms of delay and +0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 4\,ms +of delay and 1\,\% loss with 2.5\,\% reordering; High adds 6\,ms +of delay and 2.5\,\% loss with 5\,\% reordering. Medium and High +both use 50\,\% correlation, so losses and reorderings are bursty +rather than uniform. Two results dominate the data. % TODO: Double-check these per-profile parameters against the % canonical impairment-profile definitions in the earlier chapter % (Table~\ref{tab:impairment_profiles}). The Low/High loss and @@ -1150,9 +1131,9 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly through what turns out to be the wrong hypothesis. The investigation begins with Tailscale's much-discussed gVisor TCP stack, validates the candidate parameters in isolation on the -bare-metal host, and only then discovers, by reading the rig's -own NixOS module, that the gVisor stack is not actually in the -data path of the benchmark at all. The real culprit is a +bare-metal host, and only then discovers, by reading the test +rig's own NixOS module, that the gVisor stack is not actually in +the data path of the benchmark at all. The real culprit is a combination of the Linux kernel's tight default \texttt{tcp\_reordering} threshold and the way \texttt{wireguard-go} @@ -1268,7 +1249,7 @@ RTT standard deviation reaches 44.6\,ms at High, the worst jitter of any VPN. A userspace retry mechanism is the likely cause, but -without source-code evidence we cannot say so with certainty. +without source-code evidence this cannot be confirmed. % TODO: Ping packet loss data is not shown in any plot. The 1/9 % = 11.1\% interpretation is clever but depends on @@ -1342,9 +1323,9 @@ the first step (Table~\ref{tab:tcp_impairment}). \end{figure} Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low -impairment, a -98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter, -0.25\% packet loss, and 0.5\% reordering per machine. +impairment, a 98.3\% loss. The Low profile injects only modest +impairment per machine: 2\,ms of latency, 2\,ms of jitter, 0.25\% +loss, and 0.5\% reordering. Even Mycelium, the slowest VPN at baseline (259\,Mbps), retains more throughput at @@ -1427,38 +1408,28 @@ emerges from the runs that did complete. % explanation (e.g., iPerf3 crash, tc interaction, % timing issue). Three implementations maintain throughput at the profiles where -data exists. Internal holds ${\sim}$950\,Mbps at -Baseline, Medium, -and High; WireGuard sustains 850--898\,Mbps; and -Headscale sustains -700--876\,Mbps. % TODO: verify WireGuard UDP range -- -% analysis doc says 850-898, possible digit transposition -Internal and WireGuard ride the host kernel's transport-layer -backpressure (Internal directly, WireGuard via the in-kernel -WireGuard module). Headscale, by contrast, never -uses the kernel -module even though it builds on the WireGuard protocol: as -established in Section~\ref{sec:baseline}, Tailscale's -\texttt{magicsock} layer intercepts every packet for endpoint -selection, DERP relay, and the disco protocol, and that -interception is incompatible with the kernel WireGuard datapath. -Headscale therefore runs \texttt{wireguard-go} in userspace and -compensates with UDP batching -(\texttt{recvmmsg}/\texttt{sendmmsg}), -host-kernel UDP segmentation/aggregation offload -(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer -WireGuard socket), and a 7\,MB socket buffer on the same outer -socket. These offloads live in the host kernel; gVisor netstack -itself implements no UDP GSO or UDP GRO of its own. -Together they -absorb a \texttt{-b 0} sender flood without -collapsing. Userspace -VPNs without the same engineering do collapse: -EasyTier drops from -865 to 435 to 38.5 to 6.1\,Mbps across successive profiles. -Yggdrasil, already pathological at baseline (98.7\% -loss), crashes -to 12.3\,Mbps at Low and fails entirely at Medium and High. +data exists: Internal, WireGuard, and Headscale all sustain +several hundred Mbps where they complete (see +Figure~\ref{fig:udp_impairment_heatmap}). Internal and WireGuard +ride the host kernel's transport-layer backpressure (Internal +directly, WireGuard via the in-kernel WireGuard module). +Headscale takes a different route to the same outcome. Its +\texttt{magicsock} layer is incompatible with the kernel +WireGuard datapath, so \texttt{wireguard-go} runs in userspace +and leans on three host-kernel offloads to absorb a +\texttt{-b~0} sender flood: batched UDP I/O +(\texttt{recvmmsg} / \texttt{sendmmsg}), UDP +segmentation/aggregation offload (\texttt{UDP\_SEGMENT} / +\texttt{UDP\_GRO}) on the outer WireGuard socket, and a 7\,MiB +socket buffer on that same socket. Section~\ref{sec:tailscale_degraded} +returns to these mechanisms when they reappear as the explanation +for Headscale's TCP behaviour under reordering. + +Userspace VPNs without that engineering collapse. EasyTier +walks down 865, 435, 38.5, 6.1\,Mbps across the four profiles. +Yggdrasil, already pathological at baseline (98.7\,\% loss), +drops to 12.3\,Mbps at Low and fails entirely at Medium and +High. \begin{figure}[H] \centering @@ -1576,10 +1547,9 @@ concurrent streams benefits from it independently. EasyTier is the runner-up under parallel load: 473\,Mbps at Low, 51\% of its baseline. Headscale and EasyTier are the only VPNs that retain more than half their baseline parallel throughput at -Low impairment; no other implementation exceeds 30\%. -We have no -direct architectural explanation for EasyTier's resilience and -do not claim one here. +Low impairment; no other implementation exceeds 30\%. EasyTier's +resilience has no direct architectural explanation in this work, +and none is claimed here. Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\% @@ -1590,7 +1560,7 @@ loss. % TODO: DOWNSTREAM DEPENDENCY — This % under-load latency. If that diagnosis is revised, % this explanation % for parallel collapse must also be revisited. -The buffer bloat that already plagues single-stream transfers +The buffer bloat that already constrains single-stream transfers (Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six flows compete for the same bloated buffers at once. @@ -1799,33 +1769,26 @@ at Low, completes here in 170\,s. At High, only Headscale, Nebula, and Tinc survive. Internal's failure at High is the surprising one: the bare-metal baseline cannot sustain a multi-connection HTTP workload under severe degradation, while -Headscale's userspace TCP stack pulls it through. -Section~\ref{sec:tailscale_degraded} explains why. +Headscale completes in 219\,s. +Section~\ref{sec:tailscale_degraded} traces the mechanism. \section{Tailscale under degraded conditions} \label{sec:tailscale_degraded} -% TODO: Editorial pass needed on two chapter-wide issues before -% submission: -% (1) magicsock / wireguard-go userspace-datapath explanation is -% repeated three times in slightly different forms (once in -% baseline UDP, once in impairment UDP, once here). Consider -% introducing it once in full here, where it is load-bearing, -% and replacing the earlier occurrences with one-sentence -% forward references. -% (2) This section uses first-person plural ("we pursued", "we -% worked it out", "we ran two follow-up benchmarks") while -% the rest of the chapter is in impersonal voice. Either -% harmonise everything to one voice, or explicitly frame this -% section as a first-person narrative detour. +% TODO: The magicsock / wireguard-go userspace-datapath +% explanation is repeated three times in slightly different forms +% (once in baseline UDP, once in impairment UDP, once here). +% Consider introducing it once in full here, where it is +% load-bearing, and replacing the earlier occurrences with +% one-sentence forward references. -This section is about an observation that should not exist: -Headscale, a tunnelling VPN built on a kernel TCP stack and -\texttt{wireguard-go}, beats the bare-metal Internal baseline at -Medium impairment, and at Low impairment under parallel load -beats it by a factor of 2.6. The short answer turns out to be -different from the obvious answer, and we worked it out only by -chasing the obvious answer to its end. +Headscale, a tunnelling VPN built on \texttt{wireguard-go}, beats +the bare-metal Internal baseline at Medium impairment. Under +parallel load at Low impairment it beats Internal by a factor of +2.6. A VPN should not outperform the direct connection it tunnels +through, and the explanation took some chasing. The obvious +hypothesis was wrong, and pursuing it to its end was the only way +to find out. \subsection{An anomaly worth pursuing} @@ -1877,16 +1840,15 @@ comparison. \label{fig:headscale_vs_internal} \end{figure} -WireGuard-the-kernel-module is the obvious sanity -check. It uses -the same Noise/WireGuard cryptographic protocol Tailscale ships -and is the closest available comparison without the rest of -Tailscale's stack. WireGuard shows none of Headscale's -advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well -below Internal at the same profile. So the encryption layer is -not the answer, and the basic UDP tunnel is not the answer. -Whatever Headscale is doing differently lives somewhere else in -the rest of Tailscale's implementation. +The in-kernel WireGuard module is the obvious sanity check. It +uses the same Noise/WireGuard cryptographic protocol that +Tailscale embeds, and it is the closest available comparison +without the rest of Tailscale's stack. Kernel WireGuard shows +none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at +Medium, both well below Internal at the same profile. The +encryption layer is not the answer. Neither is the basic UDP +tunnel. Whatever Headscale is doing lives somewhere else in +Tailscale's implementation. % TODO: The Medium-impairment retransmit percentages (5.2\%, % 2.4\%) are not in any table or figure. Add a retransmit @@ -1906,9 +1868,9 @@ not. \subsection{A plausible villain: Tailscale's gVisor stack} -The candidate explanation we pursued first, and the one any -reading of the upstream Tailscale documentation will lead to, -is Tailscale's userspace TCP/IP stack. The Tailscale client +The first candidate was Tailscale's userspace TCP/IP stack: +the answer any reading of the upstream Tailscale documentation +points to. The Tailscale client imports Google's gVisor netstack (\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses it as an in-process TCP implementation. The gVisor @@ -1948,9 +1910,9 @@ reordering link than the host kernel. The hypothesis follows directly: Headscale's iPerf3 traffic runs through this gVisor instance instead of through the host kernel TCP stack, and so it inherits the more -reordering-tolerant behaviour. WireGuard-the-kernel-module -shares only the cryptographic protocol; it does not include -the gVisor stack, and therefore does not get the advantage. +reordering-tolerant behaviour. Kernel WireGuard shares only +the cryptographic protocol; it does not include the gVisor +stack, and therefore does not get the advantage. The natural way to test this is to extract the parameters Tailscale sets inside gVisor, apply their @@ -1962,8 +1924,8 @@ supported. If it does not, the hypothesis fails. \subsection{Reproducing the effect on bare metal} \label{sec:tuned} -We ran two follow-up benchmarks on the same hardware and -impairment setup as the original 18.12.2025 run. +Two follow-up benchmarks ran on the same hardware and impairment +setup as the original 18.12.2025 run. \begin{itemize} \bitem{Tailscale-style (27.02.2026):} @@ -2023,43 +1985,33 @@ impairment setup as the original 18.12.2025 run. \label{fig:kernel_tuning_comparison} \end{figure} -The result felt like confirmation. Internal's -Medium-impairment throughput jumped from 29.6\,Mbps to -72.7\,Mbps under the reorder-only configuration, a 146\,\% -increase from a three-line sysctl change, and the retransmit -rate at Medium dropped from ${\sim}$2.4\,\% to -1.11\,\%, which -means more than half of the original retransmissions were -spurious. The Nix cache download at Medium roughly halved, -from 58.6\,s to 29.1\,s. +The result felt like confirmation. Three sysctls raised +Internal's Medium-impairment throughput by 146\,\% and halved its +Nix cache download time +(Table~\ref{tab:kernel_tuning_internal}). The retransmit rate at +Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means +more than half of the original retransmissions were spurious. Parallel TCP gained even more. Internal at Low climbed from -277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's -old single-stream best and overtakes Headscale's original -718\,Mbps from the unmodified run. % -% TODO: DOWNSTREAM -% DEPENDENCY — "six concurrent flows" inherits -% the unresolved -% 6-vs-10 stream count from the baseline parallel test -% description. Update when that TODO is resolved. -Each of the six concurrent flows benefits independently from -the higher reordering threshold, and the gains compound. +277\,Mbps to 902\,Mbps, a 226\,\% increase that exceeded +Internal's old single-stream best and overtook Headscale's +original 718\,Mbps from the unmodified run. +% TODO: DOWNSTREAM DEPENDENCY -- "six concurrent flows" +% inherits the unresolved 6-vs-10 stream count from the baseline +% parallel test description. Update when that TODO is resolved. +Each of the six concurrent flows benefits independently from the +higher reordering threshold, and the gains compound. % TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are % not in any table. Add a table showing Headscale's results -% from the follow-up runs alongside Internal's so -% readers can +% from the follow-up runs alongside Internal's so readers can % verify the reversal. -Headscale itself, retested with the same sysctls, -gained more -modestly: +21\,\% at Medium and a small $-$5\,\% wobble at -Low. And the anomaly reversed entirely. At Medium, tuned -Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps — -a 45\,\% lead for Internal where the original run -had Headscale -40\,\% ahead. The Nix cache flipped the same way: Internal -completed in 29.1\,s against Headscale's 36.3\,s, where the -original had Headscale 17\,\% faster. +Headscale, retested with the same sysctls, gained more modestly: ++21\,\% at Medium and a small $-$5\,\% wobble at Low. And the +anomaly reversed entirely +(Figure~\ref{fig:headscale_gap_reversal}). Tuned Internal now +leads Headscale at Medium across every metric, where the +original run had Headscale ahead. \begin{figure}[H] \centering @@ -2088,11 +2040,11 @@ collapsed to three host-kernel sysctls: \texttt{tcp\_early\_retrans}. At this point in the investigation the hypothesis seemed -settled. Tailscale's gVisor stack ships with -these overrides; -the bare-metal kernel ships with stricter defaults; matching -the kernel to gVisor reproduces the effect. Then we checked -which Tailscale code path the test rig was actually running. +settled. Tailscale's gVisor stack applies these overrides; +the bare-metal kernel uses stricter defaults; matching +the kernel to gVisor reproduces the effect. The remaining +question was which Tailscale code path the test rig was actually +running. \subsection{The data path that was not there} \label{sec:gvisor_not_in_path} @@ -2168,9 +2120,9 @@ the gVisor TCP business at all. The puzzle the investigation began with has not gone away. Headscale starts at 41.5\,Mbps where Internal starts at 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel -TCP stack. Whatever Headscale is doing (partially, weakly, but -reproducibly) is worth roughly twelve megabits per second on the -Medium profile, and it is not gVisor netstack. +TCP stack. Whichever mechanism Headscale relies on (partially, +weakly, but reproducibly) is worth roughly twelve megabits per +second on the Medium profile, and it is not gVisor netstack. The +21\,\% sysctl gain for Headscale itself is also informative about the size of the mechanism. If the gain were 0\,\%, @@ -2182,7 +2134,7 @@ that the two effects are not fully additive. Two features of the \texttt{wireguard-go} data-plane pipeline are the most likely candidates, and both live on the kernel-TUN path -that Tailscale actually uses in the rig. +that Tailscale actually uses in the test rig. The first is TUN TCP and UDP generic receive offload. Tailscale's \texttt{tstun} wrapper enables both on the kernel TUN device on @@ -2248,32 +2200,38 @@ Hyprspace cannot be used as a negative control for any of this. It does import gVisor netstack, but only for its in-VPN service-network feature, and the Hyprspace benchmark traffic goes through a kernel TUN exactly like Headscale's -(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the -wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer), -not on whether gVisor handles their inner TCP. The gVisor angle +(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ in the +\texttt{wireguard-go} pipeline (TUN GRO and the 7\,MiB outer-UDP +buffer), not in whether gVisor handles their inner TCP. The gVisor angle simply does not apply to either of them in this benchmark. The kernel-side picture closes the loop. Three host-kernel TCP parameters dominate the bare-metal behaviour the benchmarks -expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the -number of out-of-order segments the kernel will tolerate before -declaring fast retransmit, and with \texttt{tc netem} injecting -0.5--2.5\,\% reordering per machine, bursts of several reordered -packets are frequent enough that the threshold is repeatedly -tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery} -(default \texttt{1}, RACK enabled) adds time-based reordering -detection on top of the segment-count threshold, which compounds -the spurious retransmits when reordering is high. And -\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail -Loss Probe enabled) fires speculative retransmits when -unacknowledged segments sit at the tail of a transmission window, -which interacts poorly with an already-impaired link. Loosening -any one of the three softens the kernel's loss detection on the +expose: + +\begin{description} + \item[\texttt{tcp\_reordering} (default 3)] The number of + out-of-order segments the kernel tolerates before declaring + fast retransmit. With \texttt{tc~netem} injecting 0.5--2.5\,\% + reordering per machine, bursts of several reordered packets + repeatedly trip this threshold on the bare-metal path. + \item[\texttt{tcp\_recovery} (default \texttt{1}, RACK enabled)] + Adds time-based reordering detection on top of the + segment-count threshold, which amplifies spurious retransmits + when reordering is high. + \item[\texttt{tcp\_early\_retrans} (default \texttt{3}, TLP + enabled)] Fires speculative retransmits when unacknowledged + segments sit at the tail of the transmission window, which + interacts poorly with an already-impaired link. +\end{description} + +\noindent +Loosening any one softens the kernel's loss detection on the bare-metal path; loosening all three recovers most of the throughput. The Headscale path reaches the same kernel TCP stack -but is already feeding it the GRO-coalesced, buffer-cushioned -stream described above, so the kernel's tight defaults fire less -often there to begin with. +but is already feeding it a GRO-coalesced, buffer-cushioned +stream, so the kernel's tight defaults fire less often there to +begin with. The same logic explains the anomaly's shape across profiles. At baseline there is no reordering, so the kernel's tight @@ -2318,31 +2276,66 @@ any Linux host and entirely independent of any VPN. The less durable finding, and the one that motivated this section, is that Tailscale's much-discussed userspace TCP stack is not in the data path for the workload that exposed the anomaly. The -advantage we attributed to it comes from a more ordinary place: -the way \texttt{wireguard-go} batches and coalesces packets +advantage initially attributed to it comes from a more ordinary +place: the way \texttt{wireguard-go} batches and coalesces packets between the wire and the kernel TCP stack, and the larger UDP -buffer it pins on its outer socket. We were chasing the wrong -hypothesis with the right experiment, and the experiment turned -out to be more useful than the hypothesis. +buffer it pins on its outer socket. The experiment was chasing +the wrong hypothesis, but the experiment turned out to be more +useful than the hypothesis. -% TODO: These sections are empty stubs but the chapter -% introduction (line 12--13) promises "findings from the source -% code analysis." Either write these sections or remove the -% promise from the intro. +\section{Summary} +\label{sec:results_summary} -\section{Source code analysis} +Four findings hold together across all four impairment profiles. -\subsection{Feature matrix overview} +At baseline, the throughput hierarchy splits into three tiers +separated by natural gaps in the data: WireGuard, ZeroTier, +Headscale, and Yggdrasil at the top ($>$\,80\,\% of bare metal); +Nebula, EasyTier, and VpnCloud in the middle (55--80\,\%); +Hyprspace, Tinc, and Mycelium at the bottom ($<$\,40\,\%). +Latency rearranges the rankings: VpnCloud is the fastest VPN at +1.13\,ms despite mid-tier throughput, Tinc has low latency but +single-core CPU caps its bulk transfer rate, and Mycelium's +34.9\,ms average is an outlier driven by Babel's overlay +routing, not by tunnel overhead. -% Summary of the 108-feature matrix across all ten VPNs. -% Highlight key architectural differences that explain -% performance results. +Under impairment, the hierarchy collapses. At High impairment +the spread between fastest and slowest implementation compresses +from 675\,Mbps to under 3\,Mbps; the impairment profile itself +becomes the bottleneck. Three pathologies stand out at the +intermediate profiles. Yggdrasil's 32\,KB jumbo overlay MTU, +which inflates its baseline numbers, becomes a liability at +Low impairment: a single lost outer packet costs roughly +24$\times$ more retransmitted inner data than a standard-MTU VPN +would lose, and throughput drops from 795 to 13\,Mbps. +Hyprspace's libp2p/yamux send pipeline serialises concurrent +flows behind a per-peer mutex; under any sustained load the +pipeline backs up and ping latency balloons by three orders of +magnitude. Headscale's RIST video quality stays at 13\,\% across +every profile, almost certainly because of MTU fragmentation in +the DERP relay layer; the failure is profile-independent because +it is structural. -\subsection{Security vulnerabilities} +Headscale's apparent lead over the bare-metal Internal baseline +at Medium impairment turns out not to come from Tailscale's +gVisor TCP stack, which is not in the data path of the +benchmark. It comes from \texttt{wireguard-go}'s TUN GRO +coalescing and the 7\,MiB outer-UDP socket buffer that +\texttt{magicsock} pins, both of which feed the host kernel TCP +stack a smoother input than the bare-metal path receives. The +underlying cause is a host-kernel one: the default +\texttt{tcp\_reordering=3} threshold is too tight for the kind +of bursty, correlated reordering \texttt{tc netem} produces, and +costs the bare-metal host more than half its achievable +throughput. Three sysctl lines repair it, and the fix is +portable to any Linux host independent of any VPN. -% Vulnerabilities discovered during source code review. - -\section{Summary of findings} - -% Brief summary table or ranking of VPNs by key metrics. Save -% deeper interpretation for a Discussion chapter. +A ranking by single metric would obscure all of this. The most +useful one-sentence summary is therefore that no VPN dominates +across throughput, latency, application-level workloads, and +operational resilience together; each implementation makes +trade-offs that surface only when the workload changes. +WireGuard comes closest to a default recommendation for +performance-critical use; Headscale is the most robust under +adverse network conditions; the others occupy specific niches +that the per-section analyses describe.