Compare commits

..

1 Commits

Author SHA1 Message Date
Luis a3c533b58f improved dense number paragraphs 2026-04-28 09:29:56 +02:00
+299 -306
View File
@@ -6,23 +6,23 @@
This chapter presents the results of the benchmark suite across all This chapter presents the results of the benchmark suite across all
ten VPN implementations and the internal baseline. The structure ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded: follows the impairment profiles from ideal to degraded.
Section~\ref{sec:baseline} establishes overhead under ideal Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to conditions; subsequent sections examine how each VPN responds to
increasing network impairment, with source-code excerpts woven in increasing network impairment, with source-code excerpts woven in
where they explain the measured behaviour. A recurring theme is where they explain the measured behaviour. No single metric
that no single metric captures VPN performance; the rankings shift captures VPN performance. The rankings shift depending on what is
depending on whether one measures throughput, latency, retransmit measured: throughput, latency, retransmit behaviour, or
behavior, or real-world application performance. application-level performance.
\section{Baseline Performance} \section{Baseline performance}
\label{sec:baseline} \label{sec:baseline}
The baseline impairment profile introduces no artificial loss or The baseline impairment profile introduces no artificial loss or
reordering, so any performance gap between VPNs can be attributed to reordering, so any performance gap between VPNs can be attributed to
the VPN itself. Throughout the plots in this section, the the VPN itself. Throughout the plots in this section, the
\emph{internal} bar marks a direct host-to-host connection with no VPN \emph{internal} bar is a direct host-to-host connection with no VPN
in the path; it represents the best the hardware can do. On its own, in the path; it is the best the hardware can do. On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just latency of just
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a 0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
@@ -30,35 +30,29 @@ single retransmit across an entire 30-second test. Mycelium sits at
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
bare-metal figure. bare-metal figure.
A note on naming: ``Headscale'' in every table and figure of this Throughout this chapter, ``Headscale'' labels the scenario in
chapter labels the test scenario in which the Tailscale client which the Tailscale client (\texttt{tailscaled}, built on
(\texttt{tailscaled}) connects to a self-hosted Headscale control \texttt{wireguard-go}) connects to a self-hosted Headscale
server. The data plane is therefore the Tailscale client built on control server. The data plane is therefore Tailscale's, not
\texttt{wireguard-go}, not the Headscale binary itself, which is Headscale's; the Headscale binary itself is only a control-plane
only a control-plane server. Statements below about ``Headscale'' server. Section~\ref{sec:tailscale_degraded} returns to which
running \texttt{wireguard-go} should be read as statements about Tailscale code paths the test rig actually exercises.
the Tailscale client in this scenario.
Section~\ref{sec:tailscale_degraded} covers the specifics of how
the rig launches \texttt{tailscaled} and which Tailscale code
paths that choice activates.
\subsection{Test Execution Overview} \subsection{Test execution overview}
Running the full baseline suite across all ten VPNs and the internal The full baseline suite ran in just over four hours across all ten
reference took just over four hours. Actual benchmark execution VPNs and the internal reference. Benchmark execution consumed
consumed the bulk of that time at 2.6~hours (63\,\%). VPN 63\,\% of that time; VPN installation and deployment accounted for
installation and deployment accounted for another 45~minutes 19\,\%; the test rig spent 9\,\% waiting for tunnels to come up
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting after restarts. Service restarts and traffic-control (tc)
for VPN tunnels to come up after restarts. VPN service restarts and stabilization took the remainder. Figure~\ref{fig:test_duration}
traffic-control (tc) stabilization took the remainder. breaks the time down per VPN.
Figure~\ref{fig:test_duration} breaks this down per VPN.
Most VPNs completed every benchmark without issues, but four failed Most VPNs completed every benchmark without issue. Four failed
one test each: Nebula and Headscale timed out on the qperf one test each: Nebula and Headscale timed out on the qperf QUIC
QUIC performance benchmark after six retries, while Hyprspace and benchmark after six retries; Hyprspace and Mycelium failed the
Mycelium failed the UDP iPerf3 test UDP iPerf3 test with a 120-second timeout. Their individual
with a 120-second timeout. Their individual success rate is success rate is 85.7\,\%; every other VPN passed the full suite
85.7\,\%, with all other VPNs passing the full suite
(Figure~\ref{fig:success_rate}). (Figure~\ref{fig:success_rate}).
\begin{figure}[H] \begin{figure}[H]
@@ -88,7 +82,7 @@ with a 120-second timeout. Their individual success rate is
\label{fig:test_overview} \label{fig:test_overview}
\end{figure} \end{figure}
\subsection{TCP Throughput} \subsection{TCP throughput}
Each VPN ran a single-stream iPerf3 session for 30~seconds on every Each VPN ran a single-stream iPerf3 session for 30~seconds on every
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna, link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
@@ -144,14 +138,12 @@ Raw throughput alone is incomplete. The retransmit rate
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts (Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
by estimated packet count, accounting for the different segment sizes by estimated packet count, accounting for the different segment sizes
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier, Headscale are effectively loss-free ($<$\,0.01\,\%); Hyprspace is
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%). the clear outlier at 0.49\,\%; the remaining VPNs spread between
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and these poles. The interesting case is ZeroTier: it sustains
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier 814\,Mbps despite a 0.10\,\% retransmit rate by riding TCP
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating congestion-control recovery, where WireGuard delivers comparable
for tunnel-internal loss through repeated TCP congestion-control throughput with effectively zero loss.
recovery; WireGuard delivers comparable throughput with effectively
zero loss.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -170,7 +162,7 @@ Figure~\ref{fig:tcp_window} shows the raw window sizes, and
Figure~\ref{fig:retransmit_correlations} plots them against retransmit Figure~\ref{fig:retransmit_correlations} plots them against retransmit
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
smallest max congestion window in the dataset (200\,KB), while smallest max congestion window in the dataset (200\,KB), while
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of Yggdrasil's 0.09\,\% rate allows a 4.3\,MB window, the largest of
any VPN. At any VPN. At
first glance this suggests a clean inverse correlation between first glance this suggests a clean inverse correlation between
retransmit rate and congestion window size, but the picture is retransmit rate and congestion window size, but the picture is
@@ -204,17 +196,13 @@ in the dataset. This points to significant in-tunnel packet loss
or buffering at the VpnCloud layer that the retransmit rate or buffering at the VpnCloud layer that the retransmit rate
(0.06\,\%) alone does not fully explain. (0.06\,\%) alone does not fully explain.
Variability, whether stochastic across runs or systematic across Variability across links also differs substantially. WireGuard's
links, also differs substantially. WireGuard's three link three link directions cluster within a 60\,Mbps band; Mycelium's
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window) span a 3:1 ratio (122--379\,Mbps), but this is per-link
and are nearly indistinguishable. Mycelium's three directions span path-selection asymmetry rather than run-to-run noise
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: (Section~\ref{sec:mycelium_routing}). A VPN whose throughput
Section~\ref{sec:mycelium_routing} shows the spread is per-link varies that widely across links is harder to capacity-plan around
path-selection asymmetry, with one link finding a direct route and than one that delivers a consistent figure on every direction.
the other two routing through the global overlay. Either way, a
VPN whose throughput varies that widely across links is harder to
capacity-plan around than one that delivers a consistent figure
on every direction.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -292,16 +280,13 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far
removed from the rest that Section~\ref{sec:mycelium_routing} gives removed from the rest that Section~\ref{sec:mycelium_routing} gives
it a dedicated analysis. it a dedicated analysis.
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two The spike-ratio column in Table~\ref{tab:latency_baseline} exposes
outliers among the low-latency VPNs. VpnCloud leads at two outliers among the low-latency VPNs. VpnCloud and ZeroTier
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at post the highest spike ratios (2.8$\times$ and 2.3$\times$) and
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest the highest jitter, while Tinc and Headscale stay close to
jitter in the table (0.25\,ms). Tinc and Headscale, by contrast, bare-metal stability. The spikes in VpnCloud and ZeroTier are
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet consistent with periodic control-plane events such as key
timing is nearly as stable as bare metal. The spikes in VpnCloud and rotation or peer heartbeats that briefly stall the data path.
ZeroTier are consistent with periodic
control-plane work such as key rotation or peer heartbeats that
briefly stalls the data path.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -353,7 +338,7 @@ spot.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png} \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
\caption{Latency vs.\ throughput at baseline. Each point represents \caption{Latency vs.\ throughput at baseline. Each point is
one VPN. The quadrants reveal different bottleneck types: one VPN. The quadrants reveal different bottleneck types:
VpnCloud (low latency, moderate throughput), Tinc (low latency, VpnCloud (low latency, moderate throughput), Tinc (low latency,
low throughput, CPU-bound), Mycelium (high latency, low low throughput, CPU-bound), Mycelium (high latency, low
@@ -361,7 +346,7 @@ spot.
\label{fig:latency_throughput} \label{fig:latency_throughput}
\end{figure} \end{figure}
\subsection{Parallel TCP Scaling} \subsection{Parallel TCP scaling}
The single-stream benchmark tests one link direction at a time. The single-stream benchmark tests one link direction at a time.
The The
@@ -383,7 +368,7 @@ Table~\ref{tab:parallel_scaling} lists the results.
\centering \centering
\caption{Parallel TCP scaling at baseline. Scaling factor is the \caption{Parallel TCP scaling at baseline. Scaling factor is the
ratio of parallel to single-stream throughput. Internal's ratio of parallel to single-stream throughput. Internal's
1.50$\times$ represents the expected scaling on this hardware.} 1.50$\times$ is the expected scaling on this hardware.}
\label{tab:parallel_scaling} \label{tab:parallel_scaling}
\begin{tabular}{lrrr} \begin{tabular}{lrrr}
\hline \hline
@@ -501,7 +486,7 @@ parallel load.
\label{fig:parallel_tcp} \label{fig:parallel_tcp}
\end{figure} \end{figure}
\subsection{UDP Stress Test} \subsection{UDP stress test}
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}), The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
which is a deliberate overload test rather than a realistic workload. which is a deliberate overload test rather than a realistic workload.
@@ -568,12 +553,13 @@ complete the UDP test at all; both timed out after 120 seconds.
% usable payload after tunnel overhead, but conflating it with path % usable payload after tunnel overhead, but conflating it with path
% MTU is misleading. Consider renaming to "effective payload size" % MTU is misleading. Consider renaming to "effective payload size"
% throughout. % throughout.
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP The effective UDP payload size, reported in the
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at \texttt{blksize\_bytes} field, differs sharply across
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at implementations. Yggdrasil sends 32\,731-byte jumbo segments;
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the ZeroTier negotiates 2\,728 bytes; the remaining VPNs cluster
smallest). These differences affect fragmentation behavior under real between 1\,208 (Headscale, the smallest) and 1\,448 (Internal).
workloads, particularly for protocols that send large datagrams. These differences affect fragmentation behaviour under workloads
that send large datagrams.
%TODO: Mention QUIC %TODO: Mention QUIC
%TODO: Mention again that the "default" settings of every VPN have been used %TODO: Mention again that the "default" settings of every VPN have been used
@@ -611,7 +597,7 @@ workloads, particularly for protocols that send large datagrams.
% TODO: Compare parallel TCP retransmit rate % TODO: Compare parallel TCP retransmit rate
% with single TCP retransmit rate and see what changed % with single TCP retransmit rate and see what changed
\subsection{Real-World Workloads} \subsection{Real-world workloads}
Saturating a link with iPerf3 measures peak capacity, but not how a Saturating a link with iPerf3 measures peak capacity, but not how a
VPN performs under realistic traffic. This subsection switches to VPN performs under realistic traffic. This subsection switches to
@@ -620,7 +606,7 @@ cache and streaming video over RIST. Both interact with the VPN
tunnel the way real software does, through many short-lived tunnel the way real software does, through many short-lived
connections, TLS handshakes, and latency-sensitive UDP packets. connections, TLS handshakes, and latency-sensitive UDP packets.
\paragraph{Nix Binary Cache Downloads.} \paragraph{Nix binary cache downloads.}
This test downloads a fixed set of Nix packages through each VPN and This test downloads a fixed set of Nix packages through each VPN and
measures the total transfer time. The results measures the total transfer time. The results
@@ -705,7 +691,7 @@ Nebula sits just below at 99.8\%, and Hyprspace's headline figure
of 100\% conceals a separate failure mode discussed below. The of 100\% conceals a separate failure mode discussed below. The
14--16 dropped frames that appear uniformly across every run, including 14--16 dropped frames that appear uniformly across every run, including
Internal, are most likely encoder warm-up artefacts rather than Internal, are most likely encoder warm-up artefacts rather than
tunnel overhead, though we have not verified this directly. tunnel overhead, though this has not been verified directly.
% TODO: The packet-drop distribution statistics (288 mean, % TODO: The packet-drop distribution statistics (288 mean,
% 10\% median, IQR 255--330) are not shown in any figure. % 10\% median, IQR 255--330) are not shown in any figure.
@@ -756,7 +742,7 @@ overwhelm FEC entirely.
\label{fig:rist_quality} \label{fig:rist_quality}
\end{figure} \end{figure}
\subsection{Operational Resilience} \subsection{Operational resilience}
Throughput, latency, and application performance describe how a Throughput, latency, and application performance describe how a
tunnel behaves once it is up. The next question is how quickly it tunnel behaves once it is up. The next question is how quickly it
@@ -767,7 +753,7 @@ reboot matters as much as its peak throughput.
Reboot reconnection rearranges the rankings. Hyprspace, the worst Reboot reconnection rearranges the rankings. Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on performer under sustained TCP load, recovers in just 8.7~seconds on
average, faster than any other VPN. WireGuard and Nebula follow at average, faster than any other VPN. WireGuard and Nebula follow at
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06, 10.1\,s each. Nebula's consistency is striking: 10.06, 10.07,
10.07\,s across its three nodes, an exact match for Nebula's 10.07\,s across its three nodes, an exact match for Nebula's
\texttt{HostUpdateNotification} interval, whose default is \texttt{HostUpdateNotification} interval, whose default is
10~seconds in the lighthouse protocol (configurable, but the 10~seconds in the lighthouse protocol (configurable, but the
@@ -781,8 +767,8 @@ Section~\ref{sec:mycelium_routing} argues from that uniformity
that the bound is a fixed timer in the overlay protocol. that the bound is a fixed timer in the overlay protocol.
Yggdrasil produces the most lopsided result in the dataset: its yuki Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and node is back in 7.1~seconds while lom and luna take 97.3 and
97.3~seconds respectively. Yggdrasil organises its overlay as a 94.8~seconds respectively. Yggdrasil organises its overlay as a
distributed spanning tree rooted at the node with the highest public distributed spanning tree rooted at the node with the highest public
key: every other node picks a parent closer to the root and the key: every other node picks a parent closer to the root and the
whole network hangs off that parent chain. The gap likely reflects whole network hangs off that parent chain. The gap likely reflects
@@ -816,14 +802,14 @@ can route traffic.
\label{fig:reboot_reconnection} \label{fig:reboot_reconnection}
\end{figure} \end{figure}
\subsection{Pathological Cases} \subsection{Pathological cases}
\label{sec:pathological} \label{sec:pathological}
Three VPNs exhibit behaviors that the aggregate numbers alone cannot Hyprspace, Mycelium, and Tinc each show a pathology that the
explain. The following subsections piece together observations from aggregate tables flatten. The following subsections diagnose each
earlier benchmarks into per-VPN diagnoses. in turn.
\paragraph{Hyprspace: Buffer Bloat.} \paragraph{Hyprspace: buffer bloat.}
\label{sec:hyprspace_bloat} \label{sec:hyprspace_bloat}
% TODO: The under-load latency of 2,800 ms is not shown in any plot % TODO: The under-load latency of 2,800 ms is not shown in any plot
@@ -841,7 +827,7 @@ The consequences show in every TCP metric. With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than spends most of its time in congestion recovery rather than
steady-state transfer. The max congestion window shrinks to steady-state transfer. The max congestion window shrinks to
205\,KB, the smallest in the dataset. Under parallel load the 200\,KB, the smallest in the dataset. Under parallel load the
situation worsens: retransmits climb to 17\,426. % TODO: The situation worsens: retransmits climb to 17\,426. % TODO: The
% explanation for the sender/receiver inversion (ACK delays % explanation for the sender/receiver inversion (ACK delays
% causing sender-side timer undercounting) is a hypothesis. Normally % causing sender-side timer undercounting) is a hypothesis. Normally
@@ -853,12 +839,9 @@ while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate. The cause the sender-side timer to undercount the actual data rate. The
UDP test never finished at all; it timed out at 120~seconds. UDP test never finished at all; it timed out at 120~seconds.
% Should we always use percentages for retransmits? Outside sustained load, Hyprspace looks fine. It has the fastest
reboot reconnection in the dataset (8.7\,s) and delivers 100\,\%
What prevents Hyprspace from being entirely unusable is everything video quality between burst events. The pathology is narrow but
\emph{except} sustained load. It has the fastest reboot
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
quality outside of its burst events. The pathology is narrow but
severe: any continuous data stream saturates the tunnel's internal severe: any continuous data stream saturates the tunnel's internal
buffers. buffers.
@@ -910,33 +893,35 @@ stack but the benchmark traffic never reaches it; that case is
the subject of Section~\ref{sec:tailscale_degraded}. the subject of Section~\ref{sec:tailscale_degraded}.
If gVisor is out of scope, the buffer bloat must originate If gVisor is out of scope, the buffer bloat must originate
further up the Hyprspace stack instead. Hyprspace uses further up the Hyprspace stack. Hyprspace uses \texttt{libp2p},
\texttt{libp2p}, a peer-to-peer networking library, and its a peer-to-peer networking library, and its \texttt{yamux} stream
\texttt{yamux} stream multiplexer, which runs many logical streams multiplexer. Yamux runs many logical streams over a single
over a single underlying connection and polices each one with a underlying connection and polices each one with a credit-based
credit-based flow-control window. The most plausible source of flow-control window. The most plausible source of the bloat is
the bloat is this libp2p/yamux layer, through which raw IP packets this libp2p/yamux layer, through which raw IP packets are
are funnelled. Hyprspace's TUN-read loop dispatches funnelled.
each outbound packet on its own goroutine, and every such
goroutine ends up in \texttt{node/node.go}'s Hyprspace's TUN-read loop dispatches each outbound packet on its
\texttt{sendPacket}, which keeps exactly one libp2p stream per own goroutine, and every such goroutine ends up in
destination peer in \texttt{activeStreams} and guards it with a \texttt{node/node.go}'s \texttt{sendPacket}. This function keeps
single per-peer \texttt{sync.Mutex} exactly one libp2p stream per destination peer in
(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent \texttt{activeStreams}, guarded by a single per-peer
application TCP flows to the same Hyprspace neighbour therefore \texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}).
Concurrent application TCP flows to the same Hyprspace neighbour
serialise behind that one lock: the parallel iPerf3 test, which serialise behind that one lock: the parallel iPerf3 test, which
opens multiple TCP connections to the same peer at once, opens multiple TCP connections to the same peer at once, collapses
collapses to a single send pipeline at this layer. Each to a single send pipeline at this layer. Each goroutine waiting
goroutine waiting for the lock pins its own 1420-byte packet for the lock pins its own 1420-byte packet buffer, and the
buffer, and the underlying yamux session adds a per-stream underlying yamux session adds a per-stream flow-control window on
flow-control window on top. None of this is visible to the top.
kernel TCP sender that produced the inner segments: the kernel
sees only that the TUN write returned, so it keeps growing its None of this is visible to the kernel TCP sender that produced
congestion window while the libp2p layer falls further behind. The the inner segments. The kernel sees only that the TUN write
geometry is the textbook one for buffer bloat: a returned, so it keeps growing its congestion window while the
fast producer (kernel TCP) sitting upstream of a slow, libp2p layer falls further behind. The geometry is the standard
serialised consumer (the single yamux stream per peer) with shape of buffer bloat: a fast producer (kernel TCP) sitting
no flow-control signal coupling the two. upstream of a slow, serialised consumer (the single yamux stream
per peer), with no flow-control signal coupling the two.
\lstinputlisting[language=Go,caption={Hyprspace's outbound \lstinputlisting[language=Go,caption={Hyprspace's outbound
fast path keeps exactly one libp2p stream per destination peer fast path keeps exactly one libp2p stream per destination peer
@@ -987,9 +972,9 @@ reports a kernel-measured RTT that is independent of
ICMP ping. For the luna$\rightarrow$lom stream, this ICMP ping. For the luna$\rightarrow$lom stream, this
TCP~RTT starts at 51.6\,ms and climbs to a mean of TCP~RTT starts at 51.6\,ms and climbs to a mean of
144\,ms over the 30-second run, with 144\,ms over the 30-second run, with
757~retransmits---the link was clearly overlay-routed 757~retransmits. The link was clearly overlay-routed during
during the throughput test, even though ping had found a the throughput test, even though ping had found a direct path
direct path eight minutes earlier. For eight minutes earlier. For
yuki$\rightarrow$luna the reverse happened: the TCP yuki$\rightarrow$luna the reverse happened: the TCP
stream measured only 12--22\,ms, and its bidirectional stream measured only 12--22\,ms, and its bidirectional
return path recorded 1.0\,ms, a direct LAN connection return path recorded 1.0\,ms, a direct LAN connection
@@ -1057,17 +1042,14 @@ inversion.
\end{figure} \end{figure}
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment % TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
% (47.3 ms) numbers are from qperf but not shown in any figure. % (47.3 ms vs.\ 16.0 ms) numbers are from qperf but not shown in
% Add a connection-setup latency table or plot. Also % any figure. Add a connection-setup latency table or plot.
% clarify what
% Internal's connection establishment time is (47.3 /
% 3 = 15.8 ms?)
% so the "3× overhead" can be verified.
The overlay penalty shows up most clearly at connection setup. The overlay penalty shows up most clearly at connection setup.
Mycelium's average time-to-first-byte is 93.7\,ms Mycelium's average time-to-first-byte is 93.7\,ms
(vs.\ Internal's (vs.\ Internal's
16.8\,ms, a 5.6$\times$ overhead), and connection establishment 16.8\,ms, a 5.6$\times$ overhead), and connection establishment
alone costs 47.3\,ms (3$\times$ overhead). Every new connection alone costs 47.3\,ms against Internal's 16.0\,ms (a
2.95$\times$ overhead). Every new connection
incurs that overhead, so workloads dominated by incurs that overhead, so workloads dominated by
short-lived connections accumulate it rapidly. Bulk short-lived connections accumulate it rapidly. Bulk
downloads, by downloads, by
@@ -1094,13 +1076,13 @@ delay.
The UDP test timed out at 120~seconds, and even first-time The UDP test timed out at 120~seconds, and even first-time
connectivity required a 70-second wait at startup. connectivity required a 70-second wait at startup.
\paragraph{Tinc: Userspace Processing Bottleneck.} \paragraph{Tinc: userspace processing bottleneck.}
The latency subsection already traced Tinc's 336\,Mbps ceiling to The latency subsection already traced Tinc's 336\,Mbps ceiling to
single-core CPU exhaustion. The usual network suspects do not single-core CPU exhaustion. The usual network suspects do not
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
effective UDP payload size (1\,353 bytes) and its retransmit count effective UDP payload size (1\,353 bytes) and its retransmit count
(240) are in the normal range. That leaves CPU: 14.9\,\% (240) are in the normal range. That leaves CPU: 12.3\,\%
whole-system utilization is what one saturated core looks like on whole-system utilization is what one saturated core looks like on
a multi-core host, which fits a single-threaded userspace VPN. a multi-core host, which fits a single-threaded userspace VPN.
The parallel benchmark confirms the diagnosis. Tinc scales to The parallel benchmark confirms the diagnosis. Tinc scales to
@@ -1110,7 +1092,7 @@ the gaps a single flow would leave idle, and the extra work
translates directly into extra throughput. translates directly into extra throughput.
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the % TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
% unresolved CPU-profiling TODO from the latency subsection % unresolved CPU-profiling TODO from the latency subsection
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread % (VpnCloud's similar 14.2\% at 539\,Mbps). If per-thread
% profiling refutes the single-core story, this paragraph must % profiling refutes the single-core story, this paragraph must
% be revisited as well. % be revisited as well.
@@ -1121,13 +1103,12 @@ Baseline benchmarks rank VPNs by overhead under ideal
conditions. The impairment profiles in conditions. The impairment profiles in
Table~\ref{tab:impairment_profiles} test a different property: Table~\ref{tab:impairment_profiles} test a different property:
resilience. Each profile applies symmetric \texttt{tc netem} resilience. Each profile applies symmetric \texttt{tc netem}
impairment to every machine. Low adds roughly 2\,ms of delay and impairment to every machine. Low adds 2\,ms of delay and
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 4\,ms
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High of delay and 1\,\% loss with 2.5\,\% reordering; High adds 6\,ms
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\% of delay and 2.5\,\% loss with 5\,\% reordering. Medium and High
reordering. Medium and High both use 50\,\% correlation, so both use 50\,\% correlation, so losses and reorderings are bursty
losses and reorderings are bursty rather than uniform. Two rather than uniform. Two results dominate the data.
results dominate the data.
% TODO: Double-check these per-profile parameters against the % TODO: Double-check these per-profile parameters against the
% canonical impairment-profile definitions in the earlier chapter % canonical impairment-profile definitions in the earlier chapter
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and % (Table~\ref{tab:impairment_profiles}). The Low/High loss and
@@ -1150,9 +1131,9 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
through what turns out to be the wrong hypothesis. The through what turns out to be the wrong hypothesis. The
investigation begins with Tailscale's much-discussed gVisor TCP investigation begins with Tailscale's much-discussed gVisor TCP
stack, validates the candidate parameters in isolation on the stack, validates the candidate parameters in isolation on the
bare-metal host, and only then discovers, by reading the rig's bare-metal host, and only then discovers, by reading the test
own NixOS module, that the gVisor stack is not actually in the rig's own NixOS module, that the gVisor stack is not actually in
data path of the benchmark at all. The real culprit is a the data path of the benchmark at all. The real culprit is a
combination of the Linux kernel's tight default combination of the Linux kernel's tight default
\texttt{tcp\_reordering} threshold and the way \texttt{tcp\_reordering} threshold and the way
\texttt{wireguard-go} \texttt{wireguard-go}
@@ -1268,7 +1249,7 @@ RTT standard deviation reaches 44.6\,ms at High, the
worst jitter worst jitter
of any VPN. A userspace retry mechanism is the of any VPN. A userspace retry mechanism is the
likely cause, but likely cause, but
without source-code evidence we cannot say so with certainty. without source-code evidence this cannot be confirmed.
% TODO: Ping packet loss data is not shown in any plot. The 1/9 % TODO: Ping packet loss data is not shown in any plot. The 1/9
% = 11.1\% interpretation is clever but depends on % = 11.1\% interpretation is clever but depends on
@@ -1342,9 +1323,9 @@ the first step (Table~\ref{tab:tcp_impairment}).
\end{figure} \end{figure}
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
impairment, a impairment, a 98.3\% loss. The Low profile injects only modest
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter, impairment per machine: 2\,ms of latency, 2\,ms of jitter, 0.25\%
0.25\% packet loss, and 0.5\% reordering per machine. loss, and 0.5\% reordering.
Even Mycelium, Even Mycelium,
the slowest VPN at baseline (259\,Mbps), retains more the slowest VPN at baseline (259\,Mbps), retains more
throughput at throughput at
@@ -1427,38 +1408,28 @@ emerges from the runs that did complete.
% explanation (e.g., iPerf3 crash, tc interaction, % explanation (e.g., iPerf3 crash, tc interaction,
% timing issue). % timing issue).
Three implementations maintain throughput at the profiles where Three implementations maintain throughput at the profiles where
data exists. Internal holds ${\sim}$950\,Mbps at data exists: Internal, WireGuard, and Headscale all sustain
Baseline, Medium, several hundred Mbps where they complete (see
and High; WireGuard sustains 850--898\,Mbps; and Figure~\ref{fig:udp_impairment_heatmap}). Internal and WireGuard
Headscale sustains ride the host kernel's transport-layer backpressure (Internal
700--876\,Mbps. % TODO: verify WireGuard UDP range -- directly, WireGuard via the in-kernel WireGuard module).
% analysis doc says 850-898, possible digit transposition Headscale takes a different route to the same outcome. Its
Internal and WireGuard ride the host kernel's transport-layer \texttt{magicsock} layer is incompatible with the kernel
backpressure (Internal directly, WireGuard via the in-kernel WireGuard datapath, so \texttt{wireguard-go} runs in userspace
WireGuard module). Headscale, by contrast, never and leans on three host-kernel offloads to absorb a
uses the kernel \texttt{-b~0} sender flood: batched UDP I/O
module even though it builds on the WireGuard protocol: as (\texttt{recvmmsg} / \texttt{sendmmsg}), UDP
established in Section~\ref{sec:baseline}, Tailscale's segmentation/aggregation offload (\texttt{UDP\_SEGMENT} /
\texttt{magicsock} layer intercepts every packet for endpoint \texttt{UDP\_GRO}) on the outer WireGuard socket, and a 7\,MiB
selection, DERP relay, and the disco protocol, and that socket buffer on that same socket. Section~\ref{sec:tailscale_degraded}
interception is incompatible with the kernel WireGuard datapath. returns to these mechanisms when they reappear as the explanation
Headscale therefore runs \texttt{wireguard-go} in userspace and for Headscale's TCP behaviour under reordering.
compensates with UDP batching
(\texttt{recvmmsg}/\texttt{sendmmsg}), Userspace VPNs without that engineering collapse. EasyTier
host-kernel UDP segmentation/aggregation offload walks down 865, 435, 38.5, 6.1\,Mbps across the four profiles.
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer Yggdrasil, already pathological at baseline (98.7\,\% loss),
WireGuard socket), and a 7\,MB socket buffer on the same outer drops to 12.3\,Mbps at Low and fails entirely at Medium and
socket. These offloads live in the host kernel; gVisor netstack High.
itself implements no UDP GSO or UDP GRO of its own.
Together they
absorb a \texttt{-b 0} sender flood without
collapsing. Userspace
VPNs without the same engineering do collapse:
EasyTier drops from
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
Yggdrasil, already pathological at baseline (98.7\%
loss), crashes
to 12.3\,Mbps at Low and fails entirely at Medium and High.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -1576,10 +1547,9 @@ concurrent streams benefits from it independently.
EasyTier is the runner-up under parallel load: 473\,Mbps at Low, EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
51\% of its baseline. Headscale and EasyTier are the only VPNs 51\% of its baseline. Headscale and EasyTier are the only VPNs
that retain more than half their baseline parallel throughput at that retain more than half their baseline parallel throughput at
Low impairment; no other implementation exceeds 30\%. Low impairment; no other implementation exceeds 30\%. EasyTier's
We have no resilience has no direct architectural explanation in this work,
direct architectural explanation for EasyTier's resilience and and none is claimed here.
do not claim one here.
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
Low, a 99.6\% Low, a 99.6\%
@@ -1590,7 +1560,7 @@ loss. % TODO: DOWNSTREAM DEPENDENCY — This
% under-load latency. If that diagnosis is revised, % under-load latency. If that diagnosis is revised,
% this explanation % this explanation
% for parallel collapse must also be revisited. % for parallel collapse must also be revisited.
The buffer bloat that already plagues single-stream transfers The buffer bloat that already constrains single-stream transfers
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six (Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
flows compete for the same bloated buffers at once. flows compete for the same bloated buffers at once.
@@ -1799,33 +1769,26 @@ at Low, completes here in 170\,s. At High, only Headscale,
Nebula, and Tinc survive. Internal's failure at High is the Nebula, and Tinc survive. Internal's failure at High is the
surprising one: the bare-metal baseline cannot sustain a surprising one: the bare-metal baseline cannot sustain a
multi-connection HTTP workload under severe degradation, while multi-connection HTTP workload under severe degradation, while
Headscale's userspace TCP stack pulls it through. Headscale completes in 219\,s.
Section~\ref{sec:tailscale_degraded} explains why. Section~\ref{sec:tailscale_degraded} traces the mechanism.
\section{Tailscale under degraded conditions} \section{Tailscale under degraded conditions}
\label{sec:tailscale_degraded} \label{sec:tailscale_degraded}
% TODO: Editorial pass needed on two chapter-wide issues before % TODO: The magicsock / wireguard-go userspace-datapath
% submission: % explanation is repeated three times in slightly different forms
% (1) magicsock / wireguard-go userspace-datapath explanation is % (once in baseline UDP, once in impairment UDP, once here).
% repeated three times in slightly different forms (once in % Consider introducing it once in full here, where it is
% baseline UDP, once in impairment UDP, once here). Consider % load-bearing, and replacing the earlier occurrences with
% introducing it once in full here, where it is load-bearing, % one-sentence forward references.
% and replacing the earlier occurrences with one-sentence
% forward references.
% (2) This section uses first-person plural ("we pursued", "we
% worked it out", "we ran two follow-up benchmarks") while
% the rest of the chapter is in impersonal voice. Either
% harmonise everything to one voice, or explicitly frame this
% section as a first-person narrative detour.
This section is about an observation that should not exist: Headscale, a tunnelling VPN built on \texttt{wireguard-go}, beats
Headscale, a tunnelling VPN built on a kernel TCP stack and the bare-metal Internal baseline at Medium impairment. Under
\texttt{wireguard-go}, beats the bare-metal Internal baseline at parallel load at Low impairment it beats Internal by a factor of
Medium impairment, and at Low impairment under parallel load 2.6. A VPN should not outperform the direct connection it tunnels
beats it by a factor of 2.6. The short answer turns out to be through, and the explanation took some chasing. The obvious
different from the obvious answer, and we worked it out only by hypothesis was wrong, and pursuing it to its end was the only way
chasing the obvious answer to its end. to find out.
\subsection{An anomaly worth pursuing} \subsection{An anomaly worth pursuing}
@@ -1877,16 +1840,15 @@ comparison.
\label{fig:headscale_vs_internal} \label{fig:headscale_vs_internal}
\end{figure} \end{figure}
WireGuard-the-kernel-module is the obvious sanity The in-kernel WireGuard module is the obvious sanity check. It
check. It uses uses the same Noise/WireGuard cryptographic protocol that
the same Noise/WireGuard cryptographic protocol Tailscale ships Tailscale embeds, and it is the closest available comparison
and is the closest available comparison without the rest of without the rest of Tailscale's stack. Kernel WireGuard shows
Tailscale's stack. WireGuard shows none of Headscale's none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well Medium, both well below Internal at the same profile. The
below Internal at the same profile. So the encryption layer is encryption layer is not the answer. Neither is the basic UDP
not the answer, and the basic UDP tunnel is not the answer. tunnel. Whatever Headscale is doing lives somewhere else in
Whatever Headscale is doing differently lives somewhere else in Tailscale's implementation.
the rest of Tailscale's implementation.
% TODO: The Medium-impairment retransmit percentages (5.2\%, % TODO: The Medium-impairment retransmit percentages (5.2\%,
% 2.4\%) are not in any table or figure. Add a retransmit % 2.4\%) are not in any table or figure. Add a retransmit
@@ -1906,9 +1868,9 @@ not.
\subsection{A plausible villain: Tailscale's gVisor stack} \subsection{A plausible villain: Tailscale's gVisor stack}
The candidate explanation we pursued first, and the one any The first candidate was Tailscale's userspace TCP/IP stack:
reading of the upstream Tailscale documentation will lead to, the answer any reading of the upstream Tailscale documentation
is Tailscale's userspace TCP/IP stack. The Tailscale client points to. The Tailscale client
imports Google's gVisor netstack imports Google's gVisor netstack
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses (\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
it as an in-process TCP implementation. The gVisor it as an in-process TCP implementation. The gVisor
@@ -1948,9 +1910,9 @@ reordering link than the host kernel. The hypothesis follows
directly: Headscale's iPerf3 traffic directly: Headscale's iPerf3 traffic
runs through this gVisor instance instead of through the host runs through this gVisor instance instead of through the host
kernel TCP stack, and so it inherits the more kernel TCP stack, and so it inherits the more
reordering-tolerant behaviour. WireGuard-the-kernel-module reordering-tolerant behaviour. Kernel WireGuard shares only
shares only the cryptographic protocol; it does not include the cryptographic protocol; it does not include the gVisor
the gVisor stack, and therefore does not get the advantage. stack, and therefore does not get the advantage.
The natural way to test this is to extract The natural way to test this is to extract
the parameters Tailscale sets inside gVisor, apply their the parameters Tailscale sets inside gVisor, apply their
@@ -1962,8 +1924,8 @@ supported. If it does not, the hypothesis fails.
\subsection{Reproducing the effect on bare metal} \subsection{Reproducing the effect on bare metal}
\label{sec:tuned} \label{sec:tuned}
We ran two follow-up benchmarks on the same hardware and Two follow-up benchmarks ran on the same hardware and impairment
impairment setup as the original 18.12.2025 run. setup as the original 18.12.2025 run.
\begin{itemize} \begin{itemize}
\bitem{Tailscale-style (27.02.2026):} \bitem{Tailscale-style (27.02.2026):}
@@ -2023,43 +1985,33 @@ impairment setup as the original 18.12.2025 run.
\label{fig:kernel_tuning_comparison} \label{fig:kernel_tuning_comparison}
\end{figure} \end{figure}
The result felt like confirmation. Internal's The result felt like confirmation. Three sysctls raised
Medium-impairment throughput jumped from 29.6\,Mbps to Internal's Medium-impairment throughput by 146\,\% and halved its
72.7\,Mbps under the reorder-only configuration, a 146\,\% Nix cache download time
increase from a three-line sysctl change, and the retransmit (Table~\ref{tab:kernel_tuning_internal}). The retransmit rate at
rate at Medium dropped from ${\sim}$2.4\,\% to Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means
1.11\,\%, which more than half of the original retransmissions were spurious.
means more than half of the original retransmissions were
spurious. The Nix cache download at Medium roughly halved,
from 58.6\,s to 29.1\,s.
Parallel TCP gained even more. Internal at Low climbed from Parallel TCP gained even more. Internal at Low climbed from
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's 277\,Mbps to 902\,Mbps, a 226\,\% increase that exceeded
old single-stream best and overtakes Headscale's original Internal's old single-stream best and overtook Headscale's
718\,Mbps from the unmodified run. % original 718\,Mbps from the unmodified run.
% TODO: DOWNSTREAM % TODO: DOWNSTREAM DEPENDENCY -- "six concurrent flows"
% DEPENDENCY — "six concurrent flows" inherits % inherits the unresolved 6-vs-10 stream count from the baseline
% the unresolved % parallel test description. Update when that TODO is resolved.
% 6-vs-10 stream count from the baseline parallel test Each of the six concurrent flows benefits independently from the
% description. Update when that TODO is resolved. higher reordering threshold, and the gains compound.
Each of the six concurrent flows benefits independently from
the higher reordering threshold, and the gains compound.
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are % TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
% not in any table. Add a table showing Headscale's results % not in any table. Add a table showing Headscale's results
% from the follow-up runs alongside Internal's so % from the follow-up runs alongside Internal's so readers can
% readers can
% verify the reversal. % verify the reversal.
Headscale itself, retested with the same sysctls, Headscale, retested with the same sysctls, gained more modestly:
gained more +21\,\% at Medium and a small $-$5\,\% wobble at Low. And the
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at anomaly reversed entirely
Low. And the anomaly reversed entirely. At Medium, tuned (Figure~\ref{fig:headscale_gap_reversal}). Tuned Internal now
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps — leads Headscale at Medium across every metric, where the
a 45\,\% lead for Internal where the original run original run had Headscale ahead.
had Headscale
40\,\% ahead. The Nix cache flipped the same way: Internal
completed in 29.1\,s against Headscale's 36.3\,s, where the
original had Headscale 17\,\% faster.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -2088,11 +2040,11 @@ collapsed to three host-kernel sysctls:
\texttt{tcp\_early\_retrans}. \texttt{tcp\_early\_retrans}.
At this point in the investigation the hypothesis seemed At this point in the investigation the hypothesis seemed
settled. Tailscale's gVisor stack ships with settled. Tailscale's gVisor stack applies these overrides;
these overrides; the bare-metal kernel uses stricter defaults; matching
the bare-metal kernel ships with stricter defaults; matching the kernel to gVisor reproduces the effect. The remaining
the kernel to gVisor reproduces the effect. Then we checked question was which Tailscale code path the test rig was actually
which Tailscale code path the test rig was actually running. running.
\subsection{The data path that was not there} \subsection{The data path that was not there}
\label{sec:gvisor_not_in_path} \label{sec:gvisor_not_in_path}
@@ -2168,9 +2120,9 @@ the gVisor TCP business at all.
The puzzle the investigation began with has not gone away. The puzzle the investigation began with has not gone away.
Headscale starts at 41.5\,Mbps where Internal starts at Headscale starts at 41.5\,Mbps where Internal starts at
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
TCP stack. Whatever Headscale is doing (partially, weakly, but TCP stack. Whichever mechanism Headscale relies on (partially,
reproducibly) is worth roughly twelve megabits per second on the weakly, but reproducibly) is worth roughly twelve megabits per
Medium profile, and it is not gVisor netstack. second on the Medium profile, and it is not gVisor netstack.
The +21\,\% sysctl gain for Headscale itself is also informative The +21\,\% sysctl gain for Headscale itself is also informative
about the size of the mechanism. If the gain were 0\,\%, about the size of the mechanism. If the gain were 0\,\%,
@@ -2182,7 +2134,7 @@ that the two effects are not fully additive.
Two features of the \texttt{wireguard-go} data-plane pipeline are Two features of the \texttt{wireguard-go} data-plane pipeline are
the most likely candidates, and both live on the kernel-TUN path the most likely candidates, and both live on the kernel-TUN path
that Tailscale actually uses in the rig. that Tailscale actually uses in the test rig.
The first is TUN TCP and UDP generic receive offload. Tailscale's The first is TUN TCP and UDP generic receive offload. Tailscale's
\texttt{tstun} wrapper enables both on the kernel TUN device on \texttt{tstun} wrapper enables both on the kernel TUN device on
@@ -2248,32 +2200,38 @@ Hyprspace cannot be used as a negative control for any of this.
It does import gVisor netstack, but only for its in-VPN It does import gVisor netstack, but only for its in-VPN
service-network feature, and the Hyprspace benchmark traffic goes service-network feature, and the Hyprspace benchmark traffic goes
through a kernel TUN exactly like Headscale's through a kernel TUN exactly like Headscale's
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the (Section~\ref{sec:hyprspace_bloat}). The two VPNs differ in the
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer), \texttt{wireguard-go} pipeline (TUN GRO and the 7\,MiB outer-UDP
not on whether gVisor handles their inner TCP. The gVisor angle buffer), not in whether gVisor handles their inner TCP. The gVisor angle
simply does not apply to either of them in this benchmark. simply does not apply to either of them in this benchmark.
The kernel-side picture closes the loop. Three host-kernel TCP The kernel-side picture closes the loop. Three host-kernel TCP
parameters dominate the bare-metal behaviour the benchmarks parameters dominate the bare-metal behaviour the benchmarks
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the expose:
number of out-of-order segments the kernel will tolerate before
declaring fast retransmit, and with \texttt{tc netem} injecting \begin{description}
0.5--2.5\,\% reordering per machine, bursts of several reordered \item[\texttt{tcp\_reordering} (default 3)] The number of
packets are frequent enough that the threshold is repeatedly out-of-order segments the kernel tolerates before declaring
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery} fast retransmit. With \texttt{tc~netem} injecting 0.5--2.5\,\%
(default \texttt{1}, RACK enabled) adds time-based reordering reordering per machine, bursts of several reordered packets
detection on top of the segment-count threshold, which compounds repeatedly trip this threshold on the bare-metal path.
the spurious retransmits when reordering is high. And \item[\texttt{tcp\_recovery} (default \texttt{1}, RACK enabled)]
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail Adds time-based reordering detection on top of the
Loss Probe enabled) fires speculative retransmits when segment-count threshold, which amplifies spurious retransmits
unacknowledged segments sit at the tail of a transmission window, when reordering is high.
which interacts poorly with an already-impaired link. Loosening \item[\texttt{tcp\_early\_retrans} (default \texttt{3}, TLP
any one of the three softens the kernel's loss detection on the enabled)] Fires speculative retransmits when unacknowledged
segments sit at the tail of the transmission window, which
interacts poorly with an already-impaired link.
\end{description}
\noindent
Loosening any one softens the kernel's loss detection on the
bare-metal path; loosening all three recovers most of the bare-metal path; loosening all three recovers most of the
throughput. The Headscale path reaches the same kernel TCP stack throughput. The Headscale path reaches the same kernel TCP stack
but is already feeding it the GRO-coalesced, buffer-cushioned but is already feeding it a GRO-coalesced, buffer-cushioned
stream described above, so the kernel's tight defaults fire less stream, so the kernel's tight defaults fire less often there to
often there to begin with. begin with.
The same logic explains the anomaly's shape across profiles. At The same logic explains the anomaly's shape across profiles. At
baseline there is no reordering, so the kernel's tight baseline there is no reordering, so the kernel's tight
@@ -2318,31 +2276,66 @@ any Linux host and entirely independent of any VPN.
The less durable finding, and the one that motivated this section, The less durable finding, and the one that motivated this section,
is that Tailscale's much-discussed userspace TCP stack is not in is that Tailscale's much-discussed userspace TCP stack is not in
the data path for the workload that exposed the anomaly. The the data path for the workload that exposed the anomaly. The
advantage we attributed to it comes from a more ordinary place: advantage initially attributed to it comes from a more ordinary
the way \texttt{wireguard-go} batches and coalesces packets place: the way \texttt{wireguard-go} batches and coalesces packets
between the wire and the kernel TCP stack, and the larger UDP between the wire and the kernel TCP stack, and the larger UDP
buffer it pins on its outer socket. We were chasing the wrong buffer it pins on its outer socket. The experiment was chasing
hypothesis with the right experiment, and the experiment turned the wrong hypothesis, but the experiment turned out to be more
out to be more useful than the hypothesis. useful than the hypothesis.
% TODO: These sections are empty stubs but the chapter \section{Summary}
% introduction (line 12--13) promises "findings from the source \label{sec:results_summary}
% code analysis." Either write these sections or remove the
% promise from the intro.
\section{Source code analysis} Four findings hold together across all four impairment profiles.
\subsection{Feature matrix overview} At baseline, the throughput hierarchy splits into three tiers
separated by natural gaps in the data: WireGuard, ZeroTier,
Headscale, and Yggdrasil at the top ($>$\,80\,\% of bare metal);
Nebula, EasyTier, and VpnCloud in the middle (55--80\,\%);
Hyprspace, Tinc, and Mycelium at the bottom ($<$\,40\,\%).
Latency rearranges the rankings: VpnCloud is the fastest VPN at
1.13\,ms despite mid-tier throughput, Tinc has low latency but
single-core CPU caps its bulk transfer rate, and Mycelium's
34.9\,ms average is an outlier driven by Babel's overlay
routing, not by tunnel overhead.
% Summary of the 108-feature matrix across all ten VPNs. Under impairment, the hierarchy collapses. At High impairment
% Highlight key architectural differences that explain the spread between fastest and slowest implementation compresses
% performance results. from 675\,Mbps to under 3\,Mbps; the impairment profile itself
becomes the bottleneck. Three pathologies stand out at the
intermediate profiles. Yggdrasil's 32\,KB jumbo overlay MTU,
which inflates its baseline numbers, becomes a liability at
Low impairment: a single lost outer packet costs roughly
24$\times$ more retransmitted inner data than a standard-MTU VPN
would lose, and throughput drops from 795 to 13\,Mbps.
Hyprspace's libp2p/yamux send pipeline serialises concurrent
flows behind a per-peer mutex; under any sustained load the
pipeline backs up and ping latency balloons by three orders of
magnitude. Headscale's RIST video quality stays at 13\,\% across
every profile, almost certainly because of MTU fragmentation in
the DERP relay layer; the failure is profile-independent because
it is structural.
\subsection{Security vulnerabilities} Headscale's apparent lead over the bare-metal Internal baseline
at Medium impairment turns out not to come from Tailscale's
gVisor TCP stack, which is not in the data path of the
benchmark. It comes from \texttt{wireguard-go}'s TUN GRO
coalescing and the 7\,MiB outer-UDP socket buffer that
\texttt{magicsock} pins, both of which feed the host kernel TCP
stack a smoother input than the bare-metal path receives. The
underlying cause is a host-kernel one: the default
\texttt{tcp\_reordering=3} threshold is too tight for the kind
of bursty, correlated reordering \texttt{tc netem} produces, and
costs the bare-metal host more than half its achievable
throughput. Three sysctl lines repair it, and the fix is
portable to any Linux host independent of any VPN.
% Vulnerabilities discovered during source code review. A ranking by single metric would obscure all of this. The most
useful one-sentence summary is therefore that no VPN dominates
\section{Summary of findings} across throughput, latency, application-level workloads, and
operational resilience together; each implementation makes
% Brief summary table or ranking of VPNs by key metrics. Save trade-offs that surface only when the workload changes.
% deeper interpretation for a Discussion chapter. WireGuard comes closest to a default recommendation for
performance-critical use; Headscale is the most robust under
adverse network conditions; the others occupy specific niches
that the per-section analyses describe.