Compare commits

..

1 Commits

Author SHA1 Message Date
Luis a3c533b58f improved dense number paragraphs 2026-04-28 09:29:56 +02:00
+299 -306
View File
@@ -6,23 +6,23 @@
This chapter presents the results of the benchmark suite across all
ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded:
follows the impairment profiles from ideal to degraded.
Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to
conditions; subsequent sections examine how each VPN responds to
increasing network impairment, with source-code excerpts woven in
where they explain the measured behaviour. A recurring theme is
that no single metric captures VPN performance; the rankings shift
depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance.
where they explain the measured behaviour. No single metric
captures VPN performance. The rankings shift depending on what is
measured: throughput, latency, retransmit behaviour, or
application-level performance.
\section{Baseline Performance}
\section{Baseline performance}
\label{sec:baseline}
The baseline impairment profile introduces no artificial loss or
reordering, so any performance gap between VPNs can be attributed to
the VPN itself. Throughout the plots in this section, the
\emph{internal} bar marks a direct host-to-host connection with no VPN
in the path; it represents the best the hardware can do. On its own,
\emph{internal} bar is a direct host-to-host connection with no VPN
in the path; it is the best the hardware can do. On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
@@ -30,35 +30,29 @@ single retransmit across an entire 30-second test. Mycelium sits at
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
bare-metal figure.
A note on naming: ``Headscale'' in every table and figure of this
chapter labels the test scenario in which the Tailscale client
(\texttt{tailscaled}) connects to a self-hosted Headscale control
server. The data plane is therefore the Tailscale client built on
\texttt{wireguard-go}, not the Headscale binary itself, which is
only a control-plane server. Statements below about ``Headscale''
running \texttt{wireguard-go} should be read as statements about
the Tailscale client in this scenario.
Section~\ref{sec:tailscale_degraded} covers the specifics of how
the rig launches \texttt{tailscaled} and which Tailscale code
paths that choice activates.
Throughout this chapter, ``Headscale'' labels the scenario in
which the Tailscale client (\texttt{tailscaled}, built on
\texttt{wireguard-go}) connects to a self-hosted Headscale
control server. The data plane is therefore Tailscale's, not
Headscale's; the Headscale binary itself is only a control-plane
server. Section~\ref{sec:tailscale_degraded} returns to which
Tailscale code paths the test rig actually exercises.
\subsection{Test Execution Overview}
\subsection{Test execution overview}
Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours. Actual benchmark execution
consumed the bulk of that time at 2.6~hours (63\,\%). VPN
installation and deployment accounted for another 45~minutes
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
for VPN tunnels to come up after restarts. VPN service restarts and
traffic-control (tc) stabilization took the remainder.
Figure~\ref{fig:test_duration} breaks this down per VPN.
The full baseline suite ran in just over four hours across all ten
VPNs and the internal reference. Benchmark execution consumed
63\,\% of that time; VPN installation and deployment accounted for
19\,\%; the test rig spent 9\,\% waiting for tunnels to come up
after restarts. Service restarts and traffic-control (tc)
stabilization took the remainder. Figure~\ref{fig:test_duration}
breaks the time down per VPN.
Most VPNs completed every benchmark without issues, but four failed
one test each: Nebula and Headscale timed out on the qperf
QUIC performance benchmark after six retries, while Hyprspace and
Mycelium failed the UDP iPerf3 test
with a 120-second timeout. Their individual success rate is
85.7\,\%, with all other VPNs passing the full suite
Most VPNs completed every benchmark without issue. Four failed
one test each: Nebula and Headscale timed out on the qperf QUIC
benchmark after six retries; Hyprspace and Mycelium failed the
UDP iPerf3 test with a 120-second timeout. Their individual
success rate is 85.7\,\%; every other VPN passed the full suite
(Figure~\ref{fig:success_rate}).
\begin{figure}[H]
@@ -88,7 +82,7 @@ with a 120-second timeout. Their individual success rate is
\label{fig:test_overview}
\end{figure}
\subsection{TCP Throughput}
\subsection{TCP throughput}
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
@@ -144,14 +138,12 @@ Raw throughput alone is incomplete. The retransmit rate
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
by estimated packet count, accounting for the different segment sizes
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier,
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
for tunnel-internal loss through repeated TCP congestion-control
recovery; WireGuard delivers comparable throughput with effectively
zero loss.
Headscale are effectively loss-free ($<$\,0.01\,\%); Hyprspace is
the clear outlier at 0.49\,\%; the remaining VPNs spread between
these poles. The interesting case is ZeroTier: it sustains
814\,Mbps despite a 0.10\,\% retransmit rate by riding TCP
congestion-control recovery, where WireGuard delivers comparable
throughput with effectively zero loss.
\begin{figure}[H]
\centering
@@ -170,7 +162,7 @@ Figure~\ref{fig:tcp_window} shows the raw window sizes, and
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
smallest max congestion window in the dataset (200\,KB), while
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
Yggdrasil's 0.09\,\% rate allows a 4.3\,MB window, the largest of
any VPN. At
first glance this suggests a clean inverse correlation between
retransmit rate and congestion window size, but the picture is
@@ -204,17 +196,13 @@ in the dataset. This points to significant in-tunnel packet loss
or buffering at the VpnCloud layer that the retransmit rate
(0.06\,\%) alone does not fully explain.
Variability, whether stochastic across runs or systematic across
links, also differs substantially. WireGuard's three link
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
and are nearly indistinguishable. Mycelium's three directions span
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
Section~\ref{sec:mycelium_routing} shows the spread is per-link
path-selection asymmetry, with one link finding a direct route and
the other two routing through the global overlay. Either way, a
VPN whose throughput varies that widely across links is harder to
capacity-plan around than one that delivers a consistent figure
on every direction.
Variability across links also differs substantially. WireGuard's
three link directions cluster within a 60\,Mbps band; Mycelium's
span a 3:1 ratio (122--379\,Mbps), but this is per-link
path-selection asymmetry rather than run-to-run noise
(Section~\ref{sec:mycelium_routing}). A VPN whose throughput
varies that widely across links is harder to capacity-plan around
than one that delivers a consistent figure on every direction.
\begin{figure}[H]
\centering
@@ -292,16 +280,13 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far
removed from the rest that Section~\ref{sec:mycelium_routing} gives
it a dedicated analysis.
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
outliers among the low-latency VPNs. VpnCloud leads at
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
jitter in the table (0.25\,ms). Tinc and Headscale, by contrast,
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
timing is nearly as stable as bare metal. The spikes in VpnCloud and
ZeroTier are consistent with periodic
control-plane work such as key rotation or peer heartbeats that
briefly stalls the data path.
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes
two outliers among the low-latency VPNs. VpnCloud and ZeroTier
post the highest spike ratios (2.8$\times$ and 2.3$\times$) and
the highest jitter, while Tinc and Headscale stay close to
bare-metal stability. The spikes in VpnCloud and ZeroTier are
consistent with periodic control-plane events such as key
rotation or peer heartbeats that briefly stall the data path.
\begin{figure}[H]
\centering
@@ -353,7 +338,7 @@ spot.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
\caption{Latency vs.\ throughput at baseline. Each point represents
\caption{Latency vs.\ throughput at baseline. Each point is
one VPN. The quadrants reveal different bottleneck types:
VpnCloud (low latency, moderate throughput), Tinc (low latency,
low throughput, CPU-bound), Mycelium (high latency, low
@@ -361,7 +346,7 @@ spot.
\label{fig:latency_throughput}
\end{figure}
\subsection{Parallel TCP Scaling}
\subsection{Parallel TCP scaling}
The single-stream benchmark tests one link direction at a time.
The
@@ -383,7 +368,7 @@ Table~\ref{tab:parallel_scaling} lists the results.
\centering
\caption{Parallel TCP scaling at baseline. Scaling factor is the
ratio of parallel to single-stream throughput. Internal's
1.50$\times$ represents the expected scaling on this hardware.}
1.50$\times$ is the expected scaling on this hardware.}
\label{tab:parallel_scaling}
\begin{tabular}{lrrr}
\hline
@@ -501,7 +486,7 @@ parallel load.
\label{fig:parallel_tcp}
\end{figure}
\subsection{UDP Stress Test}
\subsection{UDP stress test}
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
which is a deliberate overload test rather than a realistic workload.
@@ -568,12 +553,13 @@ complete the UDP test at all; both timed out after 120 seconds.
% usable payload after tunnel overhead, but conflating it with path
% MTU is misleading. Consider renaming to "effective payload size"
% throughout.
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
smallest). These differences affect fragmentation behavior under real
workloads, particularly for protocols that send large datagrams.
The effective UDP payload size, reported in the
\texttt{blksize\_bytes} field, differs sharply across
implementations. Yggdrasil sends 32\,731-byte jumbo segments;
ZeroTier negotiates 2\,728 bytes; the remaining VPNs cluster
between 1\,208 (Headscale, the smallest) and 1\,448 (Internal).
These differences affect fragmentation behaviour under workloads
that send large datagrams.
%TODO: Mention QUIC
%TODO: Mention again that the "default" settings of every VPN have been used
@@ -611,7 +597,7 @@ workloads, particularly for protocols that send large datagrams.
% TODO: Compare parallel TCP retransmit rate
% with single TCP retransmit rate and see what changed
\subsection{Real-World Workloads}
\subsection{Real-world workloads}
Saturating a link with iPerf3 measures peak capacity, but not how a
VPN performs under realistic traffic. This subsection switches to
@@ -620,7 +606,7 @@ cache and streaming video over RIST. Both interact with the VPN
tunnel the way real software does, through many short-lived
connections, TLS handshakes, and latency-sensitive UDP packets.
\paragraph{Nix Binary Cache Downloads.}
\paragraph{Nix binary cache downloads.}
This test downloads a fixed set of Nix packages through each VPN and
measures the total transfer time. The results
@@ -705,7 +691,7 @@ Nebula sits just below at 99.8\%, and Hyprspace's headline figure
of 100\% conceals a separate failure mode discussed below. The
14--16 dropped frames that appear uniformly across every run, including
Internal, are most likely encoder warm-up artefacts rather than
tunnel overhead, though we have not verified this directly.
tunnel overhead, though this has not been verified directly.
% TODO: The packet-drop distribution statistics (288 mean,
% 10\% median, IQR 255--330) are not shown in any figure.
@@ -756,7 +742,7 @@ overwhelm FEC entirely.
\label{fig:rist_quality}
\end{figure}
\subsection{Operational Resilience}
\subsection{Operational resilience}
Throughput, latency, and application performance describe how a
tunnel behaves once it is up. The next question is how quickly it
@@ -767,7 +753,7 @@ reboot matters as much as its peak throughput.
Reboot reconnection rearranges the rankings. Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on
average, faster than any other VPN. WireGuard and Nebula follow at
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
10.1\,s each. Nebula's consistency is striking: 10.06, 10.07,
10.07\,s across its three nodes, an exact match for Nebula's
\texttt{HostUpdateNotification} interval, whose default is
10~seconds in the lighthouse protocol (configurable, but the
@@ -781,8 +767,8 @@ Section~\ref{sec:mycelium_routing} argues from that uniformity
that the bound is a fixed timer in the overlay protocol.
Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively. Yggdrasil organises its overlay as a
node is back in 7.1~seconds while lom and luna take 97.3 and
94.8~seconds respectively. Yggdrasil organises its overlay as a
distributed spanning tree rooted at the node with the highest public
key: every other node picks a parent closer to the root and the
whole network hangs off that parent chain. The gap likely reflects
@@ -816,14 +802,14 @@ can route traffic.
\label{fig:reboot_reconnection}
\end{figure}
\subsection{Pathological Cases}
\subsection{Pathological cases}
\label{sec:pathological}
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
explain. The following subsections piece together observations from
earlier benchmarks into per-VPN diagnoses.
Hyprspace, Mycelium, and Tinc each show a pathology that the
aggregate tables flatten. The following subsections diagnose each
in turn.
\paragraph{Hyprspace: Buffer Bloat.}
\paragraph{Hyprspace: buffer bloat.}
\label{sec:hyprspace_bloat}
% TODO: The under-load latency of 2,800 ms is not shown in any plot
@@ -841,7 +827,7 @@ The consequences show in every TCP metric. With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than
steady-state transfer. The max congestion window shrinks to
205\,KB, the smallest in the dataset. Under parallel load the
200\,KB, the smallest in the dataset. Under parallel load the
situation worsens: retransmits climb to 17\,426. % TODO: The
% explanation for the sender/receiver inversion (ACK delays
% causing sender-side timer undercounting) is a hypothesis. Normally
@@ -853,12 +839,9 @@ while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate. The
UDP test never finished at all; it timed out at 120~seconds.
% Should we always use percentages for retransmits?
What prevents Hyprspace from being entirely unusable is everything
\emph{except} sustained load. It has the fastest reboot
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
quality outside of its burst events. The pathology is narrow but
Outside sustained load, Hyprspace looks fine. It has the fastest
reboot reconnection in the dataset (8.7\,s) and delivers 100\,\%
video quality between burst events. The pathology is narrow but
severe: any continuous data stream saturates the tunnel's internal
buffers.
@@ -910,33 +893,35 @@ stack but the benchmark traffic never reaches it; that case is
the subject of Section~\ref{sec:tailscale_degraded}.
If gVisor is out of scope, the buffer bloat must originate
further up the Hyprspace stack instead. Hyprspace uses
\texttt{libp2p}, a peer-to-peer networking library, and its
\texttt{yamux} stream multiplexer, which runs many logical streams
over a single underlying connection and polices each one with a
credit-based flow-control window. The most plausible source of
the bloat is this libp2p/yamux layer, through which raw IP packets
are funnelled. Hyprspace's TUN-read loop dispatches
each outbound packet on its own goroutine, and every such
goroutine ends up in \texttt{node/node.go}'s
\texttt{sendPacket}, which keeps exactly one libp2p stream per
destination peer in \texttt{activeStreams} and guards it with a
single per-peer \texttt{sync.Mutex}
(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent
application TCP flows to the same Hyprspace neighbour therefore
further up the Hyprspace stack. Hyprspace uses \texttt{libp2p},
a peer-to-peer networking library, and its \texttt{yamux} stream
multiplexer. Yamux runs many logical streams over a single
underlying connection and polices each one with a credit-based
flow-control window. The most plausible source of the bloat is
this libp2p/yamux layer, through which raw IP packets are
funnelled.
Hyprspace's TUN-read loop dispatches each outbound packet on its
own goroutine, and every such goroutine ends up in
\texttt{node/node.go}'s \texttt{sendPacket}. This function keeps
exactly one libp2p stream per destination peer in
\texttt{activeStreams}, guarded by a single per-peer
\texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}).
Concurrent application TCP flows to the same Hyprspace neighbour
serialise behind that one lock: the parallel iPerf3 test, which
opens multiple TCP connections to the same peer at once,
collapses to a single send pipeline at this layer. Each
goroutine waiting for the lock pins its own 1420-byte packet
buffer, and the underlying yamux session adds a per-stream
flow-control window on top. None of this is visible to the
kernel TCP sender that produced the inner segments: the kernel
sees only that the TUN write returned, so it keeps growing its
congestion window while the libp2p layer falls further behind. The
geometry is the textbook one for buffer bloat: a
fast producer (kernel TCP) sitting upstream of a slow,
serialised consumer (the single yamux stream per peer) with
no flow-control signal coupling the two.
opens multiple TCP connections to the same peer at once, collapses
to a single send pipeline at this layer. Each goroutine waiting
for the lock pins its own 1420-byte packet buffer, and the
underlying yamux session adds a per-stream flow-control window on
top.
None of this is visible to the kernel TCP sender that produced
the inner segments. The kernel sees only that the TUN write
returned, so it keeps growing its congestion window while the
libp2p layer falls further behind. The geometry is the standard
shape of buffer bloat: a fast producer (kernel TCP) sitting
upstream of a slow, serialised consumer (the single yamux stream
per peer), with no flow-control signal coupling the two.
\lstinputlisting[language=Go,caption={Hyprspace's outbound
fast path keeps exactly one libp2p stream per destination peer
@@ -987,9 +972,9 @@ reports a kernel-measured RTT that is independent of
ICMP ping. For the luna$\rightarrow$lom stream, this
TCP~RTT starts at 51.6\,ms and climbs to a mean of
144\,ms over the 30-second run, with
757~retransmits---the link was clearly overlay-routed
during the throughput test, even though ping had found a
direct path eight minutes earlier. For
757~retransmits. The link was clearly overlay-routed during
the throughput test, even though ping had found a direct path
eight minutes earlier. For
yuki$\rightarrow$luna the reverse happened: the TCP
stream measured only 12--22\,ms, and its bidirectional
return path recorded 1.0\,ms, a direct LAN connection
@@ -1057,17 +1042,14 @@ inversion.
\end{figure}
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
% (47.3 ms) numbers are from qperf but not shown in any figure.
% Add a connection-setup latency table or plot. Also
% clarify what
% Internal's connection establishment time is (47.3 /
% 3 = 15.8 ms?)
% so the "3× overhead" can be verified.
% (47.3 ms vs.\ 16.0 ms) numbers are from qperf but not shown in
% any figure. Add a connection-setup latency table or plot.
The overlay penalty shows up most clearly at connection setup.
Mycelium's average time-to-first-byte is 93.7\,ms
(vs.\ Internal's
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
alone costs 47.3\,ms against Internal's 16.0\,ms (a
2.95$\times$ overhead). Every new connection
incurs that overhead, so workloads dominated by
short-lived connections accumulate it rapidly. Bulk
downloads, by
@@ -1094,13 +1076,13 @@ delay.
The UDP test timed out at 120~seconds, and even first-time
connectivity required a 70-second wait at startup.
\paragraph{Tinc: Userspace Processing Bottleneck.}
\paragraph{Tinc: userspace processing bottleneck.}
The latency subsection already traced Tinc's 336\,Mbps ceiling to
single-core CPU exhaustion. The usual network suspects do not
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
effective UDP payload size (1\,353 bytes) and its retransmit count
(240) are in the normal range. That leaves CPU: 14.9\,\%
(240) are in the normal range. That leaves CPU: 12.3\,\%
whole-system utilization is what one saturated core looks like on
a multi-core host, which fits a single-threaded userspace VPN.
The parallel benchmark confirms the diagnosis. Tinc scales to
@@ -1110,7 +1092,7 @@ the gaps a single flow would leave idle, and the extra work
translates directly into extra throughput.
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
% unresolved CPU-profiling TODO from the latency subsection
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
% (VpnCloud's similar 14.2\% at 539\,Mbps). If per-thread
% profiling refutes the single-core story, this paragraph must
% be revisited as well.
@@ -1121,13 +1103,12 @@ Baseline benchmarks rank VPNs by overhead under ideal
conditions. The impairment profiles in
Table~\ref{tab:impairment_profiles} test a different property:
resilience. Each profile applies symmetric \texttt{tc netem}
impairment to every machine. Low adds roughly 2\,ms of delay and
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
reordering. Medium and High both use 50\,\% correlation, so
losses and reorderings are bursty rather than uniform. Two
results dominate the data.
impairment to every machine. Low adds 2\,ms of delay and
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 4\,ms
of delay and 1\,\% loss with 2.5\,\% reordering; High adds 6\,ms
of delay and 2.5\,\% loss with 5\,\% reordering. Medium and High
both use 50\,\% correlation, so losses and reorderings are bursty
rather than uniform. Two results dominate the data.
% TODO: Double-check these per-profile parameters against the
% canonical impairment-profile definitions in the earlier chapter
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
@@ -1150,9 +1131,9 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
through what turns out to be the wrong hypothesis. The
investigation begins with Tailscale's much-discussed gVisor TCP
stack, validates the candidate parameters in isolation on the
bare-metal host, and only then discovers, by reading the rig's
own NixOS module, that the gVisor stack is not actually in the
data path of the benchmark at all. The real culprit is a
bare-metal host, and only then discovers, by reading the test
rig's own NixOS module, that the gVisor stack is not actually in
the data path of the benchmark at all. The real culprit is a
combination of the Linux kernel's tight default
\texttt{tcp\_reordering} threshold and the way
\texttt{wireguard-go}
@@ -1268,7 +1249,7 @@ RTT standard deviation reaches 44.6\,ms at High, the
worst jitter
of any VPN. A userspace retry mechanism is the
likely cause, but
without source-code evidence we cannot say so with certainty.
without source-code evidence this cannot be confirmed.
% TODO: Ping packet loss data is not shown in any plot. The 1/9
% = 11.1\% interpretation is clever but depends on
@@ -1342,9 +1323,9 @@ the first step (Table~\ref{tab:tcp_impairment}).
\end{figure}
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
impairment, a
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
0.25\% packet loss, and 0.5\% reordering per machine.
impairment, a 98.3\% loss. The Low profile injects only modest
impairment per machine: 2\,ms of latency, 2\,ms of jitter, 0.25\%
loss, and 0.5\% reordering.
Even Mycelium,
the slowest VPN at baseline (259\,Mbps), retains more
throughput at
@@ -1427,38 +1408,28 @@ emerges from the runs that did complete.
% explanation (e.g., iPerf3 crash, tc interaction,
% timing issue).
Three implementations maintain throughput at the profiles where
data exists. Internal holds ${\sim}$950\,Mbps at
Baseline, Medium,
and High; WireGuard sustains 850--898\,Mbps; and
Headscale sustains
700--876\,Mbps. % TODO: verify WireGuard UDP range --
% analysis doc says 850-898, possible digit transposition
Internal and WireGuard ride the host kernel's transport-layer
backpressure (Internal directly, WireGuard via the in-kernel
WireGuard module). Headscale, by contrast, never
uses the kernel
module even though it builds on the WireGuard protocol: as
established in Section~\ref{sec:baseline}, Tailscale's
\texttt{magicsock} layer intercepts every packet for endpoint
selection, DERP relay, and the disco protocol, and that
interception is incompatible with the kernel WireGuard datapath.
Headscale therefore runs \texttt{wireguard-go} in userspace and
compensates with UDP batching
(\texttt{recvmmsg}/\texttt{sendmmsg}),
host-kernel UDP segmentation/aggregation offload
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
WireGuard socket), and a 7\,MB socket buffer on the same outer
socket. These offloads live in the host kernel; gVisor netstack
itself implements no UDP GSO or UDP GRO of its own.
Together they
absorb a \texttt{-b 0} sender flood without
collapsing. Userspace
VPNs without the same engineering do collapse:
EasyTier drops from
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
Yggdrasil, already pathological at baseline (98.7\%
loss), crashes
to 12.3\,Mbps at Low and fails entirely at Medium and High.
data exists: Internal, WireGuard, and Headscale all sustain
several hundred Mbps where they complete (see
Figure~\ref{fig:udp_impairment_heatmap}). Internal and WireGuard
ride the host kernel's transport-layer backpressure (Internal
directly, WireGuard via the in-kernel WireGuard module).
Headscale takes a different route to the same outcome. Its
\texttt{magicsock} layer is incompatible with the kernel
WireGuard datapath, so \texttt{wireguard-go} runs in userspace
and leans on three host-kernel offloads to absorb a
\texttt{-b~0} sender flood: batched UDP I/O
(\texttt{recvmmsg} / \texttt{sendmmsg}), UDP
segmentation/aggregation offload (\texttt{UDP\_SEGMENT} /
\texttt{UDP\_GRO}) on the outer WireGuard socket, and a 7\,MiB
socket buffer on that same socket. Section~\ref{sec:tailscale_degraded}
returns to these mechanisms when they reappear as the explanation
for Headscale's TCP behaviour under reordering.
Userspace VPNs without that engineering collapse. EasyTier
walks down 865, 435, 38.5, 6.1\,Mbps across the four profiles.
Yggdrasil, already pathological at baseline (98.7\,\% loss),
drops to 12.3\,Mbps at Low and fails entirely at Medium and
High.
\begin{figure}[H]
\centering
@@ -1576,10 +1547,9 @@ concurrent streams benefits from it independently.
EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
51\% of its baseline. Headscale and EasyTier are the only VPNs
that retain more than half their baseline parallel throughput at
Low impairment; no other implementation exceeds 30\%.
We have no
direct architectural explanation for EasyTier's resilience and
do not claim one here.
Low impairment; no other implementation exceeds 30\%. EasyTier's
resilience has no direct architectural explanation in this work,
and none is claimed here.
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
Low, a 99.6\%
@@ -1590,7 +1560,7 @@ loss. % TODO: DOWNSTREAM DEPENDENCY — This
% under-load latency. If that diagnosis is revised,
% this explanation
% for parallel collapse must also be revisited.
The buffer bloat that already plagues single-stream transfers
The buffer bloat that already constrains single-stream transfers
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
flows compete for the same bloated buffers at once.
@@ -1799,33 +1769,26 @@ at Low, completes here in 170\,s. At High, only Headscale,
Nebula, and Tinc survive. Internal's failure at High is the
surprising one: the bare-metal baseline cannot sustain a
multi-connection HTTP workload under severe degradation, while
Headscale's userspace TCP stack pulls it through.
Section~\ref{sec:tailscale_degraded} explains why.
Headscale completes in 219\,s.
Section~\ref{sec:tailscale_degraded} traces the mechanism.
\section{Tailscale under degraded conditions}
\label{sec:tailscale_degraded}
% TODO: Editorial pass needed on two chapter-wide issues before
% submission:
% (1) magicsock / wireguard-go userspace-datapath explanation is
% repeated three times in slightly different forms (once in
% baseline UDP, once in impairment UDP, once here). Consider
% introducing it once in full here, where it is load-bearing,
% and replacing the earlier occurrences with one-sentence
% forward references.
% (2) This section uses first-person plural ("we pursued", "we
% worked it out", "we ran two follow-up benchmarks") while
% the rest of the chapter is in impersonal voice. Either
% harmonise everything to one voice, or explicitly frame this
% section as a first-person narrative detour.
% TODO: The magicsock / wireguard-go userspace-datapath
% explanation is repeated three times in slightly different forms
% (once in baseline UDP, once in impairment UDP, once here).
% Consider introducing it once in full here, where it is
% load-bearing, and replacing the earlier occurrences with
% one-sentence forward references.
This section is about an observation that should not exist:
Headscale, a tunnelling VPN built on a kernel TCP stack and
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
Medium impairment, and at Low impairment under parallel load
beats it by a factor of 2.6. The short answer turns out to be
different from the obvious answer, and we worked it out only by
chasing the obvious answer to its end.
Headscale, a tunnelling VPN built on \texttt{wireguard-go}, beats
the bare-metal Internal baseline at Medium impairment. Under
parallel load at Low impairment it beats Internal by a factor of
2.6. A VPN should not outperform the direct connection it tunnels
through, and the explanation took some chasing. The obvious
hypothesis was wrong, and pursuing it to its end was the only way
to find out.
\subsection{An anomaly worth pursuing}
@@ -1877,16 +1840,15 @@ comparison.
\label{fig:headscale_vs_internal}
\end{figure}
WireGuard-the-kernel-module is the obvious sanity
check. It uses
the same Noise/WireGuard cryptographic protocol Tailscale ships
and is the closest available comparison without the rest of
Tailscale's stack. WireGuard shows none of Headscale's
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
below Internal at the same profile. So the encryption layer is
not the answer, and the basic UDP tunnel is not the answer.
Whatever Headscale is doing differently lives somewhere else in
the rest of Tailscale's implementation.
The in-kernel WireGuard module is the obvious sanity check. It
uses the same Noise/WireGuard cryptographic protocol that
Tailscale embeds, and it is the closest available comparison
without the rest of Tailscale's stack. Kernel WireGuard shows
none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at
Medium, both well below Internal at the same profile. The
encryption layer is not the answer. Neither is the basic UDP
tunnel. Whatever Headscale is doing lives somewhere else in
Tailscale's implementation.
% TODO: The Medium-impairment retransmit percentages (5.2\%,
% 2.4\%) are not in any table or figure. Add a retransmit
@@ -1906,9 +1868,9 @@ not.
\subsection{A plausible villain: Tailscale's gVisor stack}
The candidate explanation we pursued first, and the one any
reading of the upstream Tailscale documentation will lead to,
is Tailscale's userspace TCP/IP stack. The Tailscale client
The first candidate was Tailscale's userspace TCP/IP stack:
the answer any reading of the upstream Tailscale documentation
points to. The Tailscale client
imports Google's gVisor netstack
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
it as an in-process TCP implementation. The gVisor
@@ -1948,9 +1910,9 @@ reordering link than the host kernel. The hypothesis follows
directly: Headscale's iPerf3 traffic
runs through this gVisor instance instead of through the host
kernel TCP stack, and so it inherits the more
reordering-tolerant behaviour. WireGuard-the-kernel-module
shares only the cryptographic protocol; it does not include
the gVisor stack, and therefore does not get the advantage.
reordering-tolerant behaviour. Kernel WireGuard shares only
the cryptographic protocol; it does not include the gVisor
stack, and therefore does not get the advantage.
The natural way to test this is to extract
the parameters Tailscale sets inside gVisor, apply their
@@ -1962,8 +1924,8 @@ supported. If it does not, the hypothesis fails.
\subsection{Reproducing the effect on bare metal}
\label{sec:tuned}
We ran two follow-up benchmarks on the same hardware and
impairment setup as the original 18.12.2025 run.
Two follow-up benchmarks ran on the same hardware and impairment
setup as the original 18.12.2025 run.
\begin{itemize}
\bitem{Tailscale-style (27.02.2026):}
@@ -2023,43 +1985,33 @@ impairment setup as the original 18.12.2025 run.
\label{fig:kernel_tuning_comparison}
\end{figure}
The result felt like confirmation. Internal's
Medium-impairment throughput jumped from 29.6\,Mbps to
72.7\,Mbps under the reorder-only configuration, a 146\,\%
increase from a three-line sysctl change, and the retransmit
rate at Medium dropped from ${\sim}$2.4\,\% to
1.11\,\%, which
means more than half of the original retransmissions were
spurious. The Nix cache download at Medium roughly halved,
from 58.6\,s to 29.1\,s.
The result felt like confirmation. Three sysctls raised
Internal's Medium-impairment throughput by 146\,\% and halved its
Nix cache download time
(Table~\ref{tab:kernel_tuning_internal}). The retransmit rate at
Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means
more than half of the original retransmissions were spurious.
Parallel TCP gained even more. Internal at Low climbed from
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
old single-stream best and overtakes Headscale's original
718\,Mbps from the unmodified run. %
% TODO: DOWNSTREAM
% DEPENDENCY — "six concurrent flows" inherits
% the unresolved
% 6-vs-10 stream count from the baseline parallel test
% description. Update when that TODO is resolved.
Each of the six concurrent flows benefits independently from
the higher reordering threshold, and the gains compound.
277\,Mbps to 902\,Mbps, a 226\,\% increase that exceeded
Internal's old single-stream best and overtook Headscale's
original 718\,Mbps from the unmodified run.
% TODO: DOWNSTREAM DEPENDENCY -- "six concurrent flows"
% inherits the unresolved 6-vs-10 stream count from the baseline
% parallel test description. Update when that TODO is resolved.
Each of the six concurrent flows benefits independently from the
higher reordering threshold, and the gains compound.
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
% not in any table. Add a table showing Headscale's results
% from the follow-up runs alongside Internal's so
% readers can
% from the follow-up runs alongside Internal's so readers can
% verify the reversal.
Headscale itself, retested with the same sysctls,
gained more
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
Low. And the anomaly reversed entirely. At Medium, tuned
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
a 45\,\% lead for Internal where the original run
had Headscale
40\,\% ahead. The Nix cache flipped the same way: Internal
completed in 29.1\,s against Headscale's 36.3\,s, where the
original had Headscale 17\,\% faster.
Headscale, retested with the same sysctls, gained more modestly:
+21\,\% at Medium and a small $-$5\,\% wobble at Low. And the
anomaly reversed entirely
(Figure~\ref{fig:headscale_gap_reversal}). Tuned Internal now
leads Headscale at Medium across every metric, where the
original run had Headscale ahead.
\begin{figure}[H]
\centering
@@ -2088,11 +2040,11 @@ collapsed to three host-kernel sysctls:
\texttt{tcp\_early\_retrans}.
At this point in the investigation the hypothesis seemed
settled. Tailscale's gVisor stack ships with
these overrides;
the bare-metal kernel ships with stricter defaults; matching
the kernel to gVisor reproduces the effect. Then we checked
which Tailscale code path the test rig was actually running.
settled. Tailscale's gVisor stack applies these overrides;
the bare-metal kernel uses stricter defaults; matching
the kernel to gVisor reproduces the effect. The remaining
question was which Tailscale code path the test rig was actually
running.
\subsection{The data path that was not there}
\label{sec:gvisor_not_in_path}
@@ -2168,9 +2120,9 @@ the gVisor TCP business at all.
The puzzle the investigation began with has not gone away.
Headscale starts at 41.5\,Mbps where Internal starts at
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
TCP stack. Whatever Headscale is doing (partially, weakly, but
reproducibly) is worth roughly twelve megabits per second on the
Medium profile, and it is not gVisor netstack.
TCP stack. Whichever mechanism Headscale relies on (partially,
weakly, but reproducibly) is worth roughly twelve megabits per
second on the Medium profile, and it is not gVisor netstack.
The +21\,\% sysctl gain for Headscale itself is also informative
about the size of the mechanism. If the gain were 0\,\%,
@@ -2182,7 +2134,7 @@ that the two effects are not fully additive.
Two features of the \texttt{wireguard-go} data-plane pipeline are
the most likely candidates, and both live on the kernel-TUN path
that Tailscale actually uses in the rig.
that Tailscale actually uses in the test rig.
The first is TUN TCP and UDP generic receive offload. Tailscale's
\texttt{tstun} wrapper enables both on the kernel TUN device on
@@ -2248,32 +2200,38 @@ Hyprspace cannot be used as a negative control for any of this.
It does import gVisor netstack, but only for its in-VPN
service-network feature, and the Hyprspace benchmark traffic goes
through a kernel TUN exactly like Headscale's
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
not on whether gVisor handles their inner TCP. The gVisor angle
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ in the
\texttt{wireguard-go} pipeline (TUN GRO and the 7\,MiB outer-UDP
buffer), not in whether gVisor handles their inner TCP. The gVisor angle
simply does not apply to either of them in this benchmark.
The kernel-side picture closes the loop. Three host-kernel TCP
parameters dominate the bare-metal behaviour the benchmarks
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
number of out-of-order segments the kernel will tolerate before
declaring fast retransmit, and with \texttt{tc netem} injecting
0.5--2.5\,\% reordering per machine, bursts of several reordered
packets are frequent enough that the threshold is repeatedly
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
(default \texttt{1}, RACK enabled) adds time-based reordering
detection on top of the segment-count threshold, which compounds
the spurious retransmits when reordering is high. And
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
Loss Probe enabled) fires speculative retransmits when
unacknowledged segments sit at the tail of a transmission window,
which interacts poorly with an already-impaired link. Loosening
any one of the three softens the kernel's loss detection on the
expose:
\begin{description}
\item[\texttt{tcp\_reordering} (default 3)] The number of
out-of-order segments the kernel tolerates before declaring
fast retransmit. With \texttt{tc~netem} injecting 0.5--2.5\,\%
reordering per machine, bursts of several reordered packets
repeatedly trip this threshold on the bare-metal path.
\item[\texttt{tcp\_recovery} (default \texttt{1}, RACK enabled)]
Adds time-based reordering detection on top of the
segment-count threshold, which amplifies spurious retransmits
when reordering is high.
\item[\texttt{tcp\_early\_retrans} (default \texttt{3}, TLP
enabled)] Fires speculative retransmits when unacknowledged
segments sit at the tail of the transmission window, which
interacts poorly with an already-impaired link.
\end{description}
\noindent
Loosening any one softens the kernel's loss detection on the
bare-metal path; loosening all three recovers most of the
throughput. The Headscale path reaches the same kernel TCP stack
but is already feeding it the GRO-coalesced, buffer-cushioned
stream described above, so the kernel's tight defaults fire less
often there to begin with.
but is already feeding it a GRO-coalesced, buffer-cushioned
stream, so the kernel's tight defaults fire less often there to
begin with.
The same logic explains the anomaly's shape across profiles. At
baseline there is no reordering, so the kernel's tight
@@ -2318,31 +2276,66 @@ any Linux host and entirely independent of any VPN.
The less durable finding, and the one that motivated this section,
is that Tailscale's much-discussed userspace TCP stack is not in
the data path for the workload that exposed the anomaly. The
advantage we attributed to it comes from a more ordinary place:
the way \texttt{wireguard-go} batches and coalesces packets
advantage initially attributed to it comes from a more ordinary
place: the way \texttt{wireguard-go} batches and coalesces packets
between the wire and the kernel TCP stack, and the larger UDP
buffer it pins on its outer socket. We were chasing the wrong
hypothesis with the right experiment, and the experiment turned
out to be more useful than the hypothesis.
buffer it pins on its outer socket. The experiment was chasing
the wrong hypothesis, but the experiment turned out to be more
useful than the hypothesis.
% TODO: These sections are empty stubs but the chapter
% introduction (line 12--13) promises "findings from the source
% code analysis." Either write these sections or remove the
% promise from the intro.
\section{Summary}
\label{sec:results_summary}
\section{Source code analysis}
Four findings hold together across all four impairment profiles.
\subsection{Feature matrix overview}
At baseline, the throughput hierarchy splits into three tiers
separated by natural gaps in the data: WireGuard, ZeroTier,
Headscale, and Yggdrasil at the top ($>$\,80\,\% of bare metal);
Nebula, EasyTier, and VpnCloud in the middle (55--80\,\%);
Hyprspace, Tinc, and Mycelium at the bottom ($<$\,40\,\%).
Latency rearranges the rankings: VpnCloud is the fastest VPN at
1.13\,ms despite mid-tier throughput, Tinc has low latency but
single-core CPU caps its bulk transfer rate, and Mycelium's
34.9\,ms average is an outlier driven by Babel's overlay
routing, not by tunnel overhead.
% Summary of the 108-feature matrix across all ten VPNs.
% Highlight key architectural differences that explain
% performance results.
Under impairment, the hierarchy collapses. At High impairment
the spread between fastest and slowest implementation compresses
from 675\,Mbps to under 3\,Mbps; the impairment profile itself
becomes the bottleneck. Three pathologies stand out at the
intermediate profiles. Yggdrasil's 32\,KB jumbo overlay MTU,
which inflates its baseline numbers, becomes a liability at
Low impairment: a single lost outer packet costs roughly
24$\times$ more retransmitted inner data than a standard-MTU VPN
would lose, and throughput drops from 795 to 13\,Mbps.
Hyprspace's libp2p/yamux send pipeline serialises concurrent
flows behind a per-peer mutex; under any sustained load the
pipeline backs up and ping latency balloons by three orders of
magnitude. Headscale's RIST video quality stays at 13\,\% across
every profile, almost certainly because of MTU fragmentation in
the DERP relay layer; the failure is profile-independent because
it is structural.
\subsection{Security vulnerabilities}
Headscale's apparent lead over the bare-metal Internal baseline
at Medium impairment turns out not to come from Tailscale's
gVisor TCP stack, which is not in the data path of the
benchmark. It comes from \texttt{wireguard-go}'s TUN GRO
coalescing and the 7\,MiB outer-UDP socket buffer that
\texttt{magicsock} pins, both of which feed the host kernel TCP
stack a smoother input than the bare-metal path receives. The
underlying cause is a host-kernel one: the default
\texttt{tcp\_reordering=3} threshold is too tight for the kind
of bursty, correlated reordering \texttt{tc netem} produces, and
costs the bare-metal host more than half its achievable
throughput. Three sysctl lines repair it, and the fix is
portable to any Linux host independent of any VPN.
% Vulnerabilities discovered during source code review.
\section{Summary of findings}
% Brief summary table or ranking of VPNs by key metrics. Save
% deeper interpretation for a Discussion chapter.
A ranking by single metric would obscure all of this. The most
useful one-sentence summary is therefore that no VPN dominates
across throughput, latency, application-level workloads, and
operational resilience together; each implementation makes
trade-offs that surface only when the workload changes.
WireGuard comes closest to a default recommendation for
performance-critical use; Headscale is the most robust under
adverse network conditions; the others occupy specific niches
that the per-section analyses describe.