verified all numbers

This commit is contained in:
2026-04-10 11:18:40 +02:00
parent 0e636ee5f3
commit 13633f092a
+250 -220
View File
@@ -9,10 +9,9 @@ ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded:
Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to
increasing network impairment. The chapter concludes with findings
from the source code analysis. A recurring theme is that no single
metric captures VPN
performance; the rankings shift
increasing network impairment, with source-code excerpts woven in
where they explain the measured behaviour. A recurring theme is
that no single metric captures VPN performance; the rankings shift
depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance.
@@ -26,38 +25,32 @@ the VPN itself. Throughout the plots in this section, the
in the path; it represents the best the hardware can do. On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching
92.5\,\% of bare-metal throughput with only a single retransmit across
an entire 30-second test. Mycelium sits at the other extreme, adding
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
single retransmit across an entire 30-second test. Mycelium sits at
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
bare-metal figure.
A note on naming: ``Headscale'' in every table and figure of this
chapter labels the test scenario in which the Tailscale client
(\texttt{tailscaled}) connects to a self-hosted Headscale control
server. The data plane is therefore the Tailscale client built on
\texttt{wireguard-go}, not the Headscale binary itself, which is
only a control-plane server. The test rig launches
\texttt{tailscaled} via the NixOS \texttt{services.tailscale}
module with \texttt{interfaceName = "ts-headscale"}, which
translates to \texttt{--tun ts-headscale}; this means the Tailscale
client uses a real kernel TUN device and the host kernel's TCP/IP
stack handles every tunneled packet. The alternate
\texttt{--tun=userspace-networking} mode, in which gVisor netstack
terminates tunneled TCP inside the \texttt{tailscaled} process, is
\emph{not} engaged in any of the benchmarks reported here.
Statements below about ``Headscale'' running \texttt{wireguard-go}
should be read as statements about the Tailscale client in this
scenario.
only a control-plane server. Statements below about ``Headscale''
running \texttt{wireguard-go} should be read as statements about
the Tailscale client in this scenario.
Section~\ref{sec:tailscale_degraded} covers the specifics of how
the rig launches \texttt{tailscaled} and which Tailscale code
paths that choice activates.
\subsection{Test Execution Overview}
Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours. The bulk of that time, about
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
installation and deployment accounted for another 45~minutes (19\,\%),
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
up after restarts. The remaining time was consumed by VPN service restarts
and traffic-control (tc) stabilization.
reference took just over four hours. Actual benchmark execution
consumed the bulk of that time at 2.6~hours (63\,\%). VPN
installation and deployment accounted for another 45~minutes
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
for VPN tunnels to come up after restarts. VPN service restarts and
traffic-control (tc) stabilization took the remainder.
Figure~\ref{fig:test_duration} breaks this down per VPN.
Most VPNs completed every benchmark without issues, but four failed
@@ -146,8 +139,8 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates
needs. ZeroTier compensates for tunnel-internal packet loss by
repeatedly triggering TCP congestion-control recovery, whereas
WireGuard delivers data with negligible in-tunnel loss. The
bare-metal Internal reference sits at 1.7~retransmits per test
essentially noise and the VPNs split into three groups around
bare-metal Internal reference sits at 1.7~retransmits per test,
essentially noise, and the VPNs split into three groups around
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
@@ -187,10 +180,10 @@ and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
\end{figure}
Retransmits have a direct mechanical relationship with TCP congestion
control. Each retransmit triggers a reduction in the congestion window
(\texttt{cwnd}), throttling the sender.
This relationship is visible
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
control: each one triggers a reduction in the congestion window
(\texttt{cwnd}) and throttles the sender.
Figure~\ref{fig:retransmit_correlations} shows the relationship:
Hyprspace, with 4965
retransmits, maintains the smallest max congestion window in the
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
window, the largest of any VPN. At first glance this suggests a
@@ -200,24 +193,31 @@ largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
segment carries far more data, so the window in bytes is inflated
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
congestion windows across different MTU sizes is not meaningful
without normalizing for segment size. What \emph{is} clear is that
high retransmit rates force TCP to spend more time in congestion
recovery than in steady-state transmission, capping throughput
regardless of available bandwidth. ZeroTier illustrates the
opposite extreme: brute-force retransmission can still yield high
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
bandwidth and unstable flow behavior.
without normalizing for segment size. The reliable conclusion is
simpler: high retransmit rates force TCP to spend more time in
congestion recovery than in steady-state transmission, and that
caps throughput regardless of available bandwidth. ZeroTier
illustrates the opposite extreme: brute-force retransmission can
still yield high throughput (814\,Mbps with 1\,163 retransmits), at
the cost of wasted bandwidth and unstable flow behavior.
VpnCloud stands out: its sender reports 538.8\,Mbps
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
in the dataset). This suggests significant in-tunnel packet loss or
buffering at the VpnCloud layer that the retransmit count (857)
VpnCloud stands out: its sender reports 538.8\,Mbps but the
receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
in the dataset. This points to significant in-tunnel packet loss
or buffering at the VpnCloud layer that the retransmit count (857)
alone does not fully explain.
% TODO: Clarify whether the headline TCP table
% (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports
% sender or receiver throughput. The prose here cites sender
% 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender
% column, so the table caption should say so explicitly. Same
% clarification needed for Hyprspace (368 in table vs.\ sender
% 367.9 / receiver 419.8 in the pathological-cases paragraph).
Variability whether stochastic across runs or systematic across
links also differs substantially. WireGuard's three link
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window),
behaving almost identically. Mycelium's three directions span
Variability, whether stochastic across runs or systematic across
links, also differs substantially. WireGuard's three link
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
and are nearly indistinguishable. Mycelium's three directions span
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
Section~\ref{sec:mycelium_routing} shows the spread is per-link
path-selection asymmetry, with one link finding a direct route and
@@ -315,25 +315,21 @@ interference that the average hides.
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
but only the second-lowest throughput (336\,Mbps). Packets traverse
the tunnel quickly, yet single-threaded userspace processing cannot
keep up with the link speed. The qperf benchmark backs this up: Tinc
maxes out at
14.9\,\% total system CPU while delivering just 336\,Mbps.
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
the tunnel quickly, yet something caps the overall rate. The qperf
benchmark reports Tinc maxing out at 14.9\,\% total system CPU while
delivering 336\,Mbps. On a multi-core host this figure is consistent
with a single saturated core, which fits Tinc's single-threaded
userspace architecture: one core encrypts, copies, and forwards
packets, and the remaining cores sit idle. But VpnCloud reports the
same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc),
so whole-system CPU alone cannot explain the gap, and a per-packet
processing cost difference must also be in play.
% TODO: 14.9\% total CPU does not pin the bottleneck on its own.
% This is whole-system utilization on a multi-core machine, and a
% single saturated core fits the budget — but VpnCloud reports the
% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc.
% The single-saturated-core story alone therefore cannot explain
% the throughput gap; per-packet processing cost must differ
% materially between the two. Verify with per-thread CPU sampling
% or eBPF profiling.
On a multi-core system, this low percentage is consistent with a
single saturated core (and Tinc is single-threaded), which would
explain why the CPU rather than the network is the bottleneck.
The story is incomplete, however: VpnCloud shows the same 14.9\,\%
total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc —
so a difference in per-packet processing cost between the two
implementations must also be in play.
% same 14.9\% \emph{and} reaches 539\,Mbps. Verify with per-thread
% CPU sampling or eBPF profiling to confirm the single-core story
% and quantify the per-packet cost difference.
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
spot.
@@ -346,9 +342,9 @@ spot.
The qperf measurements also reveal a wide spread in CPU usage.
Hyprspace (55.1\,\%) and Yggdrasil
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a
kernel-level implementation, presumably due to in-kernel
cryptographic processing.
9.7\,\%. WireGuard sits at 30.8\,\%, higher than expected for a
kernel-level implementation; in-kernel cryptographic processing
is the likely cause, though no profiling data confirms this.
On the efficient end, VpnCloud
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
CPU time. Nebula and Headscale are missing from
@@ -416,8 +412,10 @@ Table~\ref{tab:parallel_scaling} lists the results.
The VPNs that gain the most are those most constrained in
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
can never fill the pipe: the bandwidth-delay product demands a window
larger than any single flow maintains, so multiple concurrent flows
can never fill the pipe: the bandwidth-delay product (the amount
of in-flight data a TCP flow needs to saturate a link, equal to the
link bandwidth times the round-trip time) demands a window larger
than any single flow maintains, so multiple concurrent flows
compensate for that constraint and push throughput to 2.20$\times$
the single-stream figure. Hyprspace scales almost as well
(2.18$\times$) for the same reason but with a different
@@ -425,7 +423,7 @@ bottleneck. Its libp2p send pipeline accumulates roughly
2\,800\,ms of under-load latency
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
flow a bandwidth-delay product on the order of hundreds of
megabytes to fill far beyond any single kernel cwnd. And
megabytes to fill, far beyond any single kernel cwnd. And
because Hyprspace keys \texttt{activeStreams} by destination
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
three concurrent peer pairs in the parallel benchmark each get
@@ -440,8 +438,9 @@ more of the bloated pipeline than one can.
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
% per-flow window evolution nor the actual under-load latency
% has been measured directly. A tcpdump of one Hyprspace
% iPerf3 run with inter-arrival timing analysis would settle
% it. Tinc picks up a
% iPerf3 run with inter-arrival timing analysis would settle it.
Tinc picks up a
1.68$\times$ boost because several streams can collectively keep its
single-threaded CPU busy during what would otherwise be idle gaps in
a single flow.
@@ -449,9 +448,9 @@ a single flow.
% TODO: "zero retransmits" in parallel mode is not shown in any table
% or figure. Add parallel-mode retransmit data or remove the claim.
WireGuard and Internal both scale cleanly at around
1.48--1.50$\times$ with zero retransmits, suggesting that
WireGuard's overhead is a fixed per-packet cost that does not worsen
under multiplexing.
1.48--1.50$\times$ with zero retransmits. This is consistent
with WireGuard's overhead being a fixed per-packet cost that does
not worsen under multiplexing.
Nebula is the only VPN that actually gets \emph{slower} with more
streams: throughput drops from 706\,Mbps to 648\,Mbps
@@ -498,8 +497,9 @@ The sender throughput values are artifacts: they reflect how fast the
sender can write to the socket, not how fast data traverses the
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
throughput because it uses a 32,731-byte block size (a jumbo-frame
overlay MTU), inflating the apparent rate per \texttt{send()} system
call. Only the receiver throughput is meaningful.
overlay MTU), which inflates the apparent rate per
\texttt{send()} system call. Only the receiver throughput is
meaningful.
\begin{table}[H]
\centering
@@ -537,16 +537,19 @@ because the sender overwhelms the tunnel's userspace processing capacity.
Headscale shares WireGuard's cryptographic protocol but, contrary to
intuition, does not share its kernel datapath: Tailscale's
\texttt{magicsock} layer intercepts every packet to handle endpoint
selection and DERP relay, which is incompatible with the in-kernel
WireGuard module. Headscale therefore runs \texttt{wireguard-go}
entirely in userspace, and the unbounded \texttt{-b~0} flood overruns
that userspace pipeline just as it overruns every other userspace
implementation, producing 69.8\,\% loss despite the WireGuard branding.
selection and DERP (Designated Encrypted Relay for Packets,
Tailscale's TLS-over-TCP relay network used when a direct UDP path
between peers cannot be established), which is incompatible with the
in-kernel WireGuard module. Headscale therefore runs
\texttt{wireguard-go} entirely in userspace, and the unbounded
\texttt{-b~0} flood overruns that userspace pipeline just as it
overruns every other userspace implementation, and Headscale
shows 69.8\,\% loss despite the WireGuard branding.
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
(due to its large block size) but loses almost all of it. These loss
rates do not reflect real-world UDP behavior but reveal which VPNs
implement effective flow control. Hyprspace and Mycelium could not
complete the UDP test at all, timing out after 120 seconds.
complete the UDP test at all; both timed out after 120 seconds.
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
% the path MTU. It is derived from the socket MSS and reflects the
@@ -743,19 +746,11 @@ overwhelm FEC entirely.
\subsection{Operational Resilience}
Sustained-load performance does not predict recovery speed. How
quickly a tunnel comes up after a reboot, and how reliably it
reconverges, matters as much as peak throughput for operational use.
% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
% are not shown in any figure or table. Either add a figure or
% scrap this paragraph (see note below).
First-time connectivity spans a wide range. Headscale and WireGuard
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
(10--14\,s) spend seconds negotiating with their control planes
before passing traffic.
%TODO: Maybe we want to scrap first-time connectivity
Throughput, latency, and application performance describe how a
tunnel behaves once it is up. The next question is how quickly it
gets there. Sustained-load numbers do not predict recovery speed,
and for operational use the time a tunnel takes to come up after a
reboot matters as much as its peak throughput.
Reboot reconnection rearranges the rankings. Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on
@@ -768,18 +763,21 @@ benchmarks use the default). After a reboot, a node must
wait until the next periodic update before its lighthouses learn
its new endpoint, so the reconnection time tracks the timer rather
than any topology-dependent convergence.
Mycelium sits at the opposite end, needing 76.6~seconds and showing
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
suggesting a fixed protocol-level wait built into the overlay.
Mycelium sits at the opposite end at 76.6~seconds, and its three
nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
Section~\ref{sec:mycelium_routing} argues from that uniformity
that the bound is a fixed timer in the overlay protocol.
Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively. The gap likely reflects the overlay's
spanning-tree rebuild: a node near the root of the tree reconverges
quickly, while one further out has to wait for the topology to
propagate.
%TODO: Needs clarifications what is a "spanning tree build"
97.3~seconds respectively. Yggdrasil organises its overlay as a
distributed spanning tree rooted at the node with the highest public
key: every other node picks a parent closer to the root and the
whole network hangs off that parent chain. The gap likely reflects
the cost of rebuilding that tree after a reboot: a node close to the
current root reconverges quickly, while one further out must wait
for updated parent information to propagate hop-by-hop before it
can route traffic.
\begin{figure}[H]
\centering
@@ -823,14 +821,14 @@ earlier benchmarks into per-VPN diagnoses.
Hyprspace produces the most severe performance collapse in the
dataset. At idle, its ping latency is a modest 1.79\,ms.
Under TCP load, that number balloons to roughly 2\,800\,ms, a
1\,556$\times$ increase. This is not the network becoming
congested; it is the VPN tunnel itself filling up with buffered
packets and refusing to drain.
1\,556$\times$ increase. The network itself has capacity to spare;
the VPN tunnel is filling up with buffered packets and failing to
drain.
The consequences ripple through every TCP metric. With 4\,965
The consequences show in every TCP metric. With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than
steady-state transfer, shrinking the max congestion window to
steady-state transfer. The max congestion window shrinks to
205\,KB, the smallest in the dataset. Under parallel load the
situation worsens: retransmits climb to 17\,426. % TODO: The
% explanation for the sender/receiver inversion (ACK delays
@@ -841,7 +839,7 @@ The buffering even
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate. The
UDP test never finished at all, timing out at 120~seconds.
UDP test never finished at all; it timed out at 120~seconds.
% Should we always use percentages for retransmits?
@@ -891,7 +889,7 @@ Since the benchmark targets the regular Hyprspace IPv4/IPv6
addresses rather than service-network proxies, both endpoints
rely on their host kernel's TCP stack for the entire transfer.
Whatever options Hyprspace's gVisor instance might set
internally congestion control, loss recovery, buffer sizes
internally (congestion control, loss recovery, buffer sizes)
are therefore irrelevant to these measurements; the inner TCP
state machine the kernel runs is the only one in the path.
The same caveat applies more sharply to Tailscale, where the
@@ -900,9 +898,13 @@ stack but the benchmark traffic never reaches it; that case is
the subject of Section~\ref{sec:tailscale_degraded}.
If gVisor is out of scope, the buffer bloat must originate
further up the Hyprspace stack instead. The most plausible
source is the libp2p / yamux stream layer through which raw IP
packets are funnelled. Hyprspace's TUN-read loop dispatches
further up the Hyprspace stack instead. Hyprspace uses
\texttt{libp2p}, a peer-to-peer networking library, and its
\texttt{yamux} stream multiplexer, which runs many logical streams
over a single underlying connection and polices each one with a
credit-based flow-control window. The most plausible source of
the bloat is this libp2p/yamux layer, through which raw IP packets
are funnelled. Hyprspace's TUN-read loop dispatches
each outbound packet on its own goroutine, and every such
goroutine ends up in \texttt{node/node.go}'s
\texttt{sendPacket}, which keeps exactly one libp2p stream per
@@ -916,10 +918,10 @@ collapses to a single send pipeline at this layer. Each
goroutine waiting for the lock pins its own 1420-byte packet
buffer, and the underlying yamux session adds a per-stream
flow-control window on top. None of this is visible to the
kernel TCP sender that produced the inner segments the kernel
sees only that the TUN write returned so it keeps growing
its congestion window while the libp2p layer falls further
behind. The geometry is the textbook one for buffer bloat: a
kernel TCP sender that produced the inner segments: the kernel
sees only that the TUN write returned, so it keeps growing its
congestion window while the libp2p layer falls further behind. The
geometry is the textbook one for buffer bloat: a
fast producer (kernel TCP) sitting upstream of a slow,
serialised consumer (the single yamux stream per peer) with
no flow-control signal coupling the two.
@@ -1036,10 +1038,15 @@ background.
Mycelium is also the slowest VPN to recover from a reboot:
76.6~seconds on average, and almost suspiciously uniform across
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to
a fixed convergence timer in the overlay protocol
most likely a
default interval rather than anything topology-dependent.
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
fixed convergence timer in the overlay protocol, most likely a
default wait interval hard-coded into the reconnection logic. A
topology-dependent recovery time, by contrast, would vary with each
node's position in the overlay: a node near an active peer would
reconverge quickly while one further away would wait longer for
routing information to reach it. Mycelium shows no such variation,
so the bound is almost certainly a timer rather than a propagation
delay.
% TODO: Identify which Mycelium constant or default this 75-78 s
% recovery actually corresponds to before claiming it is a fixed
% timer; the source code would settle whether it is hard-coded,
@@ -1047,49 +1054,46 @@ default interval rather than anything topology-dependent.
The UDP test timed out at 120~seconds, and even first-time
connectivity required a 70-second wait at startup.
% Explain what topology-dependent means in this case.
\paragraph{Tinc: Userspace Processing Bottleneck.}
Tinc is a clear case of a CPU bottleneck masquerading
as a network
problem. At 1.19\,ms latency, packets get through the
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a
third of the bare-metal link.
The usual suspects do not apply:
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
WireGuard at 1\,368) is in the normal range, and its retransmit
count (240) is moderate. What limits Tinc is its
single-threaded
userspace architecture: one CPU core simply cannot
encrypt, copy,
and forward packets fast enough to fill the pipe.
% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the
% Tinc CPU bottleneck
% diagnosis from above, but the 14.9% CPU figure has
% an unresolved TODO
% (the same utilization as VpnCloud at 539 Mbps). If
% the CPU claim is
% revised or refuted, this confirmation must be updated too.
The parallel benchmark confirms this diagnosis. Tinc scales to
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
Multiple TCP streams collectively keep that single
core busy during
what would otherwise be idle gaps in any individual
flow, squeezing
out throughput that no single stream could reach alone.
The latency subsection already traced Tinc's 336\,Mbps ceiling to
single-core CPU exhaustion. The usual network suspects do not
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
effective UDP payload size (1\,353 bytes) and its retransmit count
(240) are in the normal range. That leaves CPU: 14.9\,\%
whole-system utilization is what one saturated core looks like on
a multi-core host, which fits a single-threaded userspace VPN.
The parallel benchmark confirms the diagnosis. Tinc scales to
563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
Several concurrent TCP streams keep that one core busy through
the gaps a single flow would leave idle, and the extra work
translates directly into extra throughput.
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
% unresolved CPU-profiling TODO from the latency subsection
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
% profiling refutes the single-core story, this paragraph must
% be revisited as well.
\section{Impact of network impairment}
\label{sec:impairment}
Baseline benchmarks rank VPNs by overhead under ideal
conditions.
The impairment profiles in
Table~\ref{tab:impairment_profiles} test
a different property: resilience. Two results
dominate the data.
conditions. The impairment profiles in
Table~\ref{tab:impairment_profiles} test a different property:
resilience. Each profile applies symmetric \texttt{tc netem}
impairment to every machine. Low adds roughly 2\,ms of delay and
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
reordering. Medium and High both use 50\,\% correlation, so
losses and reorderings are bursty rather than uniform. Two
results dominate the data.
% TODO: Double-check these per-profile parameters against the
% canonical impairment-profile definitions in the earlier chapter
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
% delay numbers are cross-checked against later prose in this
% chapter, but the correlation and jitter values should be
% verified against the authoritative profile definition.
The first is the collapse of the throughput hierarchy. At High
impairment, the 675\,Mbps spread between fastest and slowest
@@ -1106,8 +1110,8 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
through what turns out to be the wrong hypothesis. The
investigation begins with Tailscale's much-discussed gVisor TCP
stack, validates the candidate parameters in isolation on the
bare-metal host, and only then discovers by reading the rig's
own NixOS module that the gVisor stack is not actually in the
bare-metal host, and only then discovers, by reading the rig's
own NixOS module, that the gVisor stack is not actually in the
data path of the benchmark at all. The real culprit is a
combination of the Linux kernel's tight default
\texttt{tcp\_reordering} threshold and the way
@@ -1313,6 +1317,16 @@ every lost or reordered outer packet costs roughly
retransmitted inner data than a standard 1\,400-byte
MTU VPN would
lose.
% TODO: The jumbo-MTU-as-liability argument is reused in several
% places (TCP impairment, QUIC impairment, RIST video, and
% §sec:baseline Tier analysis). In each it is presented as a
% mechanism rather than a measurement. Consider running one
% controlled experiment --- force Yggdrasil to a standard
% 1\,420-byte overlay MTU and rerun the Low/Medium impairment
% profiles --- to test the hypothesis directly, or consolidate
% the argument into a single "jumbo-MTU liability" paragraph and
% cite it from the other sections instead of restating the
% mechanism each time.
Headscale retains 34.3\% of its baseline throughput
at Low, almost
@@ -1444,6 +1458,15 @@ indicator than as a throughput measurement. A VPN that cannot
complete a 30-second UDP flood under 0.25\% packet loss has a
flow-control problem that will surface under real workloads too,
even when the symptoms are milder.
% TODO: Non-monotonic failure pattern (Internal and WireGuard
% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
% fail selectively) is never explained and directly undermines
% the "robustness indicator" framing above. Reproduce one of
% the failing Low-profile runs with iPerf3 debug logging and
% \texttt{tc -s qdisc show} to establish whether these are VPN
% flow-control failures, iPerf3/tc interaction artefacts, or
% timing issues; then either explain the pattern or soften the
% robustness-indicator claim.
\subsection{Parallel TCP}
@@ -1552,10 +1575,10 @@ At High impairment, WireGuard (23.2\,Mbps), VpnCloud
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
0.4\,Mbps of one another. At baseline these four
span a 188\,Mbps
range (656 to 844\,Mbps). QUIC's own congestion
control, running on
top of an already-degraded outer link, has become the
sole limiter.
range (656 to 844\,Mbps). At this point QUIC's own congestion
control is the sole limiter: it runs on top of an
already-degraded outer link and cannot push past
${\sim}$23\,Mbps regardless of the VPN underneath.
\begin{figure}[H]
\centering
@@ -1742,6 +1765,20 @@ Section~\ref{sec:tailscale_degraded} explains why.
\section{Tailscale under degraded conditions}
\label{sec:tailscale_degraded}
% TODO: Editorial pass needed on two chapter-wide issues before
% submission:
% (1) magicsock / wireguard-go userspace-datapath explanation is
% repeated three times in slightly different forms (once in
% baseline UDP, once in impairment UDP, once here). Consider
% introducing it once in full here, where it is load-bearing,
% and replacing the earlier occurrences with one-sentence
% forward references.
% (2) This section uses first-person plural ("we pursued", "we
% worked it out", "we ran two follow-up benchmarks") while
% the rest of the chapter is in impersonal voice. Either
% harmonise everything to one voice, or explicitly frame this
% section as a first-person narrative detour.
This section is about an observation that should not exist:
Headscale, a tunnelling VPN built on a kernel TCP stack and
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
@@ -1753,7 +1790,7 @@ chasing the obvious answer to its end.
\subsection{An anomaly worth pursuing}
At Medium impairment, Headscale reaches 41.5\,Mbps on a single
TCP stream against Internal's 29.6\,Mbps a 40\,\% lead for
TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
the VPN over the direct host-to-host link it tunnels through.
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
Low and High impairment it lags Internal by some margin. Yet at
@@ -1837,12 +1874,12 @@ imports Google's gVisor netstack
it as an in-process TCP implementation. The gVisor
documentation is direct about why this matters: netstack is
designed for adverse networks where the host kernel's TCP
defaults are too aggressive. Tailscale's release notes go
further, calling out specific overrides on top of gVisor — the
most visible being an explicit RACK disable and 8\,MiB / 6\,MiB
receive and send buffers.
defaults are too aggressive. Tailscale's release notes go further
and name specific overrides
on top of gVisor; the most visible are an explicit RACK disable
and 8\,MiB / 6\,MiB receive and send buffers.
Reading Tailscale's source confirms it.
The Tailscale source code bears this out.
\texttt{wgengine/netstack/netstack.go} contains the netstack
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
reproduces the relevant overrides verbatim. RACK is disabled
@@ -1863,25 +1900,22 @@ enabled (gVisor's default is off).
\texttt{wgengine/netstack/netstack.go}.
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
Read against the Linux kernel defaults RACK on, CUBIC by
Read against the Linux kernel defaults (RACK on, CUBIC by
default, ${\sim}$1\,MiB receive and send buffers,
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled these
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
overrides describe a TCP stack better suited to a lossy,
reordering link than the host kernel. The hypothesis writes
itself: Headscale's iPerf3 traffic is processed
by this gVisor
instance instead of by the host kernel TCP stack, and so it
inherits the more reordering-tolerant behaviour.
WireGuard-the-kernel-module shares only the cryptographic
protocol; it does not get the gVisor stack, and
therefore does
not get the advantage.
reordering link than the host kernel. The hypothesis follows
directly: Headscale's iPerf3 traffic
runs through this gVisor instance instead of through the host
kernel TCP stack, and so it inherits the more
reordering-tolerant behaviour. WireGuard-the-kernel-module
shares only the cryptographic protocol; it does not include
the gVisor stack, and therefore does not get the advantage.
It is a clean story. The natural way to test it
is to extract
The natural way to test this is to extract
the parameters Tailscale sets inside gVisor, apply their
nearest Linux equivalents to the bare-metal host as sysctls,
and see whether Internal with no VPN at all picks up the
and see whether Internal, with no VPN at all, picks up the
same advantage. If it does, the gVisor explanation is
supported. If it does not, the hypothesis fails.
@@ -1951,21 +1985,18 @@ impairment setup as the original 18.12.2025 run.
The result felt like confirmation. Internal's
Medium-impairment throughput jumped from 29.6\,Mbps to
72.7\,Mbps under the reorder-only configuration a 146\,\%
increase from a three-line sysctl change and
the retransmit
72.7\,Mbps under the reorder-only configuration, a 146\,\%
increase from a three-line sysctl change, and the retransmit
rate at Medium dropped from ${\sim}$2.4\,\% to
1.11\,\%, which
means more than half of the original retransmissions were
spurious. The Nix cache download at Medium roughly halved,
from 58.6\,s to 29.1\,s.
Parallel TCP gained more. Internal at Low
climbed from 277 to
902\,Mbps, a 226\,\% increase that not only
exceeds Internal's
old single-stream best but actually overtakes Headscale's
original 718\,Mbps from the unmodified run. %
Parallel TCP gained even more. Internal at Low climbed from
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
old single-stream best and overtakes Headscale's original
718\,Mbps from the unmodified run. %
% TODO: DOWNSTREAM
% DEPENDENCY — "six concurrent flows" inherits
% the unresolved
@@ -2024,9 +2055,10 @@ the kernel to gVisor reproduces the effect. Then we checked
which Tailscale code path the test rig was actually running.
\subsection{The data path that was not there}
\label{sec:gvisor_not_in_path}
In default mode what anyone running \texttt{tailscale up}
on a Linux host gets the Tailscale client creates a real
In default mode (what anyone running \texttt{tailscale up}
on a Linux host gets), the Tailscale client creates a real
kernel TUN device, registers a route for the
Tailscale subnet
through it, and forwards inbound and outbound
@@ -2054,20 +2086,19 @@ running inside \texttt{tailscaled} itself (Tailscale SSH,
Taildrop, the metric endpoint). External processes such as
iPerf3 cannot reach the Tailscale network in that mode.
The test rig does not use that mode. As shown in
Listing~\ref{lst:rig_interface_name}, the benchmark
suite's Headscale module sets the interface name to
\texttt{ts-\$\{instanceName\}}, resolving to
\texttt{tailscaled --tun ts-headscale}: a real kernel
TUN. gVisor netstack is therefore unreachable from
external benchmark traffic.
The test rig does not use that mode. The benchmark suite's
Headscale module sets the interface name to
\texttt{ts-\$\{instanceName\}}
(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
launches with \texttt{--tun ts-headscale}: a real kernel TUN.
External benchmark traffic cannot reach gVisor netstack at all.
\lstinputlisting[language=Nix,caption={The
benchmark suite's
Headscale module sets \texttt{interfaceName} to a real kernel
TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale}
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
ts-headscale}
on every test machine.
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
@@ -2075,7 +2106,7 @@ The empirical fingerprint pins the same conclusion down without
source-code reading. Headscale itself gained +21\,\% at Medium
from the host-kernel sysctl tuning. If Headscale's iPerf3
traffic were processed by gVisor netstack, host-kernel sysctls
would change nothing they configure the host kernel TCP stack
would change nothing; they configure the host kernel TCP stack
and only the host kernel TCP stack. The fact that Headscale moves
measurably under those sysctls is direct evidence that
Headscale's application TCP runs on the host kernel stack, just
@@ -2097,8 +2128,8 @@ the gVisor TCP business at all.
The puzzle the investigation began with has not gone away.
Headscale starts at 41.5\,Mbps where Internal starts at
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
TCP stack. Whatever Headscale is doing partially, weakly, but
reproducibly is worth roughly twelve megabits per second on the
TCP stack. Whatever Headscale is doing (partially, weakly, but
reproducibly) is worth roughly twelve megabits per second on the
Medium profile, and it is not gVisor netstack.
The +21\,\% sysctl gain for Headscale itself is also informative
@@ -2143,8 +2174,8 @@ The second is the 7\,MiB outer-UDP socket buffer that
\texttt{SO\_*BUFFORCE} variant where available so the value is
honoured even past \texttt{net.core.rmem\_max}. The host kernel
default is in the low hundreds of KiB. Under burst-correlated
impairment Medium and High both use 50\,\% correlation, so
losses and reorderings cluster this larger buffer absorbs
impairment (Medium and High both use 50\,\% correlation, so
losses and reorderings cluster), this larger buffer absorbs
spikes in arrival rate that would otherwise overflow the kernel
UDP receive queue and surface as additional inner-TCP losses.
Internal has no such cushion on its incoming wire path.
@@ -2244,16 +2275,15 @@ the bare-metal host more than half of its achievable throughput.
Three lines of \texttt{sysctl} repair it. The fix is portable to
any Linux host and entirely independent of any VPN.
The unresilient finding the one that motivated us to write this
section in the first place — is that Tailscale's much-discussed
userspace TCP stack is, for the workload that exposed the
anomaly, sitting on the bench. The advantage we attributed to it
must come from a more ordinary place: the way
\texttt{wireguard-go} batches and coalesces packets between the
wire and the kernel TCP stack, and the larger UDP buffer it pins
on its outer socket. We were chasing the wrong hypothesis with
the right experiment, and the experiment turned out to be more
useful than the hypothesis.
The less durable finding, and the one that motivated this section,
is that Tailscale's much-discussed userspace TCP stack is not in
the data path for the workload that exposed the anomaly. The
advantage we attributed to it comes from a more ordinary place:
the way \texttt{wireguard-go} batches and coalesces packets
between the wire and the kernel TCP stack, and the larger UDP
buffer it pins on its outer socket. We were chasing the wrong
hypothesis with the right experiment, and the experiment turned
out to be more useful than the hypothesis.
% TODO: These sections are empty stubs but the chapter
% introduction (line 12--13) promises "findings from the source