2298 lines
95 KiB
TeX
2298 lines
95 KiB
TeX
% Chapter Template
|
||
|
||
\chapter{Results} % Main chapter title
|
||
|
||
\label{Results}
|
||
|
||
This chapter presents the results of the benchmark suite across all
|
||
ten VPN implementations and the internal baseline. The structure
|
||
follows the impairment profiles from ideal to degraded:
|
||
Section~\ref{sec:baseline} establishes overhead under ideal
|
||
conditions, then subsequent sections examine how each VPN responds to
|
||
increasing network impairment. The chapter concludes with findings
|
||
from the source code analysis. A recurring theme is that no single
|
||
metric captures VPN
|
||
performance; the rankings shift
|
||
depending on whether one measures throughput, latency, retransmit
|
||
behavior, or real-world application performance.
|
||
|
||
\section{Baseline Performance}
|
||
\label{sec:baseline}
|
||
|
||
The baseline impairment profile introduces no artificial loss or
|
||
reordering, so any performance gap between VPNs can be attributed to
|
||
the VPN itself. Throughout the plots in this section, the
|
||
\emph{internal} bar marks a direct host-to-host connection with no VPN
|
||
in the path; it represents the best the hardware can do. On its own,
|
||
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
||
latency of just
|
||
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching
|
||
92.5\,\% of bare-metal throughput with only a single retransmit across
|
||
an entire 30-second test. Mycelium sits at the other extreme, adding
|
||
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
|
||
|
||
A note on naming: ``Headscale'' in every table and figure of this
|
||
chapter labels the test scenario in which the Tailscale client
|
||
(\texttt{tailscaled}) connects to a self-hosted Headscale control
|
||
server. The data plane is therefore the Tailscale client built on
|
||
\texttt{wireguard-go}, not the Headscale binary itself, which is
|
||
only a control-plane server. The test rig launches
|
||
\texttt{tailscaled} via the NixOS \texttt{services.tailscale}
|
||
module with \texttt{interfaceName = "ts-headscale"}, which
|
||
translates to \texttt{--tun ts-headscale}; this means the Tailscale
|
||
client uses a real kernel TUN device and the host kernel's TCP/IP
|
||
stack handles every tunneled packet. The alternate
|
||
\texttt{--tun=userspace-networking} mode, in which gVisor netstack
|
||
terminates tunneled TCP inside the \texttt{tailscaled} process, is
|
||
\emph{not} engaged in any of the benchmarks reported here.
|
||
Statements below about ``Headscale'' running \texttt{wireguard-go}
|
||
should be read as statements about the Tailscale client in this
|
||
scenario.
|
||
|
||
\subsection{Test Execution Overview}
|
||
|
||
Running the full baseline suite across all ten VPNs and the internal
|
||
reference took just over four hours. The bulk of that time, about
|
||
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
|
||
installation and deployment accounted for another 45~minutes (19\,\%),
|
||
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
|
||
up after restarts. The remaining time was consumed by VPN service restarts
|
||
and traffic-control (tc) stabilization.
|
||
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
||
|
||
Most VPNs completed every benchmark without issues, but four failed
|
||
one test each: Nebula and Headscale timed out on the qperf
|
||
QUIC performance benchmark after six retries, while Hyprspace and
|
||
Mycelium failed the UDP iPerf3 test
|
||
with a 120-second timeout. Their individual success rate is
|
||
85.7\,\%, with all other VPNs passing the full suite
|
||
(Figure~\ref{fig:success_rate}).
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Average Test
|
||
Duration per Machine}.png}
|
||
\caption{Average test duration per VPN, including installation
|
||
time and benchmark execution}
|
||
\label{fig:test_duration}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
|
||
Success Rate}.png}
|
||
\caption{Benchmark success rate across all seven tests}
|
||
\label{fig:success_rate}
|
||
\end{subfigure}
|
||
\caption{Test execution overview. Hyprspace has the longest average
|
||
duration due to UDP timeouts and long VPN connectivity
|
||
waits. WireGuard completes fastest. Nebula, Headscale,
|
||
Hyprspace, and Mycelium each fail one benchmark.}
|
||
\label{fig:test_overview}
|
||
\end{figure}
|
||
|
||
\subsection{TCP Throughput}
|
||
|
||
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
||
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
|
||
averages. Three distinct performance tiers emerge, separated by
|
||
natural gaps in the data.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput at baseline, sorted by
|
||
throughput. Retransmits are averaged per 30-second test across
|
||
all three link directions. The horizontal rules separate the
|
||
three performance tiers.}
|
||
\label{tab:tcp_baseline}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Throughput (Mbps)} &
|
||
\textbf{Baseline (\%)} & \textbf{Retransmits} \\
|
||
\hline
|
||
Internal & 934 & 100.0 & 1.7 \\
|
||
WireGuard & 864 & 92.5 & 1 \\
|
||
ZeroTier & 814 & 87.2 & 1163 \\
|
||
Headscale & 800 & 85.6 & 102 \\
|
||
Yggdrasil & 795 & 85.1 & 75 \\
|
||
\hline
|
||
Nebula & 706 & 75.6 & 955 \\
|
||
EasyTier & 636 & 68.1 & 537 \\
|
||
VpnCloud & 539 & 57.7 & 857 \\
|
||
\hline
|
||
Hyprspace & 368 & 39.4 & 4965 \\
|
||
Tinc & 336 & 36.0 & 240 \\
|
||
Mycelium & 259 & 27.7 & 710 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
|
||
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
|
||
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
|
||
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
|
||
at under 40\,\% of baseline.
|
||
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.
|
||
|
||
Raw throughput alone is incomplete, however. The retransmit column
|
||
reveals that not all high-throughput VPNs get there cleanly.
|
||
ZeroTier, for instance, reaches 814\,Mbps but accumulates
|
||
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
|
||
needs. ZeroTier compensates for tunnel-internal packet loss by
|
||
repeatedly triggering TCP congestion-control recovery, whereas
|
||
WireGuard delivers data with negligible in-tunnel loss. The
|
||
bare-metal Internal reference sits at 1.7~retransmits per test —
|
||
essentially noise — and the VPNs split into three groups around
|
||
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
|
||
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
|
||
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
||
|
||
% TODO: Is this naming scheme any good?
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Throughput}.png}
|
||
\caption{Average single-stream TCP throughput}
|
||
\label{fig:tcp_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Retransmit Rate}.png}
|
||
% TODO: Caption says "retransmits" (counts) but the plot axis shows
|
||
% "Retransmit Rate (\%)." Align the caption with the plot.
|
||
\caption{TCP retransmit rate (\%)}
|
||
\label{fig:tcp_retransmits}
|
||
\end{subfigure}
|
||
% TODO: This parent caption still says "retransmit count" but the
|
||
% subfigure axis and caption were corrected to "retransmit rate (%)."
|
||
% Align the parent caption terminology (counts vs rates).
|
||
\caption{TCP throughput and retransmit rate at baseline. WireGuard
|
||
leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
|
||
retransmits per test. The retransmit count does not always track
|
||
inversely with throughput: ZeroTier achieves high throughput
|
||
\emph{despite} high retransmits.}
|
||
\label{fig:tcp_results}
|
||
\end{figure}
|
||
|
||
Retransmits have a direct mechanical relationship with TCP congestion
|
||
control. Each retransmit triggers a reduction in the congestion window
|
||
(\texttt{cwnd}), throttling the sender.
|
||
This relationship is visible
|
||
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
|
||
retransmits, maintains the smallest max congestion window in the
|
||
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
||
window, the largest of any VPN. At first glance this suggests a
|
||
clean inverse correlation between retransmits and congestion window
|
||
size, but the picture is misleading. Yggdrasil's outsized window is
|
||
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
|
||
segment carries far more data, so the window in bytes is inflated
|
||
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
||
congestion windows across different MTU sizes is not meaningful
|
||
without normalizing for segment size. What \emph{is} clear is that
|
||
high retransmit rates force TCP to spend more time in congestion
|
||
recovery than in steady-state transmission, capping throughput
|
||
regardless of available bandwidth. ZeroTier illustrates the
|
||
opposite extreme: brute-force retransmission can still yield high
|
||
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
|
||
bandwidth and unstable flow behavior.
|
||
|
||
VpnCloud stands out: its sender reports 538.8\,Mbps
|
||
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
|
||
in the dataset). This suggests significant in-tunnel packet loss or
|
||
buffering at the VpnCloud layer that the retransmit count (857)
|
||
alone does not fully explain.
|
||
|
||
Variability — whether stochastic across runs or systematic across
|
||
links — also differs substantially. WireGuard's three link
|
||
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window),
|
||
behaving almost identically. Mycelium's three directions span
|
||
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
|
||
Section~\ref{sec:mycelium_routing} shows the spread is per-link
|
||
path-selection asymmetry, with one link finding a direct route and
|
||
the other two routing through the global overlay. Either way, a
|
||
VPN whose throughput varies that widely across links is harder to
|
||
capacity-plan around than one that delivers a consistent figure
|
||
on every direction.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
|
||
\caption{Retransmits vs.\ throughput}
|
||
\label{fig:retransmit_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
|
||
\caption{Retransmits vs.\ max congestion window}
|
||
\label{fig:retransmit_cwnd}
|
||
\end{subfigure}
|
||
\caption{Retransmit correlations (log scale on x-axis). High
|
||
retransmits do not always mean low throughput (ZeroTier: 1\,163
|
||
retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
|
||
4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
|
||
between retransmits and congestion window size is dominated by
|
||
Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
|
||
by its 32\,KB jumbo overlay MTU rather than by low retransmits
|
||
alone.}
|
||
\label{fig:retransmit_correlations}
|
||
\end{figure}
|
||
|
||
\subsection{Latency}
|
||
|
||
Sorting by latency rearranges the rankings considerably.
|
||
Table~\ref{tab:latency_baseline} lists the average ping round-trip
|
||
times, which cluster into three distinct ranges.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Average ping RTT at baseline, sorted by latency}
|
||
\label{tab:latency_baseline}
|
||
\begin{tabular}{lr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Avg RTT (ms)} \\
|
||
\hline
|
||
Internal & 0.60 \\
|
||
VpnCloud & 1.13 \\
|
||
Tinc & 1.19 \\
|
||
WireGuard & 1.20 \\
|
||
Nebula & 1.25 \\
|
||
ZeroTier & 1.28 \\
|
||
EasyTier & 1.33 \\
|
||
\hline
|
||
Headscale & 1.64 \\
|
||
Hyprspace & 1.79 \\
|
||
Yggdrasil & 2.20 \\
|
||
\hline
|
||
Mycelium & 34.9 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal
|
||
0.60\,ms; EasyTier sits just above at 1.33\,ms. VpnCloud posts the
|
||
lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms),
|
||
yet its throughput tops out at only 539\,Mbps.
|
||
Low per-packet latency does not guarantee high bulk throughput. A
|
||
second group (Headscale,
|
||
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
|
||
moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
||
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
||
it a dedicated analysis.
|
||
|
||
% TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT
|
||
% plot. Add a max-RTT figure or table, or reference the raw data
|
||
% source.
|
||
ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
|
||
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
|
||
sub-2\,ms VPN. These spikes point to periodic control-plane
|
||
interference that the average hides.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
|
||
\caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
|
||
massive outlier at 58$\times$ the internal baseline. VpnCloud is
|
||
the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
|
||
\label{fig:ping_rtt}
|
||
\end{figure}
|
||
|
||
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
||
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
||
the tunnel quickly, yet single-threaded userspace processing cannot
|
||
keep up with the link speed. The qperf benchmark backs this up: Tinc
|
||
maxes out at
|
||
14.9\,\% total system CPU while delivering just 336\,Mbps.
|
||
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
|
||
% This is whole-system utilization on a multi-core machine, and a
|
||
% single saturated core fits the budget — but VpnCloud reports the
|
||
% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc.
|
||
% The single-saturated-core story alone therefore cannot explain
|
||
% the throughput gap; per-packet processing cost must differ
|
||
% materially between the two. Verify with per-thread CPU sampling
|
||
% or eBPF profiling.
|
||
On a multi-core system, this low percentage is consistent with a
|
||
single saturated core (and Tinc is single-threaded), which would
|
||
explain why the CPU rather than the network is the bottleneck.
|
||
The story is incomplete, however: VpnCloud shows the same 14.9\,\%
|
||
total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc —
|
||
so a difference in per-packet processing cost between the two
|
||
implementations must also be in play.
|
||
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
||
spot.
|
||
|
||
% TODO: These CPU numbers are stated inline but never shown in a plot
|
||
% or table. Add a CPU utilization figure or table so readers can
|
||
% verify. Also, the claim that WireGuard's CPU usage "goes to
|
||
% cryptographic processing" is unsubstantiated: no profiling data
|
||
% is presented. Either add profiling evidence or soften to
|
||
% "likely" / "presumably."
|
||
The qperf measurements also reveal a wide spread in CPU usage.
|
||
Hyprspace (55.1\,\%) and Yggdrasil
|
||
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
||
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a
|
||
kernel-level implementation, presumably due to in-kernel
|
||
cryptographic processing.
|
||
On the efficient end, VpnCloud
|
||
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
|
||
CPU time. Nebula and Headscale are missing from
|
||
this comparison because qperf failed for both.
|
||
|
||
%TODO: Explain why they consistently failed
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
||
\caption{Latency vs.\ throughput at baseline. Each point represents
|
||
one VPN. The quadrants reveal different bottleneck types:
|
||
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
||
low throughput, CPU-bound), Mycelium (high latency, low
|
||
throughput, overlay routing overhead).}
|
||
\label{fig:latency_throughput}
|
||
\end{figure}
|
||
|
||
\subsection{Parallel TCP Scaling}
|
||
|
||
The single-stream benchmark tests one link direction at a time. %
|
||
% TODO: The plot labels this benchmark "10-stream parallel" but this
|
||
% description says "six unidirectional flows." Verify the actual test
|
||
% configuration and reconcile the two.
|
||
The
|
||
parallel benchmark changes this setup: all three link directions
|
||
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom) run simultaneously in a circular pattern for
|
||
60~seconds, each carrying one bidirectional TCP stream (six
|
||
unidirectional flows in total). Because three independent
|
||
link pairs now compete for shared tunnel resources at once, the
|
||
aggregate throughput is naturally higher than any single direction
|
||
alone, which is why even Internal reaches 1.50$\times$ its
|
||
single-stream figure. The scaling factor (parallel throughput
|
||
divided by single-stream throughput) captures two effects:
|
||
the benefit of using multiple link pairs in parallel, and how
|
||
well the VPN handles the resulting contention.
|
||
Table~\ref{tab:parallel_scaling} lists the results.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
||
ratio of parallel to single-stream throughput. Internal's
|
||
1.50$\times$ represents the expected scaling on this hardware.}
|
||
\label{tab:parallel_scaling}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Single (Mbps)} &
|
||
\textbf{Parallel (Mbps)} & \textbf{Scaling} \\
|
||
\hline
|
||
Mycelium & 259 & 569 & 2.20$\times$ \\
|
||
Hyprspace & 368 & 803 & 2.18$\times$ \\
|
||
Tinc & 336 & 563 & 1.68$\times$ \\
|
||
Yggdrasil & 795 & 1265 & 1.59$\times$ \\
|
||
Headscale & 800 & 1228 & 1.54$\times$ \\
|
||
Internal & 934 & 1398 & 1.50$\times$ \\
|
||
ZeroTier & 814 & 1206 & 1.48$\times$ \\
|
||
WireGuard & 864 & 1281 & 1.48$\times$ \\
|
||
EasyTier & 636 & 927 & 1.46$\times$ \\
|
||
VpnCloud & 539 & 763 & 1.42$\times$ \\
|
||
Nebula & 706 & 648 & 0.92$\times$ \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The VPNs that gain the most are those most constrained in
|
||
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
||
can never fill the pipe: the bandwidth-delay product demands a window
|
||
larger than any single flow maintains, so multiple concurrent flows
|
||
compensate for that constraint and push throughput to 2.20$\times$
|
||
the single-stream figure. Hyprspace scales almost as well
|
||
(2.18$\times$) for the same reason but with a different
|
||
bottleneck. Its libp2p send pipeline accumulates roughly
|
||
2\,800\,ms of under-load latency
|
||
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
|
||
flow a bandwidth-delay product on the order of hundreds of
|
||
megabytes to fill — far beyond any single kernel cwnd. And
|
||
because Hyprspace keys \texttt{activeStreams} by destination
|
||
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
|
||
three concurrent peer pairs in the parallel benchmark each get
|
||
their own libp2p stream, their own mutex, and their own yamux
|
||
flow-control window. The three TCP senders therefore maintain
|
||
three independent windows in flight, and three windows fill
|
||
more of the bloated pipeline than one can.
|
||
% TODO: This is still a hypothesis: it generalises the same
|
||
% bandwidth-delay-product argument used for Mycelium directly
|
||
% above, and is now grounded in the per-peer
|
||
% \texttt{SharedStream} structure verified in
|
||
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
|
||
% per-flow window evolution nor the actual under-load latency
|
||
% has been measured directly. A tcpdump of one Hyprspace
|
||
% iPerf3 run with inter-arrival timing analysis would settle
|
||
% it. Tinc picks up a
|
||
1.68$\times$ boost because several streams can collectively keep its
|
||
single-threaded CPU busy during what would otherwise be idle gaps in
|
||
a single flow.
|
||
|
||
% TODO: "zero retransmits" in parallel mode is not shown in any table
|
||
% or figure. Add parallel-mode retransmit data or remove the claim.
|
||
WireGuard and Internal both scale cleanly at around
|
||
1.48--1.50$\times$ with zero retransmits, suggesting that
|
||
WireGuard's overhead is a fixed per-packet cost that does not worsen
|
||
under multiplexing.
|
||
|
||
Nebula is the only VPN that actually gets \emph{slower} with more
|
||
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
||
(0.92$\times$) while retransmits jump from 955 to 2\,462. The
|
||
streams are clearly fighting each other for resources inside the
|
||
tunnel.
|
||
|
||
More streams also amplify existing retransmit problems. Hyprspace
|
||
climbs from 4\,965 to 17\,426~retransmits;
|
||
VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream
|
||
mode stay clean under load, while the stressed ones only get worse.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
|
||
\caption{Single-stream vs.\ parallel throughput}
|
||
\label{fig:single_vs_parallel}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
|
||
\caption{Parallel TCP scaling factor}
|
||
\label{fig:scaling_factor}
|
||
\end{subfigure}
|
||
\caption{Parallel TCP scaling at baseline. Nebula is the only VPN
|
||
where parallel throughput is lower than single-stream
|
||
(0.92$\times$). Mycelium and Hyprspace benefit most from
|
||
parallelism ($>$2$\times$), compensating for latency and buffer
|
||
bloat respectively. The dashed line at 1.0$\times$ marks the
|
||
break-even point.}
|
||
\label{fig:parallel_tcp}
|
||
\end{figure}
|
||
|
||
\subsection{UDP Stress Test}
|
||
|
||
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
||
which is a deliberate overload test rather than a realistic workload.
|
||
The sender throughput values are artifacts: they reflect how fast the
|
||
sender can write to the socket, not how fast data traverses the
|
||
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
||
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
||
overlay MTU), inflating the apparent rate per \texttt{send()} system
|
||
call. Only the receiver throughput is meaningful.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{UDP receiver throughput and packet loss at baseline
|
||
(\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
|
||
at 120 seconds and are excluded.}
|
||
\label{tab:udp_baseline}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Receiver (Mbps)} &
|
||
\textbf{Loss (\%)} \\
|
||
\hline
|
||
Internal & 952 & 0.0 \\
|
||
WireGuard & 898 & 0.0 \\
|
||
Nebula & 890 & 76.2 \\
|
||
Headscale & 876 & 69.8 \\
|
||
EasyTier & 865 & 78.3 \\
|
||
Yggdrasil & 852 & 98.7 \\
|
||
ZeroTier & 851 & 89.5 \\
|
||
VpnCloud & 773 & 83.7 \\
|
||
Tinc & 471 & 89.9 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
%TODO: Explain that the UDP test also crashes often,
|
||
% which makes the test somewhat unreliable
|
||
% but a good indicator if the network traffic is "different" then
|
||
% the programmer expected
|
||
|
||
Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
|
||
the kernel level with proper backpressure that matches sender to
|
||
receiver rate. Every other VPN shows massive loss (69--99\%)
|
||
because the sender overwhelms the tunnel's userspace processing capacity.
|
||
Headscale shares WireGuard's cryptographic protocol but, contrary to
|
||
intuition, does not share its kernel datapath: Tailscale's
|
||
\texttt{magicsock} layer intercepts every packet to handle endpoint
|
||
selection and DERP relay, which is incompatible with the in-kernel
|
||
WireGuard module. Headscale therefore runs \texttt{wireguard-go}
|
||
entirely in userspace, and the unbounded \texttt{-b~0} flood overruns
|
||
that userspace pipeline just as it overruns every other userspace
|
||
implementation, producing 69.8\,\% loss despite the WireGuard branding.
|
||
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
||
(due to its large block size) but loses almost all of it. These loss
|
||
rates do not reflect real-world UDP behavior but reveal which VPNs
|
||
implement effective flow control. Hyprspace and Mycelium could not
|
||
complete the UDP test at all, timing out after 120 seconds.
|
||
|
||
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
|
||
% the path MTU. It is derived from the socket MSS and reflects the
|
||
% usable payload after tunnel overhead, but conflating it with path
|
||
% MTU is misleading. Consider renaming to "effective payload size"
|
||
% throughout.
|
||
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
|
||
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
|
||
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
|
||
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
|
||
smallest). These differences affect fragmentation behavior under real
|
||
workloads, particularly for protocols that send large datagrams.
|
||
|
||
%TODO: Mention QUIC
|
||
%TODO: Mention again that the "default" settings of every VPN have been used
|
||
% to better reflect real world use, as most users probably won't
|
||
% change these defaults
|
||
% and explain that good defaults are as much a part of good software as
|
||
% having the features but they are hard to configure correctly
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Throughput}.png}
|
||
\caption{UDP receiver throughput}
|
||
\label{fig:udp_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Packet Loss}.png}
|
||
\caption{UDP packet loss}
|
||
\label{fig:udp_loss}
|
||
\end{subfigure}
|
||
\caption{UDP stress test results at baseline (\texttt{-b 0},
|
||
unlimited sender rate). Internal and WireGuard are the only
|
||
implementations with 0\% loss. Hyprspace and Mycelium are
|
||
excluded due to 120-second timeouts.}
|
||
\label{fig:udp_results}
|
||
\end{figure}
|
||
|
||
% TODO: Compare parallel TCP retransmit rate
|
||
% with single TCP retransmit rate and see what changed
|
||
|
||
\subsection{Real-World Workloads}
|
||
|
||
Saturating a link with iPerf3 measures peak capacity, but not how a
|
||
VPN performs under realistic traffic. This subsection switches to
|
||
application-level workloads: downloading packages from a Nix binary
|
||
cache and streaming video over RIST. Both interact with the VPN
|
||
tunnel the way real software does, through many short-lived
|
||
connections, TLS handshakes, and latency-sensitive UDP packets.
|
||
|
||
\paragraph{Nix Binary Cache Downloads.}
|
||
|
||
This test downloads a fixed set of Nix packages through each VPN and
|
||
measures the total transfer time. The results
|
||
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
|
||
considerably: even Hyprspace, the worst performer, finishes in
|
||
11.92\,s, only 40\,\% slower than bare metal. Once connection
|
||
setup, TLS handshakes, and HTTP round-trips enter the picture,
|
||
throughput differences between 500 and 900\,Mbps matter far less
|
||
than per-connection latency.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time at baseline, sorted by
|
||
duration. Overhead is relative to the internal baseline (8.53\,s).}
|
||
\label{tab:nix_cache}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Mean (s)} &
|
||
\textbf{Overhead (\%)} \\
|
||
\hline
|
||
Internal & 8.53 & -- \\
|
||
Nebula & 9.15 & +7.3 \\
|
||
ZeroTier & 9.22 & +8.1 \\
|
||
VpnCloud & 9.39 & +10.0 \\
|
||
EasyTier & 9.39 & +10.1 \\
|
||
WireGuard & 9.45 & +10.8 \\
|
||
Headscale & 9.79 & +14.8 \\
|
||
Tinc & 10.00 & +17.2 \\
|
||
Mycelium & 10.07 & +18.1 \\
|
||
Yggdrasil & 10.59 & +24.2 \\
|
||
Hyprspace & 11.92 & +39.7 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Several rankings invert relative to raw throughput. ZeroTier
|
||
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
|
||
6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil
|
||
is the clearest example: it has the
|
||
fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead
|
||
because its
|
||
2.2\,ms latency adds up over the many small sequential HTTP requests
|
||
that constitute a Nix cache download.
|
||
Figure~\ref{fig:throughput_vs_download} confirms this weak link
|
||
between raw throughput and real-world download speed.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
|
||
Mean Download Time}.png}
|
||
\caption{Nix cache download time per VPN}
|
||
\label{fig:nix_cache}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
|
||
\caption{Raw throughput vs.\ download time}
|
||
\label{fig:throughput_vs_download}
|
||
\end{subfigure}
|
||
\caption{Application-level download performance. The throughput
|
||
hierarchy compresses under real HTTP workloads: the worst VPN
|
||
(Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
|
||
Throughput explains some variance but not all: Yggdrasil
|
||
(795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
|
||
because latency matters more for HTTP workloads.}
|
||
\label{fig:nix_download}
|
||
\end{figure}
|
||
|
||
\paragraph{Video streaming (RIST).}
|
||
|
||
At 3.3\,Mbps, the RIST video stream sits well within every VPN's
|
||
throughput budget. The test therefore measures something else: how
|
||
well each VPN handles real-time UDP delivery under steady load.
|
||
|
||
Most VPNs pass without incident. Eight deliver 100\% quality,
|
||
Nebula sits just below at 99.8\%, and Hyprspace's headline figure
|
||
of 100\% conceals a separate failure mode discussed below. The
|
||
14--16 dropped frames that appear uniformly across every run, including
|
||
Internal, are most likely encoder warm-up artefacts rather than
|
||
tunnel overhead, though we have not verified this directly.
|
||
|
||
% TODO: The packet-drop distribution statistics (288 mean,
|
||
% 10\% median, IQR 255--330) are not shown in any figure.
|
||
% Add a box plot or distribution figure for Headscale's RIST drops.
|
||
Headscale is the clear failure. Its mean quality is 13.1\%, and
|
||
each test interval drops 288 packets. The degradation is sustained
|
||
rather than bursty: median quality is 10\%, and the interquartile
|
||
range of dropped packets is a narrow 255--330. The qperf benchmark
|
||
also fails outright for Headscale at baseline, which rules out a
|
||
bulk-TCP explanation. Something in the real-time path is broken.
|
||
|
||
The failure is unexpected because Headscale builds on WireGuard,
|
||
which handles video without trouble, and Headscale's own TCP
|
||
throughput puts it in Tier~1. RIST runs over UDP, however, and
|
||
qperf probes latency-sensitive paths using both TCP and UDP. The
|
||
most plausible source is Headscale's DERP relay or NAT traversal
|
||
layer. Headscale's effective UDP payload size is 1\,208~bytes, the
|
||
smallest in the dataset. RIST packets larger than this would be
|
||
fragmented, and fragment reassembly under sustained load could
|
||
produce exactly the steady, uniform drop pattern the data shows.
|
||
This is a hypothesis, not a confirmed cause: it would need a
|
||
packet capture to verify. Either way, the result disqualifies
|
||
Headscale from video conferencing, VoIP, or any other real-time
|
||
media workload, regardless of TCP throughput.
|
||
|
||
% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500,
|
||
% percentiles all zero) are not visible in the RIST Quality bar chart.
|
||
% Add a distribution plot or note in the caption that the bar
|
||
% chart hides this variance.
|
||
Hyprspace fails differently. Its average quality reads 100\%, but
|
||
the raw drop counts underneath are unstable: mean packet drops of
|
||
1\,194 and a maximum spike of 55\,500. The 25th, 50th, and 75th
|
||
percentiles are all zero, so most runs deliver perfectly while a
|
||
small number suffer catastrophic bursts. RIST's forward error
|
||
correction recovers from most of these events, but the worst spikes
|
||
overwhelm FEC entirely.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Video
|
||
Streaming/RIST Quality}.png}
|
||
\caption{RIST video streaming quality at baseline. Headscale at
|
||
13.1\% average quality is the clear outlier. Every other VPN
|
||
achieves 99.8\% or higher. Nebula is at 99.8\% (minor
|
||
degradation). The video bitrate (3.3\,Mbps) is well within every
|
||
VPN's throughput capacity, so this test reveals real-time UDP
|
||
handling quality rather than bandwidth limits.}
|
||
\label{fig:rist_quality}
|
||
\end{figure}
|
||
|
||
\subsection{Operational Resilience}
|
||
|
||
Sustained-load performance does not predict recovery speed. How
|
||
quickly a tunnel comes up after a reboot, and how reliably it
|
||
reconverges, matters as much as peak throughput for operational use.
|
||
|
||
% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
|
||
% are not shown in any figure or table. Either add a figure or
|
||
% scrap this paragraph (see note below).
|
||
First-time connectivity spans a wide range. Headscale and WireGuard
|
||
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
|
||
(10--14\,s) spend seconds negotiating with their control planes
|
||
before passing traffic.
|
||
|
||
%TODO: Maybe we want to scrap first-time connectivity
|
||
|
||
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
||
performer under sustained TCP load, recovers in just 8.7~seconds on
|
||
average, faster than any other VPN. WireGuard and Nebula follow at
|
||
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
|
||
10.07\,s across its three nodes, an exact match for Nebula's
|
||
\texttt{HostUpdateNotification} interval, whose default is
|
||
10~seconds in the lighthouse protocol (configurable, but the
|
||
benchmarks use the default). After a reboot, a node must
|
||
wait until the next periodic update before its lighthouses learn
|
||
its new endpoint, so the reconnection time tracks the timer rather
|
||
than any topology-dependent convergence.
|
||
Mycelium sits at the opposite end, needing 76.6~seconds and showing
|
||
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
|
||
suggesting a fixed protocol-level wait built into the overlay.
|
||
|
||
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
||
node is back in 7.1~seconds while lom and luna take 94.8 and
|
||
97.3~seconds respectively. The gap likely reflects the overlay's
|
||
spanning-tree rebuild: a node near the root of the tree reconverges
|
||
quickly, while one further out has to wait for the topology to
|
||
propagate.
|
||
|
||
%TODO: Needs clarifications what is a "spanning tree build"
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
|
||
\caption{Average reconnection time per VPN}
|
||
\label{fig:reboot_bar}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
|
||
\caption{Per-node reconnection time heatmap}
|
||
\label{fig:reboot_heatmap}
|
||
\end{subfigure}
|
||
\caption{Reboot reconnection time at baseline. The heatmap reveals
|
||
Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
|
||
95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
|
||
across all nodes). Hyprspace reconnects fastest (8.7\,s average)
|
||
despite its poor sustained-load performance.}
|
||
\label{fig:reboot_reconnection}
|
||
\end{figure}
|
||
|
||
\subsection{Pathological Cases}
|
||
\label{sec:pathological}
|
||
|
||
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
|
||
explain. The following subsections piece together observations from
|
||
earlier benchmarks into per-VPN diagnoses.
|
||
|
||
\paragraph{Hyprspace: Buffer Bloat.}
|
||
\label{sec:hyprspace_bloat}
|
||
|
||
% TODO: The under-load latency of 2,800 ms is not shown in any plot
|
||
% or table. Where does this number come from? Add a figure showing
|
||
% latency-under-load (e.g., from qperf concurrent ping) or reference
|
||
% the raw data source.
|
||
Hyprspace produces the most severe performance collapse in the
|
||
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
||
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
||
1\,556$\times$ increase. This is not the network becoming
|
||
congested; it is the VPN tunnel itself filling up with buffered
|
||
packets and refusing to drain.
|
||
|
||
The consequences ripple through every TCP metric. With 4\,965
|
||
retransmits per 30-second test (one in every 200~segments), TCP
|
||
spends most of its time in congestion recovery rather than
|
||
steady-state transfer, shrinking the max congestion window to
|
||
205\,KB, the smallest in the dataset. Under parallel load the
|
||
situation worsens: retransmits climb to 17\,426. % TODO: The
|
||
% explanation for the sender/receiver inversion (ACK delays
|
||
% causing sender-side timer undercounting) is a hypothesis. Normally
|
||
% sender >= receiver. Consider verifying with packet captures or
|
||
% note this as a likely but unconfirmed explanation.
|
||
The buffering even
|
||
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
||
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
||
cause the sender-side timer to undercount the actual data rate. The
|
||
UDP test never finished at all, timing out at 120~seconds.
|
||
|
||
% Should we always use percentages for retransmits?
|
||
|
||
What prevents Hyprspace from being entirely unusable is everything
|
||
\emph{except} sustained load. It has the fastest reboot
|
||
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
|
||
quality outside of its burst events. The pathology is narrow but
|
||
severe: any continuous data stream saturates the tunnel's internal
|
||
buffers.
|
||
|
||
Hyprspace does import gVisor netstack, but reading the source
|
||
confirms that the gVisor TCP stack sits exclusively behind the
|
||
in-VPN ``service network'' feature. Regular tunnel traffic uses
|
||
an ordinary kernel TUN device created through the
|
||
\texttt{songgao/water} library, and the forwarding loop in
|
||
\texttt{node/node.go} only diverts a packet into the gVisor
|
||
stack when its destination falls inside the
|
||
\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
|
||
protocol is TCP; everything else is shipped verbatim over a
|
||
libp2p stream and written back into the receiving peer's kernel
|
||
TUN. Listings~\ref{lst:hyprspace_kernel_tun},
|
||
\ref{lst:hyprspace_dispatch}, and \ref{lst:hyprspace_netstack}
|
||
show the relevant code in the upstream Hyprspace tree.
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace creates a real
|
||
kernel TUN via \texttt{songgao/water}; this is the device every
|
||
peer-to-peer packet traverses.
|
||
\textit{hyprspace/tun/tun\_linux.go:14--36}},label={lst:hyprspace_kernel_tun}]{Listings/hyprspace_tun_linux.go}
|
||
|
||
\lstinputlisting[language=Go,caption={The IPv6 dispatch in the
|
||
Hyprspace forwarding loop only diverts to the gVisor service-network
|
||
TUN when the destination matches the
|
||
\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
|
||
protocol byte is \texttt{0x06} (TCP); every other packet is left
|
||
on the kernel TUN path and forwarded over libp2p.
|
||
\textit{hyprspace/node/node.go:255--283}},label={lst:hyprspace_dispatch}]{Listings/hyprspace_dispatch.go}
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace's gVisor netstack
|
||
initialiser only enables TCP SACK; there is no \texttt{TCPRecovery}
|
||
override (RACK stays at gVisor's default), no congestion-control
|
||
override, and no buffer-size override. The text in
|
||
\texttt{tun.go} also notes the file is taken verbatim from
|
||
wireguard-go.
|
||
\textit{hyprspace/netstack/tun.go:6--80}},label={lst:hyprspace_netstack}]{Listings/hyprspace_netstack.go}
|
||
|
||
Since the benchmark targets the regular Hyprspace IPv4/IPv6
|
||
addresses rather than service-network proxies, both endpoints
|
||
rely on their host kernel's TCP stack for the entire transfer.
|
||
Whatever options Hyprspace's gVisor instance might set
|
||
internally — congestion control, loss recovery, buffer sizes —
|
||
are therefore irrelevant to these measurements; the inner TCP
|
||
state machine the kernel runs is the only one in the path.
|
||
The same caveat applies more sharply to Tailscale, where the
|
||
upstream documentation talks about an in-process gVisor TCP
|
||
stack but the benchmark traffic never reaches it; that case is
|
||
the subject of Section~\ref{sec:tailscale_degraded}.
|
||
|
||
If gVisor is out of scope, the buffer bloat must originate
|
||
further up the Hyprspace stack instead. The most plausible
|
||
source is the libp2p / yamux stream layer through which raw IP
|
||
packets are funnelled. Hyprspace's TUN-read loop dispatches
|
||
each outbound packet on its own goroutine, and every such
|
||
goroutine ends up in \texttt{node/node.go}'s
|
||
\texttt{sendPacket}, which keeps exactly one libp2p stream per
|
||
destination peer in \texttt{activeStreams} and guards it with a
|
||
single per-peer \texttt{sync.Mutex}
|
||
(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent
|
||
application TCP flows to the same Hyprspace neighbour therefore
|
||
serialise behind that one lock: the parallel iPerf3 test, which
|
||
opens multiple TCP connections to the same peer at once,
|
||
collapses to a single send pipeline at this layer. Each
|
||
goroutine waiting for the lock pins its own 1420-byte packet
|
||
buffer, and the underlying yamux session adds a per-stream
|
||
flow-control window on top. None of this is visible to the
|
||
kernel TCP sender that produced the inner segments — the kernel
|
||
sees only that the TUN write returned — so it keeps growing
|
||
its congestion window while the libp2p layer falls further
|
||
behind. The geometry is the textbook one for buffer bloat: a
|
||
fast producer (kernel TCP) sitting upstream of a slow,
|
||
serialised consumer (the single yamux stream per peer) with
|
||
no flow-control signal coupling the two.
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace's outbound
|
||
fast path keeps exactly one libp2p stream per destination peer
|
||
in \texttt{activeStreams} and guards it with a per-peer
|
||
\texttt{sync.Mutex} held inside the \texttt{SharedStream}
|
||
record. The TUN-read loop spawns a fresh goroutine per packet
|
||
(\texttt{node.go:282}); each one calls \texttt{sendPacket} and
|
||
takes \texttt{ms.Lock} for the duration of the libp2p stream
|
||
write, so concurrent application TCP flows to the same
|
||
Hyprspace neighbour are serialised behind a single mutex.
|
||
\textit{hyprspace/node/node.go:36--39, 282,
|
||
328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go}
|
||
|
||
\paragraph{Mycelium: Routing Anomaly.}
|
||
\label{sec:mycelium_routing}
|
||
|
||
Mycelium's 34.9\,ms average latency appears to be the cost of
|
||
routing through a global overlay. The per-path
|
||
numbers, however,
|
||
reveal a bimodal distribution:
|
||
|
||
\begin{itemize}
|
||
\bitem{luna$\rightarrow$lom:} 1.63\,ms (direct
|
||
path, comparable
|
||
to Headscale at 1.64\,ms)
|
||
\bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
|
||
\bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
|
||
\end{itemize}
|
||
|
||
One of the three links has found a direct route; the
|
||
other two still
|
||
bounce through the overlay. All three machines sit on the same
|
||
% TODO: Characterising path discovery as "failing
|
||
% intermittently" assumes
|
||
% direct routing is the expected outcome on a LAN.
|
||
% Mycelium is designed
|
||
% as a global overlay and may intentionally route
|
||
% through supernodes.
|
||
% If this is by-design behaviour, rephrase to avoid
|
||
% implying a bug.
|
||
% This characterisation also propagates to the
|
||
% impairment ping analysis
|
||
% in Section sec:impairment, which says impairment "pushes path
|
||
% discovery toward shorter routes."
|
||
% TODO: The throughput data INVERTS the latency split
|
||
% rather than
|
||
% "mirroring" it. The direct path (luna→lom, 1.63 ms
|
||
% RTT) achieves
|
||
% only 122 Mbps, while the overlay-routed path
|
||
% (yuki→luna, 51.60 ms
|
||
% RTT) reaches 379 Mbps: the opposite of what TCP
|
||
% theory predicts.
|
||
% The plot also shows luna→lom receiver throughput at
|
||
% only 57.2 Mbps
|
||
% (a 53% sender/receiver gap on that link). Explain
|
||
% why the direct
|
||
% path is 3× slower than the overlay path, or acknowledge the
|
||
% contradiction. The current wording "mirrors the
|
||
% split" is incorrect.
|
||
physical network, so Mycelium's path discovery is not
|
||
consistently
|
||
selecting the direct route, a more specific problem
|
||
than blanket overlay
|
||
overhead. Throughput shows a similarly lopsided split:
|
||
yuki$\rightarrow$luna reaches 379\,Mbps while
|
||
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In
|
||
bidirectional mode, the reverse direction on that
|
||
worst link drops
|
||
to 58.4\,Mbps, the lowest single-direction figure in the entire
|
||
dataset.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
|
||
Throughput}.png}
|
||
% TODO: The caption attributes the asymmetry to
|
||
% "inconsistent direct
|
||
% route discovery" but the direct-route link
|
||
% (luna→lom, 1.63 ms RTT)
|
||
% is actually the SLOWEST (122 Mbps). The caption
|
||
% should address
|
||
% why the direct path underperforms the overlay paths.
|
||
\caption{Per-link TCP throughput for Mycelium, showing extreme
|
||
path asymmetry. The 3:1 ratio between best
|
||
(yuki$\rightarrow$luna, 379\,Mbps) and worst
|
||
(luna$\rightarrow$lom, 122\,Mbps) links does not
|
||
correlate with
|
||
the latency split (Section~\ref{sec:mycelium_routing}).}
|
||
\label{fig:mycelium_paths}
|
||
\end{figure}
|
||
|
||
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
|
||
% (47.3 ms) numbers are from qperf but not shown in any figure.
|
||
% Add a connection-setup latency table or plot. Also
|
||
% clarify what
|
||
% Internal's connection establishment time is (47.3 /
|
||
% 3 = 15.8 ms?)
|
||
% so the "3× overhead" can be verified.
|
||
The overlay penalty shows up most clearly at connection setup.
|
||
Mycelium's average time-to-first-byte is 93.7\,ms
|
||
(vs.\ Internal's
|
||
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
||
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
|
||
incurs that overhead, so workloads dominated by
|
||
short-lived connections accumulate it rapidly. Bulk
|
||
downloads, by
|
||
contrast, amortize it: the Nix cache test finishes only 18\,\%
|
||
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
|
||
transfer phase begins, per-connection latency fades into the
|
||
background.
|
||
|
||
Mycelium is also the slowest VPN to recover from a reboot:
|
||
76.6~seconds on average, and almost suspiciously uniform across
|
||
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to
|
||
a fixed convergence timer in the overlay protocol —
|
||
most likely a
|
||
default interval rather than anything topology-dependent.
|
||
% TODO: Identify which Mycelium constant or default this 75-78 s
|
||
% recovery actually corresponds to before claiming it is a fixed
|
||
% timer; the source code would settle whether it is hard-coded,
|
||
% a configurable default, or coincidence.
|
||
The UDP test timed out at 120~seconds, and even first-time
|
||
connectivity required a 70-second wait at startup.
|
||
|
||
% Explain what topology-dependent means in this case.
|
||
|
||
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
||
|
||
Tinc is a clear case of a CPU bottleneck masquerading
|
||
as a network
|
||
problem. At 1.19\,ms latency, packets get through the
|
||
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a
|
||
third of the bare-metal link.
|
||
The usual suspects do not apply:
|
||
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
|
||
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
|
||
WireGuard at 1\,368) is in the normal range, and its retransmit
|
||
count (240) is moderate. What limits Tinc is its
|
||
single-threaded
|
||
userspace architecture: one CPU core simply cannot
|
||
encrypt, copy,
|
||
and forward packets fast enough to fill the pipe.
|
||
|
||
% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the
|
||
% Tinc CPU bottleneck
|
||
% diagnosis from above, but the 14.9% CPU figure has
|
||
% an unresolved TODO
|
||
% (the same utilization as VpnCloud at 539 Mbps). If
|
||
% the CPU claim is
|
||
% revised or refuted, this confirmation must be updated too.
|
||
The parallel benchmark confirms this diagnosis. Tinc scales to
|
||
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
|
||
Multiple TCP streams collectively keep that single
|
||
core busy during
|
||
what would otherwise be idle gaps in any individual
|
||
flow, squeezing
|
||
out throughput that no single stream could reach alone.
|
||
|
||
\section{Impact of network impairment}
|
||
\label{sec:impairment}
|
||
|
||
Baseline benchmarks rank VPNs by overhead under ideal
|
||
conditions.
|
||
The impairment profiles in
|
||
Table~\ref{tab:impairment_profiles} test
|
||
a different property: resilience. Two results
|
||
dominate the data.
|
||
|
||
The first is the collapse of the throughput hierarchy. At High
|
||
impairment, the 675\,Mbps spread between fastest and slowest
|
||
implementation compresses to under 3\,Mbps. Architectural
|
||
differences that mattered at gigabit speeds become
|
||
invisible once
|
||
the network is the bottleneck.
|
||
|
||
The second is harder to explain. Headscale outperforms the
|
||
bare-metal Internal baseline at Medium impairment across TCP,
|
||
parallel TCP, and the Nix cache benchmark. A VPN built on
|
||
WireGuard should not beat a direct connection.
|
||
Section~\ref{sec:tailscale_degraded} pursues this anomaly
|
||
through what turns out to be the wrong hypothesis. The
|
||
investigation begins with Tailscale's much-discussed gVisor TCP
|
||
stack, validates the candidate parameters in isolation on the
|
||
bare-metal host, and only then discovers — by reading the rig's
|
||
own NixOS module — that the gVisor stack is not actually in the
|
||
data path of the benchmark at all. The real culprit is a
|
||
combination of the Linux kernel's tight default
|
||
\texttt{tcp\_reordering} threshold and the way
|
||
\texttt{wireguard-go}
|
||
batches packets between the wire and the host kernel TCP stack.
|
||
|
||
\subsection{Ping}
|
||
|
||
Latency is the most predictable metric under
|
||
impairment. Most VPNs
|
||
absorb the injected delay with a fixed per-hop
|
||
overhead, and rankings
|
||
within the central cluster barely change across profiles
|
||
(Table~\ref{tab:ping_impairment}). tc~netem adds
|
||
roughly 4, 8, and
|
||
15\,ms of round-trip delay at Low, Medium, and High
|
||
respectively;
|
||
Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Average ping RTT (ms) across impairment
|
||
profiles, sorted
|
||
by High-profile RTT}
|
||
\label{tab:ping_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 0.60 & 4.82 & 9.38 & 15.49 \\
|
||
Tinc & 1.19 & 5.32 & 9.85 & 15.92 \\
|
||
Nebula & 1.25 & 5.38 & 9.99 & 15.96 \\
|
||
WireGuard & 1.20 & 5.36 & 9.88 & 15.99 \\
|
||
Headscale & 1.64 & 5.82 & 10.39 & 16.07 \\
|
||
VpnCloud & 1.13 & 5.41 & 10.35 & 16.21 \\
|
||
ZeroTier & 1.28 & 5.34 & 10.02 & 16.54 \\
|
||
Yggdrasil & 2.20 & 6.73 & 11.99 & 20.20 \\
|
||
Hyprspace & 1.79 & 6.15 & 10.76 & 24.49 \\
|
||
EasyTier & 1.33 & 6.27 & 14.13 & 26.60 \\
|
||
Mycelium & 34.90 & 23.42 & 43.88 & 33.05 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Ping
|
||
Average RTT Heatmap}.png}
|
||
\caption{Average ping RTT across impairment
|
||
profiles. Most VPNs
|
||
form a tight parallel band; Mycelium's non-monotonic curve,
|
||
EasyTier's excess latency at High, and Hyprspace's upward
|
||
divergence stand out.}
|
||
\label{fig:ping_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Mycelium defies the pattern. Its RTT \emph{drops}
|
||
from 34.9\,ms at
|
||
baseline to 23.4\,ms at Low impairment, a 33\%
|
||
improvement at the
|
||
profile where every other VPN gets slower. It then climbs to
|
||
43.9\,ms at Medium before falling again to 33.0\,ms
|
||
at High. The
|
||
baseline analysis
|
||
(Section~\ref{sec:mycelium_routing}) showed that
|
||
Mycelium's latency comes from a bimodal routing
|
||
distribution: one
|
||
path runs at 1.63\,ms, two others route through the
|
||
global overlay at
|
||
${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This
|
||
% explanation depends on the baseline
|
||
% characterisation of Mycelium's path discovery as
|
||
% "failing intermittently"
|
||
% (Section mycelium_routing). If that
|
||
% characterisation is revised (e.g.,
|
||
% overlay routing is by-design, not a failure), then
|
||
% the claim that
|
||
% impairment "pushes path discovery toward shorter
|
||
% routes" needs rethinking:
|
||
% the mechanism would be different if Mycelium is not
|
||
% trying to find direct
|
||
% routes in the first place.
|
||
Impairment seems to push Mycelium's path selection toward the
|
||
shorter route, so a larger share of traffic avoids the overlay
|
||
detour. The non-monotonic curve is consistent with a
|
||
path selection
|
||
algorithm that reacts to measured link quality but
|
||
not linearly with
|
||
degradation severity.
|
||
|
||
% TODO: Ping packet loss data is not shown in any figure. Add a
|
||
% packet loss table/figure or reference the raw data
|
||
% so readers can
|
||
% verify these numbers.
|
||
Mycelium loses zero ping packets at Low and Medium impairment.
|
||
Most other VPNs show 0.1--3.2\% loss at those profiles. At High
|
||
impairment Mycelium's loss jumps to 11.1\%.
|
||
|
||
% TODO: EasyTier's max RTT (290 ms), WireGuard's max
|
||
% (~40 ms), and
|
||
% EasyTier's std dev (44.6 ms) are not shown in any
|
||
% plot. The ping
|
||
% heatmap only shows averages. Add a
|
||
% jitter/distribution figure.
|
||
% Also, the "userspace retry mechanism" is a hypothesized cause
|
||
% without source-code or packet-level evidence.
|
||
EasyTier accumulates 11\,ms of excess latency at High impairment
|
||
beyond what tc~netem injects. Its average RTT is
|
||
26.6\,ms and its
|
||
maximum reaches 290\,ms, against ${\sim}$40\,ms for
|
||
WireGuard. The
|
||
RTT standard deviation reaches 44.6\,ms at High, the
|
||
worst jitter
|
||
of any VPN. A userspace retry mechanism is the
|
||
likely cause, but
|
||
without source-code evidence we cannot say so with certainty.
|
||
|
||
% TODO: Ping packet loss data is not shown in any plot. The 1/9
|
||
% = 11.1\% interpretation is clever but depends on
|
||
% the exact test
|
||
% structure (3 pairs × 3 runs × 100 packets). Verify
|
||
% this matches
|
||
% the actual test setup and add a supporting figure or table.
|
||
Hyprspace shows the same 11.1\% ping packet loss at Low, Medium,
|
||
and High impairment. With 9~measurement runs per
|
||
profile (3~machine
|
||
pairs $\times$ 3~runs of 100~packets), 11.1\% is
|
||
exactly 1/9: one
|
||
run fails completely while the other eight report zero loss.
|
||
% TODO: DOWNSTREAM DEPENDENCY — This is a third
|
||
% reference to the buffer
|
||
% bloat diagnosis from Section hyprspace_bloat, which
|
||
% depends on the
|
||
% unverified 2,800 ms under-load latency. If that diagnosis is
|
||
% revised, this explanation must also be revisited.
|
||
The binary pass/fail behaviour fits the buffer bloat
|
||
diagnosis from
|
||
Section~\ref{sec:hyprspace_bloat}: when the tunnel's
|
||
buffers fill, a
|
||
path stalls completely rather than degrading gradually.
|
||
|
||
\subsection{TCP throughput}
|
||
|
||
The baseline TCP hierarchy does not survive impairment. The
|
||
three performance tiers from
|
||
Section~\ref{sec:baseline} dissolve at
|
||
the first step (Table~\ref{tab:tcp_impairment}).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput (Mbps) across impairment
|
||
profiles, sorted by baseline. Retention is the
|
||
Low-to-baseline ratio.}
|
||
\label{tab:tcp_impairment}
|
||
\begin{tabular}{lrrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} & \textbf{Retention} \\
|
||
\hline
|
||
Internal & 934 & 333 & 29.6 & 4.25 & 35.7\% \\
|
||
WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\% \\
|
||
ZeroTier & 814 & 63.7 & 12.0 & 4.01 & 7.8\% \\
|
||
Headscale & 800 & 274 & 41.5 & 4.21 & 34.3\% \\
|
||
Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\% \\
|
||
\hline
|
||
Nebula & 706 & 49.8 & 7.82 & 2.60 & 7.1\% \\
|
||
EasyTier & 636 & 156 & 17.4 & 3.59 & 24.6\% \\
|
||
VpnCloud & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\
|
||
\hline
|
||
Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\% \\
|
||
Tinc & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\
|
||
Mycelium & 259 & 16.2 & 3.87 & 2.73 & 6.3\% \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/TCP
|
||
Throughput Heatmap}.png}
|
||
\caption{Single-stream TCP throughput across
|
||
impairment profiles.
|
||
Headscale crosses above Internal at Medium impairment;
|
||
Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs
|
||
converge at High.}
|
||
\label{fig:tcp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
|
||
impairment, a
|
||
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
|
||
0.25\% packet loss, and 0.5\% reordering per machine.
|
||
Even Mycelium,
|
||
the slowest VPN at baseline (259\,Mbps), retains more
|
||
throughput at
|
||
Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes
|
||
that inflated Yggdrasil's baseline numbers
|
||
(Section~\ref{sec:baseline}) becomes a liability
|
||
under impairment:
|
||
every lost or reordered outer packet costs roughly
|
||
24$\times$ more
|
||
retransmitted inner data than a standard 1\,400-byte
|
||
MTU VPN would
|
||
lose.
|
||
|
||
Headscale retains 34.3\% of its baseline throughput
|
||
at Low, almost
|
||
the same as Internal's 35.7\%. At Medium impairment, Headscale
|
||
(41.5\,Mbps) overtakes Internal (29.6\,Mbps).
|
||
Section~\ref{sec:tailscale_degraded} investigates
|
||
this anomaly in
|
||
detail.
|
||
|
||
At High impairment, the throughput range collapses
|
||
from 675\,Mbps to
|
||
2.9\,Mbps. Internal leads at 4.25\,Mbps, Hyprspace trails at
|
||
1.39\,Mbps, and the impairment profile itself is the bottleneck.
|
||
With 2.5\% packet loss and 5\% reordering per machine, every
|
||
implementation is loss-limited, and the architectural
|
||
differences
|
||
that mattered at gigabit speeds no longer matter at all.
|
||
|
||
\subsection{UDP throughput}
|
||
|
||
The UDP stress test (\texttt{-b~0}) separates
|
||
implementations with
|
||
effective backpressure from those without it more
|
||
cleanly than any
|
||
TCP benchmark. Under impairment, it also produces widespread
|
||
failures.
|
||
% TODO: Tinc fails at Low and Medium but succeeds at
|
||
% High (8 Mbps):
|
||
% the same non-monotonic failure pattern as
|
||
% Internal/WireGuard (fail
|
||
% at Low, succeed at Medium/High). This suggests the
|
||
% failures are
|
||
% iPerf3/tc interaction issues rather than
|
||
% fundamental VPN limitations.
|
||
% Nebula and VpnCloud also fail selectively. The
|
||
% widespread non-monotonic
|
||
% failure pattern undermines using this benchmark as
|
||
% a reliability
|
||
% indicator (see line 1163 claim). Consider
|
||
% discussing this pattern.
|
||
Hyprspace and Mycelium continue to time out at all profiles,
|
||
extending their baseline failures. Tinc drops out at Low and
|
||
Medium, ZeroTier at Medium. The data is sparse, but one pattern
|
||
emerges from the runs that did complete.
|
||
|
||
% TODO: The heatmap shows Internal and WireGuard both
|
||
% fail (×) at
|
||
% some impairment profiles (e.g., Internal fails at
|
||
% Low, WireGuard
|
||
% at Low and High). "Regardless of impairment" overstates the
|
||
% evidence. Rephrase to reflect the failures, or explain why
|
||
% those runs failed despite the claim of maintained throughput.
|
||
% TODO: Internal (and WireGuard) fail at Low
|
||
% impairment in the UDP
|
||
% test but succeed at Medium and High: the opposite of what one
|
||
% would expect. This is never explained.
|
||
% Investigate and add an
|
||
% explanation (e.g., iPerf3 crash, tc interaction,
|
||
% timing issue).
|
||
Three implementations maintain throughput at the profiles where
|
||
data exists. Internal holds ${\sim}$950\,Mbps at
|
||
Baseline, Medium,
|
||
and High; WireGuard sustains 850--898\,Mbps; and
|
||
Headscale sustains
|
||
700--876\,Mbps. % TODO: verify WireGuard UDP range --
|
||
% analysis doc says 850-898, possible digit transposition
|
||
Internal and WireGuard ride the host kernel's transport-layer
|
||
backpressure (Internal directly, WireGuard via the in-kernel
|
||
WireGuard module). Headscale, by contrast, never
|
||
uses the kernel
|
||
module even though it builds on the WireGuard protocol: as
|
||
established in Section~\ref{sec:baseline}, Tailscale's
|
||
\texttt{magicsock} layer intercepts every packet for endpoint
|
||
selection, DERP relay, and the disco protocol, and that
|
||
interception is incompatible with the kernel WireGuard datapath.
|
||
Headscale therefore runs \texttt{wireguard-go} in userspace and
|
||
compensates with UDP batching
|
||
(\texttt{recvmmsg}/\texttt{sendmmsg}),
|
||
host-kernel UDP segmentation/aggregation offload
|
||
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
|
||
WireGuard socket), and a 7\,MB socket buffer on the same outer
|
||
socket. These offloads live in the host kernel; gVisor netstack
|
||
itself implements no UDP GSO or UDP GRO of its own.
|
||
Together they
|
||
absorb a \texttt{-b 0} sender flood without
|
||
collapsing. Userspace
|
||
VPNs without the same engineering do collapse:
|
||
EasyTier drops from
|
||
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
|
||
Yggdrasil, already pathological at baseline (98.7\%
|
||
loss), crashes
|
||
to 12.3\,Mbps at Low and fails entirely at Medium and High.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/UDP
|
||
Receiver Throughput Heatmap}.png}
|
||
% TODO: The heatmap shows Internal, WireGuard, and
|
||
% Headscale all
|
||
% fail ($\times$) at Low impairment. WireGuard also fails at
|
||
% High. These selective failures need an explanation
|
||
% (iPerf3/tc interaction?).
|
||
\caption{UDP receiver throughput across impairment profiles.
|
||
Implementations with effective UDP backpressure
|
||
(Internal and
|
||
WireGuard via the in-kernel datapath; Headscale via
|
||
\texttt{wireguard-go} batching plus large socket buffers)
|
||
maintain high throughput where they complete;
|
||
other userspace
|
||
VPNs collapse or fail entirely ($\times$ marks a failed run).}
|
||
\label{fig:udp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
% TODO: This "robustness indicator" interpretation is
|
||
% undermined by
|
||
% the non-monotonic failure pattern. Internal and
|
||
% WireGuard fail at
|
||
% Low (0.25% loss) but succeed at Medium and High
|
||
% (1%+ loss). If
|
||
% failures indicated "fundamental flow-control
|
||
% problems," they should
|
||
% get worse with more impairment, not better. The
|
||
% pattern suggests
|
||
% iPerf3 or tc timing issues rather than VPN
|
||
% limitations. Either
|
||
% explain the non-monotonic failures or weaken this conclusion.
|
||
Under impairment this benchmark is more useful as a robustness
|
||
indicator than as a throughput measurement. A VPN that cannot
|
||
complete a 30-second UDP flood under 0.25\% packet loss has a
|
||
flow-control problem that will surface under real workloads too,
|
||
even when the symptoms are milder.
|
||
|
||
\subsection{Parallel TCP}
|
||
|
||
% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional
|
||
% flows" must match
|
||
% the baseline parallel test description. The
|
||
% baseline section has an
|
||
% unresolved TODO about whether the test uses 6 or 10
|
||
% streams. If the
|
||
% baseline is corrected to 10, this section must also
|
||
% be updated.
|
||
The Headscale anomaly from single-stream TCP grows larger under
|
||
parallel load. Table~\ref{tab:parallel_impairment}
|
||
shows aggregate
|
||
throughput across three concurrent bidirectional links (six
|
||
unidirectional flows).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP throughput (Mbps) across
|
||
impairment profiles.
|
||
Three concurrent bidirectional links produce six
|
||
unidirectional
|
||
flows.}
|
||
\label{tab:parallel_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 1398 & 277 & 82.6 & 10.4 \\
|
||
Headscale & 1228 & 718 & 113 & 20.0 \\
|
||
WireGuard & 1281 & 173 & 24.5 & 8.39 \\
|
||
Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\
|
||
ZeroTier & 1206 & 176 & 35.4 & 7.97 \\
|
||
EasyTier & 927 & 473 & 57.4 & 10.7 \\
|
||
Hyprspace & 803 & 2.87 & 6.94 & 3.62 \\
|
||
VpnCloud & 763 & 174 & 23.7 & 8.25 \\
|
||
Nebula & 648 & 103 & 15.3 & 4.93 \\
|
||
Mycelium & 569 & 72.7 & 7.51 & 3.69 \\
|
||
Tinc & 563 & 168 & 23.7 & 8.25 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Parallel
|
||
TCP Throughput Heatmap}.png}
|
||
\caption{Parallel TCP throughput across impairment profiles.
|
||
Headscale dominates at Low (718\,Mbps vs.\ Internal's 277);
|
||
EasyTier is the runner-up (473\,Mbps); Hyprspace
|
||
collapses to
|
||
2.87\,Mbps.}
|
||
\label{fig:parallel_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
At Low impairment, Headscale reaches 718\,Mbps: 2.6$\times$
|
||
Internal's 277\,Mbps and 4.1$\times$ WireGuard's 173\,Mbps. At
|
||
Medium, Headscale (113\,Mbps) still leads Internal
|
||
(82.6\,Mbps) by
|
||
37\%. Whatever mechanism produces the single-stream
|
||
crossover at
|
||
Medium scales with the flow count, because each of the six
|
||
concurrent streams benefits from it independently.
|
||
|
||
EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
|
||
51\% of its baseline. Headscale and EasyTier are the only VPNs
|
||
that retain more than half their baseline parallel throughput at
|
||
Low impairment; no other implementation exceeds 30\%.
|
||
We have no
|
||
direct architectural explanation for EasyTier's resilience and
|
||
do not claim one here.
|
||
|
||
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
|
||
Low, a 99.6\%
|
||
loss. % TODO: DOWNSTREAM DEPENDENCY — This
|
||
% references the buffer bloat diagnosis
|
||
% from Section hyprspace_bloat, which depends on the
|
||
% unverified 2,800 ms
|
||
% under-load latency. If that diagnosis is revised,
|
||
% this explanation
|
||
% for parallel collapse must also be revisited.
|
||
The buffer bloat that already plagues single-stream transfers
|
||
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
|
||
flows compete for the same bloated buffers at once.
|
||
|
||
High-profile convergence is more pronounced here than in
|
||
single-stream mode. Tinc and VpnCloud land at identical
|
||
8.25\,Mbps even though they differ by 200\,Mbps at baseline.
|
||
|
||
\subsection{QUIC performance}
|
||
|
||
Headscale and Nebula failed the qperf QUIC benchmark at baseline
|
||
(Section~\ref{sec:baseline}) and continue to fail at every
|
||
impairment profile.
|
||
|
||
Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to
|
||
7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High.
|
||
This is the same cliff observed in its TCP results,
|
||
driven by the
|
||
same jumbo-MTU amplification of outer-layer packet loss.
|
||
|
||
At High impairment, WireGuard (23.2\,Mbps), VpnCloud
|
||
(23.4\,Mbps),
|
||
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
|
||
0.4\,Mbps of one another. At baseline these four
|
||
span a 188\,Mbps
|
||
range (656 to 844\,Mbps). QUIC's own congestion
|
||
control, running on
|
||
top of an already-degraded outer link, has become the
|
||
sole limiter.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/QUIC
|
||
Bandwidth Heatmap}.png}
|
||
\caption{QUIC bandwidth across impairment profiles. Yggdrasil
|
||
drops from 745 to 8\,Mbps at Low; WireGuard,
|
||
VpnCloud, ZeroTier,
|
||
and Tinc converge to ${\sim}$23\,Mbps at High.
|
||
Headscale and
|
||
Nebula fail at all profiles ($\times$).}
|
||
\label{fig:quic_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
\subsection{Video streaming}
|
||
|
||
At ${\sim}$3.3\,Mbps, the RIST video stream sits
|
||
within every VPN's
|
||
throughput budget even at High impairment. Quality
|
||
differences in
|
||
Table~\ref{tab:rist_impairment} therefore reflect
|
||
packet delivery
|
||
reliability, not bandwidth.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{RIST video streaming quality (\%) across impairment
|
||
profiles, sorted by High-profile quality}
|
||
\label{tab:rist_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Mycelium & 100.0 & 100.0 & 100.0 & 99.9 \\
|
||
EasyTier & 100.0 & 100.0 & 96.2 & 85.5 \\
|
||
Internal & 100.0 & 99.2 & 89.3 & 80.2 \\
|
||
ZeroTier & 100.0 & 99.3 & 89.9 & 80.2 \\
|
||
VpnCloud & 100.0 & 99.2 & 89.7 & 80.1 \\
|
||
WireGuard & 100.0 & 99.3 & 90.0 & 80.0 \\
|
||
Hyprspace & 100.0 & 92.9 & 87.9 & 78.1 \\
|
||
Tinc & 100.0 & 99.3 & 90.0 & 77.8 \\
|
||
Nebula & 99.8 & 98.8 & 85.6 & 72.1 \\
|
||
Yggdrasil & 100.0 & 94.7 & 71.4 & 43.3 \\
|
||
Headscale & 13.1 & 13.0 & 13.0 & 13.0 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Video
|
||
Streaming Quality Heatmap}.png}
|
||
\caption{RIST video streaming quality across
|
||
impairment profiles.
|
||
Headscale is stuck at ${\sim}$13\% regardless of profile;
|
||
Mycelium maintains ${\sim}$100\% even at High; Yggdrasil
|
||
declines steeply to 43\%.}
|
||
\label{fig:rist_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale sits at ${\sim}$13\% across all four profiles: 13.1\%,
|
||
13.0\%, 13.0\%, 13.0\%. This profile-independence confirms the
|
||
baseline diagnosis (Section~\ref{sec:baseline}): the failure is
|
||
% TODO: DOWNSTREAM DEPENDENCY — This repeats the
|
||
% DERP/MTU hypothesis from
|
||
% Section baseline as though it were established.
|
||
% The baseline TODO notes
|
||
% this hypothesis is unverified (no packet capture
|
||
% evidence). Do not
|
||
% present it as a confirmed diagnosis here without
|
||
% resolving the upstream TODO.
|
||
structural (most plausibly MTU fragmentation in the DERP relay
|
||
layer) and cannot worsen because it is already
|
||
saturated. Adding
|
||
latency or loss on top of an 87\% packet drop floor changes
|
||
nothing.
|
||
|
||
Mycelium holds 99.9\% quality even at High impairment, ahead of
|
||
Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even
|
||
Mycelium's degraded overlay paths comfortably sustain
|
||
the stream.
|
||
The same overlay routing that adds 34.9\,ms of
|
||
latency and cripples
|
||
bulk TCP transfers is harmless at video bitrates, and RIST's
|
||
forward error correction handles the residual loss.
|
||
|
||
% TODO: The claim that jumbo MTU causes burst losses
|
||
% that overwhelm
|
||
% FEC is a hypothesis. No FEC analysis or
|
||
% packet-level evidence is
|
||
% shown. Consider adding packet capture data or
|
||
% softening the claim.
|
||
Yggdrasil degrades the most steeply: 100\% at
|
||
baseline, 94.7\% at
|
||
Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU
|
||
that hurt TCP
|
||
throughput likely hurts here as well: large overlay packets are
|
||
more exposed to loss and reordering at the outer layer, and the
|
||
resulting burst losses may exceed what RIST's FEC can recover.
|
||
|
||
\subsection{Application-level download}
|
||
|
||
The Nix binary cache download is the most demanding
|
||
application-level benchmark. Hundreds of sequential HTTP
|
||
connections amplify the per-connection latency
|
||
penalties that bulk
|
||
throughput tests amortise. Table~\ref{tab:nix_impairment} shows
|
||
download times across profiles.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time (seconds)
|
||
across impairment
|
||
profiles, sorted by Low-profile time. ``--'' marks a failed
|
||
run.}
|
||
\label{tab:nix_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 8.53 & 11.9 & 58.6 & -- \\
|
||
Headscale & 9.79 & 13.5 & 48.8 & 219 \\
|
||
EasyTier & 9.39 & 22.1 & 141 & -- \\
|
||
VpnCloud & 9.39 & 27.9 & 163 & -- \\
|
||
WireGuard & 9.45 & 28.8 & 161 & -- \\
|
||
Nebula & 9.15 & 30.8 & 180 & 547 \\
|
||
Tinc & 10.0 & 30.9 & 166 & 496 \\
|
||
ZeroTier & 9.22 & 36.2 & 141 & -- \\
|
||
Mycelium & 10.1 & 79.5 & -- & -- \\
|
||
Yggdrasil & 10.6 & 230 & -- & -- \\
|
||
Hyprspace & 11.9 & -- & 170 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Nix
|
||
Cache Download Time Heatmap}.png}
|
||
\caption{Nix binary cache download time across
|
||
impairment profiles.
|
||
Headscale, Nebula, and Tinc complete all four
|
||
profiles; Headscale
|
||
beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's
|
||
Low-profile time explodes to 230\,s ($\times$ marks
|
||
a failed run).}
|
||
\label{fig:nix_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale, Nebula, and Tinc are the only VPNs to
|
||
complete all four
|
||
profiles. At Medium impairment, Headscale finishes
|
||
in 48.8~seconds,
|
||
faster than Internal's 58.6~seconds. Internal itself
|
||
fails at High
|
||
impairment while Headscale completes in 219~seconds, Tinc in
|
||
496~seconds, and Nebula in 547~seconds.
|
||
|
||
Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low
|
||
impairment, a 22$\times$ slowdown. Every HTTP request pays the
|
||
latency penalty of Yggdrasil's impairment-amplified
|
||
retransmissions.
|
||
Mycelium degrades almost as badly (10.1\,s to 79.5\,s, an
|
||
8$\times$ increase): its overlay routing overhead compounds over
|
||
hundreds of sequential HTTP connections.
|
||
|
||
% TODO: Hyprspace fails at Low but completes at Medium (170 s).
|
||
% This contradicts the "clean gradient" claim.
|
||
% Explain why a VPN
|
||
% can fail at Low but succeed at Medium, or note the anomaly.
|
||
The failure map shows a mostly clean gradient: more demanding
|
||
profiles knock out more VPNs. At Low, 10 of 11
|
||
finish (Hyprspace
|
||
fails). At Medium, 9 finish, though Hyprspace, which had failed
|
||
at Low, completes here in 170\,s. At High, only Headscale,
|
||
Nebula, and Tinc survive. Internal's failure at High is the
|
||
surprising one: the bare-metal baseline cannot sustain a
|
||
multi-connection HTTP workload under severe degradation, while
|
||
Headscale's userspace TCP stack pulls it through.
|
||
Section~\ref{sec:tailscale_degraded} explains why.
|
||
|
||
\section{Tailscale under degraded conditions}
|
||
\label{sec:tailscale_degraded}
|
||
|
||
This section is about an observation that should not exist:
|
||
Headscale, a tunnelling VPN built on a kernel TCP stack and
|
||
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
|
||
Medium impairment, and at Low impairment under parallel load
|
||
beats it by a factor of 2.6. The short answer turns out to be
|
||
different from the obvious answer, and we worked it out only by
|
||
chasing the obvious answer to its end.
|
||
|
||
\subsection{An anomaly worth pursuing}
|
||
|
||
At Medium impairment, Headscale reaches 41.5\,Mbps on a single
|
||
TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for
|
||
the VPN over the direct host-to-host link it tunnels through.
|
||
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
|
||
Low and High impairment it lags Internal by some margin. Yet at
|
||
Medium the order inverts, and not by a sliver: a 12\,Mbps gap on
|
||
a 30\,Mbps link is well above measurement noise. The same thing
|
||
happens, more dramatically, on the parallel TCP test, where
|
||
Headscale's 718\,Mbps at Low beats Internal's 277\,Mbps by a
|
||
factor of 2.6. Table~\ref{tab:headscale_anomaly} collects the
|
||
comparison.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Headscale vs.\ Internal vs.\ WireGuard
|
||
under impairment
|
||
(18.12.2025 run). For TCP benchmarks, higher is
|
||
better. For
|
||
Nix cache, lower is better; ``--'' marks a failed run.}
|
||
\label{tab:headscale_anomaly}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} &
|
||
\textbf{Headscale} & \textbf{WireGuard} \\
|
||
\hline
|
||
Single TCP (Mbps) & Low & 333 & 274 & 54.7 \\
|
||
Single TCP (Mbps) & Medium & 29.6 & 41.5 & 8.77 \\
|
||
Single TCP (Mbps) & High & 4.25 & 4.21 & 2.63 \\
|
||
Parallel TCP (Mbps) & Low & 277 & 718 & 173 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 & 113 & 24.5 \\
|
||
Nix cache (s) & Medium & 58.6 & 48.8 & 161 \\
|
||
Nix cache (s) & High & -- & 219 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png}
|
||
\caption{Single-stream TCP throughput for Internal,
|
||
Headscale, and
|
||
WireGuard across impairment profiles (log scale). Headscale
|
||
crosses above Internal at Medium impairment;
|
||
WireGuard stays far
|
||
below both; all three converge at High.}
|
||
\label{fig:headscale_vs_internal}
|
||
\end{figure}
|
||
|
||
WireGuard-the-kernel-module is the obvious sanity
|
||
check. It uses
|
||
the same Noise/WireGuard cryptographic protocol Tailscale ships
|
||
and is the closest available comparison without the rest of
|
||
Tailscale's stack. WireGuard shows none of Headscale's
|
||
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
|
||
below Internal at the same profile. So the encryption layer is
|
||
not the answer, and the basic UDP tunnel is not the answer.
|
||
Whatever Headscale is doing differently lives somewhere else in
|
||
the rest of Tailscale's implementation.
|
||
|
||
% TODO: The Medium-impairment retransmit percentages (5.2\%,
|
||
% 2.4\%) are not in any table or figure. Add a retransmit
|
||
% rate table for impaired profiles or reference the data
|
||
% source.
|
||
The retransmit data narrows the search. At Medium, WireGuard's
|
||
TCP retransmit rate is 5.2\,\%, more than double Internal's
|
||
${\sim}$2.4\,\%. Headscale matches Internal at ${\sim}$2.4\,\%
|
||
even though it is a tunnelling VPN. Both Headscale and
|
||
bare-metal Internal run the same host kernel TCP stack at the
|
||
inner layer, so the asymmetry is not about a different TCP
|
||
implementation. It is about what the kernel TCP stack is being
|
||
asked to process: something on Headscale's path is suppressing
|
||
the spurious retransmits the kernel would otherwise fire under
|
||
\texttt{tc netem}-induced reordering, and WireGuard's path is
|
||
not.
|
||
|
||
\subsection{A plausible villain: Tailscale's gVisor stack}
|
||
|
||
The candidate explanation we pursued first, and the one any
|
||
reading of the upstream Tailscale documentation will lead to,
|
||
is Tailscale's userspace TCP/IP stack. The Tailscale client
|
||
imports Google's gVisor netstack
|
||
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
|
||
it as an in-process TCP implementation. The gVisor
|
||
documentation is direct about why this matters: netstack is
|
||
designed for adverse networks where the host kernel's TCP
|
||
defaults are too aggressive. Tailscale's release notes go
|
||
further, calling out specific overrides on top of gVisor — the
|
||
most visible being an explicit RACK disable and 8\,MiB / 6\,MiB
|
||
receive and send buffers.
|
||
|
||
Reading Tailscale's source confirms it.
|
||
\texttt{wgengine/netstack/netstack.go} contains the netstack
|
||
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
|
||
reproduces the relevant overrides verbatim. RACK is disabled
|
||
(\texttt{TCPRecovery(0)}) with a comment pointing at
|
||
\texttt{tailscale/issues/9707}: ``gVisor's RACK performs
|
||
poorly. ACKs do not appear to be handled in a timely manner,
|
||
leading to spurious retransmissions and a reduced congestion
|
||
window.'' Reno is set explicitly with a comment pointing at
|
||
\texttt{gvisor/issues/11632}, an integer-overflow bug in
|
||
gVisor's CUBIC implementation. The TCP send and receive
|
||
buffer maxima are pushed up to 8\,MiB and 6\,MiB. SACK is
|
||
enabled (gVisor's default is off).
|
||
|
||
\lstinputlisting[language=Go,caption={Tailscale's gVisor
|
||
netstack initialiser explicitly disables RACK, pins Reno as
|
||
the congestion control, and enlarges the TCP buffer maxima.
|
||
These overrides live inside
|
||
\texttt{wgengine/netstack/netstack.go}.
|
||
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
|
||
|
||
Read against the Linux kernel defaults — RACK on, CUBIC by
|
||
default, ${\sim}$1\,MiB receive and send buffers,
|
||
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these
|
||
overrides describe a TCP stack better suited to a lossy,
|
||
reordering link than the host kernel. The hypothesis writes
|
||
itself: Headscale's iPerf3 traffic is processed
|
||
by this gVisor
|
||
instance instead of by the host kernel TCP stack, and so it
|
||
inherits the more reordering-tolerant behaviour.
|
||
WireGuard-the-kernel-module shares only the cryptographic
|
||
protocol; it does not get the gVisor stack, and
|
||
therefore does
|
||
not get the advantage.
|
||
|
||
It is a clean story. The natural way to test it
|
||
is to extract
|
||
the parameters Tailscale sets inside gVisor, apply their
|
||
nearest Linux equivalents to the bare-metal host as sysctls,
|
||
and see whether Internal — with no VPN at all — picks up the
|
||
same advantage. If it does, the gVisor explanation is
|
||
supported. If it does not, the hypothesis fails.
|
||
|
||
\subsection{Reproducing the effect on bare metal}
|
||
\label{sec:tuned}
|
||
|
||
We ran two follow-up benchmarks on the same hardware and
|
||
impairment setup as the original 18.12.2025 run.
|
||
|
||
\begin{itemize}
|
||
\bitem{Tailscale-style (27.02.2026):}
|
||
\texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0},
|
||
\texttt{tcp\_early\_retrans=0}, plus enlarged
|
||
buffer sizes
|
||
(\texttt{tcp\_rmem}, \texttt{tcp\_wmem},
|
||
\texttt{rmem\_max},
|
||
\texttt{wmem\_max}). Tested on Internal, Headscale,
|
||
WireGuard, Tinc, and ZeroTier.
|
||
\bitem{Reorder-only (06.03.2026):} Only
|
||
\texttt{tcp\_reordering=10},
|
||
\texttt{tcp\_recovery=0}, and
|
||
\texttt{tcp\_early\_retrans=0}. Buffer sizes left at
|
||
kernel defaults. Tested on Internal and Headscale only.
|
||
\end{itemize}
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Internal (no VPN) throughput across three kernel
|
||
configurations. ``Default'' is the
|
||
18.12.2025 run with stock
|
||
Linux TCP parameters.}
|
||
\label{tab:kernel_tuning_internal}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Metric} & \textbf{Profile} & \textbf{Default} &
|
||
\textbf{Tailscale-style} & \textbf{Reorder-only} \\
|
||
\hline
|
||
Single TCP (Mbps) & Baseline & 934 &
|
||
934 & 934 \\
|
||
Single TCP (Mbps) & Low & 333 &
|
||
363 & 354 \\
|
||
Single TCP (Mbps) & Medium & 29.6 &
|
||
64.2 & 72.7 \\
|
||
Parallel TCP (Mbps) & Low & 277 &
|
||
893 & 902 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 &
|
||
226 & 211 \\
|
||
Retransmit \% & Medium & ${\sim}$2.4
|
||
& 1.21 & 1.11 \\
|
||
Nix cache (s) & Medium & 58.6 &
|
||
29.7 & 29.1 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png}
|
||
\caption{Internal (no VPN) single-stream TCP
|
||
throughput across three
|
||
kernel configurations. Baseline is unchanged; at Medium
|
||
impairment, throughput jumps from 30 to 64 to
|
||
73\,Mbps as
|
||
reordering tolerance increases.}
|
||
\label{fig:kernel_tuning_comparison}
|
||
\end{figure}
|
||
|
||
The result felt like confirmation. Internal's
|
||
Medium-impairment throughput jumped from 29.6\,Mbps to
|
||
72.7\,Mbps under the reorder-only configuration — a 146\,\%
|
||
increase from a three-line sysctl change — and
|
||
the retransmit
|
||
rate at Medium dropped from ${\sim}$2.4\,\% to
|
||
1.11\,\%, which
|
||
means more than half of the original retransmissions were
|
||
spurious. The Nix cache download at Medium roughly halved,
|
||
from 58.6\,s to 29.1\,s.
|
||
|
||
Parallel TCP gained more. Internal at Low
|
||
climbed from 277 to
|
||
902\,Mbps, a 226\,\% increase that not only
|
||
exceeds Internal's
|
||
old single-stream best but actually overtakes Headscale's
|
||
original 718\,Mbps from the unmodified run. %
|
||
% TODO: DOWNSTREAM
|
||
% DEPENDENCY — "six concurrent flows" inherits
|
||
% the unresolved
|
||
% 6-vs-10 stream count from the baseline parallel test
|
||
% description. Update when that TODO is resolved.
|
||
Each of the six concurrent flows benefits independently from
|
||
the higher reordering threshold, and the gains compound.
|
||
|
||
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
|
||
% not in any table. Add a table showing Headscale's results
|
||
% from the follow-up runs alongside Internal's so
|
||
% readers can
|
||
% verify the reversal.
|
||
Headscale itself, retested with the same sysctls,
|
||
gained more
|
||
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
|
||
Low. And the anomaly reversed entirely. At Medium, tuned
|
||
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
|
||
a 45\,\% lead for Internal where the original run
|
||
had Headscale
|
||
40\,\% ahead. The Nix cache flipped the same way: Internal
|
||
completed in 29.1\,s against Headscale's 36.3\,s, where the
|
||
original had Headscale 17\,\% faster.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png}
|
||
\caption{Internal-to-Headscale speed-up factor
|
||
before and after
|
||
kernel tuning. Values above 1.0 mean
|
||
Internal is faster. At
|
||
Medium impairment, the ratio flips from
|
||
0.71$\times$ (Headscale
|
||
ahead) to 1.45$\times$ (Internal ahead).}
|
||
\label{fig:headscale_gap_reversal}
|
||
\end{figure}
|
||
|
||
The reorder-only configuration matched or exceeded the full
|
||
Tailscale-style configuration on most metrics. The two
|
||
exceptions were single-stream TCP at Low (354
|
||
vs.\ 363\,Mbps)
|
||
and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within
|
||
7\,\%. The enlarged buffer sizes did not help and may have
|
||
added mild buffer bloat that partially offset the reordering
|
||
benefit, though the gap could also be run-to-run variance.
|
||
Either way, the entire Headscale advantage on Internal
|
||
collapsed to three host-kernel sysctls:
|
||
\texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and
|
||
\texttt{tcp\_early\_retrans}.
|
||
|
||
At this point in the investigation the hypothesis seemed
|
||
settled. Tailscale's gVisor stack ships with
|
||
these overrides;
|
||
the bare-metal kernel ships with stricter defaults; matching
|
||
the kernel to gVisor reproduces the effect. Then we checked
|
||
which Tailscale code path the test rig was actually running.
|
||
|
||
\subsection{The data path that was not there}
|
||
|
||
In default mode — what anyone running \texttt{tailscale up}
|
||
on a Linux host gets — the Tailscale client creates a real
|
||
kernel TUN device, registers a route for the
|
||
Tailscale subnet
|
||
through it, and forwards inbound and outbound
|
||
packets through
|
||
that interface. An application like iPerf3 issues a
|
||
\texttt{connect} to the remote peer's Tailscale
|
||
IP. The host
|
||
kernel TCP stack handles the application TCP. The kernel
|
||
routes the resulting outbound packets to the TUN device.
|
||
\texttt{tailscaled} (with \texttt{wireguard-go} embedded)
|
||
reads them from the TUN, encrypts them, and sends them as
|
||
outer WireGuard UDP packets on the wire. The receiving side
|
||
reverses the process and writes the decrypted inner packets
|
||
back into its own TUN, where the host kernel TCP stack
|
||
delivers them to the iPerf3 server.
|
||
|
||
In that path, gVisor netstack is never instantiated. The
|
||
netstack initialiser in
|
||
Listing~\ref{lst:tailscale_netstack_overrides}
|
||
only runs when
|
||
\texttt{tailscaled} is launched with
|
||
\texttt{--tun=userspace-networking}, a mode that has no
|
||
kernel TUN at all and is reachable only from processes
|
||
running inside \texttt{tailscaled} itself (Tailscale SSH,
|
||
Taildrop, the metric endpoint). External processes such as
|
||
iPerf3 cannot reach the Tailscale network in that mode.
|
||
|
||
The test rig does not use that mode.
|
||
Listing~\ref{lst:nixos_tailscale} shows the relevant line of
|
||
the upstream NixOS \texttt{services.tailscale} module, which
|
||
assembles the daemon command line as
|
||
\texttt{tailscaled --tun
|
||
\$\{cfg.interfaceName\}~\dots}, with
|
||
no \texttt{userspace-networking} fall-back unless
|
||
the operator
|
||
explicitly sets \texttt{interfaceName =
|
||
"userspace-networking"}.
|
||
Listing~\ref{lst:rig_interface_name} shows what
|
||
the benchmark
|
||
suite's Headscale module sets the interface name to:
|
||
\texttt{ts-\$\{instanceName\}}, truncated to fifteen
|
||
characters. The two together resolve to
|
||
\texttt{tailscaled --tun ts-headscale} on every
|
||
test machine,
|
||
a real kernel TUN. gVisor netstack is unreachable from any
|
||
external benchmark traffic in this rig.
|
||
|
||
\lstinputlisting[language=Nix,caption={The NixOS
|
||
\texttt{services.tailscale} module passes \texttt{--tun
|
||
\$\{interfaceName\}} as the daemon's TUN argument. There is
|
||
no \texttt{--tun=userspace-networking} fall-back unless the
|
||
user explicitly sets \texttt{interfaceName = "userspace-networking"}.
|
||
\textit{nixpkgs/nixos/modules/services/networking/tailscale.nix:158}},label={lst:nixos_tailscale}]{Listings/nixos_tailscale.nix}
|
||
|
||
\lstinputlisting[language=Nix,caption={The
|
||
benchmark suite's
|
||
Headscale module sets \texttt{interfaceName} to a real kernel
|
||
TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
|
||
Combined with Listing~\ref{lst:nixos_tailscale}, this means
|
||
\texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale}
|
||
on every test machine.
|
||
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
|
||
|
||
The empirical fingerprint pins the same conclusion down without
|
||
source-code reading. Headscale itself gained +21\,\% at Medium
|
||
from the host-kernel sysctl tuning. If Headscale's iPerf3
|
||
traffic were processed by gVisor netstack, host-kernel sysctls
|
||
would change nothing — they configure the host kernel TCP stack
|
||
and only the host kernel TCP stack. The fact that Headscale moves
|
||
measurably under those sysctls is direct evidence that
|
||
Headscale's application TCP runs on the host kernel stack, just
|
||
as Internal's does.
|
||
|
||
The validation experiment was therefore validating something
|
||
other than the hypothesis it was supposed to validate. It was
|
||
confirming, very cleanly, that the Linux kernel's default
|
||
\texttt{tcp\_reordering=3} is too tight for the kind of bursty,
|
||
correlated reordering the Medium profile produces, and that
|
||
loosening it produces a large throughput gain on a kernel-TCP
|
||
data path. That part of the result stands. What does not stand
|
||
is the inference that the gain reproduces something Tailscale was
|
||
already doing in gVisor. For this benchmark, Tailscale is not in
|
||
the gVisor TCP business at all.
|
||
|
||
\subsection{Where the advantage actually lives}
|
||
|
||
The puzzle the investigation began with has not gone away.
|
||
Headscale starts at 41.5\,Mbps where Internal starts at
|
||
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
||
TCP stack. Whatever Headscale is doing — partially, weakly, but
|
||
reproducibly — is worth roughly twelve megabits per second on the
|
||
Medium profile, and it is not gVisor netstack.
|
||
|
||
The +21\,\% sysctl gain for Headscale itself is also informative
|
||
about the size of the mechanism. If the gain were 0\,\%,
|
||
Headscale would already be doing the sysctls' work; if it were
|
||
+146\,\% like Internal's, Headscale would be doing nothing of its
|
||
own. The partial response says Headscale's mechanism produces an
|
||
effect similar in kind to the sysctls but smaller in size, and
|
||
that the two effects are not fully additive.
|
||
|
||
Two features of the \texttt{wireguard-go} data-plane pipeline are
|
||
the most likely candidates, and both live on the kernel-TUN path
|
||
that Tailscale actually uses in the rig.
|
||
|
||
The first is TUN TCP and UDP generic receive offload. Tailscale's
|
||
\texttt{tstun} wrapper enables both on the kernel TUN device on
|
||
Linux unless an environment knob disables them or a runtime probe
|
||
rejects the feature (Listing~\ref{lst:tstun_gro}). On the
|
||
receive side, this means \texttt{wireguard-go} decrypts a burst
|
||
of inbound WireGuard frames and then coalesces consecutive
|
||
in-order TCP segments belonging to the same flow into a single
|
||
super-segment before writing them back to the kernel TUN. On the
|
||
transmit side, it accepts GSO super-segments from the kernel TUN
|
||
read in the same way. The receiving kernel TCP stack therefore
|
||
sees fewer, larger segments per coalesced batch instead of $N$
|
||
small ones, and the segment timing that survives to the kernel is
|
||
the timing of GRO batches rather than of individual on-the-wire
|
||
packets. Bare-metal Internal traffic has no equivalent path
|
||
because it does not pass through any user-space TUN at all.
|
||
|
||
\lstinputlisting[language=Go,caption={Tailscale enables TUN TCP
|
||
and UDP GRO on every Linux non-TAP \texttt{tailscaled} process
|
||
unless the operator disables them via environment knobs or a
|
||
kernel runtime probe rejects the feature. This is in the default
|
||
kernel-TUN data path; it is not gated on
|
||
\texttt{--tun=userspace-networking}.
|
||
\textit{tailscale/net/tstun/wrap\_linux.go:25--43}},label={lst:tstun_gro}]{Listings/tstun_gro.go}
|
||
|
||
The second is the 7\,MiB outer-UDP socket buffer that
|
||
\texttt{magicsock} pins on the WireGuard UDP socket
|
||
(Listing~\ref{lst:magicsock_buffer}), using the ``force''
|
||
\texttt{SO\_*BUFFORCE} variant where available so the value is
|
||
honoured even past \texttt{net.core.rmem\_max}. The host kernel
|
||
default is in the low hundreds of KiB. Under burst-correlated
|
||
impairment — Medium and High both use 50\,\% correlation, so
|
||
losses and reorderings cluster — this larger buffer absorbs
|
||
spikes in arrival rate that would otherwise overflow the kernel
|
||
UDP receive queue and surface as additional inner-TCP losses.
|
||
Internal has no such cushion on its incoming wire path.
|
||
|
||
\lstinputlisting[language=Go,caption={\texttt{magicsock} pins the
|
||
outer WireGuard UDP socket's send and receive buffers to 7\,MiB
|
||
and uses \texttt{SetBufferSize} with the \texttt{SO\_*BUFFORCE}
|
||
(``force'') variant where available, so the value is honoured
|
||
even past \texttt{net.core.rmem\_max}.
|
||
\textit{tailscale/wgengine/magicsock/magicsock.go:86,3908--3913}},label={lst:magicsock_buffer}]{Listings/magicsock_buffer.go}
|
||
|
||
% TODO: Neither of the two candidate mechanisms above is directly
|
||
% verified in this chapter. A targeted follow-up — for example
|
||
% tcpdump on the receiving \texttt{tailscale0} interface during a
|
||
% Medium-impairment iPerf3 run, with inter-arrival timing
|
||
% analysis — would distinguish their relative contributions and
|
||
% confirm the mechanism. The argument here is that they are the
|
||
% most plausible candidates consistent with the evidence, not
|
||
% measured causes.
|
||
|
||
A third feature, batched UDP I/O, completes the picture without
|
||
changing it qualitatively. \texttt{wireguard-go} uses
|
||
\texttt{recvmmsg} and \texttt{sendmmsg} on the outer UDP socket
|
||
so a burst of WireGuard frames moves through a single system
|
||
call. This does not change \emph{whether} packets are reordered,
|
||
but it reduces per-packet timing jitter that the kernel might
|
||
otherwise interpret as additional reordering.
|
||
|
||
Hyprspace cannot be used as a negative control for any of this.
|
||
It does import gVisor netstack, but only for its in-VPN
|
||
service-network feature, and the Hyprspace benchmark traffic goes
|
||
through a kernel TUN exactly like Headscale's
|
||
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
|
||
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
|
||
not on whether gVisor handles their inner TCP. The gVisor angle
|
||
simply does not apply to either of them in this benchmark.
|
||
|
||
The kernel-side picture closes the loop. Three host-kernel TCP
|
||
parameters dominate the bare-metal behaviour the benchmarks
|
||
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
|
||
number of out-of-order segments the kernel will tolerate before
|
||
declaring fast retransmit, and with \texttt{tc netem} injecting
|
||
0.5--2.5\,\% reordering per machine, bursts of several reordered
|
||
packets are frequent enough that the threshold is repeatedly
|
||
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
|
||
(default \texttt{1}, RACK enabled) adds time-based reordering
|
||
detection on top of the segment-count threshold, which compounds
|
||
the spurious retransmits when reordering is high. And
|
||
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
|
||
Loss Probe enabled) fires speculative retransmits when
|
||
unacknowledged segments sit at the tail of a transmission window,
|
||
which interacts poorly with an already-impaired link. Loosening
|
||
any one of the three softens the kernel's loss detection on the
|
||
bare-metal path; loosening all three recovers most of the
|
||
throughput. The Headscale path reaches the same kernel TCP stack
|
||
but is already feeding it the GRO-coalesced, buffer-cushioned
|
||
stream described above, so the kernel's tight defaults fire less
|
||
often there to begin with.
|
||
|
||
The same logic explains the anomaly's shape across profiles. At
|
||
baseline there is no reordering, so the kernel's tight
|
||
\texttt{tcp\_reordering} threshold never trips and Internal's
|
||
native kernel-stack speed wins. As reordering rises from 0.5\,\%
|
||
(Low) to 2.5\,\% (Medium) per machine, the kernel's loss
|
||
detection fires on the bare-metal path more often than on the
|
||
GRO-coalesced Headscale path, and the throughput gap shifts in
|
||
Headscale's favour. At High impairment, both converge to
|
||
${\sim}$4.2\,Mbps: absolute packet loss becomes the dominant
|
||
bottleneck, and reordering tolerance no longer matters.
|
||
|
||
% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier
|
||
% (11.5 Mbps) tuned values are not in any table. Add them to
|
||
% Table~\ref{tab:kernel_tuning_internal} or a new table.
|
||
Other VPNs respond unevenly to the same sysctl tuning.
|
||
WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps
|
||
(+39\,\%), Tinc's from 5.53 to 11.5\,Mbps (+108\,\%), and
|
||
ZeroTier stays flat (12.0 to 11.5\,Mbps). % TODO: The
|
||
% reading below — that VPNs which add their own encapsulation and
|
||
% userspace processing have bottlenecks the host kernel sysctls
|
||
% cannot touch — does not cleanly fit the data: Tinc (a fully
|
||
% userspace VPN) shows the largest gain (+108\,\%), larger than
|
||
% kernel-WireGuard's. A more complete explanation has to account
|
||
% for which TCP stack each VPN's application traffic actually
|
||
% traverses and which of those stacks the sysctls actually reach.
|
||
The intuitive reading is that VPNs which add their own
|
||
encapsulation and userspace processing have bottlenecks the host
|
||
kernel sysctls cannot touch, but Tinc's large gain shows the
|
||
picture is not that simple.
|
||
|
||
The resilient finding from this section, the one that survives
|
||
regardless of which of the two Tailscale-side mechanisms turns
|
||
out to dominate, is not about Tailscale at all. It is about
|
||
Linux. The kernel's default \texttt{tcp\_reordering=3} threshold
|
||
is too tight for the kind of bursty, correlated reordering
|
||
\texttt{tc netem} produces at the Medium profile, and it costs
|
||
the bare-metal host more than half of its achievable throughput.
|
||
Three lines of \texttt{sysctl} repair it. The fix is portable to
|
||
any Linux host and entirely independent of any VPN.
|
||
|
||
The unresilient finding — the one that motivated us to write this
|
||
section in the first place — is that Tailscale's much-discussed
|
||
userspace TCP stack is, for the workload that exposed the
|
||
anomaly, sitting on the bench. The advantage we attributed to it
|
||
must come from a more ordinary place: the way
|
||
\texttt{wireguard-go} batches and coalesces packets between the
|
||
wire and the kernel TCP stack, and the larger UDP buffer it pins
|
||
on its outer socket. We were chasing the wrong hypothesis with
|
||
the right experiment, and the experiment turned out to be more
|
||
useful than the hypothesis.
|
||
|
||
% TODO: These sections are empty stubs but the chapter
|
||
% introduction (line 12--13) promises "findings from the source
|
||
% code analysis." Either write these sections or remove the
|
||
% promise from the intro.
|
||
|
||
\section{Source code analysis}
|
||
|
||
\subsection{Feature matrix overview}
|
||
|
||
% Summary of the 108-feature matrix across all ten VPNs.
|
||
% Highlight key architectural differences that explain
|
||
% performance results.
|
||
|
||
\subsection{Security vulnerabilities}
|
||
|
||
% Vulnerabilities discovered during source code review.
|
||
|
||
\section{Summary of findings}
|
||
|
||
% Brief summary table or ranking of VPNs by key metrics. Save
|
||
% deeper interpretation for a Discussion chapter.
|