2349 lines
100 KiB
TeX
2349 lines
100 KiB
TeX
% Chapter Template
|
||
|
||
\chapter{Results} % Main chapter title
|
||
|
||
\label{Results}
|
||
|
||
This chapter presents the results of the benchmark suite across all
|
||
ten VPN implementations and the internal baseline. The structure
|
||
follows the impairment profiles from ideal to degraded:
|
||
Section~\ref{sec:baseline} establishes overhead under ideal
|
||
conditions, then subsequent sections examine how each VPN responds to
|
||
increasing network impairment, with source-code excerpts woven in
|
||
where they explain the measured behaviour. A recurring theme is
|
||
that no single metric captures VPN performance; the rankings shift
|
||
depending on whether one measures throughput, latency, retransmit
|
||
behavior, or real-world application performance.
|
||
|
||
\section{Baseline Performance}
|
||
\label{sec:baseline}
|
||
|
||
The baseline impairment profile introduces no artificial loss or
|
||
reordering, so any performance gap between VPNs can be attributed to
|
||
the VPN itself. Throughout the plots in this section, the
|
||
\emph{internal} bar marks a direct host-to-host connection with no VPN
|
||
in the path; it represents the best the hardware can do. On its own,
|
||
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
||
latency of just
|
||
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
|
||
single retransmit across an entire 30-second test. Mycelium sits at
|
||
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
|
||
bare-metal figure.
|
||
|
||
A note on naming: ``Headscale'' in every table and figure of this
|
||
chapter labels the test scenario in which the Tailscale client
|
||
(\texttt{tailscaled}) connects to a self-hosted Headscale control
|
||
server. The data plane is therefore the Tailscale client built on
|
||
\texttt{wireguard-go}, not the Headscale binary itself, which is
|
||
only a control-plane server. Statements below about ``Headscale''
|
||
running \texttt{wireguard-go} should be read as statements about
|
||
the Tailscale client in this scenario.
|
||
Section~\ref{sec:tailscale_degraded} covers the specifics of how
|
||
the rig launches \texttt{tailscaled} and which Tailscale code
|
||
paths that choice activates.
|
||
|
||
\subsection{Test Execution Overview}
|
||
|
||
Running the full baseline suite across all ten VPNs and the internal
|
||
reference took just over four hours. Actual benchmark execution
|
||
consumed the bulk of that time at 2.6~hours (63\,\%). VPN
|
||
installation and deployment accounted for another 45~minutes
|
||
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
|
||
for VPN tunnels to come up after restarts. VPN service restarts and
|
||
traffic-control (tc) stabilization took the remainder.
|
||
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
||
|
||
Most VPNs completed every benchmark without issues, but four failed
|
||
one test each: Nebula and Headscale timed out on the qperf
|
||
QUIC performance benchmark after six retries, while Hyprspace and
|
||
Mycelium failed the UDP iPerf3 test
|
||
with a 120-second timeout. Their individual success rate is
|
||
85.7\,\%, with all other VPNs passing the full suite
|
||
(Figure~\ref{fig:success_rate}).
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Average Test
|
||
Duration per Machine}.png}
|
||
\caption{Average test duration per VPN, including installation
|
||
time and benchmark execution}
|
||
\label{fig:test_duration}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
|
||
Success Rate}.png}
|
||
\caption{Benchmark success rate across all seven tests}
|
||
\label{fig:success_rate}
|
||
\end{subfigure}
|
||
\caption{Test execution overview. Hyprspace has the longest average
|
||
duration due to UDP timeouts and long VPN connectivity
|
||
waits. WireGuard completes fastest. Nebula, Headscale,
|
||
Hyprspace, and Mycelium each fail one benchmark.}
|
||
\label{fig:test_overview}
|
||
\end{figure}
|
||
|
||
\subsection{TCP Throughput}
|
||
|
||
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
||
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
|
||
averages. Three distinct performance tiers emerge, separated by
|
||
natural gaps in the data.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput at baseline, sorted by
|
||
throughput. Retransmits are averaged per 30-second test across
|
||
all three link directions. The horizontal rules separate the
|
||
three performance tiers.}
|
||
\label{tab:tcp_baseline}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Throughput (Mbps)} &
|
||
\textbf{Baseline (\%)} & \textbf{Retransmits} \\
|
||
\hline
|
||
Internal & 934 & 100.0 & 1.7 \\
|
||
WireGuard & 864 & 92.5 & 1 \\
|
||
ZeroTier & 814 & 87.2 & 1163 \\
|
||
Headscale & 800 & 85.6 & 102 \\
|
||
Yggdrasil & 795 & 85.1 & 75 \\
|
||
\hline
|
||
Nebula & 706 & 75.6 & 955 \\
|
||
EasyTier & 636 & 68.1 & 537 \\
|
||
VpnCloud & 539 & 57.7 & 857 \\
|
||
\hline
|
||
Hyprspace & 368 & 39.4 & 4965 \\
|
||
Tinc & 336 & 36.0 & 240 \\
|
||
Mycelium & 259 & 27.7 & 710 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
|
||
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
|
||
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
|
||
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
|
||
at under 40\,\% of baseline.
|
||
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Throughput}.png}
|
||
\caption{Average single-stream TCP throughput}
|
||
\label{fig:tcp_throughput}
|
||
\end{figure}
|
||
|
||
Raw throughput alone is incomplete. The retransmit rate
|
||
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
|
||
by estimated packet count, accounting for the different segment sizes
|
||
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
|
||
Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier,
|
||
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
|
||
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
|
||
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier
|
||
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
|
||
for tunnel-internal loss through repeated TCP congestion-control
|
||
recovery; WireGuard delivers comparable throughput with effectively
|
||
zero loss.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Retransmit Rate}.png}
|
||
\caption{TCP retransmit rate at baseline. WireGuard and Headscale
|
||
are effectively loss-free ($<$\,0.01\,\%). Hyprspace is the clear
|
||
outlier at 0.49\,\%.}
|
||
\label{fig:tcp_retransmits}
|
||
\end{figure}
|
||
|
||
Retransmits have a direct mechanical relationship with TCP congestion
|
||
control: each one triggers a reduction in the congestion window
|
||
(\texttt{cwnd}) and throttles the sender.
|
||
Figure~\ref{fig:tcp_window} shows the raw window sizes, and
|
||
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
|
||
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
|
||
smallest max congestion window in the dataset (200\,KB), while
|
||
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
|
||
any VPN. At
|
||
first glance this suggests a clean inverse correlation between
|
||
retransmit rate and congestion window size, but the picture is
|
||
misleading. Yggdrasil's outsized window is largely an artifact of
|
||
its jumbo overlay MTU (32\,731 bytes): each segment carries far more
|
||
data, so the window in bytes is inflated relative to VPNs using a
|
||
standard ${\sim}$1\,400-byte MTU. Comparing congestion windows
|
||
across different MTU sizes is not meaningful without normalizing for
|
||
segment size. The reliable conclusion is simpler: high retransmit
|
||
rates force TCP to spend more time in congestion recovery than in
|
||
steady-state transmission, and that caps throughput regardless of
|
||
available bandwidth. ZeroTier illustrates the opposite extreme:
|
||
brute-force retransmission can still yield high throughput
|
||
(814\,Mbps at a 0.10\,\% rate), at the cost of wasted bandwidth and
|
||
unstable flow behavior.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Max TCP
|
||
Window Size}.png}
|
||
\caption{Maximum TCP window sizes (send and congestion) at baseline.
|
||
Yggdrasil's congestion window (4\,219\,KB) dwarfs all others but
|
||
is inflated by its 32\,KB jumbo overlay MTU. Hyprspace has the
|
||
smallest congestion window (200\,KB).}
|
||
\label{fig:tcp_window}
|
||
\end{figure}
|
||
|
||
VpnCloud stands out: its sender reports 538.8\,Mbps but the
|
||
receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
|
||
in the dataset. This points to significant in-tunnel packet loss
|
||
or buffering at the VpnCloud layer that the retransmit rate
|
||
(0.06\,\%) alone does not fully explain.
|
||
|
||
Variability, whether stochastic across runs or systematic across
|
||
links, also differs substantially. WireGuard's three link
|
||
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
|
||
and are nearly indistinguishable. Mycelium's three directions span
|
||
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
|
||
Section~\ref{sec:mycelium_routing} shows the spread is per-link
|
||
path-selection asymmetry, with one link finding a direct route and
|
||
the other two routing through the global overlay. Either way, a
|
||
VPN whose throughput varies that widely across links is harder to
|
||
capacity-plan around than one that delivers a consistent figure
|
||
on every direction.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
|
||
\caption{Retransmits vs.\ throughput}
|
||
\label{fig:retransmit_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
|
||
\caption{Retransmits vs.\ max congestion window}
|
||
\label{fig:retransmit_cwnd}
|
||
\end{subfigure}
|
||
\caption{Retransmit correlations (log scale on x-axis). A high
|
||
retransmit rate does not always mean low throughput (ZeroTier:
|
||
0.10\,\%, 814\,Mbps), but an extreme rate does (Hyprspace:
|
||
0.49\,\%, 368\,Mbps). The apparent inverse correlation between
|
||
retransmit rate and congestion window size is dominated by
|
||
Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
|
||
by its 32\,KB jumbo overlay MTU rather than by a low retransmit
|
||
rate alone.}
|
||
\label{fig:retransmit_correlations}
|
||
\end{figure}
|
||
|
||
\subsection{Latency}
|
||
|
||
Sorting by latency rearranges the rankings considerably.
|
||
Table~\ref{tab:latency_baseline} lists the average ping round-trip
|
||
times, which cluster into three distinct ranges. The table also
|
||
reports the average maximum RTT observed across test runs and the
|
||
resulting spike ratio (max/avg); a high ratio signals bursty tail
|
||
latency that the average alone conceals.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Ping RTT statistics at baseline, sorted by average latency.
|
||
The spike ratio is max\,RTT\,/\,avg\,RTT; higher values indicate
|
||
bursty tail latency.}
|
||
\label{tab:latency_baseline}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Avg RTT (ms)} & \textbf{Max RTT (ms)}
|
||
& \textbf{Spike Ratio} & \textbf{Jitter (ms)} \\
|
||
\hline
|
||
Internal & 0.60 & 0.65 & 1.1$\times$ & 0.04 \\
|
||
VpnCloud & 1.13 & 3.14 & 2.8$\times$ & 0.25 \\
|
||
Tinc & 1.19 & 1.31 & 1.1$\times$ & 0.07 \\
|
||
WireGuard & 1.20 & 1.81 & 1.5$\times$ & 0.13 \\
|
||
Nebula & 1.25 & 1.53 & 1.2$\times$ & 0.10 \\
|
||
ZeroTier & 1.28 & 3.00 & 2.3$\times$ & 0.25 \\
|
||
EasyTier & 1.33 & 1.55 & 1.2$\times$ & 0.10 \\
|
||
\hline
|
||
Headscale & 1.64 & 1.81 & 1.1$\times$ & 0.09 \\
|
||
Hyprspace & 1.79 & 2.21 & 1.2$\times$ & 0.13 \\
|
||
Yggdrasil & 2.20 & 3.13 & 1.4$\times$ & 0.20 \\
|
||
\hline
|
||
Mycelium & 34.9 & 48.6 & 1.4$\times$ & 1.49 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal
|
||
0.60\,ms; EasyTier sits just above at 1.33\,ms. VpnCloud posts the
|
||
lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms),
|
||
yet its throughput tops out at only 539\,Mbps.
|
||
Low per-packet latency does not guarantee high bulk throughput. A
|
||
second group (Headscale,
|
||
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
|
||
moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
||
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
||
it a dedicated analysis.
|
||
|
||
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
|
||
outliers among the low-latency VPNs. VpnCloud leads at
|
||
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
|
||
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
|
||
jitter in the table (0.25\,ms). Tinc and Headscale, by contrast,
|
||
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
|
||
timing is nearly as stable as bare metal. The spikes in VpnCloud and
|
||
ZeroTier are consistent with periodic
|
||
control-plane work such as key rotation or peer heartbeats that
|
||
briefly stalls the data path.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
|
||
\caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
|
||
massive outlier at 58$\times$ the internal baseline. VpnCloud is
|
||
the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
|
||
\label{fig:ping_rtt}
|
||
\end{figure}
|
||
|
||
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
||
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
||
the tunnel quickly, yet something caps the overall rate.
|
||
Figure~\ref{fig:tcp_cpu} shows that Tinc uses only 12.3\,\% host CPU
|
||
during the TCP test. On a multi-core host this figure is consistent
|
||
with a single saturated core, which fits Tinc's single-threaded
|
||
userspace architecture: one core encrypts, copies, and forwards
|
||
packets, and the remaining cores sit idle.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP CPU
|
||
Utilization}.png}
|
||
\caption{CPU utilization during TCP throughput tests, split by host
|
||
(sender) and remote (receiver). Tinc (12.3\,\%) and VpnCloud
|
||
(14.2\,\%) use similar CPU, yet VpnCloud achieves 60\,\% higher
|
||
throughput. Yggdrasil's low CPU (2.7\,\%) reflects its
|
||
kernel-level forwarding with jumbo segments.}
|
||
\label{fig:tcp_cpu}
|
||
\end{figure}
|
||
|
||
VpnCloud is also
|
||
single-threaded and uses slightly more CPU (14.2\,\%), yet reaches
|
||
539\,Mbps (60\,\% more throughput). The gap comes down to per-packet
|
||
cost. Tinc uses a hand-written ChaCha20-Poly1305 implementation
|
||
without hardware acceleration, allocates a fresh stack buffer and
|
||
copies the payload for each packet, and routes through a splay-tree
|
||
lookup. VpnCloud uses the \texttt{ring} cryptographic library, which
|
||
employs optimized assembly and can select AES-128-GCM with hardware
|
||
AES-NI instructions at runtime; it encrypts in place with no extra
|
||
buffer copies and routes through an $O(1)$ hash-map lookup. These
|
||
differences compound in a tight single-threaded loop: every
|
||
microsecond saved per packet raises the maximum packet rate the one
|
||
available core can sustain.
|
||
|
||
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
||
spot.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
||
\caption{Latency vs.\ throughput at baseline. Each point represents
|
||
one VPN. The quadrants reveal different bottleneck types:
|
||
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
||
low throughput, CPU-bound), Mycelium (high latency, low
|
||
throughput, overlay routing overhead).}
|
||
\label{fig:latency_throughput}
|
||
\end{figure}
|
||
|
||
\subsection{Parallel TCP Scaling}
|
||
|
||
The single-stream benchmark tests one link direction at a time.
|
||
The
|
||
parallel benchmark changes this setup: all three link directions
|
||
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom) run simultaneously in a circular pattern for
|
||
60~seconds, each carrying one bidirectional TCP stream (six
|
||
unidirectional flows in total). Because three independent
|
||
link pairs now compete for shared tunnel resources at once, the
|
||
aggregate throughput is naturally higher than any single direction
|
||
alone, which is why even Internal reaches 1.50$\times$ its
|
||
single-stream figure. The scaling factor (parallel throughput
|
||
divided by single-stream throughput) captures two effects:
|
||
the benefit of using multiple link pairs in parallel, and how
|
||
well the VPN handles the resulting contention.
|
||
Table~\ref{tab:parallel_scaling} lists the results.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
||
ratio of parallel to single-stream throughput. Internal's
|
||
1.50$\times$ represents the expected scaling on this hardware.}
|
||
\label{tab:parallel_scaling}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Single (Mbps)} &
|
||
\textbf{Parallel (Mbps)} & \textbf{Scaling} \\
|
||
\hline
|
||
Mycelium & 259 & 569 & 2.20$\times$ \\
|
||
Hyprspace & 368 & 803 & 2.18$\times$ \\
|
||
Tinc & 336 & 563 & 1.68$\times$ \\
|
||
Yggdrasil & 795 & 1265 & 1.59$\times$ \\
|
||
Headscale & 800 & 1228 & 1.54$\times$ \\
|
||
Internal & 934 & 1398 & 1.50$\times$ \\
|
||
ZeroTier & 814 & 1206 & 1.48$\times$ \\
|
||
WireGuard & 864 & 1281 & 1.48$\times$ \\
|
||
EasyTier & 636 & 927 & 1.46$\times$ \\
|
||
VpnCloud & 539 & 763 & 1.42$\times$ \\
|
||
Nebula & 706 & 648 & 0.92$\times$ \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The VPNs that gain the most are those most constrained in
|
||
single-stream mode. Mycelium's 34.9\,ms RTT gives it a
|
||
bandwidth-delay product (Equation~\ref{eq:bdp}) of roughly
|
||
4.4\,MB on a 1\,Gbps link. No single TCP flow maintains a
|
||
congestion window that large, so the link is never fully utilized.
|
||
Multiple concurrent flows each contribute their own window, and
|
||
their aggregate in-flight data approaches the BDP, which pushes
|
||
throughput to 2.20$\times$ the single-stream figure.
|
||
|
||
Hyprspace scales almost as well (2.18$\times$) for the same
|
||
structural reason, but the bottleneck is different. Its libp2p send
|
||
pipeline accumulates roughly 2\,800\,ms of under-load latency
|
||
(Section~\ref{sec:hyprspace_bloat}), which inflates the effective BDP
|
||
to hundreds of megabytes, far beyond any single kernel congestion
|
||
window. Because Hyprspace keys \texttt{activeStreams} by destination
|
||
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the three
|
||
concurrent peer pairs in the parallel benchmark each get their own
|
||
libp2p stream, their own mutex, and their own yamux flow-control
|
||
window. Three independent windows in flight fill more of the bloated
|
||
pipeline than one can.
|
||
% TODO: This is still a hypothesis: it generalises the same
|
||
% bandwidth-delay-product argument used for Mycelium directly
|
||
% above, and is now grounded in the per-peer
|
||
% \texttt{SharedStream} structure verified in
|
||
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
|
||
% per-flow window evolution nor the actual under-load latency
|
||
% has been measured directly. A tcpdump of one Hyprspace
|
||
% iPerf3 run with inter-arrival timing analysis would settle it.
|
||
|
||
Tinc picks up a
|
||
1.68$\times$ boost because several streams can collectively keep its
|
||
single-threaded CPU busy during what would otherwise be idle gaps in
|
||
a single flow.
|
||
|
||
WireGuard and Internal both scale cleanly at around
|
||
1.48--1.50$\times$ with a 0.00\,\% retransmit rate in both modes.
|
||
This is consistent with WireGuard's overhead being a fixed per-packet
|
||
cost that does not worsen under multiplexing.
|
||
|
||
Nebula is the only VPN that actually gets \emph{slower} with more
|
||
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
||
(0.92$\times$). The cause is lock contention in Nebula's firewall
|
||
connection tracker (Listing~\ref{lst:nebula_conntrack}). A single
|
||
\texttt{sync.Mutex} protects the global \texttt{Conns} map, and every
|
||
packet in both directions must acquire it. The lock holder also
|
||
purges the timer wheel before releasing the lock, so other goroutines
|
||
stall while that housekeeping runs. Nebula mitigates this with a
|
||
per-routine cache that bypasses the global lock for known flows, but
|
||
the cache is invalidated every second, at which point all goroutines
|
||
contend on the mutex again. With parallel streams, the increased
|
||
goroutine count turns this periodic contention into a throughput
|
||
bottleneck.
|
||
|
||
\lstinputlisting[language=Go,caption={Nebula's firewall conntrack: a
|
||
global mutex protects the connection map and is acquired on every
|
||
packet.
|
||
\textit{nebula/firewall.go:79--84,
|
||
486--558}},label={lst:nebula_conntrack}]{Listings/nebula_conntrack.go}
|
||
|
||
Retransmit rates under parallel load shift in two directions.
|
||
VpnCloud's rate climbs from 0.06\,\% to 0.14\,\% (2.5$\times$) and
|
||
Yggdrasil's from 0.09\,\% to 0.23\,\% (2.7$\times$), so
|
||
multiplexing genuinely increases loss for these VPNs. Hyprspace's
|
||
rate, by contrast, drops slightly from 0.49\,\% to 0.39\,\% even
|
||
though it sends far more data in parallel; the per-packet loss
|
||
probability does not worsen, but the absolute count still triples
|
||
because three pairs are transmitting simultaneously. VPNs that were
|
||
clean in single-stream mode (WireGuard, Internal) stay clean under
|
||
parallel load.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
|
||
\caption{Single-stream vs.\ parallel throughput}
|
||
\label{fig:single_vs_parallel}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
|
||
\caption{Parallel TCP scaling factor}
|
||
\label{fig:scaling_factor}
|
||
\end{subfigure}
|
||
\caption{Parallel TCP scaling at baseline. Nebula is the only VPN
|
||
where parallel throughput is lower than single-stream
|
||
(0.92$\times$). Mycelium and Hyprspace benefit most from
|
||
parallelism ($>$2$\times$), compensating for latency and buffer
|
||
bloat respectively. The dashed line at 1.0$\times$ marks the
|
||
break-even point.}
|
||
\label{fig:parallel_tcp}
|
||
\end{figure}
|
||
|
||
\subsection{UDP Stress Test}
|
||
|
||
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
||
which is a deliberate overload test rather than a realistic workload.
|
||
The sender throughput values are artifacts: they reflect how fast the
|
||
sender can write to the socket, not how fast data traverses the
|
||
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
||
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
||
overlay MTU), which inflates the apparent rate per
|
||
\texttt{send()} system call. Only the receiver throughput is
|
||
meaningful.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{UDP receiver throughput and packet loss at baseline
|
||
(\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
|
||
at 120 seconds and are excluded.}
|
||
\label{tab:udp_baseline}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Receiver (Mbps)} &
|
||
\textbf{Loss (\%)} \\
|
||
\hline
|
||
Internal & 952 & 0.0 \\
|
||
WireGuard & 898 & 0.0 \\
|
||
Nebula & 890 & 76.2 \\
|
||
Headscale & 876 & 69.8 \\
|
||
EasyTier & 865 & 78.3 \\
|
||
Yggdrasil & 852 & 98.7 \\
|
||
ZeroTier & 851 & 89.5 \\
|
||
VpnCloud & 773 & 83.7 \\
|
||
Tinc & 471 & 89.9 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
%TODO: Explain that the UDP test also crashes often,
|
||
% which makes the test somewhat unreliable
|
||
% but a good indicator if the network traffic is "different" then
|
||
% the programmer expected
|
||
|
||
Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
|
||
the kernel level with proper backpressure that matches sender to
|
||
receiver rate. Every other VPN shows massive loss (69--99\%)
|
||
because the sender overwhelms the tunnel's userspace processing capacity.
|
||
Headscale shares WireGuard's cryptographic protocol but, contrary to
|
||
intuition, does not share its kernel datapath: Tailscale's
|
||
\texttt{magicsock} layer intercepts every packet to handle endpoint
|
||
selection and DERP (Designated Encrypted Relay for Packets,
|
||
Tailscale's TLS-over-TCP relay network used when a direct UDP path
|
||
between peers cannot be established), which is incompatible with the
|
||
in-kernel WireGuard module. Headscale therefore runs
|
||
\texttt{wireguard-go} entirely in userspace, and the unbounded
|
||
\texttt{-b~0} flood overruns that userspace pipeline just as it
|
||
overruns every other userspace implementation, and Headscale
|
||
shows 69.8\,\% loss despite the WireGuard branding.
|
||
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
||
(due to its large block size) but loses almost all of it. These loss
|
||
rates do not reflect real-world UDP behavior but reveal which VPNs
|
||
implement effective flow control. Hyprspace and Mycelium could not
|
||
complete the UDP test at all; both timed out after 120 seconds.
|
||
|
||
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
|
||
% the path MTU. It is derived from the socket MSS and reflects the
|
||
% usable payload after tunnel overhead, but conflating it with path
|
||
% MTU is misleading. Consider renaming to "effective payload size"
|
||
% throughout.
|
||
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
|
||
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
|
||
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
|
||
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
|
||
smallest). These differences affect fragmentation behavior under real
|
||
workloads, particularly for protocols that send large datagrams.
|
||
|
||
%TODO: Mention QUIC
|
||
%TODO: Mention again that the "default" settings of every VPN have been used
|
||
% to better reflect real world use, as most users probably won't
|
||
% change these defaults
|
||
% and explain that good defaults are as much a part of good software as
|
||
% having the features but they are hard to configure correctly
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Throughput}.png}
|
||
\caption{UDP receiver throughput}
|
||
\label{fig:udp_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Packet Loss}.png}
|
||
\caption{UDP packet loss}
|
||
\label{fig:udp_loss}
|
||
\end{subfigure}
|
||
\caption{UDP stress test results at baseline (\texttt{-b 0},
|
||
unlimited sender rate). Internal and WireGuard are the only
|
||
implementations with 0\% loss. Hyprspace and Mycelium are
|
||
excluded due to 120-second timeouts.}
|
||
\label{fig:udp_results}
|
||
\end{figure}
|
||
|
||
% TODO: Compare parallel TCP retransmit rate
|
||
% with single TCP retransmit rate and see what changed
|
||
|
||
\subsection{Real-World Workloads}
|
||
|
||
Saturating a link with iPerf3 measures peak capacity, but not how a
|
||
VPN performs under realistic traffic. This subsection switches to
|
||
application-level workloads: downloading packages from a Nix binary
|
||
cache and streaming video over RIST. Both interact with the VPN
|
||
tunnel the way real software does, through many short-lived
|
||
connections, TLS handshakes, and latency-sensitive UDP packets.
|
||
|
||
\paragraph{Nix Binary Cache Downloads.}
|
||
|
||
This test downloads a fixed set of Nix packages through each VPN and
|
||
measures the total transfer time. The results
|
||
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
|
||
considerably: even Hyprspace, the worst performer, finishes in
|
||
11.92\,s, only 40\,\% slower than bare metal. Once connection
|
||
setup, TLS handshakes, and HTTP round-trips enter the picture,
|
||
throughput differences between 500 and 900\,Mbps matter far less
|
||
than per-connection latency.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time at baseline, sorted by
|
||
duration. Overhead is relative to the internal baseline (8.53\,s).}
|
||
\label{tab:nix_cache}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Mean (s)} &
|
||
\textbf{Overhead (\%)} \\
|
||
\hline
|
||
Internal & 8.53 & -- \\
|
||
Nebula & 9.15 & +7.3 \\
|
||
ZeroTier & 9.22 & +8.1 \\
|
||
VpnCloud & 9.39 & +10.0 \\
|
||
EasyTier & 9.39 & +10.1 \\
|
||
WireGuard & 9.45 & +10.8 \\
|
||
Headscale & 9.79 & +14.8 \\
|
||
Tinc & 10.00 & +17.2 \\
|
||
Mycelium & 10.07 & +18.1 \\
|
||
Yggdrasil & 10.59 & +24.2 \\
|
||
Hyprspace & 11.92 & +39.7 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Several rankings invert relative to raw throughput. ZeroTier
|
||
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
|
||
6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil
|
||
is the clearest example: it has the
|
||
fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead
|
||
because its
|
||
2.2\,ms latency adds up over the many small sequential HTTP requests
|
||
that constitute a Nix cache download.
|
||
Figure~\ref{fig:throughput_vs_download} confirms this weak link
|
||
between raw throughput and real-world download speed.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
|
||
Mean Download Time}.png}
|
||
\caption{Nix cache download time per VPN}
|
||
\label{fig:nix_cache}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
|
||
\caption{Raw throughput vs.\ download time}
|
||
\label{fig:throughput_vs_download}
|
||
\end{subfigure}
|
||
\caption{Application-level download performance. The throughput
|
||
hierarchy compresses under real HTTP workloads: the worst VPN
|
||
(Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
|
||
Throughput explains some variance but not all: Yggdrasil
|
||
(795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
|
||
because latency matters more for HTTP workloads.}
|
||
\label{fig:nix_download}
|
||
\end{figure}
|
||
|
||
\paragraph{Video streaming (RIST).}
|
||
|
||
At 3.3\,Mbps, the RIST video stream sits well within every VPN's
|
||
throughput budget. The test therefore measures something else: how
|
||
well each VPN handles real-time UDP delivery under steady load.
|
||
|
||
Most VPNs pass without incident. Eight deliver 100\% quality,
|
||
Nebula sits just below at 99.8\%, and Hyprspace's headline figure
|
||
of 100\% conceals a separate failure mode discussed below. The
|
||
14--16 dropped frames that appear uniformly across every run, including
|
||
Internal, are most likely encoder warm-up artefacts rather than
|
||
tunnel overhead, though we have not verified this directly.
|
||
|
||
% TODO: The packet-drop distribution statistics (288 mean,
|
||
% 10\% median, IQR 255--330) are not shown in any figure.
|
||
% Add a box plot or distribution figure for Headscale's RIST drops.
|
||
Headscale is the clear failure. Its mean quality is 13.1\%, and
|
||
each test interval drops 288 packets. The degradation is sustained
|
||
rather than bursty: median quality is 10\%, and the interquartile
|
||
range of dropped packets is a narrow 255--330. The qperf benchmark
|
||
also fails outright for Headscale at baseline, which rules out a
|
||
bulk-TCP explanation. Something in the real-time path is broken.
|
||
|
||
The failure is unexpected because Headscale builds on WireGuard,
|
||
which handles video without trouble, and Headscale's own TCP
|
||
throughput puts it in Tier~1. RIST runs over UDP, however, and
|
||
qperf probes latency-sensitive paths using both TCP and UDP. The
|
||
most plausible source is Headscale's DERP relay or NAT traversal
|
||
layer. Headscale's effective UDP payload size is 1\,208~bytes, the
|
||
smallest in the dataset. RIST packets larger than this would be
|
||
fragmented, and fragment reassembly under sustained load could
|
||
produce exactly the steady, uniform drop pattern the data shows.
|
||
This is a hypothesis, not a confirmed cause: it would need a
|
||
packet capture to verify. Either way, the result disqualifies
|
||
Headscale from video conferencing, VoIP, or any other real-time
|
||
media workload, regardless of TCP throughput.
|
||
|
||
% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500,
|
||
% percentiles all zero) are not visible in the RIST Quality bar chart.
|
||
% Add a distribution plot or note in the caption that the bar
|
||
% chart hides this variance.
|
||
Hyprspace fails differently. Its average quality reads 100\%, but
|
||
the raw drop counts underneath are unstable: mean packet drops of
|
||
1\,194 and a maximum spike of 55\,500. The 25th, 50th, and 75th
|
||
percentiles are all zero, so most runs deliver perfectly while a
|
||
small number suffer catastrophic bursts. RIST's forward error
|
||
correction recovers from most of these events, but the worst spikes
|
||
overwhelm FEC entirely.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Video
|
||
Streaming/RIST Quality}.png}
|
||
\caption{RIST video streaming quality at baseline. Headscale at
|
||
13.1\% average quality is the clear outlier. Every other VPN
|
||
achieves 99.8\% or higher. Nebula is at 99.8\% (minor
|
||
degradation). The video bitrate (3.3\,Mbps) is well within every
|
||
VPN's throughput capacity, so this test reveals real-time UDP
|
||
handling quality rather than bandwidth limits.}
|
||
\label{fig:rist_quality}
|
||
\end{figure}
|
||
|
||
\subsection{Operational Resilience}
|
||
|
||
Throughput, latency, and application performance describe how a
|
||
tunnel behaves once it is up. The next question is how quickly it
|
||
gets there. Sustained-load numbers do not predict recovery speed,
|
||
and for operational use the time a tunnel takes to come up after a
|
||
reboot matters as much as its peak throughput.
|
||
|
||
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
||
performer under sustained TCP load, recovers in just 8.7~seconds on
|
||
average, faster than any other VPN. WireGuard and Nebula follow at
|
||
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
|
||
10.07\,s across its three nodes, an exact match for Nebula's
|
||
\texttt{HostUpdateNotification} interval, whose default is
|
||
10~seconds in the lighthouse protocol (configurable, but the
|
||
benchmarks use the default). After a reboot, a node must
|
||
wait until the next periodic update before its lighthouses learn
|
||
its new endpoint, so the reconnection time tracks the timer rather
|
||
than any topology-dependent convergence.
|
||
Mycelium sits at the opposite end at 76.6~seconds, and its three
|
||
nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
|
||
Section~\ref{sec:mycelium_routing} argues from that uniformity
|
||
that the bound is a fixed timer in the overlay protocol.
|
||
|
||
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
||
node is back in 7.1~seconds while lom and luna take 94.8 and
|
||
97.3~seconds respectively. Yggdrasil organises its overlay as a
|
||
distributed spanning tree rooted at the node with the highest public
|
||
key: every other node picks a parent closer to the root and the
|
||
whole network hangs off that parent chain. The gap likely reflects
|
||
the cost of rebuilding that tree after a reboot: a node close to the
|
||
current root reconverges quickly, while one further out must wait
|
||
for updated parent information to propagate hop-by-hop before it
|
||
can route traffic.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
|
||
\caption{Average reconnection time per VPN}
|
||
\label{fig:reboot_bar}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
|
||
\caption{Per-node reconnection time heatmap}
|
||
\label{fig:reboot_heatmap}
|
||
\end{subfigure}
|
||
\caption{Reboot reconnection time at baseline. The heatmap reveals
|
||
Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
|
||
95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
|
||
across all nodes). Hyprspace reconnects fastest (8.7\,s average)
|
||
despite its poor sustained-load performance.}
|
||
\label{fig:reboot_reconnection}
|
||
\end{figure}
|
||
|
||
\subsection{Pathological Cases}
|
||
\label{sec:pathological}
|
||
|
||
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
|
||
explain. The following subsections piece together observations from
|
||
earlier benchmarks into per-VPN diagnoses.
|
||
|
||
\paragraph{Hyprspace: Buffer Bloat.}
|
||
\label{sec:hyprspace_bloat}
|
||
|
||
% TODO: The under-load latency of 2,800 ms is not shown in any plot
|
||
% or table. Where does this number come from? Add a figure showing
|
||
% latency-under-load (e.g., from qperf concurrent ping) or reference
|
||
% the raw data source.
|
||
Hyprspace produces the most severe performance collapse in the
|
||
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
||
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
||
1\,556$\times$ increase. The network itself has capacity to spare;
|
||
the VPN tunnel is filling up with buffered packets and failing to
|
||
drain.
|
||
|
||
The consequences show in every TCP metric. With 4\,965
|
||
retransmits per 30-second test (one in every 200~segments), TCP
|
||
spends most of its time in congestion recovery rather than
|
||
steady-state transfer. The max congestion window shrinks to
|
||
205\,KB, the smallest in the dataset. Under parallel load the
|
||
situation worsens: retransmits climb to 17\,426. % TODO: The
|
||
% explanation for the sender/receiver inversion (ACK delays
|
||
% causing sender-side timer undercounting) is a hypothesis. Normally
|
||
% sender >= receiver. Consider verifying with packet captures or
|
||
% note this as a likely but unconfirmed explanation.
|
||
The buffering even
|
||
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
||
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
||
cause the sender-side timer to undercount the actual data rate. The
|
||
UDP test never finished at all; it timed out at 120~seconds.
|
||
|
||
% Should we always use percentages for retransmits?
|
||
|
||
What prevents Hyprspace from being entirely unusable is everything
|
||
\emph{except} sustained load. It has the fastest reboot
|
||
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
|
||
quality outside of its burst events. The pathology is narrow but
|
||
severe: any continuous data stream saturates the tunnel's internal
|
||
buffers.
|
||
|
||
Hyprspace does import gVisor netstack, but reading the source
|
||
confirms that the gVisor TCP stack sits exclusively behind the
|
||
in-VPN ``service network'' feature. Regular tunnel traffic uses
|
||
an ordinary kernel TUN device created through the
|
||
\texttt{songgao/water} library, and the forwarding loop in
|
||
\texttt{node/node.go} only diverts a packet into the gVisor
|
||
stack when its destination falls inside the
|
||
\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
|
||
protocol is TCP; everything else is shipped verbatim over a
|
||
libp2p stream and written back into the receiving peer's kernel
|
||
TUN. Listings~\ref{lst:hyprspace_kernel_tun},
|
||
\ref{lst:hyprspace_dispatch}, and \ref{lst:hyprspace_netstack}
|
||
show the relevant code in the upstream Hyprspace tree.
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace creates a real
|
||
kernel TUN via \texttt{songgao/water}; this is the device every
|
||
peer-to-peer packet traverses.
|
||
\textit{hyprspace/tun/tun\_linux.go:14--36}},label={lst:hyprspace_kernel_tun}]{Listings/hyprspace_tun_linux.go}
|
||
|
||
\lstinputlisting[language=Go,caption={The IPv6 dispatch in the
|
||
Hyprspace forwarding loop only diverts to the gVisor service-network
|
||
TUN when the destination matches the
|
||
\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
|
||
protocol byte is \texttt{0x06} (TCP); every other packet is left
|
||
on the kernel TUN path and forwarded over libp2p.
|
||
\textit{hyprspace/node/node.go:255--283}},label={lst:hyprspace_dispatch}]{Listings/hyprspace_dispatch.go}
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace's gVisor netstack
|
||
initialiser only enables TCP SACK; there is no \texttt{TCPRecovery}
|
||
override (RACK stays at gVisor's default), no congestion-control
|
||
override, and no buffer-size override. The text in
|
||
\texttt{tun.go} also notes the file is taken verbatim from
|
||
wireguard-go.
|
||
\textit{hyprspace/netstack/tun.go:6--80}},label={lst:hyprspace_netstack}]{Listings/hyprspace_netstack.go}
|
||
|
||
Since the benchmark targets the regular Hyprspace IPv4/IPv6
|
||
addresses rather than service-network proxies, both endpoints
|
||
rely on their host kernel's TCP stack for the entire transfer.
|
||
Whatever options Hyprspace's gVisor instance might set
|
||
internally (congestion control, loss recovery, buffer sizes)
|
||
are therefore irrelevant to these measurements; the inner TCP
|
||
state machine the kernel runs is the only one in the path.
|
||
The same caveat applies more sharply to Tailscale, where the
|
||
upstream documentation talks about an in-process gVisor TCP
|
||
stack but the benchmark traffic never reaches it; that case is
|
||
the subject of Section~\ref{sec:tailscale_degraded}.
|
||
|
||
If gVisor is out of scope, the buffer bloat must originate
|
||
further up the Hyprspace stack instead. Hyprspace uses
|
||
\texttt{libp2p}, a peer-to-peer networking library, and its
|
||
\texttt{yamux} stream multiplexer, which runs many logical streams
|
||
over a single underlying connection and polices each one with a
|
||
credit-based flow-control window. The most plausible source of
|
||
the bloat is this libp2p/yamux layer, through which raw IP packets
|
||
are funnelled. Hyprspace's TUN-read loop dispatches
|
||
each outbound packet on its own goroutine, and every such
|
||
goroutine ends up in \texttt{node/node.go}'s
|
||
\texttt{sendPacket}, which keeps exactly one libp2p stream per
|
||
destination peer in \texttt{activeStreams} and guards it with a
|
||
single per-peer \texttt{sync.Mutex}
|
||
(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent
|
||
application TCP flows to the same Hyprspace neighbour therefore
|
||
serialise behind that one lock: the parallel iPerf3 test, which
|
||
opens multiple TCP connections to the same peer at once,
|
||
collapses to a single send pipeline at this layer. Each
|
||
goroutine waiting for the lock pins its own 1420-byte packet
|
||
buffer, and the underlying yamux session adds a per-stream
|
||
flow-control window on top. None of this is visible to the
|
||
kernel TCP sender that produced the inner segments: the kernel
|
||
sees only that the TUN write returned, so it keeps growing its
|
||
congestion window while the libp2p layer falls further behind. The
|
||
geometry is the textbook one for buffer bloat: a
|
||
fast producer (kernel TCP) sitting upstream of a slow,
|
||
serialised consumer (the single yamux stream per peer) with
|
||
no flow-control signal coupling the two.
|
||
|
||
\lstinputlisting[language=Go,caption={Hyprspace's outbound
|
||
fast path keeps exactly one libp2p stream per destination peer
|
||
in \texttt{activeStreams} and guards it with a per-peer
|
||
\texttt{sync.Mutex} held inside the \texttt{SharedStream}
|
||
record. The TUN-read loop spawns a fresh goroutine per packet
|
||
(\texttt{node.go:282}); each one calls \texttt{sendPacket} and
|
||
takes \texttt{ms.Lock} for the duration of the libp2p stream
|
||
write, so concurrent application TCP flows to the same
|
||
Hyprspace neighbour are serialised behind a single mutex.
|
||
\textit{hyprspace/node/node.go:36--39, 282,
|
||
328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go}
|
||
|
||
\paragraph{Mycelium: routing anomaly.}
|
||
\label{sec:mycelium_routing}
|
||
|
||
Mycelium's 34.9\,ms average latency looks like a
|
||
straightforward cost of routing through a global
|
||
overlay. The per-path numbers do not fit this
|
||
explanation:
|
||
|
||
\begin{itemize}
|
||
\bitem{luna$\rightarrow$lom:} 1.63\,ms (comparable
|
||
to Headscale at 1.64\,ms)
|
||
\bitem{lom$\rightarrow$yuki:} 51.47\,ms
|
||
\bitem{yuki$\rightarrow$luna:} 51.60\,ms
|
||
\end{itemize}
|
||
|
||
One link found a direct LAN path; the other two
|
||
bounced through the overlay. All three machines sit on
|
||
the same physical network, so the split is not a matter
|
||
of topology.
|
||
|
||
The throughput results invert the latency ranking.
|
||
The link with the low ping latency,
|
||
luna$\rightarrow$lom at 1.63\,ms, should be the fastest
|
||
according to TCP congestion theory. It is the slowest:
|
||
122\,Mbps, with the reverse direction dropping to
|
||
58.4\,Mbps in bidirectional mode. Meanwhile
|
||
yuki$\rightarrow$luna, whose ICMP~RTT was 30$\times$
|
||
higher, reaches 379\,Mbps
|
||
(Figure~\ref{fig:mycelium_paths}). The throughput
|
||
ranking is the exact inverse of what the ping data
|
||
predicts.
|
||
|
||
The explanation is in the iperf3 logs. Each TCP stream
|
||
reports a kernel-measured RTT that is independent of
|
||
ICMP ping. For the luna$\rightarrow$lom stream, this
|
||
TCP~RTT starts at 51.6\,ms and climbs to a mean of
|
||
144\,ms over the 30-second run, with
|
||
757~retransmits---the link was clearly overlay-routed
|
||
during the throughput test, even though ping had found a
|
||
direct path eight minutes earlier. For
|
||
yuki$\rightarrow$luna the reverse happened: the TCP
|
||
stream measured only 12--22\,ms, and its bidirectional
|
||
return path recorded 1.0\,ms, a direct LAN connection
|
||
that the earlier ICMP test had not seen. The routes
|
||
changed between the two tests.
|
||
|
||
Mycelium uses the Babel routing protocol
|
||
(Section~\ref{sec:babel}) to discover and select paths.
|
||
Two properties of its implementation explain why routes
|
||
shifted mid-benchmark. First, Mycelium advertises
|
||
routes at a five-minute interval
|
||
(Listing~\ref{lst:mycelium_constants}):
|
||
|
||
\lstinputlisting[language=Rust,caption={Mycelium's
|
||
Babel timing constants. Routes are re-advertised
|
||
every 300\,s; the router will not learn about a new
|
||
path until the next cycle.
|
||
\textit{mycelium/src/router.rs:33--59}},label={lst:mycelium_constants}]{Listings/mycelium_route_constants.rs}
|
||
|
||
A direct path that appears between update cycles is
|
||
invisible to the router until the next advertisement
|
||
arrives. The benchmark's ping and throughput tests ran
|
||
sequentially with several minutes between them, so each
|
||
test observed whichever route happened to be selected at
|
||
that point in Babel's five-minute cycle.
|
||
|
||
Second, even when a better route \emph{is} advertised,
|
||
the router resists switching to it.
|
||
Listing~\ref{lst:mycelium_best_route} shows the
|
||
\texttt{find\_best\_route} function: a candidate route
|
||
is rejected unless its metric improves on the current
|
||
route by more than 10, or unless it is directly
|
||
connected (metric~0). This hysteresis prevents
|
||
flapping but also means that an overlay path, once
|
||
established, can persist for the remainder of the
|
||
update interval even after a shorter path becomes
|
||
available.
|
||
|
||
\lstinputlisting[language=Rust,caption={Route
|
||
selection with hysteresis. Lines~16--25 reject a
|
||
candidate route unless it is directly connected or
|
||
improves the composite metric by more than
|
||
\texttt{SIGNIFICANT\_METRIC\_IMPROVEMENT}\,(10).
|
||
\textit{mycelium/src/router.rs:1213--1238}},label={lst:mycelium_best_route}]{Listings/mycelium_find_best_route.rs}
|
||
|
||
The five-minute update interval and the switching
|
||
hysteresis together explain the throughput asymmetry.
|
||
The TCP-measured RTTs
|
||
are consistent with the observed throughput on every
|
||
link; only the ICMP~RTTs, measured minutes earlier under
|
||
a different routing state, give the impression of an
|
||
inversion.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
|
||
Throughput}.png}
|
||
\caption{Per-link TCP throughput for Mycelium. The
|
||
luna$\rightarrow$lom link appears slow despite its
|
||
low ping latency because Babel had switched to an
|
||
overlay route by the time the throughput test ran.
|
||
The TCP-level RTTs reported by iperf3, not the
|
||
earlier ICMP measurements, explain the 3:1 ratio.}
|
||
\label{fig:mycelium_paths}
|
||
\end{figure}
|
||
|
||
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
|
||
% (47.3 ms) numbers are from qperf but not shown in any figure.
|
||
% Add a connection-setup latency table or plot. Also
|
||
% clarify what
|
||
% Internal's connection establishment time is (47.3 /
|
||
% 3 = 15.8 ms?)
|
||
% so the "3× overhead" can be verified.
|
||
The overlay penalty shows up most clearly at connection setup.
|
||
Mycelium's average time-to-first-byte is 93.7\,ms
|
||
(vs.\ Internal's
|
||
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
||
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
|
||
incurs that overhead, so workloads dominated by
|
||
short-lived connections accumulate it rapidly. Bulk
|
||
downloads, by
|
||
contrast, amortize it: the Nix cache test finishes only 18\,\%
|
||
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
|
||
transfer phase begins, per-connection latency fades into the
|
||
background.
|
||
|
||
Mycelium is also the slowest VPN to recover from a reboot:
|
||
76.6~seconds on average, and almost suspiciously uniform across
|
||
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
|
||
fixed convergence timer in the overlay protocol, most likely a
|
||
default wait interval hard-coded into the reconnection logic. A
|
||
topology-dependent recovery time, by contrast, would vary with each
|
||
node's position in the overlay: a node near an active peer would
|
||
reconverge quickly while one further away would wait longer for
|
||
routing information to reach it. Mycelium shows no such variation,
|
||
so the bound is almost certainly a timer rather than a propagation
|
||
delay.
|
||
% TODO: Identify which Mycelium constant or default this 75-78 s
|
||
% recovery actually corresponds to before claiming it is a fixed
|
||
% timer; the source code would settle whether it is hard-coded,
|
||
% a configurable default, or coincidence.
|
||
The UDP test timed out at 120~seconds, and even first-time
|
||
connectivity required a 70-second wait at startup.
|
||
|
||
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
||
|
||
The latency subsection already traced Tinc's 336\,Mbps ceiling to
|
||
single-core CPU exhaustion. The usual network suspects do not
|
||
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
|
||
effective UDP payload size (1\,353 bytes) and its retransmit count
|
||
(240) are in the normal range. That leaves CPU: 14.9\,\%
|
||
whole-system utilization is what one saturated core looks like on
|
||
a multi-core host, which fits a single-threaded userspace VPN.
|
||
The parallel benchmark confirms the diagnosis. Tinc scales to
|
||
563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
|
||
Several concurrent TCP streams keep that one core busy through
|
||
the gaps a single flow would leave idle, and the extra work
|
||
translates directly into extra throughput.
|
||
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
|
||
% unresolved CPU-profiling TODO from the latency subsection
|
||
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
|
||
% profiling refutes the single-core story, this paragraph must
|
||
% be revisited as well.
|
||
|
||
\section{Impact of network impairment}
|
||
\label{sec:impairment}
|
||
|
||
Baseline benchmarks rank VPNs by overhead under ideal
|
||
conditions. The impairment profiles in
|
||
Table~\ref{tab:impairment_profiles} test a different property:
|
||
resilience. Each profile applies symmetric \texttt{tc netem}
|
||
impairment to every machine. Low adds roughly 2\,ms of delay and
|
||
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
|
||
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
|
||
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
|
||
reordering. Medium and High both use 50\,\% correlation, so
|
||
losses and reorderings are bursty rather than uniform. Two
|
||
results dominate the data.
|
||
% TODO: Double-check these per-profile parameters against the
|
||
% canonical impairment-profile definitions in the earlier chapter
|
||
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
|
||
% delay numbers are cross-checked against later prose in this
|
||
% chapter, but the correlation and jitter values should be
|
||
% verified against the authoritative profile definition.
|
||
|
||
The first is the collapse of the throughput hierarchy. At High
|
||
impairment, the 675\,Mbps spread between fastest and slowest
|
||
implementation compresses to under 3\,Mbps. Architectural
|
||
differences that mattered at gigabit speeds become
|
||
invisible once
|
||
the network is the bottleneck.
|
||
|
||
The second is harder to explain. Headscale outperforms the
|
||
bare-metal Internal baseline at Medium impairment across TCP,
|
||
parallel TCP, and the Nix cache benchmark. A VPN built on
|
||
WireGuard should not beat a direct connection.
|
||
Section~\ref{sec:tailscale_degraded} pursues this anomaly
|
||
through what turns out to be the wrong hypothesis. The
|
||
investigation begins with Tailscale's much-discussed gVisor TCP
|
||
stack, validates the candidate parameters in isolation on the
|
||
bare-metal host, and only then discovers, by reading the rig's
|
||
own NixOS module, that the gVisor stack is not actually in the
|
||
data path of the benchmark at all. The real culprit is a
|
||
combination of the Linux kernel's tight default
|
||
\texttt{tcp\_reordering} threshold and the way
|
||
\texttt{wireguard-go}
|
||
batches packets between the wire and the host kernel TCP stack.
|
||
|
||
\subsection{Ping}
|
||
|
||
Latency is the most predictable metric under
|
||
impairment. Most VPNs
|
||
absorb the injected delay with a fixed per-hop
|
||
overhead, and rankings
|
||
within the central cluster barely change across profiles
|
||
(Table~\ref{tab:ping_impairment}). tc~netem adds
|
||
roughly 4, 8, and
|
||
15\,ms of round-trip delay at Low, Medium, and High
|
||
respectively;
|
||
Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Average ping RTT (ms) across impairment
|
||
profiles, sorted
|
||
by High-profile RTT}
|
||
\label{tab:ping_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 0.60 & 4.82 & 9.38 & 15.49 \\
|
||
Tinc & 1.19 & 5.32 & 9.85 & 15.92 \\
|
||
Nebula & 1.25 & 5.38 & 9.99 & 15.96 \\
|
||
WireGuard & 1.20 & 5.36 & 9.88 & 15.99 \\
|
||
Headscale & 1.64 & 5.82 & 10.39 & 16.07 \\
|
||
VpnCloud & 1.13 & 5.41 & 10.35 & 16.21 \\
|
||
ZeroTier & 1.28 & 5.34 & 10.02 & 16.54 \\
|
||
Yggdrasil & 2.20 & 6.73 & 11.99 & 20.20 \\
|
||
Hyprspace & 1.79 & 6.15 & 10.76 & 24.49 \\
|
||
EasyTier & 1.33 & 6.27 & 14.13 & 26.60 \\
|
||
Mycelium & 34.90 & 23.42 & 43.88 & 33.05 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Ping
|
||
Average RTT Heatmap}.png}
|
||
\caption{Average ping RTT across impairment
|
||
profiles. Most VPNs
|
||
form a tight parallel band; Mycelium's non-monotonic curve,
|
||
EasyTier's excess latency at High, and Hyprspace's upward
|
||
divergence stand out.}
|
||
\label{fig:ping_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Mycelium defies the pattern. Its RTT \emph{drops}
|
||
from 34.9\,ms at
|
||
baseline to 23.4\,ms at Low impairment, a 33\%
|
||
improvement at the
|
||
profile where every other VPN gets slower. It then climbs to
|
||
43.9\,ms at Medium before falling again to 33.0\,ms
|
||
at High. The
|
||
baseline analysis
|
||
(Section~\ref{sec:mycelium_routing}) showed that
|
||
Mycelium's latency comes from a bimodal routing
|
||
distribution: one
|
||
path runs at 1.63\,ms, two others route through the
|
||
global overlay at
|
||
${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This
|
||
% explanation depends on the baseline
|
||
% characterisation of Mycelium's path discovery as
|
||
% "failing intermittently"
|
||
% (Section mycelium_routing). If that
|
||
% characterisation is revised (e.g.,
|
||
% overlay routing is by-design, not a failure), then
|
||
% the claim that
|
||
% impairment "pushes path discovery toward shorter
|
||
% routes" needs rethinking:
|
||
% the mechanism would be different if Mycelium is not
|
||
% trying to find direct
|
||
% routes in the first place.
|
||
Impairment seems to push Mycelium's path selection toward the
|
||
shorter route, so a larger share of traffic avoids the overlay
|
||
detour. The non-monotonic curve is consistent with a
|
||
path selection
|
||
algorithm that reacts to measured link quality but
|
||
not linearly with
|
||
degradation severity.
|
||
|
||
% TODO: Ping packet loss data is not shown in any figure. Add a
|
||
% packet loss table/figure or reference the raw data
|
||
% so readers can
|
||
% verify these numbers.
|
||
Mycelium loses zero ping packets at Low and Medium impairment.
|
||
Most other VPNs show 0.1--3.2\% loss at those profiles. At High
|
||
impairment Mycelium's loss jumps to 11.1\%.
|
||
|
||
% TODO: EasyTier's max RTT (290 ms), WireGuard's max
|
||
% (~40 ms), and
|
||
% EasyTier's std dev (44.6 ms) are not shown in any
|
||
% plot. The ping
|
||
% heatmap only shows averages. Add a
|
||
% jitter/distribution figure.
|
||
% Also, the "userspace retry mechanism" is a hypothesized cause
|
||
% without source-code or packet-level evidence.
|
||
EasyTier accumulates 11\,ms of excess latency at High impairment
|
||
beyond what tc~netem injects. Its average RTT is
|
||
26.6\,ms and its
|
||
maximum reaches 290\,ms, against ${\sim}$40\,ms for
|
||
WireGuard. The
|
||
RTT standard deviation reaches 44.6\,ms at High, the
|
||
worst jitter
|
||
of any VPN. A userspace retry mechanism is the
|
||
likely cause, but
|
||
without source-code evidence we cannot say so with certainty.
|
||
|
||
% TODO: Ping packet loss data is not shown in any plot. The 1/9
|
||
% = 11.1\% interpretation is clever but depends on
|
||
% the exact test
|
||
% structure (3 pairs × 3 runs × 100 packets). Verify
|
||
% this matches
|
||
% the actual test setup and add a supporting figure or table.
|
||
Hyprspace shows the same 11.1\% ping packet loss at Low, Medium,
|
||
and High impairment. With 9~measurement runs per
|
||
profile (3~machine
|
||
pairs $\times$ 3~runs of 100~packets), 11.1\% is
|
||
exactly 1/9: one
|
||
run fails completely while the other eight report zero loss.
|
||
% TODO: DOWNSTREAM DEPENDENCY — This is a third
|
||
% reference to the buffer
|
||
% bloat diagnosis from Section hyprspace_bloat, which
|
||
% depends on the
|
||
% unverified 2,800 ms under-load latency. If that diagnosis is
|
||
% revised, this explanation must also be revisited.
|
||
The binary pass/fail behaviour fits the buffer bloat
|
||
diagnosis from
|
||
Section~\ref{sec:hyprspace_bloat}: when the tunnel's
|
||
buffers fill, a
|
||
path stalls completely rather than degrading gradually.
|
||
|
||
\subsection{TCP throughput}
|
||
|
||
The baseline TCP hierarchy does not survive impairment. The
|
||
three performance tiers from
|
||
Section~\ref{sec:baseline} dissolve at
|
||
the first step (Table~\ref{tab:tcp_impairment}).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput (Mbps) across impairment
|
||
profiles, sorted by baseline. Retention is the
|
||
Low-to-baseline ratio.}
|
||
\label{tab:tcp_impairment}
|
||
\begin{tabular}{lrrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} & \textbf{Retention} \\
|
||
\hline
|
||
Internal & 934 & 333 & 29.6 & 4.25 & 35.7\% \\
|
||
WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\% \\
|
||
ZeroTier & 814 & 63.7 & 12.0 & 4.01 & 7.8\% \\
|
||
Headscale & 800 & 274 & 41.5 & 4.21 & 34.3\% \\
|
||
Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\% \\
|
||
\hline
|
||
Nebula & 706 & 49.8 & 7.82 & 2.60 & 7.1\% \\
|
||
EasyTier & 636 & 156 & 17.4 & 3.59 & 24.6\% \\
|
||
VpnCloud & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\
|
||
\hline
|
||
Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\% \\
|
||
Tinc & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\
|
||
Mycelium & 259 & 16.2 & 3.87 & 2.73 & 6.3\% \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/TCP
|
||
Throughput Heatmap}.png}
|
||
\caption{Single-stream TCP throughput across
|
||
impairment profiles.
|
||
Headscale crosses above Internal at Medium impairment;
|
||
Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs
|
||
converge at High.}
|
||
\label{fig:tcp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
|
||
impairment, a
|
||
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
|
||
0.25\% packet loss, and 0.5\% reordering per machine.
|
||
Even Mycelium,
|
||
the slowest VPN at baseline (259\,Mbps), retains more
|
||
throughput at
|
||
Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes
|
||
that inflated Yggdrasil's baseline numbers
|
||
(Section~\ref{sec:baseline}) becomes a liability
|
||
under impairment:
|
||
every lost or reordered outer packet costs roughly
|
||
24$\times$ more
|
||
retransmitted inner data than a standard 1\,400-byte
|
||
MTU VPN would
|
||
lose.
|
||
% TODO: The jumbo-MTU-as-liability argument is reused in several
|
||
% places (TCP impairment, QUIC impairment, RIST video, and
|
||
% §sec:baseline Tier analysis). In each it is presented as a
|
||
% mechanism rather than a measurement. Consider running one
|
||
% controlled experiment --- force Yggdrasil to a standard
|
||
% 1\,420-byte overlay MTU and rerun the Low/Medium impairment
|
||
% profiles --- to test the hypothesis directly, or consolidate
|
||
% the argument into a single "jumbo-MTU liability" paragraph and
|
||
% cite it from the other sections instead of restating the
|
||
% mechanism each time.
|
||
|
||
Headscale retains 34.3\% of its baseline throughput
|
||
at Low, almost
|
||
the same as Internal's 35.7\%. At Medium impairment, Headscale
|
||
(41.5\,Mbps) overtakes Internal (29.6\,Mbps).
|
||
Section~\ref{sec:tailscale_degraded} investigates
|
||
this anomaly in
|
||
detail.
|
||
|
||
At High impairment, the throughput range collapses
|
||
from 675\,Mbps to
|
||
2.9\,Mbps. Internal leads at 4.25\,Mbps, Hyprspace trails at
|
||
1.39\,Mbps, and the impairment profile itself is the bottleneck.
|
||
With 2.5\% packet loss and 5\% reordering per machine, every
|
||
implementation is loss-limited, and the architectural
|
||
differences
|
||
that mattered at gigabit speeds no longer matter at all.
|
||
|
||
\subsection{UDP throughput}
|
||
|
||
The UDP stress test (\texttt{-b~0}) separates
|
||
implementations with
|
||
effective backpressure from those without it more
|
||
cleanly than any
|
||
TCP benchmark. Under impairment, it also produces widespread
|
||
failures.
|
||
% TODO: Tinc fails at Low and Medium but succeeds at
|
||
% High (8 Mbps):
|
||
% the same non-monotonic failure pattern as
|
||
% Internal/WireGuard (fail
|
||
% at Low, succeed at Medium/High). This suggests the
|
||
% failures are
|
||
% iPerf3/tc interaction issues rather than
|
||
% fundamental VPN limitations.
|
||
% Nebula and VpnCloud also fail selectively. The
|
||
% widespread non-monotonic
|
||
% failure pattern undermines using this benchmark as
|
||
% a reliability
|
||
% indicator (see line 1163 claim). Consider
|
||
% discussing this pattern.
|
||
Hyprspace and Mycelium continue to time out at all profiles,
|
||
extending their baseline failures. Tinc drops out at Low and
|
||
Medium, ZeroTier at Medium. The data is sparse, but one pattern
|
||
emerges from the runs that did complete.
|
||
|
||
% TODO: The heatmap shows Internal and WireGuard both
|
||
% fail (×) at
|
||
% some impairment profiles (e.g., Internal fails at
|
||
% Low, WireGuard
|
||
% at Low and High). "Regardless of impairment" overstates the
|
||
% evidence. Rephrase to reflect the failures, or explain why
|
||
% those runs failed despite the claim of maintained throughput.
|
||
% TODO: Internal (and WireGuard) fail at Low
|
||
% impairment in the UDP
|
||
% test but succeed at Medium and High: the opposite of what one
|
||
% would expect. This is never explained.
|
||
% Investigate and add an
|
||
% explanation (e.g., iPerf3 crash, tc interaction,
|
||
% timing issue).
|
||
Three implementations maintain throughput at the profiles where
|
||
data exists. Internal holds ${\sim}$950\,Mbps at
|
||
Baseline, Medium,
|
||
and High; WireGuard sustains 850--898\,Mbps; and
|
||
Headscale sustains
|
||
700--876\,Mbps. % TODO: verify WireGuard UDP range --
|
||
% analysis doc says 850-898, possible digit transposition
|
||
Internal and WireGuard ride the host kernel's transport-layer
|
||
backpressure (Internal directly, WireGuard via the in-kernel
|
||
WireGuard module). Headscale, by contrast, never
|
||
uses the kernel
|
||
module even though it builds on the WireGuard protocol: as
|
||
established in Section~\ref{sec:baseline}, Tailscale's
|
||
\texttt{magicsock} layer intercepts every packet for endpoint
|
||
selection, DERP relay, and the disco protocol, and that
|
||
interception is incompatible with the kernel WireGuard datapath.
|
||
Headscale therefore runs \texttt{wireguard-go} in userspace and
|
||
compensates with UDP batching
|
||
(\texttt{recvmmsg}/\texttt{sendmmsg}),
|
||
host-kernel UDP segmentation/aggregation offload
|
||
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
|
||
WireGuard socket), and a 7\,MB socket buffer on the same outer
|
||
socket. These offloads live in the host kernel; gVisor netstack
|
||
itself implements no UDP GSO or UDP GRO of its own.
|
||
Together they
|
||
absorb a \texttt{-b 0} sender flood without
|
||
collapsing. Userspace
|
||
VPNs without the same engineering do collapse:
|
||
EasyTier drops from
|
||
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
|
||
Yggdrasil, already pathological at baseline (98.7\%
|
||
loss), crashes
|
||
to 12.3\,Mbps at Low and fails entirely at Medium and High.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/UDP
|
||
Receiver Throughput Heatmap}.png}
|
||
% TODO: The heatmap shows Internal, WireGuard, and
|
||
% Headscale all
|
||
% fail ($\times$) at Low impairment. WireGuard also fails at
|
||
% High. These selective failures need an explanation
|
||
% (iPerf3/tc interaction?).
|
||
\caption{UDP receiver throughput across impairment profiles.
|
||
Implementations with effective UDP backpressure
|
||
(Internal and
|
||
WireGuard via the in-kernel datapath; Headscale via
|
||
\texttt{wireguard-go} batching plus large socket buffers)
|
||
maintain high throughput where they complete;
|
||
other userspace
|
||
VPNs collapse or fail entirely ($\times$ marks a failed run).}
|
||
\label{fig:udp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
% TODO: This "robustness indicator" interpretation is
|
||
% undermined by
|
||
% the non-monotonic failure pattern. Internal and
|
||
% WireGuard fail at
|
||
% Low (0.25% loss) but succeed at Medium and High
|
||
% (1%+ loss). If
|
||
% failures indicated "fundamental flow-control
|
||
% problems," they should
|
||
% get worse with more impairment, not better. The
|
||
% pattern suggests
|
||
% iPerf3 or tc timing issues rather than VPN
|
||
% limitations. Either
|
||
% explain the non-monotonic failures or weaken this conclusion.
|
||
Under impairment this benchmark is more useful as a robustness
|
||
indicator than as a throughput measurement. A VPN that cannot
|
||
complete a 30-second UDP flood under 0.25\% packet loss has a
|
||
flow-control problem that will surface under real workloads too,
|
||
even when the symptoms are milder.
|
||
% TODO: Non-monotonic failure pattern (Internal and WireGuard
|
||
% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
|
||
% fail selectively) is never explained and directly undermines
|
||
% the "robustness indicator" framing above. Reproduce one of
|
||
% the failing Low-profile runs with iPerf3 debug logging and
|
||
% \texttt{tc -s qdisc show} to establish whether these are VPN
|
||
% flow-control failures, iPerf3/tc interaction artefacts, or
|
||
% timing issues; then either explain the pattern or soften the
|
||
% robustness-indicator claim.
|
||
|
||
\subsection{Parallel TCP}
|
||
|
||
% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional
|
||
% flows" must match
|
||
% the baseline parallel test description. The
|
||
% baseline section has an
|
||
% unresolved TODO about whether the test uses 6 or 10
|
||
% streams. If the
|
||
% baseline is corrected to 10, this section must also
|
||
% be updated.
|
||
The Headscale anomaly from single-stream TCP grows larger under
|
||
parallel load. Table~\ref{tab:parallel_impairment}
|
||
shows aggregate
|
||
throughput across three concurrent bidirectional links (six
|
||
unidirectional flows).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP throughput (Mbps) across
|
||
impairment profiles.
|
||
Three concurrent bidirectional links produce six
|
||
unidirectional
|
||
flows.}
|
||
\label{tab:parallel_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 1398 & 277 & 82.6 & 10.4 \\
|
||
Headscale & 1228 & 718 & 113 & 20.0 \\
|
||
WireGuard & 1281 & 173 & 24.5 & 8.39 \\
|
||
Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\
|
||
ZeroTier & 1206 & 176 & 35.4 & 7.97 \\
|
||
EasyTier & 927 & 473 & 57.4 & 10.7 \\
|
||
Hyprspace & 803 & 2.87 & 6.94 & 3.62 \\
|
||
VpnCloud & 763 & 174 & 23.7 & 8.25 \\
|
||
Nebula & 648 & 103 & 15.3 & 4.93 \\
|
||
Mycelium & 569 & 72.7 & 7.51 & 3.69 \\
|
||
Tinc & 563 & 168 & 23.7 & 8.25 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Parallel
|
||
TCP Throughput Heatmap}.png}
|
||
\caption{Parallel TCP throughput across impairment profiles.
|
||
Headscale dominates at Low (718\,Mbps vs.\ Internal's 277);
|
||
EasyTier is the runner-up (473\,Mbps); Hyprspace
|
||
collapses to
|
||
2.87\,Mbps.}
|
||
\label{fig:parallel_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
At Low impairment, Headscale reaches 718\,Mbps: 2.6$\times$
|
||
Internal's 277\,Mbps and 4.1$\times$ WireGuard's 173\,Mbps. At
|
||
Medium, Headscale (113\,Mbps) still leads Internal
|
||
(82.6\,Mbps) by
|
||
37\%. Whatever mechanism produces the single-stream
|
||
crossover at
|
||
Medium scales with the flow count, because each of the six
|
||
concurrent streams benefits from it independently.
|
||
|
||
EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
|
||
51\% of its baseline. Headscale and EasyTier are the only VPNs
|
||
that retain more than half their baseline parallel throughput at
|
||
Low impairment; no other implementation exceeds 30\%.
|
||
We have no
|
||
direct architectural explanation for EasyTier's resilience and
|
||
do not claim one here.
|
||
|
||
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
|
||
Low, a 99.6\%
|
||
loss. % TODO: DOWNSTREAM DEPENDENCY — This
|
||
% references the buffer bloat diagnosis
|
||
% from Section hyprspace_bloat, which depends on the
|
||
% unverified 2,800 ms
|
||
% under-load latency. If that diagnosis is revised,
|
||
% this explanation
|
||
% for parallel collapse must also be revisited.
|
||
The buffer bloat that already plagues single-stream transfers
|
||
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
|
||
flows compete for the same bloated buffers at once.
|
||
|
||
High-profile convergence is more pronounced here than in
|
||
single-stream mode. Tinc and VpnCloud land at identical
|
||
8.25\,Mbps even though they differ by 200\,Mbps at baseline.
|
||
|
||
\subsection{QUIC performance}
|
||
|
||
Headscale and Nebula failed the qperf QUIC benchmark at baseline
|
||
(Section~\ref{sec:baseline}) and continue to fail at every
|
||
impairment profile.
|
||
|
||
Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to
|
||
7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High.
|
||
This is the same cliff observed in its TCP results,
|
||
driven by the
|
||
same jumbo-MTU amplification of outer-layer packet loss.
|
||
|
||
At High impairment, WireGuard (23.2\,Mbps), VpnCloud
|
||
(23.4\,Mbps),
|
||
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
|
||
0.4\,Mbps of one another. At baseline these four
|
||
span a 188\,Mbps
|
||
range (656 to 844\,Mbps). At this point QUIC's own congestion
|
||
control is the sole limiter: it runs on top of an
|
||
already-degraded outer link and cannot push past
|
||
${\sim}$23\,Mbps regardless of the VPN underneath.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/QUIC
|
||
Bandwidth Heatmap}.png}
|
||
\caption{QUIC bandwidth across impairment profiles. Yggdrasil
|
||
drops from 745 to 8\,Mbps at Low; WireGuard,
|
||
VpnCloud, ZeroTier,
|
||
and Tinc converge to ${\sim}$23\,Mbps at High.
|
||
Headscale and
|
||
Nebula fail at all profiles ($\times$).}
|
||
\label{fig:quic_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
\subsection{Video streaming}
|
||
|
||
At ${\sim}$3.3\,Mbps, the RIST video stream sits
|
||
within every VPN's
|
||
throughput budget even at High impairment. Quality
|
||
differences in
|
||
Table~\ref{tab:rist_impairment} therefore reflect
|
||
packet delivery
|
||
reliability, not bandwidth.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{RIST video streaming quality (\%) across impairment
|
||
profiles, sorted by High-profile quality}
|
||
\label{tab:rist_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Mycelium & 100.0 & 100.0 & 100.0 & 99.9 \\
|
||
EasyTier & 100.0 & 100.0 & 96.2 & 85.5 \\
|
||
Internal & 100.0 & 99.2 & 89.3 & 80.2 \\
|
||
ZeroTier & 100.0 & 99.3 & 89.9 & 80.2 \\
|
||
VpnCloud & 100.0 & 99.2 & 89.7 & 80.1 \\
|
||
WireGuard & 100.0 & 99.3 & 90.0 & 80.0 \\
|
||
Hyprspace & 100.0 & 92.9 & 87.9 & 78.1 \\
|
||
Tinc & 100.0 & 99.3 & 90.0 & 77.8 \\
|
||
Nebula & 99.8 & 98.8 & 85.6 & 72.1 \\
|
||
Yggdrasil & 100.0 & 94.7 & 71.4 & 43.3 \\
|
||
Headscale & 13.1 & 13.0 & 13.0 & 13.0 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Video
|
||
Streaming Quality Heatmap}.png}
|
||
\caption{RIST video streaming quality across
|
||
impairment profiles.
|
||
Headscale is stuck at ${\sim}$13\% regardless of profile;
|
||
Mycelium maintains ${\sim}$100\% even at High; Yggdrasil
|
||
declines steeply to 43\%.}
|
||
\label{fig:rist_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale sits at ${\sim}$13\% across all four profiles: 13.1\%,
|
||
13.0\%, 13.0\%, 13.0\%. This profile-independence confirms the
|
||
baseline diagnosis (Section~\ref{sec:baseline}): the failure is
|
||
% TODO: DOWNSTREAM DEPENDENCY — This repeats the
|
||
% DERP/MTU hypothesis from
|
||
% Section baseline as though it were established.
|
||
% The baseline TODO notes
|
||
% this hypothesis is unverified (no packet capture
|
||
% evidence). Do not
|
||
% present it as a confirmed diagnosis here without
|
||
% resolving the upstream TODO.
|
||
structural (most plausibly MTU fragmentation in the DERP relay
|
||
layer) and cannot worsen because it is already
|
||
saturated. Adding
|
||
latency or loss on top of an 87\% packet drop floor changes
|
||
nothing.
|
||
|
||
Mycelium holds 99.9\% quality even at High impairment, ahead of
|
||
Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even
|
||
Mycelium's degraded overlay paths comfortably sustain
|
||
the stream.
|
||
The same overlay routing that adds 34.9\,ms of
|
||
latency and cripples
|
||
bulk TCP transfers is harmless at video bitrates, and RIST's
|
||
forward error correction handles the residual loss.
|
||
|
||
% TODO: The claim that jumbo MTU causes burst losses
|
||
% that overwhelm
|
||
% FEC is a hypothesis. No FEC analysis or
|
||
% packet-level evidence is
|
||
% shown. Consider adding packet capture data or
|
||
% softening the claim.
|
||
Yggdrasil degrades the most steeply: 100\% at
|
||
baseline, 94.7\% at
|
||
Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU
|
||
that hurt TCP
|
||
throughput likely hurts here as well: large overlay packets are
|
||
more exposed to loss and reordering at the outer layer, and the
|
||
resulting burst losses may exceed what RIST's FEC can recover.
|
||
|
||
\subsection{Application-level download}
|
||
|
||
The Nix binary cache download is the most demanding
|
||
application-level benchmark. Hundreds of sequential HTTP
|
||
connections amplify the per-connection latency
|
||
penalties that bulk
|
||
throughput tests amortise. Table~\ref{tab:nix_impairment} shows
|
||
download times across profiles.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time (seconds)
|
||
across impairment
|
||
profiles, sorted by Low-profile time. ``--'' marks a failed
|
||
run.}
|
||
\label{tab:nix_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 8.53 & 11.9 & 58.6 & -- \\
|
||
Headscale & 9.79 & 13.5 & 48.8 & 219 \\
|
||
EasyTier & 9.39 & 22.1 & 141 & -- \\
|
||
VpnCloud & 9.39 & 27.9 & 163 & -- \\
|
||
WireGuard & 9.45 & 28.8 & 161 & -- \\
|
||
Nebula & 9.15 & 30.8 & 180 & 547 \\
|
||
Tinc & 10.0 & 30.9 & 166 & 496 \\
|
||
ZeroTier & 9.22 & 36.2 & 141 & -- \\
|
||
Mycelium & 10.1 & 79.5 & -- & -- \\
|
||
Yggdrasil & 10.6 & 230 & -- & -- \\
|
||
Hyprspace & 11.9 & -- & 170 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Nix
|
||
Cache Download Time Heatmap}.png}
|
||
\caption{Nix binary cache download time across
|
||
impairment profiles.
|
||
Headscale, Nebula, and Tinc complete all four
|
||
profiles; Headscale
|
||
beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's
|
||
Low-profile time explodes to 230\,s ($\times$ marks
|
||
a failed run).}
|
||
\label{fig:nix_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale, Nebula, and Tinc are the only VPNs to
|
||
complete all four
|
||
profiles. At Medium impairment, Headscale finishes
|
||
in 48.8~seconds,
|
||
faster than Internal's 58.6~seconds. Internal itself
|
||
fails at High
|
||
impairment while Headscale completes in 219~seconds, Tinc in
|
||
496~seconds, and Nebula in 547~seconds.
|
||
|
||
Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low
|
||
impairment, a 22$\times$ slowdown. Every HTTP request pays the
|
||
latency penalty of Yggdrasil's impairment-amplified
|
||
retransmissions.
|
||
Mycelium degrades almost as badly (10.1\,s to 79.5\,s, an
|
||
8$\times$ increase): its overlay routing overhead compounds over
|
||
hundreds of sequential HTTP connections.
|
||
|
||
% TODO: Hyprspace fails at Low but completes at Medium (170 s).
|
||
% This contradicts the "clean gradient" claim.
|
||
% Explain why a VPN
|
||
% can fail at Low but succeed at Medium, or note the anomaly.
|
||
The failure map shows a mostly clean gradient: more demanding
|
||
profiles knock out more VPNs. At Low, 10 of 11
|
||
finish (Hyprspace
|
||
fails). At Medium, 9 finish, though Hyprspace, which had failed
|
||
at Low, completes here in 170\,s. At High, only Headscale,
|
||
Nebula, and Tinc survive. Internal's failure at High is the
|
||
surprising one: the bare-metal baseline cannot sustain a
|
||
multi-connection HTTP workload under severe degradation, while
|
||
Headscale's userspace TCP stack pulls it through.
|
||
Section~\ref{sec:tailscale_degraded} explains why.
|
||
|
||
\section{Tailscale under degraded conditions}
|
||
\label{sec:tailscale_degraded}
|
||
|
||
% TODO: Editorial pass needed on two chapter-wide issues before
|
||
% submission:
|
||
% (1) magicsock / wireguard-go userspace-datapath explanation is
|
||
% repeated three times in slightly different forms (once in
|
||
% baseline UDP, once in impairment UDP, once here). Consider
|
||
% introducing it once in full here, where it is load-bearing,
|
||
% and replacing the earlier occurrences with one-sentence
|
||
% forward references.
|
||
% (2) This section uses first-person plural ("we pursued", "we
|
||
% worked it out", "we ran two follow-up benchmarks") while
|
||
% the rest of the chapter is in impersonal voice. Either
|
||
% harmonise everything to one voice, or explicitly frame this
|
||
% section as a first-person narrative detour.
|
||
|
||
This section is about an observation that should not exist:
|
||
Headscale, a tunnelling VPN built on a kernel TCP stack and
|
||
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
|
||
Medium impairment, and at Low impairment under parallel load
|
||
beats it by a factor of 2.6. The short answer turns out to be
|
||
different from the obvious answer, and we worked it out only by
|
||
chasing the obvious answer to its end.
|
||
|
||
\subsection{An anomaly worth pursuing}
|
||
|
||
At Medium impairment, Headscale reaches 41.5\,Mbps on a single
|
||
TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
|
||
the VPN over the direct host-to-host link it tunnels through.
|
||
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
|
||
Low and High impairment it lags Internal by some margin. Yet at
|
||
Medium the order inverts, and not by a sliver: a 12\,Mbps gap on
|
||
a 30\,Mbps link is well above measurement noise. The same thing
|
||
happens, more dramatically, on the parallel TCP test, where
|
||
Headscale's 718\,Mbps at Low beats Internal's 277\,Mbps by a
|
||
factor of 2.6. Table~\ref{tab:headscale_anomaly} collects the
|
||
comparison.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Headscale vs.\ Internal vs.\ WireGuard
|
||
under impairment
|
||
(18.12.2025 run). For TCP benchmarks, higher is
|
||
better. For
|
||
Nix cache, lower is better; ``--'' marks a failed run.}
|
||
\label{tab:headscale_anomaly}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} &
|
||
\textbf{Headscale} & \textbf{WireGuard} \\
|
||
\hline
|
||
Single TCP (Mbps) & Low & 333 & 274 & 54.7 \\
|
||
Single TCP (Mbps) & Medium & 29.6 & 41.5 & 8.77 \\
|
||
Single TCP (Mbps) & High & 4.25 & 4.21 & 2.63 \\
|
||
Parallel TCP (Mbps) & Low & 277 & 718 & 173 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 & 113 & 24.5 \\
|
||
Nix cache (s) & Medium & 58.6 & 48.8 & 161 \\
|
||
Nix cache (s) & High & -- & 219 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png}
|
||
\caption{Single-stream TCP throughput for Internal,
|
||
Headscale, and
|
||
WireGuard across impairment profiles (log scale). Headscale
|
||
crosses above Internal at Medium impairment;
|
||
WireGuard stays far
|
||
below both; all three converge at High.}
|
||
\label{fig:headscale_vs_internal}
|
||
\end{figure}
|
||
|
||
WireGuard-the-kernel-module is the obvious sanity
|
||
check. It uses
|
||
the same Noise/WireGuard cryptographic protocol Tailscale ships
|
||
and is the closest available comparison without the rest of
|
||
Tailscale's stack. WireGuard shows none of Headscale's
|
||
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
|
||
below Internal at the same profile. So the encryption layer is
|
||
not the answer, and the basic UDP tunnel is not the answer.
|
||
Whatever Headscale is doing differently lives somewhere else in
|
||
the rest of Tailscale's implementation.
|
||
|
||
% TODO: The Medium-impairment retransmit percentages (5.2\%,
|
||
% 2.4\%) are not in any table or figure. Add a retransmit
|
||
% rate table for impaired profiles or reference the data
|
||
% source.
|
||
The retransmit data narrows the search. At Medium, WireGuard's
|
||
TCP retransmit rate is 5.2\,\%, more than double Internal's
|
||
${\sim}$2.4\,\%. Headscale matches Internal at ${\sim}$2.4\,\%
|
||
even though it is a tunnelling VPN. Both Headscale and
|
||
bare-metal Internal run the same host kernel TCP stack at the
|
||
inner layer, so the asymmetry is not about a different TCP
|
||
implementation. It is about what the kernel TCP stack is being
|
||
asked to process: something on Headscale's path is suppressing
|
||
the spurious retransmits the kernel would otherwise fire under
|
||
\texttt{tc netem}-induced reordering, and WireGuard's path is
|
||
not.
|
||
|
||
\subsection{A plausible villain: Tailscale's gVisor stack}
|
||
|
||
The candidate explanation we pursued first, and the one any
|
||
reading of the upstream Tailscale documentation will lead to,
|
||
is Tailscale's userspace TCP/IP stack. The Tailscale client
|
||
imports Google's gVisor netstack
|
||
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
|
||
it as an in-process TCP implementation. The gVisor
|
||
documentation is direct about why this matters: netstack is
|
||
designed for adverse networks where the host kernel's TCP
|
||
defaults are too aggressive. Tailscale's release notes go further
|
||
and name specific overrides
|
||
on top of gVisor; the most visible are an explicit RACK disable
|
||
and 8\,MiB / 6\,MiB receive and send buffers.
|
||
|
||
The Tailscale source code bears this out.
|
||
\texttt{wgengine/netstack/netstack.go} contains the netstack
|
||
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
|
||
reproduces the relevant overrides verbatim. RACK is disabled
|
||
(\texttt{TCPRecovery(0)}) with a comment pointing at
|
||
\texttt{tailscale/issues/9707}: ``gVisor's RACK performs
|
||
poorly. ACKs do not appear to be handled in a timely manner,
|
||
leading to spurious retransmissions and a reduced congestion
|
||
window.'' Reno is set explicitly with a comment pointing at
|
||
\texttt{gvisor/issues/11632}, an integer-overflow bug in
|
||
gVisor's CUBIC implementation. The TCP send and receive
|
||
buffer maxima are pushed up to 8\,MiB and 6\,MiB. SACK is
|
||
enabled (gVisor's default is off).
|
||
|
||
\lstinputlisting[language=Go,caption={Tailscale's gVisor
|
||
netstack initialiser explicitly disables RACK, pins Reno as
|
||
the congestion control, and enlarges the TCP buffer maxima.
|
||
These overrides live inside
|
||
\texttt{wgengine/netstack/netstack.go}.
|
||
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
|
||
|
||
Read against the Linux kernel defaults (RACK on, CUBIC by
|
||
default, ${\sim}$1\,MiB receive and send buffers,
|
||
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
|
||
overrides describe a TCP stack better suited to a lossy,
|
||
reordering link than the host kernel. The hypothesis follows
|
||
directly: Headscale's iPerf3 traffic
|
||
runs through this gVisor instance instead of through the host
|
||
kernel TCP stack, and so it inherits the more
|
||
reordering-tolerant behaviour. WireGuard-the-kernel-module
|
||
shares only the cryptographic protocol; it does not include
|
||
the gVisor stack, and therefore does not get the advantage.
|
||
|
||
The natural way to test this is to extract
|
||
the parameters Tailscale sets inside gVisor, apply their
|
||
nearest Linux equivalents to the bare-metal host as sysctls,
|
||
and see whether Internal, with no VPN at all, picks up the
|
||
same advantage. If it does, the gVisor explanation is
|
||
supported. If it does not, the hypothesis fails.
|
||
|
||
\subsection{Reproducing the effect on bare metal}
|
||
\label{sec:tuned}
|
||
|
||
We ran two follow-up benchmarks on the same hardware and
|
||
impairment setup as the original 18.12.2025 run.
|
||
|
||
\begin{itemize}
|
||
\bitem{Tailscale-style (27.02.2026):}
|
||
\texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0},
|
||
\texttt{tcp\_early\_retrans=0}, plus enlarged
|
||
buffer sizes
|
||
(\texttt{tcp\_rmem}, \texttt{tcp\_wmem},
|
||
\texttt{rmem\_max},
|
||
\texttt{wmem\_max}). Tested on Internal, Headscale,
|
||
WireGuard, Tinc, and ZeroTier.
|
||
\bitem{Reorder-only (06.03.2026):} Only
|
||
\texttt{tcp\_reordering=10},
|
||
\texttt{tcp\_recovery=0}, and
|
||
\texttt{tcp\_early\_retrans=0}. Buffer sizes left at
|
||
kernel defaults. Tested on Internal and Headscale only.
|
||
\end{itemize}
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Internal (no VPN) throughput across three kernel
|
||
configurations. ``Default'' is the
|
||
18.12.2025 run with stock
|
||
Linux TCP parameters.}
|
||
\label{tab:kernel_tuning_internal}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Metric} & \textbf{Profile} & \textbf{Default} &
|
||
\textbf{Tailscale-style} & \textbf{Reorder-only} \\
|
||
\hline
|
||
Single TCP (Mbps) & Baseline & 934 &
|
||
934 & 934 \\
|
||
Single TCP (Mbps) & Low & 333 &
|
||
363 & 354 \\
|
||
Single TCP (Mbps) & Medium & 29.6 &
|
||
64.2 & 72.7 \\
|
||
Parallel TCP (Mbps) & Low & 277 &
|
||
893 & 902 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 &
|
||
226 & 211 \\
|
||
Retransmit \% & Medium & ${\sim}$2.4
|
||
& 1.21 & 1.11 \\
|
||
Nix cache (s) & Medium & 58.6 &
|
||
29.7 & 29.1 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png}
|
||
\caption{Internal (no VPN) single-stream TCP
|
||
throughput across three
|
||
kernel configurations. Baseline is unchanged; at Medium
|
||
impairment, throughput jumps from 30 to 64 to
|
||
73\,Mbps as
|
||
reordering tolerance increases.}
|
||
\label{fig:kernel_tuning_comparison}
|
||
\end{figure}
|
||
|
||
The result felt like confirmation. Internal's
|
||
Medium-impairment throughput jumped from 29.6\,Mbps to
|
||
72.7\,Mbps under the reorder-only configuration, a 146\,\%
|
||
increase from a three-line sysctl change, and the retransmit
|
||
rate at Medium dropped from ${\sim}$2.4\,\% to
|
||
1.11\,\%, which
|
||
means more than half of the original retransmissions were
|
||
spurious. The Nix cache download at Medium roughly halved,
|
||
from 58.6\,s to 29.1\,s.
|
||
|
||
Parallel TCP gained even more. Internal at Low climbed from
|
||
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
|
||
old single-stream best and overtakes Headscale's original
|
||
718\,Mbps from the unmodified run. %
|
||
% TODO: DOWNSTREAM
|
||
% DEPENDENCY — "six concurrent flows" inherits
|
||
% the unresolved
|
||
% 6-vs-10 stream count from the baseline parallel test
|
||
% description. Update when that TODO is resolved.
|
||
Each of the six concurrent flows benefits independently from
|
||
the higher reordering threshold, and the gains compound.
|
||
|
||
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
|
||
% not in any table. Add a table showing Headscale's results
|
||
% from the follow-up runs alongside Internal's so
|
||
% readers can
|
||
% verify the reversal.
|
||
Headscale itself, retested with the same sysctls,
|
||
gained more
|
||
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
|
||
Low. And the anomaly reversed entirely. At Medium, tuned
|
||
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
|
||
a 45\,\% lead for Internal where the original run
|
||
had Headscale
|
||
40\,\% ahead. The Nix cache flipped the same way: Internal
|
||
completed in 29.1\,s against Headscale's 36.3\,s, where the
|
||
original had Headscale 17\,\% faster.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png}
|
||
\caption{Internal-to-Headscale speed-up factor
|
||
before and after
|
||
kernel tuning. Values above 1.0 mean
|
||
Internal is faster. At
|
||
Medium impairment, the ratio flips from
|
||
0.71$\times$ (Headscale
|
||
ahead) to 1.45$\times$ (Internal ahead).}
|
||
\label{fig:headscale_gap_reversal}
|
||
\end{figure}
|
||
|
||
The reorder-only configuration matched or exceeded the full
|
||
Tailscale-style configuration on most metrics. The two
|
||
exceptions were single-stream TCP at Low (354
|
||
vs.\ 363\,Mbps)
|
||
and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within
|
||
7\,\%. The enlarged buffer sizes did not help and may have
|
||
added mild buffer bloat that partially offset the reordering
|
||
benefit, though the gap could also be run-to-run variance.
|
||
Either way, the entire Headscale advantage on Internal
|
||
collapsed to three host-kernel sysctls:
|
||
\texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and
|
||
\texttt{tcp\_early\_retrans}.
|
||
|
||
At this point in the investigation the hypothesis seemed
|
||
settled. Tailscale's gVisor stack ships with
|
||
these overrides;
|
||
the bare-metal kernel ships with stricter defaults; matching
|
||
the kernel to gVisor reproduces the effect. Then we checked
|
||
which Tailscale code path the test rig was actually running.
|
||
|
||
\subsection{The data path that was not there}
|
||
\label{sec:gvisor_not_in_path}
|
||
|
||
In default mode (what anyone running \texttt{tailscale up}
|
||
on a Linux host gets), the Tailscale client creates a real
|
||
kernel TUN device, registers a route for the
|
||
Tailscale subnet
|
||
through it, and forwards inbound and outbound
|
||
packets through
|
||
that interface. An application like iPerf3 issues a
|
||
\texttt{connect} to the remote peer's Tailscale
|
||
IP. The host
|
||
kernel TCP stack handles the application TCP. The kernel
|
||
routes the resulting outbound packets to the TUN device.
|
||
\texttt{tailscaled} (with \texttt{wireguard-go} embedded)
|
||
reads them from the TUN, encrypts them, and sends them as
|
||
outer WireGuard UDP packets on the wire. The receiving side
|
||
reverses the process and writes the decrypted inner packets
|
||
back into its own TUN, where the host kernel TCP stack
|
||
delivers them to the iPerf3 server.
|
||
|
||
In that path, gVisor netstack is never instantiated. The
|
||
netstack initialiser in
|
||
Listing~\ref{lst:tailscale_netstack_overrides}
|
||
only runs when
|
||
\texttt{tailscaled} is launched with
|
||
\texttt{--tun=userspace-networking}, a mode that has no
|
||
kernel TUN at all and is reachable only from processes
|
||
running inside \texttt{tailscaled} itself (Tailscale SSH,
|
||
Taildrop, the metric endpoint). External processes such as
|
||
iPerf3 cannot reach the Tailscale network in that mode.
|
||
|
||
The test rig does not use that mode. The benchmark suite's
|
||
Headscale module sets the interface name to
|
||
\texttt{ts-\$\{instanceName\}}
|
||
(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
|
||
launches with \texttt{--tun ts-headscale}: a real kernel TUN.
|
||
External benchmark traffic cannot reach gVisor netstack at all.
|
||
|
||
\lstinputlisting[language=Nix,caption={The
|
||
benchmark suite's
|
||
Headscale module sets \texttt{interfaceName} to a real kernel
|
||
TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
|
||
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
|
||
ts-headscale}
|
||
on every test machine.
|
||
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
|
||
|
||
The empirical fingerprint pins the same conclusion down without
|
||
source-code reading. Headscale itself gained +21\,\% at Medium
|
||
from the host-kernel sysctl tuning. If Headscale's iPerf3
|
||
traffic were processed by gVisor netstack, host-kernel sysctls
|
||
would change nothing; they configure the host kernel TCP stack
|
||
and only the host kernel TCP stack. The fact that Headscale moves
|
||
measurably under those sysctls is direct evidence that
|
||
Headscale's application TCP runs on the host kernel stack, just
|
||
as Internal's does.
|
||
|
||
The validation experiment was therefore validating something
|
||
other than the hypothesis it was supposed to validate. It was
|
||
confirming, very cleanly, that the Linux kernel's default
|
||
\texttt{tcp\_reordering=3} is too tight for the kind of bursty,
|
||
correlated reordering the Medium profile produces, and that
|
||
loosening it produces a large throughput gain on a kernel-TCP
|
||
data path. That part of the result stands. What does not stand
|
||
is the inference that the gain reproduces something Tailscale was
|
||
already doing in gVisor. For this benchmark, Tailscale is not in
|
||
the gVisor TCP business at all.
|
||
|
||
\subsection{Where the advantage actually lives}
|
||
|
||
The puzzle the investigation began with has not gone away.
|
||
Headscale starts at 41.5\,Mbps where Internal starts at
|
||
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
||
TCP stack. Whatever Headscale is doing (partially, weakly, but
|
||
reproducibly) is worth roughly twelve megabits per second on the
|
||
Medium profile, and it is not gVisor netstack.
|
||
|
||
The +21\,\% sysctl gain for Headscale itself is also informative
|
||
about the size of the mechanism. If the gain were 0\,\%,
|
||
Headscale would already be doing the sysctls' work; if it were
|
||
+146\,\% like Internal's, Headscale would be doing nothing of its
|
||
own. The partial response says Headscale's mechanism produces an
|
||
effect similar in kind to the sysctls but smaller in size, and
|
||
that the two effects are not fully additive.
|
||
|
||
Two features of the \texttt{wireguard-go} data-plane pipeline are
|
||
the most likely candidates, and both live on the kernel-TUN path
|
||
that Tailscale actually uses in the rig.
|
||
|
||
The first is TUN TCP and UDP generic receive offload. Tailscale's
|
||
\texttt{tstun} wrapper enables both on the kernel TUN device on
|
||
Linux unless an environment knob disables them or a runtime probe
|
||
rejects the feature (Listing~\ref{lst:tstun_gro}). On the
|
||
receive side, this means \texttt{wireguard-go} decrypts a burst
|
||
of inbound WireGuard frames and then coalesces consecutive
|
||
in-order TCP segments belonging to the same flow into a single
|
||
super-segment before writing them back to the kernel TUN. On the
|
||
transmit side, it accepts GSO super-segments from the kernel TUN
|
||
read in the same way. The receiving kernel TCP stack therefore
|
||
sees fewer, larger segments per coalesced batch instead of $N$
|
||
small ones, and the segment timing that survives to the kernel is
|
||
the timing of GRO batches rather than of individual on-the-wire
|
||
packets. Bare-metal Internal traffic has no equivalent path
|
||
because it does not pass through any user-space TUN at all.
|
||
|
||
\lstinputlisting[language=Go,caption={Tailscale enables TUN TCP
|
||
and UDP GRO on every Linux non-TAP \texttt{tailscaled} process
|
||
unless the operator disables them via environment knobs or a
|
||
kernel runtime probe rejects the feature. This is in the default
|
||
kernel-TUN data path; it is not gated on
|
||
\texttt{--tun=userspace-networking}.
|
||
\textit{tailscale/net/tstun/wrap\_linux.go:25--43}},label={lst:tstun_gro}]{Listings/tstun_gro.go}
|
||
|
||
The second is the 7\,MiB outer-UDP socket buffer that
|
||
\texttt{magicsock} pins on the WireGuard UDP socket
|
||
(Listing~\ref{lst:magicsock_buffer}), using the ``force''
|
||
\texttt{SO\_*BUFFORCE} variant where available so the value is
|
||
honoured even past \texttt{net.core.rmem\_max}. The host kernel
|
||
default is in the low hundreds of KiB. Under burst-correlated
|
||
impairment (Medium and High both use 50\,\% correlation, so
|
||
losses and reorderings cluster), this larger buffer absorbs
|
||
spikes in arrival rate that would otherwise overflow the kernel
|
||
UDP receive queue and surface as additional inner-TCP losses.
|
||
Internal has no such cushion on its incoming wire path.
|
||
|
||
\lstinputlisting[language=Go,caption={\texttt{magicsock} pins the
|
||
outer WireGuard UDP socket's send and receive buffers to 7\,MiB
|
||
and uses \texttt{SetBufferSize} with the \texttt{SO\_*BUFFORCE}
|
||
(``force'') variant where available, so the value is honoured
|
||
even past \texttt{net.core.rmem\_max}.
|
||
\textit{tailscale/wgengine/magicsock/magicsock.go:86,3908--3913}},label={lst:magicsock_buffer}]{Listings/magicsock_buffer.go}
|
||
|
||
% TODO: Neither of the two candidate mechanisms above is directly
|
||
% verified in this chapter. A targeted follow-up — for example
|
||
% tcpdump on the receiving \texttt{tailscale0} interface during a
|
||
% Medium-impairment iPerf3 run, with inter-arrival timing
|
||
% analysis — would distinguish their relative contributions and
|
||
% confirm the mechanism. The argument here is that they are the
|
||
% most plausible candidates consistent with the evidence, not
|
||
% measured causes.
|
||
|
||
A third feature, batched UDP I/O, completes the picture without
|
||
changing it qualitatively. \texttt{wireguard-go} uses
|
||
\texttt{recvmmsg} and \texttt{sendmmsg} on the outer UDP socket
|
||
so a burst of WireGuard frames moves through a single system
|
||
call. This does not change \emph{whether} packets are reordered,
|
||
but it reduces per-packet timing jitter that the kernel might
|
||
otherwise interpret as additional reordering.
|
||
|
||
Hyprspace cannot be used as a negative control for any of this.
|
||
It does import gVisor netstack, but only for its in-VPN
|
||
service-network feature, and the Hyprspace benchmark traffic goes
|
||
through a kernel TUN exactly like Headscale's
|
||
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
|
||
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
|
||
not on whether gVisor handles their inner TCP. The gVisor angle
|
||
simply does not apply to either of them in this benchmark.
|
||
|
||
The kernel-side picture closes the loop. Three host-kernel TCP
|
||
parameters dominate the bare-metal behaviour the benchmarks
|
||
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
|
||
number of out-of-order segments the kernel will tolerate before
|
||
declaring fast retransmit, and with \texttt{tc netem} injecting
|
||
0.5--2.5\,\% reordering per machine, bursts of several reordered
|
||
packets are frequent enough that the threshold is repeatedly
|
||
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
|
||
(default \texttt{1}, RACK enabled) adds time-based reordering
|
||
detection on top of the segment-count threshold, which compounds
|
||
the spurious retransmits when reordering is high. And
|
||
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
|
||
Loss Probe enabled) fires speculative retransmits when
|
||
unacknowledged segments sit at the tail of a transmission window,
|
||
which interacts poorly with an already-impaired link. Loosening
|
||
any one of the three softens the kernel's loss detection on the
|
||
bare-metal path; loosening all three recovers most of the
|
||
throughput. The Headscale path reaches the same kernel TCP stack
|
||
but is already feeding it the GRO-coalesced, buffer-cushioned
|
||
stream described above, so the kernel's tight defaults fire less
|
||
often there to begin with.
|
||
|
||
The same logic explains the anomaly's shape across profiles. At
|
||
baseline there is no reordering, so the kernel's tight
|
||
\texttt{tcp\_reordering} threshold never trips and Internal's
|
||
native kernel-stack speed wins. As reordering rises from 0.5\,\%
|
||
(Low) to 2.5\,\% (Medium) per machine, the kernel's loss
|
||
detection fires on the bare-metal path more often than on the
|
||
GRO-coalesced Headscale path, and the throughput gap shifts in
|
||
Headscale's favour. At High impairment, both converge to
|
||
${\sim}$4.2\,Mbps: absolute packet loss becomes the dominant
|
||
bottleneck, and reordering tolerance no longer matters.
|
||
|
||
% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier
|
||
% (11.5 Mbps) tuned values are not in any table. Add them to
|
||
% Table~\ref{tab:kernel_tuning_internal} or a new table.
|
||
Other VPNs respond unevenly to the same sysctl tuning.
|
||
WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps
|
||
(+39\,\%), Tinc's from 5.53 to 11.5\,Mbps (+108\,\%), and
|
||
ZeroTier stays flat (12.0 to 11.5\,Mbps). % TODO: The
|
||
% reading below — that VPNs which add their own encapsulation and
|
||
% userspace processing have bottlenecks the host kernel sysctls
|
||
% cannot touch — does not cleanly fit the data: Tinc (a fully
|
||
% userspace VPN) shows the largest gain (+108\,\%), larger than
|
||
% kernel-WireGuard's. A more complete explanation has to account
|
||
% for which TCP stack each VPN's application traffic actually
|
||
% traverses and which of those stacks the sysctls actually reach.
|
||
The intuitive reading is that VPNs which add their own
|
||
encapsulation and userspace processing have bottlenecks the host
|
||
kernel sysctls cannot touch, but Tinc's large gain shows the
|
||
picture is not that simple.
|
||
|
||
The resilient finding from this section, the one that survives
|
||
regardless of which of the two Tailscale-side mechanisms turns
|
||
out to dominate, is not about Tailscale at all. It is about
|
||
Linux. The kernel's default \texttt{tcp\_reordering=3} threshold
|
||
is too tight for the kind of bursty, correlated reordering
|
||
\texttt{tc netem} produces at the Medium profile, and it costs
|
||
the bare-metal host more than half of its achievable throughput.
|
||
Three lines of \texttt{sysctl} repair it. The fix is portable to
|
||
any Linux host and entirely independent of any VPN.
|
||
|
||
The less durable finding, and the one that motivated this section,
|
||
is that Tailscale's much-discussed userspace TCP stack is not in
|
||
the data path for the workload that exposed the anomaly. The
|
||
advantage we attributed to it comes from a more ordinary place:
|
||
the way \texttt{wireguard-go} batches and coalesces packets
|
||
between the wire and the kernel TCP stack, and the larger UDP
|
||
buffer it pins on its outer socket. We were chasing the wrong
|
||
hypothesis with the right experiment, and the experiment turned
|
||
out to be more useful than the hypothesis.
|
||
|
||
% TODO: These sections are empty stubs but the chapter
|
||
% introduction (line 12--13) promises "findings from the source
|
||
% code analysis." Either write these sections or remove the
|
||
% promise from the intro.
|
||
|
||
\section{Source code analysis}
|
||
|
||
\subsection{Feature matrix overview}
|
||
|
||
% Summary of the 108-feature matrix across all ten VPNs.
|
||
% Highlight key architectural differences that explain
|
||
% performance results.
|
||
|
||
\subsection{Security vulnerabilities}
|
||
|
||
% Vulnerabilities discovered during source code review.
|
||
|
||
\section{Summary of findings}
|
||
|
||
% Brief summary table or ranking of VPNs by key metrics. Save
|
||
% deeper interpretation for a Discussion chapter.
|