1699 lines
76 KiB
TeX
1699 lines
76 KiB
TeX
% Chapter Template
|
||
|
||
\chapter{Results} % Main chapter title
|
||
|
||
\label{Results}
|
||
|
||
This chapter presents the results of the benchmark suite across all
|
||
ten VPN implementations and the internal baseline. The structure
|
||
follows the impairment profiles from ideal to degraded:
|
||
Section~\ref{sec:baseline} establishes overhead under ideal
|
||
conditions, then subsequent sections examine how each VPN responds to
|
||
increasing network impairment. The chapter concludes with findings
|
||
from the source code analysis. A recurring theme is that no single
|
||
metric captures VPN
|
||
performance; the rankings shift
|
||
depending on whether one measures throughput, latency, retransmit
|
||
behavior, or real-world application performance.
|
||
|
||
\section{Baseline Performance}
|
||
\label{sec:baseline}
|
||
|
||
The baseline impairment profile introduces no artificial loss or
|
||
reordering, so any performance gap between VPNs can be attributed to
|
||
the VPN itself. Throughout the plots in this section, the
|
||
\emph{internal} bar marks a direct host-to-host connection with no VPN
|
||
in the path; it represents the best the hardware can do. On its own,
|
||
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
||
latency of just
|
||
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching
|
||
92.5\,\% of bare-metal throughput with only a single retransmit across
|
||
an entire 30-second test. Mycelium sits at the other extreme, adding
|
||
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
|
||
|
||
\subsection{Test Execution Overview}
|
||
|
||
Running the full baseline suite across all ten VPNs and the internal
|
||
reference took just over four hours. The bulk of that time, about
|
||
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
|
||
installation and deployment accounted for another 45~minutes (19\,\%),
|
||
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
|
||
up after restarts. The remaining time was consumed by VPN service restarts
|
||
and traffic-control (tc) stabilization.
|
||
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
||
|
||
Most VPNs completed every benchmark without issues, but four failed
|
||
one test each: Nebula and Headscale timed out on the qperf
|
||
QUIC performance benchmark after six retries, while Hyprspace and
|
||
Mycelium failed the UDP iPerf3 test
|
||
with a 120-second timeout. Their individual success rate is
|
||
85.7\,\%, with all other VPNs passing the full suite
|
||
(Figure~\ref{fig:success_rate}).
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Average Test
|
||
Duration per Machine}.png}
|
||
\caption{Average test duration per VPN, including installation
|
||
time and benchmark execution}
|
||
\label{fig:test_duration}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{1.0\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
|
||
Success Rate}.png}
|
||
\caption{Benchmark success rate across all seven tests}
|
||
\label{fig:success_rate}
|
||
\end{subfigure}
|
||
\caption{Test execution overview. Hyprspace has the longest average
|
||
duration due to UDP timeouts and long VPN connectivity
|
||
waits. WireGuard completes fastest. Nebula, Headscale,
|
||
Hyprspace, and Mycelium each fail one benchmark.}
|
||
\label{fig:test_overview}
|
||
\end{figure}
|
||
|
||
\subsection{TCP Throughput}
|
||
|
||
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
||
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
|
||
averages. Three distinct performance tiers emerge, separated by
|
||
natural gaps in the data.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput at baseline, sorted by
|
||
throughput. Retransmits are averaged per 30-second test across
|
||
all three link directions. The horizontal rules separate the
|
||
three performance tiers.}
|
||
\label{tab:tcp_baseline}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Throughput (Mbps)} &
|
||
\textbf{Baseline (\%)} & \textbf{Retransmits} \\
|
||
\hline
|
||
Internal & 934 & 100.0 & 1.7 \\
|
||
WireGuard & 864 & 92.5 & 1 \\
|
||
ZeroTier & 814 & 87.2 & 1163 \\
|
||
Headscale & 800 & 85.6 & 102 \\
|
||
Yggdrasil & 795 & 85.1 & 75 \\
|
||
\hline
|
||
Nebula & 706 & 75.6 & 955 \\
|
||
EasyTier & 636 & 68.1 & 537 \\
|
||
VpnCloud & 539 & 57.7 & 857 \\
|
||
\hline
|
||
Hyprspace & 368 & 39.4 & 4965 \\
|
||
Tinc & 336 & 36.0 & 240 \\
|
||
Mycelium & 259 & 27.7 & 710 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
|
||
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
|
||
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
|
||
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
|
||
at under 40\,\% of baseline.
|
||
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.
|
||
|
||
Raw throughput alone is incomplete, however. The retransmit column
|
||
reveals that not all high-throughput VPNs get there cleanly.
|
||
ZeroTier, for instance, reaches 814\,Mbps but accumulates
|
||
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
|
||
needs. ZeroTier compensates for tunnel-internal packet loss by
|
||
repeatedly triggering TCP congestion-control recovery, whereas
|
||
WireGuard delivers data with negligible in-tunnel loss. Across all VPNs,
|
||
retransmit behaviour falls into three groups: \emph{clean} ($<$110:
|
||
WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed}
|
||
(200--900: Tinc, EasyTier, Mycelium, VpnCloud), and
|
||
\emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
||
|
||
% TODO: Is this naming scheme any good?
|
||
|
||
% TODO: Fix TCP Throughput plot
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Throughput}.png}
|
||
\caption{Average single-stream TCP throughput}
|
||
\label{fig:tcp_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||
Retransmit Rate}.png}
|
||
% TODO: Caption says "retransmits" (counts) but the plot axis shows
|
||
% "Retransmit Rate (\%)." Align the caption with the plot.
|
||
\caption{TCP retransmit rate (\%)}
|
||
\label{fig:tcp_retransmits}
|
||
\end{subfigure}
|
||
% TODO: This parent caption still says "retransmit count" but the
|
||
% subfigure axis and caption were corrected to "retransmit rate (%)."
|
||
% Align the parent caption terminology (counts vs rates).
|
||
\caption{TCP throughput and retransmit rate at baseline. WireGuard
|
||
leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
|
||
retransmits per test. The retransmit count does not always track
|
||
inversely with throughput: ZeroTier achieves high throughput
|
||
\emph{despite} high retransmits.}
|
||
\label{fig:tcp_results}
|
||
\end{figure}
|
||
|
||
Retransmits have a direct mechanical relationship with TCP congestion
|
||
control. Each retransmit triggers a reduction in the congestion window
|
||
(\texttt{cwnd}), throttling the sender. % TODO: The text says "average congestion window" but
|
||
% Figure~\ref{fig:retransmit_cwnd} plots "Max Congestion Window."
|
||
% Use consistent terminology --- either change the text to "max" or
|
||
% change the figure axis label.
|
||
This relationship is visible
|
||
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
|
||
retransmits, maintains the smallest max congestion window in the
|
||
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
||
window, the largest of any VPN. At first glance this suggests a
|
||
clean inverse correlation between retransmits and congestion window
|
||
size, but the picture is misleading. Yggdrasil's outsized window is
|
||
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
|
||
segment carries far more data, so the window in bytes is inflated
|
||
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
||
congestion windows across different MTU sizes is not meaningful
|
||
without normalizing for segment size. What \emph{is} clear is that
|
||
high retransmit rates force TCP to spend more time in congestion
|
||
recovery than in steady-state transmission, capping throughput
|
||
regardless of available bandwidth. ZeroTier illustrates the
|
||
opposite extreme: brute-force retransmission can still yield high
|
||
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
|
||
bandwidth and unstable flow behavior.
|
||
|
||
VpnCloud stands out: its sender reports 538.8\,Mbps
|
||
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
|
||
in the dataset). This suggests significant in-tunnel packet loss or
|
||
buffering at the VpnCloud layer that the retransmit count (857)
|
||
alone does not fully explain.
|
||
|
||
% TODO: Mycelium's 122--379 Mbps range is per-link asymmetry (different
|
||
% overlay routing paths), not stochastic run-to-run variability.
|
||
% Section~\ref{sec:mycelium_routing} confirms the same numbers as
|
||
% per-link throughput. Conflating link asymmetry with run-to-run
|
||
% variance is misleading --- either separate the two or clarify that
|
||
% Mycelium's spread comes from path selection, not randomness.
|
||
Run-to-run variability also differs substantially. WireGuard ranges
|
||
from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges
|
||
from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A
|
||
VPN with wide variance is harder to capacity-plan around than one
|
||
with consistent performance, even if the average is lower.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
|
||
\caption{Retransmits vs.\ throughput}
|
||
\label{fig:retransmit_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
|
||
\caption{Retransmits vs.\ max congestion window}
|
||
\label{fig:retransmit_cwnd}
|
||
\end{subfigure}
|
||
\caption{Retransmit correlations (log scale on x-axis). High
|
||
retransmits do not always mean low throughput (ZeroTier: 1\,163
|
||
retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
|
||
4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
|
||
between retransmits and congestion window size is dominated by
|
||
Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
|
||
by its 32\,KB jumbo overlay MTU rather than by low retransmits
|
||
alone.}
|
||
\label{fig:retransmit_correlations}
|
||
\end{figure}
|
||
|
||
\subsection{Latency}
|
||
|
||
Sorting by latency rearranges the rankings considerably.
|
||
Table~\ref{tab:latency_baseline} lists the average ping round-trip
|
||
times, which cluster into three distinct ranges.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Average ping RTT at baseline, sorted by latency}
|
||
\label{tab:latency_baseline}
|
||
\begin{tabular}{lr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Avg RTT (ms)} \\
|
||
\hline
|
||
Internal & 0.60 \\
|
||
VpnCloud & 1.13 \\
|
||
Tinc & 1.19 \\
|
||
WireGuard & 1.20 \\
|
||
Nebula & 1.25 \\
|
||
ZeroTier & 1.28 \\
|
||
EasyTier & 1.33 \\
|
||
\hline
|
||
Headscale & 1.64 \\
|
||
Hyprspace & 1.79 \\
|
||
Yggdrasil & 2.20 \\
|
||
\hline
|
||
Mycelium & 34.9 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal
|
||
0.60\,ms; EasyTier sits just above at 1.33\,ms. VpnCloud posts the
|
||
lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms),
|
||
yet its throughput tops out at only 539\,Mbps.
|
||
Low per-packet latency does not guarantee high bulk throughput. A
|
||
second group (Headscale,
|
||
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
|
||
moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
||
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
||
it a dedicated analysis.
|
||
|
||
% TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT
|
||
% plot. Add a max-RTT figure or table, or reference the raw data
|
||
% source.
|
||
ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
|
||
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
|
||
sub-2\,ms VPN. These spikes point to periodic control-plane
|
||
interference that the average hides.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
|
||
\caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
|
||
massive outlier at 58$\times$ the internal baseline. VpnCloud is
|
||
the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
|
||
\label{fig:ping_rtt}
|
||
\end{figure}
|
||
|
||
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
||
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
||
the tunnel quickly, yet single-threaded userspace processing cannot
|
||
keep up with the link speed. The qperf benchmark backs this up: Tinc
|
||
maxes out at
|
||
14.9\,\% total system CPU while delivering just 336\,Mbps.
|
||
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
|
||
% Clarify that this is whole-system utilization on a multi-core
|
||
% machine, and that Tinc's single-threaded design means one core is
|
||
% saturated while the rest are idle. Also note that VpnCloud reports
|
||
% the same 14.9\% yet achieves 539 Mbps --- explain why the same CPU
|
||
% utilization yields different throughput (e.g., different per-packet
|
||
% processing cost).
|
||
On a multi-core system, the low percentage reflects a single
|
||
saturated core, a clear sign that the CPU, not the network, is the
|
||
bottleneck.
|
||
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
||
spot.
|
||
|
||
% TODO: These CPU numbers are stated inline but never shown in a plot
|
||
% or table. Add a CPU utilization figure or table so readers can
|
||
% verify. Also, the claim that WireGuard's CPU usage "goes to
|
||
% cryptographic processing" is unsubstantiated --- no profiling data
|
||
% is presented. Either add profiling evidence or soften to
|
||
% "likely" / "presumably."
|
||
The qperf measurements also reveal a wide spread in CPU usage.
|
||
Hyprspace (55.1\,\%) and Yggdrasil
|
||
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
||
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a
|
||
kernel-level implementation, presumably due to in-kernel
|
||
cryptographic processing. % TODO: "do the most with the least CPU time" is misleading ---
|
||
% Tinc gets only 336 Mbps at 14.9% CPU (22.6 Mbps/%), while
|
||
% WireGuard gets 864 Mbps at 30.8% (28 Mbps/%). These three use
|
||
% the least CPU but don't necessarily achieve the best throughput/CPU
|
||
% ratio. Rephrase to "use the least CPU" or calculate actual
|
||
% efficiency ratios.
|
||
On the efficient end, VpnCloud
|
||
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
|
||
CPU time. Nebula and Headscale are missing from
|
||
this comparison because qperf failed for both.
|
||
|
||
%TODO: Explain why they consistently failed
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
||
\caption{Latency vs.\ throughput at baseline. Each point represents
|
||
one VPN. The quadrants reveal different bottleneck types:
|
||
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
||
low throughput, CPU-bound), Mycelium (high latency, low
|
||
throughput, overlay routing overhead).}
|
||
\label{fig:latency_throughput}
|
||
\end{figure}
|
||
|
||
\subsection{Parallel TCP Scaling}
|
||
|
||
The single-stream benchmark tests one link direction at a time. % TODO: The plot labels this benchmark "10-stream parallel" but this
|
||
% description says "six unidirectional flows." Verify the actual test
|
||
% configuration and reconcile the two.
|
||
The
|
||
parallel benchmark changes this setup: all three link directions
|
||
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||
luna$\rightarrow$lom) run simultaneously in a circular pattern for
|
||
60~seconds, each carrying one bidirectional TCP stream (six
|
||
unidirectional flows in total). Because three independent
|
||
link pairs now compete for shared tunnel resources at once, the
|
||
aggregate throughput is naturally higher than any single direction
|
||
alone, which is why even Internal reaches 1.50$\times$ its
|
||
single-stream figure. The scaling factor (parallel throughput
|
||
divided by single-stream throughput) captures two effects:
|
||
the benefit of using multiple link pairs in parallel, and how
|
||
well the VPN handles the resulting contention.
|
||
Table~\ref{tab:parallel_scaling} lists the results.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
||
ratio of parallel to single-stream throughput. Internal's
|
||
1.50$\times$ represents the expected scaling on this hardware.}
|
||
\label{tab:parallel_scaling}
|
||
\begin{tabular}{lrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Single (Mbps)} &
|
||
\textbf{Parallel (Mbps)} & \textbf{Scaling} \\
|
||
\hline
|
||
Mycelium & 259 & 569 & 2.20$\times$ \\
|
||
Hyprspace & 368 & 803 & 2.18$\times$ \\
|
||
Tinc & 336 & 563 & 1.68$\times$ \\
|
||
Yggdrasil & 795 & 1265 & 1.59$\times$ \\
|
||
Headscale & 800 & 1228 & 1.54$\times$ \\
|
||
Internal & 934 & 1398 & 1.50$\times$ \\
|
||
ZeroTier & 814 & 1206 & 1.48$\times$ \\
|
||
WireGuard & 864 & 1281 & 1.48$\times$ \\
|
||
EasyTier & 636 & 927 & 1.46$\times$ \\
|
||
VpnCloud & 539 & 763 & 1.42$\times$ \\
|
||
Nebula & 706 & 648 & 0.92$\times$ \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
The VPNs that gain the most are those most constrained in
|
||
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
||
can never fill the pipe: the bandwidth-delay product demands a window
|
||
larger than any single flow maintains, so multiple concurrent flows
|
||
compensate for that constraint and push throughput to 2.20$\times$
|
||
the single-stream figure. % TODO: The buffer-bloat workaround explanation for Hyprspace's
|
||
% parallel scaling is a hypothesis. No direct evidence is shown
|
||
% that multiple streams specifically alleviate buffer bloat.
|
||
% Consider adding bufferbloat measurements or softening the claim.
|
||
% TODO: DOWNSTREAM DEPENDENCY — This claim depends on the buffer bloat
|
||
% diagnosis in Section hyprspace_bloat, which itself rests on the unverified
|
||
% 2,800 ms under-load latency (see TODO there). If that latency figure
|
||
% is not confirmed, this parallel-scaling explanation collapses.
|
||
Hyprspace scales almost as well
|
||
(2.18$\times$), possibly because multiple streams collectively work
|
||
around the buffer bloat that cripples any individual flow
|
||
(Section~\ref{sec:hyprspace_bloat}). Tinc picks up a
|
||
1.68$\times$ boost because several streams can collectively keep its
|
||
single-threaded CPU busy during what would otherwise be idle gaps in
|
||
a single flow.
|
||
|
||
% TODO: "zero retransmits" in parallel mode is not shown in any table
|
||
% or figure. Add parallel-mode retransmit data or remove the claim.
|
||
WireGuard and Internal both scale cleanly at around
|
||
1.48--1.50$\times$ with zero retransmits, suggesting that
|
||
WireGuard's overhead is a fixed per-packet cost that does not worsen
|
||
under multiplexing.
|
||
|
||
Nebula is the only VPN that actually gets \emph{slower} with more
|
||
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
||
(0.92$\times$) while retransmits jump from 955 to 2\,462. The
|
||
% TODO: "ten streams" vs "six unidirectional flows" --- reconcile
|
||
% with the test description above.
|
||
streams are clearly fighting each other for resources inside the
|
||
tunnel.
|
||
|
||
More streams also amplify existing retransmit problems. Hyprspace
|
||
climbs from 4\,965 to 17\,426~retransmits;
|
||
VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream
|
||
mode stay clean under load, while the stressed ones only get worse.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
|
||
\caption{Single-stream vs.\ parallel throughput}
|
||
\label{fig:single_vs_parallel}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
|
||
\caption{Parallel TCP scaling factor}
|
||
\label{fig:scaling_factor}
|
||
\end{subfigure}
|
||
\caption{Parallel TCP scaling at baseline. Nebula is the only VPN
|
||
where parallel throughput is lower than single-stream
|
||
(0.92$\times$). Mycelium and Hyprspace benefit most from
|
||
parallelism ($>$2$\times$), compensating for latency and buffer
|
||
bloat respectively. The dashed line at 1.0$\times$ marks the
|
||
break-even point.}
|
||
\label{fig:parallel_tcp}
|
||
\end{figure}
|
||
|
||
\subsection{UDP Stress Test}
|
||
|
||
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
||
which is a deliberate overload test rather than a realistic workload.
|
||
The sender throughput values are artifacts: they reflect how fast the
|
||
sender can write to the socket, not how fast data traverses the
|
||
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
||
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
||
overlay MTU), inflating the apparent rate per \texttt{send()} system
|
||
call. Only the receiver throughput is meaningful.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{UDP receiver throughput and packet loss at baseline
|
||
(\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
|
||
at 120 seconds and are excluded.}
|
||
\label{tab:udp_baseline}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Receiver (Mbps)} &
|
||
\textbf{Loss (\%)} \\
|
||
\hline
|
||
Internal & 952 & 0.0 \\
|
||
WireGuard & 898 & 0.0 \\
|
||
Nebula & 890 & 76.2 \\
|
||
Headscale & 876 & 69.8 \\
|
||
EasyTier & 865 & 78.3 \\
|
||
Yggdrasil & 852 & 98.7 \\
|
||
ZeroTier & 851 & 89.5 \\
|
||
VpnCloud & 773 & 83.7 \\
|
||
Tinc & 471 & 89.9 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
%TODO: Explain that the UDP test also crashes often,
|
||
% which makes the test somewhat unreliable
|
||
% but a good indicator if the network traffic is "different" then
|
||
% the programmer expected
|
||
|
||
Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
|
||
the kernel level with proper backpressure that matches sender to
|
||
receiver rate. Every other VPN shows massive loss (69--99\%)
|
||
because the sender overwhelms the tunnel's userspace processing capacity.
|
||
% TODO: Headscale also uses WireGuard's kernel module but still shows
|
||
% 69.8\% loss. Explain that Headscale's userspace netstack sits
|
||
% between the application and the WireGuard kernel module, so UDP
|
||
% traffic must pass through userspace before reaching the kernel
|
||
% tunnel --- this is why it behaves like a userspace VPN here despite
|
||
% using WireGuard underneath.
|
||
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
||
(due to its large block size) but loses almost all of it. These loss
|
||
rates do not reflect real-world UDP behavior but reveal which VPNs
|
||
implement effective flow control. Hyprspace and Mycelium could not
|
||
complete the UDP test at all, timing out after 120 seconds.
|
||
|
||
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
|
||
% the path MTU. It is derived from the socket MSS and reflects the
|
||
% usable payload after tunnel overhead, but conflating it with path
|
||
% MTU is misleading. Consider renaming to "effective payload size"
|
||
% throughout.
|
||
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
|
||
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
|
||
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
|
||
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
|
||
smallest). These differences affect fragmentation behavior under real
|
||
workloads, particularly for protocols that send large datagrams.
|
||
|
||
%TODO: Mention QUIC
|
||
%TODO: Mention again that the "default" settings of every VPN have been used
|
||
% to better reflect real world use, as most users probably won't
|
||
% change these defaults
|
||
% and explain that good defaults are as much a part of good software as
|
||
% having the features but they are hard to configure correctly
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Throughput}.png}
|
||
\caption{UDP receiver throughput}
|
||
\label{fig:udp_throughput}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
||
Packet Loss}.png}
|
||
\caption{UDP packet loss}
|
||
\label{fig:udp_loss}
|
||
\end{subfigure}
|
||
\caption{UDP stress test results at baseline (\texttt{-b 0},
|
||
unlimited sender rate). Internal and WireGuard are the only
|
||
implementations with 0\% loss. Hyprspace and Mycelium are
|
||
excluded due to 120-second timeouts.}
|
||
\label{fig:udp_results}
|
||
\end{figure}
|
||
|
||
% TODO: Compare parallel TCP retransmit rate
|
||
% with single TCP retransmit rate and see what changed
|
||
|
||
\subsection{Real-World Workloads}
|
||
|
||
Saturating a link with iPerf3 measures peak capacity, but not how a
|
||
VPN performs under realistic traffic. This subsection switches to
|
||
application-level workloads: downloading packages from a Nix binary
|
||
cache and streaming video over RIST. Both interact with the VPN
|
||
tunnel the way real software does, through many short-lived
|
||
connections, TLS handshakes, and latency-sensitive UDP packets.
|
||
|
||
\paragraph{Nix Binary Cache Downloads.}
|
||
|
||
This test downloads a fixed set of Nix packages through each VPN and
|
||
measures the total transfer time. The results
|
||
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
|
||
considerably: even Hyprspace, the worst performer, finishes in
|
||
11.92\,s, only 40\,\% slower than bare metal. Once connection
|
||
setup, TLS handshakes, and HTTP round-trips enter the picture,
|
||
throughput differences between 500 and 900\,Mbps matter far less
|
||
than per-connection latency.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time at baseline, sorted by
|
||
duration. Overhead is relative to the internal baseline (8.53\,s).}
|
||
\label{tab:nix_cache}
|
||
\begin{tabular}{lrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Mean (s)} &
|
||
\textbf{Overhead (\%)} \\
|
||
\hline
|
||
Internal & 8.53 & -- \\
|
||
Nebula & 9.15 & +7.3 \\
|
||
ZeroTier & 9.22 & +8.1 \\
|
||
VpnCloud & 9.39 & +10.0 \\
|
||
EasyTier & 9.39 & +10.1 \\
|
||
WireGuard & 9.45 & +10.8 \\
|
||
Headscale & 9.79 & +14.8 \\
|
||
Tinc & 10.00 & +17.2 \\
|
||
Mycelium & 10.07 & +18.1 \\
|
||
Yggdrasil & 10.59 & +24.2 \\
|
||
Hyprspace & 11.92 & +39.7 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
Several rankings invert relative to raw throughput. ZeroTier
|
||
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
|
||
6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil
|
||
is the clearest example: it has the
|
||
fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead
|
||
because its
|
||
2.2\,ms latency adds up over the many small sequential HTTP requests
|
||
that constitute a Nix cache download.
|
||
Figure~\ref{fig:throughput_vs_download} confirms this weak link
|
||
between raw throughput and real-world download speed.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
|
||
Mean Download Time}.png}
|
||
\caption{Nix cache download time per VPN}
|
||
\label{fig:nix_cache}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
|
||
\caption{Raw throughput vs.\ download time}
|
||
\label{fig:throughput_vs_download}
|
||
\end{subfigure}
|
||
\caption{Application-level download performance. The throughput
|
||
hierarchy compresses under real HTTP workloads: the worst VPN
|
||
(Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
|
||
Throughput explains some variance but not all: Yggdrasil
|
||
(795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
|
||
because latency matters more for HTTP workloads.}
|
||
\label{fig:nix_download}
|
||
\end{figure}
|
||
|
||
\paragraph{Video Streaming (RIST).}
|
||
|
||
At just 3.3\,Mbps, the RIST video stream sits comfortably within
|
||
every VPN's throughput budget. This test therefore measures
|
||
something different: how well the VPN handles real-time UDP packet
|
||
delivery under steady load. % TODO: The RIST plot shows Nebula at 99.8\%, not 100\%. "Nine of
|
||
% eleven deliver 100\%" is inaccurate --- eight deliver 100\%, Nebula
|
||
% delivers 99.8\%. Also, the claim that 14--16 dropped frames trace
|
||
% to encoder warm-up is stated without evidence. How was this
|
||
% determined? Add a reference or explain the methodology.
|
||
Nine of the eleven VPNs pass without
|
||
incident, delivering near-perfect video quality. The 14--16 dropped
|
||
frames that appear uniformly across all VPNs, including Internal,
|
||
likely trace back to encoder warm-up rather than tunnel overhead.
|
||
|
||
% TODO: The packet-drop distribution statistics (288 mean,
|
||
% 10\% median, IQR 255--330) are not shown in any figure.
|
||
% Add a box plot or distribution figure for Headscale's RIST drops.
|
||
Headscale is the exception. It averages just 13.1\,\% quality,
|
||
dropping 288~packets per test interval. The degradation is not
|
||
bursty but sustained: median quality sits at 10\,\%, and the
|
||
interquartile range of dropped packets spans a narrow 255--330 band.
|
||
The qperf benchmark independently corroborates this, having failed
|
||
outright for Headscale, confirming that something beyond bulk TCP is
|
||
broken.
|
||
|
||
What makes this failure unexpected is that Headscale builds on
|
||
WireGuard, which handles video flawlessly. TCP throughput places
|
||
Headscale squarely in Tier~1. Yet the RIST test runs over UDP, and
|
||
qperf probes latency-sensitive paths using both TCP and UDP. The
|
||
% TODO: The DERP relay / MTU fragmentation hypothesis is plausible
|
||
% but unverified. No packet capture or fragmentation analysis is
|
||
% presented. Either add tcpdump / packet-level evidence or mark
|
||
% this more clearly as a hypothesis.
|
||
pattern points toward Headscale's DERP relay or NAT traversal layer
|
||
as the source. Its effective UDP payload size of 1\,208~bytes, the smallest
|
||
of any VPN, may compound the issue: RIST packets that exceed
|
||
this limit would be fragmented, and reassembling fragments under
|
||
sustained load could produce exactly the kind of steady, uniform packet
|
||
drops the data shows. For video conferencing, VoIP, or any
|
||
real-time media workload, this is a disqualifying result regardless
|
||
of TCP throughput.
|
||
|
||
% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500,
|
||
% percentiles all zero) are not visible in the RIST Quality bar chart.
|
||
% Add a distribution plot or note in the caption that the bar
|
||
% chart hides this variance.
|
||
Hyprspace reveals a different failure mode. Its average quality
|
||
reads 100\,\%, but the raw numbers underneath are far from stable:
|
||
mean packet drops of 1\,194 and a maximum spike of 55\,500, with
|
||
the 25th, 50th, and 75th percentiles all at zero. Hyprspace
|
||
alternates between perfect delivery and catastrophic bursts.
|
||
RIST's forward error correction compensates for most of these
|
||
events, but the worst spikes are severe enough to overwhelm FEC
|
||
entirely.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/Video
|
||
Streaming/RIST Quality}.png}
|
||
\caption{RIST video streaming quality at baseline. Headscale at
|
||
13.1\% average quality is the clear outlier. Every other VPN
|
||
achieves 99.8\% or higher. Nebula is at 99.8\% (minor
|
||
degradation). The video bitrate (3.3\,Mbps) is well within every
|
||
VPN's throughput capacity, so this test reveals real-time UDP
|
||
handling quality rather than bandwidth limits.}
|
||
\label{fig:rist_quality}
|
||
\end{figure}
|
||
|
||
\subsection{Operational Resilience}
|
||
|
||
Sustained-load performance does not predict recovery speed. How
|
||
quickly a tunnel comes up after a reboot, and how reliably it
|
||
reconverges, matters as much as peak throughput for operational use.
|
||
|
||
% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
|
||
% are not shown in any figure or table. Either add a figure or
|
||
% scrap this paragraph (see note below).
|
||
First-time connectivity spans a wide range. Headscale and WireGuard
|
||
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
|
||
(10--14\,s) spend seconds negotiating with their control planes
|
||
before passing traffic.
|
||
|
||
%TODO: Maybe we want to scrap first-time connectivity
|
||
|
||
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
||
performer under sustained TCP load, recovers in just 8.7~seconds on
|
||
average, faster than any other VPN. WireGuard and Nebula follow at
|
||
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
|
||
10.07\,s across its three nodes, pointing to a hard-coded timer
|
||
rather than topology-dependent convergence.
|
||
Mycelium sits at the opposite end, needing 76.6~seconds and showing
|
||
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
|
||
suggesting a fixed protocol-level wait built into the overlay.
|
||
|
||
%TODO: Hard coded timer needs to be verified
|
||
|
||
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
||
node is back in 7.1~seconds while lom and luna take 94.8 and
|
||
97.3~seconds respectively. The gap likely reflects the overlay's
|
||
spanning-tree rebuild: a node near the root of the tree reconverges
|
||
quickly, while one further out has to wait for the topology to
|
||
propagate.
|
||
|
||
%TODO: Needs clarifications what is a "spanning tree build"
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
|
||
\caption{Average reconnection time per VPN}
|
||
\label{fig:reboot_bar}
|
||
\end{subfigure}
|
||
|
||
\vspace{1em}
|
||
|
||
\begin{subfigure}[t]{\textwidth}
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
|
||
\caption{Per-node reconnection time heatmap}
|
||
\label{fig:reboot_heatmap}
|
||
\end{subfigure}
|
||
\caption{Reboot reconnection time at baseline. The heatmap reveals
|
||
Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
|
||
95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
|
||
across all nodes). Hyprspace reconnects fastest (8.7\,s average)
|
||
despite its poor sustained-load performance.}
|
||
\label{fig:reboot_reconnection}
|
||
\end{figure}
|
||
|
||
\subsection{Pathological Cases}
|
||
\label{sec:pathological}
|
||
|
||
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
|
||
explain. The following subsections piece together observations from
|
||
earlier benchmarks into per-VPN diagnoses.
|
||
|
||
\paragraph{Hyprspace: Buffer Bloat.}
|
||
\label{sec:hyprspace_bloat}
|
||
|
||
% TODO: The under-load latency of 2,800 ms is not shown in any plot
|
||
% or table. Where does this number come from? Add a figure showing
|
||
% latency-under-load (e.g., from qperf concurrent ping) or reference
|
||
% the raw data source.
|
||
Hyprspace produces the most severe performance collapse in the
|
||
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
||
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
||
1\,556$\times$ increase. This is not the network becoming
|
||
congested; it is the VPN tunnel itself filling up with buffered
|
||
packets and refusing to drain.
|
||
|
||
The consequences ripple through every TCP metric. With 4\,965
|
||
retransmits per 30-second test (one in every 200~segments), TCP
|
||
spends most of its time in congestion recovery rather than
|
||
steady-state transfer, shrinking the max congestion window to
|
||
205\,KB, the smallest in the dataset. Under parallel load the
|
||
situation worsens: retransmits climb to 17\,426. % TODO: The explanation for the sender/receiver inversion (ACK delays
|
||
% causing sender-side timer undercounting) is a hypothesis. Normally
|
||
% sender >= receiver. Consider verifying with packet captures or
|
||
% note this as a likely but unconfirmed explanation.
|
||
The buffering even
|
||
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
||
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
||
cause the sender-side timer to undercount the actual data rate. The
|
||
UDP test never finished at all, timing out at 120~seconds.
|
||
|
||
% Should we always use percentages for retransmits?
|
||
|
||
What prevents Hyprspace from being entirely unusable is everything
|
||
\emph{except} sustained load. It has the fastest reboot
|
||
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
|
||
quality outside of its burst events. The pathology is narrow but
|
||
severe: any continuous data stream saturates the tunnel's internal
|
||
buffers.
|
||
|
||
\paragraph{Mycelium: Routing Anomaly.}
|
||
\label{sec:mycelium_routing}
|
||
|
||
Mycelium's 34.9\,ms average latency appears to be the cost of
|
||
routing through a global overlay. The per-path numbers, however,
|
||
reveal a bimodal distribution:
|
||
|
||
\begin{itemize}
|
||
\bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable
|
||
to Headscale at 1.64\,ms)
|
||
\bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
|
||
\bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
|
||
\end{itemize}
|
||
|
||
One of the three links has found a direct route; the other two still
|
||
bounce through the overlay. All three machines sit on the same
|
||
% TODO: Characterising path discovery as "failing intermittently" assumes
|
||
% direct routing is the expected outcome on a LAN. Mycelium is designed
|
||
% as a global overlay and may intentionally route through supernodes.
|
||
% If this is by-design behaviour, rephrase to avoid implying a bug.
|
||
% This characterisation also propagates to the impairment ping analysis
|
||
% (around line 966) which says impairment "pushes path discovery toward
|
||
% shorter routes."
|
||
% TODO: The throughput data INVERTS the latency split rather than
|
||
% "mirroring" it. The direct path (luna→lom, 1.63 ms RTT) achieves
|
||
% only 122 Mbps, while the overlay-routed path (yuki→luna, 51.60 ms
|
||
% RTT) reaches 379 Mbps --- the opposite of what TCP theory predicts.
|
||
% The plot also shows luna→lom receiver throughput at only 57.2 Mbps
|
||
% (a 53% sender/receiver gap on that link). Explain why the direct
|
||
% path is 3× slower than the overlay path, or acknowledge the
|
||
% contradiction. The current wording "mirrors the split" is incorrect.
|
||
physical network, so Mycelium's path discovery is not consistently
|
||
selecting the direct route, a more specific problem than blanket overlay
|
||
overhead. Throughput shows a similarly lopsided split:
|
||
yuki$\rightarrow$luna reaches 379\,Mbps while
|
||
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In
|
||
bidirectional mode, the reverse direction on that worst link drops
|
||
to 58.4\,Mbps, the lowest single-direction figure in the entire
|
||
dataset.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
|
||
Throughput}.png}
|
||
% TODO: The caption attributes the asymmetry to "inconsistent direct
|
||
% route discovery" but the direct-route link (luna→lom, 1.63 ms RTT)
|
||
% is actually the SLOWEST (122 Mbps). The caption should address
|
||
% why the direct path underperforms the overlay paths.
|
||
\caption{Per-link TCP throughput for Mycelium, showing extreme
|
||
path asymmetry. The 3:1 ratio between best
|
||
(yuki$\rightarrow$luna, 379\,Mbps) and worst
|
||
(luna$\rightarrow$lom, 122\,Mbps) links does not correlate with
|
||
the latency split (Section~\ref{sec:mycelium_routing}).}
|
||
\label{fig:mycelium_paths}
|
||
\end{figure}
|
||
|
||
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
|
||
% (47.3 ms) numbers are from qperf but not shown in any figure.
|
||
% Add a connection-setup latency table or plot. Also clarify what
|
||
% Internal's connection establishment time is (47.3 / 3 = 15.8 ms?)
|
||
% so the "3× overhead" can be verified.
|
||
The overlay penalty shows up most clearly at connection setup.
|
||
Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's
|
||
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
||
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
|
||
incurs that overhead, so workloads dominated by
|
||
short-lived connections accumulate it rapidly. Bulk downloads, by
|
||
contrast, amortize it: the Nix cache test finishes only 18\,\%
|
||
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
|
||
transfer phase begins, per-connection latency fades into the
|
||
background.
|
||
|
||
Mycelium is also the slowest VPN to recover from a reboot:
|
||
76.6~seconds on average, and almost suspiciously uniform across
|
||
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
|
||
hard-coded convergence timer in the overlay protocol rather than
|
||
anything topology-dependent. The UDP test timed out at
|
||
120~seconds, and even first-time connectivity required a
|
||
70-second wait at startup.
|
||
|
||
% Explain what topology-dependent means in this case.
|
||
|
||
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
||
|
||
Tinc is a clear case of a CPU bottleneck masquerading as a network
|
||
problem. At 1.19\,ms latency, packets get through the
|
||
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a
|
||
third of the bare-metal link. % TODO: "path MTU is a healthy 1,500 bytes" but blksize_bytes is
|
||
% 1,353. These are different metrics --- blksize_bytes is the UDP
|
||
% payload size, not the path MTU. Clarify the distinction or
|
||
% remove the 1,500 claim.
|
||
The usual suspects do not apply:
|
||
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
|
||
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
|
||
WireGuard at 1\,368) is in the normal range, and its retransmit
|
||
count (240) is moderate. What limits Tinc is its single-threaded
|
||
userspace architecture: one CPU core simply cannot encrypt, copy,
|
||
and forward packets fast enough to fill the pipe.
|
||
|
||
% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the Tinc CPU bottleneck
|
||
% diagnosis from above, but the 14.9% CPU figure has an unresolved TODO
|
||
% (the same utilization as VpnCloud at 539 Mbps). If the CPU claim is
|
||
% revised or refuted, this confirmation must be updated too.
|
||
The parallel benchmark confirms this diagnosis. Tinc scales to
|
||
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
|
||
Multiple TCP streams collectively keep that single core busy during
|
||
what would otherwise be idle gaps in any individual flow, squeezing
|
||
out throughput that no single stream could reach alone.
|
||
|
||
\section{Impact of Network Impairment}
|
||
\label{sec:impairment}
|
||
|
||
Baseline benchmarks rank VPNs by overhead under ideal conditions.
|
||
The impairment profiles from Table~\ref{tab:impairment_profiles}
|
||
test a different property: resilience. Two results dominate the
|
||
data. First, the throughput hierarchy from
|
||
Section~\ref{sec:baseline} collapses under degradation --- at High
|
||
impairment, the 675\,Mbps spread across all implementations compresses
|
||
to under 3\,Mbps, and architectural differences that matter at gigabit speeds
|
||
vanish. Second, Headscale outperforms the bare-metal Internal
|
||
baseline at Medium impairment across TCP, parallel TCP, and Nix
|
||
cache benchmarks. A VPN built on WireGuard should not beat a direct
|
||
connection; Section~\ref{sec:tailscale_degraded} traces the cause to
|
||
three TCP parameters in Tailscale's userspace network stack.
|
||
|
||
\subsection{Ping}
|
||
|
||
Latency is the most predictable metric under impairment. Most VPNs
|
||
absorb the injected delay with a fixed per-hop overhead, and rankings
|
||
within the central cluster barely change across profiles
|
||
(Table~\ref{tab:ping_impairment}). tc~netem adds roughly 4, 8, and
|
||
15\,ms of round-trip delay at Low, Medium, and High respectively;
|
||
Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Average ping RTT (ms) across impairment profiles, sorted
|
||
by High-profile RTT}
|
||
\label{tab:ping_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 0.60 & 4.82 & 9.38 & 15.49 \\
|
||
Tinc & 1.19 & 5.32 & 9.85 & 15.92 \\
|
||
Nebula & 1.25 & 5.38 & 9.99 & 15.96 \\
|
||
WireGuard & 1.20 & 5.36 & 9.88 & 15.99 \\
|
||
Headscale & 1.64 & 5.82 & 10.39 & 16.07 \\
|
||
VpnCloud & 1.13 & 5.41 & 10.35 & 16.21 \\
|
||
ZeroTier & 1.28 & 5.34 & 10.02 & 16.54 \\
|
||
Yggdrasil & 2.20 & 6.73 & 11.99 & 20.20 \\
|
||
Hyprspace & 1.79 & 6.15 & 10.76 & 24.49 \\
|
||
EasyTier & 1.33 & 6.27 & 14.13 & 26.60 \\
|
||
Mycelium & 34.90 & 23.42 & 43.88 & 33.05 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Ping Average RTT Heatmap}.png}
|
||
\caption{Average ping RTT across impairment profiles. Most VPNs
|
||
form a tight parallel band; Mycelium's non-monotonic curve,
|
||
EasyTier's excess latency at High, and Hyprspace's upward
|
||
divergence stand out.}
|
||
\label{fig:ping_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Mycelium defies the pattern. Its RTT \emph{drops} from 34.9\,ms at
|
||
baseline to 23.4\,ms at Low impairment, a 33\% improvement where
|
||
every other VPN gets slower. It then rises to 43.9\,ms at Medium
|
||
before falling again to 33.0\,ms at High. The baseline analysis
|
||
(Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency
|
||
comes from a bimodal routing distribution: one path runs at 1.63\,ms
|
||
while two others route through the global overlay at
|
||
${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This explanation depends on the baseline
|
||
% characterisation of Mycelium's path discovery as "failing intermittently"
|
||
% (Section mycelium_routing). If that characterisation is revised (e.g.,
|
||
% overlay routing is by-design, not a failure), then the claim that
|
||
% impairment "pushes path discovery toward shorter routes" needs rethinking:
|
||
% the mechanism would be different if Mycelium is not trying to find direct
|
||
% routes in the first place.
|
||
The impairment appears to push Mycelium's path
|
||
discovery toward shorter routes, so a larger share of traffic takes
|
||
the direct path. The non-monotonic pattern is consistent with a path
|
||
selection algorithm that responds to measured link quality, but not
|
||
linearly with degradation severity.
|
||
|
||
% TODO: Ping packet loss data is not shown in any figure. Add a
|
||
% packet loss table/figure or reference the raw data so readers can
|
||
% verify these numbers.
|
||
Mycelium also achieves 0\% ping packet loss at Low and Medium
|
||
impairment, while most VPNs show 0.1--3.2\% loss at those profiles.
|
||
At High impairment, Mycelium's loss jumps to 11.1\%.
|
||
|
||
% TODO: EasyTier's max RTT (290 ms), WireGuard's max (~40 ms), and
|
||
% EasyTier's std dev (44.6 ms) are not shown in any plot. The ping
|
||
% heatmap only shows averages. Add a jitter/distribution figure.
|
||
% Also, the "userspace retry mechanism" is a hypothesized cause
|
||
% without source-code or packet-level evidence.
|
||
EasyTier accumulates 11\,ms of excess latency at High impairment
|
||
beyond what tc~netem accounts for. Its average RTT of 26.6\,ms and
|
||
maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) suggest a
|
||
userspace retry mechanism that introduces escalating variance.
|
||
EasyTier's RTT standard deviation reaches 44.6\,ms at High, the
|
||
worst jitter of any VPN.
|
||
|
||
% TODO: Ping packet loss data is not shown in any plot. The 1/9
|
||
% = 11.1\% interpretation is clever but depends on the exact test
|
||
% structure (3 pairs × 3 runs × 100 packets). Verify this matches
|
||
% the actual test setup and add a supporting figure or table.
|
||
Hyprspace shows 11.1\% ping packet loss at every impairment level ---
|
||
Low, Medium, and High alike. With 9~measurement runs (3~machine
|
||
pairs $\times$ 3~runs of 100~packets), 11.1\% equals exactly 1/9:
|
||
one run per profile fails completely while the other eight report zero
|
||
loss. % TODO: DOWNSTREAM DEPENDENCY — This is a third reference to the buffer
|
||
% bloat diagnosis from Section hyprspace_bloat, which depends on the
|
||
% unverified 2,800 ms under-load latency. If that diagnosis is
|
||
% revised, this explanation must also be revisited.
|
||
This binary pass/fail behavior is consistent with the buffer bloat
|
||
diagnosis from Section~\ref{sec:hyprspace_bloat}: when buffers fill,
|
||
an entire path stalls rather than degrading gradually.
|
||
|
||
\subsection{TCP Throughput}
|
||
|
||
TCP throughput is where the baseline hierarchy breaks down. The
|
||
three performance tiers from Section~\ref{sec:baseline} dissolve at
|
||
the first impairment step (Table~\ref{tab:tcp_impairment}).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Single-stream TCP throughput (Mbps) across impairment
|
||
profiles, sorted by baseline. Retention is the
|
||
Low-to-baseline ratio.}
|
||
\label{tab:tcp_impairment}
|
||
\begin{tabular}{lrrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} & \textbf{Retention} \\
|
||
\hline
|
||
Internal & 934 & 333 & 29.6 & 4.25 & 35.7\% \\
|
||
WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\% \\
|
||
ZeroTier & 814 & 63.7 & 12.0 & 4.01 & 7.8\% \\
|
||
Headscale & 800 & 274 & 41.5 & 4.21 & 34.3\% \\
|
||
Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\% \\
|
||
\hline
|
||
Nebula & 706 & 49.8 & 7.82 & 2.60 & 7.1\% \\
|
||
EasyTier & 636 & 156 & 17.4 & 3.59 & 24.6\% \\
|
||
VpnCloud & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\
|
||
\hline
|
||
Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\% \\
|
||
Tinc & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\
|
||
Mycelium & 259 & 16.2 & 3.87 & 2.73 & 6.3\% \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/TCP Throughput Heatmap}.png}
|
||
\caption{Single-stream TCP throughput across impairment profiles.
|
||
Headscale crosses above Internal at Medium impairment;
|
||
Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs
|
||
converge at High.}
|
||
\label{fig:tcp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a
|
||
98.3\% throughput loss from adding just 2\,ms latency, 2\,ms jitter,
|
||
0.25\% packet loss, and 0.5\% reordering per machine. Even Mycelium,
|
||
the slowest VPN at baseline (259\,Mbps), retains more throughput at
|
||
Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes,
|
||
which inflated baseline metrics
|
||
(Section~\ref{sec:baseline}), becomes a liability under impairment:
|
||
each lost or reordered outer packet triggers retransmission of
|
||
${\sim}$24$\times$ more inner-layer data than a standard
|
||
1\,400-byte MTU VPN would lose.
|
||
|
||
Headscale retains 34.3\% of its baseline throughput at Low, nearly
|
||
matching Internal's 35.7\%. At Medium impairment, Headscale
|
||
(41.5\,Mbps) overtakes Internal (29.6\,Mbps) --- a VPN outperforming
|
||
the bare-metal baseline.
|
||
Section~\ref{sec:tailscale_degraded} investigates this anomaly in
|
||
detail.
|
||
|
||
At High impairment, the throughput range compresses from 675\,Mbps at
|
||
baseline to just 2.9\,Mbps. Internal leads at 4.25\,Mbps; Hyprspace
|
||
trails at 1.39\,Mbps. The impairment profile itself becomes the
|
||
bottleneck. With 2.5\% packet loss and 5\% reordering per machine,
|
||
every implementation is TCP-loss-limited, and architectural
|
||
differences that matter at gigabit speeds become irrelevant.
|
||
|
||
\subsection{UDP Throughput}
|
||
|
||
The UDP stress test (\texttt{-b~0}) separates kernel-level from
|
||
userspace implementations more cleanly than any TCP benchmark. It
|
||
also produces widespread failures under impairment: Hyprspace and
|
||
Mycelium, which already failed at baseline, continue to time out at
|
||
% TODO: Tinc fails at Low and Medium but succeeds at High (8 Mbps) ---
|
||
% the same non-monotonic failure pattern as Internal/WireGuard (fail
|
||
% at Low, succeed at Medium/High). This suggests the failures are
|
||
% iPerf3/tc interaction issues rather than fundamental VPN limitations.
|
||
% Nebula and VpnCloud also fail selectively. The widespread non-monotonic
|
||
% failure pattern undermines using this benchmark as a reliability
|
||
% indicator (see line 1163 claim). Consider discussing this pattern.
|
||
all profiles, and Tinc drops out at Low and Medium while ZeroTier
|
||
fails at Medium. Despite the sparse dataset, one pattern is clear.
|
||
|
||
% TODO: The heatmap shows Internal and WireGuard both fail (×) at
|
||
% some impairment profiles (e.g., Internal fails at Low, WireGuard
|
||
% at Low and High). "Regardless of impairment" overstates the
|
||
% evidence. Rephrase to reflect the failures, or explain why
|
||
% those runs failed despite the claim of maintained throughput.
|
||
% TODO: Internal (and WireGuard) fail at Low impairment in the UDP
|
||
% test but succeed at Medium and High --- the opposite of what one
|
||
% would expect. This is never explained. Investigate and add an
|
||
% explanation (e.g., iPerf3 crash, tc interaction, timing issue).
|
||
Kernel-level implementations maintain throughput at the profiles
|
||
where data exists. Internal holds ${\sim}$950\,Mbps at
|
||
Baseline, Medium, and High. Headscale sustains 700--876\,Mbps and WireGuard
|
||
850--898\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition
|
||
both use WireGuard's kernel module for the outer tunnel, which
|
||
provides proper backpressure at the transport layer. Userspace VPNs collapse: EasyTier drops from
|
||
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
|
||
Yggdrasil, already pathological at baseline (98.7\% loss), crashes to
|
||
12.3\,Mbps at Low and fails entirely at Medium and High.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/UDP Receiver Throughput Heatmap}.png}
|
||
% TODO: This caption says "kernel-level VPNs maintain high throughput"
|
||
% but the heatmap shows Internal, WireGuard, and Headscale ALL fail
|
||
% ($\times$) at Low impairment. WireGuard also fails at High.
|
||
% Rephrase to acknowledge the failures or explain them.
|
||
\caption{UDP receiver throughput across impairment profiles.
|
||
Kernel-level VPNs (Internal, WireGuard, Headscale) maintain high
|
||
throughput where they complete; userspace VPNs collapse or fail
|
||
entirely ($\times$ marks a failed run).}
|
||
\label{fig:udp_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
% TODO: This "robustness indicator" interpretation is undermined by
|
||
% the non-monotonic failure pattern. Internal and WireGuard fail at
|
||
% Low (0.25% loss) but succeed at Medium and High (1%+ loss). If
|
||
% failures indicated "fundamental flow-control problems," they should
|
||
% get worse with more impairment, not better. The pattern suggests
|
||
% iPerf3 or tc timing issues rather than VPN limitations. Either
|
||
% explain the non-monotonic failures or weaken this conclusion.
|
||
The failure rate of this benchmark under impairment makes it more
|
||
useful as a robustness indicator than a throughput measurement. A VPN
|
||
that cannot complete a 30-second UDP flood under 0.25\% packet loss
|
||
has fundamental flow-control problems that will surface under real
|
||
workloads too, even if the symptoms are milder.
|
||
|
||
\subsection{Parallel TCP}
|
||
|
||
% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional flows" must match
|
||
% the baseline parallel test description. The baseline section has an
|
||
% unresolved TODO about whether the test uses 6 or 10 streams. If the
|
||
% baseline is corrected to 10, this section must also be updated.
|
||
The Headscale anomaly from single-stream TCP grows larger under
|
||
parallel load. Table~\ref{tab:parallel_impairment} shows aggregate
|
||
throughput across three concurrent bidirectional links (six
|
||
unidirectional flows).
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Parallel TCP throughput (Mbps) across impairment profiles.
|
||
Three concurrent bidirectional links produce six unidirectional
|
||
flows.}
|
||
\label{tab:parallel_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 1398 & 277 & 82.6 & 10.4 \\
|
||
Headscale & 1228 & 718 & 113 & 20.0 \\
|
||
WireGuard & 1281 & 173 & 24.5 & 8.39 \\
|
||
Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\
|
||
ZeroTier & 1206 & 176 & 35.4 & 7.97 \\
|
||
EasyTier & 927 & 473 & 57.4 & 10.7 \\
|
||
Hyprspace & 803 & 2.87 & 6.94 & 3.62 \\
|
||
VpnCloud & 763 & 174 & 23.7 & 8.25 \\
|
||
Nebula & 648 & 103 & 15.3 & 4.93 \\
|
||
Mycelium & 569 & 72.7 & 7.51 & 3.69 \\
|
||
Tinc & 563 & 168 & 23.7 & 8.25 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Parallel TCP Throughput Heatmap}.png}
|
||
\caption{Parallel TCP throughput across impairment profiles.
|
||
Headscale dominates at Low (718\,Mbps vs.\ Internal's 277);
|
||
EasyTier is the runner-up (473\,Mbps); Hyprspace collapses to
|
||
2.87\,Mbps.}
|
||
\label{fig:parallel_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale at Low impairment: 718\,Mbps --- 2.6$\times$ Internal
|
||
(277\,Mbps) and 4.1$\times$ WireGuard (173\,Mbps). At Medium,
|
||
Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%.
|
||
Whatever mechanism produces the single-stream crossover at Medium
|
||
scales with the number of flows: six independent streams each
|
||
benefit from it.
|
||
|
||
% TODO: EasyTier's resilience (473 Mbps at Low, 51% retention) is the
|
||
% second-best result after Headscale, yet receives no architectural
|
||
% explanation. Headscale gets an entire subsection attributing its
|
||
% resilience to gVisor TCP tuning. Either explain what gives EasyTier
|
||
% its resilience (e.g., its own TCP stack, congestion control, FEC)
|
||
% or acknowledge the gap explicitly.
|
||
EasyTier is the second-most resilient VPN under parallel load, at
|
||
473\,Mbps at Low (51\% of baseline). Both EasyTier and Headscale
|
||
retain more than half their baseline parallel throughput at Low
|
||
impairment; no other VPN exceeds 30\%.
|
||
|
||
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\%
|
||
loss. % TODO: DOWNSTREAM DEPENDENCY — This references the buffer bloat diagnosis
|
||
% from Section hyprspace_bloat, which depends on the unverified 2,800 ms
|
||
% under-load latency. If that diagnosis is revised, this explanation
|
||
% for parallel collapse must also be revisited.
|
||
The buffer bloat that plagues single-stream transfers
|
||
(Section~\ref{sec:hyprspace_bloat}) becomes catastrophic when six
|
||
concurrent flows compete for the same bloated buffers.
|
||
|
||
The High-profile convergence effect is even more pronounced here than
|
||
in single-stream mode. Tinc and VpnCloud land at identical
|
||
8.25\,Mbps despite differing by 200\,Mbps at baseline.
|
||
|
||
\subsection{QUIC Performance}
|
||
|
||
Headscale and Nebula failed the qperf QUIC benchmark at baseline
|
||
(Section~\ref{sec:baseline}) and continue to fail across all
|
||
impairment profiles.
|
||
|
||
Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to
|
||
7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High ---
|
||
the same cliff observed in its TCP results, again driven by
|
||
jumbo-MTU amplification of outer-layer packet loss.
|
||
|
||
At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps),
|
||
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
|
||
0.4\,Mbps of each other. At baseline these four span a 188\,Mbps
|
||
range (844 to 656\,Mbps). QUIC's own congestion control, operating atop the
|
||
already-degraded outer link, becomes the sole limiter.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/QUIC Bandwidth Heatmap}.png}
|
||
\caption{QUIC bandwidth across impairment profiles. Yggdrasil
|
||
drops from 745 to 8\,Mbps at Low; WireGuard, VpnCloud, ZeroTier,
|
||
and Tinc converge to ${\sim}$23\,Mbps at High. Headscale and
|
||
Nebula fail at all profiles ($\times$).}
|
||
\label{fig:quic_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
\subsection{Video Streaming}
|
||
|
||
At ${\sim}$3.3\,Mbps, the RIST video stream sits within every VPN's
|
||
throughput budget even at High impairment. Quality differences in
|
||
Table~\ref{tab:rist_impairment} therefore reflect packet delivery
|
||
reliability, not bandwidth.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{RIST video streaming quality (\%) across impairment
|
||
profiles, sorted by High-profile quality}
|
||
\label{tab:rist_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Mycelium & 100.0 & 100.0 & 100.0 & 99.9 \\
|
||
EasyTier & 100.0 & 100.0 & 96.2 & 85.5 \\
|
||
Internal & 100.0 & 99.2 & 89.3 & 80.2 \\
|
||
ZeroTier & 100.0 & 99.3 & 89.9 & 80.2 \\
|
||
VpnCloud & 100.0 & 99.2 & 89.7 & 80.1 \\
|
||
WireGuard & 100.0 & 99.3 & 90.0 & 80.0 \\
|
||
Hyprspace & 100.0 & 92.9 & 87.9 & 78.1 \\
|
||
Tinc & 100.0 & 99.3 & 90.0 & 77.8 \\
|
||
Nebula & 99.8 & 98.8 & 85.6 & 72.1 \\
|
||
Yggdrasil & 100.0 & 94.7 & 71.4 & 43.3 \\
|
||
Headscale & 13.1 & 13.0 & 13.0 & 13.0 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Video Streaming Quality Heatmap}.png}
|
||
\caption{RIST video streaming quality across impairment profiles.
|
||
Headscale is stuck at ${\sim}$13\% regardless of profile;
|
||
Mycelium maintains ${\sim}$100\% even at High; Yggdrasil
|
||
declines steeply to 43\%.}
|
||
\label{fig:rist_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale stays at ${\sim}$13\% across all four profiles: 13.1\%,
|
||
13.0\%, 13.0\%, 13.0\%. The profile-independence confirms the
|
||
baseline diagnosis from Section~\ref{sec:baseline}. The failure is
|
||
% TODO: DOWNSTREAM DEPENDENCY — This repeats the DERP/MTU hypothesis from
|
||
% Section baseline as though it were established. The baseline TODO notes
|
||
% this hypothesis is unverified (no packet capture evidence). Do not
|
||
% present it as a confirmed diagnosis here without resolving the upstream TODO.
|
||
structural --- likely MTU fragmentation in the DERP relay layer ---
|
||
and cannot worsen because it is already saturated. Adding latency or
|
||
loss on top of an 87\% packet drop floor changes nothing.
|
||
|
||
Mycelium delivers 99.9\% quality even at High impairment, better than
|
||
Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even
|
||
Mycelium's degraded overlay paths can sustain the stream. The same
|
||
overlay routing that adds 34.9\,ms of latency and cripples bulk TCP
|
||
transfers is harmless at video bitrates. RIST's own forward error
|
||
correction compensates for whatever packet loss remains.
|
||
|
||
% TODO: The claim that jumbo MTU causes burst losses that overwhelm
|
||
% FEC is a hypothesis. No FEC analysis or packet-level evidence is
|
||
% shown. Consider adding packet capture data or softening the claim.
|
||
Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at
|
||
Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU that hurt TCP
|
||
throughput likely hurts here too --- large overlay packets carrying
|
||
RIST data are more likely to be lost or reordered at the outer layer,
|
||
and RIST's FEC may not recover from the resulting burst losses.
|
||
|
||
\subsection{Application-Level Download}
|
||
|
||
The Nix binary cache download is the most demanding application-level
|
||
benchmark: hundreds of sequential HTTP connections amplify
|
||
per-connection latency penalties that bulk throughput tests amortize.
|
||
Table~\ref{tab:nix_impairment} shows download times across profiles.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Nix binary cache download time (seconds) across impairment
|
||
profiles, sorted by Low-profile time. ``--'' marks a failed
|
||
run.}
|
||
\label{tab:nix_impairment}
|
||
\begin{tabular}{lrrrr}
|
||
\hline
|
||
\textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
|
||
\textbf{Medium} & \textbf{High} \\
|
||
\hline
|
||
Internal & 8.53 & 11.9 & 58.6 & -- \\
|
||
Headscale & 9.79 & 13.5 & 48.8 & 219 \\
|
||
EasyTier & 9.39 & 22.1 & 141 & -- \\
|
||
VpnCloud & 9.39 & 27.9 & 163 & -- \\
|
||
WireGuard & 9.45 & 28.8 & 161 & -- \\
|
||
Nebula & 9.15 & 30.8 & 180 & 547 \\
|
||
Tinc & 10.0 & 30.9 & 166 & 496 \\
|
||
ZeroTier & 9.22 & 36.2 & 141 & -- \\
|
||
Mycelium & 10.1 & 79.5 & -- & -- \\
|
||
Yggdrasil & 10.6 & 230 & -- & -- \\
|
||
Hyprspace & 11.9 & -- & 170 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{{Figures/impairment/Nix Cache Download Time Heatmap}.png}
|
||
\caption{Nix binary cache download time across impairment profiles.
|
||
Headscale, Nebula, and Tinc complete all four profiles; Headscale
|
||
beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's
|
||
Low-profile time explodes to 230\,s ($\times$ marks a failed run).}
|
||
\label{fig:nix_impairment_heatmap}
|
||
\end{figure}
|
||
|
||
Headscale, Nebula, and Tinc are the only VPNs to complete all four
|
||
profiles. At Medium impairment, Headscale finishes in 48.8~seconds
|
||
--- faster than Internal's 58.6~seconds. Internal itself fails at
|
||
High impairment while Headscale completes in 219~seconds, Tinc in
|
||
496~seconds, and Nebula in 547~seconds.
|
||
|
||
Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low
|
||
impairment, a 22$\times$ slowdown. Every HTTP request incurs the
|
||
latency penalty from Yggdrasil's impairment-amplified
|
||
retransmissions. Mycelium also degrades severely (10.1\,s to
|
||
79.5\,s, an 8$\times$ increase), consistent with its overlay routing
|
||
overhead, which compounds over hundreds of sequential HTTP
|
||
connections.
|
||
|
||
% TODO: Hyprspace fails at Low but completes at Medium (170 s).
|
||
% This contradicts the "clean gradient" claim. Explain why a VPN
|
||
% can fail at Low but succeed at Medium, or note the anomaly.
|
||
The failure map reveals a mostly clean gradient: more demanding
|
||
profiles knock out more VPNs. At Low, 10 of 11 complete (Hyprspace
|
||
fails). At Medium, 9 complete (though Hyprspace, which failed at
|
||
Low, completes at 170\,s). At High, only 3 survive (Headscale,
|
||
Nebula, Tinc). Internal's failure at High is the most surprising --- the
|
||
bare-metal baseline cannot sustain a multi-connection HTTP workload
|
||
under severe degradation, but Headscale, shielded by its userspace
|
||
TCP stack, can. Section~\ref{sec:tailscale_degraded} explains why.
|
||
|
||
\section{Tailscale Under Degraded Conditions}
|
||
\label{sec:tailscale_degraded}
|
||
|
||
\subsection{Observed Anomaly}
|
||
|
||
At Medium impairment, Headscale delivers 41.5\,Mbps single-stream TCP
|
||
throughput --- 40\% more than Internal's 29.6\,Mbps. A VPN built
|
||
atop WireGuard outperforms the bare-metal connection it tunnels
|
||
through. The anomaly is consistent across benchmarks:
|
||
Table~\ref{tab:headscale_anomaly} summarizes the comparison.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Headscale vs.\ Internal vs.\ WireGuard under impairment
|
||
(18.12.2025 run). For TCP benchmarks, higher is better. For
|
||
Nix cache, lower is better; ``--'' marks a failed run.}
|
||
\label{tab:headscale_anomaly}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} &
|
||
\textbf{Headscale} & \textbf{WireGuard} \\
|
||
\hline
|
||
Single TCP (Mbps) & Low & 333 & 274 & 54.7 \\
|
||
Single TCP (Mbps) & Medium & 29.6 & 41.5 & 8.77 \\
|
||
Single TCP (Mbps) & High & 4.25 & 4.21 & 2.63 \\
|
||
Parallel TCP (Mbps) & Low & 277 & 718 & 173 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 & 113 & 24.5 \\
|
||
Nix cache (s) & Medium & 58.6 & 48.8 & 161 \\
|
||
Nix cache (s) & High & -- & 219 & -- \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png}
|
||
\caption{Single-stream TCP throughput for Internal, Headscale, and
|
||
WireGuard across impairment profiles (log scale). Headscale
|
||
crosses above Internal at Medium impairment; WireGuard stays far
|
||
below both; all three converge at High.}
|
||
\label{fig:headscale_vs_internal}
|
||
\end{figure}
|
||
|
||
In parallel TCP at Low impairment, Headscale reaches 718\,Mbps vs.\
|
||
Internal's 277\,Mbps (2.6$\times$). The Nix cache download at
|
||
Medium takes Headscale 48.8\,s vs.\ Internal's 58.6\,s (17\%
|
||
faster). At High impairment, Internal fails the Nix cache entirely
|
||
while Headscale completes in 219\,s.
|
||
|
||
WireGuard, which shares Headscale's cryptographic layer, shows no
|
||
such advantage: 54.7\,Mbps at Low, 8.77\,Mbps at Medium. Whatever
|
||
protects Headscale is not the encryption or the tunnel --- it is
|
||
something in Tailscale's userspace networking stack.
|
||
|
||
% TODO: The Medium-impairment retransmit percentages (5.2\%,
|
||
% 2.4\%) are not in any table or figure. Add a retransmit rate
|
||
% table for impaired profiles or reference the data source.
|
||
The retransmit data provides the first clue. At Medium impairment,
|
||
WireGuard's retransmit rate is 5.2\% --- more than double Internal's
|
||
${\sim}$2.4\%. Headscale, despite being a VPN, matches Internal at
|
||
${\sim}$2.4\%. WireGuard uses the host kernel's TCP stack, which
|
||
treats reordered packets as losses and fires spurious retransmits;
|
||
Headscale's gVisor stack tolerates more reordering, so fewer
|
||
retransmissions are wasted on packets that were merely delayed.
|
||
|
||
\subsection{Congestion Control Analysis}
|
||
|
||
Tailscale uses a userspace TCP/IP stack derived from Google's gVisor
|
||
(netstack). This stack does not inherit the host kernel's TCP
|
||
parameters. Three defaults differ from the Linux kernel in ways that
|
||
matter under packet reordering:
|
||
|
||
\begin{itemize}
|
||
\bitem{\texttt{tcp\_reordering}:} gVisor uses 10; the Linux kernel
|
||
defaults to~3. This parameter controls how many out-of-order
|
||
packets TCP tolerates before treating the event as a loss. With
|
||
tc~netem injecting 0.5--2.5\% reordering per machine, bursts of
|
||
3+ reordered packets are frequent. The kernel's threshold of~3
|
||
causes spurious fast retransmits and congestion window reductions
|
||
for packets that are merely reordered, not lost.
|
||
\bitem{\texttt{tcp\_recovery} (RACK):} gVisor disables it; the
|
||
Linux kernel enables it by default. RACK uses timing-based loss
|
||
detection that is more aggressive than the pure sequence-based
|
||
approach gVisor uses. Under reordering, RACK's timing heuristics
|
||
can falsely classify delayed packets as lost.
|
||
\bitem{\texttt{tcp\_early\_retrans} (TLP):} gVisor disables it; the
|
||
kernel enables it. Tail Loss Probe sends speculative retransmits
|
||
on idle connections, which can worsen congestion when the link is
|
||
already impaired.
|
||
\end{itemize}
|
||
|
||
Under packet reordering, these three defaults compound. The Linux
|
||
TCP stack fires retransmits and cuts the congestion window far more
|
||
often than necessary; each false positive shrinks the window and
|
||
reduces throughput. Tailscale's gVisor stack tolerates more
|
||
reordering before reacting, so its congestion window stays larger and
|
||
throughput stays higher.
|
||
|
||
% TODO: The claim that the anomaly "grows with impairment severity" is
|
||
% not fully supported. At High impairment, Headscale (4.21 Mbps) and
|
||
% Internal (4.25 Mbps) converge --- the anomaly vanishes rather than
|
||
% growing. The logic predicts continued divergence at High reordering
|
||
% (5% per machine), but the data shows both become loss-limited.
|
||
% Rephrase to say the anomaly emerges at Medium but disappears at High
|
||
% when absolute loss dominates.
|
||
This explains why the anomaly emerges as impairment increases. At
|
||
baseline, there is no reordering, so the threshold difference is
|
||
irrelevant and Internal's kernel-level processing advantage dominates.
|
||
As reordering increases from 0.5\% (Low) to 2.5\% (Medium) per
|
||
machine, the kernel's aggressive loss detection fires more often, and
|
||
the throughput gap shifts in Headscale's favor. At High impairment,
|
||
however, both converge to ${\sim}$4.2\,Mbps: the absolute packet loss
|
||
rate becomes the dominant bottleneck, overriding the reordering
|
||
tolerance advantage.
|
||
|
||
\subsection{Tuned Kernel Parameters}
|
||
|
||
Two follow-up benchmark runs applied Tailscale's gVisor TCP
|
||
parameters to the host kernel via sysctl:
|
||
|
||
\begin{itemize}
|
||
\bitem{Full gVisor (27.02.2026):} All parameters ---
|
||
\texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0},
|
||
\texttt{tcp\_early\_retrans=0}, plus enlarged buffer sizes
|
||
(\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, \texttt{rmem\_max},
|
||
\texttt{wmem\_max}). Tested on Internal, Headscale, WireGuard,
|
||
Tinc, and ZeroTier.
|
||
\bitem{Reorder-only (06.03.2026):} Only
|
||
\texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, and
|
||
\texttt{tcp\_early\_retrans=0}. Buffer sizes left at kernel
|
||
defaults. Tested on Internal and Headscale only.
|
||
\end{itemize}
|
||
|
||
Table~\ref{tab:kernel_tuning_internal} shows how Internal responds
|
||
to the tuning. Both follow-up runs used the same impairment profiles
|
||
and hardware as the original 18.12.2025 run.
|
||
|
||
\begin{table}[H]
|
||
\centering
|
||
\caption{Internal (no VPN) throughput across three kernel
|
||
configurations. ``Default'' is the 18.12.2025 run with stock
|
||
Linux TCP parameters.}
|
||
\label{tab:kernel_tuning_internal}
|
||
\begin{tabular}{llrrr}
|
||
\hline
|
||
\textbf{Metric} & \textbf{Profile} & \textbf{Default} &
|
||
\textbf{Full gVisor} & \textbf{Reorder-only} \\
|
||
\hline
|
||
Single TCP (Mbps) & Baseline & 934 & 934 & 934 \\
|
||
Single TCP (Mbps) & Low & 333 & 363 & 354 \\
|
||
Single TCP (Mbps) & Medium & 29.6 & 64.2 & 72.7 \\
|
||
Parallel TCP (Mbps) & Low & 277 & 893 & 902 \\
|
||
Parallel TCP (Mbps) & Medium & 82.6 & 226 & 211 \\
|
||
Retransmit \% & Medium & ${\sim}$2.4 & 1.21 & 1.11 \\
|
||
Nix cache (s) & Medium & 58.6 & 29.7 & 29.1 \\
|
||
\hline
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png}
|
||
\caption{Internal (no VPN) single-stream TCP throughput across three
|
||
kernel configurations. Baseline is unchanged; at Medium
|
||
impairment, throughput jumps from 30 to 64 to 73\,Mbps as
|
||
reordering tolerance increases.}
|
||
\label{fig:kernel_tuning_comparison}
|
||
\end{figure}
|
||
|
||
Internal's Medium-impairment throughput jumps from 29.6 to
|
||
72.7\,Mbps --- a 146\% increase from a three-line sysctl change. The
|
||
retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; over half
|
||
of the original retransmissions were spurious. The Nix cache download at
|
||
Medium halves from 58.6\,s to 29.1\,s.
|
||
|
||
Parallel TCP sees an even larger gain. Internal at Low impairment
|
||
climbs from 277 to 902\,Mbps, a 226\% increase that now exceeds
|
||
Headscale's original 718\,Mbps. % TODO: DOWNSTREAM DEPENDENCY — "six concurrent flows" inherits the
|
||
% unresolved 6-vs-10 stream count from the baseline parallel test
|
||
% description. Update when that TODO is resolved.
|
||
With six concurrent flows each
|
||
independently benefiting from the higher reordering threshold, the
|
||
aggregate improvement compounds.
|
||
|
||
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are not in
|
||
% any table. Add a table showing Headscale's results from the
|
||
% follow-up runs alongside Internal's so readers can verify the
|
||
% reversal.
|
||
% TODO: "At every impairment level and benchmark" is a strong claim
|
||
% but only single-stream TCP at Medium and Nix cache at Medium are
|
||
% shown with both Internal and Headscale values. The Headscale tuned
|
||
% data is not in any table (see TODO above). Either add the full
|
||
% comparison table or weaken to "at the metrics shown."
|
||
The anomaly reverses. At the measured impairment levels and benchmarks,
|
||
tuned Internal now meets or exceeds Headscale. At Medium impairment:
|
||
Internal 72.7\,Mbps vs.\ Headscale 50.1\,Mbps (Internal 45\% ahead),
|
||
where the original result had Headscale 40\% ahead. The Nix cache
|
||
flips too: Internal completes in 29.1\,s vs.\ Headscale's 36.3\,s,
|
||
where the original had Headscale 17\% faster.
|
||
|
||
\begin{figure}[H]
|
||
\centering
|
||
\includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png}
|
||
\caption{Internal-to-Headscale speed-up factor before and after
|
||
kernel tuning. Values above 1.0 mean Internal is faster. At
|
||
Medium impairment, the ratio flips from 0.71$\times$ (Headscale
|
||
ahead) to 1.45$\times$ (Internal ahead).}
|
||
\label{fig:headscale_gap_reversal}
|
||
\end{figure}
|
||
|
||
The reorder-only configuration (06.03) matches or exceeds the full
|
||
gVisor configuration (27.02) at most metrics; the two exceptions are
|
||
single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at
|
||
Medium (211 vs.\ 226\,Mbps), both within 7\%. Internal
|
||
reaches 72.7\,Mbps at Medium with reorder-only vs.\ 64.2\,Mbps with
|
||
full gVisor. % TODO: The "mild buffer bloat" explanation for full-gVisor being
|
||
% slightly slower than reorder-only is speculative. The difference
|
||
% (64.2 vs 72.7 Mbps) could be within run-to-run variance. Either
|
||
% test with more runs or present this as one possible explanation.
|
||
The enlarged buffer sizes appear unnecessary and may
|
||
introduce mild buffer bloat that partially offsets the reordering
|
||
benefit, though the difference could also reflect normal run-to-run
|
||
variance. The entire Headscale advantage is explained by three kernel
|
||
parameters: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and
|
||
\texttt{tcp\_early\_retrans}.
|
||
|
||
% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier
|
||
% (11.5 Mbps) tuned values are not in any table. Add them to
|
||
% Table~\ref{tab:kernel_tuning_internal} or a new table.
|
||
Other VPNs benefit less from the kernel tuning. WireGuard's Medium
|
||
throughput rises from 8.77 to 12.2\,Mbps (+39\%) and Tinc's from
|
||
5.53 to 11.5\,Mbps (+108\%). ZeroTier stays flat (12.0 to
|
||
11.5\,Mbps). The tuning helps the kernel TCP stack, but VPNs that
|
||
add their own encapsulation overhead and userspace processing have
|
||
independent bottlenecks that the sysctl parameters cannot remove.
|
||
|
||
% TODO: Headscale tuned-run percentages (+21\%, $-$5\%) are not in
|
||
% any table. Also, the "compound delays" hypothesis is speculative
|
||
% --- no evidence is shown that double reordering tolerance causes
|
||
% compound delays. Either verify experimentally or weaken the claim.
|
||
Headscale itself gets modestly faster with kernel tuning (+21\% at
|
||
Medium) but slightly slower at Low impairment ($-$5\%). Its
|
||
userspace gVisor stack already optimizes for reordering tolerance.
|
||
When the kernel stack also increases its tolerance, the two layers of
|
||
tuning may interact suboptimally --- both independently delay
|
||
retransmits, which could cause compound delays on the
|
||
kernel-to-Headscale socket path.
|
||
|
||
% TODO: These sections are empty stubs but the chapter introduction
|
||
% (line 12--13) promises "findings from the source code analysis."
|
||
% Either write these sections or remove the promise from the intro.
|
||
|
||
\section{Source Code Analysis}
|
||
|
||
\subsection{Feature Matrix Overview}
|
||
|
||
% Summary of the 131-feature matrix across all ten VPNs.
|
||
% Highlight key architectural differences that explain
|
||
% performance results.
|
||
|
||
\subsection{Security Vulnerabilities}
|
||
|
||
% Vulnerabilities discovered during source code review.
|
||
|
||
\section{Summary of Findings}
|
||
|
||
% Brief summary table or ranking of VPNs by key metrics.
|
||
% Save deeper interpretation for a Discussion chapter.
|