improve mycelium argument
This commit is contained in:
+268
-228
@@ -132,87 +132,77 @@ VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
|
||||
at under 40\,\% of baseline.
|
||||
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.
|
||||
|
||||
Raw throughput alone is incomplete, however. The retransmit column
|
||||
reveals that not all high-throughput VPNs get there cleanly.
|
||||
ZeroTier, for instance, reaches 814\,Mbps but accumulates
|
||||
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
|
||||
needs. ZeroTier compensates for tunnel-internal packet loss by
|
||||
repeatedly triggering TCP congestion-control recovery, whereas
|
||||
WireGuard delivers data with negligible in-tunnel loss. The
|
||||
bare-metal Internal reference sits at 1.7~retransmits per test,
|
||||
essentially noise, and the VPNs split into three groups around
|
||||
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
|
||||
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
|
||||
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||||
Throughput}.png}
|
||||
\caption{Average single-stream TCP throughput}
|
||||
\label{fig:tcp_throughput}
|
||||
\end{figure}
|
||||
|
||||
% TODO: Is this naming scheme any good?
|
||||
Raw throughput alone is incomplete. The retransmit rate
|
||||
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
|
||||
by estimated packet count, accounting for the different segment sizes
|
||||
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
|
||||
Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier,
|
||||
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
|
||||
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
|
||||
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier
|
||||
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
|
||||
for tunnel-internal loss through repeated TCP congestion-control
|
||||
recovery; WireGuard delivers comparable throughput with effectively
|
||||
zero loss.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\begin{subfigure}[t]{\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||||
Throughput}.png}
|
||||
\caption{Average single-stream TCP throughput}
|
||||
\label{fig:tcp_throughput}
|
||||
\end{subfigure}
|
||||
|
||||
\vspace{1em}
|
||||
|
||||
\begin{subfigure}[t]{\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||||
Retransmit Rate}.png}
|
||||
% TODO: Caption says "retransmits" (counts) but the plot axis shows
|
||||
% "Retransmit Rate (\%)." Align the caption with the plot.
|
||||
\caption{TCP retransmit rate (\%)}
|
||||
\label{fig:tcp_retransmits}
|
||||
\end{subfigure}
|
||||
% TODO: This parent caption still says "retransmit count" but the
|
||||
% subfigure axis and caption were corrected to "retransmit rate (%)."
|
||||
% Align the parent caption terminology (counts vs rates).
|
||||
\caption{TCP throughput and retransmit rate at baseline. WireGuard
|
||||
leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
|
||||
retransmits per test. The retransmit count does not always track
|
||||
inversely with throughput: ZeroTier achieves high throughput
|
||||
\emph{despite} high retransmits.}
|
||||
\label{fig:tcp_results}
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
||||
Retransmit Rate}.png}
|
||||
\caption{TCP retransmit rate at baseline. WireGuard and Headscale
|
||||
are effectively loss-free ($<$\,0.01\,\%). Hyprspace is the clear
|
||||
outlier at 0.49\,\%.}
|
||||
\label{fig:tcp_retransmits}
|
||||
\end{figure}
|
||||
|
||||
Retransmits have a direct mechanical relationship with TCP congestion
|
||||
control: each one triggers a reduction in the congestion window
|
||||
(\texttt{cwnd}) and throttles the sender.
|
||||
Figure~\ref{fig:retransmit_correlations} shows the relationship:
|
||||
Hyprspace, with 4965
|
||||
retransmits, maintains the smallest max congestion window in the
|
||||
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
||||
window, the largest of any VPN. At first glance this suggests a
|
||||
clean inverse correlation between retransmits and congestion window
|
||||
size, but the picture is misleading. Yggdrasil's outsized window is
|
||||
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
|
||||
segment carries far more data, so the window in bytes is inflated
|
||||
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
||||
congestion windows across different MTU sizes is not meaningful
|
||||
without normalizing for segment size. The reliable conclusion is
|
||||
simpler: high retransmit rates force TCP to spend more time in
|
||||
congestion recovery than in steady-state transmission, and that
|
||||
caps throughput regardless of available bandwidth. ZeroTier
|
||||
illustrates the opposite extreme: brute-force retransmission can
|
||||
still yield high throughput (814\,Mbps with 1\,163 retransmits), at
|
||||
the cost of wasted bandwidth and unstable flow behavior.
|
||||
Figure~\ref{fig:tcp_window} shows the raw window sizes, and
|
||||
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
|
||||
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
|
||||
smallest max congestion window in the dataset (200\,KB), while
|
||||
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
|
||||
any VPN. At
|
||||
first glance this suggests a clean inverse correlation between
|
||||
retransmit rate and congestion window size, but the picture is
|
||||
misleading. Yggdrasil's outsized window is largely an artifact of
|
||||
its jumbo overlay MTU (32\,731 bytes): each segment carries far more
|
||||
data, so the window in bytes is inflated relative to VPNs using a
|
||||
standard ${\sim}$1\,400-byte MTU. Comparing congestion windows
|
||||
across different MTU sizes is not meaningful without normalizing for
|
||||
segment size. The reliable conclusion is simpler: high retransmit
|
||||
rates force TCP to spend more time in congestion recovery than in
|
||||
steady-state transmission, and that caps throughput regardless of
|
||||
available bandwidth. ZeroTier illustrates the opposite extreme:
|
||||
brute-force retransmission can still yield high throughput
|
||||
(814\,Mbps at a 0.10\,\% rate), at the cost of wasted bandwidth and
|
||||
unstable flow behavior.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Max TCP
|
||||
Window Size}.png}
|
||||
\caption{Maximum TCP window sizes (send and congestion) at baseline.
|
||||
Yggdrasil's congestion window (4\,219\,KB) dwarfs all others but
|
||||
is inflated by its 32\,KB jumbo overlay MTU. Hyprspace has the
|
||||
smallest congestion window (200\,KB).}
|
||||
\label{fig:tcp_window}
|
||||
\end{figure}
|
||||
|
||||
VpnCloud stands out: its sender reports 538.8\,Mbps but the
|
||||
receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
|
||||
in the dataset. This points to significant in-tunnel packet loss
|
||||
or buffering at the VpnCloud layer that the retransmit count (857)
|
||||
alone does not fully explain.
|
||||
% TODO: Clarify whether the headline TCP table
|
||||
% (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports
|
||||
% sender or receiver throughput. The prose here cites sender
|
||||
% 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender
|
||||
% column, so the table caption should say so explicitly. Same
|
||||
% clarification needed for Hyprspace (368 in table vs.\ sender
|
||||
% 367.9 / receiver 419.8 in the pathological-cases paragraph).
|
||||
or buffering at the VpnCloud layer that the retransmit rate
|
||||
(0.06\,\%) alone does not fully explain.
|
||||
|
||||
Variability, whether stochastic across runs or systematic across
|
||||
links, also differs substantially. WireGuard's three link
|
||||
@@ -243,14 +233,14 @@ on every direction.
|
||||
\caption{Retransmits vs.\ max congestion window}
|
||||
\label{fig:retransmit_cwnd}
|
||||
\end{subfigure}
|
||||
\caption{Retransmit correlations (log scale on x-axis). High
|
||||
retransmits do not always mean low throughput (ZeroTier: 1\,163
|
||||
retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
|
||||
4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
|
||||
between retransmits and congestion window size is dominated by
|
||||
\caption{Retransmit correlations (log scale on x-axis). A high
|
||||
retransmit rate does not always mean low throughput (ZeroTier:
|
||||
0.10\,\%, 814\,Mbps), but an extreme rate does (Hyprspace:
|
||||
0.49\,\%, 368\,Mbps). The apparent inverse correlation between
|
||||
retransmit rate and congestion window size is dominated by
|
||||
Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
|
||||
by its 32\,KB jumbo overlay MTU rather than by low retransmits
|
||||
alone.}
|
||||
by its 32\,KB jumbo overlay MTU rather than by a low retransmit
|
||||
rate alone.}
|
||||
\label{fig:retransmit_correlations}
|
||||
\end{figure}
|
||||
|
||||
@@ -258,29 +248,35 @@ on every direction.
|
||||
|
||||
Sorting by latency rearranges the rankings considerably.
|
||||
Table~\ref{tab:latency_baseline} lists the average ping round-trip
|
||||
times, which cluster into three distinct ranges.
|
||||
times, which cluster into three distinct ranges. The table also
|
||||
reports the average maximum RTT observed across test runs and the
|
||||
resulting spike ratio (max/avg); a high ratio signals bursty tail
|
||||
latency that the average alone conceals.
|
||||
|
||||
\begin{table}[H]
|
||||
\centering
|
||||
\caption{Average ping RTT at baseline, sorted by latency}
|
||||
\caption{Ping RTT statistics at baseline, sorted by average latency.
|
||||
The spike ratio is max\,RTT\,/\,avg\,RTT; higher values indicate
|
||||
bursty tail latency.}
|
||||
\label{tab:latency_baseline}
|
||||
\begin{tabular}{lr}
|
||||
\begin{tabular}{lrrrr}
|
||||
\hline
|
||||
\textbf{VPN} & \textbf{Avg RTT (ms)} \\
|
||||
\textbf{VPN} & \textbf{Avg RTT (ms)} & \textbf{Max RTT (ms)}
|
||||
& \textbf{Spike Ratio} & \textbf{Jitter (ms)} \\
|
||||
\hline
|
||||
Internal & 0.60 \\
|
||||
VpnCloud & 1.13 \\
|
||||
Tinc & 1.19 \\
|
||||
WireGuard & 1.20 \\
|
||||
Nebula & 1.25 \\
|
||||
ZeroTier & 1.28 \\
|
||||
EasyTier & 1.33 \\
|
||||
Internal & 0.60 & 0.65 & 1.1$\times$ & 0.04 \\
|
||||
VpnCloud & 1.13 & 3.14 & 2.8$\times$ & 0.25 \\
|
||||
Tinc & 1.19 & 1.31 & 1.1$\times$ & 0.07 \\
|
||||
WireGuard & 1.20 & 1.81 & 1.5$\times$ & 0.13 \\
|
||||
Nebula & 1.25 & 1.53 & 1.2$\times$ & 0.10 \\
|
||||
ZeroTier & 1.28 & 3.00 & 2.3$\times$ & 0.25 \\
|
||||
EasyTier & 1.33 & 1.55 & 1.2$\times$ & 0.10 \\
|
||||
\hline
|
||||
Headscale & 1.64 \\
|
||||
Hyprspace & 1.79 \\
|
||||
Yggdrasil & 2.20 \\
|
||||
Headscale & 1.64 & 1.81 & 1.1$\times$ & 0.09 \\
|
||||
Hyprspace & 1.79 & 2.21 & 1.2$\times$ & 0.13 \\
|
||||
Yggdrasil & 2.20 & 3.13 & 1.4$\times$ & 0.20 \\
|
||||
\hline
|
||||
Mycelium & 34.9 \\
|
||||
Mycelium & 34.9 & 48.6 & 1.4$\times$ & 1.49 \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@@ -296,13 +292,16 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
||||
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
||||
it a dedicated analysis.
|
||||
|
||||
% TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT
|
||||
% plot. Add a max-RTT figure or table, or reference the raw data
|
||||
% source.
|
||||
ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
|
||||
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
|
||||
sub-2\,ms VPN. These spikes point to periodic control-plane
|
||||
interference that the average hides.
|
||||
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
|
||||
outliers among the low-latency VPNs. VpnCloud leads at
|
||||
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
|
||||
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
|
||||
jitter in the table (0.25\,ms). Tinc and Headscale, by contrast,
|
||||
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
|
||||
timing is nearly as stable as bare metal. The spikes in VpnCloud and
|
||||
ZeroTier are consistent with periodic
|
||||
control-plane work such as key rotation or peer heartbeats that
|
||||
briefly stalls the data path.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@@ -315,43 +314,42 @@ interference that the average hides.
|
||||
|
||||
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
||||
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
||||
the tunnel quickly, yet something caps the overall rate. The qperf
|
||||
benchmark reports Tinc maxing out at 14.9\,\% total system CPU while
|
||||
delivering 336\,Mbps. On a multi-core host this figure is consistent
|
||||
the tunnel quickly, yet something caps the overall rate.
|
||||
Figure~\ref{fig:tcp_cpu} shows that Tinc uses only 12.3\,\% host CPU
|
||||
during the TCP test. On a multi-core host this figure is consistent
|
||||
with a single saturated core, which fits Tinc's single-threaded
|
||||
userspace architecture: one core encrypts, copies, and forwards
|
||||
packets, and the remaining cores sit idle. But VpnCloud reports the
|
||||
same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc),
|
||||
so whole-system CPU alone cannot explain the gap, and a per-packet
|
||||
processing cost difference must also be in play.
|
||||
% TODO: 14.9\% total CPU does not pin the bottleneck on its own.
|
||||
% This is whole-system utilization on a multi-core machine, and a
|
||||
% single saturated core fits the budget — but VpnCloud reports the
|
||||
% same 14.9\% \emph{and} reaches 539\,Mbps. Verify with per-thread
|
||||
% CPU sampling or eBPF profiling to confirm the single-core story
|
||||
% and quantify the per-packet cost difference.
|
||||
packets, and the remaining cores sit idle.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP CPU
|
||||
Utilization}.png}
|
||||
\caption{CPU utilization during TCP throughput tests, split by host
|
||||
(sender) and remote (receiver). Tinc (12.3\,\%) and VpnCloud
|
||||
(14.2\,\%) use similar CPU, yet VpnCloud achieves 60\,\% higher
|
||||
throughput. Yggdrasil's low CPU (2.7\,\%) reflects its
|
||||
kernel-level forwarding with jumbo segments.}
|
||||
\label{fig:tcp_cpu}
|
||||
\end{figure}
|
||||
|
||||
VpnCloud is also
|
||||
single-threaded and uses slightly more CPU (14.2\,\%), yet reaches
|
||||
539\,Mbps (60\,\% more throughput). The gap comes down to per-packet
|
||||
cost. Tinc uses a hand-written ChaCha20-Poly1305 implementation
|
||||
without hardware acceleration, allocates a fresh stack buffer and
|
||||
copies the payload for each packet, and routes through a splay-tree
|
||||
lookup. VpnCloud uses the \texttt{ring} cryptographic library, which
|
||||
employs optimized assembly and can select AES-128-GCM with hardware
|
||||
AES-NI instructions at runtime; it encrypts in place with no extra
|
||||
buffer copies and routes through an $O(1)$ hash-map lookup. These
|
||||
differences compound in a tight single-threaded loop: every
|
||||
microsecond saved per packet raises the maximum packet rate the one
|
||||
available core can sustain.
|
||||
|
||||
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
||||
spot.
|
||||
|
||||
% TODO: These CPU numbers are stated inline but never shown in a plot
|
||||
% or table. Add a CPU utilization figure or table so readers can
|
||||
% verify. Also, the claim that WireGuard's CPU usage "goes to
|
||||
% cryptographic processing" is unsubstantiated: no profiling data
|
||||
% is presented. Either add profiling evidence or soften to
|
||||
% "likely" / "presumably."
|
||||
The qperf measurements also reveal a wide spread in CPU usage.
|
||||
Hyprspace (55.1\,\%) and Yggdrasil
|
||||
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
||||
9.7\,\%. WireGuard sits at 30.8\,\%, higher than expected for a
|
||||
kernel-level implementation; in-kernel cryptographic processing
|
||||
is the likely cause, though no profiling data confirms this.
|
||||
On the efficient end, VpnCloud
|
||||
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
|
||||
CPU time. Nebula and Headscale are missing from
|
||||
this comparison because qperf failed for both.
|
||||
|
||||
%TODO: Explain why they consistently failed
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
||||
@@ -365,10 +363,7 @@ this comparison because qperf failed for both.
|
||||
|
||||
\subsection{Parallel TCP Scaling}
|
||||
|
||||
The single-stream benchmark tests one link direction at a time. %
|
||||
% TODO: The plot labels this benchmark "10-stream parallel" but this
|
||||
% description says "six unidirectional flows." Verify the actual test
|
||||
% configuration and reconcile the two.
|
||||
The single-stream benchmark tests one link direction at a time.
|
||||
The
|
||||
parallel benchmark changes this setup: all three link directions
|
||||
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||||
@@ -411,26 +406,25 @@ Table~\ref{tab:parallel_scaling} lists the results.
|
||||
\end{table}
|
||||
|
||||
The VPNs that gain the most are those most constrained in
|
||||
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
||||
can never fill the pipe: the bandwidth-delay product (the amount
|
||||
of in-flight data a TCP flow needs to saturate a link, equal to the
|
||||
link bandwidth times the round-trip time) demands a window larger
|
||||
than any single flow maintains, so multiple concurrent flows
|
||||
compensate for that constraint and push throughput to 2.20$\times$
|
||||
the single-stream figure. Hyprspace scales almost as well
|
||||
(2.18$\times$) for the same reason but with a different
|
||||
bottleneck. Its libp2p send pipeline accumulates roughly
|
||||
2\,800\,ms of under-load latency
|
||||
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
|
||||
flow a bandwidth-delay product on the order of hundreds of
|
||||
megabytes to fill, far beyond any single kernel cwnd. And
|
||||
because Hyprspace keys \texttt{activeStreams} by destination
|
||||
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
|
||||
three concurrent peer pairs in the parallel benchmark each get
|
||||
their own libp2p stream, their own mutex, and their own yamux
|
||||
flow-control window. The three TCP senders therefore maintain
|
||||
three independent windows in flight, and three windows fill
|
||||
more of the bloated pipeline than one can.
|
||||
single-stream mode. Mycelium's 34.9\,ms RTT gives it a
|
||||
bandwidth-delay product (Equation~\ref{eq:bdp}) of roughly
|
||||
4.4\,MB on a 1\,Gbps link. No single TCP flow maintains a
|
||||
congestion window that large, so the link is never fully utilized.
|
||||
Multiple concurrent flows each contribute their own window, and
|
||||
their aggregate in-flight data approaches the BDP, which pushes
|
||||
throughput to 2.20$\times$ the single-stream figure.
|
||||
|
||||
Hyprspace scales almost as well (2.18$\times$) for the same
|
||||
structural reason, but the bottleneck is different. Its libp2p send
|
||||
pipeline accumulates roughly 2\,800\,ms of under-load latency
|
||||
(Section~\ref{sec:hyprspace_bloat}), which inflates the effective BDP
|
||||
to hundreds of megabytes, far beyond any single kernel congestion
|
||||
window. Because Hyprspace keys \texttt{activeStreams} by destination
|
||||
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the three
|
||||
concurrent peer pairs in the parallel benchmark each get their own
|
||||
libp2p stream, their own mutex, and their own yamux flow-control
|
||||
window. Three independent windows in flight fill more of the bloated
|
||||
pipeline than one can.
|
||||
% TODO: This is still a hypothesis: it generalises the same
|
||||
% bandwidth-delay-product argument used for Mycelium directly
|
||||
% above, and is now grounded in the per-peer
|
||||
@@ -445,23 +439,41 @@ Tinc picks up a
|
||||
single-threaded CPU busy during what would otherwise be idle gaps in
|
||||
a single flow.
|
||||
|
||||
% TODO: "zero retransmits" in parallel mode is not shown in any table
|
||||
% or figure. Add parallel-mode retransmit data or remove the claim.
|
||||
WireGuard and Internal both scale cleanly at around
|
||||
1.48--1.50$\times$ with zero retransmits. This is consistent
|
||||
with WireGuard's overhead being a fixed per-packet cost that does
|
||||
not worsen under multiplexing.
|
||||
1.48--1.50$\times$ with a 0.00\,\% retransmit rate in both modes.
|
||||
This is consistent with WireGuard's overhead being a fixed per-packet
|
||||
cost that does not worsen under multiplexing.
|
||||
|
||||
Nebula is the only VPN that actually gets \emph{slower} with more
|
||||
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
||||
(0.92$\times$) while retransmits jump from 955 to 2\,462. The
|
||||
streams are clearly fighting each other for resources inside the
|
||||
tunnel.
|
||||
(0.92$\times$). The cause is lock contention in Nebula's firewall
|
||||
connection tracker (Listing~\ref{lst:nebula_conntrack}). A single
|
||||
\texttt{sync.Mutex} protects the global \texttt{Conns} map, and every
|
||||
packet in both directions must acquire it. The lock holder also
|
||||
purges the timer wheel before releasing the lock, so other goroutines
|
||||
stall while that housekeeping runs. Nebula mitigates this with a
|
||||
per-routine cache that bypasses the global lock for known flows, but
|
||||
the cache is invalidated every second, at which point all goroutines
|
||||
contend on the mutex again. With parallel streams, the increased
|
||||
goroutine count turns this periodic contention into a throughput
|
||||
bottleneck.
|
||||
|
||||
More streams also amplify existing retransmit problems. Hyprspace
|
||||
climbs from 4\,965 to 17\,426~retransmits;
|
||||
VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream
|
||||
mode stay clean under load, while the stressed ones only get worse.
|
||||
\lstinputlisting[language=Go,caption={Nebula's firewall conntrack: a
|
||||
global mutex protects the connection map and is acquired on every
|
||||
packet.
|
||||
\textit{nebula/firewall.go:79--84,
|
||||
486--558}},label={lst:nebula_conntrack}]{Listings/nebula_conntrack.go}
|
||||
|
||||
Retransmit rates under parallel load shift in two directions.
|
||||
VpnCloud's rate climbs from 0.06\,\% to 0.14\,\% (2.5$\times$) and
|
||||
Yggdrasil's from 0.09\,\% to 0.23\,\% (2.7$\times$), so
|
||||
multiplexing genuinely increases loss for these VPNs. Hyprspace's
|
||||
rate, by contrast, drops slightly from 0.49\,\% to 0.39\,\% even
|
||||
though it sends far more data in parallel; the per-packet loss
|
||||
probability does not worsen, but the absolute count still triples
|
||||
because three pairs are transmitting simultaneously. VPNs that were
|
||||
clean in single-stream mode (WireGuard, Internal) stay clean under
|
||||
parallel load.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
@@ -938,81 +950,109 @@ no flow-control signal coupling the two.
|
||||
\textit{hyprspace/node/node.go:36--39, 282,
|
||||
328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go}
|
||||
|
||||
\paragraph{Mycelium: Routing Anomaly.}
|
||||
\paragraph{Mycelium: routing anomaly.}
|
||||
\label{sec:mycelium_routing}
|
||||
|
||||
Mycelium's 34.9\,ms average latency appears to be the cost of
|
||||
routing through a global overlay. The per-path
|
||||
numbers, however,
|
||||
reveal a bimodal distribution:
|
||||
Mycelium's 34.9\,ms average latency looks like a
|
||||
straightforward cost of routing through a global
|
||||
overlay. The per-path numbers do not fit this
|
||||
explanation:
|
||||
|
||||
\begin{itemize}
|
||||
\bitem{luna$\rightarrow$lom:} 1.63\,ms (direct
|
||||
path, comparable
|
||||
\bitem{luna$\rightarrow$lom:} 1.63\,ms (comparable
|
||||
to Headscale at 1.64\,ms)
|
||||
\bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
|
||||
\bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
|
||||
\bitem{lom$\rightarrow$yuki:} 51.47\,ms
|
||||
\bitem{yuki$\rightarrow$luna:} 51.60\,ms
|
||||
\end{itemize}
|
||||
|
||||
One of the three links has found a direct route; the
|
||||
other two still
|
||||
bounce through the overlay. All three machines sit on the same
|
||||
% TODO: Characterising path discovery as "failing
|
||||
% intermittently" assumes
|
||||
% direct routing is the expected outcome on a LAN.
|
||||
% Mycelium is designed
|
||||
% as a global overlay and may intentionally route
|
||||
% through supernodes.
|
||||
% If this is by-design behaviour, rephrase to avoid
|
||||
% implying a bug.
|
||||
% This characterisation also propagates to the
|
||||
% impairment ping analysis
|
||||
% in Section sec:impairment, which says impairment "pushes path
|
||||
% discovery toward shorter routes."
|
||||
% TODO: The throughput data INVERTS the latency split
|
||||
% rather than
|
||||
% "mirroring" it. The direct path (luna→lom, 1.63 ms
|
||||
% RTT) achieves
|
||||
% only 122 Mbps, while the overlay-routed path
|
||||
% (yuki→luna, 51.60 ms
|
||||
% RTT) reaches 379 Mbps: the opposite of what TCP
|
||||
% theory predicts.
|
||||
% The plot also shows luna→lom receiver throughput at
|
||||
% only 57.2 Mbps
|
||||
% (a 53% sender/receiver gap on that link). Explain
|
||||
% why the direct
|
||||
% path is 3× slower than the overlay path, or acknowledge the
|
||||
% contradiction. The current wording "mirrors the
|
||||
% split" is incorrect.
|
||||
physical network, so Mycelium's path discovery is not
|
||||
consistently
|
||||
selecting the direct route, a more specific problem
|
||||
than blanket overlay
|
||||
overhead. Throughput shows a similarly lopsided split:
|
||||
yuki$\rightarrow$luna reaches 379\,Mbps while
|
||||
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In
|
||||
bidirectional mode, the reverse direction on that
|
||||
worst link drops
|
||||
to 58.4\,Mbps, the lowest single-direction figure in the entire
|
||||
dataset.
|
||||
One link found a direct LAN path; the other two
|
||||
bounced through the overlay. All three machines sit on
|
||||
the same physical network, so the split is not a matter
|
||||
of topology.
|
||||
|
||||
The throughput results invert the latency ranking.
|
||||
The link with the low ping latency,
|
||||
luna$\rightarrow$lom at 1.63\,ms, should be the fastest
|
||||
according to TCP congestion theory. It is the slowest:
|
||||
122\,Mbps, with the reverse direction dropping to
|
||||
58.4\,Mbps in bidirectional mode. Meanwhile
|
||||
yuki$\rightarrow$luna, whose ICMP~RTT was 30$\times$
|
||||
higher, reaches 379\,Mbps
|
||||
(Figure~\ref{fig:mycelium_paths}). The throughput
|
||||
ranking is the exact inverse of what the ping data
|
||||
predicts.
|
||||
|
||||
The explanation is in the iperf3 logs. Each TCP stream
|
||||
reports a kernel-measured RTT that is independent of
|
||||
ICMP ping. For the luna$\rightarrow$lom stream, this
|
||||
TCP~RTT starts at 51.6\,ms and climbs to a mean of
|
||||
144\,ms over the 30-second run, with
|
||||
757~retransmits---the link was clearly overlay-routed
|
||||
during the throughput test, even though ping had found a
|
||||
direct path eight minutes earlier. For
|
||||
yuki$\rightarrow$luna the reverse happened: the TCP
|
||||
stream measured only 12--22\,ms, and its bidirectional
|
||||
return path recorded 1.0\,ms, a direct LAN connection
|
||||
that the earlier ICMP test had not seen. The routes
|
||||
changed between the two tests.
|
||||
|
||||
Mycelium uses the Babel routing protocol
|
||||
(Section~\ref{sec:babel}) to discover and select paths.
|
||||
Two properties of its implementation explain why routes
|
||||
shifted mid-benchmark. First, Mycelium advertises
|
||||
routes at a five-minute interval
|
||||
(Listing~\ref{lst:mycelium_constants}):
|
||||
|
||||
\lstinputlisting[language=Rust,caption={Mycelium's
|
||||
Babel timing constants. Routes are re-advertised
|
||||
every 300\,s; the router will not learn about a new
|
||||
path until the next cycle.
|
||||
\textit{mycelium/src/router.rs:33--59}},label={lst:mycelium_constants}]{Listings/mycelium_route_constants.rs}
|
||||
|
||||
A direct path that appears between update cycles is
|
||||
invisible to the router until the next advertisement
|
||||
arrives. The benchmark's ping and throughput tests ran
|
||||
sequentially with several minutes between them, so each
|
||||
test observed whichever route happened to be selected at
|
||||
that point in Babel's five-minute cycle.
|
||||
|
||||
Second, even when a better route \emph{is} advertised,
|
||||
the router resists switching to it.
|
||||
Listing~\ref{lst:mycelium_best_route} shows the
|
||||
\texttt{find\_best\_route} function: a candidate route
|
||||
is rejected unless its metric improves on the current
|
||||
route by more than 10, or unless it is directly
|
||||
connected (metric~0). This hysteresis prevents
|
||||
flapping but also means that an overlay path, once
|
||||
established, can persist for the remainder of the
|
||||
update interval even after a shorter path becomes
|
||||
available.
|
||||
|
||||
\lstinputlisting[language=Rust,caption={Route
|
||||
selection with hysteresis. Lines~16--25 reject a
|
||||
candidate route unless it is directly connected or
|
||||
improves the composite metric by more than
|
||||
\texttt{SIGNIFICANT\_METRIC\_IMPROVEMENT}\,(10).
|
||||
\textit{mycelium/src/router.rs:1213--1238}},label={lst:mycelium_best_route}]{Listings/mycelium_find_best_route.rs}
|
||||
|
||||
The five-minute update interval and the switching
|
||||
hysteresis together explain the throughput asymmetry.
|
||||
The TCP-measured RTTs
|
||||
are consistent with the observed throughput on every
|
||||
link; only the ICMP~RTTs, measured minutes earlier under
|
||||
a different routing state, give the impression of an
|
||||
inversion.
|
||||
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
|
||||
Throughput}.png}
|
||||
% TODO: The caption attributes the asymmetry to
|
||||
% "inconsistent direct
|
||||
% route discovery" but the direct-route link
|
||||
% (luna→lom, 1.63 ms RTT)
|
||||
% is actually the SLOWEST (122 Mbps). The caption
|
||||
% should address
|
||||
% why the direct path underperforms the overlay paths.
|
||||
\caption{Per-link TCP throughput for Mycelium, showing extreme
|
||||
path asymmetry. The 3:1 ratio between best
|
||||
(yuki$\rightarrow$luna, 379\,Mbps) and worst
|
||||
(luna$\rightarrow$lom, 122\,Mbps) links does not
|
||||
correlate with
|
||||
the latency split (Section~\ref{sec:mycelium_routing}).}
|
||||
\caption{Per-link TCP throughput for Mycelium. The
|
||||
luna$\rightarrow$lom link appears slow despite its
|
||||
low ping latency because Babel had switched to an
|
||||
overlay route by the time the throughput test ran.
|
||||
The TCP-level RTTs reported by iperf3, not the
|
||||
earlier ICMP measurements, explain the 3:1 ratio.}
|
||||
\label{fig:mycelium_paths}
|
||||
\end{figure}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user