Compare commits
1 Commits
bbb5c6e886
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| a3c533b58f |
+299
-306
@@ -6,23 +6,23 @@
|
|||||||
|
|
||||||
This chapter presents the results of the benchmark suite across all
|
This chapter presents the results of the benchmark suite across all
|
||||||
ten VPN implementations and the internal baseline. The structure
|
ten VPN implementations and the internal baseline. The structure
|
||||||
follows the impairment profiles from ideal to degraded:
|
follows the impairment profiles from ideal to degraded.
|
||||||
Section~\ref{sec:baseline} establishes overhead under ideal
|
Section~\ref{sec:baseline} establishes overhead under ideal
|
||||||
conditions, then subsequent sections examine how each VPN responds to
|
conditions; subsequent sections examine how each VPN responds to
|
||||||
increasing network impairment, with source-code excerpts woven in
|
increasing network impairment, with source-code excerpts woven in
|
||||||
where they explain the measured behaviour. A recurring theme is
|
where they explain the measured behaviour. No single metric
|
||||||
that no single metric captures VPN performance; the rankings shift
|
captures VPN performance. The rankings shift depending on what is
|
||||||
depending on whether one measures throughput, latency, retransmit
|
measured: throughput, latency, retransmit behaviour, or
|
||||||
behavior, or real-world application performance.
|
application-level performance.
|
||||||
|
|
||||||
\section{Baseline Performance}
|
\section{Baseline performance}
|
||||||
\label{sec:baseline}
|
\label{sec:baseline}
|
||||||
|
|
||||||
The baseline impairment profile introduces no artificial loss or
|
The baseline impairment profile introduces no artificial loss or
|
||||||
reordering, so any performance gap between VPNs can be attributed to
|
reordering, so any performance gap between VPNs can be attributed to
|
||||||
the VPN itself. Throughout the plots in this section, the
|
the VPN itself. Throughout the plots in this section, the
|
||||||
\emph{internal} bar marks a direct host-to-host connection with no VPN
|
\emph{internal} bar is a direct host-to-host connection with no VPN
|
||||||
in the path; it represents the best the hardware can do. On its own,
|
in the path; it is the best the hardware can do. On its own,
|
||||||
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
||||||
latency of just
|
latency of just
|
||||||
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
|
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
|
||||||
@@ -30,35 +30,29 @@ single retransmit across an entire 30-second test. Mycelium sits at
|
|||||||
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
|
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
|
||||||
bare-metal figure.
|
bare-metal figure.
|
||||||
|
|
||||||
A note on naming: ``Headscale'' in every table and figure of this
|
Throughout this chapter, ``Headscale'' labels the scenario in
|
||||||
chapter labels the test scenario in which the Tailscale client
|
which the Tailscale client (\texttt{tailscaled}, built on
|
||||||
(\texttt{tailscaled}) connects to a self-hosted Headscale control
|
\texttt{wireguard-go}) connects to a self-hosted Headscale
|
||||||
server. The data plane is therefore the Tailscale client built on
|
control server. The data plane is therefore Tailscale's, not
|
||||||
\texttt{wireguard-go}, not the Headscale binary itself, which is
|
Headscale's; the Headscale binary itself is only a control-plane
|
||||||
only a control-plane server. Statements below about ``Headscale''
|
server. Section~\ref{sec:tailscale_degraded} returns to which
|
||||||
running \texttt{wireguard-go} should be read as statements about
|
Tailscale code paths the test rig actually exercises.
|
||||||
the Tailscale client in this scenario.
|
|
||||||
Section~\ref{sec:tailscale_degraded} covers the specifics of how
|
|
||||||
the rig launches \texttt{tailscaled} and which Tailscale code
|
|
||||||
paths that choice activates.
|
|
||||||
|
|
||||||
\subsection{Test Execution Overview}
|
\subsection{Test execution overview}
|
||||||
|
|
||||||
Running the full baseline suite across all ten VPNs and the internal
|
The full baseline suite ran in just over four hours across all ten
|
||||||
reference took just over four hours. Actual benchmark execution
|
VPNs and the internal reference. Benchmark execution consumed
|
||||||
consumed the bulk of that time at 2.6~hours (63\,\%). VPN
|
63\,\% of that time; VPN installation and deployment accounted for
|
||||||
installation and deployment accounted for another 45~minutes
|
19\,\%; the test rig spent 9\,\% waiting for tunnels to come up
|
||||||
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
|
after restarts. Service restarts and traffic-control (tc)
|
||||||
for VPN tunnels to come up after restarts. VPN service restarts and
|
stabilization took the remainder. Figure~\ref{fig:test_duration}
|
||||||
traffic-control (tc) stabilization took the remainder.
|
breaks the time down per VPN.
|
||||||
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
|
||||||
|
|
||||||
Most VPNs completed every benchmark without issues, but four failed
|
Most VPNs completed every benchmark without issue. Four failed
|
||||||
one test each: Nebula and Headscale timed out on the qperf
|
one test each: Nebula and Headscale timed out on the qperf QUIC
|
||||||
QUIC performance benchmark after six retries, while Hyprspace and
|
benchmark after six retries; Hyprspace and Mycelium failed the
|
||||||
Mycelium failed the UDP iPerf3 test
|
UDP iPerf3 test with a 120-second timeout. Their individual
|
||||||
with a 120-second timeout. Their individual success rate is
|
success rate is 85.7\,\%; every other VPN passed the full suite
|
||||||
85.7\,\%, with all other VPNs passing the full suite
|
|
||||||
(Figure~\ref{fig:success_rate}).
|
(Figure~\ref{fig:success_rate}).
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
@@ -88,7 +82,7 @@ with a 120-second timeout. Their individual success rate is
|
|||||||
\label{fig:test_overview}
|
\label{fig:test_overview}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{TCP Throughput}
|
\subsection{TCP throughput}
|
||||||
|
|
||||||
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
||||||
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
||||||
@@ -144,14 +138,12 @@ Raw throughput alone is incomplete. The retransmit rate
|
|||||||
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
|
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
|
||||||
by estimated packet count, accounting for the different segment sizes
|
by estimated packet count, accounting for the different segment sizes
|
||||||
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
|
each VPN negotiates (1\,228 to 32\,731 bytes). WireGuard and
|
||||||
Headscale are effectively loss-free ($<$\,0.01\,\%). Tinc, EasyTier,
|
Headscale are effectively loss-free ($<$\,0.01\,\%); Hyprspace is
|
||||||
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
|
the clear outlier at 0.49\,\%; the remaining VPNs spread between
|
||||||
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
|
these poles. The interesting case is ZeroTier: it sustains
|
||||||
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%. ZeroTier
|
814\,Mbps despite a 0.10\,\% retransmit rate by riding TCP
|
||||||
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
|
congestion-control recovery, where WireGuard delivers comparable
|
||||||
for tunnel-internal loss through repeated TCP congestion-control
|
throughput with effectively zero loss.
|
||||||
recovery; WireGuard delivers comparable throughput with effectively
|
|
||||||
zero loss.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -170,7 +162,7 @@ Figure~\ref{fig:tcp_window} shows the raw window sizes, and
|
|||||||
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
|
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
|
||||||
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
|
rate. Hyprspace, with a 0.49\,\% retransmit rate, maintains the
|
||||||
smallest max congestion window in the dataset (200\,KB), while
|
smallest max congestion window in the dataset (200\,KB), while
|
||||||
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
|
Yggdrasil's 0.09\,\% rate allows a 4.3\,MB window, the largest of
|
||||||
any VPN. At
|
any VPN. At
|
||||||
first glance this suggests a clean inverse correlation between
|
first glance this suggests a clean inverse correlation between
|
||||||
retransmit rate and congestion window size, but the picture is
|
retransmit rate and congestion window size, but the picture is
|
||||||
@@ -204,17 +196,13 @@ in the dataset. This points to significant in-tunnel packet loss
|
|||||||
or buffering at the VpnCloud layer that the retransmit rate
|
or buffering at the VpnCloud layer that the retransmit rate
|
||||||
(0.06\,\%) alone does not fully explain.
|
(0.06\,\%) alone does not fully explain.
|
||||||
|
|
||||||
Variability, whether stochastic across runs or systematic across
|
Variability across links also differs substantially. WireGuard's
|
||||||
links, also differs substantially. WireGuard's three link
|
three link directions cluster within a 60\,Mbps band; Mycelium's
|
||||||
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
|
span a 3:1 ratio (122--379\,Mbps), but this is per-link
|
||||||
and are nearly indistinguishable. Mycelium's three directions span
|
path-selection asymmetry rather than run-to-run noise
|
||||||
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
|
(Section~\ref{sec:mycelium_routing}). A VPN whose throughput
|
||||||
Section~\ref{sec:mycelium_routing} shows the spread is per-link
|
varies that widely across links is harder to capacity-plan around
|
||||||
path-selection asymmetry, with one link finding a direct route and
|
than one that delivers a consistent figure on every direction.
|
||||||
the other two routing through the global overlay. Either way, a
|
|
||||||
VPN whose throughput varies that widely across links is harder to
|
|
||||||
capacity-plan around than one that delivers a consistent figure
|
|
||||||
on every direction.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -292,16 +280,13 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
|||||||
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
||||||
it a dedicated analysis.
|
it a dedicated analysis.
|
||||||
|
|
||||||
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
|
The spike-ratio column in Table~\ref{tab:latency_baseline} exposes
|
||||||
outliers among the low-latency VPNs. VpnCloud leads at
|
two outliers among the low-latency VPNs. VpnCloud and ZeroTier
|
||||||
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
|
post the highest spike ratios (2.8$\times$ and 2.3$\times$) and
|
||||||
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
|
the highest jitter, while Tinc and Headscale stay close to
|
||||||
jitter in the table (0.25\,ms). Tinc and Headscale, by contrast,
|
bare-metal stability. The spikes in VpnCloud and ZeroTier are
|
||||||
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
|
consistent with periodic control-plane events such as key
|
||||||
timing is nearly as stable as bare metal. The spikes in VpnCloud and
|
rotation or peer heartbeats that briefly stall the data path.
|
||||||
ZeroTier are consistent with periodic
|
|
||||||
control-plane work such as key rotation or peer heartbeats that
|
|
||||||
briefly stalls the data path.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -353,7 +338,7 @@ spot.
|
|||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
||||||
\caption{Latency vs.\ throughput at baseline. Each point represents
|
\caption{Latency vs.\ throughput at baseline. Each point is
|
||||||
one VPN. The quadrants reveal different bottleneck types:
|
one VPN. The quadrants reveal different bottleneck types:
|
||||||
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
||||||
low throughput, CPU-bound), Mycelium (high latency, low
|
low throughput, CPU-bound), Mycelium (high latency, low
|
||||||
@@ -361,7 +346,7 @@ spot.
|
|||||||
\label{fig:latency_throughput}
|
\label{fig:latency_throughput}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Parallel TCP Scaling}
|
\subsection{Parallel TCP scaling}
|
||||||
|
|
||||||
The single-stream benchmark tests one link direction at a time.
|
The single-stream benchmark tests one link direction at a time.
|
||||||
The
|
The
|
||||||
@@ -383,7 +368,7 @@ Table~\ref{tab:parallel_scaling} lists the results.
|
|||||||
\centering
|
\centering
|
||||||
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
||||||
ratio of parallel to single-stream throughput. Internal's
|
ratio of parallel to single-stream throughput. Internal's
|
||||||
1.50$\times$ represents the expected scaling on this hardware.}
|
1.50$\times$ is the expected scaling on this hardware.}
|
||||||
\label{tab:parallel_scaling}
|
\label{tab:parallel_scaling}
|
||||||
\begin{tabular}{lrrr}
|
\begin{tabular}{lrrr}
|
||||||
\hline
|
\hline
|
||||||
@@ -501,7 +486,7 @@ parallel load.
|
|||||||
\label{fig:parallel_tcp}
|
\label{fig:parallel_tcp}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{UDP Stress Test}
|
\subsection{UDP stress test}
|
||||||
|
|
||||||
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
||||||
which is a deliberate overload test rather than a realistic workload.
|
which is a deliberate overload test rather than a realistic workload.
|
||||||
@@ -568,12 +553,13 @@ complete the UDP test at all; both timed out after 120 seconds.
|
|||||||
% usable payload after tunnel overhead, but conflating it with path
|
% usable payload after tunnel overhead, but conflating it with path
|
||||||
% MTU is misleading. Consider renaming to "effective payload size"
|
% MTU is misleading. Consider renaming to "effective payload size"
|
||||||
% throughout.
|
% throughout.
|
||||||
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
|
The effective UDP payload size, reported in the
|
||||||
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
|
\texttt{blksize\_bytes} field, differs sharply across
|
||||||
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
|
implementations. Yggdrasil sends 32\,731-byte jumbo segments;
|
||||||
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
|
ZeroTier negotiates 2\,728 bytes; the remaining VPNs cluster
|
||||||
smallest). These differences affect fragmentation behavior under real
|
between 1\,208 (Headscale, the smallest) and 1\,448 (Internal).
|
||||||
workloads, particularly for protocols that send large datagrams.
|
These differences affect fragmentation behaviour under workloads
|
||||||
|
that send large datagrams.
|
||||||
|
|
||||||
%TODO: Mention QUIC
|
%TODO: Mention QUIC
|
||||||
%TODO: Mention again that the "default" settings of every VPN have been used
|
%TODO: Mention again that the "default" settings of every VPN have been used
|
||||||
@@ -611,7 +597,7 @@ workloads, particularly for protocols that send large datagrams.
|
|||||||
% TODO: Compare parallel TCP retransmit rate
|
% TODO: Compare parallel TCP retransmit rate
|
||||||
% with single TCP retransmit rate and see what changed
|
% with single TCP retransmit rate and see what changed
|
||||||
|
|
||||||
\subsection{Real-World Workloads}
|
\subsection{Real-world workloads}
|
||||||
|
|
||||||
Saturating a link with iPerf3 measures peak capacity, but not how a
|
Saturating a link with iPerf3 measures peak capacity, but not how a
|
||||||
VPN performs under realistic traffic. This subsection switches to
|
VPN performs under realistic traffic. This subsection switches to
|
||||||
@@ -620,7 +606,7 @@ cache and streaming video over RIST. Both interact with the VPN
|
|||||||
tunnel the way real software does, through many short-lived
|
tunnel the way real software does, through many short-lived
|
||||||
connections, TLS handshakes, and latency-sensitive UDP packets.
|
connections, TLS handshakes, and latency-sensitive UDP packets.
|
||||||
|
|
||||||
\paragraph{Nix Binary Cache Downloads.}
|
\paragraph{Nix binary cache downloads.}
|
||||||
|
|
||||||
This test downloads a fixed set of Nix packages through each VPN and
|
This test downloads a fixed set of Nix packages through each VPN and
|
||||||
measures the total transfer time. The results
|
measures the total transfer time. The results
|
||||||
@@ -705,7 +691,7 @@ Nebula sits just below at 99.8\%, and Hyprspace's headline figure
|
|||||||
of 100\% conceals a separate failure mode discussed below. The
|
of 100\% conceals a separate failure mode discussed below. The
|
||||||
14--16 dropped frames that appear uniformly across every run, including
|
14--16 dropped frames that appear uniformly across every run, including
|
||||||
Internal, are most likely encoder warm-up artefacts rather than
|
Internal, are most likely encoder warm-up artefacts rather than
|
||||||
tunnel overhead, though we have not verified this directly.
|
tunnel overhead, though this has not been verified directly.
|
||||||
|
|
||||||
% TODO: The packet-drop distribution statistics (288 mean,
|
% TODO: The packet-drop distribution statistics (288 mean,
|
||||||
% 10\% median, IQR 255--330) are not shown in any figure.
|
% 10\% median, IQR 255--330) are not shown in any figure.
|
||||||
@@ -756,7 +742,7 @@ overwhelm FEC entirely.
|
|||||||
\label{fig:rist_quality}
|
\label{fig:rist_quality}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Operational Resilience}
|
\subsection{Operational resilience}
|
||||||
|
|
||||||
Throughput, latency, and application performance describe how a
|
Throughput, latency, and application performance describe how a
|
||||||
tunnel behaves once it is up. The next question is how quickly it
|
tunnel behaves once it is up. The next question is how quickly it
|
||||||
@@ -767,7 +753,7 @@ reboot matters as much as its peak throughput.
|
|||||||
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
||||||
performer under sustained TCP load, recovers in just 8.7~seconds on
|
performer under sustained TCP load, recovers in just 8.7~seconds on
|
||||||
average, faster than any other VPN. WireGuard and Nebula follow at
|
average, faster than any other VPN. WireGuard and Nebula follow at
|
||||||
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
|
10.1\,s each. Nebula's consistency is striking: 10.06, 10.07,
|
||||||
10.07\,s across its three nodes, an exact match for Nebula's
|
10.07\,s across its three nodes, an exact match for Nebula's
|
||||||
\texttt{HostUpdateNotification} interval, whose default is
|
\texttt{HostUpdateNotification} interval, whose default is
|
||||||
10~seconds in the lighthouse protocol (configurable, but the
|
10~seconds in the lighthouse protocol (configurable, but the
|
||||||
@@ -781,8 +767,8 @@ Section~\ref{sec:mycelium_routing} argues from that uniformity
|
|||||||
that the bound is a fixed timer in the overlay protocol.
|
that the bound is a fixed timer in the overlay protocol.
|
||||||
|
|
||||||
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
||||||
node is back in 7.1~seconds while lom and luna take 94.8 and
|
node is back in 7.1~seconds while lom and luna take 97.3 and
|
||||||
97.3~seconds respectively. Yggdrasil organises its overlay as a
|
94.8~seconds respectively. Yggdrasil organises its overlay as a
|
||||||
distributed spanning tree rooted at the node with the highest public
|
distributed spanning tree rooted at the node with the highest public
|
||||||
key: every other node picks a parent closer to the root and the
|
key: every other node picks a parent closer to the root and the
|
||||||
whole network hangs off that parent chain. The gap likely reflects
|
whole network hangs off that parent chain. The gap likely reflects
|
||||||
@@ -816,14 +802,14 @@ can route traffic.
|
|||||||
\label{fig:reboot_reconnection}
|
\label{fig:reboot_reconnection}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Pathological Cases}
|
\subsection{Pathological cases}
|
||||||
\label{sec:pathological}
|
\label{sec:pathological}
|
||||||
|
|
||||||
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
|
Hyprspace, Mycelium, and Tinc each show a pathology that the
|
||||||
explain. The following subsections piece together observations from
|
aggregate tables flatten. The following subsections diagnose each
|
||||||
earlier benchmarks into per-VPN diagnoses.
|
in turn.
|
||||||
|
|
||||||
\paragraph{Hyprspace: Buffer Bloat.}
|
\paragraph{Hyprspace: buffer bloat.}
|
||||||
\label{sec:hyprspace_bloat}
|
\label{sec:hyprspace_bloat}
|
||||||
|
|
||||||
% TODO: The under-load latency of 2,800 ms is not shown in any plot
|
% TODO: The under-load latency of 2,800 ms is not shown in any plot
|
||||||
@@ -841,7 +827,7 @@ The consequences show in every TCP metric. With 4\,965
|
|||||||
retransmits per 30-second test (one in every 200~segments), TCP
|
retransmits per 30-second test (one in every 200~segments), TCP
|
||||||
spends most of its time in congestion recovery rather than
|
spends most of its time in congestion recovery rather than
|
||||||
steady-state transfer. The max congestion window shrinks to
|
steady-state transfer. The max congestion window shrinks to
|
||||||
205\,KB, the smallest in the dataset. Under parallel load the
|
200\,KB, the smallest in the dataset. Under parallel load the
|
||||||
situation worsens: retransmits climb to 17\,426. % TODO: The
|
situation worsens: retransmits climb to 17\,426. % TODO: The
|
||||||
% explanation for the sender/receiver inversion (ACK delays
|
% explanation for the sender/receiver inversion (ACK delays
|
||||||
% causing sender-side timer undercounting) is a hypothesis. Normally
|
% causing sender-side timer undercounting) is a hypothesis. Normally
|
||||||
@@ -853,12 +839,9 @@ while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
|||||||
cause the sender-side timer to undercount the actual data rate. The
|
cause the sender-side timer to undercount the actual data rate. The
|
||||||
UDP test never finished at all; it timed out at 120~seconds.
|
UDP test never finished at all; it timed out at 120~seconds.
|
||||||
|
|
||||||
% Should we always use percentages for retransmits?
|
Outside sustained load, Hyprspace looks fine. It has the fastest
|
||||||
|
reboot reconnection in the dataset (8.7\,s) and delivers 100\,\%
|
||||||
What prevents Hyprspace from being entirely unusable is everything
|
video quality between burst events. The pathology is narrow but
|
||||||
\emph{except} sustained load. It has the fastest reboot
|
|
||||||
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
|
|
||||||
quality outside of its burst events. The pathology is narrow but
|
|
||||||
severe: any continuous data stream saturates the tunnel's internal
|
severe: any continuous data stream saturates the tunnel's internal
|
||||||
buffers.
|
buffers.
|
||||||
|
|
||||||
@@ -910,33 +893,35 @@ stack but the benchmark traffic never reaches it; that case is
|
|||||||
the subject of Section~\ref{sec:tailscale_degraded}.
|
the subject of Section~\ref{sec:tailscale_degraded}.
|
||||||
|
|
||||||
If gVisor is out of scope, the buffer bloat must originate
|
If gVisor is out of scope, the buffer bloat must originate
|
||||||
further up the Hyprspace stack instead. Hyprspace uses
|
further up the Hyprspace stack. Hyprspace uses \texttt{libp2p},
|
||||||
\texttt{libp2p}, a peer-to-peer networking library, and its
|
a peer-to-peer networking library, and its \texttt{yamux} stream
|
||||||
\texttt{yamux} stream multiplexer, which runs many logical streams
|
multiplexer. Yamux runs many logical streams over a single
|
||||||
over a single underlying connection and polices each one with a
|
underlying connection and polices each one with a credit-based
|
||||||
credit-based flow-control window. The most plausible source of
|
flow-control window. The most plausible source of the bloat is
|
||||||
the bloat is this libp2p/yamux layer, through which raw IP packets
|
this libp2p/yamux layer, through which raw IP packets are
|
||||||
are funnelled. Hyprspace's TUN-read loop dispatches
|
funnelled.
|
||||||
each outbound packet on its own goroutine, and every such
|
|
||||||
goroutine ends up in \texttt{node/node.go}'s
|
Hyprspace's TUN-read loop dispatches each outbound packet on its
|
||||||
\texttt{sendPacket}, which keeps exactly one libp2p stream per
|
own goroutine, and every such goroutine ends up in
|
||||||
destination peer in \texttt{activeStreams} and guards it with a
|
\texttt{node/node.go}'s \texttt{sendPacket}. This function keeps
|
||||||
single per-peer \texttt{sync.Mutex}
|
exactly one libp2p stream per destination peer in
|
||||||
(Listing~\ref{lst:hyprspace_sendpacket}). Concurrent
|
\texttt{activeStreams}, guarded by a single per-peer
|
||||||
application TCP flows to the same Hyprspace neighbour therefore
|
\texttt{sync.Mutex} (Listing~\ref{lst:hyprspace_sendpacket}).
|
||||||
|
Concurrent application TCP flows to the same Hyprspace neighbour
|
||||||
serialise behind that one lock: the parallel iPerf3 test, which
|
serialise behind that one lock: the parallel iPerf3 test, which
|
||||||
opens multiple TCP connections to the same peer at once,
|
opens multiple TCP connections to the same peer at once, collapses
|
||||||
collapses to a single send pipeline at this layer. Each
|
to a single send pipeline at this layer. Each goroutine waiting
|
||||||
goroutine waiting for the lock pins its own 1420-byte packet
|
for the lock pins its own 1420-byte packet buffer, and the
|
||||||
buffer, and the underlying yamux session adds a per-stream
|
underlying yamux session adds a per-stream flow-control window on
|
||||||
flow-control window on top. None of this is visible to the
|
top.
|
||||||
kernel TCP sender that produced the inner segments: the kernel
|
|
||||||
sees only that the TUN write returned, so it keeps growing its
|
None of this is visible to the kernel TCP sender that produced
|
||||||
congestion window while the libp2p layer falls further behind. The
|
the inner segments. The kernel sees only that the TUN write
|
||||||
geometry is the textbook one for buffer bloat: a
|
returned, so it keeps growing its congestion window while the
|
||||||
fast producer (kernel TCP) sitting upstream of a slow,
|
libp2p layer falls further behind. The geometry is the standard
|
||||||
serialised consumer (the single yamux stream per peer) with
|
shape of buffer bloat: a fast producer (kernel TCP) sitting
|
||||||
no flow-control signal coupling the two.
|
upstream of a slow, serialised consumer (the single yamux stream
|
||||||
|
per peer), with no flow-control signal coupling the two.
|
||||||
|
|
||||||
\lstinputlisting[language=Go,caption={Hyprspace's outbound
|
\lstinputlisting[language=Go,caption={Hyprspace's outbound
|
||||||
fast path keeps exactly one libp2p stream per destination peer
|
fast path keeps exactly one libp2p stream per destination peer
|
||||||
@@ -987,9 +972,9 @@ reports a kernel-measured RTT that is independent of
|
|||||||
ICMP ping. For the luna$\rightarrow$lom stream, this
|
ICMP ping. For the luna$\rightarrow$lom stream, this
|
||||||
TCP~RTT starts at 51.6\,ms and climbs to a mean of
|
TCP~RTT starts at 51.6\,ms and climbs to a mean of
|
||||||
144\,ms over the 30-second run, with
|
144\,ms over the 30-second run, with
|
||||||
757~retransmits---the link was clearly overlay-routed
|
757~retransmits. The link was clearly overlay-routed during
|
||||||
during the throughput test, even though ping had found a
|
the throughput test, even though ping had found a direct path
|
||||||
direct path eight minutes earlier. For
|
eight minutes earlier. For
|
||||||
yuki$\rightarrow$luna the reverse happened: the TCP
|
yuki$\rightarrow$luna the reverse happened: the TCP
|
||||||
stream measured only 12--22\,ms, and its bidirectional
|
stream measured only 12--22\,ms, and its bidirectional
|
||||||
return path recorded 1.0\,ms, a direct LAN connection
|
return path recorded 1.0\,ms, a direct LAN connection
|
||||||
@@ -1057,17 +1042,14 @@ inversion.
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
|
% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
|
||||||
% (47.3 ms) numbers are from qperf but not shown in any figure.
|
% (47.3 ms vs.\ 16.0 ms) numbers are from qperf but not shown in
|
||||||
% Add a connection-setup latency table or plot. Also
|
% any figure. Add a connection-setup latency table or plot.
|
||||||
% clarify what
|
|
||||||
% Internal's connection establishment time is (47.3 /
|
|
||||||
% 3 = 15.8 ms?)
|
|
||||||
% so the "3× overhead" can be verified.
|
|
||||||
The overlay penalty shows up most clearly at connection setup.
|
The overlay penalty shows up most clearly at connection setup.
|
||||||
Mycelium's average time-to-first-byte is 93.7\,ms
|
Mycelium's average time-to-first-byte is 93.7\,ms
|
||||||
(vs.\ Internal's
|
(vs.\ Internal's
|
||||||
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
||||||
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
|
alone costs 47.3\,ms against Internal's 16.0\,ms (a
|
||||||
|
2.95$\times$ overhead). Every new connection
|
||||||
incurs that overhead, so workloads dominated by
|
incurs that overhead, so workloads dominated by
|
||||||
short-lived connections accumulate it rapidly. Bulk
|
short-lived connections accumulate it rapidly. Bulk
|
||||||
downloads, by
|
downloads, by
|
||||||
@@ -1094,13 +1076,13 @@ delay.
|
|||||||
The UDP test timed out at 120~seconds, and even first-time
|
The UDP test timed out at 120~seconds, and even first-time
|
||||||
connectivity required a 70-second wait at startup.
|
connectivity required a 70-second wait at startup.
|
||||||
|
|
||||||
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
\paragraph{Tinc: userspace processing bottleneck.}
|
||||||
|
|
||||||
The latency subsection already traced Tinc's 336\,Mbps ceiling to
|
The latency subsection already traced Tinc's 336\,Mbps ceiling to
|
||||||
single-core CPU exhaustion. The usual network suspects do not
|
single-core CPU exhaustion. The usual network suspects do not
|
||||||
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
|
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
|
||||||
effective UDP payload size (1\,353 bytes) and its retransmit count
|
effective UDP payload size (1\,353 bytes) and its retransmit count
|
||||||
(240) are in the normal range. That leaves CPU: 14.9\,\%
|
(240) are in the normal range. That leaves CPU: 12.3\,\%
|
||||||
whole-system utilization is what one saturated core looks like on
|
whole-system utilization is what one saturated core looks like on
|
||||||
a multi-core host, which fits a single-threaded userspace VPN.
|
a multi-core host, which fits a single-threaded userspace VPN.
|
||||||
The parallel benchmark confirms the diagnosis. Tinc scales to
|
The parallel benchmark confirms the diagnosis. Tinc scales to
|
||||||
@@ -1110,7 +1092,7 @@ the gaps a single flow would leave idle, and the extra work
|
|||||||
translates directly into extra throughput.
|
translates directly into extra throughput.
|
||||||
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
|
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
|
||||||
% unresolved CPU-profiling TODO from the latency subsection
|
% unresolved CPU-profiling TODO from the latency subsection
|
||||||
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
|
% (VpnCloud's similar 14.2\% at 539\,Mbps). If per-thread
|
||||||
% profiling refutes the single-core story, this paragraph must
|
% profiling refutes the single-core story, this paragraph must
|
||||||
% be revisited as well.
|
% be revisited as well.
|
||||||
|
|
||||||
@@ -1121,13 +1103,12 @@ Baseline benchmarks rank VPNs by overhead under ideal
|
|||||||
conditions. The impairment profiles in
|
conditions. The impairment profiles in
|
||||||
Table~\ref{tab:impairment_profiles} test a different property:
|
Table~\ref{tab:impairment_profiles} test a different property:
|
||||||
resilience. Each profile applies symmetric \texttt{tc netem}
|
resilience. Each profile applies symmetric \texttt{tc netem}
|
||||||
impairment to every machine. Low adds roughly 2\,ms of delay and
|
impairment to every machine. Low adds 2\,ms of delay and
|
||||||
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
|
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds 4\,ms
|
||||||
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
|
of delay and 1\,\% loss with 2.5\,\% reordering; High adds 6\,ms
|
||||||
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
|
of delay and 2.5\,\% loss with 5\,\% reordering. Medium and High
|
||||||
reordering. Medium and High both use 50\,\% correlation, so
|
both use 50\,\% correlation, so losses and reorderings are bursty
|
||||||
losses and reorderings are bursty rather than uniform. Two
|
rather than uniform. Two results dominate the data.
|
||||||
results dominate the data.
|
|
||||||
% TODO: Double-check these per-profile parameters against the
|
% TODO: Double-check these per-profile parameters against the
|
||||||
% canonical impairment-profile definitions in the earlier chapter
|
% canonical impairment-profile definitions in the earlier chapter
|
||||||
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
|
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
|
||||||
@@ -1150,9 +1131,9 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
|
|||||||
through what turns out to be the wrong hypothesis. The
|
through what turns out to be the wrong hypothesis. The
|
||||||
investigation begins with Tailscale's much-discussed gVisor TCP
|
investigation begins with Tailscale's much-discussed gVisor TCP
|
||||||
stack, validates the candidate parameters in isolation on the
|
stack, validates the candidate parameters in isolation on the
|
||||||
bare-metal host, and only then discovers, by reading the rig's
|
bare-metal host, and only then discovers, by reading the test
|
||||||
own NixOS module, that the gVisor stack is not actually in the
|
rig's own NixOS module, that the gVisor stack is not actually in
|
||||||
data path of the benchmark at all. The real culprit is a
|
the data path of the benchmark at all. The real culprit is a
|
||||||
combination of the Linux kernel's tight default
|
combination of the Linux kernel's tight default
|
||||||
\texttt{tcp\_reordering} threshold and the way
|
\texttt{tcp\_reordering} threshold and the way
|
||||||
\texttt{wireguard-go}
|
\texttt{wireguard-go}
|
||||||
@@ -1268,7 +1249,7 @@ RTT standard deviation reaches 44.6\,ms at High, the
|
|||||||
worst jitter
|
worst jitter
|
||||||
of any VPN. A userspace retry mechanism is the
|
of any VPN. A userspace retry mechanism is the
|
||||||
likely cause, but
|
likely cause, but
|
||||||
without source-code evidence we cannot say so with certainty.
|
without source-code evidence this cannot be confirmed.
|
||||||
|
|
||||||
% TODO: Ping packet loss data is not shown in any plot. The 1/9
|
% TODO: Ping packet loss data is not shown in any plot. The 1/9
|
||||||
% = 11.1\% interpretation is clever but depends on
|
% = 11.1\% interpretation is clever but depends on
|
||||||
@@ -1342,9 +1323,9 @@ the first step (Table~\ref{tab:tcp_impairment}).
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
|
Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
|
||||||
impairment, a
|
impairment, a 98.3\% loss. The Low profile injects only modest
|
||||||
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
|
impairment per machine: 2\,ms of latency, 2\,ms of jitter, 0.25\%
|
||||||
0.25\% packet loss, and 0.5\% reordering per machine.
|
loss, and 0.5\% reordering.
|
||||||
Even Mycelium,
|
Even Mycelium,
|
||||||
the slowest VPN at baseline (259\,Mbps), retains more
|
the slowest VPN at baseline (259\,Mbps), retains more
|
||||||
throughput at
|
throughput at
|
||||||
@@ -1427,38 +1408,28 @@ emerges from the runs that did complete.
|
|||||||
% explanation (e.g., iPerf3 crash, tc interaction,
|
% explanation (e.g., iPerf3 crash, tc interaction,
|
||||||
% timing issue).
|
% timing issue).
|
||||||
Three implementations maintain throughput at the profiles where
|
Three implementations maintain throughput at the profiles where
|
||||||
data exists. Internal holds ${\sim}$950\,Mbps at
|
data exists: Internal, WireGuard, and Headscale all sustain
|
||||||
Baseline, Medium,
|
several hundred Mbps where they complete (see
|
||||||
and High; WireGuard sustains 850--898\,Mbps; and
|
Figure~\ref{fig:udp_impairment_heatmap}). Internal and WireGuard
|
||||||
Headscale sustains
|
ride the host kernel's transport-layer backpressure (Internal
|
||||||
700--876\,Mbps. % TODO: verify WireGuard UDP range --
|
directly, WireGuard via the in-kernel WireGuard module).
|
||||||
% analysis doc says 850-898, possible digit transposition
|
Headscale takes a different route to the same outcome. Its
|
||||||
Internal and WireGuard ride the host kernel's transport-layer
|
\texttt{magicsock} layer is incompatible with the kernel
|
||||||
backpressure (Internal directly, WireGuard via the in-kernel
|
WireGuard datapath, so \texttt{wireguard-go} runs in userspace
|
||||||
WireGuard module). Headscale, by contrast, never
|
and leans on three host-kernel offloads to absorb a
|
||||||
uses the kernel
|
\texttt{-b~0} sender flood: batched UDP I/O
|
||||||
module even though it builds on the WireGuard protocol: as
|
(\texttt{recvmmsg} / \texttt{sendmmsg}), UDP
|
||||||
established in Section~\ref{sec:baseline}, Tailscale's
|
segmentation/aggregation offload (\texttt{UDP\_SEGMENT} /
|
||||||
\texttt{magicsock} layer intercepts every packet for endpoint
|
\texttt{UDP\_GRO}) on the outer WireGuard socket, and a 7\,MiB
|
||||||
selection, DERP relay, and the disco protocol, and that
|
socket buffer on that same socket. Section~\ref{sec:tailscale_degraded}
|
||||||
interception is incompatible with the kernel WireGuard datapath.
|
returns to these mechanisms when they reappear as the explanation
|
||||||
Headscale therefore runs \texttt{wireguard-go} in userspace and
|
for Headscale's TCP behaviour under reordering.
|
||||||
compensates with UDP batching
|
|
||||||
(\texttt{recvmmsg}/\texttt{sendmmsg}),
|
Userspace VPNs without that engineering collapse. EasyTier
|
||||||
host-kernel UDP segmentation/aggregation offload
|
walks down 865, 435, 38.5, 6.1\,Mbps across the four profiles.
|
||||||
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
|
Yggdrasil, already pathological at baseline (98.7\,\% loss),
|
||||||
WireGuard socket), and a 7\,MB socket buffer on the same outer
|
drops to 12.3\,Mbps at Low and fails entirely at Medium and
|
||||||
socket. These offloads live in the host kernel; gVisor netstack
|
High.
|
||||||
itself implements no UDP GSO or UDP GRO of its own.
|
|
||||||
Together they
|
|
||||||
absorb a \texttt{-b 0} sender flood without
|
|
||||||
collapsing. Userspace
|
|
||||||
VPNs without the same engineering do collapse:
|
|
||||||
EasyTier drops from
|
|
||||||
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
|
|
||||||
Yggdrasil, already pathological at baseline (98.7\%
|
|
||||||
loss), crashes
|
|
||||||
to 12.3\,Mbps at Low and fails entirely at Medium and High.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -1576,10 +1547,9 @@ concurrent streams benefits from it independently.
|
|||||||
EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
|
EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
|
||||||
51\% of its baseline. Headscale and EasyTier are the only VPNs
|
51\% of its baseline. Headscale and EasyTier are the only VPNs
|
||||||
that retain more than half their baseline parallel throughput at
|
that retain more than half their baseline parallel throughput at
|
||||||
Low impairment; no other implementation exceeds 30\%.
|
Low impairment; no other implementation exceeds 30\%. EasyTier's
|
||||||
We have no
|
resilience has no direct architectural explanation in this work,
|
||||||
direct architectural explanation for EasyTier's resilience and
|
and none is claimed here.
|
||||||
do not claim one here.
|
|
||||||
|
|
||||||
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
|
Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
|
||||||
Low, a 99.6\%
|
Low, a 99.6\%
|
||||||
@@ -1590,7 +1560,7 @@ loss. % TODO: DOWNSTREAM DEPENDENCY — This
|
|||||||
% under-load latency. If that diagnosis is revised,
|
% under-load latency. If that diagnosis is revised,
|
||||||
% this explanation
|
% this explanation
|
||||||
% for parallel collapse must also be revisited.
|
% for parallel collapse must also be revisited.
|
||||||
The buffer bloat that already plagues single-stream transfers
|
The buffer bloat that already constrains single-stream transfers
|
||||||
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
|
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
|
||||||
flows compete for the same bloated buffers at once.
|
flows compete for the same bloated buffers at once.
|
||||||
|
|
||||||
@@ -1799,33 +1769,26 @@ at Low, completes here in 170\,s. At High, only Headscale,
|
|||||||
Nebula, and Tinc survive. Internal's failure at High is the
|
Nebula, and Tinc survive. Internal's failure at High is the
|
||||||
surprising one: the bare-metal baseline cannot sustain a
|
surprising one: the bare-metal baseline cannot sustain a
|
||||||
multi-connection HTTP workload under severe degradation, while
|
multi-connection HTTP workload under severe degradation, while
|
||||||
Headscale's userspace TCP stack pulls it through.
|
Headscale completes in 219\,s.
|
||||||
Section~\ref{sec:tailscale_degraded} explains why.
|
Section~\ref{sec:tailscale_degraded} traces the mechanism.
|
||||||
|
|
||||||
\section{Tailscale under degraded conditions}
|
\section{Tailscale under degraded conditions}
|
||||||
\label{sec:tailscale_degraded}
|
\label{sec:tailscale_degraded}
|
||||||
|
|
||||||
% TODO: Editorial pass needed on two chapter-wide issues before
|
% TODO: The magicsock / wireguard-go userspace-datapath
|
||||||
% submission:
|
% explanation is repeated three times in slightly different forms
|
||||||
% (1) magicsock / wireguard-go userspace-datapath explanation is
|
% (once in baseline UDP, once in impairment UDP, once here).
|
||||||
% repeated three times in slightly different forms (once in
|
% Consider introducing it once in full here, where it is
|
||||||
% baseline UDP, once in impairment UDP, once here). Consider
|
% load-bearing, and replacing the earlier occurrences with
|
||||||
% introducing it once in full here, where it is load-bearing,
|
% one-sentence forward references.
|
||||||
% and replacing the earlier occurrences with one-sentence
|
|
||||||
% forward references.
|
|
||||||
% (2) This section uses first-person plural ("we pursued", "we
|
|
||||||
% worked it out", "we ran two follow-up benchmarks") while
|
|
||||||
% the rest of the chapter is in impersonal voice. Either
|
|
||||||
% harmonise everything to one voice, or explicitly frame this
|
|
||||||
% section as a first-person narrative detour.
|
|
||||||
|
|
||||||
This section is about an observation that should not exist:
|
Headscale, a tunnelling VPN built on \texttt{wireguard-go}, beats
|
||||||
Headscale, a tunnelling VPN built on a kernel TCP stack and
|
the bare-metal Internal baseline at Medium impairment. Under
|
||||||
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
|
parallel load at Low impairment it beats Internal by a factor of
|
||||||
Medium impairment, and at Low impairment under parallel load
|
2.6. A VPN should not outperform the direct connection it tunnels
|
||||||
beats it by a factor of 2.6. The short answer turns out to be
|
through, and the explanation took some chasing. The obvious
|
||||||
different from the obvious answer, and we worked it out only by
|
hypothesis was wrong, and pursuing it to its end was the only way
|
||||||
chasing the obvious answer to its end.
|
to find out.
|
||||||
|
|
||||||
\subsection{An anomaly worth pursuing}
|
\subsection{An anomaly worth pursuing}
|
||||||
|
|
||||||
@@ -1877,16 +1840,15 @@ comparison.
|
|||||||
\label{fig:headscale_vs_internal}
|
\label{fig:headscale_vs_internal}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
WireGuard-the-kernel-module is the obvious sanity
|
The in-kernel WireGuard module is the obvious sanity check. It
|
||||||
check. It uses
|
uses the same Noise/WireGuard cryptographic protocol that
|
||||||
the same Noise/WireGuard cryptographic protocol Tailscale ships
|
Tailscale embeds, and it is the closest available comparison
|
||||||
and is the closest available comparison without the rest of
|
without the rest of Tailscale's stack. Kernel WireGuard shows
|
||||||
Tailscale's stack. WireGuard shows none of Headscale's
|
none of Headscale's advantage: 54.7\,Mbps at Low and 8.77\,Mbps at
|
||||||
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
|
Medium, both well below Internal at the same profile. The
|
||||||
below Internal at the same profile. So the encryption layer is
|
encryption layer is not the answer. Neither is the basic UDP
|
||||||
not the answer, and the basic UDP tunnel is not the answer.
|
tunnel. Whatever Headscale is doing lives somewhere else in
|
||||||
Whatever Headscale is doing differently lives somewhere else in
|
Tailscale's implementation.
|
||||||
the rest of Tailscale's implementation.
|
|
||||||
|
|
||||||
% TODO: The Medium-impairment retransmit percentages (5.2\%,
|
% TODO: The Medium-impairment retransmit percentages (5.2\%,
|
||||||
% 2.4\%) are not in any table or figure. Add a retransmit
|
% 2.4\%) are not in any table or figure. Add a retransmit
|
||||||
@@ -1906,9 +1868,9 @@ not.
|
|||||||
|
|
||||||
\subsection{A plausible villain: Tailscale's gVisor stack}
|
\subsection{A plausible villain: Tailscale's gVisor stack}
|
||||||
|
|
||||||
The candidate explanation we pursued first, and the one any
|
The first candidate was Tailscale's userspace TCP/IP stack:
|
||||||
reading of the upstream Tailscale documentation will lead to,
|
the answer any reading of the upstream Tailscale documentation
|
||||||
is Tailscale's userspace TCP/IP stack. The Tailscale client
|
points to. The Tailscale client
|
||||||
imports Google's gVisor netstack
|
imports Google's gVisor netstack
|
||||||
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
|
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
|
||||||
it as an in-process TCP implementation. The gVisor
|
it as an in-process TCP implementation. The gVisor
|
||||||
@@ -1948,9 +1910,9 @@ reordering link than the host kernel. The hypothesis follows
|
|||||||
directly: Headscale's iPerf3 traffic
|
directly: Headscale's iPerf3 traffic
|
||||||
runs through this gVisor instance instead of through the host
|
runs through this gVisor instance instead of through the host
|
||||||
kernel TCP stack, and so it inherits the more
|
kernel TCP stack, and so it inherits the more
|
||||||
reordering-tolerant behaviour. WireGuard-the-kernel-module
|
reordering-tolerant behaviour. Kernel WireGuard shares only
|
||||||
shares only the cryptographic protocol; it does not include
|
the cryptographic protocol; it does not include the gVisor
|
||||||
the gVisor stack, and therefore does not get the advantage.
|
stack, and therefore does not get the advantage.
|
||||||
|
|
||||||
The natural way to test this is to extract
|
The natural way to test this is to extract
|
||||||
the parameters Tailscale sets inside gVisor, apply their
|
the parameters Tailscale sets inside gVisor, apply their
|
||||||
@@ -1962,8 +1924,8 @@ supported. If it does not, the hypothesis fails.
|
|||||||
\subsection{Reproducing the effect on bare metal}
|
\subsection{Reproducing the effect on bare metal}
|
||||||
\label{sec:tuned}
|
\label{sec:tuned}
|
||||||
|
|
||||||
We ran two follow-up benchmarks on the same hardware and
|
Two follow-up benchmarks ran on the same hardware and impairment
|
||||||
impairment setup as the original 18.12.2025 run.
|
setup as the original 18.12.2025 run.
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\bitem{Tailscale-style (27.02.2026):}
|
\bitem{Tailscale-style (27.02.2026):}
|
||||||
@@ -2023,43 +1985,33 @@ impairment setup as the original 18.12.2025 run.
|
|||||||
\label{fig:kernel_tuning_comparison}
|
\label{fig:kernel_tuning_comparison}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
The result felt like confirmation. Internal's
|
The result felt like confirmation. Three sysctls raised
|
||||||
Medium-impairment throughput jumped from 29.6\,Mbps to
|
Internal's Medium-impairment throughput by 146\,\% and halved its
|
||||||
72.7\,Mbps under the reorder-only configuration, a 146\,\%
|
Nix cache download time
|
||||||
increase from a three-line sysctl change, and the retransmit
|
(Table~\ref{tab:kernel_tuning_internal}). The retransmit rate at
|
||||||
rate at Medium dropped from ${\sim}$2.4\,\% to
|
Medium dropped from ${\sim}$2.4\,\% to 1.11\,\%, which means
|
||||||
1.11\,\%, which
|
more than half of the original retransmissions were spurious.
|
||||||
means more than half of the original retransmissions were
|
|
||||||
spurious. The Nix cache download at Medium roughly halved,
|
|
||||||
from 58.6\,s to 29.1\,s.
|
|
||||||
|
|
||||||
Parallel TCP gained even more. Internal at Low climbed from
|
Parallel TCP gained even more. Internal at Low climbed from
|
||||||
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
|
277\,Mbps to 902\,Mbps, a 226\,\% increase that exceeded
|
||||||
old single-stream best and overtakes Headscale's original
|
Internal's old single-stream best and overtook Headscale's
|
||||||
718\,Mbps from the unmodified run. %
|
original 718\,Mbps from the unmodified run.
|
||||||
% TODO: DOWNSTREAM
|
% TODO: DOWNSTREAM DEPENDENCY -- "six concurrent flows"
|
||||||
% DEPENDENCY — "six concurrent flows" inherits
|
% inherits the unresolved 6-vs-10 stream count from the baseline
|
||||||
% the unresolved
|
% parallel test description. Update when that TODO is resolved.
|
||||||
% 6-vs-10 stream count from the baseline parallel test
|
Each of the six concurrent flows benefits independently from the
|
||||||
% description. Update when that TODO is resolved.
|
higher reordering threshold, and the gains compound.
|
||||||
Each of the six concurrent flows benefits independently from
|
|
||||||
the higher reordering threshold, and the gains compound.
|
|
||||||
|
|
||||||
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
|
% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
|
||||||
% not in any table. Add a table showing Headscale's results
|
% not in any table. Add a table showing Headscale's results
|
||||||
% from the follow-up runs alongside Internal's so
|
% from the follow-up runs alongside Internal's so readers can
|
||||||
% readers can
|
|
||||||
% verify the reversal.
|
% verify the reversal.
|
||||||
Headscale itself, retested with the same sysctls,
|
Headscale, retested with the same sysctls, gained more modestly:
|
||||||
gained more
|
+21\,\% at Medium and a small $-$5\,\% wobble at Low. And the
|
||||||
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
|
anomaly reversed entirely
|
||||||
Low. And the anomaly reversed entirely. At Medium, tuned
|
(Figure~\ref{fig:headscale_gap_reversal}). Tuned Internal now
|
||||||
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
|
leads Headscale at Medium across every metric, where the
|
||||||
a 45\,\% lead for Internal where the original run
|
original run had Headscale ahead.
|
||||||
had Headscale
|
|
||||||
40\,\% ahead. The Nix cache flipped the same way: Internal
|
|
||||||
completed in 29.1\,s against Headscale's 36.3\,s, where the
|
|
||||||
original had Headscale 17\,\% faster.
|
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -2088,11 +2040,11 @@ collapsed to three host-kernel sysctls:
|
|||||||
\texttt{tcp\_early\_retrans}.
|
\texttt{tcp\_early\_retrans}.
|
||||||
|
|
||||||
At this point in the investigation the hypothesis seemed
|
At this point in the investigation the hypothesis seemed
|
||||||
settled. Tailscale's gVisor stack ships with
|
settled. Tailscale's gVisor stack applies these overrides;
|
||||||
these overrides;
|
the bare-metal kernel uses stricter defaults; matching
|
||||||
the bare-metal kernel ships with stricter defaults; matching
|
the kernel to gVisor reproduces the effect. The remaining
|
||||||
the kernel to gVisor reproduces the effect. Then we checked
|
question was which Tailscale code path the test rig was actually
|
||||||
which Tailscale code path the test rig was actually running.
|
running.
|
||||||
|
|
||||||
\subsection{The data path that was not there}
|
\subsection{The data path that was not there}
|
||||||
\label{sec:gvisor_not_in_path}
|
\label{sec:gvisor_not_in_path}
|
||||||
@@ -2168,9 +2120,9 @@ the gVisor TCP business at all.
|
|||||||
The puzzle the investigation began with has not gone away.
|
The puzzle the investigation began with has not gone away.
|
||||||
Headscale starts at 41.5\,Mbps where Internal starts at
|
Headscale starts at 41.5\,Mbps where Internal starts at
|
||||||
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
||||||
TCP stack. Whatever Headscale is doing (partially, weakly, but
|
TCP stack. Whichever mechanism Headscale relies on (partially,
|
||||||
reproducibly) is worth roughly twelve megabits per second on the
|
weakly, but reproducibly) is worth roughly twelve megabits per
|
||||||
Medium profile, and it is not gVisor netstack.
|
second on the Medium profile, and it is not gVisor netstack.
|
||||||
|
|
||||||
The +21\,\% sysctl gain for Headscale itself is also informative
|
The +21\,\% sysctl gain for Headscale itself is also informative
|
||||||
about the size of the mechanism. If the gain were 0\,\%,
|
about the size of the mechanism. If the gain were 0\,\%,
|
||||||
@@ -2182,7 +2134,7 @@ that the two effects are not fully additive.
|
|||||||
|
|
||||||
Two features of the \texttt{wireguard-go} data-plane pipeline are
|
Two features of the \texttt{wireguard-go} data-plane pipeline are
|
||||||
the most likely candidates, and both live on the kernel-TUN path
|
the most likely candidates, and both live on the kernel-TUN path
|
||||||
that Tailscale actually uses in the rig.
|
that Tailscale actually uses in the test rig.
|
||||||
|
|
||||||
The first is TUN TCP and UDP generic receive offload. Tailscale's
|
The first is TUN TCP and UDP generic receive offload. Tailscale's
|
||||||
\texttt{tstun} wrapper enables both on the kernel TUN device on
|
\texttt{tstun} wrapper enables both on the kernel TUN device on
|
||||||
@@ -2248,32 +2200,38 @@ Hyprspace cannot be used as a negative control for any of this.
|
|||||||
It does import gVisor netstack, but only for its in-VPN
|
It does import gVisor netstack, but only for its in-VPN
|
||||||
service-network feature, and the Hyprspace benchmark traffic goes
|
service-network feature, and the Hyprspace benchmark traffic goes
|
||||||
through a kernel TUN exactly like Headscale's
|
through a kernel TUN exactly like Headscale's
|
||||||
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
|
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ in the
|
||||||
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
|
\texttt{wireguard-go} pipeline (TUN GRO and the 7\,MiB outer-UDP
|
||||||
not on whether gVisor handles their inner TCP. The gVisor angle
|
buffer), not in whether gVisor handles their inner TCP. The gVisor angle
|
||||||
simply does not apply to either of them in this benchmark.
|
simply does not apply to either of them in this benchmark.
|
||||||
|
|
||||||
The kernel-side picture closes the loop. Three host-kernel TCP
|
The kernel-side picture closes the loop. Three host-kernel TCP
|
||||||
parameters dominate the bare-metal behaviour the benchmarks
|
parameters dominate the bare-metal behaviour the benchmarks
|
||||||
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
|
expose:
|
||||||
number of out-of-order segments the kernel will tolerate before
|
|
||||||
declaring fast retransmit, and with \texttt{tc netem} injecting
|
\begin{description}
|
||||||
0.5--2.5\,\% reordering per machine, bursts of several reordered
|
\item[\texttt{tcp\_reordering} (default 3)] The number of
|
||||||
packets are frequent enough that the threshold is repeatedly
|
out-of-order segments the kernel tolerates before declaring
|
||||||
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
|
fast retransmit. With \texttt{tc~netem} injecting 0.5--2.5\,\%
|
||||||
(default \texttt{1}, RACK enabled) adds time-based reordering
|
reordering per machine, bursts of several reordered packets
|
||||||
detection on top of the segment-count threshold, which compounds
|
repeatedly trip this threshold on the bare-metal path.
|
||||||
the spurious retransmits when reordering is high. And
|
\item[\texttt{tcp\_recovery} (default \texttt{1}, RACK enabled)]
|
||||||
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
|
Adds time-based reordering detection on top of the
|
||||||
Loss Probe enabled) fires speculative retransmits when
|
segment-count threshold, which amplifies spurious retransmits
|
||||||
unacknowledged segments sit at the tail of a transmission window,
|
when reordering is high.
|
||||||
which interacts poorly with an already-impaired link. Loosening
|
\item[\texttt{tcp\_early\_retrans} (default \texttt{3}, TLP
|
||||||
any one of the three softens the kernel's loss detection on the
|
enabled)] Fires speculative retransmits when unacknowledged
|
||||||
|
segments sit at the tail of the transmission window, which
|
||||||
|
interacts poorly with an already-impaired link.
|
||||||
|
\end{description}
|
||||||
|
|
||||||
|
\noindent
|
||||||
|
Loosening any one softens the kernel's loss detection on the
|
||||||
bare-metal path; loosening all three recovers most of the
|
bare-metal path; loosening all three recovers most of the
|
||||||
throughput. The Headscale path reaches the same kernel TCP stack
|
throughput. The Headscale path reaches the same kernel TCP stack
|
||||||
but is already feeding it the GRO-coalesced, buffer-cushioned
|
but is already feeding it a GRO-coalesced, buffer-cushioned
|
||||||
stream described above, so the kernel's tight defaults fire less
|
stream, so the kernel's tight defaults fire less often there to
|
||||||
often there to begin with.
|
begin with.
|
||||||
|
|
||||||
The same logic explains the anomaly's shape across profiles. At
|
The same logic explains the anomaly's shape across profiles. At
|
||||||
baseline there is no reordering, so the kernel's tight
|
baseline there is no reordering, so the kernel's tight
|
||||||
@@ -2318,31 +2276,66 @@ any Linux host and entirely independent of any VPN.
|
|||||||
The less durable finding, and the one that motivated this section,
|
The less durable finding, and the one that motivated this section,
|
||||||
is that Tailscale's much-discussed userspace TCP stack is not in
|
is that Tailscale's much-discussed userspace TCP stack is not in
|
||||||
the data path for the workload that exposed the anomaly. The
|
the data path for the workload that exposed the anomaly. The
|
||||||
advantage we attributed to it comes from a more ordinary place:
|
advantage initially attributed to it comes from a more ordinary
|
||||||
the way \texttt{wireguard-go} batches and coalesces packets
|
place: the way \texttt{wireguard-go} batches and coalesces packets
|
||||||
between the wire and the kernel TCP stack, and the larger UDP
|
between the wire and the kernel TCP stack, and the larger UDP
|
||||||
buffer it pins on its outer socket. We were chasing the wrong
|
buffer it pins on its outer socket. The experiment was chasing
|
||||||
hypothesis with the right experiment, and the experiment turned
|
the wrong hypothesis, but the experiment turned out to be more
|
||||||
out to be more useful than the hypothesis.
|
useful than the hypothesis.
|
||||||
|
|
||||||
% TODO: These sections are empty stubs but the chapter
|
\section{Summary}
|
||||||
% introduction (line 12--13) promises "findings from the source
|
\label{sec:results_summary}
|
||||||
% code analysis." Either write these sections or remove the
|
|
||||||
% promise from the intro.
|
|
||||||
|
|
||||||
\section{Source code analysis}
|
Four findings hold together across all four impairment profiles.
|
||||||
|
|
||||||
\subsection{Feature matrix overview}
|
At baseline, the throughput hierarchy splits into three tiers
|
||||||
|
separated by natural gaps in the data: WireGuard, ZeroTier,
|
||||||
|
Headscale, and Yggdrasil at the top ($>$\,80\,\% of bare metal);
|
||||||
|
Nebula, EasyTier, and VpnCloud in the middle (55--80\,\%);
|
||||||
|
Hyprspace, Tinc, and Mycelium at the bottom ($<$\,40\,\%).
|
||||||
|
Latency rearranges the rankings: VpnCloud is the fastest VPN at
|
||||||
|
1.13\,ms despite mid-tier throughput, Tinc has low latency but
|
||||||
|
single-core CPU caps its bulk transfer rate, and Mycelium's
|
||||||
|
34.9\,ms average is an outlier driven by Babel's overlay
|
||||||
|
routing, not by tunnel overhead.
|
||||||
|
|
||||||
% Summary of the 108-feature matrix across all ten VPNs.
|
Under impairment, the hierarchy collapses. At High impairment
|
||||||
% Highlight key architectural differences that explain
|
the spread between fastest and slowest implementation compresses
|
||||||
% performance results.
|
from 675\,Mbps to under 3\,Mbps; the impairment profile itself
|
||||||
|
becomes the bottleneck. Three pathologies stand out at the
|
||||||
|
intermediate profiles. Yggdrasil's 32\,KB jumbo overlay MTU,
|
||||||
|
which inflates its baseline numbers, becomes a liability at
|
||||||
|
Low impairment: a single lost outer packet costs roughly
|
||||||
|
24$\times$ more retransmitted inner data than a standard-MTU VPN
|
||||||
|
would lose, and throughput drops from 795 to 13\,Mbps.
|
||||||
|
Hyprspace's libp2p/yamux send pipeline serialises concurrent
|
||||||
|
flows behind a per-peer mutex; under any sustained load the
|
||||||
|
pipeline backs up and ping latency balloons by three orders of
|
||||||
|
magnitude. Headscale's RIST video quality stays at 13\,\% across
|
||||||
|
every profile, almost certainly because of MTU fragmentation in
|
||||||
|
the DERP relay layer; the failure is profile-independent because
|
||||||
|
it is structural.
|
||||||
|
|
||||||
\subsection{Security vulnerabilities}
|
Headscale's apparent lead over the bare-metal Internal baseline
|
||||||
|
at Medium impairment turns out not to come from Tailscale's
|
||||||
|
gVisor TCP stack, which is not in the data path of the
|
||||||
|
benchmark. It comes from \texttt{wireguard-go}'s TUN GRO
|
||||||
|
coalescing and the 7\,MiB outer-UDP socket buffer that
|
||||||
|
\texttt{magicsock} pins, both of which feed the host kernel TCP
|
||||||
|
stack a smoother input than the bare-metal path receives. The
|
||||||
|
underlying cause is a host-kernel one: the default
|
||||||
|
\texttt{tcp\_reordering=3} threshold is too tight for the kind
|
||||||
|
of bursty, correlated reordering \texttt{tc netem} produces, and
|
||||||
|
costs the bare-metal host more than half its achievable
|
||||||
|
throughput. Three sysctl lines repair it, and the fix is
|
||||||
|
portable to any Linux host independent of any VPN.
|
||||||
|
|
||||||
% Vulnerabilities discovered during source code review.
|
A ranking by single metric would obscure all of this. The most
|
||||||
|
useful one-sentence summary is therefore that no VPN dominates
|
||||||
\section{Summary of findings}
|
across throughput, latency, application-level workloads, and
|
||||||
|
operational resilience together; each implementation makes
|
||||||
% Brief summary table or ranking of VPNs by key metrics. Save
|
trade-offs that surface only when the workload changes.
|
||||||
% deeper interpretation for a Discussion chapter.
|
WireGuard comes closest to a default recommendation for
|
||||||
|
performance-critical use; Headscale is the most robust under
|
||||||
|
adverse network conditions; the others occupy specific niches
|
||||||
|
that the per-section analyses describe.
|
||||||
|
|||||||
Reference in New Issue
Block a user