verified all numbers
This commit is contained in:
+251
-221
@@ -9,10 +9,9 @@ ten VPN implementations and the internal baseline. The structure
|
|||||||
follows the impairment profiles from ideal to degraded:
|
follows the impairment profiles from ideal to degraded:
|
||||||
Section~\ref{sec:baseline} establishes overhead under ideal
|
Section~\ref{sec:baseline} establishes overhead under ideal
|
||||||
conditions, then subsequent sections examine how each VPN responds to
|
conditions, then subsequent sections examine how each VPN responds to
|
||||||
increasing network impairment. The chapter concludes with findings
|
increasing network impairment, with source-code excerpts woven in
|
||||||
from the source code analysis. A recurring theme is that no single
|
where they explain the measured behaviour. A recurring theme is
|
||||||
metric captures VPN
|
that no single metric captures VPN performance; the rankings shift
|
||||||
performance; the rankings shift
|
|
||||||
depending on whether one measures throughput, latency, retransmit
|
depending on whether one measures throughput, latency, retransmit
|
||||||
behavior, or real-world application performance.
|
behavior, or real-world application performance.
|
||||||
|
|
||||||
@@ -26,38 +25,32 @@ the VPN itself. Throughout the plots in this section, the
|
|||||||
in the path; it represents the best the hardware can do. On its own,
|
in the path; it represents the best the hardware can do. On its own,
|
||||||
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
||||||
latency of just
|
latency of just
|
||||||
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching
|
0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
|
||||||
92.5\,\% of bare-metal throughput with only a single retransmit across
|
single retransmit across an entire 30-second test. Mycelium sits at
|
||||||
an entire 30-second test. Mycelium sits at the other extreme, adding
|
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
|
||||||
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
|
bare-metal figure.
|
||||||
|
|
||||||
A note on naming: ``Headscale'' in every table and figure of this
|
A note on naming: ``Headscale'' in every table and figure of this
|
||||||
chapter labels the test scenario in which the Tailscale client
|
chapter labels the test scenario in which the Tailscale client
|
||||||
(\texttt{tailscaled}) connects to a self-hosted Headscale control
|
(\texttt{tailscaled}) connects to a self-hosted Headscale control
|
||||||
server. The data plane is therefore the Tailscale client built on
|
server. The data plane is therefore the Tailscale client built on
|
||||||
\texttt{wireguard-go}, not the Headscale binary itself, which is
|
\texttt{wireguard-go}, not the Headscale binary itself, which is
|
||||||
only a control-plane server. The test rig launches
|
only a control-plane server. Statements below about ``Headscale''
|
||||||
\texttt{tailscaled} via the NixOS \texttt{services.tailscale}
|
running \texttt{wireguard-go} should be read as statements about
|
||||||
module with \texttt{interfaceName = "ts-headscale"}, which
|
the Tailscale client in this scenario.
|
||||||
translates to \texttt{--tun ts-headscale}; this means the Tailscale
|
Section~\ref{sec:tailscale_degraded} covers the specifics of how
|
||||||
client uses a real kernel TUN device and the host kernel's TCP/IP
|
the rig launches \texttt{tailscaled} and which Tailscale code
|
||||||
stack handles every tunneled packet. The alternate
|
paths that choice activates.
|
||||||
\texttt{--tun=userspace-networking} mode, in which gVisor netstack
|
|
||||||
terminates tunneled TCP inside the \texttt{tailscaled} process, is
|
|
||||||
\emph{not} engaged in any of the benchmarks reported here.
|
|
||||||
Statements below about ``Headscale'' running \texttt{wireguard-go}
|
|
||||||
should be read as statements about the Tailscale client in this
|
|
||||||
scenario.
|
|
||||||
|
|
||||||
\subsection{Test Execution Overview}
|
\subsection{Test Execution Overview}
|
||||||
|
|
||||||
Running the full baseline suite across all ten VPNs and the internal
|
Running the full baseline suite across all ten VPNs and the internal
|
||||||
reference took just over four hours. The bulk of that time, about
|
reference took just over four hours. Actual benchmark execution
|
||||||
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
|
consumed the bulk of that time at 2.6~hours (63\,\%). VPN
|
||||||
installation and deployment accounted for another 45~minutes (19\,\%),
|
installation and deployment accounted for another 45~minutes
|
||||||
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
|
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
|
||||||
up after restarts. The remaining time was consumed by VPN service restarts
|
for VPN tunnels to come up after restarts. VPN service restarts and
|
||||||
and traffic-control (tc) stabilization.
|
traffic-control (tc) stabilization took the remainder.
|
||||||
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
||||||
|
|
||||||
Most VPNs completed every benchmark without issues, but four failed
|
Most VPNs completed every benchmark without issues, but four failed
|
||||||
@@ -146,8 +139,8 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates
|
|||||||
needs. ZeroTier compensates for tunnel-internal packet loss by
|
needs. ZeroTier compensates for tunnel-internal packet loss by
|
||||||
repeatedly triggering TCP congestion-control recovery, whereas
|
repeatedly triggering TCP congestion-control recovery, whereas
|
||||||
WireGuard delivers data with negligible in-tunnel loss. The
|
WireGuard delivers data with negligible in-tunnel loss. The
|
||||||
bare-metal Internal reference sits at 1.7~retransmits per test —
|
bare-metal Internal reference sits at 1.7~retransmits per test,
|
||||||
essentially noise — and the VPNs split into three groups around
|
essentially noise, and the VPNs split into three groups around
|
||||||
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
|
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
|
||||||
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
|
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
|
||||||
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
||||||
@@ -187,10 +180,10 @@ and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Retransmits have a direct mechanical relationship with TCP congestion
|
Retransmits have a direct mechanical relationship with TCP congestion
|
||||||
control. Each retransmit triggers a reduction in the congestion window
|
control: each one triggers a reduction in the congestion window
|
||||||
(\texttt{cwnd}), throttling the sender.
|
(\texttt{cwnd}) and throttles the sender.
|
||||||
This relationship is visible
|
Figure~\ref{fig:retransmit_correlations} shows the relationship:
|
||||||
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
|
Hyprspace, with 4965
|
||||||
retransmits, maintains the smallest max congestion window in the
|
retransmits, maintains the smallest max congestion window in the
|
||||||
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
||||||
window, the largest of any VPN. At first glance this suggests a
|
window, the largest of any VPN. At first glance this suggests a
|
||||||
@@ -200,24 +193,31 @@ largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
|
|||||||
segment carries far more data, so the window in bytes is inflated
|
segment carries far more data, so the window in bytes is inflated
|
||||||
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
||||||
congestion windows across different MTU sizes is not meaningful
|
congestion windows across different MTU sizes is not meaningful
|
||||||
without normalizing for segment size. What \emph{is} clear is that
|
without normalizing for segment size. The reliable conclusion is
|
||||||
high retransmit rates force TCP to spend more time in congestion
|
simpler: high retransmit rates force TCP to spend more time in
|
||||||
recovery than in steady-state transmission, capping throughput
|
congestion recovery than in steady-state transmission, and that
|
||||||
regardless of available bandwidth. ZeroTier illustrates the
|
caps throughput regardless of available bandwidth. ZeroTier
|
||||||
opposite extreme: brute-force retransmission can still yield high
|
illustrates the opposite extreme: brute-force retransmission can
|
||||||
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
|
still yield high throughput (814\,Mbps with 1\,163 retransmits), at
|
||||||
bandwidth and unstable flow behavior.
|
the cost of wasted bandwidth and unstable flow behavior.
|
||||||
|
|
||||||
VpnCloud stands out: its sender reports 538.8\,Mbps
|
VpnCloud stands out: its sender reports 538.8\,Mbps but the
|
||||||
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
|
receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
|
||||||
in the dataset). This suggests significant in-tunnel packet loss or
|
in the dataset. This points to significant in-tunnel packet loss
|
||||||
buffering at the VpnCloud layer that the retransmit count (857)
|
or buffering at the VpnCloud layer that the retransmit count (857)
|
||||||
alone does not fully explain.
|
alone does not fully explain.
|
||||||
|
% TODO: Clarify whether the headline TCP table
|
||||||
|
% (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports
|
||||||
|
% sender or receiver throughput. The prose here cites sender
|
||||||
|
% 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender
|
||||||
|
% column, so the table caption should say so explicitly. Same
|
||||||
|
% clarification needed for Hyprspace (368 in table vs.\ sender
|
||||||
|
% 367.9 / receiver 419.8 in the pathological-cases paragraph).
|
||||||
|
|
||||||
Variability — whether stochastic across runs or systematic across
|
Variability, whether stochastic across runs or systematic across
|
||||||
links — also differs substantially. WireGuard's three link
|
links, also differs substantially. WireGuard's three link
|
||||||
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window),
|
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
|
||||||
behaving almost identically. Mycelium's three directions span
|
and are nearly indistinguishable. Mycelium's three directions span
|
||||||
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
|
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
|
||||||
Section~\ref{sec:mycelium_routing} shows the spread is per-link
|
Section~\ref{sec:mycelium_routing} shows the spread is per-link
|
||||||
path-selection asymmetry, with one link finding a direct route and
|
path-selection asymmetry, with one link finding a direct route and
|
||||||
@@ -315,25 +315,21 @@ interference that the average hides.
|
|||||||
|
|
||||||
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
||||||
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
||||||
the tunnel quickly, yet single-threaded userspace processing cannot
|
the tunnel quickly, yet something caps the overall rate. The qperf
|
||||||
keep up with the link speed. The qperf benchmark backs this up: Tinc
|
benchmark reports Tinc maxing out at 14.9\,\% total system CPU while
|
||||||
maxes out at
|
delivering 336\,Mbps. On a multi-core host this figure is consistent
|
||||||
14.9\,\% total system CPU while delivering just 336\,Mbps.
|
with a single saturated core, which fits Tinc's single-threaded
|
||||||
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
|
userspace architecture: one core encrypts, copies, and forwards
|
||||||
|
packets, and the remaining cores sit idle. But VpnCloud reports the
|
||||||
|
same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc),
|
||||||
|
so whole-system CPU alone cannot explain the gap, and a per-packet
|
||||||
|
processing cost difference must also be in play.
|
||||||
|
% TODO: 14.9\% total CPU does not pin the bottleneck on its own.
|
||||||
% This is whole-system utilization on a multi-core machine, and a
|
% This is whole-system utilization on a multi-core machine, and a
|
||||||
% single saturated core fits the budget — but VpnCloud reports the
|
% single saturated core fits the budget — but VpnCloud reports the
|
||||||
% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc.
|
% same 14.9\% \emph{and} reaches 539\,Mbps. Verify with per-thread
|
||||||
% The single-saturated-core story alone therefore cannot explain
|
% CPU sampling or eBPF profiling to confirm the single-core story
|
||||||
% the throughput gap; per-packet processing cost must differ
|
% and quantify the per-packet cost difference.
|
||||||
% materially between the two. Verify with per-thread CPU sampling
|
|
||||||
% or eBPF profiling.
|
|
||||||
On a multi-core system, this low percentage is consistent with a
|
|
||||||
single saturated core (and Tinc is single-threaded), which would
|
|
||||||
explain why the CPU rather than the network is the bottleneck.
|
|
||||||
The story is incomplete, however: VpnCloud shows the same 14.9\,\%
|
|
||||||
total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc —
|
|
||||||
so a difference in per-packet processing cost between the two
|
|
||||||
implementations must also be in play.
|
|
||||||
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
||||||
spot.
|
spot.
|
||||||
|
|
||||||
@@ -346,9 +342,9 @@ spot.
|
|||||||
The qperf measurements also reveal a wide spread in CPU usage.
|
The qperf measurements also reveal a wide spread in CPU usage.
|
||||||
Hyprspace (55.1\,\%) and Yggdrasil
|
Hyprspace (55.1\,\%) and Yggdrasil
|
||||||
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
||||||
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a
|
9.7\,\%. WireGuard sits at 30.8\,\%, higher than expected for a
|
||||||
kernel-level implementation, presumably due to in-kernel
|
kernel-level implementation; in-kernel cryptographic processing
|
||||||
cryptographic processing.
|
is the likely cause, though no profiling data confirms this.
|
||||||
On the efficient end, VpnCloud
|
On the efficient end, VpnCloud
|
||||||
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
|
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
|
||||||
CPU time. Nebula and Headscale are missing from
|
CPU time. Nebula and Headscale are missing from
|
||||||
@@ -416,8 +412,10 @@ Table~\ref{tab:parallel_scaling} lists the results.
|
|||||||
|
|
||||||
The VPNs that gain the most are those most constrained in
|
The VPNs that gain the most are those most constrained in
|
||||||
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
||||||
can never fill the pipe: the bandwidth-delay product demands a window
|
can never fill the pipe: the bandwidth-delay product (the amount
|
||||||
larger than any single flow maintains, so multiple concurrent flows
|
of in-flight data a TCP flow needs to saturate a link, equal to the
|
||||||
|
link bandwidth times the round-trip time) demands a window larger
|
||||||
|
than any single flow maintains, so multiple concurrent flows
|
||||||
compensate for that constraint and push throughput to 2.20$\times$
|
compensate for that constraint and push throughput to 2.20$\times$
|
||||||
the single-stream figure. Hyprspace scales almost as well
|
the single-stream figure. Hyprspace scales almost as well
|
||||||
(2.18$\times$) for the same reason but with a different
|
(2.18$\times$) for the same reason but with a different
|
||||||
@@ -425,7 +423,7 @@ bottleneck. Its libp2p send pipeline accumulates roughly
|
|||||||
2\,800\,ms of under-load latency
|
2\,800\,ms of under-load latency
|
||||||
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
|
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
|
||||||
flow a bandwidth-delay product on the order of hundreds of
|
flow a bandwidth-delay product on the order of hundreds of
|
||||||
megabytes to fill — far beyond any single kernel cwnd. And
|
megabytes to fill, far beyond any single kernel cwnd. And
|
||||||
because Hyprspace keys \texttt{activeStreams} by destination
|
because Hyprspace keys \texttt{activeStreams} by destination
|
||||||
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
|
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
|
||||||
three concurrent peer pairs in the parallel benchmark each get
|
three concurrent peer pairs in the parallel benchmark each get
|
||||||
@@ -440,8 +438,9 @@ more of the bloated pipeline than one can.
|
|||||||
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
|
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
|
||||||
% per-flow window evolution nor the actual under-load latency
|
% per-flow window evolution nor the actual under-load latency
|
||||||
% has been measured directly. A tcpdump of one Hyprspace
|
% has been measured directly. A tcpdump of one Hyprspace
|
||||||
% iPerf3 run with inter-arrival timing analysis would settle
|
% iPerf3 run with inter-arrival timing analysis would settle it.
|
||||||
% it. Tinc picks up a
|
|
||||||
|
Tinc picks up a
|
||||||
1.68$\times$ boost because several streams can collectively keep its
|
1.68$\times$ boost because several streams can collectively keep its
|
||||||
single-threaded CPU busy during what would otherwise be idle gaps in
|
single-threaded CPU busy during what would otherwise be idle gaps in
|
||||||
a single flow.
|
a single flow.
|
||||||
@@ -449,9 +448,9 @@ a single flow.
|
|||||||
% TODO: "zero retransmits" in parallel mode is not shown in any table
|
% TODO: "zero retransmits" in parallel mode is not shown in any table
|
||||||
% or figure. Add parallel-mode retransmit data or remove the claim.
|
% or figure. Add parallel-mode retransmit data or remove the claim.
|
||||||
WireGuard and Internal both scale cleanly at around
|
WireGuard and Internal both scale cleanly at around
|
||||||
1.48--1.50$\times$ with zero retransmits, suggesting that
|
1.48--1.50$\times$ with zero retransmits. This is consistent
|
||||||
WireGuard's overhead is a fixed per-packet cost that does not worsen
|
with WireGuard's overhead being a fixed per-packet cost that does
|
||||||
under multiplexing.
|
not worsen under multiplexing.
|
||||||
|
|
||||||
Nebula is the only VPN that actually gets \emph{slower} with more
|
Nebula is the only VPN that actually gets \emph{slower} with more
|
||||||
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
||||||
@@ -498,8 +497,9 @@ The sender throughput values are artifacts: they reflect how fast the
|
|||||||
sender can write to the socket, not how fast data traverses the
|
sender can write to the socket, not how fast data traverses the
|
||||||
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
||||||
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
||||||
overlay MTU), inflating the apparent rate per \texttt{send()} system
|
overlay MTU), which inflates the apparent rate per
|
||||||
call. Only the receiver throughput is meaningful.
|
\texttt{send()} system call. Only the receiver throughput is
|
||||||
|
meaningful.
|
||||||
|
|
||||||
\begin{table}[H]
|
\begin{table}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -537,16 +537,19 @@ because the sender overwhelms the tunnel's userspace processing capacity.
|
|||||||
Headscale shares WireGuard's cryptographic protocol but, contrary to
|
Headscale shares WireGuard's cryptographic protocol but, contrary to
|
||||||
intuition, does not share its kernel datapath: Tailscale's
|
intuition, does not share its kernel datapath: Tailscale's
|
||||||
\texttt{magicsock} layer intercepts every packet to handle endpoint
|
\texttt{magicsock} layer intercepts every packet to handle endpoint
|
||||||
selection and DERP relay, which is incompatible with the in-kernel
|
selection and DERP (Designated Encrypted Relay for Packets,
|
||||||
WireGuard module. Headscale therefore runs \texttt{wireguard-go}
|
Tailscale's TLS-over-TCP relay network used when a direct UDP path
|
||||||
entirely in userspace, and the unbounded \texttt{-b~0} flood overruns
|
between peers cannot be established), which is incompatible with the
|
||||||
that userspace pipeline just as it overruns every other userspace
|
in-kernel WireGuard module. Headscale therefore runs
|
||||||
implementation, producing 69.8\,\% loss despite the WireGuard branding.
|
\texttt{wireguard-go} entirely in userspace, and the unbounded
|
||||||
|
\texttt{-b~0} flood overruns that userspace pipeline just as it
|
||||||
|
overruns every other userspace implementation, and Headscale
|
||||||
|
shows 69.8\,\% loss despite the WireGuard branding.
|
||||||
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
||||||
(due to its large block size) but loses almost all of it. These loss
|
(due to its large block size) but loses almost all of it. These loss
|
||||||
rates do not reflect real-world UDP behavior but reveal which VPNs
|
rates do not reflect real-world UDP behavior but reveal which VPNs
|
||||||
implement effective flow control. Hyprspace and Mycelium could not
|
implement effective flow control. Hyprspace and Mycelium could not
|
||||||
complete the UDP test at all, timing out after 120 seconds.
|
complete the UDP test at all; both timed out after 120 seconds.
|
||||||
|
|
||||||
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
|
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
|
||||||
% the path MTU. It is derived from the socket MSS and reflects the
|
% the path MTU. It is derived from the socket MSS and reflects the
|
||||||
@@ -743,19 +746,11 @@ overwhelm FEC entirely.
|
|||||||
|
|
||||||
\subsection{Operational Resilience}
|
\subsection{Operational Resilience}
|
||||||
|
|
||||||
Sustained-load performance does not predict recovery speed. How
|
Throughput, latency, and application performance describe how a
|
||||||
quickly a tunnel comes up after a reboot, and how reliably it
|
tunnel behaves once it is up. The next question is how quickly it
|
||||||
reconverges, matters as much as peak throughput for operational use.
|
gets there. Sustained-load numbers do not predict recovery speed,
|
||||||
|
and for operational use the time a tunnel takes to come up after a
|
||||||
% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
|
reboot matters as much as its peak throughput.
|
||||||
% are not shown in any figure or table. Either add a figure or
|
|
||||||
% scrap this paragraph (see note below).
|
|
||||||
First-time connectivity spans a wide range. Headscale and WireGuard
|
|
||||||
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
|
|
||||||
(10--14\,s) spend seconds negotiating with their control planes
|
|
||||||
before passing traffic.
|
|
||||||
|
|
||||||
%TODO: Maybe we want to scrap first-time connectivity
|
|
||||||
|
|
||||||
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
||||||
performer under sustained TCP load, recovers in just 8.7~seconds on
|
performer under sustained TCP load, recovers in just 8.7~seconds on
|
||||||
@@ -768,18 +763,21 @@ benchmarks use the default). After a reboot, a node must
|
|||||||
wait until the next periodic update before its lighthouses learn
|
wait until the next periodic update before its lighthouses learn
|
||||||
its new endpoint, so the reconnection time tracks the timer rather
|
its new endpoint, so the reconnection time tracks the timer rather
|
||||||
than any topology-dependent convergence.
|
than any topology-dependent convergence.
|
||||||
Mycelium sits at the opposite end, needing 76.6~seconds and showing
|
Mycelium sits at the opposite end at 76.6~seconds, and its three
|
||||||
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
|
nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
|
||||||
suggesting a fixed protocol-level wait built into the overlay.
|
Section~\ref{sec:mycelium_routing} argues from that uniformity
|
||||||
|
that the bound is a fixed timer in the overlay protocol.
|
||||||
|
|
||||||
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
||||||
node is back in 7.1~seconds while lom and luna take 94.8 and
|
node is back in 7.1~seconds while lom and luna take 94.8 and
|
||||||
97.3~seconds respectively. The gap likely reflects the overlay's
|
97.3~seconds respectively. Yggdrasil organises its overlay as a
|
||||||
spanning-tree rebuild: a node near the root of the tree reconverges
|
distributed spanning tree rooted at the node with the highest public
|
||||||
quickly, while one further out has to wait for the topology to
|
key: every other node picks a parent closer to the root and the
|
||||||
propagate.
|
whole network hangs off that parent chain. The gap likely reflects
|
||||||
|
the cost of rebuilding that tree after a reboot: a node close to the
|
||||||
%TODO: Needs clarifications what is a "spanning tree build"
|
current root reconverges quickly, while one further out must wait
|
||||||
|
for updated parent information to propagate hop-by-hop before it
|
||||||
|
can route traffic.
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -823,14 +821,14 @@ earlier benchmarks into per-VPN diagnoses.
|
|||||||
Hyprspace produces the most severe performance collapse in the
|
Hyprspace produces the most severe performance collapse in the
|
||||||
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
||||||
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
||||||
1\,556$\times$ increase. This is not the network becoming
|
1\,556$\times$ increase. The network itself has capacity to spare;
|
||||||
congested; it is the VPN tunnel itself filling up with buffered
|
the VPN tunnel is filling up with buffered packets and failing to
|
||||||
packets and refusing to drain.
|
drain.
|
||||||
|
|
||||||
The consequences ripple through every TCP metric. With 4\,965
|
The consequences show in every TCP metric. With 4\,965
|
||||||
retransmits per 30-second test (one in every 200~segments), TCP
|
retransmits per 30-second test (one in every 200~segments), TCP
|
||||||
spends most of its time in congestion recovery rather than
|
spends most of its time in congestion recovery rather than
|
||||||
steady-state transfer, shrinking the max congestion window to
|
steady-state transfer. The max congestion window shrinks to
|
||||||
205\,KB, the smallest in the dataset. Under parallel load the
|
205\,KB, the smallest in the dataset. Under parallel load the
|
||||||
situation worsens: retransmits climb to 17\,426. % TODO: The
|
situation worsens: retransmits climb to 17\,426. % TODO: The
|
||||||
% explanation for the sender/receiver inversion (ACK delays
|
% explanation for the sender/receiver inversion (ACK delays
|
||||||
@@ -841,7 +839,7 @@ The buffering even
|
|||||||
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
||||||
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
|
||||||
cause the sender-side timer to undercount the actual data rate. The
|
cause the sender-side timer to undercount the actual data rate. The
|
||||||
UDP test never finished at all, timing out at 120~seconds.
|
UDP test never finished at all; it timed out at 120~seconds.
|
||||||
|
|
||||||
% Should we always use percentages for retransmits?
|
% Should we always use percentages for retransmits?
|
||||||
|
|
||||||
@@ -891,7 +889,7 @@ Since the benchmark targets the regular Hyprspace IPv4/IPv6
|
|||||||
addresses rather than service-network proxies, both endpoints
|
addresses rather than service-network proxies, both endpoints
|
||||||
rely on their host kernel's TCP stack for the entire transfer.
|
rely on their host kernel's TCP stack for the entire transfer.
|
||||||
Whatever options Hyprspace's gVisor instance might set
|
Whatever options Hyprspace's gVisor instance might set
|
||||||
internally — congestion control, loss recovery, buffer sizes —
|
internally (congestion control, loss recovery, buffer sizes)
|
||||||
are therefore irrelevant to these measurements; the inner TCP
|
are therefore irrelevant to these measurements; the inner TCP
|
||||||
state machine the kernel runs is the only one in the path.
|
state machine the kernel runs is the only one in the path.
|
||||||
The same caveat applies more sharply to Tailscale, where the
|
The same caveat applies more sharply to Tailscale, where the
|
||||||
@@ -900,9 +898,13 @@ stack but the benchmark traffic never reaches it; that case is
|
|||||||
the subject of Section~\ref{sec:tailscale_degraded}.
|
the subject of Section~\ref{sec:tailscale_degraded}.
|
||||||
|
|
||||||
If gVisor is out of scope, the buffer bloat must originate
|
If gVisor is out of scope, the buffer bloat must originate
|
||||||
further up the Hyprspace stack instead. The most plausible
|
further up the Hyprspace stack instead. Hyprspace uses
|
||||||
source is the libp2p / yamux stream layer through which raw IP
|
\texttt{libp2p}, a peer-to-peer networking library, and its
|
||||||
packets are funnelled. Hyprspace's TUN-read loop dispatches
|
\texttt{yamux} stream multiplexer, which runs many logical streams
|
||||||
|
over a single underlying connection and polices each one with a
|
||||||
|
credit-based flow-control window. The most plausible source of
|
||||||
|
the bloat is this libp2p/yamux layer, through which raw IP packets
|
||||||
|
are funnelled. Hyprspace's TUN-read loop dispatches
|
||||||
each outbound packet on its own goroutine, and every such
|
each outbound packet on its own goroutine, and every such
|
||||||
goroutine ends up in \texttt{node/node.go}'s
|
goroutine ends up in \texttt{node/node.go}'s
|
||||||
\texttt{sendPacket}, which keeps exactly one libp2p stream per
|
\texttt{sendPacket}, which keeps exactly one libp2p stream per
|
||||||
@@ -916,10 +918,10 @@ collapses to a single send pipeline at this layer. Each
|
|||||||
goroutine waiting for the lock pins its own 1420-byte packet
|
goroutine waiting for the lock pins its own 1420-byte packet
|
||||||
buffer, and the underlying yamux session adds a per-stream
|
buffer, and the underlying yamux session adds a per-stream
|
||||||
flow-control window on top. None of this is visible to the
|
flow-control window on top. None of this is visible to the
|
||||||
kernel TCP sender that produced the inner segments — the kernel
|
kernel TCP sender that produced the inner segments: the kernel
|
||||||
sees only that the TUN write returned — so it keeps growing
|
sees only that the TUN write returned, so it keeps growing its
|
||||||
its congestion window while the libp2p layer falls further
|
congestion window while the libp2p layer falls further behind. The
|
||||||
behind. The geometry is the textbook one for buffer bloat: a
|
geometry is the textbook one for buffer bloat: a
|
||||||
fast producer (kernel TCP) sitting upstream of a slow,
|
fast producer (kernel TCP) sitting upstream of a slow,
|
||||||
serialised consumer (the single yamux stream per peer) with
|
serialised consumer (the single yamux stream per peer) with
|
||||||
no flow-control signal coupling the two.
|
no flow-control signal coupling the two.
|
||||||
@@ -1036,10 +1038,15 @@ background.
|
|||||||
|
|
||||||
Mycelium is also the slowest VPN to recover from a reboot:
|
Mycelium is also the slowest VPN to recover from a reboot:
|
||||||
76.6~seconds on average, and almost suspiciously uniform across
|
76.6~seconds on average, and almost suspiciously uniform across
|
||||||
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to
|
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
|
||||||
a fixed convergence timer in the overlay protocol —
|
fixed convergence timer in the overlay protocol, most likely a
|
||||||
most likely a
|
default wait interval hard-coded into the reconnection logic. A
|
||||||
default interval rather than anything topology-dependent.
|
topology-dependent recovery time, by contrast, would vary with each
|
||||||
|
node's position in the overlay: a node near an active peer would
|
||||||
|
reconverge quickly while one further away would wait longer for
|
||||||
|
routing information to reach it. Mycelium shows no such variation,
|
||||||
|
so the bound is almost certainly a timer rather than a propagation
|
||||||
|
delay.
|
||||||
% TODO: Identify which Mycelium constant or default this 75-78 s
|
% TODO: Identify which Mycelium constant or default this 75-78 s
|
||||||
% recovery actually corresponds to before claiming it is a fixed
|
% recovery actually corresponds to before claiming it is a fixed
|
||||||
% timer; the source code would settle whether it is hard-coded,
|
% timer; the source code would settle whether it is hard-coded,
|
||||||
@@ -1047,49 +1054,46 @@ default interval rather than anything topology-dependent.
|
|||||||
The UDP test timed out at 120~seconds, and even first-time
|
The UDP test timed out at 120~seconds, and even first-time
|
||||||
connectivity required a 70-second wait at startup.
|
connectivity required a 70-second wait at startup.
|
||||||
|
|
||||||
% Explain what topology-dependent means in this case.
|
|
||||||
|
|
||||||
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
||||||
|
|
||||||
Tinc is a clear case of a CPU bottleneck masquerading
|
The latency subsection already traced Tinc's 336\,Mbps ceiling to
|
||||||
as a network
|
single-core CPU exhaustion. The usual network suspects do not
|
||||||
problem. At 1.19\,ms latency, packets get through the
|
apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
|
||||||
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a
|
effective UDP payload size (1\,353 bytes) and its retransmit count
|
||||||
third of the bare-metal link.
|
(240) are in the normal range. That leaves CPU: 14.9\,\%
|
||||||
The usual suspects do not apply:
|
whole-system utilization is what one saturated core looks like on
|
||||||
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
|
a multi-core host, which fits a single-threaded userspace VPN.
|
||||||
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
|
The parallel benchmark confirms the diagnosis. Tinc scales to
|
||||||
WireGuard at 1\,368) is in the normal range, and its retransmit
|
563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
|
||||||
count (240) is moderate. What limits Tinc is its
|
Several concurrent TCP streams keep that one core busy through
|
||||||
single-threaded
|
the gaps a single flow would leave idle, and the extra work
|
||||||
userspace architecture: one CPU core simply cannot
|
translates directly into extra throughput.
|
||||||
encrypt, copy,
|
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
|
||||||
and forward packets fast enough to fill the pipe.
|
% unresolved CPU-profiling TODO from the latency subsection
|
||||||
|
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
|
||||||
% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the
|
% profiling refutes the single-core story, this paragraph must
|
||||||
% Tinc CPU bottleneck
|
% be revisited as well.
|
||||||
% diagnosis from above, but the 14.9% CPU figure has
|
|
||||||
% an unresolved TODO
|
|
||||||
% (the same utilization as VpnCloud at 539 Mbps). If
|
|
||||||
% the CPU claim is
|
|
||||||
% revised or refuted, this confirmation must be updated too.
|
|
||||||
The parallel benchmark confirms this diagnosis. Tinc scales to
|
|
||||||
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
|
|
||||||
Multiple TCP streams collectively keep that single
|
|
||||||
core busy during
|
|
||||||
what would otherwise be idle gaps in any individual
|
|
||||||
flow, squeezing
|
|
||||||
out throughput that no single stream could reach alone.
|
|
||||||
|
|
||||||
\section{Impact of network impairment}
|
\section{Impact of network impairment}
|
||||||
\label{sec:impairment}
|
\label{sec:impairment}
|
||||||
|
|
||||||
Baseline benchmarks rank VPNs by overhead under ideal
|
Baseline benchmarks rank VPNs by overhead under ideal
|
||||||
conditions.
|
conditions. The impairment profiles in
|
||||||
The impairment profiles in
|
Table~\ref{tab:impairment_profiles} test a different property:
|
||||||
Table~\ref{tab:impairment_profiles} test
|
resilience. Each profile applies symmetric \texttt{tc netem}
|
||||||
a different property: resilience. Two results
|
impairment to every machine. Low adds roughly 2\,ms of delay and
|
||||||
dominate the data.
|
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
|
||||||
|
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
|
||||||
|
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
|
||||||
|
reordering. Medium and High both use 50\,\% correlation, so
|
||||||
|
losses and reorderings are bursty rather than uniform. Two
|
||||||
|
results dominate the data.
|
||||||
|
% TODO: Double-check these per-profile parameters against the
|
||||||
|
% canonical impairment-profile definitions in the earlier chapter
|
||||||
|
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
|
||||||
|
% delay numbers are cross-checked against later prose in this
|
||||||
|
% chapter, but the correlation and jitter values should be
|
||||||
|
% verified against the authoritative profile definition.
|
||||||
|
|
||||||
The first is the collapse of the throughput hierarchy. At High
|
The first is the collapse of the throughput hierarchy. At High
|
||||||
impairment, the 675\,Mbps spread between fastest and slowest
|
impairment, the 675\,Mbps spread between fastest and slowest
|
||||||
@@ -1106,8 +1110,8 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
|
|||||||
through what turns out to be the wrong hypothesis. The
|
through what turns out to be the wrong hypothesis. The
|
||||||
investigation begins with Tailscale's much-discussed gVisor TCP
|
investigation begins with Tailscale's much-discussed gVisor TCP
|
||||||
stack, validates the candidate parameters in isolation on the
|
stack, validates the candidate parameters in isolation on the
|
||||||
bare-metal host, and only then discovers — by reading the rig's
|
bare-metal host, and only then discovers, by reading the rig's
|
||||||
own NixOS module — that the gVisor stack is not actually in the
|
own NixOS module, that the gVisor stack is not actually in the
|
||||||
data path of the benchmark at all. The real culprit is a
|
data path of the benchmark at all. The real culprit is a
|
||||||
combination of the Linux kernel's tight default
|
combination of the Linux kernel's tight default
|
||||||
\texttt{tcp\_reordering} threshold and the way
|
\texttt{tcp\_reordering} threshold and the way
|
||||||
@@ -1313,6 +1317,16 @@ every lost or reordered outer packet costs roughly
|
|||||||
retransmitted inner data than a standard 1\,400-byte
|
retransmitted inner data than a standard 1\,400-byte
|
||||||
MTU VPN would
|
MTU VPN would
|
||||||
lose.
|
lose.
|
||||||
|
% TODO: The jumbo-MTU-as-liability argument is reused in several
|
||||||
|
% places (TCP impairment, QUIC impairment, RIST video, and
|
||||||
|
% §sec:baseline Tier analysis). In each it is presented as a
|
||||||
|
% mechanism rather than a measurement. Consider running one
|
||||||
|
% controlled experiment --- force Yggdrasil to a standard
|
||||||
|
% 1\,420-byte overlay MTU and rerun the Low/Medium impairment
|
||||||
|
% profiles --- to test the hypothesis directly, or consolidate
|
||||||
|
% the argument into a single "jumbo-MTU liability" paragraph and
|
||||||
|
% cite it from the other sections instead of restating the
|
||||||
|
% mechanism each time.
|
||||||
|
|
||||||
Headscale retains 34.3\% of its baseline throughput
|
Headscale retains 34.3\% of its baseline throughput
|
||||||
at Low, almost
|
at Low, almost
|
||||||
@@ -1444,6 +1458,15 @@ indicator than as a throughput measurement. A VPN that cannot
|
|||||||
complete a 30-second UDP flood under 0.25\% packet loss has a
|
complete a 30-second UDP flood under 0.25\% packet loss has a
|
||||||
flow-control problem that will surface under real workloads too,
|
flow-control problem that will surface under real workloads too,
|
||||||
even when the symptoms are milder.
|
even when the symptoms are milder.
|
||||||
|
% TODO: Non-monotonic failure pattern (Internal and WireGuard
|
||||||
|
% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
|
||||||
|
% fail selectively) is never explained and directly undermines
|
||||||
|
% the "robustness indicator" framing above. Reproduce one of
|
||||||
|
% the failing Low-profile runs with iPerf3 debug logging and
|
||||||
|
% \texttt{tc -s qdisc show} to establish whether these are VPN
|
||||||
|
% flow-control failures, iPerf3/tc interaction artefacts, or
|
||||||
|
% timing issues; then either explain the pattern or soften the
|
||||||
|
% robustness-indicator claim.
|
||||||
|
|
||||||
\subsection{Parallel TCP}
|
\subsection{Parallel TCP}
|
||||||
|
|
||||||
@@ -1552,10 +1575,10 @@ At High impairment, WireGuard (23.2\,Mbps), VpnCloud
|
|||||||
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
|
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
|
||||||
0.4\,Mbps of one another. At baseline these four
|
0.4\,Mbps of one another. At baseline these four
|
||||||
span a 188\,Mbps
|
span a 188\,Mbps
|
||||||
range (656 to 844\,Mbps). QUIC's own congestion
|
range (656 to 844\,Mbps). At this point QUIC's own congestion
|
||||||
control, running on
|
control is the sole limiter: it runs on top of an
|
||||||
top of an already-degraded outer link, has become the
|
already-degraded outer link and cannot push past
|
||||||
sole limiter.
|
${\sim}$23\,Mbps regardless of the VPN underneath.
|
||||||
|
|
||||||
\begin{figure}[H]
|
\begin{figure}[H]
|
||||||
\centering
|
\centering
|
||||||
@@ -1742,6 +1765,20 @@ Section~\ref{sec:tailscale_degraded} explains why.
|
|||||||
\section{Tailscale under degraded conditions}
|
\section{Tailscale under degraded conditions}
|
||||||
\label{sec:tailscale_degraded}
|
\label{sec:tailscale_degraded}
|
||||||
|
|
||||||
|
% TODO: Editorial pass needed on two chapter-wide issues before
|
||||||
|
% submission:
|
||||||
|
% (1) magicsock / wireguard-go userspace-datapath explanation is
|
||||||
|
% repeated three times in slightly different forms (once in
|
||||||
|
% baseline UDP, once in impairment UDP, once here). Consider
|
||||||
|
% introducing it once in full here, where it is load-bearing,
|
||||||
|
% and replacing the earlier occurrences with one-sentence
|
||||||
|
% forward references.
|
||||||
|
% (2) This section uses first-person plural ("we pursued", "we
|
||||||
|
% worked it out", "we ran two follow-up benchmarks") while
|
||||||
|
% the rest of the chapter is in impersonal voice. Either
|
||||||
|
% harmonise everything to one voice, or explicitly frame this
|
||||||
|
% section as a first-person narrative detour.
|
||||||
|
|
||||||
This section is about an observation that should not exist:
|
This section is about an observation that should not exist:
|
||||||
Headscale, a tunnelling VPN built on a kernel TCP stack and
|
Headscale, a tunnelling VPN built on a kernel TCP stack and
|
||||||
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
|
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
|
||||||
@@ -1753,7 +1790,7 @@ chasing the obvious answer to its end.
|
|||||||
\subsection{An anomaly worth pursuing}
|
\subsection{An anomaly worth pursuing}
|
||||||
|
|
||||||
At Medium impairment, Headscale reaches 41.5\,Mbps on a single
|
At Medium impairment, Headscale reaches 41.5\,Mbps on a single
|
||||||
TCP stream against Internal's 29.6\,Mbps — a 40\,\% lead for
|
TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
|
||||||
the VPN over the direct host-to-host link it tunnels through.
|
the VPN over the direct host-to-host link it tunnels through.
|
||||||
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
|
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
|
||||||
Low and High impairment it lags Internal by some margin. Yet at
|
Low and High impairment it lags Internal by some margin. Yet at
|
||||||
@@ -1837,12 +1874,12 @@ imports Google's gVisor netstack
|
|||||||
it as an in-process TCP implementation. The gVisor
|
it as an in-process TCP implementation. The gVisor
|
||||||
documentation is direct about why this matters: netstack is
|
documentation is direct about why this matters: netstack is
|
||||||
designed for adverse networks where the host kernel's TCP
|
designed for adverse networks where the host kernel's TCP
|
||||||
defaults are too aggressive. Tailscale's release notes go
|
defaults are too aggressive. Tailscale's release notes go further
|
||||||
further, calling out specific overrides on top of gVisor — the
|
and name specific overrides
|
||||||
most visible being an explicit RACK disable and 8\,MiB / 6\,MiB
|
on top of gVisor; the most visible are an explicit RACK disable
|
||||||
receive and send buffers.
|
and 8\,MiB / 6\,MiB receive and send buffers.
|
||||||
|
|
||||||
Reading Tailscale's source confirms it.
|
The Tailscale source code bears this out.
|
||||||
\texttt{wgengine/netstack/netstack.go} contains the netstack
|
\texttt{wgengine/netstack/netstack.go} contains the netstack
|
||||||
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
|
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
|
||||||
reproduces the relevant overrides verbatim. RACK is disabled
|
reproduces the relevant overrides verbatim. RACK is disabled
|
||||||
@@ -1863,25 +1900,22 @@ enabled (gVisor's default is off).
|
|||||||
\texttt{wgengine/netstack/netstack.go}.
|
\texttt{wgengine/netstack/netstack.go}.
|
||||||
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
|
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
|
||||||
|
|
||||||
Read against the Linux kernel defaults — RACK on, CUBIC by
|
Read against the Linux kernel defaults (RACK on, CUBIC by
|
||||||
default, ${\sim}$1\,MiB receive and send buffers,
|
default, ${\sim}$1\,MiB receive and send buffers,
|
||||||
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled — these
|
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
|
||||||
overrides describe a TCP stack better suited to a lossy,
|
overrides describe a TCP stack better suited to a lossy,
|
||||||
reordering link than the host kernel. The hypothesis writes
|
reordering link than the host kernel. The hypothesis follows
|
||||||
itself: Headscale's iPerf3 traffic is processed
|
directly: Headscale's iPerf3 traffic
|
||||||
by this gVisor
|
runs through this gVisor instance instead of through the host
|
||||||
instance instead of by the host kernel TCP stack, and so it
|
kernel TCP stack, and so it inherits the more
|
||||||
inherits the more reordering-tolerant behaviour.
|
reordering-tolerant behaviour. WireGuard-the-kernel-module
|
||||||
WireGuard-the-kernel-module shares only the cryptographic
|
shares only the cryptographic protocol; it does not include
|
||||||
protocol; it does not get the gVisor stack, and
|
the gVisor stack, and therefore does not get the advantage.
|
||||||
therefore does
|
|
||||||
not get the advantage.
|
|
||||||
|
|
||||||
It is a clean story. The natural way to test it
|
The natural way to test this is to extract
|
||||||
is to extract
|
|
||||||
the parameters Tailscale sets inside gVisor, apply their
|
the parameters Tailscale sets inside gVisor, apply their
|
||||||
nearest Linux equivalents to the bare-metal host as sysctls,
|
nearest Linux equivalents to the bare-metal host as sysctls,
|
||||||
and see whether Internal — with no VPN at all — picks up the
|
and see whether Internal, with no VPN at all, picks up the
|
||||||
same advantage. If it does, the gVisor explanation is
|
same advantage. If it does, the gVisor explanation is
|
||||||
supported. If it does not, the hypothesis fails.
|
supported. If it does not, the hypothesis fails.
|
||||||
|
|
||||||
@@ -1951,21 +1985,18 @@ impairment setup as the original 18.12.2025 run.
|
|||||||
|
|
||||||
The result felt like confirmation. Internal's
|
The result felt like confirmation. Internal's
|
||||||
Medium-impairment throughput jumped from 29.6\,Mbps to
|
Medium-impairment throughput jumped from 29.6\,Mbps to
|
||||||
72.7\,Mbps under the reorder-only configuration — a 146\,\%
|
72.7\,Mbps under the reorder-only configuration, a 146\,\%
|
||||||
increase from a three-line sysctl change — and
|
increase from a three-line sysctl change, and the retransmit
|
||||||
the retransmit
|
|
||||||
rate at Medium dropped from ${\sim}$2.4\,\% to
|
rate at Medium dropped from ${\sim}$2.4\,\% to
|
||||||
1.11\,\%, which
|
1.11\,\%, which
|
||||||
means more than half of the original retransmissions were
|
means more than half of the original retransmissions were
|
||||||
spurious. The Nix cache download at Medium roughly halved,
|
spurious. The Nix cache download at Medium roughly halved,
|
||||||
from 58.6\,s to 29.1\,s.
|
from 58.6\,s to 29.1\,s.
|
||||||
|
|
||||||
Parallel TCP gained more. Internal at Low
|
Parallel TCP gained even more. Internal at Low climbed from
|
||||||
climbed from 277 to
|
277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
|
||||||
902\,Mbps, a 226\,\% increase that not only
|
old single-stream best and overtakes Headscale's original
|
||||||
exceeds Internal's
|
718\,Mbps from the unmodified run. %
|
||||||
old single-stream best but actually overtakes Headscale's
|
|
||||||
original 718\,Mbps from the unmodified run. %
|
|
||||||
% TODO: DOWNSTREAM
|
% TODO: DOWNSTREAM
|
||||||
% DEPENDENCY — "six concurrent flows" inherits
|
% DEPENDENCY — "six concurrent flows" inherits
|
||||||
% the unresolved
|
% the unresolved
|
||||||
@@ -2024,9 +2055,10 @@ the kernel to gVisor reproduces the effect. Then we checked
|
|||||||
which Tailscale code path the test rig was actually running.
|
which Tailscale code path the test rig was actually running.
|
||||||
|
|
||||||
\subsection{The data path that was not there}
|
\subsection{The data path that was not there}
|
||||||
|
\label{sec:gvisor_not_in_path}
|
||||||
|
|
||||||
In default mode — what anyone running \texttt{tailscale up}
|
In default mode (what anyone running \texttt{tailscale up}
|
||||||
on a Linux host gets — the Tailscale client creates a real
|
on a Linux host gets), the Tailscale client creates a real
|
||||||
kernel TUN device, registers a route for the
|
kernel TUN device, registers a route for the
|
||||||
Tailscale subnet
|
Tailscale subnet
|
||||||
through it, and forwards inbound and outbound
|
through it, and forwards inbound and outbound
|
||||||
@@ -2054,20 +2086,19 @@ running inside \texttt{tailscaled} itself (Tailscale SSH,
|
|||||||
Taildrop, the metric endpoint). External processes such as
|
Taildrop, the metric endpoint). External processes such as
|
||||||
iPerf3 cannot reach the Tailscale network in that mode.
|
iPerf3 cannot reach the Tailscale network in that mode.
|
||||||
|
|
||||||
The test rig does not use that mode. As shown in
|
The test rig does not use that mode. The benchmark suite's
|
||||||
Listing~\ref{lst:rig_interface_name}, the benchmark
|
Headscale module sets the interface name to
|
||||||
suite's Headscale module sets the interface name to
|
\texttt{ts-\$\{instanceName\}}
|
||||||
\texttt{ts-\$\{instanceName\}}, resolving to
|
(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
|
||||||
\texttt{tailscaled --tun ts-headscale}: a real kernel
|
launches with \texttt{--tun ts-headscale}: a real kernel TUN.
|
||||||
TUN. gVisor netstack is therefore unreachable from
|
External benchmark traffic cannot reach gVisor netstack at all.
|
||||||
external benchmark traffic.
|
|
||||||
|
|
||||||
|
|
||||||
\lstinputlisting[language=Nix,caption={The
|
\lstinputlisting[language=Nix,caption={The
|
||||||
benchmark suite's
|
benchmark suite's
|
||||||
Headscale module sets \texttt{interfaceName} to a real kernel
|
Headscale module sets \texttt{interfaceName} to a real kernel
|
||||||
TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
|
TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
|
||||||
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale}
|
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
|
||||||
|
ts-headscale}
|
||||||
on every test machine.
|
on every test machine.
|
||||||
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
|
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
|
||||||
|
|
||||||
@@ -2075,7 +2106,7 @@ The empirical fingerprint pins the same conclusion down without
|
|||||||
source-code reading. Headscale itself gained +21\,\% at Medium
|
source-code reading. Headscale itself gained +21\,\% at Medium
|
||||||
from the host-kernel sysctl tuning. If Headscale's iPerf3
|
from the host-kernel sysctl tuning. If Headscale's iPerf3
|
||||||
traffic were processed by gVisor netstack, host-kernel sysctls
|
traffic were processed by gVisor netstack, host-kernel sysctls
|
||||||
would change nothing — they configure the host kernel TCP stack
|
would change nothing; they configure the host kernel TCP stack
|
||||||
and only the host kernel TCP stack. The fact that Headscale moves
|
and only the host kernel TCP stack. The fact that Headscale moves
|
||||||
measurably under those sysctls is direct evidence that
|
measurably under those sysctls is direct evidence that
|
||||||
Headscale's application TCP runs on the host kernel stack, just
|
Headscale's application TCP runs on the host kernel stack, just
|
||||||
@@ -2097,8 +2128,8 @@ the gVisor TCP business at all.
|
|||||||
The puzzle the investigation began with has not gone away.
|
The puzzle the investigation began with has not gone away.
|
||||||
Headscale starts at 41.5\,Mbps where Internal starts at
|
Headscale starts at 41.5\,Mbps where Internal starts at
|
||||||
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
|
||||||
TCP stack. Whatever Headscale is doing — partially, weakly, but
|
TCP stack. Whatever Headscale is doing (partially, weakly, but
|
||||||
reproducibly — is worth roughly twelve megabits per second on the
|
reproducibly) is worth roughly twelve megabits per second on the
|
||||||
Medium profile, and it is not gVisor netstack.
|
Medium profile, and it is not gVisor netstack.
|
||||||
|
|
||||||
The +21\,\% sysctl gain for Headscale itself is also informative
|
The +21\,\% sysctl gain for Headscale itself is also informative
|
||||||
@@ -2143,8 +2174,8 @@ The second is the 7\,MiB outer-UDP socket buffer that
|
|||||||
\texttt{SO\_*BUFFORCE} variant where available so the value is
|
\texttt{SO\_*BUFFORCE} variant where available so the value is
|
||||||
honoured even past \texttt{net.core.rmem\_max}. The host kernel
|
honoured even past \texttt{net.core.rmem\_max}. The host kernel
|
||||||
default is in the low hundreds of KiB. Under burst-correlated
|
default is in the low hundreds of KiB. Under burst-correlated
|
||||||
impairment — Medium and High both use 50\,\% correlation, so
|
impairment (Medium and High both use 50\,\% correlation, so
|
||||||
losses and reorderings cluster — this larger buffer absorbs
|
losses and reorderings cluster), this larger buffer absorbs
|
||||||
spikes in arrival rate that would otherwise overflow the kernel
|
spikes in arrival rate that would otherwise overflow the kernel
|
||||||
UDP receive queue and surface as additional inner-TCP losses.
|
UDP receive queue and surface as additional inner-TCP losses.
|
||||||
Internal has no such cushion on its incoming wire path.
|
Internal has no such cushion on its incoming wire path.
|
||||||
@@ -2244,16 +2275,15 @@ the bare-metal host more than half of its achievable throughput.
|
|||||||
Three lines of \texttt{sysctl} repair it. The fix is portable to
|
Three lines of \texttt{sysctl} repair it. The fix is portable to
|
||||||
any Linux host and entirely independent of any VPN.
|
any Linux host and entirely independent of any VPN.
|
||||||
|
|
||||||
The unresilient finding — the one that motivated us to write this
|
The less durable finding, and the one that motivated this section,
|
||||||
section in the first place — is that Tailscale's much-discussed
|
is that Tailscale's much-discussed userspace TCP stack is not in
|
||||||
userspace TCP stack is, for the workload that exposed the
|
the data path for the workload that exposed the anomaly. The
|
||||||
anomaly, sitting on the bench. The advantage we attributed to it
|
advantage we attributed to it comes from a more ordinary place:
|
||||||
must come from a more ordinary place: the way
|
the way \texttt{wireguard-go} batches and coalesces packets
|
||||||
\texttt{wireguard-go} batches and coalesces packets between the
|
between the wire and the kernel TCP stack, and the larger UDP
|
||||||
wire and the kernel TCP stack, and the larger UDP buffer it pins
|
buffer it pins on its outer socket. We were chasing the wrong
|
||||||
on its outer socket. We were chasing the wrong hypothesis with
|
hypothesis with the right experiment, and the experiment turned
|
||||||
the right experiment, and the experiment turned out to be more
|
out to be more useful than the hypothesis.
|
||||||
useful than the hypothesis.
|
|
||||||
|
|
||||||
% TODO: These sections are empty stubs but the chapter
|
% TODO: These sections are empty stubs but the chapter
|
||||||
% introduction (line 12--13) promises "findings from the source
|
% introduction (line 12--13) promises "findings from the source
|
||||||
|
|||||||
Reference in New Issue
Block a user