verified all numbers

This commit is contained in:
2026-04-10 11:18:40 +02:00
parent 0e636ee5f3
commit 13633f092a
+251 -221
View File
@@ -9,10 +9,9 @@ ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded: follows the impairment profiles from ideal to degraded:
Section~\ref{sec:baseline} establishes overhead under ideal Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to conditions, then subsequent sections examine how each VPN responds to
increasing network impairment. The chapter concludes with findings increasing network impairment, with source-code excerpts woven in
from the source code analysis. A recurring theme is that no single where they explain the measured behaviour. A recurring theme is
metric captures VPN that no single metric captures VPN performance; the rankings shift
performance; the rankings shift
depending on whether one measures throughput, latency, retransmit depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance. behavior, or real-world application performance.
@@ -26,38 +25,32 @@ the VPN itself. Throughout the plots in this section, the
in the path; it represents the best the hardware can do. On its own, in the path; it represents the best the hardware can do. On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just latency of just
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching 0.60\,ms. WireGuard reaches 92.5\,\% of bare-metal throughput with only a
92.5\,\% of bare-metal throughput with only a single retransmit across single retransmit across an entire 30-second test. Mycelium sits at
an entire 30-second test. Mycelium sits at the other extreme, adding the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. bare-metal figure.
A note on naming: ``Headscale'' in every table and figure of this A note on naming: ``Headscale'' in every table and figure of this
chapter labels the test scenario in which the Tailscale client chapter labels the test scenario in which the Tailscale client
(\texttt{tailscaled}) connects to a self-hosted Headscale control (\texttt{tailscaled}) connects to a self-hosted Headscale control
server. The data plane is therefore the Tailscale client built on server. The data plane is therefore the Tailscale client built on
\texttt{wireguard-go}, not the Headscale binary itself, which is \texttt{wireguard-go}, not the Headscale binary itself, which is
only a control-plane server. The test rig launches only a control-plane server. Statements below about ``Headscale''
\texttt{tailscaled} via the NixOS \texttt{services.tailscale} running \texttt{wireguard-go} should be read as statements about
module with \texttt{interfaceName = "ts-headscale"}, which the Tailscale client in this scenario.
translates to \texttt{--tun ts-headscale}; this means the Tailscale Section~\ref{sec:tailscale_degraded} covers the specifics of how
client uses a real kernel TUN device and the host kernel's TCP/IP the rig launches \texttt{tailscaled} and which Tailscale code
stack handles every tunneled packet. The alternate paths that choice activates.
\texttt{--tun=userspace-networking} mode, in which gVisor netstack
terminates tunneled TCP inside the \texttt{tailscaled} process, is
\emph{not} engaged in any of the benchmarks reported here.
Statements below about ``Headscale'' running \texttt{wireguard-go}
should be read as statements about the Tailscale client in this
scenario.
\subsection{Test Execution Overview} \subsection{Test Execution Overview}
Running the full baseline suite across all ten VPNs and the internal Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours. The bulk of that time, about reference took just over four hours. Actual benchmark execution
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN consumed the bulk of that time at 2.6~hours (63\,\%). VPN
installation and deployment accounted for another 45~minutes (19\,\%), installation and deployment accounted for another 45~minutes
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come (19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
up after restarts. The remaining time was consumed by VPN service restarts for VPN tunnels to come up after restarts. VPN service restarts and
and traffic-control (tc) stabilization. traffic-control (tc) stabilization took the remainder.
Figure~\ref{fig:test_duration} breaks this down per VPN. Figure~\ref{fig:test_duration} breaks this down per VPN.
Most VPNs completed every benchmark without issues, but four failed Most VPNs completed every benchmark without issues, but four failed
@@ -146,8 +139,8 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates
needs. ZeroTier compensates for tunnel-internal packet loss by needs. ZeroTier compensates for tunnel-internal packet loss by
repeatedly triggering TCP congestion-control recovery, whereas repeatedly triggering TCP congestion-control recovery, whereas
WireGuard delivers data with negligible in-tunnel loss. The WireGuard delivers data with negligible in-tunnel loss. The
bare-metal Internal reference sits at 1.7~retransmits per test bare-metal Internal reference sits at 1.7~retransmits per test,
essentially noise and the VPNs split into three groups around essentially noise, and the VPNs split into three groups around
it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale), it: \emph{clean} ($<$110: WireGuard, Yggdrasil, Headscale),
\emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud),
and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
@@ -187,10 +180,10 @@ and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
\end{figure} \end{figure}
Retransmits have a direct mechanical relationship with TCP congestion Retransmits have a direct mechanical relationship with TCP congestion
control. Each retransmit triggers a reduction in the congestion window control: each one triggers a reduction in the congestion window
(\texttt{cwnd}), throttling the sender. (\texttt{cwnd}) and throttles the sender.
This relationship is visible Figure~\ref{fig:retransmit_correlations} shows the relationship:
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 Hyprspace, with 4965
retransmits, maintains the smallest max congestion window in the retransmits, maintains the smallest max congestion window in the
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
window, the largest of any VPN. At first glance this suggests a window, the largest of any VPN. At first glance this suggests a
@@ -200,24 +193,31 @@ largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
segment carries far more data, so the window in bytes is inflated segment carries far more data, so the window in bytes is inflated
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
congestion windows across different MTU sizes is not meaningful congestion windows across different MTU sizes is not meaningful
without normalizing for segment size. What \emph{is} clear is that without normalizing for segment size. The reliable conclusion is
high retransmit rates force TCP to spend more time in congestion simpler: high retransmit rates force TCP to spend more time in
recovery than in steady-state transmission, capping throughput congestion recovery than in steady-state transmission, and that
regardless of available bandwidth. ZeroTier illustrates the caps throughput regardless of available bandwidth. ZeroTier
opposite extreme: brute-force retransmission can still yield high illustrates the opposite extreme: brute-force retransmission can
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted still yield high throughput (814\,Mbps with 1\,163 retransmits), at
bandwidth and unstable flow behavior. the cost of wasted bandwidth and unstable flow behavior.
VpnCloud stands out: its sender reports 538.8\,Mbps VpnCloud stands out: its sender reports 538.8\,Mbps but the
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
in the dataset). This suggests significant in-tunnel packet loss or in the dataset. This points to significant in-tunnel packet loss
buffering at the VpnCloud layer that the retransmit count (857) or buffering at the VpnCloud layer that the retransmit count (857)
alone does not fully explain. alone does not fully explain.
% TODO: Clarify whether the headline TCP table
% (Table~\ref{tab:tcp_baseline}, 539\,Mbps for VpnCloud) reports
% sender or receiver throughput. The prose here cites sender
% 538.8 vs.\ receiver 413.4 --- the 539 figure matches the sender
% column, so the table caption should say so explicitly. Same
% clarification needed for Hyprspace (368 in table vs.\ sender
% 367.9 / receiver 419.8 in the pathological-cases paragraph).
Variability whether stochastic across runs or systematic across Variability, whether stochastic across runs or systematic across
links also differs substantially. WireGuard's three link links, also differs substantially. WireGuard's three link
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window), directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
behaving almost identically. Mycelium's three directions span and are nearly indistinguishable. Mycelium's three directions span
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise: 122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
Section~\ref{sec:mycelium_routing} shows the spread is per-link Section~\ref{sec:mycelium_routing} shows the spread is per-link
path-selection asymmetry, with one link finding a direct route and path-selection asymmetry, with one link finding a direct route and
@@ -315,25 +315,21 @@ interference that the average hides.
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms) Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
but only the second-lowest throughput (336\,Mbps). Packets traverse but only the second-lowest throughput (336\,Mbps). Packets traverse
the tunnel quickly, yet single-threaded userspace processing cannot the tunnel quickly, yet something caps the overall rate. The qperf
keep up with the link speed. The qperf benchmark backs this up: Tinc benchmark reports Tinc maxing out at 14.9\,\% total system CPU while
maxes out at delivering 336\,Mbps. On a multi-core host this figure is consistent
14.9\,\% total system CPU while delivering just 336\,Mbps. with a single saturated core, which fits Tinc's single-threaded
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck. userspace architecture: one core encrypts, copies, and forwards
packets, and the remaining cores sit idle. But VpnCloud reports the
same 14.9\,\% and still reaches 539\,Mbps (60\,\% more than Tinc),
so whole-system CPU alone cannot explain the gap, and a per-packet
processing cost difference must also be in play.
% TODO: 14.9\% total CPU does not pin the bottleneck on its own.
% This is whole-system utilization on a multi-core machine, and a % This is whole-system utilization on a multi-core machine, and a
% single saturated core fits the budget — but VpnCloud reports the % single saturated core fits the budget — but VpnCloud reports the
% same 14.9\% \emph{and} reaches 539\,Mbps, much more than Tinc. % same 14.9\% \emph{and} reaches 539\,Mbps. Verify with per-thread
% The single-saturated-core story alone therefore cannot explain % CPU sampling or eBPF profiling to confirm the single-core story
% the throughput gap; per-packet processing cost must differ % and quantify the per-packet cost difference.
% materially between the two. Verify with per-thread CPU sampling
% or eBPF profiling.
On a multi-core system, this low percentage is consistent with a
single saturated core (and Tinc is single-threaded), which would
explain why the CPU rather than the network is the bottleneck.
The story is incomplete, however: VpnCloud shows the same 14.9\,\%
total system CPU yet delivers 539\,Mbps — 60\,\% more than Tinc —
so a difference in per-packet processing cost between the two
implementations must also be in play.
Figure~\ref{fig:latency_throughput} makes this disconnect easy to Figure~\ref{fig:latency_throughput} makes this disconnect easy to
spot. spot.
@@ -346,9 +342,9 @@ spot.
The qperf measurements also reveal a wide spread in CPU usage. The qperf measurements also reveal a wide spread in CPU usage.
Hyprspace (55.1\,\%) and Yggdrasil Hyprspace (55.1\,\%) and Yggdrasil
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a 9.7\,\%. WireGuard sits at 30.8\,\%, higher than expected for a
kernel-level implementation, presumably due to in-kernel kernel-level implementation; in-kernel cryptographic processing
cryptographic processing. is the likely cause, though no profiling data confirms this.
On the efficient end, VpnCloud On the efficient end, VpnCloud
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
CPU time. Nebula and Headscale are missing from CPU time. Nebula and Headscale are missing from
@@ -416,8 +412,10 @@ Table~\ref{tab:parallel_scaling} lists the results.
The VPNs that gain the most are those most constrained in The VPNs that gain the most are those most constrained in
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
can never fill the pipe: the bandwidth-delay product demands a window can never fill the pipe: the bandwidth-delay product (the amount
larger than any single flow maintains, so multiple concurrent flows of in-flight data a TCP flow needs to saturate a link, equal to the
link bandwidth times the round-trip time) demands a window larger
than any single flow maintains, so multiple concurrent flows
compensate for that constraint and push throughput to 2.20$\times$ compensate for that constraint and push throughput to 2.20$\times$
the single-stream figure. Hyprspace scales almost as well the single-stream figure. Hyprspace scales almost as well
(2.18$\times$) for the same reason but with a different (2.18$\times$) for the same reason but with a different
@@ -425,7 +423,7 @@ bottleneck. Its libp2p send pipeline accumulates roughly
2\,800\,ms of under-load latency 2\,800\,ms of under-load latency
(Section~\ref{sec:hyprspace_bloat}), which gives any single TCP (Section~\ref{sec:hyprspace_bloat}), which gives any single TCP
flow a bandwidth-delay product on the order of hundreds of flow a bandwidth-delay product on the order of hundreds of
megabytes to fill far beyond any single kernel cwnd. And megabytes to fill, far beyond any single kernel cwnd. And
because Hyprspace keys \texttt{activeStreams} by destination because Hyprspace keys \texttt{activeStreams} by destination
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the \texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the
three concurrent peer pairs in the parallel benchmark each get three concurrent peer pairs in the parallel benchmark each get
@@ -440,8 +438,9 @@ more of the bloated pipeline than one can.
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the % Listing~\ref{lst:hyprspace_sendpacket}, but neither the
% per-flow window evolution nor the actual under-load latency % per-flow window evolution nor the actual under-load latency
% has been measured directly. A tcpdump of one Hyprspace % has been measured directly. A tcpdump of one Hyprspace
% iPerf3 run with inter-arrival timing analysis would settle % iPerf3 run with inter-arrival timing analysis would settle it.
% it. Tinc picks up a
Tinc picks up a
1.68$\times$ boost because several streams can collectively keep its 1.68$\times$ boost because several streams can collectively keep its
single-threaded CPU busy during what would otherwise be idle gaps in single-threaded CPU busy during what would otherwise be idle gaps in
a single flow. a single flow.
@@ -449,9 +448,9 @@ a single flow.
% TODO: "zero retransmits" in parallel mode is not shown in any table % TODO: "zero retransmits" in parallel mode is not shown in any table
% or figure. Add parallel-mode retransmit data or remove the claim. % or figure. Add parallel-mode retransmit data or remove the claim.
WireGuard and Internal both scale cleanly at around WireGuard and Internal both scale cleanly at around
1.48--1.50$\times$ with zero retransmits, suggesting that 1.48--1.50$\times$ with zero retransmits. This is consistent
WireGuard's overhead is a fixed per-packet cost that does not worsen with WireGuard's overhead being a fixed per-packet cost that does
under multiplexing. not worsen under multiplexing.
Nebula is the only VPN that actually gets \emph{slower} with more Nebula is the only VPN that actually gets \emph{slower} with more
streams: throughput drops from 706\,Mbps to 648\,Mbps streams: throughput drops from 706\,Mbps to 648\,Mbps
@@ -498,8 +497,9 @@ The sender throughput values are artifacts: they reflect how fast the
sender can write to the socket, not how fast data traverses the sender can write to the socket, not how fast data traverses the
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
throughput because it uses a 32,731-byte block size (a jumbo-frame throughput because it uses a 32,731-byte block size (a jumbo-frame
overlay MTU), inflating the apparent rate per \texttt{send()} system overlay MTU), which inflates the apparent rate per
call. Only the receiver throughput is meaningful. \texttt{send()} system call. Only the receiver throughput is
meaningful.
\begin{table}[H] \begin{table}[H]
\centering \centering
@@ -537,16 +537,19 @@ because the sender overwhelms the tunnel's userspace processing capacity.
Headscale shares WireGuard's cryptographic protocol but, contrary to Headscale shares WireGuard's cryptographic protocol but, contrary to
intuition, does not share its kernel datapath: Tailscale's intuition, does not share its kernel datapath: Tailscale's
\texttt{magicsock} layer intercepts every packet to handle endpoint \texttt{magicsock} layer intercepts every packet to handle endpoint
selection and DERP relay, which is incompatible with the in-kernel selection and DERP (Designated Encrypted Relay for Packets,
WireGuard module. Headscale therefore runs \texttt{wireguard-go} Tailscale's TLS-over-TCP relay network used when a direct UDP path
entirely in userspace, and the unbounded \texttt{-b~0} flood overruns between peers cannot be established), which is incompatible with the
that userspace pipeline just as it overruns every other userspace in-kernel WireGuard module. Headscale therefore runs
implementation, producing 69.8\,\% loss despite the WireGuard branding. \texttt{wireguard-go} entirely in userspace, and the unbounded
\texttt{-b~0} flood overruns that userspace pipeline just as it
overruns every other userspace implementation, and Headscale
shows 69.8\,\% loss despite the WireGuard branding.
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
(due to its large block size) but loses almost all of it. These loss (due to its large block size) but loses almost all of it. These loss
rates do not reflect real-world UDP behavior but reveal which VPNs rates do not reflect real-world UDP behavior but reveal which VPNs
implement effective flow control. Hyprspace and Mycelium could not implement effective flow control. Hyprspace and Mycelium could not
complete the UDP test at all, timing out after 120 seconds. complete the UDP test at all; both timed out after 120 seconds.
% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not % TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
% the path MTU. It is derived from the socket MSS and reflects the % the path MTU. It is derived from the socket MSS and reflects the
@@ -743,19 +746,11 @@ overwhelm FEC entirely.
\subsection{Operational Resilience} \subsection{Operational Resilience}
Sustained-load performance does not predict recovery speed. How Throughput, latency, and application performance describe how a
quickly a tunnel comes up after a reboot, and how reliably it tunnel behaves once it is up. The next question is how quickly it
reconverges, matters as much as peak throughput for operational use. gets there. Sustained-load numbers do not predict recovery speed,
and for operational use the time a tunnel takes to come up after a
% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s) reboot matters as much as its peak throughput.
% are not shown in any figure or table. Either add a figure or
% scrap this paragraph (see note below).
First-time connectivity spans a wide range. Headscale and WireGuard
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
(10--14\,s) spend seconds negotiating with their control planes
before passing traffic.
%TODO: Maybe we want to scrap first-time connectivity
Reboot reconnection rearranges the rankings. Hyprspace, the worst Reboot reconnection rearranges the rankings. Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on performer under sustained TCP load, recovers in just 8.7~seconds on
@@ -768,18 +763,21 @@ benchmarks use the default). After a reboot, a node must
wait until the next periodic update before its lighthouses learn wait until the next periodic update before its lighthouses learn
its new endpoint, so the reconnection time tracks the timer rather its new endpoint, so the reconnection time tracks the timer rather
than any topology-dependent convergence. than any topology-dependent convergence.
Mycelium sits at the opposite end, needing 76.6~seconds and showing Mycelium sits at the opposite end at 76.6~seconds, and its three
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s), nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
suggesting a fixed protocol-level wait built into the overlay. Section~\ref{sec:mycelium_routing} argues from that uniformity
that the bound is a fixed timer in the overlay protocol.
Yggdrasil produces the most lopsided result in the dataset: its yuki Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively. The gap likely reflects the overlay's 97.3~seconds respectively. Yggdrasil organises its overlay as a
spanning-tree rebuild: a node near the root of the tree reconverges distributed spanning tree rooted at the node with the highest public
quickly, while one further out has to wait for the topology to key: every other node picks a parent closer to the root and the
propagate. whole network hangs off that parent chain. The gap likely reflects
the cost of rebuilding that tree after a reboot: a node close to the
%TODO: Needs clarifications what is a "spanning tree build" current root reconverges quickly, while one further out must wait
for updated parent information to propagate hop-by-hop before it
can route traffic.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -823,14 +821,14 @@ earlier benchmarks into per-VPN diagnoses.
Hyprspace produces the most severe performance collapse in the Hyprspace produces the most severe performance collapse in the
dataset. At idle, its ping latency is a modest 1.79\,ms. dataset. At idle, its ping latency is a modest 1.79\,ms.
Under TCP load, that number balloons to roughly 2\,800\,ms, a Under TCP load, that number balloons to roughly 2\,800\,ms, a
1\,556$\times$ increase. This is not the network becoming 1\,556$\times$ increase. The network itself has capacity to spare;
congested; it is the VPN tunnel itself filling up with buffered the VPN tunnel is filling up with buffered packets and failing to
packets and refusing to drain. drain.
The consequences ripple through every TCP metric. With 4\,965 The consequences show in every TCP metric. With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than spends most of its time in congestion recovery rather than
steady-state transfer, shrinking the max congestion window to steady-state transfer. The max congestion window shrinks to
205\,KB, the smallest in the dataset. Under parallel load the 205\,KB, the smallest in the dataset. Under parallel load the
situation worsens: retransmits climb to 17\,426. % TODO: The situation worsens: retransmits climb to 17\,426. % TODO: The
% explanation for the sender/receiver inversion (ACK delays % explanation for the sender/receiver inversion (ACK delays
@@ -841,7 +839,7 @@ The buffering even
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
while the sender sees only 367.9\,Mbps, likely because massive ACK delays while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate. The cause the sender-side timer to undercount the actual data rate. The
UDP test never finished at all, timing out at 120~seconds. UDP test never finished at all; it timed out at 120~seconds.
% Should we always use percentages for retransmits? % Should we always use percentages for retransmits?
@@ -891,7 +889,7 @@ Since the benchmark targets the regular Hyprspace IPv4/IPv6
addresses rather than service-network proxies, both endpoints addresses rather than service-network proxies, both endpoints
rely on their host kernel's TCP stack for the entire transfer. rely on their host kernel's TCP stack for the entire transfer.
Whatever options Hyprspace's gVisor instance might set Whatever options Hyprspace's gVisor instance might set
internally congestion control, loss recovery, buffer sizes internally (congestion control, loss recovery, buffer sizes)
are therefore irrelevant to these measurements; the inner TCP are therefore irrelevant to these measurements; the inner TCP
state machine the kernel runs is the only one in the path. state machine the kernel runs is the only one in the path.
The same caveat applies more sharply to Tailscale, where the The same caveat applies more sharply to Tailscale, where the
@@ -900,9 +898,13 @@ stack but the benchmark traffic never reaches it; that case is
the subject of Section~\ref{sec:tailscale_degraded}. the subject of Section~\ref{sec:tailscale_degraded}.
If gVisor is out of scope, the buffer bloat must originate If gVisor is out of scope, the buffer bloat must originate
further up the Hyprspace stack instead. The most plausible further up the Hyprspace stack instead. Hyprspace uses
source is the libp2p / yamux stream layer through which raw IP \texttt{libp2p}, a peer-to-peer networking library, and its
packets are funnelled. Hyprspace's TUN-read loop dispatches \texttt{yamux} stream multiplexer, which runs many logical streams
over a single underlying connection and polices each one with a
credit-based flow-control window. The most plausible source of
the bloat is this libp2p/yamux layer, through which raw IP packets
are funnelled. Hyprspace's TUN-read loop dispatches
each outbound packet on its own goroutine, and every such each outbound packet on its own goroutine, and every such
goroutine ends up in \texttt{node/node.go}'s goroutine ends up in \texttt{node/node.go}'s
\texttt{sendPacket}, which keeps exactly one libp2p stream per \texttt{sendPacket}, which keeps exactly one libp2p stream per
@@ -916,10 +918,10 @@ collapses to a single send pipeline at this layer. Each
goroutine waiting for the lock pins its own 1420-byte packet goroutine waiting for the lock pins its own 1420-byte packet
buffer, and the underlying yamux session adds a per-stream buffer, and the underlying yamux session adds a per-stream
flow-control window on top. None of this is visible to the flow-control window on top. None of this is visible to the
kernel TCP sender that produced the inner segments the kernel kernel TCP sender that produced the inner segments: the kernel
sees only that the TUN write returned so it keeps growing sees only that the TUN write returned, so it keeps growing its
its congestion window while the libp2p layer falls further congestion window while the libp2p layer falls further behind. The
behind. The geometry is the textbook one for buffer bloat: a geometry is the textbook one for buffer bloat: a
fast producer (kernel TCP) sitting upstream of a slow, fast producer (kernel TCP) sitting upstream of a slow,
serialised consumer (the single yamux stream per peer) with serialised consumer (the single yamux stream per peer) with
no flow-control signal coupling the two. no flow-control signal coupling the two.
@@ -1036,10 +1038,15 @@ background.
Mycelium is also the slowest VPN to recover from a reboot: Mycelium is also the slowest VPN to recover from a reboot:
76.6~seconds on average, and almost suspiciously uniform across 76.6~seconds on average, and almost suspiciously uniform across
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
a fixed convergence timer in the overlay protocol fixed convergence timer in the overlay protocol, most likely a
most likely a default wait interval hard-coded into the reconnection logic. A
default interval rather than anything topology-dependent. topology-dependent recovery time, by contrast, would vary with each
node's position in the overlay: a node near an active peer would
reconverge quickly while one further away would wait longer for
routing information to reach it. Mycelium shows no such variation,
so the bound is almost certainly a timer rather than a propagation
delay.
% TODO: Identify which Mycelium constant or default this 75-78 s % TODO: Identify which Mycelium constant or default this 75-78 s
% recovery actually corresponds to before claiming it is a fixed % recovery actually corresponds to before claiming it is a fixed
% timer; the source code would settle whether it is hard-coded, % timer; the source code would settle whether it is hard-coded,
@@ -1047,49 +1054,46 @@ default interval rather than anything topology-dependent.
The UDP test timed out at 120~seconds, and even first-time The UDP test timed out at 120~seconds, and even first-time
connectivity required a 70-second wait at startup. connectivity required a 70-second wait at startup.
% Explain what topology-dependent means in this case.
\paragraph{Tinc: Userspace Processing Bottleneck.} \paragraph{Tinc: Userspace Processing Bottleneck.}
Tinc is a clear case of a CPU bottleneck masquerading The latency subsection already traced Tinc's 336\,Mbps ceiling to
as a network single-core CPU exhaustion. The usual network suspects do not
problem. At 1.19\,ms latency, packets get through the apply. Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a effective UDP payload size (1\,353 bytes) and its retransmit count
third of the bare-metal link. (240) are in the normal range. That leaves CPU: 14.9\,\%
The usual suspects do not apply: whole-system utilization is what one saturated core looks like on
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of a multi-core host, which fits a single-threaded userspace VPN.
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and The parallel benchmark confirms the diagnosis. Tinc scales to
WireGuard at 1\,368) is in the normal range, and its retransmit 563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
count (240) is moderate. What limits Tinc is its Several concurrent TCP streams keep that one core busy through
single-threaded the gaps a single flow would leave idle, and the extra work
userspace architecture: one CPU core simply cannot translates directly into extra throughput.
encrypt, copy, % TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
and forward packets fast enough to fill the pipe. % unresolved CPU-profiling TODO from the latency subsection
% (VpnCloud's identical 14.9\% at 539\,Mbps). If per-thread
% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the % profiling refutes the single-core story, this paragraph must
% Tinc CPU bottleneck % be revisited as well.
% diagnosis from above, but the 14.9% CPU figure has
% an unresolved TODO
% (the same utilization as VpnCloud at 539 Mbps). If
% the CPU claim is
% revised or refuted, this confirmation must be updated too.
The parallel benchmark confirms this diagnosis. Tinc scales to
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
Multiple TCP streams collectively keep that single
core busy during
what would otherwise be idle gaps in any individual
flow, squeezing
out throughput that no single stream could reach alone.
\section{Impact of network impairment} \section{Impact of network impairment}
\label{sec:impairment} \label{sec:impairment}
Baseline benchmarks rank VPNs by overhead under ideal Baseline benchmarks rank VPNs by overhead under ideal
conditions. conditions. The impairment profiles in
The impairment profiles in Table~\ref{tab:impairment_profiles} test a different property:
Table~\ref{tab:impairment_profiles} test resilience. Each profile applies symmetric \texttt{tc netem}
a different property: resilience. Two results impairment to every machine. Low adds roughly 2\,ms of delay and
dominate the data. 0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
reordering. Medium and High both use 50\,\% correlation, so
losses and reorderings are bursty rather than uniform. Two
results dominate the data.
% TODO: Double-check these per-profile parameters against the
% canonical impairment-profile definitions in the earlier chapter
% (Table~\ref{tab:impairment_profiles}). The Low/High loss and
% delay numbers are cross-checked against later prose in this
% chapter, but the correlation and jitter values should be
% verified against the authoritative profile definition.
The first is the collapse of the throughput hierarchy. At High The first is the collapse of the throughput hierarchy. At High
impairment, the 675\,Mbps spread between fastest and slowest impairment, the 675\,Mbps spread between fastest and slowest
@@ -1106,8 +1110,8 @@ Section~\ref{sec:tailscale_degraded} pursues this anomaly
through what turns out to be the wrong hypothesis. The through what turns out to be the wrong hypothesis. The
investigation begins with Tailscale's much-discussed gVisor TCP investigation begins with Tailscale's much-discussed gVisor TCP
stack, validates the candidate parameters in isolation on the stack, validates the candidate parameters in isolation on the
bare-metal host, and only then discovers by reading the rig's bare-metal host, and only then discovers, by reading the rig's
own NixOS module that the gVisor stack is not actually in the own NixOS module, that the gVisor stack is not actually in the
data path of the benchmark at all. The real culprit is a data path of the benchmark at all. The real culprit is a
combination of the Linux kernel's tight default combination of the Linux kernel's tight default
\texttt{tcp\_reordering} threshold and the way \texttt{tcp\_reordering} threshold and the way
@@ -1313,6 +1317,16 @@ every lost or reordered outer packet costs roughly
retransmitted inner data than a standard 1\,400-byte retransmitted inner data than a standard 1\,400-byte
MTU VPN would MTU VPN would
lose. lose.
% TODO: The jumbo-MTU-as-liability argument is reused in several
% places (TCP impairment, QUIC impairment, RIST video, and
% §sec:baseline Tier analysis). In each it is presented as a
% mechanism rather than a measurement. Consider running one
% controlled experiment --- force Yggdrasil to a standard
% 1\,420-byte overlay MTU and rerun the Low/Medium impairment
% profiles --- to test the hypothesis directly, or consolidate
% the argument into a single "jumbo-MTU liability" paragraph and
% cite it from the other sections instead of restating the
% mechanism each time.
Headscale retains 34.3\% of its baseline throughput Headscale retains 34.3\% of its baseline throughput
at Low, almost at Low, almost
@@ -1444,6 +1458,15 @@ indicator than as a throughput measurement. A VPN that cannot
complete a 30-second UDP flood under 0.25\% packet loss has a complete a 30-second UDP flood under 0.25\% packet loss has a
flow-control problem that will surface under real workloads too, flow-control problem that will surface under real workloads too,
even when the symptoms are milder. even when the symptoms are milder.
% TODO: Non-monotonic failure pattern (Internal and WireGuard
% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
% fail selectively) is never explained and directly undermines
% the "robustness indicator" framing above. Reproduce one of
% the failing Low-profile runs with iPerf3 debug logging and
% \texttt{tc -s qdisc show} to establish whether these are VPN
% flow-control failures, iPerf3/tc interaction artefacts, or
% timing issues; then either explain the pattern or soften the
% robustness-indicator claim.
\subsection{Parallel TCP} \subsection{Parallel TCP}
@@ -1552,10 +1575,10 @@ At High impairment, WireGuard (23.2\,Mbps), VpnCloud
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
0.4\,Mbps of one another. At baseline these four 0.4\,Mbps of one another. At baseline these four
span a 188\,Mbps span a 188\,Mbps
range (656 to 844\,Mbps). QUIC's own congestion range (656 to 844\,Mbps). At this point QUIC's own congestion
control, running on control is the sole limiter: it runs on top of an
top of an already-degraded outer link, has become the already-degraded outer link and cannot push past
sole limiter. ${\sim}$23\,Mbps regardless of the VPN underneath.
\begin{figure}[H] \begin{figure}[H]
\centering \centering
@@ -1742,6 +1765,20 @@ Section~\ref{sec:tailscale_degraded} explains why.
\section{Tailscale under degraded conditions} \section{Tailscale under degraded conditions}
\label{sec:tailscale_degraded} \label{sec:tailscale_degraded}
% TODO: Editorial pass needed on two chapter-wide issues before
% submission:
% (1) magicsock / wireguard-go userspace-datapath explanation is
% repeated three times in slightly different forms (once in
% baseline UDP, once in impairment UDP, once here). Consider
% introducing it once in full here, where it is load-bearing,
% and replacing the earlier occurrences with one-sentence
% forward references.
% (2) This section uses first-person plural ("we pursued", "we
% worked it out", "we ran two follow-up benchmarks") while
% the rest of the chapter is in impersonal voice. Either
% harmonise everything to one voice, or explicitly frame this
% section as a first-person narrative detour.
This section is about an observation that should not exist: This section is about an observation that should not exist:
Headscale, a tunnelling VPN built on a kernel TCP stack and Headscale, a tunnelling VPN built on a kernel TCP stack and
\texttt{wireguard-go}, beats the bare-metal Internal baseline at \texttt{wireguard-go}, beats the bare-metal Internal baseline at
@@ -1753,7 +1790,7 @@ chasing the obvious answer to its end.
\subsection{An anomaly worth pursuing} \subsection{An anomaly worth pursuing}
At Medium impairment, Headscale reaches 41.5\,Mbps on a single At Medium impairment, Headscale reaches 41.5\,Mbps on a single
TCP stream against Internal's 29.6\,Mbps a 40\,\% lead for TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
the VPN over the direct host-to-host link it tunnels through. the VPN over the direct host-to-host link it tunnels through.
Headscale costs the expected ${\sim}$14\,\% at baseline, and at Headscale costs the expected ${\sim}$14\,\% at baseline, and at
Low and High impairment it lags Internal by some margin. Yet at Low and High impairment it lags Internal by some margin. Yet at
@@ -1837,12 +1874,12 @@ imports Google's gVisor netstack
it as an in-process TCP implementation. The gVisor it as an in-process TCP implementation. The gVisor
documentation is direct about why this matters: netstack is documentation is direct about why this matters: netstack is
designed for adverse networks where the host kernel's TCP designed for adverse networks where the host kernel's TCP
defaults are too aggressive. Tailscale's release notes go defaults are too aggressive. Tailscale's release notes go further
further, calling out specific overrides on top of gVisor — the and name specific overrides
most visible being an explicit RACK disable and 8\,MiB / 6\,MiB on top of gVisor; the most visible are an explicit RACK disable
receive and send buffers. and 8\,MiB / 6\,MiB receive and send buffers.
Reading Tailscale's source confirms it. The Tailscale source code bears this out.
\texttt{wgengine/netstack/netstack.go} contains the netstack \texttt{wgengine/netstack/netstack.go} contains the netstack
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides} initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
reproduces the relevant overrides verbatim. RACK is disabled reproduces the relevant overrides verbatim. RACK is disabled
@@ -1863,25 +1900,22 @@ enabled (gVisor's default is off).
\texttt{wgengine/netstack/netstack.go}. \texttt{wgengine/netstack/netstack.go}.
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go} \textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}
Read against the Linux kernel defaults RACK on, CUBIC by Read against the Linux kernel defaults (RACK on, CUBIC by
default, ${\sim}$1\,MiB receive and send buffers, default, ${\sim}$1\,MiB receive and send buffers,
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled these \texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
overrides describe a TCP stack better suited to a lossy, overrides describe a TCP stack better suited to a lossy,
reordering link than the host kernel. The hypothesis writes reordering link than the host kernel. The hypothesis follows
itself: Headscale's iPerf3 traffic is processed directly: Headscale's iPerf3 traffic
by this gVisor runs through this gVisor instance instead of through the host
instance instead of by the host kernel TCP stack, and so it kernel TCP stack, and so it inherits the more
inherits the more reordering-tolerant behaviour. reordering-tolerant behaviour. WireGuard-the-kernel-module
WireGuard-the-kernel-module shares only the cryptographic shares only the cryptographic protocol; it does not include
protocol; it does not get the gVisor stack, and the gVisor stack, and therefore does not get the advantage.
therefore does
not get the advantage.
It is a clean story. The natural way to test it The natural way to test this is to extract
is to extract
the parameters Tailscale sets inside gVisor, apply their the parameters Tailscale sets inside gVisor, apply their
nearest Linux equivalents to the bare-metal host as sysctls, nearest Linux equivalents to the bare-metal host as sysctls,
and see whether Internal with no VPN at all picks up the and see whether Internal, with no VPN at all, picks up the
same advantage. If it does, the gVisor explanation is same advantage. If it does, the gVisor explanation is
supported. If it does not, the hypothesis fails. supported. If it does not, the hypothesis fails.
@@ -1951,21 +1985,18 @@ impairment setup as the original 18.12.2025 run.
The result felt like confirmation. Internal's The result felt like confirmation. Internal's
Medium-impairment throughput jumped from 29.6\,Mbps to Medium-impairment throughput jumped from 29.6\,Mbps to
72.7\,Mbps under the reorder-only configuration a 146\,\% 72.7\,Mbps under the reorder-only configuration, a 146\,\%
increase from a three-line sysctl change and increase from a three-line sysctl change, and the retransmit
the retransmit
rate at Medium dropped from ${\sim}$2.4\,\% to rate at Medium dropped from ${\sim}$2.4\,\% to
1.11\,\%, which 1.11\,\%, which
means more than half of the original retransmissions were means more than half of the original retransmissions were
spurious. The Nix cache download at Medium roughly halved, spurious. The Nix cache download at Medium roughly halved,
from 58.6\,s to 29.1\,s. from 58.6\,s to 29.1\,s.
Parallel TCP gained more. Internal at Low Parallel TCP gained even more. Internal at Low climbed from
climbed from 277 to 277 to 902\,Mbps, a 226\,\% increase. This exceeds Internal's
902\,Mbps, a 226\,\% increase that not only old single-stream best and overtakes Headscale's original
exceeds Internal's 718\,Mbps from the unmodified run. %
old single-stream best but actually overtakes Headscale's
original 718\,Mbps from the unmodified run. %
% TODO: DOWNSTREAM % TODO: DOWNSTREAM
% DEPENDENCY — "six concurrent flows" inherits % DEPENDENCY — "six concurrent flows" inherits
% the unresolved % the unresolved
@@ -2024,9 +2055,10 @@ the kernel to gVisor reproduces the effect. Then we checked
which Tailscale code path the test rig was actually running. which Tailscale code path the test rig was actually running.
\subsection{The data path that was not there} \subsection{The data path that was not there}
\label{sec:gvisor_not_in_path}
In default mode what anyone running \texttt{tailscale up} In default mode (what anyone running \texttt{tailscale up}
on a Linux host gets the Tailscale client creates a real on a Linux host gets), the Tailscale client creates a real
kernel TUN device, registers a route for the kernel TUN device, registers a route for the
Tailscale subnet Tailscale subnet
through it, and forwards inbound and outbound through it, and forwards inbound and outbound
@@ -2054,20 +2086,19 @@ running inside \texttt{tailscaled} itself (Tailscale SSH,
Taildrop, the metric endpoint). External processes such as Taildrop, the metric endpoint). External processes such as
iPerf3 cannot reach the Tailscale network in that mode. iPerf3 cannot reach the Tailscale network in that mode.
The test rig does not use that mode. As shown in The test rig does not use that mode. The benchmark suite's
Listing~\ref{lst:rig_interface_name}, the benchmark Headscale module sets the interface name to
suite's Headscale module sets the interface name to \texttt{ts-\$\{instanceName\}}
\texttt{ts-\$\{instanceName\}}, resolving to (Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
\texttt{tailscaled --tun ts-headscale}: a real kernel launches with \texttt{--tun ts-headscale}: a real kernel TUN.
TUN. gVisor netstack is therefore unreachable from External benchmark traffic cannot reach gVisor netstack at all.
external benchmark traffic.
\lstinputlisting[language=Nix,caption={The \lstinputlisting[language=Nix,caption={The
benchmark suite's benchmark suite's
Headscale module sets \texttt{interfaceName} to a real kernel Headscale module sets \texttt{interfaceName} to a real kernel
TUN name (\texttt{ts-<instance>}, truncated to 15 characters). TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
This means \texttt{tailscaled} runs as \texttt{tailscaled --tun ts-headscale} This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
ts-headscale}
on every test machine. on every test machine.
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix} \textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}
@@ -2075,7 +2106,7 @@ The empirical fingerprint pins the same conclusion down without
source-code reading. Headscale itself gained +21\,\% at Medium source-code reading. Headscale itself gained +21\,\% at Medium
from the host-kernel sysctl tuning. If Headscale's iPerf3 from the host-kernel sysctl tuning. If Headscale's iPerf3
traffic were processed by gVisor netstack, host-kernel sysctls traffic were processed by gVisor netstack, host-kernel sysctls
would change nothing they configure the host kernel TCP stack would change nothing; they configure the host kernel TCP stack
and only the host kernel TCP stack. The fact that Headscale moves and only the host kernel TCP stack. The fact that Headscale moves
measurably under those sysctls is direct evidence that measurably under those sysctls is direct evidence that
Headscale's application TCP runs on the host kernel stack, just Headscale's application TCP runs on the host kernel stack, just
@@ -2097,8 +2128,8 @@ the gVisor TCP business at all.
The puzzle the investigation began with has not gone away. The puzzle the investigation began with has not gone away.
Headscale starts at 41.5\,Mbps where Internal starts at Headscale starts at 41.5\,Mbps where Internal starts at
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel 29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
TCP stack. Whatever Headscale is doing partially, weakly, but TCP stack. Whatever Headscale is doing (partially, weakly, but
reproducibly is worth roughly twelve megabits per second on the reproducibly) is worth roughly twelve megabits per second on the
Medium profile, and it is not gVisor netstack. Medium profile, and it is not gVisor netstack.
The +21\,\% sysctl gain for Headscale itself is also informative The +21\,\% sysctl gain for Headscale itself is also informative
@@ -2143,8 +2174,8 @@ The second is the 7\,MiB outer-UDP socket buffer that
\texttt{SO\_*BUFFORCE} variant where available so the value is \texttt{SO\_*BUFFORCE} variant where available so the value is
honoured even past \texttt{net.core.rmem\_max}. The host kernel honoured even past \texttt{net.core.rmem\_max}. The host kernel
default is in the low hundreds of KiB. Under burst-correlated default is in the low hundreds of KiB. Under burst-correlated
impairment Medium and High both use 50\,\% correlation, so impairment (Medium and High both use 50\,\% correlation, so
losses and reorderings cluster this larger buffer absorbs losses and reorderings cluster), this larger buffer absorbs
spikes in arrival rate that would otherwise overflow the kernel spikes in arrival rate that would otherwise overflow the kernel
UDP receive queue and surface as additional inner-TCP losses. UDP receive queue and surface as additional inner-TCP losses.
Internal has no such cushion on its incoming wire path. Internal has no such cushion on its incoming wire path.
@@ -2244,16 +2275,15 @@ the bare-metal host more than half of its achievable throughput.
Three lines of \texttt{sysctl} repair it. The fix is portable to Three lines of \texttt{sysctl} repair it. The fix is portable to
any Linux host and entirely independent of any VPN. any Linux host and entirely independent of any VPN.
The unresilient finding the one that motivated us to write this The less durable finding, and the one that motivated this section,
section in the first place — is that Tailscale's much-discussed is that Tailscale's much-discussed userspace TCP stack is not in
userspace TCP stack is, for the workload that exposed the the data path for the workload that exposed the anomaly. The
anomaly, sitting on the bench. The advantage we attributed to it advantage we attributed to it comes from a more ordinary place:
must come from a more ordinary place: the way the way \texttt{wireguard-go} batches and coalesces packets
\texttt{wireguard-go} batches and coalesces packets between the between the wire and the kernel TCP stack, and the larger UDP
wire and the kernel TCP stack, and the larger UDP buffer it pins buffer it pins on its outer socket. We were chasing the wrong
on its outer socket. We were chasing the wrong hypothesis with hypothesis with the right experiment, and the experiment turned
the right experiment, and the experiment turned out to be more out to be more useful than the hypothesis.
useful than the hypothesis.
% TODO: These sections are empty stubs but the chapter % TODO: These sections are empty stubs but the chapter
% introduction (line 12--13) promises "findings from the source % introduction (line 12--13) promises "findings from the source