clan-master-thesis/Chapters/Results.tex

% Chapter Template

\chapter{Results} % Main chapter title

\label{Results}

This chapter presents the results of the benchmark suite across all
ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded:
Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to
increasing network impairment, with source-code excerpts woven in
where they explain the measured behaviour.  A recurring theme is
that no single metric captures VPN performance; the rankings shift
depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance.

\section{Baseline Performance}
\label{sec:baseline}

The baseline impairment profile introduces no artificial loss or
reordering, so any performance gap between VPNs can be attributed to
the VPN itself.  Throughout the plots in this section, the
\emph{internal} bar marks a direct host-to-host connection with no VPN
in the path; it represents the best the hardware can do.  On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just
0.60\,ms.  WireGuard reaches 92.5\,\% of bare-metal throughput with only a
single retransmit across an entire 30-second test.  Mycelium sits at
the other extreme: 34.9\,ms of latency, roughly 58$\times$ the
bare-metal figure.

A note on naming: ``Headscale'' in every table and figure of this
chapter labels the test scenario in which the Tailscale client
(\texttt{tailscaled}) connects to a self-hosted Headscale control
server.  The data plane is therefore the Tailscale client built on
\texttt{wireguard-go}, not the Headscale binary itself, which is
only a control-plane server.  Statements below about ``Headscale''
running \texttt{wireguard-go} should be read as statements about
the Tailscale client in this scenario.
Section~\ref{sec:tailscale_degraded} covers the specifics of how
the rig launches \texttt{tailscaled} and which Tailscale code
paths that choice activates.

\subsection{Test Execution Overview}

Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours.  Actual benchmark execution
consumed the bulk of that time at 2.6~hours (63\,\%).  VPN
installation and deployment accounted for another 45~minutes
(19\,\%), and the test rig spent roughly 21~minutes (9\,\%) waiting
for VPN tunnels to come up after restarts.  VPN service restarts and
traffic-control (tc) stabilization took the remainder.
Figure~\ref{fig:test_duration} breaks this down per VPN.

Most VPNs completed every benchmark without issues, but four failed
one test each: Nebula and Headscale timed out on the qperf
QUIC performance benchmark after six retries, while Hyprspace and
Mycelium failed the UDP iPerf3 test
with a 120-second timeout.  Their individual success rate is
85.7\,\%, with all other VPNs passing the full suite
(Figure~\ref{fig:success_rate}).

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Average Test
    Duration per Machine}.png}
    \caption{Average test duration per VPN, including installation
    time and benchmark execution}
    \label{fig:test_duration}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
    Success Rate}.png}
    \caption{Benchmark success rate across all seven tests}
    \label{fig:success_rate}
  \end{subfigure}
  \caption{Test execution overview. Hyprspace has the longest average
    duration due to UDP timeouts and long VPN connectivity
    waits. WireGuard completes fastest. Nebula, Headscale,
  Hyprspace, and Mycelium each fail one benchmark.}
  \label{fig:test_overview}
\end{figure}

\subsection{TCP Throughput}

Each VPN ran a single-stream iPerf3 session for 30~seconds on every
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
averages.  Three distinct performance tiers emerge, separated by
natural gaps in the data.

\begin{table}[H]
  \centering
  \caption{Single-stream TCP throughput at baseline, sorted by
    throughput. Retransmits are averaged per 30-second test across
    all three link directions. The horizontal rules separate the
  three performance tiers.}
  \label{tab:tcp_baseline}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Throughput (Mbps)} &
    \textbf{Baseline (\%)} & \textbf{Retransmits} \\
    \hline
    Internal      & 934 & 100.0 & 1.7 \\
    WireGuard     & 864 & 92.5  & 1 \\
    ZeroTier      & 814 & 87.2  & 1163 \\
    Headscale     & 800 & 85.6  & 102 \\
    Yggdrasil     & 795 & 85.1  & 75 \\
    \hline
    Nebula        & 706 & 75.6  & 955 \\
    EasyTier      & 636 & 68.1  & 537 \\
    VpnCloud      & 539 & 57.7  & 857 \\
    \hline
    Hyprspace     & 368 & 39.4  & 4965 \\
    Tinc          & 336 & 36.0  & 240 \\
    Mycelium      & 259 & 27.7  & 710 \\
    \hline
  \end{tabular}
\end{table}

The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
at under 40\,\% of baseline.
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
  Throughput}.png}
  \caption{Average single-stream TCP throughput}
  \label{fig:tcp_throughput}
\end{figure}

Raw throughput alone is incomplete.  The retransmit rate
(Figure~\ref{fig:tcp_retransmits}) normalizes raw retransmit counts
by estimated packet count, accounting for the different segment sizes
each VPN negotiates (1\,228 to 32\,731 bytes).  WireGuard and
Headscale are effectively loss-free ($<$\,0.01\,\%).  Tinc, EasyTier,
Nebula, and VpnCloud form a moderate band (0.03--0.06\,\%).
Yggdrasil, ZeroTier, and Mycelium cluster between 0.09\,\% and
0.13\,\%, and Hyprspace is the clear outlier at 0.49\,\%.  ZeroTier
reaches 814\,Mbps despite a 0.10\,\% retransmit rate by compensating
for tunnel-internal loss through repeated TCP congestion-control
recovery; WireGuard delivers comparable throughput with effectively
zero loss.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
  Retransmit Rate}.png}
  \caption{TCP retransmit rate at baseline.  WireGuard and Headscale
    are effectively loss-free ($<$\,0.01\,\%).  Hyprspace is the clear
  outlier at 0.49\,\%.}
  \label{fig:tcp_retransmits}
\end{figure}

Retransmits have a direct mechanical relationship with TCP congestion
control: each one triggers a reduction in the congestion window
(\texttt{cwnd}) and throttles the sender.
Figure~\ref{fig:tcp_window} shows the raw window sizes, and
Figure~\ref{fig:retransmit_correlations} plots them against retransmit
rate.  Hyprspace, with a 0.49\,\% retransmit rate, maintains the
smallest max congestion window in the dataset (200\,KB), while
Yggdrasil's 0.09\,\% rate allows a 4.2\,MB window, the largest of
any VPN.  At
first glance this suggests a clean inverse correlation between
retransmit rate and congestion window size, but the picture is
misleading.  Yggdrasil's outsized window is largely an artifact of
its jumbo overlay MTU (32\,731 bytes): each segment carries far more
data, so the window in bytes is inflated relative to VPNs using a
standard ${\sim}$1\,400-byte MTU.  Comparing congestion windows
across different MTU sizes is not meaningful without normalizing for
segment size.  The reliable conclusion is simpler: high retransmit
rates force TCP to spend more time in congestion recovery than in
steady-state transmission, and that caps throughput regardless of
available bandwidth.  ZeroTier illustrates the opposite extreme:
brute-force retransmission can still yield high throughput
(814\,Mbps at a 0.10\,\% rate), at the cost of wasted bandwidth and
unstable flow behavior.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Max TCP
  Window Size}.png}
  \caption{Maximum TCP window sizes (send and congestion) at baseline.
    Yggdrasil's congestion window (4\,219\,KB) dwarfs all others but
    is inflated by its 32\,KB jumbo overlay MTU.  Hyprspace has the
  smallest congestion window (200\,KB).}
  \label{fig:tcp_window}
\end{figure}

VpnCloud stands out: its sender reports 538.8\,Mbps but the
receiver measures only 413.4\,Mbps, a 23\,\% gap and the largest
in the dataset.  This points to significant in-tunnel packet loss
or buffering at the VpnCloud layer that the retransmit rate
(0.06\,\%) alone does not fully explain.

Variability, whether stochastic across runs or systematic across
links, also differs substantially.  WireGuard's three link
directions cluster tightly (824 to 884\,Mbps, a 60\,Mbps window)
and are nearly indistinguishable.  Mycelium's three directions span
122 to 379\,Mbps, a 3:1 ratio, but this is not run-to-run noise:
Section~\ref{sec:mycelium_routing} shows the spread is per-link
path-selection asymmetry, with one link finding a direct route and
the other two routing through the global overlay.  Either way, a
VPN whose throughput varies that widely across links is harder to
capacity-plan around than one that delivers a consistent figure
on every direction.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
    \caption{Retransmits vs.\ throughput}
    \label{fig:retransmit_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
    \caption{Retransmits vs.\ max congestion window}
    \label{fig:retransmit_cwnd}
  \end{subfigure}
  \caption{Retransmit correlations (log scale on x-axis).  A high
    retransmit rate does not always mean low throughput (ZeroTier:
    0.10\,\%, 814\,Mbps), but an extreme rate does (Hyprspace:
    0.49\,\%, 368\,Mbps).  The apparent inverse correlation between
    retransmit rate and congestion window size is dominated by
    Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
    by its 32\,KB jumbo overlay MTU rather than by a low retransmit
  rate alone.}
  \label{fig:retransmit_correlations}
\end{figure}

\subsection{Latency}

Sorting by latency rearranges the rankings considerably.
Table~\ref{tab:latency_baseline} lists the average ping round-trip
times, which cluster into three distinct ranges.  The table also
reports the average maximum RTT observed across test runs and the
resulting spike ratio (max/avg); a high ratio signals bursty tail
latency that the average alone conceals.

\begin{table}[H]
  \centering
  \caption{Ping RTT statistics at baseline, sorted by average latency.
    The spike ratio is max\,RTT\,/\,avg\,RTT; higher values indicate
  bursty tail latency.}
  \label{tab:latency_baseline}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Avg RTT (ms)} & \textbf{Max RTT (ms)}
    & \textbf{Spike Ratio} & \textbf{Jitter (ms)} \\
    \hline
    Internal   & 0.60 & 0.65 & 1.1$\times$ & 0.04 \\
    VpnCloud   & 1.13 & 3.14 & 2.8$\times$ & 0.25 \\
    Tinc       & 1.19 & 1.31 & 1.1$\times$ & 0.07 \\
    WireGuard  & 1.20 & 1.81 & 1.5$\times$ & 0.13 \\
    Nebula     & 1.25 & 1.53 & 1.2$\times$ & 0.10 \\
    ZeroTier   & 1.28 & 3.00 & 2.3$\times$ & 0.25 \\
    EasyTier   & 1.33 & 1.55 & 1.2$\times$ & 0.10 \\
    \hline
    Headscale  & 1.64 & 1.81 & 1.1$\times$ & 0.09 \\
    Hyprspace  & 1.79 & 2.21 & 1.2$\times$ & 0.13 \\
    Yggdrasil  & 2.20 & 3.13 & 1.4$\times$ & 0.20 \\
    \hline
    Mycelium   & 34.9 & 48.6 & 1.4$\times$ & 1.49 \\
    \hline
  \end{tabular}
\end{table}

Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal
0.60\,ms; EasyTier sits just above at 1.33\,ms.  VpnCloud posts the
lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms),
yet its throughput tops out at only 539\,Mbps.
Low per-packet latency does not guarantee high bulk throughput.  A
second group (Headscale,
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
moderate overhead.  Then there is Mycelium at 34.9\,ms, so far
removed from the rest that Section~\ref{sec:mycelium_routing} gives
it a dedicated analysis.

The spike-ratio column in Table~\ref{tab:latency_baseline} exposes two
outliers among the low-latency VPNs.  VpnCloud leads at
2.8$\times$ (avg 1.13\,ms, max 3.14\,ms) and ZeroTier follows at
2.3$\times$ (avg 1.28\,ms, max 3.00\,ms); both share the highest
jitter in the table (0.25\,ms).  Tinc and Headscale, by contrast,
stay below 1.1$\times$ with jitter under 0.09\,ms, so their packet
timing is nearly as stable as bare metal.  The spikes in VpnCloud and
ZeroTier are consistent with periodic
control-plane work such as key rotation or peer heartbeats that
briefly stalls the data path.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
  \caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
    massive outlier at 58$\times$ the internal baseline. VpnCloud is
  the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
  \label{fig:ping_rtt}
\end{figure}

Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
but only the second-lowest throughput (336\,Mbps).  Packets traverse
the tunnel quickly, yet something caps the overall rate.
Figure~\ref{fig:tcp_cpu} shows that Tinc uses only 12.3\,\% host CPU
during the TCP test.  On a multi-core host this figure is consistent
with a single saturated core, which fits Tinc's single-threaded
userspace architecture: one core encrypts, copies, and forwards
packets, and the remaining cores sit idle.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP CPU
  Utilization}.png}
  \caption{CPU utilization during TCP throughput tests, split by host
    (sender) and remote (receiver).  Tinc (12.3\,\%) and VpnCloud
    (14.2\,\%) use similar CPU, yet VpnCloud achieves 60\,\% higher
    throughput.  Yggdrasil's low CPU (2.7\,\%) reflects its
  kernel-level forwarding with jumbo segments.}
  \label{fig:tcp_cpu}
\end{figure}

VpnCloud is also
single-threaded and uses slightly more CPU (14.2\,\%), yet reaches
539\,Mbps (60\,\% more throughput).  The gap comes down to per-packet
cost.  Tinc uses a hand-written ChaCha20-Poly1305 implementation
without hardware acceleration, allocates a fresh stack buffer and
copies the payload for each packet, and routes through a splay-tree
lookup.  VpnCloud uses the \texttt{ring} cryptographic library, which
employs optimized assembly and can select AES-128-GCM with hardware
AES-NI instructions at runtime; it encrypts in place with no extra
buffer copies and routes through an $O(1)$ hash-map lookup.  These
differences compound in a tight single-threaded loop: every
microsecond saved per packet raises the maximum packet rate the one
available core can sustain.

Figure~\ref{fig:latency_throughput} makes this disconnect easy to
spot.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
  \caption{Latency vs.\ throughput at baseline. Each point represents
    one VPN. The quadrants reveal different bottleneck types:
    VpnCloud (low latency, moderate throughput), Tinc (low latency,
    low throughput, CPU-bound), Mycelium (high latency, low
  throughput, overlay routing overhead).}
  \label{fig:latency_throughput}
\end{figure}

\subsection{Parallel TCP Scaling}

The single-stream benchmark tests one link direction at a time.
The
parallel benchmark changes this setup: all three link directions
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom) run simultaneously in a circular pattern for
60~seconds, each carrying one bidirectional TCP stream (six
unidirectional flows in total).  Because three independent
link pairs now compete for shared tunnel resources at once, the
aggregate throughput is naturally higher than any single direction
alone, which is why even Internal reaches 1.50$\times$ its
single-stream figure.  The scaling factor (parallel throughput
divided by single-stream throughput) captures two effects:
the benefit of using multiple link pairs in parallel, and how
well the VPN handles the resulting contention.
Table~\ref{tab:parallel_scaling} lists the results.

\begin{table}[H]
  \centering
  \caption{Parallel TCP scaling at baseline. Scaling factor is the
    ratio of parallel to single-stream throughput. Internal's
  1.50$\times$ represents the expected scaling on this hardware.}
  \label{tab:parallel_scaling}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Single (Mbps)} &
    \textbf{Parallel (Mbps)} & \textbf{Scaling} \\
    \hline
    Mycelium  & 259  & 569  & 2.20$\times$ \\
    Hyprspace & 368  & 803  & 2.18$\times$ \\
    Tinc      & 336  & 563  & 1.68$\times$ \\
    Yggdrasil & 795  & 1265 & 1.59$\times$ \\
    Headscale & 800  & 1228 & 1.54$\times$ \\
    Internal  & 934  & 1398 & 1.50$\times$ \\
    ZeroTier  & 814  & 1206 & 1.48$\times$ \\
    WireGuard & 864  & 1281 & 1.48$\times$ \\
    EasyTier  & 636  & 927  & 1.46$\times$ \\
    VpnCloud  & 539  & 763  & 1.42$\times$ \\
    Nebula    & 706  & 648  & 0.92$\times$ \\
    \hline
  \end{tabular}
\end{table}

The VPNs that gain the most are those most constrained in
single-stream mode.  Mycelium's 34.9\,ms RTT gives it a
bandwidth-delay product (Equation~\ref{eq:bdp}) of roughly
4.4\,MB on a 1\,Gbps link.  No single TCP flow maintains a
congestion window that large, so the link is never fully utilized.
Multiple concurrent flows each contribute their own window, and
their aggregate in-flight data approaches the BDP, which pushes
throughput to 2.20$\times$ the single-stream figure.

Hyprspace scales almost as well (2.18$\times$) for the same
structural reason, but the bottleneck is different.  Its libp2p send
pipeline accumulates roughly 2\,800\,ms of under-load latency
(Section~\ref{sec:hyprspace_bloat}), which inflates the effective BDP
to hundreds of megabytes, far beyond any single kernel congestion
window.  Because Hyprspace keys \texttt{activeStreams} by destination
\texttt{peer.ID} (Listing~\ref{lst:hyprspace_sendpacket}), the three
concurrent peer pairs in the parallel benchmark each get their own
libp2p stream, their own mutex, and their own yamux flow-control
window.  Three independent windows in flight fill more of the bloated
pipeline than one can.
% TODO: This is still a hypothesis: it generalises the same
% bandwidth-delay-product argument used for Mycelium directly
% above, and is now grounded in the per-peer
% \texttt{SharedStream} structure verified in
% Listing~\ref{lst:hyprspace_sendpacket}, but neither the
% per-flow window evolution nor the actual under-load latency
% has been measured directly.  A tcpdump of one Hyprspace
% iPerf3 run with inter-arrival timing analysis would settle it.

Tinc picks up a
1.68$\times$ boost because several streams can collectively keep its
single-threaded CPU busy during what would otherwise be idle gaps in
a single flow.

WireGuard and Internal both scale cleanly at around
1.48--1.50$\times$ with a 0.00\,\% retransmit rate in both modes.
This is consistent with WireGuard's overhead being a fixed per-packet
cost that does not worsen under multiplexing.

Nebula is the only VPN that actually gets \emph{slower} with more
streams: throughput drops from 706\,Mbps to 648\,Mbps
(0.92$\times$).  The cause is lock contention in Nebula's firewall
connection tracker (Listing~\ref{lst:nebula_conntrack}).  A single
\texttt{sync.Mutex} protects the global \texttt{Conns} map, and every
packet in both directions must acquire it.  The lock holder also
purges the timer wheel before releasing the lock, so other goroutines
stall while that housekeeping runs.  Nebula mitigates this with a
per-routine cache that bypasses the global lock for known flows, but
the cache is invalidated every second, at which point all goroutines
contend on the mutex again.  With parallel streams, the increased
goroutine count turns this periodic contention into a throughput
bottleneck.

\lstinputlisting[language=Go,caption={Nebula's firewall conntrack: a
    global mutex protects the connection map and is acquired on every
    packet.
    \textit{nebula/firewall.go:79--84,
486--558}},label={lst:nebula_conntrack}]{Listings/nebula_conntrack.go}

Retransmit rates under parallel load shift in two directions.
VpnCloud's rate climbs from 0.06\,\% to 0.14\,\% (2.5$\times$) and
Yggdrasil's from 0.09\,\% to 0.23\,\% (2.7$\times$), so
multiplexing genuinely increases loss for these VPNs.  Hyprspace's
rate, by contrast, drops slightly from 0.49\,\% to 0.39\,\% even
though it sends far more data in parallel; the per-packet loss
probability does not worsen, but the absolute count still triples
because three pairs are transmitting simultaneously.  VPNs that were
clean in single-stream mode (WireGuard, Internal) stay clean under
parallel load.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
    \caption{Single-stream vs.\ parallel throughput}
    \label{fig:single_vs_parallel}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
    \caption{Parallel TCP scaling factor}
    \label{fig:scaling_factor}
  \end{subfigure}
  \caption{Parallel TCP scaling at baseline. Nebula is the only VPN
    where parallel throughput is lower than single-stream
    (0.92$\times$). Mycelium and Hyprspace benefit most from
    parallelism ($>$2$\times$), compensating for latency and buffer
    bloat respectively. The dashed line at 1.0$\times$ marks the
  break-even point.}
  \label{fig:parallel_tcp}
\end{figure}

\subsection{UDP Stress Test}

The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
which is a deliberate overload test rather than a realistic workload.
The sender throughput values are artifacts: they reflect how fast the
sender can write to the socket, not how fast data traverses the
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
throughput because it uses a 32,731-byte block size (a jumbo-frame
overlay MTU), which inflates the apparent rate per
\texttt{send()} system call.  Only the receiver throughput is
meaningful.

\begin{table}[H]
  \centering
  \caption{UDP receiver throughput and packet loss at baseline
    (\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
  at 120 seconds and are excluded.}
  \label{tab:udp_baseline}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Receiver (Mbps)} &
    \textbf{Loss (\%)} \\
    \hline
    Internal  & 952 & 0.0 \\
    WireGuard & 898 & 0.0 \\
    Nebula    & 890 & 76.2 \\
    Headscale & 876 & 69.8 \\
    EasyTier  & 865 & 78.3 \\
    Yggdrasil & 852 & 98.7 \\
    ZeroTier  & 851 & 89.5 \\
    VpnCloud  & 773 & 83.7 \\
    Tinc      & 471 & 89.9 \\
    \hline
  \end{tabular}
\end{table}

%TODO: Explain that the UDP test also crashes often,
% which makes the test somewhat unreliable
% but a good indicator if the network traffic is "different" then
% the programmer expected

Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
the kernel level with proper backpressure that matches sender to
receiver rate. Every other VPN shows massive loss (69--99\%)
because the sender overwhelms the tunnel's userspace processing capacity.
Headscale shares WireGuard's cryptographic protocol but, contrary to
intuition, does not share its kernel datapath: Tailscale's
\texttt{magicsock} layer intercepts every packet to handle endpoint
selection and DERP (Designated Encrypted Relay for Packets,
  Tailscale's TLS-over-TCP relay network used when a direct UDP path
between peers cannot be established), which is incompatible with the
in-kernel WireGuard module.  Headscale therefore runs
\texttt{wireguard-go} entirely in userspace, and the unbounded
\texttt{-b~0} flood overruns that userspace pipeline just as it
overruns every other userspace implementation, and Headscale
shows 69.8\,\% loss despite the WireGuard branding.
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
(due to its large block size) but loses almost all of it. These loss
rates do not reflect real-world UDP behavior but reveal which VPNs
implement effective flow control. Hyprspace and Mycelium could not
complete the UDP test at all; both timed out after 120 seconds.

% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
% the path MTU.  It is derived from the socket MSS and reflects the
% usable payload after tunnel overhead, but conflating it with path
% MTU is misleading.  Consider renaming to "effective payload size"
% throughout.
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
smallest). These differences affect fragmentation behavior under real
workloads, particularly for protocols that send large datagrams.

%TODO: Mention QUIC
%TODO: Mention again that the "default" settings of every VPN have been used
% to better reflect real world use, as most users probably won't
% change these defaults
% and explain that good defaults are as much a part of good software as
% having the features but they are hard to configure correctly

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Throughput}.png}
    \caption{UDP receiver throughput}
    \label{fig:udp_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Packet Loss}.png}
    \caption{UDP packet loss}
    \label{fig:udp_loss}
  \end{subfigure}
  \caption{UDP stress test results at baseline (\texttt{-b 0},
    unlimited sender rate). Internal and WireGuard are the only
    implementations with 0\% loss. Hyprspace and Mycelium are
  excluded due to 120-second timeouts.}
  \label{fig:udp_results}
\end{figure}

% TODO: Compare parallel TCP retransmit rate
% with single TCP retransmit rate and see what changed

\subsection{Real-World Workloads}

Saturating a link with iPerf3 measures peak capacity, but not how a
VPN performs under realistic traffic.  This subsection switches to
application-level workloads: downloading packages from a Nix binary
cache and streaming video over RIST.  Both interact with the VPN
tunnel the way real software does, through many short-lived
connections, TLS handshakes, and latency-sensitive UDP packets.

\paragraph{Nix Binary Cache Downloads.}

This test downloads a fixed set of Nix packages through each VPN and
measures the total transfer time.  The results
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
considerably: even Hyprspace, the worst performer, finishes in
11.92\,s, only 40\,\% slower than bare metal.  Once connection
setup, TLS handshakes, and HTTP round-trips enter the picture,
throughput differences between 500 and 900\,Mbps matter far less
than per-connection latency.

\begin{table}[H]
  \centering
  \caption{Nix binary cache download time at baseline, sorted by
  duration. Overhead is relative to the internal baseline (8.53\,s).}
  \label{tab:nix_cache}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Mean (s)} &
    \textbf{Overhead (\%)} \\
    \hline
    Internal  & 8.53  & -- \\
    Nebula    & 9.15  & +7.3 \\
    ZeroTier  & 9.22  & +8.1 \\
    VpnCloud  & 9.39  & +10.0 \\
    EasyTier  & 9.39  & +10.1 \\
    WireGuard & 9.45  & +10.8 \\
    Headscale & 9.79  & +14.8 \\
    Tinc      & 10.00 & +17.2 \\
    Mycelium  & 10.07 & +18.1 \\
    Yggdrasil & 10.59 & +24.2 \\
    Hyprspace & 11.92 & +39.7 \\
    \hline
  \end{tabular}
\end{table}

Several rankings invert relative to raw throughput.  ZeroTier
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits.  Yggdrasil
is the clearest example: it has the
fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead
because its
2.2\,ms latency adds up over the many small sequential HTTP requests
that constitute a Nix cache download.
Figure~\ref{fig:throughput_vs_download} confirms this weak link
between raw throughput and real-world download speed.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
    Mean Download Time}.png}
    \caption{Nix cache download time per VPN}
    \label{fig:nix_cache}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
    \caption{Raw throughput vs.\ download time}
    \label{fig:throughput_vs_download}
  \end{subfigure}
  \caption{Application-level download performance. The throughput
    hierarchy compresses under real HTTP workloads: the worst VPN
    (Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
    Throughput explains some variance but not all: Yggdrasil
    (795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
  because latency matters more for HTTP workloads.}
  \label{fig:nix_download}
\end{figure}

\paragraph{Video streaming (RIST).}

At 3.3\,Mbps, the RIST video stream sits well within every VPN's
throughput budget.  The test therefore measures something else: how
well each VPN handles real-time UDP delivery under steady load.

Most VPNs pass without incident.  Eight deliver 100\% quality,
Nebula sits just below at 99.8\%, and Hyprspace's headline figure
of 100\% conceals a separate failure mode discussed below.  The
14--16 dropped frames that appear uniformly across every run, including
Internal, are most likely encoder warm-up artefacts rather than
tunnel overhead, though we have not verified this directly.

% TODO: The packet-drop distribution statistics (288 mean,
% 10\% median, IQR 255--330) are not shown in any figure.
% Add a box plot or distribution figure for Headscale's RIST drops.
Headscale is the clear failure.  Its mean quality is 13.1\%, and
each test interval drops 288 packets.  The degradation is sustained
rather than bursty: median quality is 10\%, and the interquartile
range of dropped packets is a narrow 255--330.  The qperf benchmark
also fails outright for Headscale at baseline, which rules out a
bulk-TCP explanation.  Something in the real-time path is broken.

The failure is unexpected because Headscale builds on WireGuard,
which handles video without trouble, and Headscale's own TCP
throughput puts it in Tier~1.  RIST runs over UDP, however, and
qperf probes latency-sensitive paths using both TCP and UDP.  The
most plausible source is Headscale's DERP relay or NAT traversal
layer.  Headscale's effective UDP payload size is 1\,208~bytes, the
smallest in the dataset.  RIST packets larger than this would be
fragmented, and fragment reassembly under sustained load could
produce exactly the steady, uniform drop pattern the data shows.
This is a hypothesis, not a confirmed cause: it would need a
packet capture to verify.  Either way, the result disqualifies
Headscale from video conferencing, VoIP, or any other real-time
media workload, regardless of TCP throughput.

% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500,
% percentiles all zero) are not visible in the RIST Quality bar chart.
% Add a distribution plot or note in the caption that the bar
% chart hides this variance.
Hyprspace fails differently.  Its average quality reads 100\%, but
the raw drop counts underneath are unstable: mean packet drops of
1\,194 and a maximum spike of 55\,500.  The 25th, 50th, and 75th
percentiles are all zero, so most runs deliver perfectly while a
small number suffer catastrophic bursts.  RIST's forward error
correction recovers from most of these events, but the worst spikes
overwhelm FEC entirely.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/Video
  Streaming/RIST Quality}.png}
  \caption{RIST video streaming quality at baseline. Headscale at
    13.1\% average quality is the clear outlier. Every other VPN
    achieves 99.8\% or higher. Nebula is at 99.8\% (minor
    degradation). The video bitrate (3.3\,Mbps) is well within every
    VPN's throughput capacity, so this test reveals real-time UDP
  handling quality rather than bandwidth limits.}
  \label{fig:rist_quality}
\end{figure}

\subsection{Operational Resilience}

Throughput, latency, and application performance describe how a
tunnel behaves once it is up.  The next question is how quickly it
gets there.  Sustained-load numbers do not predict recovery speed,
and for operational use the time a tunnel takes to come up after a
reboot matters as much as its peak throughput.

Reboot reconnection rearranges the rankings.  Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on
average, faster than any other VPN.  WireGuard and Nebula follow at
10.1\,s each.  Nebula's consistency is striking: 10.06, 10.06,
10.07\,s across its three nodes, an exact match for Nebula's
\texttt{HostUpdateNotification} interval, whose default is
10~seconds in the lighthouse protocol (configurable, but the
benchmarks use the default).  After a reboot, a node must
wait until the next periodic update before its lighthouses learn
its new endpoint, so the reconnection time tracks the timer rather
than any topology-dependent convergence.
Mycelium sits at the opposite end at 76.6~seconds, and its three
nodes come back at almost the same time (75.7, 75.7, 78.3\,s).
Section~\ref{sec:mycelium_routing} argues from that uniformity
that the bound is a fixed timer in the overlay protocol.

Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively.  Yggdrasil organises its overlay as a
distributed spanning tree rooted at the node with the highest public
key: every other node picks a parent closer to the root and the
whole network hangs off that parent chain.  The gap likely reflects
the cost of rebuilding that tree after a reboot: a node close to the
current root reconverges quickly, while one further out must wait
for updated parent information to propagate hop-by-hop before it
can route traffic.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
    \caption{Average reconnection time per VPN}
    \label{fig:reboot_bar}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
    \caption{Per-node reconnection time heatmap}
    \label{fig:reboot_heatmap}
  \end{subfigure}
  \caption{Reboot reconnection time at baseline. The heatmap reveals
    Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
    95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
    across all nodes). Hyprspace reconnects fastest (8.7\,s average)
  despite its poor sustained-load performance.}
  \label{fig:reboot_reconnection}
\end{figure}

\subsection{Pathological Cases}
\label{sec:pathological}

Three VPNs exhibit behaviors that the aggregate numbers alone cannot
explain.  The following subsections piece together observations from
earlier benchmarks into per-VPN diagnoses.

\paragraph{Hyprspace: Buffer Bloat.}
\label{sec:hyprspace_bloat}

% TODO: The under-load latency of 2,800 ms is not shown in any plot
% or table.  Where does this number come from?  Add a figure showing
% latency-under-load (e.g., from qperf concurrent ping) or reference
% the raw data source.
Hyprspace produces the most severe performance collapse in the
dataset.  At idle, its ping latency is a modest 1.79\,ms.
Under TCP load, that number balloons to roughly 2\,800\,ms, a
1\,556$\times$ increase.  The network itself has capacity to spare;
the VPN tunnel is filling up with buffered packets and failing to
drain.

The consequences show in every TCP metric.  With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than
steady-state transfer.  The max congestion window shrinks to
205\,KB, the smallest in the dataset.  Under parallel load the
situation worsens: retransmits climb to 17\,426.  % TODO: The
% explanation for the sender/receiver inversion (ACK delays
% causing sender-side timer undercounting) is a hypothesis.  Normally
% sender >= receiver.  Consider verifying with packet captures or
% note this as a likely but unconfirmed explanation.
The buffering even
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate.  The
UDP test never finished at all; it timed out at 120~seconds.

% Should we always use percentages for retransmits?

What prevents Hyprspace from being entirely unusable is everything
\emph{except} sustained load.  It has the fastest reboot
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
quality outside of its burst events.  The pathology is narrow but
severe: any continuous data stream saturates the tunnel's internal
buffers.

Hyprspace does import gVisor netstack, but reading the source
confirms that the gVisor TCP stack sits exclusively behind the
in-VPN ``service network'' feature.  Regular tunnel traffic uses
an ordinary kernel TUN device created through the
\texttt{songgao/water} library, and the forwarding loop in
\texttt{node/node.go} only diverts a packet into the gVisor
stack when its destination falls inside the
\texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
protocol is TCP; everything else is shipped verbatim over a
libp2p stream and written back into the receiving peer's kernel
TUN.  Listings~\ref{lst:hyprspace_kernel_tun},
\ref{lst:hyprspace_dispatch}, and \ref{lst:hyprspace_netstack}
show the relevant code in the upstream Hyprspace tree.

\lstinputlisting[language=Go,caption={Hyprspace creates a real
    kernel TUN via \texttt{songgao/water}; this is the device every
    peer-to-peer packet traverses.
\textit{hyprspace/tun/tun\_linux.go:14--36}},label={lst:hyprspace_kernel_tun}]{Listings/hyprspace_tun_linux.go}

\lstinputlisting[language=Go,caption={The IPv6 dispatch in the
    Hyprspace forwarding loop only diverts to the gVisor service-network
    TUN when the destination matches the
    \texttt{fd00:hyprspsv::/80} service prefix \emph{and} the L4
    protocol byte is \texttt{0x06} (TCP); every other packet is left
    on the kernel TUN path and forwarded over libp2p.
\textit{hyprspace/node/node.go:255--283}},label={lst:hyprspace_dispatch}]{Listings/hyprspace_dispatch.go}

\lstinputlisting[language=Go,caption={Hyprspace's gVisor netstack
    initialiser only enables TCP SACK; there is no \texttt{TCPRecovery}
    override (RACK stays at gVisor's default), no congestion-control
    override, and no buffer-size override. The text in
    \texttt{tun.go} also notes the file is taken verbatim from
    wireguard-go.
\textit{hyprspace/netstack/tun.go:6--80}},label={lst:hyprspace_netstack}]{Listings/hyprspace_netstack.go}

Since the benchmark targets the regular Hyprspace IPv4/IPv6
addresses rather than service-network proxies, both endpoints
rely on their host kernel's TCP stack for the entire transfer.
Whatever options Hyprspace's gVisor instance might set
internally (congestion control, loss recovery, buffer sizes)
are therefore irrelevant to these measurements; the inner TCP
state machine the kernel runs is the only one in the path.
The same caveat applies more sharply to Tailscale, where the
upstream documentation talks about an in-process gVisor TCP
stack but the benchmark traffic never reaches it; that case is
the subject of Section~\ref{sec:tailscale_degraded}.

If gVisor is out of scope, the buffer bloat must originate
further up the Hyprspace stack instead.  Hyprspace uses
\texttt{libp2p}, a peer-to-peer networking library, and its
\texttt{yamux} stream multiplexer, which runs many logical streams
over a single underlying connection and polices each one with a
credit-based flow-control window.  The most plausible source of
the bloat is this libp2p/yamux layer, through which raw IP packets
are funnelled.  Hyprspace's TUN-read loop dispatches
each outbound packet on its own goroutine, and every such
goroutine ends up in \texttt{node/node.go}'s
\texttt{sendPacket}, which keeps exactly one libp2p stream per
destination peer in \texttt{activeStreams} and guards it with a
single per-peer \texttt{sync.Mutex}
(Listing~\ref{lst:hyprspace_sendpacket}).  Concurrent
application TCP flows to the same Hyprspace neighbour therefore
serialise behind that one lock: the parallel iPerf3 test, which
opens multiple TCP connections to the same peer at once,
collapses to a single send pipeline at this layer.  Each
goroutine waiting for the lock pins its own 1420-byte packet
buffer, and the underlying yamux session adds a per-stream
flow-control window on top.  None of this is visible to the
kernel TCP sender that produced the inner segments: the kernel
sees only that the TUN write returned, so it keeps growing its
congestion window while the libp2p layer falls further behind.  The
geometry is the textbook one for buffer bloat: a
fast producer (kernel TCP) sitting upstream of a slow,
serialised consumer (the single yamux stream per peer) with
no flow-control signal coupling the two.

\lstinputlisting[language=Go,caption={Hyprspace's outbound
    fast path keeps exactly one libp2p stream per destination peer
    in \texttt{activeStreams} and guards it with a per-peer
    \texttt{sync.Mutex} held inside the \texttt{SharedStream}
    record.  The TUN-read loop spawns a fresh goroutine per packet
    (\texttt{node.go:282}); each one calls \texttt{sendPacket} and
    takes \texttt{ms.Lock} for the duration of the libp2p stream
    write, so concurrent application TCP flows to the same
    Hyprspace neighbour are serialised behind a single mutex.
    \textit{hyprspace/node/node.go:36--39, 282,
328--348}},label={lst:hyprspace_sendpacket}]{Listings/hyprspace_sendpacket.go}

\paragraph{Mycelium: routing anomaly.}
\label{sec:mycelium_routing}

Mycelium's 34.9\,ms average latency looks like a
straightforward cost of routing through a global
overlay.  The per-path numbers do not fit this
explanation:

\begin{itemize}
    \bitem{luna$\rightarrow$lom:} 1.63\,ms (comparable
    to Headscale at 1.64\,ms)
    \bitem{lom$\rightarrow$yuki:} 51.47\,ms
    \bitem{yuki$\rightarrow$luna:} 51.60\,ms
\end{itemize}

One link found a direct LAN path; the other two
bounced through the overlay.  All three machines sit on
the same physical network, so the split is not a matter
of topology.

The throughput results invert the latency ranking.
The link with the low ping latency,
luna$\rightarrow$lom at 1.63\,ms, should be the fastest
according to TCP congestion theory.  It is the slowest:
122\,Mbps, with the reverse direction dropping to
58.4\,Mbps in bidirectional mode.  Meanwhile
yuki$\rightarrow$luna, whose ICMP~RTT was 30$\times$
higher, reaches 379\,Mbps
(Figure~\ref{fig:mycelium_paths}).  The throughput
ranking is the exact inverse of what the ping data
predicts.

The explanation is in the iperf3 logs.  Each TCP stream
reports a kernel-measured RTT that is independent of
ICMP ping.  For the luna$\rightarrow$lom stream, this
TCP~RTT starts at 51.6\,ms and climbs to a mean of
144\,ms over the 30-second run, with
757~retransmits---the link was clearly overlay-routed
during the throughput test, even though ping had found a
direct path eight minutes earlier.  For
yuki$\rightarrow$luna the reverse happened: the TCP
stream measured only 12--22\,ms, and its bidirectional
return path recorded 1.0\,ms, a direct LAN connection
that the earlier ICMP test had not seen.  The routes
changed between the two tests.

Mycelium uses the Babel routing protocol
(Section~\ref{sec:babel}) to discover and select paths.
Two properties of its implementation explain why routes
shifted mid-benchmark.  First, Mycelium advertises
routes at a five-minute interval
(Listing~\ref{lst:mycelium_constants}):

\lstinputlisting[language=Rust,caption={Mycelium's
    Babel timing constants.  Routes are re-advertised
    every 300\,s; the router will not learn about a new
    path until the next cycle.
\textit{mycelium/src/router.rs:33--59}},label={lst:mycelium_constants}]{Listings/mycelium_route_constants.rs}

A direct path that appears between update cycles is
invisible to the router until the next advertisement
arrives.  The benchmark's ping and throughput tests ran
sequentially with several minutes between them, so each
test observed whichever route happened to be selected at
that point in Babel's five-minute cycle.

Second, even when a better route \emph{is} advertised,
the router resists switching to it.
Listing~\ref{lst:mycelium_best_route} shows the
\texttt{find\_best\_route} function: a candidate route
is rejected unless its metric improves on the current
route by more than 10, or unless it is directly
connected (metric~0).  This hysteresis prevents
flapping but also means that an overlay path, once
established, can persist for the remainder of the
update interval even after a shorter path becomes
available.

\lstinputlisting[language=Rust,caption={Route
    selection with hysteresis.  Lines~16--25 reject a
    candidate route unless it is directly connected or
    improves the composite metric by more than
    \texttt{SIGNIFICANT\_METRIC\_IMPROVEMENT}\,(10).
\textit{mycelium/src/router.rs:1213--1238}},label={lst:mycelium_best_route}]{Listings/mycelium_find_best_route.rs}

The five-minute update interval and the switching
hysteresis together explain the throughput asymmetry.
The TCP-measured RTTs
are consistent with the observed throughput on every
link; only the ICMP~RTTs, measured minutes earlier under
a different routing state, give the impression of an
inversion.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
  Throughput}.png}
  \caption{Per-link TCP throughput for Mycelium.  The
    luna$\rightarrow$lom link appears slow despite its
    low ping latency because Babel had switched to an
    overlay route by the time the throughput test ran.
    The TCP-level RTTs reported by iperf3, not the
  earlier ICMP measurements, explain the 3:1 ratio.}
  \label{fig:mycelium_paths}
\end{figure}

% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
% (47.3 ms) numbers are from qperf but not shown in any figure.
% Add a connection-setup latency table or plot.  Also
% clarify what
% Internal's connection establishment time is (47.3 /
% 3 = 15.8 ms?)
% so the "3× overhead" can be verified.
The overlay penalty shows up most clearly at connection setup.
Mycelium's average time-to-first-byte is 93.7\,ms
(vs.\ Internal's
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
alone costs 47.3\,ms (3$\times$ overhead).  Every new connection
incurs that overhead, so workloads dominated by
short-lived connections accumulate it rapidly.  Bulk
downloads, by
contrast, amortize it: the Nix cache test finishes only 18\,\%
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
transfer phase begins, per-connection latency fades into the
background.

Mycelium is also the slowest VPN to recover from a reboot:
76.6~seconds on average, and almost suspiciously uniform across
nodes (75.7, 75.7, 78.3\,s).  That kind of consistency points to a
fixed convergence timer in the overlay protocol, most likely a
default wait interval hard-coded into the reconnection logic.  A
topology-dependent recovery time, by contrast, would vary with each
node's position in the overlay: a node near an active peer would
reconverge quickly while one further away would wait longer for
routing information to reach it.  Mycelium shows no such variation,
so the bound is almost certainly a timer rather than a propagation
delay.
% TODO: Identify which Mycelium constant or default this 75-78 s
% recovery actually corresponds to before claiming it is a fixed
% timer; the source code would settle whether it is hard-coded,
% a configurable default, or coincidence.
The UDP test timed out at 120~seconds, and even first-time
connectivity required a 70-second wait at startup.

\paragraph{Tinc: Userspace Processing Bottleneck.}

The latency subsection already traced Tinc's 336\,Mbps ceiling to
single-core CPU exhaustion.  The usual network suspects do not
apply.  Tinc's 1.19\,ms RTT rules out a slow tunnel, and both its
effective UDP payload size (1\,353 bytes) and its retransmit count
(240) are in the normal range.  That leaves CPU: 14.9\,\%
whole-system utilization is what one saturated core looks like on
a multi-core host, which fits a single-threaded userspace VPN.
The parallel benchmark confirms the diagnosis.  Tinc scales to
563\,Mbps (1.68$\times$), ahead of Internal's 1.50$\times$ ratio.
Several concurrent TCP streams keep that one core busy through
the gaps a single flow would leave idle, and the extra work
translates directly into extra throughput.
% TODO: DOWNSTREAM DEPENDENCY — this confirmation inherits the
% unresolved CPU-profiling TODO from the latency subsection
% (VpnCloud's identical 14.9\% at 539\,Mbps).  If per-thread
% profiling refutes the single-core story, this paragraph must
% be revisited as well.

\section{Impact of network impairment}
\label{sec:impairment}

Baseline benchmarks rank VPNs by overhead under ideal
conditions.  The impairment profiles in
Table~\ref{tab:impairment_profiles} test a different property:
resilience.  Each profile applies symmetric \texttt{tc netem}
impairment to every machine.  Low adds roughly 2\,ms of delay and
0.25\,\% packet loss with 0.5\,\% reordering; Medium adds
${\sim}$4\,ms of delay and 1\,\% loss with 2\,\% reordering; High
adds ${\sim}$7.5\,ms of delay and 2.5\,\% loss with 5\,\%
reordering.  Medium and High both use 50\,\% correlation, so
losses and reorderings are bursty rather than uniform.  Two
results dominate the data.
% TODO: Double-check these per-profile parameters against the
% canonical impairment-profile definitions in the earlier chapter
% (Table~\ref{tab:impairment_profiles}).  The Low/High loss and
% delay numbers are cross-checked against later prose in this
% chapter, but the correlation and jitter values should be
% verified against the authoritative profile definition.

The first is the collapse of the throughput hierarchy.  At High
impairment, the 675\,Mbps spread between fastest and slowest
implementation compresses to under 3\,Mbps.  Architectural
differences that mattered at gigabit speeds become
invisible once
the network is the bottleneck.

The second is harder to explain.  Headscale outperforms the
bare-metal Internal baseline at Medium impairment across TCP,
parallel TCP, and the Nix cache benchmark.  A VPN built on
WireGuard should not beat a direct connection.
Section~\ref{sec:tailscale_degraded} pursues this anomaly
through what turns out to be the wrong hypothesis.  The
investigation begins with Tailscale's much-discussed gVisor TCP
stack, validates the candidate parameters in isolation on the
bare-metal host, and only then discovers, by reading the rig's
own NixOS module, that the gVisor stack is not actually in the
data path of the benchmark at all.  The real culprit is a
combination of the Linux kernel's tight default
\texttt{tcp\_reordering} threshold and the way
\texttt{wireguard-go}
batches packets between the wire and the host kernel TCP stack.

\subsection{Ping}

Latency is the most predictable metric under
impairment.  Most VPNs
absorb the injected delay with a fixed per-hop
overhead, and rankings
within the central cluster barely change across profiles
(Table~\ref{tab:ping_impairment}).  tc~netem adds
roughly 4, 8, and
15\,ms of round-trip delay at Low, Medium, and High
respectively;
Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this.

\begin{table}[H]
  \centering
  \caption{Average ping RTT (ms) across impairment
    profiles, sorted
  by High-profile RTT}
  \label{tab:ping_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal   & 0.60  & 4.82  & 9.38  & 15.49 \\
    Tinc       & 1.19  & 5.32  & 9.85  & 15.92 \\
    Nebula     & 1.25  & 5.38  & 9.99  & 15.96 \\
    WireGuard  & 1.20  & 5.36  & 9.88  & 15.99 \\
    Headscale  & 1.64  & 5.82  & 10.39 & 16.07 \\
    VpnCloud   & 1.13  & 5.41  & 10.35 & 16.21 \\
    ZeroTier   & 1.28  & 5.34  & 10.02 & 16.54 \\
    Yggdrasil  & 2.20  & 6.73  & 11.99 & 20.20 \\
    Hyprspace  & 1.79  & 6.15  & 10.76 & 24.49 \\
    EasyTier   & 1.33  & 6.27  & 14.13 & 26.60 \\
    Mycelium   & 34.90 & 23.42 & 43.88 & 33.05 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Ping
  Average RTT Heatmap}.png}
  \caption{Average ping RTT across impairment
    profiles.  Most VPNs
    form a tight parallel band; Mycelium's non-monotonic curve,
    EasyTier's excess latency at High, and Hyprspace's upward
  divergence stand out.}
  \label{fig:ping_impairment_heatmap}
\end{figure}

Mycelium defies the pattern.  Its RTT \emph{drops}
from 34.9\,ms at
baseline to 23.4\,ms at Low impairment, a 33\%
improvement at the
profile where every other VPN gets slower.  It then climbs to
43.9\,ms at Medium before falling again to 33.0\,ms
at High.  The
baseline analysis
(Section~\ref{sec:mycelium_routing}) showed that
Mycelium's latency comes from a bimodal routing
distribution: one
path runs at 1.63\,ms, two others route through the
global overlay at
${\sim}$51\,ms.  % TODO: DOWNSTREAM DEPENDENCY — This
% explanation depends on the baseline
% characterisation of Mycelium's path discovery as
% "failing intermittently"
% (Section mycelium_routing).  If that
% characterisation is revised (e.g.,
% overlay routing is by-design, not a failure), then
% the claim that
% impairment "pushes path discovery toward shorter
% routes" needs rethinking:
% the mechanism would be different if Mycelium is not
% trying to find direct
% routes in the first place.
Impairment seems to push Mycelium's path selection toward the
shorter route, so a larger share of traffic avoids the overlay
detour.  The non-monotonic curve is consistent with a
path selection
algorithm that reacts to measured link quality but
not linearly with
degradation severity.

% TODO: Ping packet loss data is not shown in any figure.  Add a
% packet loss table/figure or reference the raw data
% so readers can
% verify these numbers.
Mycelium loses zero ping packets at Low and Medium impairment.
Most other VPNs show 0.1--3.2\% loss at those profiles.  At High
impairment Mycelium's loss jumps to 11.1\%.

% TODO: EasyTier's max RTT (290 ms), WireGuard's max
% (~40 ms), and
% EasyTier's std dev (44.6 ms) are not shown in any
% plot.  The ping
% heatmap only shows averages.  Add a
% jitter/distribution figure.
% Also, the "userspace retry mechanism" is a hypothesized cause
% without source-code or packet-level evidence.
EasyTier accumulates 11\,ms of excess latency at High impairment
beyond what tc~netem injects.  Its average RTT is
26.6\,ms and its
maximum reaches 290\,ms, against ${\sim}$40\,ms for
WireGuard.  The
RTT standard deviation reaches 44.6\,ms at High, the
worst jitter
of any VPN.  A userspace retry mechanism is the
likely cause, but
without source-code evidence we cannot say so with certainty.

% TODO: Ping packet loss data is not shown in any plot.  The 1/9
% = 11.1\% interpretation is clever but depends on
% the exact test
% structure (3 pairs × 3 runs × 100 packets).  Verify
% this matches
% the actual test setup and add a supporting figure or table.
Hyprspace shows the same 11.1\% ping packet loss at Low, Medium,
and High impairment.  With 9~measurement runs per
profile (3~machine
pairs $\times$ 3~runs of 100~packets), 11.1\% is
exactly 1/9: one
run fails completely while the other eight report zero loss.
% TODO: DOWNSTREAM DEPENDENCY — This is a third
% reference to the buffer
% bloat diagnosis from Section hyprspace_bloat, which
% depends on the
% unverified 2,800 ms under-load latency.  If that diagnosis is
% revised, this explanation must also be revisited.
The binary pass/fail behaviour fits the buffer bloat
diagnosis from
Section~\ref{sec:hyprspace_bloat}: when the tunnel's
buffers fill, a
path stalls completely rather than degrading gradually.

\subsection{TCP throughput}

The baseline TCP hierarchy does not survive impairment.  The
three performance tiers from
Section~\ref{sec:baseline} dissolve at
the first step (Table~\ref{tab:tcp_impairment}).

\begin{table}[H]
  \centering
  \caption{Single-stream TCP throughput (Mbps) across impairment
    profiles, sorted by baseline.  Retention is the
  Low-to-baseline ratio.}
  \label{tab:tcp_impairment}
  \begin{tabular}{lrrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} & \textbf{Retention} \\
    \hline
    Internal  & 934 & 333  & 29.6 & 4.25 & 35.7\% \\
    WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\%  \\
    ZeroTier  & 814 & 63.7 & 12.0 & 4.01 & 7.8\%  \\
    Headscale & 800 & 274  & 41.5 & 4.21 & 34.3\% \\
    Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\%  \\
    \hline
    Nebula    & 706 & 49.8 & 7.82 & 2.60 & 7.1\%  \\
    EasyTier  & 636 & 156  & 17.4 & 3.59 & 24.6\% \\
    VpnCloud  & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\
    \hline
    Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\%  \\
    Tinc      & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\
    Mycelium  & 259 & 16.2 & 3.87 & 2.73 & 6.3\%  \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/TCP
  Throughput Heatmap}.png}
  \caption{Single-stream TCP throughput across
    impairment profiles.
    Headscale crosses above Internal at Medium impairment;
    Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs
  converge at High.}
  \label{fig:tcp_impairment_heatmap}
\end{figure}

Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low
impairment, a
98.3\% loss after adding only 2\,ms of latency, 2\,ms of jitter,
0.25\% packet loss, and 0.5\% reordering per machine.
Even Mycelium,
the slowest VPN at baseline (259\,Mbps), retains more
throughput at
Low than Yggdrasil does.  The jumbo overlay MTU of 32\,731~bytes
that inflated Yggdrasil's baseline numbers
(Section~\ref{sec:baseline}) becomes a liability
under impairment:
every lost or reordered outer packet costs roughly
24$\times$ more
retransmitted inner data than a standard 1\,400-byte
MTU VPN would
lose.
% TODO: The jumbo-MTU-as-liability argument is reused in several
% places (TCP impairment, QUIC impairment, RIST video, and
% §sec:baseline Tier analysis).  In each it is presented as a
% mechanism rather than a measurement.  Consider running one
% controlled experiment --- force Yggdrasil to a standard
% 1\,420-byte overlay MTU and rerun the Low/Medium impairment
% profiles --- to test the hypothesis directly, or consolidate
% the argument into a single "jumbo-MTU liability" paragraph and
% cite it from the other sections instead of restating the
% mechanism each time.

Headscale retains 34.3\% of its baseline throughput
at Low, almost
the same as Internal's 35.7\%.  At Medium impairment, Headscale
(41.5\,Mbps) overtakes Internal (29.6\,Mbps).
Section~\ref{sec:tailscale_degraded} investigates
this anomaly in
detail.

At High impairment, the throughput range collapses
from 675\,Mbps to
2.9\,Mbps.  Internal leads at 4.25\,Mbps, Hyprspace trails at
1.39\,Mbps, and the impairment profile itself is the bottleneck.
With 2.5\% packet loss and 5\% reordering per machine, every
implementation is loss-limited, and the architectural
differences
that mattered at gigabit speeds no longer matter at all.

\subsection{UDP throughput}

The UDP stress test (\texttt{-b~0}) separates
implementations with
effective backpressure from those without it more
cleanly than any
TCP benchmark.  Under impairment, it also produces widespread
failures.
% TODO: Tinc fails at Low and Medium but succeeds at
% High (8 Mbps):
% the same non-monotonic failure pattern as
% Internal/WireGuard (fail
% at Low, succeed at Medium/High).  This suggests the
% failures are
% iPerf3/tc interaction issues rather than
% fundamental VPN limitations.
% Nebula and VpnCloud also fail selectively.  The
% widespread non-monotonic
% failure pattern undermines using this benchmark as
% a reliability
% indicator (see line 1163 claim).  Consider
% discussing this pattern.
Hyprspace and Mycelium continue to time out at all profiles,
extending their baseline failures.  Tinc drops out at Low and
Medium, ZeroTier at Medium.  The data is sparse, but one pattern
emerges from the runs that did complete.

% TODO: The heatmap shows Internal and WireGuard both
% fail (×) at
% some impairment profiles (e.g., Internal fails at
% Low, WireGuard
% at Low and High).  "Regardless of impairment" overstates the
% evidence.  Rephrase to reflect the failures, or explain why
% those runs failed despite the claim of maintained throughput.
% TODO: Internal (and WireGuard) fail at Low
% impairment in the UDP
% test but succeed at Medium and High: the opposite of what one
% would expect.  This is never explained.
% Investigate and add an
% explanation (e.g., iPerf3 crash, tc interaction,
% timing issue).
Three implementations maintain throughput at the profiles where
data exists.  Internal holds ${\sim}$950\,Mbps at
Baseline, Medium,
and High; WireGuard sustains 850--898\,Mbps; and
Headscale sustains
700--876\,Mbps. % TODO: verify WireGuard UDP range --
% analysis doc says 850-898, possible digit transposition
Internal and WireGuard ride the host kernel's transport-layer
backpressure (Internal directly, WireGuard via the in-kernel
WireGuard module).  Headscale, by contrast, never
uses the kernel
module even though it builds on the WireGuard protocol: as
established in Section~\ref{sec:baseline}, Tailscale's
\texttt{magicsock} layer intercepts every packet for endpoint
selection, DERP relay, and the disco protocol, and that
interception is incompatible with the kernel WireGuard datapath.
Headscale therefore runs \texttt{wireguard-go} in userspace and
compensates with UDP batching
(\texttt{recvmmsg}/\texttt{sendmmsg}),
host-kernel UDP segmentation/aggregation offload
(\texttt{UDP\_SEGMENT}/\texttt{UDP\_GRO}, applied to the outer
WireGuard socket), and a 7\,MB socket buffer on the same outer
socket.  These offloads live in the host kernel; gVisor netstack
itself implements no UDP GSO or UDP GRO of its own.
Together they
absorb a \texttt{-b 0} sender flood without
collapsing.  Userspace
VPNs without the same engineering do collapse:
EasyTier drops from
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
Yggdrasil, already pathological at baseline (98.7\%
loss), crashes
to 12.3\,Mbps at Low and fails entirely at Medium and High.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/UDP
  Receiver Throughput Heatmap}.png}
  % TODO: The heatmap shows Internal, WireGuard, and
  % Headscale all
  % fail ($\times$) at Low impairment.  WireGuard also fails at
  % High.  These selective failures need an explanation
  % (iPerf3/tc interaction?).
  \caption{UDP receiver throughput across impairment profiles.
    Implementations with effective UDP backpressure
    (Internal and
      WireGuard via the in-kernel datapath; Headscale via
    \texttt{wireguard-go} batching plus large socket buffers)
    maintain high throughput where they complete;
    other userspace
  VPNs collapse or fail entirely ($\times$ marks a failed run).}
  \label{fig:udp_impairment_heatmap}
\end{figure}

% TODO: This "robustness indicator" interpretation is
% undermined by
% the non-monotonic failure pattern.  Internal and
% WireGuard fail at
% Low (0.25% loss) but succeed at Medium and High
% (1%+ loss).  If
% failures indicated "fundamental flow-control
% problems," they should
% get worse with more impairment, not better.  The
% pattern suggests
% iPerf3 or tc timing issues rather than VPN
% limitations.  Either
% explain the non-monotonic failures or weaken this conclusion.
Under impairment this benchmark is more useful as a robustness
indicator than as a throughput measurement.  A VPN that cannot
complete a 30-second UDP flood under 0.25\% packet loss has a
flow-control problem that will surface under real workloads too,
even when the symptoms are milder.
% TODO: Non-monotonic failure pattern (Internal and WireGuard
% fail at Low but succeed at Medium/High; Tinc, Nebula, VpnCloud
% fail selectively) is never explained and directly undermines
% the "robustness indicator" framing above.  Reproduce one of
% the failing Low-profile runs with iPerf3 debug logging and
% \texttt{tc -s qdisc show} to establish whether these are VPN
% flow-control failures, iPerf3/tc interaction artefacts, or
% timing issues; then either explain the pattern or soften the
% robustness-indicator claim.

\subsection{Parallel TCP}

% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional
% flows" must match
% the baseline parallel test description.  The
% baseline section has an
% unresolved TODO about whether the test uses 6 or 10
% streams.  If the
% baseline is corrected to 10, this section must also
% be updated.
The Headscale anomaly from single-stream TCP grows larger under
parallel load.  Table~\ref{tab:parallel_impairment}
shows aggregate
throughput across three concurrent bidirectional links (six
unidirectional flows).

\begin{table}[H]
  \centering
  \caption{Parallel TCP throughput (Mbps) across
    impairment profiles.
    Three concurrent bidirectional links produce six
    unidirectional
  flows.}
  \label{tab:parallel_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal  & 1398 & 277  & 82.6 & 10.4 \\
    Headscale & 1228 & 718  & 113  & 20.0 \\
    WireGuard & 1281 & 173  & 24.5 & 8.39 \\
    Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\
    ZeroTier  & 1206 & 176  & 35.4 & 7.97 \\
    EasyTier  & 927  & 473  & 57.4 & 10.7 \\
    Hyprspace & 803  & 2.87 & 6.94 & 3.62 \\
    VpnCloud  & 763  & 174  & 23.7 & 8.25 \\
    Nebula    & 648  & 103  & 15.3 & 4.93 \\
    Mycelium  & 569  & 72.7 & 7.51 & 3.69 \\
    Tinc      & 563  & 168  & 23.7 & 8.25 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Parallel
  TCP Throughput Heatmap}.png}
  \caption{Parallel TCP throughput across impairment profiles.
    Headscale dominates at Low (718\,Mbps vs.\ Internal's 277);
    EasyTier is the runner-up (473\,Mbps); Hyprspace
    collapses to
  2.87\,Mbps.}
  \label{fig:parallel_impairment_heatmap}
\end{figure}

At Low impairment, Headscale reaches 718\,Mbps: 2.6$\times$
Internal's 277\,Mbps and 4.1$\times$ WireGuard's 173\,Mbps.  At
Medium, Headscale (113\,Mbps) still leads Internal
(82.6\,Mbps) by
37\%.  Whatever mechanism produces the single-stream
crossover at
Medium scales with the flow count, because each of the six
concurrent streams benefits from it independently.

EasyTier is the runner-up under parallel load: 473\,Mbps at Low,
51\% of its baseline.  Headscale and EasyTier are the only VPNs
that retain more than half their baseline parallel throughput at
Low impairment; no other implementation exceeds 30\%.
We have no
direct architectural explanation for EasyTier's resilience and
do not claim one here.

Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at
Low, a 99.6\%
loss.  % TODO: DOWNSTREAM DEPENDENCY — This
% references the buffer bloat diagnosis
% from Section hyprspace_bloat, which depends on the
% unverified 2,800 ms
% under-load latency.  If that diagnosis is revised,
% this explanation
% for parallel collapse must also be revisited.
The buffer bloat that already plagues single-stream transfers
(Section~\ref{sec:hyprspace_bloat}) turns catastrophic when six
flows compete for the same bloated buffers at once.

High-profile convergence is more pronounced here than in
single-stream mode.  Tinc and VpnCloud land at identical
8.25\,Mbps even though they differ by 200\,Mbps at baseline.

\subsection{QUIC performance}

Headscale and Nebula failed the qperf QUIC benchmark at baseline
(Section~\ref{sec:baseline}) and continue to fail at every
impairment profile.

Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to
7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High.
This is the same cliff observed in its TCP results,
driven by the
same jumbo-MTU amplification of outer-layer packet loss.

At High impairment, WireGuard (23.2\,Mbps), VpnCloud
(23.4\,Mbps),
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
0.4\,Mbps of one another.  At baseline these four
span a 188\,Mbps
range (656 to 844\,Mbps).  At this point QUIC's own congestion
control is the sole limiter: it runs on top of an
already-degraded outer link and cannot push past
${\sim}$23\,Mbps regardless of the VPN underneath.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/QUIC
  Bandwidth Heatmap}.png}
  \caption{QUIC bandwidth across impairment profiles.  Yggdrasil
    drops from 745 to 8\,Mbps at Low; WireGuard,
    VpnCloud, ZeroTier,
    and Tinc converge to ${\sim}$23\,Mbps at High.
    Headscale and
  Nebula fail at all profiles ($\times$).}
  \label{fig:quic_impairment_heatmap}
\end{figure}

\subsection{Video streaming}

At ${\sim}$3.3\,Mbps, the RIST video stream sits
within every VPN's
throughput budget even at High impairment.  Quality
differences in
Table~\ref{tab:rist_impairment} therefore reflect
packet delivery
reliability, not bandwidth.

\begin{table}[H]
  \centering
  \caption{RIST video streaming quality (\%) across impairment
  profiles, sorted by High-profile quality}
  \label{tab:rist_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Mycelium  & 100.0 & 100.0 & 100.0 & 99.9 \\
    EasyTier  & 100.0 & 100.0 & 96.2  & 85.5 \\
    Internal  & 100.0 & 99.2  & 89.3  & 80.2 \\
    ZeroTier  & 100.0 & 99.3  & 89.9  & 80.2 \\
    VpnCloud  & 100.0 & 99.2  & 89.7  & 80.1 \\
    WireGuard & 100.0 & 99.3  & 90.0  & 80.0 \\
    Hyprspace & 100.0 & 92.9  & 87.9  & 78.1 \\
    Tinc      & 100.0 & 99.3  & 90.0  & 77.8 \\
    Nebula    & 99.8  & 98.8  & 85.6  & 72.1 \\
    Yggdrasil & 100.0 & 94.7  & 71.4  & 43.3 \\
    Headscale & 13.1  & 13.0  & 13.0  & 13.0 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Video
  Streaming Quality Heatmap}.png}
  \caption{RIST video streaming quality across
    impairment profiles.
    Headscale is stuck at ${\sim}$13\% regardless of profile;
    Mycelium maintains ${\sim}$100\% even at High; Yggdrasil
  declines steeply to 43\%.}
  \label{fig:rist_impairment_heatmap}
\end{figure}

Headscale sits at ${\sim}$13\% across all four profiles: 13.1\%,
13.0\%, 13.0\%, 13.0\%.  This profile-independence confirms the
baseline diagnosis (Section~\ref{sec:baseline}): the failure is
% TODO: DOWNSTREAM DEPENDENCY — This repeats the
% DERP/MTU hypothesis from
% Section baseline as though it were established.
% The baseline TODO notes
% this hypothesis is unverified (no packet capture
% evidence).  Do not
% present it as a confirmed diagnosis here without
% resolving the upstream TODO.
structural (most plausibly MTU fragmentation in the DERP relay
layer) and cannot worsen because it is already
saturated.  Adding
latency or loss on top of an 87\% packet drop floor changes
nothing.

Mycelium holds 99.9\% quality even at High impairment, ahead of
Internal (80.2\%) and every other VPN.  At 3.3\,Mbps, even
Mycelium's degraded overlay paths comfortably sustain
the stream.
The same overlay routing that adds 34.9\,ms of
latency and cripples
bulk TCP transfers is harmless at video bitrates, and RIST's
forward error correction handles the residual loss.

% TODO: The claim that jumbo MTU causes burst losses
% that overwhelm
% FEC is a hypothesis.  No FEC analysis or
% packet-level evidence is
% shown.  Consider adding packet capture data or
% softening the claim.
Yggdrasil degrades the most steeply: 100\% at
baseline, 94.7\% at
Low, 71.4\% at Medium, 43.3\% at High.  The jumbo MTU
that hurt TCP
throughput likely hurts here as well: large overlay packets are
more exposed to loss and reordering at the outer layer, and the
resulting burst losses may exceed what RIST's FEC can recover.

\subsection{Application-level download}

The Nix binary cache download is the most demanding
application-level benchmark.  Hundreds of sequential HTTP
connections amplify the per-connection latency
penalties that bulk
throughput tests amortise.  Table~\ref{tab:nix_impairment} shows
download times across profiles.

\begin{table}[H]
  \centering
  \caption{Nix binary cache download time (seconds)
    across impairment
    profiles, sorted by Low-profile time.  ``--'' marks a failed
  run.}
  \label{tab:nix_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal  & 8.53  & 11.9  & 58.6  & --  \\
    Headscale & 9.79  & 13.5  & 48.8  & 219 \\
    EasyTier  & 9.39  & 22.1  & 141   & --  \\
    VpnCloud  & 9.39  & 27.9  & 163   & --  \\
    WireGuard & 9.45  & 28.8  & 161   & --  \\
    Nebula    & 9.15  & 30.8  & 180   & 547 \\
    Tinc      & 10.0  & 30.9  & 166   & 496 \\
    ZeroTier  & 9.22  & 36.2  & 141   & --  \\
    Mycelium  & 10.1  & 79.5  & --    & --  \\
    Yggdrasil & 10.6  & 230   & --    & --  \\
    Hyprspace & 11.9  & --    & 170   & --  \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Nix
  Cache Download Time Heatmap}.png}
  \caption{Nix binary cache download time across
    impairment profiles.
    Headscale, Nebula, and Tinc complete all four
    profiles; Headscale
    beats Internal at Medium (49\,s vs.\ 59\,s).  Yggdrasil's
    Low-profile time explodes to 230\,s ($\times$ marks
  a failed run).}
  \label{fig:nix_impairment_heatmap}
\end{figure}

Headscale, Nebula, and Tinc are the only VPNs to
complete all four
profiles.  At Medium impairment, Headscale finishes
in 48.8~seconds,
faster than Internal's 58.6~seconds.  Internal itself
fails at High
impairment while Headscale completes in 219~seconds, Tinc in
496~seconds, and Nebula in 547~seconds.

Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low
impairment, a 22$\times$ slowdown.  Every HTTP request pays the
latency penalty of Yggdrasil's impairment-amplified
retransmissions.
Mycelium degrades almost as badly (10.1\,s to 79.5\,s, an
8$\times$ increase): its overlay routing overhead compounds over
hundreds of sequential HTTP connections.

% TODO: Hyprspace fails at Low but completes at Medium (170 s).
% This contradicts the "clean gradient" claim.
% Explain why a VPN
% can fail at Low but succeed at Medium, or note the anomaly.
The failure map shows a mostly clean gradient: more demanding
profiles knock out more VPNs.  At Low, 10 of 11
finish (Hyprspace
fails).  At Medium, 9 finish, though Hyprspace, which had failed
at Low, completes here in 170\,s.  At High, only Headscale,
Nebula, and Tinc survive.  Internal's failure at High is the
surprising one: the bare-metal baseline cannot sustain a
multi-connection HTTP workload under severe degradation, while
Headscale's userspace TCP stack pulls it through.
Section~\ref{sec:tailscale_degraded} explains why.

\section{Tailscale under degraded conditions}
\label{sec:tailscale_degraded}

% TODO: Editorial pass needed on two chapter-wide issues before
% submission:
% (1) magicsock / wireguard-go userspace-datapath explanation is
%     repeated three times in slightly different forms (once in
%     baseline UDP, once in impairment UDP, once here).  Consider
%     introducing it once in full here, where it is load-bearing,
%     and replacing the earlier occurrences with one-sentence
%     forward references.
% (2) This section uses first-person plural ("we pursued", "we
%     worked it out", "we ran two follow-up benchmarks") while
%     the rest of the chapter is in impersonal voice.  Either
%     harmonise everything to one voice, or explicitly frame this
%     section as a first-person narrative detour.

This section is about an observation that should not exist:
Headscale, a tunnelling VPN built on a kernel TCP stack and
\texttt{wireguard-go}, beats the bare-metal Internal baseline at
Medium impairment, and at Low impairment under parallel load
beats it by a factor of 2.6.  The short answer turns out to be
different from the obvious answer, and we worked it out only by
chasing the obvious answer to its end.

\subsection{An anomaly worth pursuing}

At Medium impairment, Headscale reaches 41.5\,Mbps on a single
TCP stream against Internal's 29.6\,Mbps, a 40\,\% lead for
the VPN over the direct host-to-host link it tunnels through.
Headscale costs the expected ${\sim}$14\,\% at baseline, and at
Low and High impairment it lags Internal by some margin.  Yet at
Medium the order inverts, and not by a sliver: a 12\,Mbps gap on
a 30\,Mbps link is well above measurement noise.  The same thing
happens, more dramatically, on the parallel TCP test, where
Headscale's 718\,Mbps at Low beats Internal's 277\,Mbps by a
factor of 2.6.  Table~\ref{tab:headscale_anomaly} collects the
comparison.

\begin{table}[H]
  \centering
  \caption{Headscale vs.\ Internal vs.\ WireGuard
    under impairment
    (18.12.2025 run).  For TCP benchmarks, higher is
    better.  For
  Nix cache, lower is better; ``--'' marks a failed run.}
  \label{tab:headscale_anomaly}
  \begin{tabular}{llrrr}
    \hline
    \textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} &
    \textbf{Headscale} & \textbf{WireGuard} \\
    \hline
    Single TCP (Mbps)   & Low    & 333  & 274  & 54.7 \\
    Single TCP (Mbps)   & Medium & 29.6 & 41.5 & 8.77 \\
    Single TCP (Mbps)   & High   & 4.25 & 4.21 & 2.63 \\
    Parallel TCP (Mbps) & Low    & 277  & 718  & 173  \\
    Parallel TCP (Mbps) & Medium & 82.6 & 113  & 24.5 \\
    Nix cache (s)       & Medium & 58.6 & 48.8 & 161  \\
    Nix cache (s)       & High   & --   & 219  & --   \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png}
  \caption{Single-stream TCP throughput for Internal,
    Headscale, and
    WireGuard across impairment profiles (log scale).  Headscale
    crosses above Internal at Medium impairment;
    WireGuard stays far
  below both; all three converge at High.}
  \label{fig:headscale_vs_internal}
\end{figure}

WireGuard-the-kernel-module is the obvious sanity
check.  It uses
the same Noise/WireGuard cryptographic protocol Tailscale ships
and is the closest available comparison without the rest of
Tailscale's stack.  WireGuard shows none of Headscale's
advantage: 54.7\,Mbps at Low and 8.77\,Mbps at Medium, both well
below Internal at the same profile.  So the encryption layer is
not the answer, and the basic UDP tunnel is not the answer.
Whatever Headscale is doing differently lives somewhere else in
the rest of Tailscale's implementation.

% TODO: The Medium-impairment retransmit percentages (5.2\%,
% 2.4\%) are not in any table or figure.  Add a retransmit
% rate table for impaired profiles or reference the data
% source.
The retransmit data narrows the search.  At Medium, WireGuard's
TCP retransmit rate is 5.2\,\%, more than double Internal's
${\sim}$2.4\,\%.  Headscale matches Internal at ${\sim}$2.4\,\%
even though it is a tunnelling VPN.  Both Headscale and
bare-metal Internal run the same host kernel TCP stack at the
inner layer, so the asymmetry is not about a different TCP
implementation.  It is about what the kernel TCP stack is being
asked to process: something on Headscale's path is suppressing
the spurious retransmits the kernel would otherwise fire under
\texttt{tc netem}-induced reordering, and WireGuard's path is
not.

\subsection{A plausible villain: Tailscale's gVisor stack}

The candidate explanation we pursued first, and the one any
reading of the upstream Tailscale documentation will lead to,
is Tailscale's userspace TCP/IP stack.  The Tailscale client
imports Google's gVisor netstack
(\texttt{gvisor.dev/gvisor/pkg/tcpip}) as a Go library and uses
it as an in-process TCP implementation.  The gVisor
documentation is direct about why this matters: netstack is
designed for adverse networks where the host kernel's TCP
defaults are too aggressive.  Tailscale's release notes go further
and name specific overrides
on top of gVisor; the most visible are an explicit RACK disable
and 8\,MiB / 6\,MiB receive and send buffers.

The Tailscale source code bears this out.
\texttt{wgengine/netstack/netstack.go} contains the netstack
initialiser, and Listing~\ref{lst:tailscale_netstack_overrides}
reproduces the relevant overrides verbatim.  RACK is disabled
(\texttt{TCPRecovery(0)}) with a comment pointing at
\texttt{tailscale/issues/9707}: ``gVisor's RACK performs
poorly. ACKs do not appear to be handled in a timely manner,
leading to spurious retransmissions and a reduced congestion
window.''  Reno is set explicitly with a comment pointing at
\texttt{gvisor/issues/11632}, an integer-overflow bug in
gVisor's CUBIC implementation.  The TCP send and receive
buffer maxima are pushed up to 8\,MiB and 6\,MiB.  SACK is
enabled (gVisor's default is off).

\lstinputlisting[language=Go,caption={Tailscale's gVisor
    netstack initialiser explicitly disables RACK, pins Reno as
    the congestion control, and enlarges the TCP buffer maxima.
    These overrides live inside
    \texttt{wgengine/netstack/netstack.go}.
\textit{tailscale/wgengine/netstack/netstack.go:264--339}},label={lst:tailscale_netstack_overrides}]{Listings/tailscale_netstack_overrides.go}

Read against the Linux kernel defaults (RACK on, CUBIC by
  default, ${\sim}$1\,MiB receive and send buffers,
\texttt{tcp\_reordering=3}, Tail Loss Probe enabled), these
overrides describe a TCP stack better suited to a lossy,
reordering link than the host kernel.  The hypothesis follows
directly: Headscale's iPerf3 traffic
runs through this gVisor instance instead of through the host
kernel TCP stack, and so it inherits the more
reordering-tolerant behaviour.  WireGuard-the-kernel-module
shares only the cryptographic protocol; it does not include
the gVisor stack, and therefore does not get the advantage.

The natural way to test this is to extract
the parameters Tailscale sets inside gVisor, apply their
nearest Linux equivalents to the bare-metal host as sysctls,
and see whether Internal, with no VPN at all, picks up the
same advantage.  If it does, the gVisor explanation is
supported.  If it does not, the hypothesis fails.

\subsection{Reproducing the effect on bare metal}
\label{sec:tuned}

We ran two follow-up benchmarks on the same hardware and
impairment setup as the original 18.12.2025 run.

\begin{itemize}
    \bitem{Tailscale-style (27.02.2026):}
    \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0},
    \texttt{tcp\_early\_retrans=0}, plus enlarged
    buffer sizes
    (\texttt{tcp\_rmem}, \texttt{tcp\_wmem},
      \texttt{rmem\_max},
    \texttt{wmem\_max}).  Tested on Internal, Headscale,
    WireGuard, Tinc, and ZeroTier.
    \bitem{Reorder-only (06.03.2026):} Only
    \texttt{tcp\_reordering=10},
    \texttt{tcp\_recovery=0}, and
    \texttt{tcp\_early\_retrans=0}.  Buffer sizes left at
    kernel defaults.  Tested on Internal and Headscale only.
\end{itemize}

\begin{table}[H]
  \centering
  \caption{Internal (no VPN) throughput across three kernel
    configurations.  ``Default'' is the
    18.12.2025 run with stock
  Linux TCP parameters.}
  \label{tab:kernel_tuning_internal}
  \begin{tabular}{llrrr}
    \hline
    \textbf{Metric} & \textbf{Profile} & \textbf{Default} &
    \textbf{Tailscale-style} & \textbf{Reorder-only} \\
    \hline
    Single TCP (Mbps)   & Baseline & 934       &
    934  & 934  \\
    Single TCP (Mbps)   & Low      & 333       &
    363  & 354  \\
    Single TCP (Mbps)   & Medium   & 29.6      &
    64.2 & 72.7 \\
    Parallel TCP (Mbps) & Low      & 277       &
    893  & 902  \\
    Parallel TCP (Mbps) & Medium   & 82.6      &
    226  & 211  \\
    Retransmit \%       & Medium   & ${\sim}$2.4
    & 1.21 & 1.11 \\
    Nix cache (s)       & Medium   & 58.6      &
    29.7 & 29.1 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png}
  \caption{Internal (no VPN) single-stream TCP
    throughput across three
    kernel configurations.  Baseline is unchanged; at Medium
    impairment, throughput jumps from 30 to 64 to
    73\,Mbps as
  reordering tolerance increases.}
  \label{fig:kernel_tuning_comparison}
\end{figure}

The result felt like confirmation.  Internal's
Medium-impairment throughput jumped from 29.6\,Mbps to
72.7\,Mbps under the reorder-only configuration, a 146\,\%
increase from a three-line sysctl change, and the retransmit
rate at Medium dropped from ${\sim}$2.4\,\% to
1.11\,\%, which
means more than half of the original retransmissions were
spurious.  The Nix cache download at Medium roughly halved,
from 58.6\,s to 29.1\,s.

Parallel TCP gained even more.  Internal at Low climbed from
277 to 902\,Mbps, a 226\,\% increase.  This exceeds Internal's
old single-stream best and overtakes Headscale's original
718\,Mbps from the unmodified run.  %
% TODO: DOWNSTREAM
% DEPENDENCY — "six concurrent flows" inherits
% the unresolved
% 6-vs-10 stream count from the baseline parallel test
% description.  Update when that TODO is resolved.
Each of the six concurrent flows benefits independently from
the higher reordering threshold, and the gains compound.

% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are
% not in any table.  Add a table showing Headscale's results
% from the follow-up runs alongside Internal's so
% readers can
% verify the reversal.
Headscale itself, retested with the same sysctls,
gained more
modestly: +21\,\% at Medium and a small $-$5\,\% wobble at
Low.  And the anomaly reversed entirely.  At Medium, tuned
Internal reached 72.7\,Mbps against Headscale's 50.1\,Mbps —
a 45\,\% lead for Internal where the original run
had Headscale
40\,\% ahead.  The Nix cache flipped the same way: Internal
completed in 29.1\,s against Headscale's 36.3\,s, where the
original had Headscale 17\,\% faster.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png}
  \caption{Internal-to-Headscale speed-up factor
    before and after
    kernel tuning.  Values above 1.0 mean
    Internal is faster.  At
    Medium impairment, the ratio flips from
    0.71$\times$ (Headscale
  ahead) to 1.45$\times$ (Internal ahead).}
  \label{fig:headscale_gap_reversal}
\end{figure}

The reorder-only configuration matched or exceeded the full
Tailscale-style configuration on most metrics.  The two
exceptions were single-stream TCP at Low (354
vs.\ 363\,Mbps)
and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within
7\,\%.  The enlarged buffer sizes did not help and may have
added mild buffer bloat that partially offset the reordering
benefit, though the gap could also be run-to-run variance.
Either way, the entire Headscale advantage on Internal
collapsed to three host-kernel sysctls:
\texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and
\texttt{tcp\_early\_retrans}.

At this point in the investigation the hypothesis seemed
settled.  Tailscale's gVisor stack ships with
these overrides;
the bare-metal kernel ships with stricter defaults; matching
the kernel to gVisor reproduces the effect.  Then we checked
which Tailscale code path the test rig was actually running.

\subsection{The data path that was not there}
\label{sec:gvisor_not_in_path}

In default mode (what anyone running \texttt{tailscale up}
on a Linux host gets), the Tailscale client creates a real
kernel TUN device, registers a route for the
Tailscale subnet
through it, and forwards inbound and outbound
packets through
that interface.  An application like iPerf3 issues a
\texttt{connect} to the remote peer's Tailscale
IP.  The host
kernel TCP stack handles the application TCP.  The kernel
routes the resulting outbound packets to the TUN device.
\texttt{tailscaled} (with \texttt{wireguard-go} embedded)
reads them from the TUN, encrypts them, and sends them as
outer WireGuard UDP packets on the wire.  The receiving side
reverses the process and writes the decrypted inner packets
back into its own TUN, where the host kernel TCP stack
delivers them to the iPerf3 server.

In that path, gVisor netstack is never instantiated.  The
netstack initialiser in
Listing~\ref{lst:tailscale_netstack_overrides}
only runs when
\texttt{tailscaled} is launched with
\texttt{--tun=userspace-networking}, a mode that has no
kernel TUN at all and is reachable only from processes
running inside \texttt{tailscaled} itself (Tailscale SSH,
Taildrop, the metric endpoint).  External processes such as
iPerf3 cannot reach the Tailscale network in that mode.

The test rig does not use that mode.  The benchmark suite's
Headscale module sets the interface name to
\texttt{ts-\$\{instanceName\}}
(Listing~\ref{lst:rig_interface_name}), so \texttt{tailscaled}
launches with \texttt{--tun ts-headscale}: a real kernel TUN.
External benchmark traffic cannot reach gVisor netstack at all.

\lstinputlisting[language=Nix,caption={The
    benchmark suite's
    Headscale module sets \texttt{interfaceName} to a real kernel
    TUN name (\texttt{ts-<instance>}, truncated to 15 characters).
    This means \texttt{tailscaled} runs as \texttt{tailscaled --tun
    ts-headscale}
    on every test machine.
\textit{vpn-benchmark-suite/clanModules/headscale/shared.nix:19,273--277}},label={lst:rig_interface_name}]{Listings/rig_interface_name.nix}

The empirical fingerprint pins the same conclusion down without
source-code reading.  Headscale itself gained +21\,\% at Medium
from the host-kernel sysctl tuning.  If Headscale's iPerf3
traffic were processed by gVisor netstack, host-kernel sysctls
would change nothing; they configure the host kernel TCP stack
and only the host kernel TCP stack. The fact that Headscale moves
measurably under those sysctls is direct evidence that
Headscale's application TCP runs on the host kernel stack, just
as Internal's does.

The validation experiment was therefore validating something
other than the hypothesis it was supposed to validate.  It was
confirming, very cleanly, that the Linux kernel's default
\texttt{tcp\_reordering=3} is too tight for the kind of bursty,
correlated reordering the Medium profile produces, and that
loosening it produces a large throughput gain on a kernel-TCP
data path.  That part of the result stands.  What does not stand
is the inference that the gain reproduces something Tailscale was
already doing in gVisor.  For this benchmark, Tailscale is not in
the gVisor TCP business at all.

\subsection{Where the advantage actually lives}

The puzzle the investigation began with has not gone away.
Headscale starts at 41.5\,Mbps where Internal starts at
29.6\,Mbps, and both run their iPerf3 TCP on the same host kernel
TCP stack.  Whatever Headscale is doing (partially, weakly, but
reproducibly) is worth roughly twelve megabits per second on the
Medium profile, and it is not gVisor netstack.

The +21\,\% sysctl gain for Headscale itself is also informative
about the size of the mechanism.  If the gain were 0\,\%,
Headscale would already be doing the sysctls' work; if it were
+146\,\% like Internal's, Headscale would be doing nothing of its
own.  The partial response says Headscale's mechanism produces an
effect similar in kind to the sysctls but smaller in size, and
that the two effects are not fully additive.

Two features of the \texttt{wireguard-go} data-plane pipeline are
the most likely candidates, and both live on the kernel-TUN path
that Tailscale actually uses in the rig.

The first is TUN TCP and UDP generic receive offload. Tailscale's
\texttt{tstun} wrapper enables both on the kernel TUN device on
Linux unless an environment knob disables them or a runtime probe
rejects the feature (Listing~\ref{lst:tstun_gro}).  On the
receive side, this means \texttt{wireguard-go} decrypts a burst
of inbound WireGuard frames and then coalesces consecutive
in-order TCP segments belonging to the same flow into a single
super-segment before writing them back to the kernel TUN. On the
transmit side, it accepts GSO super-segments from the kernel TUN
read in the same way.  The receiving kernel TCP stack therefore
sees fewer, larger segments per coalesced batch instead of $N$
small ones, and the segment timing that survives to the kernel is
the timing of GRO batches rather than of individual on-the-wire
packets. Bare-metal Internal traffic has no equivalent path
because it does not pass through any user-space TUN at all.

\lstinputlisting[language=Go,caption={Tailscale enables TUN TCP
    and UDP GRO on every Linux non-TAP \texttt{tailscaled} process
    unless the operator disables them via environment knobs or a
    kernel runtime probe rejects the feature. This is in the default
    kernel-TUN data path; it is not gated on
    \texttt{--tun=userspace-networking}.
\textit{tailscale/net/tstun/wrap\_linux.go:25--43}},label={lst:tstun_gro}]{Listings/tstun_gro.go}

The second is the 7\,MiB outer-UDP socket buffer that
\texttt{magicsock} pins on the WireGuard UDP socket
(Listing~\ref{lst:magicsock_buffer}), using the ``force''
\texttt{SO\_*BUFFORCE} variant where available so the value is
honoured even past \texttt{net.core.rmem\_max}.  The host kernel
default is in the low hundreds of KiB.  Under burst-correlated
impairment (Medium and High both use 50\,\% correlation, so
losses and reorderings cluster), this larger buffer absorbs
spikes in arrival rate that would otherwise overflow the kernel
UDP receive queue and surface as additional inner-TCP losses.
Internal has no such cushion on its incoming wire path.

\lstinputlisting[language=Go,caption={\texttt{magicsock} pins the
    outer WireGuard UDP socket's send and receive buffers to 7\,MiB
    and uses \texttt{SetBufferSize} with the \texttt{SO\_*BUFFORCE}
    (``force'') variant where available, so the value is honoured
    even past \texttt{net.core.rmem\_max}.
\textit{tailscale/wgengine/magicsock/magicsock.go:86,3908--3913}},label={lst:magicsock_buffer}]{Listings/magicsock_buffer.go}

% TODO: Neither of the two candidate mechanisms above is directly
% verified in this chapter.  A targeted follow-up — for example
% tcpdump on the receiving \texttt{tailscale0} interface during a
% Medium-impairment iPerf3 run, with inter-arrival timing
% analysis — would distinguish their relative contributions and
% confirm the mechanism.  The argument here is that they are the
% most plausible candidates consistent with the evidence, not
% measured causes.

A third feature, batched UDP I/O, completes the picture without
changing it qualitatively.  \texttt{wireguard-go} uses
\texttt{recvmmsg} and \texttt{sendmmsg} on the outer UDP socket
so a burst of WireGuard frames moves through a single system
call.  This does not change \emph{whether} packets are reordered,
but it reduces per-packet timing jitter that the kernel might
otherwise interpret as additional reordering.

Hyprspace cannot be used as a negative control for any of this.
It does import gVisor netstack, but only for its in-VPN
service-network feature, and the Hyprspace benchmark traffic goes
through a kernel TUN exactly like Headscale's
(Section~\ref{sec:hyprspace_bloat}). The two VPNs differ on the
wireguard-go pipeline (TUN GRO and the 7\,MiB outer-UDP buffer),
not on whether gVisor handles their inner TCP.  The gVisor angle
simply does not apply to either of them in this benchmark.

The kernel-side picture closes the loop.  Three host-kernel TCP
parameters dominate the bare-metal behaviour the benchmarks
expose. \texttt{net.ipv4.tcp\_reordering} (default 3) is the
number of out-of-order segments the kernel will tolerate before
declaring fast retransmit, and with \texttt{tc netem} injecting
0.5--2.5\,\% reordering per machine, bursts of several reordered
packets are frequent enough that the threshold is repeatedly
tripped on the bare-metal path. \texttt{net.ipv4.tcp\_recovery}
(default \texttt{1}, RACK enabled) adds time-based reordering
detection on top of the segment-count threshold, which compounds
the spurious retransmits when reordering is high. And
\texttt{net.ipv4.tcp\_early\_retrans} (default \texttt{3}, Tail
Loss Probe enabled) fires speculative retransmits when
unacknowledged segments sit at the tail of a transmission window,
which interacts poorly with an already-impaired link.  Loosening
any one of the three softens the kernel's loss detection on the
bare-metal path; loosening all three recovers most of the
throughput.  The Headscale path reaches the same kernel TCP stack
but is already feeding it the GRO-coalesced, buffer-cushioned
stream described above, so the kernel's tight defaults fire less
often there to begin with.

The same logic explains the anomaly's shape across profiles. At
baseline there is no reordering, so the kernel's tight
\texttt{tcp\_reordering} threshold never trips and Internal's
native kernel-stack speed wins. As reordering rises from 0.5\,\%
(Low) to 2.5\,\% (Medium) per machine, the kernel's loss
detection fires on the bare-metal path more often than on the
GRO-coalesced Headscale path, and the throughput gap shifts in
Headscale's favour.  At High impairment, both converge to
${\sim}$4.2\,Mbps: absolute packet loss becomes the dominant
bottleneck, and reordering tolerance no longer matters.

% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier
% (11.5 Mbps) tuned values are not in any table.  Add them to
% Table~\ref{tab:kernel_tuning_internal} or a new table.
Other VPNs respond unevenly to the same sysctl tuning.
WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps
(+39\,\%), Tinc's from 5.53 to 11.5\,Mbps (+108\,\%), and
ZeroTier stays flat (12.0 to 11.5\,Mbps).  % TODO: The
% reading below — that VPNs which add their own encapsulation and
% userspace processing have bottlenecks the host kernel sysctls
% cannot touch — does not cleanly fit the data: Tinc (a fully
% userspace VPN) shows the largest gain (+108\,\%), larger than
% kernel-WireGuard's.  A more complete explanation has to account
% for which TCP stack each VPN's application traffic actually
% traverses and which of those stacks the sysctls actually reach.
The intuitive reading is that VPNs which add their own
encapsulation and userspace processing have bottlenecks the host
kernel sysctls cannot touch, but Tinc's large gain shows the
picture is not that simple.

The resilient finding from this section, the one that survives
regardless of which of the two Tailscale-side mechanisms turns
out to dominate, is not about Tailscale at all.  It is about
Linux.  The kernel's default \texttt{tcp\_reordering=3} threshold
is too tight for the kind of bursty, correlated reordering
\texttt{tc netem} produces at the Medium profile, and it costs
the bare-metal host more than half of its achievable throughput.
Three lines of \texttt{sysctl} repair it.  The fix is portable to
any Linux host and entirely independent of any VPN.

The less durable finding, and the one that motivated this section,
is that Tailscale's much-discussed userspace TCP stack is not in
the data path for the workload that exposed the anomaly.  The
advantage we attributed to it comes from a more ordinary place:
the way \texttt{wireguard-go} batches and coalesces packets
between the wire and the kernel TCP stack, and the larger UDP
buffer it pins on its outer socket.  We were chasing the wrong
hypothesis with the right experiment, and the experiment turned
out to be more useful than the hypothesis.

% TODO: These sections are empty stubs but the chapter
% introduction (line 12--13) promises "findings from the source
% code analysis." Either write these sections or remove the
% promise from the intro.

\section{Source code analysis}

\subsection{Feature matrix overview}

% Summary of the 108-feature matrix across all ten VPNs.
% Highlight key architectural differences that explain
% performance results.

\subsection{Security vulnerabilities}

% Vulnerabilities discovered during source code review.

\section{Summary of findings}

% Brief summary table or ranking of VPNs by key metrics. Save
% deeper interpretation for a Discussion chapter.