clan-master-thesis/Chapters/Results.tex

% Chapter Template

\chapter{Results} % Main chapter title

\label{Results}

This chapter presents the results of the benchmark suite across all
ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded:
Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to
increasing network impairment. The chapter concludes with findings
from the source code analysis. A recurring theme is that no single
metric captures VPN
performance; the rankings shift
depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance.

\section{Baseline Performance}
\label{sec:baseline}

The baseline impairment profile introduces no artificial loss or
reordering, so any performance gap between VPNs can be attributed to
the VPN itself.  Throughout the plots in this section, the
\emph{internal} bar marks a direct host-to-host connection with no VPN
in the path; it represents the best the hardware can do.  On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just
0.60\,ms.  WireGuard comes remarkably close to these numbers, reaching
92.5\,\% of bare-metal throughput with only a single retransmit across
an entire 30-second test.  Mycelium sits at the other extreme, adding
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.

\subsection{Test Execution Overview}

Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours.  The bulk of that time, about
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
installation and deployment accounted for another 45~minutes (19\,\%),
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
up after restarts.  The remaining time was consumed by VPN service restarts
and traffic-control (tc) stabilization.
Figure~\ref{fig:test_duration} breaks this down per VPN.

Most VPNs completed every benchmark without issues, but four failed
one test each: Nebula and Headscale timed out on the qperf
QUIC performance benchmark after six retries, while Hyprspace and
Mycelium failed the UDP iPerf3 test
with a 120-second timeout.  Their individual success rate is
85.7\,\%, with all other VPNs passing the full suite
(Figure~\ref{fig:success_rate}).

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Average Test
    Duration per Machine}.png}
    \caption{Average test duration per VPN, including installation
    time and benchmark execution}
    \label{fig:test_duration}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
    Success Rate}.png}
    \caption{Benchmark success rate across all seven tests}
    \label{fig:success_rate}
  \end{subfigure}
  \caption{Test execution overview. Hyprspace has the longest average
    duration due to UDP timeouts and long VPN connectivity
    waits. WireGuard completes fastest. Nebula, Headscale,
  Hyprspace, and Mycelium each fail one benchmark.}
  \label{fig:test_overview}
\end{figure}

\subsection{TCP Throughput}

Each VPN ran a single-stream iPerf3 session for 30~seconds on every
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
averages.  Three distinct performance tiers emerge, separated by
natural gaps in the data.

\begin{table}[H]
  \centering
  \caption{Single-stream TCP throughput at baseline, sorted by
    throughput. Retransmits are averaged per 30-second test across
    all three link directions. The horizontal rules separate the
  three performance tiers.}
  \label{tab:tcp_baseline}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Throughput (Mbps)} &
    \textbf{Baseline (\%)} & \textbf{Retransmits} \\
    \hline
    Internal      & 934 & 100.0 & 1.7 \\
    WireGuard     & 864 & 92.5  & 1 \\
    ZeroTier      & 814 & 87.2  & 1163 \\
    Headscale     & 800 & 85.6  & 102 \\
    Yggdrasil     & 795 & 85.1  & 75 \\
    \hline
    Nebula        & 706 & 75.6  & 955 \\
    EasyTier      & 636 & 68.1  & 537 \\
    VpnCloud      & 539 & 57.7  & 857 \\
    \hline
    Hyprspace     & 368 & 39.4  & 4965 \\
    Tinc          & 336 & 36.0  & 240 \\
    Mycelium      & 259 & 27.7  & 710 \\
    \hline
  \end{tabular}
\end{table}

The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
at under 40\,\% of baseline.
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.

Raw throughput alone is incomplete, however.  The retransmit column
reveals that not all high-throughput VPNs get there cleanly.
ZeroTier, for instance, reaches 814\,Mbps but accumulates
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
needs.  ZeroTier compensates for tunnel-internal packet loss by
repeatedly triggering TCP congestion-control recovery, whereas
WireGuard delivers data with negligible in-tunnel loss.  Across all VPNs,
retransmit behaviour falls into three groups: \emph{clean} ($<$110:
WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed}
(200--900: Tinc, EasyTier, Mycelium, VpnCloud), and
\emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).

% TODO: Is this naming scheme any good?

% TODO: Fix TCP Throughput plot

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
    Throughput}.png}
    \caption{Average single-stream TCP throughput}
    \label{fig:tcp_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
    Retransmit Rate}.png}
    % TODO: Caption says "retransmits" (counts) but the plot axis shows
    % "Retransmit Rate (\%)."  Align the caption with the plot.
    \caption{TCP retransmit rate (\%)}
    \label{fig:tcp_retransmits}
  \end{subfigure}
  % TODO: This parent caption still says "retransmit count" but the
  % subfigure axis and caption were corrected to "retransmit rate (%)."
  % Align the parent caption terminology (counts vs rates).
  \caption{TCP throughput and retransmit rate at baseline. WireGuard
    leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
    retransmits per test. The retransmit count does not always track
    inversely with throughput: ZeroTier achieves high throughput
  \emph{despite} high retransmits.}
  \label{fig:tcp_results}
\end{figure}

Retransmits have a direct mechanical relationship with TCP congestion
control. Each retransmit triggers a reduction in the congestion window
(\texttt{cwnd}), throttling the sender. % TODO: The text says "average congestion window" but
% Figure~\ref{fig:retransmit_cwnd} plots "Max Congestion Window."
% Use consistent terminology --- either change the text to "max" or
% change the figure axis label.
This relationship is visible
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
retransmits, maintains the smallest max congestion window in the
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
window, the largest of any VPN.  At first glance this suggests a
clean inverse correlation between retransmits and congestion window
size, but the picture is misleading.  Yggdrasil's outsized window is
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
segment carries far more data, so the window in bytes is inflated
relative to VPNs using a standard ${\sim}$1\,400-byte MTU.  Comparing
congestion windows across different MTU sizes is not meaningful
without normalizing for segment size.  What \emph{is} clear is that
high retransmit rates force TCP to spend more time in congestion
recovery than in steady-state transmission, capping throughput
regardless of available bandwidth.  ZeroTier illustrates the
opposite extreme: brute-force retransmission can still yield high
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
bandwidth and unstable flow behavior.

VpnCloud stands out: its sender reports 538.8\,Mbps
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
in the dataset). This suggests significant in-tunnel packet loss or
buffering at the VpnCloud layer that the retransmit count (857)
alone does not fully explain.

% TODO: Mycelium's 122--379 Mbps range is per-link asymmetry (different
% overlay routing paths), not stochastic run-to-run variability.
% Section~\ref{sec:mycelium_routing} confirms the same numbers as
% per-link throughput.  Conflating link asymmetry with run-to-run
% variance is misleading --- either separate the two or clarify that
% Mycelium's spread comes from path selection, not randomness.
Run-to-run variability also differs substantially. WireGuard ranges
from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges
from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A
VPN with wide variance is harder to capacity-plan around than one
with consistent performance, even if the average is lower.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
    \caption{Retransmits vs.\ throughput}
    \label{fig:retransmit_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
    \caption{Retransmits vs.\ max congestion window}
    \label{fig:retransmit_cwnd}
  \end{subfigure}
  \caption{Retransmit correlations (log scale on x-axis). High
    retransmits do not always mean low throughput (ZeroTier: 1\,163
    retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
    4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
    between retransmits and congestion window size is dominated by
    Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
    by its 32\,KB jumbo overlay MTU rather than by low retransmits
  alone.}
  \label{fig:retransmit_correlations}
\end{figure}

\subsection{Latency}

Sorting by latency rearranges the rankings considerably.
Table~\ref{tab:latency_baseline} lists the average ping round-trip
times, which cluster into three distinct ranges.

\begin{table}[H]
  \centering
  \caption{Average ping RTT at baseline, sorted by latency}
  \label{tab:latency_baseline}
  \begin{tabular}{lr}
    \hline
    \textbf{VPN} & \textbf{Avg RTT (ms)} \\
    \hline
    Internal   & 0.60 \\
    VpnCloud   & 1.13 \\
    Tinc       & 1.19 \\
    WireGuard  & 1.20 \\
    Nebula     & 1.25 \\
    ZeroTier   & 1.28 \\
    EasyTier   & 1.33 \\
    \hline
    Headscale  & 1.64 \\
    Hyprspace  & 1.79 \\
    Yggdrasil  & 2.20 \\
    \hline
    Mycelium   & 34.9 \\
    \hline
  \end{tabular}
\end{table}

Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal
0.60\,ms; EasyTier sits just above at 1.33\,ms.  VpnCloud posts the
lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms),
yet its throughput tops out at only 539\,Mbps.
Low per-packet latency does not guarantee high bulk throughput.  A
second group (Headscale,
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
moderate overhead.  Then there is Mycelium at 34.9\,ms, so far
removed from the rest that Section~\ref{sec:mycelium_routing} gives
it a dedicated analysis.

% TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT
% plot.  Add a max-RTT figure or table, or reference the raw data
% source.
ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
sub-2\,ms VPN.  These spikes point to periodic control-plane
interference that the average hides.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
  \caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
    massive outlier at 58$\times$ the internal baseline. VpnCloud is
  the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
  \label{fig:ping_rtt}
\end{figure}

Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
but only the second-lowest throughput (336\,Mbps).  Packets traverse
the tunnel quickly, yet single-threaded userspace processing cannot
keep up with the link speed.  The qperf benchmark backs this up: Tinc
maxes out at
14.9\,\% total system CPU while delivering just 336\,Mbps.
% TODO: 14.9\% total CPU does not obviously indicate a bottleneck.
% Clarify that this is whole-system utilization on a multi-core
% machine, and that Tinc's single-threaded design means one core is
% saturated while the rest are idle.  Also note that VpnCloud reports
% the same 14.9\% yet achieves 539 Mbps --- explain why the same CPU
% utilization yields different throughput (e.g., different per-packet
% processing cost).
On a multi-core system, the low percentage reflects a single
saturated core, a clear sign that the CPU, not the network, is the
bottleneck.
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
spot.

% TODO: These CPU numbers are stated inline but never shown in a plot
% or table.  Add a CPU utilization figure or table so readers can
% verify.  Also, the claim that WireGuard's CPU usage "goes to
% cryptographic processing" is unsubstantiated --- no profiling data
% is presented.  Either add profiling evidence or soften to
% "likely" / "presumably."
The qperf measurements also reveal a wide spread in CPU usage.
Hyprspace (55.1\,\%) and Yggdrasil
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
9.7\,\%.  WireGuard sits at 30.8\,\%, surprisingly high for a
kernel-level implementation, presumably due to in-kernel
cryptographic processing.  % TODO: "do the most with the least CPU time" is misleading ---
% Tinc gets only 336 Mbps at 14.9% CPU (22.6 Mbps/%), while
% WireGuard gets 864 Mbps at 30.8% (28 Mbps/%).  These three use
% the least CPU but don't necessarily achieve the best throughput/CPU
% ratio.  Rephrase to "use the least CPU" or calculate actual
% efficiency ratios.
On the efficient end, VpnCloud
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least
CPU time.  Nebula and Headscale are missing from
this comparison because qperf failed for both.

%TODO: Explain why they consistently failed

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
  \caption{Latency vs.\ throughput at baseline. Each point represents
    one VPN. The quadrants reveal different bottleneck types:
    VpnCloud (low latency, moderate throughput), Tinc (low latency,
    low throughput, CPU-bound), Mycelium (high latency, low
  throughput, overlay routing overhead).}
  \label{fig:latency_throughput}
\end{figure}

\subsection{Parallel TCP Scaling}

The single-stream benchmark tests one link direction at a time.  % TODO: The plot labels this benchmark "10-stream parallel" but this
% description says "six unidirectional flows." Verify the actual test
% configuration and reconcile the two.
The
parallel benchmark changes this setup: all three link directions
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom) run simultaneously in a circular pattern for
60~seconds, each carrying one bidirectional TCP stream (six
unidirectional flows in total).  Because three independent
link pairs now compete for shared tunnel resources at once, the
aggregate throughput is naturally higher than any single direction
alone, which is why even Internal reaches 1.50$\times$ its
single-stream figure.  The scaling factor (parallel throughput
divided by single-stream throughput) captures two effects:
the benefit of using multiple link pairs in parallel, and how
well the VPN handles the resulting contention.
Table~\ref{tab:parallel_scaling} lists the results.

\begin{table}[H]
  \centering
  \caption{Parallel TCP scaling at baseline. Scaling factor is the
    ratio of parallel to single-stream throughput. Internal's
  1.50$\times$ represents the expected scaling on this hardware.}
  \label{tab:parallel_scaling}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Single (Mbps)} &
    \textbf{Parallel (Mbps)} & \textbf{Scaling} \\
    \hline
    Mycelium  & 259  & 569  & 2.20$\times$ \\
    Hyprspace & 368  & 803  & 2.18$\times$ \\
    Tinc      & 336  & 563  & 1.68$\times$ \\
    Yggdrasil & 795  & 1265 & 1.59$\times$ \\
    Headscale & 800  & 1228 & 1.54$\times$ \\
    Internal  & 934  & 1398 & 1.50$\times$ \\
    ZeroTier  & 814  & 1206 & 1.48$\times$ \\
    WireGuard & 864  & 1281 & 1.48$\times$ \\
    EasyTier  & 636  & 927  & 1.46$\times$ \\
    VpnCloud  & 539  & 763  & 1.42$\times$ \\
    Nebula    & 706  & 648  & 0.92$\times$ \\
    \hline
  \end{tabular}
\end{table}

The VPNs that gain the most are those most constrained in
single-stream mode.  Mycelium's 34.9\,ms RTT means a lone TCP stream
can never fill the pipe: the bandwidth-delay product demands a window
larger than any single flow maintains, so multiple concurrent flows
compensate for that constraint and push throughput to 2.20$\times$
the single-stream figure.  % TODO: The buffer-bloat workaround explanation for Hyprspace's
% parallel scaling is a hypothesis.  No direct evidence is shown
% that multiple streams specifically alleviate buffer bloat.
% Consider adding bufferbloat measurements or softening the claim.
% TODO: DOWNSTREAM DEPENDENCY — This claim depends on the buffer bloat
% diagnosis in Section hyprspace_bloat, which itself rests on the unverified
% 2,800 ms under-load latency (see TODO there).  If that latency figure
% is not confirmed, this parallel-scaling explanation collapses.
Hyprspace scales almost as well
(2.18$\times$), possibly because multiple streams collectively work
around the buffer bloat that cripples any individual flow
(Section~\ref{sec:hyprspace_bloat}).  Tinc picks up a
1.68$\times$ boost because several streams can collectively keep its
single-threaded CPU busy during what would otherwise be idle gaps in
a single flow.

% TODO: "zero retransmits" in parallel mode is not shown in any table
% or figure.  Add parallel-mode retransmit data or remove the claim.
WireGuard and Internal both scale cleanly at around
1.48--1.50$\times$ with zero retransmits, suggesting that
WireGuard's overhead is a fixed per-packet cost that does not worsen
under multiplexing.

Nebula is the only VPN that actually gets \emph{slower} with more
streams: throughput drops from 706\,Mbps to 648\,Mbps
(0.92$\times$) while retransmits jump from 955 to 2\,462.  The
% TODO: "ten streams" vs "six unidirectional flows" --- reconcile
% with the test description above.
streams are clearly fighting each other for resources inside the
tunnel.

More streams also amplify existing retransmit problems.  Hyprspace
climbs from 4\,965 to 17\,426~retransmits;
VpnCloud from 857 to 6\,023.  VPNs that were clean in single-stream
mode stay clean under load, while the stressed ones only get worse.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
    \caption{Single-stream vs.\ parallel throughput}
    \label{fig:single_vs_parallel}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
    \caption{Parallel TCP scaling factor}
    \label{fig:scaling_factor}
  \end{subfigure}
  \caption{Parallel TCP scaling at baseline. Nebula is the only VPN
    where parallel throughput is lower than single-stream
    (0.92$\times$). Mycelium and Hyprspace benefit most from
    parallelism ($>$2$\times$), compensating for latency and buffer
    bloat respectively. The dashed line at 1.0$\times$ marks the
  break-even point.}
  \label{fig:parallel_tcp}
\end{figure}

\subsection{UDP Stress Test}

The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
which is a deliberate overload test rather than a realistic workload.
The sender throughput values are artifacts: they reflect how fast the
sender can write to the socket, not how fast data traverses the
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
throughput because it uses a 32,731-byte block size (a jumbo-frame
overlay MTU), inflating the apparent rate per \texttt{send()} system
call. Only the receiver throughput is meaningful.

\begin{table}[H]
  \centering
  \caption{UDP receiver throughput and packet loss at baseline
    (\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
  at 120 seconds and are excluded.}
  \label{tab:udp_baseline}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Receiver (Mbps)} &
    \textbf{Loss (\%)} \\
    \hline
    Internal  & 952 & 0.0 \\
    WireGuard & 898 & 0.0 \\
    Nebula    & 890 & 76.2 \\
    Headscale & 876 & 69.8 \\
    EasyTier  & 865 & 78.3 \\
    Yggdrasil & 852 & 98.7 \\
    ZeroTier  & 851 & 89.5 \\
    VpnCloud  & 773 & 83.7 \\
    Tinc      & 471 & 89.9 \\
    \hline
  \end{tabular}
\end{table}

%TODO: Explain that the UDP test also crashes often,
% which makes the test somewhat unreliable
% but a good indicator if the network traffic is "different" then
% the programmer expected

Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
the kernel level with proper backpressure that matches sender to
receiver rate. Every other VPN shows massive loss (69--99\%)
because the sender overwhelms the tunnel's userspace processing capacity.
% TODO: Headscale also uses WireGuard's kernel module but still shows
% 69.8\% loss.  Explain that Headscale's userspace netstack sits
% between the application and the WireGuard kernel module, so UDP
% traffic must pass through userspace before reaching the kernel
% tunnel --- this is why it behaves like a userspace VPN here despite
% using WireGuard underneath.
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
(due to its large block size) but loses almost all of it. These loss
rates do not reflect real-world UDP behavior but reveal which VPNs
implement effective flow control. Hyprspace and Mycelium could not
complete the UDP test at all, timing out after 120 seconds.

% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not
% the path MTU.  It is derived from the socket MSS and reflects the
% usable payload after tunnel overhead, but conflating it with path
% MTU is misleading.  Consider renaming to "effective payload size"
% throughout.
The \texttt{blksize\_bytes} field reveals each VPN's effective UDP
payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at
2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at
1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
smallest). These differences affect fragmentation behavior under real
workloads, particularly for protocols that send large datagrams.

%TODO: Mention QUIC
%TODO: Mention again that the "default" settings of every VPN have been used
% to better reflect real world use, as most users probably won't
% change these defaults
% and explain that good defaults are as much a part of good software as
% having the features but they are hard to configure correctly

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Throughput}.png}
    \caption{UDP receiver throughput}
    \label{fig:udp_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Packet Loss}.png}
    \caption{UDP packet loss}
    \label{fig:udp_loss}
  \end{subfigure}
  \caption{UDP stress test results at baseline (\texttt{-b 0},
    unlimited sender rate). Internal and WireGuard are the only
    implementations with 0\% loss. Hyprspace and Mycelium are
  excluded due to 120-second timeouts.}
  \label{fig:udp_results}
\end{figure}

% TODO: Compare parallel TCP retransmit rate
% with single TCP retransmit rate and see what changed

\subsection{Real-World Workloads}

Saturating a link with iPerf3 measures peak capacity, but not how a
VPN performs under realistic traffic.  This subsection switches to
application-level workloads: downloading packages from a Nix binary
cache and streaming video over RIST.  Both interact with the VPN
tunnel the way real software does, through many short-lived
connections, TLS handshakes, and latency-sensitive UDP packets.

\paragraph{Nix Binary Cache Downloads.}

This test downloads a fixed set of Nix packages through each VPN and
measures the total transfer time.  The results
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
considerably: even Hyprspace, the worst performer, finishes in
11.92\,s, only 40\,\% slower than bare metal.  Once connection
setup, TLS handshakes, and HTTP round-trips enter the picture,
throughput differences between 500 and 900\,Mbps matter far less
than per-connection latency.

\begin{table}[H]
  \centering
  \caption{Nix binary cache download time at baseline, sorted by
  duration. Overhead is relative to the internal baseline (8.53\,s).}
  \label{tab:nix_cache}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Mean (s)} &
    \textbf{Overhead (\%)} \\
    \hline
    Internal  & 8.53  & -- \\
    Nebula    & 9.15  & +7.3 \\
    ZeroTier  & 9.22  & +8.1 \\
    VpnCloud  & 9.39  & +10.0 \\
    EasyTier  & 9.39  & +10.1 \\
    WireGuard & 9.45  & +10.8 \\
    Headscale & 9.79  & +14.8 \\
    Tinc      & 10.00 & +17.2 \\
    Mycelium  & 10.07 & +18.1 \\
    Yggdrasil & 10.59 & +24.2 \\
    Hyprspace & 11.92 & +39.7 \\
    \hline
  \end{tabular}
\end{table}

Several rankings invert relative to raw throughput.  ZeroTier
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits.  Yggdrasil
is the clearest example: it has the
fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead
because its
2.2\,ms latency adds up over the many small sequential HTTP requests
that constitute a Nix cache download.
Figure~\ref{fig:throughput_vs_download} confirms this weak link
between raw throughput and real-world download speed.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
    Mean Download Time}.png}
    \caption{Nix cache download time per VPN}
    \label{fig:nix_cache}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
    \caption{Raw throughput vs.\ download time}
    \label{fig:throughput_vs_download}
  \end{subfigure}
  \caption{Application-level download performance. The throughput
    hierarchy compresses under real HTTP workloads: the worst VPN
    (Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
    Throughput explains some variance but not all: Yggdrasil
    (795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
  because latency matters more for HTTP workloads.}
  \label{fig:nix_download}
\end{figure}

\paragraph{Video Streaming (RIST).}

At just 3.3\,Mbps, the RIST video stream sits comfortably within
every VPN's throughput budget.  This test therefore measures
something different: how well the VPN handles real-time UDP packet
delivery under steady load.  % TODO: The RIST plot shows Nebula at 99.8\%, not 100\%.  "Nine of
% eleven deliver 100\%" is inaccurate --- eight deliver 100\%, Nebula
% delivers 99.8\%.  Also, the claim that 14--16 dropped frames trace
% to encoder warm-up is stated without evidence.  How was this
% determined?  Add a reference or explain the methodology.
Nine of the eleven VPNs pass without
incident, delivering near-perfect video quality.  The 14--16 dropped
frames that appear uniformly across all VPNs, including Internal,
likely trace back to encoder warm-up rather than tunnel overhead.

% TODO: The packet-drop distribution statistics (288 mean,
% 10\% median, IQR 255--330) are not shown in any figure.
% Add a box plot or distribution figure for Headscale's RIST drops.
Headscale is the exception.  It averages just 13.1\,\% quality,
dropping 288~packets per test interval.  The degradation is not
bursty but sustained: median quality sits at 10\,\%, and the
interquartile range of dropped packets spans a narrow 255--330 band.
The qperf benchmark independently corroborates this, having failed
outright for Headscale, confirming that something beyond bulk TCP is
broken.

What makes this failure unexpected is that Headscale builds on
WireGuard, which handles video flawlessly.  TCP throughput places
Headscale squarely in Tier~1.  Yet the RIST test runs over UDP, and
qperf probes latency-sensitive paths using both TCP and UDP.  The
% TODO: The DERP relay / MTU fragmentation hypothesis is plausible
% but unverified.  No packet capture or fragmentation analysis is
% presented.  Either add tcpdump / packet-level evidence or mark
% this more clearly as a hypothesis.
pattern points toward Headscale's DERP relay or NAT traversal layer
as the source.  Its effective UDP payload size of 1\,208~bytes, the smallest
of any VPN, may compound the issue: RIST packets that exceed
this limit would be fragmented, and reassembling fragments under
sustained load could produce exactly the kind of steady, uniform packet
drops the data shows.  For video conferencing, VoIP, or any
real-time media workload, this is a disqualifying result regardless
of TCP throughput.

% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500,
% percentiles all zero) are not visible in the RIST Quality bar chart.
% Add a distribution plot or note in the caption that the bar
% chart hides this variance.
Hyprspace reveals a different failure mode.  Its average quality
reads 100\,\%, but the raw numbers underneath are far from stable:
mean packet drops of 1\,194 and a maximum spike of 55\,500, with
the 25th, 50th, and 75th percentiles all at zero.  Hyprspace
alternates between perfect delivery and catastrophic bursts.
RIST's forward error correction compensates for most of these
events, but the worst spikes are severe enough to overwhelm FEC
entirely.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/Video
  Streaming/RIST Quality}.png}
  \caption{RIST video streaming quality at baseline. Headscale at
    13.1\% average quality is the clear outlier. Every other VPN
    achieves 99.8\% or higher. Nebula is at 99.8\% (minor
    degradation). The video bitrate (3.3\,Mbps) is well within every
    VPN's throughput capacity, so this test reveals real-time UDP
  handling quality rather than bandwidth limits.}
  \label{fig:rist_quality}
\end{figure}

\subsection{Operational Resilience}

Sustained-load performance does not predict recovery speed.  How
quickly a tunnel comes up after a reboot, and how reliably it
reconverges, matters as much as peak throughput for operational use.

% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s)
% are not shown in any figure or table.  Either add a figure or
% scrap this paragraph (see note below).
First-time connectivity spans a wide range.  Headscale and WireGuard
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
(10--14\,s) spend seconds negotiating with their control planes
before passing traffic.

%TODO: Maybe we want to scrap first-time connectivity

Reboot reconnection rearranges the rankings.  Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on
average, faster than any other VPN.  WireGuard and Nebula follow at
10.1\,s each.  Nebula's consistency is striking: 10.06, 10.06,
10.07\,s across its three nodes, pointing to a hard-coded timer
rather than topology-dependent convergence.
Mycelium sits at the opposite end, needing 76.6~seconds and showing
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
suggesting a fixed protocol-level wait built into the overlay.

%TODO: Hard coded timer needs to be verified

Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively.  The gap likely reflects the overlay's
spanning-tree rebuild: a node near the root of the tree reconverges
quickly, while one further out has to wait for the topology to
propagate.

%TODO: Needs clarifications what is a "spanning tree build"

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
    \caption{Average reconnection time per VPN}
    \label{fig:reboot_bar}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
    \caption{Per-node reconnection time heatmap}
    \label{fig:reboot_heatmap}
  \end{subfigure}
  \caption{Reboot reconnection time at baseline. The heatmap reveals
    Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
    95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
    across all nodes). Hyprspace reconnects fastest (8.7\,s average)
  despite its poor sustained-load performance.}
  \label{fig:reboot_reconnection}
\end{figure}

\subsection{Pathological Cases}
\label{sec:pathological}

Three VPNs exhibit behaviors that the aggregate numbers alone cannot
explain.  The following subsections piece together observations from
earlier benchmarks into per-VPN diagnoses.

\paragraph{Hyprspace: Buffer Bloat.}
\label{sec:hyprspace_bloat}

% TODO: The under-load latency of 2,800 ms is not shown in any plot
% or table.  Where does this number come from?  Add a figure showing
% latency-under-load (e.g., from qperf concurrent ping) or reference
% the raw data source.
Hyprspace produces the most severe performance collapse in the
dataset.  At idle, its ping latency is a modest 1.79\,ms.
Under TCP load, that number balloons to roughly 2\,800\,ms, a
1\,556$\times$ increase.  This is not the network becoming
congested; it is the VPN tunnel itself filling up with buffered
packets and refusing to drain.

The consequences ripple through every TCP metric.  With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than
steady-state transfer, shrinking the max congestion window to
205\,KB, the smallest in the dataset.  Under parallel load the
situation worsens: retransmits climb to 17\,426.  % TODO: The explanation for the sender/receiver inversion (ACK delays
% causing sender-side timer undercounting) is a hypothesis.  Normally
% sender >= receiver.  Consider verifying with packet captures or
% note this as a likely but unconfirmed explanation.
The buffering even
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
while the sender sees only 367.9\,Mbps, likely because massive ACK delays
cause the sender-side timer to undercount the actual data rate.  The
UDP test never finished at all, timing out at 120~seconds.

% Should we always use percentages for retransmits?

What prevents Hyprspace from being entirely unusable is everything
\emph{except} sustained load.  It has the fastest reboot
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
quality outside of its burst events.  The pathology is narrow but
severe: any continuous data stream saturates the tunnel's internal
buffers.

\paragraph{Mycelium: Routing Anomaly.}
\label{sec:mycelium_routing}

Mycelium's 34.9\,ms average latency appears to be the cost of
routing through a global overlay.  The per-path numbers, however,
reveal a bimodal distribution:

\begin{itemize}
    \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable
    to Headscale at 1.64\,ms)
    \bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
    \bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
\end{itemize}

One of the three links has found a direct route; the other two still
bounce through the overlay.  All three machines sit on the same
% TODO: Characterising path discovery as "failing intermittently" assumes
% direct routing is the expected outcome on a LAN.  Mycelium is designed
% as a global overlay and may intentionally route through supernodes.
% If this is by-design behaviour, rephrase to avoid implying a bug.
% This characterisation also propagates to the impairment ping analysis
% (around line 966) which says impairment "pushes path discovery toward
% shorter routes."
% TODO: The throughput data INVERTS the latency split rather than
% "mirroring" it.  The direct path (luna→lom, 1.63 ms RTT) achieves
% only 122 Mbps, while the overlay-routed path (yuki→luna, 51.60 ms
% RTT) reaches 379 Mbps --- the opposite of what TCP theory predicts.
% The plot also shows luna→lom receiver throughput at only 57.2 Mbps
% (a 53% sender/receiver gap on that link).  Explain why the direct
% path is 3× slower than the overlay path, or acknowledge the
% contradiction.  The current wording "mirrors the split" is incorrect.
physical network, so Mycelium's path discovery is not consistently
selecting the direct route, a more specific problem than blanket overlay
overhead.  Throughput shows a similarly lopsided split:
yuki$\rightarrow$luna reaches 379\,Mbps while
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap.  In
bidirectional mode, the reverse direction on that worst link drops
to 58.4\,Mbps, the lowest single-direction figure in the entire
dataset.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
  Throughput}.png}
  % TODO: The caption attributes the asymmetry to "inconsistent direct
  % route discovery" but the direct-route link (luna→lom, 1.63 ms RTT)
  % is actually the SLOWEST (122 Mbps).  The caption should address
  % why the direct path underperforms the overlay paths.
  \caption{Per-link TCP throughput for Mycelium, showing extreme
    path asymmetry.  The 3:1 ratio between best
    (yuki$\rightarrow$luna, 379\,Mbps) and worst
    (luna$\rightarrow$lom, 122\,Mbps) links does not correlate with
  the latency split (Section~\ref{sec:mycelium_routing}).}
  \label{fig:mycelium_paths}
\end{figure}

% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment
% (47.3 ms) numbers are from qperf but not shown in any figure.
% Add a connection-setup latency table or plot.  Also clarify what
% Internal's connection establishment time is (47.3 / 3 = 15.8 ms?)
% so the "3× overhead" can be verified.
The overlay penalty shows up most clearly at connection setup.
Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
alone costs 47.3\,ms (3$\times$ overhead).  Every new connection
incurs that overhead, so workloads dominated by
short-lived connections accumulate it rapidly.  Bulk downloads, by
contrast, amortize it: the Nix cache test finishes only 18\,\%
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
transfer phase begins, per-connection latency fades into the
background.

Mycelium is also the slowest VPN to recover from a reboot:
76.6~seconds on average, and almost suspiciously uniform across
nodes (75.7, 75.7, 78.3\,s).  That kind of consistency points to a
hard-coded convergence timer in the overlay protocol rather than
anything topology-dependent.  The UDP test timed out at
120~seconds, and even first-time connectivity required a
70-second wait at startup.

% Explain what topology-dependent means in this case.

\paragraph{Tinc: Userspace Processing Bottleneck.}

Tinc is a clear case of a CPU bottleneck masquerading as a network
problem.  At 1.19\,ms latency, packets get through the
tunnel quickly.  Yet throughput tops out at 336\,Mbps, barely a
third of the bare-metal link.  % TODO: "path MTU is a healthy 1,500 bytes" but blksize_bytes is
% 1,353.  These are different metrics --- blksize_bytes is the UDP
% payload size, not the path MTU.  Clarify the distinction or
% remove the 1,500 claim.
The usual suspects do not apply:
Tinc's effective UDP payload size (\texttt{blksize\_bytes} of
1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and
WireGuard at 1\,368) is in the normal range, and its retransmit
count (240) is moderate.  What limits Tinc is its single-threaded
userspace architecture: one CPU core simply cannot encrypt, copy,
and forward packets fast enough to fill the pipe.

% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the Tinc CPU bottleneck
% diagnosis from above, but the 14.9% CPU figure has an unresolved TODO
% (the same utilization as VpnCloud at 539 Mbps).  If the CPU claim is
% revised or refuted, this confirmation must be updated too.
The parallel benchmark confirms this diagnosis.  Tinc scales to
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
Multiple TCP streams collectively keep that single core busy during
what would otherwise be idle gaps in any individual flow, squeezing
out throughput that no single stream could reach alone.

\section{Impact of Network Impairment}
\label{sec:impairment}

Baseline benchmarks rank VPNs by overhead under ideal conditions.
The impairment profiles from Table~\ref{tab:impairment_profiles}
test a different property: resilience.  Two results dominate the
data.  First, the throughput hierarchy from
Section~\ref{sec:baseline} collapses under degradation --- at High
impairment, the 675\,Mbps spread across all implementations compresses
to under 3\,Mbps, and architectural differences that matter at gigabit speeds
vanish.  Second, Headscale outperforms the bare-metal Internal
baseline at Medium impairment across TCP, parallel TCP, and Nix
cache benchmarks.  A VPN built on WireGuard should not beat a direct
connection; Section~\ref{sec:tailscale_degraded} traces the cause to
three TCP parameters in Tailscale's userspace network stack.

\subsection{Ping}

Latency is the most predictable metric under impairment.  Most VPNs
absorb the injected delay with a fixed per-hop overhead, and rankings
within the central cluster barely change across profiles
(Table~\ref{tab:ping_impairment}).  tc~netem adds roughly 4, 8, and
15\,ms of round-trip delay at Low, Medium, and High respectively;
Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this.

\begin{table}[H]
  \centering
  \caption{Average ping RTT (ms) across impairment profiles, sorted
  by High-profile RTT}
  \label{tab:ping_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal   & 0.60  & 4.82  & 9.38  & 15.49 \\
    Tinc       & 1.19  & 5.32  & 9.85  & 15.92 \\
    Nebula     & 1.25  & 5.38  & 9.99  & 15.96 \\
    WireGuard  & 1.20  & 5.36  & 9.88  & 15.99 \\
    Headscale  & 1.64  & 5.82  & 10.39 & 16.07 \\
    VpnCloud   & 1.13  & 5.41  & 10.35 & 16.21 \\
    ZeroTier   & 1.28  & 5.34  & 10.02 & 16.54 \\
    Yggdrasil  & 2.20  & 6.73  & 11.99 & 20.20 \\
    Hyprspace  & 1.79  & 6.15  & 10.76 & 24.49 \\
    EasyTier   & 1.33  & 6.27  & 14.13 & 26.60 \\
    Mycelium   & 34.90 & 23.42 & 43.88 & 33.05 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Ping Average RTT Heatmap}.png}
  \caption{Average ping RTT across impairment profiles.  Most VPNs
    form a tight parallel band; Mycelium's non-monotonic curve,
    EasyTier's excess latency at High, and Hyprspace's upward
  divergence stand out.}
  \label{fig:ping_impairment_heatmap}
\end{figure}

Mycelium defies the pattern.  Its RTT \emph{drops} from 34.9\,ms at
baseline to 23.4\,ms at Low impairment, a 33\% improvement where
every other VPN gets slower.  It then rises to 43.9\,ms at Medium
before falling again to 33.0\,ms at High.  The baseline analysis
(Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency
comes from a bimodal routing distribution: one path runs at 1.63\,ms
while two others route through the global overlay at
${\sim}$51\,ms.  % TODO: DOWNSTREAM DEPENDENCY — This explanation depends on the baseline
% characterisation of Mycelium's path discovery as "failing intermittently"
% (Section mycelium_routing).  If that characterisation is revised (e.g.,
% overlay routing is by-design, not a failure), then the claim that
% impairment "pushes path discovery toward shorter routes" needs rethinking:
% the mechanism would be different if Mycelium is not trying to find direct
% routes in the first place.
The impairment appears to push Mycelium's path
discovery toward shorter routes, so a larger share of traffic takes
the direct path.  The non-monotonic pattern is consistent with a path
selection algorithm that responds to measured link quality, but not
linearly with degradation severity.

% TODO: Ping packet loss data is not shown in any figure.  Add a
% packet loss table/figure or reference the raw data so readers can
% verify these numbers.
Mycelium also achieves 0\% ping packet loss at Low and Medium
impairment, while most VPNs show 0.1--3.2\% loss at those profiles.
At High impairment, Mycelium's loss jumps to 11.1\%.

% TODO: EasyTier's max RTT (290 ms), WireGuard's max (~40 ms), and
% EasyTier's std dev (44.6 ms) are not shown in any plot.  The ping
% heatmap only shows averages.  Add a jitter/distribution figure.
% Also, the "userspace retry mechanism" is a hypothesized cause
% without source-code or packet-level evidence.
EasyTier accumulates 11\,ms of excess latency at High impairment
beyond what tc~netem accounts for.  Its average RTT of 26.6\,ms and
maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) suggest a
userspace retry mechanism that introduces escalating variance.
EasyTier's RTT standard deviation reaches 44.6\,ms at High, the
worst jitter of any VPN.

% TODO: Ping packet loss data is not shown in any plot.  The 1/9
% = 11.1\% interpretation is clever but depends on the exact test
% structure (3 pairs × 3 runs × 100 packets).  Verify this matches
% the actual test setup and add a supporting figure or table.
Hyprspace shows 11.1\% ping packet loss at every impairment level ---
Low, Medium, and High alike.  With 9~measurement runs (3~machine
pairs $\times$ 3~runs of 100~packets), 11.1\% equals exactly 1/9:
one run per profile fails completely while the other eight report zero
loss.  % TODO: DOWNSTREAM DEPENDENCY — This is a third reference to the buffer
% bloat diagnosis from Section hyprspace_bloat, which depends on the
% unverified 2,800 ms under-load latency.  If that diagnosis is
% revised, this explanation must also be revisited.
This binary pass/fail behavior is consistent with the buffer bloat
diagnosis from Section~\ref{sec:hyprspace_bloat}: when buffers fill,
an entire path stalls rather than degrading gradually.

\subsection{TCP Throughput}

TCP throughput is where the baseline hierarchy breaks down.  The
three performance tiers from Section~\ref{sec:baseline} dissolve at
the first impairment step (Table~\ref{tab:tcp_impairment}).

\begin{table}[H]
  \centering
  \caption{Single-stream TCP throughput (Mbps) across impairment
    profiles, sorted by baseline.  Retention is the
  Low-to-baseline ratio.}
  \label{tab:tcp_impairment}
  \begin{tabular}{lrrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} & \textbf{Retention} \\
    \hline
    Internal  & 934 & 333  & 29.6 & 4.25 & 35.7\% \\
    WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\%  \\
    ZeroTier  & 814 & 63.7 & 12.0 & 4.01 & 7.8\%  \\
    Headscale & 800 & 274  & 41.5 & 4.21 & 34.3\% \\
    Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\%  \\
    \hline
    Nebula    & 706 & 49.8 & 7.82 & 2.60 & 7.1\%  \\
    EasyTier  & 636 & 156  & 17.4 & 3.59 & 24.6\% \\
    VpnCloud  & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\
    \hline
    Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\%  \\
    Tinc      & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\
    Mycelium  & 259 & 16.2 & 3.87 & 2.73 & 6.3\%  \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/TCP Throughput Heatmap}.png}
  \caption{Single-stream TCP throughput across impairment profiles.
    Headscale crosses above Internal at Medium impairment;
    Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs
  converge at High.}
  \label{fig:tcp_impairment_heatmap}
\end{figure}

Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a
98.3\% throughput loss from adding just 2\,ms latency, 2\,ms jitter,
0.25\% packet loss, and 0.5\% reordering per machine.  Even Mycelium,
the slowest VPN at baseline (259\,Mbps), retains more throughput at
Low than Yggdrasil does.  The jumbo overlay MTU of 32\,731~bytes,
which inflated baseline metrics
(Section~\ref{sec:baseline}), becomes a liability under impairment:
each lost or reordered outer packet triggers retransmission of
${\sim}$24$\times$ more inner-layer data than a standard
1\,400-byte MTU VPN would lose.

Headscale retains 34.3\% of its baseline throughput at Low, nearly
matching Internal's 35.7\%.  At Medium impairment, Headscale
(41.5\,Mbps) overtakes Internal (29.6\,Mbps) --- a VPN outperforming
the bare-metal baseline.
Section~\ref{sec:tailscale_degraded} investigates this anomaly in
detail.

At High impairment, the throughput range compresses from 675\,Mbps at
baseline to just 2.9\,Mbps.  Internal leads at 4.25\,Mbps; Hyprspace
trails at 1.39\,Mbps.  The impairment profile itself becomes the
bottleneck.  With 2.5\% packet loss and 5\% reordering per machine,
every implementation is TCP-loss-limited, and architectural
differences that matter at gigabit speeds become irrelevant.

\subsection{UDP Throughput}

The UDP stress test (\texttt{-b~0}) separates kernel-level from
userspace implementations more cleanly than any TCP benchmark.  It
also produces widespread failures under impairment: Hyprspace and
Mycelium, which already failed at baseline, continue to time out at
% TODO: Tinc fails at Low and Medium but succeeds at High (8 Mbps) ---
% the same non-monotonic failure pattern as Internal/WireGuard (fail
% at Low, succeed at Medium/High).  This suggests the failures are
% iPerf3/tc interaction issues rather than fundamental VPN limitations.
% Nebula and VpnCloud also fail selectively.  The widespread non-monotonic
% failure pattern undermines using this benchmark as a reliability
% indicator (see line 1163 claim).  Consider discussing this pattern.
all profiles, and Tinc drops out at Low and Medium while ZeroTier
fails at Medium.  Despite the sparse dataset, one pattern is clear.

% TODO: The heatmap shows Internal and WireGuard both fail (×) at
% some impairment profiles (e.g., Internal fails at Low, WireGuard
% at Low and High).  "Regardless of impairment" overstates the
% evidence.  Rephrase to reflect the failures, or explain why
% those runs failed despite the claim of maintained throughput.
% TODO: Internal (and WireGuard) fail at Low impairment in the UDP
% test but succeed at Medium and High --- the opposite of what one
% would expect.  This is never explained.  Investigate and add an
% explanation (e.g., iPerf3 crash, tc interaction, timing issue).
Kernel-level implementations maintain throughput at the profiles
where data exists.  Internal holds ${\sim}$950\,Mbps at
Baseline, Medium, and High.  Headscale sustains 700--876\,Mbps and WireGuard
850--898\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition
both use WireGuard's kernel module for the outer tunnel, which
provides proper backpressure at the transport layer.  Userspace VPNs collapse: EasyTier drops from
865 to 435 to 38.5 to 6.1\,Mbps across successive profiles.
Yggdrasil, already pathological at baseline (98.7\% loss), crashes to
12.3\,Mbps at Low and fails entirely at Medium and High.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/UDP Receiver Throughput Heatmap}.png}
  % TODO: This caption says "kernel-level VPNs maintain high throughput"
  % but the heatmap shows Internal, WireGuard, and Headscale ALL fail
  % ($\times$) at Low impairment.  WireGuard also fails at High.
  % Rephrase to acknowledge the failures or explain them.
  \caption{UDP receiver throughput across impairment profiles.
    Kernel-level VPNs (Internal, WireGuard, Headscale) maintain high
    throughput where they complete; userspace VPNs collapse or fail
  entirely ($\times$ marks a failed run).}
  \label{fig:udp_impairment_heatmap}
\end{figure}

% TODO: This "robustness indicator" interpretation is undermined by
% the non-monotonic failure pattern.  Internal and WireGuard fail at
% Low (0.25% loss) but succeed at Medium and High (1%+ loss).  If
% failures indicated "fundamental flow-control problems," they should
% get worse with more impairment, not better.  The pattern suggests
% iPerf3 or tc timing issues rather than VPN limitations.  Either
% explain the non-monotonic failures or weaken this conclusion.
The failure rate of this benchmark under impairment makes it more
useful as a robustness indicator than a throughput measurement.  A VPN
that cannot complete a 30-second UDP flood under 0.25\% packet loss
has fundamental flow-control problems that will surface under real
workloads too, even if the symptoms are milder.

\subsection{Parallel TCP}

% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional flows" must match
% the baseline parallel test description.  The baseline section has an
% unresolved TODO about whether the test uses 6 or 10 streams.  If the
% baseline is corrected to 10, this section must also be updated.
The Headscale anomaly from single-stream TCP grows larger under
parallel load.  Table~\ref{tab:parallel_impairment} shows aggregate
throughput across three concurrent bidirectional links (six
unidirectional flows).

\begin{table}[H]
  \centering
  \caption{Parallel TCP throughput (Mbps) across impairment profiles.
    Three concurrent bidirectional links produce six unidirectional
  flows.}
  \label{tab:parallel_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal  & 1398 & 277  & 82.6 & 10.4 \\
    Headscale & 1228 & 718  & 113  & 20.0 \\
    WireGuard & 1281 & 173  & 24.5 & 8.39 \\
    Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\
    ZeroTier  & 1206 & 176  & 35.4 & 7.97 \\
    EasyTier  & 927  & 473  & 57.4 & 10.7 \\
    Hyprspace & 803  & 2.87 & 6.94 & 3.62 \\
    VpnCloud  & 763  & 174  & 23.7 & 8.25 \\
    Nebula    & 648  & 103  & 15.3 & 4.93 \\
    Mycelium  & 569  & 72.7 & 7.51 & 3.69 \\
    Tinc      & 563  & 168  & 23.7 & 8.25 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Parallel TCP Throughput Heatmap}.png}
  \caption{Parallel TCP throughput across impairment profiles.
    Headscale dominates at Low (718\,Mbps vs.\ Internal's 277);
    EasyTier is the runner-up (473\,Mbps); Hyprspace collapses to
  2.87\,Mbps.}
  \label{fig:parallel_impairment_heatmap}
\end{figure}

Headscale at Low impairment: 718\,Mbps --- 2.6$\times$ Internal
(277\,Mbps) and 4.1$\times$ WireGuard (173\,Mbps).  At Medium,
Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%.
Whatever mechanism produces the single-stream crossover at Medium
scales with the number of flows: six independent streams each
benefit from it.

% TODO: EasyTier's resilience (473 Mbps at Low, 51% retention) is the
% second-best result after Headscale, yet receives no architectural
% explanation.  Headscale gets an entire subsection attributing its
% resilience to gVisor TCP tuning.  Either explain what gives EasyTier
% its resilience (e.g., its own TCP stack, congestion control, FEC)
% or acknowledge the gap explicitly.
EasyTier is the second-most resilient VPN under parallel load, at
473\,Mbps at Low (51\% of baseline).  Both EasyTier and Headscale
retain more than half their baseline parallel throughput at Low
impairment; no other VPN exceeds 30\%.

Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\%
loss.  % TODO: DOWNSTREAM DEPENDENCY — This references the buffer bloat diagnosis
% from Section hyprspace_bloat, which depends on the unverified 2,800 ms
% under-load latency.  If that diagnosis is revised, this explanation
% for parallel collapse must also be revisited.
The buffer bloat that plagues single-stream transfers
(Section~\ref{sec:hyprspace_bloat}) becomes catastrophic when six
concurrent flows compete for the same bloated buffers.

The High-profile convergence effect is even more pronounced here than
in single-stream mode.  Tinc and VpnCloud land at identical
8.25\,Mbps despite differing by 200\,Mbps at baseline.

\subsection{QUIC Performance}

Headscale and Nebula failed the qperf QUIC benchmark at baseline
(Section~\ref{sec:baseline}) and continue to fail across all
impairment profiles.

Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to
7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High ---
the same cliff observed in its TCP results, again driven by
jumbo-MTU amplification of outer-layer packet loss.

At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps),
ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within
0.4\,Mbps of each other.  At baseline these four span a 188\,Mbps
range (844 to 656\,Mbps).  QUIC's own congestion control, operating atop the
already-degraded outer link, becomes the sole limiter.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/QUIC Bandwidth Heatmap}.png}
  \caption{QUIC bandwidth across impairment profiles.  Yggdrasil
    drops from 745 to 8\,Mbps at Low; WireGuard, VpnCloud, ZeroTier,
    and Tinc converge to ${\sim}$23\,Mbps at High.  Headscale and
  Nebula fail at all profiles ($\times$).}
  \label{fig:quic_impairment_heatmap}
\end{figure}

\subsection{Video Streaming}

At ${\sim}$3.3\,Mbps, the RIST video stream sits within every VPN's
throughput budget even at High impairment.  Quality differences in
Table~\ref{tab:rist_impairment} therefore reflect packet delivery
reliability, not bandwidth.

\begin{table}[H]
  \centering
  \caption{RIST video streaming quality (\%) across impairment
  profiles, sorted by High-profile quality}
  \label{tab:rist_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Mycelium  & 100.0 & 100.0 & 100.0 & 99.9 \\
    EasyTier  & 100.0 & 100.0 & 96.2  & 85.5 \\
    Internal  & 100.0 & 99.2  & 89.3  & 80.2 \\
    ZeroTier  & 100.0 & 99.3  & 89.9  & 80.2 \\
    VpnCloud  & 100.0 & 99.2  & 89.7  & 80.1 \\
    WireGuard & 100.0 & 99.3  & 90.0  & 80.0 \\
    Hyprspace & 100.0 & 92.9  & 87.9  & 78.1 \\
    Tinc      & 100.0 & 99.3  & 90.0  & 77.8 \\
    Nebula    & 99.8  & 98.8  & 85.6  & 72.1 \\
    Yggdrasil & 100.0 & 94.7  & 71.4  & 43.3 \\
    Headscale & 13.1  & 13.0  & 13.0  & 13.0 \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Video Streaming Quality Heatmap}.png}
  \caption{RIST video streaming quality across impairment profiles.
    Headscale is stuck at ${\sim}$13\% regardless of profile;
    Mycelium maintains ${\sim}$100\% even at High; Yggdrasil
  declines steeply to 43\%.}
  \label{fig:rist_impairment_heatmap}
\end{figure}

Headscale stays at ${\sim}$13\% across all four profiles: 13.1\%,
13.0\%, 13.0\%, 13.0\%.  The profile-independence confirms the
baseline diagnosis from Section~\ref{sec:baseline}.  The failure is
% TODO: DOWNSTREAM DEPENDENCY — This repeats the DERP/MTU hypothesis from
% Section baseline as though it were established.  The baseline TODO notes
% this hypothesis is unverified (no packet capture evidence).  Do not
% present it as a confirmed diagnosis here without resolving the upstream TODO.
structural --- likely MTU fragmentation in the DERP relay layer ---
and cannot worsen because it is already saturated.  Adding latency or
loss on top of an 87\% packet drop floor changes nothing.

Mycelium delivers 99.9\% quality even at High impairment, better than
Internal (80.2\%) and every other VPN.  At 3.3\,Mbps, even
Mycelium's degraded overlay paths can sustain the stream.  The same
overlay routing that adds 34.9\,ms of latency and cripples bulk TCP
transfers is harmless at video bitrates.  RIST's own forward error
correction compensates for whatever packet loss remains.

% TODO: The claim that jumbo MTU causes burst losses that overwhelm
% FEC is a hypothesis.  No FEC analysis or packet-level evidence is
% shown.  Consider adding packet capture data or softening the claim.
Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at
Low, 71.4\% at Medium, 43.3\% at High.  The jumbo MTU that hurt TCP
throughput likely hurts here too --- large overlay packets carrying
RIST data are more likely to be lost or reordered at the outer layer,
and RIST's FEC may not recover from the resulting burst losses.

\subsection{Application-Level Download}

The Nix binary cache download is the most demanding application-level
benchmark: hundreds of sequential HTTP connections amplify
per-connection latency penalties that bulk throughput tests amortize.
Table~\ref{tab:nix_impairment} shows download times across profiles.

\begin{table}[H]
  \centering
  \caption{Nix binary cache download time (seconds) across impairment
    profiles, sorted by Low-profile time.  ``--'' marks a failed
  run.}
  \label{tab:nix_impairment}
  \begin{tabular}{lrrrr}
    \hline
    \textbf{VPN} & \textbf{Baseline} & \textbf{Low} &
    \textbf{Medium} & \textbf{High} \\
    \hline
    Internal  & 8.53  & 11.9  & 58.6  & --  \\
    Headscale & 9.79  & 13.5  & 48.8  & 219 \\
    EasyTier  & 9.39  & 22.1  & 141   & --  \\
    VpnCloud  & 9.39  & 27.9  & 163   & --  \\
    WireGuard & 9.45  & 28.8  & 161   & --  \\
    Nebula    & 9.15  & 30.8  & 180   & 547 \\
    Tinc      & 10.0  & 30.9  & 166   & 496 \\
    ZeroTier  & 9.22  & 36.2  & 141   & --  \\
    Mycelium  & 10.1  & 79.5  & --    & --  \\
    Yggdrasil & 10.6  & 230   & --    & --  \\
    Hyprspace & 11.9  & --    & 170   & --  \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/impairment/Nix Cache Download Time Heatmap}.png}
  \caption{Nix binary cache download time across impairment profiles.
    Headscale, Nebula, and Tinc complete all four profiles; Headscale
    beats Internal at Medium (49\,s vs.\ 59\,s).  Yggdrasil's
  Low-profile time explodes to 230\,s ($\times$ marks a failed run).}
  \label{fig:nix_impairment_heatmap}
\end{figure}

Headscale, Nebula, and Tinc are the only VPNs to complete all four
profiles.  At Medium impairment, Headscale finishes in 48.8~seconds
--- faster than Internal's 58.6~seconds.  Internal itself fails at
High impairment while Headscale completes in 219~seconds, Tinc in
496~seconds, and Nebula in 547~seconds.

Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low
impairment, a 22$\times$ slowdown.  Every HTTP request incurs the
latency penalty from Yggdrasil's impairment-amplified
retransmissions.  Mycelium also degrades severely (10.1\,s to
79.5\,s, an 8$\times$ increase), consistent with its overlay routing
overhead, which compounds over hundreds of sequential HTTP
connections.

% TODO: Hyprspace fails at Low but completes at Medium (170 s).
% This contradicts the "clean gradient" claim.  Explain why a VPN
% can fail at Low but succeed at Medium, or note the anomaly.
The failure map reveals a mostly clean gradient: more demanding
profiles knock out more VPNs.  At Low, 10 of 11 complete (Hyprspace
fails).  At Medium, 9 complete (though Hyprspace, which failed at
Low, completes at 170\,s).  At High, only 3 survive (Headscale,
Nebula, Tinc).  Internal's failure at High is the most surprising --- the
bare-metal baseline cannot sustain a multi-connection HTTP workload
under severe degradation, but Headscale, shielded by its userspace
TCP stack, can.  Section~\ref{sec:tailscale_degraded} explains why.

\section{Tailscale Under Degraded Conditions}
\label{sec:tailscale_degraded}

\subsection{Observed Anomaly}

At Medium impairment, Headscale delivers 41.5\,Mbps single-stream TCP
throughput --- 40\% more than Internal's 29.6\,Mbps.  A VPN built
atop WireGuard outperforms the bare-metal connection it tunnels
through.  The anomaly is consistent across benchmarks:
Table~\ref{tab:headscale_anomaly} summarizes the comparison.

\begin{table}[H]
  \centering
  \caption{Headscale vs.\ Internal vs.\ WireGuard under impairment
    (18.12.2025 run).  For TCP benchmarks, higher is better.  For
  Nix cache, lower is better; ``--'' marks a failed run.}
  \label{tab:headscale_anomaly}
  \begin{tabular}{llrrr}
    \hline
    \textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} &
    \textbf{Headscale} & \textbf{WireGuard} \\
    \hline
    Single TCP (Mbps)   & Low    & 333  & 274  & 54.7 \\
    Single TCP (Mbps)   & Medium & 29.6 & 41.5 & 8.77 \\
    Single TCP (Mbps)   & High   & 4.25 & 4.21 & 2.63 \\
    Parallel TCP (Mbps) & Low    & 277  & 718  & 173  \\
    Parallel TCP (Mbps) & Medium & 82.6 & 113  & 24.5 \\
    Nix cache (s)       & Medium & 58.6 & 48.8 & 161  \\
    Nix cache (s)       & High   & --   & 219  & --   \\
    \hline
  \end{tabular}
\end{table}

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png}
  \caption{Single-stream TCP throughput for Internal, Headscale, and
    WireGuard across impairment profiles (log scale).  Headscale
    crosses above Internal at Medium impairment; WireGuard stays far
  below both; all three converge at High.}
  \label{fig:headscale_vs_internal}
\end{figure}

In parallel TCP at Low impairment, Headscale reaches 718\,Mbps vs.\
Internal's 277\,Mbps (2.6$\times$).  The Nix cache download at
Medium takes Headscale 48.8\,s vs.\ Internal's 58.6\,s (17\%
faster).  At High impairment, Internal fails the Nix cache entirely
while Headscale completes in 219\,s.

WireGuard, which shares Headscale's cryptographic layer, shows no
such advantage: 54.7\,Mbps at Low, 8.77\,Mbps at Medium.  Whatever
protects Headscale is not the encryption or the tunnel --- it is
something in Tailscale's userspace networking stack.

% TODO: The Medium-impairment retransmit percentages (5.2\%,
% 2.4\%) are not in any table or figure.  Add a retransmit rate
% table for impaired profiles or reference the data source.
The retransmit data provides the first clue.  At Medium impairment,
WireGuard's retransmit rate is 5.2\% --- more than double Internal's
${\sim}$2.4\%.  Headscale, despite being a VPN, matches Internal at
${\sim}$2.4\%.  WireGuard uses the host kernel's TCP stack, which
treats reordered packets as losses and fires spurious retransmits;
Headscale's gVisor stack tolerates more reordering, so fewer
retransmissions are wasted on packets that were merely delayed.

\subsection{Congestion Control Analysis}

Tailscale uses a userspace TCP/IP stack derived from Google's gVisor
(netstack).  This stack does not inherit the host kernel's TCP
parameters.  Three defaults differ from the Linux kernel in ways that
matter under packet reordering:

\begin{itemize}
  \bitem{\texttt{tcp\_reordering}:} gVisor uses 10; the Linux kernel
    defaults to~3.  This parameter controls how many out-of-order
    packets TCP tolerates before treating the event as a loss.  With
    tc~netem injecting 0.5--2.5\% reordering per machine, bursts of
    3+ reordered packets are frequent.  The kernel's threshold of~3
    causes spurious fast retransmits and congestion window reductions
    for packets that are merely reordered, not lost.
  \bitem{\texttt{tcp\_recovery} (RACK):} gVisor disables it; the
    Linux kernel enables it by default.  RACK uses timing-based loss
    detection that is more aggressive than the pure sequence-based
    approach gVisor uses.  Under reordering, RACK's timing heuristics
    can falsely classify delayed packets as lost.
  \bitem{\texttt{tcp\_early\_retrans} (TLP):} gVisor disables it; the
    kernel enables it.  Tail Loss Probe sends speculative retransmits
    on idle connections, which can worsen congestion when the link is
    already impaired.
\end{itemize}

Under packet reordering, these three defaults compound.  The Linux
TCP stack fires retransmits and cuts the congestion window far more
often than necessary; each false positive shrinks the window and
reduces throughput.  Tailscale's gVisor stack tolerates more
reordering before reacting, so its congestion window stays larger and
throughput stays higher.

% TODO: The claim that the anomaly "grows with impairment severity" is
% not fully supported.  At High impairment, Headscale (4.21 Mbps) and
% Internal (4.25 Mbps) converge --- the anomaly vanishes rather than
% growing.  The logic predicts continued divergence at High reordering
% (5% per machine), but the data shows both become loss-limited.
% Rephrase to say the anomaly emerges at Medium but disappears at High
% when absolute loss dominates.
This explains why the anomaly emerges as impairment increases.  At
baseline, there is no reordering, so the threshold difference is
irrelevant and Internal's kernel-level processing advantage dominates.
As reordering increases from 0.5\% (Low) to 2.5\% (Medium) per
machine, the kernel's aggressive loss detection fires more often, and
the throughput gap shifts in Headscale's favor.  At High impairment,
however, both converge to ${\sim}$4.2\,Mbps: the absolute packet loss
rate becomes the dominant bottleneck, overriding the reordering
tolerance advantage.

\subsection{Tuned Kernel Parameters}

Two follow-up benchmark runs applied Tailscale's gVisor TCP
parameters to the host kernel via sysctl:

\begin{itemize}
  \bitem{Full gVisor (27.02.2026):} All parameters ---
    \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0},
    \texttt{tcp\_early\_retrans=0}, plus enlarged buffer sizes
    (\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, \texttt{rmem\_max},
    \texttt{wmem\_max}).  Tested on Internal, Headscale, WireGuard,
    Tinc, and ZeroTier.
  \bitem{Reorder-only (06.03.2026):} Only
    \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, and
    \texttt{tcp\_early\_retrans=0}.  Buffer sizes left at kernel
    defaults.  Tested on Internal and Headscale only.
\end{itemize}

Table~\ref{tab:kernel_tuning_internal} shows how Internal responds
to the tuning.  Both follow-up runs used the same impairment profiles
and hardware as the original 18.12.2025 run.

\begin{table}[H]
  \centering
  \caption{Internal (no VPN) throughput across three kernel
    configurations.  ``Default'' is the 18.12.2025 run with stock
  Linux TCP parameters.}
  \label{tab:kernel_tuning_internal}
  \begin{tabular}{llrrr}
    \hline
    \textbf{Metric} & \textbf{Profile} & \textbf{Default} &
    \textbf{Full gVisor} & \textbf{Reorder-only} \\
    \hline
    Single TCP (Mbps)   & Baseline & 934       & 934  & 934  \\
    Single TCP (Mbps)   & Low      & 333       & 363  & 354  \\
    Single TCP (Mbps)   & Medium   & 29.6      & 64.2 & 72.7 \\
    Parallel TCP (Mbps) & Low      & 277       & 893  & 902  \\
    Parallel TCP (Mbps) & Medium   & 82.6      & 226  & 211  \\
    Retransmit \%       & Medium   & ${\sim}$2.4 & 1.21 & 1.11 \\
    Nix cache (s)       & Medium   & 58.6      & 29.7 & 29.1 \\
    \hline
  \end{tabular}
\end{table}


\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png}
  \caption{Internal (no VPN) single-stream TCP throughput across three
    kernel configurations.  Baseline is unchanged; at Medium
    impairment, throughput jumps from 30 to 64 to 73\,Mbps as
  reordering tolerance increases.}
  \label{fig:kernel_tuning_comparison}
\end{figure}

Internal's Medium-impairment throughput jumps from 29.6 to
72.7\,Mbps --- a 146\% increase from a three-line sysctl change.  The
retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; over half
of the original retransmissions were spurious.  The Nix cache download at
Medium halves from 58.6\,s to 29.1\,s.

Parallel TCP sees an even larger gain.  Internal at Low impairment
climbs from 277 to 902\,Mbps, a 226\% increase that now exceeds
Headscale's original 718\,Mbps.  % TODO: DOWNSTREAM DEPENDENCY — "six concurrent flows" inherits the
% unresolved 6-vs-10 stream count from the baseline parallel test
% description.  Update when that TODO is resolved.
With six concurrent flows each
independently benefiting from the higher reordering threshold, the
aggregate improvement compounds.

% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are not in
% any table.  Add a table showing Headscale's results from the
% follow-up runs alongside Internal's so readers can verify the
% reversal.
% TODO: "At every impairment level and benchmark" is a strong claim
% but only single-stream TCP at Medium and Nix cache at Medium are
% shown with both Internal and Headscale values.  The Headscale tuned
% data is not in any table (see TODO above).  Either add the full
% comparison table or weaken to "at the metrics shown."
The anomaly reverses.  At the measured impairment levels and benchmarks,
tuned Internal now meets or exceeds Headscale.  At Medium impairment:
Internal 72.7\,Mbps vs.\ Headscale 50.1\,Mbps (Internal 45\% ahead),
where the original result had Headscale 40\% ahead.  The Nix cache
flips too: Internal completes in 29.1\,s vs.\ Headscale's 36.3\,s,
where the original had Headscale 17\% faster.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png}
  \caption{Internal-to-Headscale speed-up factor before and after
    kernel tuning.  Values above 1.0 mean Internal is faster.  At
    Medium impairment, the ratio flips from 0.71$\times$ (Headscale
  ahead) to 1.45$\times$ (Internal ahead).}
  \label{fig:headscale_gap_reversal}
\end{figure}

The reorder-only configuration (06.03) matches or exceeds the full
gVisor configuration (27.02) at most metrics; the two exceptions are
single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at
Medium (211 vs.\ 226\,Mbps), both within 7\%.  Internal
reaches 72.7\,Mbps at Medium with reorder-only vs.\ 64.2\,Mbps with
full gVisor.  % TODO: The "mild buffer bloat" explanation for full-gVisor being
% slightly slower than reorder-only is speculative.  The difference
% (64.2 vs 72.7 Mbps) could be within run-to-run variance.  Either
% test with more runs or present this as one possible explanation.
The enlarged buffer sizes appear unnecessary and may
introduce mild buffer bloat that partially offsets the reordering
benefit, though the difference could also reflect normal run-to-run
variance.  The entire Headscale advantage is explained by three kernel
parameters: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and
\texttt{tcp\_early\_retrans}.

% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier
% (11.5 Mbps) tuned values are not in any table.  Add them to
% Table~\ref{tab:kernel_tuning_internal} or a new table.
Other VPNs benefit less from the kernel tuning.  WireGuard's Medium
throughput rises from 8.77 to 12.2\,Mbps (+39\%) and Tinc's from
5.53 to 11.5\,Mbps (+108\%).  ZeroTier stays flat (12.0 to
11.5\,Mbps).  The tuning helps the kernel TCP stack, but VPNs that
add their own encapsulation overhead and userspace processing have
independent bottlenecks that the sysctl parameters cannot remove.

% TODO: Headscale tuned-run percentages (+21\%, $-$5\%) are not in
% any table.  Also, the "compound delays" hypothesis is speculative
% --- no evidence is shown that double reordering tolerance causes
% compound delays.  Either verify experimentally or weaken the claim.
Headscale itself gets modestly faster with kernel tuning (+21\% at
Medium) but slightly slower at Low impairment ($-$5\%).  Its
userspace gVisor stack already optimizes for reordering tolerance.
When the kernel stack also increases its tolerance, the two layers of
tuning may interact suboptimally --- both independently delay
retransmits, which could cause compound delays on the
kernel-to-Headscale socket path.

% TODO: These sections are empty stubs but the chapter introduction
% (line 12--13) promises "findings from the source code analysis."
% Either write these sections or remove the promise from the intro.

\section{Source Code Analysis}

\subsection{Feature Matrix Overview}

% Summary of the 131-feature matrix across all ten VPNs.
% Highlight key architectural differences that explain
% performance results.

\subsection{Security Vulnerabilities}

% Vulnerabilities discovered during source code review.

\section{Summary of Findings}

% Brief summary table or ranking of VPNs by key metrics.
% Save deeper interpretation for a Discussion chapter.