clan-master-thesis/Chapters/Results.tex

% Chapter Template

\chapter{Results} % Main chapter title

\label{Results}

This chapter presents the results of the benchmark suite across all
ten VPN implementations and the internal baseline. The structure
follows the impairment profiles from ideal to degraded:
Section~\ref{sec:baseline} establishes overhead under ideal
conditions, then subsequent sections examine how each VPN responds to
increasing network impairment. The chapter concludes with findings
from the source code analysis. A recurring theme throughout is that
no single metric captures VPN performance; the rankings shift
depending on whether one measures throughput, latency, retransmit
behavior, or real-world application performance.

\section{Baseline Performance}
\label{sec:baseline}

The baseline impairment profile introduces no artificial loss or
reordering, so any performance gap between VPNs can be attributed to
the VPN itself.  Throughout the plots in this section, the
\emph{internal} bar marks a direct host-to-host connection with no VPN
in the path; it represents the best the hardware can do.  On its own,
this link delivers 934\,Mbps on a single TCP stream and a round-trip
latency of just
0.60\,ms.  WireGuard comes remarkably close to these numbers, reaching
92.5\,\% of bare-metal throughput with only a single retransmit across
an entire 30-second test.  Mycelium sits at the other extreme, adding
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.

\subsection{Test Execution Overview}

Running the full baseline suite across all ten VPNs and the internal
reference took just over four hours.  The bulk of that time, about
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
installation and deployment accounted for another 45~minutes (19\,\%),
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
up after restarts.  The remaining time was consumed by VPN service restarts
and traffic-control (tc) stabilization.
Figure~\ref{fig:test_duration} breaks this down per VPN.

Most VPNs completed every benchmark without issues, but four failed
one test each: Nebula and Headscale timed out on the qperf
QUIC performance benchmark after six retries, while Hyprspace and
Mycelium failed the UDP iPerf3 test
with a 120-second timeout.  Their individual success rate is
85.7\,\%, with all other VPNs passing the full suite
(Figure~\ref{fig:success_rate}).

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Average Test
    Duration per Machine}.png}
    \caption{Average test duration per VPN, including installation
    time and benchmark execution}
    \label{fig:test_duration}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{1.0\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
    Success Rate}.png}
    \caption{Benchmark success rate across all seven tests}
    \label{fig:success_rate}
  \end{subfigure}
  \caption{Test execution overview. Hyprspace has the longest average
    duration due to UDP timeouts and long VPN connectivity
    waits. WireGuard completes fastest. Nebula, Headscale,
  Hyprspace, and Mycelium each fail one benchmark.}
  \label{fig:test_overview}
\end{figure}

\subsection{TCP Throughput}

Each VPN ran a single-stream iPerf3 session for 30~seconds on every
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
averages.  Three distinct performance tiers emerge, separated by
natural gaps in the data.

\begin{table}[H]
  \centering
  \caption{Single-stream TCP throughput at baseline, sorted by
    throughput. Retransmits are averaged per 30-second test across
    all three link directions. The horizontal rules separate the
  three performance tiers.}
  \label{tab:tcp_baseline}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Throughput (Mbps)} &
    \textbf{Baseline (\%)} & \textbf{Retransmits} \\
    \hline
    Internal      & 934 & 100.0 & 1.7 \\
    WireGuard     & 864 & 92.5  & 1 \\
    ZeroTier      & 814 & 87.2  & 1163 \\
    Headscale     & 800 & 85.6  & 102 \\
    Yggdrasil     & 795 & 85.1  & 75 \\
    \hline
    Nebula        & 706 & 75.6  & 955 \\
    EasyTier      & 636 & 68.1  & 537 \\
    VpnCloud      & 539 & 57.7  & 857 \\
    \hline
    Hyprspace     & 368 & 39.4  & 4965 \\
    Tinc          & 336 & 36.0  & 240 \\
    Mycelium      & 259 & 27.7  & 710 \\
    \hline
  \end{tabular}
\end{table}

The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
at under 40\,\% of baseline.
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.

Raw throughput alone is incomplete, however.  The retransmit column
reveals that not all high-throughput VPNs get there cleanly.
ZeroTier, for instance, reaches 814\,Mbps but accumulates
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
needs.  ZeroTier compensates for tunnel-internal packet loss by
repeatedly triggering TCP congestion-control recovery, whereas
WireGuard sends data once and it arrives.  Across all VPNs,
retransmit behaviour falls into three groups: \emph{clean} ($<$110:
WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed}
(200--900: Tinc, EasyTier, Mycelium, VpnCloud), and
\emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).

% TODO: Is this naming scheme any good?

% TODO: Fix TCP Throughput plot

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
    Throughput}.png}
    \caption{Average single-stream TCP throughput}
    \label{fig:tcp_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
    Retransmit Rate}.png}
    \caption{Average TCP retransmits per 30-second test (log scale)}
    \label{fig:tcp_retransmits}
  \end{subfigure}
  \caption{TCP throughput and retransmit rate at baseline. WireGuard
    leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
    retransmits per test. The retransmit count does not always track
    inversely with throughput: ZeroTier achieves high throughput
  \emph{despite} high retransmits.}
  \label{fig:tcp_results}
\end{figure}

Retransmits have a direct mechanical relationship with TCP congestion
control. Each retransmit triggers a reduction in the congestion window
(\texttt{cwnd}), throttling the sender. This relationship is visible
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
retransmits, maintains the smallest average congestion window in the
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
window, the largest of any VPN.  At first glance this suggests a
clean inverse correlation between retransmits and congestion window
size, but the picture is misleading.  Yggdrasil's outsized window is
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
segment carries far more data, so the window in bytes is inflated
relative to VPNs using a standard ${\sim}$1\,400-byte MTU.  Comparing
congestion windows across different MTU sizes is not meaningful
without normalizing for segment size.  What \emph{is} clear is that
high retransmit rates force TCP to spend more time in congestion
recovery than in steady-state transmission, capping throughput
regardless of available bandwidth.  ZeroTier illustrates the
opposite extreme: brute-force retransmission can still yield high
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
bandwidth and unstable flow behavior.

VpnCloud warrants specific attention: its sender reports 538.8\,Mbps
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
in the dataset). This suggests significant in-tunnel packet loss or
buffering at the VpnCloud layer that the retransmit count (857)
alone does not fully explain.

Run-to-run variability also differs substantially. WireGuard ranges
from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges
from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A
VPN with wide variance is harder to capacity-plan around than one
with consistent performance, even if the average is lower.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
    \caption{Retransmits vs.\ throughput}
    \label{fig:retransmit_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
    \caption{Retransmits vs.\ max congestion window}
    \label{fig:retransmit_cwnd}
  \end{subfigure}
  \caption{Retransmit correlations (log scale on x-axis). High
    retransmits do not always mean low throughput (ZeroTier: 1\,163
    retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
    4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
    between retransmits and congestion window size is dominated by
    Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
    by its 32\,KB jumbo overlay MTU rather than by low retransmits
  alone.}
  \label{fig:retransmit_correlations}
\end{figure}

\subsection{Latency}

Sorting by latency rearranges the rankings considerably.
Table~\ref{tab:latency_baseline} lists the average ping round-trip
times, which cluster into three distinct ranges.

\begin{table}[H]
  \centering
  \caption{Average ping RTT at baseline, sorted by latency}
  \label{tab:latency_baseline}
  \begin{tabular}{lr}
    \hline
    \textbf{VPN} & \textbf{Avg RTT (ms)} \\
    \hline
    Internal   & 0.60 \\
    VpnCloud   & 1.13 \\
    Tinc       & 1.19 \\
    WireGuard  & 1.20 \\
    Nebula     & 1.25 \\
    ZeroTier   & 1.28 \\
    EasyTier   & 1.33 \\
    \hline
    Headscale  & 1.64 \\
    Hyprspace  & 1.79 \\
    Yggdrasil  & 2.20 \\
    \hline
    Mycelium   & 34.9 \\
    \hline
  \end{tabular}
\end{table}

Six VPNs stay below 1.3\,ms, comfortably close to the bare-metal
0.60\,ms.  VpnCloud is a notable result: it posts the lowest latency
of any VPN (1.13\,ms), edging out WireGuard (1.20\,ms), yet its
throughput tops out at only 539\,Mbps.  Low per-packet latency does
not guarantee high bulk throughput.  A second group (Headscale,
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
moderate overhead.  Then there is Mycelium at 34.9\,ms, so far
removed from the rest that Section~\ref{sec:mycelium_routing} gives
it a dedicated analysis.

ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
sub-2\,ms VPN.  These spikes point to periodic control-plane
interference that the average hides.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
  \caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
    massive outlier at 58$\times$ the internal baseline. VpnCloud is
  the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
  \label{fig:ping_rtt}
\end{figure}

Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
but only the second-lowest throughput (336\,Mbps).  Packets traverse
the tunnel quickly, yet single-threaded userspace processing cannot
keep up with the link speed.  The qperf benchmark backs this up: Tinc
maxes out at
14.9\,\% CPU while delivering just 336\,Mbps, a clear sign that
the CPU, not the network, is the bottleneck.
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
spot.

Looking at CPU efficiency more broadly, the qperf measurements
reveal a wide spread.  Hyprspace (55.1\,\%) and Yggdrasil
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
9.7\,\%.  WireGuard sits at 30.8\,\%, surprisingly high for a
kernel-level implementation, though much of that goes to
cryptographic processing.  On the efficient end, VpnCloud
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) do the most
with the least CPU time.  Nebula and Headscale are missing from
this comparison because qperf failed for both.

%TODO: Explain why they consistently failed

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
  \caption{Latency vs.\ throughput at baseline. Each point represents
    one VPN. The quadrants reveal different bottleneck types:
    VpnCloud (low latency, moderate throughput), Tinc (low latency,
    low throughput, CPU-bound), Mycelium (high latency, low
  throughput, overlay routing overhead).}
  \label{fig:latency_throughput}
\end{figure}

\subsection{Parallel TCP Scaling}

The single-stream benchmark tests one link direction at a time.  The
parallel benchmark changes this setup: all three link directions
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
luna$\rightarrow$lom) run simultaneously in a circular pattern for
60~seconds, each carrying ten TCP streams.  Because three independent
link pairs now compete for shared tunnel resources at once, the
aggregate throughput is naturally higher than any single direction
alone, which is why even Internal reaches 1.50$\times$ its
single-stream figure.  The scaling factor (parallel throughput
divided by single-stream throughput) therefore captures two effects:
the benefit of utilizing multiple link pairs in parallel, and how
well the VPN handles the resulting contention.
Table~\ref{tab:parallel_scaling} lists the results.

\begin{table}[H]
  \centering
  \caption{Parallel TCP scaling at baseline. Scaling factor is the
    ratio of ten-stream to single-stream throughput. Internal's
  1.50$\times$ represents the expected scaling on this hardware.}
  \label{tab:parallel_scaling}
  \begin{tabular}{lrrr}
    \hline
    \textbf{VPN} & \textbf{Single (Mbps)} &
    \textbf{Parallel (Mbps)} & \textbf{Scaling} \\
    \hline
    Mycelium  & 259  & 569  & 2.20$\times$ \\
    Hyprspace & 368  & 803  & 2.18$\times$ \\
    Tinc      & 336  & 563  & 1.68$\times$ \\
    Yggdrasil & 795  & 1265 & 1.59$\times$ \\
    Headscale & 800  & 1228 & 1.54$\times$ \\
    Internal  & 934  & 1398 & 1.50$\times$ \\
    ZeroTier  & 814  & 1206 & 1.48$\times$ \\
    WireGuard & 864  & 1281 & 1.48$\times$ \\
    EasyTier  & 636  & 927  & 1.46$\times$ \\
    VpnCloud  & 539  & 763  & 1.42$\times$ \\
    Nebula    & 706  & 648  & 0.92$\times$ \\
    \hline
  \end{tabular}
\end{table}

The VPNs that gain the most are those most constrained in
single-stream mode.  Mycelium's 34.9\,ms RTT means a lone TCP stream
can never fill the pipe: the bandwidth-delay product demands a window
larger than any single flow maintains, so ten streams collectively
compensate for that constraint and push throughput to 2.20$\times$
the single-stream figure.  Hyprspace scales almost as well
(2.18$\times$) but for a
different reason: multiple streams work around the buffer bloat that
cripples any individual flow
(Section~\ref{sec:hyprspace_bloat}).  Tinc picks up a
1.68$\times$ boost because several streams can collectively keep its
single-threaded CPU busy during what would otherwise be idle gaps in
a single flow.

WireGuard and Internal both scale cleanly at around
1.48--1.50$\times$ with zero retransmits, suggesting that
WireGuard's overhead is a fixed per-packet cost that does not worsen
under multiplexing.

Nebula is the only VPN that actually gets \emph{slower} with more
streams: throughput drops from 706\,Mbps to 648\,Mbps
(0.92$\times$) while retransmits jump from 955 to 2\,462.  The ten
streams are clearly fighting each other for resources inside the
tunnel.

More streams also amplify existing retransmit problems across the
board.  Hyprspace climbs from 4\,965 to 17\,426~retransmits;
VpnCloud from 857 to 6\,023.  VPNs that were clean in single-stream
mode stay clean under load, while the stressed ones only get worse.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
    \caption{Single-stream vs.\ parallel throughput}
    \label{fig:single_vs_parallel}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
    \caption{Parallel TCP scaling factor}
    \label{fig:scaling_factor}
  \end{subfigure}
  \caption{Parallel TCP scaling at baseline. Nebula is the only VPN
    where parallel throughput is lower than single-stream
    (0.92$\times$). Mycelium and Hyprspace benefit most from
    parallelism ($>$2$\times$), compensating for latency and buffer
    bloat respectively. The dashed line at 1.0$\times$ marks the
  break-even point.}
  \label{fig:parallel_tcp}
\end{figure}

\subsection{UDP Stress Test}

The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
which is a deliberate overload test rather than a realistic workload.
The sender throughput values are artifacts: they reflect how fast the
sender can write to the socket, not how fast data traverses the
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
throughput because it uses a 32,731-byte block size (a jumbo-frame
overlay MTU), inflating the apparent rate per \texttt{send()} system
call. Only the receiver throughput is meaningful.

\begin{table}[H]
  \centering
  \caption{UDP receiver throughput and packet loss at baseline
    (\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
  at 120 seconds and are excluded.}
  \label{tab:udp_baseline}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Receiver (Mbps)} &
    \textbf{Loss (\%)} \\
    \hline
    Internal  & 952 & 0.0 \\
    WireGuard & 898 & 0.0 \\
    Nebula    & 890 & 76.2 \\
    Headscale & 876 & 69.8 \\
    EasyTier  & 865 & 78.3 \\
    Yggdrasil & 852 & 98.7 \\
    ZeroTier  & 851 & 89.5 \\
    VpnCloud  & 773 & 83.7 \\
    Tinc      & 471 & 89.9 \\
    \hline
  \end{tabular}
\end{table}

%TODO: Explain that the UDP test also crashes often,
% which makes the test somewhat unreliable
% but a good indicator if the network traffic is "different" then
% the programmer expected

Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
the kernel level with proper backpressure that matches sender to
receiver rate. Every userspace VPN shows massive loss (69--99\%)
because the sender overwhelms the tunnel's processing capacity.
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
(due to its large block size) but loses almost all of it. These loss
rates do not reflect real-world UDP behavior but reveal which VPNs
implement effective flow control. Hyprspace and Mycelium could not
complete the UDP test at all, timing out after 120 seconds.

The \texttt{blksize\_bytes} field reveals each VPN's effective path
MTU: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at 2728,
Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at 1353,
EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
smallest). These differences affect fragmentation behavior under real
workloads, particularly for protocols that send large datagrams.

%TODO: Mention QUIC
%TODO: Mention again that the "default" settings of every VPN have been used
% to better reflect real world use, as most users probably won't
% change these defaults
% and explain that good defaults are as much a part of good software as
% having the features but they are hard to configure correctly

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Throughput}.png}
    \caption{UDP receiver throughput}
    \label{fig:udp_throughput}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
    Packet Loss}.png}
    \caption{UDP packet loss}
    \label{fig:udp_loss}
  \end{subfigure}
  \caption{UDP stress test results at baseline (\texttt{-b 0},
    unlimited sender rate). Internal and WireGuard are the only
    implementations with 0\% loss. Hyprspace and Mycelium are
  excluded due to 120-second timeouts.}
  \label{fig:udp_results}
\end{figure}

% TODO: Compare parallel TCP retransmit rate
% with single TCP retransmit rate and see what changed

\subsection{Real-World Workloads}

Saturating a link with iPerf3 measures peak capacity, but not how a
VPN performs under realistic traffic.  This subsection switches to
application-level workloads: downloading packages from a Nix binary
cache and streaming video over RIST.  Both interact with the VPN
tunnel the way real software does, through many short-lived
connections, TLS handshakes, and latency-sensitive UDP packets.

\paragraph{Nix Binary Cache Downloads.}

This test downloads a fixed set of Nix packages through each VPN and
measures the total transfer time.  The results
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
considerably: even Hyprspace, the worst performer, finishes in
11.92\,s, only 40\,\% slower than bare metal.  Once connection
setup, TLS handshakes, and HTTP round-trips enter the picture,
throughput differences between 500 and 900\,Mbps matter far less
than per-connection latency.

\begin{table}[H]
  \centering
  \caption{Nix binary cache download time at baseline, sorted by
  duration. Overhead is relative to the internal baseline (8.53\,s).}
  \label{tab:nix_cache}
  \begin{tabular}{lrr}
    \hline
    \textbf{VPN} & \textbf{Mean (s)} &
    \textbf{Overhead (\%)} \\
    \hline
    Internal  & 8.53  & -- \\
    Nebula    & 9.15  & +7.3 \\
    ZeroTier  & 9.22  & +8.1 \\
    VpnCloud  & 9.39  & +10.0 \\
    EasyTier  & 9.39  & +10.1 \\
    WireGuard & 9.45  & +10.8 \\
    Headscale & 9.79  & +14.8 \\
    Tinc      & 10.00 & +17.2 \\
    Mycelium  & 10.07 & +18.1 \\
    Yggdrasil & 10.59 & +24.2 \\
    Hyprspace & 11.92 & +39.7 \\
    \hline
  \end{tabular}
\end{table}

Several rankings invert relative to raw throughput.  ZeroTier
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
30\,\% fewer raw Mbps and 1\,000$\times$ more retransmits.  Yggdrasil
is the clearest example: it has the
third-highest throughput at 795\,Mbps, yet lands at 24\,\% overhead
because its
2.2\,ms latency adds up over the many small sequential HTTP requests
that constitute a Nix cache download.
Figure~\ref{fig:throughput_vs_download} confirms this weak link
between raw throughput and real-world download speed.

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
    Mean Download Time}.png}
    \caption{Nix cache download time per VPN}
    \label{fig:nix_cache}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
    \caption{Raw throughput vs.\ download time}
    \label{fig:throughput_vs_download}
  \end{subfigure}
  \caption{Application-level download performance. The throughput
    hierarchy compresses under real HTTP workloads: the worst VPN
    (Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
    Throughput explains some variance but not all: Yggdrasil
    (795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
  because latency matters more for HTTP workloads.}
  \label{fig:nix_download}
\end{figure}

\paragraph{Video Streaming (RIST).}

At just 3.3\,Mbps, the RIST video stream sits comfortably within
every VPN's throughput budget.  This test therefore measures
something different: how well the VPN handles real-time UDP packet
delivery under steady load.  Nine of the eleven VPNs pass without
incident, delivering 100\,\% video quality.  The 14--16 dropped
frames that appear uniformly across all VPNs, including Internal,
trace back to encoder warm-up rather than tunnel overhead.

Headscale is the exception.  It averages just 13.1\,\% quality,
dropping 288~packets per test interval.  The degradation is not
bursty but sustained: median quality sits at 10\,\%, and the
interquartile range of dropped packets spans a narrow 255--330 band.
The qperf benchmark independently corroborates this, having failed
outright for Headscale, confirming that something beyond bulk TCP is
broken.

What makes this failure unexpected is that Headscale builds on
WireGuard, which handles video flawlessly.  TCP throughput places
Headscale squarely in Tier~1.  Yet the RIST test runs over UDP, and
qperf probes latency-sensitive paths using both TCP and UDP.  The
pattern points toward Headscale's DERP relay or NAT traversal layer
as the source.  Its effective path MTU of 1\,208~bytes, the smallest
of any VPN, likely compounds the issue: RIST packets that exceed
this limit must be fragmented, and reassembling fragments under
sustained load produces exactly the kind of steady, uniform packet
drops the data shows.  For video conferencing, VoIP, or any
real-time media workload, this is a disqualifying result regardless
of TCP throughput.

Hyprspace reveals a different failure mode.  Its average quality
reads 100\,\%, but the raw numbers underneath are far from stable:
mean packet drops of 1\,194 and a maximum spike of 55\,500, with
the 25th, 50th, and 75th percentiles all at zero.  Hyprspace
alternates between perfect delivery and catastrophic bursts.
RIST's forward error correction compensates for most of these
events, but the worst spikes are severe enough to overwhelm FEC
entirely.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/Video
  Streaming/RIST Quality}.png}
  \caption{RIST video streaming quality at baseline. Headscale at
    13.1\% average quality is the clear outlier. Every other VPN
    achieves 99.8\% or higher. Nebula is at 99.8\% (minor
    degradation). The video bitrate (3.3\,Mbps) is well within every
    VPN's throughput capacity, so this test reveals real-time UDP
  handling quality rather than bandwidth limits.}
  \label{fig:rist_quality}
\end{figure}

\subsection{Operational Resilience}

Sustained-load performance does not predict recovery speed.  How
quickly a tunnel comes up after a reboot, and how reliably it
reconverges, matters as much as peak throughput for operational use.

First-time connectivity spans a wide range.  Headscale and WireGuard
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
(10--14\,s) spend seconds negotiating with their control planes
before passing traffic.

%TODO: Maybe we want to scrap first-time connectivity

Reboot reconnection rearranges the rankings.  Hyprspace, the worst
performer under sustained TCP load, recovers in just 8.7~seconds on
average, faster than any other VPN.  WireGuard and Nebula follow at
10.1\,s each.  Nebula's consistency is striking: 10.06, 10.06,
10.07\,s across its three nodes, pointing to a hard-coded timer
rather than topology-dependent convergence.
Mycelium sits at the opposite end, needing 76.6~seconds and showing
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
suggesting a fixed protocol-level wait built into the overlay.

%TODO: Hard coded timer needs to be verified

Yggdrasil produces the most lopsided result in the dataset: its yuki
node is back in 7.1~seconds while lom and luna take 94.8 and
97.3~seconds respectively.  The gap likely reflects the overlay's
spanning-tree rebuild: a node near the root of the tree reconverges
quickly, while one further out has to wait for the topology to
propagate.

%TODO: Needs clarifications what is a "spanning tree build"

\begin{figure}[H]
  \centering
  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
    \caption{Average reconnection time per VPN}
    \label{fig:reboot_bar}
  \end{subfigure}

  \vspace{1em}

  \begin{subfigure}[t]{\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
    \caption{Per-node reconnection time heatmap}
    \label{fig:reboot_heatmap}
  \end{subfigure}
  \caption{Reboot reconnection time at baseline. The heatmap reveals
    Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
    95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
    across all nodes). Hyprspace reconnects fastest (8.7\,s average)
  despite its poor sustained-load performance.}
  \label{fig:reboot_reconnection}
\end{figure}

\subsection{Pathological Cases}
\label{sec:pathological}

Three VPNs exhibit behaviors that the aggregate numbers alone cannot
explain.  The following subsections synthesize observations from the
preceding benchmarks into per-VPN diagnoses.

\paragraph{Hyprspace: Buffer Bloat.}
\label{sec:hyprspace_bloat}

Hyprspace produces the most severe performance collapse in the
dataset.  At idle, its ping latency is a modest 1.79\,ms.
Under TCP load, that number balloons to roughly 2\,800\,ms, a
1\,556$\times$ increase.  This is not the network becoming
congested; it is the VPN tunnel itself filling up with buffered
packets and refusing to drain.

The consequences ripple through every TCP metric.  With 4\,965
retransmits per 30-second test (one in every 200~segments), TCP
spends most of its time in congestion recovery rather than
steady-state transfer, shrinking the average congestion window to
205\,KB, the smallest in the dataset.  Under parallel load the
situation worsens: retransmits climb to 17\,426.  The buffering even
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
while the sender sees only 367.9\,Mbps, because massive ACK delays
cause the sender-side timer to undercount the actual data rate.  The
UDP test never finished at all, timing out at 120~seconds.

% Should we always use percentages for retransmits?

What prevents Hyprspace from being entirely unusable is everything
\emph{except} sustained load.  It has the fastest reboot
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
quality outside of its burst events.  The pathology is narrow but
severe: any continuous data stream saturates the tunnel's internal
buffers.

\paragraph{Mycelium: Routing Anomaly.}
\label{sec:mycelium_routing}

Mycelium's 34.9\,ms average latency appears to be the cost of
routing through a global overlay.  The per-path numbers, however,
reveal a bimodal distribution:

\begin{itemize}
    \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable
    to Headscale at 1.64\,ms)
    \bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
    \bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
\end{itemize}

One of the three links has found a direct route; the other two still
bounce through the overlay.  All three machines sit on the same
physical network, so Mycelium's path discovery is failing
intermittently, a more specific problem than blanket overlay
overhead.  Throughput mirrors the split:
yuki$\rightarrow$luna reaches 379\,Mbps while
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap.  In
bidirectional mode, the reverse direction on that worst link drops
to 58.4\,Mbps, the lowest single-direction figure in the entire
dataset.

\begin{figure}[H]
  \centering
  \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
  Throughput}.png}
  \caption{Per-link TCP throughput for Mycelium, showing extreme
    path asymmetry caused by inconsistent direct route discovery.
    The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps)
    and worst (luna$\rightarrow$lom, 122\,Mbps) links reflects
  different overlay routing paths.}
  \label{fig:mycelium_paths}
\end{figure}

The overlay penalty shows up most clearly at connection setup.
Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
alone costs 47.3\,ms (3$\times$ overhead).  Every new connection
incurs that overhead, so workloads dominated by
short-lived connections accumulate it rapidly.  Bulk downloads, by
contrast, amortize it: the Nix cache test finishes only 18\,\%
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
transfer phase begins, per-connection latency fades into the
background.

Mycelium is also the slowest VPN to recover from a reboot:
76.6~seconds on average, and almost suspiciously uniform across
nodes (75.7, 75.7, 78.3\,s).  That kind of consistency points to a
hard-coded convergence timer in the overlay protocol rather than
anything topology-dependent.  The UDP test timed out at
120~seconds, and even first-time connectivity required a
70-second wait at startup.

% Explain what topology-dependent means in this case.

\paragraph{Tinc: Userspace Processing Bottleneck.}

Tinc is a clear case of a CPU bottleneck masquerading as a network
problem.  At 1.19\,ms latency, packets get through the
tunnel quickly.  Yet throughput tops out at 336\,Mbps, barely a
third of the bare-metal link.  The usual suspects do not apply:
Tinc's path MTU is a healthy 1\,500~bytes
(\texttt{blksize\_bytes} of 1\,353 from UDP iPerf3, comparable to
VpnCloud at 1\,375 and WireGuard at 1\,368), and its retransmit
count (240) is moderate.  What limits Tinc is its single-threaded
userspace architecture: one CPU core simply cannot encrypt, copy,
and forward packets fast enough to fill the pipe.

The parallel benchmark confirms this diagnosis.  Tinc scales to
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
Multiple TCP streams collectively keep that single core busy during
what would otherwise be idle gaps in any individual flow, squeezing
out throughput that no single stream could reach alone.

\section{Impact of Network Impairment}

This section examines how each VPN responds to the Low, Medium, and
High impairment profiles defined in Chapter~\ref{Methodology}.

\subsection{Ping}

% RTT and packet loss across impairment profiles.

\subsection{TCP Throughput}

% TCP iperf3: throughput, retransmits, congestion window.

\subsection{UDP Throughput}

% UDP iperf3: throughput, jitter, packet loss.

\subsection{Parallel TCP}

% Parallel iperf3: throughput under contention (A->B, B->C, C->A).

\subsection{QUIC Performance}

% qperf: bandwidth, TTFB, connection establishment time.

\subsection{Video Streaming}

% RIST: bitrate, dropped frames, packets recovered, quality score.

\subsection{Application-Level Download}

% Nix cache: download duration for Firefox package.

\section{Tailscale Under Degraded Conditions}

% The central finding: Tailscale outperforming the raw Linux
% networking stack under impairment.

\subsection{Observed Anomaly}

% Present the data showing Tailscale exceeding internal baseline
% throughput under Medium/High impairment.

\subsection{Congestion Control Analysis}

% Reno vs CUBIC, RACK disabled to avoid spurious retransmits
% under reordering.

\subsection{Tuned Kernel Parameters}

% Re-run results with tuned buffer sizes and congestion control
% on the internal baseline, showing the gap closes.

\section{Source Code Analysis}

\subsection{Feature Matrix Overview}

% Summary of the 131-feature matrix across all ten VPNs.
% Highlight key architectural differences that explain
% performance results.

\subsection{Security Vulnerabilities}

% Vulnerabilities discovered during source code review.

\section{Summary of Findings}

% Brief summary table or ranking of VPNs by key metrics.
% Save deeper interpretation for a Discussion chapter.