% Chapter Template \chapter{Results} % Main chapter title \label{Results} This chapter presents the results of the benchmark suite across all ten VPN implementations and the internal baseline. The structure follows the impairment profiles from ideal to degraded: Section~\ref{sec:baseline} establishes overhead under ideal conditions, then subsequent sections examine how each VPN responds to increasing network impairment. The chapter concludes with findings from the source code analysis. A recurring theme is that no single metric captures VPN performance; the rankings shift depending on whether one measures throughput, latency, retransmit behavior, or real-world application performance. \section{Baseline Performance} \label{sec:baseline} The baseline impairment profile introduces no artificial loss or reordering, so any performance gap between VPNs can be attributed to the VPN itself. Throughout the plots in this section, the \emph{internal} bar marks a direct host-to-host connection with no VPN in the path; it represents the best the hardware can do. On its own, this link delivers 934\,Mbps on a single TCP stream and a round-trip latency of just 0.60\,ms. WireGuard comes remarkably close to these numbers, reaching 92.5\,\% of bare-metal throughput with only a single retransmit across an entire 30-second test. Mycelium sits at the other extreme, adding 34.9\,ms of latency, roughly 58$\times$ the bare-metal figure. \subsection{Test Execution Overview} Running the full baseline suite across all ten VPNs and the internal reference took just over four hours. The bulk of that time, about 2.6~hours (63\,\%), was spent on actual benchmark execution; VPN installation and deployment accounted for another 45~minutes (19\,\%), and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come up after restarts. The remaining time was consumed by VPN service restarts and traffic-control (tc) stabilization. Figure~\ref{fig:test_duration} breaks this down per VPN. Most VPNs completed every benchmark without issues, but four failed one test each: Nebula and Headscale timed out on the qperf QUIC performance benchmark after six retries, while Hyprspace and Mycelium failed the UDP iPerf3 test with a 120-second timeout. Their individual success rate is 85.7\,\%, with all other VPNs passing the full suite (Figure~\ref{fig:success_rate}). \begin{figure}[H] \centering \begin{subfigure}[t]{1.0\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Average Test Duration per Machine}.png} \caption{Average test duration per VPN, including installation time and benchmark execution} \label{fig:test_duration} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{1.0\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Benchmark Success Rate}.png} \caption{Benchmark success rate across all seven tests} \label{fig:success_rate} \end{subfigure} \caption{Test execution overview. Hyprspace has the longest average duration due to UDP timeouts and long VPN connectivity waits. WireGuard completes fastest. Nebula, Headscale, Hyprspace, and Mycelium each fail one benchmark.} \label{fig:test_overview} \end{figure} \subsection{TCP Throughput} Each VPN ran a single-stream iPerf3 session for 30~seconds on every link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna, luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the averages. Three distinct performance tiers emerge, separated by natural gaps in the data. \begin{table}[H] \centering \caption{Single-stream TCP throughput at baseline, sorted by throughput. Retransmits are averaged per 30-second test across all three link directions. The horizontal rules separate the three performance tiers.} \label{tab:tcp_baseline} \begin{tabular}{lrrr} \hline \textbf{VPN} & \textbf{Throughput (Mbps)} & \textbf{Baseline (\%)} & \textbf{Retransmits} \\ \hline Internal & 934 & 100.0 & 1.7 \\ WireGuard & 864 & 92.5 & 1 \\ ZeroTier & 814 & 87.2 & 1163 \\ Headscale & 800 & 85.6 & 102 \\ Yggdrasil & 795 & 85.1 & 75 \\ \hline Nebula & 706 & 75.6 & 955 \\ EasyTier & 636 & 68.1 & 537 \\ VpnCloud & 539 & 57.7 & 857 \\ \hline Hyprspace & 368 & 39.4 & 4965 \\ Tinc & 336 & 36.0 & 240 \\ Mycelium & 259 & 27.7 & 710 \\ \hline \end{tabular} \end{table} The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier, Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link. A middle tier (55--80\,\%) follows with Nebula, EasyTier, and VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier at under 40\,\% of baseline. Figure~\ref{fig:tcp_throughput} visualizes this hierarchy. Raw throughput alone is incomplete, however. The retransmit column reveals that not all high-throughput VPNs get there cleanly. ZeroTier, for instance, reaches 814\,Mbps but accumulates 1\,163~retransmits per test, over 1\,000$\times$ what WireGuard needs. ZeroTier compensates for tunnel-internal packet loss by repeatedly triggering TCP congestion-control recovery, whereas WireGuard sends data once and it arrives. Across all VPNs, retransmit behaviour falls into three groups: \emph{clean} ($<$110: WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), and \emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace). % TODO: Is this naming scheme any good? % TODO: Fix TCP Throughput plot \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP Throughput}.png} \caption{Average single-stream TCP throughput} \label{fig:tcp_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP Retransmit Rate}.png} \caption{Average TCP retransmits per 30-second test (log scale)} \label{fig:tcp_retransmits} \end{subfigure} \caption{TCP throughput and retransmit rate at baseline. WireGuard leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000 retransmits per test. The retransmit count does not always track inversely with throughput: ZeroTier achieves high throughput \emph{despite} high retransmits.} \label{fig:tcp_results} \end{figure} Retransmits have a direct mechanical relationship with TCP congestion control. Each retransmit triggers a reduction in the congestion window (\texttt{cwnd}), throttling the sender. This relationship is visible in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 retransmits, maintains the smallest average congestion window in the dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB window, the largest of any VPN. At first glance this suggests a clean inverse correlation between retransmits and congestion window size, but the picture is misleading. Yggdrasil's outsized window is largely an artifact of its jumbo overlay MTU (32\,731 bytes): each segment carries far more data, so the window in bytes is inflated relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing congestion windows across different MTU sizes is not meaningful without normalizing for segment size. What \emph{is} clear is that high retransmit rates force TCP to spend more time in congestion recovery than in steady-state transmission, capping throughput regardless of available bandwidth. ZeroTier illustrates the opposite extreme: brute-force retransmission can still yield high throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted bandwidth and unstable flow behavior. VpnCloud stands out: its sender reports 538.8\,Mbps but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest in the dataset). This suggests significant in-tunnel packet loss or buffering at the VpnCloud layer that the retransmit count (857) alone does not fully explain. Run-to-run variability also differs substantially. WireGuard ranges from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A VPN with wide variance is harder to capacity-plan around than one with consistent performance, even if the average is lower. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png} \caption{Retransmits vs.\ throughput} \label{fig:retransmit_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png} \caption{Retransmits vs.\ max congestion window} \label{fig:retransmit_cwnd} \end{subfigure} \caption{Retransmit correlations (log scale on x-axis). High retransmits do not always mean low throughput (ZeroTier: 1\,163 retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace: 4\,965 retransmits, 368\,Mbps). The apparent inverse correlation between retransmits and congestion window size is dominated by Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated by its 32\,KB jumbo overlay MTU rather than by low retransmits alone.} \label{fig:retransmit_correlations} \end{figure} \subsection{Latency} Sorting by latency rearranges the rankings considerably. Table~\ref{tab:latency_baseline} lists the average ping round-trip times, which cluster into three distinct ranges. \begin{table}[H] \centering \caption{Average ping RTT at baseline, sorted by latency} \label{tab:latency_baseline} \begin{tabular}{lr} \hline \textbf{VPN} & \textbf{Avg RTT (ms)} \\ \hline Internal & 0.60 \\ VpnCloud & 1.13 \\ Tinc & 1.19 \\ WireGuard & 1.20 \\ Nebula & 1.25 \\ ZeroTier & 1.28 \\ EasyTier & 1.33 \\ \hline Headscale & 1.64 \\ Hyprspace & 1.79 \\ Yggdrasil & 2.20 \\ \hline Mycelium & 34.9 \\ \hline \end{tabular} \end{table} Six VPNs stay below 1.3\,ms, comfortably close to the bare-metal 0.60\,ms. VpnCloud posts the lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms), yet its throughput tops out at only 539\,Mbps. Low per-packet latency does not guarantee high bulk throughput. A second group (Headscale, Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing moderate overhead. Then there is Mycelium at 34.9\,ms, so far removed from the rest that Section~\ref{sec:mycelium_routing} gives it a dedicated analysis. ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any sub-2\,ms VPN. These spikes point to periodic control-plane interference that the average hides. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png} \caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a massive outlier at 58$\times$ the internal baseline. VpnCloud is the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).} \label{fig:ping_rtt} \end{figure} Tinc presents a paradox: it has the third-lowest latency (1.19\,ms) but only the second-lowest throughput (336\,Mbps). Packets traverse the tunnel quickly, yet single-threaded userspace processing cannot keep up with the link speed. The qperf benchmark backs this up: Tinc maxes out at 14.9\,\% CPU while delivering just 336\,Mbps, a clear sign that the CPU, not the network, is the bottleneck. Figure~\ref{fig:latency_throughput} makes this disconnect easy to spot. The qperf measurements also reveal a wide spread in CPU usage. Hyprspace (55.1\,\%) and Yggdrasil (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's 9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a kernel-level implementation, though much of that goes to cryptographic processing. On the efficient end, VpnCloud (14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) do the most with the least CPU time. Nebula and Headscale are missing from this comparison because qperf failed for both. %TODO: Explain why they consistently failed \begin{figure}[H] \centering \includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png} \caption{Latency vs.\ throughput at baseline. Each point represents one VPN. The quadrants reveal different bottleneck types: VpnCloud (low latency, moderate throughput), Tinc (low latency, low throughput, CPU-bound), Mycelium (high latency, low throughput, overlay routing overhead).} \label{fig:latency_throughput} \end{figure} \subsection{Parallel TCP Scaling} The single-stream benchmark tests one link direction at a time. The parallel benchmark changes this setup: all three link directions (lom$\rightarrow$yuki, yuki$\rightarrow$luna, luna$\rightarrow$lom) run simultaneously in a circular pattern for 60~seconds, each carrying one bidirectional TCP stream (six unidirectional flows in total). Because three independent link pairs now compete for shared tunnel resources at once, the aggregate throughput is naturally higher than any single direction alone, which is why even Internal reaches 1.50$\times$ its single-stream figure. The scaling factor (parallel throughput divided by single-stream throughput) captures two effects: the benefit of using multiple link pairs in parallel, and how well the VPN handles the resulting contention. Table~\ref{tab:parallel_scaling} lists the results. \begin{table}[H] \centering \caption{Parallel TCP scaling at baseline. Scaling factor is the ratio of parallel to single-stream throughput. Internal's 1.50$\times$ represents the expected scaling on this hardware.} \label{tab:parallel_scaling} \begin{tabular}{lrrr} \hline \textbf{VPN} & \textbf{Single (Mbps)} & \textbf{Parallel (Mbps)} & \textbf{Scaling} \\ \hline Mycelium & 259 & 569 & 2.20$\times$ \\ Hyprspace & 368 & 803 & 2.18$\times$ \\ Tinc & 336 & 563 & 1.68$\times$ \\ Yggdrasil & 795 & 1265 & 1.59$\times$ \\ Headscale & 800 & 1228 & 1.54$\times$ \\ Internal & 934 & 1398 & 1.50$\times$ \\ ZeroTier & 814 & 1206 & 1.48$\times$ \\ WireGuard & 864 & 1281 & 1.48$\times$ \\ EasyTier & 636 & 927 & 1.46$\times$ \\ VpnCloud & 539 & 763 & 1.42$\times$ \\ Nebula & 706 & 648 & 0.92$\times$ \\ \hline \end{tabular} \end{table} The VPNs that gain the most are those most constrained in single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream can never fill the pipe: the bandwidth-delay product demands a window larger than any single flow maintains, so multiple concurrent flows compensate for that constraint and push throughput to 2.20$\times$ the single-stream figure. Hyprspace scales almost as well (2.18$\times$) but for a different reason: multiple streams work around the buffer bloat that cripples any individual flow (Section~\ref{sec:hyprspace_bloat}). Tinc picks up a 1.68$\times$ boost because several streams can collectively keep its single-threaded CPU busy during what would otherwise be idle gaps in a single flow. WireGuard and Internal both scale cleanly at around 1.48--1.50$\times$ with zero retransmits, suggesting that WireGuard's overhead is a fixed per-packet cost that does not worsen under multiplexing. Nebula is the only VPN that actually gets \emph{slower} with more streams: throughput drops from 706\,Mbps to 648\,Mbps (0.92$\times$) while retransmits jump from 955 to 2\,462. The ten streams are clearly fighting each other for resources inside the tunnel. More streams also amplify existing retransmit problems. Hyprspace climbs from 4\,965 to 17\,426~retransmits; VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream mode stay clean under load, while the stressed ones only get worse. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png} \caption{Single-stream vs.\ parallel throughput} \label{fig:single_vs_parallel} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png} \caption{Parallel TCP scaling factor} \label{fig:scaling_factor} \end{subfigure} \caption{Parallel TCP scaling at baseline. Nebula is the only VPN where parallel throughput is lower than single-stream (0.92$\times$). Mycelium and Hyprspace benefit most from parallelism ($>$2$\times$), compensating for latency and buffer bloat respectively. The dashed line at 1.0$\times$ marks the break-even point.} \label{fig:parallel_tcp} \end{figure} \subsection{UDP Stress Test} The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}), which is a deliberate overload test rather than a realistic workload. The sender throughput values are artifacts: they reflect how fast the sender can write to the socket, not how fast data traverses the tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender throughput because it uses a 32,731-byte block size (a jumbo-frame overlay MTU), inflating the apparent rate per \texttt{send()} system call. Only the receiver throughput is meaningful. \begin{table}[H] \centering \caption{UDP receiver throughput and packet loss at baseline (\texttt{-b 0} stress test). Hyprspace and Mycelium timed out at 120 seconds and are excluded.} \label{tab:udp_baseline} \begin{tabular}{lrr} \hline \textbf{VPN} & \textbf{Receiver (Mbps)} & \textbf{Loss (\%)} \\ \hline Internal & 952 & 0.0 \\ WireGuard & 898 & 0.0 \\ Nebula & 890 & 76.2 \\ Headscale & 876 & 69.8 \\ EasyTier & 865 & 78.3 \\ Yggdrasil & 852 & 98.7 \\ ZeroTier & 851 & 89.5 \\ VpnCloud & 773 & 83.7 \\ Tinc & 471 & 89.9 \\ \hline \end{tabular} \end{table} %TODO: Explain that the UDP test also crashes often, % which makes the test somewhat unreliable % but a good indicator if the network traffic is "different" then % the programmer expected Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at the kernel level with proper backpressure that matches sender to receiver rate. Every userspace VPN shows massive loss (69--99\%) because the sender overwhelms the tunnel's processing capacity. Yggdrasil's 98.7\% loss is the most extreme: it sends the most data (due to its large block size) but loses almost all of it. These loss rates do not reflect real-world UDP behavior but reveal which VPNs implement effective flow control. Hyprspace and Mycelium could not complete the UDP test at all, timing out after 120 seconds. The \texttt{blksize\_bytes} field reveals each VPN's effective path MTU: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at 2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at 1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the smallest). These differences affect fragmentation behavior under real workloads, particularly for protocols that send large datagrams. %TODO: Mention QUIC %TODO: Mention again that the "default" settings of every VPN have been used % to better reflect real world use, as most users probably won't % change these defaults % and explain that good defaults are as much a part of good software as % having the features but they are hard to configure correctly \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP Throughput}.png} \caption{UDP receiver throughput} \label{fig:udp_throughput} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP Packet Loss}.png} \caption{UDP packet loss} \label{fig:udp_loss} \end{subfigure} \caption{UDP stress test results at baseline (\texttt{-b 0}, unlimited sender rate). Internal and WireGuard are the only implementations with 0\% loss. Hyprspace and Mycelium are excluded due to 120-second timeouts.} \label{fig:udp_results} \end{figure} % TODO: Compare parallel TCP retransmit rate % with single TCP retransmit rate and see what changed \subsection{Real-World Workloads} Saturating a link with iPerf3 measures peak capacity, but not how a VPN performs under realistic traffic. This subsection switches to application-level workloads: downloading packages from a Nix binary cache and streaming video over RIST. Both interact with the VPN tunnel the way real software does, through many short-lived connections, TLS handshakes, and latency-sensitive UDP packets. \paragraph{Nix Binary Cache Downloads.} This test downloads a fixed set of Nix packages through each VPN and measures the total transfer time. The results (Table~\ref{tab:nix_cache}) compress the throughput hierarchy considerably: even Hyprspace, the worst performer, finishes in 11.92\,s, only 40\,\% slower than bare metal. Once connection setup, TLS handshakes, and HTTP round-trips enter the picture, throughput differences between 500 and 900\,Mbps matter far less than per-connection latency. \begin{table}[H] \centering \caption{Nix binary cache download time at baseline, sorted by duration. Overhead is relative to the internal baseline (8.53\,s).} \label{tab:nix_cache} \begin{tabular}{lrr} \hline \textbf{VPN} & \textbf{Mean (s)} & \textbf{Overhead (\%)} \\ \hline Internal & 8.53 & -- \\ Nebula & 9.15 & +7.3 \\ ZeroTier & 9.22 & +8.1 \\ VpnCloud & 9.39 & +10.0 \\ EasyTier & 9.39 & +10.1 \\ WireGuard & 9.45 & +10.8 \\ Headscale & 9.79 & +14.8 \\ Tinc & 10.00 & +17.2 \\ Mycelium & 10.07 & +18.1 \\ Yggdrasil & 10.59 & +24.2 \\ Hyprspace & 11.92 & +39.7 \\ \hline \end{tabular} \end{table} Several rankings invert relative to raw throughput. ZeroTier finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite 6\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil is the clearest example: it has the fourth-highest VPN throughput at 795\,Mbps, yet lands at 24\,\% overhead because its 2.2\,ms latency adds up over the many small sequential HTTP requests that constitute a Nix cache download. Figure~\ref{fig:throughput_vs_download} confirms this weak link between raw throughput and real-world download speed. \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache Mean Download Time}.png} \caption{Nix cache download time per VPN} \label{fig:nix_cache} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png} \caption{Raw throughput vs.\ download time} \label{fig:throughput_vs_download} \end{subfigure} \caption{Application-level download performance. The throughput hierarchy compresses under real HTTP workloads: the worst VPN (Hyprspace, 11.92\,s) is only 40\% slower than bare metal. Throughput explains some variance but not all: Yggdrasil (795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s) because latency matters more for HTTP workloads.} \label{fig:nix_download} \end{figure} \paragraph{Video Streaming (RIST).} At just 3.3\,Mbps, the RIST video stream sits comfortably within every VPN's throughput budget. This test therefore measures something different: how well the VPN handles real-time UDP packet delivery under steady load. Nine of the eleven VPNs pass without incident, delivering 100\,\% video quality. The 14--16 dropped frames that appear uniformly across all VPNs, including Internal, trace back to encoder warm-up rather than tunnel overhead. Headscale is the exception. It averages just 13.1\,\% quality, dropping 288~packets per test interval. The degradation is not bursty but sustained: median quality sits at 10\,\%, and the interquartile range of dropped packets spans a narrow 255--330 band. The qperf benchmark independently corroborates this, having failed outright for Headscale, confirming that something beyond bulk TCP is broken. What makes this failure unexpected is that Headscale builds on WireGuard, which handles video flawlessly. TCP throughput places Headscale squarely in Tier~1. Yet the RIST test runs over UDP, and qperf probes latency-sensitive paths using both TCP and UDP. The pattern points toward Headscale's DERP relay or NAT traversal layer as the source. Its effective path MTU of 1\,208~bytes, the smallest of any VPN, likely compounds the issue: RIST packets that exceed this limit must be fragmented, and reassembling fragments under sustained load produces exactly the kind of steady, uniform packet drops the data shows. For video conferencing, VoIP, or any real-time media workload, this is a disqualifying result regardless of TCP throughput. Hyprspace reveals a different failure mode. Its average quality reads 100\,\%, but the raw numbers underneath are far from stable: mean packet drops of 1\,194 and a maximum spike of 55\,500, with the 25th, 50th, and 75th percentiles all at zero. Hyprspace alternates between perfect delivery and catastrophic bursts. RIST's forward error correction compensates for most of these events, but the worst spikes are severe enough to overwhelm FEC entirely. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/Video Streaming/RIST Quality}.png} \caption{RIST video streaming quality at baseline. Headscale at 13.1\% average quality is the clear outlier. Every other VPN achieves 99.8\% or higher. Nebula is at 99.8\% (minor degradation). The video bitrate (3.3\,Mbps) is well within every VPN's throughput capacity, so this test reveals real-time UDP handling quality rather than bandwidth limits.} \label{fig:rist_quality} \end{figure} \subsection{Operational Resilience} Sustained-load performance does not predict recovery speed. How quickly a tunnel comes up after a reboot, and how reliably it reconverges, matters as much as peak throughput for operational use. First-time connectivity spans a wide range. Headscale and WireGuard are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud (10--14\,s) spend seconds negotiating with their control planes before passing traffic. %TODO: Maybe we want to scrap first-time connectivity Reboot reconnection rearranges the rankings. Hyprspace, the worst performer under sustained TCP load, recovers in just 8.7~seconds on average, faster than any other VPN. WireGuard and Nebula follow at 10.1\,s each. Nebula's consistency is striking: 10.06, 10.06, 10.07\,s across its three nodes, pointing to a hard-coded timer rather than topology-dependent convergence. Mycelium sits at the opposite end, needing 76.6~seconds and showing the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s), suggesting a fixed protocol-level wait built into the overlay. %TODO: Hard coded timer needs to be verified Yggdrasil produces the most lopsided result in the dataset: its yuki node is back in 7.1~seconds while lom and luna take 94.8 and 97.3~seconds respectively. The gap likely reflects the overlay's spanning-tree rebuild: a node near the root of the tree reconverges quickly, while one further out has to wait for the topology to propagate. %TODO: Needs clarifications what is a "spanning tree build" \begin{figure}[H] \centering \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png} \caption{Average reconnection time per VPN} \label{fig:reboot_bar} \end{subfigure} \vspace{1em} \begin{subfigure}[t]{\textwidth} \centering \includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png} \caption{Per-node reconnection time heatmap} \label{fig:reboot_heatmap} \end{subfigure} \caption{Reboot reconnection time at baseline. The heatmap reveals Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\ 95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s across all nodes). Hyprspace reconnects fastest (8.7\,s average) despite its poor sustained-load performance.} \label{fig:reboot_reconnection} \end{figure} \subsection{Pathological Cases} \label{sec:pathological} Three VPNs exhibit behaviors that the aggregate numbers alone cannot explain. The following subsections piece together observations from earlier benchmarks into per-VPN diagnoses. \paragraph{Hyprspace: Buffer Bloat.} \label{sec:hyprspace_bloat} Hyprspace produces the most severe performance collapse in the dataset. At idle, its ping latency is a modest 1.79\,ms. Under TCP load, that number balloons to roughly 2\,800\,ms, a 1\,556$\times$ increase. This is not the network becoming congested; it is the VPN tunnel itself filling up with buffered packets and refusing to drain. The consequences ripple through every TCP metric. With 4\,965 retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than steady-state transfer, shrinking the average congestion window to 205\,KB, the smallest in the dataset. Under parallel load the situation worsens: retransmits climb to 17\,426. The buffering even inverts iPerf3's measurements: the receiver reports 419.8\,Mbps while the sender sees only 367.9\,Mbps, because massive ACK delays cause the sender-side timer to undercount the actual data rate. The UDP test never finished at all, timing out at 120~seconds. % Should we always use percentages for retransmits? What prevents Hyprspace from being entirely unusable is everything \emph{except} sustained load. It has the fastest reboot reconnection in the dataset (8.7\,s) and delivers 100\,\% video quality outside of its burst events. The pathology is narrow but severe: any continuous data stream saturates the tunnel's internal buffers. \paragraph{Mycelium: Routing Anomaly.} \label{sec:mycelium_routing} Mycelium's 34.9\,ms average latency appears to be the cost of routing through a global overlay. The per-path numbers, however, reveal a bimodal distribution: \begin{itemize} \bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable to Headscale at 1.64\,ms) \bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed) \bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed) \end{itemize} One of the three links has found a direct route; the other two still bounce through the overlay. All three machines sit on the same physical network, so Mycelium's path discovery is failing intermittently, a more specific problem than blanket overlay overhead. Throughput mirrors the split: yuki$\rightarrow$luna reaches 379\,Mbps while luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In bidirectional mode, the reverse direction on that worst link drops to 58.4\,Mbps, the lowest single-direction figure in the entire dataset. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average Throughput}.png} \caption{Per-link TCP throughput for Mycelium, showing extreme path asymmetry caused by inconsistent direct route discovery. The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps) and worst (luna$\rightarrow$lom, 122\,Mbps) links reflects different overlay routing paths.} \label{fig:mycelium_paths} \end{figure} The overlay penalty shows up most clearly at connection setup. Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's 16.8\,ms, a 5.6$\times$ overhead), and connection establishment alone costs 47.3\,ms (3$\times$ overhead). Every new connection incurs that overhead, so workloads dominated by short-lived connections accumulate it rapidly. Bulk downloads, by contrast, amortize it: the Nix cache test finishes only 18\,\% slower than Internal (10.07\,s vs.\ 8.53\,s) because once the transfer phase begins, per-connection latency fades into the background. Mycelium is also the slowest VPN to recover from a reboot: 76.6~seconds on average, and almost suspiciously uniform across nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a hard-coded convergence timer in the overlay protocol rather than anything topology-dependent. The UDP test timed out at 120~seconds, and even first-time connectivity required a 70-second wait at startup. % Explain what topology-dependent means in this case. \paragraph{Tinc: Userspace Processing Bottleneck.} Tinc is a clear case of a CPU bottleneck masquerading as a network problem. At 1.19\,ms latency, packets get through the tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a third of the bare-metal link. The usual suspects do not apply: Tinc's path MTU is a healthy 1\,500~bytes (\texttt{blksize\_bytes} of 1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and WireGuard at 1\,368), and its retransmit count (240) is moderate. What limits Tinc is its single-threaded userspace architecture: one CPU core simply cannot encrypt, copy, and forward packets fast enough to fill the pipe. The parallel benchmark confirms this diagnosis. Tinc scales to 563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio. Multiple TCP streams collectively keep that single core busy during what would otherwise be idle gaps in any individual flow, squeezing out throughput that no single stream could reach alone. \section{Impact of Network Impairment} \label{sec:impairment} The impairment profiles from Table~\ref{tab:impairment_profiles} are applied to the full benchmark suite. Baseline results from Section~\ref{sec:baseline} serve as the reference. \subsection{Ping} Table~\ref{tab:ping_impairment} lists average round-trip times across all four profiles. Most VPNs track the expected increase closely: tc~netem adds roughly 4\,ms, 8\,ms, and 15\,ms of round-trip delay at Low, Medium, and High respectively, and Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this. \begin{table}[H] \centering \caption{Average ping RTT (ms) across impairment profiles, sorted by High-profile RTT} \label{tab:ping_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 0.60 & 4.82 & 9.38 & 15.49 \\ Tinc & 1.19 & 5.32 & 9.85 & 15.92 \\ Nebula & 1.25 & 5.38 & 9.99 & 15.96 \\ WireGuard & 1.20 & 5.36 & 9.88 & 15.99 \\ Headscale & 1.64 & 5.82 & 10.39 & 16.07 \\ VpnCloud & 1.13 & 5.41 & 10.35 & 16.21 \\ ZeroTier & 1.28 & 5.34 & 10.02 & 16.54 \\ Yggdrasil & 2.20 & 6.73 & 11.99 & 20.20 \\ Hyprspace & 1.79 & 6.15 & 10.76 & 24.49 \\ EasyTier & 1.33 & 6.27 & 14.13 & 26.60 \\ Mycelium & 34.90 & 23.42 & 43.88 & 33.05 \\ \hline \end{tabular} \end{table} % PLOT: line chart % File: Figures/impairment/Ping Average RTT Heatmap.png % Data: Average ping RTT for all 11 VPNs at baseline, low, medium, high % Show: Most VPNs in a tight parallel band; Mycelium's non-monotonic curve; % EasyTier and Hyprspace diverging upward at high impairment Mycelium defies the pattern. Its RTT \emph{drops} from 34.9\,ms at baseline to 23.4\,ms at Low impairment, a 33\% improvement where every other VPN gets slower. It then rises to 43.9\,ms at Medium before falling again to 33.0\,ms at High. The baseline analysis (Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency comes from a bimodal routing distribution: one path runs at 1.63\,ms while two others route through the global overlay at ${\sim}$51\,ms. The impairment appears to push Mycelium's path discovery toward shorter routes, so a larger share of traffic takes the direct path. The non-monotonic pattern is consistent with a path selection algorithm that responds to measured link quality, but not linearly with degradation severity. Mycelium also achieves 0\% ping packet loss at Low and Medium impairment, while most VPNs show 0.1--3.2\% loss at those profiles. At High impairment, Mycelium's loss jumps to 11.1\%. EasyTier accumulates 11\,ms of excess latency at High impairment beyond what tc~netem accounts for. Its average RTT of 26.6\,ms and maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) point to a userspace scheduling or retry mechanism that introduces escalating variance. EasyTier's RTT standard deviation reaches 44.6\,ms at High, the worst jitter of any VPN. Hyprspace shows 11.1\% ping packet loss at every impairment level --- Low, Medium, and High alike. With 9~measurement runs (3~machine pairs $\times$ 3~runs of 100~packets), 11.1\% equals exactly 1/9: one run per profile fails completely while the other eight report zero loss. This binary pass/fail behavior is consistent with the buffer bloat diagnosis from Section~\ref{sec:hyprspace_bloat}. When buffers fill, an entire path stalls rather than degrading gradually. \subsection{TCP Throughput} Table~\ref{tab:tcp_impairment} presents single-stream TCP throughput across all four profiles. The baseline performance tiers from Section~\ref{sec:baseline} dissolve almost immediately under impairment. \begin{table}[H] \centering \caption{Single-stream TCP throughput (Mbps) across impairment profiles, sorted by baseline. Retention is the Low-to-baseline ratio.} \label{tab:tcp_impairment} \begin{tabular}{lrrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} & \textbf{Retention} \\ \hline Internal & 934 & 333 & 29.6 & 4.25 & 35.7\% \\ WireGuard & 864 & 54.7 & 8.77 & 2.63 & 6.3\% \\ ZeroTier & 814 & 63.7 & 12.0 & 4.01 & 7.8\% \\ Headscale & 800 & 274 & 41.5 & 4.21 & 34.3\% \\ Yggdrasil & 795 & 13.2 & 6.08 & 3.40 & 1.7\% \\ \hline Nebula & 706 & 49.8 & 7.82 & 2.60 & 7.1\% \\ EasyTier & 636 & 156 & 17.4 & 3.59 & 24.6\% \\ VpnCloud & 539 & 58.2 & 8.33 & 1.86 & 10.8\% \\ \hline Hyprspace & 368 & 4.42 & 2.05 & 1.39 & 1.2\% \\ Tinc & 336 & 54.4 & 5.53 & 2.77 & 16.2\% \\ Mycelium & 259 & 16.2 & 3.87 & 2.73 & 6.3\% \\ \hline \end{tabular} \end{table} % PLOT: line chart % File: Figures/impairment/TCP Throughput Heatmap.png % Data: Single-stream TCP throughput for all 11 VPNs at baseline, % low, medium, high % Show: Headscale crossing above Internal at medium impairment; % Yggdrasil's cliff from baseline to low; convergence of all % VPNs at high impairment Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a 98.3\% throughput loss from adding just 2\,ms latency, 2\,ms jitter, 0.25\% packet loss, and 0.5\% reordering per machine. Even Mycelium, the slowest VPN at baseline (259\,Mbps), retains more throughput at Low than Yggdrasil does. The jumbo overlay MTU of 32\,731~bytes, which inflated baseline metrics (Section~\ref{sec:baseline}), becomes a liability under impairment: each lost or reordered outer packet triggers retransmission of ${\sim}$24$\times$ more inner-layer data than a standard 1\,400-byte MTU VPN would lose. Headscale retains 34.3\% of its baseline throughput at Low, nearly matching Internal's 35.7\%. At Medium impairment, Headscale (41.5\,Mbps) overtakes Internal (29.6\,Mbps) --- a VPN outperforming the bare-metal baseline. Section~\ref{sec:tailscale_degraded} investigates this anomaly in detail. At High impairment, the throughput range compresses from 675\,Mbps at baseline to just 2.9\,Mbps. Internal leads at 4.25\,Mbps; Hyprspace trails at 1.39\,Mbps. The impairment profile itself becomes the bottleneck. With 2.5\% packet loss and 5\% reordering per machine, every implementation is TCP-loss-limited, and architectural differences that matter at gigabit speeds become irrelevant. \subsection{UDP Throughput} The UDP stress test (\texttt{-b~0}) suffers from widespread failures under impairment. Hyprspace and Mycelium, which already failed at baseline, continue to fail at all profiles. Tinc and ZeroTier fail at most non-baseline profiles. The sparse dataset limits conclusions, but one pattern stands out. Kernel-level implementations maintain throughput regardless of impairment. Internal holds ${\sim}$950\,Mbps across all profiles where data exists. Headscale sustains 700--876\,Mbps and WireGuard 850--908\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition both rely on WireGuard's in-kernel UDP handling with proper backpressure. Userspace VPNs collapse: EasyTier drops from 865 to 435 to 38.5 to 6.1\,Mbps across successive profiles. Yggdrasil, already pathological at baseline (98.7\% loss), crashes to 12.3\,Mbps at Low and fails entirely at Medium and High. % PLOT: heatmap % File: Figures/impairment/UDP Receiver Throughput Heatmap.png % Data: UDP receiver throughput for all 11 VPNs at baseline, low, % medium, high (grey/hatched cells for failures) % Show: Kernel-level VPNs (Internal, WireGuard, Headscale) maintaining % high throughput across all profiles; userspace VPNs failing or % collapsing; the large number of empty cells The failure rate of this benchmark under impairment makes it more useful as a robustness indicator than a throughput measurement. A VPN that cannot complete a 30-second UDP flood under 0.25\% packet loss has fundamental flow-control problems that will surface under real workloads too, even if the symptoms are milder. \subsection{Parallel TCP} Table~\ref{tab:parallel_impairment} shows aggregate throughput across three concurrent bidirectional links (six unidirectional flows). The Headscale anomaly from the single-stream results is amplified here. \begin{table}[H] \centering \caption{Parallel TCP throughput (Mbps) across impairment profiles. Three concurrent bidirectional links produce six unidirectional flows.} \label{tab:parallel_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 1398 & 277 & 82.6 & 10.4 \\ Headscale & 1228 & 718 & 113 & 20.0 \\ WireGuard & 1281 & 173 & 24.5 & 8.39 \\ Yggdrasil & 1265 & 38.7 & 16.7 & 8.95 \\ ZeroTier & 1206 & 176 & 35.4 & 7.97 \\ EasyTier & 927 & 473 & 57.4 & 10.7 \\ Hyprspace & 803 & 2.87 & 6.94 & 3.62 \\ VpnCloud & 763 & 174 & 23.7 & 8.25 \\ Nebula & 648 & 103 & 15.3 & 4.93 \\ Mycelium & 569 & 72.7 & 7.51 & 3.69 \\ Tinc & 563 & 168 & 23.7 & 8.25 \\ \hline \end{tabular} \end{table} % PLOT: heatmap % File: Figures/impairment/Parallel TCP Throughput Heatmap.png % Data: Parallel TCP throughput for all 11 VPNs at baseline, low, % medium, high % Show: Headscale dominating at low impairment (718 Mbps vs Internal's % 277); EasyTier as runner-up (473 Mbps); Hyprspace's collapse % to 2.87 Mbps Headscale at Low impairment: 718\,Mbps --- 2.6$\times$ Internal (277\,Mbps) and 4.1$\times$ WireGuard (173\,Mbps). At Medium, Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%. The single-stream anomaly from Section~\ref{sec:tailscale_degraded} compounds when multiple flows each independently benefit from Headscale's congestion control tuning. EasyTier is the second-most resilient VPN under parallel load, at 473\,Mbps at Low (51\% of baseline). Both EasyTier and Headscale retain more than half their baseline parallel throughput at Low impairment; no other VPN exceeds 30\%. Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\% loss. The buffer bloat that plagues single-stream transfers (Section~\ref{sec:hyprspace_bloat}) becomes catastrophic when six concurrent flows compete for the same bloated buffers. The High-profile convergence effect is even more pronounced here than in single-stream mode. Tinc and VpnCloud land at identical 8.25\,Mbps despite differing by 200\,Mbps at baseline. \subsection{QUIC Performance} Headscale and Nebula failed the qperf QUIC benchmark at baseline (Section~\ref{sec:baseline}) and continue to fail across all impairment profiles. Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to 7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High --- the same cliff observed in its TCP results, driven by the same jumbo-MTU amplification of outer-layer packet loss. At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps), ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within 0.4\,Mbps of each other. At baseline these four span a 500\,Mbps range. QUIC's own congestion control, operating atop the already-degraded outer link, becomes the sole limiter. % PLOT: heatmap % File: Figures/impairment/QUIC Bandwidth Heatmap.png % Data: QPerf QUIC bandwidth for VPNs with data at all four profiles % (WireGuard, VpnCloud, ZeroTier, Tinc, Yggdrasil, Internal) % Show: Yggdrasil's cliff from baseline to low; convergence of % WireGuard, VpnCloud, ZeroTier, Tinc at high (~23 Mbps) \subsection{Video Streaming} Table~\ref{tab:rist_impairment} presents RIST video quality scores across profiles. The actual encoding bitrate of ${\sim}$3.3\,Mbps sits well within every VPN's throughput budget even at High impairment, so quality differences reflect packet delivery reliability rather than bandwidth limits. \begin{table}[H] \centering \caption{RIST video streaming quality (\%) across impairment profiles, sorted by High-profile quality} \label{tab:rist_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Mycelium & 100.0 & 100.0 & 100.0 & 99.9 \\ EasyTier & 100.0 & 100.0 & 96.2 & 85.5 \\ Internal & 100.0 & 99.2 & 89.3 & 80.2 \\ ZeroTier & 100.0 & 99.3 & 89.9 & 80.2 \\ VpnCloud & 100.0 & 99.2 & 89.7 & 80.1 \\ WireGuard & 100.0 & 99.3 & 90.0 & 80.0 \\ Hyprspace & 100.0 & 92.9 & 87.9 & 78.1 \\ Tinc & 100.0 & 99.3 & 90.0 & 77.8 \\ Nebula & 99.8 & 98.8 & 85.6 & 72.1 \\ Yggdrasil & 100.0 & 94.7 & 71.4 & 43.3 \\ Headscale & 13.1 & 13.0 & 13.0 & 13.0 \\ \hline \end{tabular} \end{table} % PLOT: heatmap % File: Figures/impairment/Video Streaming Quality Heatmap.png % Data: RIST quality for all 11 VPNs at baseline, low, medium, high % Show: Headscale stuck at 13% (red row); Mycelium stuck near 100% % (green row); gradual degradation for the bulk; Yggdrasil's % steep decline to 43% Headscale stays at ${\sim}$13\% across all four profiles: 13.1\%, 13.0\%, 13.0\%, 13.0\%. The profile-independence confirms the baseline diagnosis from Section~\ref{sec:baseline}. The failure is structural --- likely MTU fragmentation in the DERP relay layer --- and cannot worsen because it is already saturated. Adding latency or loss on top of an 87\% packet drop floor changes nothing. Mycelium delivers 99.9\% quality even at High impairment, better than Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even Mycelium's degraded overlay paths can sustain the stream. Its overlay retransmission mechanism, which cripples bulk TCP transfers, works well for steady low-bandwidth UDP flows. RIST's own forward error correction handles whatever Mycelium's retransmissions miss. Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU that hurt TCP throughput also hurts here --- large overlay packets carrying RIST data are more likely to be lost or reordered at the outer layer, and RIST's FEC cannot recover from the resulting burst losses. \subsection{Application-Level Download} Table~\ref{tab:nix_impairment} shows Nix binary cache download times across profiles. This HTTP-heavy workload, dominated by many short-lived TCP connections, is more sensitive to per-connection latency than to raw bandwidth. \begin{table}[H] \centering \caption{Nix binary cache download time (seconds) across impairment profiles, sorted by Low-profile time. ``--'' marks a failed run.} \label{tab:nix_impairment} \begin{tabular}{lrrrr} \hline \textbf{VPN} & \textbf{Baseline} & \textbf{Low} & \textbf{Medium} & \textbf{High} \\ \hline Internal & 8.53 & 11.9 & 58.6 & -- \\ Headscale & 9.79 & 13.5 & 48.8 & 219 \\ EasyTier & 9.39 & 22.1 & 141 & -- \\ VpnCloud & 9.39 & 27.9 & 163 & -- \\ WireGuard & 9.45 & 28.8 & 161 & -- \\ Nebula & 9.15 & 30.8 & 180 & 547 \\ Tinc & 10.0 & 30.9 & 166 & 496 \\ ZeroTier & 9.22 & 36.2 & 141 & -- \\ Mycelium & 10.1 & 79.5 & -- & -- \\ Yggdrasil & 10.6 & 230 & -- & -- \\ Hyprspace & 11.9 & -- & 170 & -- \\ \hline \end{tabular} \end{table} % PLOT: heatmap % File: Figures/impairment/Nix Cache Download Time Heatmap.png % Data: Nix cache download time for all VPNs at baseline, low, medium, % high (hatched/absent bars for failures) % Show: Headscale as the only VPN completing all four profiles; % Headscale beating Internal at medium (48.8 vs 58.6 s); % Yggdrasil's 22x slowdown at low impairment Headscale is the only VPN to complete all four profiles. At Medium impairment, it finishes in 48.8~seconds --- faster than Internal's 58.6~seconds. Internal itself fails at High impairment while Headscale completes in 219~seconds. Only Nebula (547\,s) and Tinc (496\,s) also survive High impairment. Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low impairment, a 22$\times$ slowdown. Every HTTP request incurs the latency penalty from Yggdrasil's impairment-amplified retransmissions. Mycelium also degrades severely (10.1\,s to 79.5\,s, an 8$\times$ increase), consistent with its overlay routing overhead, which compounds over hundreds of sequential HTTP connections. The failure map reveals a clean gradient: more demanding profiles knock out more VPNs. At Low, 10 of 11 complete (Hyprspace fails). At Medium, 9 complete. At High, only 3 survive (Headscale, Nebula, Tinc). Internal's failure at High is the most surprising --- the bare-metal baseline cannot sustain a multi-connection HTTP workload under severe degradation, but Headscale, shielded by its userspace TCP stack, can. Section~\ref{sec:tailscale_degraded} explains why. \section{Tailscale Under Degraded Conditions} \label{sec:tailscale_degraded} \subsection{Observed Anomaly} At Medium impairment, Headscale delivers 41.5\,Mbps single-stream TCP throughput --- 40\% more than Internal's 29.6\,Mbps. A VPN built atop WireGuard outperforms the bare-metal connection it tunnels through. The anomaly is consistent across benchmarks: Table~\ref{tab:headscale_anomaly} summarizes the comparison. \begin{table}[H] \centering \caption{Headscale vs.\ Internal vs.\ WireGuard under impairment (18.12.2025 run). For TCP benchmarks, higher is better. For Nix cache, lower is better; ``--'' marks a failed run.} \label{tab:headscale_anomaly} \begin{tabular}{llrrr} \hline \textbf{Benchmark} & \textbf{Profile} & \textbf{Internal} & \textbf{Headscale} & \textbf{WireGuard} \\ \hline Single TCP (Mbps) & Low & 333 & 274 & 54.7 \\ Single TCP (Mbps) & Medium & 29.6 & 41.5 & 8.77 \\ Single TCP (Mbps) & High & 4.25 & 4.21 & 2.63 \\ Parallel TCP (Mbps) & Low & 277 & 718 & 173 \\ Parallel TCP (Mbps) & Medium & 82.6 & 113 & 24.5 \\ Nix cache (s) & Medium & 58.6 & 48.8 & 161 \\ Nix cache (s) & High & -- & 219 & -- \\ \hline \end{tabular} \end{table} % TODO: Needs to be created, use the tools/ folder % PLOT: line chart % File: Figures/impairment/headscale-vs-internal-across-profiles.png % Data: Single-stream TCP throughput for Internal, Headscale, and % WireGuard across all four profiles % Show: Headscale crossing above Internal at medium impairment; % WireGuard far below both; convergence at high % Y-axis: log scale In parallel TCP at Low impairment, Headscale reaches 718\,Mbps vs.\ Internal's 277\,Mbps (2.6$\times$). The Nix cache download at Medium takes Headscale 48.8\,s vs.\ Internal's 58.6\,s (17\% faster). At High impairment, Internal fails the Nix cache entirely while Headscale completes in 219\,s. WireGuard, which shares Headscale's cryptographic layer, shows no such advantage: 54.7\,Mbps at Low, 8.77\,Mbps at Medium. Whatever protects Headscale is not the encryption or the tunnel --- it is something in Tailscale's userspace networking stack. The retransmit data provides the first clue. At Medium impairment, Headscale's retransmit percentage is approximately 2.4\%, matching Internal's ${\sim}$2.4\%. WireGuard's is 5.2\%. Headscale achieves Internal's retransmit efficiency while delivering higher throughput --- fewer spurious retransmissions leave more bandwidth for actual data. \subsection{Congestion Control Analysis} Tailscale uses a userspace TCP/IP stack derived from Google's gVisor (netstack). This stack does not inherit the host kernel's TCP parameters. Three defaults differ from the Linux kernel in ways that matter under packet reordering: \begin{itemize} \bitem{\texttt{tcp\_reordering}:} gVisor uses 10; the Linux kernel defaults to~3. This parameter controls how many out-of-order packets TCP tolerates before treating the event as a loss. With tc~netem injecting 0.5--2.5\% reordering per machine, bursts of 3+ reordered packets are frequent. The kernel's threshold of~3 causes spurious fast retransmits and congestion window reductions for packets that are merely reordered, not lost. \bitem{\texttt{tcp\_recovery} (RACK):} gVisor disables it; the Linux kernel enables it by default. RACK uses timing-based loss detection that is more aggressive than the pure sequence-based approach gVisor uses. Under reordering, RACK's timing heuristics can falsely classify delayed packets as lost. \bitem{\texttt{tcp\_early\_retrans} (TLP):} gVisor disables it; the kernel enables it. Tail Loss Probe sends speculative retransmits on idle connections, which can worsen congestion when the link is already impaired. \end{itemize} The combined effect: under network conditions with packet reordering, the default Linux TCP stack fires retransmits and cuts the congestion window far more often than necessary. Each false positive shrinks the window and reduces throughput. Tailscale's gVisor stack tolerates more reordering before reacting, so its congestion window stays larger and throughput stays higher. This explains why the anomaly grows with impairment severity. At baseline, there is no reordering, so the threshold difference is irrelevant and Internal's kernel-level processing advantage dominates. As reordering increases from 0.5\% (Low) to 2.5\% (Medium) per machine, the kernel's aggressive loss detection fires more often, and the throughput gap shifts in Headscale's favor. \subsection{Tuned Kernel Parameters} Two follow-up benchmark runs applied Tailscale's gVisor TCP parameters to the host kernel via sysctl: \begin{itemize} \bitem{Full gVisor (27.02.2026):} All parameters --- \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, \texttt{tcp\_early\_retrans=0}, plus enlarged buffer sizes (\texttt{tcp\_rmem}, \texttt{tcp\_wmem}, \texttt{rmem\_max}, \texttt{wmem\_max}). Tested on Internal, Headscale, WireGuard, Tinc, and ZeroTier. \bitem{Reorder-only (06.03.2026):} Only \texttt{tcp\_reordering=10}, \texttt{tcp\_recovery=0}, and \texttt{tcp\_early\_retrans=0}. Buffer sizes left at kernel defaults. Tested on Internal and Headscale only. \end{itemize} Table~\ref{tab:kernel_tuning_internal} shows how Internal responds to the tuning. Both follow-up runs used the same impairment profiles and hardware as the original 18.12.2025 run. \begin{table}[H] \centering \caption{Internal (no VPN) throughput across three kernel configurations. ``Default'' is the 18.12.2025 run with stock Linux TCP parameters.} \label{tab:kernel_tuning_internal} \begin{tabular}{llrrr} \hline \textbf{Metric} & \textbf{Profile} & \textbf{Default} & \textbf{Full gVisor} & \textbf{Reorder-only} \\ \hline Single TCP (Mbps) & Baseline & 934 & 934 & 934 \\ Single TCP (Mbps) & Low & 333 & 363 & 354 \\ Single TCP (Mbps) & Medium & 29.6 & 64.2 & 72.7 \\ Parallel TCP (Mbps) & Low & 277 & 893 & 902 \\ Parallel TCP (Mbps) & Medium & 82.6 & 226 & 211 \\ Retransmit \% & Medium & ${\sim}$2.4 & 1.21 & 1.11 \\ Nix cache (s) & Medium & 58.6 & 29.7 & 29.1 \\ \hline \end{tabular} \end{table} % PLOT: grouped bar chart % File: Figures/impairment/no_vpn_kernel_tuning_comparison.png % Data: Internal single-stream TCP at baseline/low/medium across % original, full gVisor, and reorder-only configurations % Show: Dramatic jump at medium (29.6 -> 64.2 -> 72.7 Mbps); % baseline unchanged; modest improvement at low % Y-axis: linear scale Internal's Medium-impairment throughput jumps from 29.6 to 72.7\,Mbps --- a 146\% increase from a three-line sysctl change. The retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; most of the original retransmissions were spurious. The Nix cache download at Medium halves from 58.6\,s to 29.1\,s. Parallel TCP sees an even larger gain. Internal at Low impairment climbs from 277 to 902\,Mbps, a 226\% increase that now exceeds Headscale's original 718\,Mbps. With six concurrent flows each independently benefiting from the higher reordering threshold, the aggregate improvement compounds. The anomaly reverses. At every impairment level and benchmark, tuned Internal now meets or exceeds Headscale. At Medium impairment: Internal 72.7\,Mbps vs.\ Headscale 50.1\,Mbps (Internal 45\% ahead), where the original result had Headscale 40\% ahead. The Nix cache flips too: Internal completes in 29.1\,s vs.\ Headscale's 36.3\,s, where the original had Headscale 17\% faster. % PLOT: before/after comparison % File: Figures/impairment/headscale-gap-reversal.png % Data: Internal vs Headscale throughput ratio at each impairment % level, original vs tuned (reorder-only) % Show: The crossover from "Headscale wins" (ratio < 1) to "Internal % wins" (ratio > 1) at medium impairment after tuning % Y-axis: ratio (Internal / Headscale), 1.0 as break-even The reorder-only configuration (06.03) matches or exceeds the full gVisor configuration (27.02) at most metrics; the two exceptions are single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within 7\%. Internal reaches 72.7\,Mbps at Medium with reorder-only vs.\ 64.2\,Mbps with full gVisor. The enlarged buffer sizes are unnecessary and may introduce mild buffer bloat that partially offsets the reordering benefit. The entire Headscale advantage is explained by three kernel parameters: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and \texttt{tcp\_early\_retrans}. Other VPNs benefit less from the kernel tuning. WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps (+39\%) and Tinc's from 5.53 to 11.5\,Mbps (+108\%). ZeroTier shows no change (12.0 to 11.5\,Mbps). The tuning helps the kernel TCP stack, but VPNs that add their own encapsulation overhead and userspace processing have independent bottlenecks that the sysctl parameters cannot remove. Headscale itself gets modestly faster with kernel tuning (+21\% at Medium) but slightly slower at Low impairment ($-$5\%). Its userspace gVisor stack already optimizes for reordering tolerance. When the kernel stack also increases its tolerance, the two layers of tuning may interact suboptimally --- both independently delay retransmits, which can cause compound delays on the kernel-to-Headscale socket path. \section{Source Code Analysis} \subsection{Feature Matrix Overview} % Summary of the 131-feature matrix across all ten VPNs. % Highlight key architectural differences that explain % performance results. \subsection{Security Vulnerabilities} % Vulnerabilities discovered during source code review. \section{Summary of Findings} % Brief summary table or ranking of VPNs by key metrics. % Save deeper interpretation for a Discussion chapter.