diff --git a/Chapters/Results.tex b/Chapters/Results.tex index 9d44e77..981711c 100644 --- a/Chapters/Results.tex +++ b/Chapters/Results.tex @@ -127,7 +127,7 @@ ZeroTier, for instance, reaches 814\,Mbps but accumulates 1\,163~retransmits per test, over 1\,000$\times$ what WireGuard needs. ZeroTier compensates for tunnel-internal packet loss by repeatedly triggering TCP congestion-control recovery, whereas -WireGuard sends data once and it arrives. Across all VPNs, +WireGuard delivers data with negligible in-tunnel loss. Across all VPNs, retransmit behaviour falls into three groups: \emph{clean} ($<$110: WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} (200--900: Tinc, EasyTier, Mycelium, VpnCloud), and @@ -153,9 +153,14 @@ WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP Retransmit Rate}.png} - \caption{Average TCP retransmits per 30-second test (log scale)} + % TODO: Caption says "retransmits" (counts) but the plot axis shows + % "Retransmit Rate (\%)." Align the caption with the plot. + \caption{TCP retransmit rate (\%)} \label{fig:tcp_retransmits} \end{subfigure} + % TODO: This parent caption still says "retransmit count" but the + % subfigure axis and caption were corrected to "retransmit rate (%)." + % Align the parent caption terminology (counts vs rates). \caption{TCP throughput and retransmit rate at baseline. WireGuard leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000 retransmits per test. The retransmit count does not always track @@ -166,9 +171,13 @@ WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed} Retransmits have a direct mechanical relationship with TCP congestion control. Each retransmit triggers a reduction in the congestion window -(\texttt{cwnd}), throttling the sender. This relationship is visible +(\texttt{cwnd}), throttling the sender. % TODO: The text says "average congestion window" but +% Figure~\ref{fig:retransmit_cwnd} plots "Max Congestion Window." +% Use consistent terminology --- either change the text to "max" or +% change the figure axis label. +This relationship is visible in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965 -retransmits, maintains the smallest average congestion window in the +retransmits, maintains the smallest max congestion window in the dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB window, the largest of any VPN. At first glance this suggests a clean inverse correlation between retransmits and congestion window @@ -191,6 +200,12 @@ in the dataset). This suggests significant in-tunnel packet loss or buffering at the VpnCloud layer that the retransmit count (857) alone does not fully explain. +% TODO: Mycelium's 122--379 Mbps range is per-link asymmetry (different +% overlay routing paths), not stochastic run-to-run variability. +% Section~\ref{sec:mycelium_routing} confirms the same numbers as +% per-link throughput. Conflating link asymmetry with run-to-run +% variance is misleading --- either separate the two or clarify that +% Mycelium's spread comes from path selection, not randomness. Run-to-run variability also differs substantially. WireGuard ranges from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A @@ -256,9 +271,10 @@ times, which cluster into three distinct ranges. \end{tabular} \end{table} -Six VPNs stay below 1.3\,ms, comfortably close to the bare-metal -0.60\,ms. VpnCloud posts the lowest latency of any VPN (1.13\,ms), below -WireGuard (1.20\,ms), yet its throughput tops out at only 539\,Mbps. +Five VPNs stay below 1.3\,ms, comfortably close to the bare-metal +0.60\,ms; EasyTier sits just above at 1.33\,ms. VpnCloud posts the +lowest latency of any VPN (1.13\,ms), below WireGuard (1.20\,ms), +yet its throughput tops out at only 539\,Mbps. Low per-packet latency does not guarantee high bulk throughput. A second group (Headscale, Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing @@ -266,6 +282,9 @@ moderate overhead. Then there is Mycelium at 34.9\,ms, so far removed from the rest that Section~\ref{sec:mycelium_routing} gives it a dedicated analysis. +% TODO: The max RTT claim (8.6 ms) is not visible in the Average RTT +% plot. Add a max-RTT figure or table, or reference the raw data +% source. ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any sub-2\,ms VPN. These spikes point to periodic control-plane @@ -285,19 +304,40 @@ but only the second-lowest throughput (336\,Mbps). Packets traverse the tunnel quickly, yet single-threaded userspace processing cannot keep up with the link speed. The qperf benchmark backs this up: Tinc maxes out at -14.9\,\% CPU while delivering just 336\,Mbps, a clear sign that -the CPU, not the network, is the bottleneck. +14.9\,\% total system CPU while delivering just 336\,Mbps. +% TODO: 14.9\% total CPU does not obviously indicate a bottleneck. +% Clarify that this is whole-system utilization on a multi-core +% machine, and that Tinc's single-threaded design means one core is +% saturated while the rest are idle. Also note that VpnCloud reports +% the same 14.9\% yet achieves 539 Mbps --- explain why the same CPU +% utilization yields different throughput (e.g., different per-packet +% processing cost). +On a multi-core system, the low percentage reflects a single +saturated core, a clear sign that the CPU, not the network, is the +bottleneck. Figure~\ref{fig:latency_throughput} makes this disconnect easy to spot. +% TODO: These CPU numbers are stated inline but never shown in a plot +% or table. Add a CPU utilization figure or table so readers can +% verify. Also, the claim that WireGuard's CPU usage "goes to +% cryptographic processing" is unsubstantiated --- no profiling data +% is presented. Either add profiling evidence or soften to +% "likely" / "presumably." The qperf measurements also reveal a wide spread in CPU usage. Hyprspace (55.1\,\%) and Yggdrasil (52.8\,\%) consume 5--6$\times$ as much CPU as Internal's 9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a -kernel-level implementation, though much of that goes to -cryptographic processing. On the efficient end, VpnCloud -(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) do the most -with the least CPU time. Nebula and Headscale are missing from +kernel-level implementation, presumably due to in-kernel +cryptographic processing. % TODO: "do the most with the least CPU time" is misleading --- +% Tinc gets only 336 Mbps at 14.9% CPU (22.6 Mbps/%), while +% WireGuard gets 864 Mbps at 30.8% (28 Mbps/%). These three use +% the least CPU but don't necessarily achieve the best throughput/CPU +% ratio. Rephrase to "use the least CPU" or calculate actual +% efficiency ratios. +On the efficient end, VpnCloud +(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) use the least +CPU time. Nebula and Headscale are missing from this comparison because qperf failed for both. %TODO: Explain why they consistently failed @@ -315,7 +355,10 @@ this comparison because qperf failed for both. \subsection{Parallel TCP Scaling} -The single-stream benchmark tests one link direction at a time. The +The single-stream benchmark tests one link direction at a time. % TODO: The plot labels this benchmark "10-stream parallel" but this +% description says "six unidirectional flows." Verify the actual test +% configuration and reconcile the two. +The parallel benchmark changes this setup: all three link directions (lom$\rightarrow$yuki, yuki$\rightarrow$luna, luna$\rightarrow$lom) run simultaneously in a circular pattern for @@ -361,15 +404,24 @@ single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream can never fill the pipe: the bandwidth-delay product demands a window larger than any single flow maintains, so multiple concurrent flows compensate for that constraint and push throughput to 2.20$\times$ -the single-stream figure. Hyprspace scales almost as well -(2.18$\times$) but for a -different reason: multiple streams work around the buffer bloat that -cripples any individual flow +the single-stream figure. % TODO: The buffer-bloat workaround explanation for Hyprspace's +% parallel scaling is a hypothesis. No direct evidence is shown +% that multiple streams specifically alleviate buffer bloat. +% Consider adding bufferbloat measurements or softening the claim. +% TODO: DOWNSTREAM DEPENDENCY — This claim depends on the buffer bloat +% diagnosis in Section hyprspace_bloat, which itself rests on the unverified +% 2,800 ms under-load latency (see TODO there). If that latency figure +% is not confirmed, this parallel-scaling explanation collapses. +Hyprspace scales almost as well +(2.18$\times$), possibly because multiple streams collectively work +around the buffer bloat that cripples any individual flow (Section~\ref{sec:hyprspace_bloat}). Tinc picks up a 1.68$\times$ boost because several streams can collectively keep its single-threaded CPU busy during what would otherwise be idle gaps in a single flow. +% TODO: "zero retransmits" in parallel mode is not shown in any table +% or figure. Add parallel-mode retransmit data or remove the claim. WireGuard and Internal both scale cleanly at around 1.48--1.50$\times$ with zero retransmits, suggesting that WireGuard's overhead is a fixed per-packet cost that does not worsen @@ -377,7 +429,9 @@ under multiplexing. Nebula is the only VPN that actually gets \emph{slower} with more streams: throughput drops from 706\,Mbps to 648\,Mbps -(0.92$\times$) while retransmits jump from 955 to 2\,462. The ten +(0.92$\times$) while retransmits jump from 955 to 2\,462. The +% TODO: "ten streams" vs "six unidirectional flows" --- reconcile +% with the test description above. streams are clearly fighting each other for resources inside the tunnel. @@ -454,18 +508,29 @@ call. Only the receiver throughput is meaningful. Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at the kernel level with proper backpressure that matches sender to -receiver rate. Every userspace VPN shows massive loss (69--99\%) -because the sender overwhelms the tunnel's processing capacity. +receiver rate. Every other VPN shows massive loss (69--99\%) +because the sender overwhelms the tunnel's userspace processing capacity. +% TODO: Headscale also uses WireGuard's kernel module but still shows +% 69.8\% loss. Explain that Headscale's userspace netstack sits +% between the application and the WireGuard kernel module, so UDP +% traffic must pass through userspace before reaching the kernel +% tunnel --- this is why it behaves like a userspace VPN here despite +% using WireGuard underneath. Yggdrasil's 98.7\% loss is the most extreme: it sends the most data (due to its large block size) but loses almost all of it. These loss rates do not reflect real-world UDP behavior but reveal which VPNs implement effective flow control. Hyprspace and Mycelium could not complete the UDP test at all, timing out after 120 seconds. -The \texttt{blksize\_bytes} field reveals each VPN's effective path -MTU: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at 2728, -Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at 1353, -EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the +% TODO: blksize_bytes is the UDP payload size iPerf3 selects, not +% the path MTU. It is derived from the socket MSS and reflects the +% usable payload after tunnel overhead, but conflating it with path +% MTU is misleading. Consider renaming to "effective payload size" +% throughout. +The \texttt{blksize\_bytes} field reveals each VPN's effective UDP +payload size: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at +2728, Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at +1353, EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the smallest). These differences affect fragmentation behavior under real workloads, particularly for protocols that send large datagrams. @@ -593,11 +658,19 @@ between raw throughput and real-world download speed. At just 3.3\,Mbps, the RIST video stream sits comfortably within every VPN's throughput budget. This test therefore measures something different: how well the VPN handles real-time UDP packet -delivery under steady load. Nine of the eleven VPNs pass without -incident, delivering 100\,\% video quality. The 14--16 dropped +delivery under steady load. % TODO: The RIST plot shows Nebula at 99.8\%, not 100\%. "Nine of +% eleven deliver 100\%" is inaccurate --- eight deliver 100\%, Nebula +% delivers 99.8\%. Also, the claim that 14--16 dropped frames trace +% to encoder warm-up is stated without evidence. How was this +% determined? Add a reference or explain the methodology. +Nine of the eleven VPNs pass without +incident, delivering near-perfect video quality. The 14--16 dropped frames that appear uniformly across all VPNs, including Internal, -trace back to encoder warm-up rather than tunnel overhead. +likely trace back to encoder warm-up rather than tunnel overhead. +% TODO: The packet-drop distribution statistics (288 mean, +% 10\% median, IQR 255--330) are not shown in any figure. +% Add a box plot or distribution figure for Headscale's RIST drops. Headscale is the exception. It averages just 13.1\,\% quality, dropping 288~packets per test interval. The degradation is not bursty but sustained: median quality sits at 10\,\%, and the @@ -610,15 +683,23 @@ What makes this failure unexpected is that Headscale builds on WireGuard, which handles video flawlessly. TCP throughput places Headscale squarely in Tier~1. Yet the RIST test runs over UDP, and qperf probes latency-sensitive paths using both TCP and UDP. The +% TODO: The DERP relay / MTU fragmentation hypothesis is plausible +% but unverified. No packet capture or fragmentation analysis is +% presented. Either add tcpdump / packet-level evidence or mark +% this more clearly as a hypothesis. pattern points toward Headscale's DERP relay or NAT traversal layer -as the source. Its effective path MTU of 1\,208~bytes, the smallest -of any VPN, likely compounds the issue: RIST packets that exceed -this limit must be fragmented, and reassembling fragments under -sustained load produces exactly the kind of steady, uniform packet +as the source. Its effective UDP payload size of 1\,208~bytes, the smallest +of any VPN, may compound the issue: RIST packets that exceed +this limit would be fragmented, and reassembling fragments under +sustained load could produce exactly the kind of steady, uniform packet drops the data shows. For video conferencing, VoIP, or any real-time media workload, this is a disqualifying result regardless of TCP throughput. +% TODO: Hyprspace's packet-drop statistics (mean 1,194, max 55,500, +% percentiles all zero) are not visible in the RIST Quality bar chart. +% Add a distribution plot or note in the caption that the bar +% chart hides this variance. Hyprspace reveals a different failure mode. Its average quality reads 100\,\%, but the raw numbers underneath are far from stable: mean packet drops of 1\,194 and a maximum spike of 55\,500, with @@ -647,6 +728,9 @@ Sustained-load performance does not predict recovery speed. How quickly a tunnel comes up after a reboot, and how reliably it reconverges, matters as much as peak throughput for operational use. +% TODO: First-time connectivity numbers (50 ms, 8--17 s, 10--14 s) +% are not shown in any figure or table. Either add a figure or +% scrap this paragraph (see note below). First-time connectivity spans a wide range. Headscale and WireGuard are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud (10--14\,s) spend seconds negotiating with their control planes @@ -710,6 +794,10 @@ earlier benchmarks into per-VPN diagnoses. \paragraph{Hyprspace: Buffer Bloat.} \label{sec:hyprspace_bloat} +% TODO: The under-load latency of 2,800 ms is not shown in any plot +% or table. Where does this number come from? Add a figure showing +% latency-under-load (e.g., from qperf concurrent ping) or reference +% the raw data source. Hyprspace produces the most severe performance collapse in the dataset. At idle, its ping latency is a modest 1.79\,ms. Under TCP load, that number balloons to roughly 2\,800\,ms, a @@ -720,11 +808,15 @@ packets and refusing to drain. The consequences ripple through every TCP metric. With 4\,965 retransmits per 30-second test (one in every 200~segments), TCP spends most of its time in congestion recovery rather than -steady-state transfer, shrinking the average congestion window to +steady-state transfer, shrinking the max congestion window to 205\,KB, the smallest in the dataset. Under parallel load the -situation worsens: retransmits climb to 17\,426. The buffering even +situation worsens: retransmits climb to 17\,426. % TODO: The explanation for the sender/receiver inversion (ACK delays +% causing sender-side timer undercounting) is a hypothesis. Normally +% sender >= receiver. Consider verifying with packet captures or +% note this as a likely but unconfirmed explanation. +The buffering even inverts iPerf3's measurements: the receiver reports 419.8\,Mbps -while the sender sees only 367.9\,Mbps, because massive ACK delays +while the sender sees only 367.9\,Mbps, likely because massive ACK delays cause the sender-side timer to undercount the actual data rate. The UDP test never finished at all, timing out at 120~seconds. @@ -753,9 +845,24 @@ reveal a bimodal distribution: One of the three links has found a direct route; the other two still bounce through the overlay. All three machines sit on the same -physical network, so Mycelium's path discovery is failing -intermittently, a more specific problem than blanket overlay -overhead. Throughput mirrors the split: +% TODO: Characterising path discovery as "failing intermittently" assumes +% direct routing is the expected outcome on a LAN. Mycelium is designed +% as a global overlay and may intentionally route through supernodes. +% If this is by-design behaviour, rephrase to avoid implying a bug. +% This characterisation also propagates to the impairment ping analysis +% (around line 966) which says impairment "pushes path discovery toward +% shorter routes." +% TODO: The throughput data INVERTS the latency split rather than +% "mirroring" it. The direct path (luna→lom, 1.63 ms RTT) achieves +% only 122 Mbps, while the overlay-routed path (yuki→luna, 51.60 ms +% RTT) reaches 379 Mbps --- the opposite of what TCP theory predicts. +% The plot also shows luna→lom receiver throughput at only 57.2 Mbps +% (a 53% sender/receiver gap on that link). Explain why the direct +% path is 3× slower than the overlay path, or acknowledge the +% contradiction. The current wording "mirrors the split" is incorrect. +physical network, so Mycelium's path discovery is not consistently +selecting the direct route, a more specific problem than blanket overlay +overhead. Throughput shows a similarly lopsided split: yuki$\rightarrow$luna reaches 379\,Mbps while luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In bidirectional mode, the reverse direction on that worst link drops @@ -766,14 +873,23 @@ dataset. \centering \includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average Throughput}.png} + % TODO: The caption attributes the asymmetry to "inconsistent direct + % route discovery" but the direct-route link (luna→lom, 1.63 ms RTT) + % is actually the SLOWEST (122 Mbps). The caption should address + % why the direct path underperforms the overlay paths. \caption{Per-link TCP throughput for Mycelium, showing extreme - path asymmetry caused by inconsistent direct route discovery. - The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps) - and worst (luna$\rightarrow$lom, 122\,Mbps) links reflects - different overlay routing paths.} + path asymmetry. The 3:1 ratio between best + (yuki$\rightarrow$luna, 379\,Mbps) and worst + (luna$\rightarrow$lom, 122\,Mbps) links does not correlate with + the latency split (Section~\ref{sec:mycelium_routing}).} \label{fig:mycelium_paths} \end{figure} +% TODO: TTFB (93.7 ms vs.\ 16.8 ms) and connection establishment +% (47.3 ms) numbers are from qperf but not shown in any figure. +% Add a connection-setup latency table or plot. Also clarify what +% Internal's connection establishment time is (47.3 / 3 = 15.8 ms?) +% so the "3× overhead" can be verified. The overlay penalty shows up most clearly at connection setup. Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's 16.8\,ms, a 5.6$\times$ overhead), and connection establishment @@ -800,14 +916,22 @@ anything topology-dependent. The UDP test timed out at Tinc is a clear case of a CPU bottleneck masquerading as a network problem. At 1.19\,ms latency, packets get through the tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a -third of the bare-metal link. The usual suspects do not apply: -Tinc's path MTU is a healthy 1\,500~bytes -(\texttt{blksize\_bytes} of 1\,353 from UDP iPerf3, comparable to -VpnCloud at 1\,375 and WireGuard at 1\,368), and its retransmit +third of the bare-metal link. % TODO: "path MTU is a healthy 1,500 bytes" but blksize_bytes is +% 1,353. These are different metrics --- blksize_bytes is the UDP +% payload size, not the path MTU. Clarify the distinction or +% remove the 1,500 claim. +The usual suspects do not apply: +Tinc's effective UDP payload size (\texttt{blksize\_bytes} of +1\,353 from UDP iPerf3, comparable to VpnCloud at 1\,375 and +WireGuard at 1\,368) is in the normal range, and its retransmit count (240) is moderate. What limits Tinc is its single-threaded userspace architecture: one CPU core simply cannot encrypt, copy, and forward packets fast enough to fill the pipe. +% TODO: DOWNSTREAM DEPENDENCY — This "confirms" the Tinc CPU bottleneck +% diagnosis from above, but the 14.9% CPU figure has an unresolved TODO +% (the same utilization as VpnCloud at 539 Mbps). If the CPU claim is +% revised or refuted, this confirmation must be updated too. The parallel benchmark confirms this diagnosis. Tinc scales to 563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio. Multiple TCP streams collectively keep that single core busy during @@ -817,17 +941,27 @@ out throughput that no single stream could reach alone. \section{Impact of Network Impairment} \label{sec:impairment} -The impairment profiles from Table~\ref{tab:impairment_profiles} are -applied to the full benchmark suite. Baseline results from -Section~\ref{sec:baseline} serve as the reference. +Baseline benchmarks rank VPNs by overhead under ideal conditions. +The impairment profiles from Table~\ref{tab:impairment_profiles} +test a different property: resilience. Two results dominate the +data. First, the throughput hierarchy from +Section~\ref{sec:baseline} collapses under degradation --- at High +impairment, the 675\,Mbps spread across all implementations compresses +to under 3\,Mbps, and architectural differences that matter at gigabit speeds +vanish. Second, Headscale outperforms the bare-metal Internal +baseline at Medium impairment across TCP, parallel TCP, and Nix +cache benchmarks. A VPN built on WireGuard should not beat a direct +connection; Section~\ref{sec:tailscale_degraded} traces the cause to +three TCP parameters in Tailscale's userspace network stack. \subsection{Ping} -Table~\ref{tab:ping_impairment} lists average round-trip times across -all four profiles. Most VPNs track the expected increase closely: -tc~netem adds roughly 4\,ms, 8\,ms, and 15\,ms of round-trip delay -at Low, Medium, and High respectively, and Internal's measured values -(4.82, 9.38, 15.49\,ms) confirm this. +Latency is the most predictable metric under impairment. Most VPNs +absorb the injected delay with a fixed per-hop overhead, and rankings +within the central cluster barely change across profiles +(Table~\ref{tab:ping_impairment}). tc~netem adds roughly 4, 8, and +15\,ms of round-trip delay at Low, Medium, and High respectively; +Internal's measured values (4.82, 9.38, 15.49\,ms) confirm this. \begin{table}[H] \centering @@ -854,11 +988,15 @@ at Low, Medium, and High respectively, and Internal's measured values \end{tabular} \end{table} -% PLOT: line chart -% File: Figures/impairment/Ping Average RTT Heatmap.png -% Data: Average ping RTT for all 11 VPNs at baseline, low, medium, high -% Show: Most VPNs in a tight parallel band; Mycelium's non-monotonic curve; -% EasyTier and Hyprspace diverging upward at high impairment +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/Ping Average RTT Heatmap}.png} + \caption{Average ping RTT across impairment profiles. Most VPNs + form a tight parallel band; Mycelium's non-monotonic curve, + EasyTier's excess latency at High, and Hyprspace's upward + divergence stand out.} + \label{fig:ping_impairment_heatmap} +\end{figure} Mycelium defies the pattern. Its RTT \emph{drops} from 34.9\,ms at baseline to 23.4\,ms at Low impairment, a 33\% improvement where @@ -867,37 +1005,59 @@ before falling again to 33.0\,ms at High. The baseline analysis (Section~\ref{sec:mycelium_routing}) showed that Mycelium's latency comes from a bimodal routing distribution: one path runs at 1.63\,ms while two others route through the global overlay at -${\sim}$51\,ms. The impairment appears to push Mycelium's path +${\sim}$51\,ms. % TODO: DOWNSTREAM DEPENDENCY — This explanation depends on the baseline +% characterisation of Mycelium's path discovery as "failing intermittently" +% (Section mycelium_routing). If that characterisation is revised (e.g., +% overlay routing is by-design, not a failure), then the claim that +% impairment "pushes path discovery toward shorter routes" needs rethinking: +% the mechanism would be different if Mycelium is not trying to find direct +% routes in the first place. +The impairment appears to push Mycelium's path discovery toward shorter routes, so a larger share of traffic takes the direct path. The non-monotonic pattern is consistent with a path selection algorithm that responds to measured link quality, but not linearly with degradation severity. +% TODO: Ping packet loss data is not shown in any figure. Add a +% packet loss table/figure or reference the raw data so readers can +% verify these numbers. Mycelium also achieves 0\% ping packet loss at Low and Medium impairment, while most VPNs show 0.1--3.2\% loss at those profiles. At High impairment, Mycelium's loss jumps to 11.1\%. +% TODO: EasyTier's max RTT (290 ms), WireGuard's max (~40 ms), and +% EasyTier's std dev (44.6 ms) are not shown in any plot. The ping +% heatmap only shows averages. Add a jitter/distribution figure. +% Also, the "userspace retry mechanism" is a hypothesized cause +% without source-code or packet-level evidence. EasyTier accumulates 11\,ms of excess latency at High impairment beyond what tc~netem accounts for. Its average RTT of 26.6\,ms and -maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) point to a -userspace scheduling or retry mechanism that introduces escalating -variance. EasyTier's RTT standard deviation reaches 44.6\,ms at -High, the worst jitter of any VPN. +maximum of 290\,ms (vs.\ ${\sim}$40\,ms for WireGuard) suggest a +userspace retry mechanism that introduces escalating variance. +EasyTier's RTT standard deviation reaches 44.6\,ms at High, the +worst jitter of any VPN. +% TODO: Ping packet loss data is not shown in any plot. The 1/9 +% = 11.1\% interpretation is clever but depends on the exact test +% structure (3 pairs × 3 runs × 100 packets). Verify this matches +% the actual test setup and add a supporting figure or table. Hyprspace shows 11.1\% ping packet loss at every impairment level --- Low, Medium, and High alike. With 9~measurement runs (3~machine pairs $\times$ 3~runs of 100~packets), 11.1\% equals exactly 1/9: one run per profile fails completely while the other eight report zero -loss. This binary pass/fail behavior is consistent with the buffer -bloat diagnosis from Section~\ref{sec:hyprspace_bloat}. When buffers -fill, an entire path stalls rather than degrading gradually. +loss. % TODO: DOWNSTREAM DEPENDENCY — This is a third reference to the buffer +% bloat diagnosis from Section hyprspace_bloat, which depends on the +% unverified 2,800 ms under-load latency. If that diagnosis is +% revised, this explanation must also be revisited. +This binary pass/fail behavior is consistent with the buffer bloat +diagnosis from Section~\ref{sec:hyprspace_bloat}: when buffers fill, +an entire path stalls rather than degrading gradually. \subsection{TCP Throughput} -Table~\ref{tab:tcp_impairment} presents single-stream TCP throughput -across all four profiles. The baseline performance tiers from -Section~\ref{sec:baseline} dissolve almost immediately under -impairment. +TCP throughput is where the baseline hierarchy breaks down. The +three performance tiers from Section~\ref{sec:baseline} dissolve at +the first impairment step (Table~\ref{tab:tcp_impairment}). \begin{table}[H] \centering @@ -927,13 +1087,15 @@ impairment. \end{tabular} \end{table} -% PLOT: line chart -% File: Figures/impairment/TCP Throughput Heatmap.png -% Data: Single-stream TCP throughput for all 11 VPNs at baseline, -% low, medium, high -% Show: Headscale crossing above Internal at medium impairment; -% Yggdrasil's cliff from baseline to low; convergence of all -% VPNs at high impairment +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/TCP Throughput Heatmap}.png} + \caption{Single-stream TCP throughput across impairment profiles. + Headscale crosses above Internal at Medium impairment; + Yggdrasil collapses from 795 to 13\,Mbps at Low; all VPNs + converge at High.} + \label{fig:tcp_impairment_heatmap} +\end{figure} Yggdrasil crashes from 795\,Mbps to 13.2\,Mbps at Low impairment, a 98.3\% throughput loss from adding just 2\,ms latency, 2\,ms jitter, @@ -962,31 +1124,60 @@ differences that matter at gigabit speeds become irrelevant. \subsection{UDP Throughput} -The UDP stress test (\texttt{-b~0}) suffers from widespread failures -under impairment. Hyprspace and Mycelium, which already failed at -baseline, continue to fail at all profiles. Tinc and ZeroTier fail -at most non-baseline profiles. The sparse dataset limits -conclusions, but one pattern stands out. +The UDP stress test (\texttt{-b~0}) separates kernel-level from +userspace implementations more cleanly than any TCP benchmark. It +also produces widespread failures under impairment: Hyprspace and +Mycelium, which already failed at baseline, continue to time out at +% TODO: Tinc fails at Low and Medium but succeeds at High (8 Mbps) --- +% the same non-monotonic failure pattern as Internal/WireGuard (fail +% at Low, succeed at Medium/High). This suggests the failures are +% iPerf3/tc interaction issues rather than fundamental VPN limitations. +% Nebula and VpnCloud also fail selectively. The widespread non-monotonic +% failure pattern undermines using this benchmark as a reliability +% indicator (see line 1163 claim). Consider discussing this pattern. +all profiles, and Tinc drops out at Low and Medium while ZeroTier +fails at Medium. Despite the sparse dataset, one pattern is clear. -Kernel-level implementations maintain throughput regardless of -impairment. Internal holds ${\sim}$950\,Mbps across all profiles -where data exists. Headscale sustains 700--876\,Mbps and WireGuard -850--908\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition -both rely on WireGuard's in-kernel UDP handling with -proper backpressure. Userspace VPNs collapse: EasyTier drops from +% TODO: The heatmap shows Internal and WireGuard both fail (×) at +% some impairment profiles (e.g., Internal fails at Low, WireGuard +% at Low and High). "Regardless of impairment" overstates the +% evidence. Rephrase to reflect the failures, or explain why +% those runs failed despite the claim of maintained throughput. +% TODO: Internal (and WireGuard) fail at Low impairment in the UDP +% test but succeed at Medium and High --- the opposite of what one +% would expect. This is never explained. Investigate and add an +% explanation (e.g., iPerf3 crash, tc interaction, timing issue). +Kernel-level implementations maintain throughput at the profiles +where data exists. Internal holds ${\sim}$950\,Mbps at +Baseline, Medium, and High. Headscale sustains 700--876\,Mbps and WireGuard +850--898\,Mbps; % TODO: verify WireGuard UDP range -- analysis doc says 850-898, possible digit transposition +both use WireGuard's kernel module for the outer tunnel, which +provides proper backpressure at the transport layer. Userspace VPNs collapse: EasyTier drops from 865 to 435 to 38.5 to 6.1\,Mbps across successive profiles. Yggdrasil, already pathological at baseline (98.7\% loss), crashes to 12.3\,Mbps at Low and fails entirely at Medium and High. -% PLOT: heatmap -% File: Figures/impairment/UDP Receiver Throughput Heatmap.png -% Data: UDP receiver throughput for all 11 VPNs at baseline, low, -% medium, high (grey/hatched cells for failures) -% Show: Kernel-level VPNs (Internal, WireGuard, Headscale) maintaining -% high throughput across all profiles; userspace VPNs failing or -% collapsing; the large number of empty cells - +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/UDP Receiver Throughput Heatmap}.png} + % TODO: This caption says "kernel-level VPNs maintain high throughput" + % but the heatmap shows Internal, WireGuard, and Headscale ALL fail + % ($\times$) at Low impairment. WireGuard also fails at High. + % Rephrase to acknowledge the failures or explain them. + \caption{UDP receiver throughput across impairment profiles. + Kernel-level VPNs (Internal, WireGuard, Headscale) maintain high + throughput where they complete; userspace VPNs collapse or fail + entirely ($\times$ marks a failed run).} + \label{fig:udp_impairment_heatmap} +\end{figure} +% TODO: This "robustness indicator" interpretation is undermined by +% the non-monotonic failure pattern. Internal and WireGuard fail at +% Low (0.25% loss) but succeed at Medium and High (1%+ loss). If +% failures indicated "fundamental flow-control problems," they should +% get worse with more impairment, not better. The pattern suggests +% iPerf3 or tc timing issues rather than VPN limitations. Either +% explain the non-monotonic failures or weaken this conclusion. The failure rate of this benchmark under impairment makes it more useful as a robustness indicator than a throughput measurement. A VPN that cannot complete a 30-second UDP flood under 0.25\% packet loss @@ -995,9 +1186,14 @@ workloads too, even if the symptoms are milder. \subsection{Parallel TCP} -Table~\ref{tab:parallel_impairment} shows aggregate throughput across -three concurrent bidirectional links (six unidirectional flows). The -Headscale anomaly from the single-stream results is amplified here. +% TODO: DOWNSTREAM DEPENDENCY — "six unidirectional flows" must match +% the baseline parallel test description. The baseline section has an +% unresolved TODO about whether the test uses 6 or 10 streams. If the +% baseline is corrected to 10, this section must also be updated. +The Headscale anomaly from single-stream TCP grows larger under +parallel load. Table~\ref{tab:parallel_impairment} shows aggregate +throughput across three concurrent bidirectional links (six +unidirectional flows). \begin{table}[H] \centering @@ -1025,30 +1221,40 @@ Headscale anomaly from the single-stream results is amplified here. \end{tabular} \end{table} -% PLOT: heatmap -% File: Figures/impairment/Parallel TCP Throughput Heatmap.png -% Data: Parallel TCP throughput for all 11 VPNs at baseline, low, -% medium, high -% Show: Headscale dominating at low impairment (718 Mbps vs Internal's -% 277); EasyTier as runner-up (473 Mbps); Hyprspace's collapse -% to 2.87 Mbps - +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/Parallel TCP Throughput Heatmap}.png} + \caption{Parallel TCP throughput across impairment profiles. + Headscale dominates at Low (718\,Mbps vs.\ Internal's 277); + EasyTier is the runner-up (473\,Mbps); Hyprspace collapses to + 2.87\,Mbps.} + \label{fig:parallel_impairment_heatmap} +\end{figure} Headscale at Low impairment: 718\,Mbps --- 2.6$\times$ Internal (277\,Mbps) and 4.1$\times$ WireGuard (173\,Mbps). At Medium, Headscale (113\,Mbps) still leads Internal (82.6\,Mbps) by 37\%. -The single-stream anomaly from -Section~\ref{sec:tailscale_degraded} compounds when multiple flows -each independently benefit from Headscale's congestion control -tuning. +Whatever mechanism produces the single-stream crossover at Medium +scales with the number of flows: six independent streams each +benefit from it. +% TODO: EasyTier's resilience (473 Mbps at Low, 51% retention) is the +% second-best result after Headscale, yet receives no architectural +% explanation. Headscale gets an entire subsection attributing its +% resilience to gVisor TCP tuning. Either explain what gives EasyTier +% its resilience (e.g., its own TCP stack, congestion control, FEC) +% or acknowledge the gap explicitly. EasyTier is the second-most resilient VPN under parallel load, at 473\,Mbps at Low (51\% of baseline). Both EasyTier and Headscale retain more than half their baseline parallel throughput at Low impairment; no other VPN exceeds 30\%. Hyprspace collapses from 803\,Mbps to 2.87\,Mbps at Low, a 99.6\% -loss. The buffer bloat that plagues single-stream transfers +loss. % TODO: DOWNSTREAM DEPENDENCY — This references the buffer bloat diagnosis +% from Section hyprspace_bloat, which depends on the unverified 2,800 ms +% under-load latency. If that diagnosis is revised, this explanation +% for parallel collapse must also be revisited. +The buffer bloat that plagues single-stream transfers (Section~\ref{sec:hyprspace_bloat}) becomes catastrophic when six concurrent flows compete for the same bloated buffers. @@ -1064,30 +1270,31 @@ impairment profiles. Yggdrasil's QUIC bandwidth drops from 745\,Mbps at baseline to 7.67\,Mbps at Low, 3.45\,Mbps at Medium, and 2.17\,Mbps at High --- -the same cliff observed in its TCP results, driven by the same +the same cliff observed in its TCP results, again driven by jumbo-MTU amplification of outer-layer packet loss. At High impairment, WireGuard (23.2\,Mbps), VpnCloud (23.4\,Mbps), ZeroTier (23.0\,Mbps), and Tinc (23.4\,Mbps) converge to within -0.4\,Mbps of each other. At baseline these four span a 500\,Mbps -range. QUIC's own congestion control, operating atop the +0.4\,Mbps of each other. At baseline these four span a 188\,Mbps +range (844 to 656\,Mbps). QUIC's own congestion control, operating atop the already-degraded outer link, becomes the sole limiter. -% PLOT: heatmap -% File: Figures/impairment/QUIC Bandwidth Heatmap.png -% Data: QPerf QUIC bandwidth for VPNs with data at all four profiles -% (WireGuard, VpnCloud, ZeroTier, Tinc, Yggdrasil, Internal) -% Show: Yggdrasil's cliff from baseline to low; convergence of -% WireGuard, VpnCloud, ZeroTier, Tinc at high (~23 Mbps) - +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/QUIC Bandwidth Heatmap}.png} + \caption{QUIC bandwidth across impairment profiles. Yggdrasil + drops from 745 to 8\,Mbps at Low; WireGuard, VpnCloud, ZeroTier, + and Tinc converge to ${\sim}$23\,Mbps at High. Headscale and + Nebula fail at all profiles ($\times$).} + \label{fig:quic_impairment_heatmap} +\end{figure} \subsection{Video Streaming} -Table~\ref{tab:rist_impairment} presents RIST video quality scores -across profiles. The actual encoding bitrate of ${\sim}$3.3\,Mbps -sits well within every VPN's throughput budget even at High -impairment, so quality differences reflect packet delivery reliability -rather than bandwidth limits. +At ${\sim}$3.3\,Mbps, the RIST video stream sits within every VPN's +throughput budget even at High impairment. Quality differences in +Table~\ref{tab:rist_impairment} therefore reflect packet delivery +reliability, not bandwidth. \begin{table}[H] \centering @@ -1114,41 +1321,50 @@ rather than bandwidth limits. \end{tabular} \end{table} -% PLOT: heatmap -% File: Figures/impairment/Video Streaming Quality Heatmap.png -% Data: RIST quality for all 11 VPNs at baseline, low, medium, high -% Show: Headscale stuck at 13% (red row); Mycelium stuck near 100% -% (green row); gradual degradation for the bulk; Yggdrasil's -% steep decline to 43% - +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/Video Streaming Quality Heatmap}.png} + \caption{RIST video streaming quality across impairment profiles. + Headscale is stuck at ${\sim}$13\% regardless of profile; + Mycelium maintains ${\sim}$100\% even at High; Yggdrasil + declines steeply to 43\%.} + \label{fig:rist_impairment_heatmap} +\end{figure} Headscale stays at ${\sim}$13\% across all four profiles: 13.1\%, 13.0\%, 13.0\%, 13.0\%. The profile-independence confirms the baseline diagnosis from Section~\ref{sec:baseline}. The failure is +% TODO: DOWNSTREAM DEPENDENCY — This repeats the DERP/MTU hypothesis from +% Section baseline as though it were established. The baseline TODO notes +% this hypothesis is unverified (no packet capture evidence). Do not +% present it as a confirmed diagnosis here without resolving the upstream TODO. structural --- likely MTU fragmentation in the DERP relay layer --- and cannot worsen because it is already saturated. Adding latency or loss on top of an 87\% packet drop floor changes nothing. Mycelium delivers 99.9\% quality even at High impairment, better than Internal (80.2\%) and every other VPN. At 3.3\,Mbps, even -Mycelium's degraded overlay paths can sustain the stream. Its -overlay retransmission mechanism, which cripples bulk TCP transfers, -works well for steady low-bandwidth UDP flows. RIST's own forward -error correction handles whatever Mycelium's retransmissions miss. +Mycelium's degraded overlay paths can sustain the stream. The same +overlay routing that adds 34.9\,ms of latency and cripples bulk TCP +transfers is harmless at video bitrates. RIST's own forward error +correction compensates for whatever packet loss remains. +% TODO: The claim that jumbo MTU causes burst losses that overwhelm +% FEC is a hypothesis. No FEC analysis or packet-level evidence is +% shown. Consider adding packet capture data or softening the claim. Yggdrasil degrades the most steeply: 100\% at baseline, 94.7\% at Low, 71.4\% at Medium, 43.3\% at High. The jumbo MTU that hurt TCP -throughput also hurts here --- large overlay packets carrying RIST -data are more likely to be lost or reordered at the outer layer, and -RIST's FEC cannot recover from the resulting burst losses. +throughput likely hurts here too --- large overlay packets carrying +RIST data are more likely to be lost or reordered at the outer layer, +and RIST's FEC may not recover from the resulting burst losses. \subsection{Application-Level Download} -Table~\ref{tab:nix_impairment} shows Nix binary cache download times -across profiles. This HTTP-heavy workload, dominated by many -short-lived TCP connections, is more sensitive to per-connection -latency than to raw bandwidth. - +The Nix binary cache download is the most demanding application-level +benchmark: hundreds of sequential HTTP connections amplify +per-connection latency penalties that bulk throughput tests amortize. +Table~\ref{tab:nix_impairment} shows download times across profiles. + \begin{table}[H] \centering \caption{Nix binary cache download time (seconds) across impairment @@ -1175,20 +1391,21 @@ latency than to raw bandwidth. \end{tabular} \end{table} -% PLOT: heatmap -% File: Figures/impairment/Nix Cache Download Time Heatmap.png -% Data: Nix cache download time for all VPNs at baseline, low, medium, -% high (hatched/absent bars for failures) -% Show: Headscale as the only VPN completing all four profiles; -% Headscale beating Internal at medium (48.8 vs 58.6 s); -% Yggdrasil's 22x slowdown at low impairment +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{{Figures/impairment/Nix Cache Download Time Heatmap}.png} + \caption{Nix binary cache download time across impairment profiles. + Headscale, Nebula, and Tinc complete all four profiles; Headscale + beats Internal at Medium (49\,s vs.\ 59\,s). Yggdrasil's + Low-profile time explodes to 230\,s ($\times$ marks a failed run).} + \label{fig:nix_impairment_heatmap} +\end{figure} - -Headscale is the only VPN to complete all four profiles. At Medium -impairment, it finishes in 48.8~seconds --- faster than Internal's -58.6~seconds. Internal itself fails at High impairment while -Headscale completes in 219~seconds. Only Nebula (547\,s) and Tinc -(496\,s) also survive High impairment. +Headscale, Nebula, and Tinc are the only VPNs to complete all four +profiles. At Medium impairment, Headscale finishes in 48.8~seconds +--- faster than Internal's 58.6~seconds. Internal itself fails at +High impairment while Headscale completes in 219~seconds, Tinc in +496~seconds, and Nebula in 547~seconds. Yggdrasil's download time explodes from 10.6\,s to 230\,s at Low impairment, a 22$\times$ slowdown. Every HTTP request incurs the @@ -1198,10 +1415,14 @@ retransmissions. Mycelium also degrades severely (10.1\,s to overhead, which compounds over hundreds of sequential HTTP connections. -The failure map reveals a clean gradient: more demanding profiles -knock out more VPNs. At Low, 10 of 11 complete (Hyprspace fails). -At Medium, 9 complete. At High, only 3 survive (Headscale, Nebula, -Tinc). Internal's failure at High is the most surprising --- the +% TODO: Hyprspace fails at Low but completes at Medium (170 s). +% This contradicts the "clean gradient" claim. Explain why a VPN +% can fail at Low but succeed at Medium, or note the anomaly. +The failure map reveals a mostly clean gradient: more demanding +profiles knock out more VPNs. At Low, 10 of 11 complete (Hyprspace +fails). At Medium, 9 complete (though Hyprspace, which failed at +Low, completes at 170\,s). At High, only 3 survive (Headscale, +Nebula, Tinc). Internal's failure at High is the most surprising --- the bare-metal baseline cannot sustain a multi-connection HTTP workload under severe degradation, but Headscale, shielded by its userspace TCP stack, can. Section~\ref{sec:tailscale_degraded} explains why. @@ -1239,14 +1460,15 @@ Table~\ref{tab:headscale_anomaly} summarizes the comparison. \end{tabular} \end{table} -% TODO: Needs to be created, use the tools/ folder -% PLOT: line chart -% File: Figures/impairment/headscale-vs-internal-across-profiles.png -% Data: Single-stream TCP throughput for Internal, Headscale, and -% WireGuard across all four profiles -% Show: Headscale crossing above Internal at medium impairment; -% WireGuard far below both; convergence at high -% Y-axis: log scale +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{Figures/impairment/headscale-vs-internal-across-profiles.png} + \caption{Single-stream TCP throughput for Internal, Headscale, and + WireGuard across impairment profiles (log scale). Headscale + crosses above Internal at Medium impairment; WireGuard stays far + below both; all three converge at High.} + \label{fig:headscale_vs_internal} +\end{figure} In parallel TCP at Low impairment, Headscale reaches 718\,Mbps vs.\ Internal's 277\,Mbps (2.6$\times$). The Nix cache download at @@ -1259,12 +1481,16 @@ such advantage: 54.7\,Mbps at Low, 8.77\,Mbps at Medium. Whatever protects Headscale is not the encryption or the tunnel --- it is something in Tailscale's userspace networking stack. +% TODO: The Medium-impairment retransmit percentages (5.2\%, +% 2.4\%) are not in any table or figure. Add a retransmit rate +% table for impaired profiles or reference the data source. The retransmit data provides the first clue. At Medium impairment, -Headscale's retransmit percentage is approximately 2.4\%, matching -Internal's ${\sim}$2.4\%. WireGuard's is 5.2\%. Headscale achieves -Internal's retransmit efficiency while delivering higher throughput ---- fewer spurious retransmissions leave more bandwidth for actual -data. +WireGuard's retransmit rate is 5.2\% --- more than double Internal's +${\sim}$2.4\%. Headscale, despite being a VPN, matches Internal at +${\sim}$2.4\%. WireGuard uses the host kernel's TCP stack, which +treats reordered packets as losses and fires spurious retransmits; +Headscale's gVisor stack tolerates more reordering, so fewer +retransmissions are wasted on packets that were merely delayed. \subsection{Congestion Control Analysis} @@ -1292,19 +1518,29 @@ matter under packet reordering: already impaired. \end{itemize} -The combined effect: under network conditions with packet reordering, -the default Linux TCP stack fires retransmits and cuts the congestion -window far more often than necessary. Each false positive shrinks the -window and reduces throughput. Tailscale's gVisor stack tolerates -more reordering before reacting, so its congestion window stays larger -and throughput stays higher. +Under packet reordering, these three defaults compound. The Linux +TCP stack fires retransmits and cuts the congestion window far more +often than necessary; each false positive shrinks the window and +reduces throughput. Tailscale's gVisor stack tolerates more +reordering before reacting, so its congestion window stays larger and +throughput stays higher. -This explains why the anomaly grows with impairment severity. At +% TODO: The claim that the anomaly "grows with impairment severity" is +% not fully supported. At High impairment, Headscale (4.21 Mbps) and +% Internal (4.25 Mbps) converge --- the anomaly vanishes rather than +% growing. The logic predicts continued divergence at High reordering +% (5% per machine), but the data shows both become loss-limited. +% Rephrase to say the anomaly emerges at Medium but disappears at High +% when absolute loss dominates. +This explains why the anomaly emerges as impairment increases. At baseline, there is no reordering, so the threshold difference is irrelevant and Internal's kernel-level processing advantage dominates. As reordering increases from 0.5\% (Low) to 2.5\% (Medium) per machine, the kernel's aggressive loss detection fires more often, and -the throughput gap shifts in Headscale's favor. +the throughput gap shifts in Headscale's favor. At High impairment, +however, both converge to ${\sim}$4.2\,Mbps: the absolute packet loss +rate becomes the dominant bottleneck, overriding the reordering +tolerance advantage. \subsection{Tuned Kernel Parameters} @@ -1351,67 +1587,99 @@ and hardware as the original 18.12.2025 run. \end{table} -% PLOT: grouped bar chart -% File: Figures/impairment/no_vpn_kernel_tuning_comparison.png -% Data: Internal single-stream TCP at baseline/low/medium across -% original, full gVisor, and reorder-only configurations -% Show: Dramatic jump at medium (29.6 -> 64.2 -> 72.7 Mbps); -% baseline unchanged; modest improvement at low -% Y-axis: linear scale +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{Figures/impairment/no_vpn_kernel_tuning_comparison.png} + \caption{Internal (no VPN) single-stream TCP throughput across three + kernel configurations. Baseline is unchanged; at Medium + impairment, throughput jumps from 30 to 64 to 73\,Mbps as + reordering tolerance increases.} + \label{fig:kernel_tuning_comparison} +\end{figure} Internal's Medium-impairment throughput jumps from 29.6 to 72.7\,Mbps --- a 146\% increase from a three-line sysctl change. The -retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; most of the -original retransmissions were spurious. The Nix cache download at +retransmit percentage drops from ${\sim}$2.4\% to 1.11\%; over half +of the original retransmissions were spurious. The Nix cache download at Medium halves from 58.6\,s to 29.1\,s. Parallel TCP sees an even larger gain. Internal at Low impairment climbs from 277 to 902\,Mbps, a 226\% increase that now exceeds -Headscale's original 718\,Mbps. With six concurrent flows each +Headscale's original 718\,Mbps. % TODO: DOWNSTREAM DEPENDENCY — "six concurrent flows" inherits the +% unresolved 6-vs-10 stream count from the baseline parallel test +% description. Update when that TODO is resolved. +With six concurrent flows each independently benefiting from the higher reordering threshold, the aggregate improvement compounds. -The anomaly reverses. At every impairment level and benchmark, tuned -Internal now meets or exceeds Headscale. At Medium impairment: +% TODO: Headscale's tuned-run values (50.1 Mbps, 36.3 s) are not in +% any table. Add a table showing Headscale's results from the +% follow-up runs alongside Internal's so readers can verify the +% reversal. +% TODO: "At every impairment level and benchmark" is a strong claim +% but only single-stream TCP at Medium and Nix cache at Medium are +% shown with both Internal and Headscale values. The Headscale tuned +% data is not in any table (see TODO above). Either add the full +% comparison table or weaken to "at the metrics shown." +The anomaly reverses. At the measured impairment levels and benchmarks, +tuned Internal now meets or exceeds Headscale. At Medium impairment: Internal 72.7\,Mbps vs.\ Headscale 50.1\,Mbps (Internal 45\% ahead), where the original result had Headscale 40\% ahead. The Nix cache flips too: Internal completes in 29.1\,s vs.\ Headscale's 36.3\,s, where the original had Headscale 17\% faster. -% PLOT: before/after comparison -% File: Figures/impairment/headscale-gap-reversal.png -% Data: Internal vs Headscale throughput ratio at each impairment -% level, original vs tuned (reorder-only) -% Show: The crossover from "Headscale wins" (ratio < 1) to "Internal -% wins" (ratio > 1) at medium impairment after tuning -% Y-axis: ratio (Internal / Headscale), 1.0 as break-even +\begin{figure}[H] + \centering + \includegraphics[width=\textwidth]{Figures/impairment/headscale-gap-reversal.png} + \caption{Internal-to-Headscale speed-up factor before and after + kernel tuning. Values above 1.0 mean Internal is faster. At + Medium impairment, the ratio flips from 0.71$\times$ (Headscale + ahead) to 1.45$\times$ (Internal ahead).} + \label{fig:headscale_gap_reversal} +\end{figure} The reorder-only configuration (06.03) matches or exceeds the full gVisor configuration (27.02) at most metrics; the two exceptions are single-stream TCP at Low (354 vs.\ 363\,Mbps) and parallel TCP at Medium (211 vs.\ 226\,Mbps), both within 7\%. Internal reaches 72.7\,Mbps at Medium with reorder-only vs.\ 64.2\,Mbps with -full gVisor. The enlarged buffer sizes are unnecessary and may +full gVisor. % TODO: The "mild buffer bloat" explanation for full-gVisor being +% slightly slower than reorder-only is speculative. The difference +% (64.2 vs 72.7 Mbps) could be within run-to-run variance. Either +% test with more runs or present this as one possible explanation. +The enlarged buffer sizes appear unnecessary and may introduce mild buffer bloat that partially offsets the reordering -benefit. The entire Headscale advantage is explained by three kernel +benefit, though the difference could also reflect normal run-to-run +variance. The entire Headscale advantage is explained by three kernel parameters: \texttt{tcp\_reordering}, \texttt{tcp\_recovery}, and \texttt{tcp\_early\_retrans}. +% TODO: WireGuard (12.2 Mbps), Tinc (11.5 Mbps), and ZeroTier +% (11.5 Mbps) tuned values are not in any table. Add them to +% Table~\ref{tab:kernel_tuning_internal} or a new table. Other VPNs benefit less from the kernel tuning. WireGuard's Medium throughput rises from 8.77 to 12.2\,Mbps (+39\%) and Tinc's from -5.53 to 11.5\,Mbps (+108\%). ZeroTier shows no change (12.0 to +5.53 to 11.5\,Mbps (+108\%). ZeroTier stays flat (12.0 to 11.5\,Mbps). The tuning helps the kernel TCP stack, but VPNs that add their own encapsulation overhead and userspace processing have independent bottlenecks that the sysctl parameters cannot remove. +% TODO: Headscale tuned-run percentages (+21\%, $-$5\%) are not in +% any table. Also, the "compound delays" hypothesis is speculative +% --- no evidence is shown that double reordering tolerance causes +% compound delays. Either verify experimentally or weaken the claim. Headscale itself gets modestly faster with kernel tuning (+21\% at Medium) but slightly slower at Low impairment ($-$5\%). Its userspace gVisor stack already optimizes for reordering tolerance. When the kernel stack also increases its tolerance, the two layers of tuning may interact suboptimally --- both independently delay -retransmits, which can cause compound delays on the +retransmits, which could cause compound delays on the kernel-to-Headscale socket path. +% TODO: These sections are empty stubs but the chapter introduction +% (line 12--13) promises "findings from the source code analysis." +% Either write these sections or remove the promise from the intro. + \section{Source Code Analysis} \subsection{Feature Matrix Overview} diff --git a/Figures/impairment/Nix Cache Download Time Heatmap.png b/Figures/impairment/Nix Cache Download Time Heatmap.png index 19fdefc..8f5e5af 100644 Binary files a/Figures/impairment/Nix Cache Download Time Heatmap.png and b/Figures/impairment/Nix Cache Download Time Heatmap.png differ diff --git a/Figures/impairment/Parallel TCP Throughput Heatmap.png b/Figures/impairment/Parallel TCP Throughput Heatmap.png index 3d8ffce..04e42ae 100644 Binary files a/Figures/impairment/Parallel TCP Throughput Heatmap.png and b/Figures/impairment/Parallel TCP Throughput Heatmap.png differ diff --git a/Figures/impairment/QUIC Bandwidth Heatmap.png b/Figures/impairment/QUIC Bandwidth Heatmap.png index 3eeebee..c17e724 100644 Binary files a/Figures/impairment/QUIC Bandwidth Heatmap.png and b/Figures/impairment/QUIC Bandwidth Heatmap.png differ diff --git a/Figures/impairment/TCP Throughput Heatmap.png b/Figures/impairment/TCP Throughput Heatmap.png index f1a1291..4fbd1b1 100644 Binary files a/Figures/impairment/TCP Throughput Heatmap.png and b/Figures/impairment/TCP Throughput Heatmap.png differ diff --git a/Figures/impairment/UDP Receiver Throughput Heatmap.png b/Figures/impairment/UDP Receiver Throughput Heatmap.png index ea3dc7b..b483d60 100644 Binary files a/Figures/impairment/UDP Receiver Throughput Heatmap.png and b/Figures/impairment/UDP Receiver Throughput Heatmap.png differ diff --git a/Figures/impairment/Video Streaming Quality Heatmap.png b/Figures/impairment/Video Streaming Quality Heatmap.png index 3560b08..6a0ca50 100644 Binary files a/Figures/impairment/Video Streaming Quality Heatmap.png and b/Figures/impairment/Video Streaming Quality Heatmap.png differ