884 lines
35 KiB
TeX
884 lines
35 KiB
TeX
% Chapter Template
|
|
|
|
\chapter{Results} % Main chapter title
|
|
|
|
\label{Results}
|
|
|
|
This chapter presents the results of the benchmark suite across all
|
|
ten VPN implementations and the internal baseline. The structure
|
|
follows the impairment profiles from ideal to degraded:
|
|
Section~\ref{sec:baseline} establishes overhead under ideal
|
|
conditions, then subsequent sections examine how each VPN responds to
|
|
increasing network impairment. The chapter concludes with findings
|
|
from the source code analysis. A recurring theme throughout is that
|
|
no single metric captures VPN performance; the rankings shift
|
|
depending on whether one measures throughput, latency, retransmit
|
|
behavior, or real-world application performance.
|
|
|
|
\section{Baseline Performance}
|
|
\label{sec:baseline}
|
|
|
|
The baseline impairment profile introduces no artificial loss or
|
|
reordering, so any performance gap between VPNs can be attributed to
|
|
the VPN itself. Throughout the plots in this section, the
|
|
\emph{internal} bar marks a direct host-to-host connection with no VPN
|
|
in the path; it represents the best the hardware can do. On its own,
|
|
this link delivers 934\,Mbps on a single TCP stream and a round-trip
|
|
latency of just
|
|
0.60\,ms. WireGuard comes remarkably close to these numbers, reaching
|
|
92.5\,\% of bare-metal throughput with only a single retransmit across
|
|
an entire 30-second test. Mycelium sits at the other extreme, adding
|
|
34.9\,ms of latency, roughly 58$\times$ the bare-metal figure.
|
|
|
|
\subsection{Test Execution Overview}
|
|
|
|
Running the full baseline suite across all ten VPNs and the internal
|
|
reference took just over four hours. The bulk of that time, about
|
|
2.6~hours (63\,\%), was spent on actual benchmark execution; VPN
|
|
installation and deployment accounted for another 45~minutes (19\,\%),
|
|
and roughly 21~minutes (9\,\%) went to waiting for VPN tunnels to come
|
|
up after restarts. The remaining time was consumed by VPN service restarts
|
|
and traffic-control (tc) stabilization.
|
|
Figure~\ref{fig:test_duration} breaks this down per VPN.
|
|
|
|
Most VPNs completed every benchmark without issues, but four failed
|
|
one test each: Nebula and Headscale timed out on the qperf
|
|
QUIC performance benchmark after six retries, while Hyprspace and
|
|
Mycelium failed the UDP iPerf3 test
|
|
with a 120-second timeout. Their individual success rate is
|
|
85.7\,\%, with all other VPNs passing the full suite
|
|
(Figure~\ref{fig:success_rate}).
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{1.0\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/Average Test
|
|
Duration per Machine}.png}
|
|
\caption{Average test duration per VPN, including installation
|
|
time and benchmark execution}
|
|
\label{fig:test_duration}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{1.0\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/Benchmark
|
|
Success Rate}.png}
|
|
\caption{Benchmark success rate across all seven tests}
|
|
\label{fig:success_rate}
|
|
\end{subfigure}
|
|
\caption{Test execution overview. Hyprspace has the longest average
|
|
duration due to UDP timeouts and long VPN connectivity
|
|
waits. WireGuard completes fastest. Nebula, Headscale,
|
|
Hyprspace, and Mycelium each fail one benchmark.}
|
|
\label{fig:test_overview}
|
|
\end{figure}
|
|
|
|
\subsection{TCP Throughput}
|
|
|
|
Each VPN ran a single-stream iPerf3 session for 30~seconds on every
|
|
link direction (lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
|
luna$\rightarrow$lom); Table~\ref{tab:tcp_baseline} shows the
|
|
averages. Three distinct performance tiers emerge, separated by
|
|
natural gaps in the data.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Single-stream TCP throughput at baseline, sorted by
|
|
throughput. Retransmits are averaged per 30-second test across
|
|
all three link directions. The horizontal rules separate the
|
|
three performance tiers.}
|
|
\label{tab:tcp_baseline}
|
|
\begin{tabular}{lrrr}
|
|
\hline
|
|
\textbf{VPN} & \textbf{Throughput (Mbps)} &
|
|
\textbf{Baseline (\%)} & \textbf{Retransmits} \\
|
|
\hline
|
|
Internal & 934 & 100.0 & 1.7 \\
|
|
WireGuard & 864 & 92.5 & 1 \\
|
|
ZeroTier & 814 & 87.2 & 1163 \\
|
|
Headscale & 800 & 85.6 & 102 \\
|
|
Yggdrasil & 795 & 85.1 & 75 \\
|
|
\hline
|
|
Nebula & 706 & 75.6 & 955 \\
|
|
EasyTier & 636 & 68.1 & 537 \\
|
|
VpnCloud & 539 & 57.7 & 857 \\
|
|
\hline
|
|
Hyprspace & 368 & 39.4 & 4965 \\
|
|
Tinc & 336 & 36.0 & 240 \\
|
|
Mycelium & 259 & 27.7 & 710 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
The top tier ($>$80\,\% of baseline) groups WireGuard, ZeroTier,
|
|
Headscale, and Yggdrasil, all within 15\,\% of the bare-metal link.
|
|
A middle tier (55--80\,\%) follows with Nebula, EasyTier, and
|
|
VpnCloud, while Hyprspace, Tinc, and Mycelium occupy the bottom tier
|
|
at under 40\,\% of baseline.
|
|
Figure~\ref{fig:tcp_throughput} visualizes this hierarchy.
|
|
|
|
Raw throughput alone is incomplete, however. The retransmit column
|
|
reveals that not all high-throughput VPNs get there cleanly.
|
|
ZeroTier, for instance, reaches 814\,Mbps but accumulates
|
|
1\,163~retransmits per test, over 1\,000$\times$ what WireGuard
|
|
needs. ZeroTier compensates for tunnel-internal packet loss by
|
|
repeatedly triggering TCP congestion-control recovery, whereas
|
|
WireGuard sends data once and it arrives. Across all VPNs,
|
|
retransmit behaviour falls into three groups: \emph{clean} ($<$110:
|
|
WireGuard, Internal, Yggdrasil, Headscale), \emph{stressed}
|
|
(200--900: Tinc, EasyTier, Mycelium, VpnCloud), and
|
|
\emph{pathological} ($>$950: Nebula, ZeroTier, Hyprspace).
|
|
|
|
% TODO: Is this naming scheme any good?
|
|
|
|
% TODO: Fix TCP Throughput plot
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
|
Throughput}.png}
|
|
\caption{Average single-stream TCP throughput}
|
|
\label{fig:tcp_throughput}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/TCP
|
|
Retransmit Rate}.png}
|
|
\caption{Average TCP retransmits per 30-second test (log scale)}
|
|
\label{fig:tcp_retransmits}
|
|
\end{subfigure}
|
|
\caption{TCP throughput and retransmit rate at baseline. WireGuard
|
|
leads at 864\,Mbps with 1 retransmit. Hyprspace has nearly 5000
|
|
retransmits per test. The retransmit count does not always track
|
|
inversely with throughput: ZeroTier achieves high throughput
|
|
\emph{despite} high retransmits.}
|
|
\label{fig:tcp_results}
|
|
\end{figure}
|
|
|
|
Retransmits have a direct mechanical relationship with TCP congestion
|
|
control. Each retransmit triggers a reduction in the congestion window
|
|
(\texttt{cwnd}), throttling the sender. This relationship is visible
|
|
in Figure~\ref{fig:retransmit_correlations}: Hyprspace, with 4965
|
|
retransmits, maintains the smallest average congestion window in the
|
|
dataset (205\,KB), while Yggdrasil's 75 retransmits allow a 4.3\,MB
|
|
window, the largest of any VPN. At first glance this suggests a
|
|
clean inverse correlation between retransmits and congestion window
|
|
size, but the picture is misleading. Yggdrasil's outsized window is
|
|
largely an artifact of its jumbo overlay MTU (32\,731 bytes): each
|
|
segment carries far more data, so the window in bytes is inflated
|
|
relative to VPNs using a standard ${\sim}$1\,400-byte MTU. Comparing
|
|
congestion windows across different MTU sizes is not meaningful
|
|
without normalizing for segment size. What \emph{is} clear is that
|
|
high retransmit rates force TCP to spend more time in congestion
|
|
recovery than in steady-state transmission, capping throughput
|
|
regardless of available bandwidth. ZeroTier illustrates the
|
|
opposite extreme: brute-force retransmission can still yield high
|
|
throughput (814\,Mbps with 1\,163 retransmits), at the cost of wasted
|
|
bandwidth and unstable flow behavior.
|
|
|
|
VpnCloud warrants specific attention: its sender reports 538.8\,Mbps
|
|
but the receiver measures only 413.4\,Mbps, leaving a 23\,\% gap (the largest
|
|
in the dataset). This suggests significant in-tunnel packet loss or
|
|
buffering at the VpnCloud layer that the retransmit count (857)
|
|
alone does not fully explain.
|
|
|
|
Run-to-run variability also differs substantially. WireGuard ranges
|
|
from 824 to 884\,Mbps (a 60\,Mbps window), while Mycelium ranges
|
|
from 122 to 379\,Mbps, a 3:1 ratio between worst and best runs. A
|
|
VPN with wide variance is harder to capacity-plan around than one
|
|
with consistent performance, even if the average is lower.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-throughput.png}
|
|
\caption{Retransmits vs.\ throughput}
|
|
\label{fig:retransmit_throughput}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/retransmits-vs-max-congestion-window.png}
|
|
\caption{Retransmits vs.\ max congestion window}
|
|
\label{fig:retransmit_cwnd}
|
|
\end{subfigure}
|
|
\caption{Retransmit correlations (log scale on x-axis). High
|
|
retransmits do not always mean low throughput (ZeroTier: 1\,163
|
|
retransmits, 814\,Mbps), but extreme retransmits do (Hyprspace:
|
|
4\,965 retransmits, 368\,Mbps). The apparent inverse correlation
|
|
between retransmits and congestion window size is dominated by
|
|
Yggdrasil's outlier (4.3\,MB \texttt{cwnd}), which is inflated
|
|
by its 32\,KB jumbo overlay MTU rather than by low retransmits
|
|
alone.}
|
|
\label{fig:retransmit_correlations}
|
|
\end{figure}
|
|
|
|
\subsection{Latency}
|
|
|
|
Sorting by latency rearranges the rankings considerably.
|
|
Table~\ref{tab:latency_baseline} lists the average ping round-trip
|
|
times, which cluster into three distinct ranges.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Average ping RTT at baseline, sorted by latency}
|
|
\label{tab:latency_baseline}
|
|
\begin{tabular}{lr}
|
|
\hline
|
|
\textbf{VPN} & \textbf{Avg RTT (ms)} \\
|
|
\hline
|
|
Internal & 0.60 \\
|
|
VpnCloud & 1.13 \\
|
|
Tinc & 1.19 \\
|
|
WireGuard & 1.20 \\
|
|
Nebula & 1.25 \\
|
|
ZeroTier & 1.28 \\
|
|
EasyTier & 1.33 \\
|
|
\hline
|
|
Headscale & 1.64 \\
|
|
Hyprspace & 1.79 \\
|
|
Yggdrasil & 2.20 \\
|
|
\hline
|
|
Mycelium & 34.9 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
Six VPNs stay below 1.3\,ms, comfortably close to the bare-metal
|
|
0.60\,ms. VpnCloud is a notable result: it posts the lowest latency
|
|
of any VPN (1.13\,ms), edging out WireGuard (1.20\,ms), yet its
|
|
throughput tops out at only 539\,Mbps. Low per-packet latency does
|
|
not guarantee high bulk throughput. A second group (Headscale,
|
|
Hyprspace, Yggdrasil) lands in the 1.5--2.2\,ms range, representing
|
|
moderate overhead. Then there is Mycelium at 34.9\,ms, so far
|
|
removed from the rest that Section~\ref{sec:mycelium_routing} gives
|
|
it a dedicated analysis.
|
|
|
|
ZeroTier's average of 1.28\,ms looks unremarkable, but its maximum
|
|
RTT spikes to 8.6\,ms, a 6.8$\times$ jump and the largest for any
|
|
sub-2\,ms VPN. These spikes point to periodic control-plane
|
|
interference that the average hides.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/ping/Average RTT}.png}
|
|
\caption{Average ping RTT at baseline. Mycelium (34.9\,ms) is a
|
|
massive outlier at 58$\times$ the internal baseline. VpnCloud is
|
|
the fastest VPN at 1.13\,ms, slightly below WireGuard (1.20\,ms).}
|
|
\label{fig:ping_rtt}
|
|
\end{figure}
|
|
|
|
Tinc presents a paradox: it has the third-lowest latency (1.19\,ms)
|
|
but only the second-lowest throughput (336\,Mbps). Packets traverse
|
|
the tunnel quickly, yet single-threaded userspace processing cannot
|
|
keep up with the link speed. The qperf benchmark backs this up: Tinc
|
|
maxes out at
|
|
14.9\,\% CPU while delivering just 336\,Mbps, a clear sign that
|
|
the CPU, not the network, is the bottleneck.
|
|
Figure~\ref{fig:latency_throughput} makes this disconnect easy to
|
|
spot.
|
|
|
|
Looking at CPU efficiency more broadly, the qperf measurements
|
|
reveal a wide spread. Hyprspace (55.1\,\%) and Yggdrasil
|
|
(52.8\,\%) consume 5--6$\times$ as much CPU as Internal's
|
|
9.7\,\%. WireGuard sits at 30.8\,\%, surprisingly high for a
|
|
kernel-level implementation, though much of that goes to
|
|
cryptographic processing. On the efficient end, VpnCloud
|
|
(14.9\,\%), Tinc (14.9\,\%), and EasyTier (15.4\,\%) do the most
|
|
with the least CPU time. Nebula and Headscale are missing from
|
|
this comparison because qperf failed for both.
|
|
|
|
%TODO: Explain why they consistently failed
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/latency-vs-throughput.png}
|
|
\caption{Latency vs.\ throughput at baseline. Each point represents
|
|
one VPN. The quadrants reveal different bottleneck types:
|
|
VpnCloud (low latency, moderate throughput), Tinc (low latency,
|
|
low throughput, CPU-bound), Mycelium (high latency, low
|
|
throughput, overlay routing overhead).}
|
|
\label{fig:latency_throughput}
|
|
\end{figure}
|
|
|
|
\subsection{Parallel TCP Scaling}
|
|
|
|
The single-stream benchmark tests one link direction at a time. The
|
|
parallel benchmark changes this setup: all three link directions
|
|
(lom$\rightarrow$yuki, yuki$\rightarrow$luna,
|
|
luna$\rightarrow$lom) run simultaneously in a circular pattern for
|
|
60~seconds, each carrying ten TCP streams. Because three independent
|
|
link pairs now compete for shared tunnel resources at once, the
|
|
aggregate throughput is naturally higher than any single direction
|
|
alone, which is why even Internal reaches 1.50$\times$ its
|
|
single-stream figure. The scaling factor (parallel throughput
|
|
divided by single-stream throughput) therefore captures two effects:
|
|
the benefit of utilizing multiple link pairs in parallel, and how
|
|
well the VPN handles the resulting contention.
|
|
Table~\ref{tab:parallel_scaling} lists the results.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Parallel TCP scaling at baseline. Scaling factor is the
|
|
ratio of ten-stream to single-stream throughput. Internal's
|
|
1.50$\times$ represents the expected scaling on this hardware.}
|
|
\label{tab:parallel_scaling}
|
|
\begin{tabular}{lrrr}
|
|
\hline
|
|
\textbf{VPN} & \textbf{Single (Mbps)} &
|
|
\textbf{Parallel (Mbps)} & \textbf{Scaling} \\
|
|
\hline
|
|
Mycelium & 259 & 569 & 2.20$\times$ \\
|
|
Hyprspace & 368 & 803 & 2.18$\times$ \\
|
|
Tinc & 336 & 563 & 1.68$\times$ \\
|
|
Yggdrasil & 795 & 1265 & 1.59$\times$ \\
|
|
Headscale & 800 & 1228 & 1.54$\times$ \\
|
|
Internal & 934 & 1398 & 1.50$\times$ \\
|
|
ZeroTier & 814 & 1206 & 1.48$\times$ \\
|
|
WireGuard & 864 & 1281 & 1.48$\times$ \\
|
|
EasyTier & 636 & 927 & 1.46$\times$ \\
|
|
VpnCloud & 539 & 763 & 1.42$\times$ \\
|
|
Nebula & 706 & 648 & 0.92$\times$ \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
The VPNs that gain the most are those most constrained in
|
|
single-stream mode. Mycelium's 34.9\,ms RTT means a lone TCP stream
|
|
can never fill the pipe: the bandwidth-delay product demands a window
|
|
larger than any single flow maintains, so ten streams collectively
|
|
compensate for that constraint and push throughput to 2.20$\times$
|
|
the single-stream figure. Hyprspace scales almost as well
|
|
(2.18$\times$) but for a
|
|
different reason: multiple streams work around the buffer bloat that
|
|
cripples any individual flow
|
|
(Section~\ref{sec:hyprspace_bloat}). Tinc picks up a
|
|
1.68$\times$ boost because several streams can collectively keep its
|
|
single-threaded CPU busy during what would otherwise be idle gaps in
|
|
a single flow.
|
|
|
|
WireGuard and Internal both scale cleanly at around
|
|
1.48--1.50$\times$ with zero retransmits, suggesting that
|
|
WireGuard's overhead is a fixed per-packet cost that does not worsen
|
|
under multiplexing.
|
|
|
|
Nebula is the only VPN that actually gets \emph{slower} with more
|
|
streams: throughput drops from 706\,Mbps to 648\,Mbps
|
|
(0.92$\times$) while retransmits jump from 955 to 2\,462. The ten
|
|
streams are clearly fighting each other for resources inside the
|
|
tunnel.
|
|
|
|
More streams also amplify existing retransmit problems across the
|
|
board. Hyprspace climbs from 4\,965 to 17\,426~retransmits;
|
|
VpnCloud from 857 to 6\,023. VPNs that were clean in single-stream
|
|
mode stay clean under load, while the stressed ones only get worse.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/single-stream-vs-parallel-tcp-throughput.png}
|
|
\caption{Single-stream vs.\ parallel throughput}
|
|
\label{fig:single_vs_parallel}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/parallel-tcp-scaling-factor.png}
|
|
\caption{Parallel TCP scaling factor}
|
|
\label{fig:scaling_factor}
|
|
\end{subfigure}
|
|
\caption{Parallel TCP scaling at baseline. Nebula is the only VPN
|
|
where parallel throughput is lower than single-stream
|
|
(0.92$\times$). Mycelium and Hyprspace benefit most from
|
|
parallelism ($>$2$\times$), compensating for latency and buffer
|
|
bloat respectively. The dashed line at 1.0$\times$ marks the
|
|
break-even point.}
|
|
\label{fig:parallel_tcp}
|
|
\end{figure}
|
|
|
|
\subsection{UDP Stress Test}
|
|
|
|
The UDP iPerf3 test uses unlimited sender rate (\texttt{-b 0}),
|
|
which is a deliberate overload test rather than a realistic workload.
|
|
The sender throughput values are artifacts: they reflect how fast the
|
|
sender can write to the socket, not how fast data traverses the
|
|
tunnel. Yggdrasil, for example, reports 63,744\,Mbps sender
|
|
throughput because it uses a 32,731-byte block size (a jumbo-frame
|
|
overlay MTU), inflating the apparent rate per \texttt{send()} system
|
|
call. Only the receiver throughput is meaningful.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{UDP receiver throughput and packet loss at baseline
|
|
(\texttt{-b 0} stress test). Hyprspace and Mycelium timed out
|
|
at 120 seconds and are excluded.}
|
|
\label{tab:udp_baseline}
|
|
\begin{tabular}{lrr}
|
|
\hline
|
|
\textbf{VPN} & \textbf{Receiver (Mbps)} &
|
|
\textbf{Loss (\%)} \\
|
|
\hline
|
|
Internal & 952 & 0.0 \\
|
|
WireGuard & 898 & 0.0 \\
|
|
Nebula & 890 & 76.2 \\
|
|
Headscale & 876 & 69.8 \\
|
|
EasyTier & 865 & 78.3 \\
|
|
Yggdrasil & 852 & 98.7 \\
|
|
ZeroTier & 851 & 89.5 \\
|
|
VpnCloud & 773 & 83.7 \\
|
|
Tinc & 471 & 89.9 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
%TODO: Explain that the UDP test also crashes often,
|
|
% which makes the test somewhat unreliable
|
|
% but a good indicator if the network traffic is "different" then
|
|
% the programmer expected
|
|
|
|
Only Internal and WireGuard achieve 0\,\% packet loss. Both operate at
|
|
the kernel level with proper backpressure that matches sender to
|
|
receiver rate. Every userspace VPN shows massive loss (69--99\%)
|
|
because the sender overwhelms the tunnel's processing capacity.
|
|
Yggdrasil's 98.7\% loss is the most extreme: it sends the most data
|
|
(due to its large block size) but loses almost all of it. These loss
|
|
rates do not reflect real-world UDP behavior but reveal which VPNs
|
|
implement effective flow control. Hyprspace and Mycelium could not
|
|
complete the UDP test at all, timing out after 120 seconds.
|
|
|
|
The \texttt{blksize\_bytes} field reveals each VPN's effective path
|
|
MTU: Yggdrasil at 32,731 bytes (jumbo overlay), ZeroTier at 2728,
|
|
Internal at 1448, VpnCloud at 1375, WireGuard at 1368, Tinc at 1353,
|
|
EasyTier at 1288, Nebula at 1228, and Headscale at 1208 (the
|
|
smallest). These differences affect fragmentation behavior under real
|
|
workloads, particularly for protocols that send large datagrams.
|
|
|
|
%TODO: Mention QUIC
|
|
%TODO: Mention again that the "default" settings of every VPN have been used
|
|
% to better reflect real world use, as most users probably won't
|
|
% change these defaults
|
|
% and explain that good defaults are as much a part of good software as
|
|
% having the features but they are hard to configure correctly
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
|
Throughput}.png}
|
|
\caption{UDP receiver throughput}
|
|
\label{fig:udp_throughput}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/udp/UDP
|
|
Packet Loss}.png}
|
|
\caption{UDP packet loss}
|
|
\label{fig:udp_loss}
|
|
\end{subfigure}
|
|
\caption{UDP stress test results at baseline (\texttt{-b 0},
|
|
unlimited sender rate). Internal and WireGuard are the only
|
|
implementations with 0\% loss. Hyprspace and Mycelium are
|
|
excluded due to 120-second timeouts.}
|
|
\label{fig:udp_results}
|
|
\end{figure}
|
|
|
|
% TODO: Compare parallel TCP retransmit rate
|
|
% with single TCP retransmit rate and see what changed
|
|
|
|
\subsection{Real-World Workloads}
|
|
|
|
Saturating a link with iPerf3 measures peak capacity, but not how a
|
|
VPN performs under realistic traffic. This subsection switches to
|
|
application-level workloads: downloading packages from a Nix binary
|
|
cache and streaming video over RIST. Both interact with the VPN
|
|
tunnel the way real software does, through many short-lived
|
|
connections, TLS handshakes, and latency-sensitive UDP packets.
|
|
|
|
\paragraph{Nix Binary Cache Downloads.}
|
|
|
|
This test downloads a fixed set of Nix packages through each VPN and
|
|
measures the total transfer time. The results
|
|
(Table~\ref{tab:nix_cache}) compress the throughput hierarchy
|
|
considerably: even Hyprspace, the worst performer, finishes in
|
|
11.92\,s, only 40\,\% slower than bare metal. Once connection
|
|
setup, TLS handshakes, and HTTP round-trips enter the picture,
|
|
throughput differences between 500 and 900\,Mbps matter far less
|
|
than per-connection latency.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Nix binary cache download time at baseline, sorted by
|
|
duration. Overhead is relative to the internal baseline (8.53\,s).}
|
|
\label{tab:nix_cache}
|
|
\begin{tabular}{lrr}
|
|
\hline
|
|
\textbf{VPN} & \textbf{Mean (s)} &
|
|
\textbf{Overhead (\%)} \\
|
|
\hline
|
|
Internal & 8.53 & -- \\
|
|
Nebula & 9.15 & +7.3 \\
|
|
ZeroTier & 9.22 & +8.1 \\
|
|
VpnCloud & 9.39 & +10.0 \\
|
|
EasyTier & 9.39 & +10.1 \\
|
|
WireGuard & 9.45 & +10.8 \\
|
|
Headscale & 9.79 & +14.8 \\
|
|
Tinc & 10.00 & +17.2 \\
|
|
Mycelium & 10.07 & +18.1 \\
|
|
Yggdrasil & 10.59 & +24.2 \\
|
|
Hyprspace & 11.92 & +39.7 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
Several rankings invert relative to raw throughput. ZeroTier
|
|
finishes faster than WireGuard (9.22\,s vs.\ 9.45\,s) despite
|
|
30\,\% fewer raw Mbps and 1\,000$\times$ more retransmits. Yggdrasil
|
|
is the clearest example: it has the
|
|
third-highest throughput at 795\,Mbps, yet lands at 24\,\% overhead
|
|
because its
|
|
2.2\,ms latency adds up over the many small sequential HTTP requests
|
|
that constitute a Nix cache download.
|
|
Figure~\ref{fig:throughput_vs_download} confirms this weak link
|
|
between raw throughput and real-world download speed.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/Nix Cache
|
|
Mean Download Time}.png}
|
|
\caption{Nix cache download time per VPN}
|
|
\label{fig:nix_cache}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/raw-throughput-vs-nix-cache-download-time.png}
|
|
\caption{Raw throughput vs.\ download time}
|
|
\label{fig:throughput_vs_download}
|
|
\end{subfigure}
|
|
\caption{Application-level download performance. The throughput
|
|
hierarchy compresses under real HTTP workloads: the worst VPN
|
|
(Hyprspace, 11.92\,s) is only 40\% slower than bare metal.
|
|
Throughput explains some variance but not all: Yggdrasil
|
|
(795\,Mbps, 10.59\,s) is slower than Nebula (706\,Mbps, 9.15\,s)
|
|
because latency matters more for HTTP workloads.}
|
|
\label{fig:nix_download}
|
|
\end{figure}
|
|
|
|
\paragraph{Video Streaming (RIST).}
|
|
|
|
At just 3.3\,Mbps, the RIST video stream sits comfortably within
|
|
every VPN's throughput budget. This test therefore measures
|
|
something different: how well the VPN handles real-time UDP packet
|
|
delivery under steady load. Nine of the eleven VPNs pass without
|
|
incident, delivering 100\,\% video quality. The 14--16 dropped
|
|
frames that appear uniformly across all VPNs, including Internal,
|
|
trace back to encoder warm-up rather than tunnel overhead.
|
|
|
|
Headscale is the exception. It averages just 13.1\,\% quality,
|
|
dropping 288~packets per test interval. The degradation is not
|
|
bursty but sustained: median quality sits at 10\,\%, and the
|
|
interquartile range of dropped packets spans a narrow 255--330 band.
|
|
The qperf benchmark independently corroborates this, having failed
|
|
outright for Headscale, confirming that something beyond bulk TCP is
|
|
broken.
|
|
|
|
What makes this failure unexpected is that Headscale builds on
|
|
WireGuard, which handles video flawlessly. TCP throughput places
|
|
Headscale squarely in Tier~1. Yet the RIST test runs over UDP, and
|
|
qperf probes latency-sensitive paths using both TCP and UDP. The
|
|
pattern points toward Headscale's DERP relay or NAT traversal layer
|
|
as the source. Its effective path MTU of 1\,208~bytes, the smallest
|
|
of any VPN, likely compounds the issue: RIST packets that exceed
|
|
this limit must be fragmented, and reassembling fragments under
|
|
sustained load produces exactly the kind of steady, uniform packet
|
|
drops the data shows. For video conferencing, VoIP, or any
|
|
real-time media workload, this is a disqualifying result regardless
|
|
of TCP throughput.
|
|
|
|
Hyprspace reveals a different failure mode. Its average quality
|
|
reads 100\,\%, but the raw numbers underneath are far from stable:
|
|
mean packet drops of 1\,194 and a maximum spike of 55\,500, with
|
|
the 25th, 50th, and 75th percentiles all at zero. Hyprspace
|
|
alternates between perfect delivery and catastrophic bursts.
|
|
RIST's forward error correction compensates for most of these
|
|
events, but the worst spikes are severe enough to overwhelm FEC
|
|
entirely.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/Video
|
|
Streaming/RIST Quality}.png}
|
|
\caption{RIST video streaming quality at baseline. Headscale at
|
|
13.1\% average quality is the clear outlier. Every other VPN
|
|
achieves 99.8\% or higher. Nebula is at 99.8\% (minor
|
|
degradation). The video bitrate (3.3\,Mbps) is well within every
|
|
VPN's throughput capacity, so this test reveals real-time UDP
|
|
handling quality rather than bandwidth limits.}
|
|
\label{fig:rist_quality}
|
|
\end{figure}
|
|
|
|
\subsection{Operational Resilience}
|
|
|
|
Sustained-load performance does not predict recovery speed. How
|
|
quickly a tunnel comes up after a reboot, and how reliably it
|
|
reconverges, matters as much as peak throughput for operational use.
|
|
|
|
First-time connectivity spans a wide range. Headscale and WireGuard
|
|
are ready in under 50\,ms, while ZeroTier (8--17\,s) and VpnCloud
|
|
(10--14\,s) spend seconds negotiating with their control planes
|
|
before passing traffic.
|
|
|
|
%TODO: Maybe we want to scrap first-time connectivity
|
|
|
|
Reboot reconnection rearranges the rankings. Hyprspace, the worst
|
|
performer under sustained TCP load, recovers in just 8.7~seconds on
|
|
average, faster than any other VPN. WireGuard and Nebula follow at
|
|
10.1\,s each. Nebula's consistency is striking: 10.06, 10.06,
|
|
10.07\,s across its three nodes, pointing to a hard-coded timer
|
|
rather than topology-dependent convergence.
|
|
Mycelium sits at the opposite end, needing 76.6~seconds and showing
|
|
the same suspiciously uniform pattern (75.7, 75.7, 78.3\,s),
|
|
suggesting a fixed protocol-level wait built into the overlay.
|
|
|
|
%TODO: Hard coded timer needs to be verified
|
|
|
|
Yggdrasil produces the most lopsided result in the dataset: its yuki
|
|
node is back in 7.1~seconds while lom and luna take 94.8 and
|
|
97.3~seconds respectively. The gap likely reflects the overlay's
|
|
spanning-tree rebuild: a node near the root of the tree reconverges
|
|
quickly, while one further out has to wait for the topology to
|
|
propagate.
|
|
|
|
%TODO: Needs clarifications what is a "spanning tree build"
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-per-vpn.png}
|
|
\caption{Average reconnection time per VPN}
|
|
\label{fig:reboot_bar}
|
|
\end{subfigure}
|
|
|
|
\vspace{1em}
|
|
|
|
\begin{subfigure}[t]{\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/baseline/reboot-reconnection-time-heatmap.png}
|
|
\caption{Per-node reconnection time heatmap}
|
|
\label{fig:reboot_heatmap}
|
|
\end{subfigure}
|
|
\caption{Reboot reconnection time at baseline. The heatmap reveals
|
|
Yggdrasil's extreme per-node asymmetry (7\,s for yuki vs.\
|
|
95--97\,s for lom/luna) and Mycelium's uniform slowness (75--78\,s
|
|
across all nodes). Hyprspace reconnects fastest (8.7\,s average)
|
|
despite its poor sustained-load performance.}
|
|
\label{fig:reboot_reconnection}
|
|
\end{figure}
|
|
|
|
\subsection{Pathological Cases}
|
|
\label{sec:pathological}
|
|
|
|
Three VPNs exhibit behaviors that the aggregate numbers alone cannot
|
|
explain. The following subsections synthesize observations from the
|
|
preceding benchmarks into per-VPN diagnoses.
|
|
|
|
\paragraph{Hyprspace: Buffer Bloat.}
|
|
\label{sec:hyprspace_bloat}
|
|
|
|
Hyprspace produces the most severe performance collapse in the
|
|
dataset. At idle, its ping latency is a modest 1.79\,ms.
|
|
Under TCP load, that number balloons to roughly 2\,800\,ms, a
|
|
1\,556$\times$ increase. This is not the network becoming
|
|
congested; it is the VPN tunnel itself filling up with buffered
|
|
packets and refusing to drain.
|
|
|
|
The consequences ripple through every TCP metric. With 4\,965
|
|
retransmits per 30-second test (one in every 200~segments), TCP
|
|
spends most of its time in congestion recovery rather than
|
|
steady-state transfer, shrinking the average congestion window to
|
|
205\,KB, the smallest in the dataset. Under parallel load the
|
|
situation worsens: retransmits climb to 17\,426. The buffering even
|
|
inverts iPerf3's measurements: the receiver reports 419.8\,Mbps
|
|
while the sender sees only 367.9\,Mbps, because massive ACK delays
|
|
cause the sender-side timer to undercount the actual data rate. The
|
|
UDP test never finished at all, timing out at 120~seconds.
|
|
|
|
% Should we always use percentages for retransmits?
|
|
|
|
What prevents Hyprspace from being entirely unusable is everything
|
|
\emph{except} sustained load. It has the fastest reboot
|
|
reconnection in the dataset (8.7\,s) and delivers 100\,\% video
|
|
quality outside of its burst events. The pathology is narrow but
|
|
severe: any continuous data stream saturates the tunnel's internal
|
|
buffers.
|
|
|
|
\paragraph{Mycelium: Routing Anomaly.}
|
|
\label{sec:mycelium_routing}
|
|
|
|
Mycelium's 34.9\,ms average latency appears to be the cost of
|
|
routing through a global overlay. The per-path numbers, however,
|
|
reveal a bimodal distribution:
|
|
|
|
\begin{itemize}
|
|
\bitem{luna$\rightarrow$lom:} 1.63\,ms (direct path, comparable
|
|
to Headscale at 1.64\,ms)
|
|
\bitem{lom$\rightarrow$yuki:} 51.47\,ms (overlay-routed)
|
|
\bitem{yuki$\rightarrow$luna:} 51.60\,ms (overlay-routed)
|
|
\end{itemize}
|
|
|
|
One of the three links has found a direct route; the other two still
|
|
bounce through the overlay. All three machines sit on the same
|
|
physical network, so Mycelium's path discovery is failing
|
|
intermittently, a more specific problem than blanket overlay
|
|
overhead. Throughput mirrors the split:
|
|
yuki$\rightarrow$luna reaches 379\,Mbps while
|
|
luna$\rightarrow$lom manages only 122\,Mbps, a 3:1 gap. In
|
|
bidirectional mode, the reverse direction on that worst link drops
|
|
to 58.4\,Mbps, the lowest single-direction figure in the entire
|
|
dataset.
|
|
|
|
\begin{figure}[H]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{{Figures/baseline/tcp/Mycelium/Average
|
|
Throughput}.png}
|
|
\caption{Per-link TCP throughput for Mycelium, showing extreme
|
|
path asymmetry caused by inconsistent direct route discovery.
|
|
The 3:1 ratio between best (yuki$\rightarrow$luna, 379\,Mbps)
|
|
and worst (luna$\rightarrow$lom, 122\,Mbps) links reflects
|
|
different overlay routing paths.}
|
|
\label{fig:mycelium_paths}
|
|
\end{figure}
|
|
|
|
The overlay penalty shows up most clearly at connection setup.
|
|
Mycelium's average time-to-first-byte is 93.7\,ms (vs.\ Internal's
|
|
16.8\,ms, a 5.6$\times$ overhead), and connection establishment
|
|
alone costs 47.3\,ms (3$\times$ overhead). Every new connection
|
|
incurs that overhead, so workloads dominated by
|
|
short-lived connections accumulate it rapidly. Bulk downloads, by
|
|
contrast, amortize it: the Nix cache test finishes only 18\,\%
|
|
slower than Internal (10.07\,s vs.\ 8.53\,s) because once the
|
|
transfer phase begins, per-connection latency fades into the
|
|
background.
|
|
|
|
Mycelium is also the slowest VPN to recover from a reboot:
|
|
76.6~seconds on average, and almost suspiciously uniform across
|
|
nodes (75.7, 75.7, 78.3\,s). That kind of consistency points to a
|
|
hard-coded convergence timer in the overlay protocol rather than
|
|
anything topology-dependent. The UDP test timed out at
|
|
120~seconds, and even first-time connectivity required a
|
|
70-second wait at startup.
|
|
|
|
% Explain what topology-dependent means in this case.
|
|
|
|
\paragraph{Tinc: Userspace Processing Bottleneck.}
|
|
|
|
Tinc is a clear case of a CPU bottleneck masquerading as a network
|
|
problem. At 1.19\,ms latency, packets get through the
|
|
tunnel quickly. Yet throughput tops out at 336\,Mbps, barely a
|
|
third of the bare-metal link. The usual suspects do not apply:
|
|
Tinc's path MTU is a healthy 1\,500~bytes
|
|
(\texttt{blksize\_bytes} of 1\,353 from UDP iPerf3, comparable to
|
|
VpnCloud at 1\,375 and WireGuard at 1\,368), and its retransmit
|
|
count (240) is moderate. What limits Tinc is its single-threaded
|
|
userspace architecture: one CPU core simply cannot encrypt, copy,
|
|
and forward packets fast enough to fill the pipe.
|
|
|
|
The parallel benchmark confirms this diagnosis. Tinc scales to
|
|
563\,Mbps (1.68$\times$), beating Internal's 1.50$\times$ ratio.
|
|
Multiple TCP streams collectively keep that single core busy during
|
|
what would otherwise be idle gaps in any individual flow, squeezing
|
|
out throughput that no single stream could reach alone.
|
|
|
|
\section{Impact of Network Impairment}
|
|
|
|
This section examines how each VPN responds to the Low, Medium, and
|
|
High impairment profiles defined in Chapter~\ref{Methodology}.
|
|
|
|
\subsection{Ping}
|
|
|
|
% RTT and packet loss across impairment profiles.
|
|
|
|
\subsection{TCP Throughput}
|
|
|
|
% TCP iperf3: throughput, retransmits, congestion window.
|
|
|
|
\subsection{UDP Throughput}
|
|
|
|
% UDP iperf3: throughput, jitter, packet loss.
|
|
|
|
\subsection{Parallel TCP}
|
|
|
|
% Parallel iperf3: throughput under contention (A->B, B->C, C->A).
|
|
|
|
\subsection{QUIC Performance}
|
|
|
|
% qperf: bandwidth, TTFB, connection establishment time.
|
|
|
|
\subsection{Video Streaming}
|
|
|
|
% RIST: bitrate, dropped frames, packets recovered, quality score.
|
|
|
|
\subsection{Application-Level Download}
|
|
|
|
% Nix cache: download duration for Firefox package.
|
|
|
|
\section{Tailscale Under Degraded Conditions}
|
|
|
|
% The central finding: Tailscale outperforming the raw Linux
|
|
% networking stack under impairment.
|
|
|
|
\subsection{Observed Anomaly}
|
|
|
|
% Present the data showing Tailscale exceeding internal baseline
|
|
% throughput under Medium/High impairment.
|
|
|
|
\subsection{Congestion Control Analysis}
|
|
|
|
% Reno vs CUBIC, RACK disabled to avoid spurious retransmits
|
|
% under reordering.
|
|
|
|
\subsection{Tuned Kernel Parameters}
|
|
|
|
% Re-run results with tuned buffer sizes and congestion control
|
|
% on the internal baseline, showing the gap closes.
|
|
|
|
\section{Source Code Analysis}
|
|
|
|
\subsection{Feature Matrix Overview}
|
|
|
|
% Summary of the 131-feature matrix across all ten VPNs.
|
|
% Highlight key architectural differences that explain
|
|
% performance results.
|
|
|
|
\subsection{Security Vulnerabilities}
|
|
|
|
% Vulnerabilities discovered during source code review.
|
|
|
|
\section{Summary of Findings}
|
|
|
|
% Brief summary table or ranking of VPNs by key metrics.
|
|
% Save deeper interpretation for a Discussion chapter.
|