clan-master-thesis/Chapters/Methodology.tex

% Chapter Template

\chapter{Methodology} % Main chapter title

\label{Methodology}

This chapter describes the methodology used to benchmark and analyze
peer-to-peer mesh VPN implementations. The evaluation combines
performance benchmarking under controlled network conditions with a
structured source code analysis of each implementation. All
dependencies, system configurations, and test procedures are pinned
or declared so that the experiments can be independently reproduced.

\section{Experimental Setup}

\subsection{Hardware Configuration}

All experiments were conducted on three bare-metal servers with
identical specifications:

\begin{itemize}
    \bitem{CPU:} Intel Model 94, 4 cores / 8 threads
    \bitem{Memory:} 64 GB RAM
    \bitem{Network:} 1 Gbps Ethernet (e1000e driver; one machine
    uses r8169)
    \bitem{Cryptographic acceleration:} AES-NI, AVX, AVX2, PCLMULQDQ,
    RDRAND, SSE4.2
\end{itemize}

Results may differ on systems without hardware cryptographic
acceleration, since most of the tested VPNs offload encryption to
AES-NI.

\subsection{Network Topology}

The three machines are connected via a direct 1 Gbps LAN on the same
network segment. Each machine has a publicly reachable IPv4 address,
which is used to deploy configuration changes via Clan. On this
baseline topology, latency is sub-millisecond and there is no packet
loss, so measured overhead can be attributed to the VPN itself.
Figure~\ref{fig:mesh_topology} illustrates the full-mesh connectivity
between the three machines.

\begin{figure}[H]
  \centering
  \begin{tikzpicture}[
      node/.style={
        draw, rounded corners, minimum width=2.2cm, minimum height=1cm,
        font=\ttfamily\bfseries, align=center
      },
      link/.style={thick, <->}
    ]
    % Nodes in an equilateral triangle
    \node[node] (luna) at (0, 3.5) {luna};
    \node[node] (yuki) at (-3, 0) {yuki};
    \node[node] (lom)  at (3, 0)  {lom};

    % Mesh links
    \draw[link] (luna) -- node[left,  font=\small] {1 Gbps} (yuki);
    \draw[link] (luna) -- node[right, font=\small] {1 Gbps} (lom);
    \draw[link] (yuki) -- node[below, font=\small] {1 Gbps} (lom);
  \end{tikzpicture}
  \caption{Full-mesh network topology of the three benchmark machines}
  \label{fig:mesh_topology}
\end{figure}

To simulate real-world network conditions, Linux traffic control
(\texttt{tc netem}) is used to inject latency, jitter, packet loss,
and reordering. These impairments are applied symmetrically on all
machines, meaning effective round-trip impairment is approximately
double the per-machine values.

\subsection{Configuration Methodology}

Each VPN is built from source within the Nix flake, with all
dependencies pinned to exact versions. VPNs not packaged in nixpkgs
(Hyprspace, EasyTier, VpnCloud) have dedicated build expressions
under \texttt{pkgs/} in the flake.

Cryptographic material (WireGuard keys, Nebula certificates, ZeroTier
identities) is generated deterministically via Clan's vars generator
system.

Generated keys are stored in version control under
\texttt{vars/per-machine/\{name\}/} and read at NixOS evaluation time,
so key material is part of the reproducible configuration.

\section{Benchmark Suite}

The benchmark suite includes synthetic throughput tests and
application-level workloads. Prior comparative work relied exclusively
on iperf3; the additional benchmarks here capture behavior that
iperf3 alone misses.
Table~\ref{tab:benchmark_suite} summarises each benchmark.

\begin{table}[H]
  \centering
  \caption{Benchmark suite overview}
  \label{tab:benchmark_suite}
  \begin{tabular}{llll}
    \hline
    \textbf{Benchmark} & \textbf{Protocol} & \textbf{Duration} &
    \textbf{Key Metrics} \\
    \hline
    Ping & ICMP & 3 runs $\times$ 100 pkts & RTT, packet loss \\
    TCP iPerf3 & TCP & 30 s & Throughput, retransmits, CPU \\
    UDP iPerf3 & UDP & 30 s & Throughput, jitter, packet loss \\
    Parallel iPerf3 & TCP & 60 s & Throughput under contention \\
    QPerf & QUIC & 30 s & Bandwidth, TTFB, conn. time \\
    RIST Streaming & RIST & 30 s & Bitrate, dropped frames, RTT \\
    Nix Cache Download & HTTP & 2 runs & Download duration \\
    \hline
  \end{tabular}
\end{table}

The first four benchmarks use standard network testing tools;
the remaining three test application-level workloads.
The subsections below describe configuration details that the table
does not capture.

\subsection{Ping}

Sends 100 ICMP echo requests at 200\,ms intervals with a 1-second
per-packet timeout, repeated for 3 runs.

\subsection{TCP and UDP iPerf3}

Both tests run for 30 seconds in bidirectional mode with zero-copy
(\texttt{-Z}) to minimize CPU overhead. The UDP variant additionally
sets unlimited target bandwidth (\texttt{-b 0}) and enables 64-bit
counters.

\subsection{Parallel iPerf3}

Runs one bidirectional TCP stream on all three machine pairs
simultaneously in a circular pattern (A$\rightarrow$B,
B$\rightarrow$C, C$\rightarrow$A) for 60 seconds with zero-copy
(\texttt{-Z}). The three concurrent bidirectional links produce six
unidirectional flows in total. This contention stresses shared
resources that single-stream tests leave idle.

\subsection{QPerf}

Spawns one qperf process per CPU core, each running for 30 seconds.
Per-core bandwidth is summed per second. In addition to throughput,
QPerf reports time to first byte and connection establishment time,
which iPerf3 does not measure.

\subsection{RIST Video Streaming}

Generates a 4K ($3840\times2160$) H.264 test pattern at 30\,fps
(ultrafast preset, zerolatency tuning, 25\,Mbps bitrate cap) with
ffmpeg and transmits it over the RIST protocol for 30 seconds. Because
the synthetic test pattern is highly compressible, the actual encoding
bitrate is approximately 3.3\,Mbps, well below the configured cap. RIST
(Reliable Internet Stream Transport) is a protocol for low-latency
video contribution over unreliable networks. The benchmark records
encoding-side statistics (actual bitrate, frame rate, dropped frames)
and RIST-specific counters (packets recovered via retransmission,
quality score).

\subsection{Nix Cache Download}

A Harmonia Nix binary cache server on the target machine serves the
Firefox package. The client downloads it via \texttt{nix copy}
through the VPN. Unlike the iPerf3 tests, this workload issues many
short-lived HTTP requests instead of a single bulk transfer.
Benchmarked with hyperfine (1 warmup run, 2 timed runs); the local
Nix store and SQLite metadata are cleared between runs.

\section{Network Impairment Profiles}

Four impairment profiles simulate progressively worse network
conditions, from an unmodified baseline to a severely degraded link.
All impairments are injected with Linux traffic control
(\texttt{tc netem}) on the egress side of every machine's primary
interface.
Table~\ref{tab:impairment_profiles} lists the per-machine values.
Because impairments are applied on both ends of a connection, the
effective round-trip impact is roughly double the listed values.

\begin{table}[H]
  \centering
  \caption{Network impairment profiles (per-machine egress values)}
  \label{tab:impairment_profiles}
  \begin{tabular}{lccccc}
    \hline
    \textbf{Profile} & \textbf{Latency} & \textbf{Jitter} &
    \textbf{Loss} & \textbf{Reorder} & \textbf{Correlation} \\
    \hline
    Baseline & -  & -  & -  & -  & -  \\
    Low & 2 ms & 2 ms & 0.25\% & 0.5\% & 25\% \\
    Medium & 4 ms & 7 ms & 1.0\% & 2.5\% & 50\% \\
    High & 6 ms & 15 ms & 2.5\% & 5\% & 50\% \\
    \hline
  \end{tabular}
\end{table}

Each column in Table~\ref{tab:impairment_profiles} controls one
aspect of the simulated degradation:

\begin{itemize}
  \item \textbf{Latency} is a constant delay added to every outgoing
    packet. For example, 2\,ms on each machine adds roughly 4\,ms to
    the round trip.
  \item \textbf{Jitter} introduces random variation on top of the
    fixed latency. A packet on the Low profile may see anywhere
    between 0 and 4\,ms of total added delay instead of exactly
    2\,ms.
  \item \textbf{Loss} is the fraction of packets that are silently
    dropped. At 0.25\,\% (Low profile), roughly 1 in 400 packets is
    discarded.
  \item \textbf{Reorder} is the fraction of packets that arrive out
    of sequence. \texttt{tc netem} achieves this by giving selected
    packets a shorter delay than their predecessors, so they overtake
    earlier packets.
  \item \textbf{Correlation} determines whether impairment events are
    independent or bursty. At 0\,\%, each packet's fate is decided
    independently. At higher values, a packet that was lost or
    reordered raises the probability that the next packet suffers the
    same fate, producing the burst patterns typical of real networks.
\end{itemize}

A 30-second stabilization period follows TC application before
measurements begin so that queuing disciplines can settle.

\section{Experimental Procedure}

\subsection{Automation}

A Python orchestrator (\texttt{vpn\_bench/}) automates the full
benchmark suite. For each VPN under test, it:

\begin{enumerate}
  \item Cleans all state directories from previous VPN runs
  \item Deploys the VPN configuration to all machines via Clan
  \item Restarts the VPN service on every machine (with retry:
    up to 3 attempts, 2-second backoff)
  \item Verifies VPN connectivity via a connection-check service
    (120-second timeout)
  \item For each impairment profile:
    \begin{enumerate}
      \item Applies TC rules via context manager (guarantees cleanup)
      \item Waits 30 seconds for stabilization
      \item Executes each benchmark three times sequentially,
        once per machine pair: $A\to B$, then
        $B\to C$, lastly $C\to A$
      \item Clears TC rules
    \end{enumerate}
  \item Collects results and metadata
\end{enumerate}

Figure~\ref{fig:orchestrator_flow} illustrates this procedure as a
flowchart.

\begin{figure}[H]
  \centering
  \begin{tikzpicture}[
      box/.style={
        draw, rounded corners, minimum width=4.8cm, minimum height=0.9cm,
        font=\small, align=center, fill=white
      },
      decision/.style={
        draw, diamond, aspect=2.5, minimum width=3cm,
        font=\small, align=center, fill=white, inner sep=1pt
      },
      arr/.style={->, thick},
      every node/.style={font=\small}
    ]
    % Main flow
    \node[box] (clean) at (0, 0) {Clean state directories};
    \node[box] (deploy) at (0, -1.5) {Deploy VPN via Clan};
    \node[box] (restart) at (0, -3) {Restart VPN services\\(up to 3 attempts)};
    \node[box] (verify) at (0, -4.5) {Verify connectivity\\(120\,s timeout)};

    % Inner loop
    \node[decision] (profile) at (0, -6.3) {Next impairment\\profile?};
    \node[box] (tc) at (0, -8.3) {Apply TC rules};
    \node[box] (wait) at (0, -9.8) {Wait 30\,s};
    \node[box] (bench) at (0, -11.3) {Run benchmarks\\$A{\to}B,\;
    B{\to}C,\; C{\to}A$};
    \node[box] (clear) at (0, -12.8) {Clear TC rules};

    % After loop
    \node[box] (collect) at (0, -14.8) {Collect results};

    % Arrows -- main spine
    \draw[arr] (clean) -- (deploy);
    \draw[arr] (deploy) -- (restart);
    \draw[arr] (restart) -- (verify);
    \draw[arr] (verify) -- (profile);
    \draw[arr] (profile) -- node[right] {yes} (tc);
    \draw[arr] (tc) -- (wait);
    \draw[arr] (wait) -- (bench);
    \draw[arr] (bench) -- (clear);

    % Loop back
    \draw[arr] (clear) -- ++(3.8, 0) |- (profile);

    % Exit loop
    \draw[arr] (profile) -- ++(-3.2, 0) node[above, pos=0.3] {no}
    |- (collect);
  \end{tikzpicture}
  \caption{Flowchart of the benchmark orchestrator procedure for a
  single VPN}
  \label{fig:orchestrator_flow}
\end{figure}

\subsection{Retry Logic}

Tests use a retry wrapper with up to 2 retries (3 total attempts),
5-second initial delay, and 700-second maximum total time. The number
of attempts is recorded in test metadata so that retried results can
be identified during analysis.

\subsection{Statistical Analysis}

Each metric is summarized as a statistics dictionary containing:

\begin{itemize}
    \bitem{min / max:} Extreme values observed
    \bitem{average:} Arithmetic mean across samples
    \bitem{p25 / p50 / p75:} Quartiles via Python's
    \texttt{statistics.quantiles()} method
\end{itemize}

Aggregation differs by benchmark type. Benchmarks that execute
multiple discrete runs, ping (3 runs of 100 packets each) and
nix-cache (2 timed runs via hyperfine), first compute statistics
within each run, then aggregate across runs: averages and percentiles
are averaged, while the reported minimum and maximum are the global
extremes across all runs. Concretely, if ping produces three runs
with mean RTTs of 5.1, 5.3, and 5.0\,ms, the reported average is
the mean of those three values (5.13\,ms). The reported minimum is
the single lowest RTT observed across all three runs.

Benchmarks that produce continuous per-second samples, qperf and
RIST streaming for example, pool all per-second measurements from a single
execution into one series before computing statistics. For qperf,
bandwidth is first summed across CPU cores for each second, and
statistics are then computed over the resulting time series.

The analysis reports empirical percentiles (p25, p50, p75) alongside
min/max bounds rather than parametric confidence intervals.
Benchmark latency and throughput distributions are often skewed or
multimodal, so parametric assumptions of normality would be
unreliable. The interquartile range (p25--p75) conveys the spread of
typical observations, while min and max capture outlier behavior.
The nix-cache benchmark additionally reports standard deviation via
hyperfine's built-in statistical output.

\section{Source Code Analysis}

We also conducted a structured source code analysis of all ten VPN
implementations. The analysis followed three phases.

\subsection{Repository Collection and LLM-Assisted Overview}

The latest main branch of each VPN's git repository was cloned,
together with key dependencies that implement core functionality
outside the main repository. For example, Yggdrasil delegates its
routing and cryptographic operations to the Ironwood library, which
was analyzed alongside the main codebase.

Ten LLM agents (Claude Code) were then spawned in parallel, one per
VPN. Each agent was instructed to read the full source tree and
produce an \texttt{overview.md} file documenting the following
aspects:

\begin{itemize}
  \item Wire protocol and message framing
  \item Encryption scheme and key exchange
  \item Packet handling and performance
  \item NAT traversal mechanism
  \item Local routing and peer discovery
  \item Security features and access control
  \item Resilience / Central Point of Failure
\end{itemize}

Each agent was required to reference the specific file and line
range supporting every claim so that outputs could be verified
against the source.

\subsection{Manual Verification}

The LLM-generated overviews served as a navigational aid rather than
a trusted source. The most important code paths identified in each
overview were manually read and verified against the actual source
code. Where the automated summaries were inaccurate or superficial,
they were corrected and expanded.

\subsection{Feature Matrix and Maintainer Review}

The findings from both phases were consolidated into a feature matrix
of 131 features across all ten VPN implementations, covering protocol
characteristics, cryptographic primitives, NAT traversal strategies,
routing behavior, and security properties.

The completed feature matrix was published and sent to the respective
VPN maintainers for review. We incorporated their feedback as
corrections and clarifications to the final classification.

\section{Reproducibility}

The experimental stack pins or declares the variables that could
affect results.

\subsection{Dependency Pinning}

Every external dependency is pinned via \texttt{flake.lock}, which records
cryptographic hashes (\texttt{narHash}) and commit SHAs for each input.
Key pinned inputs include:

\begin{itemize}
    \bitem{nixpkgs:} Follows \texttt{clan-core/nixpkgs}, so a single
    version is used across the dependency graph
    \bitem{clan-core:} The Clan framework, pinned to a specific commit
    \bitem{VPN sources:} Hyprspace, EasyTier, Nebula locked to
    exact commits
    \bitem{Build infrastructure:} flake-parts, treefmt-nix, disko,
    nixos-facter-modules
\end{itemize}

Custom packages not in nixpkgs (qperf, VpnCloud, iperf with auth patches,
EasyTier, Hyprspace) are built from source within the flake.

\subsection{Declarative System Configuration}

Each benchmark machine runs NixOS, where the entire operating system is
defined declaratively. There is no imperative package installation or
configuration drift. Given the same NixOS configuration, two machines
will have identical software, services, and kernel parameters.

Machine deployment is atomic: the system either switches to the new
configuration entirely or rolls back.

\subsection{Inventory-Driven Topology}

Clan's inventory system maps machines to service roles declaratively.
For each VPN, the orchestrator writes an inventory entry assigning
machines to roles (e.g., Nebula lighthouse vs.\ peer). The Clan module
system translates this into NixOS configuration; systemd services,
firewall rules, peer lists, and key references. The same inventory
entry always produces the same NixOS configuration.

\subsection{State Isolation}

Before installing a new VPN, the orchestrator deletes all state
directories from previous runs, including VPN-specific directories
(\texttt{/var/lib/zerotier-one}, \texttt{/var/lib/nebula}, etc.) and
benchmark directories. This prevents cross-contamination between tests.

\subsection{Data Provenance}

Results are organized in the four-level directory hierarchy shown in
Figure~\ref{fig:result-tree}. Each VPN directory stores a
\texttt{layout.json} capturing the machine topology used for that run.
Each impairment profile directory records the exact \texttt{tc}
parameters in \texttt{tc\_settings.json} and per-phase durations in
\texttt{timing\_breakdown.json}. Individual benchmark results are
stored in one subdirectory per machine pair.

\begin{figure}[ht]
  \centering
  \begin{forest}
    for tree={
      font=\ttfamily\small,
      grow'=0,
      folder,
      s sep=2pt,
      inner xsep=3pt,
      inner ysep=2pt,
    }
    [date/
      [vpn/
        [layout.json]
        [profile/
          [tc\_settings.json]
          [timing\_breakdown.json]
          [parallel\_tcp\_iperf3.json]
          [\textnormal{\textit{\{pos\}\_\{peer\}}}/
            [ping.json]
            [tcp\_iperf3.json]
            [udp\_iperf3.json]
            [qperf.json]
            [rist\_stream.json]
            [nix\_cache.json]
            [connection\_timings.json]
          ]
        ]
      ]
      [General/
        [hardware.json]
        [comparison/
          [cross\_profile\_*.json]
          [profile/
            [benchmark\_stats.json]
            [per-benchmark .json files]
          ]
        ]
      ]
    ]
  \end{forest}
  \caption{Directory hierarchy of benchmark results. Each run produces
    per-VPN and per-profile directories alongside a \texttt{General/}
  directory with cross-VPN comparison data.}
  \label{fig:result-tree}
\end{figure}

Every benchmark result file uses a uniform JSON envelope with a
\texttt{status} field, a \texttt{data} object holding the
test-specific payload, and a \texttt{meta} object recording
wall-clock duration, number of attempts, VPN restart count and
duration, connectivity wait time, source and target machine names,
and on failure, the relevant service logs.

\section{VPNs Under Test}

VPNs were selected based on:
\begin{itemize}
    \bitem{NAT traversal capability:} All selected VPNs can establish
    connections between peers behind NAT without manual port forwarding.
    \bitem{Decentralization:} Preference for solutions without mandatory
    central servers, though coordinated-mesh VPNs were included for comparison.
    \bitem{Active development:} Only VPNs with recent commits and
    maintained releases were considered (with the exception of VpnCloud).
    \bitem{Linux support:} All VPNs must run on Linux.
\end{itemize}

Table~\ref{tab:vpn_selection} lists the ten VPN implementations
selected for evaluation.

\begin{table}[H]
  \centering
  \caption{VPN implementations included in the benchmark}
  \label{tab:vpn_selection}
  \begin{tabular}{lll}
    \hline
    \textbf{VPN} & \textbf{Architecture} & \textbf{Notes} \\
    \hline
    Tailscale (Headscale) & Coordinated mesh & Open-source
    coordination server \\
    ZeroTier & Coordinated mesh & Global virtual Ethernet \\
    Nebula & Coordinated mesh & Slack's overlay network \\
    Tinc & Fully decentralized & Established since 1998 \\
    Yggdrasil & Fully decentralized & Spanning-tree routing \\
    Mycelium & Fully decentralized & End-to-end encrypted IPv6 overlay \\
    Hyprspace & Fully decentralized & libp2p-based, IPFS-compatible \\
    EasyTier & Fully decentralized & Rust-based, multi-protocol \\
    VpnCloud & Fully decentralized & Lightweight, kernel bypass option \\
    WireGuard & Point-to-point & Reference baseline (not a mesh VPN) \\
    \hline
    Internal (no VPN) & N/A & Baseline for raw network performance \\
    \hline
  \end{tabular}
\end{table}

WireGuard is not a mesh VPN but is included as a reference point.
Comparing its overhead to the mesh VPNs isolates the cost of mesh
coordination and NAT traversal.