clan-master-thesis/Chapters/Methodology.tex

% Chapter Template

\chapter{Methodology} % Main chapter title

\label{Methodology}

This chapter describes the methodology used to benchmark and analyze
peer-to-peer mesh VPN implementations. The evaluation combines
performance benchmarking under controlled network conditions with a
structured source code analysis of each implementation. The
benchmarking framework prioritizes reproducibility at every layer;
from pinned dependencies and declarative system configuration to
automated test orchestration; enabling independent verification of
results and facilitating future comparative studies.

\section{Experimental Setup}

\subsection{Hardware Configuration}

All experiments were conducted on three bare-metal servers with
identical specifications:

\begin{itemize}
    \bitem{CPU:} Intel Model 94, 4 cores / 8 threads
    \bitem{Memory:} 64 GB RAM
    \bitem{Network:} 1 Gbps Ethernet (e1000e driver; one machine
    uses r8169)
    \bitem{Cryptographic acceleration:} AES-NI, AVX, AVX2, PCLMULQDQ,
    RDRAND, SSE4.2
\end{itemize}

The presence of hardware cryptographic acceleration is relevant because
many VPN implementations leverage AES-NI for encryption, and the results
may differ on systems without these features.

\subsection{Network Topology}

The three machines are connected via a direct 1 Gbps LAN on the same
network segment. Each machine has a publicly reachable IPv4 address,
which is used to deploy configuration changes via Clan. This baseline
topology provides a controlled environment with minimal latency and no
packet loss, allowing the overhead introduced by each VPN implementation
to be measured in isolation. Figure~\ref{fig:mesh_topology} illustrates
the full-mesh connectivity between the three machines.

\begin{figure}[H]
  \centering
  \begin{tikzpicture}[
      node/.style={
        draw, rounded corners, minimum width=2.2cm, minimum height=1cm,
        font=\ttfamily\bfseries, align=center
      },
      link/.style={thick, <->}
    ]
    % Nodes in an equilateral triangle
    \node[node] (luna) at (0, 3.5) {luna};
    \node[node] (yuki) at (-3, 0) {yuki};
    \node[node] (lom)  at (3, 0)  {lom};

    % Mesh links
    \draw[link] (luna) -- node[left,  font=\small] {1 Gbps} (yuki);
    \draw[link] (luna) -- node[right, font=\small] {1 Gbps} (lom);
    \draw[link] (yuki) -- node[below, font=\small] {1 Gbps} (lom);
  \end{tikzpicture}
  \caption{Full-mesh network topology of the three benchmark machines}
  \label{fig:mesh_topology}
\end{figure}

To simulate real-world network conditions, Linux traffic control
(\texttt{tc netem}) is used to inject latency, jitter, packet loss,
and reordering. These impairments are applied symmetrically on all
machines, meaning effective round-trip impairment is approximately
double the per-machine values.

\section{VPNs Under Test}

Ten VPN implementations were selected for evaluation, spanning a range
of architectures from centralized coordination to fully decentralized
mesh topologies. Table~\ref{tab:vpn_selection} summarizes the selection.

\begin{table}[H]
  \centering
  \caption{VPN implementations included in the benchmark}
  \label{tab:vpn_selection}
  \begin{tabular}{lll}
    \hline
    \textbf{VPN} & \textbf{Architecture} & \textbf{Notes} \\
    \hline
    Tailscale (Headscale) & Coordinated mesh & Open-source
    coordination server \\
    ZeroTier & Coordinated mesh & Global virtual Ethernet \\
    Nebula & Coordinated mesh & Slack's overlay network \\
    Tinc & Fully decentralized & Established since 1998 \\
    Yggdrasil & Fully decentralized & Spanning-tree routing \\
    Mycelium & Fully decentralized & End-to-end encrypted IPv6 overlay \\
    Hyprspace & Fully decentralized & libp2p-based, IPFS-compatible \\
    EasyTier & Fully decentralized & Rust-based, multi-protocol \\
    VpnCloud & Fully decentralized & Lightweight, kernel bypass option \\
    WireGuard & Point-to-point & Reference baseline (not a mesh VPN) \\
    \hline
    Internal (no VPN) & N/A & Baseline for raw network performance \\
    \hline
  \end{tabular}
\end{table}

WireGuard is included as a reference point despite not being a mesh VPN.
Its minimal overhead and widespread adoption make it a useful comparison
for understanding the cost of mesh coordination and NAT traversal logic.

\subsection{Selection Criteria}

VPNs were selected based on:
\begin{itemize}
    \bitem{NAT traversal capability:} All selected VPNs can establish
    connections between peers behind NAT without manual port forwarding.
    \bitem{Decentralization:} Preference for solutions without mandatory
    central servers, though coordinated-mesh VPNs were included for comparison.
    \bitem{Active development:} Only VPNs with recent commits and
    maintained releases were considered.
    \bitem{Linux support:} All VPNs must run on Linux.
\end{itemize}

\subsection{Configuration Methodology}

Each VPN is built from source within the Nix flake, ensuring that all
dependencies are pinned to exact versions. VPNs not packaged in nixpkgs
(Hyprspace, EasyTier, VpnCloud) have dedicated build expressions
under \texttt{pkgs/} in the flake.

Cryptographic material (WireGuard keys, Nebula certificates, ZeroTier
identities) is generated deterministically via Clan's vars generator
system. For example, WireGuard keys are generated as:

\begin{verbatim}
wg genkey > "$out/private-key"
wg pubkey < "$out/private-key" > "$out/public-key"
\end{verbatim}

Generated keys are stored in version control under
\texttt{vars/per-machine/\{name\}/} and read at NixOS evaluation time,
making key material part of the reproducible configuration.

\section{Benchmark Suite}

The benchmark suite includes both synthetic throughput tests and
real-world workloads. This combination addresses a limitation of prior
work that relied exclusively on iperf3.

\subsection{Ping}

Measures ICMP round-trip latency and packet delivery reliability.

\begin{itemize}
    \bitem{Method:} 100 ICMP echo requests at 200 ms intervals,
    1-second per-packet timeout, repeated for 3 runs.
    \bitem{Metrics:} RTT (min, avg, max, mdev), packet loss percentage,
    per-packet RTTs.
\end{itemize}

\subsection{TCP iPerf3}

Measures bulk TCP throughput with iperf3
a common tool used in research to measure network performance.

\begin{itemize}
    \bitem{Method:}  30-second bidirectional test zero-copy mode
    (\texttt{-Z}) to minimize CPU
    overhead.
    \bitem{Metrics:} Throughput (bits/s), retransmits, congestion
    window and CPU utilization.
\end{itemize}

\subsection{UDP iPerf3}

Measures bulk UDP throughput with the same flags as the TCP Iperf3 benchmark.

\begin{itemize}
    \bitem{Method:} plus unlimited target bandwidth (\texttt{-b 0}) and
    64-bit counters flags.
    \bitem{Metrics:} Throughput (bits/s), jitter, packet loss and CPU
    utilization.
\end{itemize}

\subsection{Parallel iPerf3}

Tests concurrent overlay network traffic by running TCP streams on all machines
simultaneously in a circular pattern (A$\rightarrow$B,
B$\rightarrow$C, C$\rightarrow$A) for 60 seconds. This simulates
contention across the overlay network.

\begin{itemize}
    \bitem{Method:}  60-second bidirectional test zero-copy mode
    (\texttt{-Z}) to minimize CPU
    overhead.
    \bitem{Metrics:} Throughput (bits/s), retransmits, congestion
    window and CPU utilization.
\end{itemize}

\subsection{QPerf}

Measures connection-level QUIC performance rather
than bulk UDP or TCP throughput.

\begin{itemize}
    \bitem{Method:} One qperf process per CPU core in parallel, each
    running for 30 seconds. Bandwidth from all cores is summed per second.
    \bitem{Metrics:} Total bandwidth (Mbps), CPU usage, time to first
    byte (TTFB), connection establishment time.
\end{itemize}

\subsection{RIST Video Streaming}

Measures real-time multimedia streaming performance.

\begin{itemize}
    \bitem{Method:} The sender generates a 4K ($3840\times2160$) test
    pattern at 30 fps using ffmpeg with H.264 encoding (ultrafast preset,
    zerolatency tuning) at 25 Mbps target bitrate. The stream is transmitted
    over the RIST protocol to a receiver on the target machine for 30 seconds.
    \bitem{Encoding metrics:} Actual bitrate, frame rate, dropped frames.
    \bitem{Network metrics:} Packets dropped, packets recovered via
    RIST retransmission, RTT, quality score (0--100), received bitrate.
\end{itemize}

RIST (Reliable Internet Stream Transport) is a protocol designed for
low-latency video contribution over unreliable networks, making it a
realistic test of VPN behavior under multimedia workloads.

\subsection{Nix Cache Download}

Measures sustained HTTP download performance of many small files
using a real-world workload.

\begin{itemize}
    \bitem{Method:} A Harmonia Nix binary cache server on the target
    machine serves the Firefox package. The client downloads it via
    \texttt{nix copy} through the VPN. Benchmarked with hyperfine:
    1 warmup run followed by 2 timed runs. The local cache and Nix's
    SQLite metadata are cleared between runs.
    \bitem{Metrics:} Mean duration (seconds), standard deviation,
    min/max duration.
\end{itemize}

\section{Network Impairment Profiles}

Four impairment profiles simulate a range of network conditions, from
ideal to severely degraded. Impairments are applied via Linux traffic
control (\texttt{tc netem}) on every machine's primary interface.
Table~\ref{tab:impairment_profiles} shows the per-machine values;
effective round-trip impairment is approximately doubled.

\begin{table}[H]
  \centering
  \caption{Network impairment profiles (per-machine egress values)}
  \label{tab:impairment_profiles}
  \begin{tabular}{lccccc}
    \hline
    \textbf{Profile} & \textbf{Latency} & \textbf{Jitter} &
    \textbf{Loss} & \textbf{Reorder} & \textbf{Correlation} \\
    \hline
    Baseline & -  & -  & -  & -  & -  \\
    Low & 2 ms & 2 ms & 0.25\% & 0.5\% & 25\% \\
    Medium & 4 ms & 7 ms & 1.0\% & 2.5\% & 50\% \\
    High & 6 ms & 15 ms & 2.5\% & 5\% & 50\% \\
    \hline
  \end{tabular}
\end{table}

The correlation column controls how strongly each packet's impairment
depends on the preceding packet. At 0\% correlation, loss and
reordering events are independent; at higher values they occur in
bursts, because a packet that was lost or reordered increases the
probability that the next packet suffers the same fate. This produces
realistic bursty degradation rather than uniformly distributed drops.

The ``Low'' profile approximates a well-provisioned continental
connection, ``Medium'' represents intercontinental links or congested
networks, and ``High'' simulates severely degraded conditions such as
satellite links or highly congested mobile networks.

A 30-second stabilization period follows TC application before
measurements begin, allowing queuing disciplines to settle.

\section{Experimental Procedure}

\subsection{Automation}

The benchmark suite is fully automated via a Python orchestrator
(\texttt{vpn\_bench/}). For each VPN under test, the orchestrator:

\begin{enumerate}
  \item Cleans all state directories from previous VPN runs
  \item Deploys the VPN configuration to all machines via Clan
  \item Restarts the VPN service on every machine (with retry:
    up to 3 attempts, 2-second backoff)
  \item Verifies VPN connectivity via a connection-check service
    (120-second timeout)
  \item For each impairment profile:
    \begin{enumerate}
      \item Applies TC rules via context manager (guarantees cleanup)
      \item Waits 30 seconds for stabilization
      \item Executes each benchmark three times sequentially,
        once per machine pair: $A\to B$, then
        $B\to C$, lastly $C\to A$
      \item Clears TC rules
    \end{enumerate}
  \item Collects results and metadata
\end{enumerate}

Figure~\ref{fig:orchestrator_flow} illustrates this procedure as a
flowchart.

\begin{figure}[H]
  \centering
  \begin{tikzpicture}[
      box/.style={
        draw, rounded corners, minimum width=4.8cm, minimum height=0.9cm,
        font=\small, align=center, fill=white
      },
      decision/.style={
        draw, diamond, aspect=2.5, minimum width=3cm,
        font=\small, align=center, fill=white, inner sep=1pt
      },
      arr/.style={->, thick},
      every node/.style={font=\small}
    ]
    % Main flow
    \node[box] (clean) at (0, 0) {Clean state directories};
    \node[box] (deploy) at (0, -1.5) {Deploy VPN via Clan};
    \node[box] (restart) at (0, -3) {Restart VPN services\\(up to 3 attempts)};
    \node[box] (verify) at (0, -4.5) {Verify connectivity\\(120\,s timeout)};

    % Inner loop
    \node[decision] (profile) at (0, -6.3) {Next impairment\\profile?};
    \node[box] (tc) at (0, -8.3) {Apply TC rules};
    \node[box] (wait) at (0, -9.8) {Wait 30\,s};
    \node[box] (bench) at (0, -11.3) {Run benchmarks\\$A{\to}B,\;
    B{\to}C,\; C{\to}A$};
    \node[box] (clear) at (0, -12.8) {Clear TC rules};

    % After loop
    \node[box] (collect) at (0, -14.8) {Collect results};

    % Arrows -- main spine
    \draw[arr] (clean) -- (deploy);
    \draw[arr] (deploy) -- (restart);
    \draw[arr] (restart) -- (verify);
    \draw[arr] (verify) -- (profile);
    \draw[arr] (profile) -- node[right] {yes} (tc);
    \draw[arr] (tc) -- (wait);
    \draw[arr] (wait) -- (bench);
    \draw[arr] (bench) -- (clear);

    % Loop back
    \draw[arr] (clear) -- ++(3.8, 0) |- (profile);

    % Exit loop
    \draw[arr] (profile) -- ++(-3.2, 0) node[above, pos=0.3] {no}
    |- (collect);
  \end{tikzpicture}
  \caption{Flowchart of the benchmark orchestrator procedure for a
  single VPN}
  \label{fig:orchestrator_flow}
\end{figure}

\subsection{Retry Logic}

Tests use a retry wrapper with up to 2 retries (3 total attempts),
5-second initial delay, and 700-second maximum total time. The number
of attempts is recorded in test metadata so that retried results can
be identified during analysis.

\subsection{Statistical Analysis}

Each metric is summarized as a statistics dictionary containing:

\begin{itemize}
    \bitem{min / max:} Extreme values observed
    \bitem{average:} Arithmetic mean across samples
    \bitem{p25 / p50 / p75:} Quartiles via pythons
    \texttt{statistics.quantiles()} method
\end{itemize}

Aggregation differs by benchmark type. Benchmarks that execute
multiple discrete runs, ping (3 runs of 100 packets each) and
nix-cache (2 timed runs via hyperfine), first compute statistics
within each run, then average the resulting statistics across runs.
Concretely, if ping produces three runs with mean RTTs of
5.1, 5.3, and 5.0\,ms, the reported average is the mean of
those three values (5.13\,ms). The reported minimum is the
single lowest RTT observed across all three runs.

Benchmarks that produce continuous per-second samples, qperf and
RIST streaming for example, pool all per-second measurements from a single
execution into one series before computing statistics. For qperf,
bandwidth is first summed across CPU cores for each second, and
statistics are then computed over the resulting time series.

The analysis reports empirical percentiles (p25, p50, p75) alongside
min/max bounds rather than parametric confidence intervals. This
choice is deliberate: benchmark latency and throughput distributions
are often skewed or multimodal, making assumptions of normality
unreliable. The interquartile range (p25--p75) conveys the spread of
typical observations, while min and max capture outlier behavior.
The nix-cache benchmark additionally reports standard deviation via
hyperfine's built-in statistical output.

\section{Source Code Analysis}

To complement the performance benchmarks with architectural
understanding, a structured source code analysis was conducted for
all ten VPN implementations. The analysis followed three phases.

\subsection{Repository Collection and LLM-Assisted Overview}

The latest main branch of each VPN's git repository was cloned,
together with key dependencies that implement core functionality
outside the main repository. For example, Yggdrasil delegates its
routing and cryptographic operations to the Ironwood library, which
was analyzed alongside the main codebase.

Ten LLM agents (Claude Code) were then spawned in parallel, one per
VPN. Each agent was instructed to read the full source tree and
produce an \texttt{overview.md} file documenting the following
aspects:

\begin{itemize}
  \item Wire protocol and message framing
  \item Encryption scheme and key exchange
  \item Packet handling and performance
  \item NAT traversal mechanism
  \item Local routing and peer discovery
  \item Security features and access control
  \item Resilience / Central Point of Failure
\end{itemize}

Every claim in the generated overview was required to reference the
specific file and line range in the repository that supports it,
enabling direct verification.

\subsection{Manual Verification}

The LLM-generated overviews served as a navigational aid rather than
a trusted source. The most important code paths identified in each
overview were manually read and verified against the actual source
code, correcting inaccuracies and deepening the analysis where the
automated summaries remained superficial.

\subsection{Feature Matrix and Maintainer Review}

The findings from both the automated and manual analysis were
consolidated into a comprehensive feature matrix cataloguing 131
features across all ten VPN implementations. The matrix covers
protocol characteristics, cryptographic primitives, NAT traversal
strategies, routing behavior, and security properties.

The completed feature matrix was published and sent to the respective
VPN maintainers for review. Maintainer feedback was incorporated as
corrections and clarifications, improving the accuracy of the final
classification.

\section{Reproducibility}

Reproducibility is ensured at every layer of the experimental stack.

\subsection{Dependency Pinning}

Every external dependency is pinned via \texttt{flake.lock}, which records
cryptographic hashes (\texttt{narHash}) and commit SHAs for each input.
Key pinned inputs include:

\begin{itemize}
    \bitem{nixpkgs:} Follows \texttt{clan-core/nixpkgs}, ensuring a
    single version across the dependency graph
    \bitem{clan-core:} The Clan framework, pinned to a specific commit
    \bitem{VPN sources:} Hyprspace, EasyTier, Nebula locked to
    exact commits
    \bitem{Build infrastructure:} flake-parts, treefmt-nix, disko,
    nixos-facter-modules
\end{itemize}

Custom packages not in nixpkgs (qperf, VpnCloud, iperf with auth patches,
EasyTier, Hyprspace) are built from source within the flake.

\subsection{Declarative System Configuration}

Each benchmark machine runs NixOS, where the entire operating system is
defined declaratively. There is no imperative package installation or
configuration drift. Given the same NixOS configuration, two machines
will have identical software, services, and kernel parameters.

Machine deployment is atomic: the system either switches to the new
configuration entirely or rolls back.

\subsection{Inventory-Driven Topology}

Clan's inventory system maps machines to service roles declaratively.
For each VPN, the orchestrator writes an inventory entry assigning
machines to roles (e.g., Nebula lighthouse vs.\ peer). The Clan module
system translates this into NixOS configuration; systemd services,
firewall rules, peer lists, and key references. The same inventory
entry always produces the same NixOS configuration.

\subsection{State Isolation}

Before installing a new VPN, the orchestrator deletes all state
directories from previous runs, including VPN-specific directories
(\texttt{/var/lib/zerotier-one}, \texttt{/var/lib/nebula}, etc.) and
benchmark directories. This prevents cross-contamination between tests.

\subsection{Data Provenance}

Every test result includes metadata recording:

\begin{itemize}
  \item Wall-clock duration
  \item Number of attempts (1 = first try succeeded)
  \item VPN restart attempts and duration
  \item Connectivity wait duration
  \item Source and target machine names
  \item Service logs (on failure)
\end{itemize}

Results are organized hierarchically by VPN, TC profile, and machine
pair. Each profile directory contains a \texttt{tc\_settings.json}
snapshot of the exact impairment parameters applied.