add treefmt and jupyter lab

This commit is contained in:
2024-11-24 23:38:01 +01:00
parent 6b95d8632c
commit 5b0ae1e598
46 changed files with 6836 additions and 279 deletions

View File

@@ -0,0 +1,392 @@
# Data-Mesher Specification
The Data-Mesher is an application that runs on every node in a peer-to-peer network. It functions as a database that eventually synchronizes across all nodes, using a special data structure called a **CRDT (Conflict-Free Replicated Data Type)** to resolve conflicts. What sets it apart from other databases is that even untrusted nodes can add data without compromising data added by others. Its primary use is for announcing hostnames and application settings, though it is flexible enough to support other use cases as well.
Below is a more detailed explanation of how it works:
---
### **What is a "dm-network"?**
A **dm-network** is a group of hosts (computers/nodes) and settings that are grouped under a single key in a JSON file. This key is a public **ed25519 key**.
- **Settings:**
- The settings in a dm-network are protected by a digital signature to prevent tampering.
- The admin users are the only ones with the private key needed to change these settings.
- However, any node in the network can update the settings as long as they sign the change correctly (with the admin's private key). This ensures there isnt just one "admin" node, allowing settings to be changed on any local node and then pushed to other nodes in the network.
### **Hosts in the dm-network**
A dm-network also includes a **hosts** attribute that stores information about hostnames for DNS lookups.
- Each node can add its own hostnames to this list.
- Every node has its own pair of private and public keys to sign the hostnames it adds.
- In case multiple nodes try to use the same hostname, the one with the **oldest signed entry** (earliest timestamp) will be chosen.
#### **Preventing clock manipulation:**
To avoid cheating with time (for example, backdating a hostname entry), the dm-network relies on **trusted timestamp attestation servers**. We can use the [Time-Stamp Protocol RFC3161](https://datatracker.ietf.org/doc/html/rfc3161)
which allows sending a payload and getting a signed sha256 with a timestamp back. We can use the [FreeTSA](https://www.freetsa.org/index_de.php) servers for that.
Or we could use the [Opentimestamp](https://opentimestamps.org/) specification and their free to use servers.
---
### **Data Synchronization Between Nodes**
When two nodes communicate, they exchange their entire set of data.
After ensuring the data is merged, both nodes should end up with the same `data.json` file. The **merge function** ensures that both nodes arrive at the same result, no matter the order or timing.
*currently the merge function is quite primite: for settings it checks if the signature is valid and afterwards the bigger* last_update timestamp wins.
---
### **Handling Invalid or Missing Timestamps**
- If a hostname entry doesn't have a valid timestamp, it will still be shared with other nodes, but it wont be active or used yet.
- The entry stays inactive until it reaches a trusted dm-node that also acts as a **timestamp attestation node** (TSP), which will add a timestamp and sign the entry. From that point, the hostname becomes valid and can be used in the network. The **timestamp attestation nodes** are listed under the "settings" key in the JSON file, and only the dm-network's admin can modify this list.
- This means there needs to be **one or many trusted dm-nodes**, which attest that the timestamps are correct. If one of the trusted dm-nodes is compromised, hostnames can be malicously claimed and redirected to attacker controlled nodes.
- This also means that **hostnames are not to be trusted**, and instead a Certificate Authority should be used to verify the authenticity of endpoints.
- This design has been chosen because:
- It enables having completely off-grid nodes, that are only inside the mesh VPN
- A node can start claiming it's hostname offline and just sync it into the VPN network once it's online
- No timing attacks: An attacker cannot pre-fetch timestamps to then use them to override hostnames
### **Security: Invalid Signatures**
- If a hostname or timestamp has an invalid signature, it wont be shared with other nodes.
- An alert will be triggered for further action.
- Additionally, hosts must go through a verification step to ensure that the IPs and ports are reachable and that the machine behind them holds the correct private keys that signed the entry. Before accepting a new host into the configuration, the node will attempt to contact the hosts IP and port, and a challenge-response protocol will be used to verify that it is the correct machine.
---
Below is an example `data.json`:
```json
{
"22excOG1Q7hlNMyRPWz4eZNeTqsH18p0+r0KGPUqVR8=": {
"hosts": {
"7BZSfLVyoTc12xgpvMUSWGTNsjjP4iqv/JSgpYbHQC4=": {
"hostnames": {
"green": {
"hostname": "green"
}
},
"ip": "fdcc:c5da:5295:c853:d499:937c:31a2:1e86",
"last_seen": 1731199277,
"port": 7331,
"signature": "RUZEqQoH1E2TuuB0rcQeaEuyxLTB70xgcj2VvRpvDwRtxvbaXegErJ7ei5obS46x3ApjgVP+3Di7OTXBSxqUCnsiaG9zdG5hbWVzIjogeyJncmVlbiI6IHsiaG9zdG5hbWUiOiAiZ3JlZW4ifX0sICJpcCI6ICJmZGNjOmM1ZGE6NTI5NTpjODUzOmQ0OTk6OTM3YzozMWEyOjFlODYiLCAibGFzdF9zZWVuIjogMTczMTE5OTI3NywgInBvcnQiOiA3MzMxfQ=="
},
"D9mq63wEznl4kHhsoQbq8hpncvGZeWC0vEOekcB8Nko=": {
"hostnames": {
"mors": {
"hostname": "mors"
}
},
"ip": "fdcc:c5da:5295:c853:d499:93e9:c5fc:c8b5",
"last_seen": 1731180076,
"port": 7331,
"signature": "/9o91MnAmSQTnbJCOK29zc2NcoAg8jI3SHbJ1NLiVQfCWafZ9MRqakkT/yLbgOTaepTCy2VFmu2HXalqnnUyC3siaG9zdG5hbWVzIjogeyJtb3JzIjogeyJob3N0bmFtZSI6ICJtb3JzIn19LCAiaXAiOiAiZmRjYzpjNWRhOjUyOTU6Yzg1MzpkNDk5OjkzZTk6YzVmYzpjOGI1IiwgImxhc3Rfc2VlbiI6IDE3MzExODAwNzYsICJwb3J0IjogNzMzMX0="
}
},
"settings": {
"banned_keys": [],
"host_signing_keys": [],
"hostname_overrides": {},
"last_update": 1724161701,
"public": true,
"signature": "K3UIjSbQkjKRM2yBlj8PIoeIZq4PyvImJss6SWremYBzggzibjnx8A5mifh0GF0xHig0J4gVhDmsYqogovRuDA==",
"tld": "nether"
}
}
}
```
### Examples for merging
for settings bigger last_update always wins
```
a:
{ hosts: { ... },
settings: {
last_update: 2,
tld: "test",
signature: "sig2"
}
}
b:
{ hosts: { ... },
settings: {
last_update: 3,
tld: "test2",
signature: "sig3"
}
}
result:
{ hosts: { ... },
settings: {
last_update: 3,
tld: "test2",
signature: "sig3"
}
}
```
hosts with bigger last_update win,
new hosts will be added
```
a:
{
hosts: {
pub1: {
ip: "42::1",
port: 7331,
last_update 3,
signature "sig_pub1_3"
}
},
settings: { ... },
}
b:
{
hosts: {
pub1: {
ip: "42::3",
port: 7331,
last_update 1,
signature "sig_pub1_1"
},
pub2: {
ip: "42::2",
port: 7331,
last_update 2,
signature "sig_pub2_2"
}
},
settings: { ... },
}
result:
{
hosts: {
pub1: {
ip: "42::1",
port: 7331,
last_update 3,
signature "sig_pub1_3"
},
pub2: {
ip: "42::2",
port: 7331,
last_update 2,
signature "sig_pub2_2"
}
},
settings: { ... },
}
```
### Time-to-Live (TTL) and Gossip Protocol
In the Data-Mesher, all information should "decay" over time, meaning it automatically expires after a set period (this feature is not yet implemented). The **settings** field should include a configurable **Time-to-Live (TTL)**, which removes old information, such as host entries, once they exceed the specified TTL.
- This makes attacks possible where Bobs Laptop is offline for TTL time (because Bob is on vacation) and another user claims their hostname.
- This means **hostnames** are only to be trusted if
A) no trusted dm-node has been compromised and
B) The hosts are never longer offline then the TTL
To prevent their data from expiring, hosts must regularly send updates to other peers. However, these updates dont need to reach every peer directly. Since nodes share their entire data set during communication, information can relay through other nodes. As a result, a **gossip-style communication system** can be used, where information spreads gradually across the network through indirect connections, ensuring that all nodes stay up-to-date without overwhelming network traffic.
### Joining Multiple DM-Networks
Heres an example of how you can configure Data-Mesher in NixOS to join two different DM-Networks. In this setup, the `.qubasa.clan` domain gets higher priority than the `.clan` domain.
What does this mean? If both networks have a `home` hostname (e.g., `home.qubasa.clan`), the one from `.qubasa.clan` will take precedence over the one from `.clan`. The network with the **lower priority number** wins in the case of conflicts (closer to 0).
Here's the Nix configuration:
```nix
services.data-mesher = {
enable = true; # Enable Data-Mesher service
interface = "<mesh_vpn>"; # The network interface Data-Mesher will use
openFirewall = true; # Ensure the firewall allows Data-Mesher traffic
# Define the DM-Networks to join
networks = {
"qubasa.clan" = {
pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKTi2h4X56CzjeY4L1INl1d5JvYwh7HpaSuUlD33RhnY"; # Public key for the qubasa.clan network
priority = 1; # Higher priority (lower number = higher priority)
bootstrapPeers = [
"http://[fd27:bb88:dbef:737b:3799:9318:aa77:ec12]:7331" # A peer within this network to bootstrap from
];
};
"clan" = {
pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJ1UM2Cza+GIRyuB9C3NqY0pSWnGC4DzmQOcWOa4SafV"; # Public key for the clan network
priority = 2; # Lower priority (higher number = lower priority)
bootstrapPeers = [
"http://[fd16:aa77:dbef:737b:3799:9316:aa77:dbef]:7331" # A peer within this network to bootstrap from
];
};
};
};
```
### Key Points:
- **Priority System**: `qubasa.clan` has a **priority** of `1.0`, while `clan` has `2.0`. So, if there's a conflicting hostname, `qubasa.clan` wins since it has a smaller priority number.
- **Bootstrap Peers**: Each network defines one or more **bootstrap peers**—these are known nodes that help your node join the network by sharing the networks data.
- **Public Keys**: Each network you join requires its **public key** to verify the networks integrity.
- Note: We use floats here because they are infinitely divisible. Which means one can add a higher priority node anywhere without having to change other nodes priorities.
This setup allows you to join and sync with multiple peer-to-peer DM-Networks and ensures the network with the defined priority takes precedence where necessary.
### DNS Output (`dns.json`)
The Data-Mesher generates a file called `dns.json`, which contains all valid and verified hostnames in a simple, flat JSON format. This file is designed to be easy to consume by other applications or services that need to reference known network hosts.
Heres an example of a `dns.json` file:
```json
{"hostname": "mors.nether", "ip": "fdcc:c5da:5295:c853:d499:93e9:c5fc:c8b5"}
{"hostname": "green.nether", "ip": "fdcc:c5da:5295:c853:d499:937c:31a2:1e86"}
```
Each entry consists of:
- **hostname**: A valid, peer-reviewed hostname in the network.
- **ip**: The corresponding IP address (IPv6) of the hostname.
This file provides an up-to-date list of hostnames that are known to be good and usable within the network.
---
### signing timestaps by trusted nodes
It should probably look like this:
```json
{
"xxx": {
"hosts": {
"123": {
"hostnames": {
"123": {
"xyz": {
"signed_at": 128389182,
"signature": "signedbyxyz"
}
}
},
last_seen: 11,
signature "signedby123withtime11"
},
"456": {
"hostnames": {
"456": [
"xyz": {
"signed_at": 1231881298,
"signature": "signedbyxyz"
}
]
}
}
},
"settings": {
"host_signing_keys": [
"xyz"
]
}
}
}
```
### Bootstrapping
Currently, the Data-Mesher starts by using a list of URLs to pull bootstrap data from. These URLs can point to peers in the network or just a web server that makes `data.json` available.
If a node suddenly finds itself isolated because all other hosts have "decayed" or become unreachable, it should go back to the bootstrap peers to retrieve fresh network information.
It would also be useful to have a feature that allows you to add new peers to a running Data-Mesher instance. For example, you could say, "Hey, connect to this peer and get its data" while the system is still running, without needing to restart or manually reconfigure everything.
---
### Future Ideas
Ideas that are not currently on our roadmap but we want to see in the future.
#### **Host Schema**
It would be great if we could make the host schema more flexible, so it isn't tied specifically to hostnames. Hostnames could then just become one possible implementation of this schema. Here's an idea of how the settings could look:
```json
{
"settings": {
"hostSchema": {
"hostnames": {
"schema": $some JSON schema,
"merge": oldest_wins,
}
}
}
}
```
However, this doesn't cover the requirement of **signing hosts** (i.e., ensuring hostnames are signed by trusted peers). One possible solution could be to require signatures by trusted third-party hosts for all data in these schemas, since we're already using them for merging as well.
#### **Data Schema**
It would also be interesting to expand the current schema (which supports **hosts** and **settings**) with a third top-level key, like **data**. This new data section would have its own schema defined in the settings, along with a merge function. Specific hosts will be authorized to modify this data, meaning nodes can verify if the data has been changed by an allowed writer.
To accomplish this, we'd need a protocol to verify the **signatures** and the **origin** of the data (this part hasnt been specified fully yet, but Im happy to discuss ideas!).
Heres a rough idea of how the structure could look:
```json
{
hosts: {
"123": { ... },
"456": { ... }
},
settings: {
dataSchema: {
x: {
schema: $some JSON schema,
merge: newest_wins,
allowed_writers: [
"123"
]
}
}
},
"data":{
x: {
value: "10.42.0.1",
signed_at: 1231212,
signature: "signedby123"
}
}
}
```
In this example:
- The **dataSchema** would define the rules for the data field, including how it should be merged (e.g., **newest-wins**) and which hosts are **allowed to write** the data (e.g., only host "123").
- The **data** section includes information such as the **value**, when it was **signed**, and the **signature** to verify who modified it.

View File

@@ -0,0 +1,160 @@
+++
title= "Introducing Data-Mesher: A CRDT-Based Peer-to-Peer Database"
subline= "DNS was designed for resilience, built to function even during catastrophic failures. But despite its distributed nature, it places control in the hands of..."
date = 2024-11-18T09:08:10+02:00
draft = false
authors = [ "Lassulus", "Qubasa" ]
tags = ['Dev Report']
+++
**one authority per domain**. This centralization limits flexibility in networks where multiple, independent nodes need to contribute and manage data collaboratively.
**Data-Mesher** changes this. Running on every node in a peer-to-peer network, Data-Mesher functions as a fully decentralized database using **CRDTs (Conflict-Free Replicated Data Types)** to resolve conflicts. What makes it unique is that **any node—trusted or untrusted—can add data** without compromising others'. Multiple authorities can coexist, allowing for decentralized control of key information, such as DNS-like hostnames and application settings.
In Data-Mesher, the network, not a single authority, resolves conflicts, making it ideal for systems that need true multi-party collaboration.
---
### How Does Data-Mesher Work?
At its core, Data-Mesher runs on every participating node (host) within a distributed system, allowing each node to independently store, update, and sync data across other nodes. It uses a **CRDT**-based approach to resolve conflicts, ensuring that all nodes eventually converge on the same dataset.
What makes Data-Mesher unique is its capability to allow untrusted nodes to contribute data without compromising the integrity of other nodes' contributions. This is particularly useful in peer-to-peer environments, where nodes might not have complete trust in each other but still require collaborative data sharing.
****
---
### The Basic Structure of a dm-network
The data in Data-Mesher is grouped into what we call a **dm-network**, which is primarily a key-value structure. In a very basic form, the dm-network is simply a group of hosts (nodes) and **settings** bundled under a shared public key (an ed25519 key). Nodes in the dm-network collaborate by announcing DNS hostnames and other settings relevant to application configuration.
Here are key elements of a dm-network:
- **Settings**
Protected by a digital signature only accessible to admins with the correct private key. Settings control policy, which could include anything from banning specific keys to establishing rules for hostname overrides.
- **Hosts**
Acts as the storage for **hostnames** that nodes contribute. Each hostname entry is signed, and the node uses its ed25519 key pair to generate this signature. In case two hosts try to register the same hostname, Data-Mesher selects the earliest signed entry.
---
### Key Mechanisms of Data Synchronization
1. **Node Communication and Data Sync**
When two nodes communicate, they exchange their entire dataset in the form of a `data.json` file. This file contains all known settings and hosts, where each entry is signed and timestamped. The **merge** function ensures that both nodes end up with identical data, regardless of the order in which changes are received. The merge strategy for configuration settings is simple—whichever entry has the most recent `last_updated` timestamp wins.
2. **Timestamp Attestation**
To avoid timestamp manipulation (such as backdating a record to give precedence to newer data), Data-Mesher relies on **timestamp attestation servers**. RFC3161-based services like [FreeTSA](https://www.freetsa.org/index_de.php) or [OpenTimestamps](https://opentimestamps.org/) are options for obtaining cryptographic timestamps. These services accept payloads (e.g., the CRDT entries) and return cryptographically signed timestamps, ensuring tamper-proofing.
3. **Handling Invalid Hostnames**
If a node submits a hostname without a valid timestamp, this hostname will be stored but marked as inactive until it can be verified by a timestamp attestation service. Only once validation is complete and the hostname carries a valid signed timestamp will it be activated within the network.
4. **Invalid Signatures**
Hostnames or timestamp entries backed by invalid signatures are rejected upfront and do not propagate through the network, adding a layer of security to prevent malicious data from infiltrating the network.
---
### Use Case: Distributed DNS and Hostname Management
One of the main applications implemented in Data-Mesher is in managing DNS-like functionality for decentralized networks. Nodes in the dm-network can announce DNS hostnames, and these are propagated across the peer-to-peer system without requiring centralized management.
- Each node registers its own hostnames.
- Host entries are shared across nodes in a **gossip-style** protocol to ensure they're up-to-date without overloading network throughput. Nodes communicate indirectly and share updates with the nodes they connect to, spreading the data gradually rather than all at once.
#### **Example `dns.json` Output**
As part of its functionality, Data-Mesher periodically generates a `dns.json` file, which contains hostname-to-IP mappings in a consumable format. Multiple services can utilize this information to route traffic within the network.
Example:
```json
{"hostname": "mors.nether", "ip": "fdcc:c5da:5295:c853:d499:93e9:c5fc:c8b5"}
{"hostname": "green.nether", "ip": "fdcc:c5da:5295:c853:d499:937c:31a2:1e86"}
```
---
### Security and Accuracy Mechanisms
One inherent challenge in any distributed network is ensuring accuracy and security without compromising on decentralization. Data-Mesher balances these demands with several meaningful mechanisms:
1. **Signature Validation**
Each data entry (whether it's a setting or hostname) must have a valid signature to be trusted and propagated across nodes. This allows nodes to validate the origin and authenticity of every piece of shared data.
2. **Reachability Checks**
When a node announces a new hostname, other nodes verify the reachability of the target IP and port before incorporating the hostname into their working configuration. The verification protocol ensures that the machine providing the hostname entry can correctly respond to a challenge tied to the relevant private key.
---
### Joining Multiple DM-Networks
Data-Mesher supports joining multiple DM-Networks, and nodes can prioritize which network has control in cases of conflicts. Heres an example of how you can configure Data-Mesher in NixOS to join two different DM-Networks.
In this setup, the `.qubasa.clan` domain gets higher priority than the `.clan` domain.
What does this mean?
If both networks have colliding hostname (e.g., `home.qubasa.clan`), the one from `.qubasa.clan` will take precedence over the one from `.clan`. The network with the **lower priority number** wins in the case of conflicts (closer to 0).
Heres a Nix configuration to demonstrate this:
```nix
services.data-mesher = {
enable = true; # Enable Data-Mesher service
interface = "<mesh_vpn>"; # The network interface Data-Mesher will use
openFirewall = true; # Ensure the firewall allows Data-Mesher traffic
# Define the DM-Networks to join
networks = {
"qubasa.clan" = {
pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKTi2h4X56CzjeY4L1INl1d5JvYwh7HpaSuUlD33RhnY"; # Public key for the qubasa.clan network
priority = 1; # Higher priority (lower number = higher priority)
bootstrapPeers = [
"http://[fd27:bb88:dbef:737b:3799:9318:aa77:ec12]:7331" # A peer within this network to bootstrap from
];
};
"clan" = {
pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJ1UM2Cza+GIRyuB9C3NqY0pSWnGC4DzmQOcWOa4SafV"; # Public key for the clan network
priority = 2; # Lower priority (higher number = lower priority)
bootstrapPeers = [
"http://[fd16:aa77:dbef:737b:3799:9316:aa77:dbef]:7331" # A peer within this network to bootstrap from
];
};
};
};
```
#### Key Points:
- **Priority System**: `qubasa.clan` has a priority of `1`, while `clan` has priority `2`. If theres a conflicting hostname, Data-Mesher will resolve it in favor of the `qubasa.clan` network since it has a lower priority number.
- **Bootstrap Peers**: Each network is associated with one or more bootstrap peers, which help your node join that network by sharing an initial dataset. In this example, two peers are provided, one from the `qubasa.clan` and another from the `clan` domain.
- **Public Keys**: Each network you join requires a valid public key. The key is used to verify that the data received belongs to that specific network and has not been tampered with.
- **Floating Point Priorities**: Data-Mesher uses floating-point numbers for priorities. This allows you to insert new networks at any level of priority without reworking the entire priority hierarchy. For example, you could add a network with priority `1.5` between `qubasa.clan` (priority `1`) and `clan` (priority `2`).
This setup allows nodes to coordinate and sync across multiple DM-Networks seamlessly, while ensuring that conflicts are handled predictably based on each networks priority.
---
### Bootstrapping and Network Resurrection
In case a node finds itself isolated from the rest of the dm-network (perhaps due to network partitions or decayed peers), it can bootstrap itself by pulling static data from a list of pre-configured URLs. These bootstrap URLs can point either to other nodes in the dm-network or a simple web server serving a static `data.json` file.
---
### Whats Next? Future Enhancements
#### **Customizable Data Schemas**
While the current focus is on hostnames and settings, we foresee a broader application case by allowing more diverse use cases like network configurations, or even custom application settings. This would be achieved by extending the current schema definitions to allow broader categories beyond just "hosts" and "settings".
This could involve creating more flexible merge criteria (newer-wins, append-only, etc.) and defining writers and permissions for specific data keys.
#### **Flexible Host Object Definitions**
At the moment, the "host" object in the schema is somewhat inflexible, focusing primarily on DNS information. A future enhancement might include expanding this definition into a more abstract schema, allowing for various other types of node-related information beyond DNS settings.
---
### From Theory to Practice
Data-Mesher isn't just a theoretical construct—it can be a practical solution for networks that require decentralized peer coordination, conflict-free updates, and scalable distribution of discrete configurations like DNS names.
For those interested in deploying decentralized networks with configurations that automatically self-consume and degrade over time (via TTL), Data-Mesher offers a foundational approach that prioritizes security, consistency, and flexibility in equal measure.

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.7 MiB