Nov 29, 20244 min read

Automating DFIR with Velociraptor, Jupyter Notebook, and Neo4j: Part 1

In the field of Digital Forensics and Incident Response (DFIR), automation has become a critical component in improving the efficiency of investigations. Our solution leverages tools like Velociraptor, Jupyter notebook, and Neo4j to streamline the process of collecting and analyzing forensic artifacts. In this first part of our blog series, we’ll discuss how we automate the collection of forensic data, store it in Neo4j, and build relationships between various artifacts, setting the stage for more advanced analysis.

1. Initial step in automating DFIR using Velociraptor and Jupyter Notebook

We use Velociraptor to collect critical forensic data from endpoints. Velociraptor allows us to gather various artifacts such as system logs, browser history, and network activity, which are essential for building a comprehensive picture of potential incidents. Through the use of Jupyter notebooks, we can automate the process of querying Velociraptor and storing the resulting data in Neo4j.

Here are some of the key artifacts we collect:

Windows.System.Pslist (Running Processes)
Windows.Applications.Chrome.History (Chrome Browser History)
Windows.Network.PacketCapture (Network Packet Captures)
Windows.System.DLLs (Loaded DLLs)
Windows.Timeline.Prefetch (Prefetch Files)
Windows.Applications.Firefox.History (Firefox Browser History)
Windows.System.Amcache (Inventory of Application Executables)

2. Inserting Artifact Data into Neo4j

After collecting data with Velociraptor, we store each artifact’s data in Neo4j as labeled nodes. Each artifact is represented with a specific node type in Neo4j, such as DEVICE, FILE, USER, PROCESS, and URL. The data is stored with relevant fields like FilePath, PID, User, VisitTime, and IP Address.

Here are some examples of how artifact data is stored in Neo4j:

Device data: Each device is represented as a DEVICE node with properties such as HostName, OS, and MACAddress.
File metadata: Files are stored as FILE nodes, with properties like FilePath, FileHash, CreationTime, and ModificationTime.
User activities: Browser histories are captured as URL nodes for sites visited in Chrome and Firefox.
Example: Collecting Data from Velociraptor and Inserting into Neo4j
As part of automating the DFIR process, we’ve written a sample code to query Velociraptor using Jupyter Notebooks and then insert the collected data into Neo4j. The following example shows how we collect data from the Generic.System.Pstree artifact and store it as nodes in Neo4j. from neo4j import GraphDatabase
import pandas as pd
from pyvelociraptor import velo_pandas, LoadConfigFile
# Set pandas options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Load configuration file
config_path = '/home/azureuser/api.config.yaml'
try:
config = LoadConfigFile(config_path)
except FileNotFoundError:
print(f"Configuration file not found: {config_path}")
config = None
except Exception as e:
print(f"Error loading config file: {e}")
config = None
if config:
# Query data from Velociraptor
try:
query = """
SELECT *
FROM source(
client_id = 'C.83cbd74581a82bf1',
flow_id = 'F.CQO6AJC27DUVC',
artifact = 'Generic.System.Pstree'
)
"" print("Executing query...")
result = velo_pandas.DataFrameQuery(query, config=config)
print("Query executed, checking results...")
if result:
df = pd.DataFrame(result)
print("DataFrame created successfully. DataFrame shape:", df.shape)
if df.empty:
print("The DataFrame is empty. No data retrieved.")
else:
print("Data retrieved:")
print(df.head()) # Display the first few rows of the DataFrame
# Connect to Neo4j
neo4j_uri = "Enter URL"
neo4j_username = "Enter Username"
neo4j_password = "Enter Password"
driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_username, neo4j_password))
# Function to insert data into Neo4j
def insert_data(tx, data):
for index, row in data.head(50).iterrows(): # Limit to 50 rows
query = """
CREATE (n:Generic_System_Pstree {
Pid: $Pid,
Ppid: $Ppid,
Name: $Name,
Username: $Username,
Exe: $Exe,
CommandLine: $CommandLine,
StartTime: $StartTime,
EndTime: $EndTime,
CallChain: $CallChain,
PSTree: $PSTree
})
""params = {
'Pid': row['Pid'],
'Ppid': row['Ppid'],
'Name': row['Name'],
'Username': row['Username'],
'Exe': row['Exe'],
'CommandLine': row['CommandLine'],
'StartTime': row['StartTime'],
'EndTime': row['EndTime'],
'CallChain': row['CallChain'],
'PSTree': row['PSTree']
}
tx.run(query, params)
if not df.empty:
# Insert data into Neo4j
with driver.session() as session:
session.execute_write(insert_data, df)
print("Data successfully inserted into Neo4j.")
else:
print("No data returned from the query.")
except Exception as e:
print(f"Error querying data: {e}")
else:
print("Configuration loading failed. Please check the configuration file and try again.")

3. Building Relationships Between Artifact Data

To fully leverage the power of Neo4j, we build relationships between different artifact nodes. These relationships allow us to see how various elements like files, users, devices, and network connections interact with each other. This is particularly useful for incident response as it helps map out the sequence of events during an attack.

We have implemented the following relationships:

DEVICE to FILE: HAS_FILE
DEVICE to USER: USED_BY
FILE to URL: LINKED_TO
FILE to IP_ADDRESS: ACCESS_BY
USER to PROCESS: STARTED
REGISTRY to FILE: REFERENCES
PROCESS to IP_ADDRESS: NETWORK_ACTIVITY

These relationships allow investigators to uncover patterns and connections that may indicate malicious activity. Here are a few additional Cypher queries that illustrate how we build these relationships:

Linking Devices to Files

cypher

MATCH (d:DEVICE) MATCH (f:FILE) WHERE d.HostName IS NOT NULL AND f.FilePath IS NOT NULL MERGE (d)-[:HAS_FILE]->(f) RETURN d, f

Linking Users to Processes

cypher

MATCH (u:USER) MATCH (p:PROCESS) WHERE u.Name IS NOT NULL AND p.ProcessName IS NOT NULL MERGE (u)-[:STARTED]->(p) RETURN u, p

Linking Files to IP Addresses (Network Activity)

cypher

MATCH (f:FILE) MATCH (ip:IP_ADDRESS) WHERE f.FilePath IS NOT NULL AND ip.Laddr IS NOT NULL MERGE (f)-[:ACCESS_BY]->(ip) RETURN f, ip

Linking Processes to Network Activity

cypher

MATCH (p:PROCESS) MATCH (i:IP_ADDRESS) WHERE p.CommandLine CONTAINS i.Laddr OR p.CommandLine CONTAINS i.Raddr MERGE (p)-[:NETWORK_ACTIVITY]->(i) RETURN p, i

Linking Registry Entries to Files

cypher

MATCH (reg:REGISTRY) MATCH (f:FILE) WHERE reg.AppCompatPosition IS NOT NULL AND f.FilePath IS NOT NULL MERGE (reg)-[:REFERENCES]->(f) RETURN reg, f

By structuring the data this way, we can easily trace the interactions between systems, users, and files, allowing us to uncover the chain of events leading to a security incident.

4. Visualizing and Querying Data in Neo4j

One of the biggest advantages of using Neo4j is the ability to visualize relationships between entities. Using Cypher queries, we can map out the interactions between different nodes and their relationships.

For instance, to visualize all nodes and relationships in the database, we can run the following query:

cypher

MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 2500

This provides a clear view of all the data relationships, helping us quickly identify potential issues or attack vectors.

Stay Tuned for Part 2!

In Part 2, we’ll dive into how we extract data from Neo4j and utilize LLMs (Large Language Models) to create a Retrieval-Augmented Generation (RAG) framework. We will show how we index the data, generate dynamic queries, and prepare the data for training and analysis using LLMs.