Some Anti-virus software will monitor that certain key files, for example important binaries or the kernel, have not been modified. The way to do this is with file integrity monitoring. The general idea is to store hashes of files, and continually re-hash the files to make sure nothing has changed.
Let’s look at how this might look in python.
def monitor_files(path, monitor_log_file="log.pickle", hashes=(hashlib.sha256, hashlib.blake2b, hashlib.md5, hashlib.sha3_512)):
if os.path.exists(monitor_log_file):
with open(monitor_log_file, 'rb') as handle:
monitor_log = pickle.load(handle)
else:
monitor_log = {}
if not os.path.exists(path):
raise Exception(f"Path does not exist {path}")
new_log = {}
for root, dirs, files in os.walk(path):
for file in files:
print(root, dirs, file)
file_hashes = []
for hash_ in hashes:
m = hash_()
with open(f"{root}/{file}", "rb") as f:
m.update(f.read())
file_hashes.append(m.digest())
if file in monitor_log:
if file_hashes != monitor_log[file]:
print(f"hash mismatch for file {file}")
else:
print(f"new file {file}")
new_log[file] = file_hashes
with open(monitor_log_file, 'wb') as handle:
pickle.dump(new_log, handle)
Notes about this code:
- There should be a size check. The file system can quickly give the file size and if there’s a mismatch between expected and actual, it’s clear that the file has been modified.
- Whole files are being hashed. This is fine for small files, but for big files, it’s better to store a bunch of small hashes and short-circuit the execution if a mismatch is found. For example, there’s no need to hash through a 100GB file if the first byte is wrong.
- Multiple hashes are being computed. This makes it very difficult to perform a hash collision attack, as the probability of finding an attack that can satisfy multiple hashes simultaneously is very low.
- Using pickle is a bad idea. Firstly, pickle is susceptible to arbitrary code execution, so if this code is running in an untrusted environment, the serde is an attack vector (however minor). It’s better to use a WORM storage medium or ship valid hashes to a trusted remote machine. Ideally the trusted hash is generated on a trusted machine to begin with.
Here’s a version that does the incremental hashing:
class FileHashes:
whole_file: bytes
chunks: list[bytes]
window_size: int
file_size: int
def rolling_fim(file_path: str, window_size:int = 256, monitor_log_file="rolling_log.pickle", hash_ = hashlib.blake2b):
if os.path.exists(monitor_log_file):
with open(monitor_log_file, 'rb') as handle:
monitor_log = pickle.load(handle)
else:
print("warning, no old file integrity log exists.")
monitor_log = FileHashes(b"", [], window_size, os.path.getsize(file_path))
if not os.path.exists(file_path):
raise Exception(f"Path does not exist {file_path}")
if monitor_log.window_size != window_size:
print("warning window size changed, expect all comparisons to fail.")
new_log = FileHashes(b"", [], window_size, os.path.getsize(file_path))
if new_log.file_size != monitor_log.file_size:
print("file size changed. Files do not match.")
return
m = hash_()
chunk_num = 0
with open(file_path, "rb") as f:
while True:
n = hash_()
chunk = f.read(window_size)
if not chunk:
break
m.update(chunk)
n.update(chunk)
new_log.chunks.append(n.digest())
if len(monitor_log.chunks) > chunk_num and monitor_log.chunks[chunk_num] != new_log.chunks[chunk_num]:
print(f"mismatch found in chunk {chunk_num}")
break
chunk_num += 1
new_log.whole_file = m.digest()
if monitor_log.whole_file != new_log.whole_file:
print("mismatch in whole file hash")
with open(monitor_log_file, 'wb') as handle:
pickle.dump(new_log, handle)
Notes on this code:
- Switched to use one hash, blake2b. Blake2b is generally preferred for hashing large files due to its speed. We could implement a similar logic for computing multiple hashes of each chunk of the file to make collision attacks more difficult.
- Added the file size check here, because why not.
- The code computes a series of hashes of given window sizes of the source file. If one fails, short circuit execution.
- Also compute a whole file hash, this is also to make collision
attacks more difficult.
This code is for illustration purposes only, but several improvements have been identified that, if implemented, could make this system more production ready. Further improvements would be needed to prevent a compromised machine from always returning known good hashes, such as remote verification.