Understand the World

Accidental Complexity: Files

September 16, 2018

When I wrote about ideas and competition, I said: “you can beat large companies by spotting accidental complexity and avoiding it.”

Pick a problem and focus. That’s what every startup book says. And they are right. Scope creep killed lots of products and companies.

But that’s not enough. These days you have large, agile competitors like Amazon that consist of small, startup-like projects, exposed to the world as if they were independent companies.
When they notice your Super Solution taking off, they just spin up an internal project that does the same thing, but better.

So it’s important to apply this principle everywhere, beyond just product design. Identify and eliminate the complexity that existed before you even started.

An Example

Let’s start with something simple. Like files. Yes, those files.

Everybody uses them for decades, they must be bulletproof, right? But for some reason, Apple tried to get rid of them in iOS. Let’s look closer.

What Files Do

Why do you even need files? To store photos, docs, apps, you might say.

Oh, I hear “in Unix, everything is a file!” Great! So we also have filesystems listing processes, devices, and so on. Next to directories, drives, repositories, remote filesystems… Things are getting complicated.

Unfortunately, files mix two completely unrelated concepts: data storage and API (a way for apps to talk to each other).

Files as an API

You’ve probably never thought of files as an API. But let’s take a look:

There’s a file on your computer that lists “apps that can open JPEG images.” Finder checks it when you double-click that cat picture.
There are “lock files,” created to avoid conflicting changes, “dotfiles” where thumbnails or version info is stored, and so on.
File extensions (“.jpg,” “.pdf”) tell other apps what the file format to expect.
All kinds of file-like objects in Linux and Mac OS

Problems

Mostly, filesystems just work. Except when their functions interact in unexpected ways.

I want this movie in “Best of 2015”, “Fiction” and “Already Watched” folders at the same time, how do I do that? Oh, shortcuts!

Oops, you sent me a shortcut, not the file itself.
Ever received Photoshop or PDF files missing fonts?
“How do I attach this folder to e-mail?”
Here are my edits to your proposal - I have edited it already, what exactly did you change?

Multiple billion-dollar companies exist that answer just this one question (Dropbox, Github, Google Docs, to name a few)
Database backup: a giant file which was changed just a bit, and you have to take a copy, quick. Or fiddle with complicated database replication mechanisms.
This file cannot be deleted because it’s used by another program (which program?)
(I have dozens of examples, but this post is way too long already).

When Apple designed the iPhone, they tried to make computing accessible to everybody, not just geeks and trained engineers. For users on the go, battery running out, without any manuals, without a keyboard to type a search query.

And while the problems listed above are rare, for millions of users, hundreds of them will run into them every minute. No wonder they tried to steer clear from this mess.

How did we get there?

It’s not that filesystem designers are stupid. They’re brilliant people. But you can’t solve a problem when you don’t understand it’s a problem first.

The complexity accumulated slowly:

Disks were into parts with names, for convenience;
Larger disks appeared, and people wanted to organize files into directories;
To separate files of different users, access rights were born;
OS services were exposed as files to reuse some existing tools;
Networks proliferated, people started working with data concurrently, files had to be protected from accidental concurrent access as neither apps nor filesystems were designed for that;
encrypted, so hackers couldn’t dig old disks from the trash bin;
archived, to back up as a single file, or sent over the Internet; and so on.

…
Oops.

At this stage, when you’re about to add another feature to the system, it’s not enough to think about the feature itself.

If you’re lucky, you need to think about every other function, too.
If not - you need to think about every combination of features out there.

Nobody has the time or imagination to do this, of course. That’s why we end up with all of these half-baked tools, dozens of similar services and thousands of bugs.

Fresh start

As I said, these people were smart. They added features one by one, because at that time, in their situation, it was the easiest and the smartest thing to do.

But let’s see what happens if we start with all these requirements together.

Apps on multiple devices, used by multiple users to create, store, organize, and collaborate.
Going down, we have pieces of data, organized, annotated, modified and shared.
A directory, a compound document, or a catalog (like “my selfie photos”) is also a piece of data. It refers to other pieces, and this is also where we name those pieces and annotate them with things like “document type” or “author.”

Good, now we have to worry only about the contents! Let’s remove the names altogether and refer to pieces by their hash (a cryptographic thumbprint of the data).

Modifications
First, we often have to store previous versions anyway, second, there are multiple ways you can modify a particular piece (“save a copy,” “modify this copy,” “replace all copies,” etc.) so I’d argue this is up to the app to handle.

Large files
We split large files into multiple pieces, organized like a tree with the root referring to leaves like “Country -> City -> Street -> House 3A -> Apartment 15 -> Photo of Apartment 15”. Databases are internally structured this way already, so there’s no harm in making it explicit, too. And you can refer to pieces from multiple places, like “Friends -> A -> Andrew -> Photo of Apartment 15”, without changing a single thing.

What about access rights?
For reading, you can encrypt the piece, store the key together with the “name” and “document type” and share that. For writing, it’s even simpler. You either accept the message “make picture GreatCat refer to piece A555”, or you don’t.

Backup and synchronization systems are greatly simplified, too. They just need to store these blobs once and for good. And if you share only the pieces, without the encryption keys - they never ever get to see what’s stored there.

Practice what you preach

It’s easy to bash on large and faceless companies with arguments pulled out of thin air.

However, the approach I described is exactly how my knowledge assistant stores all of its data. It’s encrypted, backed up across multiple servers, and the entire source code for the data storage, including a lot of “conventional” file system functions, is still less than 5000 lines long, with comments and blank lines.

For comparison, the source code for Unix “ls” command (list a directory) is at 4700 lines already, and that’s just one feature.

I spent about a week writing this part. Maybe the “ls” command was written faster. But over time it adds up. The less code you have, the fewer bugs you have, the faster you can make changes.

This is how the software should be built. Not to the point when there’s nothing more to add, but to the point when there’s nothing to take away.

Ideas, experiments and projects by Oleksandr.
Ping me via Telegram, Twitter, or just e-mail. There's also a Telegram channel of articles I like.