Making computer systems secure is very difficult. The consequences of insecure systems are already extremely serious and will be catastrophic in future if they are not already. Malignant people, often sponsored by malignant states, are actively attacking computer systems and have had considerable success doing so.
So it is surprising that companies whose stated aims are to increase security are effectively working to make their customers’ systems less secure.
Managing large, complex computing installations
For any large, complex computing installation1, simply managing it is a problem. The way of managing a small installation – having someone (part of) whose job is to look after the installation – has terrible scaling problems: if your installation has a million OS instances, then keeping them up to date might involve a hundred thousand people. And if you could afford that many people you still haven’t solved the problem: with a large number of people whose job is to look after parts of the installation there is a vanishingly tiny chance that they will do so consistently.
For systems which are merely large this problem can be made a lot simpler: for such a system the number of components is far larger than the number of tasks the system performs, so there are many components for each task. These components can then be forced to be identical (or identical-enough). The failure of single components simply lowers the capacity of the system in almost all cases. There are still scaling problems – for a system with a huge amount of hardware, hardware failure rates will mean that more of the hardware fails and needs to be replaced, requiring people to actually do the replacement – but much of the management of such a system scales much less than linearly with its size. Finding problems which both can be solved by systems which are merely large and from which money can be made is what made the giant internet companies so rich, of course2.
For systems which are both large and complex the problem is far harder: because such a system is performing a large number of distinct tasks managing it necessarily requires people with expertise in all these tasks, and there are only so many things a person can be good at. Because of this, running such a system is never really scalable. But, if you can isolate various layers of the system – the computing and storage hardware, the operating system, the software platform on which applications live, and so on – then you can make those parts of the system into something which is merely large, and you can manage those in a way which will scale.
This, of course, is exactly what everyone with a large, complex computing installation is trying to do.
Single points of control
The trick to managing a large installation, or the parts of a large, complex installation which can be made merely large, is to have single points of control. For instance, if I want to deploy some update to a very large number of machines, I very definitely don’t want to have to access each machine individually to do that: instead I need to have some single point of control from where I can say ‘deploy this update to this set of machines’ and that will just happen, and I’ll get some kind of report about which machines it worked on and so on.
Making the management of large installations scalable requires these single points of control. They may not be rooms full of dials and flashing lights in hollowed-out volcanos staffed by people in white coats, where occasional klaxons sound (although, of course, they should be), but they have to exist, somewhere: it must be the case that changes to the system can be made in one place, or a very small number of places, and take effect over the whole system. There’s no other way to do this.
A security problem
Single points of control present a quite considerable security problem. They are necessary so that the system can be managed efficiently, but it doesn’t say anywhere that the changes made from such a single point of control are good changes. So two things are extremely important:
- all the single points of control need to be known about and their number should be kept as small as possible;
- all the single points of control must be very carefully managed, with extensive controls over access, carefully managed logs and so forth.
I suspect most organisations fail at both of these, unfortunately: they neither keep a careful catalogue of the single points of control and nor do they control access to them carefully enough. This essay, however, is not about how to deal with this problem except in one respect.
To understand what the single points of control are you need to understand the notion of transitive closure. This is pretty simple, fortunately: if a system \(a\) controls a system \(b\), and system \(b\) controls systems \(c, d, \ldots\), then, by transitive closure, system \(a\) controls all of systems \(c, d, \ldots\). And similarly, if \(d\) controls \(g\), then \(a\) also controls \(g\). What this means is that, in order to understand what the single points of control are, you need to construct graphs3 of the transitive closure of control. This is not hard to do, but it is quite hard for people to remember these graphs: they really need to exist in some explicit form. Doing this is also a good exercise in making sure you actually do think hard about what the nodes in the graph are: what are the things which grant control over some system, and how are they being managed.
An important thing about this transitive closure of control is that everything gets more sensitive as you go up the graph4: the higher nodes in the graph control more lower nodes, and often very many more lower nodes. If the graph is a tree with a constant branching factor then the number of nodes controlled goes up like \(n!\) as you get higher in the tree, and that’s fast: it’s tempting to say it goes up exponentially, but it doesn’t: it goes up much faster than that.
All of this means that for large installations the points of control near the top of the tree are extremely sensitive: they need to be very tightly controlled indeed. It would be foolish, wouldn’t it, to allow third-parties to manage these points of control?
We’re all fools
Of course, we all do exactly that, all the time. We all run software we have neither written nor exhaustively checked5, on hardware we don’t really understand, for instance, and thus outsource our security to the people who write this software and make this hardware. And most of the time it’s OK. Most of the time. Sometimes bad things are found in the software or the hardware and we have to rush around to deal with them. Well, not so much ‘sometimes’ as ‘quite often’ in fact.
But we don’t really have much choice about this: in theory we could build our own hardware and write our own software to run on it as people did in the 1940s and 1950s, but in practice that’s absurdly impractical.
But that’s not where it ends, of course. We now all love our cloud computing: running our software on top of platforms and hardware managed by other people, and keeping our data on their storage systems. Because of course no-one could ever compromise one of these suppliers of computing resources without us realising, quietly changing the cloud platform so it recorded interesting things about what we’re doing. And of course these, very large, computing infrastructures are not managed in turn from single points of control which now, by transitive closure, have control over the computing infrastructures of a huge number of organisations. Oh, wait.
Well this, too, seems to have worked out reasonably well. So far. And this essay is not about the risks of cloud computing.
Some more than others
There are things we can do to control the risks we all take. For instance, when dealing with software we haven’t written or checked in detail, we can carefully run it first in a controlled, isolated environment to try and assess any problems with it. This doesn’t ensure safety – nothing can do that – but it does mean that we have at least some chance of finding out if the new software is broken or malignant.
What we should not be doing is blindly accepting and deploying updates to software into an environment we care about. And we should very, very definitely not be doing that when that software has access to control our systems. If we were to do that, then, by the time we know that the people we’re getting the software from have been compromised, or were perhaps always malignant, it’s far too late: the damage is done. And, worse, we probably will never know what the damage that has been done is.
A target painted on our backs
Points of control which are both far up the graph and well-known have targets painted on their backs. If Dr Evil, President Evil or General Secretary Evil decides that they’d like to compromise a large number of organisations, the things they are going to go for are the points of control which are far up the graph. And they’ll be willing to put a great deal of time, skill and money into this.
Points of control which are far up the graph are, as a result, all but certain to be attacked, and all but certain to be attacked by people with effectively unbounded resources. The only safe assumption to make is that these points of control will be compromised in due course: assuming otherwise is hopelessly naïve.
So you should be very, very careful to test anything you get from such places – especially software, which is far more mutable than hardware. And, if you are in charge of one of these places you should certainly not be suggesting that anyone blindly take your updates: that would be extremely irresponsible.
And yet this is exactly what happens: we are all actively encouraged to blindly trust software we receive from organisations with targets painted on their backs.
And that’s what this essay is about.
There are many good choices here, but I’ll just pick one: Qualys.
That sounds good, right? Except, wait: they’re providing security solutions. It’s in the nature of such solutions that they both need to be updated very frequently as new threats appear and require privileged access to systems. It almost certainly is not possible to do the kind of staged test and deploy I suggested above for software like this: if there’s a new compromise you want to know about it now, not in two weeks. Instead you really need to just accept updates from Qualys as and when they appear or, perhaps worse, allow them to pull data from your systems to check ‘in the cloud’ where you do not have control over the security of that data. That means that, if you are using Qualys tools on live systems, Qualys are a single point of control for you.
has over 10,300 customers in more than 130 countries, including a majority of the Forbes Global 100. – Wikipedia
That means that they’re a single point of control for a large number of very high-value targets for President Evil: Qualys have a target painted on their back, are illuminated by bright searchlights and are surrounded by flashing neon arrows pointing at the target.
So, well, they’ll know about this, won’t they? And although they can’t avoid being a target to some extent7, they certainly will be addressing these problems to reduce the risk somehow, won’t they? Certainly they will have many documents and guides describing how to minimise the inevitable risk associated with using their products.
Not so much.
How to lose friends and alienate people
https://www.qualys.com/documentation, then ‘Cloud Platform’ / ‘Scan authentication’ / ‘Unix record’ / ‘online help’ / ‘What credentials should I use?’ / ‘Learn more’ and you should find a link entitled ’*NIX Authenticated Scan Process and Commands’ whose target is
https://success.qualys.com/discussions/s/article/0000062208, from which
When Qualys performs an authenticated scan against a *nix system with a properly configured authentication record we will create an ssh session using the credentials in the authentication record, check the effective UID (level of access), execute “sudo su -” (or other root delegation command configured in the record), re-check effective UID to ensure the elevation worked, then begin our checks.
sudo su - means ‘become
root and spawn a shell’. Or, in other words, gain completely unconstrained access to the system with the highest possible level of privilege. Further down the same page you’ll find this:
First, customers should be strongly discouraged from placing granular controls around the Qualys service account because of the reasons stated above. […] Even if it were possible to publish this list, it would likely take a lot of effort to maintain its currency.
In other words: ‘don’t use fine-grained control to limit what our tool can do, because maintaining the list of commands it might run would be a lot of work for us.’
Yet further down the page is:
Below is a list of commands that a Qualys service account might run during a scan. Remember not every command is run every time, and *nix distributions differ. This list of commands is neither comprehensive nor actively maintained.
This is followed by a list of commands which includes
awk(equivalent to uncontrolled
root access again) and just a huge number of other commands all of which imply unconstrained root access.
That page also links to
https://success.qualys.com/discussions/s/article/0000062289. Which contains this obvious falsehood:
In a nutshell, all of our data point detections are scripts that need to be run as root. Running them as a non-root user would, in most cases, result in permission errors which cannot be distinguished from other error sources. That would result in incorrect data being returned by the scanner, which is why we do not support this. There is no way to make non-root scanning work reliably with a scanning model based on shell commands or shell scripts.
It also contains this lovely example of why
sudo is no good:
sudo /usr/bin/find . -maxdepth 0 -name . -exec /bin/sh -c "su -" ";" -quit
This is truly magnificent: anyone who has looked after
sudo configuration will know, immediately that this is why you don’t allow unconstrained
find in the commands you allow to be run. But apparently the people at Qualys don’t understand that.
The terrifying conclusion
It is hard to read this material without coming to the conclusion that the people writing it – the people on whom you are relying to check your systems for security – do not care about the security of their customers’ systems if that security might cause momentary inconvenience for them. Worse, it is hard to read this material without coming to the conclusion that the people writing it do not understand the security architecture of *nix systems10 at all.
But they have no choice
Well, the people who wrote the documents excerpted above are certainly patronising, and they also seem alarmingly incompetent. But, surely, the problem is real: I can poke fun at them all I like but that doesn’t actually help anything, does it?
This is a security scanner and this means that the things it is checking for change very fast: people who write malware do not give warning of what they are going to do in advance and do not make it easy to know when they are attacking you. When a new attack becomes known about it needs to be checked for right away. And since the nature of the attack can’t be known in advance, the techniques needed to check for it can’t be known in advance, which means both that you will need to allow the scanner to run programs it has just fetched from Qualys, and also that those programs must be able to use all the facilities of the system, at the highest privilege level, to do there work. There’s just no way around this, is there?
And, despite what might appear from reading the above material, we therefore have to assume that everyone at Qualys knows they an enormously attractive target for President Evil and that their security is thus impeccable: we have no choice.
One of many
And Qualys are just one of many: I have picked on them only because I had to pick on someone. As another example, there’s a company – a very famous company with a three-letter name – who sell a product which, if you install it according to their recommendations, requires you to grant unconstrained
root access via
sudo to an entire directory containing a huge number of shell scripts some of which are tens of thousands of lines long, and some of which write other shell scripts. The chances of that system not containing security problems are close to zero. But again, we have to trust them, even though the evidence that they don’t even understand what security means is overwhelming: after all they do have a three-letter name.
And this is everywhere you look: we are trusting the security of our systems to people who do not appear to understand what security means.
Isn’t this all just a bit alarmist? It’s all very well for me to go on about single points of control and companies with targets painted on their backs, but surely nothing bad ever really happens?
If you think that, then you haven’t been paying attention.
SolarWinds are a company which write network-management tools used by many other companies, government organisations and others. One of their products is called Orion, which is used by about 33,000 public and private-sector organisations. Most or all of those organisations download updates to the product either automatically or semi-automatically. This makes SolarWinds a very attractive target. Starting before October 2019 SolarWinds were compromised and in particular the build system for Orion was compromised in such a way that releases of the product contained malicious code. Between, perhaps, March and December 2020 the attackers used these compromised updates, together with other compromises to attack at least 200 organisations, including multiple parts of the US federal government, NATO, the UK government, the European parliament, Microsoft and many others. A good description of this attack can be found here. The people who did the attack were the Russian Foreign Intelligence Service, Sluzbha Vneshney Razvedki11. I don’t know what the results of this attack were, and perhaps no-one outside Russia knows what was taken and what will be done with it. It is certainly very safe to say that the results were extremely severe, if not catastrophic.
It’s worth noting that the result of the build system for Orion being compromised was that the compromised releases were properly digitally signed: it is not safe to rely on digital signatures to prove that software has not been compromised in the case where the organisation signing the software has been compromised.
in early 2021 there was a security breach at Qualys. It seems that this breach didn’t compromise their security tools: they got away with it, this time.
This is not the end
These are both supply chain attacks: many others have happened, and without doubt many more will happen. In the context of this essay, supply chain attacks are a result of having single points of control for security management which are outside an organisation and which serve many organisations, making them interesting to attackers with large resources.
But what can we do? It is inevitable that these organisations will be attacked, and almost inevitable that they will be compromised. In many cases we can mitigate the risk by having a fairly long test and deployment cycle and hoping that either we find the problems or that others do before we start relying on the tool. For security scanners we can’t do that, because we can’t afford to wait. We have to trust suppliers of security products, and we have to allow them to run privileged code on our systems which we can not check because the alternative of not checking for security compromises is even worse.
We have to trust them because, in fact, we have no other choice.
Is this the end?
So, this seems like an insoluble problem, doesn’t it? A security scanner has terrifying properties, by its nature:
- it must be updated very frequently, far too frequently to perform safety checks;
- it must have privileged access to live systems.
There’s just no way around that, is there? And of course, President Evil knows this too: the organisations providing these tools make extremely good targets because the nature of the tools means both that any compromise is very serious and compromises are very hard to detect. And there is therefore no way around the fact that the suppliers of these tools will be targets for President Evil, will, in due course, be compromised, and all is therefore lost.
Well, perhaps not. Perhaps it is possible to reduce the risk.
The problem to solve is that a security scanner must be updated very frequently and must run with high privilege. Suppliers of such tools, even if they are competent which is not always clear, are extremely valuable targets for attackers with very large resources and thus are almost certain to be compromised. So running these scanners on live systems needs to be avoided, even though the scanners need access to the live systems to run.
Well, there’s a way around that. If you could make an identical copy of any system then you could scan the copy. If the machine has a vulnerability, so will the copy. If the scanner is compromised then it will attack only the copy, which doesn’t matter, since it’s only a copy, which will be destroyed immediately after being scanned.
It is more complicated than that, of course: the copy needs actually to be running as lots of things will almost certainly only really show up when a system is running (what network ports does it have open, for instance). So the copy needs to be more than just a blob of data: it needs to be a real thing running programs. And the copy has to think it’s not a copy: enough of the world around it needs to be faked up so it thinks it’s doing real work. But all of this world must be fake – under no circumstances should the copy be able to see real data or talk to real live systems. Finally, the scanner needs to be very restricted in the data it can upload: since the whole point is that we don’t trust the scanner we can’t allow it to ship all the data on the system to who-knows-where when it’s been compromised. Ideally the scanner should return a single bit: is the thing it is scanning compromised? If it is then this tells us to look more closely at it, for instance by looking at a report stashed locally on the copy.
Doing this is not simple to arrange, but it is perfectly possible. Here are some objections with answers.
But, cloning systems like this is hard, isn’t it? Not really. For a start, if the systems concerned are virtualised then pretty much all serious hypervisors support making snapshots and clones of the virtual machines they’re running, and moving those snapshots and clones between different physical hardware. If the systems aren’t virtualised then things are harder, but this kind of ‘make a carbon copy of a system’ is what you should already be doing for backup and disaster recovery (DR). Some people, apparently, maintain DR systems by manually keeping them up to date with the live systems. If you are doing that, stop: create the DR systems by cloning the live systems. If you don’t have a good approach to cloning do it by restoring backups. If you can’t restore your backups (or you aren’t making backups) then you are already dead, so nothing matters.
But, this means doubling the size of the environment, doesn’t it? No: you only need enough extra computational resources to scan each little chunk of your environment, since you can reuse them. But, you already need enough extra resources to support DR: just use those!
But, this will be hard to set up, won’t it? Yes, it will require a fair amount of work. But if you don’t do this, or something like it, then within the next few years your systems (almost certainly) will be compromised and your data (almost certainly) will leak to bad people as a result. So the question is: is the cost of that higher, or lower, than the cost of this, or something like it?
But, the things that do the cloning can be attacked, can’t they? Yes, they can. But these tools are a tiny fragment of your infrastructure. They are, in fact, a single point of control, and one you have to be very, very careful about. This sketch doesn’t remove the problem since nothing can do that: it just makes it much less severe and much better controlled.
But, lots of details are missing, aren’t they? Yes. This is a sketch, written by some person on the internet: it’s not a complete solution. (If you want a complete solution pay me lots of money and I’ll make you one.)
But, you haven’t thought of this thing, and that thing, and …, have you? No. It’s a sketch.
Because we want to
Solving these problems, in the sense of making them much less likely to happen and the consequences when they do happen much less bad, is not easy. But it is possible, as the sketch in the previous section shows. Not solving them means that, almost certainly, in the next few years a catastrophe will happen. I said at the beginning of the essay
it is surprising that companies whose stated aims are to increase security are effectively working to make their customers’ systems less secure.
But it isn’t, not really: it is depressing, but not really surprising, because the entire history of computing has been made up of people avoiding solving problems through laziness, lack of imagination, or the desire to make a quick buck.
I think that should stop. Solving these problems will be hard, but we can solve them if we only want to.
Appendix: ‘large, complex computing installations’
I’ve used this term above without ever really defining it. Defining it is not entirely easy, and the meanings of definitions change over time: once an IBM System/360 Model 70 might have been thought of as a very large computing installation, but today it would be a very small one other than, perhaps, physically.
Every time I want to write about large computing installations I find I don’t know the right words any more: is a large computing installation one with many systems, or is it one large system? What, anyway, is a ‘system’? Once everyone knew what it meant: the system was the departmental VAX, and later there were several systems which were the VAX (still creaking along on life-support) and a bunch of Suns, some of which were workstations and some of which were fileservers.
But that meaning has dissolved away. For a while it was safe to talk about ‘servers’: everyone knew that a server was something that lived in a rack along with other servers12. But that in turn has dissolved away as the relationship between physical hardware and the programs that run on it becomes more complicated and often more remote.
So what, today, are the right words? What is a large installation and what a small one? Here’s my attempt at a definition.
- An installation is large if it has a very large number of truly concurrent threads of control. ‘Truly concurrent’ means ‘supported by hardware’, and what is meant by ‘very large’ will increase over time: at the time of writing (mid 2021) this probably means at the very least tens of thousands.
- An installation is complex if it is performing a large number of conceptually distinct tasks. Again the definition of what is a large number may change over time although it will probably increase more slowly than the number of threads of control.
This definition, for instance, would make many HPC systems large, but not complex: although they have large number of independent threads of control, they probably run a rather small number of different programs, and perhaps only one (probably several copies of that one, of course). It’s possible for a system to be complex, but not large, although unusual.
I’m not sure if this definition is adequate, but I think it will serve here.
In the main text I use ‘installation’ and ‘system’ interchangeably: I should probably only use ‘installation’ but I don’t. When I talk about an individual computer in a large installation I’ve tried to say ‘machine’.
See appendix. ↩
Once upon a time I worked for a then-famous company which sold holidays over the internet. We used to sneer at Amazon for picking a simple problem – mostly selling books, then – to solve: books just sit in a warehouse waiting to be bought, for decades if need be, while everyone wants a different holiday and holidays have very definite sell-by dates. One day I realised that what Amazon had done – picking a simple, scalable problem to solve – was smart, and what we were trying to do was not smart and that was why they were going to get rich and we weren’t. I didn’t get rich, and I don’t know if that company even still exists. ↩
A graph here is not a plot: it’s a drawing of some kind of network consisting of nodes (points of control, for instance) and arcs between those nodes which may or may not have arrows on them indicating direction: if a controls b then there will be a node for a, a node for b and an arrow from a to be indicating control. ↩
By ‘up’ I mean in the direction of ‘is controlled by’ while ‘down’ means in the direction of ‘has control over’. ↩
All the text in this essay was extracted from the linked sources in early September, 2021. Things may have changed since then, but the what is here was there then. I have marked elisions with ’[…]’. ↩
For instance, if Qualys can be compromised in such a way that their tools fail to report other compromises, then this would allow those other compromises to propagate undetected, even if the tools provided by Qualys are not themselves doing direct harm. ↩
This may formerly have been
May formerly have been
To be fair, ‘the security architecture of *nix systems’ does give the impression that there is one – that it is something made of marble and stainless steel rather than partly-dissolved mud bricks and rotting straw. ↩
In other words, this time it was indeed President Evil. ↩
Some very large or very old servers might have been whole racks, or even several. ↩