[官方] 近期 Firefox Add-on 服务中断的技术细节

1,276 阅读9分钟
原文链接: hacks.mozilla.org

Recently, Firefox had an incident in which most add-ons stopped working. This was due to an error on our end: we let one of the certificates used to sign add-ons expire which had the effect of disabling the vast majority of add-ons. Now that we’ve fixed the problem for most users and most people’s add-ons are restored, I wanted to walk through the details of what happened, why, and how we repaired it.

Background: Add-Ons and Add-On Signing

Although many people use Firefox out of the box, Firefox also supports a powerful extension mechanism called “add-ons”. Add-ons allow users to add third party features to Firefox that extend the capabilities we offer by default. Currently there are over 15,000 Firefox add-ons with capabilities ranging from blocking ads to managing hundreds of tabs.

Firefox requires that all add-ons that are installed be digitally signed. This requirement is intended to protect users from malicious add-ons by requiring some minimal standard of review by Mozilla staff. Before we introduced this requirement in 2015, we had serious problems with malicious add-ons.

The way that the add-on signing works is that Firefox is configured with a preinstalled “root certificate”. That root is stored offline in a hardware security module (HSM). Every few years it is used to sign a new “intermediate certificate” which is kept online and used as part of the signing process. When an add-on is presented for signature, we generate a new temporary “end-entity certificate” and sign that using the intermediate certificate. The end-entity certificate is then used to sign the add-on itself. Shown visually, this looks like this:

Diagram showing the digital signature workflow from Root to Add-on

Note that each certificate has a “subject” (to whom the certificate belongs) and an “issuer” (the signer). In the case of the root, these are the same entity, but for other certificates, the issuer of a certificate is the subject of the certificate that signed it.

An important point here is that each add-on is signed by its own end-entity certificate, but nearly all add-ons share the same intermediate certificate [1]. It is this certificate that encountered a problem: Each certificate has a fixed period during which it is valid. Before or after this window, the certificate won’t be accepted, and an add-on signed with that certificate can’t be loaded into Firefox. Unfortunately, the intermediate certificate we were using expired just after 1AM UTC on May 4, and immediately every add-on that was signed with that certificate become unverifiable and could not be loaded into Firefox.

Although add-ons all expired around midnight, the impact of the outage wasn’t felt immediately. The reason for this is that Firefox doesn’t continuously check add-ons for validity. Rather, all add-ons are checked about every 24 hours, with the time of the check being different for each user. The result is that some people experienced problems right away, some people didn’t experience them until much later. We at Mozilla first became aware of the problem around 6PM Pacific time on Friday May 3 and immediately assembled a team to try to solve the issue.

Damage Limitation

Once we realized what we were up against, we took several steps to try to avoid things getting any worse.

First, we disabled signing of new add-ons. This was sensible at the time because we were signing with a certificate that we knew was expired. In retrospect, it might have been OK to leave it up, but it also turned out to interfere with the “hardwiring a date” mitigation we discuss below (though eventually didn’t use) and so it’s good we preserved the option. Signing is now back up.

Second, we immediately pushed a hotfix which suppressed re-validating the signatures on add-ons. The idea here was to avoid breaking users who hadn’t re-validated yet. We did this before we had any other fix, and have removed it now that fixes are available.

Working in Parallel

In theory, fixing a problem like this looks simple: make a new, valid certificate and republish every add-on with that certificate. Unfortunately, we quickly determined that this wouldn’t work for a number of reasons:

  1. There are a very large number of add-ons (over 15,000) and the signing service isn’t optimized for bulk signing, so just re-signing every add-on would take longer than we wanted.
  2. Once add-ons were signed, users would need to get a new add-on. Some add-ons are hosted on Mozilla’s servers and Firefox would update those add-ons within 24 hours, but users would have to manually update any add-ons that they had installed from other sources, which would be very inconvenient.

Instead, we focused on trying to develop a fix which we could provide to all our users with little or no manual intervention.

After examining a number of approaches, we quickly converged on two major strategies which we pursued in parallel:

  1. Patching Firefox to change the date which is used to validate the certificate. This would make existing add-ons magically work again, but required shipping a new build of Firefox (a “dot release”).
  2. Generate a replacement certificate that was still valid and somehow convince Firefox to accept it instead of the existing, expired certificate.

We weren’t sure that either of these would work, so we decided to pursue them in parallel and deploy the first one that looked like it was going to work. At the end of the day, we ended up deploying the second fix, the new certificate, which I’ll describe in some more detail below.

A Replacement Certificate

As suggested above, there are two main steps we had to follow here:

  1. Generate a new, valid, certificate.
  2. Install it remotely in Firefox.

In order to understand why this works, you need to know a little more about how Firefox validates add-ons. The add-on itself comes as a bundle of files that includes the certificate chain used to sign it. The result is that the add-on is independently verifiable as long as you know the root certificate, which is configured into Firefox at build time. However, as I said, the intermediate certificate was broken, so the add-on wasn’t actually verifiable.

However, it turns out that when Firefox tries to validate the add-on, it’s not limited to just using the certificates in the add-on itself. Instead, it tries to build a valid chain of certificates starting at the end-entity certificate and continuing until it gets to the root. The algorithm is complicated, but at a high level, you start with the end-entity certificate and then find a certificate whose subject is equal to the issuer of the end-entity certificate (i.e., the intermediate certificate). In the simple case, that’s just the intermediate that shipped with the add-on, but it could be any certificate that the browser happens to know about. If we can remotely add a new, valid, certificate, then Firefox will try that as well. The figure below shows the situation before and after we install the new certificate.

a diagram showing two workflows, before and after we installed a new valid certificate

Once the new certificate is installed, Firefox has two choices for how to validate the certificate chain, use the old invalid certificate (which won’t work) and use the new valid certificate (which will work). An important feature here is that the new certificate has the same subject name and public key as the old certificate, so that its signature on the End-Entity certificate is valid. Fortunately, Firefox is smart enough to try both until it finds a path that works, so the add-on becomes valid again. Note that this is the same logic we use for validating TLS certificates, so it’s relatively well understood code that we were able to leverage.[2]

The great thing about this fix is that it doesn’t require us to change any existing add-on. As long as we get the new certificate into Firefox, then even add-ons which are carrying the old certificate will just automatically verify. The tricky bit then becomes getting the new certificate into Firefox, which we need to do automatically and remotely, and then getting Firefox to recheck all the add-ons that may have been disabled.

Normandy and the Studies System

Ironically, the solution to this problem is a special type of add-on called a system add-on (SAO). In order to let us do research studies, we have developed a system called Normandy which lets us serve SAOs to Firefox users. Those SAOs automatically execute on the user’s browser and while they are usually used for running experiments, they also have extensive access to Firefox internal APIs. Important for this case, they can add new certificates to the certificate database that Firefox uses to verify add-ons.[3]

So the fix here is to build a SAO which does two things:

  1. Install the new certificate we have made.
  2. Force the browser to re-verify every add-on so that the ones which were disabled become active.

But wait, you say. Add-ons don’t work so how do we get it to run? Well, we sign it with the new certificate!

Putting it all together… and what took so long?

OK, so now we’ve got a plan: issue a new certificate to replace the old one, build a system add-on to install it on Firefox, and deploy it via Normandy. Starting from about 6 PM Pacific on Friday May 3, we were shipping the fix in Normandy at 2:44 AM, or after less than 9 hours, and then it took another 6-12 hours before most of our users had it. This is actually quite good from a standing start, but I’ve seen a number of questions on Twitter about why we couldn’t get it done faster. There are a number of steps that were time consuming.

First, it took a while to issue the new intermediate certificate. As I mentioned above, the Root certificate is in a hardware security module which is stored offline. This is good security practice, as you use the Root very rarely and so you want it to be secure, but it’s obviously somewhat inconvenient if you want to issue a new certificate on an emergency basis. At any rate, one of our engineers had to drive to the secure location where the HSM is stored. Then there were a few false starts where we didn’t issue exactly the right certificate, and each attempt cost an hour or two of testing before we knew exactly what to do.

Second, developing the system add-on takes some time. It’s conceptually very simple, but even simple programs require taking some care, and we really wanted to make sure we didn’t make things worse. And before we shipped the SAO, we had to test it, and that takes time, especially because it has to be signed. But the signing system was disabled, so we had to find some workarounds for that.

Finally, once we had the SAO ready to ship, it still takes time to deploy. Firefox clients check for Normandy updates every 6 hours, and of course many clients are offline, so it takes some time for the fix to propagate through the Firefox population. However, at this point we expect that most people have received the update and/or the dot release we did later.

Final Steps

While the SAO that was deployed with Studies should fix most users, it didn’t get to everyone. In particular, there are a number of types of affected users who will need another approach:

  • Users who have disabled either Telemetry or Studies.
  • Users on Firefox for Android (Fennec), where we don’t have Studies.
  • Users of downstream builds of Firefox (e.g., Debian) which have disabled Studies.
  • Users who are behind HTTPS Man-in-the-middle proxies, because our add-on installation systems enforce key pinning for these connections, which proxies interfere with.
  • Users of very old builds of Firefox which the Studies system can’t reach.

We can’t really do anything about the last group — they should update to a new version of Firefox anyway because older versions typically have quite serious unfixed security vulnerabilities. We know that some people have stayed on older versions of Firefox because they want to run old-style add-ons, but many of these now work with newer versions of Firefox. For the other groups we are developing a patch to Firefox that will install the new certificate once people update. This was released as a “dot release” so people will get it — and probably have already — through the ordinary update channel. If you have a downstream build, you’ll need to wait for your build maintainer to update.

We recognize that none of this is perfect. In particular, in some cases, users lost data associated with their add-ons (an example here is the “multi-account containers” add-on).

We were unable to develop a fix that would avoid this side effect, but we believe this is the best approach for the most users in the short term. Long term, we will be looking at better architectural approaches for dealing with this kind of issue.

Lessons

First, I want to say that the team here did amazing work: they built and shipped a fix in less than 12 hours from the initial report. As someone who sat in the meeting where it happened, I can say that people were working incredibly hard in a tough situation and that very little time was wasted.

With that said, obviously this isn’t an ideal situation and it shouldn’t have happened in the first place. We clearly need to adjust our processes both to make this and similar incidents it less likely to happen and to make them easier to fix.

We’ll be running a formal post-mortem next week and will publish the list of changes we intend to make, but in the meantime here are my initial thoughts about what we need to do. First, we should have a much better way of tracking the status of everything in Firefox that is a potential time bomb and making sure that we don’t find ourselves in a situation where one goes off unexpectedly. We’re still working out the details here, but at minimum we need to inventory everything of this nature.

Second, we need a mechanism to be able to quickly push updates to our users even when — especially when — everything else is down.  It was great that we are able to use the Studies system, but it was also an imperfect tool that we pressed into service, and that had some undesirable side effects. In particular, we know that many users have auto-updates enabled but would prefer not to participate in Studies and that’s a reasonable preference (true story: I had it off as well!) but at the same time we need to be able to push updates to our users; whatever the internal technical mechanisms, users should be able to opt-in to updates (including hot-fixes) but opt out of everything else. Additionally, the update channel should be more responsive than what we have today. Even on Monday, we still had some users who hadn’t picked up either the hotfix or the dot release, which clearly isn’t ideal. There’s been some work on this problem already, but this incident shows just how important it is.

Finally, we’ll be looking more generally at our add-on security architecture to make sure that it’s enforcing the right security properties at the least risk of breakage.

We’ll be following up next week with the results of a more thorough post-mortem, but in the meantime, I’ll be happy to answer questions by email at ekr-blog@mozilla.com.

 

[1] A few very old add-ons were signed with a different intermediate.

[2] Readers who are familiar with the WebPKI will recognize that this is also the way that cross-certification works.
[3] Technical note: we aren’t adding the certificate with any special privileges; it gets its authority by being signed for the root. We’re just adding it to the pool of certificates which can be used by Firefox. So, it’s not like we are adding a new privileged certificate to Firefox.

About Eric Rescorla

Eric is CTO of the Firefox team at Mozilla.

More articles by Eric Rescorla…