Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Review of the .nz DNSSEC Chain Validation Incident on 29-30 May 2023 [pdf] (internetnz.nz)
90 points by slyall on Oct 9, 2023 | hide | past | favorite | 25 comments


- Responsible for .nz tld are staffed by 4 technical employees doing 24x7 on call support

- Engagement surveys captured staff feeling fatigued and under resources at entire org level

- Hard to hire for roles due to specialisation

- Team moving away from BAU work to projects

- Issues raised by external partner but team had other immediate work pressures

- Had to contact ex-employee that use to worked there to help resolve

- Once a year critical take just done by one team member

- Many IT operations tasks have sufficient external dependencies that it is impossible to tell – for certain that the task will be successful in production without doing the task in the production environment. - We've all been there.

Sounds like InternetNZ should actually outsource all of this to an external party and just focus on governance work.


> - Engagement surveys captured staff feeling fatigued and under resources at entire org level

This smells like an organizational failure. First, the survey doesn't tell if the tech staff suffered fatigue, because it was organization-wide. Second, the Executive Leadership Team has 5 people. Just the executive leadership. The Council itself has 9 members. I shudder to think how many (indirect) managers the 4 tech staff has to report to. Third, if there is a problem in your organization, you won't find it with a bloody survey. Those just exist to satisfy KPIs. Fourth, if there is a problem in a small team, that is fairly easy to locate. but if their manager doesn't know, or can't change it, something is wrong in the organization. Fifth, if there is a problem in a critical team, the organization as a whole has failed.

This won't be solved by outsourcing. If anything, placing critical employees at a distance creates more problems than it solves.


Agreed. Over the past few years I've encountered more and more organisations with a ratio of management:developer/designer of >1, ie for every developer or designer, there is more than 1 "manager" (PM, EM, etc) involved. These organisations tend to have appalling velocity and very low developer morale.

Conversely some of the most efficient organisations I've seen have virtually no "management". Usually a lead developer who still works on the product managing tickets, taking requirements from a CEO or similar. These teams can deliver a mind boggling amount of work in comparison.


Good summary! Not sure about that conclusion. I don't imagine there are roving bands of these specialized DNS network architects to whom you can magically outsource the operations.

The whole thing just strikes me as the continued under-valuing of this kind of maintenance work. It's not glamorous, you often work for a government/non-profit that pays less, on-call is brutal, and the chronic short-staffing is a pain multiplier. Not exactly the best foundation for an entire country's internet infrastructure to sit atop.


But DNSSEC in root zones isn't unique to .nz.

.au, .us, .com, .net, .gov, .io, etc all have the same challenges.


Outsourcing isn't a panacea. `.au` is outsourced to Identity Digital (formerly Afilias) and they managed to flub their configuration recently too[1].

1. https://www.auda.org.au/statement/au-domain-name-system-upda...


TLDs, not root zones ;)

The root zone is atop the DNS hierarchy and is usually denoted by a single dot, it contains all the top-level-domains.


.ca as well:

> This DNSSEC Practice Statement (“DPS”) is a statement of security practices and provisions made by the Canadian Internet Registration Authority (CIRA). These practices and provisions are applied in conjunction with DNS Security Extensions (DNSSEC) in the Canadian country-code Top Level Domain (ccTLD), .CA.

> This DPS conforms to the template included in RFC 6841 . The approach described here is modelled closely on the corresponding procedures published in a corresponding DNSSEC Policy and Practice Statement published by .SE (The Internet Infrastructure Foundation) for the Swedish top-level domain, whose pioneering work in DNSSEC deployment is acknowledged.

* https://www.cira.ca/en/resources/documents/domains/cira-dnss...

    To provide a means for stakeholders to evaluate the strength and
    security of the DNSSEC chain of trust, an entity operating a DNSSEC-
    enabled zone may publish a DNSSEC Practice Statement (DPS),
    comprising statements describing critical security controls and
    procedures relevant for scrutinizing the trustworthiness of the
    system.  The DPS may also identify any of the DNSSEC Policies (DPs)
    it supports, explaining how it meets their requirements.
* https://datatracker.ietf.org/doc/html/rfc6841


Only needing to hire a minimum number of employees to run it, the "TLD business" must be quite profitable.


Actually margins are razor thin. Most of the hard part is compliance, and you're paying out huge sums for audits.


I wonder how the budget to write this 90 page report compares to the annual budget for .nz


90 pages are easily filled with a handful of contributors. It's when you want to convey (without removing (much)) the same information in 9 pages where it gets time-consuming


Does anyone know if DNSSEC-validating recursive resolvers will retry (fetch new records) if their cached DNSSEC-related records (e.g. DS) are a certain age and they are seeing a validating error?

If they would, problems could autocorrect faster (than waiting out a TTL of 1 day). It could cause higher load on name servers, but at first glance it seems like a reasonable trade off.


Page 50 of the report says that some (unnamed) resolvers do this, and “we think this [..] is a very good implementation feature to reduce the impact of mistakes”, but that it’s “definitely not universally implemented”.


It's implementation dependent. If I recall correctly unbound defaults to caching bogus responses for 60 seconds and BIND for 30 seconds.


Does anyone know of a tool that guides you with making DNS changes? Before starting it would analyze the current settings (SOA, parent/child NS, DS, DNSKEY, RRSIG, etc and their TTLs), possibly suggest reducing some TTLs, show the when you need to make which changes. Then checking along the way for each step if the changes have propagated and letting you continue with the next step when it is safe. Could be useful for DNSSEC changes, but also just changing name servers.

I moved domains to new name servers last week. I took the poor-man's approach of disabling DNSSEC during the migration. I made a step-by-step plan, verifying propagation along the way with dig. Still made one mistake, where I thought I had reduced the NS TTL in a child zone before migrating, but hadn't (it was still 2 days instead of 1 hour). An automated check would have caught it.

I think such a tool could also have shown this KSK rollover required a longer wait than the procedure had.


For moving DNSSEC-signed domains, there’s MUSIC: <https://github.com/DNSSEC-Provisioning/music>


I only know of proprietary tooling for this sort of thing. I'd imagine it'd be tough to sell as a product or service so it'd probably have to be someones labor of love.


I would probably build something over OctoDNS, the IAC tool I use at work and home to update and migrate domains and its records...


Technical tl;dr:

The DS record was cached for 24 hours but the DNSKEY record was cached for 1 hour. The DS and DNSKEY records were rotated due to an annual rotation of the signing key. The (cached) old DS record referenced the old DNSKEY. This inconsistency invalidated records for resolvers which validated dnssec records (85% of New Zealanders users, mostly through their ISPs’ resolvers.)

Procedural tl;dr: Management was not informed of the incident until the next day. DNS operators were not made aware of the rotation of the KSK (key signing key) so the problem by outside operators was difficult to theorise.


Also, the software they were using (OpenDNSSEC) had an automated delay to avoid exactly this issue, but the delay period was based on a setting in its config file that was supposed to reflect the TTL of the DS record, and they forgot to change it when the TTL changed.


I think that is the core issue. I do wonder why this TTL is determined by a config file and not a direct TTL query to the root servers. A mismatch between the two should at least provide a warning, I would think?


OpenDNSSEC's first release was before the root was signed so perhaps there's some assumptions in it that need revisiting.


Interesting that it was written in LibreOffice.


Hallmark of a technically skilled organisation when their people choose to use something like LaTeX, LibreOffice, Markdown, etc. :)

(Or when they spoof the metadata!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: