- Responsible for .nz tld are staffed by 4 technical employees doing 24x7 on call support
- Engagement surveys captured staff feeling fatigued and under resources at entire org level
- Hard to hire for roles due to specialisation
- Team moving away from BAU work to projects
- Issues raised by external partner but team had other immediate work pressures
- Had to contact ex-employee that use to worked there to help resolve
- Once a year critical take just done by one team member
- Many IT operations tasks have sufficient external dependencies that it is impossible to tell – for certain that the task will be successful in production without doing the task in the production environment. - We've all been there.
Sounds like InternetNZ should actually outsource all of this to an external party and just focus on governance work.
> - Engagement surveys captured staff feeling fatigued and under resources at entire org level
This smells like an organizational failure. First, the survey doesn't tell if the tech staff suffered fatigue, because it was organization-wide. Second, the Executive Leadership Team has 5 people. Just the executive leadership. The Council itself has 9 members. I shudder to think how many (indirect) managers the 4 tech staff has to report to. Third, if there is a problem in your organization, you won't find it with a bloody survey. Those just exist to satisfy KPIs. Fourth, if there is a problem in a small team, that is fairly easy to locate. but if their manager doesn't know, or can't change it, something is wrong in the organization. Fifth, if there is a problem in a critical team, the organization as a whole has failed.
This won't be solved by outsourcing. If anything, placing critical employees at a distance creates more problems than it solves.
Agreed. Over the past few years I've encountered more and more organisations with a ratio of management:developer/designer of >1, ie for every developer or designer, there is more than 1 "manager" (PM, EM, etc) involved. These organisations tend to have appalling velocity and very low developer morale.
Conversely some of the most efficient organisations I've seen have virtually no "management". Usually a lead developer who still works on the product managing tickets, taking requirements from a CEO or similar. These teams can deliver a mind boggling amount of work in comparison.
Good summary! Not sure about that conclusion. I don't imagine there are roving bands of these specialized DNS network architects to whom you can magically outsource the operations.
The whole thing just strikes me as the continued under-valuing of this kind of maintenance work. It's not glamorous, you often work for a government/non-profit that pays less, on-call is brutal, and the chronic short-staffing is a pain multiplier. Not exactly the best foundation for an entire country's internet infrastructure to sit atop.
> This DNSSEC Practice Statement (“DPS”) is a statement of security practices and provisions made by the Canadian Internet Registration Authority (CIRA). These practices and provisions are applied in conjunction with DNS Security Extensions (DNSSEC) in the Canadian country-code Top Level Domain (ccTLD), .CA.
> This DPS conforms to the template included in RFC 6841 . The approach described here is modelled closely on the corresponding procedures published in a corresponding DNSSEC Policy and Practice Statement published by .SE (The Internet Infrastructure Foundation) for the Swedish top-level domain, whose pioneering work in DNSSEC deployment is acknowledged.
To provide a means for stakeholders to evaluate the strength and
security of the DNSSEC chain of trust, an entity operating a DNSSEC-
enabled zone may publish a DNSSEC Practice Statement (DPS),
comprising statements describing critical security controls and
procedures relevant for scrutinizing the trustworthiness of the
system. The DPS may also identify any of the DNSSEC Policies (DPs)
it supports, explaining how it meets their requirements.
90 pages are easily filled with a handful of contributors. It's when you want to convey (without removing (much)) the same information in 9 pages where it gets time-consuming
Does anyone know if DNSSEC-validating recursive resolvers will retry (fetch new records) if their cached DNSSEC-related records (e.g. DS) are a certain age and they are seeing a validating error?
If they would, problems could autocorrect faster (than waiting out a TTL of 1 day). It could cause higher load on name servers, but at first glance it seems like a reasonable trade off.
Page 50 of the report says that some (unnamed) resolvers do this, and “we think this [..] is a very good implementation feature to reduce the impact of mistakes”, but that it’s “definitely not universally implemented”.
Does anyone know of a tool that guides you with making DNS changes? Before starting it would analyze the current settings (SOA, parent/child NS, DS, DNSKEY, RRSIG, etc and their TTLs), possibly suggest reducing some TTLs, show the when you need to make which changes. Then checking along the way for each step if the changes have propagated and letting you continue with the next step when it is safe. Could be useful for DNSSEC changes, but also just changing name servers.
I moved domains to new name servers last week. I took the poor-man's approach of disabling DNSSEC during the migration. I made a step-by-step plan, verifying propagation along the way with dig. Still made one mistake, where I thought I had reduced the NS TTL in a child zone before migrating, but hadn't (it was still 2 days instead of 1 hour). An automated check would have caught it.
I think such a tool could also have shown this KSK rollover required a longer wait than the procedure had.
I only know of proprietary tooling for this sort of thing. I'd imagine it'd be tough to sell as a product or service so it'd probably have to be someones labor of love.
The DS record was cached for 24 hours but the DNSKEY record was cached for 1 hour.
The DS and DNSKEY records were rotated due to an annual rotation of the signing key.
The (cached) old DS record referenced the old DNSKEY. This inconsistency invalidated records for resolvers which validated dnssec records (85% of New Zealanders users, mostly through their ISPs’ resolvers.)
Procedural tl;dr:
Management was not informed of the incident until the next day.
DNS operators were not made aware of the rotation of the KSK (key signing key) so the problem by outside operators was difficult to theorise.
Also, the software they were using (OpenDNSSEC) had an automated delay to avoid exactly this issue, but the delay period was based on a setting in its config file that was supposed to reflect the TTL of the DS record, and they forgot to change it when the TTL changed.
I think that is the core issue. I do wonder why this TTL is determined by a config file and not a direct TTL query to the root servers. A mismatch between the two should at least provide a warning, I would think?
- Engagement surveys captured staff feeling fatigued and under resources at entire org level
- Hard to hire for roles due to specialisation
- Team moving away from BAU work to projects
- Issues raised by external partner but team had other immediate work pressures
- Had to contact ex-employee that use to worked there to help resolve
- Once a year critical take just done by one team member
- Many IT operations tasks have sufficient external dependencies that it is impossible to tell – for certain that the task will be successful in production without doing the task in the production environment. - We've all been there.
Sounds like InternetNZ should actually outsource all of this to an external party and just focus on governance work.