
Lead Site Reliability Engineer
Job summary
As a senior member of the National Data Exploitation Capability (NDEC), you will lead the technical delivery, reliability and performance of advanced applications and services that underpin critical operational activity. You will drive the development and maintenance of resilient, scalable and high‑performing systems, ensuring they consistently meet operational demand. Working at the forefront of the Agency’s data and technology landscape, you will shape engineering approaches, champion automation and observability, and enable teams to deliver secure, robust and dependable services.
Job description
As the Lead Site Reliability Engineer, you will provide strategic and technical leadership for the site reliability engineering (SRE) function within the National Data Exploitation Capability (NDEC). You will lead an Agile, multi‑disciplinary team responsible for designing, implementing and operating the applications and services that support critical analytical and operational outcomes across the NCA. Your remit will include ensuring the reliability, resilience, capacity, availability and performance of these services in line with demanding operational needs.
You will act as NDEC’s subject matter expert for all aspects of Site Reliability Engineering, setting technical direction, driving adoption of SRE best practice and providing expert guidance to specialist teams. You will lead efforts to build and maintain stable, secure and scalable systems, using data and observability to anticipate issues, reduce operational toil and improve service performance.
A key part of your role will be championing automation and modern engineering approaches to streamline processes, accelerate delivery and enhance system reliability. You will promote a culture of continuous improvement, using monitoring, performance insights and stakeholder feedback to identify opportunities to optimise systems, strengthen service resilience and improve user experience. You will also work closely with engineers, architects, product managers and operational teams to ensure that services are designed and operated in a reliable, maintainable and cost‑effective way.
Through strong collaboration, effective leadership and a deep understanding of SRE principles, you will play a pivotal role in ensuring NDEC’s platforms and services remain robust, responsive and able to support mission‑critical operations in a fast‑moving and complex environment.
Duties and Responsibilities
Delivery - Lead and oversee the end‑to‑end delivery of high‑quality software applications and services, from design and testing through to implementation, operation and ongoing support, ensuring they meet reliability, performance and availability requirements.
Quality Assurance - Ensure all solutions are secure by design, compliant with regulatory, security and architectural standards, and aligned with best practice across engineering, operational and governance domains.
Subject Matter Expertise - Act as the SRE subject matter expert on tooling, technologies and engineering practices, including Infrastructure as Code, CI/CD pipelines, observability tooling and containerisation, ensuring these are applied effectively to improve scalability, resilience and operational efficiency.
Monitoring & Observability - Lead the implementation and continuous improvement of monitoring and observability capabilities. Ensure deployed applications and services are actively monitored, and that availability targets are met through effective alerting, diagnostics and operational insight.
Automation - Drive automation initiatives, establishing processes to identify manual or repetitive tasks, and applying automation to reduce operational effort, improve consistency and enhance service reliability.
Incident Management - Lead the detection, diagnosis and resolution of incidents and problems in collaboration with the Service Manager. Ensure effective incident response processes, rapid escalation, clear communication and timely remediation actions.
Scalability & Capacity Planning - Plan for and manage capacity across services and platforms to ensure systems can scale reliably in response to operational and user demand, mitigating performance or stability risks.
Troubleshooting & Problem Resolution - Lead post‑incident reviews and root cause analysis, directing the implementation of lessons learned and longer‑term improvements to prevent recurrence and strengthen system resilience.
Leadership - Provide strong leadership to the NDEC Site Reliability Engineering team, ensuring teams deliver reliable, scalable and secure services throughout the entire software lifecycle. Mentor and develop junior SREs and foster a culture of collaboration, learning and excellence.
Innovation - Stay up to date with emerging industry trends, technologies and SRE practices. Evaluate new tools and techniques to enhance automation, observability and overall service reliability, and guide their adoption where beneficial.
Communication & Collaboration - Communicate clearly and confidently with senior leaders, translating technical issues, risks and dependencies into clear operational or organisational impacts. Ensure the SRE team collaborates effectively with engineers, architects, specialists and operational stakeholders to maintain high‑quality service delivery.
Person specification
Availability & Capacity Management - Ability to lead teams in the design, deployment, monitoring and support of services to ensure they meet availability, reliability and scalability requirements. Experience planning and managing capacity to ensure systems and services scale effectively in response to operational demand.
Coding, Scripting & Infrastructure as Code - Ability to write, read and maintain Infrastructure as Code solutions (e.g., Terraform) and work confidently with containerisation technologies such as Docker. Experience applying automation, scripting and configuration management to improve repeatability and reduce operational effort.
Modern Development Standards & DevOps Practices - Strong understanding of modern development standards, including the use of CI/CD pipelines (e.g. GitLab) and automated build/deployment processes. Ability to lead others in adopting modern engineering practices, including containerisation best practice and developing skills or interest in Kubernetes. Experience delivering and maintaining scalable applications using CI/CD, IaC and virtualisation technologies such as VMware (or equivalent).
Problem & Incident Management - Experience identifying, investigating and resolving root causes of incidents and recurring problems, using data to identify patterns and trends. Ability to collaborate with specialists to determine appropriate resolutions, implement preventative measures and drive continuous improvement.
Systems Design & Integration - Ability to review and assure system designs to ensure appropriate technology choices, efficient use of resources and integration across multiple platforms, including virtualisation environments such as VMware. Experience designing or supporting complex, distributed systems, ensuring they are resilient, scalable and secure.
Technical Leadership & SME Expertise - Ability to anticipate technology trends, advise on future opportunities and set direction for tooling, standards and best practice across the SRE function. Demonstrable experience providing technical leadership and mentorship, supporting skill development and capability growth within the team. Experience leading the delivery and lifecycle management of high‑quality, reliable applications and services.
Cloud Engineering - Experience developing, deploying and supporting cloud‑based applications (preferably Amazon Web Services). Understanding of cloud‑native architectures, operational models and security considerations.
Performance & Service Management - Experience monitoring and managing the performance of applications and services to ensure they meet operational and user‑driven demand. Ability to lead post‑incident reviews, direct improvements and ensure stable, high‑quality service operation.
Communication & Stakeholder Engagement - Ability to communicate complex technical information clearly and confidently, adapting style for senior leadership audiences. Experience escalating risks, translating technical issues into business impacts and ensuring decisions are well understood. Ability to lead collaborative working with engineers, architects and operational stakeholders to maintain service quality.
Behaviours
We'll assess you against these behaviours during the selection process:
- Seeing the Big Picture
- Leadership
- Managing a Quality Service
Benefits
Alongside your salary of £67,609, National Crime Agency contributes £19,586 towards you being a member of the Civil Service Defined Benefit Pension scheme. Find out what benefits a Civil Service Pension provides.New entrants to the NCA receive 26 days annual leave, rising to 31 on completion of 5 years continuous service, plus 8 bank holidays.
If qualifying criteria is met new joiners from UK Police Forces or the UK Intelligence Community (UKIC) will have service with those employers taken into account for continuous service purposes for annual leave entitlement only, this will be up to a maximum of 31 days leave (including 1 privilege day).
Other benefits include:
- Flexible working, including flexi-time, compressed hours and job sharing (in line with business requirements)
- Family friendly policies, notably above the statutory minimum
- Learning and Development opportunities
- Interest free loans and advances, including season tickets, childcare and rental deposits
- Housing schemes - Key Worker status
- Discounts and Savings with a wide variety of services including Cycle to Work, Smart Tech schemes, dental insurance, gym discounts and savings on everyday spending, available through the Reward Gateway , Edenred and Blue Light Card schemes.
- Staff support groups/networks
- Sports and social activities, including membership to the Civil Service Sports Council (CSSC)
Further information is available on the NCA Website.
Things you need to know
Artificial intelligence
Artificial intelligence can be a useful tool to support your application, however, all examples and statements provided must be truthful, factually accurate and taken directly from your own experience. Where plagiarism has been identified (presenting the ideas and experiences of others, or generated by artificial intelligence, as your own) applications may be withdrawn and internal candidates may be subject to disciplinary action. Please see our candidate guidance (opens in a new window) for more information on appropriate and inappropriate use.Selection process details
This vacancy is using Success Profiles (opens in a new window), and will assess your Behaviours and Experience.Experience - This will be assessed:CV
Please include your full career history, training, qualifications, key responsibilities, and achievements. Explain any employment gaps in the last two years. Ensure all accreditation dates are accurate.
Details of what is expected within you CV are as follows: Please provide a high‑level summary of your relevant career history, highlighting the roles, environments and levels of responsibility that demonstrate your ability to operate effectively in a context comparable to this position and meet the criteria listed in the Person Specification.
Experience Criteria - will be assessed by 500 word examples on:- Designing, automating and managing highly reliable and scalable distributed systems.
- Hands‑on leadership in continuous integration and continuous deployment (CI/CD), container orchestration, and modern DevOps and Site Reliability Engineering practices.
- Proven experience in incident response, root cause analysis and leading reliability improvement initiatives to enhance the stability and performance of services.
Longlist
In the event of a high number of applications, we may operate a longlist. Applicants will need to meet the minimum pass mark for the lead criteria.
- Designing, automating and managing highly reliable and scalable distributed systems. .
Candidates who do not meet the minimum pass mark for the lead criteria will not progress to having their other criteria assessed. Applications must meet the minimum criteria to be progressed to the assessment stage.
You will receive an acknowledgement once your application is submitted.
We aim to have sift completed and scores released within 10 working days of the closing date of the advert. For high volume campaigns this timeframe may be extended.
Scores will be provided but further feedback will not be available at this stage.
For guidance on the application process, visit:NCA Applying and Onboarding
Assessment 1
The format of this assessment will be Interview which will be tested on the criteria listed in the Success Profiles at Assessment section.
Success Profiles at Assessment
Behaviours- Seeing the Big Picture
- Leadership
- Managing a Quality Service
- Designing, automating and managing highly reliable and scalable distributed systems.
- Hands‑on leadership in continuous integration and continuous deployment (CI/CD), container orchestration, and modern DevOps and Site Reliability Engineering practices.
- Proven experience in incident response, root cause analysis and leading reliability improvement initiatives to enhance the stability and performance of services.
If successful but no role is immediately available, you may be placed on a reserve list for 12 months.
Reserve lists can be used to fill similar role types across the Agency where the assessment criteria is considered a match by the recruitment team and the business area.
In the event of a tie at the assessment stage, available roles will be offered in merit order using the following order:
- Lead criteria (behaviours/technical/experience)
- If still tied, desirable criteria will be assessed (if advertised)
- If still tied, application sift scores will be used
Feedback will only be provided if you attend an interview or assessment.
Security
Successful candidates must meet the security requirements before they can be appointed. The level of security needed is security check (opens in a new window).See our vetting charter (opens in a new window).People working with government assets must complete baseline personnel security standard (opens in new window) checks.
Medical
Successful candidates will be expected to have a medical.Nationality requirements
Open to UK nationals only.Working for the Civil Service
The Civil Service Code (opens in a new window) sets out the standards of behaviour expected of civil servants.We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment principles (opens in a new window).The Civil Service embraces diversity and promotes equal opportunities. As such, we run a Disability Confident Scheme (DCS) for candidates with disabilities who meet the minimum selection criteria.The Civil Service also offers a Redeployment Interview Scheme to civil servants who are at risk of redundancy, and who meet the minimum requirements for the advertised vacancy.
Diversity and Inclusion
The Civil Service is committed to attract, retain and invest in talent wherever it is found. To learn more please see theCivil Service People Plan (opens in a new window) and the Civil Service Diversity and Inclusion Strategy (opens in a new window).Apply and further information
This vacancy is part of the Great Place to Work for Veterans (opens in a new window) initiative.Once this job has closed, the job advert will no longer be available. You may want to save a copy for your records.Contact point for applicants
Job contact :
- Name : central.recruitment@nca.gov.uk
- Email : central.recruitment@nca.gov.uk
- Telephone : central.recruitment@nca.gov.uk
Recruitment team
- Email : central.recruitment@nca.gov.uk
Further information
If you believe your application has not been treated fairly, email: Central.Recruitment@nca.gov.uk (quoting the vacancy reference). If unresolved, you may escalate your complaint to the Civil Service Commission.Salary range
- £67,127 per year