<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Upgraded Adventure</title>
    <description>Upgraded Adventures of Bruce Becker</description>
    <link>https://www.brucellino.dev/</link>
    <atom:link href="https://www.brucellino.dev/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Thu, 09 Apr 2026 12:53:12 +0000</pubDate>
    <lastBuildDate>Thu, 09 Apr 2026 12:53:12 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Managing security and risk in a complex system</title>
        <description>&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#problem-statement&quot; id=&quot;markdown-toc-problem-statement&quot;&gt;Problem Statement&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#setting-the-stage&quot; id=&quot;markdown-toc-setting-the-stage&quot;&gt;Setting the stage&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#seeing-risk&quot; id=&quot;markdown-toc-seeing-risk&quot;&gt;Seeing Risk&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#owning-risk&quot; id=&quot;markdown-toc-owning-risk&quot;&gt;Owning Risk&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#where-have-i-seen-this-before&quot; id=&quot;markdown-toc-where-have-i-seen-this-before&quot;&gt;Where have I seen this before&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#providing-value-to-customers-closes-the-feedback-loop&quot; id=&quot;markdown-toc-providing-value-to-customers-closes-the-feedback-loop&quot;&gt;Providing value to customers closes the feedback loop&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#putting-dependency-track-to-use&quot; id=&quot;markdown-toc-putting-dependency-track-to-use&quot;&gt;Putting Dependency Track to use&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#tracking-dependencies-measuring-risk-ensuring-compliance&quot; id=&quot;markdown-toc-tracking-dependencies-measuring-risk-ensuring-compliance&quot;&gt;Tracking dependencies, measuring risk, ensuring compliance&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#services-in-the-security-plane-of-the-platform&quot; id=&quot;markdown-toc-services-in-the-security-plane-of-the-platform&quot;&gt;Services in the Security Plane of the platform&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot; id=&quot;markdown-toc-conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#footnotes-and-references&quot; id=&quot;markdown-toc-footnotes-and-references&quot;&gt;Footnotes and References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;/h2&gt;

&lt;p&gt;In this post, we’ll tackle a common problem faced by those responsible for shipping software for a client: ensuring that software supply chain is kept secure and compliant with requirements of the contract.&lt;/p&gt;

&lt;h2 id=&quot;setting-the-stage&quot;&gt;Setting the stage&lt;/h2&gt;

&lt;p&gt;Let’s say you’re building a set of software services for a client, which takes the form of a product composed of multiple services, managed on behalf of the client.
The goal of this set of services is to provide access to a defined set of cloud services, data sets and other APIs, all of which are also part of your product.
Now, let’s say that the client intends to release this product for use by third parties and in order to do that, they need to state with certainty that the product is secure and compliant with all relevant regulations and standards.&lt;/p&gt;

&lt;p&gt;It is safe to assume that each of these services is implemented directly using one or another programming language, which inevitably depend on a set of libraries or frameworks.
These direct dependencies may themselves also depend on other libraries, packages or modules.
In reality then, the system is composed not just of the specific implementation of services, but also all of this graph of dependencies.&lt;/p&gt;

&lt;p&gt;Given this understanding of the composition of the system, of known, assumed and unknown dependencies, let us consider what it would take to keep the system safe.
This would, at a minimum, require the ability to observe the full composition of the system, as well as the presence of known vulnerabilities in the dependencies.&lt;/p&gt;

&lt;h2 id=&quot;seeing-risk&quot;&gt;Seeing Risk&lt;/h2&gt;

&lt;p&gt;Let’s assume we can actually see all of the dependencies: how would we achieve that?
It would involve some tool which inspected the system and extracted the graph of all of the dependencies.
This is only feasible when the system &lt;em&gt;declares&lt;/em&gt; what it depends on.
This varies from case to case:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;In the case of an application, this will come in the form of a language-specific manifest file, such as a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requirements.txt&lt;/code&gt; for Python, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go.mod&lt;/code&gt; for Go, or a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;package.json&lt;/code&gt; for Node.js.&lt;/li&gt;
  &lt;li&gt;In the case of a service, this will come in the form of a service-specific file, such as a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Dockerfile&lt;/code&gt; a Docker Compose file for Docker, a manifest file or Helm chart for Kubernetes, or the providers section of a Terraform configuration.&lt;/li&gt;
  &lt;li&gt;In the case of a virtual or bare-metal machine, this will come from interrogating the filesystem for installed packages or modules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scanning these various endpoints, whether they be package managers, deployment manifests, or actual deployed environments, provides us with a list of discovered versions and their versions.
This is the first step in being able to see the risk of the system: the simple ability to generate an inventory.
The key capability however is being able to take &lt;strong&gt;appropriate action&lt;/strong&gt; based on whether a given component is known to contain a given vulnerability.
Vulnerabilities do not inherently carry risk, however, since they must be exploitable in order to pose a threat, and whether they are exploitable or not depends on several factors, including the configuration and deployment of that component in the actual system.&lt;/p&gt;

&lt;p&gt;Nonetheless, the ability to map CVEs&lt;sup id=&quot;fnref:cve&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:cve&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; to components allows us to locate points of investigation, and perhaps even keep an audit trail of the system’s posture.&lt;/p&gt;

&lt;h2 id=&quot;owning-risk&quot;&gt;Owning Risk&lt;/h2&gt;

&lt;p&gt;The system is not a flat, amorphous blob of components.
There is a structure to the system which comes from the way in which humans have decided to organise their work around building, maintaining and deploying it.
In some cases, there will be a single system owner, perhaps an enterprise architect, who has the final responsibility for the system’s security posture.
In others, each service in the system will have self-contained ownership, and agree to a common set of policies in order to guarantee a consistent security posture in the system even though they take action on known vulnerabilities autonomously.
The point is that it in order to ensure that the system as a whole is secure, it is necessary to know who should take action on which component.
&lt;strong&gt;Ownership&lt;/strong&gt; is thus the ability to map vulnerabilities to components, &lt;em&gt;via&lt;/em&gt; services to actual people who take responsibility for action.&lt;/p&gt;

&lt;h2 id=&quot;where-have-i-seen-this-before&quot;&gt;Where have I seen this before&lt;/h2&gt;

&lt;p&gt;Way back in before 2020 some time, I forget exactly when, I was working in the SANREN group at the Council for Scientific and Industrial Research (CSIR) in Pretoria, South Africa.
SANREN stands for&lt;a href=&quot;https://www.sanren.ac.za&quot;&gt; “South African National Research and Education Network”&lt;/a&gt;, and aside from provisioning high-capacity fibre connections throughout South African research institutions and universities, it also had the mandate to deliver services to these customers.
One of these services was the South African National Grid, which I was responsible for, which coordinated the distributed computing infrastructure offered by the universities and laboratories connected to the SANREN network.
However, SANREN was very active in providing &lt;a href=&quot;https://www.sanren.ac.za/services/&quot;&gt;end-user services&lt;/a&gt; such as the identity federation, on-demand file transfer between individuals, &lt;em&gt;etc&lt;/em&gt;.
Most of these were simply instantiations of a product commonly-used by peers in the research networking community.
These surely provided some value to the communities served by the infrastructure, but they were too generic to be considered invaluable.
The real wins were where we could give actionable advice to customers.&lt;/p&gt;

&lt;h3 id=&quot;providing-value-to-customers-closes-the-feedback-loop&quot;&gt;Providing value to customers closes the feedback loop&lt;/h3&gt;

&lt;p&gt;I will permit myself a brief digression here to consider how infrastructures are perceived and how they become sustainable.&lt;/p&gt;

&lt;p&gt;The question can be posed as such:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;How can invisible infrastructures keep proving their value to customers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An infrastructure works well when it becomes invisible; nobody sits and thinks about how the network provides value to them, they just use the things which the network enables.
This invisibility is something which often breaks the feedback loop between customer and provider, since when things are working well, the value is invisible, and only when the infrastructure fails to do its job does it become visible.
By its very nature, it is only perceived negatively.&lt;/p&gt;

&lt;p&gt;A second consideration is whether there is a direct feedback loop between the end users of the services offered by the infrastructure and those who actually pay for it (the customer).
In the absence of such a loop, the sustainability of the infrastructure, depending on its continued viability, requires arduous measurement and reporting of the customer satisfaction and usage metrics, which inevitably drives up the cost and reduces the efficiency.
Instead, what if the infrastructure provider itself provided services &lt;em&gt;directly to the customer&lt;/em&gt;?
In SANREN’s case, these are typically the Information Technology departments of the institutes served by the network&lt;sup id=&quot;fnref:asaudit&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:asaudit&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
These IT organisations are comprised of IT professionals, with their budget directly allocated to IT services &lt;em&gt;for the institute&lt;/em&gt;.
Part of their mission is usually to protect the network and IT resources associated with them, either within their institute or across the network in the case of research and scientific collaborations.&lt;/p&gt;

&lt;p&gt;Assessing and owning risks was one of those activities which simply could not be performed by the respective IT departments, given their limited resources and vast set of users.
If SANREN could provide them with a service to assess security risks and provide specific actionable recommendations assigned to specific people this would have gone a long way towards helping them perform their overall mission.
This would have been a service provided directly to the customer, providing a strong feedback loop between the invisible infrastructure and those who ultimately depend on it.&lt;/p&gt;

&lt;p&gt;It was my buddy Schalk Peach way back around 2017-2018 who helped me understand how powerful this feedback loop between vulnerability, component and person would be.
He was a postgrad student at the CSIR SANREN team during my last few years there, and we had almost daily conversations about the workflows and entity models that would be involved in providing a service like this.
At the time, we were focussed on perimeter security and directing customer network administrators to specific mitigation actions based on the unintelligible vulnerability reports from &lt;a href=&quot;https://www.tenable.com/products/nessus&quot;&gt;Nessus&lt;/a&gt;&lt;sup id=&quot;fnref:nessus&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:nessus&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Every now again I think about him and how far ahead of the curve he was.&lt;/p&gt;

&lt;h2 id=&quot;putting-dependency-track-to-use&quot;&gt;Putting Dependency Track to use&lt;/h2&gt;

&lt;p&gt;Now I’m part of the team delivering a managed service to researchers on behalf of the European Commission, and we are facing the same challenges as a decade ago. 
A customer like the EC does not mess around when it comes to compliance and quality, and the system we were building is delivered by a great number of independent contractors, so I knew that assuring all interested parties that the system was secure and up to date would require specific tooling.
Indeed at the outset my mind flew right back to Schalk and his magical vulnerability-remedy system… if only we had something like that to observe and track our dependencies!&lt;/p&gt;

&lt;p&gt;Well, it turns out that in the decade or so since I last bothered myself with platform security, the OWASP foundation has brought a great many projects to maturity, including one called &lt;a href=&quot;https://dependencytrack.org&quot;&gt;“Dependency Track”&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;tracking-dependencies-measuring-risk-ensuring-compliance&quot;&gt;Tracking dependencies, measuring risk, ensuring compliance&lt;/h3&gt;

&lt;p&gt;Dependency Track’s strength comes from its ability to keep a detailed database of Software Bill of Materials (SBOM), as well as a local database of all known vulnerabilities.
Vulnerabilities are almost always published with an associated component, so all we needed to do was declare the components of our system, and their associated SBOMs.
Dependency Track also has an overlay system which allows one to declare that certain projects are owned by certain teams, recreating that all-important feedback loop we mentioned repeatedly before.&lt;/p&gt;

&lt;p&gt;This functionality allowed security teams at a glance to know which vulnerabilities were present where, and indeed what the relevant service owner was doing about.&lt;/p&gt;

&lt;p&gt;We ended up bolting on a custom notification handler to Dependency Track to allow us to accurately communicate the discovery of vulnerabilities both with the client and the security team.&lt;/p&gt;

&lt;h3 id=&quot;services-in-the-security-plane-of-the-platform&quot;&gt;Services in the Security Plane of the platform&lt;/h3&gt;

&lt;p&gt;In the larger context of providing a &lt;a href=&quot;https://platformengineering.org/platform-tooling&quot;&gt;platform providing support functions for the delivery of a wide set of services&lt;/a&gt;, Dependency Track falls cleanly into the “Security Plane”.
The final goal is not measurement and visibility, but actually taking remedial action, we should envision the connection of task and issue tracking services with the dependency tracker.
However, dependencies of software components are not the only source of risk and security vulnerability in the system.
Misconfiguration and insufficient resources can also contribute to security risks, or risks of denial of service.
These are detected with other tools, such as penetration and load testing tools.
These findings too are associated with specific components, owned by specific people in the organisation.&lt;/p&gt;

&lt;p&gt;This could be done with &lt;a href=&quot;https://defectdojo.com/&quot;&gt;DefectDojo&lt;/a&gt;, another tool in the OWASP stable.&lt;/p&gt;

&lt;p&gt;These of course only address the system design and architecture.
There are of course also the operational aspects, the day-to-day events happening in the system, and the inevitable attacks it will face if connected to the internet.
To this end a security information and event management (SIEM) system is required, where alerts could also be passed to the overall view of the security posture, potentially correlated with known vulnerabilities and outdated components.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We have gone to great lengths to find a set of tools which would allow teams delivering software to declare their composition and dependencies to a central service, and in turn obtain visibility and actionable advice on how to respond to known vulnerabilities.
This could be deployed in the context of platform engineering within the “security plane”, offering a platform-level internal service to internal customers wishing to deploy in the platform.
All of the tools discussed here are open-source and deployable on-prem.
Indeed I have deployed them in a Nomad cluster built for the purposes of providing the platform-level functions to our client, as well as a similar deployment in EGI internally.
We gain insight into component-level risk, the ability to map that risk to actual people, and the ability to provide them with specific concrete advice on how to mitigate the risk.
With the addition of a SIEM, we would be able to have real-time observability into security events, and thus ensure not only the architectural security of the system, but also its operational security.&lt;/p&gt;

&lt;p&gt;Total visibility and collaboration between platform engineers, SRE and service owners is the goal, and we are within its reach.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;footnotes-and-references&quot;&gt;Footnotes and References&lt;/h1&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:cve&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Common Vulnerabilities and Exposures (CVE) is a dictionary of publicly known information security vulnerabilities and exposures. CVE is maintained by the MITRE Corporation. &lt;a href=&quot;#fnref:cve&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:asaudit&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The Association of South African University Directors of IT (ASAUDIT) was in my understanding the interface between the infrastructure provider and the institutes – it has apparently transformed into &lt;a href=&quot;https://heitsa.ac.za/&quot;&gt;Higher Education IT South Africa&lt;/a&gt; &lt;a href=&quot;#fnref:asaudit&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:nessus&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I think it was Nessus, but it was a while ago. I do remember that it was open source at the time, whatever the tool was. &lt;a href=&quot;#fnref:nessus&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 14 Jul 2025 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2025/07/system-dependency/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2025/07/system-dependency/</guid>
        
        <category>blog</category>
        
        <category>platform-engineering</category>
        
        <category>security-plane</category>
        
        <category>dependency-management</category>
        
        <category>risk-management</category>
        
        
        <category>methods</category>
        
      </item>
    
      <item>
        <title>A terraform module for Vault on DigitalOcean</title>
        <description>&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#vault-in-a-managed-platform&quot; id=&quot;markdown-toc-vault-in-a-managed-platform&quot;&gt;Vault in a Managed Platform&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#solution-context&quot; id=&quot;markdown-toc-solution-context&quot;&gt;Solution Context&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#one-goal--clean-architectural-interfaces&quot; id=&quot;markdown-toc-one-goal--clean-architectural-interfaces&quot;&gt;One Goal – clean architectural interfaces&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#design-decisions&quot; id=&quot;markdown-toc-design-decisions&quot;&gt;Design Decisions&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#discussion&quot; id=&quot;markdown-toc-discussion&quot;&gt;Discussion&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#design-dependencies&quot; id=&quot;markdown-toc-design-dependencies&quot;&gt;Design dependencies&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#control-plane-vs-security-plane&quot; id=&quot;markdown-toc-control-plane-vs-security-plane&quot;&gt;Control plane vs security plane&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#using-the-module&quot; id=&quot;markdown-toc-using-the-module&quot;&gt;Using the module.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot; id=&quot;markdown-toc-conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;vault-in-a-managed-platform&quot;&gt;Vault in a Managed Platform&lt;/h2&gt;

&lt;p&gt;This is a short writeup of the Terraform module I wrote for a Vault deployment.
I had to solve a few technical challenges and take a few decisions during the development process, so I wanted to write them down for posterity.&lt;/p&gt;

&lt;p&gt;TL;DR You can find the module &lt;a href=&quot;https://registry.terraform.io/modules/brucellino/vault/digitalocean/latest&quot;&gt;in the terraform registry&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;solution-context&quot;&gt;Solution Context&lt;/h3&gt;

&lt;p&gt;There are plenty of ways to deploy Vault, but this one has a few considerations which you may find yourself identifying with.
The overall context is that this service (Vault) is a component of an &lt;strong&gt;internal developer platform&lt;/strong&gt; which supports the delivery of workloads to downstream users.
I am writing as the platform engineer who is responsible for delivering the platform as a whole.&lt;/p&gt;

&lt;p&gt;I followed the &lt;a href=&quot;https://developer.hashicorp.com/well-architected-framework/zero-trust-security/raft-reference-architecture&quot;&gt;Hashicorp Well-Architected Framework&lt;/a&gt; for Vault, applying this to the Digital Ocean platform.&lt;/p&gt;

&lt;p&gt;Another important design decision here was that I consider Vault to the be the first component of a platform, with zero dependencies.
This means that, at the time of deployment, we have no knowledge of any other services which may enhance the operations, such as service discovery, or observability.
These will indeed come later, as part of the platform’s iterative deployment, but it is important to explicitly exclude them from our scope since we want to make a clean Terraform module.&lt;/p&gt;

&lt;p&gt;There are a few questionable decisions which I have made during the design of this module, specifically the choice of Tailscale as an overlay network, and the use of self-signed certificates for TLS.
I will justify these and perhaps discuss how changes to the module may make them optional.&lt;/p&gt;

&lt;h2 id=&quot;one-goal--clean-architectural-interfaces&quot;&gt;One Goal – clean architectural interfaces&lt;/h2&gt;

&lt;p&gt;I wanted to see if I could reduce the goal of this module to a single concern, &lt;em&gt;i.e.&lt;/em&gt; it should &lt;em&gt;only&lt;/em&gt; deploy Vault.
Nothing else should be contained within it – if anything else is needed, it should be expressed as input variables.
Likewise, the module should be amenable to anything that depends on it upstream – it should produce outputs which are consumable by other modules which need to build on it.&lt;/p&gt;

&lt;p&gt;Finally, the module should deploy Vault with zero opinions about what Vault should do.
There should be no internal configuration for secret stores, authentication mechanisms, or other configuration.
This will vary greatly from team to team and use case to use case, so should be left up to the actual users of the service to define.
The overall picture then is of a production-ready, workload and userbase agnostic Vault deployment which can be customised by whoever requests it.
This will make the module better for use in a &lt;strong&gt;self-service&lt;/strong&gt; environment, such as the internal developer platform we are considering in our context.&lt;/p&gt;

&lt;h2 id=&quot;design-decisions&quot;&gt;Design Decisions&lt;/h2&gt;

&lt;p&gt;Design decisions are the constraints and opinions imposed on the module that express what it is designed to do.
Our design decisions are informed by the the single goal we have set above, and constrain the user experience of the service such that it becomes consistent with the rest of the services in the platform.&lt;/p&gt;

&lt;p&gt;These decisions are meant to reduce the scope of the module in order to make it maintainable, and help operators decide when it is appropriate to use it.&lt;/p&gt;

&lt;p&gt;The design decisions are listed below:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Design decisions will be documented using the ADR format.&lt;/li&gt;
  &lt;li&gt;The resource provider will be DigitalOcean.&lt;/li&gt;
  &lt;li&gt;Vault will be served by configured processes on virtual machines.&lt;/li&gt;
  &lt;li&gt;Vault will be configured using userdata at provision time.&lt;/li&gt;
  &lt;li&gt;Vault data will be persisted to attached storage.&lt;/li&gt;
  &lt;li&gt;The virtual machines will be configured with private networking in a virtual private network.&lt;/li&gt;
  &lt;li&gt;Networking over the public interface will be disabled via firewall rules which permit access only to specific services, and &lt;strong&gt;not&lt;/strong&gt; the Vault service.&lt;/li&gt;
  &lt;li&gt;An overlay network will be provisioned and Vault will communicate only via this overlay network.&lt;/li&gt;
  &lt;li&gt;mTLS will be configured between Vault instances.&lt;/li&gt;
  &lt;li&gt;Certificates will be generated with a private CA, as part of the Terraform state.&lt;/li&gt;
  &lt;li&gt;The final state will be a sealed Vault cluster with no configured secret stores, authentication mechanisms or policies.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;

&lt;p&gt;Why do we take these specific decisions?
Are they objectively better than alternatives?
The answer is no, they are not objectively better, indeed there can be no objectivity here, because we are building something with a specific &lt;em&gt;intent&lt;/em&gt; in mind.
This is a module for use by teams who depend on a platform which is built by a small team of platform engineers.
There are constraints all over this scenario, so what we design must take those constraints into account, and respond to the subjectivity of the situation.
We could have decided to deploy Vault via a helm chart on a DOKS cluster, or decided to expose it only via a load balancer.
We could have decided to add a backup operator which backed up raft state to an object store, or we could have decided to use a managed database as Vault backend.
All of these would have introduced additional complexity and dependencies, while keeping the actual service (Vault) constant.&lt;/p&gt;

&lt;h3 id=&quot;design-dependencies&quot;&gt;Design dependencies&lt;/h3&gt;

&lt;p&gt;Although we declare zero dependency on other services, we do indeed need some bedrock on which to deploy our Vault.
These are our &lt;em&gt;design dependencies&lt;/em&gt;.
If you find yourself disagreeing with the need for these, then this module is not for you.
I find it important however to explicitly state these dependencies so that these divergences in expectations can be immediately identified.&lt;/p&gt;

&lt;p&gt;Rather than designing a module that is generic and satisfies any deployment scenario, I wanted to develop separate modules for specific scenarios.
It is perhaps true that there will be some redundancy and repetition across these modules, but this is done consciously in order to maintain clean interfaces.&lt;/p&gt;

&lt;p&gt;The dependencies of this module are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A virtual private network (VPC) which we can use to isolate our Vault deployment from other platform services, or managed workloads.&lt;/li&gt;
  &lt;li&gt;A Digital Ocean project, so that we can add resources to it and manage them.&lt;/li&gt;
  &lt;li&gt;A tailscale organisation which we join droplets to.&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;pre-existing Vault cluster&lt;/strong&gt; which contains the secrets necessary to initialise the providers of the dependencies above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first two are useful for organising and billing resources and are provided by an upstream module.&lt;/p&gt;

&lt;p&gt;The third is a design choice to ensure that we have secure overlay to provide access to the instances of the cluster, and we can expose the service only via that interface.&lt;/p&gt;

&lt;h3 id=&quot;control-plane-vs-security-plane&quot;&gt;Control plane vs security plane&lt;/h3&gt;

&lt;p&gt;The last dependency one might seem like a unsatisfiable requirement – how can I deploy Vault if I need Vault to deploy Vault!?
The impression of redundancy may change if one remembers that we are deploying an &lt;em&gt;instance&lt;/em&gt; of Vault for a specific team or community, so it is one of potentially many instances, provided as a managed service to that team or community.
It is therefore part of a security plane, not part of the control plane.&lt;/p&gt;

&lt;p&gt;Another way to put it might be that the users of the provisioned Vault instance are the members of the team or community which are &lt;em&gt;served by&lt;/em&gt; the platform, whilst the users of the Vault cluster we depend on are the members of the platform engineering team which &lt;em&gt;build&lt;/em&gt; the platform itself.&lt;/p&gt;

&lt;h2 id=&quot;using-the-module&quot;&gt;Using the module.&lt;/h2&gt;

&lt;p&gt;Let’s close out this long navel gaze by giving a demonstration of how to use the module.
The module contains a few examples of how to use it, we will use the “simple” example here, which makes no assumptions about the pre-existing resources, and deploys into a commonly-used region (AMS3).&lt;/p&gt;

&lt;p&gt;We first declare the provider configurations: We will need a DigitalOcean and Tailscale API tokens in order to create the resources, and those tokens are stored in Vault.
We look them up with a few &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; calls, then use them to configure the providers:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# Vault configured with environment variables.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# The user only needs to know where the central operations Vault instance is, and have a token with the relevant permissions issued to them.&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# We have declared a variable to hold the value of the KV mount path.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# The path is also given to us by the operations team, who are responsible for&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# separation of concerns.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault_kv_secret_v2&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;do&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;mount&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do_kv_mount_path&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;tokens&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Similarly, the tailscale tokens are stored in the same mount,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# using a different key.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault_kv_secret_v2&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;tailscale&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;mount&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do_kv_mount_path&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;tailscale&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Now we can configure the DigitalOcean provider&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;digitalocean&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;token&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vault_kv_secret_v2&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;terraform&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# And the Tailscale provider&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;tailscale&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;api_key&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vault_kv_secret_v2&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;tailscale&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;api_key&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Vault needs to be deployed into a VPC, but that is not of concern to the Vault module, so it needs to created with its own module.
In this case, it’s being kept in the same state as the Vault module itself, so it will be deleted when Vault itself is deleted.
This may be the case in a short-lived, entirely standalone project for example.
In other cases, an existing VPC may be desired, provisioned by a different team.&lt;/p&gt;

&lt;p&gt;Having created the VPC and project resources, we can then deploy Vault into it:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# First, we make the VPC and project, using a handy existing module in our registry&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vpc&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;brucellino/vpc/digitalocean&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;version&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;2.0.0&quot;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# The project and VPC have been declared above as variables,&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# but we omit the details here for brevity.&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc_name&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc_name&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc_region&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ams3&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc_description&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Vault VPC&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Now we make our Vault service&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault_cluster&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;create_instances&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;instances&lt;/span&gt;                &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;depends_on&lt;/span&gt;               &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;                   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;../../&quot;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# These use the same variable as the VPC module,&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# we could have used the outputs from the module instead.&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc_name&lt;/span&gt;                 &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc_name&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;project_name&lt;/span&gt;             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# We have declared a local variable previously,&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# which contains the IP we want to allow SSH access from.&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;ssh_inbound_source_cidrs&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;local&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;addr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;region_from_data&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;region&lt;/span&gt;                   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ams3&quot;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# This is the Digital Ocean token which allows Vault to lookup&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# other instances in the platform to peer with.&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;auto_join_token&lt;/span&gt;          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vault_kv_secret_v2&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;vault_auto_join&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Hey presto, you now have a self-service module for creating Vault clusters on Digital Ocean.
Using this module requires knowledge of the location of a central Vault instance where platform secrets are kept, and a token from that Vault that allows access only to the secrets you need to provision resources in Digital Ocean and Tailscale.&lt;/p&gt;

&lt;p&gt;Once you’ve created the cluster, you have have full control over it, and can subsequently terraform it with the another module, fit for your specific needs.
Maybe you want to connect it to your Identity Provider, maybe you want to use it to issue Nomad tokens for jobs, maybe you want to connect your managed databases in your application workloads to it… the choice is yours.&lt;/p&gt;

&lt;p&gt;And that’s the whole point at the end of the day!&lt;/p&gt;

&lt;p&gt;Self service can be hard to do right.
With this simple module, we have taken explicit design decisions, paring down the scope of what one can do with these powerful providers, and putting just enough control in the hands of the user that they can create their resources themselves in a predictable and consistent manner.&lt;/p&gt;
</description>
        <pubDate>Mon, 16 Jun 2025 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2025/06/terraform-digitalocean-vault/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2025/06/terraform-digitalocean-vault/</guid>
        
        <category>blog</category>
        
        <category>platform-engineering</category>
        
        <category>terraform</category>
        
        <category>hashicorp</category>
        
        <category>vault</category>
        
        
        <category>architecture</category>
        
      </item>
    
      <item>
        <title>A Platform for Improving Quality of Infrastructure Engineering in Open Science (I)</title>
        <description>&lt;p&gt;&lt;em&gt;This is the writeup of a presentation given to EGI 2025 Conference in 2025.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;EGI has been managing federated infrastructure and delivering services for &lt;a href=&quot;https://www.egi.eu/20-years-with-egi/&quot;&gt;more than 20 years now&lt;/a&gt;.
During this time, it has participated in countless infrastructure projects and brought even more services to user communities across Europe and beyond.
One key to the long-term stability and maturity of these services is the FitSM standard, which provides requirements and guidelines for managing services in federated environments.&lt;/p&gt;

&lt;p&gt;However, EGI has inherited infrastructure and resources from federation partners over the years and continued maintaining this legacy platform without explicitly evolving the methods and practices involved in actually building them.&lt;/p&gt;

&lt;p&gt;Whilst FitSM is appropriate and useful for governing and managing services at a mature stage of development, it doesn’t provide the opinions, guidelines, or patterns for quickly ramping up and iterating the development of infrastructure components themselves—or services deployed within them.&lt;/p&gt;

&lt;p&gt;This &lt;em&gt;has&lt;/em&gt; been addressed in environments where fast iteration towards high-quality services creates competitive advantage. One need only refer to DevOps practices for a good frame of reference.
A shorthand introduction to this article might be:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The key for making change fast is DevOps, whilst the key for managing mature services is FitSM&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;lack-of-constraints-leads-to-lack-of-function&quot;&gt;Lack of constraints leads to lack of function&lt;/h2&gt;

&lt;p&gt;The collaboration, observation, and fast feedback patterns of DevOps allow us to reach a position where system governance becomes feasible more quickly. At that point, we can start extracting value from the FitSM standard with its clear requirements.&lt;/p&gt;

&lt;p&gt;In our context, software systems are built and delivered by federations—independent teams under separate domains.
There are so many tools available in every phase of the software development lifecycle that independent teams almost certainly evolve different toolkits.
This creates a situation where the freedom to select tools for local DevOps optimisation ends up frustrating efforts to globally optimise FitSM governance.&lt;/p&gt;

&lt;p&gt;Since each team delivers with different toolkits, implementing machine-readable procedures becomes prohibitively complex.
The only tool truly shared by all teams is human language.
FitSM requirements are then implemented by adopting this &lt;strong&gt;lowest common denominator&lt;/strong&gt;—human-readable processes, procedures, and policies&lt;sup id=&quot;fnref:revenge_of_llms&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:revenge_of_llms&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Furthermore, procedures tend to be intentionally vague, given the opacity across domain boundaries.
Teams in one area lack visibility into another’s workings and must describe procedures in terms of generic interfaces rather than specific actions.
Whilst this has benefits—abstract interfaces allow tooling to change without altering procedures—there are better ways to achieve this whilst providing superior system understanding.&lt;/p&gt;

&lt;h2 id=&quot;patterns-impose-constraints&quot;&gt;Patterns impose constraints&lt;/h2&gt;

&lt;p&gt;This complexity problem has been encountered repeatedly in enterprises large and small. It’s a consistent hurdle that becomes inevitable once a certain scale is reached.
The boundary between distinct domains is analogous to the traditional boundary between development and operations concerns.&lt;/p&gt;

&lt;p&gt;Just as we realised that better dev-ops collaboration could lead to superior overall outcomes, there were plenty of ways to do it “wrong”.
The frustration from lacklustre DevOps adoption returns led some in the industry to consider what &lt;em&gt;patterns&lt;/em&gt; or &lt;em&gt;antipatterns&lt;/em&gt; were present.
This approach led to “Team Topologies”&lt;sup id=&quot;fnref:TT&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:TT&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;—a guide to identifying which interaction model suits a given environment.&lt;/p&gt;

&lt;p&gt;Emerging from this research was the concept of a “platform”—an abstract set of functions which, when organised together, allow software delivery teams to perform tasks with less cognitive load, fewer dependencies, and faster feedback.
A few iterations later, we arrive at a semi-codification of this pattern: &lt;strong&gt;platform engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Platform engineering is the practice of building platforms&lt;sup id=&quot;fnref:mf_platforms&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:mf_platforms&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; &lt;em&gt;designed for delivery&lt;/em&gt;.
Another way to think of this: a platform is “the product that helps us deliver services to our customers”.
This links our fast-flow, fast-feedback, tool-integration DevOps world with our quality-oriented, requirements-first, process-aligned service management FitSM world.&lt;/p&gt;

&lt;p&gt;The platform engineering pattern helps various domains adopt consistent patterns, even without overall consensus on specific tooling.
It organises tools according to their function and position in the software delivery lifecycle, identifying certain “planes”:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Developer Control Plane&lt;/strong&gt;: Contains all necessary tooling for developers to build applications.
This includes version control for applications and the platform infrastructure’s source code.
In mature cases, it includes an internal service catalogue and API gateway that developers use to identify reusable components and avoid duplication.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Integration and Delivery Plane&lt;/strong&gt;: Work from the Developer Control Plane flows here.
This includes continuous integration (CI) pipelines, artifactories, and continuous deployment (CD) pipelines.
A crucial component is the platform orchestrator, which deploys new artefacts into deployment environments based on platform definitions, configuration constraints, and policies.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Resource Plane&lt;/strong&gt;: Contains the actual operating environment for services offered to customers.
This is the traditional “Ops” part of DevOps, bound to reliability, security, and cost objectives.
It includes all environments: testing, staging, and production.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Observability Plane&lt;/strong&gt;: Monitors all platform components and provides the bedrock for feedback.
It collects metrics from platform components and services deployed in the Resource Plane, alerting based on defined service level objectives.
It also aggregates logs and traces across the platform.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Security Plane&lt;/strong&gt;: Responsible for declaring, enforcing, and supporting security and safety across the platform and its workloads.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;figure&gt;
    &lt;img src=&quot;/images/platform-plane-overview.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;We’re still hiding tooling complexities behind the “plane” abstraction, but at least we now have a common model across different domains.&lt;/p&gt;

&lt;h2 id=&quot;overcoming-boundaries&quot;&gt;Overcoming boundaries&lt;/h2&gt;

&lt;p&gt;The platform engineering pattern tames DevOps complexity, but we’re still left with organisational boundaries inherent in federations.
Have we improved the situation by adopting platforms, or created more work since every team must now build one?&lt;/p&gt;

&lt;p&gt;Organisational boundaries won’t simply disappear. Even though our platform describes interaction patterns, we still need to &lt;em&gt;implement&lt;/em&gt; those interactions.
Code commits must flow through CI, workload definitions need auditing, securing, and delivery to various resource planes across boundaries (&lt;em&gt;i.e.&lt;/em&gt;, heterogeneous toolkits).&lt;/p&gt;

&lt;p&gt;The question becomes:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;How can we build a platform whilst respecting organisational boundaries?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Given that platform components, or entire planes, may be provided by different organisations:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;How can I consume platform component services or expose my component’s services across opaque organisational boundaries?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This may be analogous to the “Bezos API Mandate”&lt;sup id=&quot;fnref:BezosAPIMandate&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:BezosAPIMandate&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.
To paraphrase&lt;sup id=&quot;fnref:edited&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:edited&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;All teams will expose their data and functionality through service interfaces&lt;/li&gt;
  &lt;li&gt;Teams must communicate through these interfaces&lt;/li&gt;
  &lt;li&gt;No other form of interprocess communication is allowed&lt;/li&gt;
  &lt;li&gt;Technology choice doesn’t matter&lt;/li&gt;
  &lt;li&gt;All service interfaces must be designed to be externalisable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Wouldn’t it be brilliant if all platform functionality needed to deliver services to users were exposed as APIs?&lt;/p&gt;

&lt;h2 id=&quot;workflows-in-platforms-vs-tools&quot;&gt;Workflows in platforms vs tools&lt;/h2&gt;

&lt;p&gt;Most toolkit tools do expose APIs for external consumption.
The problem is that we’ve deliberately avoided referencing explicit tooling in our platform pattern.
When platforms are eventually created using actual tools, and the only way to use those tools in delivery workflows is via their specific APIs, we’re back to the complexity we tried to address.&lt;/p&gt;

&lt;p&gt;We’d have to integrate with numerous APIs and account for countless specific configurations to use the platform.
This clearly isn’t an improvement!&lt;/p&gt;

&lt;h2 id=&quot;event-driven-architectures&quot;&gt;Event-driven architectures&lt;/h2&gt;

&lt;p&gt;Tightly-integrated end-to-end workflows for change management, release, deployment, and monitoring—&lt;em&gt;i.e.&lt;/em&gt;, all service management system requirements—are thus doomed.
They’re either too narrow in scope (single org or team), too costly to implement (too many different tools and APIs), or too expensive to maintain (tools change, requiring ongoing API integration work).&lt;/p&gt;

&lt;p&gt;But what if we didn’t need end-to-end workflows?
An alternative would be taking an event-driven system view.
This allows writing procedures as combinations of specific events, actors, and actions:&lt;/p&gt;

&lt;div class=&quot;mermaid&quot;&gt;
---
title:  Event-Driven Architecture
---
flowchart LR
    TriggerEvent(Trigger Event) --&amp;gt; Actor
    Actor --&amp;gt; Action[[Action]]
    Action --&amp;gt; ResultEvent(Result Event)
&lt;/div&gt;

&lt;p&gt;In this scenario, actors subscribe to events, and know what to do when a given event occurs – they take the relevant action.
These actions can be codified accordingle, whilst the SMS policy defines actors.
What remains is deciding the triggering &lt;em&gt;event&lt;/em&gt; and determining how actors are notified.&lt;/p&gt;

&lt;p&gt;This is where the &lt;a href=&quot;https://cdevents.dev&quot;&gt;CDEvents specification&lt;/a&gt; comes in.
As we’ll see in the next post, common event specification in software systems will allow us to build tool-agnostic platforms for delivering software systems to customers across organisational boundaries.&lt;/p&gt;

&lt;p&gt;This will be discussed in Part II, where we dive into the CDEvents specification and show how it can solve our complexity and integration problems.&lt;/p&gt;

&lt;hr /&gt;
&lt;h1 id=&quot;footnotes-and-references&quot;&gt;Footnotes and References&lt;/h1&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:revenge_of_llms&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Perhaps one day large language models (LLMs) &lt;em&gt;will&lt;/em&gt; make human language a valid interface for implementing machine-readable workflows, but we’re most definitely not there yet. &lt;a href=&quot;#fnref:revenge_of_llms&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:TT&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Skelton, M., &amp;amp; Pais, M. (2019). Team topologies: organizing business and technology teams for fast flow. It Revolution. &lt;a href=&quot;#fnref:TT&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:mf_platforms&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The term “platform” appears repeatedly in this article. To be clear, we’re referring to the same platform that Evan Bottcher discusses in &lt;a href=&quot;https://martinfowler.com/articles/talk-about-platforms.html&quot;&gt;What I talk about when I talk about platforms&lt;/a&gt; &lt;a href=&quot;#fnref:mf_platforms&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:BezosAPIMandate&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The only attribution I could find was Steve Yegge’s second-hand, accidental &lt;a href=&quot;https://web.archive.org/web/20151209104319/https://plus.google.com/+RipRowan/posts/eVeouesvaVX&quot;&gt;post&lt;/a&gt;. It has been almost lost to time thanks to Google+’s shutdown (irony of ironies). &lt;a href=&quot;#fnref:BezosAPIMandate&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:edited&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This isn’t faithfully reproduced—I’ve edited and omitted several parts. See the original for greater context. &lt;a href=&quot;#fnref:edited&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 01 Jun 2025 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2025/06/quality-of-infra-engineering/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2025/06/quality-of-infra-engineering/</guid>
        
        <category>blog</category>
        
        <category>APIs</category>
        
        <category>DevOps</category>
        
        <category>SRE</category>
        
        <category>architecture</category>
        
        <category>platform</category>
        
        
        <category>platform-engineering</category>
        
      </item>
    
      <item>
        <title>Attracting and retaining new users</title>
        <description>&lt;p&gt;&lt;small&gt;&lt;em&gt;This is a discussion article for work done in the context of EGI user support.&lt;/em&gt;&lt;/small&gt;&lt;/p&gt;

&lt;h2 id=&quot;so-you-want-to-run-a-training-event&quot;&gt;So, you want to run a training event&lt;/h2&gt;

&lt;p&gt;There are some events which you need to run repeatedly, where the workflow is more or less constant, even though the specific details of each instance may change.
One of the north stars of engineering in our line of work is to &lt;strong&gt;eliminate toil&lt;/strong&gt;&lt;sup id=&quot;fnref:eliminate_toil&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:eliminate_toil&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;: work that is&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;manual&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;repetitive&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;automatable&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;tactical&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;no enduring value&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this case, we’ll take a look at preparation for training events from a perspective of eliminating toil, and see what the effects might be on participant experience, as well as overhead in running the event.&lt;/p&gt;

&lt;h3 id=&quot;treating-training-environments-as-products&quot;&gt;Treating Training Environments as Products&lt;/h3&gt;

&lt;p&gt;When a user comes to a training event, whether we like it or not, whether we realise it or not, we are selling them an experience, before they actually use the service.
If that experience is a good one, and if they perceive that the service they are using actually provides &lt;em&gt;utility&lt;/em&gt;, then they will come back and do whatever is necessary&lt;/p&gt;

&lt;p&gt;A lot of ink has been spilled regarding &lt;em&gt;developer experience&lt;/em&gt; (DX), and how to improve it, culiminating perhaps in the practice of &lt;em&gt;platform engineering&lt;/em&gt;.
One of the main lessons that has been learned is that in order to be successful, the platform needs to be designed &lt;em&gt;as a product&lt;/em&gt;, making it appealing and engaging for the end users, and creating &lt;em&gt;“golden paths”&lt;/em&gt; for them.&lt;/p&gt;

&lt;p&gt;With this as backdrop, let’s see if we can apply these ideas to a training event where we are exposing potential users to a service for the first time.
A traditional training event might start at “zero” and walk the user through all of the initial steps necessary to obtain access to the platform, eventually getting to the point where they have access to the actual service they want to use.&lt;/p&gt;

&lt;p&gt;Let’s take a look at what might be considered a “good” user journey – one which attempts to demonstrate the value (utility) of the service to the user as soon as possible:&lt;/p&gt;

&lt;div class=&quot;mermaid&quot;&gt;
---
title:  Demonstrating service utility during training
config:
  journey:
    actorColors:  [&quot;#59c9a5&quot;, &quot;#465775&quot;]
---
journey
  section Awareness
    Discovers service via EGI page:  5:  User
    Registers to event:  6:  User
  section Access
    Has pre-prepared credentials:  4: Operator
    Pre-registered in group / VO: 4:  Operator
    Accesses service:  5:  User
&lt;/div&gt;

&lt;p&gt;Here, we split the journey into two sections – &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Awareness&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Access&lt;/code&gt;, showing the user emotion from positive (high) to negative (low) as they complete their journey to accessing the service&lt;sup id=&quot;fnref:User_Research&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:User_Research&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
The goal is to get the user in front of the service as fast as possible, with as little distraction as possible.
In order to achieve this, however, we have created an artificial scenario where the user is &lt;em&gt;assuming an identity&lt;/em&gt; which we have already created and approved in the group.
We are hiding complexity from the user in order to use the little time we have in contact with them to greatest benefit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our goal is to deliver value to them, their goal is to conduct their research&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We need to spend every moment of the time they have dedicated to spend with us towards convincing them that our service can help them reach their goal.&lt;/p&gt;

&lt;p&gt;As you can see though, this requires some intervention on the part of the operator in that they need to provision the users and the training environment which the attendees of the tutorial will use.
As it turns out there is already a good recipe for creating the training environment, but what about provisioning the users?&lt;/p&gt;

&lt;h3 id=&quot;the-invisible-barrier&quot;&gt;The invisible barrier&lt;/h3&gt;

&lt;p&gt;This has until now been a source of toil – or rather the toil has been rejected and implicitly pushed onto the users.
We know that there is actually a huge barrier for them to accessing our services, the &lt;strong&gt;Authentication&lt;/strong&gt; and &lt;strong&gt;Authorisation&lt;/strong&gt; of the user.
This barrier is there by design in order to place access controls and service level quotas on the services we provide.
The main problem is that it creates a huge mental separation between &lt;em&gt;us&lt;/em&gt; and the &lt;em&gt;attendees&lt;/em&gt; since we have already been granted access to the service, while they have not.
We don’t feel the disillusionment and pain of having to overcome that barrier, while they do – and often it is the first emotion they have when trying our services.
Let’s take a closer look at that Access section, as it stands for the average first-time user.&lt;/p&gt;

&lt;div class=&quot;mermaid&quot;&gt;
journey
  section Access
    Requests service:  8: User
    Selects IdP: 6:  User
    Logs in IdP:  5:  User
    Info Release IdP: 4:  User
    Info Release SP: 3: User
    Access Denied:  1: User
&lt;/div&gt;

&lt;p&gt;Since the trainer doesn’t know who the users are, or which IdP they will use to log in, they can’t pre-approve &lt;em&gt;actual real people&lt;/em&gt;, so the user journey that starts at the service login page is a &lt;strong&gt;dead-end&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In fact, the user needs to first be enrolled in a group or &lt;a href=&quot;https://confluence.egi.eu/display/EGIG/Virtual+organisation&quot;&gt;Virtual Organisation (VO)&lt;/a&gt;, and thus authorised to access the service.
What is more, they need to be authorised to access that service &lt;em&gt;in a specific context&lt;/em&gt; – &lt;em&gt;i.e.&lt;/em&gt; as a member of a specific group or VO.&lt;/p&gt;

&lt;p&gt;Taking stock of this means understanding that there is actually a whole other journey for the user to take – entering a specific group.&lt;/p&gt;

&lt;h3 id=&quot;golden-paths-for-first-time-users&quot;&gt;Golden paths for first time users&lt;/h3&gt;

&lt;p&gt;Understanding that we now actually have &lt;strong&gt;two&lt;/strong&gt; journeys for the user to complete, we can keep their goals clearer, rather than sending them on a winding path with dead ends:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;First, we create a paved road&lt;sup id=&quot;fnref:aka_golden_path&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:aka_golden_path&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; for &lt;strong&gt;using the service&lt;/strong&gt;. Outcome: They want to use the service.&lt;/li&gt;
  &lt;li&gt;Then, we show them the paved road for &lt;strong&gt;accessing the service&lt;/strong&gt;. Outcome: they understand the trust and community aspects of the service.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Separating these two experiences can be done in different sessions of a tutorial, and will help the attendees remain focussed on one goal at a time.&lt;/p&gt;

&lt;h2 id=&quot;eliminating-toil&quot;&gt;Eliminating Toil&lt;/h2&gt;

&lt;p&gt;So much for the product development and retention side of things – we previously mentioned that pre-populating the training users would be a source of toil, even if it made the user experience much better.
What if we could eliminate that toil by encoding the process?&lt;/p&gt;

&lt;p&gt;The user pre-registration workflow looks a bit like this:&lt;/p&gt;

&lt;div class=&quot;mermaid&quot;&gt;
---
title: Preparation for User Training
config:
  useMaxWidth: true
  theme: base
  themeVariables:
    primaryColor: &quot;#00ff00&quot;
  flowchart:
    fontSize: 32px
---
sequenceDiagram
  autoNumber
  actor A as IdP Admin
  participant i as IDP

  V-&amp;gt;&amp;gt;A: We need users

  A-&amp;gt;&amp;gt;i: Create usernames and passwords
  A-&amp;gt;&amp;gt;O: We have users
  actor O as Operator
  participant r as RPA
  O-&amp;gt;&amp;gt;r: Invoke Process

  activate r
  participant c as Check-In
  r-&amp;gt;&amp;gt;c: Sign Up
  c-&amp;gt;&amp;gt;c: Create New User
  r-&amp;gt;&amp;gt;c: Petition VO
  deactivate r
  actor V as VO Manager
  c-&amp;gt;&amp;gt;V: Notify Request
  V-&amp;gt;&amp;gt;c: Approve Request

  c-&amp;gt;&amp;gt;c: Add to VO

  V-&amp;gt;&amp;gt;U: Here are your credentials for the training
  U-&amp;gt;&amp;gt;s: Request service
  s-&amp;gt;&amp;gt;c: Authentication
  c-&amp;gt;&amp;gt;i: Authenticate\nto IdP
  i-&amp;gt;&amp;gt;c: Return\nAttributes
  c-&amp;gt;&amp;gt;s: Authorize
  s-&amp;gt;&amp;gt;U: Grant\nAccess

  actor U as User
  participant s as Service
&lt;/div&gt;

&lt;p&gt;Here&lt;sup id=&quot;fnref:arg_mermaid&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:arg_mermaid&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; we can see that User only really needs to interact with the Service, and they experience the login procedure &lt;em&gt;as it should be&lt;/em&gt;, thanks to the pre-registration of the identities in Check-In.&lt;/p&gt;

&lt;p&gt;We introduce a new actor here called “RPA”, which stands for &lt;a href=&quot;https://en.wikipedia.org/wiki/Robotic_process_automation&quot;&gt;Robotic Process Automation&lt;/a&gt;.
This is actually some code which executes the process of first signup on behalf of the users – this is the “toily” bit which we would like to eliminate from human actors (the Operator in this case).&lt;/p&gt;

&lt;h3 id=&quot;making-the-signup-robot&quot;&gt;Making the signup robot&lt;/h3&gt;

&lt;p&gt;The process has been implemented using &lt;a href=&quot;https://robotframework.org&quot;&gt;Robot Framework&lt;/a&gt;, by impersonating the user itself, and driving a browser to complete the tasks the user would have done.
This is broken down into tasks shown in the sequence diagram above:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Signup&lt;/code&gt;: Sign up a new user to Check-In&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Join VO&lt;/code&gt;: Petition to join the VO&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Clean Up&lt;/code&gt;: Collect created EPUIDs and send them to Check-In admin for deletion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first two are mentioned in the sequence diagram above, while the last is a process which happens once the training event is completed.
In principle, we could have a multi-actor RPA, which means that we could invoke the roles of VO Manager or IDP Admin as well, but for now, we are only focussing on the part in the middle: registering users and requesting to join the VO.&lt;/p&gt;

&lt;p&gt;Let’s take a closer look at the code.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-robot&quot; data-lang=&quot;robot&quot;&gt;&lt;span class=&quot;gh&quot;&gt;*** Settings ***&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Library&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;             &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Browser&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Library&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;             &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;DataDriver&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;users.csv&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Resource&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;            &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;login_resources.robot&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Suite Teardown&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;      &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Close the Browser&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Task Setup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;          &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Open the Browser&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Task Template&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;       &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Signup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;gh&quot;&gt;*** Tasks ***&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Sign with user    Default    UserData&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Signup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Default&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;UserData&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here we can see the single task referred to above.
It is parametrised to use an input file of users, containing their usernames and passwords.&lt;/p&gt;

&lt;p&gt;We had to write the task keyword &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Signup&lt;/code&gt; ourselves – it’s contained in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;login_resources.robot&lt;/code&gt; file you see declared as a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Resource&lt;/code&gt; in the settings part:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-robot&quot; data-lang=&quot;robot&quot;&gt;&lt;span class=&quot;gh&quot;&gt;*** Keywords ***&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Signup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Arguments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${username}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${password}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Click on EGI SSO where the users have been created&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;selector=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${EGI_SSO_SELECTOR}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Wait For Navigation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;url=https://sso.egi.eu/egissoidp/profile/SAML2/Redirect/SSO?execution=e1s2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# fill in the login form&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Provide Credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;username=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${username}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;password=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${password}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Accept info release&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=.grid-item &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Wait For Load State&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;domcontentloaded&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timeout=30s&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Follow the signup flow and accept info release and Terms and Conditions&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.grid-item:nth-child(1) &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Wait For Navigation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;url=https://aai.egi.eu/registry/co_petitions/start/coef:2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;selector=a[href=&apos;/registry/co_petitions/start/coef:2/done:core&apos;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=.checkbutton&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.ui-dialog-buttonpane:nth-child(11) &amp;gt; div:nth-child(1) &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Check Checkbox&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;selector=id=CoTermsAndConditions1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Submit the request and wait for the server to log us out&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.submit:nth-child(1) &amp;gt; input:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Wait For Navigation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;https://aai.egi.eu/registry/pages/public/loggedout&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Provide Credentials&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Arguments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${username}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${password}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Fill Text&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;id=username&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;txt=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${username}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Fill Text&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;id=password&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;txt=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${password}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The various keywords you see there such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Click&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Wait For Navigation&lt;/code&gt; are all in the &lt;a href=&quot;https://marketsquare.github.io/robotframework-browser/&quot;&gt;Robot Framwork Browser library&lt;/a&gt; which is a python wrapper around &lt;a href=&quot;https://playwright.dev/&quot;&gt;Playwright&lt;/a&gt;.
It drives an actual browser to complete the workflow.&lt;/p&gt;

&lt;p&gt;Once the user has created a unique ID in Check-In, we can complete the registration by petitioning to join a VO.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-robot&quot; data-lang=&quot;robot&quot;&gt;&lt;span class=&quot;gh&quot;&gt;*** Tasks ***&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Join VO&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Default&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;UserData&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;gh&quot;&gt;*** Keywords ***&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;Join VO&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;Arguments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${username}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${password}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;selector=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${EGI_SSO_SELECTOR}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Fill Text&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;id=username&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;txt=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${USERNAME}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Fill Text&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;id=password&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;txt=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${PASSWORD}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Click Login button&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=.grid-item &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.grid-item:nth-child(1) &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# This does not redirect me to the VO enrollment page&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Go To&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;${VO_SIGNUP_URL}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# We still have the cookie, so select the favourite&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=#favouritesubmit&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Review AUP&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;selector=.checkbutton&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.ui-dialog-buttonpane:nth-child(11) &amp;gt; div:nth-child(1) &amp;gt; button:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Check Checkbox&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=#CoTermsAndConditions114&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Click&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;css=div.submit:nth-child(1) &amp;gt; input:nth-child(1)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

    &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Wait For Navigation&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;    &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;https://aai.egi.eu/registry/&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Pending acknowledgement notification&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The cleanup task is similar, but logs into Check-In and populates a file with EPUIDs to be sent to the Check-In admin for deletion.&lt;/p&gt;

&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;/h2&gt;

&lt;h3 id=&quot;have-we-really-eliminated-toil&quot;&gt;Have we really eliminated toil&lt;/h3&gt;

&lt;p&gt;We now have two simple tasks that have been codified: a repeatable procedure which can be reliably performed by a computer!
&lt;strong&gt;We have removed the work from the Operator&lt;/strong&gt;, eliminating part of the toil in this procedure.
We have written these tasks to perform actors as a single role so far – that of the prospective user&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The creation of the users in the IDP – this is perhaps already encoded, but does not notify the next actor in the sequence (Operator)&lt;/li&gt;
  &lt;li&gt;Approval of the VO registration request – this can be done by a VO admin, either in bulk via the API, or as part of the registration process, but requires a change in actor. Currently the notifications are handled via Check-In’s builtin notification systems, so there is a connection between these two procedures (request and approval of VO membership).&lt;/li&gt;
  &lt;li&gt;The user need to be notified about their credentials. This is probably done in person by the training instructor. Since this involves the transfer of sensitive data, it’s probably a good idea to transfer this information in person.&lt;/li&gt;
  &lt;li&gt;Users still need to be removed from Check-In by the Check-In admin.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;have-we-improved-the-users-experience&quot;&gt;Have we improved the user’s experience&lt;/h3&gt;

&lt;p&gt;With this pre-registration in place, let’s take a look at how the attendee experiences the training.
Let’s remember that we have just a few hours to convince them that they should use our services, and that their decision will be based on their experience of the service itself – does it actually provide them with value?
Taking a closer look at that hypothetical “good” experience we mentioned above, from the user’s point of view it looks like this:&lt;/p&gt;

&lt;div class=&quot;mermaid&quot;&gt;
---
title:  User
config:
  journey:
    actorColors:  [&quot;#59c9a5&quot;, &quot;#465775&quot;]
---
journey
  section Awareness
    Discovers service via EGI page:  5:  User
    Registers to event:  6:  User
  section Access
    Provides credentials:  5: Trainer
    Logs into Check-In:  3: User
    Releases Information:  3:  User
    Authorised to use Service: 9: User
&lt;/div&gt;

&lt;p&gt;The user initially feels a bit of a dip in their enthusiasm – when their curiosity is focussed on the service itself, every click we put between them and the service is a disappointment.
However, here we design a &lt;a href=&quot;https://en.wikipedia.org/wiki/Peak%E2%80%93end_rule&quot;&gt;“peak-end” experience&lt;/a&gt; – the final result of the user’s journey through login is entirely positive as they are granted immediate access to the service.&lt;/p&gt;

&lt;h3 id=&quot;natural-extension-rpa-for-training&quot;&gt;Natural extension: RPA for training&lt;/h3&gt;

&lt;p&gt;So, we have &lt;strong&gt;removed some toil&lt;/strong&gt; and &lt;strong&gt;improved the user experience&lt;/strong&gt; for these crucial first-touch events – we can probably consider this an unconditional win!
But improvement is itself a journey, not a destination, so let’s consider what would be the natural extension of this improvement, for next time.&lt;/p&gt;

&lt;p&gt;The goals here are the same: Create a satisfying experience for new users in a training environment, while eliminating toil from the people involved.&lt;/p&gt;

&lt;p&gt;Let’s consider that we have a lot of time between when the user &lt;em&gt;registers&lt;/em&gt; for the event, and when they actually &lt;em&gt;participate&lt;/em&gt; to the event.
We could use this time to enable an email drip campaign to help the approach the biggest obstacle of all, which is also only experienced once at the beginning: signup to Check-In.&lt;/p&gt;

&lt;p&gt;This in fact consists of two different tasks -&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Create an entity/account in Check-In&lt;/li&gt;
  &lt;li&gt;Request to join a VO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these tasks should end on a positive note and should be completed “quickly”, giving the impression that they are making progress, and achieving results.&lt;/p&gt;

&lt;p&gt;Now, we don’t yet have an email service which can be programmed via API with transactional emails, but creating something like that isn’t hard.
We could imagine a series of hooks starting from the first interaction of the new user – the registration for the event in Indico – with a persistence layer to keep track of progression, to implement such an end-to-end workflow for nudging users onto the service before the training event.
This would result in the same experience as what we have just shown for pre-registration, but without the downside of having to overcome the registration process after the training.&lt;/p&gt;

&lt;p&gt;This is a very common approach for any cloud service or SaaS, since it’s a well-known heuristic that users are easily distracted and discouraged by longer onboarding paths.&lt;/p&gt;

&lt;p&gt;Another aspect which we need to better perform is some actual user research while new users are onboarded.
How do they really feel when accessing our services for the first time?
What are they really thinking as they go through the motions of creating accounts and requesting access to service?
It would be good to have some better feedback based on consistent user research &lt;em&gt;during&lt;/em&gt; the event, rather than responding to surveys after the fact.&lt;/p&gt;

&lt;p&gt;Finally, we can indeed define &lt;em&gt;golden paths&lt;/em&gt; for new users, and these can be extended from first touch to production usage, with the adoption of these RPA tools and a little bit of automation.&lt;/p&gt;

&lt;p&gt;At the end of the day, this is all about &lt;strong&gt;keeping the customer engaged&lt;/strong&gt;, making sure they &lt;strong&gt;achieve small, continuous wins&lt;/strong&gt;, and ensuring that we &lt;strong&gt;demonstrate utility first&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;footnotes-and-references&quot;&gt;Footnotes and References&lt;/h2&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:eliminate_toil&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;See &lt;a href=&quot;https://sre.google/sre-book/eliminating-toil/&quot;&gt;The Google SRE book&lt;/a&gt; for more. &lt;a href=&quot;#fnref:eliminate_toil&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:User_Research&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I pulled these emotions graded from 0-10 straight out of the air, I have no data to back this up. It would be really interesting to measure user happiness along their journey as part of the typical training activity. &lt;a href=&quot;#fnref:User_Research&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:aka_golden_path&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The “Paved Road” or “Golden Path” are used interchangeably here, referring to an easy, intuitive, straightforward path from start to goal. Read about some of the differences &lt;a href=&quot;https://octopus.com/blog/paved-versus-golden-paths-platform-engineering&quot;&gt;on the Octopus blog&lt;/a&gt; &lt;a href=&quot;#fnref:aka_golden_path&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:arg_mermaid&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I keep getting kicked in the teeth for my optimistic view of mermaidjs. Man, creating this diagram was a pain, and the box and destroy features just straight up didn’t work. Oh well. Next time. &lt;a href=&quot;#fnref:arg_mermaid&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Wed, 01 May 2024 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2024/05/modelling-onboarding/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2024/05/modelling-onboarding/</guid>
        
        <category>blog</category>
        
        <category>rpa</category>
        
        <category>aai</category>
        
        <category>signup</category>
        
        
        <category>egi</category>
        
      </item>
    
      <item>
        <title>Consul for external services</title>
        <description>&lt;p&gt;Table of Contents&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#service-registration-vs-service-discovery&quot;&gt;Service registration vs service discovery&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#registering-services-in-a-modern-service-catalogue&quot;&gt;Registering services in a modern service catalogue&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#using-consul-as-service-catalogue&quot;&gt;Using Consul as Service Catalogue&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#problem-statement-declared-vs-actual-states&quot;&gt;Problem statement: Declared vs Actual states&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#terraforming-federated-services-into-consul&quot;&gt;Terraforming federated services into Consul&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#external-terraform-data-source&quot;&gt;External Terraform Data source&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#external-consul-node&quot;&gt;External Consul Node&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#consul-service&quot;&gt;Consul Service&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#external-service-monitor&quot;&gt;External Service Monitor&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#results&quot;&gt;Results&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#i-promised-you-ux-improvements&quot;&gt;I promised you UX improvements&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#discussion&quot;&gt;Discussion&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;service-registration-vs-service-discovery&quot;&gt;Service registration vs service discovery&lt;/h2&gt;

&lt;p&gt;This is a short experiment in using Consul as a service discovery tool for services in federated infrastructures.&lt;/p&gt;

&lt;p&gt;Traditionally, services were not “discovered” but rather registered in a &lt;strong&gt;configuration database&lt;/strong&gt;, which then acted as a source of truth for clients wishing to find out information regarding the available resources in the federation.
This configuration database (CMDB) remains an authoritative source of truth, since service owners manually register their inventory therein and there is a high level of manual verification, however it is not easily &lt;strong&gt;queryable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One of the main infrastructure services is the &lt;strong&gt;information system&lt;/strong&gt;, which is implemented with the &lt;a href=&quot;https://docs.egi.eu/users/compute/high-throughput-compute/querying-information-system/&quot;&gt;Berkeley Database Information System (BDII)&lt;/a&gt; with a specific LDAP schema.
The BDII provides a hierarchical way to propagate service registration from so-called “sites” up to a centralised service catalogue.
This model was adopted for more than a decade and supported a truly massive computing effort in the pursuit of new knowledge.&lt;/p&gt;

&lt;p&gt;My memory may be failing me after almost 7 years away from actually working in this environment, but I remember actually &lt;em&gt;using&lt;/em&gt; the BDII being extremely laborious.
There was a lot to actually remember, which ended up written in wiki after wiki, generating a great amount of toil and thus cost, and communities inevitably ended up needing to centralise and codify knowledge about the state of the infrastructure in single instances.&lt;/p&gt;

&lt;h2 id=&quot;registering-services-in-a-modern-service-catalogue&quot;&gt;Registering services in a modern service catalogue&lt;/h2&gt;

&lt;p&gt;A common design pattern in the 2020s is to use a &lt;a href=&quot;https://landscape.cncf.io/guide#orchestration-management--service-mesh&quot;&gt;service mesh&lt;/a&gt; to register services and define their connection permissions.
This is conceptually similar to the BDII approach of creating and registering services from the ground up, but instead of using LDAP with LDIF updates, we use DNS.&lt;/p&gt;

&lt;p&gt;This brings a radically different approach to actually &lt;em&gt;using&lt;/em&gt; the infrastructure, &lt;em&gt;i.e&lt;/em&gt; in &lt;strong&gt;user experience (UX)&lt;/strong&gt;.
Relying on DNS means we no longer need to remember arcane &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ldapsearch&lt;/code&gt; commands with undecipherable filters, we just need to call a DNS name.
The service catalogue also takes care of organising and routing traffic to the desired endpoints so that applications do not need to be aware of changes in the infrastructure.&lt;/p&gt;

&lt;h2 id=&quot;using-consul-as-service-catalogue&quot;&gt;Using Consul as Service Catalogue&lt;/h2&gt;

&lt;p&gt;Services deployed in a Kubernetes cluster for example get these benefits for free as part of the environment they are running in, but we do not have that luxury in the fedearated infrastructure world.
While some of these services &lt;em&gt;may&lt;/em&gt; end up deployed in Kubernetes within a specific local context (site), it is inconceivable that the entire federation would be.
The services can still be registered as &lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/service/#externalname&quot;&gt;external services&lt;/a&gt; which services within a Kubernetes cluster can discover.&lt;/p&gt;

&lt;p&gt;However, one does not need to adopt the full apparatus of Kubernetes to benefit from this.
Consul from Hashicorp is well-adapted to bridging legacy and cloud-native infrastructure, since it is designed to be &lt;a href=&quot;https://www.consul.io/use-cases/multi-platform-service-mesh&quot;&gt;multi-platform&lt;/a&gt;.
I wanted to investigate how it could be used together with existing infrastructure in order to improve the UX of communities which have adopted modern tooling but nevertheless need to use existing federated resources.&lt;/p&gt;

&lt;p&gt;Before we get to the big picture stuff though, let’s get down in the weeds to see how this could be done effectively.&lt;/p&gt;

&lt;h3 id=&quot;problem-statement-declared-vs-actual-states&quot;&gt;Problem statement: Declared vs Actual states&lt;/h3&gt;

&lt;p&gt;The source of truth hasn’t changed, and we don’t want to break the Service Management standard we’re used to: &lt;a href=&quot;https://www.fitsm.eu/&quot;&gt;FitSM&lt;/a&gt;.
Rather, we want to use existing structures to improve UX.&lt;/p&gt;

&lt;p&gt;Service discovery can be considered part of the Configuration Management process (CONFM), which has the following specific requirements&lt;sup id=&quot;fnref:FitSMCONFM&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:FitSMCONFM&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; according to the standard:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;PR11.1&lt;/strong&gt; The scope of configuration management shall be defined together with the types of configuration items (CIs) and relationships to be considered.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PR11.2&lt;/strong&gt; The level of detail of configuration information shall be sufficient to support effective control over CIs.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PR11.3&lt;/strong&gt; Information on CIs and their relationships with other CIs shall be maintained in a configuration management database (CMDB).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PR11.4&lt;/strong&gt; CIs shall be controlled and changes to CIs tracked in the CMDB.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PR11.5&lt;/strong&gt; The information stored in the CMDB shall be verified at planned intervals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The configuration management database is commonly understood to be the &lt;a href=&quot;https://goc.egi.eu&quot;&gt;GOCDB&lt;/a&gt;, the operations database which implements the functionality required to satisfy the requirements above.
It is a source of information, which is is supposed to describe the  actual state of the world.
Unfortunately, it is &lt;strong&gt;not authoritative&lt;/strong&gt; in that sense, since it is merely a database, and the items in it have no controls associated.
There are no connectivity checks, health checks, performance, &lt;em&gt;etc&lt;/em&gt; - it’s all manually added.&lt;/p&gt;

&lt;p&gt;Now, it is true that those checks are delegated to a different service (the monitoring service &lt;a href=&quot;https://argo.egi.eu&quot;&gt;ARGO&lt;/a&gt;), but again there is no way to interrogate the actual state of the infrastructure in an operational sense.
If I my application or community to use the currently “good” services, I need to somehow hook into the monitoring system, find which is the currently good set of services via some arcane query, and keep doing that all the time.&lt;/p&gt;

&lt;h3 id=&quot;terraforming-federated-services-into-consul&quot;&gt;Terraforming federated services into Consul&lt;/h3&gt;

&lt;p&gt;What if we could create an environment where the healthy services were simply discoverable by the thing we all know and use every day: DNS.
I’ll show now how Terraform can be used to query the GOCDB, register services in a Consul catalogue and then use the Consul DNS interface to use and discover these services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UX improvement&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;This exercise will demonstrate how make NGI or ROC-level BDII endpoints available via DNS.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Before: User needs to query GOCDB to find a relevant endpoint, remember a default one, or hardcode it in the environment :anger:&lt;/li&gt;
  &lt;li&gt;After: User can look up the currently healthy endpoint using a DNS name :heart_eyes:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll do the following things to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a query script to provide external data to Terraform&lt;/li&gt;
  &lt;li&gt;Create a Consul external node to assign the declared services to&lt;/li&gt;
  &lt;li&gt;Register the declared services along with simple health checks in the external node as &lt;strong&gt;external services&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Deploy an external service monitor to discover and perform monitoring checks on these external services&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We will do all this in Terraform, so first things first, we initialize our providers&lt;sup id=&quot;fnref:TFNote&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:TFNote&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;terraform&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;required_version&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;~&amp;gt; 1.7.0&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;required_providers&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;consul&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;hashicorp/consul&quot;&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;version&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;~&amp;gt; 2.20&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;external&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;hashicorp/external&quot;&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;version&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;~&amp;gt; 2.3&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;external-terraform-data-source&quot;&gt;External Terraform Data source&lt;/h4&gt;

&lt;p&gt;Next, we’ll need to use the GOCDB as a data source in Terraform.
In order to do this, we use the &lt;a href=&quot;https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;external&lt;/code&gt;&lt;/a&gt; data source from Hashicorp.
This is implemented as an arbitrary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;program&lt;/code&gt; by the user (me), which returns a JSON that Terraform can parse.
I wrote it in Python:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;c1&quot;&gt;#!/bin/env python3
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;requests&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;xmltodict&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;json&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_top_bdii&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;https://goc.egi.eu/gocdbpi/public/?method=get_service_endpoint&amp;amp;service_type=Top-BDII&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;xpars&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xmltodict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strip_whitespace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dumps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xpars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;results&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;allow_nan&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;output&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dumps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;__main__&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;get_top_bdii&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This queries the GOCDB to get the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Top-BDII&lt;/code&gt; service endpoints.
The HTTP call is unauthenticated and returns an XML response which we parse into JSON.
The JSON is added to an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output&lt;/code&gt; dict which is returned to Terraform via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stdin&lt;/code&gt; as required.&lt;/p&gt;

&lt;h4 id=&quot;external-consul-node&quot;&gt;External Consul Node&lt;/h4&gt;
&lt;!-- markdownlint-disable MD001 --&gt;

&lt;p&gt;We now create single external node (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EGI&lt;/code&gt;) to keep things simple, where we can register all of the declared services&lt;sup id=&quot;fnref:ExternalNode&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:ExternalNode&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.
We declare it as “external” using the node metadata, so that the monitor can discover it and monitor services on it:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul_node&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;egi&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;EGI&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://egi.eu&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;external-node&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;true&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;consul-service&quot;&gt;Consul Service&lt;/h4&gt;

&lt;p&gt;Now the meaty bit - we need to parse the result of the GOCDB lookup to create the services, and register them on the external node.&lt;/p&gt;

&lt;p&gt;In order to do this, I first declare a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local&lt;/code&gt; variable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;service_endpoints&lt;/code&gt; which filters out the result of the GOCDB data:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;locals&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;service_endpoints&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jsondecode&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;external&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;bdiis&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;SERVICE_ENDPOINT&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;@PRIMARY_KEY&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;key&lt;/span&gt;                &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;@PRIMARY_KEY&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;configuration_item&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;GOCDB_PORTAL_URL&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;hostname&lt;/span&gt;           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;HOSTNAME&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;sitename&lt;/span&gt;           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;SITENAME&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;in_production&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;IN_PRODUCTION&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;scopes&lt;/span&gt;             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;SCOPES&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;monitored&lt;/span&gt;          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;NODE_MONITORED&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;notifications&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;NOTIFICATIONS&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;country&lt;/span&gt;            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;COUNTRY_NAME&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;roc&lt;/span&gt;                &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;ROC_NAME&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;scopes&lt;/span&gt;             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;SCOPES&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;SCOPE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we have a nice variable of type map which we can loop over the keys of in order to register the service declared there in our Consul catalog.&lt;/p&gt;

&lt;p&gt;Since there is a 1-to-many mapping between sites and service instances, we need to register the service and check with unique names.
We will use a combination of service name and primary key as defined in GOCDB for this.&lt;/p&gt;

&lt;p&gt;In order to avoid repeating ourselves, we will use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;for_each&lt;/code&gt; keyword for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;consul_service&lt;/code&gt; resource and loop over the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local.service_endpoints&lt;/code&gt; keys.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;for_each&lt;/span&gt;   &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;local&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;service_endpoints&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;${i.sitename}-${i.key}&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt;       &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;consul_node&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;egi&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;address&lt;/span&gt;    &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;hostname&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;port&lt;/span&gt;       &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2170&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;       &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;top-bdii&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;service_id&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;top-bdii_${each.key}&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;tags&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;concat&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;sitename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;roc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;scopes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;check&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;tls_skip_verify&lt;/span&gt;                   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;check_id&lt;/span&gt;                          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;service:${each.value.sitename}-${each.value.key}&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;                              &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;${each.value.sitename} top bdii check&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;interval&lt;/span&gt;                          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;1m0s&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;timeout&lt;/span&gt;                           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;20s&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;tcp&lt;/span&gt;                               &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;${each.value.hostname}:2170&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;notes&lt;/span&gt;                             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;${each.value.sitename} TCP check&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;deregister_critical_service_after&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;720h0m0s&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;primary_key&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;key&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;ci&lt;/span&gt;            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;configuration_item&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;hostname&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;hostname&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;sitename&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;sitename&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;in_production&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;in_production&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;monitored&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;monitored&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;notifications&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;notifications&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;country&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;country&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;roc&lt;/span&gt;           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;each&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;roc&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We’ve added a series of key-value metadata to reproduce the kind of information that one would find in the GOCDB.
As we’ll see, many of the instances registered there are not alive, so their service checks will immediately fail and the service will soon be deregistered.
We only register one health check for now, which is a TCP check on the LDAP server.
It would be &lt;em&gt;far&lt;/em&gt; better and more accurate to add an actual LDAP search script check&lt;sup id=&quot;fnref:ARGOdoesthis&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:ARGOdoesthis&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, but we’ll get into that later in the discussion below.&lt;/p&gt;

&lt;h4 id=&quot;external-service-monitor&quot;&gt;External Service Monitor&lt;/h4&gt;

&lt;p&gt;Finally, we need to run an &lt;a href=&quot;https://developer.hashicorp.com/consul/tutorials/developer-discovery/service-registration-external-services#monitor-the-external-service-with-consul-esm&quot;&gt;external service monitor&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I am obviously doing this with something you might not have: a beautiful Nomad cluster.
The external service monitor can be run next to any Consul agent, so in principle you could run it as a systemd unit on one of your Consul agents.
I’m using Nomad so that I can easily manage the deployment and lifecycle.
The final resource is therefore a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nomad_job&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;nomad_job&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul_esm&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;jobspec&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;templatefile&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;${path.module}/consul-esm.jobspec.hcl&quot;&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;consul_esm_version&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;consul_esm_version&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# spread over available nodes&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;rerun_if_dead&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;with associated Job Specification.&lt;/p&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;p&gt;Now, let’s deploy this monstruosity and discuss the result.
The Terraform plan looks sane, with a bunch of&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# module.example.consul_service.top-bdii[&quot;AEGIS01-IPB-SCL-1183G0&quot;] will be created&lt;/span&gt;
  &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul_service&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;top-bdii&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;address&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;bdii.ipb.ac.rs&quot;&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;datacenter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;known&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;after&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;id&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;known&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;after&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;meta&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ci&quot;&lt;/span&gt;            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://goc.egi.eu/portal/index.php?Page_Type=Service&amp;amp;id=1183&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;country&quot;&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Serbia&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;hostname&quot;&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;bdii.ipb.ac.rs&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;in_production&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Y&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;monitored&quot;&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Y&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;notifications&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;N&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;primary_key&quot;&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;1183G0&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;roc&quot;&lt;/span&gt;           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;NGI_AEGIS&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;sitename&quot;&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;AEGIS01-IPB-SCL&quot;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;top-bdii&quot;&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;EGI&quot;&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;port&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2170&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;service_id&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;top-bdii_AEGIS01-IPB-SCL-1183G0&quot;&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;tags&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;AEGIS01-IPB-SCL&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;NGI_AEGIS&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;EGI&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

      &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;check&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;check_id&lt;/span&gt;                          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;service:AEGIS01-IPB-SCL-1183G0&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;deregister_critical_service_after&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;720h0m0s&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;interval&lt;/span&gt;                          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;1m0s&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;method&lt;/span&gt;                            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;GET&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;                              &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;AEGIS01-IPB-SCL top bdii check&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;notes&lt;/span&gt;                             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;AEGIS01-IPB-SCL TCP check&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;status&lt;/span&gt;                            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;known&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;after&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;tcp&lt;/span&gt;                               &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;bdii.ipb.ac.rs:2170&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;timeout&lt;/span&gt;                           &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;20s&quot;&lt;/span&gt;
          &lt;span class=&quot;err&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;tls_skip_verify&lt;/span&gt;                   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The apply operation added 83 resources in just under 18s.
Below is a screencast of the events in Consul while the services are registered, and eventually become healthy.&lt;/p&gt;

&lt;div class=&quot;video&quot;&gt;
&lt;video width=&quot;100%&quot; controls=&quot;&quot;&gt;
&lt;source src=&quot;https://www.brucellino.dev/images/service-registration-in-consul.webm&quot; /&gt;
&lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;As you can see, they are tagged as registered by Terraform in the EGI external node.
The Nomad job running the external monitor takes a few seconds to come up and perform the monitoring checks which eventually makes healthy service instances go green.&lt;/p&gt;

&lt;div class=&quot;video&quot;&gt;
&lt;video width=&quot;100%&quot; controls=&quot;&quot;&gt;
&lt;source src=&quot;https://www.brucellino.dev/images/consul-esm-in-nomad.webm&quot; /&gt;
&lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;So, in a few seconds, we have both registered the external services, and are monitoring them with a basic tcp liveness check.
Let’s see if this makes any difference to a user.&lt;/p&gt;

&lt;h3 id=&quot;i-promised-you-ux-improvements&quot;&gt;I promised you UX improvements&lt;/h3&gt;

&lt;p&gt;Now, imagine I’m member of the alice VO and I want to find a top-bdii:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-console&quot; data-lang=&quot;console&quot;&gt;&lt;span class=&quot;go&quot;&gt;host alice.top-bdii.service.consul
alice.top-bdii.service.consul is an alias for topbdii.grif.fr.
topbdii.grif.fr is an alias for lpnhe-topbdii.in2p3.fr.
lpnhe-topbdii.in2p3.fr is an alias for lpnhe-gs9013.in2p3.fr.
lpnhe-gs9013.in2p3.fr has address 134.158.159.13
lpnhe-gs9013.in2p3.fr has IPv6 address 2001:660:3036:197:134:158:159:13&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Consul creates DNS entries for all services and their tags in its catalogue, and only returns healthy instances.
Since it’s DNS, there’s automatically round-robin so we don’t risk hitting a given instance too hard.&lt;/p&gt;

&lt;p&gt;Since we’ve tagged services with their site name as well as NGI and ROC names, we can also find local, national or regional instances:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-console&quot; data-lang=&quot;console&quot;&gt;&lt;span class=&quot;go&quot;&gt;host ngi_france.top-bdii.service.consul
ngi_france.top-bdii.service.consul is an alias for lapp-bdii01.in2p3.fr.
lapp-bdii01.in2p3.fr has address 134.158.84.162
lapp-bdii01.in2p3.fr has IPv6 address 2001:660:5310:420:7::1&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;

&lt;p&gt;Of course, I’m hiding a few details from you here, dear reader&lt;sup id=&quot;fnref:agent&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:agent&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, but bear with me – we are looking at this from the user’s point of view.
If a community decided that they wanted to use a modern stack, but use some of EGI’s federated services, they could use this approach to discover infrastructure services.
This could greatly improve the performance of workflow engines for example which need to keep an up-to-date list of healthy compute endpoints.
The last time I touched this problem, it was done using either a hardcoded list of endpoints or an unweidly and unreliable GIIS lookup.
Being able to find things just by using DNS seems to me a much better approach.&lt;/p&gt;

&lt;p&gt;I’m also looking only at Top-BDII services here.
I decided to start there because it was easy to write a health check for it and I’m pretty much guaranteed that these will be open to the world.
It seems a bit redundant to put service discovery systems (top-bdiis) into another service discovery system (Consul).
We could replace the GOCDB query and find UIs in the same way though, which might be a bit more useful eventually.&lt;/p&gt;

&lt;p&gt;Another point is the combination of service discovery and service availability checks.
I mentioned above that the GOCDB only declares the desired state of the world, but Consul adds to that by including a current state check.
There is no historical and statistical  information in this, so it’s by no means a replacement for something like ARGO.
However, if you have a piece of infrastructure which needs to query a service topology in order to configure itself, this is a better way to go.&lt;/p&gt;

&lt;p&gt;In conclusion, the whole concept of external services is really useful here.
I can easily envisage a scenario where a community comes up with a set of applications which gets deployed into its platform, but needs to augment them with infrastgructure or compute and data services from the federation.
This fun little experiment shows just how easy it is to terraform the federation into your environment.&lt;/p&gt;

&lt;p&gt;I’m not proposing anything radical here, but I am intrigued by the idea of including the &lt;em&gt;entire&lt;/em&gt; federation into a set of peered Consul datacentres, replacing the entire bdii infrastructure with a combination of Consul’s service mesh and key-value store.
I have a sneaking suspicion it would be quite handy in creating the controls which are required to satisfy the FitSM CONFM process requirements.
Consul’s documentation says it should be able to scale effectively… but I’m more interested in entirely eliminating the need for site information services and using it as a distributed source of truth for configuration items.&lt;/p&gt;

&lt;p&gt;For now it’s just a thought experiment, but I really want to scratch that itch.– I look forward to extending the approach to see what else we can do :star:&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:FitSMCONFM&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Taken from the FitsSM standard, section on Configuration Management Process &lt;a href=&quot;#fnref:FitSMCONFM&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:TFNote&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Note that we will be using environment variables to configure the Consul provider (address and token). The backend is not declared here, but in the actual instantiation, the backend is also Consul. &lt;a href=&quot;#fnref:TFNote&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:ExternalNode&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I thought about binding these declared external services to actual exsting nodes, and then requesting that the monitor run on those nodes, but there is currently no operator loop between terraforming the services and the node state. Nodes could therefore potentially fail, taking the services registered on them with them, so I  decided on the “fake” external node. &lt;a href=&quot;#fnref:ExternalNode&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:ARGOdoesthis&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This kind of script check is exactly what the Nagios-based ARGO monitor does. &lt;a href=&quot;#fnref:ARGOdoesthis&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:agent&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;First of all, I’ve got my environment set up to be able to use the Consul DNS by having an agent running locally. &lt;a href=&quot;#fnref:agent&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 24 Mar 2024 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2024/03/consul-external-services/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2024/03/consul-external-services/</guid>
        
        <category>blog</category>
        
        <category>platform-engineering</category>
        
        <category>Consul</category>
        
        <category>Terraform</category>
        
        
        <category>platform-engineering</category>
        
      </item>
    
      <item>
        <title>Deploying Krateo with Terraform</title>
        <description>&lt;p&gt;This post will describe my experience in getting up and running with &lt;a href=&quot;https://krateo.io&quot;&gt;Krateo&lt;/a&gt; in a toy environment on Digital Ocean.&lt;/p&gt;

&lt;p&gt;I will pay particular attention to the paper cuts encountered while getting this up.&lt;/p&gt;

&lt;h2 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;/h2&gt;

&lt;p&gt;Let’s set a few criteria for ourselves, to see if this little experience was successful:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Zero-touch declaration&lt;/strong&gt;: I shouldn’t have to do anything but write a declaration of what I want. No scripts, no manual intervention, no checking things somewhere.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Zero-magic&lt;/strong&gt;: The declaration contains everything I need to know. I shouldn’t have to invoke a &lt;em&gt;deus ex-machina&lt;/em&gt; at some point, assuming that there is something else I already know.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Minimise work&lt;/strong&gt;: I should have to write the smallest possible declaration, the smallest possible set of moving parts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The solution will consist of a Kubernetes cluster with Krateo installed on it, exposed by a load balancer, with a DNS name associated with it.
You will need:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A Kubernetes cluster&lt;/li&gt;
  &lt;li&gt;A DNS zone which you control&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;approach&quot;&gt;Approach&lt;/h2&gt;

&lt;p&gt;I’m using &lt;a href=&quot;https://docs.digitalocean.com/products/kubernetes/&quot;&gt;Digital Ocean (DOKS)&lt;/a&gt; to create the Kubernetes cluster, and &lt;a href=&quot;https://developers.cloudflare.com/dns/manage-dns-records/&quot;&gt;Cloudflare to manage my DNS zone&lt;/a&gt;.
This is an almost zero-cost way to set up the basic infrastructure required for the problem&lt;sup id=&quot;fnref:cost&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:cost&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, but acounts on these platforms are a prerequisite to using them.&lt;/p&gt;

&lt;p&gt;My desire was to be able to implement solutions to the problem in different ways and evaluate them.
These solutions are in an accompanying repository: &lt;a href=&quot;https://github.com/brucellino/jubilant-umbrella&quot;&gt;brucellino/jubilant-umbrella&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;terraform-implementation&quot;&gt;Terraform implementation&lt;/h3&gt;

&lt;p&gt;The first implementation (and the only one so far) was done with &lt;a href=&quot;https://terraform.io&quot;&gt;Hashicorp Terraform&lt;/a&gt;.
This choice satisfies the three criteria state above, by containing a single definition of all of the infrastructure, with zero manual intervention and no undeclared steps.&lt;/p&gt;

&lt;p&gt;The core of the implementation are two resources:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://registry.terraform.io/providers/digitalocean/digitalocean/latest/docs/resources/kubernetes_cluster&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;digitalocean_kubernetes_cluster&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://registry.terraform.io/providers/cloudflare/cloudflare/latest/docs/resources/record&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloudflare_record&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first is the actual K8s cluster required to install Krateo to, while the second satisfies the requirement that the endpoint be passed to the ingress controller in order to expose the services.&lt;/p&gt;

&lt;p&gt;There are several other tidbits which I found necessary to add, whether for conciseness, elegance, or one of the three requirements stated above.&lt;/p&gt;

&lt;h4 id=&quot;vault-provider-to-configure-cloud-providers&quot;&gt;Vault provider to configure cloud providers&lt;/h4&gt;

&lt;p&gt;I have tokens for Cloudflare and Digital Ocean stored in my Hashicorp Vault, which are consumed as Terraform &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; sources in order to pass them to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;provisioner {}&lt;/code&gt; blocks for the respective cloud providers.
While this is not strictly required, it’s a default engineering practice I always employ when building infrastructure with Terraform and adds a bit of safety to the process.
Making the provider secrets declarative instead of hiding them in environment variables or special files which cannot be committed to the repository makes a clear separation of concerns between the infrastructure’s &lt;em&gt;code&lt;/em&gt; and &lt;em&gt;data&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Somebody else can thus more easily re-use this terraform module, just by passing the relevant Vault parameters to their data lookup:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault_kv_secret_v2&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;do&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;mount&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do_vault_mount&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do_vault_secret&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;digitalocean&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;token&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vault_kv_secret_v2&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;do&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;terraform&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This has the downside of having to include another provider (Vault), but that’s a satisfactory tradeoff for the safety and re-usability that we gain, in my opinion.&lt;/p&gt;

&lt;h4 id=&quot;installing-krateo&quot;&gt;Installing Krateo&lt;/h4&gt;

&lt;p&gt;Krateo installation is an imperative task; the Krateo CLI has to be executed against the cluster, and cannot simply be &lt;em&gt;declared&lt;/em&gt; into existence.
This fact breaks the first requirement (zero-touch declaration) at first glance, unless we can find some way around it.
The default approach would be to&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Terraform Digital Ocean to declare the K8s cluster into existence&lt;/li&gt;
  &lt;li&gt;execute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;krateo init --kubeconfig ${output from first step}&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Terraform Cloudflare to declare the DNS record into existence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This kind of imperative, step-by-step approach is prone to being unreliable, and is one of the main reasons that &lt;a href=&quot;https://12factor.net/&quot;&gt;declarative formats are strongly suggested&lt;/a&gt;.
Yes, we can codify these steps into a pipeline, and that’s a great start if there’s no alternative, but any pipeline can fail unpredictably.
Besides the reliability of the pipeline, we’re adding extra work by forcing steps to be taken, in a specific order.
We have to keep several things in our head at the same time, which are not explicitly and irrevocably linked between them. In the software development world, this is the kind of thing that would compile fine, and then generate runtime errors, forcing the developer (operator in our case), to break a state of flow with an interruption, go back and debug.&lt;/p&gt;

&lt;p&gt;Long story short, I really wanted to make the deployment as declarative as possible, so I chose to add a &lt;a href=&quot;https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;null_resource&lt;/code&gt;&lt;/a&gt; resource linked to the creation of the kubernetes cluster, in order to represent the imperative Krateo installation.&lt;/p&gt;

&lt;h2 id=&quot;results-and-discussion&quot;&gt;Results and Discussion&lt;/h2&gt;

&lt;p&gt;The final results of this experiment may be summarised as such:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The criteria set in the problem statement were respected, save for one&lt;/li&gt;
  &lt;li&gt;Krateo installation was done successfully, but&lt;/li&gt;
  &lt;li&gt;I couldn’t get it to expose the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;app&lt;/code&gt; endpoint properly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These results are discussed in a bit more detail below, but all-in all, I’d give myself a 70% satisfaction rating.&lt;/p&gt;

&lt;h3 id=&quot;interactivity-and-concerns&quot;&gt;Interactivity and Concerns&lt;/h3&gt;

&lt;p&gt;The first paper cut I encountered was having to deal with the Krateo CLI interactivity.
While I could use the attributes from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;digitalocean_kubernetes_cluster&lt;/code&gt; to write the kubeconfig file using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local_file&lt;/code&gt; resource, and thus pass it to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;krateo init --kubeconfig&lt;/code&gt;, but the CLI expects human input in order to configure the app endpoint.
I found this somewhat inelegant, based on the criteria I’ve set myself, and I would have preferred to &lt;em&gt;declare&lt;/em&gt; the domain name in some way.
I ended up having to pass it to the CLI via the command line in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local_exec&lt;/code&gt; provisioner used in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;null_resource&lt;/code&gt; representing the Krateo installation:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;${var.cf_zone}&apos;&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;./&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;krateo&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;init&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;--&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;kubeconfig&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;local_file&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;k8sconfig&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;filename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here we can see that we pass the cloudflare zone represented by the variable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cf_zone&lt;/code&gt; to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;krateo init&lt;/code&gt; execution via a shell pipe – old school.&lt;/p&gt;

&lt;p&gt;The second thing I needed to take care of was to implement a way to cleanly remove all of the resources that were created &lt;em&gt;by Krateo&lt;/em&gt; during installation.
In this small experiment, the only such resource was a loadbalancer created to expose the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;app&lt;/code&gt; service.
This is not in the Terraform state, but instead in the Krateo state.
Krateo isn’t in the Terraform state either, only the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;null_resource&lt;/code&gt; representing it – so if we do a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform destroy&lt;/code&gt;, the resources that Terraform knows about will be destroyed, but &lt;em&gt;not those Krateo made&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Luckily, Krateo implements a cleanup target for its CLI, so we can invoke that at destroy time by adding a relevant Terraform provisioner, which should run.
Putting it all together:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;null_resource&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;k_install&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;triggers&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;kube_config&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;local_file&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;k8sconfig&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;filename&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;provisioner&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;local-exec&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;when&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;create&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;command&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;curl -fSL ${local.krateo_release_url} | tar xz krateo &amp;gt;krateo&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;interpreter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/bin/bash&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;-c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;nx&quot;&gt;provisioner&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;local-exec&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;when&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;create&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;command&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;echo &apos;${var.cf_zone}&apos; | ./krateo init --kubeconfig ${local_file.k8sconfig.filename}&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;interpreter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/bin/bash&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;-c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;nx&quot;&gt;provisioner&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;local-exec&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;when&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;destroy&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;command&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;./krateo uninstall --kubeconfig kubeconfig-krateo-control-plane&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;interpreter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/bin/bash&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;-c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;change-the-endpoint&quot;&gt;Change the endpoint&lt;/h3&gt;

&lt;p&gt;What about if I wanted to change the endpoint?
Krateo expects, as mentioned above, an input parameter to allow it to tell the ingress controller how to expose its services. This is a hardcoded to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;app.&amp;lt;domain&amp;gt;&lt;/code&gt; where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;domain&amp;gt;&lt;/code&gt; is the top-level domain that you are deploying Krateo to.
This is almost certainly something that can be changed by applying a different configuration, but it would have been nice to have this configurable via the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;init&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;The good news was that running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;init&lt;/code&gt; again with a different TLD resulted in the desired configuration being applied.&lt;/p&gt;

&lt;h3 id=&quot;unpredictable-load-balancer-name&quot;&gt;Unpredictable load balancer name&lt;/h3&gt;

&lt;p&gt;During installation, Krateo creates an ingress controller which manages a Digital Ocean loadbalancer.
The public IP of that load balancer is required in order to add the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; record to the DNS in order to interact with the Krateo App, but since this load balancer is managed by Krateo, it is not known to the Terraform state.&lt;/p&gt;

&lt;p&gt;I first tried to add an external loadbalancer, which I wanted to tell Krateo about, but that didn’t work out of the box - Krateo ignored it and added it’s own.
The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; block that should discover this loadbalancer does depend on the Krateo installation &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;null_resource&lt;/code&gt;, but there is a delay between when Krateo exits and when the loadbalancer becomes available.
However, the real deal breaker is the inability to &lt;strong&gt;declare the name of the loadbalancer&lt;/strong&gt; &lt;em&gt;a-priori&lt;/em&gt;, which necessarily introduces esoteric knowledge – magic information that I just need to know and can’t derive.&lt;/p&gt;

&lt;p&gt;I ended up breaking requirements 2 and 3 described in the beginning of this post, by&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;having to add esoteric knowledge (the name of the Krateo-managed loadbalancer)&lt;/li&gt;
  &lt;li&gt;having to run terraform twice (ugh, gross) in order to pick up the load balancer&lt;/li&gt;
&lt;/ol&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# LB created by krateo.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;digitalocean_loadbalancer&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;krateo&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;depends_on&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;null_resource&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;k_install&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# id         = &quot;d72d4916-9023-4616-b292-33032dda4799&quot; # &amp;lt;- obtained from the console&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;a6434671d1dde4647804e9cd6261d5d6&quot;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# &amp;lt;- obtained from the console.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;cloudflare_record&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;k&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;zone_id&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;cloudflare_zone&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;id&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;type&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;A&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;proxied&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.&quot;&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;app&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;cf_zone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;digitalocean_loadbalancer&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;krateo&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;ip&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;name&lt;/code&gt; attribute, in my ignorance of how to use Krateo effectively, cannot be known in advance, and is computed by Krateo.
I could probably add a null resource to run a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;doctl&lt;/code&gt; in order to look up the load balancer and pass its attributes to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloudflare_record&lt;/code&gt; resource, or a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl&lt;/code&gt; to do something similar, but I didn’t want to add extra tooling at this point and indeed, I wanted to force the issue by surfacing this “problem”.&lt;/p&gt;

&lt;p&gt;This is, in my opinion, a documentation problem more than a design problem, since I can definitely imagine ways to get around this, but they all make me throw up in my mouth a bit.&lt;/p&gt;

&lt;h3 id=&quot;ssl-and-ingress-errors&quot;&gt;SSL and ingress errors&lt;/h3&gt;

&lt;p&gt;The showstopper for me was the inability to actually access the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;app.&amp;lt;domain&amp;gt;&lt;/code&gt; URL due to SSL and ingress errors.
Behind the scenes, I could see that all Krateo components had been installed, and everything was reporting healthy.
However, I was unable to access the UI, since the URL gave &lt;a href=&quot;https://http.cat/522&quot;&gt;HTTP 522&lt;/a&gt; errors (timeouts).
I didn’t spend too much time investigating, but my suspicion was that&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A firewall rule was blocking the connection between the LB and the services in the cluster - either at the infrastructure (Digital Ocean) level, or at the API gateway&lt;sup id=&quot;fnref:kong&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:kong&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; level&lt;/li&gt;
  &lt;li&gt;Somewhere a selector was improperly configured – perhaps an authentication service was missing which the API gateway was sending requests to,  resulting in the 522&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whatever the true reason, I’m confident that this could be resolved by adding a few &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubernetes_xyz&lt;/code&gt; resources using the &lt;a href=&quot;https://registry.terraform.io/providers/hashicorp/kubernetes/latest&quot;&gt;Terraform Kubernetes provider&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;The goal of this little exercise was to get my hands dirty with Krateo, while staying true to some of the engineering principles I hold dear.
I was about 70% successful at this.
I have no doubt that it’s possible – easy even – to deploy Krateo in this way, with a bit more understanding of how it is supposed to work.
I have a suspicion that a bit more detail in the documentation could have helped, or perhaps a tutorial showing how to modify the vanilla installation with a few &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kubectl&lt;/code&gt;s after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;krateo init&lt;/code&gt;.
I don’t expect the Krateo CLI to do everything after all!&lt;/p&gt;

&lt;p&gt;I should also remind the reader that deploying Krateo is very likely a one-time event.
This is a service which will act as the control plane for your entire infrastructure after all, so I don’t expect folks to be doing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;krateo init&lt;/code&gt; once a week in the end-use environment.
However, for people like me who will eventually end up doing it &lt;em&gt;for clients&lt;/em&gt;, the process isn’t 100% yet.&lt;/p&gt;

&lt;h3 id=&quot;who-governs-the-governor&quot;&gt;Who governs the governor&lt;/h3&gt;

&lt;p&gt;There is however a deeper question which has bugged me throughout this exercise:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Am I doing it wrong?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Krateo is for governance, it’s supposed to contain the control plane for everything.
But it needs to &lt;em&gt;emerge from the void&lt;/em&gt;, something &lt;a href=&quot;https://hashiatho.me/blog/2022/10/22/base-platform/&quot;&gt;I’ve written about before&lt;/a&gt;.
The demos I’ve seen before start with “Create a Kubernetes cluster”&lt;sup id=&quot;fnref:other_resources&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:other_resources&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and I can imagine that when used in an enterprise environment, it will be a bit more like “Install Krateo into an existing cluster”.
But who creates those resources?
It can’t be Krateo because it doesn’t exist in that environment yet.
Does Krateo become self-aware after installation and resolve the bootstrap paradox by then managing itself?&lt;/p&gt;

&lt;p&gt;I can’t shake the ghost of Godel whispering in my ear:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“the system is necessarily incomplete”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If something extra is always required to invoke a governor, if this is indeed an emergent property, what is the most elegant way of expressing this?&lt;/p&gt;

&lt;p&gt;I do not have an answer to this yet. Do you?&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:cost&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;To give an idea, the cost for the cluster and associated resources was 10 euro cents for 3 hours of use. &lt;a href=&quot;#fnref:cost&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:kong&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The API gateway used by Krateo is &lt;a href=&quot;https://konghq.com/&quot;&gt;Kong&lt;/a&gt; &lt;a href=&quot;#fnref:kong&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:other_resources&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;As we’ve seen, this is also not sufficient – you need a few other resources in order to properly deploy Krateo. While the DNS domain is mentioned, there are indeed some other requirements which are not declared explicitly. &lt;a href=&quot;#fnref:other_resources&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 11 Jun 2023 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2023/06/Krateo-deploy-terraform/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2023/06/Krateo-deploy-terraform/</guid>
        
        <category>blog</category>
        
        <category>platformops</category>
        
        <category>evaluation</category>
        
        <category>unboxing</category>
        
        <category>Terraform</category>
        
        <category>Krateo</category>
        
        
        <category>PlatformOps</category>
        
      </item>
    
      <item>
        <title>We are all platform engineers now</title>
        <description>&lt;h2 id=&quot;we-are-all-platform-engineers-now&quot;&gt;We are all platform engineers now&lt;/h2&gt;

&lt;p&gt;Platform Engineering is clearly the hotness amongst the cool kids, and has been so since at least early 2022.
As with all the good ideas that eventually end up mangled by the IT industry, Platform Ops has both been around for a longer time than I bet most folks think, and also doesn’t actually express anything about its nature.&lt;/p&gt;

&lt;p&gt;Terms like these are useful for driving the marketing department, naming products and generally starting useful conversations.
It’s an opener, but doesn’t actually get you all the way to actually solving a problem.&lt;/p&gt;

&lt;p&gt;One thing I am convinced by is that these terms arise because there really are problems worth solving, and perhaps since the birth the term “Agile”, these are ever more business problems rather technological problems.
With nigh infinite compute capacity, the questions then became “how the heck do we get things done now?”.&lt;/p&gt;

&lt;p&gt;In this vein, the exhortation to collaborate in DevOps movement, the codification of good practices for reliability in SRE and the realisation in DevSecOps that collaboration with the security and safety side of business gave greater benefits if included early, finds a continuation in this pattern of “Platform”.&lt;/p&gt;

&lt;p&gt;It’s hard to tell from within my little bubble, but the term “Platform Engineering” seems to have an irresistible attraction in 2023.
There will certainly be “haters” – contrarian folks who find their intelligence or experience insulted by the very idea that the knowledge and skill contained within them can be packaged into a &lt;em&gt;product&lt;/em&gt; (ugh).&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Where’s the workmanship in that?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;they will scowl and who can blame them?
Our market is swimming in terrible products which only exist to make their vendors a dime.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“I can do that in a weekend!”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;they will counter, and who can blame them?
Most of the products we’re seeing come out are just compositions of other things.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“There’s no way a product, no matter how customisable and extensible, will be able to meet all use cases”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;is the line they will draw in the sand, and who can blame them?&lt;/p&gt;

&lt;p&gt;Well, yes, these are all good points.
I’ve been involved in building things we called “platforms” for over 10 years and I can say for sure that all the things people are talking about in this wave, we’ve been trying to build in one way or another for that whole time.&lt;/p&gt;

&lt;p&gt;We didn’t ever call them “internal developer platforms” when we were building science gateways back in 2013, we definitely didn’t treat the worldwide compute grid as a product back in 2003, but these were definitely platforms. You want something else? Build it your own damn self.&lt;/p&gt;

&lt;p&gt;So, of course people did – frameworks under toolkits under apps, all in such a delirious state of entropy that nobody could accurately predict what would be around.
Remember GIFEE&lt;sup id=&quot;fnref:google&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:google&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;? Yeah, I almost forgot about that too, but luckily I was around in 2016 and I’ve seen enough failed experiments know why spidey sense is tingling now that we’re making a hotness.&lt;/p&gt;

&lt;h2 id=&quot;whats-changed&quot;&gt;What’s changed&lt;/h2&gt;

&lt;p&gt;The OG platforms (including early AWS) were born to serve the entire damn planet and were by necessity inflexible. You get these functions, that’s it. Everybody gets the same.
Self-service? yes, you can consume these very specific things, and we’ll bill you for them.
We had effort after effort to build a better piece of infrastructure, to expose more specific functions, to bring in new capabilities.
I’ll bet that a it’s a commonly-held opinion that all of this converged in the creation of Kubernetes: the codification of all conceivable functions in a data center, in a network, and between applications.&lt;/p&gt;

&lt;p&gt;This didn’t solve any damn problems – it created more of them!&lt;/p&gt;

&lt;p&gt;What’s changed since the first iterations of platform is that we are now coming to terms with Spidey Rules: with great power comes great responsibility.&lt;/p&gt;

&lt;p&gt;Now, I’ve been around, but I haven’t been to every business out there, I don’t know the story of your struggles, I’m not going to pretend I do.
But I’m willing to bet that many organisations over a certain age have found themselves mired in transition due to the inability to make decisions responsibly across the lifecycle of the application.
Each individual group or function chose “the right tools for the job”, perhaps doing their best to optimise locally&lt;sup id=&quot;fnref:or_not&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:or_not&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-exactly-are-we-talking-about&quot;&gt;What exactly are we talking about&lt;/h2&gt;

&lt;p&gt;I find it better to my taste to be explicit when talking about a subject, so as to help myself understand the boundaries of my own knowledge, and better hone my own opinions.
For this reason, I tend to think about – and hence reason about – Platform Engineering and PlatformOps differently.&lt;/p&gt;

&lt;hr /&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:google&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Did you just google that? Did you have to go to page 2 to feel the same irony I’m feeling? In 2016 had a hashtag and drive IBM to adopt the cloud model, if you believe the things you read on page 2. &lt;a href=&quot;#fnref:google&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:or_not&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is the generous point of view. A more realistic one is that choices were driven by vendors and personal connections at every level. &lt;a href=&quot;#fnref:or_not&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 10 Jun 2023 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2023/06/platform-engineering/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2023/06/platform-engineering/</guid>
        
        <category>blog</category>
        
        <category>platformops</category>
        
        
        <category>PlatformOps</category>
        
      </item>
    
      <item>
        <title>Back to serving others</title>
        <description>&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#working-through-war-and-plague&quot;&gt;Working through war and plague&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#mission-and-purpose&quot;&gt;Mission and purpose&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#a-detour&quot;&gt;A detour&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#lessons&quot;&gt;Lessons&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#who-am-i-not&quot;&gt;Who am I not&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#who-am-i&quot;&gt;Who am I&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#back-to-service&quot;&gt;Back to service&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;working-through-war-and-plague&quot;&gt;Working through war and plague&lt;/h2&gt;

&lt;p&gt;It is not the stroke of midnight, but the breaking of the first dawn of the year which prompts me to reflect on that moment of equilibrium between the year past, and that which is to come.&lt;/p&gt;

&lt;p&gt;Allow me, dear reader, to pause for a moment of self-indulgent personal reflection.
Nay, come with me, if you will, on a stroll past ourselves as we try to digest how the past few years have moulded us.
This is a story about me, but I’m not so special that you are unlikely to find fragments of yourself in my story.&lt;/p&gt;

&lt;p&gt;For the past few years, I have described myself as “2020 fugee” – a refugee from the upheaval, stagnation and unrest that was brought upon us by that year of reckoning.
I’ve had my share of feeling adrift, moving between one place and the next over the course of my life, but in that carefree past it was more often than not a flight of desire, not of obligation.
The stasis of the pandemic, having to accept that things must stand still for a time, that we must all &lt;em&gt;stay&lt;/em&gt; where we are, that events must simply be cancelled, erased, that milestones must be deleted and that everything will be done later when it would be safe again, this stasis must now come to an end.
It is time to start moving again.&lt;/p&gt;

&lt;p&gt;The irony in many cases, including mine, is that I came out of that period completely exhausted, mentally and socially.
The mere idea of having to interact, to earn social capital, to be present for others that were not my immediate family, became for a period too much to bear.&lt;/p&gt;

&lt;p&gt;I sometimes think about what happened to us during that vaguely defined time of pandemic, and whether we can ever recover from it as a society.
The pessimist in me says that some cracks were laid bare and that once we see them, we can never unsee them.&lt;/p&gt;

&lt;p&gt;However, perhaps we can go about fixing them.&lt;/p&gt;

&lt;h2 id=&quot;mission-and-purpose&quot;&gt;Mission and purpose&lt;/h2&gt;

&lt;p&gt;For as long as I can remember, I have been working for common goals, finding my own purpose through a larger mission.
First, this was the study of the universe through a series of postgraduate degrees in physics culminating in a Ph.D. and several postdoc positions.
Scientific research has for decades been a team sport, and even more so in the physics domain.
Personally I found myself perennially in collaborations large and small, working with others towards what we were all convinced was a greater good.
During the latter period of my scientific career, this peaked as I worked in what was then a giant collaboration of more than three thousand researchers and engineers.
The sense of community I felt then has been something that I will never forget and have sought ever since.
There was a mission, and I felt that my work had a purpose through achieving that mission.&lt;/p&gt;

&lt;p&gt;My decision to leave that environment was taken half-heartedly.
I would have happily stayed in physics, had I not realised two difficult truths:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I was not the world’s greatest physicist, indeed I was quite mediocre and the competition was fierce&lt;/li&gt;
  &lt;li&gt;Physics as a career would not give me the stability and growth path I wanted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, one cannot so easily change one’s essence as one can one’s job.
My next career was still in service, more explicitly so, as I returned to South Africa to help build public infrastructure for computing.
Looking back on those ten years from 2008 to 2018, I find it hard to judge myself as having had any success.
As a harsh critic, I would say that all of the work I did was just playing at the various roles of project manager, community manager, trainer, &lt;em&gt;etc&lt;/em&gt;.
The end result was that there was nothing really to judge, because the goal posts kept moving, so I can’t even call the outcome of those years a failure!
Perhaps this is too harsh a critique, but I will let others be the judge of that.&lt;/p&gt;

&lt;p&gt;One theme that did remain constant, I think, was that of working &lt;em&gt;for the benefit of others&lt;/em&gt;.
Yes, I had my personal agendas which I was following, whether I knew it or not, but this does not detract from the fact that the projects, infrastructure and initiatives that I was trying to move forward were always for the direct benefit of others.
This gave me that warm and fuzzy feeling of somehow “doing good”.&lt;/p&gt;

&lt;h2 id=&quot;a-detour&quot;&gt;A detour&lt;/h2&gt;

&lt;p&gt;I joined EGI enthusiastically in 2018 to continue this mission, and out of the blue came the biggest opportunity for detour in my life so far.
I made the decision to leave the world of research and public service to enter the private sector, because I felt that the time had come to put myself to the test.
I had had my share of projects in which I was but a part, which didn’t depend explicitly on me, but which I contributed to and I had always felt the frustration that my peers and collaborators were not “doing it right”.
A whiff of arrogance had started to surround me, I suspect, and I found myself correcting folks, pointing out “the bigger picture”, arguing based on grand principles and generally pretending like I knew it all.&lt;/p&gt;

&lt;p&gt;What if I didn’t?&lt;/p&gt;

&lt;p&gt;But what if I &lt;em&gt;did&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;The opportunity to work in a tight, elite team presented itself as I was called to join UEFA’s new DevOps team.
I still can’t believe that I was presented with that opportunity and the absolutely amazing environment I was presented with.
I found the best colleagues – truly wonderful people, great at their job all at the top of their game – as well as a fertile environment in which to prove myself.
I found myself the least prepared, least knowledgeable, least technically capable, least sure of all of my colleagues and that feeling of being uncomfortable and challenged every day was thrilling.&lt;/p&gt;

&lt;p&gt;My time at UEFA gave me the chance to really see what I was capable of and held a mirror up to &lt;em&gt;myself&lt;/em&gt;, not just the environment I was working in.&lt;/p&gt;

&lt;h2 id=&quot;lessons&quot;&gt;Lessons&lt;/h2&gt;

&lt;p&gt;Needless to say, I learned a lot during that time; lot of technology, yes, but also a lot about how things really work in the real world.
Or rather, how things &lt;em&gt;fail&lt;/em&gt; in the real world, all of the ways that they work on paper but not in practice.
The most valuable experience I gained during that time was not the subtleties of the cloud, performance tuning or monitoring tricks – it was what it takes to be successful, how to think and execute quickly how to solve problems permanently instead of fix issues.
I learned a few things about myself which I hope to bring into 2023 and beyond.&lt;/p&gt;

&lt;h3 id=&quot;who-am-i-not&quot;&gt;Who am I not&lt;/h3&gt;

&lt;p&gt;First of all, I learned a few things about who I am not:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Not the smartest person in the room&lt;/strong&gt;: just as when I was a physicist, I am never going to be the smartest person in a team. I have a meandering education and experience and I’m working with folks who have spent their lives working in this specific environment and are really good at it. This means that I have to &lt;em&gt;remain humble&lt;/em&gt; and &lt;em&gt;listen&lt;/em&gt; to others when they talk because they probably know things that I don’t.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Not the hardest worker  in the room&lt;/strong&gt;: I have a family, I have responsibilities outside of the office, I cannot afford to be dedicated to the work more than 8 or 9 hours a day (even that is a stretch in a normal week). I am older than most of my colleagues, my energy levels are not what they used to be, and I choose to dedicate the best part of that energy to my personal life. I also see this as a job, not a calling. I am here to get work done by being professional, not by being passionate. &lt;em&gt;That is ok&lt;/em&gt;, but it means that I need to have &lt;em&gt;habits&lt;/em&gt; and set &lt;em&gt;clear expectations&lt;/em&gt; both for my colleagues as well as myself and my family.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Not a rock star&lt;/strong&gt;: I don’t crave recognition. I don’t need people to call out my name, invite me to guest blogposts or podcasts, conference appearances. I don’t need an audience. I do not want to be the face of anything, because I know that it’s all been done before. I don’t want to be bamboozled by hype… get real and show me the data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Nothing new to say&lt;/strong&gt;: I don’t have original opinions in this environment. Perhaps I never did! but I’m certain that whatever I have to say has already been said before by those more eloquent and experienced than me. I have seen a bit, done a bit and learned a bit, but after all, we’re talking about computers here. It’s not quantum mechanics. My role is not to come up with edgy hot takes, it is to learn and improve those around me. There is more than one way to lead; “Linkedinfluencer” is not my style.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;who-am-i&quot;&gt;Who am I&lt;/h3&gt;

&lt;p&gt;So, am I just the opposite of the things I am not?
To an extent, I am defining myself in opposition to things which I have learned that I am not, but there are also things which I &lt;em&gt;am&lt;/em&gt;, not just things that I am because I am not their opposite.&lt;/p&gt;

&lt;p&gt;I am:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Conscious and appreciative of counter-narratives&lt;/strong&gt;: I get sceptical when I hear “just-so” stories, broad generalisations and aphorisms at work. “This technology will revolutionise…”, “we have do do things this way because…”, “the data shows…”. These are all narratives, stories that people tell to support an existing point of view. These are really useful, they help us to get on the same page, to get the point across – heck, I do this all the the time myself! – but they are just stories. They are not the truth. The truth is there is always a counter-narrative and if we’re not aware of it, we risk talking ourselves into a corner.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Capable of living in other peoples’ skins&lt;/strong&gt;: I’ve been around, I’ve seen and worked in places that my colleagues barely know the names of. I know that people think and work in different ways, and that while these may not be original, they are indeed diverse. I became aware that the world is a racist, classist place when I first stepped out of high school, and almost all my experience since then has been a clash between principles of equity and the messed up way the world seems to be organised. Working in Africa and working in Europe are wildly different experiences, with people having wildly different biases, points of view and priorities. Some of these people may have violently opposed world views to yours, but &lt;strong&gt;they are still people&lt;/strong&gt;. Whatever the official line, you don’t have “workers” at work, you have &lt;em&gt;people&lt;/em&gt; and those people are frikkin &lt;em&gt;weird&lt;/em&gt;… and I am frikkin here for it…&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Multifaced experience&lt;/strong&gt;: I am trained as a scientist and at heart I still reason about the world as a scientist. However, I speak four languages, I have lived in several cities, on several continents. I’ve been poor, I’ve written code, managed projects, I’ve been responsible for liaising with high school principals and foreign ambassadors, I’ve had folks die on me and seen my babies born. I’ve had good managers, I’ve had no managers and I’ve had terrible managers. I’ve been told “whatever it costs, just get it done”, and I’ve been told “there’s no budget, just get it done”. I haven’t really found a way to write this on a CV, but it’s really the biggest thing I’ve got going for me, from an employer’s point of view. Throw me in the deep end fam, I’ve seen it before, I’m good to go.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;I care about quality&lt;/strong&gt;: Quality makes the difference. I am not, by nature, a person who cares about the nitty-gritty details. My scientific training (and perhaps the reason I was so drawn to it), taught me to look for patterns, for general laws, to see the big picture. But in the real world of people, this doesn’t make the difference, it’s quality that makes the difference. Being prepared, anticipating pain, building things that are free of defects, designing for elegant function, rather than appearance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;back-to-service&quot;&gt;Back to service&lt;/h2&gt;

&lt;p&gt;Coming back full circle, my purpose has always been found through others, it’s time to accept and embrace that again.
I have spent the last 2 years more or less focussed on myself at work, to the detriment of the community out there I had found myself a part of.&lt;/p&gt;

&lt;p&gt;I am taking a turn away from myself and back towards serving others, this time as engineering manager at the company I work at.
In my mind, this is an opportunity to “do it right” – to put my own personal convictions and ideas to the test in a challenging way – but also a way to rediscover purpose through empowering others and improving their lives.
I want to bring a rigorous approach to designing processes in our company such that they actually make our lives better first: less miscommunication, less toil and unrewarding work, fewer meetings, more context when it’s needed, fewer surprises, putting the right tools and information in the right hands, at the right time.
I want to design for elegance, for happiness, for efficiency.&lt;/p&gt;

&lt;p&gt;It’s time to channel my inner Deming!&lt;/p&gt;

&lt;p&gt;My own experience has been that a good manager can mean the difference between happiness and despair, between fulfillment and frustration, the reason people stay and the reason people leave.
Good managers come in different forms. Some of them are memorable for your personal relationship with them, others are able to change the system so that it’s better by design.
I don’t know what effect I will eventually have in this new role, but I intend to bring those things that I am, and those things that I am not, to this new role.&lt;/p&gt;

&lt;p&gt;I don’t want to be the superhero that everyone sends their troubles to and that then magically makes those troubles disappear – I am not that guy!
I want to make things better for others by making my own life a pit of stress…
I want to solve problems by removing them from the system.&lt;/p&gt;

&lt;p&gt;I want to see people shine. I want to shrug off the lethargy and I want to feel us all picking up that spring in our stride again, to have fun at work, to build things that last.&lt;/p&gt;

&lt;p&gt;If this sounds like a nice place to work, hook me up.
If the place you work already sounds like this, good for you and your team, and hey, let’s compare notes.&lt;/p&gt;

&lt;p&gt;Have a good 2023 y’all,&lt;/p&gt;

&lt;p&gt;The Dude has spoken.&lt;/p&gt;
</description>
        <pubDate>Wed, 04 Jan 2023 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2023/01/service/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2023/01/service/</guid>
        
        <category>blog</category>
        
        <category>career</category>
        
        <category>management</category>
        
        
        <category>personal</category>
        
      </item>
    
      <item>
        <title>Issuing host certificates from Vault with Ansible</title>
        <description>&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#overview&quot;&gt;Overview&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#the-vault-ca&quot;&gt;The Vault CA&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#issueing-certificates&quot;&gt;Issueing certificates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#deciding-on-whether-to-issue-a-certificate&quot;&gt;Deciding on whether to issue a certificate&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#declaratively-determining-issue_cert&quot;&gt;Declaratively determining &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;issue_cert&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#further-enhancement&quot;&gt;Further enhancement&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#footnotes-and-references&quot;&gt;Footnotes and References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;/h2&gt;

&lt;p&gt;I recently released a new &lt;a href=&quot;https://github.com/brucellino/ansible-role-base-platform-pi/releases/tag/v1.0.0&quot;&gt;Ansible role&lt;/a&gt; which is responsible for providing a base layer of configuration machines in a cluster.&lt;/p&gt;

&lt;p&gt;This layer is responsible for preparing the machine to host other services, but without depending explicitly on them and as such, it is a somewhat &lt;em&gt;abstract&lt;/em&gt; component.
It can be applied to any machine by a controller which has access to the
In particular, this release deals with the issuing of a X.509 certificate to a machine in order to allow it to communicate securely with infrastructure services such as the service mesh or orchestration services&lt;sup id=&quot;fnref:ConsulNomad&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:ConsulNomad&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-vault-ca&quot;&gt;The Vault CA&lt;/h2&gt;

&lt;p&gt;The certificate chain we are provisioning is managed by &lt;a href=&quot;https://vaultproject.io&quot;&gt;Vault&lt;/a&gt;&lt;sup id=&quot;fnref:VaultBlog&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:VaultBlog&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
We have configured our Vault instance with an Intermediate CA, following the &lt;a href=&quot;https://developer.hashicorp.com/vault/tutorials/secrets-management/pki-engine&quot;&gt;Vault “Build your own CA” tutorial&lt;/a&gt;.
I had previously encoded this as a &lt;a href=&quot;https://registry.terraform.io/modules/brucellino/ca/vault/1.1.0&quot;&gt;Terraform module&lt;/a&gt;, more as an exercise than anything else.&lt;/p&gt;

&lt;p&gt;The real state is currently defined in &lt;a href=&quot;https://github.com/brucellino/vaultatho.me/blob/main/hashiatho.me-pki.tf&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;github.com/brucellino/vaultatho.me&lt;/code&gt;&lt;/a&gt;. Most importantly, the Vault PKI secret backend role resource is defined there:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# H@H Intermediate CA role&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vault_pki_secret_backend_role&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;hah_int_role&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;backend&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;vault_mount&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;hah_pki_int&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;path&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;hah_int_role&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;key_usage&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;DigitalSignature&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;KeyEncipherment&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;KeyAgreement&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;allowed_domains&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;*.service.consul&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;*.node.consul&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;*.node.dc1.consul&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;*.hashiatho.me&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;*.station&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

  &lt;span class=&quot;nx&quot;&gt;allow_bare_domains&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;allow_subdomains&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;allow_glob_domains&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;allow_ip_sans&lt;/span&gt;      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Calls to this endpoint with a valid token would result in the issueing of a certificate to the caller.&lt;/p&gt;

&lt;h2 id=&quot;issueing-certificates&quot;&gt;Issueing certificates&lt;/h2&gt;

&lt;p&gt;Before describing exactly how we manage to deliver the certifiate to the host, let’s consider some details.
We have a situation where a &lt;em&gt;controller&lt;/em&gt; is applying a configuration to an entity in our inventory.
The controller has access to a secret which allows it to authenticate to Vault, particularly the token allows calls to issue new certificates, &lt;em&gt;i.e.&lt;/em&gt; the &lt;a href=&quot;https://developer.hashicorp.com/vault/api-docs/secret/pki&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/pki/issue/:name&lt;/code&gt; endpoint&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our controller is actually a machine running an Ansible playbook.
I had the choice of one of two Ansible modules:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html#ansible-collections-ansible-builtin-uri-module&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ansible.builtin.uri&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.ansible.com/ansible/latest/collections/community/hashi_vault/vault_pki_generate_certificate_module.html#ansible-collections-community-hashi-vault-vault-pki-generate-certificate-module&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;community.hashi_vault.vault_pki_generate_certificate&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the first case, we would have to call the API endpoint directly and pass several parameters in the header to request the certificate, whilst in the second, we could call the wrapper module with a similar set of parameters.&lt;/p&gt;

&lt;p&gt;In this case, I opted for the latter, with something like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Issue certificate to host&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;when&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;(issue_cert | bool)&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Issue cert from Vault&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;community.hashi_vault.vault_pki_generate_certificate&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# noqa syntax-check&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;role_name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;hah_int_role&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;common_name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_fqdn&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.node.consul&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;engine_mount_point&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;pki_hah_int&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;lookup(&apos;env&apos;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;VAULT_ADDR&apos;)&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;token&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;lookup(&apos;env&apos;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;VAULT_TOKEN&apos;)&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;alt_names&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_hostname&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.node.consul&quot;&lt;/span&gt;
          &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_hostname&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.hashiatho.me&quot;&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cert_data&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;deciding-on-whether-to-issue-a-certificate&quot;&gt;Deciding on whether to issue a certificate&lt;/h2&gt;

&lt;p&gt;The astute reader will not the the task is conditional: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;when: (issue_cert | bool)&lt;/code&gt;. This is a check on a boolean fact which we use to determine whether we should issue a certificate (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt;) or not (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;false&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;How do we know when to set this value to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt;?
Surely we should issue a new certificate &lt;em&gt;when a new certificate is required&lt;/em&gt;.
Decomposing this statement, a new certificate is required when:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;There is no private key present&lt;/li&gt;
  &lt;li&gt;The public certificate is invalid
    &lt;ol&gt;
      &lt;li&gt;issued to incorrect host or hostname changed&lt;/li&gt;
      &lt;li&gt;corrupted file&lt;/li&gt;
      &lt;li&gt;expired&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In principle, if the private key is present, but the certificate is not valid for some reason, we could look up the public key and CA data in Vault, as long as we knew the serial number of the certificate, but simply revoking the cert and issueing a new one is far simpler.&lt;/p&gt;

&lt;h3 id=&quot;declaratively-determining-issue_cert&quot;&gt;Declaratively determining &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;issue_cert&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;There is a lot of logic in the process of determining whether to issue the certificate anew.
The first few runs of the playbook which tests this role resulted in hundreds of new certificates, one for each run, since the Vault PKI endpoint isn’t a stateful service.&lt;/p&gt;

&lt;p&gt;On the other hand I could have implemented the logic in a big script which did all the computing… but this didn’t feel like the right thing to do and would nevertheless result in a lot of extra work.&lt;/p&gt;

&lt;p&gt;In the end, I settled on an approach using a few invokations of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stat&lt;/code&gt; module and setting facts on the fly:&lt;/p&gt;

&lt;p&gt;First, we check all of the certificate files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Stat cert files&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ansible.builtin.stat&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/etc/tls/hashi@home/.pem&quot;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;stat&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;loop&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;certificate&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;private_key&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;issuing_ca&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We get back a large dictionary which we can query to see when the stat on files shows that they are not present:&lt;/p&gt;
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Set issue_cert&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;delegate_to&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;localhost&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ansible.builtin.set_fact&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;issue_cert&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{false&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;(stat.results&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;community.general.json_query(&apos;[*].stat.exists&apos;))&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
&lt;p&gt;If the certificate files are all present, we can enter the decision branch where we check decide to issue the certificate based on the validity of the existing certificate.
We perform a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x509_certificate_info&lt;/code&gt; and get back the cert info.
If it’s valid, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;issue_cert&lt;/code&gt; is set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;false&lt;/code&gt;.
If not, not only is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;issue_cert&lt;/code&gt; set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt;, but we also remove the corrupt or invalid files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Check Certs&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;when&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; not in (stat.results | community.general.json_query(&apos;[*].stat.exists&apos;))&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# If this fails -- either if the cert is not present or if it is not a valid cert&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Then the rescue is invoked&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Get cert facts&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;community.crypto.x509_certificate_info&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/etc/tls/hashi@home/certificate.pem&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cert_info&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Set expired fact&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;ansible.builtin.set_fact&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;issue_cert&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cert_info.expired&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;rescue&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Remove Corrupt Cert&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;ansible.builtin.file&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/etc/tls/hashi@home/{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.pem&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;absent&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;loop&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;certificate&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;private_key&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;issuing_ca&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Set issue_cert fact&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;ansible.builtin.set_fact&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;issue_cert&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;true&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;After all of that, we finally have a good value for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;issue_cert&lt;/code&gt;, which is used, as shown above, to determine whether or not to issue a new certificate and deliver it to the host:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Issue certificate to host&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;when&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;(issue_cert | bool)&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Issue cert from Vault&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;community.hashi_vault.vault_pki_generate_certificate&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# noqa syntax-check&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;role_name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;hah_int_role&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;common_name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_fqdn&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.node.consul&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;engine_mount_point&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;pki_hah_int&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;lookup(&apos;env&apos;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;VAULT_ADDR&apos;)&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;token&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;lookup(&apos;env&apos;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;VAULT_TOKEN&apos;)&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;alt_names&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_hostname&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.node.consul&quot;&lt;/span&gt;
            &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ansible_hostname&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.hashiatho.me&quot;&lt;/span&gt;
      &lt;span class=&quot;err&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;na&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cert_data&lt;/span&gt;

    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deliver certs&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;ansible.builtin.copy&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;dest&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/etc/tls/hashi@home/{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}.pem&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;cert_data.data.data[item]&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;}}&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0644&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;owner&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;root&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;group&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;root&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;loop&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;certificate&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;issuing_ca&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;private_key&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;further-enhancement&quot;&gt;Further enhancement&lt;/h2&gt;

&lt;p&gt;This approach can safely and reliably issue an x.509 credential chain to a host from the Vault CA via the controller.
It takes care of checking whether a certificate needs to be issued before actually calling the CA for a new cert, but there are a few aspects can be further improved in later versions.&lt;/p&gt;

&lt;p&gt;First of all, we can call the &lt;a href=&quot;https://developer.hashicorp.com/vault/api-docs/secret/pki#tidy&quot;&gt;PKI endpoint to tidy the certificate store&lt;/a&gt; when a certificate is determined to be invalid.
Even more correctly, we should probably &lt;a href=&quot;https://developer.hashicorp.com/vault/api-docs/secret/pki#revoke-certificate&quot;&gt;&lt;em&gt;revoke&lt;/em&gt;&lt;/a&gt; a certificate if we are issueing  new one, so that we can update the CRLs.
This can probably be quite easily implemented as an Ansible handler, but the cert serial number is required in order to call the revokation function.
This seems like a good case to use the Consul KV store (or indeed the Vault KV store) to keep host names (keys) and cert serial numbers (values) easily retrievable, rather than having to do a filter on a big list of serial numbers returned by &lt;a href=&quot;https://developer.hashicorp.com/vault/api-docs/secret/pki#list-certificates&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/pki/certs&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For the full role, see &lt;a href=&quot;https://github.com/brucellino/ansible-role-base-platform-pi&quot;&gt;@brucellino/ansible-role-base-platform-pi&lt;/a&gt;, and open an issue if you would like to discuss!&lt;/p&gt;

&lt;h2 id=&quot;footnotes-and-references&quot;&gt;Footnotes and References&lt;/h2&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:ConsulNomad&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In my case, these are Consul and Nomad. &lt;a href=&quot;#fnref:ConsulNomad&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:VaultBlog&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;See the &lt;a href=&quot;https://www.hashicorp.com/blog/certificate-management-with-vault&quot;&gt;Vault blog&lt;/a&gt; &lt;a href=&quot;#fnref:VaultBlog&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 23 Oct 2022 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2022/10/ansible-vault-certs/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2022/10/ansible-vault-certs/</guid>
        
        <category>Ansible</category>
        
        <category>ContinuousDelivery</category>
        
        <category>Vault</category>
        
        
        <category>Practice</category>
        
      </item>
    
      <item>
        <title>Consul in Digital Ocean (part II)</title>
        <description>&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#35-hashifinity-stones&quot;&gt;3/5 Hashifinity Stones&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#what-are-we-building&quot;&gt;What are we building&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#axes-of-competence&quot;&gt;Axes of competence&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#architecture&quot;&gt;Architecture&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#design-goals&quot;&gt;Design goals&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#implementation&quot;&gt;Implementation&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#tooling&quot;&gt;Tooling&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#speedbumps&quot;&gt;Speedbumps&lt;/a&gt;
        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#bind-advertise-client-addresses&quot;&gt;Bind, advertise, client addresses&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#cluster-joining&quot;&gt;Cluster joining&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#key-generation&quot;&gt;Key generation&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#data-persistence-and-zero-downtime-rolling-changes&quot;&gt;Data persistence and zero-downtime rolling changes&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#add-lifecycle-to-server-droplet-resources-too&quot;&gt;Add lifecycle to server droplet resources too&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#cloud-config-template&quot;&gt;Cloud Config template&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#cloud-config&quot;&gt;cloud-config&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#remove-existing-entries-that-point-to-localhost&quot;&gt;Remove existing entries that point to localhost&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#get-consul&quot;&gt;Get Consul&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#enable-the-consul-service&quot;&gt;Enable the consul service&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#final-considerations&quot;&gt;Final considerations&lt;/a&gt;
&lt;a href=&quot;#footnotes&quot;&gt;Footnotes&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;a href=&quot;https://www.digitalocean.com/?refcode=ed3b69c0eec6&amp;amp;utm_campaign=Referral_Invite&amp;amp;utm_medium=Referral_Program&amp;amp;utm_source=badge&quot;&gt;&lt;img src=&quot;https://web-platforms.sfo2.digitaloceanspaces.com/WWW/Badge%202.svg&quot; alt=&quot;DigitalOcean Referral Badge&quot; /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: center; text-size: small&quot;&gt;Use this link of you want to get $5 free Digital Ocean credits. That&apos;s about how much it cost me to build this module.
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;How far is a production-ready Terraform module from the tutorial?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In my estimation, and calibrated to my skill level, it’s about 3-5 days of work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read further for how I came to this estimation, and what it means.&lt;/p&gt;

&lt;h2 id=&quot;35-hashifinity-stones&quot;&gt;3/5 Hashifinity Stones&lt;/h2&gt;

&lt;p&gt;Over the course of about 3 days of work, I developed a few Terraform modules for Digital Ocean with the goal of deploying a full Hashi environment – Consul, Vault and Nomad.
Only Boundary and Waypoint would be needed to complete the full collection of Hashicorp “infinity stones”!&lt;/p&gt;

&lt;p&gt;More to the point, these are products which I use personally and professionally, and which I often propose as technical solutions to customers where I work, so I really need to know how practical and effective my knowledge of them is&lt;sup id=&quot;fnref:certification&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:certification&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;My goal with this post is to talk seriously and objectively about how much work is involved in making a production-ready Terraform module, and what I’ve learned so far making one for Consul on Digital Ocean.
Now, I’m not a newbie to Terraform or Consul, and Digital Ocean is probably the world’s simplest well-known cloud, so the threshold to getting something serious done is pretty low.&lt;/p&gt;

&lt;p&gt;That being said though, there are a few things which tripped me up on the way, which the cool kids are calling “learnings” these days (ugh, gross).&lt;/p&gt;

&lt;h2 id=&quot;what-are-we-building&quot;&gt;What are we building&lt;/h2&gt;

&lt;p&gt;The first lesson is one in design.
It’s rare that I get the chance to build something from scratch, so I don’t have much experience when it comes to designing the architecture of something, even something as simple as this.
That is not to say that it is particularly &lt;em&gt;difficult&lt;/em&gt;, just that it’s an unexercised muscle.&lt;/p&gt;

&lt;h3 id=&quot;axes-of-competence&quot;&gt;Axes of competence&lt;/h3&gt;

&lt;p&gt;Before starting out, I wanted to have an honest self-assessment of how hard this &lt;em&gt;should&lt;/em&gt; be, according to my competence.&lt;/p&gt;

&lt;div id=&quot;chart&quot;&gt;&lt;/div&gt;

&lt;p&gt;Although I feel very confident in the tooling, and the cloud provider is simple enough to easily know it well, I do not yet feel confident in my skills as an architect.
In this case, the architecture is provided by the Consul reference architecture, so I really just need to design the implementation, which I feel a bit more comfortable in.&lt;/p&gt;

&lt;h3 id=&quot;architecture&quot;&gt;Architecture&lt;/h3&gt;

&lt;p&gt;I took the &lt;a href=&quot;https://learn.hashicorp.com/tutorials/consul/reference-architecture&quot;&gt;Consul Reference Architecture&lt;/a&gt; as the starting point for creating a module which would deploy something similar to it in Digital Ocean.
Similar to the &lt;a href=&quot;https://learn.hashicorp.com/tutorials/consul/deployment-guide?in=consul/production-deploy&quot;&gt;datacenter deployment guide&lt;/a&gt; I started with a single availability zone AMS3.&lt;/p&gt;

&lt;p&gt;Following the tutorial, the idea would be to create a set of droplets for the servers, a set of droplets for the agents, and a load balancer to front the servers&lt;sup id=&quot;fnref:lb-dns&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lb-dns&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
These would all be included in a VPC to allow the agents and servers to communicate with each other over a local network.
Cloud firewalls would then protect the instances by allowing only traffic on the &lt;a href=&quot;https://www.consul.io/docs/install/ports&quot;&gt;Consul ports&lt;/a&gt;&lt;/p&gt;

&lt;figure&gt;
  &lt;img src=&quot;/images/consul-digitaloceandrawio.png&quot; /&gt;
  &lt;div class=&quot;figcaption&quot;&gt;Schematic diagram of a Consul reference architecture in Digital Ocean.&lt;/div&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;design-goals&quot;&gt;Design goals&lt;/h3&gt;

&lt;p&gt;The goals for this architecture are, in no particular order:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Concise&lt;/strong&gt;: As little code should be written and as few tools invoked as necessary, no more.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Complete&lt;/strong&gt;: The architecture should be &lt;em&gt;fully&lt;/em&gt; described by the modules&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Zero-touch deploy&lt;/strong&gt;: This should not require human intervention. This goal is similar to the concise goal, but emphasises automation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Zero-trust&lt;/strong&gt;: Sensitive data should be retrieved by authorized actors, rather than allowing access to it implicitly.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Minimal knowledge&lt;/strong&gt;: No knowledge about the deploy environment should be assumed, in order to make it re-usable in different situations.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Immutable and idempotent&lt;/strong&gt;: The state should be declared as code, and only change when changes to the code are made.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From an operator’s point of view, we should have the experience of setting a few variables, and running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform deploy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the perfect world, &lt;em&gt;no other tasks&lt;/em&gt; would be necessary to have a full Consul cluster deployed, agents, servers and all, all fully joined.&lt;/p&gt;

&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;

&lt;p&gt;A common approach here follows the 12-factor approach. Start with a base image, add the base layer of the application that we are deploying (Consul in this case) and then inject configuration via environment variables into that derived image at runtime.
That derived image should contain no specific configuration for the application until it is instantiated in a deployed environment.
This can include the service advertise address, as well as sensitive data such as TLS certificates and the gossip encryption key.&lt;/p&gt;

&lt;p&gt;In this case, we will forgo that step and use only basic OS distribution images, using &lt;a href=&quot;https://cloudinit.readthedocs.io/en/latest/index.html&quot;&gt;Cloud Init&lt;/a&gt;.
This removes a large amount of tooling (Packer and Ansible), at the expense of writing a single fairly large YAML template, with all of the limitations of cloud init (more on this later).&lt;/p&gt;

&lt;p&gt;The implementation consists of two Terraform modules:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Digital Ocean VPC&lt;/li&gt;
  &lt;li&gt;Digital Ocean Consul cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus it could be created as follows:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;vpc&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://github.com/brucellino/terraform-module-digitalocean-vpc/&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc_name&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;source&lt;/span&gt;                   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://github.com/brucellino/terraform-digitalocean-consul&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;                      &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;depends_on&lt;/span&gt;               &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;project_name&lt;/span&gt;             &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;servers&lt;/span&gt;                  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;servers&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;ssh_inbound_source_cidrs&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;inbound_ssh_cidrs&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here of course, we have declared a few variables in the definition to allow us to create a firewall rules for access from specific locations (typically a Tailscale network), number of servers, &lt;em&gt;etc&lt;/em&gt;.
Note that there is a dependency on the VPC by the Consul module – as we describe below, the Consul module looks up existing VPC resources to deploy into.
The architecture &lt;em&gt;declares&lt;/em&gt; an explicit dependency between these two modules, but there is not an &lt;em&gt;implicit&lt;/em&gt; dependency between the resources in the VPC module and those in the Consul module.&lt;/p&gt;

&lt;p&gt;In principle, the VPC could be created in a separate way up front, but since we are starting from scratch, we add it to our state and explicitly declare the dependency.&lt;/p&gt;

&lt;h3 id=&quot;tooling&quot;&gt;Tooling&lt;/h3&gt;

&lt;figure&gt;
  &lt;img src=&quot;/images/consul-do-tooling.png&quot; /&gt;
  &lt;div class=&quot;figcaption&quot;&gt;Schematic diagram of tooling to create the Cosnul cluster in Digital Ocean.&lt;/div&gt;
&lt;/figure&gt;

&lt;p&gt;In terms of tooling, these include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Terraform&lt;/strong&gt;. Specific providers are:
    &lt;ul&gt;
      &lt;li&gt;digital ocean&lt;/li&gt;
      &lt;li&gt;http&lt;/li&gt;
      &lt;li&gt;random&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Consul&lt;/strong&gt; (optional): Used as the backend for the Terraform state.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Vault&lt;/strong&gt;: key-value store for digital ocean tokens&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cloud init&lt;/strong&gt;: creating the derived image, doing the installation and configuration of Consul on nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some observations on this tooling:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Although the addition of cloud-init seems to break the design goal of immutabililty, since it alters the state of the image, it does allow us to avoid the use of Terraform provisioners and other tooling. The image itself is immutable &lt;em&gt;after&lt;/em&gt; cloud init has run, which should only happen at first boot, so I think this is acceptable.&lt;/li&gt;
  &lt;li&gt;The choice of storing the state in a Consul backend is predetermined. I used Consul because I have access to one in &lt;a href=&quot;https://hashiatho.me&quot;&gt;Hashi@Home&lt;/a&gt;, but an S3-compliant object store state could have been used, for example a DigitalOcean space. This is both cost-effective and simple, but has to exist before we can run a plan, just like the Consul backend.&lt;/li&gt;
  &lt;li&gt;Vault is a dependency of this architecture since I am considering it the default means to access sensitive data. We are accessing encrypted key-value data, so this could be provided by a different provider in principle. However, Vault also allows us to &lt;strong&gt;entirely remove&lt;/strong&gt; any secrets from the codebase, since even authentication to the backends is done using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; lookup of secrets in Vault in Terraform itself.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;speedbumps&quot;&gt;Speedbumps&lt;/h3&gt;

&lt;p&gt;Satisfied with the tooling, I set about implementing the architecture with Terraform.
Most of the objects in the state would of course be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;digitalocean_xyz&lt;/code&gt; resources, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;digitalocean_droplet&lt;/code&gt; for the agent and server cluster, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;digitalocean_firewall&lt;/code&gt; rules for managing access, &lt;em&gt;etc&lt;/em&gt;.
However, we also have a dependency on some external data.
Aside from the tokens used to authenticate to the cloud services, we would need to create an SSH authorized key using a lookup from a user’s GitHub keys URL and an existing VPC to deploy into.&lt;/p&gt;

&lt;p&gt;So far so good!&lt;/p&gt;

&lt;p&gt;That is, until we start enforcing the goal of “zero-touch” deploy.
It soon became apparent that I would have to find some creative ways of configuring Consul on the fly without expanding the toolkit.
A few of the aspects I needed to address to ensure zero-touch are described below.
I call them “speedbumps”, since they slowed me down a bit and made me think about what I was doing.&lt;/p&gt;

&lt;h4 id=&quot;bind-advertise-client-addresses&quot;&gt;Bind, advertise, client addresses&lt;/h4&gt;

&lt;p&gt;Since we start from scratch, the IP addresses are not known up front.
We would need to know the private IP of the agent, so that we can tell servers to advertise on an address that the other members of the cluster can reach it.
Of course, if one explicitly declares the network configuration  assigning IP addresses to cluster nodes, these values can be passed to the Consul configuration, but this approach does not make much sense in a dynamic environment where the cloud does the work for you.&lt;/p&gt;

&lt;p&gt;Without knowledge of the network configuration, we need to use a &lt;a href=&quot;https://www.consul.io/docs/agent/config/cli-flags#_client&quot;&gt;template&lt;/a&gt; to look up the interface address.&lt;/p&gt;

&lt;h4 id=&quot;cluster-joining&quot;&gt;Cluster joining&lt;/h4&gt;

&lt;p&gt;Not only do the agents need to know details about themselves (which IP/port to bind to and advertise on), but also details about &lt;em&gt;other&lt;/em&gt; agents, in order to join them to the cluster.
Since we want zero-touch deployments, we can’t have a multi-stage deployment where we first create the agents, then discover facts about them, and then manually join them to each other – we need a mechanism whereby agents can &lt;em&gt;discover&lt;/em&gt; themselves.&lt;/p&gt;

&lt;p&gt;In order to do this, Consul provides a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;retry_join&lt;/code&gt; mechanism, and specifically for cloud deployments is able to interrogate the cloud provider to ask it for information about the agents.
In our case, we tag the droplets with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;consul-server&lt;/code&gt; and then configure the &lt;a href=&quot;https://www.consul.io/docs/install/cloud-auto-join#digital-ocean&quot;&gt;Cloud Auto Join for Digital Ocean&lt;/a&gt;.
This requires a token to authorise calls to the Digital Ocean API, which is in turn kept in a separate Vault KV store&lt;sup id=&quot;fnref:DOSecretsMount&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:DOSecretsMount&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.
This is then injected into the consul configuration as part of the user-data template, &lt;a href=&quot;#cloud-config-template&quot;&gt;shown below&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;key-generation&quot;&gt;Key generation&lt;/h4&gt;

&lt;p&gt;One of the first speedbumps was how to create the relevant Consul configuration.
In particular, the creation of a shared encryption key to allow the servers to join the cluster.&lt;/p&gt;

&lt;p&gt;The gossip encryption key is a central value that needs to be injected into the Consul configuration either via the configuration file, or as a &lt;a href=&quot;https://www.consul.io/docs/agent/config/cli-flags#_encrypt&quot;&gt;command line parameter&lt;/a&gt;.
In principle, this does not represent a problem, since we can pass this value into either the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;consul.hcl&lt;/code&gt; configuration file, or the startup flag via the systemd unit, both of which are provided by the cloud-init template.
The problem however arises when we ask&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“who knows this value?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the past, I have kept the Consul encryption key as a value in one of the key-value stores that Terraform reads from Vault.
However, starting with fresh deploy, we need to &lt;em&gt;generate&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;This looks like a &lt;em&gt;Catch-22&lt;/em&gt; at first glance, because we can’t generate an encryption key without Consul, but  we can’t deploy Consul without an encryption key.
What is more, if it’s generated on one Consul server, how is it then shared with the others?
I spent a few hours thinking about this, tempted to revert to the &lt;em&gt;“Deus ex Machina”&lt;/em&gt; approach of pre-registering a key in a Vault KV store… until I read &lt;a href=&quot;https://www.consul.io/docs/security/encryption#gossip-encryption&quot;&gt;the documentation&lt;/a&gt; again:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The key must be 32-bytes, Base64 encoded.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is exactly what the &lt;a href=&quot;https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/id&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;random_id&lt;/code&gt;&lt;/a&gt; resource in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;random&lt;/code&gt; provider does!
I added a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;random_id&lt;/code&gt; resource:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;random_id&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;key&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;byte_length&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;and then passed it into the user data  template:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;digitalocean_droplet&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;server&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;err&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;user_data&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;templatefile&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;${path.module}/templates/userdata.tftpl&quot;&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;consul_version&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;1.12.3&quot;&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;server&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;username&lt;/span&gt;       &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;username&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;datacenter&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;datacenter&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;servers&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;servers&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;ssh_pub_key&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;http&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;ssh_key&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;body&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;tag&lt;/span&gt;            &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul-server&quot;&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;region&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;digitalocean_vpc&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;selected&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;region&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;join_token&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;vault_generic_secret&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;join_token&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;autojoin_token&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;encrypt&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;random_id&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;b64_std&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;domain&lt;/span&gt;         &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;digitalocean_domain&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;cluster&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;project&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;digitalocean_project&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;count&lt;/span&gt;          &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;index&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;err&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;data-persistence-and-zero-downtime-rolling-changes&quot;&gt;Data persistence and zero-downtime rolling changes&lt;/h4&gt;

&lt;p&gt;The Consul cluster is co-ordinated by a set of agents acting as servers.
As we have mentioned before, &lt;a href=&quot;#design-goals&quot;&gt;one of the design goals&lt;/a&gt; was to have immutable images, so that when changes are required, we create an entirely new image and replace the old one with the new one.
During these changes, we do not want the applications and other parts of infrastructure which depend on Consul for service discovery, &lt;em&gt;etc&lt;/em&gt; to be impaacted.
This means that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The cluster should not lose quorum during changes&lt;/li&gt;
  &lt;li&gt;The cluster state should be persisted across changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data in the cluster state is stored in a shared database replicated with Raft.
When making changes, we can ensure that there are always servers by using the &lt;a href=&quot;https://www.terraform.io/language/meta-arguments/lifecycle&quot;&gt;Terraform resource lifecycle meta argument&lt;/a&gt; for the droplets:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;lifecycle&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;create_before_destroy&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This will ensure that droplets belonging to the previous state are only deleted once the new droplets have been created.
&lt;strong&gt;However&lt;/strong&gt;, this refers to the droplet state, not to the Consul state!
Since cloud config takes a few minutes to complete, but Terraform will destroy old droplets as soon as the API reports that the new new droplet is “ready”, this will put our cluster into an outage for a few minutes while the new droplets come up.&lt;/p&gt;

&lt;p&gt;What is more, we will lose the Raft data on the old droplets if it is not propagated somehow to the new ones.&lt;/p&gt;

&lt;p&gt;There are &lt;a href=&quot;https://github.com/hashicorp/terraform/blob/v1.2.0/CHANGELOG.md&quot;&gt;new features in Terraform 1.2.0&lt;/a&gt; for handling &lt;a href=&quot;https://www.terraform.io/language/expressions/custom-conditions#preconditions-and-postconditions&quot;&gt;pre- and post-conditions&lt;/a&gt;.
One could imagine that a post-condition would be a guarantee that new Consul servers are up, but polling the health check url.
One can imagine a post-condition such as:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;lifecycle&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;precondition&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;201&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;204&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;status_code&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;error_message&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Consul is not healthy&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In this case, we need a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; block for the HTTP check on the loadbalancer, but that data won’t exist until the first deployment is ready.
This sounds again like a Catch-22, but yet again reading the documentation carefully seems to show a way out:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In most cases, we do not recommend including both a data block and a resource block that both represent the same object in the same configuration. Doing so can prevent Terraform from understanding that the data block result can be affected by changes in the resource block. However, when you need to check a result of a resource block that the resource itself does not directly export, you can use a data block to check that object safely as long as you place the check as a direct postcondition of the data block. This tells Terraform that the data block is serving as a check of an object defined elsewhere, allowing Terraform to perform actions in the correct order.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, if a data block is to serve specifically as a postcondition check it should only succeed if the Consul service returns healthy.&lt;/p&gt;

&lt;p&gt;Creating these checks and adding them to the resources as follows&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;http&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul_health&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;http://&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;digitalocean_loadbalancer&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;external&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;ip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;/v1/health/service/consul&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;lifecycle&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Add lifecycle to server droplet resources too&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;postcondition&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;201&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;204&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;status_code&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;error_message&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Consul service is not healthy&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;managed to generate a plan&lt;sup id=&quot;fnref:lb-droplets&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:lb-droplets&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, as &lt;a href=&quot;#service-start&quot;&gt;described below&lt;/a&gt;, the consul service doesn’t start in time for the check to pass during the Terraform apply stage:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;│ Error: Resource postcondition failed
│
│   on ../../../modules/terraform-digitalocean-consul/main.tf line 37, &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;data &lt;span class=&quot;s2&quot;&gt;&quot;http&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;consul_health&quot;&lt;/span&gt;:
│   37:       condition     &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; contains&lt;span class=&quot;o&quot;&gt;([&lt;/span&gt;201, 200, 204], self.status_code&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
│     ├────────────────
│     │ self.status_code is 503
│
│ Consul service is not healthy&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;So, there needs to be some stabilisation time before the check is conducted.
Perhaps I could do this with a &lt;a href=&quot;https://www.terraform.io/language/resources/provisioners/remote-exec&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remote-exec&lt;/code&gt; provisioner&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Zero-touch requires that we run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform apply&lt;/code&gt; and then go the beach.
This means that the droplets need to be configured &lt;em&gt;and the Consul service needs to be running&lt;/em&gt; with agents and servers all auto-joined to the cluster without manual intervention.&lt;/p&gt;

&lt;p&gt;The Consul process itself is started by a systemd unit. This is created and enabled by Cloud Init via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;write-files&lt;/code&gt; which also creates the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/consul.d/consul.hcl&lt;/code&gt; configuration file.
However, since &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cloud-init&lt;/code&gt; itself runs as a systemd unit, we need to ensure that the Consul service is started only after cloud init has finished.
What is more it also needs to be started only after the cloud init user scripts in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runcmd&lt;/code&gt; have completed else the executable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;consul&lt;/code&gt; won’t be present yet.
I haven’t figured out exactly how to determine what systemd event should trigger the start of the Consul service, other than it should happen when cloud init ends, but &lt;a href=&quot;https://stackoverflow.com/a/68099751/2707870&quot;&gt;this answer&lt;/a&gt; gives some clues.&lt;/p&gt;

&lt;p&gt;In the meantime, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remote-exec&lt;/code&gt; provisioner runs a script to wait for the cloud-init process to finish before starting the Consul service:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-hcl&quot; data-lang=&quot;hcl&quot;&gt;&lt;span class=&quot;nx&quot;&gt;connection&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ssh&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;user&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;root&quot;&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;ipv4_address&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;provisioner&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;remote-exec&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;echo waiting for consul to be present&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;while [ ! -f /usr/local/bin/consul ] ; do sleep 10 ; done&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;while [[ ! $(systemctl list-unit-files | grep consul) ]] ; do echo waiting for consul systemd unit ; done&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;service consul start&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;echo waiting for servers to stabilise&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;sleep 20&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Cloud auto-join takes care of the rest.&lt;/p&gt;

&lt;p&gt;A final consideration on this point was where and how to persist the Raft data.
Currently, the data is written to a Digital Ocean volume mounted into the droplet, which is then attached to the next server in the group. However the volume can only be attached to one droplet at a time, meaning that if we want to perform a rolling update, we need to &lt;em&gt;first&lt;/em&gt; destroy the droplet, detach the volume, then create the new droplet and attach the volume.
But in order to maintain quorum this needs to be done &lt;em&gt;one at a time&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;cloud-config-template&quot;&gt;Cloud Config template&lt;/h3&gt;

&lt;p&gt;Most of the heavy lifting is done in the cloud-config template.
For completeness, I show it below.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-yaml&quot; data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;c1&quot;&gt;# cloud-config&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;manage_etc_hosts&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;false&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;manage_resolv_conf&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;mounts&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;

&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;/dev/disk/by-id/scsi-0DO_Volume_consul-data-$&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;/consul&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;ext4&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;discard,defaults,noatime&quot;&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;users&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${username}&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;ssh-authorized-keys&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${ssh_pub_key}&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;shell&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/bin/bash&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;sudo&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ALL=(ALL)&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;NOPASSWD:ALL&apos;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;packages&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;curl&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;jq&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;net-tools&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;manage-resolv-conf&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;resolv_conf&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;nameservers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;${dns_recursor_ip}&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;searchdomains&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${domain}&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${project}&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;domain&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;${domain}&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;write_files&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;

&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/etc/consul.d/consul.hcl&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;|&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;encrypt = &quot;${encrypt}&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;%{if server }bootstrap_expect = ${servers}%{ endif }&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;datacenter = &quot;${datacenter}&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;%{if server }&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;auto_encrypt {&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;allow_tls = true&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;%{ endif }&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;data_dir = &quot;/consul/&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;log_level = &quot;INFO&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;ui_config {&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;enabled =  true&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;server = ${server}&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;client_addr = &quot;0.0.0.0&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;recursors = [&quot;${recursor_ip}&quot;]&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;bind_addr = &quot;0.0.0.0&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;advertise_addr = &quot;{{ GetInterfaceIP \&quot;eth1\&quot; }}&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;retry_join = [&quot;provider=digitalocean region=${region} tag_name=${tag} api_token=${join_token}&quot;]&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/usr/lib/systemd/system/consul.service&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;|&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;[Unit]&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Description=&quot;HashiCorp Consul - A service mesh solution&quot;&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Documentation=https://www.consul.io/&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Requires=network-online.target&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Requires=cloud-init.target&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;ConditionFileNotEmpty=/etc/consul.d/consul.hcl&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;ConditionFileNotEmpty=/usr/local/bin/consul&lt;/span&gt;

      &lt;span class=&quot;s&quot;&gt;[Service]&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Type=notify&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;User=root&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Group=root&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;ExecStart=/usr/local/bin/consul agent \&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;-auto-reload-config \&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;-config-dir=/etc/consul.d/&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;ExecReload=/bin/kill --signal HUP $MAINPID&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;KillMode=process&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;KillSignal=SIGTERM&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;Restart=on-failure&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;LimitNOFILE=65536&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;StandardOutput=append:/var/log/consul.log&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;AmbientCapabilities=CAP_NET_BIND_SERVICE&lt;/span&gt;

      &lt;span class=&quot;s&quot;&gt;[Install]&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;WantedBy=multi-user.target&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;runcmd&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Remove existing entries that point to localhost&lt;/span&gt;

&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;sed -r &apos;s/^._consul-._$//&apos; /etc/hosts&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Get Consul&lt;/span&gt;

&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;curl -fL &amp;lt;https://releases.hashicorp.com/consul/&amp;gt;${consul_version}/consul_${consul_version}_linux_amd64.zip \&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;| gunzip -&amp;gt; /usr/local/bin/consul&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;chmod a+x  /usr/local/bin/consul&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;consul -version&lt;/span&gt;
&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;systemctl daemon-reload&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Enable the consul service&lt;/span&gt;

&lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;systemctl enable consul&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;final-considerations&quot;&gt;Final considerations&lt;/h2&gt;

&lt;p&gt;It took about a week of work to create a Terraform module for Consul on Digital Ocean.
At the time of writing, there doesn’t seem to be one in the public registry, so at least I seem to be making a tiny dent in the universe.&lt;/p&gt;

&lt;p&gt;Overall, I find the simplicity of Digital Ocean to be a nett feature
It provides just enough in terms of services to treat as IaaS, while being &lt;em&gt;very&lt;/em&gt; cost effective.
What does that mean? I spent about about 5 USD in resources to develop this module – about a week’s worth of terraforming Digital Ocean.
&lt;strong&gt;It takes under four minutes&lt;/strong&gt; to deploy this and costs &lt;strong&gt;less than USD 5 a month to run&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I should point out that a lot of the speedbumps here were due to the fact that the cloud provider doesn’t have a very sophisticated set of services.
Server stabilisation and rolling updates are much easier with &lt;a href=&quot;https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html&quot;&gt;AWS autoscaling target groups&lt;/a&gt; for example.
However, I expect this to be the case for many on-premise private cloud providers like VMWare or bare-metal providers, so it is a very useful exercise to implement the lifecycle logic in the module itself.&lt;/p&gt;

&lt;p&gt;Finally, I took the new post- and pre-conditions for a drive.
At first they can seem somewhat counter-intuitive, but once you do things the “Hashi” way, I can see how they work well.
This is, in my opinion, a very welcome addition to the Terraform language, obviating the need for extra tooling.&lt;/p&gt;

&lt;p&gt;The next steps will be to include the Consul ACL token and TLS certificates into this module.&lt;/p&gt;

&lt;p&gt;The code accompanying this post is available at:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/brucellino/terraform-digitalocean-consul&quot;&gt;Terraform module&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/brucellino/terraform-hashi-cluster-digitalocean&quot;&gt;Terraform statement&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;script type=&quot;text/javascript&quot; src=&quot;https://d3js.org/d3.v3.min.js&quot;&gt;&lt;/script&gt;

&lt;script type=&quot;text/javascript&quot; src=&quot;/js/radar.js&quot;&gt;&lt;/script&gt;

&lt;script type=&quot;text/javascript&quot; src=&quot;/js/competence.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Stay tuned for the next iteration of Hashi on Digital Ocean – deploying a Vault cluster.&lt;/p&gt;

&lt;p&gt;If you’re keen on running this exercise yourself, please use my referral link to get some free credits on Digital Ocean.&lt;/p&gt;

&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;a href=&quot;https://www.digitalocean.com/?refcode=ed3b69c0eec6&amp;amp;utm_campaign=Referral_Invite&amp;amp;utm_medium=Referral_Program&amp;amp;utm_source=badge&quot;&gt;&lt;img src=&quot;https://web-platforms.sfo2.digitaloceanspaces.com/WWW/Badge%202.svg&quot; alt=&quot;DigitalOcean Referral Badge&quot; /&gt;&lt;/a&gt;
&lt;/div&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:certification&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Some might say that I should just spend the time and money on a Hashicorp certification. Although I don’t disagree with this sentiment, I would also point out that many folks like me only learn how something really works when they break it. It’s the physicist in me… I feel far more comfortable saying that I am proficient in a tool when I have used it to solve a problem. &lt;a href=&quot;#fnref:certification&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:lb-dns&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The loadbalancer should probably be replaced with some good old DNS records. &lt;a href=&quot;#fnref:lb-dns&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:DOSecretsMount&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;It would be really nice if Vault supported a Digital Ocean secrets mount, such as the &lt;a href=&quot;https://www.vaultproject.io/docs/secrets/aws&quot;&gt;AWS&lt;/a&gt;, &lt;a href=&quot;https://www.vaultproject.io/docs/secrets/azure&quot;&gt;Azure&lt;/a&gt; and &lt;a href=&quot;https://www.vaultproject.io/docs/secrets/alicloud&quot;&gt;Alibaba&lt;/a&gt;, &lt;em&gt;etc&lt;/em&gt;. &lt;a href=&quot;#fnref:DOSecretsMount&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:lb-droplets&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I originally added the droplets by ID to the load balancer, creating an explicit relationship between them. This broke the DAG, so I had to configure the LB to add droplets by tag instead. &lt;a href=&quot;#fnref:lb-droplets&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 01 Aug 2022 00:00:00 +0000</pubDate>
        <link>https://www.brucellino.dev/2022/08/consul-digitalocean/</link>
        <guid isPermaLink="true">https://www.brucellino.dev/2022/08/consul-digitalocean/</guid>
        
        <category>Consul</category>
        
        <category>Cloud</category>
        
        <category>Terraform</category>
        
        <category>Digital Ocean</category>
        
        <category>Continuous Delivery</category>
        
        
        <category>Terraform</category>
        
      </item>
    
  </channel>
</rss>
