Jenkinsfile: Infrastructure as Code (Chaos?)

Anything I do more than a couple times, and it looks like I will again, I script to automate … I guess I’m an advocate of PoC (Process as Code). I use whatever DSL or tool best fits the situation and so I’m often learning new ones.  Jenkins has long been my goto for CI/CD, applied right it can automate and order a lot of the work of building and deploying.  But Jenkins, starting as Hudson, had a very UI biased architecture, and the evolution from then to now seems to have been largely organic.  I’ve depended on Jenkins, and been happy with the results of my work with it, but often the solutions felt a bit of a Rube Goldberg machine, cobbling together a series of partially working bits to get the job done.

Enter Pipeline/Jenkinsfile

Then along came the Pipeline plugin that allowed for scripting things with the Jenkinsfile DSL.  I dove right in. Pipelines allowed me to stop configuring complex jobs in Jenkins UI, and move all that out into Jenkinsfile scripts managed in my SCM.  Awesome! Or mostly awesome.  Immediately I started hitting issues with the Jenkinsfile documentation and the pipeline plugins.  The DSL spec seemed to be a moving target, and the documentation for some of the pipeline plugins, like the AWS S3 upload/download, were sparse to nonexistent.   So, it was two steps forward and one back.  You could move all your configuration and process description out into code, and the code could reside in your SCM, but the DSL was an inconsistent, poorly documented, patch job.

Enter Blue Ocean

So then the UI revamp of Jenkins from Blue Ocean came out recently, and it’s all about pipelines and Jenkinsfiles.  There was documentation. There was a plan. They seemed to be wrangling the Jenkinsfile DSL ecosystem into sanity!

Maybe … Maybe Not

I don’t know if it’s Blue Ocean’s fault or not, but suddenly there was Declarative and Scripted (aka Advanced) variants of the pipeline DSL. They share common roots, but they are not the same, and while the scripted is richer, it’s not a full on superset. Apparently I’d been working in the land of scripted and Blue Ocean was all documented as declarative. It only took me a good four hours to figure this out and understand why my scripts were exploding all over the place. Eventually I found the easter egg documentation hidden away behind the unexplained advanced links. Then I spent about an hour figuring out the tricks I needed to get the new plugins working with my old scripted skills, and how to hand roll the features that I lost when I didn’t use declaritive.

So yeah… Blue Ocean is absolutely two, maybe three steps forwards, but as per everything Jenkins, there was that one step backwards too. It’s clearly an improvement, but leaves you feeling like you’re basing your progress on a crufty hack.

 

The Testing Stack, From My Java Bias

I’ve been at this 30 years, and Java since it’s introduction, so some folks will feel my opinions are a bit old school. I don’t think they are, I think they’re just a bit more thorough then some folks have the patience for.  Recently I butted heads with an “Agile Coach” on this, and they certainly felt I wasn’t with the program. But I’ve been a practitioner of agile methods since before they were extreme and again, I don’t think the issue is that I’m too old school, it’s just I believe the best agile methods still benefit from some traditional before and after.

My View of The Testing Stack

People get obsessed with the titles of these, but I’m not interested in that debate, I’m just enumerating the phases in a approximate chronological order:

  • Unit: TDD these please!
  • Scenario: Test as many scenarios as you can get your hands on. Unit testing should catch all the code paths, but a tool like Cucumber makes enumerating all the possible data sets in scenarios much easier.
  • End-to-end/Integration:  Put the parts together in use case sequences, using as many real bits as possible.
  • Performance/Stress/Load:  Does your correct code do it fast enough and resiliently enough. These can appear earlier, but it needs to happen here too.
  • QA: Yup … I still feel there’s value in a separate QA phase this is what got the agile coach red in the face… more to follow on this.
  • Monitoring/In Situ: Keep an eye on stuff in the wild. Keep testing them and monitoring them.

So this is a lot of testing, and plenty of folks will argue that if you get the earlier steps right, the later ones are redundant.  I don’t agree obviously. I see a value to each distinct step.

  • Unit: Errors cost more the further along you find them so the earlier one tests the better. So get as good coverage at the unit level as you can tolerate.
  • Scenario: Using unit tests to exhaustively cover every possible scenario can be … exhausting, but it’s worth sweeping through mass combinations and permutations if you can.
  • E2E/Integration: Now you know the pieces work in isolation, you need to see that they work together. This will shake out problems with complex interactions and insure that contracts where adhered to over time.
  • Performance: If it’s too slow or, fragile to use, it’s useless.
  • QA: Here is were a lot of folks say I’m being old school.  If you’ve gotten through the prior steps isn’t this redundant?  No.  No matter how good your working relationship and communications with your product team and customers are, your development team always comes at their job with a technical bias. At least I hope your developers are technical enough to have a technical bias. At a bare minimum, having a separate pass of testing that is more closely aligned with the business perspectives makes sure that you avoid issues with the “any key” and the like.  But this pass can help you hone and improve product goal communications, and act as feedback loop to improve the prior testing steps.  In a perfect world it would become redundant.
  • In Situ: Keep your eyes on the product in the wild, test them there too, you can use prior tools and scenarios here, but testing chaotic scenarios is invaluable. This is about watching the canary in the mine, and seeing if you’ve even been worried about the right canary.

From a cost perspective you always want to front load your testing as much as possible, but if your goal is quality, the old adage “measure twice, cut once” should be amended to “measure as often as you can practically tolerate, cut once”.  Needless to say, automation is the key to it,  tool everything, make the testing happen all over the place without human intervention.

Rant: Docker, NAT, VPN, EC2 … Fail!

My goal for the day was to get consul infrastructure set up at work.  I’d done proof of concept (PoC) work on all of the various bits and it was just a matter of putting the parts together:

  • Everything from docker images
  • The consul server on EC2 instances (in a VPC)
  • The consul agents on a mix of client machines, some Linux, some OSX

I was working from home, and that seemed like it would help. but I didn’t realize that it would actually make the task basically impossible.

What Got Me

Consul is a all about networking, that’s is game.  So as I tried to glue all the PoC bits together  here’s what got me:

  • Getting the EC2 security groups fixed up for consul’s too many ports – annoying but doable
  • Dealing with the fact that the networking on Docker for OSX isn’t quite right. It lacks some of the bridging features and things like -net=host “work” behave non-intuitively
  • Working from home I was on a NAT’d machine connected through a VPN so my address wasn’t always my address and some traffic wouldn’t traffic.
  • The universe hates me.  Ok that’s hyperbolic whining, but by the days end I was sure it was so.

Basically combining all those gotchas together meant that:

  • Every example for consul in docker was from Linux and there was a 50/50 chance it would fail mysteriously on OSX.
  • The errors I hit were often lack of connectivity … and so you were left trying to diagnose silence … not a lot to go on there. Bueller? Bueller? Buller?
  • There was so much “useful” information out there that I just kept trying… I mean if I just tried one more suggestion that would get it right?

Fail

Tomorrow I’ll be back on site and that will eliminate the NATing and the VPN.  Perhaps with those two complexities removed I’ll make progress.   I could always run consul natively rather than in docker…. but I really don’t want to admit defeat.