Home DevOps: Ansible for the Win!

I’ve a Raspberry Pi that I use for various things.  I’m a big fan of these little boxes, but they can be temperamental.  You end up fiddling around to get things installed and sometimes even a simple package update will leave the box a dead parrot.  A couple weeks back, I was just running a regular update and my Pi died a horrible death.  The upside with a Pi is you just re-image the disk and you’re back in business. The disks are small, the process simple.  However, if you’ve customized things all that’s gone.

I decided to rebuild my Pi, after re-imaging it, using Ansible.  Ansible is straightforward and easy to start up with. I’ve used it on and off over time and am proficient with it.  In under an hour I had rebuilt my Pi, with all my customizations with an ansible playbook. It didn’t take much more effort than doing it by hand really, but I did feel like maybe I’d gone a bit far using ansible.  Until the next morning that is.  I’d forgotten a few security measures, and my Pi is accessible to the internet, and in less than half a day someone or some bot had gotten in and taken over. Sigh. Now the whole ansible decision seemed far wiser.  I enhanced my playbook with the security changes, re-images, and reran, and the Pi was back and better in under twenty minutes.

Since that practical example, I’ve done everything on my Pi via ansible and had no regrets.

 

Advertisements

Fluentsee: Fluentd Log Parser

I wrote previously about using fluentd to collect logs as a quick solution until the “real” solution happened.  Well, like many “temporary” solutions, it settled in and took root. I was happy with it, but got progressively more bored of coming up with elaborate command pipelines to parse the logs.

Fluentsee

So in the best DevOps tradition, rather than solve the initial strategic problem, I came up with an another layer of paint to slap on as a tactical fix, and fluentsee was born.  Fluentsee is written in Java, and lets you filter the logs, and print out different format outputs:

$ java -jar fluentsee-1.0.jar --help
Option (* = required)          Description
---------------------          -----------
--help                         Get command line help.
* --log <String: filename>     Log file to use.
--match <String: field=regex>   Define a match for filtering output. May pass in
                                 multiple matches.
--tail                         Tail the log.
--verbose                      Print verbose format entries.

So, for example, to see all the log entries from the nginx container, with a POST you would:

$ java -jar fluentsee-1.0.jar --log /fluentd/data.log \
--match 'json.container_name=.*nginx.*' --match 'json.log=.*POST.*'

The matching uses Java regex’s. The parsing isn’t wildly efficient but keeps up generally.

Grab it on Github

There’s a functional version now on github, and you can expect enhancements, as I continue to ignore the original problem and focus on the tactical patch.

Collecting Docker Logs With Fluentd

I’m working on a project involving a bunch of services running in docker containers.  We are working on a design and implementation of our full blown log gathering and analysis solution, but what was I to do till then?  Having to bounce around to all the hosts and look at them there was getting tiresome, but I didn’t want to expend much energy on a stopgap measure either.

Enter Fluentd

Docker offers support for various different logging drivers, so I ran down the list and gave each choice about ten minutes of attention, and sure enough, one choice only needed ten minutes to get up and running – fluentd.

What it Took

  1. Pick a machine to host logs
  2. Run a docker image of fluentd on that host
  3. Add a couple of additional options on my docker invocations.

What That Got Me

With the above done, all my docker containers logs aggregated on the designated host in an orderly format, with log rolling etc.

But…

The orderly format in the aggregated log,  was well structured but maybe not friendly.  Its format is:

TIMESTAMP HOST_ID JSON_BLOB

So an example might look like:

20170804T140852+0000 9c501a9baf61 {"container_id":"...","container_name":"...","source":"stdout","log":"..."}

Everything in its place but…

How To Deal

So with everything going into one file, and a mix of text and JSON, I settled on the following approach.   First I installed jq to help format the JSON.  Then I just employed tried and true command line tools.

For example, lets say you just want to look at the log entries for an nginx container:

grep /nginx /fluentd/data.20170804.log | cut -c 35- | jq -C . | less -r

That’s all it takes!  Use grep to pull the lines with the container name, cut out the JSON, have jq format it, and view it.

Maybe you just want the log field, rather then the entire entry:

grep /nginx /fluentd/data.20170804.log | cut -c 35- | jq -C .log | less -r

Just have jq pull out the single field.

It’s Low Tech But…

For about ten minutes setup work, and a little command line magic, I’ve got a good solution until the real answer arrives.

Tech Notes

There were a couple of specifics worth noting in the process here.  First, there are at least two ways to direct docker to use a specific log driver. One is via the command line on a run. The other is to configure the docker daemon via its /etc/docker/daemon.json file.  The command line is more granular, you can pick and choose which containers log to which driver. That’s flexible and nice, but unfortunately docker “compose” and “cloud” don’t support setting the driver for a container.  Setting at the docker daemon level as a default solves the compose/cloud issue, but, creates a circular dependency if you’re running fluentd in docker, because that container won’t start unless fluentd is running, but fluentd is in that container.  I went with setting it at the daemon level, and I made sure to run the fluentd container first thing, with a command line option indicating the traditional log driver.

The second noteworthy point was that the fluentd container provides a data.log link that was supposed to always point to the newest log… for me it doesn’t.  I have to look into the log area and find the newest log myself because data.log doesn’t update correctly through some log rotations.

Jenkinsfile: Infrastructure as Code (Chaos?)

Anything I do more than a couple times, and it looks like I will again, I script to automate … I guess I’m an advocate of PoC (Process as Code). I use whatever DSL or tool best fits the situation and so I’m often learning new ones.  Jenkins has long been my goto for CI/CD, applied right it can automate and order a lot of the work of building and deploying.  But Jenkins, starting as Hudson, had a very UI biased architecture, and the evolution from then to now seems to have been largely organic.  I’ve depended on Jenkins, and been happy with the results of my work with it, but often the solutions felt a bit of a Rube Goldberg machine, cobbling together a series of partially working bits to get the job done.

Enter Pipeline/Jenkinsfile

Then along came the Pipeline plugin that allowed for scripting things with the Jenkinsfile DSL.  I dove right in. Pipelines allowed me to stop configuring complex jobs in Jenkins UI, and move all that out into Jenkinsfile scripts managed in my SCM.  Awesome! Or mostly awesome.  Immediately I started hitting issues with the Jenkinsfile documentation and the pipeline plugins.  The DSL spec seemed to be a moving target, and the documentation for some of the pipeline plugins, like the AWS S3 upload/download, were sparse to nonexistent.   So, it was two steps forward and one back.  You could move all your configuration and process description out into code, and the code could reside in your SCM, but the DSL was an inconsistent, poorly documented, patch job.

Enter Blue Ocean

So then the UI revamp of Jenkins from Blue Ocean came out recently, and it’s all about pipelines and Jenkinsfiles.  There was documentation. There was a plan. They seemed to be wrangling the Jenkinsfile DSL ecosystem into sanity!

Maybe … Maybe Not

I don’t know if it’s Blue Ocean’s fault or not, but suddenly there was Declarative and Scripted (aka Advanced) variants of the pipeline DSL. They share common roots, but they are not the same, and while the scripted is richer, it’s not a full on superset. Apparently I’d been working in the land of scripted and Blue Ocean was all documented as declarative. It only took me a good four hours to figure this out and understand why my scripts were exploding all over the place. Eventually I found the easter egg documentation hidden away behind the unexplained advanced links. Then I spent about an hour figuring out the tricks I needed to get the new plugins working with my old scripted skills, and how to hand roll the features that I lost when I didn’t use declaritive.

So yeah… Blue Ocean is absolutely two, maybe three steps forwards, but as per everything Jenkins, there was that one step backwards too. It’s clearly an improvement, but leaves you feeling like you’re basing your progress on a crufty hack.

 

The Testing Stack, From My Java Bias

I’ve been at this 30 years, and Java since it’s introduction, so some folks will feel my opinions are a bit old school. I don’t think they are, I think they’re just a bit more thorough then some folks have the patience for.  Recently I butted heads with an “Agile Coach” on this, and they certainly felt I wasn’t with the program. But I’ve been a practitioner of agile methods since before they were extreme and again, I don’t think the issue is that I’m too old school, it’s just I believe the best agile methods still benefit from some traditional before and after.

My View of The Testing Stack

People get obsessed with the titles of these, but I’m not interested in that debate, I’m just enumerating the phases in a approximate chronological order:

  • Unit: TDD these please!
  • Scenario: Test as many scenarios as you can get your hands on. Unit testing should catch all the code paths, but a tool like Cucumber makes enumerating all the possible data sets in scenarios much easier.
  • End-to-end/Integration:  Put the parts together in use case sequences, using as many real bits as possible.
  • Performance/Stress/Load:  Does your correct code do it fast enough and resiliently enough. These can appear earlier, but it needs to happen here too.
  • QA: Yup … I still feel there’s value in a separate QA phase this is what got the agile coach red in the face… more to follow on this.
  • Monitoring/In Situ: Keep an eye on stuff in the wild. Keep testing them and monitoring them.

So this is a lot of testing, and plenty of folks will argue that if you get the earlier steps right, the later ones are redundant.  I don’t agree obviously. I see a value to each distinct step.

  • Unit: Errors cost more the further along you find them so the earlier one tests the better. So get as good coverage at the unit level as you can tolerate.
  • Scenario: Using unit tests to exhaustively cover every possible scenario can be … exhausting, but it’s worth sweeping through mass combinations and permutations if you can.
  • E2E/Integration: Now you know the pieces work in isolation, you need to see that they work together. This will shake out problems with complex interactions and insure that contracts where adhered to over time.
  • Performance: If it’s too slow or, fragile to use, it’s useless.
  • QA: Here is were a lot of folks say I’m being old school.  If you’ve gotten through the prior steps isn’t this redundant?  No.  No matter how good your working relationship and communications with your product team and customers are, your development team always comes at their job with a technical bias. At least I hope your developers are technical enough to have a technical bias. At a bare minimum, having a separate pass of testing that is more closely aligned with the business perspectives makes sure that you avoid issues with the “any key” and the like.  But this pass can help you hone and improve product goal communications, and act as feedback loop to improve the prior testing steps.  In a perfect world it would become redundant.
  • In Situ: Keep your eyes on the product in the wild, test them there too, you can use prior tools and scenarios here, but testing chaotic scenarios is invaluable. This is about watching the canary in the mine, and seeing if you’ve even been worried about the right canary.

From a cost perspective you always want to front load your testing as much as possible, but if your goal is quality, the old adage “measure twice, cut once” should be amended to “measure as often as you can practically tolerate, cut once”.  Needless to say, automation is the key to it,  tool everything, make the testing happen all over the place without human intervention.

Rant: Docker, NAT, VPN, EC2 … Fail!

My goal for the day was to get consul infrastructure set up at work.  I’d done proof of concept (PoC) work on all of the various bits and it was just a matter of putting the parts together:

  • Everything from docker images
  • The consul server on EC2 instances (in a VPC)
  • The consul agents on a mix of client machines, some Linux, some OSX

I was working from home, and that seemed like it would help. but I didn’t realize that it would actually make the task basically impossible.

What Got Me

Consul is a all about networking, that’s is game.  So as I tried to glue all the PoC bits together  here’s what got me:

  • Getting the EC2 security groups fixed up for consul’s too many ports – annoying but doable
  • Dealing with the fact that the networking on Docker for OSX isn’t quite right. It lacks some of the bridging features and things like -net=host “work” behave non-intuitively
  • Working from home I was on a NAT’d machine connected through a VPN so my address wasn’t always my address and some traffic wouldn’t traffic.
  • The universe hates me.  Ok that’s hyperbolic whining, but by the days end I was sure it was so.

Basically combining all those gotchas together meant that:

  • Every example for consul in docker was from Linux and there was a 50/50 chance it would fail mysteriously on OSX.
  • The errors I hit were often lack of connectivity … and so you were left trying to diagnose silence … not a lot to go on there. Bueller? Bueller? Buller?
  • There was so much “useful” information out there that I just kept trying… I mean if I just tried one more suggestion that would get it right?

Fail

Tomorrow I’ll be back on site and that will eliminate the NATing and the VPN.  Perhaps with those two complexities removed I’ll make progress.   I could always run consul natively rather than in docker…. but I really don’t want to admit defeat.