DevOps Eye for the Analytics Guy

We may not be deploying cloud servers at the drop of a hat, but the analytics community can take a few lessons from our friends in the DevOps movement. DevOps, short for Developer Operations, has grown around the art of scripting and automating the process of setting up your infrastructure (servers, databases, etc.).

Here are some pointers – that apply to our line of work:

1) “It’s Scripted” – Fully automated data extracts and database / dataset builds

The DevOps folks have turned server setup into a fully scripted process. Using a tool such as Puppet or Chef, every element of configuring a new server can be automated and controlled from an administrative script. In a cloud environment, this takes disaster recovery and capacity scaling to a whole new level. Assuming the data center is still running (and has hardware capacity), losing an individual server becomes a trivial event.

For analytics – our equivalent is automating the process of extracting and assembling your masterfile (database or SAS dataset). This allows you to replace a couple of hours of manual data-munging with a single button click – especially if you add some automated testing into the process. You’re reducing the cost of updating your dataset.

The other big win is Quality – manual processes have an inherently high error rate (I’ve seen this affect as much as 10% of the required tasks during process monitoring). By automating the process, you can significantly reduce your risk of careless errors. This can be pretty important on the tail end of an all-nighter….

Some of this will likely mean moving away from Excel / Access  – while useful for many types of ad-hoc analysis, there are inherent limitations in these tools. At a minimum, invest in learning VBA. There are also some emerging software packages which can help you migrate Excel projects to simple web applications.

2) Self Configuring Code

Most good DevOps scripts are designed to work through trivial configuration issues. Similarly, if one element of a server build fails due to local conditions, there is usually some type of plan for working around the gap. This reduces the level of manual effort.

In our world, a key source of maintenance programming is minor changes in elements such as dates, file locations, and business activity (eg. which products / customers are bought). This should be your next target – rig your date-sensitive calculations to update dynamically (without changes from you) by running a set of “pre-queries” to assess the state of your database such as date of last update. Approach other scutwork changes with a similar mindset.

Naturally, you should balance this with the level of effort required and potential risk that a missing element represents, both directly and as an indicator of performance risks. Certain situations are worth having the machine do a full stop and burst into flames…

3) Revision Control – Source code managed from a repository with regular check-in’s

Single threaded version management is the Achilles heel of most analytics projects – one “master copy” and a couple of copies randomly dumped to a second directory. As the project develops, it becomes impossible to manage the 500 old copies of the code.

Revision Control (think GIT or Bazaar) is a system for tracking changes to a text file. You can periodically “commit” changes to your project as you extend and expand it, adding comments (a commit message) which explain your progress. This works for SQL Scripts, SAS files, and other code (c++, Python, R, etc). If you’re working with a large team, these products can help you distribute and integrate contributions from multiple developers.

Revision control gives you a way to understand how your project has changed since your last “working” version. This can be helpful when tracking bugs (roll back to the point where it last worked, then examine the differences from the current version) or dealing with feature changes.  You can check in changes as often as you like – allowing you to work in small increments without worrying about space or clutter. Roll-backs are also a lifesaver when redesigning a key element of your code – it allows you to discard your changes when, after 45 minutes, you realize  the improvement isn’t working.

4) Cron-Jobs and Automated Monitoring

You can run a report every Monday morning. Or you can schedule the job and email the results and/or post them to a Intranet site. Guess which one takes less work…

Automated monitoring is a twist on this – schedule a cron-job to run a test on a regular basis (eg. every day, hour, etc.) and trigger an alert when certain criteria are met. This works for both quality control and mobilizing your business folks to address a problem

Taken as a group, these ideas move you out of the report / analysis production and allow you to spend more time asking questions about your strategic effectiveness:

  • Are we finding the right issues?
  • Are they communicated in a timely fashion?
  • Is anyone listening / acting?

If you like this article, please share it!



One comment

Leave a Reply to Mike Cancel reply