Escaping The Walled Garden of Enterprise Analytics: Using R and Python For Data Analysis

In which an experienced analytics guy advises the younger generation to leave the walled garden of enterprise analytics tools and learn how to write code using a real programming language. Specifically advocating the use of R and Python for data analysis and related programming. But hey, I’m flexible on that point…

The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offense.

– Dijkstra

I was taught a long time ago in some Management 101 course to sandwich constructive criticism between two compliments. So I’ll open with this statement:

SAS and the other BI vendors have done a nice job of bringing statistical computing techniques within the reach of the typical college graduate.

Now pull up a chair and grab yourself some popcorn, since I’m going to bite the hand that fed me for the first half of my career. I spent the first seven years of my career in roles involving significant usage of SAS and a variety of drag & drop query tools. The COBOL of the analytics world.

I spent the early part of my career hacking on credit card direct mail campaign data. SAS was the defacto standard in the banking industry for marketing and risk analytics. Many of us learned it in school, the balance of us learned it on the job. We were also given a drag & drop query/reporting tool (Business Objects) which allowed us to streamline and automate some of the smaller ad-hoc requests.

From a business perspective, this was pretty awesome. Most non-coding analysts could generally go from clueless to relatively useful in a month or two. I was able to free up about 80% of my time as a junior analyst by building some automated reports using SAS and Business Objects – which allowed me to pursue more ambitious projects.

But there’s a gotcha here… what happens when you leave?

You Can’t Take It With You…

Employment for life is a nice little concept but effectively gone in America. If you’re in your twenties, you should expect to work for at least half-a-dozen employers during your career even if you are a relatively loyal sort. Contractors and adventurous types should expect to rack up much bigger numbers. At some point you will be sitting down figuring out how to build your next batch of analytics infrastructure. For many of you, this will likely involve hanging out your own shingle in analytics or a related discipline..

So here’s the first problem – you can expect to throw away a big chunk of your language specific experience when you change jobs. Enterprise BI packages aren’t cheap or very easy to implement. Your next employer will likely have a different standard and you’re probably going to have to relearn the implementation details of your statistical package and data slinging utilities. The novelty of this wears off rather quickly…

The issue isn’t just money by the way. A colleague of mine built the analytics program for the direct marketing arm of a big retailer. His first year was spent implementing Microstrategy. He got a head start on things because the money was already approved. Securing funding could easily add a year or two to your expected timeline…

Want to go out own your own? I did some checking a couple of years ago when I had to set up my own analytics group at my current employer. A one-person base SAS license cost about ten grand. Plus a couple thousand per year of ongoing fees. Even the simple tools (Minitab, Crystal Ball) had commercial use license costs of about $1000 – $2000. So leveraging your enterprise analytics experience on your own adds an immediate $5K to $15K fixed expense to whatever number you consider “ramen profitable”. Ouch!

No problem, you say. Programming is Programming. I’ll just go learn a new language…

“So You Think You Can Code..”

My standard job interview for new analysts includes a couple of simple coding tests. Nothing too complicated – since most of our work involves using relational databases, the questions are SQL oriented. The core question is the SQL equivalent of fizz-buzz, generally followed by a discussion of data architecture or query performance drivers. Most decent developers with SQL experience can rip through this in about 3 minutes.

About 60% of our decent looking candidates without non-SQL coding experience fail. After being warned on the phone screen and in the job description that a whiteboard SQL coding test is coming. And I don’t mean syntactical nit-picks and other forms of nerd sadism. I mean facepalm-level failure to generate anything close to executable SQL or identify meaningful abstractions which allow you to scale up your activities.

These are smart people. The root cause is simple. They’ve never had to do it before…

The most insidious thing about SAS and the drag and drop query tools is that you never really get to learn what is happening under the hood. SQL is generated auto-magically. There is no real path to the lower layers of the application (in the unlikely event your vendor includes an extension option, your local IT administrator probably shut it off). Welcome to the walled garden – you can do *this list of things* really well and absolutely nothing else is tolerated or supported by the system.

From a personal development perspective, it’s like over-feeding and declawing a house cat. They slowly lose the ability to hunt on their own and survive outside…

I discovered this when I returned to writing code a few years ago. Despite having about ten years of experience using the common BI / Analytics packages, I had to relearn a big chunk of what I thought I knew.

Life on The Other Side of The Window…

As the title implies, I’m recommending that analysts should learn how to use R and Python for data analysis. There’s actually some value in learning both.

  • R has a huge assortment of pre-packaged statistics programs which replicate many of the useful parts of the base SAS proc library.
  • I’ve found that Python does a better job of handling large datasets and building code to handle ugly parsing and unpacking tasks.
  • Learning Python also gives you an easy path to turn your analytics projects into real sharable software applications. There are a huge number of web frameworks (my personal favorite is bottle.py, many others exist). There are also many options for packaging your work up into a GUI application.

Another nice win? Both of them are free and can be implemented in under 15 minutes, increasing your ability to use them throughout your career. Within a corporate role, you can usually find a specialized application or project type where using Python for data analysis will produce better results than your existing BI suite. It’s hard for a sane manager to object to: BETTER + FREE. In the event you change jobs, you can be immediately productive using either tool (vs. spending several weeks/months learning a new BI package, assuming they actually bothered to do a decent job of collecting requirements from their analysts). And for your personal projects – lets just say a $10,000 license is a show stopper in my household…

Furthermore, the more time you spend coding in the same package, the better you get. Python has an awesome standard library with tested modules to handle common tasks. Numpy / SciPy and Rcran offer an equally diverse array of pre-built code. Better yet, unlike the enterprise analytics world, you’ve actually got the ability to see and tweak the code inside the box. Using, reading, and discussing Python makes you a smarter coder by exposing you to a more advanced set of programming concepts. Second career development tip: find your local Python meetup and go to a meeting or two…

Leaving Eden: Using R and Python for Data Analysis

You can take the first steps out of the walled garden at no cost but a few hours of your own time. There are many free online tutorials and books about using R and Python. Check out Project Euler (collection of math problems you can solve with short programs) or the Python Challenge. Check out Python’s list comprehensions (link to a presentation I did; list comprehensions are one of my favorite Python features). Spend some time looking at the itertools module – this has several tools you can use to slice and dice data.

Once you get a grasp of how to use R and Python for data analysis, you should spend a little time exploring how databases work. Start by learning SQL, then go check out MongoDB, Hadoop, and neo4j. Learning basic web page development (HTML, CSS, Javascript) is also helpful. Then spend time building something – anything – and sharing it with some end users…

Like a house cat that has finally snuck outside, there’s a whole new world out there…

If you like this article, please share it!

We wrote a Word Game Solver site for those who like to unscramble words and solve hangman puzzles. We recently added a cryptograms game to the site.

7 comments

  1. Great post.

    If you want to marry R with the database, check out PostgreSQL with the pl/r extension. It allows you to feed data directly into R from the query bypassing the more traditional python glue. Once mastered, it will prove incredibly effective.

    R has a fairly steep learning curve, but once you’ve got it, it will produce dividends.

  2. I agree that using open source solutions to handle your data analyics problems can help you create solutions without the added overhead of learning your organizations specific bi suite. That being said I dont think that learning python and R are the best first steps in that process especially for those with little programming experience. These are two very different languages which a lot require practice to be proficent. Unfortunately, even with proficency you still have the problem of usability for your coworkers who might not have the same level of expertise and understanding with those tools as well as the fundimental problem that is any solution you design is most likely going to be and add on to ever increasing numbers of add ons bloating your overall analytics system to unmaintainability. From my experience it seems that the most effective analyists are those that use SQL most effectively. It isnt sexy but SQL knowledge itself is the most transferable knowledge i’ve come across. If you run any sort of database, it will most likely implement SQL. You said yourself that most gui tools just generate SQL on the back end, so your manually generated queries will work just fine along the auto-magic tools. Plus, more and more databases (i.e oracle, postgres) are adding analyitics features directly into their SQL dialects. I think the best route for the asspiring analyst is to first and foremost learn the SQL standard for both data modification and data definition. Then learn about the features available to your specific database(s), next learn about making db procedures and views. Only then if you have problem thats not easily solvable by any of the above try an external open source solution. Otherwise, you risk spending large amounts of time building bridges when all you needed was a boat.

    1. Good point!

      I agree learning how to write SQL is a very good starting point for a new analyst and, in many cases, enough to get by for entry-level analytics.

      Speaking from my own experience, I needed to move up into Python and R relatively quickly to get access to statistical routines, graphics options, and write maintainable versions of some of our more advanced “slice-and-dice” code. Depends on how much “refinement” and “unpacking” you need to do… Functions are a very helpful way to control complexity. While you have them in SQL (if you have sufficient database permissions), I’ve found it easier to just use a scripting language.

      Plus from a career perspective, knowing how to use a scripting language opens up a number of options beyond those which SQL gives you….

  3. Any thoughts about Octave? I ask because this is the language used in Coursera’s “Machine Learning” course, and I’m curious to know how widely used it is in private industry.

  4. Vance – I believe that most people in industry will use Matlab instead of Octave (i.e. the “paid” version). My understanding is that it offers more functionality than Octave, particularly in terms of advanced optimisation routines.

  5. great overview so far. I’m using python (specifically scipy matplotlib) as a matlab replacement since 3 years working at University. I sometimes still go back because I’m familiar with specific libraries e.g. the matlab wavelet package is purely awesome.

Leave a Reply to Anonymous Cancel reply