I’ve been lately regularly working with strategy consultancies, on data and AI matters. While corporate ambitions everywhere around data/AI are rising, when it comes to day-to-day operations, most of the data is still fragmented among many databases, and people use Excel sheets as exchange format.
This in itself is interesting, because it means a corporate AI project is often 80% a pure data project, where you have first to figure out the topology of all corporate data, figure out some kind of unified database schema in order to make structured queries.
Over the various projects I’ve been involved in, I designed a very simple python pipeline that uses the famous Pandas data library, to parse Excel spreadsheets, and put the data in a postgreSQL DB. The advantage of doing that is that you end up with clean, structured data, that you can then query with SQL to make analysis.
One very important point is that I always import the data as-is, meaning I make one table per excel sheet. I then use Postgres views with JOINS etc to clean the data schema, step by step. From a theoretical standpoint it is suboptimal (since most of the time you end up having redundant columns across tables), but I like it that way because in the real world, data is dirty, scattered, and redundant. So whatever transformation I make, it’s important to have all the steps clearly laid out, to make the downstream analyses easy to reproduce directly from the Excel data. So, I never create an optimal schema out of nowhere, I build it step by step by stacking Postgres views on top of each other (NOTE: if performance starts being a problem, just use Postgres materialized views feature).
I then use Pgweb, a nice web interface for postgreSQL to visualize the rows, and make CSV exports to my colleagues that want to reprocess some of my results in Excel. Most of the time, Postgres correlation analysis features are enough to answer most business questions, but when needed I can use all Pandas features to make more complex analysis, and in rare cases use Tensorflow for true AI-powered predictions.
The key take-away is this: accept the fact that real-world data is dirty, messy and scattered, and often in Excel extracts, BUT make a clear, versioned pipeline to transform this data into a clean DB, and THEN make your analysis.
As usual, I use Docker and Docker-compose to set up all this.
Here is the docker-compose.yml:
You can see the full repo here: https://github.com/francoisruty/fruty_strategy-consulting-data-science