When you hear all the PR coming from various silicon valley companies, you have the feeling that whatever the problem, you can just amass some data and throw it at some algorithms and get your answers, even if you don’t even know what questions to ask.
Of course, it’s not that simple. One of the first arguments I often give to water down expectations a little bit, is this: if things were that simple, why would companies pay data scientists so much? And it’s true, but usually not satisfactory for my interlocutor, who wants to understand why.
The 2 most frequent use cases for data analysis are predictive analysis and data exploration.
Predictive analysis is used when you have structured data where columns are dimensions, and rows are records/points, and you want to predict one target column, from the others. For example, you have a dataset of airport flights, where each row is a flight, and columns are for example airport terminal, destination, aircraft type etc.
Azure, AWS, OVH etc all propose services where you just upload your csv dataset, specify which column you want to predict, and then the crunching is done automatically for you. One first obvious problem is that it’s up to you to provide the relevant columns for the analysis.
Let’s pick the airport example: let’s say each row in your dataset is a flight, and you have one column that specifies whether it was an arrival/departure. If you want to predict departure flight delay, it’s relevant to include in your columns for predictive analysis the previous arrival flight delay. Obviously, if an aircraft arrives late, it will probably be late on its next departure flight. So concretely, you have to change the structure of your dataset: you have to pick only the departure flights (since you want to predict departure flight delay), and you have to match each departure flight with its previous arrival flight, in order to have, for each departure flight, a column with previous arrival flight delay. To do that, you’ll probably need to do some SQL, use some columns (ex: aircraft immatriculation, flight number etc) to match corresponding departure and arrival flights.
The crucial point is that this part cannot be automated: the machine does not understand the dataset is about aircrafts, does not understand that arrival flight delay is relevant for predicting departure flight delay. And even if it did, it wouldn’t know how to match pairs of arrival/departure flights (use aircraft immatriculation and flight number).
You cannot automate the process of figuring out which columns are relevant for the predictive analysis, and restructuring your data to get those relevant columns usually takes a lot of time, and domain expertise for the problem at hand.
Some tools propose some degrees of features engineering automation (ex: add or multiply columns together etc) but it is nowhere near enough for a real-world scenario.
Data exploration is about finding structure in the data, without necessarily having precise questions in mind. One classic need is to find whether parts of the data can be grouped together according to some relevant criteria (one of the most classic family of algorithms to do that are clustering algorithms).
If we use again the example of airport flight data, you might want to categorize flights in terms of “how they are late”, because the airport delay problem has most likely multiple root causes. If you want to investigate on the field, you need to make a typology of situations, in order to know where to start.
If you launch a clustering algorithm on your dataset, since you have very strong variability for flight delays, this dimension will be an important factor in how flights are clustered together. Depending on how many clusters you asked the algorithm to produce, you might end up with interesting clusters, but you might also end up with clusters that mostly classify flights by delay (if the variance on that dimension is huge compared to other dimensions). Having clusters of flights by delay is not very useful from an operational standpoint.
So, you might make a hypothesis such as this: let’s suppose big and small delays do not have the same root causes. Then what you can do is pick a subset of your dataset = pick only the flights with a delay > 60min and then run the clustering algorithm on this subset, and exclude the delay dimension so that it does not “pollute” the clustering process. You will end up with clusters of flights with all big delays (that’s the point of using the subset), but separated along more useful dimensions: for example you’ll see that your flights with big delays are separated into groups such as terminal X in month Y, terminal Y with destination Z etc. Now with this information you can start figuring out how to start your field investigations to improve that airport reliability.
Automated data exploration needs to be guided by domain expertise in order to find structures of interest within the data.
In terms of tools, once you have your dataset ready for analysis, AWS Sagemaker will give you all mainstream state of the art algorithms and of course all the storage/compute you need. I didn’t test yet Azure Data Explorer, and OVH Labs is also working on a public machine learning / analytics platform (I contacted them to join the beta but didn’t get any answer).
A tool that I think really shines above the others is Dataiku. Dataiku is a web application that wraps all mainstream data science libraries (scikit etc), is dockerized, that you can run on your laptop or on one of your servers. The free version is very powerful, and it comes with many team features that will be appreciated in corporate environments.
On a project recently (airport flight data analysis to reduce delays, as you might have guessed), I used Dataiku web app (my setup: docker container running on a Hetzner dedicated server) and I was really impressed. I could make my analysis, version my algorithm runs, visualize the results and iterate quickly. I could work in a short feedback loop with airport experts and we were able to draw very interesting information from the data that was provided to us.
Automated data exploration and predictive analysis are hot, but you must keep in mind that you need domain expertise for the problem at hand (figure out what columns are relevant), and a data analyst that can work on the data to get the columns you need, and choose the relevant algorithm. Ideally, I think you should aim at a semi-automated workflow: automate the mundane stuff (keeping track of your experiments, algorithms hyperparameter search etc) with cloud services or Dataiku, and make your analysis in a short feedback loop with a domain expert.
In terms of tools, if you’re an advanced user dealing with big amounts of data (>100GB), I recommend trying AWS Sagemaker. If you’re working on use cases with less data in a corporate environment, I really recommend using Dataiku (cloud version or self-hosted).
They key take-aways: semi-automated analysis, domain expertise, data expertise, short feedback loop.