Why Big Data processing cycle steps need to be understood and managed and why they are not
Most data engineers and data scientists know the Big Data processing cycle, but many managers don’t, and some data engineers and data scientist forget about it in their day to day work. We will explain it in this article in easy to understand words for all to follow.
The problem with the data cycle is that if it is not well managed, the results will not deliver value. The mismanagement of the data processing cycle is probably the most relevant source of lack of business results produced by big data investments.
Though there are several different ways to look at it, I will use the structure employed by Alex”Sandy” Pentland in one of his courses.
Once we know what data we are looking for we should follow the following steps
- Data acquisition
- Data pre-processing
- Data hygiene
- Data analysis
- Data visualization
- Data interpretation
- Data intervention
Data acquisition refers to data collection from different sources. Depending on the project, the collection could include things like all interactions of a mobile phone, as well as its location, battery supervision, and on and off time; or it could be all data related to credit card transactions, from the shop location to the payment amount, time and object bought.
Data pre-processing. When acquiring data, these data come in streams of bytes of information, that is one and zeros, in the format obtained by the sensor, whether it be a phone, ATM or a security camera. To work with the data, it needs to be transformed into the proper formats that will have a meaning to our Big Data tools and technologies.
Data hygiene is a critical step, though in many occasions it is forgotten or carried out leniently. The gross data acquired in the first phase could include bias, may have Data Deserts and/or may be subjected to Data Mirage effects. During data hygiene, we create a new, cleaned data set that represents a better image of the signal and data we are interested in analyzing.
Data analysis is the following step. Once we have the data in the appropriate format and structure and cleaned up, we start asking questions such as: do we see any trend? Is there any kind of consolidated information this data shows? Can we infer some patterns?. For that, there are some specific tools discussed in other sections of this DigitalFullPotential web, and more specifically, in the Buzzwords Explained section and the Food for Thought one.
Data Visualization. Data analysis is sometimes carried out in parallel with data visualization. There are a great number and variety of data visualization tools. Their purpose is to show, in a graphic way, the trends and the structure of the information we discover through the analysis process.
Data interpretation. Together with the Data Hygiene step in the Big Data processing cycle, this is also a critical step. There are many instances in which Data Interpretation has not been carried out adequately, leading to the wrong conclusions. For a correct data interpretation, there are two crucial components:
- To understand which are the data sets that have been captured and filtered (through the hygiene process) and which kind of sensors have collected the data
- To understand the business/problem at hand.
Too frequently, data interpretation is carried out by the Data Departments with little knowledge of the real problems, misinterpreting what is going on. Other times, they are interpreted by experts in the field that assume the data is right and it it doesn’t carry bias or any other related problems.
Data intervention. Once the conclusions have been achieved, this will lead to the next step which may involve using the data to take decisions, train Machine Learning algorithms, or rethinking the data gathering for the future.