The Analytic World Of Big-Data
Regarding to the limited abilities of data storage, data transformation and data analysis processes of traditional analytic platforms on big data, new concepts are developed in the analytic world. The design of analytic processes on big data, and as well as automation processes are substantial parts of newly developed big data discipline. In many sectors, comprehension of big data and structuring applied models reveal the surplus along with gaining momentum in ongoing research and technology development. The viable usage of big data technology both creates value, and expedites all the processes in between data integration and decision support systems. Big data, on one hand, structurally resembles its small data sources, on the other, possess much larger sizes in volume. This new technology, in addition to that, provides surplus with respect to speed (the decrease in reach and estimation processes), volume (the data size to reach from terabyte to zettabyte that is 1 billion times larger), and diversity (not only structural, but also non-structural data to be kept within the system). On that account, more viable outcomes, and models which estimates with less margin of error could be constituted from data structures in size of zettabytes.
All the work done in the analytic world evolve in abovementioned respect, while data sizes grow at an equal rate. The methods and techniques operationalized in these processes are redesigned, while this supply demand relations continue. Complicated life events and human behaviours can be predicted via billions of records using these new methods and techniques. In this article, main steps both in processes of the data to be visualized and summarized, and of the estimation models to be formed, will be mentioned.
Integration and transformation processes
Integration and transformation processes, that are the essential parts of analytical steps, are significant to constitute the viable data source regarding to the target group. These processes are among the most time consuming ones while designing the automation system. The time spent could significantly be decreased in the databases that the data standardization transactions are proceeded before. Biased results are unavoidable by the integration and transformation transactions on non-standardised data. Thus, while dealing with the works that are carried out on different data sources, the variables which belong to the impure data are standardised by data quality tools and cleaned up. Also, standardised information on the other resources are utilised to fill the missing observations. Once the integration processes are completed, transformation processes take place in order to prepare the data for analytical calculations. Within this frame, new variables, either numerical or categorical, are produced; partition transactions for the text data, and the process in which non-structural data are transformed into analytic tables are executed.
Summarised and compact form of data
After the integration and transformation steps that are the parts of data preparation processes, data is required to be processed by statistical methods. Analytic methods, which are developed within the scope of big data, are examined under two main titles as summarisation and modelling. Data summarisation techniques are utilised to present the analytical patterns on data the simplest and most viable way to users. Later on summarisation process, in order to acquire in-depth outputs, analytic tables are prepared to examine the most detailed information on improved mean, rate or index variables for research question. Analytic tables acquired within this process possess many values and observations, and yet the high number of variables within big data increase the volume and complexity of these outputs. Traditional data summarisation techniques become deficient; thus, new techniques are required to transform the summarised information into even more summarised and compact form. In this respect, both the end user to read the pattern on the data and the summarised information becomes simpler, and taking actions from the responses acquired from the research question could be managed by automation systems.
Decision support models
Decision support systems are developed by billions of rows of data, which are linked to human behaviour and complex cases. Rule sets are attained by linking with the estimations that are made through the target parameters on all the variables on databases. Decision support systems in which estimation values used for new observations which make suggestions to the user or take instinctive actions are designed by means of the rule sets that are learnt through the data. Random data partition transactions, as well as learning and testing data sets are separately formed to test the security of the algorithms on these systems. Model rule sets acquired by learnt data are also compared with real parameters on test data for measuring the estimation power of developed models and of surplus. The calculation algorithms of the models used in decision support systems to work with higher performance, new algorithms are developed. Faster decisions with smallest error rate, thus, may be made by fast and precise estimations on the data which runs within the systems.
Evolution of analytic tools
For the purpose of the techniques developed in scope of big data are utilised to transform all abovementioned processes into decision support systems, new technologies and tools are developed in constant manner. On that account, from data integration, data cleaning, data transformation, data summarisation, data modelling and even model rule sets to obtain decision support algorithms, the whole process to take place in data sets, map-reduce techniques are used. By Hadoop technology that could work upon both structural and non-structural data, all the observations within the target group are utilised particularly to produce decision support rules in modelling processes with lesser error margin. Summarisation and modelling processes that are managed by these tools, other than performance criteria, is another substantial matter of statistical design infrastructure. For the data scientists who summarise data and develop models, features as setting up multiple models and testing them simultaneously get the edge on acquiring the most viable decision support rule set. Analytic tools being developed today, therefore, become prominent with performance and process design capacities on big data.