Please use this identifier to cite or link to this item:
|Title:||An efficient strategy for the collection and storage of large volumes of data for computation|
|Keywords:||Big Data;Data pipeline;Hadoop;Apache Flume;MapReduce;HDFS|
|Citation:||Journal of Big Data, 3(21): (2016)|
|Abstract:||In recent years, there has been an increasing amount of data being produced and stored, which is known as Big Data. The social networks, internet of things, scientific experiments and commercial services play a significant role in generating a vast amount of data. Three main factors are important in Big Data; Volume, Velocity and Variety. One needs to consider all three factors when designing a platform to support Big Data. The Large Hadron Collider (LHC) particle accelerator at CERN consists of a number of data-intensive experiments, which are estimated to produce a volume of about 30 PB of data, annually. The velocity of these data that are propagated will be extremely fast. Traditional methods of collecting, storing and analysing data have become insufficient in managing the rapidly growing volume of data. Therefore, it is essential to have an efficient strategy to capture these data as they are produced. In this paper, a number of models are explored to understand what should be the best approach for collecting and storing Big Data for analytics. An evaluation of the performance of full execution cycles of these approaches on the monitoring of the Worldwide LHC Computing Grid (WLCG) infrastructure for collecting, storing and analysing data is presented. Moreover, the models discussed are applied to a community driven software solution, Apache Flume, to show how they can be integrated, seamlessly.|
|Appears in Collections:||Dept of Electronic and Computer Engineering Research Papers|
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.