BIG DATA AND HADOOP UTILITES
The data ware housing concept got ineffective because of amount generated by data intensive methods like NGS has touched the limits of hardware capacity and thus tends to produce a space in which further data can be stored. Therefore this challenge was resolved by Dough cutting made Apache bases Hadoop and resulted in betterment in capacity to handle enormous data volume , Hadoop proved to be an easy effective solution for storing & handling, investigating many terabytes or even petabytes of data through platform of reliable secondly scalable and thirdly distributed and precise computing software tools. The basic principle of dividing large datasets into smaller blocks has made technique of Hadoop simpler in Hadoop programing libraries. The library keeps thorough check on any kind of failures at application sites due to which quality based service is ensured on lower levels of hardware executing. The Hadoop frame work proved to be boon for large number of machines to scale up their processing through chain of multiple local computations and units of storage. The major advantage of Hadoop is its scalability which made Hadoop influencing application in studies of NGS; thus due to distributing the data & processing information and extracting patterns from big size data sample of genome.
CHARACTERISTICS OF BIG DATA
The new sources of business value are unlocked with the help of big data whose data is whole scale, distributed, diversified and require new architecture and analytics & tools. The three main features of big data are volume, variety & velocity. The size of data is considered as volume; the frequency of data changing or creation is called as velocity; different formats and types of data and its different kinds of uses and ways of analysing data are considered under variety. The primary attribute of big data is its data volume. Big data can be quantified by size in TBs & PBs and even the numbers of records, transactions, tables & files. The important thing which make big data really useful and big is that it gets launched from greater variety of sources than ever before, including logs, clickstreams, and social media; using these sources for analytics means that common structured data is matched with unstructured data i.e. Structured data like texts or human language & semi structured data such as extensible Mark-up language (XML) or Riche Site Summary (RSS), no data is hard to categorize since it comes from audio-visuals and other devices of same category. Other part is that we can add multi-dimensional data can be extracted from a data ware house to add historic context to big data resulting volume with variety of big data. Velocity and speed are the keystones of big data. Velocity is nothing but frequency of data generation frequency of data to be delivered. Streaming data is collected in real time from websites for big data. The forth part discussed by researchers is veracity ; It focuses on quality of data and characterize the quality of big data in terms of good or bad & examine undefined quality of data due to inconsistency & incompleteness etc.
LIMITATIONS OF NGS-
1) As every coin has two sides , NGS also have some limitations like it is cheaper and fast but is costly for small labs or individual to afford than typical sanger sequencing.
2) NHS data analysis is long process and time consuming too, and it needs exact knowledge of bioinformatics to initiate so that proper result of sequencing can be obtained.
3) NHGRI- the national genome research institute promised to minimize the cost of human genome sequencing. But till date promised was not fulfilled. If it would have got minimised NGS could have been used as a tool for various diagnose of diseases. (Hert et al2008 )
According to ( dauber et al 2014 ) another potential limitation of NGS is due to short read lengths , highly repetitive regions are not easily studied & deep processing steps or bioinformatics is major barrier for implementation and capitalization of NGS technology.
As the time is changing and globalisation taking place the cost of sequencing is getting reduced & ability to produce large volumes of data with latest sequencers in genomic research has made NGS much more needed and powerful tool.
Datasets in a paralyzed distributed manner which are very large are getting rapidly interrogated by users due to Hadoop based big data applications.
Knowledge of human diseases, prediction of one’s own health, sequencing microorganisms ; identifying variants as well as developing stress/drought/ pest tolerant plants has been increased and opened new window for researcher due to development in technology with latest framework.
HGP which was launched with target to sequence large parts of genome has released many new theories.
On a single machine run in time of 7 days several 100GB of data corresponding to several terabytes is produced by sequencing millions of reads by means of NGS platform.
The excitement of accurate tackling these data has led to elongated enthusiasm in perspectives towards parallelization and dispersed execution of annotation pipelines.
The major aim of improving NGS analytics is to provide filtering, mapping, analysis for every large datasets in a shorter span of time.
There are some errors which remain in NGS data which arise due to mismatches in sequencing platforms and variation in formats which are supported by sequencers thus demanding accurate software packages and data handling tools apart from being faster.
The latest series of big data processing tools gives satisfactory accurate and fast capability for analysis of high throughput sequencing data to reveal unseen patterns in sequencing, bio molecule interactions etc. .
Hadoop shows major user friendly environment for substantial scale data analysis and is quiet error reducing system.
Due to Map reduce framework the high level of performance is attained by NGS in quantification and analysis job/study.
As the new windows are getting opened for research every day many libraries are moving towards cloud services ; systems.
Due to which it gets obvious that technology is entering second phase of cycle thus dominating the IT standards.
For adopting cloud services there should be correct knowledge or else it will be barrier.
Cloud service should be understand correctly so that it will be easy for user to analyse weather it is correct for him or no.
The main boon of cloud service is it can be used in mobiles, laptops, tablets etc. and does not need dedicated ; complicated system or software. But other side of cloud is its disadvantage of reliability, security ; privacy.
In future up gradations this disadvantages will be reduced and a compel holistic platform will be used for further studies.