The data growth over the last couple of decades increases on a massive scale. As the volume of the data increases so are the challenges associated with big data. The issues related to avalanche of data being produced are immense and cover variety of challenges that needs a careful consideration. The use of (High Performance Data Analytics) HPDA is increasing at brisk speed in many industries resulted in expansion of HPC market in these new territories. HPC and Big data are different systems, not only at the technical level, but also have different ecosystems. The world of workload is diverse enough and performance sensitivity is high enough that, we cannot have globally optimal and locally high sub-optimal solutions to all the issues related to convergence of big data and HPC. As we are heading towards exascale systems, the necessary integration of big data and HPC is a current hot topic of research but still at very infant stages. Both systems have different architecture and their integration brings many challenges. The main aim of this paper is to identify the driving forces, challenges, current and future trends associated with the integration of HPC and big data. We also propose architecture of big data and HPC convergence using design patterns.
Figures - uploaded by Sardar Usman
Author content
All figure content in this area was uploaded by Sardar Usman
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Big Data and HPC Convergence:
The Cutting Edge and Outlook
Sardar Usman
1(&)
, Rashid Mehmood
2
, and Iyad Katib
1
1
Department of Computer Science, FCIT, King Abdulaziz University,
Jeddah 21589, Saudi Arabia
usmansardar@hotmail.com, iakatib@kau.edu.sa
2
High Performance Computing Center, King Abdulaziz University,
Jeddah 21589, Saudi Arabia
RMehmood@kau.edu.sa
Abstract. The data growth over the last couple of decades increases on a
massive scale. As the volume of the data increases so are the challenges asso-
ciated with big data. The issues related to avalanche of data being produced are
immense and cover variety of challenges that needs a careful consideration. The
use of (High Performance Data Analytics) HPDA is increasing at brisk speed in
many industries resulted in expansion of HPC market in these new territories.
HPC and Big data are different systems, not only at the technical level, but also
have different ecosystems. The world of workload is diverse enough and per-
formance sensitivity is high enough that, we cannot have globally optimal and
locally high sub-optimal solutions to all the issues related to convergence of big
data and HPC. As we are heading towards exascale systems, the necessary
integration of big data and HPC is a current hot topic of research but still at very
infant stages. Both systems have different architecture and their integration
brings many challenges. The main aim of this paper is to identify the driving
forces, challenges, current and future trends associated with the integration of
HPC and big data. We also propose architecture of big data and HPC conver-
gence using design patterns.
Keywords: HPC Big data Hadoop HPDA Design patterns
IoT Smart cities Cognitive computing
1 Introduction
Over the years, HPC has contributed a lot in scientifi c discoveries, improved engi-
neering designs, enhanced manufacturing, fraud detection, health care, and national
security, thus played crucial role towards quality of human life. The world has seen
exponential data growth due to social media, mobility, E-commerce and other factors.
Major chunk of data has been generated in the last few years alone and is even growing
at more rapid rate [1 ]. To deal with ever growing volume of data, researchers have been
involved in developing algorithms to accelerate the extraction of key information from
massive data. Big data is a buzzword, which catches lots of attention in the recent
years. It means massive amount of structured, semi structured and unstructured data
©ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2018
R. Mehmood et al. (Eds.): SCITA 2017, LNICST 224, pp. 11– 26, 2018.
https://doi.org/10.1007/978-3-319-94180-6_4
collected from different resources and is not possible to store and process this data by
traditional databases and software techniques.
Historically only the largest companies, government research organizations and
academic computing centers have had an access to the computing power necessary to
get to valuable conclusions in a reasonable amount of time. All that is rapidly changing
with vast improvement in the price, performance, availability and density of compute
power beyond the human imagination.
The categorization of data vs. computing affected by solution urgency i.e. real time
solution, and also depends on what we trying to achieve. As the volume of data is
growing bigger, it brings more challenges to process that data in real time. As pro-
jected, in 2018 over 4.3 Exabyte of data will be created on daily basis [2 ]. Over the
years HPC community have not been deprived of huge volume of data i.e. climate
modeling, design and manufacturing, fi nancial services etc. that resulted in high fidelity
models and interdisciplinary analysis to explore data for deeper insights. The use of
High Performance Data Analytics HPDA is increasing at brisk speed in many indus-
tries resulted in expansion of HPC market in these new territories.
Powerful analytics is a key to extract a value from data by confronting budget and
marketing challenges and plays huge roles in making plans, predicting business trends
and understanding customer demands. Choosing a right solution depends on the size of
data, urgency of results, prediction about the needs of more processing power as the
size of data increases, fault tolerance for applications in case of hardware failure, data
rate and scalability etc. A real time application with high response time especially when
dealing with huge volume of data, is still a challenging task and is one of the driving
forces towards the convergence of big data and HPC.
Both HPC and Big data are different system not only at the technical level but also
have the different ecosystem. Both have different programming model, resource
manager, fi le system and hardware. HPC are mainly developed for computational
intensive applications but recently data intensive applications are also among the major
workload in HPC environment. Due to recent advancements of data intensive appli-
cations, number of software frameworks has been developed for distributed systems,
cluster resource management, parallel programming models and machine learning
frameworks. High performance computing have very well established standard pro-
gramming model e.g. Open MP/MPI. Big data analytics have been grown up in dif-
ferent perspective and have different population of developers that uses java and other
high level languages with primary focus on simplicity of use, so that problem domain
can be solved without a detailed knowledge of HPC. These difference in the infras-
tructure, resource manager, fi le system and hardware makes the system integration a
challenging task.
As the data is getting bigger and bigger in volume so is the need of high computing.
HPC community has been dealing with massive amount of data and big data analytics
for years. The solutions evolved over the years to deal with large volume of data,
should be useful for big data analytics [3 ]. The main aim of this paper is to identify
motivation and driving forces towards the integration of HPC and big data. Also
highlighting the current trends, challenges, benefi ts and future aspects of uni fied
integrated system. We also present architecture for the convergence of HPC and Big
data using design patterns.
12 S. Usman et al.
The rest of the paper is organized as follows. The next section examines the
difference between HPC and Hadoop framework with respect to hardware, resource
management, fault tolerance and programming model. Literature survey is presented in
Sect. 3 and Convergence challenges are discussed in Sect. 4 followed by the future
directions in Sect. 5 . The architecture using design pattern for the convergence of HPC
and big data is presented in Sect. 6and paper is concluded in the fi nal section.
2 HPC and Big Data Frameworks and Their Differences
Different solutions emerged over the years to deal with big data issues and are suc-
cessfully implemented. But never the less, all these solutions do not satisfy the ever-
growing needs of big data. The issues related to big data are immense and cover variety of
challenges that needs a careful consideration, for example data representation, data
reduction/compression, data confi dentiality, energy management, high dimensionality,
scalability, real and distributed computation, non-structured processing, analytical
mechanism and computational complexity etc. The exponential outburst of data and
rapidly increasing demands for real time analytical solutions urges the need for the
convergence of high-end commercial analytics and HPC. Business intelligence/analytical
solutions today lack the support for predictive analytics, lack of data granularity, lack of
software fl exibility to manipulate data, lack of intuitive user interface, relevant infor-
mation is not aggregated in a required manner and slow system performance [4].
HPC community have been dealing with complex data and compute intensive
applications, and solutions have been evolved over the years. As the volume of data is
increasing at brisk speed so are the associated challenges i.e. data analysis, minimizing
data movement, data storage, data locality and effi cient searching. As we are heading
towards exascale era, the increase in system concurrency introduced a massive challenge
for system software to manage applications to perform at extreme level of parallelism.
Large-scale applications use most widely deployed message-passing programming
model MPI along with traditional sequential languages, but with the introduction of
architectural changes (many core chip) and high demand in parallelism make this
programming model less productive for exascale systems. Billion-fold parallelism is
required to exploit the performance of extreme scale machines and locality is critical in
terms of energy consumption. As the complexity and scale of software requirements is
on a rise, simple execution model is a critical requirement, which ultimately reduce the
application programming complexity required to achieve the goals of achieving extreme
scale parallelism. A current trend in HPC market includes use of advanced interconnects
and RDMA protocols (Infi nity Band, 10– 40 Gigabits Ethernet/iWARP, RDMA over
converged Enhanced Ethernet), enhanced redesign of HPC middleware (MPI, PGAS),
SSDs, NVRAM and Burst buffer etc. Scalable parallelism, synchronization, minimizing
communication, task scheduling, memory wall, heterogeneous architecture, fault tol-
erance, software sustainability, memory latencies, simple execution environment and
dynamic memory access for data intensive application are some of the core areas that
requires considerable time and efforts to address Exascale challenges [5 ]. The difference
between Hadoop and HPC framework is highlighted in the following section.
Big Data and HPC Convergence: The Cutting Edge and Outlook 13
2.1 Hardware
Most of the modern HPC and Hadoop clusters are commodity hardware. In HPC
environment, Compute nodes are separated from data nodes. There are two types of
data storage, temporal fi le system on local nodes and persistent global shared parallel
file system on data nodes. The existing HPC clusters have limited amount of storage on
each compute node. LUSTRE is most widely used parallel fi le system in HPC and
almost 60% of the top 500 supercomputers use LUSTRE as their persistent storage.
Data needs to be transferred from data nodes to the local fi le system on each compute
node for processing. Data sharing is easy with distinct data and compute nodes but
spatial locality of data is an issue [6 ,7 ].
Hadoop cluster uses local disk space as a primary storage. The same node serves as
a data node and compute node. The computational task is scheduled on same machine
where data is resided resulting in enhanced data locality. Hadoop is write-once and
read-many framework. I/O thorough put of Hadoop is much higher, due to co-locating
of data and compute node on the same machine [7].
2.2 Resource Management
Another major difference between Hadoop and HPC cluster is resource management.
Hadoop' s Name node has Job tracker daemon. Job tracker supervised all map-reduce
tasks and communicates with the task trackers on the data node. Compared to
Hadoop' s integrated job scheduler, HPC scheduling is done with the help of specialized
tools like Grid engine, Load leveler etc., [8 ] with controlled resources (memory, time)
provided to the user.
2.3 Fault Tolerance
HPC resource scheduler use checkpoint mechanism for fault tolerance. In case of node
failure, it reschedule job from the last stored checkpoint. It needs to restart the whole
process if the checkpoint mechanism is not used. On the other hand, Hadoop uses job
tracker for fault tolerance. As data and computation are co-located on same machine,
job tracker can detect a node failure on run time by re-assigning a task on a node where
duplicate copy of data is resided [8 ,9 ].
2.4 Programming Model
Hadoop uses map-reduce programming model, which makes life easier for the pro-
grammers as they just need to defi ne map step and reduce step, when compared to the
programming efforts needed for HPC applications. In HPC environment, programmer
needs to take fi ne-grained responsibilities of managing communication, I/O, debug-
ging, synchronization and checkpoint mechanism. All these tasks needs considerable
amount of efforts and time for effective and ef ficient implementation. Hadoop does
provide a low level interface to write and run map-reduce applications written in any
language, although Hadoop is written in Java. Following Table 1 summarizes the
difference between HPC and Hadoop framework [7].
14 S. Usman et al.
Both Hadoop and Spark are big data frameworks and do perform the same tasks,
are not mutually exclusive and able to work together. Spark is mostly used on the top of
Hadoop and advance analytics of spark are used on data stored in Hadoop' s distributed
file system (HDFS). Spark has the ability to run as Hadoop' s module through YARN
and as a standalone solution [10 ] and can be seen as an alternative to map-reduce rather
than a replacement to Hadoop framework. Spark is much faster compared to Hadoop
because it handles in memory operations by copying data from distributed fi le systems
in to faster logical RAM. Map-reduce writes all data back to distributed storage system
after each iteration to ensure full recovery whereas Spark arranges data in resilient
distributed datasets that are capable of full recovery in case of failure. Spark capability
of handling advance data analytics in real time stream processing and machine learning
is a much more advance that gives Spark edge over Hadoop. The choice of selecting
either of the data processing tool depends on the needs of an organizations e.g. Dealing
with big structured data can be done effi ciently with map-reduce and there is no need to
installed a separate layer of Spark over Hadoop [11 ]. Spark on demand allows users to
use Apache Spark for in situ data analysis of big data on HPC resources [12 ]. With this
setup, there is no longer to move petabytes of data for advance data analytics.
3 Research Related to HPC and Big Data Convergence
The integration of HPC and Big data started at different levels of their Eco systems and
these integrated solutions are still at very infant stages. The convergence of both these
technologies is the hottest topic for the researcher over the last few years. In [6]
Krishnan et al. proposed a myHadoop framework using standard batch scheduling
system for confi guring Hadoop on-demand on traditional HPC resources. The overhead
in this setup includes site-specifi c confi guration, keeping input data into HDFS and
Table 1. HPC vs. Hadoop eco system
Big Data HPC
Programming model Java applications,
SparQL
Fortran, C, C++
High level
programming
Pig, Hive, Drill Domain specifi c language
Parallel run time Map-reduce MPI, Open MP, OpenCL
Data management HBase, MySQL iRODS
Scheduling (Resource
management)
YARN SLRUM (Simple LINUX utility for
resource management)
File system HDFS, SPARK
(Local storage)
LUSTRE (Remote storage)
Storage Local shared nothing
architecture
Remote shared parallel storage
Hardware for storage HDDS SSD
Interconnect Switch ethernet Switch Fiber
Infrastructure Cloud Supercomputer
Big Data and HPC Convergence: The Cutting Edge and Outlook 15
then staging results back to persistent storage. HDFS is heavily criticized for its I/O
bottleneck. Availability of limited storage is big challenge to integrate Hadoop with
HPC clusters. Islam et al. [13 ] proposed a hybrid design (Triple-H) to reduce I/O
bottleneck in HDFS and effi cient resource utilization for different analytics system
performance and cluster effi ciency with overall low system cost.
Data intensive applications have been intensively used in HPC infrastructure with
multicore systems using Map-reduce programming model [14 ]. With increase in par-
allelism, the overall throughput increases resulted in high-energy effi ciency as the task
is completed in shorter span of time. When Hadoop runs on HPC cluster with multiple
cores and each node is capable of running many map/reduce tasks using these cores.
This ultimately decreases the data movement cost and increase throughput but due to
high disk and network accesses of Map-reduce tasks, the energy consumption and
through put cannot be predicted. High degree of parallelism may or may not affect
energy effi ciency and high performance.
Tiwari et al. [15 ] studied the Hadoop's energy effi ciency on HPC cluster. Their
study shows that energy effi ciency of map-reduce job on HPC cluster changes with
increase in parallelism and network bandwidth. They determine the degree of paral-
lelism on a node for improving the energy effi ciency and also benefi ts of increasing the
network bandwidth on energy effi ciency by selecting confi guration parameters on
different types of workloads i.e. CPU intensive and moderate I/O intensive, CPU and
I/O intensive workloads, also energy and performance characteristics of a disk and
network I/O intensive jobs. When the number of map slots reached beyond 40, number
of killed map tasks almost doubled. Thus increasing the parallelism to certain extent
has positive impact on energy efficiency.
Scientifi c data sets are stored in back end storage servers in HPC environment and
these data sets can be analyzed by YARN map-reduce program on compute nodes. As
both compute and storage servers are separated in HPC environment, the cost of
moving these large data sets is very high. The High-end computation machine and
analysis clusters are connected with high-speed parallel fi le system. To overcome the
shortcomings of offl ine data analysis, " in situ" data analysis can be performed on
output data before it is written to parallel fi le system. The use high-end computation
node for data analysis results in slowing down simulation job by the interference of the
analysis task and inefficient use of computation resources for data analysis tasks. Spark
on demand allows users to use Apache Spark for in situ data analysis of big data on
HPC resources [12 ]. With this setup, there is no longer to move petabytes of data for
advance data analytics.
According to Woodie [16 ], the use of Infi niBand for large clusters is most cost
effective then standard Ethernet. The performance of HPC oriented Map-reduce solu-
tions (Mellanox UDA, RDMA-Hadoop, DataMPI etc.) depends on the degree of change
in Hadoop framework as more deep modifi cation means an optimal adaption to HPC
systems. Hadoop with IPoIB (IP over Infi niBand) and Mellanox UDA requires minimal
or no changes in Hadoop implementation and only requires minor changes in Hadoop
confi guration. RDMA-Hadoop and HMOR are the HPC oriented solutions to take
advantage of high speed interconnects by modifying some of the subsystems of
Hadoop. DataMPI is a framework that developed from the scratch, which exploits the
overlapping of map, shuffl e and merge phases of map-reduce framework and increases
16 S. Usman et al.
data locality during the reduce phase. DataMPI provides the best performance and an
average energy effi ciency [17 ]. The use of Infi niBand improved the network bandwidth,
as Infi niBand being widely used in HPC environment. Communication support in
Hadoop relies on TCP/IP protocol through Java sockets [17 ]. So it is diffi cult to use high
performance interconnects in an optimal way so different HPC oriented map-reduce
solutions came that addresses the problem of leveraging high performance interconnects
RDMA – Hadoop, DataMPI etc. Wang et al. [18 ] compared the performance of
10 GigaBit Ethernet and Infi niBand on Hadoop. With small intermediate data sizes the
use of high speed interconnect, increased the performance by effi ciently accelerating
jobs but doesn' t shows the same performance with large intermediate data size. The use
of Infi niBand on Hadoop provides better scalability and removes the disk bottleneck
issues. As the Hadoop cluster is getting bigger, organizations feel the need of specialized
gear like solid-state drives (SSDs) and the use of Infi niBand instead of standard Eth-
ernet. The use of infi niBand with RDMA (remote direct memory access) allows
40 Gigabits/s raw capacity out of Quad Data Rate (QDR) infi niBand port which is four
times as much bandwidth as 10 GigaBit Ethernet port can deliver [16].
The use of infi niBand allows maximum scalability and performance while over-
coming the bottlenecks in the I/O. Islam et al. [19 ] proposes an alternative parallel
replication scheme compared to pipelined fashioned replication scheme by analyzing
the challenges and compared its performance with existing pipelined replication in
HDFS over Ethernet, IPoIB, 10 GigE and RDMA and showed performance enhance-
ment with parallel model for large data sizes and high performance interconnects.
4 Challenges of Convergence
The world of workload is diverse enough and performance sensitivity is high enough
that, we cannot have globally optimal and local high sub-optimal solution to all the
issues related to convergence of HPC and big data. HPC and Hadoop (big data)
architectures are different and have different eco system. The cross fertilization of HPC
and Big data is the hottest topic for the researchers over the last few years. Most of the
research related to the convergence of HPC and big data started at distinct levels of eco
system but do not address the problem of moving data especially in HPC environment.
The integration of data intensive applications in HPC environment will bring many
challenges. In Exasacle environment cost of moving big data will be more then cost of
floating point operations. There is a need for high energy ef ficient and cost effective
interconnects for high bandwidth data exchange among thousands of processors. We
also need a data locality aware mechanism especially when dealing with big data in
HPC shared memory architecture. The cost of moving big data for processing also
brings another challenge of high power consumption. With massively parallel archi-
tecture with hundreds of thousands of processing nodes, the cost of moving data will be
very high. According to Moore et al. [20 ], and energy effi ciency of 20 pJ (Pico Joules)
per fl oating point operation is required for exascale system where as current state of art
multicore CPUs have 1700 pJ and GPUs have 225 pJ per fl oating point operation.
Minimizing the data movements means the innovation in memory technologies
with enhanced capacity and bandwidth. To deal with 3Vs (volume, velocity, veracity)
Big Data and HPC Convergence: The Cutting Edge and Outlook 17
of big data, effi cient data management techniques need to be investigated included data
mining and data co-ordination [13 ] as most of the HPC platforms are compute centric,
as opposed to the demands of big data (continuous processing, effi cient movement of
data between storage devices and network connections etc.). To deal with massive
parallel architecture and heterogeneous nature of big data, innovation needed at the
programming model to deal with the next generation of parallel systems. Thus reducing
the burden of parallelism and data locality for application developer as MPI leave it to
the programmer to handle issues related to parallelism. Hadoop being widely used as a
big data framework, achieve fault tolerance by the replication of data on multiple nodes
and job tracker assign job to other node in case of node failure. Fault tolerance in HPC
is by means of checkpoint mechanism, which is heavily criticized and not suitable for
exascale environment. In exascale systems hardware failure will be a rule not an
exception. The MTBF (mean time between failures) window in current Peta-scale
system is in days and for exascale systems it will be in minutes or may be few seconds.
So there is need for a comprehensive resilience at the different levels of exascale eco
system. Exascale systems will be constrained by power consumption, memory per core,
data movement cost and fault tolerance. The integration between HPC and big data
must address the issues of scalability, fault resilience, energy effi ciency, scientific
productivity programmability and performance [21].
Resilience, power consumption and performance are inter-related to each other.
High degree of resilience or fault tolerance is achieved but on the expense of high power
consumption. As we are heading towards exascale era, convergence of both HPC and
big data will make energy effi ciency a core issue to handle. Severs and data-centers are
facing the same problem of power consumption including companies like Google,
Amazon and Facebook etc. According to an estimate the actual cost of exascale system
will be less then cost of power consumption for maintaining and running exascale
system for one year [22].
The energy effi ciency techniques in big data can be broadly categorized as
software/hardware based energy effi cient techniques, energy effi cient algorithms and
architectures. As set of commodity hardware is used in both HPC and Big data plat-
forms for processing of data. The integrated hardware solution for data intensive
applications and computational intensive applications wouldn' t work for exascale
systems as hardware solution helps to achieve fault tolerance but on the expense of
high energy consumption. The current Peta scale high performance computing with
checkpoint mechanism to achieve fault tolerance and energy effi ciency does not suit
well for the integrated solution of HPC (Exascale) and big data. Soft, hard and silent
errors in exascale environment will be rule not an exception. Thus collaborative efforts
are needed at system level or application level resilience to deal with fault tolerance and
energy effi ciency for the integrated solution.
As we have seen that both HPC and Hadoop (big data) architectures are different
and have different eco system. Both have different programming model, resource
manager, fi le system and hardware. These difference in the infrastructure, resource
manager, fi le system and hardware makes the system integration a challenging task. As
the data is getting bigger and bigger in volume so is the need of high computing.
One of the biggest challenges, that both big data and HPC community facing is energy
effi ciency. Exascale Parallel computing system will have thousands of nodes with
18 S. Usman et al.
hundreds of cores each and is projected to have billions of threads of execution. The
frame of Main Time between Failures MTBF in super computers is in days and weeks.
But for Exascale computing with million times more components, the perception of
MTBF is in hours or minutes or may be in seconds. Each layer of Exascale Eco system
must be able to cope with the errors [23].
Real time data analysis is also a driving force behind the urgency of the need for the
necessary convergence of the analytics, big data, and HPC when dealing with com-
putation, storage and analysis of massive, complex data sets in high scalable envi-
ronment. Scalability issues addressed by the HPC community by capitalizing the
advancements in network technologies (low latency network), effi cient and large
memory should also address the scalability issues of the data analytics [24].
5 Driving Forces and Future Aspects
High performance data analytics HPDA includes tasks involving massive amount of
structured, semi-structured and unstructured data volumes and highly complex algo-
rithms that ultimately demands the needs of HPC resources. Companies now have the
computing power they need to actually analyze and act upon their data. This translates
into numerous benefi ts for the company, environment and society over all. In the
energy sector companies are now able to more accurately drill for oil. Automobiles and
airlines are much safer due to rapid modeling of operational data design optimization
and aerodynamics analysis, allowing them to deliver more cost effective products that
operate safer and are more fuel-effi cient. In the fi nancial sector banks and card issuers
can do fraud detection in real time. Stock investors can quickly track trends in the
market to better serve their investing customers. Retailers and advertisers can now
review historic purchasing data to better deliver the right products and advertisement to
their customers and whether researchers can study thousands of years of weather data
in hours or days instead of weeks or months, improving the quality of predictions and
safety of people worldwide. HPC industry has been dealing with data intensive sim-
ulations and high performance analytics solutions also evolved over the years urges the
commercial organizations to adopt HPC technology for competitive advantage to deal
with time critical and highly variable complex problems. The chasm between data and
compute power is becoming smaller all the time. The global HPDA market is growing
rapidly and according to forecast HPDA global market size was US 25.2 billion and
with the growth of nearly 18%, it is projected to be around US 82 billion in 2022 [25]
(Fig. 1).
Fault tolerance, high power consumption, data centric processing, limitations of I/O
and memory performance are few of the driving forces that are reshaping the HPC
platforms to achieve Exascale computing [26 ]. Data intensive simulations, complex
and time critical data analytics requires high performance data analytics solutions for
example Intelligence community, data driven science/engineering, machine learning,
deep learning and knowledge discovery etc. These competitive forces have pushed
relatively new commercial companies (Small and Medium scale Enterprises SMEs)
into HPC competency space. Fraud/anomaly detection, affi nity marketing, business
intelligence and precision medicine are some of the perusable new commercial HPC
Big Data and HPC Convergence: The Cutting Edge and Outlook 19
market segments that require high performance data analytics. The use of HPDA will
increase with time in future demanding convergence of HPC and big data. HPDA is
becoming an integral part of future business investments plans of enterprises, to
enhance customer experience, anomaly detection marketing, business intelligence,
security breaches etc. and discovery of new revenue opportunities.
5.1 The Internet of Things IoT and Smart Cities
IoT links physical devices (computers, sensors, electronics,) equipped with sensors to
the Internet and network connectivity enabling them to communicate. The common IoT
platform brings heterogeneous information together and facilitates communication by
providing common language. According to Gartner [27 ] IoT units installed base will
reach 20.8 billion by 2020 resulted in massive amount of data which will further
highlight the security, customer privacy, storage management and data centric net-
works challenges. Smart city demands better and more inventive services to run whole
city smoothly and improve people' s life through the innovative use of data.
Smart cities and IoT are some of the emerging HPDA application areas. HPC has
been involved in managing power grids and transport for the upstream design of
vehicles and urban traffi c management in smart cities for quite some time and its use
over time will increase in the markets of cognitive computing/AI, driverless vehicles and
healthcare organizations. Baz [28 ] investigated the connection between IoT and HPC by
highlighting some of the challenges in smart world applications (smart building man-
agement, smart logistics and smart manufacturing) and possible opportunities with HPC
enable solutions. China' s HPC-IoT plan 2030 is based on the use of HPC in IoT network
wellness management and security [29].
Fig. 1. HPDA market forecast [25]
20 S. Usman et al.
5.2 Cognitive Technology
Cognitive systems are capable of understanding complex language constructs, correlate
the association and help to rationalize information and discover insights. The key in
cognitive systems is learning, adaptability and how the system is evolving, helps in
decision-making process, discovery of new ventures, improved production and oper-
ation systems, optimizing resources, proactive identifi cation of faults ahead of failure
etc. The motive of cognitive computing is to handle complex problems without no or
little human intervention. According to IBM estimate 80% of data is unstructured and
is of no use for the machines and not fully exploited. The cognitive computing can be
seen as a potential candidate for the exploration of unstructured data to get more useful
information insights and effi cient decision-making. The rapid growth of data from
multidisciplinary domains requires powerful analytics but lacks human expertise to
tackle the diverse and complicated problems. The cognitive computing allows people
with less experience to interact with machine thanks to the advancement in natural
language processing and Artifi cial intelligence technologies e.g. Google DeepMind and
Qualcomm' s Zeroth Platform. The advancement in cognitive technology with the
integration of AI and machine learning for big data tools and platforms will increase the
quality of information, dealing with the complex data analytics with lesser human
intervention but requires rapid data access (low latency), faster time to insights,
hardware acceleration for complex analytics [2 ]. Extracting information from vast
amount of data requires innovation in compute and storage technologies, which should
provide cost effective storage, improved performance in a desired time frame. The
infrastructure required cognitive storage with learning ability for computers to store
only relevant and important data. The computing requires effi cient processing which
demands high memory bandwidth and extreme scale parallelism for effi cient resource
utilization within energy effi ciency constraints. Open power foundation [2 ] is an ini-
tiative towards partnering technology solutions with diverse companies coming toge-
ther to provide technology solutions to a variety of problems. With data centric
computing, time to solution will be dramatically reduced. Cognitive computing though
still at its infancy stages but in future will be a key technology for the success of
modern businesses, to get insights of the vast amount of unstructured data by lever-
aging computing technology to work better with the way humans want to work and
smoothing the natural relationship between human and the computer.
6 Design Patterns
The need for HPDA demands innovative ways, to accelerate data and predictive
analysis to target above-mentioned complex challenges by revolutionary and evolu-
tionary changes in programming models, computer architecture and runtime systems to
accommodate potential interoperability and scaling convergence of HPC and Big data
eco systems [2 ]. There is growing need for the effi cient exploration of novel techniques
to allow HPC and Big data applications to exploit billion-fold parallelism (Exascale
systems), improved data locality, unifi ed storage systems, synchronization and ulti-
mately the single system architecture to overcomes the cost and complexity of moving
Big Data and HPC Convergence: The Cutting Edge and Outlook 21
data which also improves the total cost of ownership and brings in fl exibility to manage
workfl ows and maximize system utilization. Design patterns and skeletons are the
potential candidates to address above-mentioned challenges to design scalable, robust
software development and applicable proved solutions in both HPC and big data
community.
The parallel programing problem has been an active area of research for decades
focusing primarily on programming models and their supporting environments. As we
move towards Exascale (millions of components, billions of cores) programming
parallel processors and handling billion-way parallelism is one of the major challenge
that research community is facing. Software architecture and design plays a vital role in
designing robust and scalable software. Common set of design elements (derived from
domain expert's solutions), are captured in a design pattern of that particular domain to
assist the software designer to engineer robust and scalable parallel software. These
patterns defi ne the building blocks of all software engineering and are fundamental to
architect parallel software. The design problem at different level of software devel-
opment is addressed by developing layered hierarchy of patterns by arranging patterns
at different levels. These design patterns have been developed to assist software
engineers to architect and implement parallel software effi ciently. Our Pattern Lan-
guage OPL is one of prominent source of cataloguing and categorizing the parallel
patterns [30 ]. A design pattern provides a clean mechanism to cater common design
problems using generic guidelines.
Big Data design patterns provide the concrete representation of analysis and
technology centric patterns of most common occurring problems in BigData envi-
ronment [31]. These design patterns provides the building blocks for the efficient
design of big data architecture. The standardization and integration of design patterns
can be seen as the potential candidates for the effi cient and effective convergence of
HPC and big data. Figure 2 shows the logical architecture of different layers and design
patterns (HPC & BigData) can then be applied at distinct levels to address the issues
related to big data and HPC convergence. One of the challenges associated with data
visualization and interactive management is huge volume, variety and velocity of data
and is often hard to evaluate and reapply the design solution. The visualization and
management layer involves applying patterns for distributed and parallel visualization,
interactive data exploration, rendering data visualization, real time monitoring for live
analysis and recommendations.
The analytics/processing layer includes patterns for analytics and depending on the
problem domain includes in-situ, in-transit, real time or batch processing. Advanced
analytics requires predictions, advance algorithms, simulations and real time decisions
that require high performance computing for processing and managing massive volume
of data [32].
There is a trade-off between Performance, resilience and power consumption. Trade-
off patterns needs to identify and accommodate these trade-offs in best possible way by
indulging the best practices from both HPC and Big data communities. The processing
pattern includes analytics patterns for unstructured and structured data, algorithms for
conversion of unstructured to structured data, large-scale batch and graph based pro-
cessing patterns and also parallel design patterns. The access/storage layer includes
design patterns for the effective and effi cient retrieval and storage mechanism for parallel
22 S. Usman et al.
Fig. 2. Logical layered architecture of design patterns
Big Data and HPC Convergence: The Cutting Edge and Outlook 23
and distributed fi le systems. This includes data size reduction for high volume hierar-
chical, linked, tabular and binary cognitive storage for real time in-direct and integrated
access. The cognitive storage with learning ability to automate the process of data
purging by keeping only relevant and important data for cost effective storage and
improved performance.
HPC software development community lack the expertise of software engineering
principles as these patterns defi ne the building blocks of software engineering and are
fundamental to architect parallel software. There is a need to invest the research efforts
towards exploration of innovative approaches to make use of design patterns and
skeletons to overcome scalability, elasticity, adaptability, robustness, storage, paral-
lelization and other processing challenges of the unifi ed HPC and big data environment.
7 Conclusion
The increased processing power, emergence of big data resources and real time ana-
lytical solutions are the prime drivers that pushing the realm of big data. As both HPC
and big data systems are different and have different architecture. The challenges
associated with inevitable integration of HPC and big data are immense and solutions
are starting to emerge at distinct levels of eco system. As we are heading towards
convergence of both, we will have to deal with modality, complexity and vast amount
of data. Currently we have distinct and perhaps overlapping set of design choices at
various levels of infrastructure. A single system architecture but with enough config-
urability in it that you can actually serve different design points between compute
intensive and design intensive. The single system architecture overcomes the cost and
complexity of moving data. It also improves the total cost of ownership and brings in
flexibility to manage work fl ows and maximize system utilization. Realizing these
benefi ts requires coordinated design efforts around key elements of the system i.e.
compute (multicore, FPGA), interconnect (next generation fabric), memory (Non
Volatile memory, storage burst buffer, Luster fi le system). This coordinated effort may
result in useable, effective and scalable software infrastructure.
The connected and ubiquitous synergy between HPC and Big data is expected to
deliver the results which cannot be achieved by either alone. There is a need for the
leading enterprises to use HPC technology to explore effi ciently huge volume of
heterogeneous data to surpass static searches into dynamic pattern discovery for the
competitive advantage. The integration of computing power in HPC and demands for a
quick and real time analytics for big data with cognitive technology (computer vision
techniques, Machine learning, natural language processing) are considered as reshaping
the future technology for accelerating analytics and deriving meaningful insights for
effi cient decision-making.
Acknowledgments. The authors acknowledge with thanks, the technical and fi nancial support
from the Deanship of Scientifi c Research (DSR) at the King Abdul-Aziz University (KAU),
Jeddah, Saudi Arabia, under the grant number G-661-611-38. The work carried out in this paper
is supported by the HPC Center at the King Abdul-Aziz University.
24 S. Usman et al.
References
1. Singh, K., Kaur, R.: Hadoop: addressing challenges of big data. In: 2014 IEEE International
Advance Computing Conference (IACC), pp. 686– 689. IEEE (2014)
2. Charl, S.: IBM - HPC and HPDA for the Cognitive Journey with OpenPOWER. https://www-
03.ibm.com/systems/power/solutions/bigdata-analytics/smartpaper/high-value-insights.html
3. Keable, C.: The convergence of High Performance Computing and Big Data –Ascent.
https://ascent.atos.net/convergence-high-performance-computing-big-data/
4. Joseph, E., Sorensen, B.: IDC Update on How Big Data Is Rede fining High Performance
Computing. https://www.tacc.utexas.edu/documents/1084364/1136739/IDC+HPDA+Briefi ng+
slides+10.21.2014_2.pdf
5. Geist, A., Lucas, R.: Whitepaper on the Major Computer Science Challenges at Exascale
(2009)
6. Krishnan, S., Tatineni, M., Baru, C.: myHadoop-Hadoop-on-Demand on Traditional HPC
Resources (2011)
7. Xuan, P., Denton, J., Ge, R., Srimani, P.K., Luo, F.: Big data analytics on traditional HPC
infrastructure using two-level storage (2015)
8. Is Hadoop the New HPC. http://www.admin-magazine.com/HPC/Articles/Is-Hadoop-the-
New-HPC
9. Katal, A., Wazid, M., Goudar, R.H.: Big data: issues, challenges, tools and good practices.
In: 2013 Sixth International Conference on Contemporary Computing (IC3), pp. 404–409.
IEEE (2013)
10. Hess, K.: Hadoop vs. Spark: The New Age of Big Data. http://www.datamation.com/data-
center/hadoop-vs.-spark-the-new-age-of-big-data.html
11. Muhammad, J.: Is Apache Spark going to replace Hadoop? http://aptuz.com/blog/is-apache-
spark-going-to-replace-hadoop/
12. OLCF Staff Writer: OLCF Group to Offer Spark On-Demand Data Analysis. https://www.
olcf.ornl.gov/2016/03/29/olcf-group-to-offer-spark-on-demand-data-analysis/
13. Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid
approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In:
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,
pp. 101– 110. IEEE (2015)
14. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating
MapReduce for multi-core and multiprocessor systems. In: 2007 IEEE 13th International
Symposium on High Performance Computer Architecture, pp. 13–24. IEEE (2007)
15. Tiwari, N., Sarkar, S., Bellur, U., Indrawan, M.: An empirical study of Hadoop' s energy
effi ciency on a HPC cluster. Procedia Comput. Sci. 29 ,62 –72 (2014)
16. Woodie, A.: Does Infi niBand Have a Future on Hadoop? http://www.datanami.com/2015/
08/04/does-infiniband-have-a-future-on-hadoop/
17. Veiga, J., Exp, R.R., Taboada, G.L., Touri, J.: Analysis and Evaluation of Big Data
Computing Solutions in an HPC Environment (2015)
18. Wang, Y., et al.: Assessing the performance impact of high-speed interconnects on
MapReduce. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB-2012. LNCS,
vol. 8163, pp. 148– 163. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-
53974-9_13
19. Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Panda, D.K.: Can parallel replication bene fit
Hadoop distributed fi le system for high performance interconnects? In: 2013 IEEE 21st
Annual Symposium on High-Performance Interconnects, pp. 75– 78. IEEE (2013)
Big Data and HPC Convergence: The Cutting Edge and Outlook 25
20. Moore, J., Chase, J., Ranganathan, P., Sharma, R.: Making scheduling cool: temperature-
aware workload placement in data centers (2005)
21. Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58 ,56 –68
(2015)
22. Rajovic, N., Puzovic, N., Vilanova, L., Villavieja, C., Ramirez, A.: The low-power
architecture approach towards exascale computing. In: Proceedings of the Second Workshop
on Scalable Algorithms for Large-Scale Systems - ScalA 2011, p. 1. ACM Press, New York
(2011)
23. Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges
and research opportunities. Int. J. High Perform. Comput. Appl. 23 , 212– 226 (2009)
24. Gutierrez, D.: The Convergence of Big Data and HPC – insideBIGDATA. https://
insidebigdata.com/2016/10/25/the-convergence-of-big-data-and-hpc/
25. High Performance Data Analytics (HPDA) Market-Forecast 2022. https://www.
marketresearchfuture.com/reports/high-performance-data-analytics-hpda-market
26. Willard, C.G., Snell, A., Segervall, L., Feldman, M.: Top Six Predictions for HPC in 2015
(2015)
27. Egham: Gartner Says 8.4 Billion Connected "Things" ; Will Be in Use in 2017, Up 31
Percent From 2016. http://www.gartner.com/newsroom/id/3598917
28. El Baz, D.: IoT and the need for high performance computing. In: 2014 International
Conference on Identifi cation, Information and Knowledge in the Internet of Things, pp. 1–6.
IEEE (2014)
29. Conway, S.: High Performance Data Analysis (HPDA): HPC - Big Data Convergence -
insideHPC (2017)
30. Keutzer, K., Tim, M.: Our Pattern Language_Our Pattern Language (2016). Keutzer—EECS
UC Berkeley, Tim— Intel. fi le:///Users/abdulmanan/Desktop/Our Pattern Language_Our
Pattern Language.htm
31. Bodkin, R., Bodkin, R.: Big Data Patterns, pp. 1– 23 (2017)
32. Mysore, D., Khupat, S., Jain, S.: Big data architecture and patterns, Part 1: Introduction to
big data classifi cation and architecture. https://www.ibm.com/developerworks/library/bd-
archpatterns1/index.html
26 S. Usman et al.
... HPC has been applied to SpMV/linear algebra [30][31][32][33], and other problems for several decades. Big data and data-driven approaches [26,[34][35][36] have been used relatively recently in scientific computing to address HPC related challenges, and this has given rise to the convergence of HPC and big data [37,38]. Moreover, artificial intelligence (AI) is increasingly being used to improve big data, HPC, scientific computing, and other problem domains. ...
... Some of the optimization techniques have been developed to target the hardware heterogeneity and complexities. But there is no single format that works well on all hardware platforms [37]. There is not much work related to the optimization of SpMV using machinelearning techniques. ...
SpMV is a vital computing operation of many scientific, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors affect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, different levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and efficient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best configuration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational fluid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy efficiency optimization of SpMV computations is given. This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.
... Driving high efficiency from shared-memory and distributed-memory HPC systems have always been challenging. Big data and HPC convergence, system heterogeneity, cloud computing, and many other developments have increased the complexities of HPC systems [36,51,77,108,124]. There are increasing pressures on energy-efficiency for developing exascale computers and therefore development of highly efficient HPC applications and systems have become essential. ...
... Big data technologies are being used in many application areas that require HPC to address big data challenges, see e.g., [5,79,86,95,117,118]. There are many ongoing efforts on the convergence of HPC and big data [36, 108,124]. ...
High-performance computing (HPC) plays a key role in driving innovations in health, economics, energy, transport, networks, and other smart-society infrastructures. HPC enables large-scale simulations and processing of big data related to smart societies to optimize their services. Driving high efficiency from shared-memory and distributed HPC systems have always been challenging; it has become essential as we move towards the exascale computing era. Therefore, the evaluation, analysis, and optimization of HPC applications and systems to improve HPC performance on various platforms are of paramount importance. This paper reviews the performance analysis tools and techniques for HPC applications and systems. Common HPC applications used by the researchers and HPC benchmarking suites are discussed. A qualitative comparison of various tools used for the performance analysis of HPC applications is provided. Conclusions are drawn with future research directions.
... On the one hand, scientists demand converged applications to obtain new insights into various scientific domains. The fusion of HPC and big data emanates high-performance data analytics (HPDA) to extract values from massive scientific datasets via extreme data analytics at scale [4] . The fusion of HPC and AI emanates AI-enhanced HPC, which aims to improve traditional HPC models by optimizing the parameter selections or training an AI model as an al-ternative component [5] . ...
- Yu-Tong Lu
- Peng Cheng
- Zhi-Guang Chen
With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for "triple use" systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.
... Experimental study of deploying MapReduce over Lustre with various shuffle and placement strategies of intermediate data is done in [25]. The architecture for convergence of big data and HPC systems based on design patterns is proposed in [26]. Summary report on big data and exascale computing (BDEC) contributed by eminent researchers in the domain of HPC and analytics has been put forth in [27]. ...
The dawn of exascale computing and its convergence with big data analytics has greatly spurred research interests. The reasons are straightforward. Traditionally, high performance computing (HPC) systems have been used for scientific applications involving majority of compute-intensive tasks. At the same time, the proliferation of big data resulted into design of data-intensive processing paradigms like Apache big data stack. Big data generating at high pace necessitates faster processing mechanisms for getting insights at a real time. For this, the HPC systems may serve as panacea for solving the big data problems. Though the HPC systems have the capability to give the promising results for big data, directly integrating them with existing data-intensive frameworks like Apache big data stack is not straightforward due to challenges associated with them. This triggers a research on seamlessly integrating these two paradigms based on interoperable framework, programming model, and system architecture. The aim of this paper is to assess a progress made in HPC world as an effort to augment it with big data analytics support. As an outcome of this, the taxonomy showing the factors to be considered for augmenting HPC systems with big data support has been put forth. This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem. The focus is given on research issues related to augmenting HPC paradigms with big data frameworks and corresponding approaches to address those issues. This paper also discusses data-intensive as well as compute-intensive processing paradigms, benchmark suites and workloads, and future directions in the domain of integrating HPC with big data analytics.
... The authors concluded that there is a need for implementation of the aforementioned policies for sharing of data and security of personal information, which will have long-term impacts on big data analytics. In addition, there have been numerous research studies focusing on smart infrastructure [2,58], healthcare [59][60][61][62], transport [6,[63][64][65][66][67][68][69], and other applications [70,71]. ...
The outburst of data produced over the last few years in various fields has demanded new processing techniques, novel big data–processing architectures, and intelligent algorithms for effective and efficient exploitation of huge data sets to get useful insights and improved knowledge discovery. The explosion of data brings many challenges to deal with the complexity of information overload. Numerous tools and techniques have been developed over the years to deal with big data challenges. This chapter presents a summary of state-of-the-art tools and techniques for processing of big data applications by critically analyzing their objectives, methodologies, and key approaches to address the challenges associated with big data. Also, we critically analyze some of the core applications of big data and their impacts in improving the quality of human life by primarily focusing on healthcare and smart city applications, genome sequence annotation applications, and graph-based applications. We provide a detailed review and taxonomy of the research efforts within each application domain.
... The open source software culture has helped the development of many new distributed and collaborative applications paving the way for integrated systems and hence smart cities. Many new smart city applications are being developed, such as in transport [5][6][7][8][9][10][11][12], healthcare [13][14][15][16], infrastructure [17,18], and applications [19,20]. ...
The use of open source software has increased tremendously in the last few decades paving the way for many innovations such as Internet of Things (IoT) and smart cities. The open data licenses have also become prevalent with the emergence of big data and relevant technologies. These developments have given rise to the "Share more—Develop less" culture, which in turn have raised new legal issues. The community has been developing many new licenses to address these emerging legal issues. However, selecting the right license is becoming increasingly difficult due to the licensing complexities and continuous arrival of new licenses. This chapter reviews notable open source and open data licenses and the suitability of these licenses for various kinds of data and software. Subsequently, we propose frameworks for the selection of open source software and open data licenses. Conclusions are drawn with recommendations for the future work.
Human Robot Collaboration (HRC) is considered as a major enabler for achieving flexibility and reconfigurability in modern production systems. The motivation for HRC applications arises from the potential of combining human operators' cognition and dexterity with the robot's precision, repeatability and strength that can increase system's adaptability and performance at the same time. To exploit this synergy effect on its full extent, production engineers must be equipped with the means for optimally allocating the tasks to the available resources as well as setting up appropriate workplaces to facilitate HRC. This chapter discusses existing approaches and methods for task planning in HRC environments analysing the requirements for implementing such decision-making strategies. The chapter also highlights future trends for progressing beyond the state of the art on this scientific field, exploiting the latest advances in Artificial Intelligence and Digital Twin techniques.
- Montserrat Gómez-Márquez
- Ana Lilia Ruiz- Hernández
- Martín González Sóbal
- Miguel Eduardo Rosas Baltazar
A Digital and Accounting Information Transformation Center (CTICD) is a physical space that has services such as: loan of computer equipment, accessories and / or digital tools (software, technical support, etc.). Each CTICD has tangible and intangible attributes in the service that distinguishes them and makes them attractive to the user. The objective of the work was to identify the attributes that the user of a CTICD considers of greater influence for its existence. The data collection instrument was validated through Pearson's correlation and the dependence between the tangible and intangible attributes of a CTICD was calculated. The result obtained was generated with the chi-square statistical independence test. Theoretical chi-square of a couple of variables was plotted. The contribution of this research is the identification and importance of intangible attributes in the perception and differentiation of the service of a CTICD for users, highlighting the humanistic part through attention to them, putting it before any tangible attribute.
Road transportation is among the global grand challenges affecting human lives, health, society, and economy, caused due to road accidents, traffic congestion, and other transportation deficiencies. Autonomous vehicles (AVs) are set to address major transportation challenges including safety, efficiency, reliability, sustainability, and personalization. The foremost challenge for AVs is to perceive their environments in real-time with the highest possible certainty. Relatedly, connected vehicles (CVs) have been another major driver of innovation in transportation. In this paper, we bring autonomous and connected vehicles together and propose TAAWUN, a novel approach based on the fusion of data from multiple vehicles. The aim herein is to share the information between multiple vehicles about their environments, enhance the information available to the vehicles, and make better decisions regarding the perception of their environments. TAWUN shares, among the vehicles, visual data acquired from cameras installed on individual vehicles, as well as the perceived information about the driving environments. The environment is perceived using deep learning, random forest (RF), and C5.0 classifiers. A key aspect of the TAAWUN approach is that it uses problem specific feature sets to enhance the prediction accuracy in challenging environments such as problematic shadows, extreme sunlight, and mirage. TAAWUN has been evaluated using multiple metrics, accuracy, sensitivity, specificity, and area-under-the-curve (AUC). It performs consistently better than the base schemes. Directions for future work to extend the tool are provided. This is the first work where visual information and decision fusion are used in CAVs to enhance environment perception for autonomous driving.
- Daniel A. Reed
- Jack Dongarra
Daniel A. Reed and Jack Dongarra state that scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. Big data machine learning and predictive data analytics have been considered as the fourth paradigm of science, allowing researchers to extract insights from both scientific instruments and computational simulations. A rich ecosystem of hardware and software has emerged for big-data analytics similar to high-performance computing.
Data-intensive computing has become one of the major workloads on traditional high-performance computing (HPC) clusters. Currently, deploying data-intensive computing software framework on HPC clusters still faces performance and scalability issues. In this paper, we develop a new two-level storage system by integrating Tachyon, an in-memory file system with OrangeFS, a parallel file system. We model the I/O throughputs of four storage structures: HDFS, OrangeFS, Tachyon and two-level storage. We conduct computational experiments to characterize I/O throughput behavior of two-level storage and compare its performance to that of HDFS and OrangeFS, using TeraSort benchmark. Theoretical models and experimental tests both show that the two-level storage system can increase the aggregate I/O throughputs. This work lays a solid foundation for future work in designing and building HPC systems that can provide a better support on I/O intensive workloads with preserving existing computing resources.
Map-Reduce programming model is commonly used for efficient scientific computations, as it executes tasks in parallel and distributed manner on large data volumes. The HPC infrastructure can effectively increase the parallelism of map-reduce tasks. However such an execution will incur high energy and data transmission costs. Here we empirically study how the energy efficiency of a map-reduce job varies with increase in parallelism and network bandwidth on a HPC cluster. We also investigate the effectiveness of power-aware systems in managing the energy consumption of different types of map-reduce jobs. We comprehend that for some jobs the energy efficiency degrades at high degree of parallelism, and for some it improves at low CPU frequency. Consequently we suggest strategies for configuring the degree of parallelism, network bandwidth and power management features in a HPC cluster for energy efficient execution of map-reduce jobs.
- Didier El Baz
The connection between Internet of Things (IoT) and High Performance Computing (HPC) is investigated in this keynote presentation. New paradigms and devices for HPC are presented. Several examples related to smart building management, smart logistics and smart manufacturing leading to difficult combinatorial optimization problems are detailed.
- Kamalpreet Singh
- Ravinder Kaur
Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.
- Kamalpreet Singh
- Ravinder Kaur
Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.
Hadoop is a successful open-source implementation of MapReduce programming model. It has been widely adopted by many leading industry companies for big data analytics. However, its intermediate data shuffling is a time-consuming operation that impacts the total execution time of MapReduce programs. Recently, a growing number of organizations are interested in addressing this issue by leveraging the high-performance interconnects, such as InfiniBand and 10 Gigabit Ethernet, which have been popular in High-Performance Computing (HPC) Community. There is a lack of comprehensive examination of the performance impact of these interconnects on MapReduce programs. In this work, we systematically evaluate the performance impact of two popular high-speed interconnects, 10 Gigabit Ethernet and InfiniBand, using the original Apache Hadoop and our extended Hadoop Acceleration framework. Our analysis shows that, under the Apache Hadoop, although using fast networks can efficiently accelerate the jobs with small intermediate data sizes, it is unable to maintain such advantages for jobs with large intermediate data. In contrast, Hadoop Acceleration provides better performance for jobs of a wide range of data sizes. In addition, both implementations exhibit good scalability under different networks. Hadoop Acceleration significantly reduces CPU utilization and I/O wait time of MapReduce programs.
HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA over both default HDFS and Lustre.
The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.
Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it by capturing and analysis process. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are brought into light. This paper introduces the Big data technology along with its importance in the modern world and existing projects which are effective and important in changing the concept of science into big science and society too. The various challenges and issues in adapting and accepting Big data technology, its tools (Hadoop) are also discussed in detail along with the problems Hadoop is facing. The paper concludes with the Good Big data practices to be followed.
Posted by: lonnylonnycardye0269129.blogspot.com
Source: https://www.researchgate.net/publication/326535801_Big_Data_and_HPC_Convergence_The_Cutting_Edge_and_Outlook
0 Comments