Cutting Edge Marketing Analytics Pdf Download

The data growth over the last couple of decades increases on a massive scale. As the volume of the data increases so are the challenges associated with big data. The issues related to avalanche of data being produced are immense and cover variety of challenges that needs a careful consideration. The use of (High Performance Data Analytics) HPDA is increasing at brisk speed in many industries resulted in expansion of HPC market in these new territories. HPC and Big data are different systems, not only at the technical level, but also have different ecosystems. The world of workload is diverse enough and performance sensitivity is high enough that, we cannot have globally optimal and locally high sub-optimal solutions to all the issues related to convergence of big data and HPC. As we are heading towards exascale systems, the necessary integration of big data and HPC is a current hot topic of research but still at very infant stages. Both systems have different architecture and their integration brings many challenges. The main aim of this paper is to identify the driving forces, challenges, current and future trends associated with the integration of HPC and big data. We also propose architecture of big data and HPC convergence using design patterns.

HPDA market forecast [25]
HPC vs. Hadoop eco system

Figures - uploaded by Sardar Usman

Author content

All figure content in this area was uploaded by Sardar Usman

Content may be subject to copyright.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

Big Data and HPC Convergence:

The Cutting Edge and Outlook

Sardar Usman

1(&)

, Rashid Mehmood

2

, and Iyad Katib

1

1

Department of Computer Science, FCIT, King Abdulaziz University,

Jeddah 21589, Saudi Arabia

usmansardar@hotmail.com, iakatib@kau.edu.sa

2

High Performance Computing Center, King Abdulaziz University,

Jeddah 21589, Saudi Arabia

RMehmood@kau.edu.sa

Abstract. The data growth over the last couple of decades increases on a

massive scale. As the volume of the data increases so are the challenges asso-

ciated with big data. The issues related to avalanche of data being produced are

immense and cover variety of challenges that needs a careful consideration. The

use of (High Performance Data Analytics) HPDA is increasing at brisk speed in

many industries resulted in expansion of HPC market in these new territories.

HPC and Big data are different systems, not only at the technical level, but also

have different ecosystems. The world of workload is diverse enough and per-

formance sensitivity is high enough that, we cannot have globally optimal and

locally high sub-optimal solutions to all the issues related to convergence of big

data and HPC. As we are heading towards exascale systems, the necessary

integration of big data and HPC is a current hot topic of research but still at very

infant stages. Both systems have different architecture and their integration

brings many challenges. The main aim of this paper is to identify the driving

forces, challenges, current and future trends associated with the integration of

HPC and big data. We also propose architecture of big data and HPC conver-

gence using design patterns.

Keywords: HPC Big data Hadoop HPDA Design patterns

IoT Smart cities Cognitive computing

1 Introduction

Over the years, HPC has contributed a lot in scienti c discoveries, improved engi-

neering designs, enhanced manufacturing, fraud detection, health care, and national

security, thus played crucial role towards quality of human life. The world has seen

exponential data growth due to social media, mobility, E-commerce and other factors.

Major chunk of data has been generated in the last few years alone and is even growing

at more rapid rate [1 ]. To deal with ever growing volume of data, researchers have been

involved in developing algorithms to accelerate the extraction of key information from

massive data. Big data is a buzzword, which catches lots of attention in the recent

years. It means massive amount of structured, semi structured and unstructured data

©ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2018

R. Mehmood et al. (Eds.): SCITA 2017, LNICST 224, pp. 11 26, 2018.

https://doi.org/10.1007/978-3-319-94180-6_4

collected from different resources and is not possible to store and process this data by

traditional databases and software techniques.

Historically only the largest companies, government research organizations and

academic computing centers have had an access to the computing power necessary to

get to valuable conclusions in a reasonable amount of time. All that is rapidly changing

with vast improvement in the price, performance, availability and density of compute

power beyond the human imagination.

The categorization of data vs. computing affected by solution urgency i.e. real time

solution, and also depends on what we trying to achieve. As the volume of data is

growing bigger, it brings more challenges to process that data in real time. As pro-

jected, in 2018 over 4.3 Exabyte of data will be created on daily basis [2 ]. Over the

years HPC community have not been deprived of huge volume of data i.e. climate

modeling, design and manufacturing, nancial services etc. that resulted in high delity

models and interdisciplinary analysis to explore data for deeper insights. The use of

High Performance Data Analytics HPDA is increasing at brisk speed in many indus-

tries resulted in expansion of HPC market in these new territories.

Powerful analytics is a key to extract a value from data by confronting budget and

marketing challenges and plays huge roles in making plans, predicting business trends

and understanding customer demands. Choosing a right solution depends on the size of

data, urgency of results, prediction about the needs of more processing power as the

size of data increases, fault tolerance for applications in case of hardware failure, data

rate and scalability etc. A real time application with high response time especially when

dealing with huge volume of data, is still a challenging task and is one of the driving

forces towards the convergence of big data and HPC.

Both HPC and Big data are different system not only at the technical level but also

have the different ecosystem. Both have different programming model, resource

manager, le system and hardware. HPC are mainly developed for computational

intensive applications but recently data intensive applications are also among the major

workload in HPC environment. Due to recent advancements of data intensive appli-

cations, number of software frameworks has been developed for distributed systems,

cluster resource management, parallel programming models and machine learning

frameworks. High performance computing have very well established standard pro-

gramming model e.g. Open MP/MPI. Big data analytics have been grown up in dif-

ferent perspective and have different population of developers that uses java and other

high level languages with primary focus on simplicity of use, so that problem domain

can be solved without a detailed knowledge of HPC. These difference in the infras-

tructure, resource manager, le system and hardware makes the system integration a

challenging task.

As the data is getting bigger and bigger in volume so is the need of high computing.

HPC community has been dealing with massive amount of data and big data analytics

for years. The solutions evolved over the years to deal with large volume of data,

should be useful for big data analytics [3 ]. The main aim of this paper is to identify

motivation and driving forces towards the integration of HPC and big data. Also

highlighting the current trends, challenges, bene ts and future aspects of uni ed

integrated system. We also present architecture for the convergence of HPC and Big

data using design patterns.

12 S. Usman et al.

The rest of the paper is organized as follows. The next section examines the

difference between HPC and Hadoop framework with respect to hardware, resource

management, fault tolerance and programming model. Literature survey is presented in

Sect. 3 and Convergence challenges are discussed in Sect. 4 followed by the future

directions in Sect. 5 . The architecture using design pattern for the convergence of HPC

and big data is presented in Sect. 6and paper is concluded in the nal section.

2 HPC and Big Data Frameworks and Their Differences

Different solutions emerged over the years to deal with big data issues and are suc-

cessfully implemented. But never the less, all these solutions do not satisfy the ever-

growing needs of big data. The issues related to big data are immense and cover variety of

challenges that needs a careful consideration, for example data representation, data

reduction/compression, data con dentiality, energy management, high dimensionality,

scalability, real and distributed computation, non-structured processing, analytical

mechanism and computational complexity etc. The exponential outburst of data and

rapidly increasing demands for real time analytical solutions urges the need for the

convergence of high-end commercial analytics and HPC. Business intelligence/analytical

solutions today lack the support for predictive analytics, lack of data granularity, lack of

software exibility to manipulate data, lack of intuitive user interface, relevant infor-

mation is not aggregated in a required manner and slow system performance [4].

HPC community have been dealing with complex data and compute intensive

applications, and solutions have been evolved over the years. As the volume of data is

increasing at brisk speed so are the associated challenges i.e. data analysis, minimizing

data movement, data storage, data locality and ef cient searching. As we are heading

towards exascale era, the increase in system concurrency introduced a massive challenge

for system software to manage applications to perform at extreme level of parallelism.

Large-scale applications use most widely deployed message-passing programming

model MPI along with traditional sequential languages, but with the introduction of

architectural changes (many core chip) and high demand in parallelism make this

programming model less productive for exascale systems. Billion-fold parallelism is

required to exploit the performance of extreme scale machines and locality is critical in

terms of energy consumption. As the complexity and scale of software requirements is

on a rise, simple execution model is a critical requirement, which ultimately reduce the

application programming complexity required to achieve the goals of achieving extreme

scale parallelism. A current trend in HPC market includes use of advanced interconnects

and RDMA protocols (In nity Band, 10 40 Gigabits Ethernet/iWARP, RDMA over

converged Enhanced Ethernet), enhanced redesign of HPC middleware (MPI, PGAS),

SSDs, NVRAM and Burst buffer etc. Scalable parallelism, synchronization, minimizing

communication, task scheduling, memory wall, heterogeneous architecture, fault tol-

erance, software sustainability, memory latencies, simple execution environment and

dynamic memory access for data intensive application are some of the core areas that

requires considerable time and efforts to address Exascale challenges [5 ]. The difference

between Hadoop and HPC framework is highlighted in the following section.

Big Data and HPC Convergence: The Cutting Edge and Outlook 13

2.1 Hardware

Most of the modern HPC and Hadoop clusters are commodity hardware. In HPC

environment, Compute nodes are separated from data nodes. There are two types of

data storage, temporal le system on local nodes and persistent global shared parallel

le system on data nodes. The existing HPC clusters have limited amount of storage on

each compute node. LUSTRE is most widely used parallel le system in HPC and

almost 60% of the top 500 supercomputers use LUSTRE as their persistent storage.

Data needs to be transferred from data nodes to the local le system on each compute

node for processing. Data sharing is easy with distinct data and compute nodes but

spatial locality of data is an issue [6 ,7 ].

Hadoop cluster uses local disk space as a primary storage. The same node serves as

a data node and compute node. The computational task is scheduled on same machine

where data is resided resulting in enhanced data locality. Hadoop is write-once and

read-many framework. I/O thorough put of Hadoop is much higher, due to co-locating

of data and compute node on the same machine [7].

2.2 Resource Management

Another major difference between Hadoop and HPC cluster is resource management.

Hadoop' s Name node has Job tracker daemon. Job tracker supervised all map-reduce

tasks and communicates with the task trackers on the data node. Compared to

Hadoop' s integrated job scheduler, HPC scheduling is done with the help of specialized

tools like Grid engine, Load leveler etc., [8 ] with controlled resources (memory, time)

provided to the user.

2.3 Fault Tolerance

HPC resource scheduler use checkpoint mechanism for fault tolerance. In case of node

failure, it reschedule job from the last stored checkpoint. It needs to restart the whole

process if the checkpoint mechanism is not used. On the other hand, Hadoop uses job

tracker for fault tolerance. As data and computation are co-located on same machine,

job tracker can detect a node failure on run time by re-assigning a task on a node where

duplicate copy of data is resided [8 ,9 ].

2.4 Programming Model

Hadoop uses map-reduce programming model, which makes life easier for the pro-

grammers as they just need to de ne map step and reduce step, when compared to the

programming efforts needed for HPC applications. In HPC environment, programmer

needs to take ne-grained responsibilities of managing communication, I/O, debug-

ging, synchronization and checkpoint mechanism. All these tasks needs considerable

amount of efforts and time for effective and ef cient implementation. Hadoop does

provide a low level interface to write and run map-reduce applications written in any

language, although Hadoop is written in Java. Following Table 1 summarizes the

difference between HPC and Hadoop framework [7].

14 S. Usman et al.

Both Hadoop and Spark are big data frameworks and do perform the same tasks,

are not mutually exclusive and able to work together. Spark is mostly used on the top of

Hadoop and advance analytics of spark are used on data stored in Hadoop' s distributed

le system (HDFS). Spark has the ability to run as Hadoop' s module through YARN

and as a standalone solution [10 ] and can be seen as an alternative to map-reduce rather

than a replacement to Hadoop framework. Spark is much faster compared to Hadoop

because it handles in memory operations by copying data from distributed le systems

in to faster logical RAM. Map-reduce writes all data back to distributed storage system

after each iteration to ensure full recovery whereas Spark arranges data in resilient

distributed datasets that are capable of full recovery in case of failure. Spark capability

of handling advance data analytics in real time stream processing and machine learning

is a much more advance that gives Spark edge over Hadoop. The choice of selecting

either of the data processing tool depends on the needs of an organizations e.g. Dealing

with big structured data can be done ef ciently with map-reduce and there is no need to

installed a separate layer of Spark over Hadoop [11 ]. Spark on demand allows users to

use Apache Spark for in situ data analysis of big data on HPC resources [12 ]. With this

setup, there is no longer to move petabytes of data for advance data analytics.

3 Research Related to HPC and Big Data Convergence

The integration of HPC and Big data started at different levels of their Eco systems and

these integrated solutions are still at very infant stages. The convergence of both these

technologies is the hottest topic for the researcher over the last few years. In [6]

Krishnan et al. proposed a myHadoop framework using standard batch scheduling

system for con guring Hadoop on-demand on traditional HPC resources. The overhead

in this setup includes site-speci c con guration, keeping input data into HDFS and

Table 1. HPC vs. Hadoop eco system

Big Data HPC

Programming model Java applications,

SparQL

Fortran, C, C++

High level

programming

Pig, Hive, Drill Domain speci c language

Parallel run time Map-reduce MPI, Open MP, OpenCL

Data management HBase, MySQL iRODS

Scheduling (Resource

management)

YARN SLRUM (Simple LINUX utility for

resource management)

File system HDFS, SPARK

(Local storage)

LUSTRE (Remote storage)

Storage Local shared nothing

architecture

Remote shared parallel storage

Hardware for storage HDDS SSD

Interconnect Switch ethernet Switch Fiber

Infrastructure Cloud Supercomputer

Big Data and HPC Convergence: The Cutting Edge and Outlook 15

then staging results back to persistent storage. HDFS is heavily criticized for its I/O

bottleneck. Availability of limited storage is big challenge to integrate Hadoop with

HPC clusters. Islam et al. [13 ] proposed a hybrid design (Triple-H) to reduce I/O

bottleneck in HDFS and ef cient resource utilization for different analytics system

performance and cluster ef ciency with overall low system cost.

Data intensive applications have been intensively used in HPC infrastructure with

multicore systems using Map-reduce programming model [14 ]. With increase in par-

allelism, the overall throughput increases resulted in high-energy ef ciency as the task

is completed in shorter span of time. When Hadoop runs on HPC cluster with multiple

cores and each node is capable of running many map/reduce tasks using these cores.

This ultimately decreases the data movement cost and increase throughput but due to

high disk and network accesses of Map-reduce tasks, the energy consumption and

through put cannot be predicted. High degree of parallelism may or may not affect

energy ef ciency and high performance.

Tiwari et al. [15 ] studied the Hadoop's energy ef ciency on HPC cluster. Their

study shows that energy ef ciency of map-reduce job on HPC cluster changes with

increase in parallelism and network bandwidth. They determine the degree of paral-

lelism on a node for improving the energy ef ciency and also bene ts of increasing the

network bandwidth on energy ef ciency by selecting con guration parameters on

different types of workloads i.e. CPU intensive and moderate I/O intensive, CPU and

I/O intensive workloads, also energy and performance characteristics of a disk and

network I/O intensive jobs. When the number of map slots reached beyond 40, number

of killed map tasks almost doubled. Thus increasing the parallelism to certain extent

has positive impact on energy efciency.

Scienti c data sets are stored in back end storage servers in HPC environment and

these data sets can be analyzed by YARN map-reduce program on compute nodes. As

both compute and storage servers are separated in HPC environment, the cost of

moving these large data sets is very high. The High-end computation machine and

analysis clusters are connected with high-speed parallel le system. To overcome the

shortcomings of of ine data analysis, " in situ" data analysis can be performed on

output data before it is written to parallel le system. The use high-end computation

node for data analysis results in slowing down simulation job by the interference of the

analysis task and inefcient use of computation resources for data analysis tasks. Spark

on demand allows users to use Apache Spark for in situ data analysis of big data on

HPC resources [12 ]. With this setup, there is no longer to move petabytes of data for

advance data analytics.

According to Woodie [16 ], the use of In niBand for large clusters is most cost

effective then standard Ethernet. The performance of HPC oriented Map-reduce solu-

tions (Mellanox UDA, RDMA-Hadoop, DataMPI etc.) depends on the degree of change

in Hadoop framework as more deep modi cation means an optimal adaption to HPC

systems. Hadoop with IPoIB (IP over In niBand) and Mellanox UDA requires minimal

or no changes in Hadoop implementation and only requires minor changes in Hadoop

con guration. RDMA-Hadoop and HMOR are the HPC oriented solutions to take

advantage of high speed interconnects by modifying some of the subsystems of

Hadoop. DataMPI is a framework that developed from the scratch, which exploits the

overlapping of map, shuf e and merge phases of map-reduce framework and increases

16 S. Usman et al.

data locality during the reduce phase. DataMPI provides the best performance and an

average energy ef ciency [17 ]. The use of In niBand improved the network bandwidth,

as In niBand being widely used in HPC environment. Communication support in

Hadoop relies on TCP/IP protocol through Java sockets [17 ]. So it is dif cult to use high

performance interconnects in an optimal way so different HPC oriented map-reduce

solutions came that addresses the problem of leveraging high performance interconnects

RDMA Hadoop, DataMPI etc. Wang et al. [18 ] compared the performance of

10 GigaBit Ethernet and In niBand on Hadoop. With small intermediate data sizes the

use of high speed interconnect, increased the performance by ef ciently accelerating

jobs but doesn' t shows the same performance with large intermediate data size. The use

of In niBand on Hadoop provides better scalability and removes the disk bottleneck

issues. As the Hadoop cluster is getting bigger, organizations feel the need of specialized

gear like solid-state drives (SSDs) and the use of In niBand instead of standard Eth-

ernet. The use of in niBand with RDMA (remote direct memory access) allows

40 Gigabits/s raw capacity out of Quad Data Rate (QDR) in niBand port which is four

times as much bandwidth as 10 GigaBit Ethernet port can deliver [16].

The use of in niBand allows maximum scalability and performance while over-

coming the bottlenecks in the I/O. Islam et al. [19 ] proposes an alternative parallel

replication scheme compared to pipelined fashioned replication scheme by analyzing

the challenges and compared its performance with existing pipelined replication in

HDFS over Ethernet, IPoIB, 10 GigE and RDMA and showed performance enhance-

ment with parallel model for large data sizes and high performance interconnects.

4 Challenges of Convergence

The world of workload is diverse enough and performance sensitivity is high enough

that, we cannot have globally optimal and local high sub-optimal solution to all the

issues related to convergence of HPC and big data. HPC and Hadoop (big data)

architectures are different and have different eco system. The cross fertilization of HPC

and Big data is the hottest topic for the researchers over the last few years. Most of the

research related to the convergence of HPC and big data started at distinct levels of eco

system but do not address the problem of moving data especially in HPC environment.

The integration of data intensive applications in HPC environment will bring many

challenges. In Exasacle environment cost of moving big data will be more then cost of

oating point operations. There is a need for high energy ef cient and cost effective

interconnects for high bandwidth data exchange among thousands of processors. We

also need a data locality aware mechanism especially when dealing with big data in

HPC shared memory architecture. The cost of moving big data for processing also

brings another challenge of high power consumption. With massively parallel archi-

tecture with hundreds of thousands of processing nodes, the cost of moving data will be

very high. According to Moore et al. [20 ], and energy ef ciency of 20 pJ (Pico Joules)

per oating point operation is required for exascale system where as current state of art

multicore CPUs have 1700 pJ and GPUs have 225 pJ per oating point operation.

Minimizing the data movements means the innovation in memory technologies

with enhanced capacity and bandwidth. To deal with 3Vs (volume, velocity, veracity)

Big Data and HPC Convergence: The Cutting Edge and Outlook 17

of big data, ef cient data management techniques need to be investigated included data

mining and data co-ordination [13 ] as most of the HPC platforms are compute centric,

as opposed to the demands of big data (continuous processing, ef cient movement of

data between storage devices and network connections etc.). To deal with massive

parallel architecture and heterogeneous nature of big data, innovation needed at the

programming model to deal with the next generation of parallel systems. Thus reducing

the burden of parallelism and data locality for application developer as MPI leave it to

the programmer to handle issues related to parallelism. Hadoop being widely used as a

big data framework, achieve fault tolerance by the replication of data on multiple nodes

and job tracker assign job to other node in case of node failure. Fault tolerance in HPC

is by means of checkpoint mechanism, which is heavily criticized and not suitable for

exascale environment. In exascale systems hardware failure will be a rule not an

exception. The MTBF (mean time between failures) window in current Peta-scale

system is in days and for exascale systems it will be in minutes or may be few seconds.

So there is need for a comprehensive resilience at the different levels of exascale eco

system. Exascale systems will be constrained by power consumption, memory per core,

data movement cost and fault tolerance. The integration between HPC and big data

must address the issues of scalability, fault resilience, energy ef ciency, scientic

productivity programmability and performance [21].

Resilience, power consumption and performance are inter-related to each other.

High degree of resilience or fault tolerance is achieved but on the expense of high power

consumption. As we are heading towards exascale era, convergence of both HPC and

big data will make energy ef ciency a core issue to handle. Severs and data-centers are

facing the same problem of power consumption including companies like Google,

Amazon and Facebook etc. According to an estimate the actual cost of exascale system

will be less then cost of power consumption for maintaining and running exascale

system for one year [22].

The energy ef ciency techniques in big data can be broadly categorized as

software/hardware based energy ef cient techniques, energy ef cient algorithms and

architectures. As set of commodity hardware is used in both HPC and Big data plat-

forms for processing of data. The integrated hardware solution for data intensive

applications and computational intensive applications wouldn' t work for exascale

systems as hardware solution helps to achieve fault tolerance but on the expense of

high energy consumption. The current Peta scale high performance computing with

checkpoint mechanism to achieve fault tolerance and energy ef ciency does not suit

well for the integrated solution of HPC (Exascale) and big data. Soft, hard and silent

errors in exascale environment will be rule not an exception. Thus collaborative efforts

are needed at system level or application level resilience to deal with fault tolerance and

energy ef ciency for the integrated solution.

As we have seen that both HPC and Hadoop (big data) architectures are different

and have different eco system. Both have different programming model, resource

manager, le system and hardware. These difference in the infrastructure, resource

manager, le system and hardware makes the system integration a challenging task. As

the data is getting bigger and bigger in volume so is the need of high computing.

One of the biggest challenges, that both big data and HPC community facing is energy

ef ciency. Exascale Parallel computing system will have thousands of nodes with

18 S. Usman et al.

hundreds of cores each and is projected to have billions of threads of execution. The

frame of Main Time between Failures MTBF in super computers is in days and weeks.

But for Exascale computing with million times more components, the perception of

MTBF is in hours or minutes or may be in seconds. Each layer of Exascale Eco system

must be able to cope with the errors [23].

Real time data analysis is also a driving force behind the urgency of the need for the

necessary convergence of the analytics, big data, and HPC when dealing with com-

putation, storage and analysis of massive, complex data sets in high scalable envi-

ronment. Scalability issues addressed by the HPC community by capitalizing the

advancements in network technologies (low latency network), ef cient and large

memory should also address the scalability issues of the data analytics [24].

5 Driving Forces and Future Aspects

High performance data analytics HPDA includes tasks involving massive amount of

structured, semi-structured and unstructured data volumes and highly complex algo-

rithms that ultimately demands the needs of HPC resources. Companies now have the

computing power they need to actually analyze and act upon their data. This translates

into numerous bene ts for the company, environment and society over all. In the

energy sector companies are now able to more accurately drill for oil. Automobiles and

airlines are much safer due to rapid modeling of operational data design optimization

and aerodynamics analysis, allowing them to deliver more cost effective products that

operate safer and are more fuel-ef cient. In the nancial sector banks and card issuers

can do fraud detection in real time. Stock investors can quickly track trends in the

market to better serve their investing customers. Retailers and advertisers can now

review historic purchasing data to better deliver the right products and advertisement to

their customers and whether researchers can study thousands of years of weather data

in hours or days instead of weeks or months, improving the quality of predictions and

safety of people worldwide. HPC industry has been dealing with data intensive sim-

ulations and high performance analytics solutions also evolved over the years urges the

commercial organizations to adopt HPC technology for competitive advantage to deal

with time critical and highly variable complex problems. The chasm between data and

compute power is becoming smaller all the time. The global HPDA market is growing

rapidly and according to forecast HPDA global market size was US 25.2 billion and

with the growth of nearly 18%, it is projected to be around US 82 billion in 2022 [25]

(Fig. 1).

Fault tolerance, high power consumption, data centric processing, limitations of I/O

and memory performance are few of the driving forces that are reshaping the HPC

platforms to achieve Exascale computing [26 ]. Data intensive simulations, complex

and time critical data analytics requires high performance data analytics solutions for

example Intelligence community, data driven science/engineering, machine learning,

deep learning and knowledge discovery etc. These competitive forces have pushed

relatively new commercial companies (Small and Medium scale Enterprises SMEs)

into HPC competency space. Fraud/anomaly detection, af nity marketing, business

intelligence and precision medicine are some of the perusable new commercial HPC

Big Data and HPC Convergence: The Cutting Edge and Outlook 19

market segments that require high performance data analytics. The use of HPDA will

increase with time in future demanding convergence of HPC and big data. HPDA is

becoming an integral part of future business investments plans of enterprises, to

enhance customer experience, anomaly detection marketing, business intelligence,

security breaches etc. and discovery of new revenue opportunities.

5.1 The Internet of Things IoT and Smart Cities

IoT links physical devices (computers, sensors, electronics,) equipped with sensors to

the Internet and network connectivity enabling them to communicate. The common IoT

platform brings heterogeneous information together and facilitates communication by

providing common language. According to Gartner [27 ] IoT units installed base will

reach 20.8 billion by 2020 resulted in massive amount of data which will further

highlight the security, customer privacy, storage management and data centric net-

works challenges. Smart city demands better and more inventive services to run whole

city smoothly and improve people' s life through the innovative use of data.

Smart cities and IoT are some of the emerging HPDA application areas. HPC has

been involved in managing power grids and transport for the upstream design of

vehicles and urban traf c management in smart cities for quite some time and its use

over time will increase in the markets of cognitive computing/AI, driverless vehicles and

healthcare organizations. Baz [28 ] investigated the connection between IoT and HPC by

highlighting some of the challenges in smart world applications (smart building man-

agement, smart logistics and smart manufacturing) and possible opportunities with HPC

enable solutions. China' s HPC-IoT plan 2030 is based on the use of HPC in IoT network

wellness management and security [29].

Fig. 1. HPDA market forecast [25]

20 S. Usman et al.

5.2 Cognitive Technology

Cognitive systems are capable of understanding complex language constructs, correlate

the association and help to rationalize information and discover insights. The key in

cognitive systems is learning, adaptability and how the system is evolving, helps in

decision-making process, discovery of new ventures, improved production and oper-

ation systems, optimizing resources, proactive identi cation of faults ahead of failure

etc. The motive of cognitive computing is to handle complex problems without no or

little human intervention. According to IBM estimate 80% of data is unstructured and

is of no use for the machines and not fully exploited. The cognitive computing can be

seen as a potential candidate for the exploration of unstructured data to get more useful

information insights and ef cient decision-making. The rapid growth of data from

multidisciplinary domains requires powerful analytics but lacks human expertise to

tackle the diverse and complicated problems. The cognitive computing allows people

with less experience to interact with machine thanks to the advancement in natural

language processing and Arti cial intelligence technologies e.g. Google DeepMind and

Qualcomm' s Zeroth Platform. The advancement in cognitive technology with the

integration of AI and machine learning for big data tools and platforms will increase the

quality of information, dealing with the complex data analytics with lesser human

intervention but requires rapid data access (low latency), faster time to insights,

hardware acceleration for complex analytics [2 ]. Extracting information from vast

amount of data requires innovation in compute and storage technologies, which should

provide cost effective storage, improved performance in a desired time frame. The

infrastructure required cognitive storage with learning ability for computers to store

only relevant and important data. The computing requires ef cient processing which

demands high memory bandwidth and extreme scale parallelism for ef cient resource

utilization within energy ef ciency constraints. Open power foundation [2 ] is an ini-

tiative towards partnering technology solutions with diverse companies coming toge-

ther to provide technology solutions to a variety of problems. With data centric

computing, time to solution will be dramatically reduced. Cognitive computing though

still at its infancy stages but in future will be a key technology for the success of

modern businesses, to get insights of the vast amount of unstructured data by lever-

aging computing technology to work better with the way humans want to work and

smoothing the natural relationship between human and the computer.

6 Design Patterns

The need for HPDA demands innovative ways, to accelerate data and predictive

analysis to target above-mentioned complex challenges by revolutionary and evolu-

tionary changes in programming models, computer architecture and runtime systems to

accommodate potential interoperability and scaling convergence of HPC and Big data

eco systems [2 ]. There is growing need for the ef cient exploration of novel techniques

to allow HPC and Big data applications to exploit billion-fold parallelism (Exascale

systems), improved data locality, uni ed storage systems, synchronization and ulti-

mately the single system architecture to overcomes the cost and complexity of moving

Big Data and HPC Convergence: The Cutting Edge and Outlook 21

data which also improves the total cost of ownership and brings in exibility to manage

work ows and maximize system utilization. Design patterns and skeletons are the

potential candidates to address above-mentioned challenges to design scalable, robust

software development and applicable proved solutions in both HPC and big data

community.

The parallel programing problem has been an active area of research for decades

focusing primarily on programming models and their supporting environments. As we

move towards Exascale (millions of components, billions of cores) programming

parallel processors and handling billion-way parallelism is one of the major challenge

that research community is facing. Software architecture and design plays a vital role in

designing robust and scalable software. Common set of design elements (derived from

domain expert's solutions), are captured in a design pattern of that particular domain to

assist the software designer to engineer robust and scalable parallel software. These

patterns de ne the building blocks of all software engineering and are fundamental to

architect parallel software. The design problem at different level of software devel-

opment is addressed by developing layered hierarchy of patterns by arranging patterns

at different levels. These design patterns have been developed to assist software

engineers to architect and implement parallel software ef ciently. Our Pattern Lan-

guage OPL is one of prominent source of cataloguing and categorizing the parallel

patterns [30 ]. A design pattern provides a clean mechanism to cater common design

problems using generic guidelines.

Big Data design patterns provide the concrete representation of analysis and

technology centric patterns of most common occurring problems in BigData envi-

ronment [31]. These design patterns provides the building blocks for the efcient

design of big data architecture. The standardization and integration of design patterns

can be seen as the potential candidates for the ef cient and effective convergence of

HPC and big data. Figure 2 shows the logical architecture of different layers and design

patterns (HPC & BigData) can then be applied at distinct levels to address the issues

related to big data and HPC convergence. One of the challenges associated with data

visualization and interactive management is huge volume, variety and velocity of data

and is often hard to evaluate and reapply the design solution. The visualization and

management layer involves applying patterns for distributed and parallel visualization,

interactive data exploration, rendering data visualization, real time monitoring for live

analysis and recommendations.

The analytics/processing layer includes patterns for analytics and depending on the

problem domain includes in-situ, in-transit, real time or batch processing. Advanced

analytics requires predictions, advance algorithms, simulations and real time decisions

that require high performance computing for processing and managing massive volume

of data [32].

There is a trade-off between Performance, resilience and power consumption. Trade-

off patterns needs to identify and accommodate these trade-offs in best possible way by

indulging the best practices from both HPC and Big data communities. The processing

pattern includes analytics patterns for unstructured and structured data, algorithms for

conversion of unstructured to structured data, large-scale batch and graph based pro-

cessing patterns and also parallel design patterns. The access/storage layer includes

design patterns for the effective and ef cient retrieval and storage mechanism for parallel

22 S. Usman et al.

Fig. 2. Logical layered architecture of design patterns

Big Data and HPC Convergence: The Cutting Edge and Outlook 23

and distributed le systems. This includes data size reduction for high volume hierar-

chical, linked, tabular and binary cognitive storage for real time in-direct and integrated

access. The cognitive storage with learning ability to automate the process of data

purging by keeping only relevant and important data for cost effective storage and

improved performance.

HPC software development community lack the expertise of software engineering

principles as these patterns de ne the building blocks of software engineering and are

fundamental to architect parallel software. There is a need to invest the research efforts

towards exploration of innovative approaches to make use of design patterns and

skeletons to overcome scalability, elasticity, adaptability, robustness, storage, paral-

lelization and other processing challenges of the uni ed HPC and big data environment.

7 Conclusion

The increased processing power, emergence of big data resources and real time ana-

lytical solutions are the prime drivers that pushing the realm of big data. As both HPC

and big data systems are different and have different architecture. The challenges

associated with inevitable integration of HPC and big data are immense and solutions

are starting to emerge at distinct levels of eco system. As we are heading towards

convergence of both, we will have to deal with modality, complexity and vast amount

of data. Currently we have distinct and perhaps overlapping set of design choices at

various levels of infrastructure. A single system architecture but with enough cong-

urability in it that you can actually serve different design points between compute

intensive and design intensive. The single system architecture overcomes the cost and

complexity of moving data. It also improves the total cost of ownership and brings in

exibility to manage work ows and maximize system utilization. Realizing these

bene ts requires coordinated design efforts around key elements of the system i.e.

compute (multicore, FPGA), interconnect (next generation fabric), memory (Non

Volatile memory, storage burst buffer, Luster le system). This coordinated effort may

result in useable, effective and scalable software infrastructure.

The connected and ubiquitous synergy between HPC and Big data is expected to

deliver the results which cannot be achieved by either alone. There is a need for the

leading enterprises to use HPC technology to explore ef ciently huge volume of

heterogeneous data to surpass static searches into dynamic pattern discovery for the

competitive advantage. The integration of computing power in HPC and demands for a

quick and real time analytics for big data with cognitive technology (computer vision

techniques, Machine learning, natural language processing) are considered as reshaping

the future technology for accelerating analytics and deriving meaningful insights for

ef cient decision-making.

Acknowledgments. The authors acknowledge with thanks, the technical and nancial support

from the Deanship of Scienti c Research (DSR) at the King Abdul-Aziz University (KAU),

Jeddah, Saudi Arabia, under the grant number G-661-611-38. The work carried out in this paper

is supported by the HPC Center at the King Abdul-Aziz University.

24 S. Usman et al.

References

1. Singh, K., Kaur, R.: Hadoop: addressing challenges of big data. In: 2014 IEEE International

Advance Computing Conference (IACC), pp. 686 689. IEEE (2014)

2. Charl, S.: IBM - HPC and HPDA for the Cognitive Journey with OpenPOWER. https://www-

03.ibm.com/systems/power/solutions/bigdata-analytics/smartpaper/high-value-insights.html

3. Keable, C.: The convergence of High Performance Computing and Big Data Ascent.

https://ascent.atos.net/convergence-high-performance-computing-big-data/

4. Joseph, E., Sorensen, B.: IDC Update on How Big Data Is Rede ning High Performance

Computing. https://www.tacc.utexas.edu/documents/1084364/1136739/IDC+HPDA+Brie ng+

slides+10.21.2014_2.pdf

5. Geist, A., Lucas, R.: Whitepaper on the Major Computer Science Challenges at Exascale

(2009)

6. Krishnan, S., Tatineni, M., Baru, C.: myHadoop-Hadoop-on-Demand on Traditional HPC

Resources (2011)

7. Xuan, P., Denton, J., Ge, R., Srimani, P.K., Luo, F.: Big data analytics on traditional HPC

infrastructure using two-level storage (2015)

8. Is Hadoop the New HPC. http://www.admin-magazine.com/HPC/Articles/Is-Hadoop-the-

New-HPC

9. Katal, A., Wazid, M., Goudar, R.H.: Big data: issues, challenges, tools and good practices.

In: 2013 Sixth International Conference on Contemporary Computing (IC3), pp. 404409.

IEEE (2013)

10. Hess, K.: Hadoop vs. Spark: The New Age of Big Data. http://www.datamation.com/data-

center/hadoop-vs.-spark-the-new-age-of-big-data.html

11. Muhammad, J.: Is Apache Spark going to replace Hadoop? http://aptuz.com/blog/is-apache-

spark-going-to-replace-hadoop/

12. OLCF Staff Writer: OLCF Group to Offer Spark On-Demand Data Analysis. https://www.

olcf.ornl.gov/2016/03/29/olcf-group-to-offer-spark-on-demand-data-analysis/

13. Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid

approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In:

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,

pp. 101 110. IEEE (2015)

14. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating

MapReduce for multi-core and multiprocessor systems. In: 2007 IEEE 13th International

Symposium on High Performance Computer Architecture, pp. 1324. IEEE (2007)

15. Tiwari, N., Sarkar, S., Bellur, U., Indrawan, M.: An empirical study of Hadoop' s energy

ef ciency on a HPC cluster. Procedia Comput. Sci. 29 ,62 72 (2014)

16. Woodie, A.: Does In niBand Have a Future on Hadoop? http://www.datanami.com/2015/

08/04/does-inniband-have-a-future-on-hadoop/

17. Veiga, J., Exp, R.R., Taboada, G.L., Touri, J.: Analysis and Evaluation of Big Data

Computing Solutions in an HPC Environment (2015)

18. Wang, Y., et al.: Assessing the performance impact of high-speed interconnects on

MapReduce. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB-2012. LNCS,

vol. 8163, pp. 148 163. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-

53974-9_13

19. Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Panda, D.K.: Can parallel replication bene t

Hadoop distributed le system for high performance interconnects? In: 2013 IEEE 21st

Annual Symposium on High-Performance Interconnects, pp. 75 78. IEEE (2013)

Big Data and HPC Convergence: The Cutting Edge and Outlook 25

20. Moore, J., Chase, J., Ranganathan, P., Sharma, R.: Making scheduling cool: temperature-

aware workload placement in data centers (2005)

21. Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58 ,56 68

(2015)

22. Rajovic, N., Puzovic, N., Vilanova, L., Villavieja, C., Ramirez, A.: The low-power

architecture approach towards exascale computing. In: Proceedings of the Second Workshop

on Scalable Algorithms for Large-Scale Systems - ScalA 2011, p. 1. ACM Press, New York

(2011)

23. Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges

and research opportunities. Int. J. High Perform. Comput. Appl. 23 , 212 226 (2009)

24. Gutierrez, D.: The Convergence of Big Data and HPC insideBIGDATA. https://

insidebigdata.com/2016/10/25/the-convergence-of-big-data-and-hpc/

25. High Performance Data Analytics (HPDA) Market-Forecast 2022. https://www.

marketresearchfuture.com/reports/high-performance-data-analytics-hpda-market

26. Willard, C.G., Snell, A., Segervall, L., Feldman, M.: Top Six Predictions for HPC in 2015

(2015)

27. Egham: Gartner Says 8.4 Billion Connected "Things" ; Will Be in Use in 2017, Up 31

Percent From 2016. http://www.gartner.com/newsroom/id/3598917

28. El Baz, D.: IoT and the need for high performance computing. In: 2014 International

Conference on Identi cation, Information and Knowledge in the Internet of Things, pp. 16.

IEEE (2014)

29. Conway, S.: High Performance Data Analysis (HPDA): HPC - Big Data Convergence -

insideHPC (2017)

30. Keutzer, K., Tim, M.: Our Pattern Language_Our Pattern Language (2016). KeutzerEECS

UC Berkeley, Tim Intel. le:///Users/abdulmanan/Desktop/Our Pattern Language_Our

Pattern Language.htm

31. Bodkin, R., Bodkin, R.: Big Data Patterns, pp. 1 23 (2017)

32. Mysore, D., Khupat, S., Jain, S.: Big data architecture and patterns, Part 1: Introduction to

big data classi cation and architecture. https://www.ibm.com/developerworks/library/bd-

archpatterns1/index.html

26 S. Usman et al.

... HPC has been applied to SpMV/linear algebra [30][31][32][33], and other problems for several decades. Big data and data-driven approaches [26,[34][35][36] have been used relatively recently in scientific computing to address HPC related challenges, and this has given rise to the convergence of HPC and big data [37,38]. Moreover, artificial intelligence (AI) is increasingly being used to improve big data, HPC, scientific computing, and other problem domains. ...

... Some of the optimization techniques have been developed to target the hardware heterogeneity and complexities. But there is no single format that works well on all hardware platforms [37]. There is not much work related to the optimization of SpMV using machinelearning techniques. ...

SpMV is a vital computing operation of many scientific, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors affect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, different levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and efficient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best configuration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational fluid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy efficiency optimization of SpMV computations is given. This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.

... Driving high efficiency from shared-memory and distributed-memory HPC systems have always been challenging. Big data and HPC convergence, system heterogeneity, cloud computing, and many other developments have increased the complexities of HPC systems [36,51,77,108,124]. There are increasing pressures on energy-efficiency for developing exascale computers and therefore development of highly efficient HPC applications and systems have become essential. ...

... Big data technologies are being used in many application areas that require HPC to address big data challenges, see e.g., [5,79,86,95,117,118]. There are many ongoing efforts on the convergence of HPC and big data [36, 108,124]. ...

High-performance computing (HPC) plays a key role in driving innovations in health, economics, energy, transport, networks, and other smart-society infrastructures. HPC enables large-scale simulations and processing of big data related to smart societies to optimize their services. Driving high efficiency from shared-memory and distributed HPC systems have always been challenging; it has become essential as we move towards the exascale computing era. Therefore, the evaluation, analysis, and optimization of HPC applications and systems to improve HPC performance on various platforms are of paramount importance. This paper reviews the performance analysis tools and techniques for HPC applications and systems. Common HPC applications used by the researchers and HPC benchmarking suites are discussed. A qualitative comparison of various tools used for the performance analysis of HPC applications is provided. Conclusions are drawn with future research directions.

... On the one hand, scientists demand converged applications to obtain new insights into various scientific domains. The fusion of HPC and big data emanates high-performance data analytics (HPDA) to extract values from massive scientific datasets via extreme data analytics at scale [4] . The fusion of HPC and AI emanates AI-enhanced HPC, which aims to improve traditional HPC models by optimizing the parameter selections or training an AI model as an al-ternative component [5] . ...

  • Yu-Tong Lu
  • Peng Cheng Peng Cheng
  • Zhi-Guang Chen

With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for "triple use" systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.

... Experimental study of deploying MapReduce over Lustre with various shuffle and placement strategies of intermediate data is done in [25]. The architecture for convergence of big data and HPC systems based on design patterns is proposed in [26]. Summary report on big data and exascale computing (BDEC) contributed by eminent researchers in the domain of HPC and analytics has been put forth in [27]. ...

The dawn of exascale computing and its convergence with big data analytics has greatly spurred research interests. The reasons are straightforward. Traditionally, high performance computing (HPC) systems have been used for scientific applications involving majority of compute-intensive tasks. At the same time, the proliferation of big data resulted into design of data-intensive processing paradigms like Apache big data stack. Big data generating at high pace necessitates faster processing mechanisms for getting insights at a real time. For this, the HPC systems may serve as panacea for solving the big data problems. Though the HPC systems have the capability to give the promising results for big data, directly integrating them with existing data-intensive frameworks like Apache big data stack is not straightforward due to challenges associated with them. This triggers a research on seamlessly integrating these two paradigms based on interoperable framework, programming model, and system architecture. The aim of this paper is to assess a progress made in HPC world as an effort to augment it with big data analytics support. As an outcome of this, the taxonomy showing the factors to be considered for augmenting HPC systems with big data support has been put forth. This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem. The focus is given on research issues related to augmenting HPC paradigms with big data frameworks and corresponding approaches to address those issues. This paper also discusses data-intensive as well as compute-intensive processing paradigms, benchmark suites and workloads, and future directions in the domain of integrating HPC with big data analytics.

... The authors concluded that there is a need for implementation of the aforementioned policies for sharing of data and security of personal information, which will have long-term impacts on big data analytics. In addition, there have been numerous research studies focusing on smart infrastructure [2,58], healthcare [59][60][61][62], transport [6,[63][64][65][66][67][68][69], and other applications [70,71]. ...

The outburst of data produced over the last few years in various fields has demanded new processing techniques, novel big data–processing architectures, and intelligent algorithms for effective and efficient exploitation of huge data sets to get useful insights and improved knowledge discovery. The explosion of data brings many challenges to deal with the complexity of information overload. Numerous tools and techniques have been developed over the years to deal with big data challenges. This chapter presents a summary of state-of-the-art tools and techniques for processing of big data applications by critically analyzing their objectives, methodologies, and key approaches to address the challenges associated with big data. Also, we critically analyze some of the core applications of big data and their impacts in improving the quality of human life by primarily focusing on healthcare and smart city applications, genome sequence annotation applications, and graph-based applications. We provide a detailed review and taxonomy of the research efforts within each application domain.

... The open source software culture has helped the development of many new distributed and collaborative applications paving the way for integrated systems and hence smart cities. Many new smart city applications are being developed, such as in transport [5][6][7][8][9][10][11][12], healthcare [13][14][15][16], infrastructure [17,18], and applications [19,20]. ...

The use of open source software has increased tremendously in the last few decades paving the way for many innovations such as Internet of Things (IoT) and smart cities. The open data licenses have also become prevalent with the emergence of big data and relevant technologies. These developments have given rise to the "Share more—Develop less" culture, which in turn have raised new legal issues. The community has been developing many new licenses to address these emerging legal issues. However, selecting the right license is becoming increasingly difficult due to the licensing complexities and continuous arrival of new licenses. This chapter reviews notable open source and open data licenses and the suitability of these licenses for various kinds of data and software. Subsequently, we propose frameworks for the selection of open source software and open data licenses. Conclusions are drawn with recommendations for the future work.

Human Robot Collaboration (HRC) is considered as a major enabler for achieving flexibility and reconfigurability in modern production systems. The motivation for HRC applications arises from the potential of combining human operators' cognition and dexterity with the robot's precision, repeatability and strength that can increase system's adaptability and performance at the same time. To exploit this synergy effect on its full extent, production engineers must be equipped with the means for optimally allocating the tasks to the available resources as well as setting up appropriate workplaces to facilitate HRC. This chapter discusses existing approaches and methods for task planning in HRC environments analysing the requirements for implementing such decision-making strategies. The chapter also highlights future trends for progressing beyond the state of the art on this scientific field, exploiting the latest advances in Artificial Intelligence and Digital Twin techniques.

  • Montserrat Gómez-Márquez
  • Ana Lilia Ruiz- Hernández
  • Martín González Sóbal Martín González Sóbal
  • Miguel Eduardo Rosas Baltazar

A Digital and Accounting Information Transformation Center (CTICD) is a physical space that has services such as: loan of computer equipment, accessories and / or digital tools (software, technical support, etc.). Each CTICD has tangible and intangible attributes in the service that distinguishes them and makes them attractive to the user. The objective of the work was to identify the attributes that the user of a CTICD considers of greater influence for its existence. The data collection instrument was validated through Pearson's correlation and the dependence between the tangible and intangible attributes of a CTICD was calculated. The result obtained was generated with the chi-square statistical independence test. Theoretical chi-square of a couple of variables was plotted. The contribution of this research is the identification and importance of intangible attributes in the perception and differentiation of the service of a CTICD for users, highlighting the humanistic part through attention to them, putting it before any tangible attribute.

Road transportation is among the global grand challenges affecting human lives, health, society, and economy, caused due to road accidents, traffic congestion, and other transportation deficiencies. Autonomous vehicles (AVs) are set to address major transportation challenges including safety, efficiency, reliability, sustainability, and personalization. The foremost challenge for AVs is to perceive their environments in real-time with the highest possible certainty. Relatedly, connected vehicles (CVs) have been another major driver of innovation in transportation. In this paper, we bring autonomous and connected vehicles together and propose TAAWUN, a novel approach based on the fusion of data from multiple vehicles. The aim herein is to share the information between multiple vehicles about their environments, enhance the information available to the vehicles, and make better decisions regarding the perception of their environments. TAWUN shares, among the vehicles, visual data acquired from cameras installed on individual vehicles, as well as the perceived information about the driving environments. The environment is perceived using deep learning, random forest (RF), and C5.0 classifiers. A key aspect of the TAAWUN approach is that it uses problem specific feature sets to enhance the prediction accuracy in challenging environments such as problematic shadows, extreme sunlight, and mirage. TAAWUN has been evaluated using multiple metrics, accuracy, sensitivity, specificity, and area-under-the-curve (AUC). It performs consistently better than the base schemes. Directions for future work to extend the tool are provided. This is the first work where visual information and decision fusion are used in CAVs to enhance environment perception for autonomous driving.

  • Daniel A. Reed
  • Jack Dongarra Jack Dongarra

Daniel A. Reed and Jack Dongarra state that scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. Big data machine learning and predictive data analytics have been considered as the fourth paradigm of science, allowing researchers to extract insights from both scientific instruments and computational simulations. A rich ecosystem of hardware and software has emerged for big-data analytics similar to high-performance computing.

Data-intensive computing has become one of the major workloads on traditional high-performance computing (HPC) clusters. Currently, deploying data-intensive computing software framework on HPC clusters still faces performance and scalability issues. In this paper, we develop a new two-level storage system by integrating Tachyon, an in-memory file system with OrangeFS, a parallel file system. We model the I/O throughputs of four storage structures: HDFS, OrangeFS, Tachyon and two-level storage. We conduct computational experiments to characterize I/O throughput behavior of two-level storage and compare its performance to that of HDFS and OrangeFS, using TeraSort benchmark. Theoretical models and experimental tests both show that the two-level storage system can increase the aggregate I/O throughputs. This work lays a solid foundation for future work in designing and building HPC systems that can provide a better support on I/O intensive workloads with preserving existing computing resources.

Map-Reduce programming model is commonly used for efficient scientific computations, as it executes tasks in parallel and distributed manner on large data volumes. The HPC infrastructure can effectively increase the parallelism of map-reduce tasks. However such an execution will incur high energy and data transmission costs. Here we empirically study how the energy efficiency of a map-reduce job varies with increase in parallelism and network bandwidth on a HPC cluster. We also investigate the effectiveness of power-aware systems in managing the energy consumption of different types of map-reduce jobs. We comprehend that for some jobs the energy efficiency degrades at high degree of parallelism, and for some it improves at low CPU frequency. Consequently we suggest strategies for configuring the degree of parallelism, network bandwidth and power management features in a HPC cluster for energy efficient execution of map-reduce jobs.

  • Didier El Baz Didier El Baz

The connection between Internet of Things (IoT) and High Performance Computing (HPC) is investigated in this keynote presentation. New paradigms and devices for HPC are presented. Several examples related to smart building management, smart logistics and smart manufacturing leading to difficult combinatorial optimization problems are detailed.

  • Kamalpreet Singh Kamalpreet Singh
  • Ravinder Kaur

Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.

  • Kamalpreet Singh Kamalpreet Singh
  • Ravinder Kaur

Hadoop is an open source cloud computing platform of the Apache Foundation that provides a software programming framework called MapReduce and distributed file system, HDFS. It is a Linux based set of tools that uses commodity hardware, which are relatively inexpensive, to handle, analyze and transform large quantity of data. Hadoop Distributed File System, HDFS, stores huge data set reliably and streams it to user application at high bandwidth and MapReduce is a framework that is used for processing massive data sets in a distributed fashion over a several machines. This paper gives a brief overview of Big Data, Hadoop MapReduce and Hadoop Distributed File System along with its architecture.

Hadoop is a successful open-source implementation of MapReduce programming model. It has been widely adopted by many leading industry companies for big data analytics. However, its intermediate data shuffling is a time-consuming operation that impacts the total execution time of MapReduce programs. Recently, a growing number of organizations are interested in addressing this issue by leveraging the high-performance interconnects, such as InfiniBand and 10 Gigabit Ethernet, which have been popular in High-Performance Computing (HPC) Community. There is a lack of comprehensive examination of the performance impact of these interconnects on MapReduce programs. In this work, we systematically evaluate the performance impact of two popular high-speed interconnects, 10 Gigabit Ethernet and InfiniBand, using the original Apache Hadoop and our extended Hadoop Acceleration framework. Our analysis shows that, under the Apache Hadoop, although using fast networks can efficiently accelerate the jobs with small intermediate data sizes, it is unable to maintain such advantages for jobs with large intermediate data. In contrast, Hadoop Acceleration provides better performance for jobs of a wide range of data sizes. In addition, both implementations exhibit good scalability under different networks. Hadoop Acceleration significantly reduces CPU utilization and I/O wait time of MapReduce programs.

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA over both default HDFS and Lustre.

The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.

Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it by capturing and analysis process. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are brought into light. This paper introduces the Big data technology along with its importance in the modern world and existing projects which are effective and important in changing the concept of science into big science and society too. The various challenges and issues in adapting and accepting Big data technology, its tools (Hadoop) are also discussed in detail along with the problems Hadoop is facing. The paper concludes with the Good Big data practices to be followed.

Posted by: lonnylonnycardye0269129.blogspot.com

Source: https://www.researchgate.net/publication/326535801_Big_Data_and_HPC_Convergence_The_Cutting_Edge_and_Outlook

Post a Comment

0 Comments