NDM 2012‎ > ‎

Accepted Papers

See NDM12 program for presentation slides.

paper 12:  How GridFTP pipelining, parallelism and concurrency work: A guide for optimizing large dataset transfers
Esma Yildirim, Fatih University, Turkey
Jangyoung Kim, University at Buffalo, USA
Tevfik Kosar, University at Buffalo, USA

Abstract: Optimizing the transfer of large files over high-bandwidth networks is a challenging task that requires the consideration of many parameters (e.g. network speed, round-trip time(RTT), current traffic). Unfortunately, this task becomes more complex when transferring datasets comprised of many small files. In this case, the performance of large dataset transfers not only depends on the characteristics of the transfer protocol and network, but also the number and the size distribution of the files that constitute the dataset. GridFTP is the most advanced transfer tool that provides functions to overcome large dataset transfer bottlenecks. Three of the most important functions of GridFTP are pipelining, parallelism and concurrency. In this study, we research the effects of these three crucial functions, provide models for optimizations of these parameters, define guidelines and give an algorithm for their practical use for transfer of large datasets of varying size files.

paper 15: Accelerating Data Movement Leveraging Endsystem and Network Parallelism
Jun Yi, Argonne National Laboratory, USA
Rajkumar Kettimuthu, Argonne National Lab and The University of Chicago, USA
Venkat Vishwanath, Argonne National Laboratory, USA

Abstract: Data volumes produced by simulation, experimental and observational science is increasing exponentially. This data needs to be moved from its source to another resource for analysis, visualization and archival purposes. The destination resource could be either local or remote. The data intensive science is critically dependent upon the high-performance parallel file and storage end systems to read/write and high-speed networks to move their enormous data between local and remote computing and storage facilities. 100 Gigabit per second networks such as DOE's Advanced Network Initiative (ANI), Internet2's 100G network represent a major step forward in wide area network performance. Effective utilization of these networks requires substantial and pervasive parallelism, at the file system, end system, and network levels. Additional obstacles such as heterogeneity and time-varying conditions of network and end system arise that, if not adequately addressed, will render high performance storage and network systems extremely underperformed. In this paper, we propose a data movement system that dynamically and adaptively adjusts end systems and networks parallelisms in response to changing conditions of end systems and networks to sustain high-throughput for data transfers. We evaluate our system in multiple settings and show that (1) in a homogeneous configuration, the design can achieve better throughput for light and medium workload than GridFTP and achieve comparable throughput for heavy workload, (2) and in a heterogeneous configuration, the design can achieve several factors higher throughput for all workloads than GridFTP.

paper 4: A Dynamic Virtual Networks Solution for Cloud and Grid Computing
Davide Salomoni, INFN CNAF, Italy
Marco Caberletti, Italy, INFN CNAF, Italy

Abstract: The extensive use of virtualization technologies in Cloud environments has created the need for a new network access layer, due to requirements on networking for which traditional models are not well suited. For example, hundreds of users issuing Cloud requests for which full, privileged access to Virtual Machines (VMs) is requested, typically require the definition of network separation at layer 2 through the use of virtual LANs (VLANs). However, in large computing centers, due e.g. to the number of installed network switches, their characteristics, their heterogeneity, or simply because of management policies, the dynamic or even static definition of many VLANs is often impractical or simply not possible. We present a solution called DVN (for Dynamic Virtual Networks) to the problem of dynamic virtual networking based on the use of the Generic Routing Encapsulation protocol (GRE). GRE is used to encapsulate VM traffic so that the configuration of the physical network layer does not need to change. In particular, we describe how DVN can be used to address problems such as scalable dynamic network isolation, mobility of VMs across hosts, and interconnection of geographically distributed resource centers. We also describe how DVN is being implemented in a production software framework providing and interconnecting Cloud and Grid resources.

paper 1: Hadoop acceleration in an OpenFlow-based cluster
Sandhya Narayan, InfoBlox, USA
Stuart Bailey, InfoBlox, USA
Anand Daga, University of Houston, USA

Abstract: This paper presents details for our preliminary study of how Hadoop can control its network using OpenFlow in order to improve performance. Hadoop's distributed compute framework called MapReduce, exploits the distributed storage architecture of Hadoop's distributed file system HDFS to deliver scalable, reliable parallel processing services for arbitrary algorithms. The shuffle phase of Hadoop's MapReduce computation involves movement of intermediate data from Mappers to Reducers. Reducers are often delayed due to inadequate bandwidth between them and the Mappers, and thereby lowering performance of the cluster. OpenFlow is a popular example of software-defined network (SDN) technology. Our study explores the use of OpenFlow to provide better link bandwidth for shuffle traffic, and thereby decrease the time that Reducers have to wait to gather data from Mappers. The approach illustrates how high performance computing applications can improve performance by controlling their underlying network resources. The work presented in this paper is a starting point for some experiments being done as part of SC12 SCinet Research Sandbox which will quantify the performance advantages of a version of Hadoop that uses OpenFlow to dynamically adjust the network topology of local and wide area Hadoop clusters.

paper 6: A New Framework for Publishing and Sharing Network and Security Datasets
Mohammed S. Gadelrab, National Institute of Standards, Egypt
Ali Ghorbani, University of New Brunswick, Canada

Abstract: Datasets represent an angular brick of network and security research/development. Despite the continuous growth in the number of available datasets, there is no effective publishing and sharing mechanisms so that realistic and representative datasets are not only hard to construct but it is difficult to select from tens of thousands of datasets scattered in online repositories. This work aims to alleviate the difficulties inherent in searching, selecting and comparing datasets as well as to decrease the ambiguity associated with dataset publication and share. In this paper we present the basis and the implementation of a new framework to describe and share network datasets with a special focus on network and security-related datasets. Hereafter, we present the underlying idea of the proposed framework and the key component of this approach: a Dataset Description Language (DDL) to express dataset metadata. Besides that, we explain how we implemented a proof-of-concept prototype to demonstrate its feasibility and usefulness, only from OSOTS (Open Source Off The Shelf). It allows us to overcome the problem of backward dealing with a huge number of already existing datasets where it generates Dataset Description Sheets (DDS) automatically for traffic datasets. The proposed approach provides several benefits where it facilitates searching in dataset repositories according to various criteria. Moreover, its output in XML format can be integrated easily with Security Content Automation Protocols (SCAP) tools. Not only it enhances communicating dataset properties in a clear and succinct manner but also it promotes sharing and publishing datasets in a formal and organized way. Thus, we can avoid risks inherent in publishing a malware binary or source by itself.

paper 8: Adaptive Data Transfers that Utilize Policies for Resource Sharing 
Junmin Gu, Lawrence Berkeley National Laboratory, USA
David Smith, University of Southern California Information Sciences Institute, USA
Ann L. Chervenak, University of Southern California Information Sciences, USA
Alex Sim, Lawrence Berkeley National Laboratory, USA

Abstract: With scientific data collected at unprecedented volumes and rates, the success of large scientific collaborations requires that they provide distributed data access with improved data access latencies and increased reliability to a large user community. The goal of the ADAPT (Adaptive Data Access and Policy-driven Transfers) project is to develop and deploy a general-purpose data access framework for scientific collaborations that provides fine-grained and adaptive data transfer management and the use of site and VO policies for resource sharing. This paper presents our initial design and implementation of an adaptive data transfer framework. We also present preliminary performance measurements showing that adaptation and policy improve network performance.

paper 11: A Network-aware Object Storage Service
Shigetoshi Yokoyama, National Institute of Informatics, Japan
Nobukazu Yoshioka, National Institute of Informatics, Japan
Motonobu Ichimura, NTT DATA Intellilink, Japan

Abstract: This study describes a trial for establishing a network-aware object storage service. For scientific applications that need huge amounts of remotely stored data, the cloud infrastructure has functionalities to provide a service called ‘cluster as a service’ and an inter-cloud object storage service. The scientific applications move from locations with constrained resources to locations where they can be executed practically. The inter-cloud object storage service has to be network-aware in order to perform well.

paper 7: Efficient Attribute-based Data Access in Astronomy Analysis
Benson Ma, Lawrence Berkeley National Laboratory, USA
Arie Shoshani, Lawrence Berkeley National Laboratory, USA
Alex Sim, Lawrence Berkeley National Laboratory, USA
Kesheng Wu, Lawrence Berkeley National Laboratory, USA
Yong-Ik Byun, Yonsei University, Korea
Jaegyoon Hahm, Institute of Science and Technology Information, Korea
Min-Su Shin, University of Michigan, USA

Abstract: Large experiments and simulations running on high-performance computers are generating many petabytes of data. While cloud computing could meet the needs for analyzing these petabytes with its many distributed computers, the key bottleneck in effectively utilizing cloud computing is usually the data management process, including storage, indexing, searching, accessing, and transporting. Scientific data records may have an extremely large size, an extremely large file count, or in some cases, both. Many analysis tasks perform computations on a subset of a large data records satisfying some user specified constraints on attribute (variable) values. This subsetting process typically reduces the amount of data to be moved over the cloud. However, selected data records often span over many different data files, extracting the values out these file can be time-consuming especially if the number of files is large. This work addresses this challenges of working with a large number of files. We use a set of astronomical data set as an example and use an efficient database indexing technique, named FastBit, to significantly speed up the subsetting operations. Overall, we aim to bring transparent and highly efficient attribute-based data accesses to scientists through a web-based Astronomy Data Analysis Portal. We will briefly describe the system design, and discuss the options for managing an extremely large number of files.