Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing

Masahiro Tanaka, Osamu Tatebe, Hideyuki Kawashima

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages124-133
Number of pages10
Volume2018-September
ISBN (Electronic)9781538683194
DOIs
Publication statusPublished - 2018 Oct 29
Event2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 - Belfast, United Kingdom
Duration: 2018 Sep 102018 Sep 13

Other

Other2018 IEEE International Conference on Cluster Computing, CLUSTER 2018
CountryUnited Kingdom
CityBelfast
Period18/9/1018/9/13

Fingerprint

Telescopes
Pipelines
Cams
Processing
Scalability
Servers
Cameras
Experiments

Keywords

  • Distributed file system
  • Scientific workflow system

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Cite this

Tanaka, M., Tatebe, O., & Kawashima, H. (2018). Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing. In Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018 (Vol. 2018-September, pp. 124-133). [8514866] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2018.00024

Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing. / Tanaka, Masahiro; Tatebe, Osamu; Kawashima, Hideyuki.

Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018. Vol. 2018-September Institute of Electrical and Electronics Engineers Inc., 2018. p. 124-133 8514866.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tanaka, M, Tatebe, O & Kawashima, H 2018, Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing. in Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018. vol. 2018-September, 8514866, Institute of Electrical and Electronics Engineers Inc., pp. 124-133, 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018, Belfast, United Kingdom, 18/9/10. https://doi.org/10.1109/CLUSTER.2018.00024
Tanaka M, Tatebe O, Kawashima H. Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing. In Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018. Vol. 2018-September. Institute of Electrical and Electronics Engineers Inc. 2018. p. 124-133. 8514866 https://doi.org/10.1109/CLUSTER.2018.00024
Tanaka, Masahiro ; Tatebe, Osamu ; Kawashima, Hideyuki. / Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing. Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018. Vol. 2018-September Institute of Electrical and Electronics Engineers Inc., 2018. pp. 124-133
@inproceedings{961bb41a68934d9a8b9695356225ee48,
title = "Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing",
abstract = "In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.",
keywords = "Distributed file system, Scientific workflow system",
author = "Masahiro Tanaka and Osamu Tatebe and Hideyuki Kawashima",
year = "2018",
month = "10",
day = "29",
doi = "10.1109/CLUSTER.2018.00024",
language = "English",
volume = "2018-September",
pages = "124--133",
booktitle = "Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Applying Pwrake Workflow System and Gfarm File System to Telescope Data Processing

AU - Tanaka, Masahiro

AU - Tatebe, Osamu

AU - Kawashima, Hideyuki

PY - 2018/10/29

Y1 - 2018/10/29

N2 - In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.

AB - In this paper, we describe a use case applying a scientific workflow system and a distributed file system to improve the performance of telescope data processing. The application is pipeline processing of data generated by Hyper Suprime-Cam (HSC) which is a focal plane camera mounted on the Subaru telescope. In this paper, we focus on the scalability of parallel I/O and core utilization. The IBM Spectrum Scale (GPFS) used for actual operation has a limit on scalability due to the configuration using storage servers. Therefore, we introduce the Gfarm file system which uses the storage of the worker node for parallel I/O performance. To improve core utilization, we introduce the Pwrake workflow system instead of the parallel processing framework developed for the HSC pipeline. Descriptions of task dependencies are necessary to further improve core utilization by overlapping different types of tasks. We discuss the usefulness of the workflow description language with the function of scripting language for defining complex task dependency. In the experiment, the performance of the pipeline is evaluated using a quarter of the observation data per night (input files: 80 GB, output files: 1.2 TB). Measurements on strong scaling from 48 to 576 cores show that the processing with Gfarm file system is more scalable than that with GPFS. Measurement using 576 cores shows that our method improves the processing speed of the pipeline by 2.2 times compared with the method used in actual operation.

KW - Distributed file system

KW - Scientific workflow system

UR - http://www.scopus.com/inward/record.url?scp=85057265540&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057265540&partnerID=8YFLogxK

U2 - 10.1109/CLUSTER.2018.00024

DO - 10.1109/CLUSTER.2018.00024

M3 - Conference contribution

AN - SCOPUS:85057265540

VL - 2018-September

SP - 124

EP - 133

BT - Proceedings - 2018 IEEE International Conference on Cluster Computing, CLUSTER 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -