Constant velocity 3d convolution

Yusuke Sekikawa, Kohta Ishikawa, Hideo Saito

Research output: Contribution to journalArticle

Abstract

We propose a novel 3-D convolution method, cv3dconv, for extracting spatiotemporal features from videos. It reduces the number of sum-of-products operations in 3-D convolution by thousands of times by assuming the constant moving velocity of the features. We observed that a specific class of video sequences, such as video captured by an in-vehicle camera, can be well approximated with piece-wise linear movements of 2-D features in a temporal dimension. Our principal finding is that a 3-D kernel, represented by constant velocity, can be decomposed into a convolution of a 2-D-shaped kernel and a 3-D-velocity kernel, which is parameterized using only two parameters. We derived an efficient recursive algorithm for this class of 3-D convolution, which is exceptionally suited for sparse spatiotemporal data, and this parameterized decomposed representation imposes a structured regularization along a temporal direction. We experimentally verified the validity of our approximation using a controlled dataset, and we also showed the effectiveness of the cv3dconv by adopting it for deep neural networks (DNNs) in visual odometry estimation task using publicly available event-based camera dataset captured in urban road scene. Our DNN architecture improves the estimation accuracy for about 30% compared with the existing states-of-the-arts architecture designed for event data.

Original languageEnglish
Article number8543783
Pages (from-to)76490-76501
Number of pages12
JournalIEEE Access
Volume6
DOIs
Publication statusPublished - 2018 Jan 1

Fingerprint

Convolution
Cameras
Network architecture
Deep neural networks

Keywords

  • 3D convolution
  • Convolutional neural network
  • event-based camera
  • spatiotemporal convolution

ASJC Scopus subject areas

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)

Cite this

Sekikawa, Y., Ishikawa, K., & Saito, H. (2018). Constant velocity 3d convolution. IEEE Access, 6, 76490-76501. [8543783]. https://doi.org/10.1109/ACCESS.2018.2883340

Constant velocity 3d convolution. / Sekikawa, Yusuke; Ishikawa, Kohta; Saito, Hideo.

In: IEEE Access, Vol. 6, 8543783, 01.01.2018, p. 76490-76501.

Research output: Contribution to journalArticle

Sekikawa, Y, Ishikawa, K & Saito, H 2018, 'Constant velocity 3d convolution', IEEE Access, vol. 6, 8543783, pp. 76490-76501. https://doi.org/10.1109/ACCESS.2018.2883340
Sekikawa, Yusuke ; Ishikawa, Kohta ; Saito, Hideo. / Constant velocity 3d convolution. In: IEEE Access. 2018 ; Vol. 6. pp. 76490-76501.
@article{c2fea70353b54db1aeeb8a05236b7681,
title = "Constant velocity 3d convolution",
abstract = "We propose a novel 3-D convolution method, cv3dconv, for extracting spatiotemporal features from videos. It reduces the number of sum-of-products operations in 3-D convolution by thousands of times by assuming the constant moving velocity of the features. We observed that a specific class of video sequences, such as video captured by an in-vehicle camera, can be well approximated with piece-wise linear movements of 2-D features in a temporal dimension. Our principal finding is that a 3-D kernel, represented by constant velocity, can be decomposed into a convolution of a 2-D-shaped kernel and a 3-D-velocity kernel, which is parameterized using only two parameters. We derived an efficient recursive algorithm for this class of 3-D convolution, which is exceptionally suited for sparse spatiotemporal data, and this parameterized decomposed representation imposes a structured regularization along a temporal direction. We experimentally verified the validity of our approximation using a controlled dataset, and we also showed the effectiveness of the cv3dconv by adopting it for deep neural networks (DNNs) in visual odometry estimation task using publicly available event-based camera dataset captured in urban road scene. Our DNN architecture improves the estimation accuracy for about 30{\%} compared with the existing states-of-the-arts architecture designed for event data.",
keywords = "3D convolution, Convolutional neural network, event-based camera, spatiotemporal convolution",
author = "Yusuke Sekikawa and Kohta Ishikawa and Hideo Saito",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/ACCESS.2018.2883340",
language = "English",
volume = "6",
pages = "76490--76501",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Constant velocity 3d convolution

AU - Sekikawa, Yusuke

AU - Ishikawa, Kohta

AU - Saito, Hideo

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We propose a novel 3-D convolution method, cv3dconv, for extracting spatiotemporal features from videos. It reduces the number of sum-of-products operations in 3-D convolution by thousands of times by assuming the constant moving velocity of the features. We observed that a specific class of video sequences, such as video captured by an in-vehicle camera, can be well approximated with piece-wise linear movements of 2-D features in a temporal dimension. Our principal finding is that a 3-D kernel, represented by constant velocity, can be decomposed into a convolution of a 2-D-shaped kernel and a 3-D-velocity kernel, which is parameterized using only two parameters. We derived an efficient recursive algorithm for this class of 3-D convolution, which is exceptionally suited for sparse spatiotemporal data, and this parameterized decomposed representation imposes a structured regularization along a temporal direction. We experimentally verified the validity of our approximation using a controlled dataset, and we also showed the effectiveness of the cv3dconv by adopting it for deep neural networks (DNNs) in visual odometry estimation task using publicly available event-based camera dataset captured in urban road scene. Our DNN architecture improves the estimation accuracy for about 30% compared with the existing states-of-the-arts architecture designed for event data.

AB - We propose a novel 3-D convolution method, cv3dconv, for extracting spatiotemporal features from videos. It reduces the number of sum-of-products operations in 3-D convolution by thousands of times by assuming the constant moving velocity of the features. We observed that a specific class of video sequences, such as video captured by an in-vehicle camera, can be well approximated with piece-wise linear movements of 2-D features in a temporal dimension. Our principal finding is that a 3-D kernel, represented by constant velocity, can be decomposed into a convolution of a 2-D-shaped kernel and a 3-D-velocity kernel, which is parameterized using only two parameters. We derived an efficient recursive algorithm for this class of 3-D convolution, which is exceptionally suited for sparse spatiotemporal data, and this parameterized decomposed representation imposes a structured regularization along a temporal direction. We experimentally verified the validity of our approximation using a controlled dataset, and we also showed the effectiveness of the cv3dconv by adopting it for deep neural networks (DNNs) in visual odometry estimation task using publicly available event-based camera dataset captured in urban road scene. Our DNN architecture improves the estimation accuracy for about 30% compared with the existing states-of-the-arts architecture designed for event data.

KW - 3D convolution

KW - Convolutional neural network

KW - event-based camera

KW - spatiotemporal convolution

UR - http://www.scopus.com/inward/record.url?scp=85057394864&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057394864&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2018.2883340

DO - 10.1109/ACCESS.2018.2883340

M3 - Article

VL - 6

SP - 76490

EP - 76501

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

M1 - 8543783

ER -