TY - JOUR
T1 - Spatiotemporal Video Highlight by Neural Network Considering Gaze and Hands of Surgeon in Egocentric Surgical Videos
AU - Yoshida, Keitaro
AU - Hachiuma, Ryo
AU - Tomita, Hisako
AU - Pan, Jingjing
AU - Kitani, Kris
AU - Kajita, Hiroki
AU - Hayashida, Tetsu
AU - Sugimoto, Maki
N1 - Funding Information:
This work was supported by JST CREST(JPMJCR14E1), CAO SIP (SIPAIH18D02), MHLW Health, Labour and Welfare Sciences Research Grants Research on Medical ICT and Artificial Intelligence Program Grant Numbers 20AC1004 and MIC/SCOPE (#201603003).
Publisher Copyright:
© 2022 World Scientific Publishing Company.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - In the medical field, surgical videos can be used to introduce surgical skills. Medical students and residents watch the videos to study the surgical skills and increase learning speed by compensating for the lack of experience in surgical rooms due to limited opportunity to join in surgery. To record egocentric surgical videos by a wearable camera is a solution to record surgical skills of a surgeon in detail. However, most egocentric surgical videos are of quite long duration. For example, in the case of tumor removal in breast surgery, a video recording time often reaches 2h. With that length, it is time consuming to see important scenes in the video, particularly because many surgical videos include nonessential scenes such as sterilization and preparation of tools. For extracting specific scenes from a long video, we can apply scene estimation by machine learning. Furthermore, it is important to know where the surgeon is looking to observe the area of the incision in detail. In particular, it is vital to be able to zoom in on key elements, allowing viewers to see the incision area and the fine details of the necessary surgical skills. In this study, we aimed to highlight incision scenes from egocentric surgical videos in the spatiotemporal domain by utilizing two neural networks for the temporal and spatial highlights. For the temporal highlights, we designed a neural network that estimates the incision scenes by learning gaze speed, hand movements, number of hands, and background movements in egocentric surgical videos. For the spatial highlights, in order to estimate the important area to zoom in, we designed a neural network that learns the surgeon's gaze on natural features of surgical scenes to form a probability map as a representation of the estimated gaze area. The estimated gaze area was also used to calculate the appropriate zoom-in position and zoom-in ratio. To control the highlighted parameters in accord with user preferences, we also made a user interface that allows for the selection of playback speed gain and zoom ratio gain. For the evaluation, we verified the performance of the networks by a quantitative assessment and conducted a user study with medical doctors by showing an actual surgical video to obtain a qualitative assessment on the proposed system.
AB - In the medical field, surgical videos can be used to introduce surgical skills. Medical students and residents watch the videos to study the surgical skills and increase learning speed by compensating for the lack of experience in surgical rooms due to limited opportunity to join in surgery. To record egocentric surgical videos by a wearable camera is a solution to record surgical skills of a surgeon in detail. However, most egocentric surgical videos are of quite long duration. For example, in the case of tumor removal in breast surgery, a video recording time often reaches 2h. With that length, it is time consuming to see important scenes in the video, particularly because many surgical videos include nonessential scenes such as sterilization and preparation of tools. For extracting specific scenes from a long video, we can apply scene estimation by machine learning. Furthermore, it is important to know where the surgeon is looking to observe the area of the incision in detail. In particular, it is vital to be able to zoom in on key elements, allowing viewers to see the incision area and the fine details of the necessary surgical skills. In this study, we aimed to highlight incision scenes from egocentric surgical videos in the spatiotemporal domain by utilizing two neural networks for the temporal and spatial highlights. For the temporal highlights, we designed a neural network that estimates the incision scenes by learning gaze speed, hand movements, number of hands, and background movements in egocentric surgical videos. For the spatial highlights, in order to estimate the important area to zoom in, we designed a neural network that learns the surgeon's gaze on natural features of surgical scenes to form a probability map as a representation of the estimated gaze area. The estimated gaze area was also used to calculate the appropriate zoom-in position and zoom-in ratio. To control the highlighted parameters in accord with user preferences, we also made a user interface that allows for the selection of playback speed gain and zoom ratio gain. For the evaluation, we verified the performance of the networks by a quantitative assessment and conducted a user study with medical doctors by showing an actual surgical video to obtain a qualitative assessment on the proposed system.
KW - Scene estimation
KW - egocentric video
KW - gaze point
KW - surgical video
KW - video editing
UR - http://www.scopus.com/inward/record.url?scp=85121875893&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121875893&partnerID=8YFLogxK
U2 - 10.1142/S2424905X21410014
DO - 10.1142/S2424905X21410014
M3 - Article
AN - SCOPUS:85121875893
SN - 2424-9068
VL - 7
JO - Journal of Medical Robotics Research
JF - Journal of Medical Robotics Research
IS - 1
M1 - 2141001
ER -