SmartSeg: A non-parametric approach for wearable camera video temporal segmentation

May 26, 2026, 9:12 PM

Liu, Yilin.; Wang, Hanchen David.; Fu, Haowei.; Mason, Madison Lee.; Li, Fanjie.; Wise, Alyssa.; Levin, Daniel T.; Biswas, Gautam.; Ma, Meiyi. (2026).听.听Pervasive and Mobile Computing, 121, 102223.听

Wearable cameras can record daily life in a simple and convenient way, making it possible to analyze real-world activities as they happen. A major challenge is turning long, unorganized video into meaningful events, a process called temporal segmentation, which helps both people and computers understand what is happening in the footage. This is especially difficult for wearable camera video because the viewpoint changes constantly, activities vary a lot from place to place, and videos can be any length. To address this, the researchers developed SmartSeg, an unsupervised method that does not require labeled training data. SmartSeg uses a Temporal Self-Similarity Metric encoder, a model that looks for patterns of similarity across the video, and then groups sequences of frames into events using clustering, a technique that collects similar items together. The method was tested on three different datasets and outperformed current best methods, including a 50% improvement in Mean-over-Frames on one first-person video dataset. The researchers also applied it to nursing simulation videos, where it successfully separated complex and noisy interactions into meaningful activity changes. Overall, SmartSeg appears to be a strong tool for breaking long, messy wearable camera videos into understandable events in real-world settings.

Fig. 1.听Challenges in temporal segmentation of wearable camera videos, illustrated with an example from a real-world nursing simulation.听(1)Instability of camera views: Frequent head movements introduce changes in viewpoint, motion blur, and lighting variations.听(2)听Diverse activities and environments: The video contains over 10 distinct clinical activities performed across multiple environments (e.g., hallway, patient room, medication station), making segmentation more complex.听(3)听Flexible Duration: The total video length exceeds 21 min, which poses challenges for long-range temporal modeling and efficient segmentation.

91黑料网