A research team from Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, and Tsinghua University takes advantage of the inherent spatiotemporal locality of videos to present a pure-transformer backbone architecture for video recognition that leads to a better speed-accuracy trade-off.

Here is a quick read: Video Swin Transformer Improves Speed-Accuracy Trade-offs, Achieves SOTA Results on Video Recognition Benchmarks.

The code is available on project GitHub. The paper Video Swin Transformer is on arXiv.



Source link