Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

Rundong Luo1   Matthew Wallingford2   Ali Farhadi2   Noah Snavely1   Wei-Chiu Ma1
1Cornell University   2University of Washington

Note: It may take sometime to load the videos.

360° videos generated by our model, Argus*. Starting from an input perspective video with arbitrary camera motion (red boxes), Argus generates a full 360° panoramic video, with the red box indicating the corresponding region in the generated frame. The blue, orange, and purple boxes show additional sampled perspectives from the generated 360° video.

*Argus is named after a figure in Greek mythology with many eyes, symbolizing the ability to observe from multiple perspectives.

Interactive 360° Video Visualization

Hover over the video and click it to view in 360°

We test Argus on in-the-wild videos capturing everyday activities to verify its robustness. The input region is highlighted in red. As shown, Argus can generate long-term, immersive, and realistic 360° videos from real-world perspective inputs.

Abstract

360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more holistic perspective of our surroundings. However, while existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360° generation: given a perspective video as input, our goal is to generate a full panoramic video that is coherent with the input. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain geometric and dynamic consistency with the input. To address these challenges, we first leverage the abundant 360° videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware modules to facilitate the learning process and improve the quality of 360° video generation. Experimental results demonstrate that our model can generate realistic and coherent 360° videos from arbitrary, in-the-wild perspective inputs. Additionally, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Analysis: Interpreting Scene Dynamics

We demonstrate Argus accurately understands dynamics across the 360° scene from a narrow perspective input. Using a 360° camera, we captured a video of a car driving by while providing our model with a 60° horizontal FoV region from a static camera pose (left). The car's ground truth trajectory (middle) and our model's predicted trajectory (right) show strong alignment, confirming Argus's ability to accurately interpret scene dynamics.


Input Video

Ground truth trajectory

Predicted trajectory (ours)

Analysis: Reconstructing the Scene from Generated Videos

We unwrap a rotating perspective video from our generated 360° video and show the scene reconstructed from it using MegaSaM. As shown, the reconstruction is geometrically consistent, justifying our generated 360° video achieves high realism.


Click the image to view interactive results


MegaSaM Reconstruction

Analysis: Using Generated Videos as Input

We test Argus on perspective videos generated by the text-to-video model Gen-3-Turbo with prompt "Central Park." As shown, Argus generalizes to generated videos.


Input video               

360° video generated by Argus          

Camparison with PanoDiffusion (Image-to-360° Generation)

Qualitative comparison with 360° image generation method PanoDiffusion [1]. The input region is highlighted in red, while orange and blue regions indicate extracted perspective views. Although PanoDiffusion can generate plausible 360° images from perspective inputs, it struggles with maintaining temporal consistency.


[1] Wu et al. PanoDiffusion: 360-degree Panorama Outpainting via Diffusion. In ICLR, 2024.

Camparison with Follow-Your-Canvas (Video Outpainting)

Qualitative comparison with Follow-Your-Canvas [2] for 360° video generation. Videos generated by Follow-Your-Canvas look like normal perspective videos, and its generation quality declines noticeably as it extends further from the input viewpoint.


[2] Chen et al. Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation. In AAAI, 2025.

Application: Video Stabilization

Argus shows promising application in video stabilization without modifications. Traditional video stabilization techniques require cropping, leading to a reduced field of view and visual information loss. In contrast, Argus enables video stabilization with a consistent field of view, as the generated panorama preserves scene information across frames.

Application: Camera View Direction Control

Argus enables camera viewpoint control in dynamic environments by unwrapping the generated 360° scene into perspective views. This capability allows exploration beyond the initial field of view, enhancing immersion in the scene.

Application: Dynamic Environment Map for Object Relighting

Argurs enables realistic object relighting by leveraging the generated 360° panorama videos as dynamic environment maps. We show the results of rendering a metallic sphere in Blender with the generated videos.


Application: Interactive Visual Question Answering

The panorama video generated by Argus can help visual question answering in dynamic environments. By enabling free rotation of the camera, Argus allows for comprehensive spatial understanding by seeing the scene from multiple perspectives. This flexibility supports interactive visual question answering, such as verifying if a vehicle overlaps with a crosswalk, overcoming the limitations of fixed-viewpoint videos. This capability enhances scene comprehension and opens new possibilities for video analysis applications.

MY ALT TEXT

Dataset

We start with the 360-1M dataset [3], containing approximately 1 million videos of varying quality, and systematically filter down to 283,863 high-quality 10-second video clips. Examples of our dataset are shown below.


[3] Wallingford et al. From an Image to a Scene: Learning to Imagine the World from a Million 360° Videos. In NeurIPS, 2024.