Apple Opens Source for New Model SHARP! Turn Photos into 3D Worlds in Seconds

artifact1mos agoupdate AiFun
291 0

December 11,AppleThe paper was published introducing3D generationThe SHARP model claims to reconstruct a single image into a photorealistic 3D scene in under one second on standard GPUs. Currently, the model has been open-sourced.

苹果开源新模型SHARP! 一秒钟让照片变3D世界

Users need only input a single ordinary photo for the model to predict the entire scene's 3D Gaussian representation parameters in a single pass via neural networks. The entire generation process completes in under one second on a standard GPU, followed by real-time rendering of high-resolution, photo-realistic images from adjacent viewpoints. Furthermore, the 3D scenes generated by SHARP possess absolute scale metrics, enabling precise camera displacement operations.

Quantitative evaluations demonstrate that SHARP exhibits robust zero-shot generalization capabilities across diverse datasets, achieving new technical breakthroughs on multiple datasets. Compared to existing state-of-the-art models, it reduces the LPIPS metric (perceptual similarity) by 25-34%, and the DISTS metric (structural similarity) by 21-43%. It also reduces synthesis time by three orders of magnitude and supports high-resolution rendering of adjacent views at 100 frames per second for 3D representations.

Many developers have tested this model. Among them, some users have integrated it into Vision Pro, achieving immersive effects with just a single image and generating highly detailed visuals.

Another netizen uploaded an oil painting, and the model ultimately generated a 3D scene with accurate spatial relationships and a complete visual composition.

Other netizens noted that the model cannot generate parts of a scene that are not visible, but its greatest advantage lies in its generation speed: “A MacBook Pro can complete the generation in just a few seconds...”

苹果开源新模型SHARP! 一秒钟让照片变3D世界

The detailed information of this model has been published on arXiv under the title "SHARP: Sharp Monocular View Synthesis in Less Than a Second."

苹果开源新模型SHARP! 一秒钟让照片变3D世界

Paper Address:https://arxiv.org/abs/2512.10685

Open source address:

GitHub:https://github.com/apple/ml-sharp

Hugging Face:https://huggingface.co/apple/Sharp

1. Fidelity improved by approximately 20% to 40%, with synthesis time reduced by three orders of magnitude.

Researchers evaluated the SHARP model using multiple datasets, focusing primarily on two metrics: LPIPS and DISTS. These metrics assess the structural similarity between synthetic images generated by the model and real images, as well as the degree to which the model aligns with human subjective perception. Lower values for both metrics indicate superior performance.

For the baseline models, researchers selected several existing state-of-the-art models: the Flash 3D model based on 3D Gaussian distributions; the TMPI model utilizing multi-plane images; the LVSM model based on image regression; and the stable virtual camera (SVC), ViewCrafter, and Gen3C models employing diffusion models.

Quantitative evaluations demonstrate that SHARP achieves the best performance across all datasets, outperforming all other models. Compared to the current state-of-the-art models, SHARP reduces the LPIPS metric by 25-34% and the DISTS metric by 21-43%.

苹果开源新模型SHARP! 一秒钟让照片变3D世界

Researchers evaluated the model's performance on single-image synthesis tasks. Results show that on a single GPU, SHARP achieves top-tier synthesis times while maintaining high image fidelity. Compared to models of equivalent quality, SHARP reduces synthesis time by three orders of magnitude, demonstrating its advantages in both efficiency and effectiveness.

苹果开源新模型SHARP! 一秒钟让照片变3D世界

In less than one second, this model not only generates 3D content but also renders high-resolution local views at over 100 frames per second. The results demonstrate SHARP's exceptional detail handling: the first image features clean separation between foreground and background, the second exhibits outstanding color and shape stability, and the third reveals every individual strand of the animal's fur.

苹果开源新模型SHARP! 一秒钟让照片变3D世界

2. Capable of real-time rendering and prediction of high-resolution 3D characterization, but unable to generate invisible portions.

The evolution of viewpoint synthesis research has progressed from early classical methods based on multi-image geometric modeling, through breakthroughs in implicit representations represented by neural radiative fields during the deep learning era, to the recent development of explicit, efficient rendering techniques such as 3D Gaussian splatter.

Previously, most Gaussian splatter methods required capturing dozens or even hundreds of images of the same scene from different perspectives. The SHARP model, however, focuses on generating 3D scenes from a single image. Through a single forward pass of the neural network, it can predict a complete 3D Gaussian scene representation from a single photograph.

The training process of the SHARP model comprises two stages: synthetic data training and self-supervised fine-tuning. In the first stage, researchers train the model using synthetic data featuring perfect images and depth ground truth labels to learn the fundamentals of 3D reconstruction. In the second stage, the model undergoes self-supervised fine-tuning on real images without disparity-synthesized ground truth labels. By generating pseudo-ground truth labels, it adapts to real images, thereby enhancing the model's performance on real-world images.

The research team introduced three innovations to the SHARP model: First, an end-to-end trainable architecture capable of predicting high-resolution 3D representations; second, a robust and efficient loss function configuration. Researchers carefully selected a series of loss functions that prioritize view synthesis quality while ensuring training stability and suppressing common visual artifacts; Third, it incorporates a concise depth alignment module that effectively resolves depth ambiguity issues during training.

The SHARP model comprises four learnable modules: a pre-trained encoder for feature extraction, a deep decoder generating two independent deep layers, a depth adjustment module, and a Gaussian decoder optimizing all Gaussian properties. A differentiable Gaussian initializer and combiner assemble Gaussian elements into the final 3D representation, with predicted Gaussians rendered onto input and novel views for loss computation.

苹果开源新模型SHARP! 一秒钟让照片变3D世界

During the optimization and evaluation process, the SHARP model employs multiple loss functions to enhance the quality of synthesized views, including rendering loss, depth loss, and regularization loss. Through the combination of these loss functions, the model generates high-quality 3D representations and supports real-time rendering.

Based on the above techniques, the SHARP model achieves reliable 3D scene reconstruction without relying on multiple images or time-consuming scene-by-scene optimization processes. However, this approach involves certain trade-offs: while SHARP accurately renders adjacent viewpoints, it cannot synthesize completely invisible parts of the scene. This means users cannot deviate excessively from the original photograph's shooting position.

Conclusion: The Barrier to 3D Scene Generation Has Been Lowered Once Again

The SHARP model has achieved significant progress in single-image viewpoint synthesis. During a single forward pass, it completes the entire process from 2D image understanding and 3D geometric reconstruction to detail refinement, ultimately outputting a 3D scene model capable of real-time rendering.

In practical applications, by rendering high-fidelity 3D scenes in real time, the SHARP model may deliver more immersive experiences for VR/AR applications, opening up new possibilities for industries such as gaming, film, and architecture. The research team stated they will also expand their existing methodology by integrating techniques like diffusion models to support the synthesis of distant viewpoints.

© Copyright notes

Related posts

No comments

none
No comments...