This is a combination of two existing AI technologies.
The first one is computer vision & scene understanding. That is, if you're looking at a picture of a beach, recognize features and create a textual description such as "Three palm trees line the beachfront. The sun is 27 degrees over the horizon. Cumulonimbus clouds are seen above horizon." The level of detail in the description can vary.
The second portion of this is is a system that has pre-defined 3D models for a lot of common things such as trees, clouds, people, etc.
The system first takes the picture, creates a textual description of it, then hands this over to the second component which tries to reconstruct it. There will, of course, be ambiguities which the second AI system will have to resolve, thus creating a photorealistic rendering that _could_ have been the original thing but is actually a re-interpretation of it.
[Think about how many different pictures there are of famous historical events]-- cowtamer, Apr 09 2009 Computational Heraldry http://mustard.tapo.../xq/xhtml.xq?id=115Automatic generation of coat of arms from a specialized medieval language called "blazon" [cowtamer, Apr 09 2009] yes, think about that. wowie!-- WcW, Apr 09 2009 Hey, if the 2nd part is baked, I'd love to have the actual link. I've heard there's some research into it, but have not seen anything about it outside a graduate level Computational Linguistics class I took once. Parsing a sentence to create a scene is still a non-trivial AI problem as far as I know.-- cowtamer, Apr 09 2009 Could be a useful bandwidth reducing tool for video, depending on the use and the extent of the object library.-- Skrewloose, Apr 10 2009 The second part of the problem is more complicated than having an open source 3D library. It's the problem of picking the appropriate shape and deducing from the description (and the rest of the scene that has already been constructed) where it should go. The scene may have to be revised as the description is parsed.
Consider the simple example:
"The is an office chair behind the desk. The chair is red and has 6 wheels. The desk is in front of the window"
The system would have to deduce
* The type of chair
* That the chair is between the window and the desk
* That the desk is probably an office desk (and not a school desk)
* That the setting is probably an office (as opposed to an outdoor scene) and might have other office-appropriate items
* That the desk has an opening where the wheels are visible to the observer ("behind" is a relative concept that has no meaning without the location of an observer)-- cowtamer, Apr 10 2009 I thought the idea was about translation not interpretation...
My daughter told me she saw that in action, baked by an Israeli kid inventor. The camera translates in real time inside an image using the texture and fontsize and replacing it with other language.-- pashute, Jan 05 2011 This is now pretty much baked. A year or two ago I read of a machine learning-based system where an image would be converted to an internal representation and back, for the purpose of compression. However, the output image is not the same as the input one. It contains all the same things, but not in quite the same arrangement, and not with quite the same details. Unfortunately, I can't refind the article I remember, just ones that almost exactly reproduce the input image, which is better for compression but worse for reinterpretation.
The only difference between what I remember seeing and this idea is that the internal representation isn't text (instead presumably being some vector of coefficients that is proprietary to the individual trained instance of the system), but it wouldn't be too hard to use image-to-text and text-to-image neural networks instead.-- notexactly, Dec 18 2018 random, halfbakery