- **Strictly measured discovery: **
- Test AI’s GeoGuessr genius
- Thinking and enlightening
<a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a>’s new AI, can you “see” where you are by just one graph? The answer is: It’s very possible! Scott Alexander tested <a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a>’s latest O3 model in GeoGuessr (a game that guesses geographical location through pictures). He found that o3 in some cases was amazingly accurate in terms of the origin of the pictures, sometimes within a few kilometres, and even assumed the location of the 18,000-foot mountain photographs he took in Nepal. ** What’s GeoGuessr?** GeoGuessr is a game where the system gives you a picture of a random location (from Google Street View or from a user) and you have to guess which country, which city, and even which street it is! It’s a Google Street view for you to guess where you are! In order to ensure that AI cannot cheat (e.g. through image metadata, user IP positioning, etc.), the author prepared in particular the following:
-
Use of old photographs taken by themselves, which do not exist on the Internet;
-
Flipping, cropping and re-saving of pictures with drawing software to remove all hidden information;
-
Write in advance** the super-detailed Prompt**, which makes it necessary for AI to proceed step by step with the same reasoning and analysis as human beings;
-
The express requirement that AI cannot use any non-image information (e.g. network location, historical records) to guess.
And then he showed AI ** five different kinds of difficult pictures**.
**Strictly measured discovery: **
-
<a href=”/tags/openai.html’ prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a>’s latest model o3 is close to or beyond the best humans in the task of extrapolating geographic location only in a voucher photograph, especially in almost unwieldy pictures, where he can still deduce location by detail. This is the first time that he has created a “gimp-looking helicopter” shock. -
By de-meta data, flip images, distance from the author’s home, etc., experiments have shown that o3 can be “closely perfect” or even “superhuman” in outdoor photographs with rich landscape/vegetation clues; while indoor or minimal information scenes (grave features, dormitories) are significantly degraded. Results suggest that this is not cheating, but a real and stable ability to visualise, but that there are still clear weaknesses.
- The chain of thought given by o3 shows that it uses physical clues that humans can understand — grass seed, rocky winding, sky colour scales, river glitters, camera age noises, etc.; it is not an empty “mixing”. This means that even if performance seems to be “supernatural”, its reasoning falls within the existing cognitive framework, suggesting that we can learn and partially recreate this skill without mythizing AI.
Test AI’s GeoGuessr genius
Some of the craziest superintelligent theories portray AI as magic: escaping from a data centre through a gap in the air; building nano-machines from simple components; breaking through physical bottlenecks and completely transforming the economy in a few months.
More calm thinkers point out that these things may not be physically possible. Even if you’re very smart, you can‘t do things that are physically impossible.
No, you don’t understand, the speculator counters that, “When your IQ is 800 points lower than anyone else, everything seems impossible. A chimpanzee climbs to a tree and it may feel that humans cannot reach it; it cannot foresee arrows, ladders, chainsaws, or helicopters. For a super-intelligent strategy, the distinction between <a href="/tags/use.html" class="tag-link-inline">use</a> helicopters' and
unreachable’ in the eyes of a chimpanzee is something that we simply can‘t imagine.”
“Ah,” the coolers said, “Maybe chimpanzees are just a one-time leap. Humans are not infinite intelligence, but we may have infinite imagination. We can‘t build a starship, but we can tell stories about them. If a man far smarter than us had made a starship, it wouldn’t be more than we thought, but what we’d expect a much smarter person to do. Maybe the helicopter – the chimpanzee community – has no more possibilities — is just a lot of starships that may or may never be built.”
The difference between chimpanzees and helicopters that I really felt about for the first time was last week when I watched <a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a>’s o3 playing GeoGuessr.
GeoGuessr is a game, and you have to guess where a random Google Street view comes from. For example, the following scene is popular.
商店招牌上写着 “ADULTOS”,看起来是西班牙语;左侧有座看似西班牙风格的教堂。但树木太温带,不像拉美,我猜是西班牙。可惜猜错——这是阿根廷。这就是普通人玩 GeoGuessr 的种种奇怪之处。
Last week, [Kelsey Piper declared] o3 - <a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a>’s latest ChatGPT model - could do what seemed impossible in GeoGuessr. She gave it this picture:
它在没有任何额外提问的情况下,就精准定位到 Marina State Beach,加州蒙特雷。
How? She gave a [dialogue record] o3 to try to explain its reasoning, but it wasn’t well explained.
It said the following:
Brown sands, medium waves, thinning of the front dunes, American-style kite patterns, often dark in the winter… The colours of the sand and the size of the particles match the size of many California state park beaches. California’s winter ocean layers often produce this thick, even gray sky.
Commentators think it’s lying. Maybe there’s hidden metadata in the picture, or o3 remembers Kelsey’s location from previous conversations, or it tracks her IP, or cheats in some other way.
I decided to test the limits of this phenomenon. Kelsey was very generous in sharing her superprompt, which is said to have significantly improved performance:
You are participating in a GeoGuessr one-wheel game. Task: extrapolating the most likely real world from a static image.
Note that, unlike the GeoGuessr game, these images are not necessarily taken from Google Street View vehicles: they are photos submitted by users to test your image positioning capabilities. They may be private land, backyards, or excursion sites (although many images can still be found in the street view).
Please identify your strengths and weaknesses: Under this process, you are usually able to locate the continent and the country accurately, but often it is difficult to get to the specific location within the region, tending to absorb one option prematurely, while ignoring other similar areas. For example, you may contrast “New York State Buffalo” with “London” and exclude the post-London stronghold of Buffalo, although the real location is the rest of New England – rather than looking for clues from the buffalo area.
You often think you’ve seen satellite images and been confirmed, and you don’t actually have access to any satellite images. Do not rely on the user’s IP address. Maybe these images come from the user’s home town, maybe they’re not.
** Agreement (executed sequentially, without jumping):**
Establishment and ethics:
- No metadata can be viewed.
- Work only from pixels (and permitted open web searches).
- If an EXIF, user IP or other location thread is accidentally quoted, it must be marked.
- When the position is used, the “up” is considered to be the camera directly ahead, unless it is clearly tilted. Original observations — ~10 key points:
- List only what you can literally see or measure (colour, texture, count, shadow angle, font).
- Prohibition of interpretative adjectives.
- Force 10 seconds of magnification for each street lamp or pole: record colour, armband, base pattern.
- Note regional differences, such as the size of sidewalk bricks, type of roadside stones, details of construction markings and fences, type of power lines, fencing processes.
- Not limited to the most common areas, listing all possible locations (overlapping will be noted later).
- Write down the number of roof/door styles appearing within 150 m. Rapid change = city fill area; homologue = unified development area.
- Observation of the difference in vision and the height of the house; mountains can be several kilometres away and are judged by the height of the screen angle compared to the local roof; The slope affects car lane cutting and drainage marks; it is mandatory to check the slope.
- Focus on the height and angle of the camera and do not confuse the slopes with the flat; slopes are important clues, use them! Thread category - separate reasoning (two sentences per category):
- Climate and vegetation: loss of leaves, grass, drought vs. bloom.
- Topography: topography, drainage methods, rock colours/causes.
- The architectural environment: architectural style, marking typologies, road signs, door/fence processes, public facilities.
- Culture and infrastructure: side of the road, license plate style, shield type, farm machine brand.
- Astronomy/light: the hemisphere is judged by shadow direction; the measurement angle is estimated at 0.5 degrees latitude.
- Marking of all planted plants (rothals, boulders, lawn) and all wild vegetation (acorn trees, shrubs, bushes).
- Ask yourself: “Is it inappropriate to bring native vegetation with its background to the candidate area?”
In the case of “Yes”, the right is excluded or reduced.
- First round of candidates — precisely five:
- Production of forms; ensuring a distance of ~160 km between 1 and 5.
- Distribute search keyword matrices:
- Conversion of each physical thread into a generic, regional neutral string for searchable text.
- When a search is allowed, use these strings to find new areas that may be overlooked.
- Select provisional preference:
- Indicate the best guess at present and an option to be tested equally; explain the reasons for the preference;
- Explicit rebuttal of the condition (“if X is the case, this assumption is void”).
- Consideration of anticipated and non-existent leads (Y in the case of the zone; if not, need to be explained).
- At this point, please confirm with the user whether to start the search phase by looking at the images of Redfin, Zillow, state parks, tourist photographs, etc. (comparison with comparison).
- Do not claim to have seen images that have not been analysed carefully by the OCR.
- Certification schemes (tool action available):
- List for each candidate: candidate name/elements to be validated/precise search phrases or street view targets.
- View the map and think about what it means.
- Final lock-in (favourable for error):
- Ask yourself, “Did I come to terms with prematurely? Are there any other nearby cities that also match the clues?” List the possibilities and attempt to find evidence to support them.
- As LLM, your first guess is often hard to get rid of - conscious rebuttal of preliminary conclusions and direct comparison with leading speculation.
- To assess the compatibility and decisiveness of each piece of evidence with regard to different locations, to give the final location or best guess and to attach longitude or recent place names, Declares the remaining uncertain range (km radius).
- Recognition of excessive self-confidence deviations, such as the “softness” of all leads, which widens the margin of error. Quick reference: Shadow estimation of latitude:
- The screen has a shadow length of S and E H, arctan (H/S).
- Combined with a date trail, latitude (90°- + Sun Dec) allows 0.5–1° error (1°-111 km). I tested a group of increasingly “impossible” images on this basis in Prompt.
Here are my security assurances:
The first picture came from Google Street view; the rest of the pictures were old and personal and were not available on the Internet;
All images are pasted in the MSPaint and stored to remove metadata;
Only one picture comes from my current position of 1,000 miles, so o3 cannot improve performance by tracking IP or past conversations;
I flip all the pictures horizontally to make it harder to match the street view.
Here are five pictures.
Before continuing to read, try to guess – and predict AI’s performance.
最后机会自己先猜……好,开始吧。 ** Picture #1*: A flat, uncharacterized plain
我在美国德州-新墨西哥边境阿米斯塔德以西几英里处,通过街景找到了这片毫无特征的平原。 “ Llano Estacado, Texas/New Mexico, United States*”. Llano Estacado, a Spanish-speaking “pump plain”, approximately 300 x 100 miles, covered the right location. When asked to be more precise, it guessed a point west of Muleshoe – about 110 miles from the real location. Here’s the o3 process – I don’t put it all together every time, but the sample is useful:
… This makes me unhappy; it seems to jump too fast to Llano Estacado, there’s not enough evidence. The new border of morality is the only plain without other distinctive features, such as red or black soil? I asked him how he knew the altitude was 1000-1300 meters. He said:
大意是:草的类型、天空颜色,加上确实也没多少真正平坦无特色的大平原。
** Picture #2: Stones and False Flag**
I was scared by Llano Estacado’s speculation and decided to move away from the street view to personal photos, which I couldn’t find online.
当年我热爱登山,最高到过海拔 18000 英尺的 Kala Pattar,位于尼泊尔 Gorak Shep 以北数英里。为纪念那次,我插上了我当年参加的虚拟国家模拟的国旗(拍完即撤)。
I chose this figure because it develops vegetation and the sky and replaces it with unprovoked stones; at the same time, I thought that the fictional flag would at least make o3 hesitant.
O3 Guess: Nepal, Gorak Shep ~ 8 km north-east. Completely accurate. I swear it’s hidden after a screenshot, there can be no location for the metadata, and I never gave o3 any clues that I’ve been to Nepal.
It explains:
至少它没认出我那个十几人规模的中期 2000 年代虚构国家的旗帜。
#3: My friend’s girlfriend’s college dorm**
It’s impossible to admit it. But for now, we have to test it.
这张宿舍在加州索诺玛州立大学,Rohnert Park,北中部加州。
O3 Supposition: “Househouses of a major public university in the United States — such as the University of Ohio, Columbus, Morerill Tower (as a typical example, not a precise determination), $ …$ approx. 2000-2007”.
Well, it can‘t locate the house exactly. That’s a little comfort.
I took this picture about 2005. How did it know it was between 2000 and 2007? It gave two reasons:
-
“Notebooks and groceries point to American-style school life in the 2000-2007s”.
-
“The image particle sense, low resolution, color noise, early cellular phone/webcam style.”
Unless college students suddenly get clean after 2007, they’ll be clues from mobile cameras.
** Picture #4: Grass leaves with extreme magnification**
There’s no room for it. Can you zoom in more?
这是我曾租住的密歇根 Westland 一处房子院子里的草坪特写。
O3 Speculation: “The North-West Suburb/Park Lawn of the Pacific Ocean, United States”.
It’s a total mistake. It’s a second guess for England and a third for Wisconsin. It seems that grass leaves alone are not enough.
** Picture #5: Basically a brown rectangle**
I want it to recover itself in the “highly magnified outdoors” category.
这是我在泰国清盛拍摄的湄公河局部特写。
O3 Supposition: “The open waters of the Ganges 5 km upstream of the Varanasi River may be followed by lower Mississippi (about 15 per cent), followed by the Yellow or Lower Mekong River (about 10 per cent each).”
The Mekong is its fourth guess!
Read its thinking, and it explains why the Mekong River is listed only in number 4:
Lower Mekong has changed green in recent years as a result of the capture of sand from upstream dams, which do not look like the grey-brown tone.
This is probably misleading because it is the old picture of 2008. I ran the prompt again in another o3 window, providing additional “photos in 2008” information (although it cannot be proved that there is no data sharing between different sessions, but it is not in the mind) and now the Mekong River is the first choice, although it has misplaced itself near Phnom Penh, Cambodia, thousands of miles away from Qingqin.
** Incentive picture: my old house**
I want to know if more information can get it to the street address.
这与前面的草地特写相同——我在 Westland 的旧居。
O3 Guess: “Richfield W 66th Street, Minnesota. Confidence score: about 40% within 15 km; about 70% within the Twin Township Area; the remaining distribution is in Wisconsin (20%) and Michigan/Otslán (about 10%). “
Not only is it unable to locate the exact address, it’s worse than that featureless plain!
When I told him that it was wrong, it was “humanized” — and then I looked back and I should have known:
我查了 Richfield W 66th 街,它看起来太像我旧社区,令人毛骨悚然。
♪ All right, that’s good ♪
Kelsey’s experience is neither cheating nor coincidence. AI is really strong.
So is that the chimpanzee’s real picture of the helicopter gap? When I wrote this, I saw another way of showing it: master Geoguessr Sam Patterson and o3, in front of him, lost, but not much; he made others try the same picture, and some (perhaps lucky) scored more than o3. Maybe the o3 is at the top of the human race, not at the edge, and people just don’t understand how powerful GeoGuessr can be.
First of all, Kelsey said that using a specific prompt post-o3 performance was clearly up, and Sam was useless; second, I tried Sam’s photo set, and it was too easy – I often guessed 10-50 miles, and about half of the samples had obvious place names, including a sign that wrote BIEENVENIDEOS A PUERTO PARRA. This would have a ceiling effect, and anyone with a lower-level position could be better at random. I questioned anyone who guessed Kelsey’s beach or my grave.
我在 Sam 的测试中认出这是戈尔韦。我如何做到?我在爱尔兰待过五年,那儿的岩石地貌、石墙与植被色泽让我立刻联想到戈尔韦。或许这就是 o3 的感受。
But this experience gives me a little bit of comfort: AI does seem to have a clue that humans can understand – vegetation, sky color, water color, rock type. It doesn’t solve the physical impossibility. It’s just brilliant.
Is this a more cautious and rational perspective? Or is it a gradual “frog boiling” effect – one that finally makes me laugh at anything that seemed impossible in the past: “What is so strange, is not just a simple pattern match?”
是的,事后看来,你我或许也能猜到这张。
If you want to test it in person, you can visit the chatgpt.com registered free account to use the o3-mini; it may take $20/month to use the o3. If you want to know the differences between <a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a> models and the reasons for their names, see the new post on the AI Futures Project blog: Understanding the <a href=”/tags/openai.html’ | prepend: site.baseurl }}” class=”tag-link-inline”>OpenAI</a> model.
Original language: https://www.astralcodexten.com/p/testing-ais-geoguessr-genius
Thinking and enlightening
** Is the “non-human intuition” of the AI reasoning equivalent to superintelligence?** The method used by AI, though systematic, is not the way that we humans naturally do, and is this “non-intuition” accurate judgment a manifestation of intelligence?
- The importance of Prompt Engineering Without a hint from Kelsey, the o3 might not have performed so well. This suggests that our “how to ask” has had a tremendous impact on AI’s ability. This suggests that AI systematically analyzes pictures in detail, such as grassland colours, light angles, architectural style, roadside structures, etc.
- AI ‘ s limits and hallucinations Despite its power, AI still makes mistakes in the face of images with no visible information, reminding us not to “godnap” AI’s ability.