# Tutorial: Is Your Website Visible to LLMs?

February 23, 2026 — Alessandro Caprai

---

## Practical Guide to Verify and Optimize Your Site's Accessibility for Language Models

***Method***: Based on real tests conducted on production sites with Claude, ChatGPT, Gemini, and Perplexity

Large Language Models (LLMs) like Claude, ChatGPT, Gemini, and Perplexity are becoming an increasingly relevant discovery and traffic channel for websites. However, the way these systems access content is profoundly different from both traditional browsers and classic search engines.

A site perfectly indexed on Google can be **completely invisible** to LLMs. The reasons are multiple: hosting firewall, misconfigured robots.txt, content rendered only via JavaScript, unreadable sitemaps, poisoned crawler caches.

This guide provides a structured 9-level path to diagnose and resolve every possible blocking point. Each test has been verified in the field.

### SEO vs GEO: why a new perspective is needed

Traditional SEO (Search Engine Optimization) focuses on visibility in search engine results. **GEO** (Generative Engine Optimization) extends this concept to the generative artificial intelligence ecosystem: how to ensure that LLMs find, read, and correctly cite your content.

GEO doesn't replace SEO, it integrates it. A site optimized for GEO is also a site with stronger SEO, because many requirements are shared: accessible content, clear structure, coherent metadata. But GEO adds specific requirements — such as readability by crawlers that don't execute JavaScript, Content-Type compatibility with LLM tools, explicit exposure of relationships between multilingual content, and adoption of emerging protocols like **llms.txt** to communicate directly with language models.

---

## Level 1 — Hosting Firewall and Bot Protection

### What to verify

The first and most insidious blocking level. Many modern hosting providers offer bot protection systems that block non-browser traffic **before it even reaches your server**. The blocking happens silently: you won't see anything in your application logs.

This is particularly critical because the block can generate a **poisoned cache** in the LLM's crawler: if the crawler is blocked once, it might memorize the block and not retry for days or weeks, even after you've corrected the configuration.

### Platforms involved

Platform Where to check Vercel Dashboard → Project → Firewall → Rules → Bot Management Cloudflare Dashboard → Security → Bots Netlify Site settings → Security (if present) AWS CloudFront WAF → Web ACLs → Bot Control Traditional hosting Application firewall panel (ModSecurity, etc.)

### Tests to perform

**1.1 — Verify generic Bot Protection**

Access your hosting panel and check if active bot protections exist. Look for two types of protection, often separately configurable:

- **Generic Bot Protection** — blocks all non-browser traffic (headless requests, curl, automated scripts)
- **AI Bots Managed Ruleset** — specifically blocks LLM crawlers (ClaudeBot, GPTBot, etc.)

If you want LLMs to access your site, set both to **"Log"** (monitoring without blocking) instead of "Block" or "Challenge".

**1.2 — Verify in live logs**

This is the key diagnostic test. Ask an LLM (e.g., Claude, ChatGPT) to access a specific page on your site, then immediately check the firewall's live logs:

- If **you see the request in the logs** → the block is at the application level (robots.txt, middleware, server response)
- If **you don't see any request** → the block happens at the firewall/infrastructure level, or the LLM is using a cached response and the request doesn't even start

**1.3 — Verify WAF and custom rules**

Check if you have custom WAF rules blocking specific user-agents. The main LLM user-agents are:

LLM / Service User-Agent Claude (Anthropic) `ClaudeBot` ChatGPT (OpenAI) `GPTBot`, `ChatGPT-User` Gemini (Google) `Google-Extended` Perplexity `PerplexityBot` Apple AI `Applebot-Extended` Microsoft Copilot `Bingbot` (shared with Bing Search) Meta AI `FacebookBot` Common Crawl (used by many LLMs) `CCBot`

**1.4 — Test with specific User-Agents**

Verify from your terminal that the server responds correctly to different user-agents:

```bash
# Test with Claude's User-Agent
curl -I -H "User-Agent: ClaudeBot" https://yoursite.com/

# Test with ChatGPT's User-Agent
curl -I -H "User-Agent: GPTBot" https://yoursite.com/

# Compare with standard browser User-Agent
curl -I -H "User-Agent: Mozilla/5.0" https://yoursite.com/
```

If LLM user-agents receive different responses (403, 503, redirect to challenge), the firewall is blocking them.

### Problem indicators

- The LLM returns "cannot access" but the site works in the browser
- No trace in server logs when the LLM attempts access
- Generic errors like "robots.txt disallowed" even if robots.txt is configured correctly
- Different HTTP responses based on User-Agent

---

## Level 2 — robots.txt

### What to verify

The robots.txt file is the first file that crawlers (including those of LLMs) read before accessing any page. An incorrect configuration can block the entire site. Additionally, if robots.txt itself is not reachable, many crawlers interpret this as "access denied to everything".

### Tests to perform

**2.1 — Verify robots.txt accessibility**

```bash
curl -I https://yoursite.com/robots.txt
```

Check that the response code is `200 OK` and not `403`, `404`, or `500`. An inaccessible robots.txt is often a symptom of a firewall-level block (see Level 1).

**2.2 — Verify the actually served content**

```bash
curl -s https://yoursite.com/robots.txt
```

Compare the returned content with what you expect. Beware of conflicts between static and dynamic generation:

Framework Static file Dynamic file Precedence Next.js `public/robots.txt` `app/robots.ts` Dynamic overwrites static Nuxt.js `public/robots.txt` `@nuxtjs/robots` module Depends on configuration Gatsby `static/robots.txt` `gatsby-plugin-robots-txt` plugin Plugin overwrites WordPress Physical file SEO plugin (Yoast, Rank Math) Plugin overwrites Static sites File in root — —

If both sources exist, the dynamic one might overwrite the static one with different rules than expected.

**2.3 — Add explicit Allow for LLM bots**

To maximize compatibility, add explicit rules for each LLM bot. Many crawlers first check if a specific rule exists for their own user-agent; an explicit Allow removes any ambiguity:

```
User-Agent: ClaudeBot
User-Agent: GPTBot
User-Agent: ChatGPT-User
User-Agent: Google-Extended
User-Agent: PerplexityBot
User-Agent: Applebot-Extended
Allow: /
Disallow: /admin/
Disallow: /api/

User-Agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

Sitemap: https://yoursite.com/sitemap.xml
```

**2.4 — Verify robots.txt Content-Type**

```bash
curl -I https://yoursite.com/robots.txt | grep -i content-type
```

It must be `text/plain`. If the server returns `text/html` or another MIME type, some crawlers might not correctly interpret it as a robots file.

**2.5 — Verify consistency between robots.txt and firewall**

A common error is having a robots.txt that says "Allow" but a firewall that blocks. The robots.txt is only a directive: if the firewall denies access upstream, the robots.txt is not even read. The two levels must be consistent.

### Problem indicators

- The robots.txt returns an HTTP error (403, 404, 500)
- Both static and dynamic files exist with different rules
- `Disallow` rules are too broad (e.g., `Disallow: /` without `Allow`)
- The Content-Type is not `text/plain`
- The firewall blocks upstream making robots.txt unreachable

---

## Level 3 — Meta tags and HTTP headers

### What to verify

Even with a perfect robots.txt, HTML meta tags or HTTP headers can block indexing at the individual page level. These controls are granular and can be selectively applied to specific pages or sections of the site.

### Tests to perform

**3.1 — Check robots meta tags in the** `<head>`

```bash
curl -s https://yoursite.com/page | grep -i "robots"
```

Look for tags like:

```html
<!-- BLOCKING — prevent indexing -->
<meta name="robots" content="noindex">
<meta name="robots" content="nofollow">
<meta name="robots" content="none">
<meta name="robots" content="noindex, nofollow">

<!-- PERMISSIVE — allow indexing -->
<meta name="robots" content="index, follow">
<!-- Or simply the absence of the tag (default = index, follow) -->
```

Caution: some frameworks automatically inject `noindex` meta tags under certain conditions (e.g., error pages, preview pages, staging environments).

**3.2 — Check the X-Robots-Tag header**

```bash
curl -I https://yoursite.com/page | grep -i "x-robots"
```

The `X-Robots-Tag: noindex` header in the HTTP response blocks indexing even without a meta tag in the HTML. It's often used by CDNs, middleware, or server-side configurations.

**3.3 — Check specific headers for AI bots**

Some frameworks, CDNs, or middleware add different headers or behaviors for AI bots:

```bash
# Compare responses with different User-Agents
curl -I -H "User-Agent: ClaudeBot" https://yoursite.com/page
curl -I -H "User-Agent: Mozilla/5.0" https://yoursite.com/page
```

Compare: status code, `X-Robots-Tag` header, any redirects, Content-Type. They must be identical.

### Problem indicators

- `noindex` present in the `<head>` of pages that should be visible
- Restrictive `X-Robots-Tag` in HTTP headers
- Different responses based on User-Agent (redirects, 403, challenge page)
- `noindex` meta tags automatically injected by the framework in production

---

## Level 4 — Middleware and rewrites

### What to verify

Modern frameworks (Next.js, Nuxt, SvelteKit, Remix, etc.) use middleware that intercepts requests and can modify behavior before reaching the page. This is particularly critical for multilingual sites (language redirects) and sites with authentication.

### Tests to perform

**4.1 — Check that middleware doesn't intercept SEO resources**

Middleware should not intercept robots.txt, sitemap.xml, favicon.ico, and other static files. Resources to exclude from middleware:

- robots.txt
- sitemap.xml (and sitemap index)
- favicon.ico
- Static files (assets, images, CSS, JS)
- API routes (if they don't require auth)

**4.2 — Verify middleware redirects for bots**

If middleware handles localization (e.g., redirect from `/` to `/en/`), verify it doesn't create redirect loops or error pages for bots. Bots typically don't send language preference cookies:

```bash
# Test as a bot without cookies
curl -v -H "User-Agent: ClaudeBot" https://yoursite.com/ 2>&1 | grep -i "location\|301\|302\|307"
```

If you see multiple redirects or loops, the middleware is creating problems for crawlers.

**4.3 — Verify absence of User-Agent based blocking in middleware**

Check that the middleware code doesn't contain blocking logic for bots. Problematic patterns to look for in middleware code:

- Blocking User-Agents containing "Bot" or "Crawler"
- Redirect to challenge/captcha pages for non-browser User-Agents
- 403/401 response for requests without session cookies

**4.4 — Verify URL rewrites**

If the site uses URL rewriting (e.g., from simplified slugs to internal paths), verify that rewrites work for bots as well:

```bash
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: ClaudeBot" https://yoursite.com/example-article
```

Must return `200`, not `404` or `500`.

### Problem indicators

- Middleware intercepts requests to `robots.txt` or `sitemap.xml`
- Infinite redirects for bots that don't handle language cookies
- 403 or 401 responses for non-browser User-Agents
- Pages that work in browser but return 404 via curl

---

## Level 5 — Rendering and accessible content

### What to verify

LLMs and their crawlers generally **do not execute JavaScript**. If your site's content is loaded only via client-side rendering (CSR), bots will see an empty or partial page. This is one of the most common and underestimated problems.

### Impact by rendering strategy

Rendering strategy Visibility to LLM bots Static Site Generation (SSG) ✅ Excellent — complete HTML on first request Server-Side Rendering (SSR) ✅ Good — HTML generated on each request Incremental Static Regeneration (ISR) ✅ Good — similar to SSG with updates Client-Side Rendering (CSR) ❌ Problematic — content absent for bots Partial hydration ⚠️ Variable — depends on what is server-rendered

### Tests to perform

**5.1 — Verify rendering without JavaScript**

```bash
curl -s https://yoursite.com/page | grep -c "<article\|<p\|<h1\|<h2"
```

If the main HTML content (headings, paragraphs, articles) is not present in the initial response, it means it's loaded via JavaScript and bots won't see it.

**5.2 — Compare SSR vs browser content**

Count visible elements in both contexts:

1. Open the page in browser → count visible articles/sections/elements
2. Do a `curl` of the same page → count the same elements in raw HTML
3. If the browser shows more content, the difference is loaded via JavaScript and **invisible to bots**

Practical example: if the homepage shows 8 articles in the browser but `curl` returns only 6, the 2 missing ones are loaded dynamically.

**5.3 — Verify lazy loading and infinite scroll**

If you use infinite scroll, "load more" buttons, or lazy loading for articles, dynamically loaded content won't be visible to crawlers. Ensure that at least all main content is in the initial HTML (SSR/SSG).

For article lists, consider including all recent articles in the initial HTML, or use pagination with distinct URLs (`/blog/page/2`) instead of dynamic loading.

**5.4 — Test the LLM-friendly version (if present)**

Some sites offer a Markdown or plain text version for LLMs. If present, verify it's accessible:

```bash
curl -s https://yoursite.com/page/markdown
```

Offering a dedicated endpoint for LLMs is an emerging best practice in GEO. The llms.txt protocol (see Level 9) standardizes this approach by proposing that each page have a Markdown version accessible by adding `.md` to the URL.

**5.5 — Verify structured content (Schema.org)**

Structured data in JSON-LD format helps LLMs understand content type and structure:

```bash
curl -s https://yoursite.com/page | grep -i "application/ld+json"
```

Schema.org markup for `Article`, `BlogPosting`, `FAQPage`, `HowTo` are particularly useful for GEO.

### Problem indicators

- `curl` returns HTML with little or no textual content
- Missing articles compared to those visible in browser
- Important content loaded only after user interaction (scroll, click)
- Absence of JSON-LD structured data

---

## Level 6 — Sitemap

### What to verify

The sitemap is fundamental for communicating the complete site structure to crawlers. If it's not readable or is misconfigured, bots might not discover all your pages or understand the relationships between language versions of content.

### Tests to perform

**6.1 — Verify sitemap accessibility**

```bash
curl -s -o /dev/null -w "%{http_code}" https://yoursite.com/sitemap.xml
```

Must return `200`.

**6.2 — Verify Content-Type**

```bash
curl -I https://yoursite.com/sitemap.xml | grep -i content-type
```

This is a critical problem verified in the field: some LLM tools treat `application/xml` as **binary data** and fail to read its content. To maximize compatibility, serve the sitemap with:

```
Content-Type: text/xml; charset=utf-8
```

Note: many frameworks force `application/xml` and don't allow overriding the Content-Type via configuration. In these cases, the solution is to generate the sitemap manually via a route handler or custom API endpoint, where you can control the Content-Type directly in the HTTP response.

**6.3 — Verify sitemap index structure (multilingual sites)**

If the site is multilingual and uses a sitemap index pointing to child sitemaps per language, verify the complete chain:

```bash
# Main sitemap index
curl -s https://yoursite.com/sitemap.xml

# Child sitemaps per language
curl -s https://yoursite.com/en/sitemap.xml
curl -s https://yoursite.com/es/sitemap.xml
```

The sitemap index should have this structure:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yoursite.com/en/sitemap.xml</loc>
    <lastmod>2026-02-23T10:00:00.000Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/es/sitemap.xml</loc>
    <lastmod>2026-02-23T10:00:00.000Z</lastmod>
  </sitemap>
</sitemapindex>
```

**6.4 — Verify that robots.txt points to the sitemap index**

In robots.txt, reference a single entry point — the main sitemap index — instead of listing all child sitemaps:

```
# ✅ CORRECT — single entry point
Sitemap: https://yoursite.com/sitemap.xml

# ❌ REDUNDANT — children are already referenced in the index
Sitemap: https://yoursite.com/en/sitemap.xml
Sitemap: https://yoursite.com/es/sitemap.xml
Sitemap: https://yoursite.com/fr/sitemap.xml
```

For monolingual sites without sitemap index, point directly to the sitemap:

```
Sitemap: https://yoursite.com/sitemap.xml
```

**6.5 — Verify the presence of** `hreflang` **tags (multilingual sites)**

This is the most important test for multilingual sites. Without `hreflang`, crawlers cannot link language versions of the same content. Each URL must declare all its language variants, **including itself**.

Add the `xhtml` namespace in the `<urlset>` element and alternate links in each `<url>`:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://yoursite.com/en/blog/example-article</loc>
    <lastmod>2026-02-12</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
    <xhtml:link rel="alternate" hreflang="en" href="https://yoursite.com/en/blog/example-article" />
    <xhtml:link rel="alternate" hreflang="es" href="https://yoursite.com/es/blog/articulo-ejemplo" />
    <xhtml:link rel="alternate" hreflang="fr" href="https://yoursite.com/fr/blog/article-exemple" />
  </url>
</urlset>
```

To verify if tags are present:

```bash
curl -s https://yoursite.com/en/sitemap.xml | grep -c "hreflang"
```

If the result is `0`, tags are missing and multilingual relationships are not declared.

**6.6 — Verify** `priority` **consistency**

Priorities must reflect the actual importance of pages. A uniform distribution of priorities is equivalent to having no priorities:

Page type Recommended priority Homepage 1.0 Category pages / Main sections 0.8 Articles / Posts / Content pages 0.7 About / Contact / Team 0.5 Legal / Privacy / Cookie / Registration 0.3 (or excluded from sitemap)

**6.7 — Verify that functional pages are not in the sitemap**

Pages like registration, login, legal, privacy policy, cookie policy, terms of service are functional pages and generally should not be in the sitemap. Including them dilutes crawl budget and signals to engines that they have the same importance as original content.

```bash
curl -s https://yoursite.com/sitemap.xml | grep -iE "register|legal|login|admin|privacy|cookie|terms|signup|signin"
```

**6.8 — Verify** `lastmod` **and** `changefreq` **consistency**

Inconsistencies in these fields reduce crawler trust in the entire sitemap:

- `lastmod` must reflect the actual date of the last content modification, not the sitemap generation date
- `lastmod` **of category pages** must correspond to the date of the last article published within them
- `changefreq` must be realistic: declaring `weekly` for static pages that never change is a sign of unreliability
- **All URLs should have** `lastmod`: pages without `lastmod` are treated as low update priority

```bash
# Check dates in the sitemap
curl -s https://yoursite.com/sitemap.xml | grep -B1 "lastmod"

# Count URLs without lastmod
curl -s https://yoursite.com/sitemap.xml | grep "<loc>" | wc -l   # total URLs
curl -s https://yoursite.com/sitemap.xml | grep "lastmod" | wc -l  # URLs with lastmod
```

**6.9 — Verify sitemap completeness**

Compare the number of URLs in the sitemap with the number of actual site pages:

```bash
curl -s https://yoursite.com/sitemap.xml | grep -c "<loc>"
```

Every public page with original content should be present. Orphan pages (present on site but absent from sitemap) might never be discovered by crawlers.

### Problem indicators

- Sitemap returns errors (404, 500)
- Content-Type `application/xml` interpreted as binary by some LLM crawlers
- Missing `xhtml:link` tags with `hreflang` on multilingual sites
- robots.txt lists child sitemaps instead of pointing to the index
- Functional pages (legal, registration, login) in sitemap with high priority
- Inconsistent, absent, or uniform `lastmod` for all pages
- `changefreq: weekly` on pages that never change
- Uniform priorities that don't differentiate between content and service pages
- `lastmod` of categories not updated to last published article
- Site pages absent from sitemap

---

## Level 7 — LLM crawler cache

### What to verify

LLM crawlers cache responses. If your site blocked a crawler in the past (even unintentionally, for example through a default-configured firewall), that block can persist in the crawler's cache for an indefinite period, even after you've corrected the configuration.

This is a particularly insidious problem because everything seems correct on your side, but the LLM continues not to access the site.

### How poisoned cache works

1. The LLM's crawler tries to access your `robots.txt`
2. The hosting firewall blocks the request (403 or challenge page)
3. The crawler interprets the block as "robots.txt says disallow all"
4. This interpretation is cached
5. From that moment, the crawler doesn't even try to access the site — it blocks everything internally based on the cache
6. You fix the firewall and robots.txt, but the crawler continues using the poisoned cache

### Tests to perform

**7.1 — Poisoned cache diagnosis**

If an LLM says it cannot access your site but server logs show no request:

1. Fix the configuration (firewall, robots.txt, etc.)
2. Ask the LLM to access a specific page
3. Check server logs in real time:
   - **Request present in logs** → the problem is in the server response, not in the cache
   - **Request absent from logs** → poisoned cache: the request doesn't even start

**7.2 — Force cache invalidation**

After correcting the configuration:

- **Wait**: cache typically invalidates in hours or days, not minutes
- **Retest periodically**: try every few hours to verify if cache has been updated
- **Check robots.txt first**: often robots.txt is the file whose cache is updated before everything else. If the LLM can read the updated robots.txt, the rest will follow

**7.3 — Force re-crawling (if possible)**

Some LLM providers offer mechanisms to request re-crawling:

- **Google (Gemini)**: Google Search Console → Request indexing
- **Microsoft (Copilot)**: Bing Webmaster Tools → Submit URL
- **Anthropic (Claude)**: Currently no equivalent public tool; invalidation happens automatically
- **OpenAI (ChatGPT)**: Currently no public re-crawling tool

### Problem indicators

- The LLM reports access errors but server logs show no trace of the request
- The problem persists even after correcting all configurations
- The site works with manual `curl` but not with the LLM's tool
- The robots.txt is read correctly but pages remain blocked (partial cache)

---

## Level 8 — Indexing in LLM search engines

### What to verify

Some LLMs use their own search engines (not Google) to find content. Your site might be well indexed on Google but completely absent from the search engine used by the LLM. Additionally, the search engine used by each LLM can change and is not always documented.

### Tests to perform

**8.1 — Test search from the LLM**

Directly ask the LLM to search for your article with a very specific query (e.g., exact title + site name). Repeat the test with different LLMs:

- Ask **Claude** to search for your article
- Ask **ChatGPT** to search for the same article
- Ask **Perplexity** to search for the same article
- Ask **Gemini** to search for the same article

If the site appears in results from some LLMs but not others, the problem is in the specific search engine's indexing, not in your site.

**8.2 — Verify on Google Search Console**

Check that the site is indexed on Google:

- Number of indexed pages vs total site pages
- Reported crawl errors
- Index coverage
- Pending indexing requests

**8.3 — Verify on Bing Webmaster Tools**

Many LLMs (including ChatGPT/Copilot) use Bing as a search source. Register the site on Bing Webmaster Tools and verify:

- Indexing status
- Crawl errors
- Submitted and processed sitemaps

**8.4 — Verify on other webmaster tools**

- **Yandex Webmaster** — used by some LLMs for specific markets
- **IndexNow** — protocol supported by Bing, Yandex, and others to notify new content in real time

### Problem indicators

- The site appears on Google but not in LLM search results
- Low number of indexed pages compared to total
- Crawl errors in webmaster tools
- Sitemap not processed or with errors in webmaster tools

---

## Level 9 — The llms.txt protocol and LLM-friendly content

### What to verify

Beyond technical accessibility (Levels 1-8), there are specific protocols and formats designed to make content directly understandable by LLMs. The most important is **llms.txt**, an open standard proposal created by Jeremy Howard (co-founder of fast.ai and Answer.AI) in September 2024.

The concept is simple: just as `robots.txt` tells crawlers "what you can see" and `sitemap.xml` says "where the pages are located," `llms.txt` **tells LLMs "here's how to understand my site"**.

### The llms.txt protocol

The `/llms.txt` file is a Markdown file positioned in the site root that provides a curated and readable map of the entire site in a format optimized for LLMs' context windows. Unlike the sitemap (which lists all URLs), llms.txt offers a reasoned selection of the most important content with descriptions and context.

Format specifications (from llmstxt.org):

- An **H1** with the project or site name (mandatory)
- A **blockquote** with a brief key description of the project
- Zero or more **paragraphs** with context information
- Zero or more **H2 sections** containing lists of links with descriptions
- An optional **"Optional"** section with secondary resources that the LLM can skip if it has a limited context window

### Tests to perform

**9.1 — Verify presence of llms.txt file**

```bash
curl -s -o /dev/null -w "%{http_code}" https://yoursite.com/llms.txt
```

If it returns `404`, the file doesn't exist and the site is not communicating its structure to LLMs.

**9.2 — Verify llms.txt content and format**

```bash
curl -s https://yoursite.com/llms.txt
```

The file must follow the standard format. Here's an example for a business site:

```markdown
# Company Name

> Brief description of the company and its main services.
> Key information that every LLM should know.

The company has operated in sector X since YYYY. Main services are A, B, and C.
The site is available in English, Spanish, and French.

## Services

- [Main Service](https://yoursite.com/services/main.html.md): Concise description of the service
- [Consulting](https://yoursite.com/services/consulting.html.md): Description of consulting offered

## Blog and insights

- [Important Article](https://yoursite.com/blog/article.html.md): Why this article is relevant
- [Technical Guide](https://yoursite.com/blog/guide.html.md): What this guide covers

## About us

- [About](https://yoursite.com/about.html.md): Company history and mission
- [Team](https://yoursite.com/team.html.md): The team and skills

## Optional

- [Detailed case study](https://yoursite.com/case-study.html.md): Optional deep dive
- [Technical documentation](https://yoursite.com/docs.html.md): Complete technical reference
```

Note: links in llms.txt should point to Markdown versions of pages (see test 9.3), not complete HTML pages.

**9.3 — Implement Markdown versions of pages**

The llms.txt proposal includes a fundamental convention: each site page should have a Markdown version accessible by adding `.md` to the original URL. This version contains only structured textual content, without navigation, sidebar, footer, popups, or UI elements.

```bash
# Original HTML page
curl -s https://yoursite.com/blog/article.html

# Markdown version for LLM
curl -s https://yoursite.com/blog/article.html.md
```

If the site uses URLs without extension (e.g., `/blog/article`), the convention is to add `index.html.md`:

```bash
curl -s https://yoursite.com/blog/article/index.html.md
```

Alternatively, you can adopt a custom path as long as it's consistent:

```bash
curl -s https://yoursite.com/blog/article/markdown
```

The Markdown version must contain:

- The article title (H1)
- Author and publication date
- Complete content structured with headings, paragraphs, lists
- Links to cited sources
- No navigation, UI, or marketing elements

**9.4 — Implement llms.html file for chatbots and RAG systems**

In addition to llms.txt (designed for crawlers) and Markdown versions (designed for individual pages), a third file completes the ecosystem: **llms.html**.

The llms.html is a structured HTML page that functions as a **knowledge source for business chatbots and RAG systems** (Retrieval Augmented Generation). It contains all site content in structured format, dynamically updated, and can be used as a knowledge base for:

- Custom chatbots on your site
- Custom GPTs (OpenAI's Custom GPTs)
- Business AI agents
- Any system requiring an updated knowledge source

```bash
curl -s -o /dev/null -w "%{http_code}" https://yoursite.com/llms.html
```

Unlike llms.txt (which is a map with links), llms.html contains the full text of content, organized in sections, ready to be ingested by a RAG system without further steps.

**9.5 — Reference llms.txt in robots.txt**

Just as robots.txt references the sitemap, it's good practice to add a reference to llms.txt. Although there's no formal standard yet for this, some crawlers are starting to look for it:

```
Sitemap: https://yoursite.com/sitemap.xml

# LLM-friendly content map
# llms.txt: https://yoursite.com/llms.txt
```

**9.6 — Use structured data (Schema.org)**

JSON-LD structured data helps LLMs understand content type, author, date, and structure:

```html
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Article Title",
  "author": {
    "@type": "Person",
    "name": "Author Name"
  },
  "datePublished": "2026-02-12",
  "dateModified": "2026-02-15",
  "description": "Article description",
  "inLanguage": "en"
}
</script>
```

Schema.org markup for `Article`, `BlogPosting`, `Product`, `FAQPage`, `HowTo`, `Organization`, and `LocalBusiness` are particularly useful for GEO.

**9.7 — Structure content clearly**

LLMs better interpret content when:

- Each page has a single clear and descriptive `<h1>`
- Subheadings (`<h2>`, `<h3>`) follow a logical hierarchy
- Paragraphs address one concept at a time
- Key information is at the beginning of the text (inverted pyramid structure)
- Lists and tables are used for structured data

**9.8 — Declare authorship and date**

LLMs give more weight to content with identifiable author and clear date:

- Include visible author name on the page
- Include publication and last update date
- Use meta tags `article:author`, `article:published_time`, `article:modified_time`

**9.9 — Monitor LLM citations**

Periodically test if LLMs cite your site when answering questions relevant to your content. Ask LLMs questions about topics your site covers in depth and verify if your site appears as a source.

### Relationship between sitemap.xml, llms.txt, and llms.html

File Purpose Recipients Content `sitemap.xml` Complete list of site URLs Search engines (Google, Bing) and LLM crawlers URLs, lastmod, priority, hreflang `robots.txt` Access directives for crawlers All crawlers Allow/Disallow per User-Agent `llms.txt` Curated map and site context LLMs during inference Structured Markdown with links to .md versions `llms.html` Complete knowledge base Chatbots, custom GPTs, RAG systems Structured HTML with complete content `*.html.md` Clean version of each page LLMs accessing specific pages Markdown of content only, without UI

All these files coexist and have complementary purposes. None replaces the others.

### Problem indicators

- `/llms.txt` file absent (404)
- llms.txt present but with non-standard format or generic content
- No Markdown version available for site pages
- Links in llms.txt point to complete HTML pages instead of .md versions
- Absence of JSON-LD structured data
- Content without identifiable author or date
- The LLM doesn't cite the site when asked questions relevant to published content

---

## Quick Checklist

### Accessibility and infrastructure

\# Test Command/Action Expected outcome 1 Hosting Bot Protection Hosting Firewall panel Log or disabled 2 AI Bots ruleset Hosting Firewall panel Log or disabled 3 Response to LLM User-Agent `curl -I -H "User-Agent: ClaudeBot" yoursite.com` 200 OK, identical to browser 4 robots.txt accessible `curl -I yoursite.com/robots.txt` 200 OK, `text/plain` 5 robots.txt Allow for LLMs `curl -s yoursite.com/robots.txt` Explicit Allow for LLM bots 6 robots.txt points to sitemap index `curl -s yoursite.com/robots.txt | grep Sitemap` Single link to sitemap index

### Meta tags and middleware

\# Test Command/Action Expected outcome 7 Meta robots `curl -s yoursite.com/page | grep robots` No noindex 8 X-Robots-Tag `curl -I yoursite.com/page` No noindex 9 Middleware doesn't block bots `curl -I -H "User-Agent: ClaudeBot" yoursite.com/page` 200 OK, no redirect loop

### Rendering and content

\# Test Command/Action Expected outcome 10 Content in initial HTML `curl -s yoursite.com | grep "<h1|<article|<p"` Main content present 11 SSR vs browser element count Count articles in curl vs browser Identical numbers 12 JSON-LD structured data `curl -s yoursite.com/page | grep "ld+json"` Schema.org present

### Sitemap

\# Test Command/Action Expected outcome 13 Sitemap accessible `curl -I yoursite.com/sitemap.xml` 200 OK 14 Sitemap Content-Type `curl -I yoursite.com/sitemap.xml | grep content-type` `text/xml; charset=utf-8` 15 hreflang present (multilingual) `curl -s yoursite.com/sitemap.xml | grep hreflang` Tags for each language 16 No functional pages `curl -s yoursite.com/sitemap.xml | grep -iE "legal|login|register"` No results 17 Differentiated priorities Manual sitemap review Consistent with page importance 18 Coherent lastmod `curl -s yoursite.com/sitemap.xml | grep lastmod` Real dates, not uniform

### End-to-end verification

\# Test Command/Action Expected outcome 19 llms.txt present `curl -s -o /dev/null -w "%{http_code}" yoursite.com/llms.txt` 200 OK 20 llms.txt correct format `curl -s yoursite.com/llms.txt` Structured Markdown with H1, blockquote, H2 sections 21 Markdown versions of pages `curl -s yoursite.com/blog/article.html.md` Clean Markdown content 22 llms.html for RAG `curl -s -o /dev/null -w "%{http_code}" yoursite.com/llms.html` 200 OK (if implemented) 23 LLM access test Ask Claude/GPT to read a page Content returned 24 Server logs during LLM test Check live logs during test Request visible 25 Site search via LLM Ask LLM to search for your article Site in results 26 Citation by LLM Ask LLM about a topic you cover Site cited as source

---

## When to repeat tests

The LLM crawler ecosystem is rapidly evolving. Repeat tests periodically, especially after:

- Framework updates (Next.js, Nuxt, WordPress, etc.)
- Changes to hosting, plan, or CDN
- Firewall, WAF, or bot rule updates
- Deployment of new features involving middleware or rendering
- User reports about not being able to find the site through LLMs
- Release of new LLMs or significant updates to existing ones

---

## Glossary

Term Definition **GEO** Generative Engine Optimization: optimization for visibility in LLMs **SEO** Search Engine Optimization: optimization for search engines **SSR** Server-Side Rendering: HTML content is generated by the server **CSR** Client-Side Rendering: content is generated by JavaScript in the browser **SSG** Static Site Generation: HTML pages are pre-generated at build time **WAF** Web Application Firewall: application-level firewall **robots.txt** File that tells crawlers which pages they can or cannot access **Sitemap** XML file listing all site pages for crawlers **hreflang** Attribute indicating the language and region of a page **Poisoned cache** When a crawler memorizes a block and continues using it even after correction **Content-Type** HTTP header indicating the format of the response content **User-Agent** HTTP header identifying the client making the request **Crawl budget** Number of pages a crawler is willing to scan on a site **JSON-LD** Format for structured data embedded in HTML **Schema.org** Standard vocabulary for structured data on the web **llms.txt** Markdown file in site root providing a curated content map for LLMs (standard proposed by Jeremy Howard, llmstxt.org) **llms.html** Structured HTML page functioning as a knowledge base for chatbots and RAG systems **RAG** Retrieval Augmented Generation: technique allowing an LLM to access external knowledge sources to generate more accurate responses **Inference** The moment when an LLM generates a response to a user request, potentially consulting external sources