Friday, October 23, 2009

Making AJAX crawlable

There is a not-so-new slideset by Katharina Probst and Bruce Johnson (the GWT guy) about making AJAX crawlable, here: http://docs.google.com/present/view?id=dc75gmks_120cjkt2chf





It's short and interesting.

It's not only about AJAX but any dynamic web site who wants their pages indexed like, for example, a news site.

One way to have dynamic content indexed was to make the internal engine render all the possible pages in HTML and store them in a special directory, in a simplified format to avoid a traffic spike .

Now Probst and Johnson say that Web sites willing to be indexed should provide a server script that outputs all the data the site wants to be reached through.

For example a site about places can output a list of cities with additional relevant data, in a simple listing "internal page" (do crawlers index very long pages?)

This would be better that to run the server scripts to make it generate all the possible pages. But it might be a resources waste, as instead of many pages one could do with a simple, long, "internal page" sporting the keywords and other fixed data ony once, and a listing of the variable parts, pointing to the URL where the user would choose his selection. http://docs.google.com/present/view?id=dc75gmks_120cjkt2chf Web sites wanting to be indexed should provide a simple script that outputs all the data the site wants to be reached through.For example a site about places can output a list of cities, in a simple listing "internal page". In some cases it is possible to run the server scripts to make it generate all the possible pages. This might be a resources waste, as instead of many pages one could do with a simple, long, "internal page" sporting the keywords and other fixed data ony once, and a listing of the variable parts, pointing to the URL where the user would choose his selection. "

"All possible pages" sounds scary, but in most cases it's not. Obviously Google can't generate each and every page. But most other sites could. Even eBay, for example, which must be doing something like that. Each site has to do its math: number of unique "items" times number of characters per item pure content, devoid of all the surronding stuff like hundreds of links that make a page weighty. Disk drives are inexpensive.

I think about loading a special directory with barebones XML "bait" pages containing all the relevant data with appropriate URLs that the web server could redirect to the actual page for viewing. This way the bait pages could be designed to display better in the search results list, with more relevant abstracts.

And when I write "bait" I think of scams, sites luring users to click in search results that take them to different contents.

Actually, the site should design and store indexing raw data, linking text content with URLs, to feed the crawler. And ideally the site should ba abre to run it at the time that best fits them, like for example during the local night hours. As the indexes do not update so frequently, not in real time at least, choosing a time should not be an issue.

No comments:

Post a Comment