google may not be accurate
December 12, 2006 9:56pm CST
Want to use Google for site search on your website or blog? Well, think twice before using it, especially if your site or blog does not have a high page rank (PR). Because you may not get accurate information or most recent information in the results from your site search! Well, let me explain the reasons. Google uses a fully automated search engine. It uses a software known as spiders to crawl the Internet on a regular basis and find sites to add to the Google index. It is through this indexing that the search results are generated whenever somebody searches for a term in Google. As Google itself explains, “Although we index billions of webpages and are constantly working to increase the number of pages we include, we can't guarantee that we'll crawl all of the pages of a particular site.”. If further says, “While we can't guarantee that all pages of a site will consistently appear in our index, we do offer our guidelines for maintaining a Google-friendly site.”. Thus, whether the results of the Google search of a particular term will include information contained in pages on your website or blog will depend mainly on the following factors: How fast or how regularly the Google spiders crawl your site? When was the last time the Google spiders crawled? Was it last month, last week or yesterday or today itself? How many pages of your site have been crawled and thereby indexed by Google? All or only some of the pages? How fast or how regularly or how many times your site is updated? Whether Google crawls your site with the same regularity or speed or exhaustiveness? Or, there is some time gap or some updated pages are yet to be indexed? Practical experience shows that if you have a high PR site, Google spiders may be more regular and exhaustive in indexing your site. May be every day or even more than once a day. But, may even be after every 2-3 days or more. On the other hand, if your site is not a high PR site, chances are that the Google spiders may be slightly lazy in visiting your site and they may also be a bit lax in indexing all the pages of your site. Some pages may be missing from the indexing. Of course, it also depends upon the structure of your site. If your site is not structured properly, i.e., if some pages are just not linked properly, they may not get indexed. Google itself explains the reasons for missing some sites or some pages on a site in the indexing process. It says that “Although Google crawls billions of pages, it's inevitable that some sites will be missed”. Google further elaborates that when its spiders miss a site, it's frequently for one of the following reasons: The site is not well connected through multiple links to other sites on the web. The site is launched after Google's most recent crawl was completed. The design of the site makes it difficult for Google to effectively crawl its content. The site was temporarily unavailable when Google spiders tried to crawl it or the spiders received an error when they tried to crawl it. Another problem may be with the type of web pages used by you. If you decide to use dynamic pages (i.e., the URL contains a "?" character), you should be aware that not every search engine spider crawls dynamic pages as well as static pages. It will be useful if you can keep the parameters short and the number of them few. Google advises use of the Google webmaster tools to see if it received errors when trying to crawl a site. Google also offers detailed guidelines as well as tips for building a crawler-friendly site, though with a disclaimer that while there's no guarantee that Google spiders will find a particular site, following the aforesaid guidelines should increase a site's chances of showing up in the Google search results. Google also recommends that you consider creating and submitting a detailed site map of your pages using Google Sitemaps. Google Sitemaps is an easy way to submit all your URLs to the Google index and get detailed reports about the visibility of your webpages on Google. With Google Sitemaps, you can automatically keep Google informed of all of your current pages and of any updates you make to those pages. But, the problem is that even submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in Google search results, as a Google disclaimer itself proclaims. To check when was your site indexed on the last occasion, you can use the Google’s Site status wizard wherein you can simply enter your site URL and it will show you as to whether your site is being indexed and if yes then when was it indexed last. Another direct method is to select some unique text from some recent page on your site and then search it on Google. See for yourself whether this text is found in the search. If it is not found then it basically means that this recent page is yet to be indexed by Google. You can also try to conduct this direct search test on the Google site search for your own site. But, unfortunately, as I mentioned earlier, when you use the Site status wizard or the aforementioned direct search test, you may notice a yawning gap between your recently updated pages and the Google search results. Sometimes, for some particular sites, the results may not show even the pages which you might have updated 2-3 week back or may be even earlier. Compare it with the expectation which a user may have when she notices a Goggle Site Search button on your site; she may be feeling that she will get the most recent results. But, when she does not find the desired results, instead of blaming Google for its lazy spiders, she will get an impression that the information searched by her did not exist on your site! So, where do you stand? What is the advantage of your having updated the site recently and having put the Google Site Search button? In fact, I tried to conduct this experiment on a large number of sites, big sites as well as small sites. What I found was that the latest information stored in the recently updated pages was generally always missing from the Google search. Now comes the a big surprise. What I found was that recent information in the latest updated pages of the Google’s own sites was also missing from the Google search! For this purpose, I tried to use text from the last updated pages from the Google’s own official blogs (so as to have guarantee about the recent date on which that site was updated) and then search on the Google. But, I should be fair to Google. The same results were noticed when I conducted similar tests with Yahoo! Search and MSN search engines. So, it is not the fault of Google alone. The fault lies in the method used for indexing of the pages. As I mentioned earlier, Google (or for that matter, other top Search engines also) uses an automated search engine using spiders which keep crawling the web from time to time for indexing purposes. So, with the present level of technologies and resources, it is literally impossible to have the spiders index the whole of the web instantaneously. There will always be some time gap between the updated web sites and the time of visit of the spiders. Unfortunately, this time gap turns out to be a few days for most of the sites, and sometimes even a few weeks! What then is the answer if you want to have fully indexed site at all the times with the most recent updated pages also figuring in the search? Well, the answer lies in the manual indexing. By manual indexing I mean using a software or service which is manually run by you (of course this software will automatically index your site, but you have to start the process manually, that’s why the word manually) for indexing your site immediately after every time your site is updated or as often as you want. If you reindex your site using such manual interventions every time your site is updated, then, trust me, your search results will be the most accurate and the most recent! But, the only condition is that in such a case, you may not be able to use the Google search engine for your site search, but you’ll have to use some other vendor’s software or services which can do such manual indexing. In fact, if you search the web (yes, you can use Google for this purpose!) for such a software or service, you may get plenty of them. I am refraining myself from mentioning some individual names. But, I can assure you that I have personally used such services for my earlier websites and I have found them to be absolutely reliable. I have noticed many sites on the web using similar services of manual indexing. In fact, you may even try to explore the features of Google Search Appliance or the Google Mini if they suit your requirements and resources and if your priority is the enterprise search, though honestly speaking I have not tried them. OK, this was for the site search. What about the web search in general? Well, Google (or any other good search engine such as Yahoo! or MSN, for that matter) is definitely the right answer here. This is for the simple reasons, firstly that what you are searching is the whole web and you can’t manually index the whole web yourself. Google and other search engines are already indexing the whole web for you. Secondly, the question of your credibility or your site’s credibility is not at stake here. Thirdly, for the purposes of a general web search, a time gap of a day or two does not matter much as the existing information itself is so vast that you’ll get tons of information on any searched item. Fourthly, because the big sites and high PR sites are in any case being indexed on regular basis by Google and others, therefore for the general web search, you are always likely to get the most recent results (unless of course if you are looking for some information which is hidden in some small site). So, what should be the strategy? Perhaps it is to use Google for the general web search
• United States
13 Dec 06
Perhaps you can remember when we could lay out 3 books or so on a topic and see what was the common denominator?! Just because you see something on the web that does not necessarily make it so. check it out and prove it or disprove it, you will gain knowledge that way!