Blogs

What's Behind That Search Box

Search is ubiquitous. 

So prevalent in our daily lives, some of us don’t stop to think about what’s behind the search box. For some – whether it’s corporate decision-making or national security - search results have potentially major implications. 

Search is a pretty complex subject about which I plan to write more, but let’s start at a high level with 3 items to think about if search is of critical importance to you or your mission.

  1. Nearly 40 percent of the internet is not rendered in English
  2. I am searching the internet, right?
  3. 13.4 million results in 0.48 seconds

 

1. Nearly 40 percent of the internet is not rendered in English[1]

You’ve heard the old phrase about computers … garbage in, garbage out. Well, it can also be said about search and language: English in, English out. More precisely, whatever language is in, that language is out. That’s how most searches works. So, if I am searching in English and there are relevant results in foreign languages, I’m not getting those results. Said differently, if 40 percent of the internet is not rendered in English and I’m searching in English, I’m missing 40 percent of the data right out of the gate. 

 

2. I’m searching the internet, right?

Well, no, not exactly. Internet search is basically limited to the “indexed” web. The indexed web is that part of the internet accessible to crawled and indexed by the major search engines. It does not include what is typically called the deep and dark web. Understanding the deep and dark web is becoming more mainstream, but it’s difficult, if not impossible, for the human mind to grasp the scale of the internet. Estimates vary, but suffice to say that the indexed web is perhaps 10% (probably far less) of the internet. 

So, when you are using popular internet search engines, you’re searching less than 10% of the internet.

 

3. 13.4 million results in 0.48 seconds 

Just now, I ran a search on a major search engine for “Babel Street.” That search returned 13.4 million results and in just .48 seconds! I find it amusing that search engines still note the time; that’s not particularly helpful for most use cases. Yes, I can tweak the search somewhat. I can narrow it to news or images. I can add additional search terms. If I’m really adept, I can use some Boolean-like tricks to modify my search. At the end of the day, the results are not easily filtered, sorted, or analyzed. And, of course, all the search engines are bringing back results based on an advertising model, so that impacts what comes back in terms of relevancy and even how it’s sorted. 

“Ok, wait,” you may be saying right now and then, asking, “If I’m conducting important searches in English using a major search engine, I’m using 10% of the Internet. And if my search is in English, I’m only getting around 60% of the content back.  So I’m relying on about 6% of the available data and the presentation of that data based on an advertising model for my important work or mission?” Yep, that’s mostly right.

As I noted earlier, search is a complex field. Language has always been a huge part of that complexity. For almost a decade now, Babel Street’s team of data scientists and engineers have innovated a unique and incredibly powerful cross-lingual search solution. When combined with broad data access, sophisticated filtering, analytics and more, Babel Street has created The World’s Leading AI-Enabled Data-to-Knowledge Company.

 

Next: While many talk about the volume, velocity and variety of data, you should be focused on Relevance, Real-time and Richness.

_________

[1]https://w3techs.com/technologies/overview/content_language