In this post I detail how to perform web searches from command line (linux shell, windows powershell) or in a script in any language (perl, python, etc…). I have tried and compared several tools to do so, using Google and Microsoft Bing and will present them to you. I explore the free solutions and the available API that are free with limitation and allow you to buy credits for more queries. Finally I explore the solutions to limit the number of queries to maximise the efficiency (and possibly stay in the limitation of the free APIs!)
Condensed “how to” perform web searches from cli
Free tools available
Google for linux (package googlecl)
I started by looking for an official package, and found googlecl available in the default repo of my Debian. Unfortunately it only allows to perform actions on google services (Picasa, Blog, YouTube) and no web search queries.
Googler
You can use a GitHub project named “googler” for Linux allowing you to perform web searches from command line. It has a fully automatic and interactive mode, the project is maintained, doesn’t require an API key (if I understood well they use simple HTTP GET and they parse the HTML) and can output the results in user friendly or in JSON (very practical for the scripts).
Here is a screenshot of the interactive version:
And here is the JSON output (limited at 2 results):
Inconvenience: I had to do lots of queries to extract some data of interest, and after around 50 queries, the program started to give me errors “503 unavailable”, so I suppose that Google has a system to detect scripts making queries and ban them for some time (it worked again some hours later). During this test, I applied no delay between queries to run them as fast as possible, as this tool does HTTP GETs like a browser perhaps applying a delay between each query would allow to “stay under the radar”?
DIY: Using WGET and parsing the HTML myself
After having tried googler, I decided to try to do it myself with WGET. Unfortunately, wget now gives me an error 403 forbidden when querying google in HTTP and 503 unavailable when querying in HTTPS with a redirection to a page containing “/sorry/” in the URL:
Conclusion: WGET and Googler are subject to the same query restrictions, so I abandoned the idea.
Interesting observation: I used PuTTY to make a tunnel to my server and set my Firefox browser to use the tunnel. With the same IP as my server I could use Google normally, so they don’t just ban the IP for some time, the filtering is more advanced and it might be possible to fool it by mimicking a real browser and limiting the number of queries to something that a real user would do…
Bing-CLI
I decided to check if some tools existed to perform queries using the Microsoft Bing search engine. I have found bing-cli on npmjs (unofficial tool). “Execute Bing Search queries from the terminal including Web, News, Related Search, AutoSuggest, and Images. Images are displayed as colorized-ASCII art or as inline images directly (if using iTerm2).”
It seemed good, however I could not install it, after struggling half an hour, I decided to stop so I cannot tell you much about this tool.
API available
Google API (JSON/Atom Custom Search API)
You can visit the official summary page here. In short: Supports JSON/Atom, 100 queries per day for free plus $5 / 1000 extra queries.
Observation: considering that my queries launched from googler had around 2 pages each, the limit of 100 free queries/day seems the same.
Bing API (Web Search API)
Useful links: Official overview page, Getting started, Request API key, “sandbox” (perfect for development).
In short: Supports JSON, 1000 queries/month (max 5/seconds) for free, renewable during 90 days, good tools for development.
To use it with wget (will save the JSON response in a file named as you wish):
wget -O [FILENAME TO SAVE JSON RESPONSE] –header=”Ocp-Apim-Subscription-Key: [INSERT YOUR API KEY HERE]” ‘https://api.cognitive.microsoft.com/bing/v5.0/search?q=[INSERT YOUR SEARCH QUERY HERE]&count=999999&offset=0’
I advise you to develop your query from their “sandox”.
Minimising the number of queries
I don’t really want to pay to query a search engine, so I tried to minimise the number of queries needed.
Long story short, I wanted to find smart building devices and software on manufacturer websites. To do so, I wanted to search a list of websites that contain at least one keyword from two sets of keywords (one that indicated that it is related to the smart buildings, and the other to indicate that this is a device or software)
Example query:
site:trendcontrols.com +”Building automation system” +”Installation manual”
I could have minimise the number of queries by having one big query containing everything like:
(site:trendcontrols.com OR site:siemens.co.uk OR …) AND ((“Building management system” OR “Smart building” OR …) AND (“Installation manual” OR “Specification sheet” OR …))
Obviously it would have been too easy, unfortunately both Google and Bing have restrictions and limitation in search queries.
Limitations of Google search queries: Google limits each search query to 32 words.
To save some “words” you can replace ” OR ” by ” || ” and ” AND ” by ” && “. “site:trendcontrols.com” counts for 2 words. “Building management system” counts for 3 words, even if in quotes.
Limitations of Bing search queries: Simple, 1000 characters, whatever they are. With Bing I can now perform only one query per website with my current list of keywords.
Example (595 characters in this query):
site:trendcontrols.com (“Building automation system” OR “Building control system” OR “Building energy control” OR “Building energy management” OR “Building management system” OR “Computer integrated building” OR “Home automation” OR “Intelligent building” OR “Smart building” OR “Smart home”) AND (“Data sheet” OR “Datasheet” OR “Declaration of conformity” OR “Installation guide” OR “Installation manual” OR “Operator guide” OR “Operator manual” OR “Operator instruction” OR “Product description” OR “Specification sheet” OR “Spec sheet” OR “Technical leaflet” OR “User guide” OR “User manual”)
And I can optimise it even more by specifying that I want the results from several websites by starting the query by:
(site:trendcontrols.com OR site:siemens.co.uk OR …)
Thank to this optimisation and the limitations of Bing, I reduced my initial set of 9940 queries to less than 10. I am going to make the script to parse the JSON output now and will publish my new observations and share the scripts that I have done.
Thank you for reading, I hope that this helped you (leave a little “thank you if so!”).