In this talk we will discuss a distributed web-crawler for search and extraction of contact information from corporate websites. In fact, there are two components: a web-scraper and a separate app for analysis and extraction.
The entire system has online architecture and is built on queues. Extraction is implemented as a Python app, which aggregates content into host-based buckets and makes decisions on every bucket separately. Contact information is to be extracted from every host: address, phone number, company's name, as well as social networks accounts, company's business sectors and technologies, on which the website is based. It is all preceded by corporate sites classifier to filter out the hosts, from which we will fail to extract anything, or, on the contrary, the huge catalogues, where it will be hard to get high accuracy.
This talk focuses mainly on the extraction itself, on our search of working architecture, algorithm sequence and training data collection methods. This talk will be useful for those who work with web data processing or process large amounts of data.