As part of a recent project, we were faced with the task of compiling contacts for specific competences within public administrations across Germany. To avoid having to work through this task manually, we scraped the municipality websites and trained a NER model that extracts contact details for a number of departments directly from their official internet presence. The result is a comprehensive list of over 10.000 fully qualified contacts.
The first step required us to map the list of public administrations of interest onto our existing internal data lake. Unfortunately, we were given — partially outdated — plain names without any of the commonly used identifiers like AGS or GISCO, so we re-used the heuristic originally developed to map job locations for the Job Cube. 8% of data points could not be annotated automatically, e.g. for places with names which are not uniquely identifying, and required manual review.
After structuring the input data, our next step was to identify the address to the current web presence for each of the given public administrations. Combining information from Wikipedia and automated Google Searches, a web address could be established for over 95% of entries, but as in the step before, some entries required manual intervention. Starting from this list of websites, we crawled all sites searching for keywords to identify web pages likely to list the officials we are interested in.
Based on our experience on extracting job posting contact data, we were able to jump-start the annotation and training process by iterating on an existing NER model. This allowed us to take the time to experiment and improve the conversion of websites to plain text in order to enable our model to profit from the information encoded in the markup structure (e.g. tables and lists) and context. As a result, we were able to noticeably boost our NER performance and reached a satisfying performance (F1) of 88.25% by annotating only 400 scraping results.
“The project started as a one-off mission but we have outperformed all expectations regarding data quality and volume. So we are currently working on delivering updates and establishing a constant data flow into our clients CRM. This might become the one-stop source for fully qualified public administration contacts. ”
Get in touch to find out more about our scraping and natural language processing solutions.
Contact