Using NLP to gather contact data for public administrations on a large scale

As part of a recent project, we were faced with the task of compiling contacts for specific competences within public administrations across Germany. To avoid having to work through this task manually, we scraped the municipality websites and trained a NER model that extracts contact details for a number of departments directly from their official internet presence. The result is a comprehensive list of over 10.000 fully qualified contacts.

Foxy Bytes

Getting started

The first step required us to map the list of public administrations of interest onto our existing internal data lake. Unfortunately, we were given — partially outdated — plain names without any of the commonly used identifiers like AGS or GISCO, so we re-used the heuristic originally developed to map job locations for the Job Cube. 8% of data points could not be annotated automatically, e.g. for places with names which are not uniquely identifying, and required manual review.

After structuring the input data, our next step was to identify the address to the current web presence for each of the given public administrations. Combining information from Wikipedia and automated Google Searches, a web address could be established for over 95% of entries, but as in the step before, some entries required manual intervention. Starting from this list of websites, we crawled all sites searching for keywords to identify web pages likely to list the officials we are interested in.

Training our NER

Based on our experience on extracting job posting contact data, we were able to jump-start the annotation and training process by iterating on an existing NER model. This allowed us to take the time to experiment and improve the conversion of websites to plain text in order to enable our model to profit from the information encoded in the markup structure (e.g. tables and lists) and context. As a result, we were able to noticeably boost our NER performance and reached a satisfying performance (F1) of 88.25% by annotating only 400 scraping results.

...
Given our small training set (400 documents) our NER performs great for all types.

Review

The extracted data was manually verified by us in a final review process and partially supplemented where the automatic process failed to provide acceptable results. About 3500 municipalities were checked and more than 10.000 qualified contacts were found and approved. This quality assurance allows us to be certain that the output is complete, accurate and of sufficient quality. This way, we can always guarantee complete satisfaction for our clients.

“The project started as a one-off mission but we have outperformed all expectations regarding data quality and volume. So we are currently working on delivering updates and establishing a constant data flow into our clients CRM. This might become the one-stop source for fully qualified public administration contacts. ”

Lukas

Interested?

Get in touch to find out more about our scraping and natural language processing solutions.

Contact