Anders and I work on OpenTED once every week in the wee small hours of the morning. During these times I tend to work on the code we use to scrape, dump, parse and analyse the TED documents.
This morning I finally got around to cleaning up and publishing the Python scraper that dumps TED document tabs as raw HTML. The TED website uses sessions and serves a language selection page the first time you access it. The scraper uses Mechanize to automatically select the English language. The Mechanize browser session is then reused, so subsequent dumps can access the documents directly.
The code can be found in our GitHub repository.
At OpenTED we have been wondered to what degree contracts between awarding authorities and tenders can actually be published. For this reason we have decided to run a few requests for access to document in order to find out to what extend the actual contracts are available from the European Union.
We have made five requests for contracts across a series of EU agencies and in each case referencing a specific case. We will publish updates as the cases progress and publish the results here. Below you will find a sample of one of the requests filed:
Dear Foreign Policy Instruments Service (EEAS),
Delegation of the European Union to the United States of America
2175 K Street, NW
For the attention of: Zoran Pesevski/Laurence Moreaux
Under the right of access to documents in the EU treaties, as
developed in Regulation 1049/2001, I am requesting documents which
contain the following information:
The full length contract including information on:
a) value of the contract
b) length of the contract
c) descriptions and annexes of the
the tender awarded under the title:
US-Washington DC: production, storage and delivery of EU branded
OJEU: 2012/S 236-387761
Last weekend a group coders, data wranglers and journalists got together for for the Open Interest Hackdays. As one the three participating projects Opented saw some amazing contributions.
Miha worked on a cleaner to identify name variations among awarding authorities. This will be extremely useful as it will help identify the accurate amount of contracts awarded by municipalities or agencies, which might not apply coherent names when submitting to the register. It is should also be possible to use the cleaner to identify contract award winners in a similar way.
Benjamin and Martin wrangled and geo-located contract data and made it into a nice visualization.
Callum and Rufus wrote a NodeJS scrapper to parse contract award winners and contract amounts as well as a Python extractor for uncompressing html files. They also initiated the transfer of the compressed html dump from the server to S3.
The Hackday also led to a useful suggestions on how to increase the accessibility of the html dumps on the opented server. In the coming days we will be working to provide a zip file of the unstructured html dump, which will allow anyone to download the data locally.
For a full recap of all events at the Open Interest Hackdays check Datadrivenjournalism.
At the Hackdays in London organized by OKFN and EJC a serious attempt was made to write up the scrapers for retrieving the data from TED Contract Awards about a) winning companies and b) the award amount.
While we can still not provide a final tally of the number of missing amounts from the 700,000 contract awards downloaded, it is quite clear that a substantial share is missing.
For this reason allow us briefly to review what the relevant EU regulation stipulates with regards to the publication of contract awards.
The Guide to the Community Rules on Public Procurement of Services other than water, energy, transport and telecommunications sectors (DIRECTIVE 92/50/EEC), stipulates the following with regards to the contract awards (my bold):
4.2.2 Contract award notices
As a general rule, contract award notices must be sent to the Office for Official
Publication of the European Communities. The notice will be published in the case of
public contracts for services listed in Annex IA to the Services Directive. In the case of
contracts involving only Annex IB services, the award notice will be published only if the
contracting authority has indicated its agreement.
However, as an exception to this
general rule, publication is not necessary if:
- it would impede law enforcement;
- it would be contrary to public interest;
- it would prejudice the legitimate interests of a particular enterprise,
public or private;
- it might prejudice fair competition between service providers.
OpenTED will get on the road in November to discuss how to open procurement data with journalists and NGOs.
Over the next month we will make it to:
- 7-10 November: Hackaton at @15 IACC in Brasilia. We will look at access to procurement data across countries and will also show some of the data we have scraped from the European tender register (TED).
- 16-17: November at the European Investigative Journalism conference in Antwerpen. We will present some of the data currently available and discuss how procurement data be utilised in investigative journalism.
- 24-25 November: @OKFNlabs is organizing a hackaton on Open Interests Europe focusing on data from the EU transparency register. We will try to see how procurement data from the TED-register can be matched up with lobbyist data.
If you are attending any of the events and want to meet up ping us.