Script: Downloading CRS data #33
Labels
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: WASHWeb/WASHWeb-2019#33
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Similar project: https://github.com/datasets/dac-and-crs-code-lists
OK, quick sniff around and I can see that things are not very easy to smoke out but I expected that 😄 So, the OECD API seems to be what we need: https://data.oecd.org/api/sdmx-json-documentation, giving, for example, the CRS dataset straight off the net: https://stats.oecd.org/sdmx-json/data/CRS1/all/all?startTime=2018 (where
CRS1
is the data set code I pulled from https://stats.oecd.org/Index.aspx?DataSetCode=CRS1). That download is really slow though, so we might consider some secondary packaging system which has less latency. Looking around for libraries, there seems to be thispandasdmx
package which supports interfacing with OECD API directly: https://pandasdmx.readthedocs.io/en/v1.0/sources.html#oecd-organisation-for-economic-cooperation-and-development. I'll get stuck into this in the coming days.Lets consider just using the csv downloads. Spend a bit of time looking but
I think SDMX is perhaps adding a layer more of complexity than useful. We
probably want to go there but not now necessarily for this data flow.
See the python scraper I shared which gets the main txt files from their
file server and unzips them and they are in the dropbox folder too. I used
them to combine CRS and GLAAS data on a recent project.
On Tue, Jun 9, 2020, 15:46 decentral1se notifications@github.com wrote:
OK, makes a lot of sense, will do!
@nickdickinson hey, I've taken a dive into this again and actually, there isn't much we need to do. The https://github.com/datasets/dac-and-crs-code-lists people actually publish their data to https://datahub.io/core/dac-and-crs-code-lists, which can be manipulated directly using the https://github.com/frictionlessdata/datapackage-py library which comes from the open standard building people of https://frictionlessdata.io!
For example, we can download and manipulate CRS data with only the following lines https://datahub.io/core/dac-and-crs-code-lists#python. That means we don't need https://github.com/WASHNote/WASHWeb/issues/36 because the data is already packaged in this open format which is supported by tooling like https://pandas.pydata.org (you can directly load the dataset into a Pandas frame!)
The tools to download and put it into structured data are already built :) Furthermore, we could easily follow this model for other datasets that we want to follow up with.
So then I'd propose that we think about https://github.com/WASHNote/WASHWeb/issues/35 (perhaps closing off this and https://github.com/WASHNote/WASHWeb/issues/36) and skip directly to what we want to learn from this dataset. I haven't looked at the open data ecosystem for a while but it seems to have matured.
This is great. Let's explore. The idea of the export views is to be able to
generate a new "view" in the sql sense of this dataset with only the WASH
sector code and sorting the donors and creating WASH web org IDs. That can
then be used to start to populate wikidata. For example, add donors and
their disbursements USD for certain year to wikidata. Clearly we also need
to map orgs between wikidata and CRS.
On Mon, Jun 15, 2020 at 9:49 PM decentral1se notifications@github.com
wrote:
--
Nicolas Dickinson
WASHNote
Stadhuisplein 15
3012 AR Rotterdam
The Netherlands
T: +31649875488
W: https://washnote.com
FB: washnote https://www.facebook.com/washnote
Oh, took a quick pass in the end today :)
And look mah, no infrastructure required!
Just hit the following URL and you can already see a "Hello, World" example:
We can proof out exactly what we need to start from these notebooks.
Speak tomorrow 🌈