Easier data reuse and more flexible visualizations – what we in the developer team are working on

In the last few weeks, we – the developers here at Our World in Data – have done a lot of brainstorming and planning to flesh out what capabilities we want to add to the Our World In Data site. Some of these features will make it easier to reuse our data, some will make it easier to view data from different angles.

We are currently hiring for two technical roles to help us build what we are describing below – so if you are interested yourself, have a look at the two job profiles or forward them to people you know who might be a good fit (the application deadline is Dec 5th) — thanks!

The improvements we are currently planning can be sorted into 3 main topics: 

Easier data reuse

If you are working with data yourself, then you are probably aware that on every chart on our site you can switch to the “Download” tab and download a CSV file with the underlying data. This has a few shortcomings. For one, only the data that our authors end up using in charts is easily accessible like this. But we have a much bigger catalog of more than 100,000 indicators in our internal database that we would like to open up. Ernst, our head of product design, likens this to a museum that only has a small part of its collection on display.

We are now working to make all this data available, and to do so in a form that is convenient to use for data scientists. We are creating a public index so that you can quickly discover if we have data on a certain area available, and then fetch that data as a tidy data frame in a modern file format (like Apache feather or parquet) that is easy to consume from Python, R or Observable notebooks. You can try out an experimental version of this index in Python already using the owid-catalog-py package.

Reusable metadata

A large part of the work that our data team and our authors are doing is to curate data and add metadata. This is important because the data we collect comes from different sources, both from large institutions like the World Bank and the WHO, but also from individual researchers. As a reader, when deciding how much to trust the data, it helps immensely to understand where it came from.

Harmonizing this data on a technical level, so it uses the same date formats and country names, enables joining all this data together. But only by also recording information about how this data was collected, and its limitations, can real insight be drawn from it. We are now standardizing the metadata that we are collecting and will always serve it alongside the data files in JSON format, so that all this curation work can be reused in addition to the data that we already reshare.

Shows the download data option

Richer data model

Another benefit of moving away from our closed internal database as the central data store is that we will be able to leverage richer data models. To understand why this is important, you should know that we currently bring all individual data points into one large MySQL table that has just 4 columns: Year/Date, Entity/Country, Variable, Value. 

This has worked well for us for a long time, since most of the data we are interested in is heavily aggregated, so country and year were good enough and kept things simple. But we now want to enable richer data models – our COVID-19 data effort already stretched the current model with the need for daily data, and so adding different granularities of time is one powerful change. But we also want to be able to break down critical indicators by sex or age group if the upstream data source provides this. 

Currently, when we want to include data like this we have to create new, independent variables. For example, deaths from smoking may end up becoming many variables like “Deaths – smoking – female – age 15-25” instead of just one with many dimensions. Authors then have to remember which variables to show next to each other in an article or chart. By making it possible to store additional dimensions other than year and country, we will be able to do this automatically and allow users to switch between levels of detail. 

Drill down into the details

We are also planning to add proper support for hierarchies within dimensions so that we will be able to do proper drill-down and drill-up in our charts. If you look at this chart on child mortality, you’ll see that this shows data for the entire world and split by continent. In the top-left corner, you can find the “ Add country ” button to change this selection and show individual countries. 

This view you see initially is a good starting point, but has two issues. First, it had to be manually configured this way. Second, if you click on the “ Add country ” button then the continents and individual countries are all just shown as one long list, sorted alphabetically. In the future we will be able to show different sections in the country selector for different groupings automatically, but we’ll also be able to do this for other dimensions like cause of death, so you can get a broad picture first and then dive into the details.

Visualizing uncertainty

Finally, we are planning to add metadata information to express the relationship between variables. 

One of the first areas where we want to use this feature is to add proper support for confidence intervals. Visualizing the uncertainty inherent in data or projections is very important, but at the moment we rarely do it because it currently poses all sorts of UI problems. By making our grapher understand these relationships, we’ll be able to use the visual hints for confidence intervals that are widely used in data visualization.

More flexible data visualization

The final area for technical improvements is our visualization tool. Some items on our roadmap in this area are technically pretty simple, but we think they will give our readers and authors interesting new capabilities. 

For one, we want to make the content of our grapher charts more flexible, for example by allowing authors to create slideshows of static images or to use other visualization libraries than our handcrafted grapher. Our existing chart drawing code works very well to provide a limited set of chart types with a lot of standardized features, but sometimes a more bespoke setup would be useful. One example of this is an idea that we are currently working on to visualize war casualties over several hundred years where we need to visualize conflicting estimates, different kinds of casualties etc. all over a very long time period.

Reusing our visualization tool

We also want to make our grapher easier to reuse for other projects. Currently, the code base of the charting component is quite entangled with the internal administration UI that is used to create them, and it assumes many particularities about the infrastructure it runs on. We want to split this out so that the grapher becomes a separate NPM package that can be used in other projects. Since all our code is open source, we also hope that this will make it easier for others to contribute back to our charting system.

Open exploration & contextual information

With the advances described above, we also want to make the grapher more open to exploration – this means the users should be able to search for and visualize all the variables in our catalog. Our philosophy here is that we want our metadata to describe our data so well that generating high quality visualizations requires no further config. We hope that the ability to create arbitrary scatter plots or line charts will be another interesting look into our data collection.

When exploring data like this, it is important to understand what exactly you are looking at. Our authors spend a lot of time thinking about how best to explain critical concepts like for example “International constant $” that need to be understood to interpret our charts correctly. 

Currently, these definitions live either in prose in some part of our website or are squeezed into the subtitles of our charts. We want to experiment with a new additional canvas next to our charts that will be used to summarize important concepts that are needed to understand a specific chart. This might go as far as showing secondary charts or third party content, all with the aim to make sure that our readers draw valid conclusions from our content.

Finally, we are thinking of creating a new way for our authors to create their articles. We currently use a headless WordPress installation as our storage for articles, but our authors prefer to do the actual writing in Google Docs. In an ideal world they would be able to stay in Google Docs but be able to embed and configure interactive charts in that same window, without going through the detour of WordPress to have a more seamless editing experience and experience less friction in their daily work.

Closing thoughts

As you can see, we have many ideas in the pipeline and are excited to be working on them. Thanks to the many donations we receive, we are able to grow our technical team and increase the speed at which we are working on this.

If you are interested in joining us or contributing in some other way, then please get in touch! Likewise, if you’re interested in being an early adopter of our technical work, please subscribe to our beta mailing list.