How to use OpenRefine to calculate aggregates on facets

Imagine if you could trace 50 years of African migration to the four corners of the world? How many women and men left their homes heading for pastures anew in each decade?

Africa Migration and Brain Drain (how I navigated a data nightmare)

Imagine if you could trace 50 years of African migration to the four corners of the world? How many women and men left their homes heading for pastures anew in each decade? Where did they go? What did they study? These and a multitude of other questions come to mind when we start looking back… Since we are talking about going back to the 1960s, a decade of mass decolonisation on the African continent, we are witnessing the rise of resistance movements across the continent. In South Africa, the draconic apartheid legislation that governed the lives of people of colour, saw many choose exile over oppression. How has that movement changed over the decades and has the flow of money from - and to - those who left their homes had an effect?

Finding the data, on the face of it, seemed simple enough as there are just three international organisations - the United Nations, the World Bank and the Organisation for Economic Cooperation and Development (OECD) that can have that accurate data, and, since I had already downloaded all the relevant data it seemed like analysing it would be a walk in the park. Except I had not factored in just how dirty the data would be.

Over the next three days I managed to manipulate and massage the data from the three sources into a machine-readable format that would allow me to analyse it. What emerged was a treasure trove of story ideas: the data had done the heavy lifting and delivered lots of ideas ready for us to do the journalism. For example, did you know that of all African countries South Africa has the highest percentage - 51 percent - of female migrants to OECD countries in 2010-2011 considered as “highly educated”? The data also revealed that Africa has the world’s highest brain drain to OECD countries, followed by Latin America. Staggeringly, one in nine people in Africa with tertiary diplomas left the continent to pursue their lives and careers in an OECD country. Over the last decade, the intellectual elite that were lost in the continent is comprised of: more than 83 400 life sciences professionals, 6 000 physical and mathematical engineers, more than 127 000 teaching professionals, 12 000 lawyers and 95 000 general managers left Africa, with the highest numbers being from South Africa.

So, how did I find and clean the data

I was aware of Migrants Flows, but this was not exactly what we wanted to show. Instead I downloaded data from the World Bank (WB), United Nations (UN) and OECD that I thought could be relevant. The WB had data about the remittance of money from January 2010 until April 2015 and a “Global Bilateral Migration” dataset which consists of migrations per decade from 1960 to the early 2000s and is categorised by gender. Fortunately it is a reasonably clean and well-organised dataset. However (and here starts the nightmare), the data format was changed for later years. I then poked about on Google and came across this page and found lots of useful information, including the Bilateral Migration for 2010 and 2013. But this data is measured by the “total stock” of migrants in the different countries. Meaning that I now had two different methods of counting to deal with - rather than a flow per decade, I had totals contained in 219 rows and 218 columns. Useful if you want to see one result, not adequate if you want to see more than that. So I deleted the first column and removed the merged cells in order to have one simple worksheet. Using OpenRefine’s transpose function I got a single column for country of origin and country of destination. Yay!

The next step was to identify the migration between African countries, from Africa to the rest of the World and from the rest of the continents to other countries in Africa. I also needed a column to identify the country of origin and destination. So I used an easy Excel formula:

Identify the country of origin

=IF( OR(A2="Burundi", A2="Comoros", A2="Djibouti", A2="Eritrea", A2="Ethiopia", include all the Africa’s countries names"), "AFRICAN COUNTRY", "NON AFRICAN COUNTRY")

I did this by listing African countries in a column, and “A2=” prefix in another, and then used the “concatenate all” function in one row to get the formula with few clicks. The next step was to join both destinations and identify the traffic:


The United Nation’s dataset of Migrant Stock was not in great shape either

I repeated the process of cleaning - getting rid of the header and deleted categories such as sub-Saharan and less developed. I removed the merged cells and for each tab I created a column on the left to identify what it referred to; “Total”, “Female”, or “Male” migrant stock in 1990, 2000, 2010 and 2013. One last little niggle: In some cases the UN data for some countries included refugees, which the other two data sets did not have. So I compared how different they were using Tableau:

I decided to use the UN data for the final results as it most complete for our needs. I then repeated the cleaning process with the remittance data (one per year since 2010):

I followed the same rinse-repeat formula for professions in the OECD countries. However, in this case there were 14 documents. I deleted the header and turned the occupation field into a column at the left to identify it from the rest and then joined it with a macro and repeated the process with OpenRefine.

And this is what I eventually ended up with:

A final thought This initially seemed to be a pretty straight forward task but the more I discovered, the more complex it became. The angles I had identified upfront were no longer the same angles I wanted to work with, midway through, but being flexible is part and parcel of being a journalist. I know it can be ambitious and sometimes intimidating, but let’s be honest in this era of data journalism is not enough to simply have access to the data, we will need to demand the data in usesable and standardized formats. Yes, maybe not now, just one step at the time, but let’s start thinking that way. If journalists don’t check the data, who will once it is out on the public domain? Let’s not talk about South African public institutions’ “opendata”- because it is not always open and is not always complete data. Please contact me if something does not make sense to you or if you came across this post and want to have access to the data before we upload it - ☺

Related links