Learn how to Source & Clean data

A key ability for anyone who wants to communicate insights through data storytelling is to be able to find, extract and transform relevant data into machine readable format.

Share:

Data has a lot that it can offer us. And learning the right technical skills to work with data in order to extract the value it contains is imperative, not only to leverage the resource, but to becoming a resource unto the data itself - a resource that has the ability to unlock its potential.

In today’s world, information is being transformed into data. Sometimes, relevant data is not readily available, and we need to understand the basics of sourcing data. A key ability for anyone who wants to communicate insights through data storytelling is to be able to find, extract and transform relevant data into machine readable format.

Sourcing and scraping data is the very first step of any data process, or Data Pipeline. Through data sourcing we are exposed to the many forms in which data is packaged and made accessible. Only once relevant data has been sourced and extracted into machine readable format, can it then be cleaned and then analysed.

When we source data, we are often accessing an unfamiliar dataset, which brings to the fore some critical questions that need to be answered in order to fully understand the limitations of the dataset, and how to accurately work with, and quote from the data source. For example,

  • Who collected the data?
  • What methodology was used?
  • What do all those columns mean?
  • Is it a trusted source?
  • When was the data collected - at one time, or over a period of time?
  • Does the source have any particular interest?
  • What was excluded?
  • Where was it published?
  • Was it raw or processed?

In finding the answers to these questions, we build up a “Data Profile” which includes the metadata (data about the data), and ensures that we bear any limitations in mind as we work with the dataset. This Data Profile provides us with confidence that we fully understand the data.

We use data to share insights with our networks in the form of visualisations like graphs, charts, maps, timelines etc. But data is worthless in propelling us toward our end goal if we have not acquired the necessary skills to structure the data into the best format for analysis.

In order to identify the trends, patterns or anomalies that provide us with factual insights that we can use to base decision making on, we need to analyse the data. However we can’t analyse dirty data, we need to clean it first. Data cleaning is one of the most important parts of data storytelling, and it requires a lot of caution. This stage of data processing has the highest level of human interaction with the original data, so we need to be very careful to avoid making errors through a systematic approach, and the use of best practices.

About the course

In February 2018, TrainUp, OpenUp’s training arm, will be running a two day short course on how to Source and Clean data for storytelling. This course is comprised of the following modules:

  • Mastering Google-Fu (how to conduct advanced google searches)
  • Metadata & the Data Biography
  • Sources of Data
  • Fundamentals of Data Cleaning
  • Data Cleaning with Spreadsheets
  • Data Cleaning with OpenRefine

This two day short course will be hosted in Johannesburg and in Cape Town, click here to see the invitation for applications or complete the application to attend.

CSOs, book your free seats!

As part of our Open Intake, we will be offering two free seats in our Source & Clean for data storytelling short course. If you work for a non-profit organisation, complete this application to qualify for a free seat in Johannesburg or Cape Town.

For any further questions, please get in touch with us via email at [email protected]

Share: