
STEP 3: DATA COLLECTION, SELECTION, AND PREPARATION
Is there a language selection decision to be made?
Co-creating AI/ML programs involves taking into account local languages and dialects in order to enhance representation and decrease biases. Data in local languages may be scarce since the majority of written accounts are usually used for formal, legal, and political purposes rather than informal speech, and this often precludes the representation of the entirety of a population. Moreover, natural language processing (NLP) in AI is often limited by an inability to detect different accents and variations in spoken language which can lead to a lack of representation and inclusion. Acknowledging this shortcoming and integrating a wider range of data and translation or voice softwares can help mitigate this bias.
Please find below a legend of what can be found within the framework:
📚Resources - e.g. reports, articles, and case studies
🛠Tools - e.g. guidelines, frameworks and scorecards
🔗Links - e.g. online platforms, videos, hubs and databases
❌Gap analysis - tools or resources are currently missing
👥 List of stakeholders which should be included in the specific decision point
-
👥General Public, technical Experts, linguists, software developers, other private sector actors
🛠 Mozilla Common Voice - An initiative to make voice recognition technologies better and more accessible for everyone. Common Voice is a massive global database of donated voices that lets anyone quickly and easily train voice-enabled apps in potentially every language
🛠Masakhane Machine translation service for African languages - A grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. The organisation works on different multi-language NLP efforts which include a translation tool for different African Languages
🛠OpenAI API - OpenAI’s API provides access to GPT-3, which performs a wide variety of natural language tasks, and Codex, which translates natural language to code
-
📚🛠 Big Data for Social Good Digital Toolkit - This document outlines the key technical considerations that must be made when working with Mobile Big Data. It is intended primarily for a non-technical audience from any current or interested stakeholder in a project involving Mobile Big Data. This could include government agencies, non-governmental agencies, charities or institutions working in the development sector, or commercial third parties
Go to other decision points for this Step - Data collection, selection, and preparation: