Big Data for Development

Big Data for Development

Context of the course

Introduction

As part of the capacity building pillar of the Big Data for Development project, AIMS-NEI designed a Big Data for Development (BD4D-SCP) based training program taught on the set of the AIMN-NEI network, first in Rwanda, now in Senegal, and soon in Cameroon.

The course is aimed at people passionate about data science in general and more particularly in the analysis and processing of big data, having at least four years of undergraduate studies or at least two to three years of experience as a statistical professional or any other subject related to data science.

A number of short-term trainings are underway to achieve our BD4D project goals of increasing the number of users of scientific data Africa and providing a platform for practitioners to interact.

Also as part of capacity development, AIMS-NEI will organize the first training workshop for senior executives, titled: Harnessing the Power of Big Data (LPBD). The aim of this workshop is to introduce executives to the era of Big Data, demonstrating how this phenomenon disrupts traditional businesses and opens the door to new products and services.

Information For andidates

Selection Process

Course Overview

Datasets are getting bigger and bigger as the world’s population grows and things get more and more connected. Traditional data processing software and techniques cannot handle these large scale datasets. This course teaches the essentials of processing large-scale datasets using Python.

In addition, the course also teaches how to perform common computing tasks such as managing data and building machine learning models with Python. This course takes a hands on approach to equip participants with the most essential tools in a timely manner.

This course emphasizes practice-related learning, as such it includes many exercises to allow participants enough time to practice

Approach

This course takes a hands-on approach of equipping participants with the most essential tools in a timely manner. Classes start with the fundamentals of Python and focus primarily on data structures, then move quickly to major libraries for data science in Python.

Next, the course moves on to big-data processing by first providing brief theoretical concepts on the subject, then teaches Apache Spark, an advanced tool for processing large data sets. Afterwards, it offers introductory machine learning lectures before moving on to a detailed explanation of how to build these algorithms in python. This course promotes learning by the hands-on method.

Course Objectives

  1. Understand the advanced concepts of the Python language: data structures, functions, classes etc.
  2. Perform computerized tasks on dat using Python language: data ingestion, processing, visualization, web retrieval etc.
  3. Process a large scale (20GB+) data set on a personal computer using Apache Spark and use ‘Cloud Computing’ platforms.
  4. Familiarize yourself with the theoretical bases of common machine learning algorithms.
  5. Be able to build and evaluate machine learning models using the ‘scikit-learn’ library.

Course Schedule

Day 1: Advanced Concepts in Python. On this first day, the course will focus on the Python programming language to build a solid foundation for the rest of the course materials. Participants will be introduced to practical techniques from intermediate to advanced level, such as writing functions, classes, error handling, packing of Python code, and more.

Day 2: Python for Data Science: Day 2 focuses on performing common Data Science tasks using Python. We’ll explain how to use data, process, analyze, visualize, ‘Web Scraping’, and more using Python, while introducing essential packages (Pandas, Geopandas, Numpy, Matplotlib, etc) to perform these tasks.

Day 3: Big Data Handling: On the third day, the course covers handling large data sets using Python.

The following topics will be covered in addition to introduction to Big Data, multiprocessing in Python, Apache Spark, use of common cloud platforms etc.

Day 4: Machine Learning (ML) in Python. On the fourth day, the course will begin with an introductory lecture on Machine Learning. the remainder of the day will be spent completing various ML tasks (e.g data preparation, model building, evaluation and interpretation) using the scikit-learn package in Python.\

Day 5: Putting it all together: In the last day, we will focus on the skills learned in this course to solve real-world data science problems by examining case studies.

Potential case studies to cover include: how to process nighttime satellite images(geo-spatial), how to process large call records from cellphones (mobile data), and how to create ML models to impute sensor data missing (sensor data).

Preconditions

Programming: possibility to write a simple program in Python (basic Python level)

Maths and Statistics: Training in statistics, data science of quantitative sciences.

en_USEnglish