Ayoub Fakir

About

Fri, 09 Apr 2021 00:00:00 +0000

Hi! I’m Ayoub, a Senior Data Engineer absolutely passionate about data technologies. I mainly work on Distributed Systems (Hadoop / Kubernetes / Cloud Technologies), Functional Programming (Haskell / Scala / Clojure), Rust, and Blockchain Technologies (Ethereum / Bitcoin / Hyperledger, and more recently got into the Polkadot ecosystem).

I do consulting as well as teaching (Paris 12 University).

You can contact me to say Hi, to talk about your projects or to hire me: ayoub[at]fakir.dev

[FR] Passer de EMR vers Kubernetes pour les workloads Spark

Thu, 18 Feb 2021 04:26:07 +0200

Introduction

AWS EMR est un service AWS largement utilisé principalement pour le traitement des données massives avec Apache Spark dans un Cluster Hadoop dédié. Au-delà de sa fonction principale, EMR embarque un bon nombre d’outils open-source, certains pour le monitoring (Ganglia), et d’autres pour le requêtage des données (Hive). Plus d’informations peuvent être trouvées par ici. Dépendamment du contexte, EMR peut être utilisé soit en tant qu’instance d’un cluster éphémère (par exemple en lançant un Cluster tous les 6 heures pour exécuter des jobs Spark), soit en tant que cluster permanent. C’est le cas notamment lorsque celui-ci est utilisé par plusieurs équipes, fait tourner des jobs de streaming ou lorsque l’attente de son instanciation est plus coûteuse que de le laisser tourner de manière permanente. Cet article n’est pas nécessairement un texte pour comparer EMR à Kubernetes vu que les deux ne répondent pas aux mêmes besoins. Kubernetes s’impose de plus en plus aujourd’hui pour des raisons diverses et variées, et Spark supporte Kubernetes comme Scheduler et Resources Manager nativement, donc ça aurait été dommage de ne pas s’y pencher.

[EN] Migrating from a plain Spark Application to ZparkIO

Fri, 16 Oct 2020 10:36:00 +0200

Migrating from a plain Spark Application to ZIO with ZparkIO

In this article, we’ll see how you can migrate your Spark Application into ZIO and ZparkIO, so you can benefit from all the wonderful features that ZIO offers and that we’ll be discussing.

What is ZIO?

ZIO is defined, according to official documentation as a library for asynchronous and concurrent programming that is based on pure functional programming. In other words, ZIO helps us write code with type-safe, composable and easily testable code, all by using safe and side-effect-free code. ZIO is a data type. Its signature, ZIO[R, E, A] shows us that it has three parameters:

[EN] Building a CI/CD pipeline for a Spark project using Github Actions, SBT and AWS S3 — Part 2

Wed, 29 Apr 2020 13:01:24 +0200

In the first article of this series, we talked about how we can set up a CI/CD pipeline for a Spark project using Github Actions, SBT as a build tool and S3 for deployment. Our code once pushed to the [master] branch of our project on Github, triggered an SBT Build command to generate a fat jar, then pushed it to S3 to the chosen bucket.

However, this pipeline still lacks a way to add a logic since it does not allow us to check whether the jar’s version we’re putting to S3 already exists for instance.

[EN] CI/CD pipeline using Github Actions, SBT and AWS S3 - Part 1

Wed, 08 Apr 2020 04:35:59 +0200

Github now allows us to build continuous integration and continuous deployment workflows for our Github Repositories thanks to Github Actions, for almost all Github plans.

In this tutorial, we’re going to go through building a CI/CD pipeline based on a Scala / Spark project. We will be using SBT, the Scala Build Tool, which will allow us to get a jar that we’re then going to deploy to AWS S3 using a custom Github Action.

[EN] On Minimalistic Teaching

Thu, 06 Feb 2020 10:00:37 +0200

First…

The education system today is experiencing a lot of challenges and has many issues around the world, and at all levels. That said, the “education problem” being a huge subject, we can only solve it by addressing small problems, one at a time, and the sum of all of these solutions may lead us to solving the bigger issue. For instance, one of the issues in higher education is that a teacher is either academic or professional; the former has a theoretical focus ― and does not teach students how to tackle real world problems based on what she teaches them ―, whereas the latter is more focused on practical applications ― and might not have the pedagogical tools or know-how. By tackling seemingly small problems like this one, we can hope to find small solutions; the sum of these puts us on the pathway to solving the bigger problem: how can we give tertiary students what they truly need to succeed in their post-graduation lives?

[FR] Le Bitcoin Expliqué à ma mère

Thu, 08 Nov 2018 04:26:07 +0200

Aujourd’hui, ma mère me parle du fait que l’un de ses élèves en école primaire lui ait parlé d’une grande « révolution » nommée Bitcoin. « Mais c’est quoi ce truc qui va tuer les banques ?» s’est-elle étonnée.

C’est pour cette raison même que j’ai décidé d’expliquer le Bitcoin à travers cet article à ma mère, ainsi que toutes les mamans qui pourraient consulter cet article !

Vois-tu, maman, une grande majorité de ceux qui connaissent les principes derrière le Bitcoin sont des geeks anti-sociaux qui ne parlent que binaire, et dénigrent tous ceux qui ne le connaissent pas ; j’en ai fait partie à une époque, avant de me rendre compte que la cravate, ça m’allait bien aussi !

[EN] 10+ Great Books for Functional Programming in Scala

Fri, 17 Mar 2017 05:47:36 +0200

This article was co-authored by Matthew Rathbone

image by Thomas Leuthard

James Gosling, creator of Java, said:

“If I were to pick a language to use today other than Java, it would be Scala.”

Scala is a hot language in software development today, it is used by a range of start-ups for application development and has been adopted as the unofficial language of big data software development thanks to frameworks like Spark. As a language it is less verbose than Java, and has a number of unique features that make it more flexible too. Scala is both functional, object oriented, and truly multi-threaded – so it provides a very unique development environment. There’s so much to Scala that whatever stage of programming you’re at, you’ll probably want some books!

Why combine asynchronous and distributed calculations to tackle the biggest data quality challenges

Fri, 17 Mar 2017 05:47:36 +0200

Article co-authored by Martin Delobel and available on Medium.

[EN] 10+ Great Books for Apache Spark

Fri, 13 Jan 2017 05:45:12 +0200

This article was co-authored by Matthew Rathbone

image by Ed Robertson

Apache Spark is a super useful distributed processing framework that works well with Hadoop and YARN. Many industry users have reported it to be 100x faster than Hadoop MapReduce for in certain memory-heavy tasks, and 10x faster while processing data on disk.

While Spark has incredible power, it is not always easy to find good resources or books to learn more about it, so I thought I’d compile a list. I’ll keep this list up to date as new resources come out.

[EN] The Truth Behind the Bigdata Buzz Word

Mon, 10 Oct 2016 04:27:41 +0200

Big Data…Really?

Few years ago, I had a discussion with a mentor of mine about the career path I wanted to pursue, and I said: “Look, Big Data is something really great, and I want to become a Big Data Engineer later on!”, and his answer was: “Okay, but be cautious, Big Data is not a revolution, and just like the “Cloud”, marketers have done their jobs”. I didn’t trust his words back then, and… You bet! He was right!