[FR] Passer de EMR vers Kubernetes pour les workloads Spark

Introduction AWS EMR est un service AWS largement utilisé principalement pour le traitement des données massives avec Apache Spark dans un Cluster Hadoop dédié. Au-delà de sa fonction principale, EMR embarque un bon nombre d’outils open-source, certains pour le monitoring (Ganglia), et d’autres pour le requêtage des données (Hive). Plus d’informations peuvent être trouvées par ici. Dépendamment du contexte, EMR peut être utilisé soit en tant qu’instance d’un cluster éphémère (par exemple en lançant un Cluster tous les 6 heures pour exécuter des jobs Spark), soit en tant que cluster permanent. C’est le cas notamment lorsque celui-ci est utilisé par plusieurs équipes, fait tourner des jobs de streaming ou lorsque l’attente de son instanciation est plus coûteuse que de le laisser tourner de manière permanente. Cet article n’est pas nécessairement un texte pour comparer EMR à Kubernetes vu que les deux ne répondent pas aux mêmes besoins. Kubernetes s’impose de plus en plus aujourd’hui pour des raisons diverses et variées, et Spark supporte Kubernetes comme Scheduler et Resources Manager nativement, donc ça aurait été dommage de ne pas s’y pencher. ...

Feb 18, 2021 · 7 min · 1316 words · Ayoub Fakir

[EN] Migrating from a plain Spark Application to ZparkIO

Migrating from a plain Spark Application to ZIO with ZparkIO In this article, we’ll see how you can migrate your Spark Application into ZIO and ZparkIO, so you can benefit from all the wonderful features that ZIO offers and that we’ll be discussing. What is ZIO? ZIO is defined, according to official documentation as a library for asynchronous and concurrent programming that is based on pure functional programming. In other words, ZIO helps us write code with type-safe, composable and easily testable code, all by using safe and side-effect-free code. ZIO is a data type. Its signature, ZIO[R, E, A] shows us that it has three parameters: ...

Oct 16, 2020 · 7 min · 1402 words · Ayoub Fakir

[EN] Building a CI/CD pipeline for a Spark project using Github Actions, SBT and AWS S3 — Part 2

In the first article of this series, we talked about how we can set up a CI/CD pipeline for a Spark project using Github Actions, SBT as a build tool and S3 for deployment. Our code once pushed to the [master] branch of our project on Github, triggered an SBT Build command to generate a fat jar, then pushed it to S3 to the chosen bucket. However, this pipeline still lacks a way to add a logic since it does not allow us to check whether the jar’s version we’re putting to S3 already exists for instance. ...

Apr 29, 2020 · 3 min · 427 words · Ayoub Fakir

[EN] CI/CD pipeline using Github Actions, SBT and AWS S3 - Part 1

Github now allows us to build continuous integration and continuous deployment workflows for our Github Repositories thanks to Github Actions, for almost all Github plans. In this tutorial, we’re going to go through building a CI/CD pipeline based on a Scala / Spark project. We will be using SBT, the Scala Build Tool, which will allow us to get a jar that we’re then going to deploy to AWS S3 using a custom Github Action. ...

Apr 8, 2020 · 3 min · 613 words · Ayoub Fakir

[EN] On Minimalistic Teaching

First… The education system today is experiencing a lot of challenges and has many issues around the world, and at all levels. That said, the “education problem” being a huge subject, we can only solve it by addressing small problems, one at a time, and the sum of all of these solutions may lead us to solving the bigger issue. For instance, one of the issues in higher education is that a teacher is either academic or professional; the former has a theoretical focus ― and does not teach students how to tackle real world problems based on what she teaches them ―, whereas the latter is more focused on practical applications ― and might not have the pedagogical tools or know-how. By tackling seemingly small problems like this one, we can hope to find small solutions; the sum of these puts us on the pathway to solving the bigger problem: how can we give tertiary students what they truly need to succeed in their post-graduation lives? ...

Feb 6, 2020 · 9 min · 1871 words · Ayoub Fakir

[FR] Le Bitcoin Expliqué à ma mère

Aujourd’hui, ma mère me parle du fait que l’un de ses élèves en école primaire lui ait parlé d’une grande « révolution » nommée Bitcoin. « Mais c’est quoi ce truc qui va tuer les banques ?» s’est-elle étonnée. C’est pour cette raison même que j’ai décidé d’expliquer le Bitcoin à travers cet article à ma mère, ainsi que toutes les mamans qui pourraient consulter cet article ! Vois-tu, maman, une grande majorité de ceux qui connaissent les principes derrière le Bitcoin sont des geeks anti-sociaux qui ne parlent que binaire, et dénigrent tous ceux qui ne le connaissent pas ; j’en ai fait partie à une époque, avant de me rendre compte que la cravate, ça m’allait bien aussi ! ...

Nov 8, 2018 · 5 min · 975 words · Ayoub Fakir

[EN] 10+ Great Books for Functional Programming in Scala

This article was co-authored by Matthew Rathbone image by Thomas Leuthard James Gosling, creator of Java, said: “If I were to pick a language to use today other than Java, it would be Scala.” Scala is a hot language in software development today, it is used by a range of start-ups for application development and has been adopted as the unofficial language of big data software development thanks to frameworks like Spark. As a language it is less verbose than Java, and has a number of unique features that make it more flexible too. Scala is both functional, object oriented, and truly multi-threaded – so it provides a very unique development environment. There’s so much to Scala that whatever stage of programming you’re at, you’ll probably want some books! ...

Mar 17, 2017 · 5 min · 884 words · Ayoub Fakir

Why combine asynchronous and distributed calculations to tackle the biggest data quality challenges

Article co-authored by Martin Delobel and available on Medium.

Mar 17, 2017 · 1 min · 9 words · Ayoub Fakir

[EN] 10+ Great Books for Apache Spark

This article was co-authored by Matthew Rathbone image by Ed Robertson Apache Spark is a super useful distributed processing framework that works well with Hadoop and YARN. Many industry users have reported it to be 100x faster than Hadoop MapReduce for in certain memory-heavy tasks, and 10x faster while processing data on disk. While Spark has incredible power, it is not always easy to find good resources or books to learn more about it, so I thought I’d compile a list. I’ll keep this list up to date as new resources come out. ...

Jan 13, 2017 · 6 min · 1193 words · Ayoub Fakir

[EN] The Truth Behind the Bigdata Buzz Word

Big Data…Really? Few years ago, I had a discussion with a mentor of mine about the career path I wanted to pursue, and I said: “Look, Big Data is something really great, and I want to become a Big Data Engineer later on!”, and his answer was: “Okay, but be cautious, Big Data is not a revolution, and just like the “Cloud”, marketers have done their jobs”. I didn’t trust his words back then, and… You bet! He was right! ...

Oct 10, 2016 · 4 min · 672 words · Ayoub Fakir