[{"content":"Hi! I\u0026rsquo;m Ayoub, a Senior Data Engineer absolutely passionate about data technologies. I mainly work on Distributed Systems (Hadoop / Kubernetes / Cloud Technologies), Functional Programming (Haskell / Scala / Clojure), Rust, and Blockchain Technologies (Ethereum / Bitcoin / Hyperledger, and more recently got into the Polkadot ecosystem).\nI do consulting as well as teaching (Paris 12 University).\nYou can contact me to say Hi, to talk about your projects or to hire me: ayoub[at]fakir.dev\nMy links:\nGithub LinkedIn Medium Twitter Quora ","permalink":"/about/","summary":"\u003cp\u003eHi! I\u0026rsquo;m Ayoub, a Senior Data Engineer absolutely passionate about data technologies. I mainly work on Distributed Systems (Hadoop / Kubernetes / Cloud Technologies), Functional Programming (Haskell / Scala / Clojure), Rust, and Blockchain Technologies (Ethereum / Bitcoin / Hyperledger, and more recently got into the Polkadot ecosystem).\u003c/p\u003e\n\u003cp\u003eI do consulting as well as teaching (Paris 12 University).\u003c/p\u003e\n\u003cp\u003eYou can contact me to say Hi, to talk about your projects or to hire me: ayoub[at]fakir.dev\u003c/p\u003e","title":"About"},{"content":"Introduction AWS EMR est un service AWS largement utilisé principalement pour le traitement des données massives avec Apache Spark dans un Cluster Hadoop dédié. Au-delà de sa fonction principale, EMR embarque un bon nombre d\u0026rsquo;outils open-source, certains pour le monitoring (Ganglia), et d\u0026rsquo;autres pour le requêtage des données (Hive). Plus d\u0026rsquo;informations peuvent être trouvées par ici. Dépendamment du contexte, EMR peut être utilisé soit en tant qu\u0026rsquo;instance d\u0026rsquo;un cluster éphémère (par exemple en lançant un Cluster tous les 6 heures pour exécuter des jobs Spark), soit en tant que cluster permanent. C\u0026rsquo;est le cas notamment lorsque celui-ci est utilisé par plusieurs équipes, fait tourner des jobs de streaming ou lorsque l\u0026rsquo;attente de son instanciation est plus coûteuse que de le laisser tourner de manière permanente. Cet article n\u0026rsquo;est pas nécessairement un texte pour comparer EMR à Kubernetes vu que les deux ne répondent pas aux mêmes besoins. Kubernetes s\u0026rsquo;impose de plus en plus aujourd\u0026rsquo;hui pour des raisons diverses et variées, et Spark supporte Kubernetes comme Scheduler et Resources Manager nativement, donc ça aurait été dommage de ne pas s\u0026rsquo;y pencher.\nAvantage de AWS EMR La valeur de ce service n\u0026rsquo;est plus à prouver vu son utilisation massive et sa fiabilité. Pour autant, certains de ses avantages incluent (et ne sont pas limitées à) :\nIntégration avec l\u0026rsquo;écosystème AWS. Écosystème Hadoop quasi-complet dans un seul et même service. Auto-Scaling du cluster : EMR se base sur des instances EC2, et fait la différence entre deux types de slave nodes : les Core Nodes et les Task Nodes. Ces derniers peuvent être upscalé ou downscalé sans que les données stockées dans le tampon HDFS du cluster ne soient perdues, et ça, c\u0026rsquo;est cool. A noter ici que le système de stockage distribué (HDFS) inclut dans EMR n\u0026rsquo;est pas à utiliser en tant que système de stockage principal, même dans le scénario où le cluster est permanent. Haute disponibilité : EMR détecte les nodes unhealthy et les remplace lorsque c\u0026rsquo;est nécessaire (en réalité, le name node est assez grand pour détecter les data nodes en difficulté lui-même, mais EMR rajoute sa petite touche en faisant en sorte de les remplacer à la volée). Accès aux données stockées dans S3 : Qui a parlé d\u0026rsquo;avoir le stockage de son datalake directement avec S3 ? Idée de génie ! Bon alors, puisque EMR est aussi cool que ça, de quoi on parle ? EMR est un service très complet, donc lourd, et donc chiant à migrer (si le CTO de ta boite ne t\u0026rsquo;a pas encore parlé de sa superbe idée de faire du multi-cloud ou de migrer vers GCP, ce n\u0026rsquo;est qu\u0026rsquo;une question de temps, tu verras). Et c\u0026rsquo;est là que tu te trouveras devant un désavantage assez conséquent d\u0026rsquo;EMR : il est difficilement portable. Bien évidemment, si la seule utilisation que tu en fait c\u0026rsquo;est celle de faire tourner des jobs Spark (ce qui serait dommage), alors pas de soucis ; par contre si tu utilises ses features plus avancées comme l\u0026rsquo;auto-scaling, ou si ton équipe utilise plutôt Hive qu\u0026rsquo;Athena pour requêter les données (ce qui serait dommage aussi, vraiment), là ça devient plus compliqué. EMR est un service très complet, et donc cher, et on sait tous que l\u0026rsquo;ami Bezos ne fait pas de charité… Enfin si, parfois. Bref, lorsqu\u0026rsquo;on plonge le nez dans le pricing, on voit que lorsqu\u0026rsquo;on utilise des instances EC2 m5 classiques (On Demand), le prix pour EMR est 25% supérieur ; ce prix peut aller jusqu\u0026rsquo;à 33% pour des instances spot. De plus en plus d\u0026rsquo;entreprises aujourd\u0026rsquo;hui se basent sur Kubernetes pour beaucoup de leurs use cases ; souvent, ceux-ci sont oversizés et sous utilisés… Tu commences à voir l\u0026rsquo;idée ? Spark on Kubernetes Du coup, tu as un cluster Kubernetes sous la main qu\u0026rsquo;on te propose d\u0026rsquo;utiliser, EMR te coûte trop cher (tu le sais et tu t\u0026rsquo;en fous, mais le CFO de ta boite râle car en plus il n\u0026rsquo;y comprend rien) et en plus, tu ne l\u0026rsquo;utilises que pour lancer tes jobs Spark 3 fois par jour pour exposer des données à tes amis Data Analysts ? Bouge pas, t\u0026rsquo;es à la bonne adresse.\nPar contre, si tu utilises toute la panoplie des outils proposés par EMR et que tu ne peux pas t\u0026rsquo;en passer, que t\u0026rsquo;as réussi à convaincre ton CFO que c\u0026rsquo;était normal que ça soit cher et que ton CTO ne regarde pas BFM ou n\u0026rsquo;est pas intéressé par le Multi-Cloud car lui, au moins, est conscient qu\u0026rsquo;il n\u0026rsquo;en sait rien, alors passe ton chemin, ou regarde par ici, AWS a sorti un truc qui devrait pouvoir t\u0026rsquo;intéresser.\nBref, comment ça marche ? Lorsqu\u0026rsquo;on décide de faire tourner nos jobs dans Kubernetes, on laisse tomber YARN (Yet Another Resource Negociator), qui avait, rappelons le, brillamment réussi à détrôner Map Reduce.\nComme le montre le schéma, notre commande spark-submit s\u0026rsquo;adressera à l\u0026rsquo;api-server (comme tout dans Kubernetes d\u0026rsquo;ailleurs), qui lui s\u0026rsquo;occupera de :\nDemander la création d\u0026rsquo;un Spark Driver sous forme d\u0026rsquo;un pod. Ce pod lancera la création (suivant la configuration du job) d\u0026rsquo;autres pods qui joueront le rôle d\u0026rsquo;exécuteurs. Une fois le job terminé, tous les exécuteurs sont détruits, sauf le Pod contenant le Spark Driver, qui lui persistera ses logs sur disque et se mettra en état \u0026ldquo;completed\u0026rdquo; et sera détruit plus tard (manuellement si t\u0026rsquo;es impatient, car de toutes façons il ne consomme rien). Et concrètement, on fait comment ? Si tu es familier avec Docker et Kubernetes, cela devrait aller rapidement pour toi, sinon… Qu\u0026rsquo;attends-tu pour aller acheter Kubernetes In Action ???\nBuilder l\u0026rsquo;image Docker : Celle-ci peut être située dans ton registry d\u0026rsquo;images préféré (ECR, Gitlab…). Ou alors, tu peux utiliser l\u0026rsquo;outil fourni par Spark si tu n\u0026rsquo;as pas envie d\u0026rsquo;utiliser de registry : ./bin/docker-image-tool.sh -r \u0026lt;repo\u0026gt; -t my-tag build ./bin/docker-image-tool.sh -r \u0026lt;repo\u0026gt; -t my-tag push Une fois ton image prête à l\u0026rsquo;emploi, l\u0026rsquo;étape d\u0026rsquo;après est naturellement ton spark-submit. Cela ne diffère pas des spark-submit auxquels on a l\u0026rsquo;habitude hors Kubernetes, à la différence près de l\u0026rsquo;adresse du Master (exemple pour l\u0026rsquo;exécution en mode cluster) : ./bin/spark-submit \\ --master k8s://https://\u0026lt;k8s-apiserver-host\u0026gt;:\u0026lt;k8s-apiserver-port\u0026gt; \\ --deploy-mode cluster \\ --name nom-de-ton-job \\ --class org.apache.spark.examples.TaMainClass \\ --conf spark.executor.instances=5 \\ --conf spark.kubernetes.container.image=\u0026lt;spark-image\u0026gt; \\ local:///path/to/examples.jar Une fois ton job en route, tu peux le suivre en utilisant la SparkUI, avec la commande suivante : kubectl port-forward \u0026lt;driver-pod-name\u0026gt; 4040:4040 L\u0026rsquo;avantage d\u0026rsquo;utiliser Spark, parmi d\u0026rsquo;autres évidemment, est qu\u0026rsquo;il est possible d\u0026rsquo;utiliser tous les objets natifs Kubernetes desquels nous étions privés avant : RBAC, Secrets, etc…\nLes Volumes Il est possible de monter les volumes Kubernetes suivants (à la fois au niveau du driver et des exécuteurs) :\nhostPath emptyDir nfs PVC Ce qui est intéressant dans les volumes locaux c\u0026rsquo;est que les shuffles, les étapes intermédiaires nécessitant une persistence disque (les checkpoints par exemple) se font sans trop solliciter le réseau, ce qui peut potentiellement nettement améliorer la performance de nos jobs (après, si tu t\u0026rsquo;amuses à bidouiller tes NFS pour stocker ta data dans S3, c\u0026rsquo;est ton problème).\nBon, et pour résumer, si je migre, je gagne quoi ? Déjà, énormémemnt de flexibilité\u0026hellip; et d\u0026rsquo;argent. Faire tourner des jobs Spark sur Kubernetes te coûtera nécessairement moins cher. Cela dit, tous tes devs ne sont pas nécessairement familiers avec la techno, et du coup, ça peut potentiellement te faire perdre beaucoup (en terme de temps surtout), au moins sur le court / moyen terme.\nLa migration dépend donc de ton équipe, de sa maturité et de l\u0026rsquo;envie d\u0026rsquo;apprendre une nouvelle techno, du temps que tu as à allouer à cette migration (car soyons honnêtes, si tes jobs Spark n\u0026rsquo;ont pas de tests unitaires et sont sous-optimisés, je te conseillerais plutôt de commencer par là).\nDans tous les cas j\u0026rsquo;espère que cet article te permettra d\u0026rsquo;y voir un peu plus clair et que tu pourras au moins essayer de déployer tes premiers jobs Spark sur Kubernetes histoire au moins de voir \u0026ldquo;à quoi ça ressemble\u0026rdquo;.\n","permalink":"/post/fr-passer-de-emr-vers-kubernetes-pour-les-workloads-spark/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eAWS EMR est un service AWS largement utilisé principalement pour le traitement des données massives avec Apache Spark dans un Cluster Hadoop dédié. Au-delà de sa fonction principale, EMR embarque un bon nombre d\u0026rsquo;outils open-source, certains pour le monitoring (Ganglia), et d\u0026rsquo;autres pour le requêtage des données (Hive). Plus d\u0026rsquo;informations peuvent être trouvées par \u003ca href=\"https://docs.aws.amazon.com/fr_fr/emr/latest/ManagementGuide/emr-what-is-emr.html\"\u003eici\u003c/a\u003e.\nDépendamment du contexte, EMR peut être utilisé soit en tant qu\u0026rsquo;instance d\u0026rsquo;un cluster éphémère (par exemple en lançant un Cluster tous les 6 heures pour exécuter des jobs Spark), soit en tant que cluster permanent. C\u0026rsquo;est le cas notamment lorsque celui-ci est utilisé par plusieurs équipes, fait tourner des jobs de streaming ou lorsque l\u0026rsquo;attente de son instanciation est plus coûteuse que de le laisser tourner de manière permanente.\nCet article n\u0026rsquo;est pas nécessairement un texte pour comparer EMR à Kubernetes vu que les deux ne répondent pas aux mêmes besoins. Kubernetes s\u0026rsquo;impose de plus en plus aujourd\u0026rsquo;hui pour des raisons diverses et variées, et Spark supporte Kubernetes comme Scheduler et Resources Manager nativement, donc ça aurait été dommage de ne pas s\u0026rsquo;y pencher.\u003c/p\u003e","title":"[FR] Passer de EMR vers Kubernetes pour les workloads Spark"},{"content":"Migrating from a plain Spark Application to ZIO with ZparkIO In this article, we\u0026rsquo;ll see how you can migrate your Spark Application into ZIO and ZparkIO, so you can benefit from all the wonderful features that ZIO offers and that we\u0026rsquo;ll be discussing.\nWhat is ZIO? ZIO is defined, according to official documentation as a library for asynchronous and concurrent programming that is based on pure functional programming. In other words, ZIO helps us write code with type-safe, composable and easily testable code, all by using safe and side-effect-free code. ZIO is a data type. Its signature, ZIO[R, E, A] shows us that it has three parameters:\nR, which is the environment type. Meaning that the effect that we\u0026rsquo;re describing needs an environment, which is optional (would be Any when optional). The environment describes what is required to execute this Task. E, which is the error with which the effect may fail. A, which is the data type that is returned in case of success. How do we apply ZIO to Spark? From the signature of the ZIO data type, we can deduce that for us to use Spark with ZIO, we need to specify one necessary parameter, namely, the R. In our case, the R is the SparkSession, which is the entry point to every Spark Job, and the component that resides in the Spark Driver. The process of integrating ZIO to our Spark Programs can be hard, especially for beginners. Thankfully, Leo Benkel has created a library, namely ZparkIO, which is a library project that implements the whole process of marrying Spark and ZIO together!\nUsing ZparkIO to bootstrap a Spark / ZIO project The first step is to create a trait that extends ZparkioApp[R, E, A], where you would need to override two methods: makeCli and runApp. makeCli(args: List[String]) compiles all the program arguments for you (for the moment, scallop is used by default, but we\u0026rsquo;re in the process of extracting this module so you can use your CLI tool of choice). runApp():ZIO[COMPLETE_ENV, E, A] is the main function where your program\u0026rsquo;s logic resides.\noverride def runApp(): ZIO[COMPLETE_ENV, Throwable, Unit] = ??? Once you\u0026rsquo;re setup, your SparkSession can be access as follows:\nfor { spark \u0026lt;- SparkModule() } And if you want to acces the implicits of Spark, the trick is to map over SparkModule():\nfor { spark \u0026lt;- SparkModule().map { ss =\u0026gt; import ss.implicits._ ??? } } Migrating your project Once your project is set up, the process of migrating your project can be pretty straightforward.\nDefining your program\u0026rsquo;s arguments ZparkIO uses scallop. So to define the CLI arguments of your program you would need to define a case class, Arguments, which has the following signature:\ncase class Arguments(input: List[String]) extends ScallopConf(input) with CommandLineArguments.Service { ??? } Then, each one of your arguments has to be declared as follows:\nval argumentOne: ScallopOption[String] = opt[String]( default = None, required = true, noshort = true, descr = \u0026#34;Description of this argument\u0026#34; ) HINT: The argument defined here is called \u0026ldquo;argumentOne\u0026rdquo;; however, it has to be called as argument-one from the CLI while executing your project. Scallop logic! :D\nMigrating your code The first step is that every function/helper that you have in your program needs to start returning something of type ZIO. As an example, let\u0026rsquo;s take two methods, one that reads data from an external file system without ZparkIO, and another that uses it:\ndef readData[A]( inputPath: String )( implicit sparkSession: SparkSession ): Dataset[A] = sparkSession.read.parquet(inputPath) In a plain Spark application, this method returns a Dataset of some type A, and… That\u0026rsquo;s it! If we fail to read the inputPath given as a parameter for some reason, the whole program crashes, and we do not catch the reason why (at least not at first sight). This same method, using ZIO and ZparkIO will be written as follows:\ndef readData[A](inputPath: String): ZIO[SparkModule, Throwable, Dataset[A]] = for { spark \u0026lt;- SparkModule() dataset \u0026lt;- Task(spark.read.parquet(inputPath)) } yield dataset or\ndef readData[A](inputPath: String): ZIO[SparkModule, Throwable, Dataset[A]] = SparkModule().map(_.read.parquet(inputPath)) Please note that we didn\u0026rsquo;t have to use an implicit parameter as the session is already provided by the SparkModule() method of the library. The function readData uses a Spark Environment, namely SparkModule, may fail with a Throwable, and in case of success, returns a Dataset of some type A (I\u0026rsquo;m using Datasets here instead of DataFrames because first, we need some better typing, and second, Leo is allergic to Dataframes). The instruction to read data from the filesystem is wrapped into a Task. Task is of type IO[Throwable, A], which means that it does not depend on any environment (implicitly Any). Then we provide the dataset we just read, which matches the return type of our function. Once, your function is wrapped in a Task as such, you can start leveraging ZIO features such as retry and timeout.\nprotected def retryPolicy = Schedule.recurs(3) \u0026amp;\u0026amp; Schedule.exponential(Duration.apply(2, TimeUnit.SECONDS)) def readData[A](inputPath: String): ZIO[SparkModule, Throwable, Dataset[A]] = for { spark \u0026lt;- SparkModule() dataset \u0026lt;- Task(spark.read.parquet(inputPath)) .retry(retryPolicy) .timeoutFail(ZparkioApplicationTimeoutException())(Duration(10, TimeUnit.MINUTES)) } yield dataset In this example, if the read function fails, it will retry up to 3 times with an exponential wait in between each retry but the total amount of time spent on this task cannot exceed 10 minutes.\nSome other function examples: Obviously, your Spark program has more methods than just reading data from HDFS or S3. Let\u0026rsquo;s take another example:\ndef calculateSomeAggregations[A](ds: Dataset[A]): Dataset[A] = { ds .groupBy(\u0026#34;someColumn\u0026#34;) .agg( sum(when(col(\u0026#34;someOtherColumn\u0026#34;)) === \u0026#34;value\u0026#34;, lit(1)).otherwise(lit(0)) ) } This function calculates some aggregations of some Dataset of type A. One easy way to begin your migration to ZIO and ZparkIO is to wrap its calculations into a Task:\ndef calculateSomeAggregations[A](ds: Dataset[A]): IO[Throwable, Dataset[A]] = Task { ds .groupBy(\u0026#34;someColumn\u0026#34;) .agg( sum(when(col(\u0026#34;someOtherColumn\u0026#34;)) === \u0026#34;value\u0026#34;, lit(1)).otherwise(lit(0)) ) } And… Voilà! Your safe method is ready!\nBut after all, why would you need to migrate your project and start using ZIO? Well, one first and obvious argument is that it\u0026rsquo;s… safer, more composable, and lets you more easily think about the logic of your program. Alright, now let\u0026rsquo;s take the following example:\nval df1 = spark.read.parquet(\u0026#34;path1\u0026#34;) val df2 = spark.read.parquet(\u0026#34;path2\u0026#34;) val df3 = spark.read.parquet(\u0026#34;path3\u0026#34;) … val dfn = spark.read.parquet(\u0026#34;pathn\u0026#34;) We all know that Spark is the leading framework for distributed programming and parallel calculations. However, imagine that you are running a Spark Program in an EMR cluster having 20 nodes of 32GB of RAM and 20 CPU Cores each. That\u0026rsquo;s a lot of executors to be instantiated in. You set up a cluster of this size because you do some heavy , joins, sorts and groupBys. But still, you read a lot of parquet partitions at the beginning of your program, and the problem is that… All the instructions shown earlier will be executed sequentially. When df1 is being read, it certainly does not use the whole capacity of your cluster, and the next instruction needs to wait for the first one to end, and so on. That\u0026rsquo;s a lot of resources waste. Thanks to ZIO and its Fibers feature, we can force those readings to be run in parallel as follows:\nfor { df1 \u0026lt;- Task(spark.read.parquet(\u0026#34;path1\u0026#34;)).fork … dfn \u0026lt;- Task(spark.read.parquet(\u0026#34;pathn\u0026#34;)).fork readDf1 \u0026lt;- df1.join readDf2 \u0026lt;- df2.join } The fork / join combo will force those instructions to be executed in parallel, and thus, a minimum of the Cluster\u0026rsquo;s resources waste! Another useful method that we can use in this context is foreachPar. Let\u0026rsquo;s suppose that we have a list of paths of parquet partitions that we want to read, in parallel using our readData(inputPath: String) method defined earlier:\nfor { dfs \u0026lt;- ZIO.foreachPar(filePaths) { filePath =\u0026gt; { readData(filePath) } } } yield dfs The value that we return here is a List of the Dataframes that we read, that we can now use throughout our program. And even further, foreachParN which allow you to specify how many task can run in parallele at most:\nfor { dfs \u0026lt;- ZIO.foreachParN(5)(filePaths) { filePath =\u0026gt; { readData(filePath) } } } yield dfs This code will limit the parallel execution to 5, never more, which is a great way to manage your resources. This could even be a multiple of the number of executors available in the cluster.\nWrap up We hope that his article will give you a taste about why you should use ZIO and Spark with the help of ZparkIO to get your Spark jobs to the next level of performance, safety and fun!\n","permalink":"/post/en-migrating-from-a-plain-spark-application-to-zparkio/","summary":"\u003ch1 id=\"migrating-from-a-plain-spark-application-to-zio-with-zparkio\"\u003eMigrating from a plain Spark Application to ZIO with ZparkIO\u003c/h1\u003e\n\u003cp\u003eIn this article, we\u0026rsquo;ll see how you can migrate your Spark Application into \u003ca href=\"https://zio.dev\"\u003eZIO\u003c/a\u003e and \u003ca href=\"https://github.com/leobenkel/ZparkIO\"\u003eZparkIO\u003c/a\u003e, so you can benefit from all the wonderful features that ZIO offers and that we\u0026rsquo;ll be discussing.\u003c/p\u003e\n\u003ch2 id=\"what-is-zio\"\u003eWhat is ZIO?\u003c/h2\u003e\n\u003cp\u003eZIO is defined, according to official documentation as \u003cstrong\u003ea library for asynchronous and concurrent programming that is based on pure functional programming.\u003c/strong\u003e In other words, ZIO helps us write code with type-safe, composable and easily testable code, all by using safe and side-effect-free code.\n\u003cstrong\u003eZIO is a data type\u003c/strong\u003e. Its signature, \u003cem\u003eZIO[R, E, A]\u003c/em\u003e shows us that it has three parameters:\u003c/p\u003e","title":"[EN] Migrating from a plain Spark Application to ZparkIO"},{"content":"In the first article of this series, we talked about how we can set up a CI/CD pipeline for a Spark project using Github Actions, SBT as a build tool and S3 for deployment. Our code once pushed to the [master] branch of our project on Github, triggered an SBT Build command to generate a fat jar, then pushed it to S3 to the chosen bucket.\nHowever, this pipeline still lacks a way to add a logic since it does not allow us to check whether the jar’s version we’re putting to S3 already exists for instance.\nTo add a similar logic (or expand it if necessary), we have to create our own Github Action. This article aims to show you step by step how we can write a custom Github Action (and publish it!) for our previous CI/CD.\nIn the official documentation, we see that we can create several types of Github actions; the one that is of interest to us for this article is through Creating a Docker container action.\nThe first step is to create a Dockerfile that will allow us to run AWS commands through the AWS CLI (also note that we can use boto3 as well for more complex actions like creating an EMR Cluster and running the newly-deployed jar).\nIn our example, we will check whether the Jar we want to upload already exists in our S3 bucket and if yes, stop the CI part of the pipeline and don’t upload the new jar that we build.\nFirst, we need a Dockerfile that will look like the following:\nSecond, an entrypoint.sh script will help us implement our logic:\nOur entrypoint.sh script will check whether the jar exists in the S3 bucket, then run a simple if/else logic. Of course, our AWS Access Key and Secret Access would be stored in Github’s secrets, as we saw in part 1 of this series.\nThen, we can create our workflow.yml file as merely as the one we created in the previous article while changing the path to our Github repository since it’s our own Github Action.\nAll set! Finally, it’s better to create a README.MD file to let people know how to use your action.\nThen, we push everything to our repository while tagging our release:\ngit add . git commit -m \u0026#34;My Awesome Github Action\u0026#34; git tag -a -m \u0026#34;First version OK\u0026#34; v1 git push --follow-tags Voilà! If you’re happy with the Github Action you just created, you can publish it to the official Github Marketplace. Github shows us a step-by-step how-to here.\n","permalink":"/post/en-building-a-ci/cd-pipeline-for-a-spark-project-using-github-actions-sbt-and-aws-s3-part-2/","summary":"\u003cp\u003eIn the \u003ca href=\"https://medium.com/alterway/building-a-ci-cd-pipeline-for-a-spark-project-using-github-actions-sbt-and-aws-s3-part-1-c7d43658832d\"\u003efirst article\u003c/a\u003e of this series, we talked about how we can set up a CI/CD pipeline for a Spark project using Github Actions, SBT as a build tool and S3 for deployment. Our code once pushed to the [master] branch of our project on Github, triggered an SBT Build command to generate a fat jar, then pushed it to S3 to the chosen bucket.\u003c/p\u003e\n\u003cp\u003eHowever, this pipeline still lacks a way to add a logic since it does not allow us to check whether the jar’s version we’re putting to S3 already exists for instance.\u003c/p\u003e","title":"[EN] Building a CI/CD pipeline for a Spark project using Github Actions, SBT and AWS S3 — Part 2"},{"content":"Github now allows us to build continuous integration and continuous deployment workflows for our Github Repositories thanks to Github Actions, for almost all Github plans.\nIn this tutorial, we’re going to go through building a CI/CD pipeline based on a Scala / Spark project. We will be using SBT, the Scala Build Tool, which will allow us to get a jar that we’re then going to deploy to AWS S3 using a custom Github Action.\nThe first step is to create a Scala (SBT based) project. We will be doing this using the Intellij Idea IDE, but feel free to use your editor / CLI of choice.\nFirst, select File -\u0026gt; New Project and select Scala then sbt.\nSecond, choose a name for your project, and the JDK / SBT / Scala versions.\nFor our CI / CD purposes, we would need to generate a jar file from our source code as well as our dependencies. To do so, we will need a special plugin called sbt-assembly. To do so, under ProjectName -\u0026gt; project, create a file called assembly.sbt and past the following then save:\nThen, you need to add the following dependencies to your build.sbt file in project’s root folder:\nHere we declare our dependencies (by adding them to the libraryDependencies array), then our assemblyMergeStrategy, which is the strategy used for our assembly command which plugin was added before (going through the details of this goes beyond the scope of this tutorial, but just add it, it works :))\nFinally, all we have to do is write some nice Scala / Spark code in our program. Here is some code that you can put in any package of your structure:\nOK, one last step and we’re done! We need to prepare our YAML file that describes the steps that we want to go through in order to get our CI/CD done. To do so, we need to create a folder, namely .github/workflows, and inside, create a file (its name does not matter) .yaml:\nBasically, this file gives our CI a name, in our case, “CI CD”(innovative, hah?).\non: push: branches: [ master ] This tells our workflow that it will be triggered when we push some code into the master branch.\njobs: build: runs-on: ubuntu-latest Our workflow will run a ubuntu image, with the following steps:\nThe first one will setup our JDK, with version 1.8 (same one that we chose when creating the project, remember?) The second one will build our fat jar using sbt assembly. Then, we will be using a Github Action, namely tpaschalis/s3-sync-action@master. Basically, this step will clone the master’s branch of the Github Action we chose to sync our code to S3. This Action has some parameters, namely: the S3 Bucket we will be uploading our jar to, the AWS ACCESS and SECRET keys, the region, and finally, the *local* path of the jar we want to upload. Again, our full YAML file will look like:\nThe last step here is to define the values of AWS_S3_BUCKET, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For security reasons, these variables need to be hidden securely. For this, Github has a nice feature, namely the secrets, where we can defined them:\nNow all we have to do is to devine our 3 variables (be careful, the names should match with what has been defined in the YAML file), aaaand voilà !\nAll we have to do now is to push some code into the Master Branch, and magic will happen:\nUnder the Actions tab, we find all the steps that our workflow went through, from the build to the deployment to S3.\nHope this tutorial was of a good help to you :)\n","permalink":"/post/en-ci/cd-pipeline-using-github-actions-sbt-and-aws-s3-part-1/","summary":"\u003cp\u003eGithub now allows us to build continuous integration and continuous deployment workflows for our Github Repositories thanks to Github Actions, for almost all Github plans.\u003c/p\u003e\n\u003cp\u003eIn this tutorial, we’re going to go through building a CI/CD pipeline based on a Scala / Spark project. We will be using SBT, the Scala Build Tool, which will allow us to get a jar that we’re then going to deploy to AWS S3 using a custom Github Action.\u003c/p\u003e","title":"[EN] CI/CD pipeline using Github Actions, SBT and AWS S3 - Part 1"},{"content":"First…\nThe education system today is experiencing a lot of challenges and has many issues around the world, and at all levels. That said, the “education problem” being a huge subject, we can only solve it by addressing small problems, one at a time, and the sum of all of these solutions may lead us to solving the bigger issue. For instance, one of the issues in higher education is that a teacher is either academic or professional; the former has a theoretical focus ― and does not teach students how to tackle real world problems based on what she teaches them ―, whereas the latter is more focused on practical applications ― and might not have the pedagogical tools or know-how. By tackling seemingly small problems like this one, we can hope to find small solutions; the sum of these puts us on the pathway to solving the bigger problem: how can we give tertiary students what they truly need to succeed in their post-graduation lives?\nAs a way of illustrating such effective small thinking, a project funded by the World Bank sought to help kids in rural towns in China who were having difficulty at school. As part of the project, researchers found out that many of the kids who were struggling had basic eyesight problems, and the solution was as simple as giving glasses to the children that needed them. The results were pretty impressive.\nIn this article, we take a small step and tackle the problems and challenges seen in higher education, as far as professors ― in their way of teaching ― and students ― in their way of learning ― are concerned. This article argues that in order for the students to become better professionals in the future, professors need a better way to deliver their content and let their students become more independent, and thus learn faster.\nThen…\nIn countries like France or Morocco, university professors usually come with a set of slides with a huge amount of content that they need their students to learn, sometimes without knowing the why. One of the main issues with this approach is that the students end up overwhelmed by all the information that they need to ingest, with less chance of becoming independent and without the ability to go further, since they lack the fundamentals. Our role as teachers is to give students the scope of each module, not to limit their field of thinking, but to allow them to apprehend a concept without being overwhelmed by its immensity. Unfortunately, the methods used to limit the scope are often designed to format students to think in some specific way or another, which indeed limits their potential for out-of-the-box thinking.\nOne of the experiences that I encountered when I was a student is that, not only do some professors refuse to accept that we might have a critical approach as far as their teaching is concerned, they also believe that they are the true repository of knowledge, and that whatever it is they are teaching needs to be taken for a fact and should under no circumstances be questioned. This prevalent mentality has pushed me to go down a different pedagogical path to that taken by most of my professors.\nWe believe that students today can be equally intelligent and independent enough to learn by themselves. If we take the etymology of the word “professor”, we see that she’s considered as a “declarer, a person who claims knowledge”. In the past, this definition was literally correct and the professor was the way, the truth and the light, as she had access to resources 一 be they knowledgeable people, mentors or books 一 that the students didn’t have access to, due to linguistic capacities or societal status. The professor might also have had enough experience and awareness to get the most out of these resources;this situation remains the same today, even though we have a lot of resources (videos, tutorials, etc…) that can help anyone understand even the most difficult books or concepts.\nThus, in the current era of openly accessible knowledge and free quality resources, the role of a teacher in higher education today is no longer to be the source of knowledge or truth, but to give students pathways and the keys they need in order to explore the more complex areas by themselves; as opposed to the methods used today where students are so overwhelmed by so much information that they miss the basics, and thus, are not able to explore further areas by themselves.\nGiving students the basics of a particular subject has better chances of helping them master those basics. Then if they do master them, they are more likely to look further and explore more complex areas by themselves. It’s only then that a teacher can give them the benefit of her experience and teach them how to tackle those areas and go even further; as opposed to a method whereby the teacher exposes the complex areas of the subject without piquing the students’ curiosity about them. The thesis here is that, if a student does not grasp the basics of a subject and have the curiosity to go further, the teacher has simply failed in her mission.\nThe issue with this outdated way of teaching is that Professors tend to want to make a fast impact and try to give their students as much information in as little time as possible so they have a sense of achievement as teachers. The opposite of this approach is to have a long-term vision and help students find their own way of learning and exploring things that is more beneficial for them in the long run.\nMost often, teachers tend to want to achieve this fast impact by teaching a subject without its context: Why is it important to study the history of literature before reading Jane Austen? Or for my part, why is Functional Programming so important in working with Distributed Systems? These, among others, are questions that are often not answered but that are equally as important as the very subject being taught.\nAlso, one of the common mistakes that we, as teachers, make in higher education is that we want students to learn subjects by heart; while the human mind is more able to make connections between subjects when they are familiar, it is really made to process the information given to it, not to store it. Given a particular concept, students only need to know its basics and the general idea, then train their minds to be able to master the details of a given concept, and sharpen the saw of their capabilities to identify and apply patterns. Though Beethoven may have composed his 5th Symphony, no matter how hard scientists tried, they would never have found it in his mind, as it was never stored there.\nOne of the major challenges that many students face is a lack of methodology. I remember having a professor who used to write code live in front of us, without explaining any methodology about how to reason when it came to coding problems as we faced them; the problem with this method is that, once home alone, we faced errors and design problems that we were unable to solve because he didn’t give us pathways. As a way to illustrate such methodologies, let’s see some basic steps that one should follow when reading a book, a technical document or an essay:\nIn various references, they suggest going through these three phases:\nReading: In this phase, we forget ourselves, then we embrace the vision of the author; we seek first to understand what her thesis is, and what message she is trying to send us. We go through the writing without focusing on the details, we only want to get out with the general idea. Analyzing: Here, we read again and ask ourselves a few questions: Why does the author express this idea then the other? What conclusion does she want to draw? Why here and not later in the book? The idea here is to answer these few questions both from the point of view of the author, and from ours. Appropriating: Finally, we go back to each main idea of the book and ask ourselves: from my point of view, and given my own experience and the way I learnt to see the world, do I agree with the author? If yes, we embrace the idea and make it our own, if not, we reject it and let it go. These steps are very important in the process of learning. They allow students (and everyone for this matter) to learn how to draw conclusions, whereas the current way of teaching shows them which conclusions to draw.\nLet’s take as an example the way we teach programming languages. Whenever anyone wants to write a program, they have to take two main characteristics into account:\nThe syntax: Here, the rules of the programming language of choice are to be followed, and the compiler’s rules need to be strictly respected. The logic: This is the way we reason regarding our programs; it involves some kind of a recipe to tell what the program should do. For instance, regarding syntax, when writing Scala code, we can declare an immutable variable using the keyword “val”; if we mistakenly use “Val”, even though it is semantically identical, the use of the capital letter is enough to result in a non-compilable program.\nConcerning logic, imagine having a function F, that has an input and produces an output (a simple calculation, for example), if the syntax is correct, the program will compile, but if the implementation, a.k.a. logic, is wrong in some way, the output will be wrong, rendering program unusable. One way of dealing with this is to use so-called debuggers.One way of helping students master a subject and making them benefit from a Professor’s experience is through one-day-workshops. The idea is to pick up a subject, which can be exploratory (specific technology involving Blockchain, for example) and go through its technical documentation with the students, and show them the way we reason when faced with previously unknown technology: what are the patterns we use as professionals when we work with a tool for the first time? When facing a problem, how can we go about debugging it? What are the resources that we primarily use? How can we read the technical documentation of this particular technology / tool properly? These are questions that can be more beneficial to the students than just coming up with slides and showing them the conclusions that we drew while preparing the course.\n“When forced to read technical documentation, we tend to skim the information for troubleshooting solutions and thus skip steps that we falsely assume are optional.”\nGoing back to our first example, the researchers were able to help kids get better grades, simply by giving them $15 glasses. In a sense, we as teachers shouldn’t simply show students where to look, but rather, give them the necessary glasses, paradigms and pathways to look through so that they might see the world through our more experienced eyes and make their own conclusions.\n","permalink":"/post/en-on-minimalistic-teaching/","summary":"\u003cp\u003e\u003cstrong\u003eFirst…\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe education system today is experiencing a lot of challenges and has many issues around the world, and at all levels. That said, the “education problem” being a huge subject, we can only solve it by addressing small problems, one at a time, and the sum of all of these solutions may lead us to solving the bigger issue. For instance, one of the issues in higher education is that a teacher is either academic or professional; the former has a theoretical focus ― and does not teach students how to tackle real world problems based on what she teaches them ―, whereas the latter is more focused on practical applications ― and might not have the pedagogical tools or know-how. By tackling seemingly small problems like this one, we can hope to find small solutions; the sum of these puts us on the pathway to solving the bigger problem: how can we give tertiary students what they truly need to succeed in their post-graduation lives?\u003c/p\u003e","title":"[EN] On Minimalistic Teaching"},{"content":"Aujourd’hui, ma mère me parle du fait que l’un de ses élèves en école primaire lui ait parlé d’une grande « révolution » nommée Bitcoin. « Mais c’est quoi ce truc qui va tuer les banques ?» s’est-elle étonnée.\nC’est pour cette raison même que j’ai décidé d’expliquer le Bitcoin à travers cet article à ma mère, ainsi que toutes les mamans qui pourraient consulter cet article !\nVois-tu, maman, une grande majorité de ceux qui connaissent les principes derrière le Bitcoin sont des geeks anti-sociaux qui ne parlent que binaire, et dénigrent tous ceux qui ne le connaissent pas ; j’en ai fait partie à une époque, avant de me rendre compte que la cravate, ça m’allait bien aussi !\nParlons avant tout du principe derrière l’argent ; utilisé de tous mais maîtrisé par peu, ce principe est simplement un matériau auquel nous avons tous décidé de donner de la valeur, et que certaines banques et corporations à travers le monde manipulent… Un peu comme elles le veulent.\nBref. Maman, tu as un compte bancaire, d’accord ? Tu reçois ton salaire à la fin du mois dessus, et, lorsque tu as besoin d’acheter un croissant, le montant de ce croissant est déduit du montant total que tu as sur ton compte bancaire. Mais comment arrive-t-on à faire ce « deal » avec le boulanger même en ne le connaissant pas ? Simplement parce que la banque fait office de « tiers de confiance » et fait en sorte de garantir la fiabilité de la transaction, que tu puisses prendre ton croissant, et que le boulanger puisse être payé.\nJusque-là, lorsqu’on parle de croissant, tout va pour le mieux. Mais imaginons que tu puisses avoir 3 Billion de dollars sur ton compte bancaire, que se passe-t-il si tu veux tout retirer d’un coup pour rejoindre Elon Musk et acheter un appartement sur Mars ? Cela risque d’être un peu plus compliqué. Ce qu’il faut savoir, c’est que, lorsqu’on a une somme d’argent sur notre compte bancaire, cet argent ne nous appartient plus, c’est juste un contrat entre nous et la banque, qui forme une promesse disant que, lorsqu’on aura besoin de tout ou partie de cette somme, la banque nous la restituera… Si elle en dispose elle-même. Qu’est-ce que cela engendre alors ? Une maîtrise de l’économie mondiale par la banque, ses confrères et ses partenaires… Tout simplement !\nMaintenant imaginons que je sois à court d’argent, que je me sois fait kidnapper par une mafia qui demande une rançon de 3000 euros (oui, je ne vaux pas aussi cher que ça) et qu’ils en aient besoin maintenant… Tu es au Maroc, je suis en France, et il est minuit, on fait quoi ? Si on compte sur un transfert classique, je suis déjà mort !\nUn dernier point, et pas des moindres ! Imaginons que demain, un pirate ultra compétent veuille s’en prendre à une banque et voler tout l’argent dont les clients disposent, qu’est-il sensé faire ? Simplement attaquer la banque, qui est un point central de vulnérabilité, et une fois l’attaque réussie, tous nos comptes deviennent à sa disponibilité !\nAujourd’hui, le « Bitcoin » (ou Cash Digital) nous permet d’avoir une valeur, que nous pouvons nous échanger, comme on le souhaite, sans qu’on ait besoin d’une banque ou d’un « intermédiaire de confiance » pour garantir la fiabilité de nos transactions. Comment ça marche ? C’est très simple. Il y a 30 maisons dans notre quartier, 3 individus par logement en moyenne, ce qui nous fait un total de 90 individus. Ta voisine Sarah et toi voulez conclure un deal… Imaginons, un tajine contre un service ; comment vous faites pour vous faire confiance ? C’est simple ! Vous réunissez les 88 autres voisins, tu donnes le tajine à Sarah devant tout le monde, et Sarah s’engage à te faire le service en retour dès le lendemain. Qu’est ce qui va se passer si Sarah ne tient pas sa promesse ? Tous les voisins le sauront, Sarah sera mise à l’écart, et plus personne ne voudra plus lui parler ni de conclure de deal avec elle dans le quartier !\nC’est à peu près ce qui se passe au niveau du Bitcoin ; on envoie une valeur, admettons 2 Bitcoins, d’une personne à un autre, et tous les « participants » (voisins) dans le réseau (quartier) se chargeront de valider ou pas la transaction de valeur qui a eu lieu.\nPourquoi c’est cool alors ? Clairement parce que ce réseau de voisinage n’est contrôlé par personne, aucun voisin n’est meilleur qu’un autre, et aucun n’a de contrôle sur l’autre. Tu te rappelles du pirate qui a pu prendre l’argent de tout le monde en attaquant un seul point central ? Bah maintenant, il va falloir qu’il tue tout le monde dans le quartier pour pouvoir arriver au même résultat, c’est chaud non ?\nEnfin, si on revient sur l’histoire du kidnapping, franchement tu pourrais me sauver. L’assassin t’enverra son adresse Bitcoin, tu pourras effectuer la transaction de la tienne vers la sienne en quelques minutes, même à minuit, je serai relâché, et, cerise sur le gâteau, personne ne pourra connaître son identité parce que les transactions dans ce réseau Bitcoin sont privées et chiffrées.\nVoilou, j’espère que tu y vois plus clair ! 😊\nTon fiston\nð \u0026ldquo;Ah au fait, mais ça peut aussi servir aux criminels ce truc-là ? C’est donc mal et faut le bannir tout de suite !!!\u0026rdquo; Je suis sûr que tu te poses cette question, mais sache que, comme pour tout avancement technique, il y a des bons côtés, des mauvais, mais crois-moi, ce qu’apporte le Bitcoin dépasse de loin les nuisances qu’il peut avoir (je t’expliquerai cela dans un prochain épisode), et puis, les criminels, ils s’en sortent déjà sans ça depuis des décennies !\n","permalink":"/post/fr-le-bitcoin-expliqu%C3%A9-%C3%A0-ma-m%C3%A8re/","summary":"\u003cp\u003eAujourd’hui, ma mère me parle du fait que l’un de ses élèves en école primaire lui ait parlé d’une grande « révolution » nommée Bitcoin. « Mais c’est quoi ce truc qui va tuer les banques ?» s’est-elle étonnée.\u003c/p\u003e\n\u003cp\u003eC’est pour cette raison même que j’ai décidé d’expliquer le Bitcoin à travers cet article à ma mère, ainsi que toutes les mamans qui pourraient consulter cet article !\u003c/p\u003e\n\u003cp\u003eVois-tu, maman, une grande majorité de ceux qui connaissent les principes derrière le Bitcoin sont des geeks anti-sociaux qui ne parlent que binaire, et dénigrent tous ceux qui ne le connaissent pas ; j’en ai fait partie à une époque, avant de me rendre compte que la cravate, ça m’allait bien aussi !\u003c/p\u003e","title":"[FR] Le Bitcoin Expliqué à ma mère"},{"content":"This article was co-authored by Matthew Rathbone\nimage by Thomas Leuthard\nJames Gosling, creator of Java, said:\n“If I were to pick a language to use today other than Java, it would be Scala.”\nScala is a hot language in software development today, it is used by a range of start-ups for application development and has been adopted as the unofficial language of big data software development thanks to frameworks like Spark. As a language it is less verbose than Java, and has a number of unique features that make it more flexible too. Scala is both functional, object oriented, and truly multi-threaded – so it provides a very unique development environment. There’s so much to Scala that whatever stage of programming you’re at, you’ll probably want some books!\nI’ll link to both books and on-line resources.\nBooks Programming in Scala This book is considered by many as the primary reference of the Scala programming language. Written by Martin Odersky (Scala’s creator), it covers every facet of the language. In many ways it is the equivalent of on-line documentation plus lots of details to truly get a sense of how things work under the hood. Given the focus of the book do not expect guided tutorials or an introduction to programming in general.\nScala for the Impatient This book gets you step by step into the basics of the Scala Programming Language and then covers the majority of its concepts. It is more concise than the previous book since it’s meant to get you started quickly.\nThis is more of a cheat sheet than a book, and to me, it is a resource to have in your library.\nFunctional Programming in Scala I consider this book a great resource for learning Functional Programming in general, regardless of Scala. It first focuses more on what Functional Programming is, then it applies the principles to Scala. A lot of the power of Scala comes from its functional nature, so it’s great to think about these concepts as a core part of your Scala learning.\nLearning Scala: Practical Functional Programming for the JVM If you want to learn about Scala step by step while knowing exactly how things work in the Java Virtual Machine, this book can be a great start. I don’t think this is a great beginners programming book, but if you have experience programming already it should do a good job getting you used to Scala. My biggest criticism is that this book does not include a front-to-back example of a Scala project, nor does it talk about the tooling you could be using, but overall a great book for a Scala beginner.\nScala Cookbook: Recipes for Object-Oriented and Functional Programming ‘Cookbooks’ have become their own book genre, and this book gives you the prototypical experience.\nThis book is suitable for you if you’ve already worked a little bit with Scala and you want to know more about best practices while building a Scala Application. For every problem in Scala there are lots of possible solutions; this book helps cover some good ground while presenting a solid opinion on ‘good’ Scala.\nScala in Action This is a book for people who already know how to code in Scala and who have ideally built some applications in it. The book focuses on presenting information through a ‘how-to’ approach. If you’ve never done Scala before, start with another book before this. Scala in Action takes a lot of effort to understand, but it’s worth it.\nScala High Performance Programming Scala High Performance Programming is a great read if you already know Scala and you need to monitor and optimize your application. The book’s strength comes from all the real-world use cases that are referenced throughout in order to guide the content. The book will teach you how to analyze and monitor the performance of your Scala Application, how to work efficiently with stream processing, how to write high performance code, and much more.\nScala Design Patterns This book seeks to marry design patterns from the Object Oriented world with the unique features of Scala, taking particular advantage of its functional nature. The book updates the ‘Gang of Four’ design patterns for a functional language, so if you want to explore those ideas in depth then this is the book to read.\nBuilding a Recommendation Engine with Scala Recommendation engines are becoming more commonplace today, and Scala is quickly becoming a key language for their implementation. This book goes through the details of how to build one from scratch, which I think is something every Scala developer would find valuable. The book does use practical examples (like recommendation engines in e-commerce) to introduce the concepts, so this book could also serve as a useful way to ‘break-in’ to that world.\nOther resources Scala School When I worked at Foursquare we’d direct all new hires to this wonderful Scala learning resource. It’s a free way to learn practical Scala, built by people who were teaching software engineers to be Scala software-engineers. It is not a programming introduction, but it’s a great way to get a (free) feel for Scala.\nOfficial Documentation The official on-line documentation is always a tremendously useful resource, especially for classes that are similar to their Java counterparts, but with Scala magic sprinkled on top.\n","permalink":"/post/en-10-great-books-for-functional-programming-in-scala/","summary":"\u003cp\u003eThis article was co-authored by \u003ca href=\"https://blog.matthewrathbone.com/\"\u003eMatthew Rathbone\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e\u003cimg alt=\"img\" loading=\"lazy\" src=\"https://d33wubrfki0l68.cloudfront.net/4b8a4dcbce3e4561018d5f8e84d92e8b5f05563d/f25fa/img/blog/scala-books/title.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eimage by \u003ca href=\"https://www.flickr.com/photos/thomasleuthard/19070717313\"\u003eThomas Leuthard\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eJames Gosling, creator of Java, said:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003e“If I were to pick a language to use today other than Java, it would be Scala.”\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eScala is a \u003cem\u003ehot\u003c/em\u003e language in software development today, it is used by a range of start-ups for application development and has been adopted as the unofficial language of big data software development thanks to frameworks like Spark. As a language it is less verbose than Java, and has a number of unique features that make it more flexible too. Scala is both functional, object oriented, and truly multi-threaded – so it provides a very unique development environment. There’s so much to Scala that whatever stage of programming you’re at, you’ll probably want some books!\u003c/p\u003e","title":"[EN] 10+ Great Books for Functional Programming in Scala"},{"content":"Article co-authored by Martin Delobel and available on Medium.\n","permalink":"/post/why-combine-asynchronous-and-distributed-calculations-to-tackle-the-biggest-data-quality-challenges/","summary":"\u003cp\u003eArticle co-authored by Martin Delobel and available on \u003ca href=\"https://medium.com/decathlondigital/why-combine-asynchronous-and-distributed-calculations-to-tackle-the-biggest-data-quality-challenges-2e04dfc51401\"\u003eMedium\u003c/a\u003e.\u003c/p\u003e","title":"Why combine asynchronous and distributed calculations to tackle the biggest data quality challenges"},{"content":"This article was co-authored by Matthew Rathbone\nimage by Ed Robertson\nApache Spark is a super useful distributed processing framework that works well with Hadoop and YARN. Many industry users have reported it to be 100x faster than Hadoop MapReduce for in certain memory-heavy tasks, and 10x faster while processing data on disk.\nWhile Spark has incredible power, it is not always easy to find good resources or books to learn more about it, so I thought I’d compile a list. I’ll keep this list up to date as new resources come out.\nThe books are roughly in an order that I recommend, but each has it’s unique strengths.\nLearning Spark: Lightning-Fast Big Data Analysis Learning Spark is in part written by Holden Karau, a Software Engineer at IBM’s Spark Technology Center and my former co-worker at Foursquare. Her book has been quickly adopted as a de-facto reference for Spark fundamentals and Spark architecture by many in the community. The book does a good job of explaining core principles such as RDDs (Resilient Distributed Datasets), in-memory processing and persistence, and how to use the Spark Interactive Shell.\nNon-core Spark technologies such as Spark SQL, Spark Streaming and MLib are introduced and discussed, but the book doesn’t go into too much depth, instead focusing on getting you up and running quickly.\nCode examples are in Scala and Python.\nApache Spark in 24 Hours, Sams Teach Yourself Are you impatient? This book has been written for you!\nThe first pages talk about Spark’s overall architecture, it’s relationship with Hadoop, and how to install it. You’ll then learn the basics of Spark Programming such as RDDs, and how to use them using the Scala Programming Language. The lasts parts of the book focus more on the “extensions of Spark” (Spark SQL, Spark R, etc), and finally, how to administrate, monitor and improve the Spark Performance.\nHigh Performance Spark: Best Practices for Scaling and Optimizing Apache Spark This is a brand-new book (all but the last 2 chapters are available through early release), but it has proven itself to be a solid read.\nAgain written in part by Holden Karau, High Performance Spark focuses on data manipulation techniques using a range of spark libraries and technologies above and beyond core RDD manipulation. My gut is that if you’re designing more complex data flows as an engineer or data scientist then this book will be a great companion.\nSpark in Action Spark in Action tries to skip theory and get down to the nuts and bolts or doing stuff with Spark. The book will guide you through writing Spark Applications (with Python and Scala), understanding the APIs in depth, and spark app deployment options.\nPro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark One of the key components of the Spark ecosystem is real time data processing.\nAs the only book in this list focused exclusively on real-time Spark use, this book will teach you how to deploy a Spark real-time data processing application from Scratch. It supports this with hands-on exercises and practical use-cases like on-line advertising, IoT, etc.\nMastering Apache Spark Despite it’s title, this is truly a book for beginners. It covers a lot of Spark principles and techniques, with some examples. It includes a bunch of screen-shots and shell output, so you know what is going on. This book won’t actually make you a Spark master, but it is a good (and fairly short) way to get started.\nSpark: Big Data Cluster Computing in Production If you are already a data engineer and want to learn more about production deployment for Spark apps, this book is a good start. You’ll learn how to monitor your Spark clusters, work with metrics, resource allocation, object serialization with Kryo, more. The book also discusses file format details (eg sequence files), and overall talks in a little more depth about app deployment than the average Spark book.\nLearning Spark: Analytics With Spark Framework This book aims to be straight to the point: What is Spark? Who developed it? What are the use cases? What is the Spark-Shell? How to do Streaming with Spark? And how to work with Spark on EC2 and GCE?\nThis is a self published book so you might find that it lacks the polish of other books in this list, but it does go through the basics of Spark, and the price is right.\nSpark Cookbook While Spark Cookbook does cover the basics of getting started with Spark it tries to focus on how to implement machine learning algorithms and graph processing applications. The book also tries to cover topics like monitoring and optimization. A good audience for this book would be existing data scientists or data engineers looking to start utilizing Spark for the first time.\nSpark GraphX in Action GraphX is a graph processing API for Spark. It tries to be both flexible and high-performance (much like Spark itself).\nThis is probably the most in-depth book on GraphX available (honestly it’s the only GraphX specific book available at the time of writing). Spark GraphX in Action starts with the basics of GraphX then moves on to practical examples of graph processing and machine learning. Overall I think it provides a great overview of the framework and a very practical jumping off point.\nBig Data Analytics With Spark This is another book for getting started with Spark, Big Data Analytics also tries to give an overview of other technologies that are commonly used alongside Spark (like Avro and Kafka).\nIt is full of great and useful examples (especially in the Spark SQL and Spark-Streaming chapters). Given the broad scope of the content in this book it maintains a fairly high level view of the ecosystem without going into too much depth. That said, it is yet another book that provides a great introduction to these technologies.\nBonus Resources Since Spark comes from a research laboratory in Berkeley University, the academic papers that originally described Spark are actually very useful. They allow you to dive deep into the Spark principles and understand exactly how things work under the hood. Also, each major Spark component usually has it’s own dedicated paper, which makes things even easier to break up.\nA good place to start is with the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. If your brain can grok academic writing I even recommend reading it before you read one of the above books.\nHere are some of the other available papers, each introducing a major Spark component.\nSpark: Cluster Computing with Working Sets Spark SQL: Relational Data Processing in Spark MLib: Machine Learning in Apache Spark GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters SparkR: Scaling R Programs with Spark All the papers can be downloaded for free at: http://spark.apache.org/research.html)\nWrap Up Hopefully these books can provide you with a good view into the Spark ecosystem. Learning a new technology is never easy, so if you have any other useful tips or tricks for your fellow learners feel free to add them to the comments section below.\n","permalink":"/post/en-10-great-books-for-apache-spark/","summary":"\u003cp\u003eThis article was co-authored by \u003ca href=\"https://blog.matthewrathbone.com/\"\u003eMatthew Rathbone\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e\u003cimg alt=\"img\" loading=\"lazy\" src=\"https://d33wubrfki0l68.cloudfront.net/8177dc9c6ec5935b75460f41e29cecfebe9a5c20/2662a/img/blog/books.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eimage by \u003ca href=\"https://unsplash.com/photos/eeSdJfLfx1A\"\u003eEd Robertson\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eApache Spark is a super useful distributed processing framework that works well with Hadoop and YARN. Many industry users have reported it to be 100x faster than Hadoop MapReduce for in certain memory-heavy tasks, and 10x faster while processing data on disk.\u003c/p\u003e\n\u003cp\u003eWhile Spark has incredible power, it is not always easy to find good resources or books to learn more about it, so I thought I’d compile a list. I’ll keep this list up to date as new resources come out.\u003c/p\u003e","title":"[EN] 10+ Great Books for Apache Spark"},{"content":"Big Data…Really? Few years ago, I had a discussion with a mentor of mine about the career path I wanted to pursue, and I said: \u0026ldquo;Look, Big Data is something really great, and I want to become a Big Data Engineer later on!\u0026rdquo;, and his answer was: \u0026ldquo;Okay, but be cautious, Big Data is not a revolution, and just like the \u0026ldquo;Cloud\u0026rdquo;, marketers have done their jobs\u0026rdquo;. I didn\u0026rsquo;t trust his words back then, and\u0026hellip; You bet! He was right!\nEven though I’m a “Big Data Consultant” (after all, we\u0026rsquo;re just software engineers with some kind of knowledge used in the \u0026ldquo;Big Data Industry\u0026rdquo;), I’m not okay with the use of the term “Big Data”… Why? Because data has been growing during the last decades, and the discussion of “how can we handle all this data?” has always been current. In that sense, I do prefer using the term “Data of Unusual Size”, because it has always been contextual. Every time we realized that the architectures we had became outdated and were not able to handle the data we had anymore, we innovated!\nIn 10 years, the data we gather will be even bigger, and how are we going to call it? Big-Big Data? You got it!\nThe concepts behind Big Data\u0026hellip; Are they really a “revolution”? If you ask 10 specialists on how do they perceive \u0026ldquo;Big Data\u0026rdquo;, you are more likely to have 10 different (and perhaps divergent) definitions. In a nutshell, it basically means: \u0026ldquo;I have a problem, which is that I have lots of amounts of data, and I have to make sense out of it. To do so, I\u0026rsquo;ve got to think about new technologies, new architectures and new paradigms in order to process and aggregate this data, so it can become \u0026ldquo;humanly understandable\u0026rdquo;, and usable\u0026rdquo;.\nPeople might also refer to the \u0026ldquo;3Vs\u0026rdquo; originally introduced by Doug Laney in 2001: Volume, Variety and Velocity. More Vs are added today, be they veracity, value or variability, but they all are, IMHO, questionable.\nOne of the key technical components behind \u0026ldquo;Big Data\u0026rdquo; today is the distributed computing paradigm, and this term has been introduced in\u0026hellip; 1960, or even before! And Hadoop, the most popular Big Data framwork so far (even though it is losing some of its popularity), is all about using commodity hardware to process computations accross multiple nodes (or low cost servers) to handle complex computations without paying too much\u0026hellip; And guess what? \u0026ldquo;Big Data\u0026rdquo; is based upon many other concepts that existed many decades ago :).\nIn other words, today is introduced a \u0026ldquo;new concept\u0026rdquo; which is based upon old paradigms gathered together\u0026hellip; So is \u0026ldquo;Big Data\u0026rdquo; a new concept that will save us from apocalypse? Not really.\n… But the need is real So, don\u0026rsquo;t we need Big Data today? Well, don\u0026rsquo;t get me wrong, I\u0026rsquo;m not saying that it is useless, at all! I\u0026rsquo;m just warning you that, the term \u0026ldquo;Big Data\u0026rdquo; is nothing but a buzz-term used by marketers and companies to sell their products and services to other companies, but the need is real.\nAll those \u0026ldquo;old\u0026rdquo; concepts gathered together along with (true) new ones such as Machine Learning for example created a new era of data understanding and data insight that hasn\u0026rsquo;t existed before. Don\u0026rsquo;t you ask yourself why the term Big Data is not mentionned in the Gartner\u0026rsquo;s yearly hype cycle of emerging technologies? You guessed :)\nAfter we gather data and are ready to do some processing, we have to choose the right ML algorithm, the right aggregates and the right relationships to take the most effective solution in a given situation to make better sales, target the right audience and gain new customers, which is after all what it\u0026rsquo;s all about.\nToday, the industry doesn\u0026rsquo;t pay us for what we know, it rather pays us for what we can do with that knowledge, and following the same logic, the value of \u0026ldquo;Big Data\u0026rdquo; is only real when it comes with real insights so companies can take precise decisions.\n","permalink":"/post/en-the-truth-behind-the-bigdata-buzz-word/","summary":"\u003ch2 id=\"big-datareally\"\u003eBig Data…Really?\u003c/h2\u003e\n\u003cp\u003eFew years ago, I had a discussion with a mentor of mine about the career path I wanted to pursue, and I said: \u0026ldquo;Look, Big Data is something really great, and I want to become a Big Data Engineer later on!\u0026rdquo;, and his answer was: \u0026ldquo;Okay, but be cautious, Big Data is not a revolution, and just like the \u0026ldquo;Cloud\u0026rdquo;, marketers have done their jobs\u0026rdquo;. I didn\u0026rsquo;t trust his words back then, and\u0026hellip; You bet! He was right!\u003c/p\u003e","title":"[EN] The Truth Behind the Bigdata Buzz Word"}]