|

Big Data: An Introduction

So, this is going to be a quick introduction, covering what I’ve recently learned about big data, such as the concepts & the technologies used. Obviously it’s advisable that you’re somewhat familiar with data-related technologies, whether it’s something as common as SQL or NoSQL, but it’s by no means a must. It may also be advisable, if like myself, you’re not a DBA or data warehouse engineer, to have some understanding of technologies such as Kubernetes, since some of the Apache Hadoop tools follow a similar fundamental idea, or vice versa since Hadoop is older than Kubernetes. If any of the following information is incorrect or somewhat opinionated, please keep in mind that I’ve very recently begun learning about big data & the associated technologies.

First of all, from what I’ve learned about big data is that it’s first of all not just millions of records, that’s quite common & very easy to work with in traditional RBMS. Big data really refers to extremely large data sets, so large that no one person can even begin to comprehend the sheer size & volume of the data set, I’m talking in the region of zettabytes. The second thing that I’ve learned, which being a huge fan of the functional paradigm makes a lot of sense to me; the data should always be immutable. Sure, you may want to update or delete records, but from a very low level perspective, you could think of these different updates as different versions of the data.

With big data, I personally like what may be known as lambda architecture & this really, in my opinion anyway is an effective way to use big data. The basic idea is that you have what is known as a batch layer, this layer is super duper slow, but this is somewhat by design, since it’s used to process & store an immense amount of data. Then you have the speed layer, as you may have guessed, the speed layer is used for its speed, it’s meant to stream data & process data in real time. FYI, for this layer you may wish to look at Apache Spark as a real world implementation. Then you have the serving layer, which is a mixture of output from the batch layer. I should also mention that the speed layer, this layer may also not be as complete or as accurate as the batch layer, but this is to ensure that the speed is kept to an optimum.

Hadoop

Hadoop, a technology from our beloved Apache, I’ve already said how it’s somewhat similar to Kubernetes, but to add more information to that statement, you can have a Hadoop cluster, or to be more precise, a Hadoop Distributed File System(HDFS). This is one of the tools that reminds me a lot of how Kubernetes works since you have a cluster of nodes, all working together, but Hadoop itself expands far beyond just this. Hadoop itself is officially an open source infrastructure software solution for storing & processing big data or large data sets.

HDFS specifically will split files or large amounts of files across different nodes within the cluster, this is one thing that makes HDFS pretty awesome, you could essentially use a file that is over terabytes in size. You may have heard of mapreduce, it’s essentially responsible of processing the data across all of the nodes within the cluster. A clever feature about mapreduce is that it will distribute the processing across all nodes within the cluster.

You may have heard of many different tools such as Hive or Pig, there are many tools that exist that are associated to the Hadoop ecosystem & I have little doubt that there are more to come.

Similar Posts

  • Swansea Con 2019

    First of all, I’d like to start by being totally transparent & honest, I have never been to a developer convention prior to Swansea Con 2019, partially due to other commitments, finance, etc. But after my experience with Swansea Con, I can safely say that I’ll be sure to attend as many as I can,…

  • My Beef With The Front End

    I’d like to start this read by stating that I appreciate that the front end is an ever evolving beast & that front end developers are probably among the most under rated developers, at least for the most part. This post is mostly based on my own opinions & personal experiences. Really, it’s quite funny,…

  • ITX Build

    As some of you may or may not know, I’m an all round tech enthusiast, not just in the typical sense where I love video games & not much else. I can honestly say that I love every aspect of technology, even if it’s something that I have little knowledge of. To cut to the…

  • | | |

    Node Containers

    I feel somewhat ashamed of myself that I’m only now learning about this problem(s) with process shut down with Node & Docker. After finding Bret Fisher’s talk(s) about Node & Docker best practices, I couldn’t believe that there’s a bit of an issue with process signal making it all the way through to the application….

  • Interviews

    Introduction I thought I’d spice things up a little. 🌶️ One thing that I’m certain nearly every developer has read up on is the interview process. How some companies will have several stages to the entire process, from application to offer & how others will be as simple as a single interview with a panel…

Leave a Reply

Your email address will not be published. Required fields are marked *