Big Data: An Introduction
So, this is going to be a quick introduction, covering what I’ve recently learned about big data, such as the concepts & the technologies used. Obviously it’s advisable that you’re somewhat familiar with data-related technologies, whether it’s something as common as SQL or NoSQL, but it’s by no means a must. It may also be advisable, if like myself, you’re not a DBA or data warehouse engineer, to have some understanding of technologies such as Kubernetes, since some of the Apache Hadoop tools follow a similar fundamental idea, or vice versa since Hadoop is older than Kubernetes. If any of the following information is incorrect or somewhat opinionated, please keep in mind that I’ve very recently begun learning about big data & the associated technologies.
First of all, from what I’ve learned about big data is that it’s first of all not just millions of records, that’s quite common & very easy to work with in traditional RBMS. Big data really refers to extremely large data sets, so large that no one person can even begin to comprehend the sheer size & volume of the data set, I’m talking in the region of zettabytes. The second thing that I’ve learned, which being a huge fan of the functional paradigm makes a lot of sense to me; the data should always be immutable. Sure, you may want to update or delete records, but from a very low level perspective, you could think of these different updates as different versions of the data.
With big data, I personally like what may be known as lambda architecture & this really, in my opinion anyway is an effective way to use big data. The basic idea is that you have what is known as a batch layer, this layer is super duper slow, but this is somewhat by design, since it’s used to process & store an immense amount of data. Then you have the speed layer, as you may have guessed, the speed layer is used for its speed, it’s meant to stream data & process data in real time. FYI, for this layer you may wish to look at Apache Spark as a real world implementation. Then you have the serving layer, which is a mixture of output from the batch layer. I should also mention that the speed layer, this layer may also not be as complete or as accurate as the batch layer, but this is to ensure that the speed is kept to an optimum.
Hadoop
Hadoop, a technology from our beloved Apache, I’ve already said how it’s somewhat similar to Kubernetes, but to add more information to that statement, you can have a Hadoop cluster, or to be more precise, a Hadoop Distributed File System(HDFS). This is one of the tools that reminds me a lot of how Kubernetes works since you have a cluster of nodes, all working together, but Hadoop itself expands far beyond just this. Hadoop itself is officially an open source infrastructure software solution for storing & processing big data or large data sets.
HDFS specifically will split files or large amounts of files across different nodes within the cluster, this is one thing that makes HDFS pretty awesome, you could essentially use a file that is over terabytes in size. You may have heard of mapreduce, it’s essentially responsible of processing the data across all of the nodes within the cluster. A clever feature about mapreduce is that it will distribute the processing across all nodes within the cluster.
You may have heard of many different tools such as Hive or Pig, there are many tools that exist that are associated to the Hadoop ecosystem & I have little doubt that there are more to come.