Thursday, October 29, 2020

Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)


Managing & Analyzing Big Data using Apache Scala & Apache Spark


In this blog we will see how to use Scala and Spark to analyze Big Data. When I started understanding the Big Data world everything was about Hadoop. As you can see from my previous blogs:

There are 10 essential V's in Big Data (As published in the International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056)


Lots have changed ever since. Spark is quickly becoming one of the most powerful Big Data tools as it can run programs up to 100 times faster than Hadoop MapReduce. However Spark is an in-memory operation and If velocity is not at the top of your priorities list, go with MapReduce because it can handle avalanches of data without eating up too much memory. Hadoop first stores the data and then processes it. Its intended purpose was to process data that was collected from lots of different sources. We need to select which V is critical for our business and the problem statement at hand. I will focus on Velocity for the moment.

Since Spark is written in Scala it can perform the Big Data tasks faster than any other API for Spark. 
This blog is based on an Udemy course using which I learned the latest Spark 2.0 methods and used the latest MLib, Spark SQL, and DataFrames! 


For Installation guidance and Basics you can follow my previous blogs: 

SCALA: Getting Started

Scala is a general-purpose programming language providing support for both object-oriented programming and functional programming. The language has a strong static type system. Designed to be concise, many of Scala's design decisions are aimed to address criticisms of Java. In-fact the Scala source code is compiled to Java bytecode to run on Java Virtual Machine (JVM). If you are coming from a Java background it is an added advantage as Java libraries may be directly used in Scala. Scala is so popular mainly because of Apache Spark, which is written in Scala.

Apache Spark: Getting Started

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
In order to understand Scala in the context of BigData, we need to understand the concept of BigData as I mentioned at the starting of this Blog.


Kinshuk Dutta
New York

Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)

Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...