Prodigal Pundit
This blog is about different trends in the Data Management ecosphere. Master data (Reference data, Meta data), Big Data, Data Science - R, IoT, Container (Docker), AWS
Thursday, October 29, 2020
Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)
Thursday, April 23, 2020
SOLR Search - CookBook
SOLR
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.
Wednesday, January 30, 2019
Introduction to Data Science with R & Python
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning, and big data.
Data science is a "concept to unify statistics, data analysis, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge, and information science.
R or Python?
Why use R for Data Science?
1. Academia:R is a very popular language in academia. Many researchers and scholars use R for experimenting with data science. Many popular books and learning resources on data science use R for statistical analysis as well. Since it is a language preferred by academicians, this creates a large pool of people who have a good working knowledge of R programming. Putting it differently, if many people study R programming in their academic years then this will create a large pool of skilled statisticians who can use this knowledge when they move to the industry. Thus, leading to increased traction towards this language.
Read More: Suitability of Python for Artificial Intelligence
2. Data wrangling:
Data wrangling is the process of cleaning messy and complex data sets to enable convenient consumption and further analysis. This is a very important and time taking process in data science. R has an extensive library of tools for data and database manipulation and wrangling. Some of the popular packages for data manipulation in R include:
dplyr Package – Created and maintained by Hadley Wickham, dplyr is best known for its data exploration and transformation capabilities and highly adaptive chaining syntax.
data.table Package – It allows for faster manipulation of data set with minimum coding. It simplifies data aggregation and drastically reduces the compute time.
readr Package – ‘readr’ helps in reading various forms of data into R. By not converting characters into factors it performs the task at 10x faster speed.
3. Data visualization:
Data visualization is the visual representation of data in graphical form. This allows analyzing data from angles that are not clear in unorganized or tabulated data. R has many tools that can help in data visualization, analysis, and representation. The R packages ggplot2 and ggedit for have become the standard plotting packages. While the ggplot2 package is focused on visualizing data, ggedit helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics precisely correct.
4. Specificity:
R is a language designed especially for statistical analysis and data reconfiguration. All the R libraries focus on making one thing certain – to make data analysis easier, more approachable, and detailed. Any new statistical method is first enabled through R libraries. This makes R a perfect choice for data analysis and projection. Members of the R community are very active and supportive and they have a great knowledge of statistics as well as programming. This all gives R a special edge, making it a perfect choice for data science projects.
5. Machine learning:
At some point in data science, a programmer may need to train the algorithm and bring in automation and learning capabilities to make predictions possible. R provides ample tools to developers to train and evaluate an algorithm and predict future events. Thus, R makes machine learning (a branch of data science) a lot more easy and approachable. The list of R packages for machine learning is really extensive. R machine learning packages include MICE (to take care of missing values), rpart & PARTY (for creating data partitions), CARET (for classification and regression training), randomFOREST (for creating decision trees), and much more.
Read More: 5 Machine Learning Trends to Follow
6. Availability:
R programming language is open source and is not severely restricted to operating systems. Being open-source, R is covered under the GNU General Public License Agreement. This makes it highly cost-effective for a project of any size. Since it is open-source, developments in R happen at a rapid scale and the community of developers is huge. All of this, along with a tremendous amount of learning resources makes R programming a perfect choice to begin learning R programming for data science. Because there are many new developers exploring the landscape of R programming it is easier and cost-effective to recruit or outsource to R developers.
Why use Python for Data Science?
Python is the programming language of choice for data science. Here’s a brief history:- In 2016, it overtook R on Kaggle, the premier platform for data science competitions.
- In 2017, it overtook R on KDNuggets’s annual poll of data scientists’ most used tools.
- In 2018, 66% of data scientists reported using Python daily, making it the number one language for analytics professionals.
Installing R on Mac OS
If you have followed my blogs then you must be aware that I use HomeBrew to install packages on Mac. In order to install R, I used the same.
Installing R Studio
RStudio 1.2.1335 - Mac OS X 10.12+ (64-bit)
Source Code
A tarball containing source code for RStudio v1.1.463 can be downloaded from here-- |
Wednesday, January 9, 2019
Deep Learning Using TensorFlow
Deep Learning Using TensorFlow
What is Deep Learning?
As we saw in our previous blog AI - Machine Learning & Deep Learning that deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence.What is TensorFlow
In order to implement Machine Learning, TensorFlow provides software libraries to create computational graph. I selected TensorFlow mainly because of my love for open source. You can get more information on TensorFlow from its website.
This is my step by step guide to installing TensorFlow on macOS Mojave.
Model Name: MacBook ProModel Identifier: MacBookPro15,1Processor Name: Intel Core i9Processor Speed: 2.9 GHzNumber of Processors: 1Total Number of Cores: 6L2 Cache (per Core): 256 KBL3 Cache: 12 MBMemory: 32 GBOS: macOS 10.14.3 (18D32a)
Package Manager: Home BrewBig Data: HadoopCluster Computing Framework: Apache Spark 2.0.0
Install Xcode 8.0.
https://developer.apple.com/xcode/downloads/If 8.0 is installed, you may need to:sudo xcode-select --switch /path/to/Xcode-beta.appbash-3.2$ sudo xcode-select --switch /Applications/Xcode-beta.app
Install Tensor flow
Open a terminal in MAC.CMD + Space & type Terminal or open terminal from launcher$ brew install tensorflow==> Downloading https://homebrew.bintray.com/bottles/libtensorflow-1.12.0.mojave.bottle.tar.gz######################################################################## 100.0%==> Pouring libtensorflow-1.12.0.mojave.bottle.tar.gz🍺 /usr/local/Cellar/libtensorflow/1.12.0: 9 files, 158.7MB
AI - Machine Learning & Deep Learning
Artificial Intelligence (AI) - Machine Learning (ML) & Deep Learning (DL)
When a machine or a system becomes smart enough to decide what to do next based on a given situation is known as Artificial Intelligence (AI).Machine Learning (ML): The ability of a machine to learn independently without any kind of explicit programming for a given scenario is called Machine Learning. Machine learning tool involves a lot of complex math and coding that enables the computer to modify the algorithm based on the result of identifying patterns in a large set of data and performs a function with the data given to it and gets progressively better over time.
“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”
Conceptual Neural Network |
Deep Learning (DL): Deep learning is considered an evolution of machine learning. It uses a programmable neural network that enables machines to make accurate decisions without help from humans. Better way to impart intelligence is to follow the mechanism used by human brain. This mechanism is known as neural network which tries to model human brain process with layers of nodes, linked together in different ways. Each additional layer requires a large increase in computing power. This layered model approach deals with millions of parameters which needs to be processed by GPUs. As the GPU becomes more and more affordable. So is the ability to create multi-layered deep neural networks.
In the subsequent blogs we will try to learn some basic ML as well as basic DL implementation techniques.
-
Kinshuk Dutta
Tuesday, October 2, 2018
Scala Basics
What is Scala?
Basic Understanding & Implementation - Scala on Mac
This is my step by step guide to installing Scala macOS Sierra.Installation
Scala Plugin for Eclipse
Scala IDE for Eclipse is best installed (and updated) directly from within Eclipse.This is done by using
Help → Install New Software...
, add the Add...
button in the dialog.Choose a name for the update site (Scala IDE). Then follow the instructions on screen.
Scala Basics!!
The Scala REPL is a An informative tool that logs messages on how our code is interpreted and executed
The scala command will execute a source script by wrapping it in a template and then compiling and executing the resulting program.
In interactive mode, the REPL reads expressions at the prompt, wraps them in an executable template, and then compiles and executes the result.
Previous results are automatically imported into the scope of the current expression as required.
The REPL also provides some command facilities, described below.
An alternative REPL is available in the Ammonite project, which also provides a richer shell evironment.
Values & Variables
New York
Monday, July 9, 2018
Implementation - HomeBridge - It's the answer to Impatient HomeKit Enthusiast
Enabling incompatible Accessories to HomeKit using Homebridge
Homebridge on Mac
My usual:
After you have installed XCode and Node.js you will have to set up a Node.js server. This is your HomeBridge server:
sudo npm install -g --unsafe-perm homebridge
Homebridge will be installed using the NPM package manager. Wait until the process is complete.
homebridge
into Terminal to launch it.Add Homebridge to HomeKit
Kinshuk Dutta
Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)
Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...
-
Python Basics (Python v3.2.5 ) This is my curation of some basic info on Python and an attempt to learn. Where we would like to learn, what ...
-
Intention: At the end of this exercise you will have Hadoop, Hive & HBase installed on your Mac and you will be ready to start ...
-
Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...