Wednesday, January 30, 2019

Introduction to Data Science with R & Python

What is Data Science?



Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning, and big data.

Data science is a "concept to unify statistics, data analysis, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge, and information science.

R or Python?

Why use R for Data Science?

1. Academia:

R is a very popular language in academia. Many researchers and scholars use R for experimenting with data science. Many popular books and learning resources on data science use R for statistical analysis as well. Since it is a language preferred by academicians, this creates a large pool of people who have a good working knowledge of R programming. Putting it differently, if many people study R programming in their academic years then this will create a large pool of skilled statisticians who can use this knowledge when they move to the industry. Thus, leading to increased traction towards this language.

Read More: Suitability of Python for Artificial Intelligence
2. Data wrangling:

Data wrangling is the process of cleaning messy and complex data sets to enable convenient consumption and further analysis. This is a very important and time taking process in data science. R has an extensive library of tools for data and database manipulation and wrangling. Some of the popular packages for data manipulation in R include:
dplyr Package – Created and maintained by Hadley Wickham, dplyr is best known for its data exploration and transformation capabilities and highly adaptive chaining syntax.
data.table Package – It allows for faster manipulation of data set with minimum coding. It simplifies data aggregation and drastically reduces the compute time.
readr Package – ‘readr’ helps in reading various forms of data into R. By not converting characters into factors it performs the task at 10x faster speed.
3. Data visualization:

Data visualization is the visual representation of data in graphical form. This allows analyzing data from angles that are not clear in unorganized or tabulated data. R has many tools that can help in data visualization, analysis, and representation. The R packages ggplot2 and ggedit for have become the standard plotting packages. While the ggplot2 package is focused on visualizing data, ggedit helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics precisely correct.
4. Specificity:

R is a language designed especially for statistical analysis and data reconfiguration. All the R libraries focus on making one thing certain – to make data analysis easier, more approachable, and detailed. Any new statistical method is first enabled through R libraries. This makes R a perfect choice for data analysis and projection. Members of the R community are very active and supportive and they have a great knowledge of statistics as well as programming. This all gives R a special edge, making it a perfect choice for data science projects.
5. Machine learning:

At some point in data science, a programmer may need to train the algorithm and bring in automation and learning capabilities to make predictions possible. R provides ample tools to developers to train and evaluate an algorithm and predict future events. Thus, R makes machine learning (a branch of data science) a lot more easy and approachable. The list of R packages for machine learning is really extensive. R machine learning packages include MICE (to take care of missing values), rpart & PARTY (for creating data partitions), CARET (for classification and regression training), randomFOREST (for creating decision trees), and much more.

Read More: 5 Machine Learning Trends to Follow
6. Availability:

R programming language is open source and is not severely restricted to operating systems. Being open-source, R is covered under the GNU General Public License Agreement. This makes it highly cost-effective for a project of any size. Since it is open-source, developments in R happen at a rapid scale and the community of developers is huge. All of this, along with a tremendous amount of learning resources makes R programming a perfect choice to begin learning R programming for data science. Because there are many new developers exploring the landscape of R programming it is easier and cost-effective to recruit or outsource to R developers.


Why use Python for Data Science?

Python is the programming language of choice for data science. Here’s a brief history:

Installing R on Mac OS


If you have followed my blogs then you must be aware that I use HomeBrew to install packages on Mac. In order to install R, I used the same.

$ brew install R
Updating Homebrew...
==> Auto-updated Homebrew!
Updated Homebrew from 76b3c3dbe to 0f270d811.
Updated 4 taps (homebrew/cask-versions, homebrew/core, homebrew/cask and adoptopenjdk/openjdk).
==> New Formulae
anyenv                                                                        aws-iam-authenticator                                                         s3ql
==> Updated Formulae
ffmpeg           bison              consul             erlang             gibo               helmfile           kotlin             mill               plantuml           sshuttle           visp               youtube-dl
jenkins           bundletool         coreutils          exploitdb          gmsh               htmldoc            krakend            minio              presto             switch-lan-play    whois
ask-cli            certbot            dnsperf            ffmpeg@2.8         gnupg-pkcs11-scd   hub                kubectx            minio-mc           pulumi             topgrade           wtf
aws-sdk-cpp        checkstyle         easyengine         fn                 gnutls             jbake              libgit2            nim                pycodestyle        traefik            xcodegen
bettercap          composer           emscripten         fonttools          goreleaser         jhipster           libgpg-error       nss                siril              unrar              you-get

==> Installing dependencies for r: gettext, libpng, openblas and pcre
==> Installing r dependency: gettext
==> Downloading https://homebrew.bintray.com/bottles/gettext-0.19.8.1.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring gettext-0.19.8.1.mojave.bottle.tar.gz
==> Caveats
gettext is keg-only, which means it was not symlinked into /usr/local,
because macOS provides the BSD gettext library & some software gets confused if both are in the library path.

If you need to have gettext first in your PATH run:
  echo 'export PATH="/usr/local/opt/gettext/bin:$PATH"' >> ~/.bash_profile

For compilers to find gettext you may need to set:
  export LDFLAGS="-L/usr/local/opt/gettext/lib"
  export CPPFLAGS="-I/usr/local/opt/gettext/include"

==> Summary
🍺  /usr/local/Cellar/gettext/0.19.8.1: 1,935 files, 16.9MB
==> Installing r dependency: libpng
==> Downloading https://homebrew.bintray.com/bottles/libpng-1.6.36.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libpng-1.6.36.mojave.bottle.tar.gz
🍺  /usr/local/Cellar/libpng/1.6.36: 27 files, 1.2MB
==> Installing r dependency: openblas
==> Downloading https://homebrew.bintray.com/bottles/openblas-0.3.5.mojave.bottle.1.tar.gz
######################################################################## 100.0%
==> Pouring openblas-0.3.5.mojave.bottle.1.tar.gz
==> Caveats
openblas is keg-only, which means it was not symlinked into /usr/local,
because macOS provides BLAS and LAPACK in the Accelerate framework.

For compilers to find openblas you may need to set:
  export LDFLAGS="-L/usr/local/opt/openblas/lib"
  export CPPFLAGS="-I/usr/local/opt/openblas/include"

For pkg-config to find openblas you may need to set:
  export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"

==> Summary
🍺  /usr/local/Cellar/openblas/0.3.5: 21 files, 120.7MB
==> Installing r dependency: pcre
==> Downloading https://homebrew.bintray.com/bottles/pcre-8.42.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring pcre-8.42.mojave.bottle.tar.gz
🍺  /usr/local/Cellar/pcre/8.42: 204 files, 5.5MB
==> Installing r
==> Downloading https://homebrew.bintray.com/bottles/r-3.5.2_2.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring r-3.5.2_2.mojave.bottle.tar.gz
🍺  /usr/local/Cellar/r/3.5.2_2: 2,116 files, 55.7MB
==> Caveats
==> gettext
gettext is keg-only, which means it was not symlinked into /usr/local,
because macOS provides the BSD gettext library & some software gets confused if both are in the library path.

If you need to have gettext first in your PATH run:
  echo 'export PATH="/usr/local/opt/gettext/bin:$PATH"' >> ~/.bash_profile

For compilers to find gettext you may need to set:
  export LDFLAGS="-L/usr/local/opt/gettext/lib"
  export CPPFLAGS="-I/usr/local/opt/gettext/include"

==> openblas
openblas is keg-only, which means it was not symlinked into /usr/local,
because macOS provides BLAS and LAPACK in the Accelerate framework.

For compilers to find openblas you may need to set:
  export LDFLAGS="-L/usr/local/opt/openblas/lib"
  export CPPFLAGS="-I/usr/local/opt/openblas/include"

For pkg-config to find openblas you may need to set:

  export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"

Installing R Studio

RStudio is a free and open-source integrated development environment for R, a programming language for statistical computing and graphics. RStudio was founded by JJ Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief Scientist at RStudio. Wikipedia

I downloaded the RStudio Desktop Version.
RStudio














RStudio 1.2.1335 - Mac OS X 10.12+ (64-bit)

Source Code

A tarball containing source code for RStudio v1.1.463 can be downloaded from here

RStudio IDE Cheat Sheet
--
Kinshuk Dutta
New York

Wednesday, January 9, 2019

Deep Learning Using TensorFlow

Deep Learning Using TensorFlow


What is Deep Learning?

As we saw in our previous blog AI - Machine Learning & Deep Learning that deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence.

In this blog we will make an attempt to learn the basics about Google's deep learning framework using TensorFlow and Python.

What is TensorFlow


In order to implement Machine Learning, TensorFlow provides software libraries to create computational graph. I selected TensorFlow mainly because of my love for open source. You can get more information on TensorFlow from its website.

This is my step by step guide to installing TensorFlow on macOS Mojave.

Requisites
    Hardware
Model Name: MacBook Pro 
Model Identifier: MacBookPro15,1 
Processor Name: Intel Core i9 
Processor Speed: 2.9 GHz 
Number of Processors: 1 
Total Number of Cores: 6 
L2 Cache (per Core): 256 KB 
L3 Cache: 12 MB 
Memory: 32 GB 
OS: macOS 10.14.3 (18D32a) 
    Software
Package Manager: Home Brew 
Big Data: Hadoop 
Cluster Computing Framework: Apache Spark 2.0.0 
Installation
    Install Xcode 8.0.
                    Xcode can be updated from 
https://developer.apple.com/xcode/downloads/ 
If 8.0 is installed, you may need to: 
sudo xcode-select --switch /path/to/Xcode-beta.app 

bash-3.2$ sudo xcode-select --switch /Applications/Xcode-beta.app 
    Install Home Brew

Install Tensor flow
Open a terminal in MAC.
CMD + Space & type Terminal or open terminal from launcher

$ brew install tensorflow
==> Downloading https://homebrew.bintray.com/bottles/libtensorflow-1.12.0.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtensorflow-1.12.0.mojave.bottle.tar.gz

🍺  /usr/local/Cellar/libtensorflow/1.12.0: 9 files, 158.7MB


-
Kinshuk Dutta
New York

AI - Machine Learning & Deep Learning


Artificial Intelligence (AI) - Machine Learning (ML) & Deep Learning (DL)

When a machine or a system becomes smart enough to decide what to do next based on a given situation is known as Artificial Intelligence (AI).
Artificial intelligence is pertained to a system by two main ways:

Machine Learning (ML): The ability of a machine to learn independently without any kind of explicit programming for a given scenario is called Machine Learning. Machine learning tool involves a lot of complex math and coding that enables the computer to modify the algorithm based on the result of identifying patterns in a large set of data and performs a function with the data given to it and gets progressively better over time.

 “Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions” 

Conceptual Neural Network 
Deep Learning (DL): Deep learning is considered an evolution of machine learning. It uses a programmable neural network that enables machines to make accurate decisions without help from humans. Better way to impart intelligence is to follow the mechanism used by human brain. This mechanism is known as neural network which tries to model human brain process with layers of nodes, linked together in different ways. Each additional layer requires a large increase in computing power. This layered model approach deals with millions of parameters which needs to be processed by GPUs. As the GPU becomes more and more affordable. So is the ability to create multi-layered deep neural networks.

In the subsequent blogs we will try to learn some basic ML as well as basic DL implementation techniques.
-
Kinshuk Dutta
New York

Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)

Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...