What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning, and big data.
Data science is a "concept to unify statistics, data analysis, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge, and information science.
R or Python?
Why use R for Data Science?
1. Academia:R is a very popular language in academia. Many researchers and scholars use R for experimenting with data science. Many popular books and learning resources on data science use R for statistical analysis as well. Since it is a language preferred by academicians, this creates a large pool of people who have a good working knowledge of R programming. Putting it differently, if many people study R programming in their academic years then this will create a large pool of skilled statisticians who can use this knowledge when they move to the industry. Thus, leading to increased traction towards this language.
Read More: Suitability of Python for Artificial Intelligence
2. Data wrangling:
Data wrangling is the process of cleaning messy and complex data sets to enable convenient consumption and further analysis. This is a very important and time taking process in data science. R has an extensive library of tools for data and database manipulation and wrangling. Some of the popular packages for data manipulation in R include:
dplyr Package – Created and maintained by Hadley Wickham, dplyr is best known for its data exploration and transformation capabilities and highly adaptive chaining syntax.
data.table Package – It allows for faster manipulation of data set with minimum coding. It simplifies data aggregation and drastically reduces the compute time.
readr Package – ‘readr’ helps in reading various forms of data into R. By not converting characters into factors it performs the task at 10x faster speed.
3. Data visualization:
Data visualization is the visual representation of data in graphical form. This allows analyzing data from angles that are not clear in unorganized or tabulated data. R has many tools that can help in data visualization, analysis, and representation. The R packages ggplot2 and ggedit for have become the standard plotting packages. While the ggplot2 package is focused on visualizing data, ggedit helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics precisely correct.
4. Specificity:
R is a language designed especially for statistical analysis and data reconfiguration. All the R libraries focus on making one thing certain – to make data analysis easier, more approachable, and detailed. Any new statistical method is first enabled through R libraries. This makes R a perfect choice for data analysis and projection. Members of the R community are very active and supportive and they have a great knowledge of statistics as well as programming. This all gives R a special edge, making it a perfect choice for data science projects.
5. Machine learning:
At some point in data science, a programmer may need to train the algorithm and bring in automation and learning capabilities to make predictions possible. R provides ample tools to developers to train and evaluate an algorithm and predict future events. Thus, R makes machine learning (a branch of data science) a lot more easy and approachable. The list of R packages for machine learning is really extensive. R machine learning packages include MICE (to take care of missing values), rpart & PARTY (for creating data partitions), CARET (for classification and regression training), randomFOREST (for creating decision trees), and much more.
Read More: 5 Machine Learning Trends to Follow
6. Availability:
R programming language is open source and is not severely restricted to operating systems. Being open-source, R is covered under the GNU General Public License Agreement. This makes it highly cost-effective for a project of any size. Since it is open-source, developments in R happen at a rapid scale and the community of developers is huge. All of this, along with a tremendous amount of learning resources makes R programming a perfect choice to begin learning R programming for data science. Because there are many new developers exploring the landscape of R programming it is easier and cost-effective to recruit or outsource to R developers.
Why use Python for Data Science?
Python is the programming language of choice for data science. Here’s a brief history:- In 2016, it overtook R on Kaggle, the premier platform for data science competitions.
- In 2017, it overtook R on KDNuggets’s annual poll of data scientists’ most used tools.
- In 2018, 66% of data scientists reported using Python daily, making it the number one language for analytics professionals.
Installing R on Mac OS
If you have followed my blogs then you must be aware that I use HomeBrew to install packages on Mac. In order to install R, I used the same.
$ brew install R
Updating Homebrew...
==> Auto-updated Homebrew!
Updated Homebrew from 76b3c3dbe to 0f270d811.
Updated 4 taps (homebrew/cask-versions, homebrew/core, homebrew/cask and adoptopenjdk/openjdk).
==> New Formulae
anyenv aws-iam-authenticator s3ql
==> Updated Formulae
ffmpeg ✔ bison consul erlang gibo helmfile kotlin mill plantuml sshuttle visp youtube-dl
jenkins ✔ bundletool coreutils exploitdb gmsh htmldoc krakend minio presto switch-lan-play whois
ask-cli certbot dnsperf ffmpeg@2.8 gnupg-pkcs11-scd hub kubectx minio-mc pulumi topgrade wtf
aws-sdk-cpp checkstyle easyengine fn gnutls jbake libgit2 nim pycodestyle traefik xcodegen
bettercap composer emscripten fonttools goreleaser jhipster libgpg-error nss siril unrar you-get
==> Installing dependencies for r: gettext, libpng, openblas and pcre
==> Installing r dependency: gettext
==> Downloading https://homebrew.bintray.com/bottles/gettext-0.19.8.1.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring gettext-0.19.8.1.mojave.bottle.tar.gz
==> Caveats
gettext is keg-only, which means it was not symlinked into /usr/local,
because macOS provides the BSD gettext library & some software gets confused if both are in the library path.
If you need to have gettext first in your PATH run:
echo 'export PATH="/usr/local/opt/gettext/bin:$PATH"' >> ~/.bash_profile
For compilers to find gettext you may need to set:
export LDFLAGS="-L/usr/local/opt/gettext/lib"
export CPPFLAGS="-I/usr/local/opt/gettext/include"
==> Summary
🍺 /usr/local/Cellar/gettext/0.19.8.1: 1,935 files, 16.9MB
==> Installing r dependency: libpng
==> Downloading https://homebrew.bintray.com/bottles/libpng-1.6.36.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libpng-1.6.36.mojave.bottle.tar.gz
🍺 /usr/local/Cellar/libpng/1.6.36: 27 files, 1.2MB
==> Installing r dependency: openblas
==> Downloading https://homebrew.bintray.com/bottles/openblas-0.3.5.mojave.bottle.1.tar.gz
######################################################################## 100.0%
==> Pouring openblas-0.3.5.mojave.bottle.1.tar.gz
==> Caveats
openblas is keg-only, which means it was not symlinked into /usr/local,
because macOS provides BLAS and LAPACK in the Accelerate framework.
For compilers to find openblas you may need to set:
export LDFLAGS="-L/usr/local/opt/openblas/lib"
export CPPFLAGS="-I/usr/local/opt/openblas/include"
For pkg-config to find openblas you may need to set:
export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"
==> Summary
🍺 /usr/local/Cellar/openblas/0.3.5: 21 files, 120.7MB
==> Installing r dependency: pcre
==> Downloading https://homebrew.bintray.com/bottles/pcre-8.42.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring pcre-8.42.mojave.bottle.tar.gz
🍺 /usr/local/Cellar/pcre/8.42: 204 files, 5.5MB
==> Installing r
==> Downloading https://homebrew.bintray.com/bottles/r-3.5.2_2.mojave.bottle.tar.gz
######################################################################## 100.0%
==> Pouring r-3.5.2_2.mojave.bottle.tar.gz
🍺 /usr/local/Cellar/r/3.5.2_2: 2,116 files, 55.7MB
==> Caveats
==> gettext
gettext is keg-only, which means it was not symlinked into /usr/local,
because macOS provides the BSD gettext library & some software gets confused if both are in the library path.
If you need to have gettext first in your PATH run:
echo 'export PATH="/usr/local/opt/gettext/bin:$PATH"' >> ~/.bash_profile
For compilers to find gettext you may need to set:
export LDFLAGS="-L/usr/local/opt/gettext/lib"
export CPPFLAGS="-I/usr/local/opt/gettext/include"
==> openblas
openblas is keg-only, which means it was not symlinked into /usr/local,
because macOS provides BLAS and LAPACK in the Accelerate framework.
For compilers to find openblas you may need to set:
export LDFLAGS="-L/usr/local/opt/openblas/lib"
export CPPFLAGS="-I/usr/local/opt/openblas/include"
For pkg-config to find openblas you may need to set:
export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"
Installing R Studio
RStudio is a free and open-source integrated development environment for R, a programming language for statistical computing and graphics. RStudio was founded by JJ Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief Scientist at RStudio. Wikipedia
I downloaded the RStudio Desktop Version.
RStudio 1.2.1335 - Mac OS X 10.12+ (64-bit)
Source Code
A tarball containing source code for RStudio v1.1.463 can be downloaded from here-- |
New York
No comments:
Post a Comment