Tuesday, November 30, 2010

Are you a Data Engineer or a Data Scientist?

Are you a Data Engineer or a Data Scientist?

In the recent time two new designation / title is making the headlines in the data world. 🌎
Data Engineer & Data Scientist and it all begun with Data Analytics.

A data is no good unless we derive information out of it and that information should be insightful. The set of tools which lets you do that is known as data analytics products. People working in analytics are often associated with terms like Engineers & Scientists.

Why does it matter at all? 

Well it matters because it determines the career road map for you. So, in order to know who you are or rather which of these broad category define your true nature. You will need to know the difference between them.

Data Engineers 

Decisions backed by data and analytics can provide a competitive advantage and increase ROI.
Hence it is critical that data analytics solutions implemented in your enterprise are fast and efficient.

If you are a data engineer you are expected to design, develop, implement and support product which can deal with data. Be it structural (Master, Reference, Meta, Transactional or Analytics) data or non structural data and curate a data pipeline or move data to production. Data engineers make the appropriate data accessible and available to the right users at the right time by enabling secure, compliant data utilization and democratization across enterprise.

If you have more fun with the technical part of data . Like designing, building and arranging different components of data flow architecture. This is closely related to data architect in-fact when the traditional data architects starts handling and designing Big Data instead of Data Warehouse they are known as data engineers.

Data Scientists give a meaning to your data. Basically if you can read and understand trillions of data to predict a situation or trend by using analytics then you are a data scientist. 
A Data Scientist collects, interprets and publishes data. The data they find can be used for many reasons, but in the business world, it often applies to finances or productivity. For example, a Data Scientist may look at sales figures against business decisions made over a certain time frame to determine how successful those decisions were. 

These professionals provide the forecasting knowledge a business needs to know whether changes will be effective before making a decision. Data Scientists work in a variety of industries, such as IT, healthcare, finance, retail and marketing.
  1. In order to do their job efficiently the scientists typically follow the below mention approach:
  2. Ask the right questions to begin the discovery process
  3. Acquire data
  4. Process and clean the data
  5. Integrate and store data
  6. Initial data investigation and exploratory data analysis
  7. Choose one or more potential models and algorithms
  8. Apply data science techniques, such as machine learning, statistical modeling, and artificial intelligence
  9. Measure and improve results
  10. Present final result to stakeholders
  11. Make adjustments based on feedback
  12. Repeat the process to solve a new problem

Kinshuk Dutta

Nov. 2010

I recently found this excellent pictorial representation on Data Camp. Which echoes my thought.

Thursday, July 22, 2010

Cloud Computing

Cloud Computing - "Finally, all of you, live in harmony with one another; be sympathetic, love as brothers, be compassionate and humble." 1 Peter 3:8 (NIV)

Come lets share our resources for a much more efficient IT world. Such a nice explanation is available on the following sites that I don need to explain any thing at all.

The major cloud computing implementations are:
Oracle Cloud Computing
Windows Azure
Amazon EC2 - Elastic Compute Cloud

Kinshuk Dutta

Social CRM - A Cult

Am I Becoming Vulnerable - Social CRM

The more information one puts on the net the more vulnerable he/she is becoming. Earlier crooks were playing with it illegally but now ethically people are using information pertaining to someone available on WWW to analyze, understand, and peruse the person for one’s own gain.
Does it sound weird? Actually not. That’s what Social CRM is all about. A couple of years back everybody was talking about Social CRM. The hype was much more than the promo of Yash Raj Films. At that time things were much mystified and mystical, so what exactly is Social CRM?
The traditional CRM system assists an organization by bringing together data from all areas of an organization, giving a 360-degree view of a customer for marketing and sales to make informed decisions on cross-selling and up-selling opportunities as well as to shape marketing strategies and corporate communications.
Social CRM, on the other hand, is based on your existing CRM that has the ability to leverage the social web and automate the conversion process. The social CRM can be used by marketing and sales teams to listen to conversations, craft appropriate messages, join in immediately with customer conversation, and offer them value in terms of information and solutions.
The ROI for setting up this system will be the relationships and long term loyalty that serve entrepreneurs over time.   Social CRM will help generate marketing intelligence, providing the marketing department with insight that will assist your company to source better leads and reduce customer support costs through self-helping communities.
To begin with, marketers should dedicate time to working with brand advocates, involving them in shaping the product and communications. This is how social CRM is supposed to work; through the integration of customer social networks.
One should find that their existing customers belong to social networks and openly add more information to their profile than what you have in your traditional CRM system. Therefore you should think creatively of ways to integrate your customer’s networks into your current CRM.
One can know completely about a customer from combined information of his LinkedIn, Facebook, and Orkut profiles more insight can be availed from the way the person tweets on Twitter or tracking his conversations on friend feed and Buzz.
Social networks and communities are influencing CRM; resulting in corporate sites and marketing communications being able to recognize social relationships and customers' preferences and deliver customized experiences to them in real-time. We are still some distance from this scenario; however companies such as Appirio and Salesforce are already leading the way in Social CRM.
A couple of years back, Appirio developed a Facebook application that can easily be rebranded by marketers for campaign activity. This application can be distributed at the same time sharing information, recommending information to peers, and used for other purposes such as recruiting, word of mouth, and other typical social network activities. As the information is shared, it can then be passed to a landing page where users can submit information in a web form that is stored on Salesforce.
Social CRM is both exciting and daunting, but by taking tiny steps to understanding social media, and experimenting through integrating social tools into your CRM system now, then testing and improving, you will be very well prepared.

Guess what? there used to be telemarketing, Now we have Twitter marketing.

Kinshuk Dutta

Tuesday, July 20, 2010

Introduction to NoSQL | Mongo DB

Introduction to NoSQL

A Brief History of NoSQL

I came across this brief history from a blog

NoSQL was in fact first used by Carlo Strozzi in 1998 as the name of the file-based database he was developing. Ironically it’s a relational database just one without a SQL interface. The term re-surfaced in 2009 when Eric Evans used it to name the current surge in non-relational databases. 

## the 1960s

MultiValue (aka PICK) databases are developed at TRW in 1965.
According to a comment from Scott Jones M[umps] is developed at Mass General Hospital in 1966. It is a programming language that incorporates a hierarchical database with B+ tree storage.
IBM IMS, a hierarchical database, is developed with Rockwell and Caterpillar for the Apollo space program in 1966.

## the 1970s

  • InterSystems develops the ISM product family succeeded by the Open M product, all M[umps] implementations. See comment from Scott Jones below.
  • M[umps] is approved as an ANSI standard language in 1977.
  • in 1979 Ken Thompson creates DBM which is released by AT&T. At its core, it is a file-based hash.

## the 1980’s

Several successors to DBM spring into life.
  • TDBM supporting atomic transactions
  • NDBM was the Berkeley version of DBM supporting having multiple databases open at the same time.
  • SDBM - another clone of DBM mainly for licensing reasons.
  • GT.M is the first version of a key-value store with a focus on high-performance transaction processing. It is open-sourced in 2000.
  • BerkeleyDB is created at Berkeley in the transition from 4.3BSD to 4.4BSD. Sleepycat software is started as a company in 1996 when Netscape needed new features for BerkeleyDB. Later acquired by Oracle which still sells and maintains BerkeleyDB.
  • Lotus Notes or rather the server part, Lotus Domino, which really is a document database has it’s initial release in 1989, now sold by IBM. It has evolved a lot from the early versions and is now a full office and collaboration suite.

## the 1990’s

  • GDBM is the Gnu project clone of DBM
  • Mnesia is developed by Ericsson as a soft real-time database to be used in telecom. It is relational in nature but does not use SQL as query language but rather Erlang itself.
  • InterSystems Caché launched in 1997 and is a hybrid so-called post-relational database. It has object interfaces, SQL, PICK/MultiValue, and direct manipulation of data structures. It is an M[umps] implementation. See Scott Jones comment below for more on the history of InterSystems
  • Metakit is started in 1997 and is probably the first document-oriented database. Supports smaller datasets than the ones in vogue nowadays.


  • This is where the NoSQL train really picks up some momentum and a lot is starting to happen.
  • Graph database Neo4j is started in 2000.
  • db4o an object database for Java and .net is started in 2000
  • QDBM is a re-implementation of DBM with better performance by Mikio Hirabayashi.
  • Memcached is started in 2003 by Danga to power Livejournal. Memcached isn’t really a database since it’s memory-only but there is soon a version with file storage called memcachedb.
  • Infogrid graph database is started as closed source in 2005, open-sourced in 2008
  • CouchDB is started in 2005 and provides a document database inspired by Lotus Notes. The project moves to the Apache Foundation in 2008.
  • Google BigTable is started in 2004 and the research paper is released in 2006.


  • JackRabbit is started in 2006 as an implementation of JSR 170 and 283.
  • Tokyo Cabinet is a successor to QDBM by (Mikio Hirabayashi) started in 2006
  • The research paper on Amazon Dynamo is released in 2007.
  • The document database MongoDB is started in 2007 as a part of an open-source cloud computing stack and first standalone release in 2009.
  • Facebooks open sources the Cassandra project in 2008
  • Project Voldemort is a replicated database with no single-point-of-failure. Started in 2008.
  • Dynomite is a Dynamo clone written in Erlang.
  • Terrastore is a scalable elastic document store started in 2009
  • Redis is a persistent key-value store that started in 2009
  • Riak Another dynamo-inspired database started in 2009.
  • HBase is a BigTable clone for the Hadoop project while Hypertable is another BigTable type database also from 2009.
  • Vertexdb another graph database is started in 2009
  • Eric Evans of Rackspace, a committer on the Cassandra project, introduces the term “NoSQL” often used in the sense of “Not Only SQL” to describe the surge of new projects and products.

Kinshuk Dutta

Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)

Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...