Monday, February 29, 2016

Three Database Revolutions

There are three database revolutions that happened so far.  
  • The first revolution was driven by the emergence of the electronic computer.
  • The second revolution by the emergence of the relational database.
  • The third revolution has resulted in an explosion of non-relational database alternatives driven by the demands of modern applications that require global scope and continuous availability.
Lets have a look on these three waves of database technologies and discuss the market and technology forces leading to today’s next generation databases.

1950-1972 (Pre - Relational)
  • 1951 - Magnetic Tape
  • 1952 - Magnetic Disk
  • 1961 - ISAM
  • 1965 - Hierarchical Model
  • 1968 - IMS
  • 1969 - Network Model
  • 1971 - IDMS
1972 - 2005 (Relational)
  • 1970 - Codd's Paper
  • 1974 - System R
  • 1978 - Oracle
  • 1980 - Commercial Ingres
  • 1981 - Informix
  • 1984 - DB2
  • 1987 - Sybase
  • 1989 - Postgres
  • 1989 - SQL Server
  • 1995 - MySQL
2005 - 2015 ( The Next Generation) 
  • 2003 - MarkLogic
  • 2004 - MapReduce
  • 2005 - Hadoop
  • 2005 - Vertica
  • 2007 - Dynamo
  • 2007 - Neo4J
  • 2008 - Cassandra
  • 2008 - HBase
  • 2008 - NuoDB
  • 2009 - MongoDB
  • 2010 - VoltDB
  • 2010 - Hana
  • 2011 - Riak
  • 2012 - Aerospike
  • 2014 - Spile Machine
In the 20 years following the widespread adoption of electronic computers, a range of increasingly sophisticated database systems emerged.emerged. Shortly after the definition of the relational model in 1970, almost every significant database system shared a common architecture. The three pillars of this architecture were the relational model, ACID transactions, and the SQL language.

However, starting around 2008, an explosion of new database systems occurred, and none of these
adhered to the traditional relational implementations. Few of the main reasons which led to the new generation of databases is because of 3 V's.
  • Volume - High Volume of Data. Ex: MB, GB, TB, Petabytes of data. 
  • Velocity - High Velocity of Data. Ex: Batch, Periodic, Near Real Time, Real Time
  • Variety - More Variety of Data. Ex: Database, Web, Photo, Audio, Video, Unstructured
In the next blog posts we will go through few of the popular Next Generation Databases, their overall architectures  and use cases where they will be used.



Tuesday, January 29, 2013

How MongoDB survives From SQL or Query Injection

As We know SQL injection is one of the most famous way people try to hack the SQL based applications.I came to know about interesting thing how MongoDB survives from this SQL injection while reading the mongodb docs.

For SQL based applications most of the drivers support accessing SQL data using query as String which makes the access vulnerable.
For Example in Java we use to get the data from SQL as follows,


String query = "SELECT ZipCode,State FROM zipcodes WHERE City = '+city+' AND State = '+state+'";
connection = DriverManager.getConnection(jdbcurl, username, password);
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery(query);


In case of MongoDB there is no vulnerability because all the drivers creates a BSON object for the given Query instead of calling the DB as a string itself.

For MongoDb in Java QueryBuilder is used to build Queries for accesing MongoDb Data,

DBObject query = QueryBuilder.start("City").is(city).and("State").is(state).get();

As a client program assembles a query in MongoDB, it builds a BSON object, not a string. Thus traditional SQL injection attacks are not a problem. 
MongoDB represents queries as BSON objects. Typically client libraries provide a convenient, injection free, process to build these objects.

Friday, December 14, 2012

Comparison of Popular NoSql databases (MongoDb,CouchDb,Hbase,Neo4j,Cassandra)

There are many SQL databases so far.But i personally feel the 15 years history of SQL coming to an end as everyone is moving to an era of BigData. As experts say SQL databases are not a best fit for Big Data No Sql databases came into picture as a best fit for this which provides more flexibility in storing data.
I just want to compare few popular NoSql databases that are available at this point of time.Few well known NoSql databases are
NoSql databases differ each other more than the way Sql databases differ from each other.I think its one's responsibility to choose the appropriate NoSql database for their application based on their use case.Lets do a quick comparison of these databases.

MongoDb

  • Written in  :  c++
  • Main point : Retains some friendly  properties of SQL (Query, Index)
  • Licence : AGPL(Drivers : Apache)
  • Protocol : BSON (Binary JSON)
  • Replication : Master/Slave Replication  and automatic failover via Replica Sets
  • Sharding : Built-in
  • Queries are javascript expressions.
  • Runs arbitary javascript function server side.
  • Better Update-in-place than CouchDb.
  • Uses memory mapped files for data storage.
  • Performance over features.
  • Journaling (with --journal ) option turned on starting th mongod server.
  • Has Geospatial Indexing.
  • On 32-bit systems limited to 2.5GB.
  • Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
  • For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

Cassandra

  • Written in: Java
  • Main point: Best of BigTable and Dynamo
  • License: Apache
  • Protocol: Custom, binary (Thrift)
  • Tunable trade-offs for distribution and replication (N, R, W)
  • Querying by column, range of keys
  • BigTable-like features: columns, column families
  • Has secondary indices
  • Writes are much faster than reads (!)
  • Map/reduce possible with Apache Hadoop
  • All nodes are similar, as opposed to Hadoop/HBase
  • Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")
  • For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is real time data analysis.   

HBase


  • Written in: Java
  • Main point: Billions of rows X millions of columns
  • License: Apache
  • Protocol: HTTP/REST (also Thrift)
  • Modeled after Google's BigTable
  • Uses Hadoop's HDFS as storage
  • Map/reduce with Hadoop
  • Query predicate push down via server side scan and get filters
  • Optimizations for real time queries
  • A high performance Thrift gateway
  • HTTP supports XML, Protobuf, and binary
  • Cascading, hive, and pig source and sink modules
  • Jruby-based (JIRB) shell
  • Rolling restart for configuration changes and minor upgrades
  • Random access performance is like MySQL
  • A cluster consists of several different types of nodes
  • Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.
  • For example: Analysing log data.

CouchDB


  • Written in: Erlang
  • Main point: DB consistency, ease of use
  • License: Apache
  • Protocol: HTTP/REST
  • Bi-directional (!) replication,
  • continuous or ad-hoc,
  • with conflict detection,
  • thus, master-master replication. (!)
  • MVCC - write operations do not block reads
  • Previous versions of documents are available
  • Crash-only (reliable) design
  • Needs compacting from time to time
  • Views: embedded map/reduce
  • Formatting views: lists & shows
  • Server-side document validation possible
  • Authentication possible
  • Real-time updates via _changes (!)
  • Attachment handling
  • thus, CouchApps (standalone js apps)
  • jQuery library included
  • Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
  • For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.
Neo4j

  • Written in: Java
  • Main point: Graph database - connected data
  • License: GPL, some features AGPL/commercial
  • Protocol: HTTP/REST (or embedding in Java)
  • Standalone, or embeddable into Java applications
  • Full ACID conformity (including durable data)
  • Both nodes and relationships can have metadata
  • Integrated pattern-matching-based query language ("Cypher")
  • Also the "Gremlin" graph traversal language can be used
  • Indexing of nodes and relationships
  • Nice self-contained web admin
  • Advanced path-finding with multiple algorithms
  • Indexing of keys and relationships
  • Optimized for reads
  • Has transactions (in the Java API)
  • Scriptable in Groovy
  • Online backup, advanced monitoring and High Availability is AGPL/commercial licensed
  • Best used: For graph-style, rich or complex, interconnected data. Neo4j is quite different from the others in this sense.
  • For example: Social relations, public transport links, road maps, network topologies

Reffered Sources : kkovacs , wikipedia

Tuesday, February 21, 2012

GraphDatabase - The future for Facebook Recommendations

On what Basis are you getting Recommendations from Facebook??How your data is stored Internally in Social Network Sites ??

Have you ever thought how your information is stored by facebook in database?? Do you think its SQL that facebook is using for storing your data ?? If you think so ,then you are wrong.Its NoSQL GraphDatabase called 'Cassandra' what facebook uses to store your data.I know after reading this you will get lot of questions in your mind. 'What is Graph database??  How it looks like?? How it can be useful for Facebook Recommendations?? Where else it can be used??'.Let me explain each one in detail.

What is Graph database??

I think Wikipedia gives the best answer for this question.So i think i can just add a link to wikipedia for the introduction of graphDatabase. Here you go..!!

How it looks like??

I thing you got a basic idea about graph database after seeing Wikipedia page.Here i am showing sample example of a small Social Network of friends who KNOWS each other.



You can Imagine the entire Facebook database as a infinite Graph where the users keep on increasing day by day.Some thing like this
Where each node represents each Facebook user or  page and each edge between two users represents a FRIEND and LIKE relationship.

How it can be useful for Facebook Recommendations??

Consider the sample example of a small graph which has 3 users A,B,C.

  1. A - friend of B 
  2. B - friend of A,C 
  3. C - friend of B
 Now if u notice Facebook recommends 'C to A' and 'A to C ' to make friendship each other as 'user B' is the common friend between them.And this is as simple as to find the common node between two edges in GraphDatabase.

If you use SQL u need to join all 3 records together based on 'friends' field and need to find out the transitive relationship between A,B,C which is time taking.

The above example is a very basic one.More recommendations can be found out using mutual LIKES between two users,games,pages,etc...what not..!! These things can be easily implemented using GraphDatabase and it is very efficient than SQL.

Where else it can be used??

I feel GraphDatabases are very efficient to use for social networking,spatial search,recommendation engines(Ex: Amazon,Facebook),etc ....

Why Nooooo SQL .........???

                     
 Relational databases have been around for many decades and are the database technology of choice for most traditional data-intensive storage and retrieval applications. Retrievals are usually accomplished using SQL, a declarative query language. Relational database systems are generally efficient unless the data contains many relationships requiring joins of large tables. Recently there has been much interest in data stores that do not use SQL exclusively, the so called NoSQL movement. Examples are Google’s BigTable and Facebook’s Cassandra. Lets have a look at NoSQL vs MySQL (common relational database system).
 

When to go for  NOSQL ??

In recent years, software developers have been investigating storage alternatives to relational databases. NoSQL is a blanket term for some of those new systems. Cassandra,BigTable, CouchDB, Project Voldemort, and Dynamo are all NoSQL projects, as they are all high-volume data stores that actively reject the relational and object-relational models.

Atomicity, consistency, isolation, and durability (ACID) are a set of governing principles of the relational model. Together, they guarantee database reliability. NoSQL rejects ACID.

The term “NoSQL,” as a term for modern web data stores,first began to gain popularity in early 2009. It is a topic that has gained recognition from the IT community but has yet to garner large-scale academic study. Still, the NoSQL movement has its own discussion groups, blogs, and conferences.

As the typical database administrator attempts to question whether to move from the relational model to a NoSQL model, the NoSQL community presents him or her with potential flags that the data might be more suitable for a NoSQL system.
  1. Having tables with lots of columns, each of which is only used by a few rows.
  2. Having attribute tables.
  3. Having lots of many-to-many relationships.
  4. Having tree-like characteristics.
  5. Requiring frequent schema changes.