Monday, March 24, 2014

Hadoop - NOSQL - Mongo DB

MongoDB (from "humongous") is an open-source document databaseis the leading NoSQL database which provide document database data structure for storing/retriving the elements. MongoDB provides high performance, high availability and easy scalability.

How MongoDB Data Looks like
 MongoDB is made up of databases which contain collections. A collection is made up of documents . Each document is made up of fields .Collections can be indexed , which improves lookup and sorting performance. Finally, when we get data from MongoDB we do so through a cursor whose actual execution is delayed until necessary.

To get started, there are six simple concepts we need to understand.


1) MongoDB has the same concept of a database with which you are likely already familiar (schema in Oracle/db2). Within a MongoDB instance you can have zero or more databases, each acting as high-level containers for everything else
2) A database can have zero or more collections . A collection shares enough in common with a traditional table that you can safely think of the two as the same thing.
3) Collections are made up of zero or more documents . Again, a document can safely be thought of as a row.
4) A document is made up of one or more fields , which you can probably guess are a lot like columns.
5) Indexes in MongoDB function much like their RDBMS counterparts.
6) Cursors are different than the other five concepts but they are important enough, and often overlooked, that I think they are worthy of their own discussion. The important thing to understand about cursors is that when you ask MongoDB for data, it returns a cursor, which we can do things to, such as counting or skipping ahead, without actually pulling down data.


Core Concept for Storing elements
The core difference comes from the fact that relational databases define columns at the table level whereas a document-oriented database defines its fields at the document level. That is to say that each document within a collection can have its own unique set of fields As such, a collection is a dumbed down container in comparison to a table , while a document has a lot more information than a row Ultimately, the point is that a collection isn’t strict about what goes in it (it’s schema-less). Fields are tracked with each individual document.


RDBMS / Mongo DB Design

Hadoop - HDFS

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce, and YARN form the core of Apache™ Hadoop®. The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project. This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.
HDFS Basic Design

According to The Apache Software Foundation, the primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions. The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system.
HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.

In production clusters, HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.




Basic Features: HDFS
1) Highly fault-tolerant  :  
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. High throughput Suitable for applications with large data sets streaming access to file system data can be built out of commodity hardware
2) Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
3) Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.


4) Moving Computation is Cheaper than Moving Data 

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

Basic Design in Detail Level 

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. 
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

File system Namespace
  • Hierarchical file system with directories and files
  • Create, remove, move, rename etc.
  • Namenode maintains the file system
  • Any meta information changes to the file system recorded by the Namenode.
  • An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.
 Data Replication
  • HDFS is designed to store very large files across machines in a large cluster.
  • Each file is a sequence of blocks.
  • All blocks in the file except the last are of the same size.
  • Blocks are replicated for fault tolerance.
  • Block size and replicas are configurable per file.
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
  • BlockReport contains all the blocks on a Datanode.

Namenode
  • Keeps image of entire file system namespace and file Blockmap in memory.
  • 4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.
  • When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
  • Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

Datanode

  • A Datanode stores data in files in its local file system.
  • Datanode has no knowledge about HDFS filesystem
  • It stores each block of HDFS data in a separate file.
  • Datanode does not create all files in the same directory.
  • It uses heuristics to determine optimal number of files per directory and creates directories appropriately.
  • When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
 
 

Friday, March 21, 2014

Hadoop - NOSQL (Not Only SQL) Basics

Imagine that you have coupons that you wanted to push to mobile customers that purchase a specific item. This is a customer facing system of engagement requires location data, purchase data, wallet data, and so on. You want to engage the mobile customer in real-time.
What you require is a very agile delivery system that is easily able to processes unstructured data. The system of engagement would need to be extremely dynamic.
A traditional database product would prefer more predictable, structured data. A relational database may require vertical and, sometimes horizontal expansion of servers, to expand as data or processing requirements grow.
An alternative, more cloud-friendly approach is to employ NoSQL.

NoSQL databases (either no-SQL or Not Only SQL) are currently a hot topic in some parts of computing
“NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational
databases. These data stores may not require fixed table schemas,usually avoid join operations and typically scale horizontally.
Academics and papers typically refer to these databases as structured storage.”




What is NoSQL?
NoSQL is a popular name for a subset of structured storage software that is designed is optimized for high -
performance operations on large datasets. This optimization comes at the expense of strict ACID (atomicity, consistency, isolation,and durability) compliance and , as the name implies, native querying in the SQL syntax.

NoSQL software is easy for developers to use,horizontally scalable, and optimized for narrow workload definitions.
There are three major categories of NoSQL applications today:
i) Key-value stores like Cassandra, Riak, and Project Voldemort?Graph databases like Neo4j, DEX,and Infinite Graph
ii) Document stores like MongoDB, eXist,and BaseX These NoSQL applications are either separate open source software(OSS) projectsor commercial closed-sourceprojects.
iii) The applications are written in different languages,they expose different interfaces,and they implement different optimizations


The load is able to easily grow by distributing itself over lots of ordinary, and cheap, Intel-based servers. A NoSQL database is exactly the type of database that can handle the sort of unstructured, messy and unpredictable data that our system of engagement requires.
NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure. Sometimes you will also see it revered to as 'not only SQL'.
It is not built on tables and does not employ SQL to manipulate data. It also may not provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a distributed and fault tolerant architecture.
The NoSQL taxonomy supports key-value stores, document store, BigTable, and graph database

NoSQL Advantage

1) Elastic scaling : RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new nodes, and they're usually designed with low-cost commodity hardware in mind. 
For years, database administrators have relied on scale up -- buying bigger servers as database load increases -- rather than scale out -- distributing the database across multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible.
2) Big data : Just as transaction rates have grown out of recognition over the last decade, the volumes of data that are being stored also have increased massively. O'Reilly has cleverly called this the "industrial revolution of data." RDBMS capacity has been growing to match these increases, but as with transaction rates, the constraints of data volumes that can be practically managed by a single RDBMS are becoming intolerable for some enterprises. Today, the volumes of "big data" that can be handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the biggest RDBMS.
3) Not much Admin Part : NoSQL databases are generally designed from the ground up to require less management:  automatic repair, data distribution, and simpler data models lead to lower administration and tuning requirements
4) Economics : NoSQL databases typically use clusters of cheap commodity servers to manage the exploding data and transaction volumes, while RDBMS tends to rely on expensive proprietary servers and storage systems. The result is that the cost per gigabyte or transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing you to store and process more data at a much lower price point.
5) Flexible data models : Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels. NoSQL databases have far more relaxed -- or even nonexistent -- data model restrictions. NoSQL Key Value stores and document databases allow the application to store virtually any structure it wants in a data element. Even the more rigidly defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much fuss.The result is that application changes and database schema changes do not have to be managed as one complicated change unit. In theory, this will allow applications to iterate faster, though,clearly, there can be undesirable side effects if the application fails to manage data integrity.

Challenges of NoSQL 
1) Maturity : RDBMS systems have been around for a long time. NoSQL advocates will argue that their advancing age is a sign of their obsolescence, but for most CIOs, the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are stable and richly functional. In comparison, most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution.
2) Support : Enterprises want the reassurance that if a key system fails, they will be able to get timely and competent support. All RDBMS vendors go to great lengths to provide a high level of enterprise support.
In contrast, most NoSQL systems are open source projects, and although there are usually one or more firms offering support for each NoSQL database, these companies often are small start-ups without the global reach, support resources, or credibility of an Oracle, Microsoft, or IBM.
3) Analytic and business intelligence :NoSQL databases have evolved to meet the scaling demands of modern Web 2.0 applications. Consequently, most of their feature set is oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies.

NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL.
4) Expertise
There are literally millions of developers throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will address naturally over time, but for now, it's far easier to find experienced RDBMS programmers or administrators than a NoSQL expert.


Right off the bat, NoSQL databases are unique because they are usually independent from Structured Query Language (SQL) found in relational databases. Relational databases all use SQL as the domain-specific language for ad hoc queries, while non-relational databases have no such standard query language, so they can use whatever they want. That can, if need be, include SQL.
 NoSQL databases are designed to excel in speed and volume. To pull this off, NoSQL software will use techniques that can scare the crap out of relational database users — such as not promising that all data is consistent within a system all of the time.
 That's a key result of using relational databases, because when you are conducting a financial transaction, such as buying something on Amazon, databases have to be very sure that one account is debited the same amount that another account is debited at the same time. Because so much of this back-and-forth read-write activity is needed in a single transaction, a relational database could never keep up with the speed and scaling necessary to make a company like Amazon work.


Go Big Or Go Home

Easier scalability is the first aspect highlighted by Wiederhold. NoSQL databases like Couchbase and 10Gen's MongoDB, he said, can be scaled up to handle much bigger data volumes with relative ease.


If your company suddenly finds itself deluged by overnight success, for example, with customers coming to your Web site by the droves, a relational database would have to be painstakingly replicated and re-partitioned in order to scale up to meet the new demand.

Wieder hold cited social and mobile gaming vendors as the big example of this kind of situation. An endorsement or a few well-timed tweets could spin up semi-dormant gaming servers and get them to capacity in mere hours. Because of the distributed nature of non-relational databases, to scale NoSQL all you need to do is add machines to the cluster to meet demand.

Performance / scaling


Performance is another way that NoSQL databases can excel. First, every time you add a new server to a NoSQL database cluster, there is performance scaling by virtue of the fact that you're throwing another processor into the equation.

Beyond the scaling advantages, the very architecture of NoSQL tools aids performance. If a relational database had tens or even hundreds of thousands of tables, data processing would generate far more locks on that data, and greatly degrade the performance of the database.

Because NoSQL databases have weaker data consistency models, they can trade off consistency for efficiency. In Wiederhold's social gaming example, if a user updated his or her profile, there's no real degradation of game performance if that profile's new info isn't updated across the entire database instantly. This means that resources can be dedicated to other things, like tracking down that that's about smack you around in-game.



What is CAP about?
The (CAP) theorem (Consistency, Availability and Partitioning tolerance) was given by Eric Brewer, a professor at the University of California, Berkeley and one of the founders of Google, in 2001 in the keynote of Principles of Distributed Computing









Let’s first give definitions to these 3 terms:
Consistency: A service that is consistent should follow the rule of ordering for updates that spread across all replicas in a cluster – “what you write is what you read”, regardless of location. For example,  Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.  This rule has another name “Strong consistency”.

Availability: A service should be available. There should be a guarantee that every request receives a response about whether it was successful or failed. If the system is not available it can be still consistent. However, consistency and availability cannot be achieved at the same time. This means that one has two choices on what to leave. Relaxing consistency will allow the system to remain highly available under the partitioning conditions (see next definition) and strong consistency means that under certain conditions the system will not be available.

Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system. A simple example, when we have a cluster of N replicated nodes and for some reason a network is unavailable among some number of  nodes (e.g. a network cable got chopped). This leads to inability to synchronize data. Thus, only some part of the system doesn’t work, the other one does. If you have a partition in your network, you lose either consistency (because you allow updates to both sides of the partition) or you lose availability (because you detect the error and shut down the system until the error condition is resolved).

A simple meaning of this theorem is “It is impossible for a protocol to guarantee both consistency and availability in a partition prone distributed system”. Please find below information in details . 




The popular understanding of Eric Brewer's CAP Theorem is that in distributed systems it is only possible to support two out of three desired properties: consistency, availability and partition tolerance. The 'two out of three' explanation of CAP Theorem has been used in recent years to explain and justify the emergence of NoSQL databases that relax consistency in favour of high availability. It has likewise been used to question the validity of claims that NewSQL databases are able to deliver both highly available distributed architecture and support ACID transactions.


We can't achieve All 3 characteristic in one

We are just talking about principle but how we can achieve 

Availability is achieved by replicating the data across different machines
Consistency is achieved by updating several nodes before allowing further reads
Total partitioning, meaning failure of part of the system is rare. However, we could look at a delay, a latency, of the update between nodes, as a temporary partitioning. It will then cause a temporary decision between A and C:

    On systems that allow reads before updating all the nodes, we will get high Availability
    On systems that lock all the nodes before allowing reads, we will get Consistency











Wednesday, March 19, 2014

Hadoop Basics - Horizontal Scaling instead of Vertical Scaling



 What can Hadoop do for you?

With the explosion of the era of Big Data, companies now need to leverage all their available data in full to drive the decisions that will support their future growth. The scale and variety of this data now challenges Relational Databases and that is where Hadoop can really add benefit.
Hadoop development is hosted by the Apache OpenSource Community and all major technology companies have contributed to the codebase enabling Hadoop to leap ahead of proprietary solutions in a relatively short time period.
Hadoop’s main components are its file system (HDFS) that provides cheap and reliable data storage, and its MapReduce engine that provides high performance parallel data processing. In addition Hadoop is self managing and can easily handle hardware failures  as well as scaling up or down its deployment without any change to the codebase.Hadoop installs on low cost commodity servers reducing the deployment cost.While first deployed by the main web properties Hadoop is now widely used across many sectors including entertainment, healthcare, government and market research.
 




In normal life , if we want to perform some computation job like searching 1 name from 100 Row table , we can do with low end hardware ( e.g. 1 GB RAM , 1 Core Processor).
 Now suppose we need to find 1 name from 1 million records in quick time. Hum..mm it will take some time to search not exactly but it might take 1 hour to find out record in huge DB. To reduce time span we can increase hardware capacity of current hardware and we can again get good performance for output. 

  As you seen above We can solve the problem by increasing hardware capacity which we normally call's Vertical scaling. It means you can increase hardware as per your requirement. In this case there is one word come in mind as "Hardware specification". Like we can increase RAM of machine till some limit and we can't increase its capacity beyond some limit. Now suppose we have found that if we need query output in the few second then what we can do here ?  Can we plan to purchase some expensive high performance hardware which will take less time. 

As any developer want his code should get appreciation and used in large scale that time we can't add restriction to user that he need 30 GB RAM and 10 Core Processor to get best output. Now what we can do here to get good performance as vertical word come in picture we can use horizontal scaling :) . It means we can add another COMMODITY HARDWARE in the network and use it as COMPUTATIONAL NODE FOR PROCESSING. 
It's any how better than purchasing single expensive hardware. The technical term to solve such problem we can call it as "distributed processing" where we can split the job for several machines and at the end we can summarize the output in required format.
Hadoop is  excellent software framework for the "distributed processing" of large data sets across clusters of computers using simple programming models.It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. While designing the framework main intention is to use COMMODITY Hardware instead of High End machines.
 The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise application

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. 


Hadoop: Assumptions
 It is written with large clusters of computers in mind and is built around the following assumptions:
i) Hardware will fail.
ii) Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
iii) Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.
iv) It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
v) Applications need a write-once-read-many access model.
vi) Moving Computation is Cheaper than Moving Data.
vii) Portability is important.


What are the Hadoop's Eco System Component  











Thursday, March 13, 2014

Core Java : Java hashCode()

In the Java programming language, every class implicitly or explicitly provides a hashCode() method, which digests the data stored in an instance of the class into a single hash value (a 32-bit signed integer). This hash is used by other code when storing or manipulating the instance – the values are intended to be evenly distributed for varied inputs in order to use in clustering.
Technically, in Java, hashCode() by default is a native method, meaning, it has the modifier 'native', as it is implemented directly in the native code in the JVM.
e.g. Hashing is designed to solve the problem of needing to efficiently find or store an item in a collection. For example, if we have a list of 10,000 words of English and we want to check if a given word is in the list, it would be inefficient to successively compare the word with all 10,000 items until we find a match. Hashing is a technique to make things more efficient by effectively narrowing down the search at the outset.
 

There are some restrictions placed on the behavior of equals() and hashCode(), which are enumerated in the documentation for Object. In particular, the equals() method must exhibit the following properties:

    Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
    Reflexivity: For all non-null references, a.equals(a)
    Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
    Consistency with hashCode(): Two equal objects must have the same hashCode() value


What is hashing?
Hashing means using some function or algorithm to map object data to some representative integer value. This so-called hash code (or simply hash) can then be used as a way to narrow down our search when looking for the item in the map

How hashing works

Purely as an example to help us grasp the concept, let's suppose that we want to map a list of string keys to string values (for example, map a list of countries to their capital cities). So let's say we want to store the data in Table 1 in the map.
KeyValue
CubaHavana
EnglandLondon
FranceParis
SpainMadrid
SwitzerlandBerne
Table 1: Example data to put in a map
And let's suppose that our hash function is to simply take the length of the string. For simplicity, we'll have two arrays: one for our keys and one for the values. So to put an item in the hash table, we compute its hash code (in this case, simply count the number of characters), then put the key and value in the arrays at the corresponding index. For example, Cuba has a hash code (length) of 4. So we store Cuba in the 4th position in the keys array, and Havana in the 4th index of the values array etc. And we end up with the following:
Position
(hash code = key length)
Keys arrayValues array
1
2
3
4CubaHavana
5SpainMadrid
6FranceParis
7EnglandLondon
8
9
10
11SwitzerlandBerne
Now, in this specific example things work quite well. Our array needs to be big enough to accommodate the longest string, but in this case that's only 11 slots. And we do waste a bit of space because, for example, there's no 1-letter keys in our data, nor keys between 8 and 10 letters. But in this case, the waste isn't so bad either. And taking the length of a string is nice and fast, so so is the process of finding the value associated with a given key (certainly faster than doing up to five string comparisons)1.
We can also easily see that this method wouldn't work for storing arbitrary strings. If one of our string keys was a thousand characters in length but the rest were small, we'd waste the majority of the space in the arrays. More seriously, this model can't deal with collisions: that is, what to do when there is more than one key with the same hash code (in this case, one than more string of a given length). For example, if our keys were random words of English, taking the string length would be fairly useless. Granted, the word psuedoantidisestablishmentarianistically would probably get its own place in the array. But on the other hand, we'd be left with thousands of, say, 6-letter words all competing for the same slot in the array. 

Introduction to hashing and hash maps
   
    On the previous page, we introduced the notion of hashing, mapping a piece of data such as a string to some kind of a representative integer value. We can then create a map by using this hash as an index into an array of key/value pairs. Such a structure is generally called a hash table or, particularly in Java parlance, hash map1. We saw that using the string length to create the hash, and indexing a simple array, could work in some restricted cases, but is no good generally: for example, we have the problem of collisions (several keys with the same length) and wasted space if a few keys are vastly larger than the majority.



Buckets
Now, we can solve the problem of collisions by having an array of (references to) linked lists2 rather than simply an array of keys/values. Each little list is generally called a bucket.
Then, we can solve the problem of having an array that is too large simply by taking the hash code modulo a certain array size3. So for example, if the array were 32 positions in size, going from 0-31, then rather than storing a key/value pair in the list at position 33, we store it at position (33 mod 32) = 1. (In simple terms, we "wrap round" when we reach the end of the array.) So we end up with a structure something like this: 

Graphical Representation

Each node in the linked lists stores a pairing of a key with a value. Now, to look for the mapping for, say, Ireland, we first compute this key's hash code (in this case, the string length, 7). Then we start traversing the linked list at position 7 in the table. We traverse each node in the list, comparing the key stored in that node with Ireland. When we find a match, we return the value from the pair stored in that node (Dublin). In our example here, we find it on the second comparison. So although we have to do some comparisons, if the list at a given position in the table is fairly short, we'll still reduce significantly the amount of work we need to do to find a given key/value mapping.
The structure we have just illustrated is essentially the one used by Java's hash maps and hash sets. However, we generally wouldn't want to use the string length as the hash code. In the next sections, we'll explore how to generate more adequate hash codes. 

Improving our hash function 
In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it's better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn't adequate. We need to choose a better hash function that ideally has these properties:

  • It can distribute the keys more or less evenly over the range of positions in the array. We want to avoid the situation where position 6 has a list of 10,000 entries and position 13 has a list of only 20 entries.
  • It can distribute the keys over a larger range of values. If there are 100,000 words in the map and a perfectly-distributed hash function, then having 50 positions in the array would still give an average list size of 2,000 items. But if we could have 5,000 positions, there'd be just 20 items in each list. And if we had in the region of 100,000 positions then we'd on average make just one or two comparisons1 to find the mapping we were looking for.
 In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values)2. Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.

Writing an adequate hash function for strings

for a given key X, the hash function must always generate the same hash code Y. But the point is that Y should essentially be a "random number across the possible range of hash codes". For typical keys, we don't want our hash codes to tend to "clump" within a certain range, or in binary terms, for certain bits of the hash code to tend to be always 0 or always 1. On the other hand, we also require our hash function to be fairly efficient. In practice, "reasonably random but quick to compute" is the tradeoff we want. For example, we'll see below that so long as a fair proportion of the bits of the hash code are random, then this randomness can be "ironed out" across all of the bits, and for most cases that's good enough.

Let's have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:

1
2
3
4
5
6
7
public int hashCode() {  
  int hash = 0;  
   for (int i = 0; i < length(); i++) {  
    hash = hash * 31 + charAt(i);  
   }  
   return hash;  
  }  

At this point, there are a couple of different routes you can take:
    If you just want to take a pragmatic approach and use Strings as keys to your map data without worrying about any more of the ins and outs of hashing, then you can start using the HashMap class immediately;
    If you want to use objects of your own classes as a key but without worrying too much more about the technical details of hash functions, then you can see some practical hash function guidelines and a guide to overriding the hashCde() and equals() methods, which is effectively what you need to do to "plug in" your hash function and make your class work as a hash map key.
    You may wish to understand more of the technical details of how hash functions work and what the basis is behind the guidelines mentioned.





Tuesday, March 11, 2014

Core Java : Java Collection Basics


The Java platform includes a collections framework. A collection is an object that represents a group of objects (such as the classic Vector class).

A collection, as its name implied, is simply an object that holds a collection (or a group, a container) of objects. Each item in a collection is called an element. A framework, by definition, is a set of interfaces that force you to adopt some design practices. A well-designed framework can improve your productivity and provide ease of maintenance.
A collections framework is a unified architecture for representing and manipulating collections, enabling collections to be manipulated independently of implementation details.



The collection framework provides a unified interface to store, retrieve and manipulate the elements of a collection, regardless of the underlying and actual implementation. This allows the programmers to program at the interfaces, instead of the actual implementation.


The advantages of a collections framework
  1. Increases performance by providing high-performance implementations of data structures and algorithms.
  2. Reduces programming effort by providing data structures and algorithms so you don't have to write them yourself. 
  3. Reduces the effort required to design and implement APIs by not requiring you to produce ad hoc collections APIs. 
  4. Provides interoperability between unrelated APIs by establishing a common language to pass collections back and forth
     
    First Part of Collection Framework

    First Part of Collection Framework Related to List & Set

    Second Part Of Collection Framework Related to Map

    The following list describes the core collection interfaces:
Collection the root of the collection hierarchy. A collection represents a group of objects known as its elements. Some types of collections allow duplicate elements, and others do not. Some are ordered and others are unordered. The Java platform doesn’t provide any direct implementations of this interface but provides implementations of more specific sub interfaces, such as Set and List.
 Set — It's come from mathematical word Set. This interface models the mathematical set abstraction and is used to represent sets. Set cannot contain duplicate elements
List — It is ordered collection like sequence . Lists can contain duplicate elements. The user of a List generally has precise control over where in the list each element is inserted and can access elements by their integer index (position).
Queue — a collection used to hold multiple elements prior to processing. Besides basic Collection operations, a Queue provides additional insertion, extraction, and inspection operations.
Queues typically, but do not necessarily, order elements in a FIFO (first-in, first-out) manner. Among the exceptions are priority queues, which order elements according to a supplied comparator or the elements’ natural ordering.
Map — an object that maps keys to values. A Map cannot contain duplicate keys; each key can map to at most one value. If you’ve used Hashtable, you’re already familiar with the basics of Map.

Deque — a collection used to hold multiple elements prior to processing. Besides basic Collection operations, a Deque provides additional insertion, extraction, and inspection operations. Deques can be used both as FIFO (first-in, first-out) and LIFO (last-in, first-out). In a deque all new elements can be inserted, retrieved and removed at both end.


 The last two core collection interfaces are merely sorted versions of Set and Map:
 SortedSet — a Set that maintains its elements in ascending order. Several additional operations are provided to take advantage of the ordering. Sorted sets are used for naturally ordered sets, such as word lists and membership rolls.
 SortedMap — a Map that maintains its mappings in ascending key order. This is the Map analog of SortedSet. Sorted maps are used for naturally ordered collections of key/value pairs, such as dictionaries and telephone directories.

Let's star to collect all features in the In-Depth 


Every Collection Object have different properties


Collection interface The Collection interface is used to represent any group of objects, or elements. You use the interface when you wish to work with a group of elements in as general a manner as possible. Here is a list of the public methods of Collection


 The interface supports basic operations like adding and removing. When you try to remove an element, only a single instance of the element in the collection is removed, if present.
 * boolean add(Object element)
 * boolean remove(Object element)


The Collection interface also supports query operations:
 * int size()
 * boolean isEmpty()
 * boolean contains(Object element)
 * Iterator iterator()



Iterator interface 
The iterator() method of the Collection interface returns an Iterator . An Iterator is similar to the Enumeration interface. You can traverse a collection from start to finish and safely remove elements from the underlying Collection

1:  Collection collection = ...;  
2:  Iterator iterator = collection.iterator();  
3:  while (iterator.hasNext())  
4:  {  
5:  Object element = iterator.next();  
6:  if (removalCheck(element)) {  
7:  iterator.remove();  
8:  }  
9:  }  



Set Interface  
The Set interface extends the Collection interface and, by definition, forbids duplicates within the collection. All the original methods are present and no new methods are introduced. The concrete Set implementation classes rely on the equals() method of the object added to check for equality. 

HashSet
This class implements the Set interface, backed by a hash table (actually a HashMap instance). Y
ou will use a HashSet for storing your duplicate-free collection.It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element. For efficiency, objects added to a HashSet need to implement the hashCode() method in a manner that properly distributes the hash codes. While most system classes override the default hashCode() implementation in Object , when creating our own classes to add to a HashSet remember to override hashCode()  

1:  public class HashSet<E>  
2:    extends AbstractSet<E>  
3:     implements Set<E>, Cloneable, Serializable  

This class offers constant time performance for basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets. We can set the initial capacity and load factor for this collection. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.


LinkedHashSet :
Hash table and linked list implementation of the Set interface, with predictable iteration order. This implementation differs from HashSet in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is the order in which elements were inserted into the set (insertion-order). Note that insertion order is not affected if an element is re-inserted into the set. (An element e is reinserted into a set s if s.add(e) is invoked when s.contains(e) would return true immediately prior to the invocation).


This class provides all of the optional Set operations, and permits null elements. Like HashSet, it provides constant-time performance for the basic operations (add, contains and remove), assuming the hash function disperses elements properly among the buckets. 
Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list

A linked hash set has two parameters that affect its performance: initial capacity and load factor. They are defined precisely as for HashSet. Note, however, that the penalty for choosing an excessively high value for initial capacity is less severe for this class than for HashSet, as iteration times for this class are unaffected by capacity


Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.


1:  public class LinkedHashSet<E>  
2:    extends HashSet<E>  
3:     implements Set<E>, Cloneable, Serializable  

Interface SortedSet
A SortedSet is a Set that maintains its elements in ascending order, sorted according to the elements' natural ordering or according to a Comparator provided at SortedSet creation time.Several additional operations are provided to take advantage of the ordering. Sorted sets are used for naturally ordered sets, such as word lists and membership rolls.

Range view — allows arbitrary range operations on the sorted set
Endpoints — returns the first or last element in the sorted set
Comparator access — returns the Comparator, if any, used to sort the set


1:  public interface SortedSet<E> extends Set<E> {  
2:    // Range-view  
3:    SortedSet<E> subSet(E fromElement, E toElement);  
4:    SortedSet<E> headSet(E toElement);  
5:    SortedSet<E> tailSet(E fromElement);  
6:    // Endpoints  
7:    E first();  
8:    E last();  
9:    // Comparator access  
10:    Comparator<? super E> comparator();  
11:  }  
The operations that SortedSet inherits from Set behave identically on sorted sets and normal sets with two exceptions:

  • The Iterator returned by the iterator operation traverses the sorted set in order.
  • The array returned by toArray contains the sorted set's elements in order.

SortedSet implementations also provide, by convention, a constructor that takes a Comparator and returns an empty set sorted according to the specified Comparator. If null is passed to this constructor, it returns a set that sorts its elements according to their natural ordering.

Range-view Operations

The range-view operations are somewhat analogous to those provided by the List interface, but there is one big difference. Range views of a sorted set remain valid even if the backing sorted set is modified directly. This is feasible because the endpoints of a range view of a sorted set are absolute points in the element space rather than specific elements in the backing collection.
The SortedSet interface contains two more range-view operations — headSet and tailSet, both of which take a single Object argument. The former returns a view of the initial portion of the backing SortedSet, up to but not including the specified object. The latter returns a view of the final portion of the backing SortedSet, beginning with the specified object and continuing to the end of the backing SortedSet. Thus, the following code allows you to view the dictionary as two disjoint volumes (a-m and n-z).
Endpoint Operations
The SortedSet interface contains operations to return the first and last elements in the sorted set, not surprisingly called first and last. In addition to their obvious uses, last allows a workaround for a deficiency in the SortedSet interface.
SortedSet<String> volume1 = dictionary.headSet("n");
SortedSet<String> volume2 = dictionary.tailSet("n");


TreeSet
This class implements the Set interface, backed by a TreeMap instance. This class guarantees that the sorted set will be in ascending element order, sorted according to the natural order of the elements (seeComparable), or by the comparator provided at set creation time, depending on which constructor is used.

1:  public class TreeSet<E>  
2:    extends AbstractSet<E>  
3:     implements NavigableSet<E>, Cloneable, Serializable 
 The TreeSet class guarantees that the Map will be in ascending key order and backed by a TreeMap.
 The Map is sorted according to the natural sort method for the key Class, or by the Comparator provided at set creation time, that will depend on which constructor used.
The ordering must be total in order for the Tree to function properly.

Interesting methods
E ceiling(E e)
This method returns the least element in this set greater than or equal to the given element, or null if there is no such element.

List is an ordered collection and can contain duplicate elements. You can access any element from it’s index. List is more like array with dynamic length. List is one of the most used Collection type. ArrayList and LinkedList are implementation classes of List interface.