Monday, December 4, 2017

Core Java OOPS

Object-Oriented Programming 
          Object Oriented Programming ( OOP ) is at the Core of Java. Whenever you write any simple or complex program in Java. It always follows OOP principle indirectly.


Four OOP Principles

All object-oriented programming languages provide mechanisms that help you implement the object-oriented model. They are Abstraction, encapsulation, inheritance, and polymorphism


A)   Abstraction : 
An essential element in OOP is Abstraction. Humans manage complexity through Abstraction. Data Abstraction is the property by virtue of which only the essential details are displayed to the user. It can achieve by classes and interfaces. 
Suppose Person want to purchase a new mobile phone from a particular shopping mall. He will drive to the shopping mall in his car.  While driving he doesn't want to concentrate on which car part is helping to move car ahead and how the petrol gets converted into piston energy instead he just needs to think about which mobile he want to purchase. From the outside, a car is a single object. Once you inside the car you can see the car consist of several ecosystems. Steering, brake, engine, accelerator, CD player
A powerful way to manage abstraction is through the use of hierarchical classifications.
This allows you to layer the semantics of complex systems, breaking them into more manageable pieces.

Abstraction is a process where you show only “relevant” data and “hide” unnecessary details of an object from the user
In the class, we can 2 things define abstract class/method and interfaces. So whoever wants to implement them, can be achieved by doing their own implementation way. 

B)   Encapsulation
Encapsulation in Java is a mechanism of wrapping the data (variables) and code acting on the data (methods) together as a single unit. It keeps both safe from outside interference and misuse. One way to think about encapsulation is as a protective wrapper that prevents the code and data from being arbitrarily accessed by other code defined outside the wrapper. Access to the code and data inside the wrapper is tightly controlled through a well-defined interface.

In Java, the basis of encapsulation is the class. A class defines the structure and behavior (data and code) that will be shared by a set of objects. Each object of a given class contains the structure and behavior defined by the class,

 It is also known as data hiding.
To achieve encapsulation in Java −
·        Declare the variables of a class as private.
·        Provide public setter and getter methods to modify and view the variables values.

Image result for abstraction vs encapsulation

Wrapping Data into single Unit. Encapsulation simply means binding object state (fields) and behavior (methods) together. If you are creating class, you are doing encapsulation.

Inheritance
Inheritance is the process by which one object acquires the properties of another object. This is important because it supports the concept of hierarchical classification.




1.   Inheritance is a process of defining a new class based on an existing class by extending its common data members and methods.
2.   Inheritance allows us to reuse of code, it improves reusability in your java application.
3.   The parent class is called the base class or super class. The child class that extends the base class is called the derived class or sub class or child class.
extends is the keyword used to inherit the properties of a class.

The super keyword is similar to this keyword.


IS-A Relationship
IS-A is a way of saying: This object is a type of that object. Let us see how the extends keyword is used to achieve inheritance.
public class Car {
}

public class Suzuki extends Car  {
}

public class Tata extends  Car {
}

public class Nexa  extends Suzuki {
}
Now, based on the above example, in Object-Oriented terms, the following are true −
  • Car is the superclass of Suzuki class.
  • Car is the superclass of Tata class.
  • Suzuki and Nexa are subclasses of Car class.
  • Types of Inheritance:
    Single Inheritance: refers to a child and parent class relationship where a class extends the another class.
  • Multilevel inheritance: refers to a child and parent class relationship where a class extends the child class. For example class A extends class B and class B extends class C.
  • Hierarchical inheritance: refers to a child and parent class relationship where more than one classes extends the same class. For example, class B extends class A and class C extends class A.
  • Multiple Inheritance: refers to the concept of one class extending more than one classes, which means a child class has two parent classes. Java doesn’t support multiple inheritance, read more about it here.
Most of the new OO languages like Small Talk, Java, C# do not support Multiple inheritance. 



Polymorphism
Polymorphism (from Greek, meaning “many forms”) is a feature that allows one interface to be used for a general class of actions. The specific action is determined by the exact nature
More generally, the concept of polymorphism is often expressed by the phrase “one interface, multiple methods.” This means that it is possible to design a generic interface to a group of related activities. This helps reduce complexity by allowing the same interface to be used to specify a general class of action. It is the compiler’s job to select the specific action (that is, method) as it applies to each situation. You, the programmer, do not need to make this selection manually. You need only remember and utilize the general interface


Types of Polymorphism
1) Static Polymorphism
2) Dynamic Polymorphism

Static Polymorphism:

Polymorphism that is resolved during compiler time is known as static polymorphism. Method overloading can be considered as static polymorphism example.
Method Overloading: This allows us to have more than one methods with same name in a class that differs in signature.

Dynamic Polymorphism

It is also known as Dynamic Method Dispatch. Dynamic polymorphism is a process in which a call to an overridden method is resolved at runtime rather, thats why it is called runtime polymorphism.

IS-A & HAS-A Relationships

A Car IS-A Vehicle and HAS-A License then the code would look like this:
public class Vehicle{ }
public class Car extends Vehicle{
   private License myCarLicense;
}

Abstract Class and methods in OOPs Concepts

Abstract method:
1) A method that is declared but not defined. Only method signature no body.
2) Declared using the abstract keyword
3) Example :
abstract public void playInstrument();
5) Used to put some kind of compulsion on the class who inherits the class has abstract methods. The class that inherits must provide the implementation of all the abstract methods of parent class else declare the subclass as abstract.
6) These cannot be abstract
·         Constructors
·         Static methods
·         Private methods
·         Methods that are declared “final”

Interfaces in Java

An interface is a blueprint of a class, which can be declared by using interface keyword. Interfaces can contain only constants and abstract methods (methods with only signatures no body).Like abstract classes, Interfaces cannot be instantiated, they can only be implemented by classes or extended by other interfaces. Interface is a common way to achieve full abstraction in Java.
1.    Java does not support Multiple Inheritance, however a class can implement more than one interfaces
2.    Interface is similar to an abstract class but it contains only abstract methods.
3.    Interfaces are created by using interface keyword instead of the keyword class
4.    We use implements keyword while implementing an interface(similar to extending a class with extends keyword)


Generalization and Specialization:
In order to implement the concept of inheritance in an OOPs, one has to first identify the similarities among different classes so as to come up with the base class.
This process of identifying the similarities among different classes is called Generalization. Generalization is the process of extracting shared characteristics from two or more classes, and combining them into a generalized superclass. Shared characteristics can be attributes or methods.
In contrast to generalization, specialization means creating new subclasses from an existing class. If it turns out that certain attributes or methods only apply to some of the objects of the class, a subclass can be created.


Monday, March 24, 2014

Hadoop - NOSQL - Mongo DB

MongoDB (from "humongous") is an open-source document databaseis the leading NoSQL database which provide document database data structure for storing/retriving the elements. MongoDB provides high performance, high availability and easy scalability.

How MongoDB Data Looks like
 MongoDB is made up of databases which contain collections. A collection is made up of documents . Each document is made up of fields .Collections can be indexed , which improves lookup and sorting performance. Finally, when we get data from MongoDB we do so through a cursor whose actual execution is delayed until necessary.

To get started, there are six simple concepts we need to understand.


1) MongoDB has the same concept of a database with which you are likely already familiar (schema in Oracle/db2). Within a MongoDB instance you can have zero or more databases, each acting as high-level containers for everything else
2) A database can have zero or more collections . A collection shares enough in common with a traditional table that you can safely think of the two as the same thing.
3) Collections are made up of zero or more documents . Again, a document can safely be thought of as a row.
4) A document is made up of one or more fields , which you can probably guess are a lot like columns.
5) Indexes in MongoDB function much like their RDBMS counterparts.
6) Cursors are different than the other five concepts but they are important enough, and often overlooked, that I think they are worthy of their own discussion. The important thing to understand about cursors is that when you ask MongoDB for data, it returns a cursor, which we can do things to, such as counting or skipping ahead, without actually pulling down data.


Core Concept for Storing elements
The core difference comes from the fact that relational databases define columns at the table level whereas a document-oriented database defines its fields at the document level. That is to say that each document within a collection can have its own unique set of fields As such, a collection is a dumbed down container in comparison to a table , while a document has a lot more information than a row Ultimately, the point is that a collection isn’t strict about what goes in it (it’s schema-less). Fields are tracked with each individual document.


RDBMS / Mongo DB Design

Hadoop - HDFS

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce, and YARN form the core of Apache™ Hadoop®. The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project. This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.
HDFS Basic Design

According to The Apache Software Foundation, the primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions. The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system.
HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.

In production clusters, HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.




Basic Features: HDFS
1) Highly fault-tolerant  :  
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. High throughput Suitable for applications with large data sets streaming access to file system data can be built out of commodity hardware
2) Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
3) Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.


4) Moving Computation is Cheaper than Moving Data 

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

Basic Design in Detail Level 

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. 
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

File system Namespace
  • Hierarchical file system with directories and files
  • Create, remove, move, rename etc.
  • Namenode maintains the file system
  • Any meta information changes to the file system recorded by the Namenode.
  • An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.
 Data Replication
  • HDFS is designed to store very large files across machines in a large cluster.
  • Each file is a sequence of blocks.
  • All blocks in the file except the last are of the same size.
  • Blocks are replicated for fault tolerance.
  • Block size and replicas are configurable per file.
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
  • BlockReport contains all the blocks on a Datanode.

Namenode
  • Keeps image of entire file system namespace and file Blockmap in memory.
  • 4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.
  • When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
  • Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

Datanode

  • A Datanode stores data in files in its local file system.
  • Datanode has no knowledge about HDFS filesystem
  • It stores each block of HDFS data in a separate file.
  • Datanode does not create all files in the same directory.
  • It uses heuristics to determine optimal number of files per directory and creates directories appropriately.
  • When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
 
 

Friday, March 21, 2014

Hadoop - NOSQL (Not Only SQL) Basics

Imagine that you have coupons that you wanted to push to mobile customers that purchase a specific item. This is a customer facing system of engagement requires location data, purchase data, wallet data, and so on. You want to engage the mobile customer in real-time.
What you require is a very agile delivery system that is easily able to processes unstructured data. The system of engagement would need to be extremely dynamic.
A traditional database product would prefer more predictable, structured data. A relational database may require vertical and, sometimes horizontal expansion of servers, to expand as data or processing requirements grow.
An alternative, more cloud-friendly approach is to employ NoSQL.

NoSQL databases (either no-SQL or Not Only SQL) are currently a hot topic in some parts of computing
“NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational
databases. These data stores may not require fixed table schemas,usually avoid join operations and typically scale horizontally.
Academics and papers typically refer to these databases as structured storage.”




What is NoSQL?
NoSQL is a popular name for a subset of structured storage software that is designed is optimized for high -
performance operations on large datasets. This optimization comes at the expense of strict ACID (atomicity, consistency, isolation,and durability) compliance and , as the name implies, native querying in the SQL syntax.

NoSQL software is easy for developers to use,horizontally scalable, and optimized for narrow workload definitions.
There are three major categories of NoSQL applications today:
i) Key-value stores like Cassandra, Riak, and Project Voldemort?Graph databases like Neo4j, DEX,and Infinite Graph
ii) Document stores like MongoDB, eXist,and BaseX These NoSQL applications are either separate open source software(OSS) projectsor commercial closed-sourceprojects.
iii) The applications are written in different languages,they expose different interfaces,and they implement different optimizations


The load is able to easily grow by distributing itself over lots of ordinary, and cheap, Intel-based servers. A NoSQL database is exactly the type of database that can handle the sort of unstructured, messy and unpredictable data that our system of engagement requires.
NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure. Sometimes you will also see it revered to as 'not only SQL'.
It is not built on tables and does not employ SQL to manipulate data. It also may not provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a distributed and fault tolerant architecture.
The NoSQL taxonomy supports key-value stores, document store, BigTable, and graph database

NoSQL Advantage

1) Elastic scaling : RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new nodes, and they're usually designed with low-cost commodity hardware in mind. 
For years, database administrators have relied on scale up -- buying bigger servers as database load increases -- rather than scale out -- distributing the database across multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible.
2) Big data : Just as transaction rates have grown out of recognition over the last decade, the volumes of data that are being stored also have increased massively. O'Reilly has cleverly called this the "industrial revolution of data." RDBMS capacity has been growing to match these increases, but as with transaction rates, the constraints of data volumes that can be practically managed by a single RDBMS are becoming intolerable for some enterprises. Today, the volumes of "big data" that can be handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the biggest RDBMS.
3) Not much Admin Part : NoSQL databases are generally designed from the ground up to require less management:  automatic repair, data distribution, and simpler data models lead to lower administration and tuning requirements
4) Economics : NoSQL databases typically use clusters of cheap commodity servers to manage the exploding data and transaction volumes, while RDBMS tends to rely on expensive proprietary servers and storage systems. The result is that the cost per gigabyte or transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing you to store and process more data at a much lower price point.
5) Flexible data models : Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels. NoSQL databases have far more relaxed -- or even nonexistent -- data model restrictions. NoSQL Key Value stores and document databases allow the application to store virtually any structure it wants in a data element. Even the more rigidly defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much fuss.The result is that application changes and database schema changes do not have to be managed as one complicated change unit. In theory, this will allow applications to iterate faster, though,clearly, there can be undesirable side effects if the application fails to manage data integrity.

Challenges of NoSQL 
1) Maturity : RDBMS systems have been around for a long time. NoSQL advocates will argue that their advancing age is a sign of their obsolescence, but for most CIOs, the maturity of the RDBMS is reassuring. For the most part, RDBMS systems are stable and richly functional. In comparison, most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution.
2) Support : Enterprises want the reassurance that if a key system fails, they will be able to get timely and competent support. All RDBMS vendors go to great lengths to provide a high level of enterprise support.
In contrast, most NoSQL systems are open source projects, and although there are usually one or more firms offering support for each NoSQL database, these companies often are small start-ups without the global reach, support resources, or credibility of an Oracle, Microsoft, or IBM.
3) Analytic and business intelligence :NoSQL databases have evolved to meet the scaling demands of modern Web 2.0 applications. Consequently, most of their feature set is oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies.

NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL.
4) Expertise
There are literally millions of developers throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will address naturally over time, but for now, it's far easier to find experienced RDBMS programmers or administrators than a NoSQL expert.


Right off the bat, NoSQL databases are unique because they are usually independent from Structured Query Language (SQL) found in relational databases. Relational databases all use SQL as the domain-specific language for ad hoc queries, while non-relational databases have no such standard query language, so they can use whatever they want. That can, if need be, include SQL.
 NoSQL databases are designed to excel in speed and volume. To pull this off, NoSQL software will use techniques that can scare the crap out of relational database users — such as not promising that all data is consistent within a system all of the time.
 That's a key result of using relational databases, because when you are conducting a financial transaction, such as buying something on Amazon, databases have to be very sure that one account is debited the same amount that another account is debited at the same time. Because so much of this back-and-forth read-write activity is needed in a single transaction, a relational database could never keep up with the speed and scaling necessary to make a company like Amazon work.


Go Big Or Go Home

Easier scalability is the first aspect highlighted by Wiederhold. NoSQL databases like Couchbase and 10Gen's MongoDB, he said, can be scaled up to handle much bigger data volumes with relative ease.


If your company suddenly finds itself deluged by overnight success, for example, with customers coming to your Web site by the droves, a relational database would have to be painstakingly replicated and re-partitioned in order to scale up to meet the new demand.

Wieder hold cited social and mobile gaming vendors as the big example of this kind of situation. An endorsement or a few well-timed tweets could spin up semi-dormant gaming servers and get them to capacity in mere hours. Because of the distributed nature of non-relational databases, to scale NoSQL all you need to do is add machines to the cluster to meet demand.

Performance / scaling


Performance is another way that NoSQL databases can excel. First, every time you add a new server to a NoSQL database cluster, there is performance scaling by virtue of the fact that you're throwing another processor into the equation.

Beyond the scaling advantages, the very architecture of NoSQL tools aids performance. If a relational database had tens or even hundreds of thousands of tables, data processing would generate far more locks on that data, and greatly degrade the performance of the database.

Because NoSQL databases have weaker data consistency models, they can trade off consistency for efficiency. In Wiederhold's social gaming example, if a user updated his or her profile, there's no real degradation of game performance if that profile's new info isn't updated across the entire database instantly. This means that resources can be dedicated to other things, like tracking down that that's about smack you around in-game.



What is CAP about?
The (CAP) theorem (Consistency, Availability and Partitioning tolerance) was given by Eric Brewer, a professor at the University of California, Berkeley and one of the founders of Google, in 2001 in the keynote of Principles of Distributed Computing









Let’s first give definitions to these 3 terms:
Consistency: A service that is consistent should follow the rule of ordering for updates that spread across all replicas in a cluster – “what you write is what you read”, regardless of location. For example,  Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.  This rule has another name “Strong consistency”.

Availability: A service should be available. There should be a guarantee that every request receives a response about whether it was successful or failed. If the system is not available it can be still consistent. However, consistency and availability cannot be achieved at the same time. This means that one has two choices on what to leave. Relaxing consistency will allow the system to remain highly available under the partitioning conditions (see next definition) and strong consistency means that under certain conditions the system will not be available.

Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system. A simple example, when we have a cluster of N replicated nodes and for some reason a network is unavailable among some number of  nodes (e.g. a network cable got chopped). This leads to inability to synchronize data. Thus, only some part of the system doesn’t work, the other one does. If you have a partition in your network, you lose either consistency (because you allow updates to both sides of the partition) or you lose availability (because you detect the error and shut down the system until the error condition is resolved).

A simple meaning of this theorem is “It is impossible for a protocol to guarantee both consistency and availability in a partition prone distributed system”. Please find below information in details . 




The popular understanding of Eric Brewer's CAP Theorem is that in distributed systems it is only possible to support two out of three desired properties: consistency, availability and partition tolerance. The 'two out of three' explanation of CAP Theorem has been used in recent years to explain and justify the emergence of NoSQL databases that relax consistency in favour of high availability. It has likewise been used to question the validity of claims that NewSQL databases are able to deliver both highly available distributed architecture and support ACID transactions.


We can't achieve All 3 characteristic in one

We are just talking about principle but how we can achieve 

Availability is achieved by replicating the data across different machines
Consistency is achieved by updating several nodes before allowing further reads
Total partitioning, meaning failure of part of the system is rare. However, we could look at a delay, a latency, of the update between nodes, as a temporary partitioning. It will then cause a temporary decision between A and C:

    On systems that allow reads before updating all the nodes, we will get high Availability
    On systems that lock all the nodes before allowing reads, we will get Consistency











Wednesday, March 19, 2014

Hadoop Basics - Horizontal Scaling instead of Vertical Scaling



 What can Hadoop do for you?

With the explosion of the era of Big Data, companies now need to leverage all their available data in full to drive the decisions that will support their future growth. The scale and variety of this data now challenges Relational Databases and that is where Hadoop can really add benefit.
Hadoop development is hosted by the Apache OpenSource Community and all major technology companies have contributed to the codebase enabling Hadoop to leap ahead of proprietary solutions in a relatively short time period.
Hadoop’s main components are its file system (HDFS) that provides cheap and reliable data storage, and its MapReduce engine that provides high performance parallel data processing. In addition Hadoop is self managing and can easily handle hardware failures  as well as scaling up or down its deployment without any change to the codebase.Hadoop installs on low cost commodity servers reducing the deployment cost.While first deployed by the main web properties Hadoop is now widely used across many sectors including entertainment, healthcare, government and market research.
 




In normal life , if we want to perform some computation job like searching 1 name from 100 Row table , we can do with low end hardware ( e.g. 1 GB RAM , 1 Core Processor).
 Now suppose we need to find 1 name from 1 million records in quick time. Hum..mm it will take some time to search not exactly but it might take 1 hour to find out record in huge DB. To reduce time span we can increase hardware capacity of current hardware and we can again get good performance for output. 

  As you seen above We can solve the problem by increasing hardware capacity which we normally call's Vertical scaling. It means you can increase hardware as per your requirement. In this case there is one word come in mind as "Hardware specification". Like we can increase RAM of machine till some limit and we can't increase its capacity beyond some limit. Now suppose we have found that if we need query output in the few second then what we can do here ?  Can we plan to purchase some expensive high performance hardware which will take less time. 

As any developer want his code should get appreciation and used in large scale that time we can't add restriction to user that he need 30 GB RAM and 10 Core Processor to get best output. Now what we can do here to get good performance as vertical word come in picture we can use horizontal scaling :) . It means we can add another COMMODITY HARDWARE in the network and use it as COMPUTATIONAL NODE FOR PROCESSING. 
It's any how better than purchasing single expensive hardware. The technical term to solve such problem we can call it as "distributed processing" where we can split the job for several machines and at the end we can summarize the output in required format.
Hadoop is  excellent software framework for the "distributed processing" of large data sets across clusters of computers using simple programming models.It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. While designing the framework main intention is to use COMMODITY Hardware instead of High End machines.
 The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise application

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. 


Hadoop: Assumptions
 It is written with large clusters of computers in mind and is built around the following assumptions:
i) Hardware will fail.
ii) Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
iii) Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.
iv) It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
v) Applications need a write-once-read-many access model.
vi) Moving Computation is Cheaper than Moving Data.
vii) Portability is important.


What are the Hadoop's Eco System Component  











Thursday, March 13, 2014

Core Java : Java hashCode()

In the Java programming language, every class implicitly or explicitly provides a hashCode() method, which digests the data stored in an instance of the class into a single hash value (a 32-bit signed integer). This hash is used by other code when storing or manipulating the instance – the values are intended to be evenly distributed for varied inputs in order to use in clustering.
Technically, in Java, hashCode() by default is a native method, meaning, it has the modifier 'native', as it is implemented directly in the native code in the JVM.
e.g. Hashing is designed to solve the problem of needing to efficiently find or store an item in a collection. For example, if we have a list of 10,000 words of English and we want to check if a given word is in the list, it would be inefficient to successively compare the word with all 10,000 items until we find a match. Hashing is a technique to make things more efficient by effectively narrowing down the search at the outset.
 

There are some restrictions placed on the behavior of equals() and hashCode(), which are enumerated in the documentation for Object. In particular, the equals() method must exhibit the following properties:

    Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
    Reflexivity: For all non-null references, a.equals(a)
    Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
    Consistency with hashCode(): Two equal objects must have the same hashCode() value


What is hashing?
Hashing means using some function or algorithm to map object data to some representative integer value. This so-called hash code (or simply hash) can then be used as a way to narrow down our search when looking for the item in the map

How hashing works

Purely as an example to help us grasp the concept, let's suppose that we want to map a list of string keys to string values (for example, map a list of countries to their capital cities). So let's say we want to store the data in Table 1 in the map.
KeyValue
CubaHavana
EnglandLondon
FranceParis
SpainMadrid
SwitzerlandBerne
Table 1: Example data to put in a map
And let's suppose that our hash function is to simply take the length of the string. For simplicity, we'll have two arrays: one for our keys and one for the values. So to put an item in the hash table, we compute its hash code (in this case, simply count the number of characters), then put the key and value in the arrays at the corresponding index. For example, Cuba has a hash code (length) of 4. So we store Cuba in the 4th position in the keys array, and Havana in the 4th index of the values array etc. And we end up with the following:
Position
(hash code = key length)
Keys arrayValues array
1
2
3
4CubaHavana
5SpainMadrid
6FranceParis
7EnglandLondon
8
9
10
11SwitzerlandBerne
Now, in this specific example things work quite well. Our array needs to be big enough to accommodate the longest string, but in this case that's only 11 slots. And we do waste a bit of space because, for example, there's no 1-letter keys in our data, nor keys between 8 and 10 letters. But in this case, the waste isn't so bad either. And taking the length of a string is nice and fast, so so is the process of finding the value associated with a given key (certainly faster than doing up to five string comparisons)1.
We can also easily see that this method wouldn't work for storing arbitrary strings. If one of our string keys was a thousand characters in length but the rest were small, we'd waste the majority of the space in the arrays. More seriously, this model can't deal with collisions: that is, what to do when there is more than one key with the same hash code (in this case, one than more string of a given length). For example, if our keys were random words of English, taking the string length would be fairly useless. Granted, the word psuedoantidisestablishmentarianistically would probably get its own place in the array. But on the other hand, we'd be left with thousands of, say, 6-letter words all competing for the same slot in the array. 

Introduction to hashing and hash maps
   
    On the previous page, we introduced the notion of hashing, mapping a piece of data such as a string to some kind of a representative integer value. We can then create a map by using this hash as an index into an array of key/value pairs. Such a structure is generally called a hash table or, particularly in Java parlance, hash map1. We saw that using the string length to create the hash, and indexing a simple array, could work in some restricted cases, but is no good generally: for example, we have the problem of collisions (several keys with the same length) and wasted space if a few keys are vastly larger than the majority.



Buckets
Now, we can solve the problem of collisions by having an array of (references to) linked lists2 rather than simply an array of keys/values. Each little list is generally called a bucket.
Then, we can solve the problem of having an array that is too large simply by taking the hash code modulo a certain array size3. So for example, if the array were 32 positions in size, going from 0-31, then rather than storing a key/value pair in the list at position 33, we store it at position (33 mod 32) = 1. (In simple terms, we "wrap round" when we reach the end of the array.) So we end up with a structure something like this: 

Graphical Representation

Each node in the linked lists stores a pairing of a key with a value. Now, to look for the mapping for, say, Ireland, we first compute this key's hash code (in this case, the string length, 7). Then we start traversing the linked list at position 7 in the table. We traverse each node in the list, comparing the key stored in that node with Ireland. When we find a match, we return the value from the pair stored in that node (Dublin). In our example here, we find it on the second comparison. So although we have to do some comparisons, if the list at a given position in the table is fairly short, we'll still reduce significantly the amount of work we need to do to find a given key/value mapping.
The structure we have just illustrated is essentially the one used by Java's hash maps and hash sets. However, we generally wouldn't want to use the string length as the hash code. In the next sections, we'll explore how to generate more adequate hash codes. 

Improving our hash function 
In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it's better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn't adequate. We need to choose a better hash function that ideally has these properties:

  • It can distribute the keys more or less evenly over the range of positions in the array. We want to avoid the situation where position 6 has a list of 10,000 entries and position 13 has a list of only 20 entries.
  • It can distribute the keys over a larger range of values. If there are 100,000 words in the map and a perfectly-distributed hash function, then having 50 positions in the array would still give an average list size of 2,000 items. But if we could have 5,000 positions, there'd be just 20 items in each list. And if we had in the region of 100,000 positions then we'd on average make just one or two comparisons1 to find the mapping we were looking for.
 In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values)2. Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.

Writing an adequate hash function for strings

for a given key X, the hash function must always generate the same hash code Y. But the point is that Y should essentially be a "random number across the possible range of hash codes". For typical keys, we don't want our hash codes to tend to "clump" within a certain range, or in binary terms, for certain bits of the hash code to tend to be always 0 or always 1. On the other hand, we also require our hash function to be fairly efficient. In practice, "reasonably random but quick to compute" is the tradeoff we want. For example, we'll see below that so long as a fair proportion of the bits of the hash code are random, then this randomness can be "ironed out" across all of the bits, and for most cases that's good enough.

Let's have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:

1
2
3
4
5
6
7
public int hashCode() {  
  int hash = 0;  
   for (int i = 0; i < length(); i++) {  
    hash = hash * 31 + charAt(i);  
   }  
   return hash;  
  }  

At this point, there are a couple of different routes you can take:
    If you just want to take a pragmatic approach and use Strings as keys to your map data without worrying about any more of the ins and outs of hashing, then you can start using the HashMap class immediately;
    If you want to use objects of your own classes as a key but without worrying too much more about the technical details of hash functions, then you can see some practical hash function guidelines and a guide to overriding the hashCde() and equals() methods, which is effectively what you need to do to "plug in" your hash function and make your class work as a hash map key.
    You may wish to understand more of the technical details of how hash functions work and what the basis is behind the guidelines mentioned.