Task 1

Have you ever thought about Bigdata, how important it is and how the companies like Google, Facebook, YouTube handles the data.
Here is a Article that give you an idea about how this companies manages such a large Data.
I would like to thank to Mr Vimal Daga Sir for giving so much valuable information about Bigdata.
Arth - The school of technology


To address the needs of handling complex variety of data we need a mechanism or engineering and Big data helps in simplifying the complex data structures
It is needed to derive insights from complex and hug volumes of data. Data can be enormous but to analysis that we need a system and that is where Big data system helps
It helps in Cost reduction (Big Data) as the systems can be installed at affordable prices as well
It helps in better decision making process as the analytics/algorithms involved provide accurate and appropriate.
Become a data scientist: The hottest job of the 21st century.
Big data is large amount of data. Big Data in normal layman’s term can be described as a huge volume of unstructured data. It is a term used to describe data that is huge in amount and which keeps growing with time. Big Data consists of structured, unstructured and semi-structured data. This data can be used to track and mine information for analysis or research purpose.

What is Big Data?

Big data in simple terms is a large amount of structured, unstructured, semi-structured data that can be used to for analysis purpose.

Doug Laney has given Big Data a new definition describing it as the three V’s: Volume, velocity and Variety.

Volume: The name Big Data itself suggest it contains large amount of data. The size of the data is very important in determining whether the data is “Big data” or not. Hence, “Volume” is an important characteristic when dealing with Big data.
Velocity: Velocity is the speed at which data is generated. In Big Data the velocity is a measure of determining the efficiency of the data. The more quickly the data is generated and processed will determine the data’s real potential. The flow of data is huge and Velocity is one of the characteristics of Big Data.
Variety: Data comes in various forms, structured, unstructured, numeric, etc. Earlier spreadsheets and database were considered as data. But now pdf’s, emails, audio etc are considered for analysis.
Let us know more about Big Data

Big Data has turned out to be really important for businesses who want to maintain their files and huge amount of data. Companies have moved to Big Data technologies in order to maintain data for analysis or business development purposes.

Importance of Big Data:

Big Data is important not in terms of volume but in terms of what you do with the data and how you utilize it to make analysis in order to benefit your business and organization.

Big Data helps analyze:

Product Development
Decision Making etc.
Big data when teamed up with Analytics help you determine root causes of failure in businesses, analyse sales trends based on analysing the customer buying history. Also help determine fraudulent behaviour and reduce risks that might affect the organisation.

Uses of Big Data

Big Data technologies are very beneficial to the businesses in order to boost the efficiency and develop new data driven services. There are a number of uses of big data. For example, in analysing a set of data containing weather report to predict the next weeks weather.

Let us discuss how much data is stored by various big companies in a day

Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.

A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.

According to the new recording rate, 20hrs of video per minute could require YouTube to increase its storage capacity by up to 21.0 Terabytes per day, or 7.7 Petabytes per year, placing it at a point roughly 4x more than the total amount of data generated per year by the NCSA’s supercomputers in Urbana, IL.

It was recently projected that YouTube's database to the sound of 45 terabytes of videos. While that figure doesn't sound terribly high relative to the amount of data available on the internet, YouTube has been experiencing a period of substantial growth (more than 65,000 new videos per day) since that figures publication, meaning that YouTube's database size has potentially more than doubled in the last 5 months.

Estimating the size of YouTube's database is particularly difficult due to the varying sizes and lengths of each video. However if one were truly ambitious (and a bit forgiving) we could project that the YouTube database will expect to grow as much as 20 terabytes of data in the next month.


Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour. Plus it gave the first details on its new “Project Prism”.

Let’s see how Hadoop helps to solve the Bigdata problems:

hadoop solves big data problem Initially designed in 2006, Hadoop is an amazing software particularly adapted for managing and analysis big data in structured and unstructured forms. The creators of Hadoop developed an open source technology based on input, which included technical papers that were written by Google. The initial design of Hadoop has undergone several modifications to become the go-to data management tool it is today. The hadoop helps in solving different big data problem efficiently.

Today, Hadoop is a framework that comprises tools and components offered by a range of vendors. The wide variety of tools and compartments that make up Hadoop are based on the expansion of the basic framework

Certain core components are behind the ability of Hadoop to capture as well as manage and process data. These core components are surrounded by frameworks that ensure the efficiency of the core components. The core components of Hadoop include the Hadoop Distributed File System (HDFS), YARN, MapReduce, Hadoop Common, and Hadoop Ozone and Hadoop Submarine. These components influence the activities of Hadoop tools as necessary.

The Hadoop Distributed File System, like the name suggests, is the component that is responsible for the basic distribution of data across the system of storage, which is a DataNode. This component is behind the directory of file storage as well as the file system that directs the storage of data within nodes.

Applications run concurrently on the Hadoop framework; the YARN component is in charge of ensuring that resources are appropriately distributed to running applications. This component of the Hadoop framework is also responsible for creating the schedule of jobs that run concurrently.

Hadoop processes a large volume of data with its different components performing their roles within an environment that provides the supporting structure. Apart from the components mentioned above, one also has access to certain other tools as part of their Hadoop stack. These tools include the database management system, Apache HBase, and tools for data management and application execution. Development, management, and execution tools could also be part of a Hadoop stack. The tools typically applied by an organization on the Hadoop framework are dependent on the needs of the organization.

Applications of Hadoop in big data

Organizations, especially those that generate a lot of data rely on Hadoop and similar platforms for the storage and analysis of data. These organizations include Facebook. Facebook generates an enormous volume of data and has been found to apply Hadoop for its operations. The flexibility of Hadoop allows it to function in multiple areas of Facebook in different capacities. Facebook data are thus compartmentalized into the different components of Hadoop and the applicable tools. Data such as status updates on Facebook, for example, are stored on the MySQL platform. The Facebook messenger app is known to run on HBase.

Experts have also stated that e-commerce giant, Amazon also utilize components of Hadoop inefficient data processing. The component of Hadoop that is utilized by Amazon include Elastic MapReduce web service. Elastic MapReduce web service is adapted for effectively carrying out data processing operations, which include log analysis, web indexing, data warehousing, financial analysis, scientific simulation, machine learning, and bioinformatics.

Other organizations that apply components of Hadoop include eBay and Adobe. eBay uses Hadoop components such as Java MapReduce, Apache Hive, Apache HBase, and Apache Pig for processes such as research and search optimization.

Adobe is known to apply components of Hadoop such as Apache HBase and Apache Hadoop. The production, as well as development processes of Adobe, applies components of Hadoop on clusters of 30 nodes. There are reports of the expansion of the nodes utilized by Adobe.

Many other technology tools like Scuba, Cassandra, Hive, Prism, Corona also used to handle the Bigdata problems.