Big Data and Hadoop
BigData is the latest buzzword in the IT Industry. Apache’s Hadoop is a leading Big Data platform used by IT giants Yahoo, Facebook & Google. This course is geared to make a Hadoop Expert.
A key aspect of the resiliency of hadoop clusters comes from the software ability to deduct and handle failures at the application layer. Hadoop has two main subprojects: first MapReduce, the framework that understand and assigns work to the nodes in a cluster and secondly, HDFS, a distributed file system that spans all the nodes in a Hadoop cluster for data storage.
Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of servers. It is designed to scale up from a single server to thousands of machine, with a very high degree of fault tolerance. Hadoop was derived from Google’s MapReduce and the Google file system. Yahoo! was the originator and has beena major contributor and uses Hadoop across its business. Other major users include Facebook, IBM, Twitter, American Airlines, LinkedIn, The New York Times and many more.
* Note:The Number of Project will be Covered According to the Duration of that particular Program.
All training modules are based on REAL TIME LIVE PROJECTS.
Course Duration (7 Days)- Rs. 3,500/- per participant including of all taxes.
Course Duration (15 Days)- Rs. 5,000/- per participant including of all taxes.
Course Duration (30 Days)- Rs. 7,500/- per participant including of all taxes.
Module 1. What is Big Data & Why Hadoop?
What is Big Data?
Traditional data management systems and their limitations
What is Hadoop?
Why is Hadoop used?
The Hadoop eco-system
Big data/Hadoop use cases
Module 2. HDFS (Hadoop Distributed File System) and installing Hadoop on single node
HDFS internals and use cases
Files and blocks
Namenode memory concerns
HDFS access options
Installing and configuring Hadoop
Basic Hadoop commands
Module 3. Advanced HDFS concepts
How to use configuration class
Using HDFS in MapReduce and programmatically
HDFS permission and security
Additional HDFS tasks
Module 4. Cloud computing overview and installing Hadoop on multiple nodes
Cloud computing overview
Characteristics of cloud computingSaaS/PaaS/IaaS
Configuring Masters and Slaves
Module 5.Introduction to MapReduce
Functional programming concepts
Mapping and reducing lists
Putting them together in MapReduce
Word Count example application
Understanding the driver, mapper and reducer
Closer look at MapReduce data flow
Additional MapReduce functionality
Module 6. MapReduce workshop
Hands-on work on MapReduce
Module 7. Advanced MapReduce concepts
Understand combiners & partitioners
Understand input and output formats
Chaining, listing and killing jobs
Module 8. Using Pig and Hive for data analysis
Pig program structure and execution process
Joins & filtering using Pig
Group & co-group
Schema merging and redefining functions
Using Hive command line interface
Data types and file formats
Basic DDL operations
Module 9. Introduction to HBase, Zookeeper & Sqoop
HBase overview, architecture & installation
HBase admin: test
HBase data access
Overview of Zookeeper
Sqoop overview and installation
Importing and exporting data in Sqoop
Module 10. Introduction to Oozie, Flume and advanced Hadoop concepts
Overview of Oozie and Flume
Oozie features and challenges
How does Flume work
Connecting Flume with HDFS
Authentication and high availability in Hadoop
Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule
Supervised Learning, Regression, polynomial regression, local regression, k-nearest neighbors,
Unsupervised Learning, Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping
Classification, ranking, logistic regression
Ethics, time series, advanced regression
DAY 1 and DAY 2
1. Understanding Big Data and Hadoop
Learning Objectives - In this module, you will understand Big Data, the limitations of the existing solutions for Big Data problem, how Hadoop solves the Big Data problem, the common Hadoop ecosystem components, Hadoop Architecture, HDFS, Anatomy of File Write and Read, Rack Awareness.
Topics - Big Data, Limitations and Solutions of existing Data Analytics Architecture, Hadoop, Hadoop Features, Hadoop Ecosystem, Hadoop 2.x core components, Hadoop Storage: HDFS, Hadoop Processing: MapReduce Framework, Anatomy of File Write and Read, Rack Awareness.
DAY 3 and DAY 4
2. Hadoop Architecture and HDFS
Learning Objectives - In this module, you will learn the Hadoop Cluster Architecture, Important Configuration files in a Hadoop Cluster, Data Loading Techniques.
Topics - Hadoop 2.x Cluster Architecture - Federation and High Availability, A Typical Production Hadoop Cluster, Hadoop Cluster Modes, Common Hadoop Shell Commands, Hadoop 2.x Configuration Files, Password-Less SSH, MapReduce Job Execution, Data Loading Techniques: Hadoop Copy Commands, FLUME, SQOOP.
DAY 5 and DAY 6
3. Hadoop MapReduce Framework - I
Learning Objectives - In this module, you will understand Hadoop MapReduce framework and the working of MapReduce on data stored in HDFS. You will learn about YARN concepts in MapReduce.
Topics - MapReduce Use Cases, Traditional way Vs MapReduce way, Why MapReduce, Hadoop 2.x MapReduce Architecture, Hadoop 2.x MapReduce Components, YARN MR Application Execution Flow, YARN Workflow, Anatomy of MapReduce Program, Demo on MapReduce.
4. Hadoop MapReduce Framework - II
Learning Objectives - In this module, you will understand concepts like Input Splits in MapReduce, Combiner & Partitioner and Demos on MapReduce using different data sets.
Topics - Input Splits, Relation between Input Splits and HDFS Blocks, MapReduce Job Submission Flow, Demo of Input Splits, MapReduce: Combiner & Partitioner, Demo on de-identifying Health Care Data set, Demo on Weather Data set.
5. Advance MapReduce
Learning Objectives - In this module, you will learn Advance MapReduce concepts such as Counters, Distributed Cache, MRunit, Reduce Join, Custom Input Format, Sequence Input Format and how to deal with complex MapReduce programs.
Topics - Counters, Distributed Cache, MRunit, Reduce Join, Custom Input Format, Sequence Input Format.
Learning Objectives - In this module, you will learn Pig, types of use case we can use Pig, tight coupling between Pig and MapReduce, and Pig Latin scripting.
Topics - About Pig, MapReduce Vs Pig, Pig Use Cases, Programming Structure in Pig, Pig Running Modes, Pig components, Pig Execution, Pig Latin Program, Data Models in Pig, Pig Data Types.
Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF, Pig Demo on Healthcare Data set.
DAY 10 and DAY 11
Learning Objectives - This module will help you in understanding Hive concepts, Loading and Querying Data in Hive and Hive UDF.
Topics - Hive Background, Hive Use Case, About Hive, Hive Vs Pig, Hive Architecture and Components, Metastore in Hive, Limitations of Hive, Comparison with Traditional Database, Hive Data Types and Data Models, Partitions and Buckets, Hive Tables(Managed Tables and External Tables), Importing Data, Querying Data, Managing Outputs, Hive Script, Hive UDF, Hive Demo on Healthcare Data set.
DAY 12 and DAY 13
8. Advance Hive and HBase
Learning Objectives - In this module, you will understand Advance Hive concepts such as UDF, dynamic Partitioning. You will also acquire in-depth knowledge of HBase, Hbase Architecture and its components.
Topics - Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts, Hive : Thrift Server, User Defined Functions.
HBase: Introduction to NoSQL Databases and HBase, HBase v/s RDBMS, HBase Components, HBase Architecture, HBase Cluster Deployment.
9. Advance HBase
Learning Objectives - This module will cover Advance HBase concepts. We will see demos on Bulk Loading , Filters. You will also learn what Zookeeper is all about, how it helps in monitoring a cluster, why HBase uses Zookeeper.
Topics - HBase Data Model, HBase Shell, HBase Client API, Data Loading Techniques, ZooKeeper Data Model, Zookeeper Service, Zookeeper, Demos on Bulk Loading, Getting and Inserting Data, Filters in HBase.
10. Oozie and Hadoop Project
Learning Objectives - In this module, you will understand working of multiple Hadoop ecosystem components together in a Hadoop implementation to solve Big Data problems. We will discuss multiple data sets and specifications of the project. This module will also cover Flume & Sqoop demo and Apache Oozie Workflow Scheduler for Hadoop Jobs.
Topics - Flume and Sqoop Demo, Oozie, Oozie Components, Oozie Workflow, Scheduling with Oozie, Demo on Oozie Workflow, Oozie Co-ordinator, Oozie Commands, Oozie Web Console, Hadoop Project Demo.
What is Big data
Big Data opportunities
Big Data Challenges
Characteristics of Big data
Introduction to Hadoop
What is Hadoop
Relationship between hadoop and Bigdata
Advantages and Challenges
Comparing Hadoop & SQL.
Industries using Hadoop.
Map Reduce & HDFS.
Using the Hadoop single node image (Clone).
The Hadoop Distributed File System (HDFS)
HDFS Design & Concepts
Blocks, Name nodes and Data nodes
HDFS High-Availability and HDFS Federation.
Hadoop DFS The Command-Line Interface
Basic File System Operations
Anatomy of File Read
Anatomy of File Write
Block Placement Policy and Modes
More detailed explanation about Configuration files.
Metadata, FS image, Edit log, Secondary Name Node and Safe Mode.
How to add New Data Node dynamically.
How to decommission a Data Node dynamically (Without stopping cluster).
FSCK Utility. (Block report).
How to override default configuration at system level and Programming level.
ZOOKEEPER Leader Election Algorithm.
Exercise and small use case on HDFS.
Functional Programming Basics.
Map and Reduce Basics
How Map Reduce Works
Anatomy of a Map Reduce Job Run
Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task
Execution, Progress and Status Updates
Job Completion, Failures
Shuffling and Sorting
Splits, Record reader, Partition, Types of partitions & Combiner
Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots.
Types of Schedulers and Counters.
Comparisons between Old and New API at code and Architecture Level.
Getting the data from RDBMS into HDFS using Custom data types.
Distributed Cache and Hadoop Streaming (Python, Ruby and R).
Sequential Files and Map Files.
Enabling Compression Codec’s.
Map side Join with distributed Cache.
Types of I/O Formats: Multiple outputs, NLINE input format.
Handling small files using CombineFileInputFormat.
Map/Reduce Programming – Java Programming
Hands on “Word Count” in Map/Reduce in standalone and Pseudo distribution Mode.
Sorting files using Hadoop Configuration API discussion
Emulating “grep” for searching inside a file in Hadoop
Job Dependency API discussion
Input Format API discussion
Input Split API discussion
Custom Data type creation in Hadoop.
ACID in RDBMS and BASE in NoSQL.
CAP Theorem and Types of Consistency.
Types of NoSQL Databases in detail.
Columnar Databases in Detail (HBASE and CASSANDRA).
TTL, Bloom Filters and Compensation.
HBase Data Model and Comparison between RDBMS and NOSQL.
Master & Region Servers.
HBase Operations (DDL and DML) through Shell and Programming and HBase
Block Cache and sharding.
DATA Modeling (Sequential, Salted, Promoted and Random Keys).
JAVA API’s and Rest Interface.
Client Side Buffering and Process 1 million records using Client side Buffering.
Enabling Replication and HBASE RAW Scans.
Bulk Loading and Coprocessors (Endpoints and Observers with programs).
Real world use case consisting of HDFS,MR and HBASE.
Introduction and Architecture.
Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)
OLTP vs. OLAP
Working with Tables.
Primitive data types and complex data types.
Working with Partitions.
User Defined Functions
Hive Bucketed Tables and Sampling.
External partitioned tables, Map the data to the partition in the table, Writing the output of one query to another table, Multiple inserts
Differences between ORDER BY, DISTRIBUTE BY and SORT BY.
Bucketing and Sorted Bucketing with Dynamic partition.
INDEXES and VIEWS.
Compression on hive tables and Migrating Hive tables.
Dynamic substation of Hive and Different ways of running Hive
How to enable Update in HIVE.
Log Analysis on Hive.
Access HBASE tables using Hive.
Hands on Exercises
Schema on read
Primitive data types and complex data types.
Tuple schema, BAG Schema and MAP Schema.
Loading and Storing
Grouping & Joining
Debugging commands (Illustrate and Explain).
Validations in PIG.
Type casting in PIG.
Working with Functions
User Defined Functions
Types of JOINS in pig and Replicated Join in detail.
SPLITS and Multiquery execution.
Error Handling, FLATTEN and ORDER BY.
Nested For Each.
User Defined Functions, Dynamic Invokers and Macros.
How to access HBASE using PIG.
How to Load and Write JSON DATA using PIG.
Hands on Exercises
Import Data.(Full table, Only Subset, Target Directory, protecting Password, file format other than CSV,Compressing,Control Parallelism, All tables Import)
Incremental Import(Import only New data, Last Imported data, storing Password in Metastore, Sharing Metastore between Sqoop Clients)
Free Form Query Import
Export data to RDBMS,HIVE and HBASE
Hands on Exercises.
Introduction to Flume
Flume Agents: Sources, Channels and Sinks
Log User information using Java program in to HDFS using LOG4J and Avro Source
Log User information using Java program in to HDFS using Tail Source
Log User information using Java program in to HBASE using LOG4J and Avro Source
Log User information using Java program in to HBASE using Tail Source
Use case of Flume: Flume the data from twitter in to HDFS and HBASE. Do some analysis using HIVE and PIG