Skip to document

Poster - Practical

Practical
Course

Data Science and Analytics

9 Documents
Students shared 9 documents in this course
Academic year: 2022/2023
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Dr. Vishwanath Karad MIT World Peace University

Comments

Please sign in or register to post comments.

Preview text

Apache Hive

Apache Hive is an open-source data warehousing and SQL-like query language

system built on top of Hadoop for processing and analyzing large datasets. It

provides a high-level interface for managing and querying structured and semi-

structured data in a distributed storage environment, making it a valuable tool

for big data processing and analytics. Hive uses a language called HiveQL,

which is similar to SQL, to interact with data stored in Hadoop's HDFS or other

distributed file systems.

Map-Reduce

Hive's extensibility is achieved through the

integration of user-defined functions (UDFs)

and custom scripting to enable custom data

processing and transformations.

System Architecture

SQL vsHadoop

SQL: Ideal for structured data, offers a standard

query language for relational databases, enabling

efficient data retrieval and manipulation.

Hadoop: Suited for big data with structured, semi-

structured, or unstructured data, uses a distributed

file system and batch processing, enabling

scalable storage and processing of vast datasets.

SQL: Provides ease of use, well-defined schemas,

and ACID transactions for traditional databases.

Hadoop: Offers flexibility for diverse data types

and massive scalability, but requires complex data

transformations and lacks real-time capabilities,

making it suitable for different use cases.

1.

2.

3.

4.

KEY FEATURES

Extensibility

Apache Hive offers a SQL-like query

language for managing and analyzing large

datasets in Hadoop. Its schema-on-read

approach allows flexible handling of

structured and semi-structured data. With

seamless integration into the Hadoop

ecosystem, it supports scalability,

optimization, and robust security features,

making it a vital component for big data

processing and analytics.

Was this document helpful?

Poster - Practical

Course: Data Science and Analytics

9 Documents
Students shared 9 documents in this course
Was this document helpful?
Apache Hive
Apache Hive is an open-source data warehousing and SQL-like query language
system built on top of Hadoop for processing and analyzing large datasets. It
provides a high-level interface for managing and querying structured and semi-
structured data in a distributed storage environment, making it a valuable tool
for big data processing and analytics. Hive uses a language called HiveQL,
which is similar to SQL, to interact with data stored in Hadoop's HDFS or other
distributed file systems.
Map-Reduce
Hive's extensibility is achieved through the
integration of user-defined functions (UDFs)
and custom scripting to enable custom data
processing and transformations.
System Architecture
SQL vsHadoop
SQL: Ideal for structured data, offers a standard
query language for relational databases, enabling
efficient data retrieval and manipulation.
Hadoop: Suited for big data with structured, semi-
structured, or unstructured data, uses a distributed
file system and batch processing, enabling
scalable storage and processing of vast datasets.
SQL: Provides ease of use, well-defined schemas,
and ACID transactions for traditional databases.
Hadoop: Offers flexibility for diverse data types
and massive scalability, but requires complex data
transformations and lacks real-time capabilities,
making it suitable for different use cases.
1.
2.
3.
4.
KEY FEATURES
Extensibility
Apache Hive offers a SQL-like query
language for managing and analyzing large
datasets in Hadoop. Its schema-on-read
approach allows flexible handling of
structured and semi-structured data. With
seamless integration into the Hadoop
ecosystem, it supports scalability,
optimization, and robust security features,
making it a vital component for big data
processing and analytics.