Skip to document

Dm notes - NNNNNNNNNNN

NNNNNNNNNNN
Course

Software Engineering (CS530)

376 Documents
Students shared 376 documents in this course
Academic year: 2022/2023
Uploaded by:
0followers
8Uploads
1upvotes

Comments

Please sign in or register to post comments.

Preview text

Module- DATA WAREHOUSING & MODELLING

1 Introduction Data warehouses generalize and consolidate data in multidimensional space. The construction of data warehouses involves data cleaning, data integration, and data transformation, and can be viewed as an important preprocessing step for data mining. Moreover, data warehouses provide online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.

Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform For data analysis and OLAP and will provide an effective platform for datamining. Therefore ,data warehousing and OLAPform an essential step in the knowledge discovery process.

1 Data Warehouse: Basic Concepts What Is a Data Warehouse? Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouse systems are valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing weapon—a way to retain customers bylearning more about their needs.

A data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. Data warehouse systems allow for integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historic data for analysis.

Syllabus

Basic Concepts: Data Warehousing: A multitier Architecture, Data warehouse models: Enterprise warehouse, Data mart and virtual warehouse, Extraction, Transformation and loading, Data Cube: A multidimensional data model, Stars, Snowflakes and Fact constellations: Schemas for multidimensional Data models, Dimensions: The role of concept Hierarchies, Measures: Their Categorization and computation, Typical OLAP Operations

According to William H. Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in supportof management’s decision making process”.

Key features:

  1. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
  2. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be onlya single way of identifying a product.
  3. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated witha customer.
  4. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Differences between Operational Database Systems and Data Warehouses The major task of online operational database systems is to perform online transaction and query processing. These systems are called online transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of different users. These systems are known as online analytical processing(OLAP) systems. The major distinguishing features of OLTP and OLAP are summarized as follows:

Examples of gateways includes ODBC (open database connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores information about the data warehouse and its contents.

Tier-2: The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP)model or a multidimensional OLAP.

  • OLAP model is an extended relational DBMS that maps operations on multidimensional data tostandard relational operations.
  • A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations.

Tier-3: The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e., trend analysis, prediction, and so on).

1 Data Warehouse Models There are three data warehouse models.

  1. Enterprise warehouse:
    • An enterprise warehouse collects all of the information about subjects spanning the entireorganization.
    • It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope.
    • It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
    • An enterprise data warehouse may be implemented on traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build.
  2. Data mart:
    • A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. The data contained in data marts tend to be summarized.
    • Data marts are usually implemented on low-cost departmental servers that are UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years. However, it may involve complex integration in the long run if its design and planning were not enterprise-wide.

Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or

Meta Data Repository: Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Metadata are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes.

A metadata repository should contain the following: - A description of the structure of the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.

  • Operational metadata, which include data lineage (history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), and monitoring information (warehouse usage statistics, error reports, and audit trails).
  • The algorithms used for summarization, which include measure and dimension definition algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports.
  • The mapping from the operational environment to the data warehouse, which includes source databases and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security (user authorization and access control).
  • Data related to system performance, which include indices and profiles that improve data access and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication cycles.
  • Business metadata, which include business terms and definitions, data ownership information, and charging policies.

1 Data Warehouse Modeling: Data Cube and OLAP Data warehouses and OLAP tools are based on a multidimensional data model. This model views data inthe form of a data cube.

Data Cube : A multidimensional Data model

➢ A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

  • Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
  • Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables ➢ In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.

Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

.

1 Dimensions: The Role of Concept Hierarchies A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts. Consider a concept hierarchy for the dimension location. City values for location include Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to the province or state to which it belongs For example, suppose that the dimension location is described by the attributes number, street, city, province or state, zip code, and country. These attributes are related by a total order, forming a concept hierarchy such as “street < city < province or state < country.” This hierarchy is shown in Figure

Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.

E.: median(), mode(), rank()

1 Typical OLAP Operations - ROLL-UP This is like zooming-out on the data-cube This is required when the user needs further abstraction or less detail. • Initially, the location-hierarchy was "street < city < province < country". • On rolling up, the data is aggregated by ascending the location-hierarchy from the level-of city to level-of- country.

• DRILL DOWN

This is like zooming-in on the data. This is the reverse of roll-up. • This is an appropriate operation → when the user needs further details or → when the user wants to partition more finely or when the user wants to focus on some particular values of certain dimensions. • This adds more details to the data. • Initially, the time-hierarchy was "day < month < quarter < year”. • On drill-up, the time dimension is descended from the level-of-quarter to the level-of- month.

  • PIVOT (OR ROTATE) This is used when the user wishes to re-orient the view of the data-cube. This may involve → swapping the rows and columns or → moving one of the row-dimensions into the column- dimension.
  • SLICE & DICE These are operations for browsing the data in the cube. • These operations allow ability to look at information from different viewpoints. • A slice is a subset of cube corresponding to a single value for 1or more members of dimensions.. A dice operation is done by performing a selection of 2 or more dimensions.

1 Question Bank 1. What is data warehouse? Discuss key features 2. Differentiate between Operational Database Systems and Data Warehouses. 3. Differentiate between OLAP and OLTP 4. Why multidimensional views of data and data-cubes are used? 5 a neat diagram, explain data-cube implementations. 5. Describe the Multitiered Architecture of data warehousing. 6. Explain the data warehouse models

Was this document helpful?

Dm notes - NNNNNNNNNNN

Course: Software Engineering (CS530)

376 Documents
Students shared 376 documents in this course
Was this document helpful?
Data Mining and Data Warehousing(18CS641)
Department of Information Science and Engineering, SDMIT, Ujire. 1
Module-1
DATA WAREHOUSING & MODELLING
1.1 Introduction
Data warehouses generalize and consolidate data in multidimensional space. The construction of
data warehouses involves data cleaning, data integration, and data transformation, and can be viewed
as an important preprocessing step for data mining. Moreover, data warehouses provide online
analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied
granularities, which facilitates effective data generalization and data mining.
Many other data mining functions, such as association, classification, prediction, and clustering,
can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple
levels of abstraction. Hence, the data warehouse has become an increasingly important platform For
data analysis and OLAP and will provide an effective platform for datamining. Therefore ,data
warehousing and OLAP form an essential step in the knowledge discovery process.
1.2 Data Warehouse: Basic Concepts
What Is a Data Warehouse?
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms have
spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with
competition mounting in every industry, data warehousing is the latest must-have marketing
weapona way to retain customers by learning more about their needs.
A data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for integration of a variety of
application systems. They support information processing by providing a solid platform of
consolidated historic data for analysis.
Syllabus
Basic Concepts: Data Warehousing: A multitier Architecture, Data warehouse models:
Enterprise warehouse, Data mart and virtual warehouse, Extraction, Transformation and
loading, Data Cube: A multidimensional data model, Stars, Snowflakes and Fact constellations:
Schemas for multidimensional Data models, Dimensions: The role of concept Hierarchies,
Measures: Their Categorization and computation, Typical OLAP Operations