Information
AI Chat

Data mining ch2

Summaries of Data mining chapter 2

Course

Data Mining

91 Documents

Students shared 91 documents in this course

University

Assiut University

Academic year: 2021/2022

Listed bookData Mining: Concepts and Techniques

Uploaded by:

Mahmoud Mohamed

Assiut University

28Uploads

41upvotes

Recommended for you

19
MCQ on Data mining
Data Mining
Summaries
100%(24)
24
Data mining exam questions
Data Mining
Summaries
94%(17)
19
Data mining exam questions 2
Data Mining
Summaries
100%(5)
26
Dwdm - These notes are the summaries of important topics - Data Mining: Concepts and Techniques
Data Warehousing and data mining
Summaries
100%(4)
2
Pattern Mining in Multilevel, Multidimensional Space
Principles Of Data Mining.
Summaries
100%(3)

Comments

Please sign in or register to post comments.

Preview text

Data in the real world is dirty:
- Incomplete
- Noisy
- Inconsistent
Why Is Data Preprocessing Important?
- No quality data, no quality mining results
- Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse.
Multi-Dimensional Measure of Data Quality:
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
Major Tasks in Data Preprocessing: i. Data cleaning ii. Data integration iii. Data transformation iv. Data reduction v. Data discretization
Mining Data Descriptive Characteristics:
- Motivation: To better understand the data: central tendency, variation and spread.
- Data dispersion characteristics: median, max, min, quantiles, outliers, variance.
- Numerical dimensions: Boxplot or quantile analysis on sorted intervals.
- Dispersion analysis on computed measures
Measuring the Central Tendency:

Mean
Median: Middle value if odd number of values, or average of the middle two values otherwise
Mode: Value that occurs most frequently in the data Three types: Unimodal, bimodal, trimodal.
Empirical formula

Measuring the Dispersion of Data:
- Quartiles: Q 1 (25th percentile), Q 3 (75th percentile)
- Inter-quartile range: IQR = Q 3 – Q 1
- Five number summary: min, Q 1 , M, Q 3 , max
- Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
- Outlier: usually, a value higher/lower than 1 x IQR
- Variance:

Standard deviation s (or σ) is the square root of variance s2.

Properties of Normal Distribution Curve:

From μ–σ to μ+σ: contains about 68% of the measurements.
From μ– 2 σ to μ+2σ: contains about 95% of it.
From μ–3σ to μ+3σ: contains about 99% of it.

Boxplot Analysis:

Data is represented with a box
The ends of the box are at the first and third quartiles, the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum 10 Displays of Basic Statistical Descriptions:
Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence 11 cleaning tasks:
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

Was this document helpful?

Data mining ch2

Course: Data Mining

91 Documents

Students shared 91 documents in this course

University: Assiut University

Was this document helpful?

1. Data in the real world is dirty:

• Incomplete

• Noisy

• Inconsistent

2. Why Is Data Preprocessing Important?

• No quality data, no quality mining results

• Data extraction, cleaning, and transformation comprises the majority of the work of

building a data warehouse.

3. Multi-Dimensional Measure of Data Quality:

• Accuracy

• Completeness

• Consistency

• Timeliness

• Believability

• Value added

• Interpretability

• Accessibility

4. Major Tasks in Data Preprocessing:

i. Data cleaning

ii. Data integration

iii. Data transformation

iv. Data reduction

v. Data discretization

5. Mining Data Descriptive Characteristics:

• Motivation: To better understand the data: central tendency, variation and spread.

• Data dispersion characteristics: median, max, min, quantiles, outliers, variance.

• Numerical dimensions: Boxplot or quantile analysis on sorted intervals.

• Dispersion analysis on computed measures

6. Measuring the Central Tendency:

• Mean

• Median: Middle value if odd number of values, or average of the middle two values

otherwise

• Mode: Value that occurs most frequently in the data

Three types: Unimodal, bimodal, trimodal.

• Empirical formula

Data mining ch2

Data Mining

Assiut University

Recommended for you

Comments

Students also viewed

Related documents

Preview text

Data mining ch2

Course: Data Mining

University: Assiut University

Recommended for you

Students also viewed

Related documents