Skip to document

Data mining ch2

Summaries of Data mining chapter 2
Course

Data Mining

91 Documents
Students shared 91 documents in this course
Academic year: 2021/2022
Uploaded by:
28Uploads
41upvotes

Comments

Please sign in or register to post comments.

Preview text

  1. Data in the real world is dirty:
    • Incomplete
    • Noisy
    • Inconsistent
  2. Why Is Data Preprocessing Important?
    • No quality data, no quality mining results
    • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse.
  3. Multi-Dimensional Measure of Data Quality:
    • Accuracy
    • Completeness
    • Consistency
    • Timeliness
    • Believability
    • Value added
    • Interpretability
    • Accessibility
  4. Major Tasks in Data Preprocessing: i. Data cleaning ii. Data integration iii. Data transformation iv. Data reduction v. Data discretization
  5. Mining Data Descriptive Characteristics:
    • Motivation: To better understand the data: central tendency, variation and spread.
    • Data dispersion characteristics: median, max, min, quantiles, outliers, variance.
    • Numerical dimensions: Boxplot or quantile analysis on sorted intervals.
    • Dispersion analysis on computed measures
  6. Measuring the Central Tendency:
  • Mean

  • Median: Middle value if odd number of values, or average of the middle two values otherwise

  • Mode: Value that occurs most frequently in the data Three types: Unimodal, bimodal, trimodal.

  • Empirical formula

  1. Measuring the Dispersion of Data:
    • Quartiles: Q 1 (25th percentile), Q 3 (75th percentile)
    • Inter-quartile range: IQR = Q 3 – Q 1
    • Five number summary: min, Q 1 , M, Q 3 , max
    • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
    • Outlier: usually, a value higher/lower than 1 x IQR
    • Variance:
  • Standard deviation s (or σ) is the square root of variance s2.
  1. Properties of Normal Distribution Curve:
  • From μ–σ to μ+σ: contains about 68% of the measurements.
  • From μ– 2 σ to μ+2σ: contains about 95% of it.
  • From μ–3σ to μ+3σ: contains about 99% of it.
  1. Boxplot Analysis:
  • Data is represented with a box
  • The ends of the box are at the first and third quartiles, the height of the box is IRQ
  • The median is marked by a line within the box
  • Whiskers: two lines outside the box extend to Minimum and Maximum 10 Displays of Basic Statistical Descriptions:
  • Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi
  • Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
  • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
  • Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence 11 cleaning tasks:
  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data
  • Resolve redundancy caused by data integration
Was this document helpful?

Data mining ch2

Course: Data Mining

91 Documents
Students shared 91 documents in this course

University: Assiut University

Was this document helpful?
1. Data in the real world is dirty:
Incomplete
Noisy
Inconsistent
2. Why Is Data Preprocessing Important?
No quality data, no quality mining results
Data extraction, cleaning, and transformation comprises the majority of the work of
building a data warehouse.
3. Multi-Dimensional Measure of Data Quality:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
4. Major Tasks in Data Preprocessing:
i. Data cleaning
ii. Data integration
iii. Data transformation
iv. Data reduction
v. Data discretization
5. Mining Data Descriptive Characteristics:
Motivation: To better understand the data: central tendency, variation and spread.
Data dispersion characteristics: median, max, min, quantiles, outliers, variance.
Numerical dimensions: Boxplot or quantile analysis on sorted intervals.
Dispersion analysis on computed measures
6. Measuring the Central Tendency:
Mean
Median: Middle value if odd number of values, or average of the middle two values
otherwise
Mode: Value that occurs most frequently in the data
Three types: Unimodal, bimodal, trimodal.
Empirical formula