Information
AI Chat

Data Mining - Ch7 Cluster Lecture 2

Major Clustering Approaches, Typical Alternatives to Calculate the Dis...

Course

Data Mining

91 Documents

Students shared 91 documents in this course

University

Assiut University

Academic year: 2023/2024

Listed bookData Mining: Concepts and Techniques

Uploaded by:

Mohamed El Sayed

Assiut University

531Uploads

415upvotes

Recommended for you

19
MCQ on Data mining
Data Mining
Summaries
100%(24)
24
Data mining exam questions
Data Mining
Summaries
94%(17)
19
Data mining exam questions 2
Data Mining
Summaries
100%(5)
26
Dwdm - These notes are the summaries of important topics - Data Mining: Concepts and Techniques
Data Warehousing and data mining
Summaries
100%(4)
2
Pattern Mining in Multilevel, Multidimensional Space
Principles Of Data Mining.
Summaries
100%(3)

Comments

Please sign in or register to post comments.

Preview text

Data Mining Concepts

CH7 - Lecture 2 : Cluster Analysis II

TOPICS

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering

Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10. Constraint-Based Clustering

11. Outlier Analysis

12. Summary

A model is hypothesized for each of the

####### clusters and tries to find the best fit of that

####### model to each other

Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or

####### application-specific constraints

Typical methods: COD (obstacles),

####### constrained clustering

Typical Alternatives to

Calculate the Distance

between Clusters

Single link: smallest distance between an

element in one cluster and an element in

the other, i., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between

an element in one cluster and an element

in the other, i., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element

in one cluster and an element in the other,

i., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of

two clusters, i., dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of

two clusters, i., dis(Ki, Kj) = dis(Mi, Mj)

Medoid: one chosen, centrally located

object in the cluster

Partitioning Methods

Partitioning Algorithms: Basic

Concept

Partitioning method: Construct a partition of a

database D of n objects into a set of k

clusters, s., min sum of squared distance

Given a k, find a partition of k clusters that

optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate

all partitions

Heuristic methods: k-means and k-

medoids algorithms

k-means (MacQueen’67): Each cluster is

represented by the center of the cluster

k-medoids or PAM (Partition around

medoids) (Kaufman & Rousseeuw’87):

Each cluster is represented by one of

the objects in the cluster

The K-Means Clustering

Method

Given k, the k-means algorithm is

implemented in four steps:

Partition objects into k nonempty

subsets

Compute seed points as the

centroids of the clusters of the

current partition (the centroid is

the center, i., mean point, of the

cluster)

Assign each object to the cluster

with the nearest seed point

Go back to Step 2, stop when no

more new assignment

Weakness
- Applicable only when mean is defined,

then what about categorical data?

Need to specify k, the number of clusters,

in advance

Unable to handle noisy data and outliers
Not suitable to discover clusters with non-

convex shapes

Variations of the K-Means

Method

A few variants of the k-means which differ in
- Selection of the initial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
Handling categorical data: k-modes

(Huang’98)

Replacing means of clusters with modes
Using new dissimilarity measures to deal

with categorical objects

Using a frequency-based method to

update modes of clusters

A mixture of categorical and numerical

data: k-prototype method

Find representative objects, called medoids,

in clusters

PAM (Partitioning Around Medoids, 1987)
- starts from an initial set of medoids and

iteratively replaces one of the medoids

by one of the non-medoids if it improves

the total distance of the resulting

clustering

PAM works effectively for small data sets,

but does not scale well for large data

sets

CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized

sampling

Focusing + spatial data structure (Ester et al.,

1995)

A Typical K-Medoids

Algorithm (PAM)

repeat steps 2-3 until there is no

change

PAM Clustering: Total

swapping cost TCih= åjCjih

What Is the Problem with

PAM?

- Pam is more robust than k-means in

the presence of noise and outliers

because a medoid is less influenced

by outliers or other extreme values

than a mean

- Pam works efficiently for small data

sets but does not scale well for

large data sets.

- O(k(n-k)

2 ) for each iteration

where n is # of data,k is # of

clusters

èSampling based method,

CLARA(Clustering LARge

Applications)

the whole data set if the sample

is biased

CLARANS (“Randomized”

CLARA) (1994)

CLARANS (A Clustering Algorithm

based on Randomized Search) (Ng

and Han’94)

CLARANS draws sample of

neighbors dynamically

The clustering process can be

presented as searching a graph

where every node is a potential

solution, that is, a set of k medoids

If the local optimum is found,

CLARANS starts with new randomly

selected node in search for a new

local optimum

It is more efficient and scalable than

both PAM and CLARA

Focusing techniques and spatial

access structures may further

improve its performance (Ester et

al.’95)

Multiple Choice

Multiple Choice

Was this document helpful?

Data Mining - Ch7 Cluster Lecture 2

Course: Data Mining

91 Documents

Students shared 91 documents in this course

University: Assiut University

Multiple Choice

Multiple Choice

Was this document helpful?

Data Mining Concepts

CH7 - Lecture 2 : Cluster Analysis II

Data Mining - Chapter 7

Data Mining - Ch7 Cluster Lecture 2

Data Mining

Assiut University

Recommended for you

Comments

Students also viewed

Related documents

Preview text

Data Mining Concepts

CH7 - Lecture 2 : Cluster Analysis II

TOPICS

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering

Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10. Constraint-Based Clustering

11. Outlier Analysis

12. Summary

Typical Alternatives to

Calculate the Distance

between Clusters

element in one cluster and an element in

the other, i., dis(Ki, Kj) = min(tip, tjq)

an element in one cluster and an element

in the other, i., dis(Ki, Kj) = max(tip, tjq)

in one cluster and an element in the other,

i., dis(Ki, Kj) = avg(tip, tjq)

two clusters, i., dis(Ki, Kj) = dis(Ci, Cj)

two clusters, i., dis(Ki, Kj) = dis(Mi, Mj)

object in the cluster

Partitioning Methods

Partitioning Algorithms: Basic

Concept

database D of n objects into a set of k

clusters, s., min sum of squared distance

optimizes the chosen partitioning criterion

all partitions

medoids algorithms

represented by the center of the cluster

medoids) (Kaufman & Rousseeuw’87):

Each cluster is represented by one of

the objects in the cluster

The K-Means Clustering

Method

implemented in four steps:

subsets

centroids of the clusters of the

current partition (the centroid is

the center, i., mean point, of the

cluster)

with the nearest seed point

more new assignment

then what about categorical data?

in advance

convex shapes

Variations of the K-Means

Method

(Huang’98)

with categorical objects

update modes of clusters

data: k-prototype method

in clusters

iteratively replaces one of the medoids

by one of the non-medoids if it improves

the total distance of the resulting

clustering

but does not scale well for large data

sets

sampling

1995)

A Typical K-Medoids

Algorithm (PAM)

change

PAM Clustering: Total