Abstract:

In

today’s life internet have become a common way for expressing and sharing ideas

or minds. The most important widely spread handheld devices smart phone become

common media to record people’s daily activities and their thoughts, opinion so

that the researchers tried to gather people’s mind through opinion mining . Clustering is considered as the

unsupervised learning .A clustering can

be defined as “the process of organizing

objects into groups whose members are similar in some way”.

Cluster analysis aims to organize a collection of patterns into clusters based

on similarity. In this paper I focuses

on the suggest the outline of the clustering and the clustering techniques. The

techniques that can be subdivided

into several categories are Partitional algorithms, Hierarchical clustering, Fuzzy

clustering, Density-based clustering, Model-based clustering. Thus at the end the comparison of various algorithm is studied , surveyed and

shows which clustering techniques will be better compared to other techniques.

Keywords:Mining,Clustering,

opinion-mining, Partitional,

Hierarchical Clustering, Fuzzy clustering.

I

Introduction:

DataMining: Definition

Data

mining is the process of analyzing hidden patterns of data according to

different perspectives for categorization into useful information, which is

collected and assembled in common areas, such as data warehouses, for efficient

analysis, data mining algorithms, facilitating business decision making and

other information requirements to ultimately cut costs and increase revenue.

In past decades the people of marketers and

businessman depend on the survey to identify the customers preferences , and

about the product purchased by the customers

but now a days the alternative approach to a survey is that collecting

the opinion about the people’s direct

feedback through the web. So we have chosen the opinion mining.

Opinion mining, which is also called sentiment

analysis, involves building a system to collect and categorize opinions about a

product.Many people Share their opinions on news,

products and brands in social media like

Facebook , Twitter, and LinkedIn these social media play a crucial role for

collecting , transmitting and sharing opinions of the people.

Opinion Mining:

Opinion mining is a type of natural language

processing for tracking the mood of the public about a particular product. sentiment

analysis aims to determine the attitude of a speaker, writer, or other subject

with respect to some topic or the overall contextual polarity or emotional

reaction to a document, interaction, or event. In many social networking

services or e-commerce websites, users

can provide text review, comment or feedback to the items. These user-generated

text provide a rich source of user’s sentiment opinions about numerous products

and items.

Clustering :

Clustering is the most interesting topics in data mining which

aims of finding intrinsic structures in data and find some meaningful subgroups

for further analysis. It is a common technique for statistical data analysis,

which is used in many fields, including machine learning, data mining, pattern

recognition, image analysis and bioinformatics. Thus a cluster could also be

defined as the “methodology of organizing objects into groups whose members are

similar in some way.”

II WHY CLUSTERING

Data Clustering is one of the

challenging mining techniques in the knowledge data discovery process.

Clustering huge amount of data is a difficult task since the goal is to find a

suitable partition in a unsupervised way (i.e. without any prior knowledge) trying

to maximize the intra-cluster similarity and minimize inter-cluster similarity

which in turn maintains high cluster cohesiveness. Clustering groups data

instances into subsets in such a manner that similar instances are grouped

together, while different instances belong to different groups. The instances

are thereby organized into an efficient representation that characterizes the

population being sampled. Thus the output of cluster analysis is the number of

groups or clusters that form the structure of partitions, of the data set. In

short clustering is the technique to process the data into meaningful group for

statistical analysis. The exploitation of Data Mining and Knowledge discovery

has penetrated to a variety of Machine Learning Systems. A very important area

in the field of Machine learning is Text Categorization. Feature selection and

Term weighting are two important steps that decide the result of any Text

Categorization problem.

Figure

2.Clustering Process 15

III. MOTIVATION

As the amount of digital documents over

the years as the Internet grows has been increasing dramatically, managing

information search, and retrieval, etc., have become practically important

problems. Developing methods to organize large amounts of unstructured text

documents into a smaller number of meaningful clusters would be very helpful as

document clustering is vital to such tasks as indexing, filtering, automated

metadata generation, population of hierarchical catalogues of web resources and in general, any application requiring

document organization .Also there are large number of people who are interested

in reading specific news so there is necessity to cluster the news articles

from the number of available articles, since the large number of articles are

added each data and many articles corresponds to same news but are added from

different sources. By clustering the articles, we could reduce our search

domain for recommendations as most of the users are interested in the news

corresponding to a few number of clusters. This could improve the result of

time efficiency to a greater extent and would also help in identification of

same news from different sources. The main motivation is to investigate

possible improvements of the effectiveness of document clustering by finding

out various clustering algorithms available.

IV. TYPES OF CLUSTERING

To identify suitable

algorithms for clustering that produces the best clustering solutions, it

becomes necessary to have a method for comparing the results of different

clustering algorithms. Many different clustering techniques have been defined

in order to solve the problem from different perspective, these are:-

·

Partitional Clustering

·

Density based Clustering

·

Hierarchical

clustering

·

Fuzzy Clustering.

·

Model-based clustering

A. Partitional

Clustering

Partitional clustering is

considered to be the most popular class of clustering algorithm also known as

iterative relocation algorithm. Partition clustering algorithm splits the data

points into k partition, where each partition represents a cluster. The

partition is done based on certain objective function. The cluster are formed

to optimize an objective partitioning criterion, such as a dissimilarity

function based on distance, so that the objects within a cluster are “similar”,

whereas the objects of different cluster are “dissimilar”.

4.1.1 K-means

K-means was proposed by MacQueen and is one of the most popular

partition-based methods. It partitions the dataset into k disjoint subsets,

where k is predetermined. The algorithm keeps adjusting the assignment of

objects to the closest current cluster mean until no new assignments of objects

to clusters can be made.

4.1.2

PAM(Partitioning Around Mediods)

The Partitioning Around Medoids (PAM) algorithm was introduced by

Kaufman and Rousseeuw. It is based on the k representative objects, called

medoids, among the objects of the dataset.

4.1.3 CLARA(Clustering Large Applications)

Both the k-means and PAM algorithms are slow and not practical.

One algorithm that tries to solve this problem is CLARA (Clustering LARge

Applications). CLARA is a method based on PAM that attempts to deal with large

dataset applications. CLARA uses the PAM algorithm to cluster a sample from a

set of objects into k subsets.

B Density Based

Clustering

Density-based

clustering algorithms are devised to create arbitrary-shaped clusters. In this

approach, a cluster is regarded as a region in which the density of data

objects exceeds a threshold. DBSCAN and SSN RDBC are typical algorithm

Density based clustering algorithm has

played a vital role in finding non linear shapes structure based on the

density. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

is most widely used density based algorithm. It uses the concept of density

reachability and density connectivity.

NAME

DATATYP

NOIS

COMPLEXIT

E

E

Y

DBSCAN

Numerical

Yes

?(nlogn)

OPTICS

Numerical

Yes

?(nlogn)

DENCLU

Numerical

Yes

?(n2)

E

C. Hierarchical Clustering:

Hierarchical

clustering algorithms divide or merge a dataset into a sequence of nested

partitions. The hierarchy of the nested partitions can be

agglomerative(bottom-up) or divisive(top-down). In the agglomerative method,

clustering starts with each single object in a single cluster and it continues

to cluster the closest pairs of clusters until all the objects are together in

just one cluster. Divisive hierarchical clustering, on the other hand, starts

with all objects in a single cluster and keeps splitting larger clusters into

smaller ones until all objects are separated into unit clusters511. Both

the hierarchical methods show the natural way of representing clusters, called

as dendrogram. Examples for this algorithms are ROCK, BIRCH (Balance Iterative

Reducing and Clustering using Hierarchies), CURE (Cluster Using

REpresentatives).

There

are also different agglomerative clustering algorithms which use different

similarity measures and thus based on that, different agglomerative clustering

algorithms are: Single linkage, Complete linkage, Group average linkage,

Centriod linkage, Ward’s criterion

Agglomerative

clustering

Each

object initially represents a cluster of its own. Then clusters are

successively merged until the desired cluster structure is obtained. one of the

most widely used algorithms is agglomerative algorithms

Single-link clustering

It is also called as

nearest neighbor method, that considers the distance between two clusters to be

equal to the shortest distance from any member of one cluster to any member of

the other cluster

Average-link clustering

It is also called as minimum variance method, that consider the

distance between two clusters to be equal to the average distance from any

member of one cluster to any member of the other cluster

Centriod Clustering

The centroid method uses the controid (center of the group of

cases) to determine the average distance between clusters of cases

V.PROPOSED ANALYTICAL MODEL

A hierarchical clustering algorithm creates a hierarchical

decomposition of the given set of data objects. Depending on the decomposition

approach, hierarchical algorithms are classified as agglomerative (merging) or

divisive (splitting). The agglomerative approach starts with each data point in

a separate cluster or with a certain large number of clusters. Each step of

this approach merges the two clusters that are the most similar. Thus after

each step, the total number of clusters decreases. This is repeated until the

desired number of clusters is obtained or only one cluster remains. By

contrast, the divisive approach starts with all data objects in the same

cluster. In each step, one cluster is split into small clusters. This kind of hierarchical clustering is called agglomerative

because it merges clusters iteratively

VI. SUMMARY TABLE FOR COMPARISON OF CLUSTERING TECHNIQUES

Name

Algorithm

Key

–idea

Type of

Advantages

Disadvantages

Data

K-means

Mean

Centroid

-Simple

-Sensitive to outliers

-Centroids

not meaningful

-Most

popular

in

most problems

Partitional

Numerical

PAM

robust

to outliers

Cluster should be pre

Mediod

-centriod

-determined

CLARA

Applicable for large data set

Sensitive to outliers

CLARANS

Handles outliers effectively

High cost

-Resistant to noise

-Cannot

handle varying

DBSCAN

Fixed

sixe

-Can

handle clusters of various

Densities

shapes

and sizes.

OPTICS

-Good for data set with large

-Needs large no.of

amount

of noise

Parameters

Density

Based

Numerical

-Faster

in computation

DENCLUE

Variable

size

-Solid mathematical foundation

– Needs large no.of

Parameters

RDBC

-More effective in discovering

-Cost Varying

varied

shape clusters

-Handles

noise effectively

-Robust to outliers

Ignores information about

CURE

Partition

samples

Numerical

-Appropriate

for handling large

inter-connectivity

of

Dataset

Objects

BIRCH

multidimemsional

Numerical

-suitable for large databases

-Handles only numeric

Data

-scales

linearly

-sensitive

to data records

-Robust

space complexity depends

Hierarchical

ROCK

Notion

of links

categorical

on

initialization of local

-Appropriate

for large dataset

agglomerative

Heaps

Closest

pair of

–

it

does not need to specify no.of

-Termination condition

S-link

needs

to be satisfied.

points

Clusters

-Sensitive

to outliers

Ave-link

Centriod of

–

It considers all members in

It produces clusters with

clusters

cluster

rather than single point

same

variance.

Com-link

Farthest pair of

–

Not strongly affected by

It has problem with convex

points

Outliers

shape

clusters.

-Allows

parallelization and

-Does not define

STING

Numerical

appropriate

level of

Multiresolution

Multiple

grids

Granularity

– High-quality clusters

Grid

WaveCluster

Numerical

–

Successful outlier handling

-Cost

Varying.

Density

based

-Dimensionality reduction

-Prone to high dimensional

CLIQUE

–

Scalability

Clusters

grids

-Insensitive

to noise

VII.CONCLUSION:

Since clustering is applied in many fields, a number of clustering

techniques and algorithms have been surveyed that are available in literature.

In this paper I have presented the main

characteristics of various clustering algorithms. Moreover, we discussed the

different categories in which algorithms can be classified. I concluded the

discussion on clustering algorithms by a comparative study with the pros and cons of each category and concluded

that partitional clustering well and good.