Abstract: , Twitter, and LinkedIn these social media

Abstract:

In
today’s life internet have become a common way for expressing and sharing ideas
or minds. The most important widely spread handheld devices smart phone become
common media to record people’s daily activities and their thoughts, opinion so
that the researchers tried to gather people’s mind through opinion mining . Clustering is considered as the
unsupervised learning .A  clustering can
be defined as  “the process of organizing
objects into groups whose members are similar in some way”.
Cluster analysis aims to organize a collection of patterns into clusters based
on  similarity. In this paper I focuses
on the suggest the outline of the clustering and the clustering techniques. The
techniques that can be subdivided
into several categories are Partitional algorithms, Hierarchical clustering, Fuzzy
clustering, Density-based clustering, Model-based clustering. Thus at  the end the comparison of  various algorithm is studied , surveyed and
shows which clustering techniques will be better compared to other techniques.

Keywords:Mining,Clustering,
opinion-mining, Partitional,
Hierarchical Clustering, Fuzzy clustering.

 

 

I
Introduction:

DataMining: Definition

 

 

 

 

Data
mining is the process of analyzing hidden patterns of data according to
different perspectives for categorization into useful information, which is
collected and assembled in common areas, such as data warehouses, for efficient
analysis, data mining algorithms, facilitating business decision making and
other information requirements to ultimately cut costs and increase revenue.

 In past decades the people of marketers and
businessman depend on the survey to identify the customers preferences , and
about the product purchased by the customers 
but now a days the alternative approach to a survey is that collecting
the opinion about the people’s  direct
feedback through the web. So we have chosen the opinion mining.

Opinion mining, which is also called sentiment
analysis, involves building a system to collect and categorize opinions about a
product.Many people Share their opinions on news,
products  and brands in social media like
Facebook , Twitter, and LinkedIn these social media play a crucial role for
collecting , transmitting and sharing opinions of the people.

Opinion Mining:

Opinion mining is a type of natural language
processing for tracking the mood of the public about a particular product. sentiment
analysis aims to determine the attitude of a speaker, writer, or other subject
with respect to some topic or the overall contextual polarity or emotional
reaction to a document, interaction, or event. In many social networking
services or e-commerce  websites, users
can provide text review, comment or feedback to the items. These user-generated
text provide a rich source of user’s sentiment opinions about numerous products
and items.

Clustering :

Clustering is the most interesting topics in data mining which
aims of finding intrinsic structures in data and find some meaningful subgroups
for further analysis. It is a common technique for statistical data analysis,
which is used in many fields, including machine learning, data mining, pattern
recognition, image analysis and bioinformatics. Thus a cluster could also be
defined as the “methodology of organizing objects into groups whose members are
similar in some way.”

 

 

 

 

 

 

 

 

 

 

        II  WHY  CLUSTERING

 

Data Clustering is one of the
challenging mining techniques in the knowledge data discovery process.
Clustering huge amount of data is a difficult task since the goal is to find a
suitable partition in a unsupervised way (i.e. without any prior knowledge) trying
to maximize the intra-cluster similarity and minimize inter-cluster similarity
which in turn maintains high cluster cohesiveness. Clustering groups data
instances into subsets in such a manner that similar instances are grouped
together, while different instances belong to different groups. The instances
are thereby organized into an efficient representation that characterizes the
population being sampled. Thus the output of cluster analysis is the number of
groups or clusters that form the structure of partitions, of the data set. In
short clustering is the technique to process the data into meaningful group for
statistical analysis. The exploitation of Data Mining and Knowledge discovery
has penetrated to a variety of Machine Learning Systems. A very important area
in the field of Machine learning is Text Categorization. Feature selection and
Term weighting are two important steps that decide the result of any Text
Categorization problem.

 

 

 

 

 

 

 

 

 

 

 

Figure
2.Clustering Process 15

 

III.  MOTIVATION

 

 

As the amount of digital documents over
the years as the Internet grows has been increasing dramatically, managing
information search, and retrieval, etc., have become practically important
problems. Developing methods to organize large amounts of unstructured text
documents into a smaller number of meaningful clusters would be very helpful as
document clustering is vital to such tasks as indexing, filtering, automated
metadata generation, population of hierarchical catalogues of web resources and in general, any application requiring
document organization .Also there are large number of people who are interested
in reading specific news so there is necessity to cluster the news articles
from the number of available articles, since the large number of articles are
added each data and many articles corresponds to same news but are added from
different sources. By clustering the articles, we could reduce our search
domain for recommendations as most of the users are interested in the news
corresponding to a few number of clusters. This could improve the result of
time efficiency to a greater extent and would also help in identification of
same news from different sources. The main motivation is to investigate
possible improvements of the effectiveness of document clustering by finding
out various clustering algorithms available.

 

IV.  TYPES OF CLUSTERING

 

 To identify suitable
algorithms for clustering that produces the best clustering solutions, it
becomes necessary to have a method for comparing the results of different
clustering algorithms. Many different clustering techniques have been defined
in order to solve the problem from different perspective, these are:-

 

·       
Partitional Clustering

 

·       
Density based Clustering

 

·       
Hierarchical
clustering

·       
 Fuzzy Clustering.

·       
Model-based clustering

 

A. Partitional
Clustering

 

Partitional clustering is
considered to be the most popular class of clustering algorithm also known as
iterative relocation algorithm. Partition clustering algorithm splits the data
points into k partition, where each partition represents a cluster. The
partition is done based on certain objective function. The cluster are formed
to optimize an objective partitioning criterion, such as a dissimilarity
function based on distance, so that the objects within a cluster are “similar”,
whereas the objects of different cluster are “dissimilar”.

 

 

 

 

 

 

 

 

4.1.1 K-means

 

K-means was proposed by MacQueen and is one of the most popular
partition-based methods. It partitions the dataset into k disjoint subsets,
where k is predetermined. The algorithm keeps adjusting the assignment of
objects to the closest current cluster mean until no new assignments of objects
to clusters can be made.

 

4.1.2
PAM(Partitioning Around Mediods)

 

The Partitioning Around Medoids (PAM) algorithm was introduced by
Kaufman and Rousseeuw. It is based on the k representative objects, called
medoids, among the objects of the dataset.

4.1.3 CLARA(Clustering Large Applications)

 

Both the k-means and PAM algorithms are slow and not practical.
One algorithm that tries to solve this problem is CLARA (Clustering LARge
Applications). CLARA is a method based on PAM that attempts to deal with large
dataset applications. CLARA uses the PAM algorithm to cluster a sample from a
set of objects into k subsets.

 

 

B Density Based
Clustering

 

Density-based
clustering algorithms are devised to create arbitrary-shaped clusters. In this
approach, a cluster is regarded as a region in which the density of data
objects exceeds a threshold. DBSCAN and SSN RDBC are typical algorithm

Density based clustering algorithm has
played a vital role in finding non linear shapes structure based on the
density. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
is most widely used density based algorithm. It uses the concept of  density
reachability and density connectivity.

 

NAME

DATATYP

NOIS

COMPLEXIT

 

E

E

Y

 

 

 

DBSCAN

Numerical

Yes

?(nlogn)

 

OPTICS

Numerical

Yes

?(nlogn)

 

DENCLU

Numerical

Yes

?(n2)

 

E

 

 

 

 

 

 

C. Hierarchical Clustering:

Hierarchical
clustering algorithms divide or merge a dataset into a sequence of nested
partitions. The hierarchy of the nested partitions can be
agglomerative(bottom-up) or divisive(top-down). In the agglomerative method,
clustering starts with each single object in a single cluster and it continues
to cluster the closest pairs of clusters until all the objects are together in
just one cluster. Divisive hierarchical clustering, on the other hand, starts
with all objects in a single cluster and keeps splitting larger clusters into
smaller ones until all objects are separated into unit clusters511. Both
the hierarchical methods show the natural way of representing clusters, called
as dendrogram. Examples for this algorithms are ROCK, BIRCH (Balance Iterative
Reducing and Clustering using Hierarchies), CURE (Cluster Using
REpresentatives).

 

 

 

 

 

 

 

 

 

 

There
are also different agglomerative clustering algorithms which use different
similarity measures and thus based on that, different agglomerative clustering
algorithms are: Single linkage, Complete linkage, Group average linkage,
Centriod linkage, Ward’s criterion

Agglomerative
clustering

Each
object initially represents a cluster of its own. Then clusters are
successively merged until the desired cluster structure is obtained. one of the
most widely used algorithms is agglomerative algorithms

Single-link clustering

 

It is also called as
nearest neighbor method, that considers the distance between two clusters to be
equal to the shortest distance from any member of one cluster to any member of
the other cluster

 

 

 

 

 

 

 

 

 

 

 

 Average-link clustering

 

It is also called as minimum variance method, that consider the
distance between two clusters to be equal to the average distance from any
member of one cluster to any member of the other cluster

 

 

 

 

 

 

 

 

 

 

 

 

 

Centriod Clustering

 

The centroid method uses the controid (center of the group of
cases) to determine the average distance between clusters of cases

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 V.PROPOSED ANALYTICAL MODEL

 

A hierarchical clustering algorithm creates a hierarchical
decomposition of the given set of data objects. Depending on the decomposition
approach, hierarchical algorithms are classified as agglomerative (merging) or
divisive (splitting). The agglomerative approach starts with each data point in
a separate cluster or with a certain large number of clusters. Each step of
this approach merges the two clusters that are the most similar. Thus after
each step, the total number of clusters decreases. This is repeated until the
desired number of clusters is obtained or only one cluster remains. By
contrast, the divisive approach starts with all data objects in the same
cluster. In each step, one cluster is split into small clusters. This kind of hierarchical clustering is called agglomerative
because it merges clusters iteratively

 

 

 

 

VI.  SUMMARY TABLE FOR COMPARISON OF CLUSTERING TECHNIQUES

 

 

 

 

 

 

 

 

 

 

Name

Algorithm

Key
–idea

Type of

Advantages

 

Disadvantages

 

Data

 

 

 

 

 

 

 

 

 

 

 

K-means

Mean
Centroid

 

-Simple

 

-Sensitive to outliers

 

 

 

 

-Centroids
not meaningful

 

 

 

-Most
popular

 

 

 

 

 

 

 

in
most problems

 

Partitional

 

 

Numerical

 

 

 

PAM

 

robust
to outliers

 

Cluster should be pre

 

 

Mediod
-centriod

 

 

-determined

 

 

 

 

 

 

 

 

CLARA

 

Applicable for large data set

 

Sensitive to outliers

 

 

 

 

 

 

 

CLARANS

 

 

Handles outliers effectively

 

High cost

 

 

 

 

 

-Resistant to noise

 

-Cannot
handle varying

 

 

DBSCAN

Fixed
sixe

 

-Can
handle clusters of various

 

 

 

 

 

Densities

 

 

 

 

 

shapes
and sizes.

 

 

 

 

 

 

 

 

 

 

OPTICS

 

 

-Good for data set with large

 

-Needs large no.of

 

 

 

 

amount
of noise

 

 

 

 

 

 

Parameters

 

Density
Based

 

 

Numerical

-Faster
in computation

 

 

 

 

 

 

 

 

DENCLUE

Variable
size

 

-Solid mathematical foundation

 

– Needs large no.of

 

 

 

 

Parameters

 

 

 

 

 

 

 

 

 

RDBC

 

 

-More effective in discovering

 

-Cost Varying

 

 

 

 

varied
shape clusters

 

 

 

 

 

 

-Handles
noise effectively

 

 

 

 

 

 

 

-Robust to outliers

 

Ignores information about

 

 

CURE

Partition
samples

Numerical

-Appropriate
for handling large

 

inter-connectivity
of

 

 

 

 

 

Dataset

 

Objects

 

 

BIRCH

multidimemsional

Numerical

-suitable for large databases

 

-Handles only numeric

 

 

 

Data

 

 

-scales
linearly

 

 

 

 

 

 

 

-sensitive
to data records

 

 

 

 

 

 

 

 

 

 

 

 

-Robust

 

space complexity depends

 

Hierarchical

ROCK

Notion
of links

categorical

 

on
initialization of local

 

-Appropriate
for large dataset

 

 

agglomerative

 

 

 

 

Heaps

 

 

 

 

 

 

 

 

 

Closest
pair of

it
does not need to specify no.of

 

-Termination condition

 

 

S-link

 

needs
to be satisfied.

 

 

points

Clusters

 

 

 

 

 

 

-Sensitive
to outliers

 

 

 

 

 

 

 

 

 

Ave-link

Centriod of

It considers all members in

 

It produces clusters with

 

 

clusters

cluster
rather than single point

 

same
variance.

 

 

 

 

 

 

 

Com-link

Farthest pair of

Not strongly affected by

 

It has problem with convex

 

 

points

Outliers

 

shape
clusters.

 

 

 

 

 

 

 

 

 

 

-Allows
parallelization and

 

-Does not define

 

 

STING

 

Numerical

 

appropriate
level of

 

 

 

Multiresolution

 

 

 

 

Multiple
grids

 

 

Granularity

 

 

 

 

 

 

 

 

 

 

– High-quality clusters

 

 

 

 

 

 

 

 

 

 

Grid

WaveCluster

 

Numerical


Successful outlier handling

 

-Cost
Varying.

 

 

 

 

 

 

 

 

 

 

 

Density
based

 

-Dimensionality reduction

 

-Prone to high dimensional

 

 

CLIQUE

 


Scalability

 

Clusters

 

 

grids

 

 

 

 

 

 

-Insensitive
to noise

 

 

 

 

 

 

 

 

 

 

 

VII.CONCLUSION:

 

Since clustering is applied in many fields, a number of clustering
techniques and algorithms have been surveyed that are available in literature.
In this paper I have  presented the main
characteristics of various clustering algorithms. Moreover, we discussed the
different categories in which algorithms can be classified. I concluded the
discussion on clustering algorithms by a comparative study with the  pros and cons of each category and concluded
that partitional clustering well and good.