Opinion mining is a technique where the voice of the masses, whether they are in support or against any particular type of thing. Opinion mining is around for a longer time now. People have developed various techniques to mine and analyse data. Sometimes simple algorithms can work whereas in the case of complex conditions. Opinion mining not only helps in deciding about a particular topic but also helps in deriving at conclusions where there is no concrete evidences are found. The data is growing with each step and where ever we go, data is something that keeps generated no matter if it happens due to the work we carry out or the automated processes that keep on happening in the background.
Social networking and the data growing in the network is getting immense day by day. This data can easily be used to manipulate and analyse different results for different things. For example, the tweets done over Twitter can easily be transformed into a database and then analyzed to draw a common perception of the users worldwide for a particular thing or a topic. Similarly other types of data that is prevailing in different domains and sites can be retrieved, pre-processed and then be used in different analysis practices. However, drawing certain reference to the perception is not that easy from the given dataset or manually created data set. To begin with you have to make sure that every reference you take in respect of data is valid for all the entities related to that particular data set. The opinion mining and its technique are very common yet not properly discovered in all senses. The most important factor that comes into play is the way opinion mining methodologies are designed. It really matters how the opinions are extracted and what algorithms are being used in the long run for the derivation of the opinion mining methodology because the integration of the algorithm with the data set and with training set is very important in determining the efficiency as well as analyzing different aspects of the test data. As the internet of things are approaching faster than expected, analysis and deriving conclusions from various different test cases can help us in obtaining valuable as well as determining many things that might not be available readily to us in any form.
There are basically two main component to consider while we go for opinion mining.
1) Test data: this is the data on which the opinion mining techniques are applied to infer something and arrive at a particular conclusion. There are many ways by which the test data can be formulated. In many cases the data is collected in real time scenario by the help of different gadgets and IOT wear technology.
In other cases the test data is already available and is taken from different sites or repositories that needs to be analyzed and searched upon.
2) Training Data: In this the data set has some attributes that helps us to analyze the test data. In simple terms we can say that the test data is analyzed by the virtue of the qualities of the training data. It is very important that the training data is well organized and surveyed because all the attributes that the training data will have will be automatically passed on to the test data once the classification algorithm comes into action.
II. Literature Survey
Gathering data from the online sources with the help of some tools that mine the text and present in a database format. This process if termed as Text Mining. The process of text mining can be done with very different tools that are available readily. In today’s scenario, extracting tweets from twitter, generating the database of reviews from comment section of youtube are a part of text mining. Once the text has been mined, it is important to analyse the data. It is very crucial to set parameters that decide the basis on how the data will be summarised and analysed. Mostly this data is segregated on the common terms that is generally done with the help of Sentiment Analysis. The sentiment analysis gives the boundary of how the text is classified in a broader sense of view.
Social media these days is the best platform where the use of text mining and sentiment analysis is reflected. It is very appropriate to say that all the marketing firms are utilising the analysis of social media to earn the expenses and revenue and get near perfect results. However these results if analysed on a broader map do not fulfil the criteria for which they are actually analysed for. 2In paper, “Learning Extraction Patterns for Subjective Expressions”, they use the basic classification tool of Naïve baye’s to carry out their working in sentiment analysis.
3 “Sentiment Analysis of Movie Reviews .A new Feature-based Heuristic for Aspect-level Sentiment Classification” tells us how the text mining and web scraping is done with the help of different API’s. the paper reflects how the basic web scraping should be done and after that how the analysis of the collected data has to be carried out.
As the analytics industry is growing, the usage of text mining and analysis has suddenly taken a steep curve ahead. However, most researches keep talking about how the usage of different algorithms available for data classification can impact the efficiency as well as the analysis of the data.
The usage of algorithms just not depend on what type of data is being used for test or what training is used for the coaching purpose but it also depends upon various constraints like accuracy and time efficiency. It is very important that the algorithms are chosen not only for the usage in classification purpose but also for the efficiency purpose as well.
The training data set plays a very crucial role in determining what all has to be analysed and how. It is very important to see that whatever analysis we want to draw, we have to get it from the test data using the same training data set. The change in classification algorithm won’t affect the results of the analysis but it can surely change the efficiency. But if the training data is not selected the right way then it can lead to wrong analysis no matter how efficient and worthy classification algorithm has been used.
The three very basic as well as useful machine learning algorithm that are extensively used in text mining and classification like, Naïve Baye’s, Support Vector Mechanism(SVM). All these algorithms are good efficiency solutions but their maximum efficiency come with the change in the entropy.
The methodology has been designed in a way that the characteristics of the database can be valued. The whole idea is to obtain the universal training data that can be used to train any test data no matter what object or topic it reflects. The training data will be pulled just like any other data base and will be cleaned as well as made specific to the point. Then the various additions will be made to the training data. Once the training data is believed to be done then any random test data wil be picked and specific to that test data training data will be made and the comparison of efficiency has to be done. This has to be repeated more and over again till the time efficiency for test data matches from both the training sets.
IV. Characterstics of Database
The database used both for analysis and for training of data set is very crucial to determine.
The training database determines the way the test data will be analysed. If the results vary in every repetition it means that the training data set is being used differently for every iteration. The training dataset is always available to us but that is the dataset that is unique to the test data. To implement an overview as well as to obtain the unique results it is highly important to develop the training data set that is universal for all the test data and can easily be used to fetch the results from the different data sets.
The training data set will then be used to compare with other data sets
The whole idea of the research is to develop a data set that can be universally used as the training data set. It is very important that the training data that has to be used is very well defined and is applicable to most data sets. For that a training data set will be made with some key characteristics. The training data set will be picked up from different repositories as well as from text mining. Once we are through with the collection then the characteristic approach of the data will be seen. Random test data sets will be picked and the training data set that is developed unique is tested. Very basic classification algorithms are used. The test data is also checked with its own training set. Once the results are not differing much and are stable for all the test data sets, then the universal data set can be implemented on projects. The other basic things are as follows.
V.1 DATA CLEANING
This is the most important step in the entire Data cleaning process. Here, we will find out those keywords, which builds up the meaning of the sentence.
Transform text to lower case.
To remove stopwords.
Stop words are just common words which we may not be interested in. If we look at the result of stopwords (“english”) we can see what is getting removed.
Remove URL’s from text
So, what have we just done?
We’ve transformed every word to lower case so that ‘Apple’ and ‘apple’ now count as the same word. We’ve removed all punctuation — ‘apple’ and ‘apple!’ will now be the same. We stripped out any extra whitespace and we removed stop words and URLs.
Step 5: Build a term-document matrix
It is the document that contains the detailed numbers as well as frequency of the words in order of how many times they have been used in the whole data set.
Step 6: Generate the Word cloud
The order of words is completely random but the size of the words are directly proportional to the frequency of occurrence of the word in text files.
The diagram directly helps us identify the most frequently used words in the text files.
V.2 SENTIMENT ANALYSIS
The polarity score is not always very accurate. It sometimes misses out on the overall context of the tweet because it focuses on individual words. Sometimes words like ‘ohhh’ can be used as positive or negative.
So let’s start applying sentimental analysis in R on the #instagram logo data extracted from the previous post.first, we need to download the Positive and negative words text files and upload it to the R console.
Step 1: Scan the words into R
Step 2: If you want to add your words to the positive and negative words list
Once we have the data in place with us, we need to see that how we need to define the test data, we can either convert it into positive and negative words or we can just easily define the AFIN framework.
Step 3 : Apply sentiment Function to the tweets
result <- score.sentiment(test$text,pos.words,neg.words) Step 4 : Summary of the Scores Step 5 : Histogram of the Scores We did sentimental analysis of linear type with positive and negative mood.We choose logistic regression for analysis.Logistic regression is used in case of binary dependent variables such as pass/fail, positive/negative.First we took a training dataset with sentiment tag and a testing dataset without tags.Loading dataData can be taken from csv or excel file into dataframes provided by pandas library.Cleaning the data and creating a Document term matrix First need to convert all letters in lowercase.Raw data contains a lot of words and symbols in which we are not intrsted in.We need to remove these symbols before further analysis.These symbols are punctuations, numbers, URLs and stopwords. The information value of 'stop words' is near zero due to the fact that they are so common in a language. This process also includes stemming which is reducing words to its lexical roots. We used regular expression to remove URLs and numbers.Python library nltk provides many stemmers. We used PorterStemmer.We also used word_tokenize class from nltk to convert documents into tokens.. This whole process of data cleaning and tokenization is done through class sklearn.feature_extraction.text.CountVectorizer in the scikit learn Python library. Our tokenization function tokenize, stop words type 'english' and number of terms required in vector are passed as parameters in CountVectorizer constructor.Now using fit_transform function of this class we transformed our corpus data into feature vector. Using toarray function of numpy library convert feature vector into a 2d array. Building training and evaluation datasetNow we need to create a separate evaluation set from our original training data if we want to evaluate our classifier.We used train_test_split method from sklearn.model_selection python class.This method split arrays or matrices into random train and test subsets. Now new train set is 85% of initial training dataset as specified in test_train_split function. Training classifierWe used bag of words model for classifier. In this model, a text is represented as the whole collection of the bag of its words, disregarding grammar and even word order but keeping multiplicity. The frequency of each word is used as a feature for training a classifier. Sckit learn library in python provides LogisticRegression class to implement logistic regression.Train set is passed as parameter in fit method of this class. Evaluation Now our classifier is trained. We use evaluation set to predict its sentiment using predict method of LogisticRegression class. To check classifier's precision we used function classification_report from sklearn.metrics. This function returns several types of scores like precision, recall and f1-score. Training classifierthen we trained our classifier on whole training dataset. Predicting sentimentnow trained classifier is ready to predict positive and negative sentiment from test dataset. V. RESULTS In the initial testing and developing of the universal data set, a data set of 1000 lines was formed and was tested on different test data sets. Those test data sets were also tested with their unique training sets. The test results show that universal data set is 60 percent accurate while the unique data set is 91.7 percent accurate. The gap is big but as the universal data set keeps on growing and is added with more fundamental characteristics the gap will start bridging. The project needs more testing. VII. Conclusion . No matter how much been done, there is always the new results in the consecutive iterations. The above-said model is been developed in such a way that it is able to deal with most of the classification algorithms with ease. The different algorithms will be giving the same results at each run. The Training data set that plays the most crucial role in the model is been developed such a way that any test data can be trained and analyzed from it. However, it depends on the user to check the efficiency of the test data once the results with the universal data set is being developed. If the internet is searched a lot of study material is available about this concept. However, every concept hovers around the main fact that, the written code only serves the single purpose. The code developed in the paper will aim at producing a code that is general and not specific. That means any person can mine and find the respective opinioned ideologies he is looking for. Also, the training dataset that is common for all the paper will be continuously changing and upgrading itself. In future, the data set can be linked to the web and the analysis can be done over the web and in that case there will be no need to determine the corresponding test data for the particular dataset. References 1 Wei Jiang, "Study On Identification of Subjective Sentences in Product Reviews Based on Weekly Supervised Topic Model", Journal of Software, Vol.9, No.7July 2014,pp. 1952-1959J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp. 68-73. 2 Riloff Ellen and Janycee Wiebe, "Learning Extraction Patterns for Subjective Expressions", In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03),pp.105-112. 3 V.K. Singh, R. Piryani, A. Uddin, P. Waila, "Sentiment Analysis of Movie Reviews .A new Feature-based Heuristic for Aspect-level Sentiment Classification", IEEE 2013,pp.712-717.