Opinion mining is a technique where the voice of the masses, whether they are in support or against any particular type of thing. Opinion mining is around for a longer time now. People have developed various techniques to mine and analyse data. Sometimes simple algorithms can work whereas in the case of complex conditions. Opinion mining not only helps in deciding about a particular topic but also helps in deriving at conclusions where there is no concrete evidences are found. The data is growing with each step and where ever we go, data is something that keeps generated no matter if it happens due to the work we carry out or the automated processes that keep on happening in the background.
Social networking and the data growing in the network is getting immense day by day. This data can easily be used to manipulate and analyse different results for different things. For example, the tweets done over Twitter can easily be transformed into a database and then analyzed to draw a common perception of the users worldwide for a particular thing or a topic. Similarly other types of data that is prevailing in different domains and sites can be retrieved, pre-processed and then be used in different analysis practices. However, drawing certain reference to the perception is not that easy from the given dataset or manually created data set. To begin with you have to make sure that every reference you take in respect of data is valid for all the entities related to that particular data set. The opinion mining and its technique are very common yet not properly discovered in all senses. The most important factor that comes into play is the way opinion mining methodologies are designed. It really matters how the opinions are extracted and what algorithms are being used in the long run for the derivation of the opinion mining methodology because the integration of the algorithm with the data set and with training set is very important in determining the efficiency as well as analyzing different aspects of the test data. As the internet of things are approaching faster than expected, analysis and deriving conclusions from various different test cases can help us in obtaining valuable as well as determining many things that might not be available readily to us in any form.
There are basically two main component to consider while we go for opinion mining.
1) Test data: this is the data on which the opinion mining techniques are applied to infer something and arrive at a particular conclusion. There are many ways by which the test data can be formulated. In many cases the data is collected in real time scenario by the help of different gadgets and IOT wear technology.
In other cases the test data is already available and is taken from different sites or repositories that needs to be analyzed and searched upon.
2) Training Data: In this the data set has some attributes that helps us to analyze the test data. In simple terms we can say that the test data is analyzed by the virtue of the qualities of the training data. It is very important that the training data is well organized and surveyed because all the attributes that the training data will have will be automatically passed on to the test data once the classification algorithm comes into action.
II. Literature Survey
Gathering data from the online sources with the help of some tools that mine the text and present in a database format. This process if termed as Text Mining. The process of text mining can be done with very different tools that are available readily. In today’s scenario, extracting tweets from twitter, generating the database of reviews from comment section of youtube are a part of text mining. Once the text has been mined, it is important to analyse the data. It is very crucial to set parameters that decide the basis on how the data will be summarised and analysed. Mostly this data is segregated on the common terms that is generally done with the help of Sentiment Analysis. The sentiment analysis gives the boundary of how the text is classified in a broader sense of view.
Social media these days is the best platform where the use of text mining and sentiment analysis is reflected. It is very appropriate to say that all the marketing firms are utilising the analysis of social media to earn the expenses and revenue and get near perfect results. However these results if analysed on a broader map do not fulfil the criteria for which they are actually analysed for. 2In paper, “Learning Extraction Patterns for Subjective Expressions”, they use the basic classification tool of Naïve baye’s to carry out their working in sentiment analysis.
3 “Sentiment Analysis of Movie Reviews .A new Feature-based Heuristic for Aspect-level Sentiment Classification” tells us how the text mining and web scraping is done with the help of different API’s. the paper reflects how the basic web scraping should be done and after that how the analysis of the collected data has to be carried out.
As the analytics industry is growing, the usage of text mining and analysis has suddenly taken a steep curve ahead. However, most researches keep talking about how the usage of different algorithms available for data classification can impact the efficiency as well as the analysis of the data.
The usage of algorithms just not depend on what type of data is being used for test or what training is used for the coaching purpose but it also depends upon various constraints like accuracy and time efficiency. It is very important that the algorithms are chosen not only for the usage in classification purpose but also for the efficiency purpose as well.
The training data set plays a very crucial role in determining what all has to be analysed and how. It is very important to see that whatever analysis we want to draw, we have to get it from the test data using the same training data set. The change in classification algorithm won’t affect the results of the analysis but it can surely change the efficiency. But if the training data is not selected the right way then it can lead to wrong analysis no matter how efficient and worthy classification algorithm has been used.
The three very basic as well as useful machine learning algorithm that are extensively used in text mining and classification like, Naïve Baye’s, Support Vector Mechanism(SVM). All these algorithms are good efficiency solutions but their maximum efficiency come with the change in the entropy.
The methodology has been designed in a way that the characteristics of the database can be valued. The whole idea is to obtain the universal training data that can be used to train any test data no matter what object or topic it reflects. The training data will be pulled just like any other data base and will be cleaned as well as made specific to the point. Then the various additions will be made to the training data. Once the training data is believed to be done then any random test data wil be picked and specific to that test data training data will be made and the comparison of efficiency has to be done. This has to be repeated more and over again till the time efficiency for test data matches from both the training sets.
IV. Characterstics of Database
The database used both for analysis and for training of data set is very crucial to determine.
The training database determines the way the test data will be analysed. If the results vary in every repetition it means that the training data set is being used differently for every iteration. The training dataset is always available to us but that is the dataset that is unique to the test data. To implement an overview as well as to obtain the unique results it is highly important to develop the training data set that is universal for all the test data and can easily be used to fetch the results from the different data sets.
The training data set will then be used to compare with other data sets
The whole idea of the research is to develop a data set that can be universally used as the training data set. It is very important that the training data that has to be used is very well defined and is applicable to most data sets. For that a training data set will be made with some key characteristics. The training data set will be picked up from different repositories as well as from text mining. Once we are through with the collection then the characteristic approach of the data will be seen. Random test data sets will be picked and the training data set that is developed unique is tested. Very basic classification algorithms are used. The test data is also checked with its own training set. Once the results are not differing much and are stable for all the test data sets, then the universal data set can be implemented on projects. The other basic things are as follows.
V.1 DATA CLEANING
This is the most important step in the entire Data cleaning process. Here, we will find out those keywords, which builds up the meaning of the sentence.
Transform text to lower case.
To remove stopwords.
Stop words are just common words which we may not be interested in. If we look at the result of stopwords (“english”) we can see what is getting removed.
Remove URL’s from text
So, what have we just done?
We’ve transformed every word to lower case so that ‘Apple’ and ‘apple’ now count as the same word. We’ve removed all punctuation — ‘apple’ and ‘apple!’ will now be the same. We stripped out any extra whitespace and we removed stop words and URLs.
Step 5: Build a term-document matrix
It is the document that contains the detailed numbers as well as frequency of the words in order of how many times they have been used in the whole data set.
Step 6: Generate the Word cloud
The order of words is completely random but the size of the words are directly proportional to the frequency of occurrence of the word in text files.
The diagram directly helps us identify the most frequently used words in the text files.
V.2 SENTIMENT ANALYSIS
The polarity score is not always very accurate. It sometimes misses out on the overall context of the tweet because it focuses on individual words. Sometimes words like ‘ohhh’ can be used as positive or negative.
So let’s start applying sentimental analysis in R on the #instagram logo data extracted from the previous post.first, we need to download the Positive and negative words text files and upload it to the R console.
Step 1: Scan the words into R
Step 2: If you want to add your words to the positive and negative words list
Once we have the data in place with us, we need to see that how we need to define the test data, we can either convert it into positive and negative words or we can just easily define the AFIN framework.
Step 3 : Apply sentiment Function to the tweets