Text Analytics on Amazon Reviews

                                                                                                         By-Akanksha Goel

With a wide pool of unstructured Text Data coming from different sources like Social Media,Reviews and Feedbacks from various customers,written Documents converted into Text or Survey Reports,

Questions

How to get insights on customer perspective in this huge corpus of unstructured Data?
How to understand different opinions from different customers and get common ground perspective?
How to classify Text into diverse categories?
How to generate Visuals that give clear picture from this unordered Data?

Text Analytics is the way!!!

What is Text Analytics?

The process of generating inferences and patterns from the pool of unstructured Data is refered to as Text Analytics. Text Analytics also termed as Text mining include various Natural Language Processing Techniques such as

Text Summarisation,
Text clustering,
Text classification,
Visualizations and
Opinion mining or Sentiment Analysis,

to understand meaning in this Huge dataset.

Sentiment Analysis is the major application of Text Analytics.

Sentiment Analysis and its Application

What is Sentiment Analysis?

Sentiment Analysis also refered to Opinion mining is the process of understanding whether piece of text is positive,negative or neutral and to what extent.Text can be in form of feedbacks,Reviews,Survey results,Tweets with different emoptions associated with them. Sentiment Analysis helps to identify overall View of masses on particular Brand or product. Polarity of opinion is associated with each word in a sentence.
For Example:

Text	Positive Polarity Score	Negative polarity Score	Neutral polarity score	Sentiment
Akanksha is Good:	0.9	0.0	0.1	Positive
Akanksha is Bad:	0.0	0.9	0.1	Negative

Application of Sentiment Analysis

There are so many products in market having same configuration.
For Example:When buying mobile phones,there are so many choices and even the set of features are same.
How to know which one to buy?

In this post Sentiment analysis is used on Amazon reviews of mobile to know which one is the best product.

Implementation in Python

Following 4 steps to do in depth analysis on different products and gives us the best product.

Loading the Dataset
Preprocessing of the Dataset
Sentiment Analysis.
Data Visualisation.

Requirements

Tool Requirements

Download Anaconda(https://www.anaconda.com/downloads)
Or
Directly install Python 3(https://www.python.org/downloads/)
Create Virtual Environment to use different versions of Python when required and use Conda to install Packages in Anaconda Prompt. (https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/)
Or
Python PIP library to install all packages in python (https://pip.pypa.io/en/stable/installing/)

Package Requirements:

Install Python Requests package (http://docs.python-requests.org/en/master/user/install/) using command:

Conda install Requests
or
PIP install Requests

Similarly, 2. Install Python-Dateutil(https://pypi.org/project/python-dateutil/) 3. Install Python LXML package (https://lxml.de/installation.html) 4. Python Nltk package and Nltk data (https://www.nltk.org/data.html) 5. Python Matplotlib Package (https://matplotlib.org/users/installing.html) 6. Python Pandas Package (https://pandas.pydata.org/) 7. Python Numpy Package 8. Python wordcloud Package

#Importing all the packages
from lxml import html  
import json
import string
from dateutil import parser as dateparser
from time import sleep
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import requests
import json,re
from subprocess import check_output
from wordcloud import WordCloud, STOPWORDS
import pandas as pd 
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Loading the Dataset

In this post I have scraped reviews of Mobiles from Amazon Website using requests and Lxml library of Python.

def ParseReviews(asin,Number_of_Pages):
    ratings_dict = {}
    reviews_list = [] 

    # Number_of_Pages=4# You can iterate for as many number of Pages
    for i in range(1,Number_of_Pages+1):
        amazon_url='https://www.amazon.in/product-reviews/'+asin+'/ref=cm_cr_othr_d_paging_btm_+'+str(i)+'+?pageNumber='+str(i)
        page = requests.get(amazon_url)
        page_response = page.text

        parser = html.fromstring(page_response)
        XPATH_AGGREGATE = '//span[@id="acrCustomerReviewText"]'
        XPATH_REVIEW_SECTION_1 = '//div[contains(@id,"reviews-summary")]'
        XPATH_REVIEW_SECTION_2 = '//div[@data-hook="review"]'

        XPATH_AGGREGATE_RATING = '//table[@id="histogramTable"]//tr'
        XPATH_PRODUCT_NAME = '//h1//span[@id="productTitle"]//text()'
        XPATH_PRODUCT_PRICE  = '//span[@id="priceblock_ourprice"]/text()'
        raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
        product_price = ''.join(raw_product_price).replace(',','')

        raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)

        reviews = parser.xpath(XPATH_REVIEW_SECTION_1)
        if not reviews:
            reviews = parser.xpath(XPATH_REVIEW_SECTION_2)


        if not reviews:
            raise ValueError('unable to find reviews in page')

        #Parsing individual reviews
        for review in reviews:
            XPATH_REVIEW_TEXT_1 = './/div//span[@data-hook="review-body"]//text()'
            XPATH_REVIEW_TEXT_2 = './/div//span[@data-action="columnbalancing-showfullreview"]/@data-columnbalancing-showfullreview'
            XPATH_REVIEW_TEXT_3  = './/div[contains(@id,"dpReviews")]/div/text()'

            raw_review_text1 = review.xpath(XPATH_REVIEW_TEXT_1)
            raw_review_text2 = review.xpath(XPATH_REVIEW_TEXT_2)
            raw_review_text3 = review.xpath(XPATH_REVIEW_TEXT_3)
            review_text = ' '.join(' '.join(raw_review_text1).split())
            if raw_review_text2:
                json_loaded_review_data = json.loads(raw_review_text2[0])
                json_loaded_review_data_text = json_loaded_review_data['rest']
                cleaned_json_loaded_review_data_text = re.sub('<.*?>','',json_loaded_review_data_text)
                full_review_text = review_text+cleaned_json_loaded_review_data_text
            else:
                full_review_text = review_text
            if not raw_review_text1:
                full_review_text = ' '.join(' '.join(raw_review_text3).split())


            review_dict = {'review_text':full_review_text,}
            reviews_list.append(review_dict)
    return reviews_list

def ReadAsin(pages,Asin_List):
    Final_data = []
    review_list={}
    n=len(Asin_List)
    for asin in Asin_List:
        print("Processing Reviews for "+str(pages)+" Pages of Product of  Asin Number:"+asin)


        Final_data=(ParseReviews(asin,pages))
        sample=pd.DataFrame(Final_data)
        for i in sample:
            # review_list.append(asin)
            review_list[asin]=[]
            review_list[asin].append(sample[i])

        sleep(5)
    return review_list

Redmi Note 5 Pro (64GB) and Samsung Galaxy J8 (Blue, 64GB) are two mobile phones which have same set of features
but how to know which mobile to buy?
Therefore Reviews of these phones are parsed from Amazon website for comparison.

Redmi Note 5 Pro (64GB) Vs Samsung Galaxy J8 (Blue, 64GB)

To compare these mobile phones,you need to find there Asin number. Asin number is a unique number which is given by Amazon to each product.

Fetching list of Amazon reviews

#Fix number of Pages you want
#Number of Pages=round(Number_of_reviews/10)
#Each page signifies 10 reviews
pages=3 #30 reviews

#Add list of unique Asin Number to be compared
#B07CJYMDNQ: Redmi Note 5 Pro (64GB) Vs B07DZZKBBL: Samsung Galaxy J8 (Blue, 64GB)
Asin_List = ['B07CJYMDNQ','B07DZZKBBL']
Mobile_names={'B07CJYMDNQ': 'Redmi Note 5 Pro (64GB)','B07DZZKBBL': 'Samsung Galaxy J8 (Blue, 64GB)'}
#Gives list of reviews for Asin listed.
review_list=ReadAsin(pages,Asin_List)
# print(review_list) #Shows list of reviews 
dataset=pd.DataFrame(review_list)
dataset.head()

Processing Reviews for 3 Pages of Product of  Asin Number:B07CJYMDNQ
Processing Reviews for 3 Pages of Product of  Asin Number:B07DZZKBBL

	B07CJYMDNQ	B07DZZKBBL
0	0 They are not giving any accessories but ...	0 Super camera having live focus and varie...

Preprocessing of the Dataset

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Text may contain a lot of undesirable characters like punctuation marks, stop words and digits which may not be helpful.There are some words which are same in sense. For example:'Friendly,Friendly,Friendlines' which are repeated therfore reduced them to single stem word 'friendli'.

The following code is used to filter the text.

def filter_sentence(text):
    ps = PorterStemmer()
    stop_words = set(stopwords.words('english'))

    word_tokens = word_tokenize(text) 
    filtered_sentence = [ps.stem(w) for w in word_tokens if not w in stop_words] # removing useless words
    sentence=''
    for word in filtered_sentence:
        sentence=sentence+word+' '
    exclude = set(string.punctuation)
    s = ''.join(ch for ch in sentence if ch not in exclude)
    s= ''.join([i for i in s if not i.isdigit()])
    return s
print(filter_sentence('Hey! There are 20 good @ people.'))
print(filter_sentence('Friendly Friendly Friendliness'))

hey  there  good  peopl  
friendli friendli friendli

Sentiment Analysis using python

This is extention from here.
Here positive,negative and neutral score of each review is compared with the Average positive,negative and neutral polarity score respectively.
If positive polarity of sentence is greater than average positive polarity of the comlete set of reviews then sentence is reffered as positive. similarly checking for negativity and positivity.

def Sentiment_Analysis(review_list,Asin_List):
    Dataset={}
    for asin in Asin_List:
        print("Processing Sentiments for: "+Mobile_names[asin])

        Dataset[asin]=[]
        total_p=0
        total_n=0
        total_neu=0
        count=len(review_list[asin][0])
        for i in range(len(review_list[asin][0])):
            text=(review_list[asin][0][i])
            sentence=filter_sentence(text)
            sid = SentimentIntensityAnalyzer()
            ss = sid.polarity_scores(sentence)
            for k in sorted(ss):
                if k=='pos':
                    total_p=total_p+ss[k]
                if k=='neg':
                    total_n=total_n+ss[k]
                if k=='neu':
                    total_neu=total_neu+ss[k]
        Avg_p=total_p/count
        Avg_n=total_n/count
        Avg_neu=total_neu/count
        print("Average Positive Polarity:"+str(round(Avg_p,2))) #Average positive polarity score
        print("Average Negative Polarity:"+str(round(Avg_n,2))) #Average Negative Polarity
        print("Average Neutral Polarity:"+str(round(Avg_neu,2)))#Average Neutral Polarity

        #Check polarity with the help of average polarity
        for i in range(len(review_list[asin][0])):
            text=(review_list[asin][0][i])
            sentence=filter_sentence(text)
            line={}
            line['text']=text
            line['sentiment']=''
            ss = sid.polarity_scores(sentence)
            for k in sorted(ss):
                if k=='pos':
                    line['positivity']=ss[k]
                    if ss[k]>Avg_p:
                        line['sentiment']='positive'#Appending sentiment to each sentence
                elif k=='neg':
                    line['negativity']=ss[k]
                    if ss[k]>Avg_n:
                        line['sentiment']='negative'#Appending sentiment to each sentence
                elif k=='neu':
                    line['neutrality']=ss[k]
                    if ss[k]>Avg_neu:
                        line['sentiment']='neutral'#Appending sentiment to each sentence
                if line['sentiment']=='':
                    line['sentiment']='neutral'
            ss = sid.polarity_scores(sentence)

            Dataset[asin].append(line)


    return Dataset
Dataset_After_Sentiment_Analysis=Sentiment_Analysis(review_list,Asin_List)
dataset=pd.DataFrame(Dataset_After_Sentiment_Analysis)
dataset.head()

Processing Sentiments for: Redmi Note 5 Pro (64GB)
Average Positive Polarity:0.18
Average Negative Polarity:0.12
Average Neutral Polarity:0.69
Processing Sentiments for: Samsung Galaxy J8 (Blue, 64GB)
Average Positive Polarity:0.4
Average Negative Polarity:0.02
Average Neutral Polarity:0.58

	B07CJYMDNQ	B07DZZKBBL
0	{'text': 'They are not giving any accessories ...	{'text': 'Super camera having live focus and v...
1	{'text': 'I ordered phone in gold color and Am...	{'text': 'I am using this phone from last 8 da...
2	{'text': 'All the accessories we're not receiv...	{'text': 'It is in amazing condition, package ...
3	{'text': 'On flipkar this phone available in 1...	{'text': 'Mid range best mobile with all premi...
4	{'text': 'I didn't got my accessories 3 in ada...	{'text': 'good phone', 'sentiment': 'positive'...

Text Visualizations

Visualizations helps us to give clear inferences which cannot be easily seen with naked eyes. And when the dataset is large then relations and comparisons cannot be seen.
In this post word Clouds are used to give an idea about the prominent words used by the customer.The more the frequency of the word,the bigger the size of word in the cloud.
Barchart is used to compare the number of positive,negative and neutral comments. Barchart is also used to compare the Average positive,negative and neutral polarities of each mobile phone.

def Final_visuals(Dataset,Asin_List):
    Final_list=[]
    Avg_final_list=[]
    for asin in Asin_List:

        k={}
        k['pos']=0
        k['neu']=0
        k['neg']=0
        k['Avg_p']=0
        k['Avg_n']=0
        k['Avg_neu']=0

        total_p=0
        total_n=0
        total_neu=0
        count=0
        Total_text=''

        for i in range(len(Dataset[asin])):
            text=(Dataset[asin][i]['text'])
            sentence=filter_sentence(text)
            Total_text=Total_text+sentence+''
            if(Dataset[asin][i]['sentiment']=='positive'):
                k['pos'] =k['pos']+1 #Number of positive comments
            if(Dataset[asin][i]['sentiment']=='negative'):
                k['neg'] =k['neg']+1 #Number of negative comments
            if(Dataset[asin][i]['sentiment']=='neutral'):
                k['neu'] =k['neu']+1 #Number of neutral comments
            total_p=total_p+Dataset[asin][i]['positivity']
            total_n=total_n+Dataset[asin][i]['negativity']
            total_neu=total_neu+Dataset[asin][i]['neutrality']
            count=count+1
        k['Avg_p']=round(total_p/count,2) #Average positive polarity score
        k['Avg_n']=round(total_n/count,2) #Average negative polarity score
        k['Avg_neu']=round(total_neu/count,2) #Average neutral polarity score
        k['total']=count #Total no of Reviews
        Final_list.append([k['pos'],k['neg'],k['neu']])
        Avg_final_list.append([k['Avg_p'],k['Avg_n'],k['Avg_neu']])
        print(Mobile_names[asin]+'-')
        print('Number of positive comments:'+str(k['pos']))
        print('Number of negative comments:'+str(k['neg']))
        print('Number of neutral comments:'+str(k['neu']))
        print('Average positive polarity score:'+str(k['Avg_p']))
        print('Average negative polarity score:'+str(k['Avg_n']))
        print('Average neutral polarity score:'+str(k['Avg_neu']))

        mpl.rcParams['font.size']=18              
        mpl.rcParams['savefig.dpi']=100             
        mpl.rcParams['figure.subplot.bottom']=.1 
        stopwords = set(STOPWORDS)
        wordcloud = WordCloud(
                                  background_color='white',
                                  stopwords=stopwords,
                                  max_words=200,
                                  max_font_size=45, 
                                  random_state=38
                                 ).generate(Total_text)

        # print(wordcloud)
        print('Word Cloud of product with Asin:'+str(asin))
        plt.imshow(wordcloud)
        plt.title(Mobile_names[asin])
        plt.axis('off')
        plt.show()
    print('Bar chart showing number of reviews for each product')
    #Bar Plot comparing number of Positive,Negative and Neutral reviews
    obj = ('Positive', 'Negative', 'Neutral')
    index = np.arange(len(obj))
    j=0
    bar_width=0.35
    width = 0
    for i in Final_list:
        plt.bar(index+width,i,bar_width,align='center',alpha=0.5,label=Asin_List[j])
        width=width+bar_width
        j +=1
    plt.xticks(index, obj)
    plt.legend()
    plt.ylabel('Number of reviews')
    plt.tight_layout()
    plt.show()

    #Bar Plot comparing Average polarities
    print('Barchart showing Average polarity score of each product')
    obj = ('Positivity', 'Negativity', 'Neutrality')
    index = np.arange(len(obj))
    j=0
    bar_width=0.35
    width = 0
    for i in Avg_final_list:
        plt.bar(index+width,i,bar_width,align='center',alpha=0.5,label=Asin_List[j])
        width=width+bar_width
        j +=1
    plt.xticks(index, obj)
    plt.legend()

    plt.ylabel('Average polarity score')
    plt.tight_layout()
    plt.show()
Final_visuals(Dataset_After_Sentiment_Analysis,Asin_List)

Redmi Note 5 Pro (64GB)-
Number of positive comments:11
Number of negative comments:4
Number of neutral comments:15
Average positive polarity score:0.18
Average negative polarity score:0.12
Average neutral polarity score:0.69
Word Cloud of product with Asin:B07CJYMDNQ

png

Samsung Galaxy J8 (Blue, 64GB)-
Number of positive comments:13
Number of negative comments:2
Number of neutral comments:15
Average positive polarity score:0.4
Average negative polarity score:0.02
Average neutral polarity score:0.58
Word Cloud of product with Asin:B07DZZKBBL

png

Bar chart showing number of reviews for each product

png

Barchart showing Average polarity score of each product

png

Conclusion

It can be seen that for Samsung mobile more positive words like 'Good','nice','awesom','best' are more frequent rather than in Redmi mobile
Samsung has more number of positive comments compared to Redmi.
Samsung has less number of negative comments compared to Redmi.

Therefore Samsung is better than Redmi

References

http://www.nltk.org/howto/sentiment.html
https://matplotlib.org/api/index.html
https://www.amazon.in/product-reviews/B07CJYMDNQ/ref=cm_cr_arp_d_paging_btm_3?pageNumber=1
https://www.amazon.in/product-reviews/B07DZZKBBL/ref=cm_cr_othr_d_paging_btm_1?pageNumber=1
https://www.scrapehero.com/how-to-scrape-amazon-product-reviews/
https://akanksha005.github.io/category/data-visualization.html

Csv in Python

Load Data from CSVs

import pandas as pd

daily_engagement=pd.read_csv('daily_engagement_full.csv')

len(daily_engagement['acct'].unique())

import unicodecsv

## Longer version of code (replaced with shorter, equivalent version below)

# enrollments = []
# f = open('enrollments.csv', 'rb')
# reader = unicodecsv.DictReader(f)
# for row in reader:
#     enrollments.append(row)
# f.close()
def …

Titanic Dataset
Published: Wed 03 May 2017
By Akanksha Goel

In Data Analysis.

Titanic Dataset

Question
- How different variables are dependent on no of people survived ?
- What is the highest age who has survived?
- How many males and females survived from this accident?
- what is the percentage of people survived?
```
#Now fetching the titanic data using pandas 
import pandas as pd

titanic_data=pd …
```
read more

Data Analytics-Akanksha Goel

Other articles

Data Visualization using Tableau

Titanic Data Visualization with Tableau-Akanksha Goel

Tableau links

Summary

Enron Fraud Identification

Identify Fraud From Enron Email Dataset

OpenStreetMap Data Case Study

OpenStreetMap Data Case Study

Map Area

Melbourne,Australia (9,990.5 km2 )

Problems Encountered in Map Area(sample.osm)

Csv in Python

Load Data from CSVs

Titanic Dataset

Titanic Dataset

Question

Other articles

Titanic Data Visualization with Tableau-Akanksha Goel

Tableau links

Summary

Identify Fraud From Enron Email Dataset

OpenStreetMap Data Case Study

Map Area

Melbourne,Australia (9,990.5 km2 )

Problems Encountered in Map Area(sample.osm)

Load Data from CSVs

Titanic Dataset

Question

links

social