Seen Jarvis of Iron Man and wanted your own AI, so exploring the world of ML and AI. Well, it is not that in a project you could reach there but when your own computer reaches the capacity to tell about how the message is good or bad about just a statement then you are a step further to it. May be someday this would help to bring Jarvis to you.
This is the Project you should do to practice in Natural Language Processing to move ahead from basic ML libraries like pandas and NumPy to reach a place of Data Scientist.
Machine Learning Kit will be shipped to you and you can learn and build using tutorials. You can start for free today!
In Machine learning the problem is not with the algorithm so much we just need proper data and we will be making our own data on own using web scraping and then use it for training and testing then use it for review/comment analysis. Here we will be using nltk library for natural language processing. We will use other libraries too, like bs4, sklearn, etc. The thing which would be not joyful will be the time of rendering of data and data extraction, otherwise, it is a very simple code to understand.
The given libraries should be installed in your python package:-
Want to develop practical skills on Machine Learning? Checkout our latest projects and start learning for free
We will complete it in three parts:
1) Data extraction using web scraping
2) Training and Testing of our model
3) Practically checking a comment
1. We need a large data so we go to the place where we can get it, I got the data from the review section of iPhone 6 on Flipkart site ( direct link below ):
2. This opens the first page of Iphone 6 , we will first extract the data from here required for us
3. we need to understand what we are extracting, two things one is rating other is a review section, we will take ‘3’ as neutral, ‘< 3’ as negative and ‘> 3’ as positive comments
4. while we could individually extract each thing rating and review, but I am extracting from the common way
5. we use the div class under which these both come, here it is ‘col _390CkK _1gY8H-'
5. import the beautifulsoup from bs4. Then using get method from requests we extract whole page and save in a variable
response = requests.get(“URL”)
requesting website and downloading its content using get method
6. After this we use instantiating soup object which accepts what to find and how to find
soup = BeautifulSoup(response.text, 'lxml')
NOTE: if lxml produces error use 'html.parser'
7. Using find all method to find all occurrences of class col _390CkK _1gY8H- and put data in variable content/reviews (whatever you want )
8. We add the data individually in a dummy variable following way:
for i in reviews:
9. Now using looping we add the review and rating in different list ( the name should predefined ) with conditioning like:
for i in a:
10. Now we need to extract all the data from other pages too
11. Now we need to put all code at one place and amend the url in following way:
&marketplace=FLIPKART&page=' + str(i)
12. put this in for loop where i in range 2 to 1649 (because we have 1648 pages to add the data)
Note: It is a large data set so scraping may take hours about 2.5 hours
13. Now we need to change whole data in csv form
14. using file handling we open using with open a file in writing format and declare a writer in following way:
hello = csv.writer(file name)
15. Declare first row as ‘reviews’ , ‘rating’
16. Now with for loop add all data from a list in the file
Note: remember to declare your file with .csv extension
Your CSV data is ready. It is a large data set close to 1 crore reviews, It takes time rendering this data so can reduce the page number
If you want, you can get a better data set from kaggle and other sites since this can be a biased review collection, just modify the complete code
TRAINING AND TESTING MODEL
1. we will be using nltk and sklearn libraries with string aand other packages within them
2. first we read the the csv file using pandas library
3. we import nltk , string library and stopwords from nltk.corpus and then declare function to remove punctuations and stop words from our data
4. now we import the countvectorizer and TfidTransformer from sklearn.feature extraction .text and also we import multinomialNB from sklearn.naive_bayes
5. we need to divide our data for training and testing so we import trai_test_split from sklearn.model_selection and divide data in following way
6. Now we import Pipeline from sklearn.pipline library
7. now we pipeline our data in our model following way:
model = Pipeline([
Skyfi Labs helps students learn practical skills by building real-world projects.
You can enrol with friends and receive kits at your doorstep
You can learn from experts, build working projects, showcase skills to the world and grab the best jobs.
Get started today!
8. Now we train our data with following step:
9. this may take hours as per dataset size and processor of system
10. after train we cross-check with test:
predictions = pipeline.predict(msg_test)
11. Now print the report:
12. now to check our own comment we do following way:
The work is done program is ready but in RAM if you want to fix this you can using ‘pickle’ library or joblib from sklearn , just google it for moreKit required to develop Comment Analysis using NLP:
Stay up-to-date and build projects on latest technologies