Python Web Scraping - Tutorialspoint

1y ago
143 Views
32 Downloads
849.21 KB
21 Pages
Last View : 4d ago
Last Download : 1m ago
Upload by : Aiyana Dorn
Transcription

Python Web Scraping

Python Web ScrapingAbout the TutorialWeb scraping, also called web data mining or web harvesting, is the process ofconstructing an agent which can extract, parse, download and organize useful informationfrom the web automatically.This tutorial will teach you various concepts of web scraping and makes you comfortablewith scraping various types of websites and their data.AudienceThis tutorial will be useful for graduates, post graduates, and research students who eitherhave an interest in this subject or have this subject as a part of their curriculum. Thetutorial suits the learning needs of both a beginner or an advanced learner.PrerequisitesThe reader must have basic knowledge about HTML, CSS, and Java Script. He/she shouldalso be aware about basic terminologies used in Web Technology along with Pythonprogramming concepts. If you do not have knowledge on these concepts, we suggest youto go through tutorials on these concepts first.Copyright & Disclaimer Copyright 2018 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point (I)Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republishany contents or a part of contents of this e-book in any manner without written consentof the publisher.We strive to update the contents of our website and tutorials as timely and as precisely aspossible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of ourwebsite or its contents including this tutorial. If you discover any errors on our website orin this tutorial, please notify us at [email protected]

Python Web ScrapingTable of ContentsAbout the Tutorial . iAudience . iPrerequisites . iCopyright & Disclaimer. iTable of Contents . ii1.PYTHON WEB SCRAPING – INTRODUCTION . 1What is Web Scraping?. 1Origin of Web Scraping. 1Web Crawling v/s Web Scraping . 1Uses of Web Scraping . 2Components of a Web Scraper . 3Working of a Web Scraper. 32.PYTHON WEB SCRAPING – GETTING STARTED WITH PYTHON . 5Why Python for Web Scraping? . 5Installation of Python . 5Setting Up the PATH . 7Running Python . 73.PYTHON WEB SCRAPING – PYTHON MODULES FOR WEB SCRAPING . 9Python Development Environments using virtualenv . 9Python Modules for Web Scraping . 11Requests . 11Urllib3 . 12Selenium . 13Scrapy . 144.PYTHON WEB SCRAPING — LEGALITY OF WEB SCRAPING . 15ii

Python Web ScrapingIntroduction . 15Research Required Prior to Scraping . 155.PYTHON WEB SCRAPING – DATA EXTRACTION . 21Web page Analysis . 21Different Ways to Extract Data from Web Page . 21Beautiful Soup . 23Lxml . 246.PYTHON WEB SCRAPING – DATA PROCESSING . 26Introduction . 26CSV and JSON Data Processing . 26Data Processing using AWS S3 . 27Data processing using MySQL . 28Data processing using PostgreSQL . 307.PYTHON WEB SCRAPING – PROCESSING IMAGES AND VIDEOS . 31Introduction . 31Getting Media Content from Web Page . 31Extracting Filename from URL . 31Information about Type of Content from URL . 32Generating Thumbnail for Images . 34Screenshot from Website . 34Thumbnail Generation for Video . 35Ripping an MP4 video to an MP3 . 368.PYTHON WEB SCRAPING – DEALING WITH TEXT . 37Introduction . 37Getting started with NLTK . 37Installing Other Necessary packages . 38iii

Python Web ScrapingTokenization . 38Stemming . 39Lemmatization . 39Chunking . 40Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form . 41Building a Bag of Words Model in NLTK . 42Topic Modeling: Identifying Patterns in Text Data . 42Topic Modeling Algorithms . 439.PYTHON WEB SCRAPING – SCRAPING DYNAMIC WEBSITES. 44Introduction . 44Dynamic Website Example . 44Approaches for Scraping data from Dynamic Websites . 44Reverse Engineering JavaScript . 45Rendering JavaScript . 4610. PYTHON WEB SCRAPING — SCRAPING FORM BASED WEBSITES . 48Introduction . 48Interacting with Login forms . 48Loading Cookies from the Web Server . 49Automating forms with Python . 5011. PYTHON WEB SCRAPING — PROCESSING CAPTCHA . 52What is CAPTCHA? . 52Loading CAPTCHA with Python . 52Pillow Python Package . 53OCR: Extracting Text from Image using Python . 5412. PYTHON WEB SCRAPING — TESTING WITH SCRAPERS . 55Introduction . 55iv

Python Web ScrapingTesting using Python . 55Unittest: Python Module . 55Testing with Selenium . 57Comparison: unittest or Selenium . 58v

Python Web Scraping1. Python Web Scraping – IntroductionWeb scraping is an automatic process of extracting information from web. This chapterwill give you an in-depth idea of web scraping, its comparison with web crawling, and whyyou should opt for web scraping. You will also learn about the components and working ofa web scraper.What is Web Scraping?The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Heretwo questions arise: What we can get from the web and How to get that.The answer to the first question is ‘data’. Data is indispensable for any programmer andthe basic requirement of every programming project is the large amount of useful data.The answer to the second question is a bit tricky, because there are lots of ways to getdata. In general, we may get data from a database or data file and other sources. Butwhat if we need large amount of data that is available online? One way to get such kindof data is to manually search (clicking away in a web browser) and save (copy-pasting intoa spreadsheet or file) the required data. This method is quite tedious and time consuming.Another way to get such data is using web scraping.Web scraping, also called web data mining or web harvesting, is the process ofconstructing an agent which can extract, parse, download and organize useful informationfrom the web automatically. In other words, we can say that instead of manually savingthe data from websites, the web scraping software will automatically load and extract datafrom multiple websites as per our requirement.Origin of Web ScrapingThe origin of web scraping is screen scrapping, which was used to integrate non-web basedapplications or native windows applications. Originally screen scraping was used prior tothe wide use of World Wide Web (WWW), but it could not scale up WWW expanded. Thismade it necessary to automate the approach of screen scraping and the technique called‘Web Scraping’ came into existence.Web Crawling v/s Web ScrapingThe terms Web Crawling and Scraping are often used interchangeably as the basic conceptof them is to extract data. However, they are different from each other. We can understandthe basic difference from their definitions.Web crawling is basically used to index the information on the page using bots akacrawlers. It is also called indexing. On the hand, web scraping is an automated way ofextracting the information using bots aka scrapers. It is also called data extraction.1

Python Web ScrapingTo understand the difference between these two terms, let us look into the comparisontable given hereunder:Web CrawlingWeb ScrapingRefers to downloading and storing thecontents of a large number of websites.Refers to extracting individual dataelements from the website by using asite-specific structure.Mostly done on large scale.Can be implemented at any scale.Yields generic information.Yields specific information.Used by major search engines like Google,Bing, Yahoo. Googlebot is an example ofa web crawler.The information extracted using webscraping can be used to replicate insome other website or can be used toperform data analysis. For example thedata elements can be names, address,price etc.Uses of Web ScrapingThe uses and reasons for using web scraping are as endless as the uses of the World WideWeb. Web scrapers can do anything like ordering online food, scanning online shoppingwebsite for you and buying ticket of a match the moment they are available etc. just likea human can do. Some of the important uses of web scraping are discussed here: E-commerce Websites: Web scrapers can collect the data specially related to theprice of a specific product from various e-commerce websites for their comparison. Content Aggregators: Web scraping is used widely by content aggregators likenews aggregators and job aggregators for providing updated data to their users. Marketing and Sales Campaigns: Web scrapers can be used to get the data likeemails, phone number etc. for sales and marketing campaigns. Search Engine Optimization (SEO): Web scraping is widely used by SEO toolslike SEMRush, Majestic etc. to tell business how they rank for search keywords thatmatter to them. Data for Machine Learning Projects: Retrieval of data for machine learningprojects depends upon web scraping.Data for Research: Researchers can collect useful data for the purpose of their researchwork by saving their time by this automated process.2

Python Web ScrapingComponents of a Web ScraperA web scraper consists of the following components:Web Crawler ModuleA very necessary component of web scraper, web crawler module, is used to navigate thetarget website by making HTTP or HTTPS request to the URLs. The crawler downloads theunstructured data (HTML contents) and passes it to extractor, the next module.ExtractorThe extractor processes the fetched HTML content and extracts the data into semistructured format. This is also called as a parser module and uses different parsingtechniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligencefor its functioning.Data Transformation and Cleaning ModuleThe data extracted above is not suitable for ready use. It must pass through some cleaningmodule so that we can use it. The methods like String manipulation or regular expressioncan be used for this purpose. Note that extraction and transformation can be performedin a single step also.Storage ModuleAfter extracting the data, we need to store it as per our requirement. The storage modulewill output the data in a standard format that can be stored in a database or JSON or CSVformat.Working of a Web ScraperWeb scraper may be defined as a software or script used to download the contents ofmultiple web pages and extracting data from it.Downloading the ContentsExtracting the DataStoring the DataAnalyzing the Data3

Python Web ScrapingWe can understand the working of a web scraper in simple steps as shown in the diagramgiven above.Step 1: Downloading Contents from Web PagesIn this step, a web scraper will download the requested contents from multiple web pages.Step 2: Extracting DataThe data on websites is HTML and mostly unstructured. Hence, in this step, web scraperwill parse and extract structured data from the downloaded contents.Step 3: Storing the DataHere, a web scraper will store and save the extracted data in any of the format like CSV,JSON or in database.Step 4: Analyzing the DataAfter all these steps are successfully done, the web scraper will analyze the data thusobtained.4

Python Web Scraping2. Python Web Scraping – Getting Started withPythonIn the first chapter, we have learnt what web scraping is all about. In this chapter, let ussee how to implement web scraping using Python.Why Python for Web Scraping?Python is a popular tool for implementing web scraping. Pyth

About the Tutorial Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. This tutorial will teach you various concepts of web scraping and makes you comfortable