@@ -4,13 +4,13 @@ Web Scraping
4
4
5
5
Web scraping means downloading multiple web pages, often from different
6
6
servers.
7
- Typically, there is a considerable waiting time involved between sending a
8
- request and receiving the answer.
7
+ Typically, there is a considerable waiting time between sending a request and
8
+ receiving the answer.
9
9
Using a client that always waits for the server to answer before sending
10
- the next request, means spending most of time waiting.
11
- Here ``asyncio `` can help to send many request without waiting for a response
10
+ the next request, can lead to spending most of time waiting.
11
+ Here ``asyncio `` can help to send many requests without waiting for a response
12
12
and collecting the answers later.
13
- The next examples show how a synchronous client spends most of the
13
+ The following examples show how a synchronous client spends most of the time
14
14
waiting and how to use ``asyncio `` to write asynchronous client that
15
15
can handle many requests concurrently.
16
16
@@ -75,7 +75,7 @@ The request handler only has a ``GET`` method:
75
75
It takes the last entry in the paths with ``self.path[1:] ``, i.e.
76
76
our ``2.5 ``, and tries to convert it into a floating point number.
77
77
This will be the time the function is going to sleep, using ``time.sleep() ``.
78
- This means waits 2.5 seconds until it answers.
78
+ This means waiting 2.5 seconds until it answers.
79
79
The rest of the method contains the HTTP header and message.
80
80
81
81
A Synchronous Client
@@ -86,11 +86,11 @@ This is the full implementation:
86
86
87
87
.. literalinclude :: examples/synchronous_client.py
88
88
89
- Again, we go through step-by-step.
89
+ Again, we go through it step-by-step.
90
90
91
91
While about 80 % of the websites use ``utf-8 `` as encoding
92
92
(provided by the default in ``ENCODING ``), it is a good idea to actually use
93
- the encoding of that is specified by ``charset ``.
93
+ the encoding specified by ``charset ``.
94
94
This is our helper to find out what the encoding of the page is:
95
95
96
96
.. literalinclude :: examples/synchronous_client.py
@@ -120,8 +120,8 @@ Now, we want multiple pages:
120
120
We just iterate over the waiting times and call ``get_page() `` for all
121
121
of them.
122
122
The function ``time.perf_counter() `` provides a time stamp.
123
- Taking two time stamps a different and calculating their difference
124
- provides the elapsed run time.
123
+ Taking two time stamps a different points in time and calculating their
124
+ difference provides the elapsed run time.
125
125
126
126
Finally, we can run our client::
127
127
@@ -145,7 +145,7 @@ and get this output::
145
145
Because we wait for each call to ``get_page() `` to complete, we need to
146
146
wait about 11 seconds.
147
147
That is the sum of all waiting times.
148
- Let's see see if we can do better going asynchronously.
148
+ Let's see if we can do it any better going asynchronously.
149
149
150
150
151
151
Getting One Page Asynchronously
@@ -159,7 +159,7 @@ using the new Python 3.5 keywords ``async`` and ``await``:
159
159
As with the synchronous example, finding out the encoding of the page
160
160
is a good idea.
161
161
This function helps here by going through the lines of the HTTP header,
162
- which it gets as an argument, searching for ``charset `` and returning is value
162
+ which it gets as an argument, searching for ``charset `` and returning its value
163
163
if found.
164
164
Again, the default encoding is ``ISO-8859-1 ``:
165
165
@@ -189,7 +189,7 @@ Therefore, we need to convert our strings in to bytestrings.
189
189
190
190
Next, we read header and message from the reader, which is a ``StreamReader ``
191
191
instance.
192
- We need to iterate over the reader by using the specific for loop for
192
+ We need to iterate over the reader by using a special or loop for
193
193
``asyncio ``:
194
194
195
195
.. code-block :: python
@@ -350,7 +350,7 @@ Exercise
350
350
Add more waiting times to the list ``waits `` and see how this impacts
351
351
the run times of the blocking and the non-blocking implementation.
352
352
Try (positive) numbers that are all less than five.
353
- Try numbers greater than five.
353
+ Then try numbers greater than five.
354
354
355
355
High-Level Approach with ``aiohttp ``
356
356
------------------------------------
@@ -376,8 +376,8 @@ The function to get one page is asynchronous, because of the ``async def``:
376
376
:start-after: import aiohttp
377
377
:end-before: def get_multiple_pages
378
378
379
- The arguments are the same as for the previous function to retrieve one page
380
- plus the additional argument ``session ``.
379
+ The arguments are the same as those for the previous function to retrieve one
380
+ page plus the additional argument ``session ``.
381
381
The first task is to construct the full URL as a string from the given
382
382
host, port, and the desired waiting time.
383
383
0 commit comments