Skip to content
This repository was archived by the owner on May 31, 2021. It is now read-only.

Commit fc0e832

Browse files
committed
Describe the synchronous client.
1 parent 1b0e782 commit fc0e832

File tree

1 file changed

+48
-3
lines changed

1 file changed

+48
-3
lines changed

webscraper.rst

Lines changed: 48 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,21 +85,66 @@ Our first attempt is synchronous:
8585
.. literalinclude:: examples/synchronous_client.py
8686

8787

88+
While about 80 % of the websites use ``utf-8`` as encoding
89+
(provided by the default in ``ENCODING``), it is a good idea to actually use
90+
the encoding of that is specified by ``charset``.
91+
This is our helper to find out what the encoding of the page is:
92+
93+
.. literalinclude:: examples/synchronous_client.py
94+
:language: python
95+
:start-after: ENCODING = 'ISO-8859-1'
96+
:end-before: def get_page
97+
98+
It falls back to ``ISO-8859-1`` if it cannot find a specification of the
99+
encoding.
100+
88101
Using ``urllib.request.urlopen()``, retrieving a web page is rather simple.
89102
The response is a bytestring and ``.encode()`` is needed to convert it into a
90-
string.
91-
103+
string:
92104

93105
.. literalinclude:: examples/synchronous_client.py
94106
:language: python
95-
:start-after: return entry.split('=')[1].strip()
107+
:start-after: return ENCODING
96108
:end-before: def get_multiple_pages
97109

110+
Now, we want multiple pages:
111+
98112
.. literalinclude:: examples/synchronous_client.py
99113
:language: python
100114
:start-after: return html
101115
:end-before: if __name__ == '__main__':
102116

117+
We just iterate over the waiting times and call ``get_page()`` for all
118+
of them.
119+
The function ``time.perf_counter()`` provides a time stamp.
120+
Taking two time stamps a different and calculating their difference
121+
provides the elapsed run time.
122+
123+
Finally, we can run our client::
124+
125+
python synchronous_client.py
126+
127+
and get this output::
128+
129+
It took 11.08 seconds for a total waiting time of 11.00.
130+
Waited for 1.00 seconds.
131+
That's all.
132+
133+
Waited for 5.00 seconds.
134+
That's all.
135+
136+
Waited for 3.00 seconds.
137+
That's all.
138+
139+
Waited for 2.00 seconds.
140+
That's all.
141+
142+
Because we wait for each call to ``get_page()`` to complete, we need to
143+
wait about 11 seconds.
144+
That is the sum of all waiting times.
145+
Let's see see if we can do better going asynchronously.
146+
147+
103148
Getting a Page Asynchronously
104149
-----------------------------
105150

0 commit comments

Comments
 (0)