Alex Nazarovsky

MATLAB, DSP, Julia, Power quality, Engineering, Metrology

Python Script to Download Gost From protect.gost.ru as a Pdf File

Unfortunately the Russian database of the national standards (GOSTs) does not enable to get files as .pdf , but only as separate images (which are even not directly available). So we can’t even download the texts, it’s a pity. I’ve made a simple Python script to address this problem, because it’s far better to read and explore a printed copy. Script uses some libraries, you can install them by running pip install img2pdf pypdf2 requests It was tested in Python 2.7.

Usage:

  1. Find the standard you want to download from site protect.gost.ru
  2. Get the link, in the example it is http://protect.gost.ru/v.aspx?control=8&baseC=-1&page=0&month=-1&year=-1&search=&RegNum=1&DocOnPageCount=15&id=126445
  3. Replace the link in the script
  4. Run the script python gostdl.py
  5. In the same folder you will get separate pages of document in .pdf and .jpg format as well as merged document in document.pdf
Python 2.7 script ‘gostdl.py’ to download Gost from protect.gost.ru as a pdf file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests,re,img2pdf
from PyPDF2 import PdfFileMerger, PdfFileReader

s = requests.Session()
response = s.get("http://protect.gost.ru/v.aspx?control=8&baseC=-1&page=0&month=-1&year=-1&search=&RegNum=1&DocOnPageCount=15&id=126445")
links = re.findall(r"\" href=\".*?pageK=(.*?-.*?-.*?-.*?)\".*?>.*?</a>\s+.*?<a style", response.content)
n=1
pdf_merger = PdfFileMerger()

for link in links:
    url="http://protect.gost.ru/image.ashx?page="+link
    fname=str(n).zfill(2)+".jpg"
    print "Downloading "+ fname + " url=" +url
    r = s.get(url)
    with open(fname, "wb") as jpg:
       jpg.write(r.content)
       jpg.flush()

    pdf_bytes = img2pdf.convert([fname])
    pdf_name = str(n).zfill(2)+".pdf"
    with open(pdf_name,"wb") as pdf:
       pdf.write(pdf_bytes)
    pdf_merger.append(PdfFileReader(file(pdf_name, 'rb')))
    n=n+1

pdf_merger.write("document.pdf")