Hathi Download Helper

Download books from the hathitrust website in a fast and easy manner.



Quickstart

Simple View Mode

  1. Copy the book URL or book ID into the URL field of Hathi Download Helper and press the 'Get book info' button.

Advanced View Mode

  1. Copy the book URL or book ID into the URL field of Hathi Download Helper and press the 'Get book info' button.
  2. Select the source file format (pdf or images), set the destination folder and press the 'Start download' button.
  3. Convert and merge the downloaded files to achieve a pdf book by pressing the 'create pdf' button.

Features: brief overview:

  1. Book information : Enable Proxy to retrieve book information data via an internet proxy. Use auto proxy to automatically get connected to various updated proxy server.
  2. Download of books : Use auto proxy feature to download data via various proxy server from either the US-only or from all countries available and to reduce waiting time.
  3. Download of books : Use web proxy (disabled in HDH 1.1.1)feature to reduce waiting time.(Works only for 'public domain' titles.)
  4. Link Collector : Use Link Collector and Batch Job features to easily create batch jobs and download several books at once.


Download resources

Source code and installer are available on:

WWW:  https://sourceforge.net/projects/hathidownloadhelper/
WWW:  www.facebook.com/hathidownloadhelperTool
WWW:  http://www.softpedia.com/get/Internet/Download-Managers/Hathi-Download-Helper.shtml

Comments, feedback, bug reports and questions are welcome:
hathidownloadhelper@hotmail.com

You can also use the build-in contact form. No email address required! (See: Help → Contact & Bug report)


User interface elements

The main window of Hathi Download Helper is separated into three group boxes. Each group box corresponds to a certain processing step:
  • step 1: Identifying target book ID → Book information group box
  • step 2: Downloading book pages → Download settings group box
  • step 3: Converting downloaded data → PDF merge & conversion group box
Furthermore the menu bar provides additional features, namely:
  • Page setup
  • Gui setup (font setup, style setup)
  • User settings
  • Proxy settings
  • Batch job
  • Export links
  • Merge PDSs
  • Help
  • About
  • Contact and Bug report
  • Update check
  • Automatically check for update

Menu Bar

The menu bar provides the following options:
  • File
  • → Exit : Exit Hathi Download Helper

  • Options
  • → Page Setup : Opens dialog for setting up page size (letter, A3, A4, etc.) and page margins (top,right, left, bottom)

  • Options
  • → GUI setup→Style setup : Opens dialog for GUI style setup. Available styles depend on your operating system.

  • Options
  • → GUI setup →Font setup : Opens dialog for GUI font setup. You can adjust font type, size, etc. The size of the Gui will be adjusted to the required space for all elements.

  • Options
  • → GUI setup →Default : Resets font and GUI style to default settings.

  • Options
  • → User settings : Opens dialog for default application settings.

  • Tools
  • → Proxy : Opens proxy server setup dialog. Here you can either activate the build-in AutoPorxy feature or set up your own proxy IP, port number and proxy type. For authentication user name and password may be entered. Otherwise just leave those fields empty.

  • Tools
  • → Create batch job : Opens batch job dialog. With this dialog you can create a list of books which should be downloaded one after the other.

  • Tools
  • → Export links : Opens export dialog for creation of an html file containing links to each book page. You may like to use this feature to utilize your favorite download tool to download all book pages on your own.

  • Tools
  • → Merge PDFs : Opens a file dialog to select arbitrary pdf files to be merged to a single pdf.

  • Help
  • → Help : Opens the help dialog you are currently reading.

  • Help
  • → About : Hathi Download Helper about dialog.

  • Help
  • → About Qt : About Qt dialog.

  • Help
  • → Check for Update : Checks if a new version is available online and provides download links for source code and installer.
  • Help
  • → Automatically check for Update : Enable / Disable automatic update check on start-up.


Group Boxes


Book information

Book URL:

use proxy server

The 'book information' group box holds the URL input field as well as the received book information: Title, number of pages, book ID, publisher and author.
After entering the book URL Hathi Download Helper reads the html document after pushing the 'get Book info' button.
Alternatively the book ID can be entered also. If desired a proxy server could be used by selecting the corresponding checkbox. When the book is blocked, e.g. due to copy right restrictions, a message saying "Received empty document..." will be displayed close to the progress bar.


Download settings

pdfs

images

image zoom

download OCR text

pages

create pdf book after download

resume book download

enable AutoProxy

enable WebProxies

In the 'Download settings' group box the user can choose between two file formats:
  • pdfs : select this option to download the book pages as single searchable pdf documents generated by Hathitrust.org. After download you have the option to merge all pdf files. For this operation Hathi Download Helper is utilizing 'pdftk' (see: PDF merge & conversion). Note: The download of pdf files is limited to approximately 15 files/ 5 mins.
  • images: select this option to download the book pages as image files (jpeg, png). The image quality depends on the selected resolution. The amount of files which you can download without a waiting time is much faster compared to pdf download.
  • The image quality can be adjusted by selecting a zoom factor. The listed dpi-values are approximations and depend on the selected page size.
  • To generate 'searchable' pdfs Hathi Download Helper has the option to download ocr text files in addition to the image files. The ocr text files will be stored as html documents on your hard disk.
  • Using the 'pages' input field the user can decide either to download a whole book or only certain pages. Single page numbers have to be separated by commas (e.g. 1,3,5). Page ranges have to be indicated by a hyphen, starting with the smaller value (e.g.: 5-10, 20-30).
  • Selecting the 'create pdf book after download' checkbox will automatically start the pdf merge and conversion process to generate pdf files or a single pdf book file.
  • When the 'resume book download' option is checked the Hathi Download Helper will check if there are already files of the specified book downloaded during a previous download session and if the files are readable. By default is option is not checked and the downloader will re-download all files.
  • The 'enable AutoProxy' checkbox will activate the built-in AutoProxy support of Hathi Download Helper. This feature will automatically establish connections to free proxy server to bypass the download limitations (e.g. for the pdf files) of hathitrust.org.
  • The 'enable WebProxies' checkbox will activate the build-in webproxy support of Hathi Download Helper. This feature will automatically generate download request via several webproxies to bypass the download limitations (e.g. for the pdf files) of hathitrust.org. Note: Only non-restricted books, which are accessible for non-us citizens, can be downloaded. See WebProxies for details.



PDF merge & conversion

merge pdfs

convert & merge images to pdf book

convert images to single pdf files

use plaintext (ocr text) only

set pdf resolution

In the 'PDF merge & conversion' group box the user can choose between the following options:
  • 'merge pdfs' : Merge single pdf files using the free tool 'pdftk' (http://www.pdflabs.com)
  • 'convert & merge images to pdf book': Convert and merge images to a pdf book. Page size and page margins are editable via 'Options' → 'Page setup'
  • convert images to single pdf files: Create single pdf files for each page.
  • Sets the output resolution for pdf files generated by Hathi Download Helper from images/ocr text files.



Features

This section holds some information about manual proxy, AutoProxy and WebProxy feature as well as the file naming and folder structure used by Hathi Download Helper. Furthermore, you will find some explanations about Hathi Download Helper as PDF merger and Image-to-PDF converter.





Manual Proxy

Hathi Download Helper provides an option to enable a network proxy. The proxy specification (IP- and port number, proxy type, user name and password) have to be defined by the user.
The Hathi Download Helper will use this proxy server connection as long as the 'use proxy server' checkbox is selected.
For implementation the QNetworkProxy class of Qt 4.7.4 is used:
The following types are supported:

Proxy TypeDescriptionDefault capabilities
Caching-only HTTPImplemented using normal HTTP commands, it is useful only in the context of HTTP requestsCachingCapability, HostNameLookupCapability
SOCKS_5Generic proxy for any kind of connection. Supports TCP, UDP, binding to a port (incoming connections) and authentication.TunnelingCapability, ListeningCapability, UdpTunnelingCapability, HostNameLookupCapability

On enabling the proxy connection Hathi Download Helper will check if a connection to the proxy server was established and if hathitrust.org is reachable.





AutoProxy feature

Hathi Download Helper provides an option to automatically connect to network proxies. For this purpose the downloader is utilizing proxy lists free available from the internet and will check if the given hathirust.org book url is reachable.
Depending on the selected option (see: Tools→Proxy ) the AutoProxy feature will use either proxy servers from 'US only' or from 'all countries'.
To use the AutoProxy feature to get information about a book you have to select the 'use proxy server' checkbox within the book information groupbox.
To enable the AutoProxy feature for download select the 'enable AutoProxy' checkbox within the download settings groupbox.
NOTE: Some books are only available when viewed in the US. For those books the 'US only' option has to be selected. Please note that you are only allowed to view these books when you are in the US.

Further information:
  • AutoProxy can be configured within the proxy setup dialog from the menu bar (select Tools→Proxy).
  • To activate AutoProxy select the corresponding radio button within proxy setup dialog.
  • AutoProxy can be configured to either use proxy server from the 'US only' or from 'all countries'.
  • Additional proxy sources might be added from inside the proxy preferences dialog.
  • When download limited of hathitrust.org for an active proxy connection has exceeded AutoProxy will automatically reconnect to another proxy server.
  • Please note: Establishing a new proxy connection might take several minutes!




WebProxy - disabled in HDH 1.1.1

Hathi Download Helper provides an option that utilizes a large amount of random web proxies to download data from hathitrust.org:
This feature re-directs all download requests to free web proxy services to continue the download of data while the server download limitation for the user is activated. Please note the following information:

Restrictions:
• Works only for non-restricted books which are also public domain when viewed outside the US.
• Strongly varying download speed.

Important advice:
• Since this feature utilizes a large number of random web pages an updated virus scanner is recommended.
• There is no guarantee for proper functioning.

WebProxy safety measurements

To minimize the risk of unwanted behaviour the following safety measurements are implemented:
  • Java script is not allowed to open new windows or close existing windows.
  • Java is disabled.
  • Plug-ins in web pages are disabled.
  • Automatically load of images is disabled.
  • Automatically URL redirections are disabled.
  • Integrated pop-up blocker is enabled.
  • Monitoring of load requests for cross-site scripting attempts. Suspicious scripts are blocked.





File and folder structure


Hathi Download Helper creates the following sub-folder structure for downloaded data inside the target directory:
  • 'pdfs'
  • :Folder for downloaded pdf files
  • 'images'
  • :Folder for downloaded image files
  • 'ocr'
  • :Folder for downloaded ocr text files (*.hmtl)
Note: All downloaded data (images, pdfs, ocr files) will be kept. If you don't need them any more you have to delete them manually.

Note: When restarting a download (with the same book ID to the same destination folder) all files downloaded in the previous session will be overwritten unless you have selected the 'resume book download options'. In that case the downloader will check if a corresponding file (with the same name) already exists and will not download this file again.

Hathi Download Helper creates the following sub-folder structure for converted data inside the source directory:
  • 'pdfs'
  • :Folder for generated pdf files. Existing files will be overwritten.
  • 'pdfs_text_only'
  • :Folder for generated pdfs files with ocr text only.
Note: Since the target folder for download is the source folder for conversion all existing pdf files within the 'pdfs' folder will be overwriten when 'single pdf' conversion was selected as output option!





Namespace

Hathi Download Helper is using a fixed name structure for downloaded data, starting with the document ID (but with removed reserved characters).
This namespace is used for pdf files, images and ocr text files (html-files).
Example for document ID: hvd.32044038439063:

File formatID + "_page_" + page number + filetype extension

PDF example: hvd.32044038439063_page_001.pdf
JPG example: hvd.32044038439063_page_001.jpg
OCR example: hvd.32044038439063_page_001.html





Hathi Download Helper as PDF merger

Hathi Download Helper is able to merge any pdf files utilizing the 'pdftk' (pdf toolkit) application. For this purpose the radio button "merge pdfs" has to be selected. When selecting a folder without content downloaded by Hathi Download Helper (files/folders) a corresponding file dialog for file selection will apear. This dialog is also available from the menu bar (Tools → Merge PDFs). If you are running a linux or MAC OS system you have to install the 'pdftk' tool (http://www.pdflabs.com). For Windows systems Hathi Download Helper brings along a copy of 'pdftk'.





Hathi Download Helper as Image-to-PDF converter

Hathi Download Helper is able to convert a number of different image formats into pdf files. For this purpose the radio button "convert & merge images to pdf book" or "convert images to single pdf files" has to be selected. When selecting a folder without content downloaded by Hathi Download Helper (files/folders) a corresponding file dialog for file selection will apear."

Note: Since the target folder for download is the source folder for conversion all existing pdf files within the 'pdfs' folder will be overwriten when 'single pdf' conversion was selected as output option!





Installing pdftk

For merging existing pdf files Hathi Download Helper is using the 'pdftk' application. To install pdftk you have to do the following actions in dependency of your OS:
  • Windows
    1. Download and install 'pdftk' from http://www.pdflabs.com
    2. Open the pdftk program folder and copy the files pdftk.exe and libiconv2.dll
    3. Open the Hathi Download Helper folder containing the hathidownloadhelper.exe file and create a new folder named pdftk
    4. Copy the files from step 2 into the pdftk subfolder.
    Hint: If you have compiled Hathi Download Helper on your own you have to place the pdftk subfolder in your Debug/Release target folder containing the HathiDownloadHelper.exe file.
  • Linux
    1. Download and install 'pdftk server' from http://www.pdflabs.com or use the pdftk file placed in the pdftk subfolder attached to this project.
      • When you are using Ubuntu you can install pdftk by the following command:
        sudo apt-get install pdftk
  • Mac OS
    1. Download and install 'pdftk server' for Mac OS from http://www.pdflabs.com.
      Hint: Due to 'gatekeeper' feature on Mac OS you have to open the installer via right-click on it. Then select 'Open'.





    FAQ


  • What does the name "Hathi Download Helper" mean?
  • Hathi (pronounced hah-tee) is the Hindi word for elephant, an animal highly regarded for its capability to suck a huge amount of water into its trunk, and to blow the water into its mouth. In computer networks, to download means to receive data to a local system from a remote system, or to initiate such a data transfer. Helper refers to a device that helps. In combination, the words convey the key benefits users can expect from this application - to download pages or complete books in an easy way.

  • HDH Logo
  • The Hathi Download Helper logo was originally created by 'lemming' and titled as 'Cartoon elephant'. It is under public domain. For further information visit: https://openclipart.org/detail/17810/cartoon-elephant


  • Server: maximum download limit exceeded...please wait....
  • There is a download limitation for any files by Hathitrust.org. When downloading too many files in a short period of time you will be forced to wait for some time. In case of pdf-files the limitation is about 15 files/ 5 minutes. Afterwards you have to wait for appr. 5 minutes. You may activate the WebProxy-Feature to download data via several webproxies during this queuing period.


  • Suddenly Hathitrust.org is not reachable anymore...
  • This behaviour may occur due to extensive download requests. In this case the user IP might be blocked by Hathitrust.org for apprx. 5 minutes.


  • Why are the created PDF files are so huge?
  • Hathi Download Helper uses a PDF-Printer (Qt::QPrinter), which 'prints' the images into the pdf file. Since QPrinter only supports jpg-image formats all pages are stored as jpg-images inside the pdf file. Therefore even pages with text only have to be stored in the same way like full resolution images.


  • How does Hathi Download Helper generate searchable PDF files?
  • Is there any OCR software involved?
  • Hathi Download Helper does not have any OCR functionality. Instead it uses the OCR files generated by Hathitrust.org. The downloaded OCR files are stored as html files on your hard disk. For PDF creation the OCR text will be printed on each page overlayed by the corresponding images.



  • AutoProxy: Verifying connection always fails!
  • Hathi Download Helper is using free proxy servers whose IPs are published and updated online. Therefore the service of quality is strongly varying. Normally HDH requests up to 20 proxy IPs from 1 source. Sometimes a source only provides outdated IPs. HDH will request a new IP list after checking each of the previously received IP list. You can enforce an update of the IP list by re-freshing (uncheck, re-check) the 'use Proxy Server' checkbox.

    When the proxy verification still fails check if HDH is blocked by your firewall. An easy test is to run a manual update check: Menu bar → Help → check for update.





    ERROR FIXING

  • "Error: unable to execute 'pdftk' application."
  • For merging existing pdf files Hathi Download Helper is using the 'pdftk' application. This error may occur due to missing permissions for the pdftk files. To fix this error see Installing pdftk
  • "Error: Waiting for 'pdftk' application."
  • For merging existing pdf files Hathi Download Helper is using the 'pdftk' application. This error may occur due to corrupted pdf-files. A warning dialog might name corrupted files. To fix this error you have to do the following actions:
    1. Delete the corrupted PDF file named in the warning dialog.
    2. Restart the pdf merging process to check whether more corrupted files exist.
    3. When all corrupted files have been deleted restart the download with activated 'resume book download' option.

    4. Since version
    Hint: If you have compiled Hathi Download Helper on your own you have to place the pdftk subfolder in your Debug/Release target folder containing the HathiDownloadHelper.exe file.
  • Mac OS: Text in buttons and drop-downs looks misaligned
  • There are some problems with Qt framework on Mac OS X. Updating/changing application font should fix this problem. Select 'Options'→'Gui setupt'→'Font setup' from the menubar.
  • "Error: Waiting for 'pdftk' application."
  • For merging existing pdf files Hathi Download Helper is using the 'pdftk' application. This error may occur due to corrupted pdf-files. A warning dialog might name corrupted files. To fix this error you have to do the following actions:
    1. Delete the corrupted PDF file named in the warning dialog.
    2. Restart the pdf merging process to check whether more corrupted files exist.
    3. When all corrupted files have been deleted restart the download with activated 'resume book download' option.

    4. Since version
    Hint: If you have compiled Hathi Download Helper on your own you have to place the pdftk subfolder in your Debug/Release target folder containing the HathiDownloadHelper.exe file.





    Change log

    2013.05.18:initial version 1.0.0
    2013.05.19:version 1.0.1 released:
    fixed bug in image resolution setting after 'page setup' dialog, renamed images files in qt resources, copied image files in application directory
    2013.05.24:version 1.0.2 released:
    changed development environment to 4.7.4, added compiler switch for qt 5.x, tested on linux and windows system, added options for GUI style and fonts, updated GUI, bug fix for missing ocr files, reduced freezing effect of GUI during pdf creation, added 'pdftk' binary for linux/OS, added selection for proxy type.
    2013.06.03:version 1.0.3 released:
    bug fix for proxy type selection. moved pdf merge & conversion into QThread worker to eliminate freezing effect of GUI during processing. Changed usage from QPixmap to QImage for pdf creation. Changed OCR text extraction method to reduce memory usage(QWebkit is really greedy). Improved text font size adjustment method. Added Author and Publisher information. Changed Windows installer creation from QT framework installer to inno setup compiler to fix kernel32.dll error on win XP.
    2013.07.02:version 1.0.4 released:
    improved download performance by using parallel download requests (it is really much faster now :-D ), added encryption for proxy password, added 'check for update' feature, added batch job feature for downloading several books at once, added link export function
    2013.08.18:version 1.0.5 released:
    re-implementation of all GUI elements and dialogs, fixed text clipping of GUI elements, fixed page shrinking on pdf creation due to long ocr text, improved download speed, re-designed help file
    2013.10.27:version 1.0.6 released:
    bug fixes: lost destination path for single pdf-file creation, application crash on manual file selection. Added new features for batch job dialog: 'edit book', 'load job', 'save job', added gimmicks for Halloween and Christmas, minor changes.
    2014.03.30:version 1.0.7 released:
    added new download options: webproxies, resume of book downloads, added user settings dialog, added auto-update option, coding: separated GUI from file downloader.
    2014.05.06version 1.0.8 released:
    adjustments due to changes in hathitrust.org link structure.
    2014.10.26version 1.0.9 released:
    Updated GUI, added link collector feature, added history feature, added automatic proxy feature (including US proxies): 'AutoProxy', added verification check for proxy connections, improved pdf merging process, added field for copyright information, added check for corrupted pdf and image files, added automatic download resume in case of corrupted pdf files, minor bug fixes, changed development environment to 4.8.0
    2014.11.30version 1.1.0 released:
    bug fixes: fixed possible application crash on proxy activation, fixed PDFTK problems with too long file paths. Changes: disabled change-over from WebProxy to AutoProxy feature and vice versa during download, revised behaviour of various GUI controls to improve usability
    2015.03.08version 1.1.1 alpha (unreleased):
    Changes: adjusted timing of AutoProxy feature, added option to preserve existing pdf books with identical name in same folder, adjustments for Mac OS compatibility.
    2016.05.19version 1.1.1 released:
    Changes: adjustments to obtain SSL/TLS compatibility for https requests. (AutoProxy / WebProxy disabled)
    2016.06.07version 1.1.2 released:
    Bug fixes: fixed and improved autoproxy feature. Changes: enabled rezie of GUI, added message / bug report feature.
    2016.09.05version 1.1.3 released:
    Bug fixes: pdf merging fails when downloading book with more than 1300 pages, fixes automatic update check feature. New features: Download whole books as 1 pdf when whole book download is available, added Pdf merging dialog to merge arbitrary pdf files.
    2017.07.23version 1.1.4 released:
    Bug fixes: Fixed connection problems to hathitrust webpage. Fixed broken feedback form.
    2017.12.20version 1.1.5 released:
    Bug fixes: Removed obsolete proxy sources. Improved auto proxy feature. Added SQL database. Added option to remove downloaded page data automatically.
    2018.05.15version 1.1.6 released:
    Bug fixes: Increased timeout for pdf merging process. Fixed crash on manual pdf merging process. Updated restriction check for image download. Adapted default font settings for Mac OS. New features: Added simple view mode, added 1-click-download feature, added forward connection check for proxy server
    2018.07.06version 1.1.7 released:
    Bug fixes: Fixed (auto) update check feature, Fixed saving user settings problem on linux