Html5lib not found, please install it

html5lib is a Python package and while dealing with this a common that arises is the html5lib not found, please install it. This python package implements the HTML5 parsing algorithm. You will use the library heavily when you make Python web scraping programs. It has function to conform to the WHATWG HTML specification, as all major web browsers implement it.

When working with the library, especially in specific development environments such as a virtual environment or Jupyter Notebook, you may run into the error “ImportError: html5lib not found, please install it“. This fairly simple error can be solved in a few steps. This guide post will discuss the three most popular cases for which this error may arise.

Why is the “ImportError: html5lib not found, please install it” error occurring?

Library not installed correctly

If you are running a virtual environment, you want to check that the library has been installed in the virtual environment and not accidentally globally in the host system. If your virtual environment has not been enabled, right-click and open a terminal in the same folder where the virtual environment files are installed, then type the following commands

source bin/Activate

This will execute the “Activate” shell script, enabling the virtual environment. If there are no errors and you can see your virtual environment shell name pop up at the end of your terminal line, follow the next steps to check your installed packages. Type this command into the command line:

!pip list

This will list all the installed packages in the virtual environment; if a package called “html5lib” does not show up in the list, it means the package has not been installed or has been installed globally instead.

File path should be a string literal

Sometimes, even if the library is correctly installed, you may face an error with the html5lib module because of file access-related operations, which are oddly specific with string literal operation(r). For example, take a look at this code:

import pandas as pd
df = pd.read_html("C:\xxxxx\a.htm")

This code will give us the error even with the html5lib library correctly installed because the pd.read_html function needs the file path to be uploaded as a string literal. This error occurs more frequently with pre-Python 3.0 versions such as Python 2 onwards, so upgrading your Python version is also feasible.

Restarting Jupyter Notebook

A kernel error in Jupyter Notebook can occur for various reasons, including a problem with the kernel itself or an issue with the code you’re trying to run. Sometimes, the jupyter notebook environment may get desynchronized, and even with all the libraries correctly installed, you may get an unwanted error. This is fairly common when using the environment. In the top right corner of the notebook, you will see the status of the kernel. If it says “Not Connected” or “Connecting”, try refreshing the page after waiting for few minutes. If you have multiple kernel sessions running locally on your Jupyter Notebook environment, you can execute the code in the terminal to list all your terminal sessions:

jupyter server list --json

The output would look similar to this:

{'id': '73109856-1658-4abb-b850-6f011325eff5',
  'path': 'Untitled.ipynb',
  'name': 'Untitled.ipynb',
  'type': 'notebook',
  'kernel': {'id': '45b29d0c-3a72-416b-a964-7a04f0c637ef',
   'name': 'python3',
   'last_activity': '2022-07-21T13:39:00.822405Z',
   'execution_state': 'idle', 
   'connections': 1},
  'notebook': {'path': 'Untitled.ipynb', 'name': 'Untitled.ipynb'}},

The “execution_state” line will tell us the individual status of each kernel in the jupyter notebook environment.

How do I solve the “html5lib not found, please install it” error?

Installing the library correctly

If your library does not show up on the list after executing the pip list command, you might want to consider reinstalling the library in the virtual environment. It’s as simple as:

pip3 install html5lib

Note that you need to ensure your virtual environment is enabled before using this command, or the library will be installed globally and will be inaccessible to your virtual environment once again. After running the command, execute the pip list once again and check if the library shows up this time.

Make file path a string literal

The faulty code in Case 2 gives us the “html5lib not found; please install it” error because the pd.read_html function needs the file path to be uploaded as a string literal. The solution is as simple as putting an r in front of the file path:

import pandas as pd
df = pd.read_html(r"C:\xxxxx\a.htm")

This will execute the code correctly; the output will be empty as it is a simple read file code, but an empty output signifies that the code ran correctly. As signified before, upgrading the Python version to 3 and above is also helpful.

Restart/Reinstall your jupyter notebook

Enter the kernel you wish to code on and check the Kernel status as instructed above in the causes section. If it says “Not Connected” or “Connecting”, refresh the page after waiting for a few minutes. Click on the “Kernel” menu in the toolbar and select “Restart”. This will restart the kernel and clear any existing variables or data in the notebook. If the kernel is stuck or in an error state, this may fix the problem.

If you’re using an older kernel version, it may be causing compatibility issues. You can try to update the kernel to the latest version. You can do this by clicking on the “Kernel” menu in the toolbar and selecting “Change Kernel”. One of the other errors is an old kernel spec error related to a removed environment. We may rectify the problem by using the following command:

jupyter kernelspec remove python3

In extreme cases, you may need to reinstall the entire kernel. You can do this by clicking on the “Kernel” menu in the toolbar and selecting “Remove Kernel“. Then, reinstall the kernel and try running your code again.

FAQs

What is the html5lib module used for?

html5lib is a pure Python library for parsing HTML. It has function to conform to the WHATWG HTML specification, as all major web browsers implement it.

How to check your python version?

Enter your terminal or virtual environment and type Python –version. We recommend an upgrade if it is anything below version 3, as the 2.0+ versions are more prone to errors.

How do I install a new jupyter notebook kernel?

Ensure you install the ipykernel package into the virtual environment of the Python version you wish to use. Then, the ipython3 kernel is installed for Python3, and the ipython kernel is installed for Python2. Reload Jupyter Notebook, and you should be able to use any of the kernels

Conclusion

In this article, we learned about the causes of this error “html5lib not found, please install it” occurs. We learned that it could commonly have 3 cases of occurrence. It’s either not installed correctly or in the virtual environment; the file path provided is not a string literal, or the Jupyter Notebook kernel is not working properly. The resolution to the cases of error respectively is to install the module in the Jupyter notebook environment correctly, using the string literal r before the file path provided as shown in the pd.read_html function above, and check and restart your Jupyter Notebook kernel.

ImportErrors like these are the most common errors in Python, especially when working with web scraping and in the Jupyter Notebook environment. The problems, however, are relatively simple to resolve by adding a few lines of code or checking the environment. So keep calm when dealing with errors, and happy coding!

References

  1. html5lib
  2. pd.read_html

Follow us at PythonClear to learn more about solutions to general errors one may encounter while programming in Python.

Leave a Comment