What is a webpage?
A webpage is a document-type file that is written in multiple web languages (CSS, JavaScript, HTML, etc), especially in HTML web language. A web page is a combination of graphics, doc, links, text, videos, and so on. The web page naturally gives the information of a subject. (example: 'https://en.wikipedia.org/wiki/Link' a webpage in the Wikipedia website that gives the information of the word 'Link'.)
What is a link in a webpage?
A link in a webpage is defined as the connection between multiple pages in a website (example: <a href="example: https://en.wikipedia.org/wiki/Link">Link</a>. This is a hypertext reference link of the 'link' webpage in Wikipedia). The link in a webpage also can be defined as the connection between multiple network layers in a website server holder. Generally, A link is written in HTML web language inside the link tag attributes 'a'.
get all links from a webpage | python project for Web scraping practice
Ok. Then, we start the main work. We understand the basics and now we will do this using python language. This is a project for beginners in web scraping using python. Previously, we also discussed the three important web scraping projects. We know that the link exists within the 'a' attribute in an HTML page. First, we will target and fetch the 'a' attribute, and then we will store the 'href' links using the python Beautiful Soup library.
Steps to "get all links from a webpage | python project for Web scraping practice"
1. Install the required package (pip install beautifulsoup4).
2. Import all required packages.
3. Enter the URL to be analyzed.
4. Make a method to parse the URL and extract all HTML attributes. Then, find the 'https://' or 'http://' link values within the 'a' tag attribute.
5. Initialize a string list variable 'links' to store all link URLs. All things will be done using the Beautiful soup library.
6. When the 'href' within 'a' attribute is found in an HTML page, append it to the variable 'links'. Print all string listed values from the 'links'.
7. As there can be too many pages links, then we have to store those in a file. That is why we have to use the 'text' file saving process for further operation.
Python code for "get all links from a webpage | python project for Web scraping practice" using the Beautiful Soup library
# import all required packages
import requests
from bs4 import BeautifulSoup
# get all url from the requested url
def URLparser(web_url):
if ("https" or "http") in web_url:
data_from_url = requests.get(web_url)
else:
data_from_url = requests.get("https://" + web_url)
return data_from_url
if __name__=='__main__':
# try 'thoughtco.com' as input url
# please, don't insert a link with 'https://www.'
web_url = input("Enter a Web Link: ")
# sending the web address
url_from_web=URLparser(web_url)
# parsing the html page with BeautifulSoup
Bt_soup = BeautifulSoup(url_from_web.text, "html.parser")
# taking a variable to store all parsed url
links = []
for l in Bt_soup.find_all("a"):
links.append(l.get("href"))
#print all fetched links
for l in links:
print(l)
# Writing the output to a file (All_Links.txt)
# change 'a' to 'w' to overwrite the text file each time
with open("All_Links.txt", 'a') as saved:
print(links[:10], file=saved)
Input:
Enter a Web Link: thoughtco.com
Output:
https://www.thoughtco.com/sciences-math-4132465
https://www.thoughtco.com/science-4132464
https://www.thoughtco.com/math-4133545
https://www.thoughtco.com/social-sciences-4133522
https://www.thoughtco.com/computer-science-4133486
https://www.thoughtco.com/animals-and-nature-4133421
https://www.thoughtco.com/humanities-4133358
https://www.thoughtco.com/history-and-culture-4133356
https://www.thoughtco.com/visual-arts-4132957
https://www.thoughtco.com/literature-4133251
https://www.thoughtco.com/english-4688281
https://www.thoughtco.com/geography-4133035
How this python mini-project will help you?
1. Analysis of a website. You can recursive the process of finding the link inside a link.
2. Find the 'nofollow' and 'dofollow' tags for SEO from a website. The 'nofollow' and 'dofollow' are included in the 'a' attribute in an HTML page.