Python笔记之根据相对路径的网址链接提取并补充完整的网址链接

2023-12-25 20:05:09

参数说明：

response_url，需要补充的相对路径，例如/index/news1.html、./index/news2.html、//index/news1.html
href，当前网址所属主页网址，例如http://www.abc.com/news_list
return_href，补充完成后传回的完整网址链接，例如http://www.abc.com/news_list/index/news1.html

def handleHref(response_url, href):
    """
    1.（无）开头表示当前目录下的
    2.（/）开头的目录表示该目录为根目录的一个子目录
    3.（./）开头的目录表示该目录为当前目录（当前目录所在的目录）的一个子目录
    4.（../）开头的目录表示该目录为当前目录的父目录
    5.（//）开头的目录表示和当前页面使用同一种协议http/https
    6.（?）开头的表示当前目录下补充?及其后面的参数
    """
    if href.startswith("//"):
        # 获取response_url请求协议
        http = urlparse(response_url)[0]
        return_href = http + ":" + href
    elif href.startswith("/"):
        hostname = urlparse(response_url).netloc
        prefix = urlparse(response_url).scheme + "://" + hostname
        return_href = prefix + href
    elif href.startswith("./"):
        prefix = dirname(response_url) + "/"
        return_href = prefix + href.replace("./", "")
    elif href.startswith("../"):
        dir_name = dirname(response_url) + "/"
        new_prefix_list = dir_name.split("/")[0:len(dir_name.split("/")) - href.count("../") - 1]
        new_prefix = ""
        for p in new_prefix_list:
            new_prefix = new_prefix + p + "/"
        return_href = new_prefix + href.replace("../", "")

    elif href.startswith("?"):
        # 有些href链接是?开头的，例如，?page=1&id=1,这就需要拿到response_url的无参数链接
        url_dir = dirname(response_url)
        url_base = basename(response_url).replace(" ", "")
        url1 = url_base.split("?")
        return_href = url_dir + "/" + url1[0] + href
    else:
        # 其它特殊情况，这里是自己随便定的规则，可以自定义规则
        prefix = dirname(response_url) + "/"
        return_href = prefix + href

    return return_href

文章来源:https://blog.csdn.net/qq_23730073/article/details/135202466
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：veading@qq.com进行投诉反馈，一经查实，立即删除！