【Python随笔】python的os.walk方法源码分析

在日常python编程中，有很多遍历文件夹内文件的需求，而os.walk方法就是一个满足该需求的例子。不熟悉这个方法的同学，刚开始用os.walk的时候难免踩坑。因此本文采用源码分析的方式，讲述os.walk的机理，让大家对于这个方法有更加深入的理解。

以python3为例，os.walk方法的源码如下：

def walk(top, topdown=True, onerror=None, followlinks=False):
    """Directory tree generator.

    For each directory in the directory tree rooted at top (including top
    itself, but excluding '.' and '..'), yields a 3-tuple

        dirpath, dirnames, filenames

    dirpath is a string, the path to the directory.  dirnames is a list of
    the names of the subdirectories in dirpath (excluding '.' and '..').
    filenames is a list of the names of the non-directory files in dirpath.
    Note that the names in the lists are just names, with no path components.
    To get a full path (which begins with top) to a file or directory in
    dirpath, do os.path.join(dirpath, name).

    If optional arg 'topdown' is true or not specified, the triple for a
    directory is generated before the triples for any of its subdirectories
    (directories are generated top down).  If topdown is false, the triple
    for a directory is generated after the triples for all of its
    subdirectories (directories are generated bottom up).

    When topdown is true, the caller can modify the dirnames list in-place
    (e.g., via del or slice assignment), and walk will only recurse into the
    subdirectories whose names remain in dirnames; this can be used to prune the
    search, or to impose a specific order of visiting.  Modifying dirnames when
    topdown is false has no effect on the behavior of os.walk(), since the
    directories in dirnames have already been generated by the time dirnames
    itself is generated. No matter the value of topdown, the list of
    subdirectories is retrieved before the tuples for the directory and its
    subdirectories are generated.

    By default errors from the os.scandir() call are ignored.  If
    optional arg 'onerror' is specified, it should be a function; it
    will be called with one argument, an OSError instance.  It can
    report the error to continue with the walk, or raise the exception
    to abort the walk.  Note that the filename is available as the
    filename attribute of the exception object.

    By default, os.walk does not follow symbolic links to subdirectories on
    systems that support them.  In order to get this functionality, set the
    optional argument 'followlinks' to true.

    Caution:  if you pass a relative pathname for top, don't change the
    current working directory between resumptions of walk.  walk never
    changes the current directory, and assumes that the client doesn't
    either.

    Example:

    import os
    from os.path import join, getsize
    for root, dirs, files in os.walk('python/Lib/email'):
        print(root, "consumes", end="")
        print(sum(getsize(join(root, name)) for name in files), end="")
        print("bytes in", len(files), "non-directory files")
        if 'CVS' in dirs:
            dirs.remove('CVS')  # don't visit CVS directories

    """
    top = fspath(top)
    dirs = []
    nondirs = []
    walk_dirs = []

    # We may not have read permission for top, in which case we can't
    # get a list of the files the directory contains.  os.walk
    # always suppressed the exception then, rather than blow up for a
    # minor reason when (say) a thousand readable directories are still
    # left to visit.  That logic is copied here.
    try:
        # Note that scandir is global in this module due
        # to earlier import-*.
        scandir_it = scandir(top)
    except OSError as error:
        if onerror is not None:
            onerror(error)
        return

    with scandir_it:
        while True:
            try:
                try:
                    entry = next(scandir_it)
                except StopIteration:
                    break
            except OSError as error:
                if onerror is not None:
                    onerror(error)
                return

            try:
                is_dir = entry.is_dir()
            except OSError:
                # If is_dir() raises an OSError, consider that the entry is not
                # a directory, same behaviour than os.path.isdir().
                is_dir = False

            if is_dir:
                dirs.append(entry.name)
            else:
                nondirs.append(entry.name)

            if not topdown and is_dir:
                # Bottom-up: recurse into sub-directory, but exclude symlinks to
                # directories if followlinks is False
                if followlinks:
                    walk_into = True
                else:
                    try:
                        is_symlink = entry.is_symlink()
                    except OSError:
                        # If is_symlink() raises an OSError, consider that the
                        # entry is not a symbolic link, same behaviour than
                        # os.path.islink().
                        is_symlink = False
                    walk_into = not is_symlink

                if walk_into:
                    walk_dirs.append(entry.path)

    # Yield before recursion if going top down
    if topdown:
        yield top, dirs, nondirs

        # Recurse into sub-directories
        islink, join = path.islink, path.join
        for dirname in dirs:
            new_path = join(top, dirname)
            # Issue #23605: os.path.islink() is used instead of caching
            # entry.is_symlink() result during the loop on os.scandir() because
            # the caller can replace the directory entry during the "yield"
            # above.
            if followlinks or not islink(new_path):
                yield from walk(new_path, topdown, onerror, followlinks)
    else:
        # Recurse into sub-directories
        for new_path in walk_dirs:
            yield from walk(new_path, topdown, onerror, followlinks)
        # Yield after recursion if going bottom up
        yield top, dirs, nondirs

os.walk包含了4个参数：top、topdown、onerror以及followlinks。top指代需要遍历的根目录；topdown指代是否自顶向下进行遍历；onerror为过程中抛异常的回调；followlinks指代是否需要跟踪符号链接。

从源码和注释中可以看到，os.walk本身返回的是一个生成器generator。生成器会根据上一次的状态，不断地生产下一个值，因此生成器生成的值可能会是无穷无尽的，但因为只保留少量的状态信息，所以不太耗费资源。想象一下，如果遍历一个包含大量文件的文件夹，不用生成器直接把遍历所有的结果整合起来给到调用者，那势必需要花费相当多的资源去存储所有结果的信息。每当遍历到一个文件夹时，os.walk会采取如下操作：

尝试获得os.scandir的iterator迭代器。scandir方法，返回的也是generator，和listdir不同。
在scandir_iterator作用域里，不断调用next获取文件夹下剩余的entry
判断entry是文件夹还是文件，分别放到dirs与nondirs列表中
如果不是topdown自顶向下，而是bottomup自底向上，需要将搜到的子文件夹/符号链接的文件夹放到walk_dirs中，后续先遍历它们
确定dirs跟nondirs列表后，根据是否自顶向下遍历来执行行为
- 如果设置了自顶向下遍历，就yield (当前目录, 子一层的文件夹列表, 子一层的非文件夹列表)。值得一提的是，我们可以修改子一层文件夹列表里面的值，来实现比如剪枝的需求
- 如果设置自底向上，就先yield from walk_dirs里的各个文件夹及子一层文件夹/非文件夹列表，然后再yield当前目录/子一层信息

os.walk的原理大致如此。与此同时，os.walk方法也是一个非常典型的使用generator的例子，值得我们在应用python的时候学习与回顾