Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type annotations #118

Merged

Conversation

Sachaa-Thanasius
Copy link
Contributor

@Sachaa-Thanasius Sachaa-Thanasius commented Jun 14, 2024

First off, apologies for the last PR. It was unreviewable, not productive to have as an open PR, and overall bad form. I hope I can show I've learned from my mistakes and thus that this one is a far sight better.

The scope of this PR is just adding a bit of typing configuration, *some typing checks, and some type annotations to make things more reasonable to review and manageable to iterate on. More than willing to take feedback on how to go about this, but I've started small (at least, I hope it's small). *Any guidance would be appreciated.

(I used #105 and some of the comments made in there to inform some starting choices. For example, import typing as t was suggested as a way to keep annotation verbosity low (or so I'm guessing) while maintaining namespace separation (which was actually said), so I used that here.)

EDIT: Voids #115, closes #119.

Sachaa-Thanasius and others added 4 commits June 14, 2024 16:49
- Add a typing configuration file with pyright set to strict.
- Add annotations for normalizers.py and validators.py.
@Sachaa-Thanasius Sachaa-Thanasius changed the title Add a few annotations. Add type annotations Jun 14, 2024
Sachaa-Thanasius and others added 8 commits June 14, 2024 18:23
…ll as a py.typed file.

- The py.typed file is very early, but it's necessary to verify type completeness via `pyright --verifytypes`. Seems like the right kind of typing check, but that's assuming pyright and not mypy is ultimately used.
 - Reference: https://microsoft.github.io/pyright/#/typed-libraries?id=verifying-type-completeness
- Add typing testenv to tox.ini, which will run `pyright --verifytypes rfc3986`.
- Add one more matrix slot to the GitHub workflow to run the above typing check in CI on the lowest supported version of python on Ubuntu.
 - I only added it for Ubuntu because this package and its dependencies are pure python, so the types shouldn't change between operating systems.
- TODO: Consider changing the insides of the URIBuilder.add_* methods to use `type(self)(...)` instead of `URIBuilder(...)`. That way, subclassing is supported. Would result in the return annotations being Self as well.
- Added trailing commas to function parameters in normalizers.py and validators.py for functions with multi-line signatures.
- I'm not sure if these functions are truly necessary if the library no longer supports Python 2, but the use of it with `encoding` is prevalant enough around the library that it seems worth typing.
- The overloads are necessary to account for None being a passthrough value.
 - Two cases of this are `ParseResult.copy_with()` and `ParseResultBytes.copy_with()`. The `attrs_dict` dictionary in those methods is allowed to have None, and None is allowed for all of the component parameters (from what I can tell). Thus, whether intentional or not, `compat.to_(bytes|str)()`'s ability to let values pass through without modification, so long as they aren't bytes, is depended upon.
- These are passed into `urllib.parse.urlencode()`, so the annotation for that was copied from typeshed and modified slightly.
 - `_QueryType` in typeshed uses `typing.Sequence` to account for different types of sequence values being passed in, but `URLBuilder.extend_query_with()` in builder.py uses `isinstance(query_items, list)` as flow control, so Sequence is too wide to account for what that method allows. Thus, the type was modified to use `typing.List` in place of `typing.Sequence` where necessary.
 - Arguably, that isinstance check should be changed to check against `Sequence` instead, but I'd prefer having that double-checked, and this PR's scope is currently mostly limited to annotations anyway. This can be revisited later if necessary.
- TODO: Ask if `username is None` check is still needed in `URLBuilder.add_credentials()` if the `username` parameter is annotated with `str`.
- Ignore need for docstrings in function overloads.
- Ignore a line being too long due to a pyright: ignore + explanation.
 - Moving the explanation away from that line doesn't make the line short enough to pass. Removing the error code as well would be enough, but removing both has 2 downsides:
  a) The lack of explanation on the same line makes it harder to immediately see why the pyright: ignore was added when grepping.
  b) Having an unrestricted pyright: ignore (or type: ignore) can cover up other typechecking errors, which could cause a problem down the line.
 - Having a pyright: ignore, a noqa, and fmt: off/on for this one line isn't clean, but I'd like a second opinion on how to handle it.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 15, 2024

Something I'm unsure about: the use of compat.to_(str|bytes) throughout the library allows the main functions in api can accept bytes and str for some of their parameters and still return a valid result. However, the docstrings mention just str as the acceptable type for the arguments (e.g. api.urlparse). It's a bit unclear to me what the intended type for such parameters is meant to be for if the package nowadays only supports Python3.8+, where strings are encoded with utf-8 by default; it's thus unclear to me whether the str annotations I just added on some of the api function parameters are incorrect.

If the public API is meant to take only utf-8-encoded strings, then the compat functions can be eliminated, the encoding parameters everywhere can be made superfluous, and many things can be annotated as str. If the public API is meant to also take bytes now, the docstrings will need changing to indicate that and many things will have to be annotated with t.Union[bytes, str]. That's how I see it, at least.

- `uri` can't be eagerly imported within `misc` without causing a circular import, but is necessary for getting the most currently correct parameter type.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 15, 2024

Something to consider: the stdlib counterparts to ParseResult and ParseResultBytes are defined in typeshed as subclasses of a mixin and a generic namedtuple (to account for str | None and bytes | None attributes respectively). However, that can't be done reasonably at runtime pre-3.11 since support for generic namedtuples wasn't added until 3.11. Thus, maintaining a type interface close to that while keeping "inline" typing features & annotations and inheritance from namedtuple doesn't seem doable to me.

Some ideas for alternatives:

  1. Use different typed namedtuples as base classes for ParseResult and ParseResultBytes with str | None and bytes | None attribute annotations respectively. It would be a natural extension of the current implementation, but it adds a slight maintenance burden, since both class would then have to be kept in sync in attribute names and attribute types (somewhat) now while remaining statically analyzable. Having a single generic base class for both could alleviate that.

  2. Take the typeshed route and use a type stub file, though I'm less keen about that since the features & annotations within the stub won't be easily introspectable at runtime (e.g. via inspect.get_annotations(), typing.get_type_hints(), even class.__mro__ if someone's checking for Generic, etc.).

  3. Use dataclasses.dataclass instead, since that can decorate a subclass of Generic. Additionally, much of the namedtuple interface can be emulated within it if necessary, e.g. tuple.__iter__() with a manually defined __iter__(), namedtuple._replace() with dataclasses.replace(), etc.

1 feels like a way to avoid using a generic (due to runtime limitations) in a place where it makes sense to use it, and makes the classes harder to maintain and possibly extend as a result. 2 would be most faithful to the inheritance hierarchy and implementation of the stdlib counterparts, I think. 3 would be a significant change in API, assuming the namedtuple interface and inheritance was considered a guaranteed part of the API.

Imo, 2 is the least breaking but also the least forward-moving, if that makes sense. Not my choice, ultimately, but I figured I'd lay out what options I could think of.

EDIT: These circumstances somewhat applies to URIReference and IRIReference as well, since they could very easily be refactored to share a base class, but the runtime issue with generic inheritance isn't relevant; typing.NamedTuple would work. A base class would also help in typing functions/methods which may in time be changed and/or documented to accept both types of "references" as arguments.

@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 15, 2024

Followup regarding my ideas for ParseResult, ParseResultBytes, etc. (linked #118 (comment)): after more thought, this PR doesn't have to make such a big decision; it can still type the classes to some degree of satisfaction via class-level annotations temporarily. This doesn't have to be perfect on the first go; that's just something I'm projecting. If the interface & implementation for those classes can/should be changed more drastically, we can make that discussion into its own issue.

Same sentiment applies to some degree for the compat functions and "str vs. Union[bytes, str]" questions. (I'll note that discussion in Discord indicated that I'd misunderstood how str and bytes work in Python, and now I'm thinking the compat functions are necessary, the encoding parameters are necessary, and the main API functions should be documented as accepting bytes and str).

Didn't meant to pile everything on in here at once. Hoping this reduces the pressure of the questions & comments I've thrown out so far.

@sigmavirus24
Copy link
Collaborator

Something I'm unsure about: the use of compat.to_(str|bytes) throughout the library allows the main functions in api can accept bytes and str for some of their parameters and still return a valid result. However, the docstrings mention just str as the acceptable type for the arguments (e.g. api.urlparse). It's a bit unclear to me what the intended type for such parameters is meant to be for if the package nowadays only supports Python3.8+, where strings are encoded with utf-8 by default; it's thus unclear to me whether the str annotations I just added on some of the api function parameters are incorrect.

If the public API is meant to take only utf-8-encoded strings, then the compat functions can be eliminated, the encoding parameters everywhere can be made superfluous, and many things can be annotated as str. If the public API is meant to also take bytes now, the docstrings will need changing to indicate that and many things will have to be annotated with t.Union[bytes, str]. That's how I see it, at least.

So yes, this project started in 2014 when the deprecation of 2.7 was far far away.

I think that there's benefit to accepting both as someone may be dealing with raw bytes from a socket and not want to deal with round-tripping or knowing the encoding the other end intended (as enough specifications and implementations predate good utf8 support). The doc strings should be clarified and the signatures should be clear about this too.

@sigmavirus24
Copy link
Collaborator

Something to consider: the stdlib counterparts to ParseResult and ParseResultBytes are defined in typeshed as subclasses of a mixin and a generic namedtuple (to account for str | None and bytes | None attributes respectively). However, that can't be done reasonably at runtime pre-3.11 since support for generic namedtuples wasn't added until 3.11. Thus, maintaining a type interface close to that while keeping "inline" typing features & annotations and inheritance from namedtuple doesn't seem doable to me.

Some ideas for alternatives:

  1. Use different typed namedtuples as base classes for ParseResult and ParseResultBytes with str | None and bytes | None attribute annotations respectively. It would be a natural extension of the current implementation, but it adds a slight maintenance burden, since both class would then have to be kept in sync in attribute names and attribute types (somewhat) now while remaining statically analyzable. Having a single generic base class for both could alleviate that.

  2. Take the typeshed route and use a type stub file, though I'm less keen about that since the features & annotations within the stub won't be easily introspectable at runtime (e.g. via inspect.get_annotations(), typing.get_type_hints(), even class.__mro__ if someone's checking for Generic, etc.).

  3. Use dataclasses.dataclass instead, since that can decorate a subclass of Generic. Additionally, much of the namedtuple interface can be emulated within it if necessary, e.g. tuple.__iter__() with a manually defined __iter__(), namedtuple._replace() with dataclasses.replace(), etc.

1 feels like a way to avoid using a generic (due to runtime limitations) in a place where it makes sense to use it, and makes the classes harder to maintain and possibly extend as a result. 2 would be most faithful to the inheritance hierarchy and implementation of the stdlib counterparts, I think. 3 would be a significant change in API, assuming the namedtuple interface and inheritance was considered a guaranteed part of the API.

Imo, 2 is the least breaking but also the least forward-moving, if that makes sense. Not my choice, ultimately, but I figured I'd lay out what options I could think of.

EDIT: These circumstances somewhat applies to URIReference and IRIReference as well, since they could very easily be refactored to share a base class, but the runtime issue with generic inheritance isn't relevant; typing.NamedTuple would work. A base class would also help in typing functions/methods which may in time be changed and/or documented to accept both types of "references" as arguments.

I think 2 is likely better until we can match the typeshed behavior "inline" by dropping everything before 3.11. I know it has limitations but I'd rather keep the code as consistent as possible first then iterate on it to get to a better typing place if necessary.

… well.

- This provides parity with the way urllib.parse is typed and the runtime implementation, since bytearray isn't a str and also has a `decode(encoding)` method.
- Adjusted compat functions to also accept bytearray for the same reasons.
- Adjusted the parameter types of the other functions called in api.py that are part of the `(U|I)RIReference` interfaces to make sure api.py fully type-checks.
- TODO: Consider making a typealias for `t.Union[str, bytes, bytearray]` and using that everywhere to avoid verbosity?
- TODO: For the **kwargs functions in api.py and `URLMixin.is_valid()`, consider enumerating the possible keyword-only parameters?
- Substituted namedtuple inheritance with common base `typing.NamedTuple` subclass in misc.py, since these classes share almost the exact same interface.
- Added a _typing_compat.py module to be able to import typing.Self, or a placeholder for it, in multiple other modules without bloating their code.
- Added basic method annotations to the two reference classes.
- Not annotations-related:
 - Move the __hash__ implementation over to IRIReference from URIMixin to be congruent with URIReference.
 - Made the __eq__ implementations more similar to avoid different behavior in cases of inheritance (rare as that might be).
 - Added overloads to `normalizers.normalize_query` and `normalizers.normalize_fragment` to clearly indicate that None will get passed through. This behavior is relied upon by the library currently.

- Note: The runtime-related changes can be reverted and reattempted later if need be. Still passing all the tests currently.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 15, 2024

Ah, CI is hitting the conflict between flake8-import-order and reorder-python-imports that came up in #105, I think. One of the two might need to be removed.

…er and close enough to urlllib.parse's interfaces.

- bytearray would end up pervading everything, I think, to the point where it's not worth it. I was hasty in adding those initially.
- After review, we don't actually need a generic namedtuple to make this work in a type stub. Inline annotations seemingly work fine so far and don't significantly change the runtime. This might be option 4, which can hold until 3.11 is the lowest supported version.
 - However, in the meantime, `ParseResultMixin` can be made generic in a helpful way with `typing.AnyStr`.
 - `port` is a strange case and might be annotated too narrowly or incorrectly; based on some of the ways that it's populated, it might sometimes be an int.
- Prefixed `_typing_compat.Self` imports with underscore to avoid poluting namespace with public variables.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 16, 2024

I think 2 is likely better until we can match the typeshed behavior "inline" by dropping everything before 3.11. I know it has limitations but I'd rather keep the code as consistent as possible first then iterate on it to get to a better typing place if necessary.

I'm going to try option 4 real quick, which is throwing class-level annotations in there to see if it's enough (see changes to parseresult.py in my most recent PR at this point). I might've overestimated the scale of the problem if this works, but we'll see. Figured it's worth a shot.

- Made int | str order consistent.
- The ParseResultMixin shim properties are now marked with t.Optional, since what they wrap can be None.
@@ -19,6 +19,10 @@ jobs:
- os: windows-latest
python: '3.12'
toxenv: py
# typing
- os: ubuntu-latest
python: '3.8'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still bitter there's no way to tell GHA that "Hey, I have this list elsewhere of things I want to run against, can you just pick the 'earliest'/'oldest'/'smallest' so that when I update the list I don't have to update every goddamn reference to remove the oldest?"

Copy link
Contributor Author

@Sachaa-Thanasius Sachaa-Thanasius Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a novice at best when it comes to GHA and just went off of what the rest of the workflow looked like, so I have no idea if the technology exists, lol. It is a bit annoying, but I figured if a better way is found and it does get refactored, this would temporarily showcase what the output would look like in CI for pyright --verifytypes and either help or hurt the case for switching to and/or including mypy :D.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm just griping aloud, not hoping you would magically fix GitHub's product. 😂

src/rfc3986/misc.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sigmavirus24 sigmavirus24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when you think this is ready for merge/final review

@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jun 20, 2024

Let me know when you think this is ready for merge/final review

It feels close as a first step. One thing I wanted to check, though: is it okay if I haven't adjusted any of the documentation to account for these annotations yet (e.g. the types in the docstrings)? I could try to put such changes in this one or defer it to another PR.

…ment without warnings from a type checker.

- Also add a noqa to _mixin.URIMixin.resolve_with, since the extra `if TYPE_CHECKING`` statement pushed it over the complexity limit.

Co-authored-by: Ian Stapleton Cordasco <[email protected]>
@Sachaa-Thanasius Sachaa-Thanasius marked this pull request as ready for review June 20, 2024 12:48
@Sachaa-Thanasius
Copy link
Contributor Author

Other than some minor nits, this seems ready for final review. I think I've mostly managed to avoid functional changes, so it shouldn't break anyone in theory.

…omponent_is_valid` to avoid testing via `int(...)`.

- Also makes the linters and type-checker happier.
@sigmavirus24
Copy link
Collaborator

Ah, CI is hitting the conflict between flake8-import-order and reorder-python-imports that came up in #105, I think. One of the two might need to be removed.

Yeah, as much as I prefer reorder-python-imports you have black which won't budge, and our beloved Anthony who also won't budge. So I've switched every other project I've had time to over to isort. I think there's an example config somewhere that replicated reorder-python-imports but I need to find which project that was

I think https://github.com/sigmavirus24/github3.py/blob/a66800d1ba5e9bc4fee026c94404aeb82b6c0b6d/pyproject.toml#L97-L100 is the example config that mostly gets us to the happy medium between reorder-python-imports, black, and the flake8-import-order style I prefer/enforce here

@Sachaa-Thanasius
Copy link
Contributor Author

I'll try adjusting the tooling config to switch to that, then.

Ah, CI is hitting the conflict between flake8-import-order and reorder-python-imports that came up in #105, I think. One of the two might need to be removed.

Yeah, as much as I prefer reorder-python-imports you have black which won't budge, and our beloved Anthony who also won't budge. So I've switched every other project I've had time to over to isort. I think there's an example config somewhere that replicated reorder-python-imports but I need to find which project that was

I think https://github.com/sigmavirus24/github3.py/blob/a66800d1ba5e9bc4fee026c94404aeb82b6c0b6d/pyproject.toml#L97-L100 is the example config that mostly gets us to the happy medium between reorder-python-imports, black, and the flake8-import-order style I prefer/enforce here

I'll try adjusting the tooling config to that, then.

Sachaa-Thanasius and others added 2 commits July 2, 2024 22:22
…ong with some other minor changes.

- Update .pre-commit-config.yaml and tox.ini in accordance to the above.
 - Run `pre-commit autoupdate` while I'm at it. Can be reverted if need be.
- Somewhat centralize black and isort config in pyproject.toml to avoid having to keep it in sync in multiple places.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jul 3, 2024

Side note: Ran pre-commit autoupdate in the process, which should make #115 (EDIT: and #119) redundant. I hope that's all right.

@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jul 3, 2024

Regarding the other failing CI checks: Would you be receptive to enhancing the current coverage config by replacing it with, say, covdefaults? I might be preaching to the choir here, but it would help hitting 100% coverage easier in future commits easier, imo:

  • It adds some common exclusion cases that the current coverage config doesn't account for (possibly making some current pragma comments redundant).
  • It adds version-based pragmas that would be useful for sys.version_info branches (currently only relevant to _typing_compat.py).
  • It tells the report to include the exact lines being missed, making it easier to determine what needs changing, more testing, or something else.
  • It feels like a natural extension of the current config but easier to maintain since its focus is "good defaults" that don't need much, if any, fiddling.

Just an idea that came to mind while looking at the GH actions results.

@sigmavirus24
Copy link
Collaborator

I'm not opposed to covdefaults in a future change but for now things have been working fine as it is, so this seems more of a signal of missing test coverage but I haven't looked deeply again at this. My availability is spotty for several weeks so this either needs to pass with very minimal additional changes or we need look at covdefaults in a separate PR that lands before this one and then rebase this after that merges.

…G` blocks and b) lines that are only ellipses, as well as some temporary pragma comments in _typing_compat.py. This seems to account for most of the missing coverage with the current configuration, excluding line 447 in validators.py.
@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jul 3, 2024

Okay, those should take care of the coverage metrics without any egregious changes. cc @sigmavirus24.

@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jul 3, 2024

Ah, right, pyright doesn't support less than 100% coverage. We could easily create a small wrapper script that reads the percentage from the output of pyight --verifytypes and determines pass/fail status based on our expected coverage instead, giving us more flexibility in CI?

Alternatively, it could be ignored for now.

Not my best work, but something like this could be thrown in a scripts/bin/whatever (or even tests) directory, and the corresponding tox command for the typing testenv can be changed to python directory_here/verify_types.py:

"""This script is a shim around `pyright --verifytypes` to determine if the
current typing coverage meets the expected coverage. The previous command by
itself won't suffice, since its expected coverage can't be modified from 100%.
Useful while still adding annotations to the library.
"""

import argparse
import json
import subprocess
from decimal import Decimal

PYRIGHT_CMD = ("pyright", "--verifytypes", "rfc3986", "--outputjson")


def validate_coverage(inp: str) -> Decimal:
    """Ensure the given coverage score is between 0 and 100 (inclusive)."""

    coverage = Decimal(inp)
    if not (0 <= coverage <= 100):
        raise ValueError
    return coverage


def main() -> int:
    """Determine if rfc3986's typing coverage meets our expected coverage."""

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--fail-under",
        default=Decimal("75"),
        type=validate_coverage,
        help="The target typing coverage to not fall below (default: 75).",
    )
    parser.add_argument(
        "--quiet",
        action="store_true",
        help="Whether to hide the full output from `pyright --verifytypes`.",
    )

    args = parser.parse_args()

    expected_coverage: Decimal = args.fail_under / 100
    quiet: bool = args.quiet

    try:
        output = subprocess.check_output(
            PYRIGHT_CMD,
            stderr=subprocess.STDOUT,
            text=True,
        )
    except subprocess.CalledProcessError as exc:
        output = exc.output

    verifytypes_output = json.loads(output)
    raw_score = verifytypes_output["typeCompleteness"]["completenessScore"]
    actual_coverage = Decimal(raw_score)

    if not quiet:
        # Switch back to normal output instead of json, for readability.
        subprocess.run(PYRIGHT_CMD[:-1])

    if actual_coverage >= expected_coverage:
        print(
            f"OK - Required typing coverage of {expected_coverage:.2%} "
            f"reached. Total typing coverage: {actual_coverage:.2%}."
        )
        return 0
    else:
        print(
            f"FAIL - Required typing coverage of {expected_coverage:.2%} not "
            f"reached. Total typing coverage: {actual_coverage:.2%}."
        )
        return 1


if __name__ == "__main__":
    raise SystemExit(main())

EDIT: Prior art in trio, though their script is more complicated and searches the pyright output for other data.

EDIT 2: Edited script draft to hopefully be clearer.

@sigmavirus24
Copy link
Collaborator

@Sachaa-Thanasius let's do that here and get CI green

@Sachaa-Thanasius
Copy link
Contributor Author

Sachaa-Thanasius commented Jul 6, 2024

I just threw the script in the tests directory for now. Easy to change based on need or preference.

@sigmavirus24 sigmavirus24 merged commit 75e77ba into python-hyper:main Jul 6, 2024
11 checks passed
@sigmavirus24
Copy link
Collaborator

Thanks so much for working on this!

@Sachaa-Thanasius
Copy link
Contributor Author

No problem. Thanks for all the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tooling: Confusing difference in behavior between pre-commit run --all-files and tox -e run lint
2 participants