Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: how do i handle this buggy csv file? #10477

Open
1 task done
pauljsymonds opened this issue Nov 12, 2024 · 7 comments
Open
1 task done

question: how do i handle this buggy csv file? #10477

pauljsymonds opened this issue Nov 12, 2024 · 7 comments
Labels
question Questions about the library

Comments

@pauljsymonds
Copy link

What happened?

I am reading CSV files into ibis using read_csv, this is working fine. however when I then do a .count() the error appears.

it mentions I can do a ignore_errors = True to ignore the line, but then my count would be incorrect.

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

DuckDb

Relevant log output

Value with undetermined quote found.

when using .count()

Code of Conduct

  • I agree to follow this project's Code of Conduct
@pauljsymonds pauljsymonds added the bug Incorrect behavior inside of ibis label Nov 12, 2024
@pauljsymonds
Copy link
Author

Interestingly ibis can read it into a table but pandas cannot

in your Backend.execute you use Pandas 1349 through 1367

@cpcloud
Copy link
Member

cpcloud commented Nov 12, 2024

There's not a ton Ibis can do beyond whatever options are exposed by the backend here.

You're also going to be limited by the correctness of your CSV file, which sounds like it's missing a quote.

The reason it errors on the .count() call is because the buggy line in the data is not actually read until you call count() (ibis.read_csv -- depending on the backend -- reads a small amount of data for schema inference. When count is called, the DuckDB CSV reader will have to read at the very least every newline character and do at least some amount of parsing to respect whatever the set of options is that would in theory make the CSV file a table.

Fundamentally, with incorrect data, you cannot make any piece of software give you a "correct" count without defining what "correct" means.

If a correct count means including lines that are incorrect with respect to the thing doing the parsing, then you're out of luck and you need to fix the data.

@cpcloud cpcloud changed the title bug question: how do i handle this buggy csv file? Nov 12, 2024
@cpcloud cpcloud added question Questions about the library and removed bug Incorrect behavior inside of ibis labels Nov 12, 2024
@pauljsymonds
Copy link
Author

pauljsymonds commented Nov 12, 2024

Is Ibis mostly abstracting pandas or is it more advanced and using more Arrow? For context I am trialing ibis as a pandas alternative.

@gforsyth
Copy link
Member

Hey @pauljsymonds -- Ibis is abstracting tabular query engines. It's also useful to think about Ibis as "only" an interface (a dataframe interface).

In fact, you can't use Ibis by itself and compute anything, you need to pair it with a backend engine. By default, we use DuckDB for this and that's a pretty good combination.
If you are doing analytics locally on your machine, you are likely best served using Ibis with DuckDB, Polars, or Datafusion, as those are the three local backends we have support for (and all are very performant).

@IndexSeek
Copy link
Member

DuckDB offers some convenient functionality for Reading Faulty CSV Files that may help you determine where the errors are.

Here's an example of using Ibis with the DuckDB backend to demonstrate this behavior. Notice the tab character in Bob's row instead of the comma, which would raise an InvalidInputException if specifying a comma as a delimiter.

In [1]: from ibis.interactive import *

In [2]: data = """name,amount
   ...: Alice,100
   ...: Bob\t200
   ...: Charlie,300"""

In [3]: with open("/tmp/example.csv", "w") as f:
   ...:     f.write(data)
   ...: 

In [4]: con = ibis.duckdb.connect()

In [5]: con.read_csv("/tmp/example.csv", sep=",", ignore_errors=True, store_rejects=True)
Out[5]: 
┏━━━━━━━━━┳━━━━━━━━┓
┃ name    ┃ amount ┃
┡━━━━━━━━━╇━━━━━━━━┩
│ string  │ int64  │
├─────────┼────────┤
│ Alice   │    100 │
│ Charlie │    300 │
└─────────┴────────┘

In [6]: con.table("reject_errors")
Out[6]: 
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ scan_id ┃ file_id ┃ line   ┃ line_byte_position ┃ byte_position ┃ column_idx ┃ column_name ┃ error_type      ┃ csv_line ┃ error_message          ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ uint64  │ uint64  │ uint64 │ uint64             │ uint64        │ uint64     │ string      │ string          │ string   │ string                 │
├─────────┼─────────┼────────┼────────────────────┼───────────────┼────────────┼─────────────┼─────────────────┼──────────┼────────────────────────┤
│       7 │       0 │      3 │                 23 │            30 │          1 │ amount      │ MISSING COLUMNS │ Bob\t200 │ Expected Number of Co… │
└─────────┴─────────┴────────┴────────────────────┴───────────────┴────────────┴─────────────┴─────────────────┴──────────┴────────────────────────┘

@pauljsymonds
Copy link
Author

I am using locally with the default assumption is DuckDb. Is there a discord where I can ask questions, I have a few on various things I have tried with ibis using the docs.

@IndexSeek
Copy link
Member

Is there a discord where I can ask questions, I have a few on various things I have tried with ibis using the docs.

Ibis uses Zulip for its community chat; we'd be happy to help answer your questions there. Here's a post with more information that may help if you are new to Zulip: Announcing Zulip for Ibis community chat.

https://ibis-project.zulipchat.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions about the library
Projects
Status: backlog
Development

No branches or pull requests

4 participants