How is this logically possible?

aggle-rithm · Feb 14, 2014

Wow. I've seen some odd behavior in SQL before, but this one seems to defy logic. Maybe I'm missing something obvious.

It hinges on a fairly simple logic statement: Given that a record exists in one table, then a matching record either exists in another table or it does not exist in that table. It's logically impossible for it to both exist and not exist in a macroscopic universe.

But when I run these three queries:

I get the following result:

1st query: 1 result.
2nd query: zero results.
3rd query: zero results.

If the first query came up empty, it would make sense that the second two would also return zero results. If it exists, though, how can it both NOT exist and NOT NOT exist?

If there were no records at all in the second table, then the second query should return one row. If there were a zillion records but none matched, it would return one row. If the table or column didn't exist, it would throw an error. That leaves one other possibility: That there is a matching record in the second table, which means the third query would return one row.

But it doesn't.

I've distilled the queries down as simply as I could to remove the possibility of any fat-fingering, but I just can't wrap my head around this one.

aggle-rithm · Feb 14, 2014

(Interesting that it wouldn't let me post with the SQL in plain text. I guess there are safeguards against SQL injection.)

aggle-rithm · Feb 14, 2014

Huh...I just tried it the other way, using a left join rather than the IN keyword, and now it works as expected. Why wouldn't they return the same results?

Brown · Feb 14, 2014

A leading authority on the subject had this to say: "It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error."

aggle-rithm · Feb 14, 2014

Brown said:
A leading authority on the subject had this to say: "It can only be attributable to human error. This sort of thing has cropped up before, and it is always due to human error."

Definitely a possibility, but what IS the error?

ETA: Oh, now I get the reference.

StevenLeonCooper · Feb 14, 2014

I believe if you just look at the tables manually and order them by invoicenumber you should be able to see why these queries don't work.

I would say use joins instead of "IN" or "NOT IN"; that's just my preference.

Select invoicenumber from FlatInvoiceOrders where invoicenumber = @invoicenumber and see what you get.

ETA: Check this out: http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/

aggle-rithm · Feb 14, 2014

StevenLeonCooper said:
ETA: Check this out: http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/

Ah-HA! Mystery solved.

IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

There were records in there with NULL values for invoice number. It just so happens that those records were entered, by accident, just a few minutes before I stumbled across this mystery.

Thanks for the link.

Wudang · Feb 14, 2014

I have to re-read the NULLs chapter in my Celko book every so often. Just weird.

StevenLeonCooper · Feb 14, 2014

I was doing a rather complicated query where I was filtering a list of images by an image collection and an image-collection collection with relation tables in between and I specifically only wanted images that belonged to whatever image-collections that belonged to the image-collection collection whose id was the "DefaultCollectionId" from the App Settings.

I was tempted to us "IN" to determine this but my coworker showed me how to just keep using INNER JOIN to filter the results instead. Since the appropriate image-collection collection ID was stored in the App Settings table, I didn't actually even need to access that table, I only needed to inner join on the "DefaulCollectionId" value (along with a few other things).

SumDood · Feb 14, 2014

Does the leading spaces in the input variable (' 881826' vs '881826') make a difference?

And actually, my first thought was going to suggest looking into something to do with 'null' values. I can't count the hours I've lost tracking down funky query results because the way sql handles 'null'.

aggle-rithm · Feb 14, 2014

The maxim I live by as a programmer is: "If something is happening that appears to make no logical sense, then I am making an invalid assumption."

In this case, the assumption was that the SQL statement was semantically equivalent to a similarly-worded English language statement.

abaddon · Feb 14, 2014

aggle-rithm said:
Ah-HA! Mystery solved.

There were records in there with NULL values for invoice number. It just so happens that those records were entered, by accident, just a few minutes before I stumbled across this mystery.

Thanks for the link.

Interesting. I have never deployed IN statements, and always used various JOINs. Not a preference per se, nor a criticism, simply never had any need to do so.

I take it you are in SQL hell with a project given your proclivity for SQL threads? Yes?

Word of warning. It never gets better. I tend to avoid such threads because, well, this is my down time on JREF, but FWIW the IN clause doesn't do it for me.

StevenLeonCooper · Feb 14, 2014

SumDood said:
Does the leading spaces in the input variable (' 881826' vs '881826') make a difference?

And actually, my first thought was going to suggest looking into something to do with 'null' values. I can't count the hours I've lost tracking down funky query results because the way sql handles 'null'.

I was reading a book on relational theory and the author suggested that the mere existence of null values in a DB was a bad sign. It does beg the question: If it's null then why does it exist? Any column worth having should probably have a default value and if you find yourself having hundreds of rows with NULL in a particular column, there's probably a different way to structure the table.

GlennB · Feb 14, 2014

SumDood said:
Does the leading spaces in the input variable (' 881826' vs '881826') make a difference?

aggle-rithm managed to embed the multiple spaces -by posting the code as an image - that the forum software (html?) has stripped from your post, thus rendering your perfectly reasonable post as a piece of nonsense (see my thread on "When did ...")

Case proved

eta: I now see it left one leading space there. Nice. How decent of it.

GlennB · Feb 14, 2014

StevenLeonCooper said:
I was reading a book on relational theory and the author suggested that the mere existence of null values in a DB was a bad sign. It does beg the question: If it's null then why does it exist?

Because the order has been taken, and maybe even delivered, but not yet invoiced?

In the end you might well want to update the order record with the eventual invoice number (I know we did). Also a query on null invoice numbers on the ORDERS table would also tell you about orders yet to be invoiced?

I can think of many reasons for perfectly reasonable null values in table columns. I certainly recall needing to use NVL quite a lot in much d/b software, and I don't think we had bad designs.

GlennB · Feb 14, 2014

Additional - why might an invoice number be recorded as VARCHAR ?

StevenLeonCooper · Feb 14, 2014

GlennB said:
Because the order has been taken, and maybe even delivered, but not yet invoiced?

In the end you might well want to update the order record with the eventual invoice number (I know we did). Also a query on null invoice numbers on the ORDERS table would also tell you about orders yet to be invoiced?

I can think of many reasons for perfectly reasonable null values in table columns. I certainly recall needing to use NVL quite a lot in much d/b software, and I don't think we had bad designs.

But what does NULL do that you can't do with the data type you already assigned to that column? I can't think of any particularly compelling reasons to use NULL for dates or timespamps and you can't use it on indexes so the primary place you'll see NULL is going to be just plain old VARCHAR and INT data for text and ID numbers.

In your scenario you have an invoice column for an invoice that doesn't exist yet. Why? Order, Delivery and Invoicing are specific events that happen at finite times.

It would make more sense to have Order, Delivery and Invoice tables and relate those to your specific order and then add them only when they actually happen. You avoid nulls and you allow yourself to isolate the three types of events so you don't unnecessarily have to grab all 3 any time you want to look up a specific one. If you want to know which orders haven't been invoiced yet just compare the Orders table to the Invoice table on the OrderId. The difference contains all of the orders that were placed but not invoiced.

The effect is pretty much the same and for a small project you probably wouldn't notice a performance difference but I think conceptually when you avoid using NULL values it forces you to make design choices that are a little more logical. To me it just makes more sense to only include real data that actually exists; if an invoice hasn't happened yet then I see no reason to give it a NULL placeholder.

GlennB · Feb 14, 2014

StevenLeonCooper said:
In your scenario you have an invoice column for an invoice that doesn't exist yet. Why? Order, Delivery and Invoicing are specific events that happen at finite times.

Because a regular customer might get a monthly invoice that covers many orders, then phones you to query it. It's likely they'll quote the invoice number at you, so you want to see what orders - and order details - are associated with that invoice.

This requires that that invoice number be stored within the ORDERS rows for the efficient indexing of the query. Sometimes that value will be null, in the case that an order had yet to be invoiced.

Seems reasonable to me.

StevenLeonCooper · Feb 15, 2014

GlennB said:
Because a regular customer might get a monthly invoice that covers many orders, then phones you to query it. It's likely they'll quote the invoice number at you, so you want to see what orders - and order details - are associated with that invoice.

This requires that that invoice number be stored within the ORDERS rows for the efficient indexing of the query. Sometimes that value will be null, in the case that an order had yet to be invoiced.

Seems reasonable to me.

Wait ... That seems different then. You would probably want the invoices to have a many-to-many relationship with orders and a many-to-many relationship with a customer. That's actually kind of like that image gallery thing I was working on. I would do that with 5 tables: Customers, Invoices, InvoicesToCustomers, Orders and OrdersToInvoices.

Use InvoicesToCustomers to relate each invoice to a customer. Use OrdersToInvoices to relate each order to an invoice.

That allows you to not require any NULL data.

Now, it does seem to me that simply having an "InvoiceId" column in orders also solves the problem without so many relational tables. I can see the appeal of that. Logically however this has some major limitations. You're saying an order can only associate with a single invoice. What if you make a mistake on an invoice and need to invalidate it and create another? With the relational layout you can give the invoices a "Voided" bit flag and maintain the invoice-order association for both records. Then you can query for ALL invoices that referenced the order, void or valid.

With your way you just update the InvoiceId column and the historical connection is lost; you may stumble over the voided invoice and find that it's not documented at all. You might wish you had done it differently if you're auditing all of your invoices. This type of data represents real, psychical objects so it makes sense to design the database in a way that allows it to be consistent with those objects.

ETA: posted from phone. Please forgive any strange autocorrects.

GlennB · Feb 15, 2014

StevenLeonCooper said:
Wait ... That seems different then. You would probably want the invoices to have a many-to-many relationship with orders ...

Yes, you're right. The system I had in mind recorded the invoice id in a transaction line that was associated with an order, not the order itself. That way an order could be associated with many invoices, including adjustments over multiple invoices.

StevenLeonCooper said:
...and a many-to-many relationship with a customer.

I don't really understand that part. Multiple customers on a given invoice?

But, generally speaking, disallowing nulls when a value is not yet known suggests to me a large number of tables to cater for every possible event where there might still be an unknown value, leading to some heroic table joins when all is finally done. Either that, or fill unknown values with defaults - e.g. 0 for the invoice id of an uninvoiced transaction. That would be horrible.

How is this logically possible?

Ardent Formulist

Ardent Formulist

Ardent Formulist

Penultimate Amazing

Ardent Formulist

Thinker

Ardent Formulist

BOFH

Thinker

Muse

Ardent Formulist

Penultimate Amazing

Thinker

Loggerheaded, earth-vexing fustilarian

Loggerheaded, earth-vexing fustilarian

Loggerheaded, earth-vexing fustilarian

Thinker

Loggerheaded, earth-vexing fustilarian

Thinker

Loggerheaded, earth-vexing fustilarian