References to technology pose a conundrum for rappers. On the one hand, the general culture of hip hop places a heavy emphasis on staying up-to-date with the latest trends, and so there is incentive to mention cutting-edge consumer gadgets. On the other hand, technology tends to become obsolete, and when it does it becomes outdated and unfashionable more abruptly than old slang or clothing. As time moves forward, these lyrics will be received in two ways. People old enough to remember these things will say “Haha, remember car phones?”, while young people will say “WTF is a ‘beeper’?”
www.thug.com must have seemed futuristic and cutting-edge at the time of its release. Way back in 1998, home Internet access was not widespread. And even the relatively few people who were “surfing the Web” and “cruising the Information Superhighway” were doing so via AOL, a service that was a little like Facebook in that it provided a shitty and heavily-controlled version of the real Internet. In contrast, www.thug.com depicted a real web browser based on Netscape Navigator (charmingly renamed to thugscape
). And of course, the title of the album is a URL, and the www dot whatever dot com URL format was still novel and striking.
Actually, the fact that the album’s title is a URL makes it something of a Bobby Tables. I’m writing this in 2024, and it still causes screw-ups. I googled “www.thug.com” and there were at least two irregularities in the search results. First, the Wikipedia entry showed up with the title “Trick Daddy” instead of the album title:
Second, the Spotify result displays “Something went wrong”. Whoops!
These problems are not present for any of Trick Daddy’s other albums: Book of Thugs, Thugs Are Us, Thug Holiday, Thug Matrimony, Back by Thug Demand, or Born a Thug, Still a Thug.
I’m sure there are other places where the URL-as-title causes problems. This blog post has the same title, and I am curious to see if any problems arise. (Really, that is the main reason for writing this post at all.)
The URL still works1. Today, unfortunately, it just redirects to the Slip-N-Slide Records home page. I would prefer to see a re-creation of an old-timey “web site” along the lines of Space Jam.
Finally, it appears that story of Internet technology has come full circle: while researching this album, I came across some in-the-wild AI-generated garbage. Here are some of its profound insights:
The album explored Trick Daddy’s experiences in a raw fashion. Trick Daddy has always shared his experiences, and he did so in raw fashion on the project.
Between the producers and featured artists, Trick Daddy enlisted a tight team of collaborators to create www.thug.com. The input of each collaborator built the album up to become what it is today.
With his lyrics, Trick Daddy explored themes such as street life, violence, and poverty, in many ways chronicling his own life. Even though it took a village, Trick Daddy was the star of the show.
While it’s possible that he may be done with music, www.thug.com, as well as his other works, will showcase his talents for ages to come.
1 I write “the URL” without a hyperlink because Emacs Org mode is apparently confounded by the URL being same as the title of the post.
]]>I was a little taken aback, as “pro-AI” is not how I would ever think to describe my own thoughts and feelings. It’s not wrong wrong, I guess. I mean, I’m not anti-AI in the way that some people are. But still, “pro-AI” is too vague to be useful, and is prone to wild misinterpretation.
“Okay, so how do you want to be described then?”
At this point, with respect to AI, I think my position can be summed up as optimistic resignationism.
Resignationist: this technology will be dramatically impactful in all sorts of ways, and its development is inevitable and unstoppable.
Optimistic: along with all of its many negative impacts, there will be lots of cool and amazing things, and maybe even the possibility for some positive social change.
In short: It’s going to be a bumpy ride to an unknown destination. But the in-flight entertainment will be great.
]]>print
, range
, len
, abs
, enumerate
, etc. Anyone who writes Python code will end up using some of these functions at some point.
One built-in function is not like the others: eval
. This function accepts a string argument and returns the value of the string when evaluated as Python code. Built-in functions ought to be basic and broadly-applicable. But eval
is neither: it is at once extraordinarily powerful and practically useless. Not useless in the sense that it cannot be put to any use, because it certainly can be; but useless in the sense that it is almost certainly the wrong tool to use.1
I can’t think of any reason why eval
should be made immediately globally available to all Python programmers. Its inclusion as a built-in goes back at least as far as Python 1.4 from 1996. The Python userbase back then must have been very different from what it is today. Perhaps it had a much higher percentage of hardcore programmers who had legitimate uses for eval
as well as an understanding of why it should not be used. Or perhaps it was a bad idea at the time too.
To be clear, I am not advocating for getting rid of eval
. It has its place in certain code-generation tasks. Libraries like Pytest and Pydantic and the standard dataclass
module use it for things like dynamic class generation and runtime object introspection2.
Instead, what’s needed is some anti-discoverability. It’s too easy to just happen upon eval
; and anyone who just happens upon eval
definitely should not use it. Including it as a built-in creates the illusion that it is fine and reasonable to use the function casually. It shouldn’t be used casually, and there should be a daunting barrier to indicate this. The simplest way to do this would be require importing it from a scary-looking module, like one of the Python language services.
Personally, I have found exactly one good use for eval
. It had to do with code for representing arithmatic expressions. There was, for example, an Exponent
class to represent expressions like 23. I wanted this to be stringified as valid Python, as in 2 ** 3
. To verify this, I added some test code along the lines of:
Notice that this is a code-generation task: I am actually trying to create some Python code, and so it is reasonable to consider eval
. Notice as well that this was only done in test code, where standards are generally a little looser.
In the wild I’ve run across three uses of eval
, each one terrible:
Four functions were defined to get times: days
, weeks
, months
, and years
. A string argument period
was passed in to determine which would be called. The appropriate time function was called as follows: eval(period + '()')
. Woof. A dramatically simpler and safer way is to stick the functions in a dictionary and then key in with the string argument.
A list of file names was defined, along with a bunch of strings like 'os.sep'
. These strings were then all appended together, passed to eval
, and then passed on to some path-manipulation functions. I never figured out what exactly it was doing, though it was nevertheless obvious that eval
was not being used appropriately.
A new report was added to a boring business web app. The report contained some fancy nested tables, and the tables were dynamically generated based on a query paremter, say ?name=business_asset
. The name business_asset
was not statically available, but it was expected that at runtime there would be a variable called business_asset
. And wouldn’t you know it, eval(request.args.get('name'))
was used to get that the value of that variable. Yes, the query parameter was passed directly into eval
, exactly the thing that everyone says to watch out for. In that situation, problems can arise if a user passes in a “name” like '__import__("os").listdir()'
or something similar. (Although this use of eval
was dangerous, I can see why it was done. This issue was fixed by replacing it with safe code, and that safe code turned out to be ugly and hard to understand.)
The change I am proposing probably would have prevented the first two of these three uses. The third one would have happened anyway – indeed, that use of eval
was accompanied by a pragma comment: # pylint: disable = eval-used
.
1 exec
is a built-in function that is subtly different from eval
in terms of how it is called, but identical in spirit. Everything said here about eval
applies to exec
. Actually, in Python 2 exec
was a statement rather than a function, which is much worse.
2 Even in those cases where it can be used, it may still not be the best choice. As Paul Graham once said, “calling eval
explicitly is like buying something in an airport gift-shop. Having waited till the last moment, you have to pay high prices for a limited selection of second-rate goods.”
Exponentiation is repeated multiplication: 25 is defined as 2 * 2 * 2 * 2 * 2.
Repeated exponentiation is known as tetration: 2 ↑↑ 5 is defined as 2(2(2(22))).
Exponential growth is difficult for people to reason about (population growth, the spread of disease, etc). The numbers are surprising and overwhelming. Well, tetration grows a lot faster than that. Here are the first few values of 2 ↑↑ n:
n | 2 ↑↑ n |
---|---|
0 | 1 |
1 | 2 |
2 | 4 |
3 | 16 |
4 | 65,536 |
5 | > 1019,729 |
That is to say, T = 2 ↑↑ 5 is a number with almost twenty thousand digits.
T represents a sweet spot for big numbers. It is perhaps the smallest number that is easy to define, dramatically larger than any real-world number, and also physically obtainable. By “physically obtainable” I mean that it can actually be displayed on a computer and witnessed in full, although this might take a little work.
First of all, some programming langauges represent numbers using bit sequences of fixed length. There are infinitely many numbers and only finitely many bit sequences of a given length, so it is only possible to deal with numbers within a certain bound. For example, the largest 64-bit number is 264 - 1 = 18,446,744,073,709,551,615, a number with only twenty digits. Even on a platform with 65,536-bit numbers, T would be just a little too big to represent.
So bearing witness to T will require a language with unbounded numbers, like Python. But still there will be problems:
By default Python won’t display such a large number. To see T, the default must be overridden:
Will that work? That depends on the local conditions. Up until just recently, trying to display T inside Emacs would cause it to freeze hard due to a long-standing and unbelievably annoying bug with long lines. So don’t try this on Emacs version 28 or earlier. I haven’t tried it on other platforms, but my guess is that it would cause problems elsewhere as well.
The first and last digits of T in base-10 are: 2003…6736. As expected, it is an even number.
The base-10 representation of T contains every single three-digit number sequence. That is, it contains 000, 001, …, and 999 as subsequences.
]]>I have a smallish Python codebase. There is a type that is used ubiquitously throughout: Count
. The current definition of the Count
type is quite simple:
That is to say, Count
is just a type alias.
Right now, a Count
is nothing other than a number. However, I would like to start using algebraic expressions alongside actual numbers. For instance, instead of the actual number 9
, there might be an expression like (4 + 5)
. Note that these expressions will be used alongside numbers, and not instead of them. Numbers and expressions will need to interact with each other, and the exact nature of a particular variable might not be known until runtime. Possibly other kinds of objects might be used later on as well.
Callsites for Count
are mostly type-annotated with Count
, but occasionally with int
. This is because Mypy can’t (or won’t) distinguish between a type alias and its defintion. At the same time, there are functions annotated with int
that really do require an actual number. So for example, there might be annotations like this:
Come up with an outline for a class or interface or whatever that extends the Count
type to include int
-like objects. Ideally this should not require extensive changes to existing use-sites. Additionally, come up with a strategy for actually implementing the changes (tools to use, etc). Obviously the details of the scenario are rather vague, so the solution can be similarly vague.
Solutions to the challenge will be judged by me according to the following criteria:
Winning entries will be posted to this blog at a later date, where they will be read by perhaps a few dozen people. Fame and glory await.
I am primarily interested in concrete Python-specific solutions. However, entries of the form “Python sucks, here’s how I would do it in langauge X” will be considered if they are interesting enough.
]]>Many text-creators are understandably worried about this. If an LLM ingests all my writing, it will learn what my kind of writing is like. And if it learns that, it will be able to generate new text that is just like the text that I myself have previously generated. And since there is no such thing as human-generated text, that means that I will be made obsolete as a blogger. I could continue to write, of course, but it will be impossible to keep up with the automated just-like-my-blog text generator.
Now, let’s imagine that “the big one” is being trained right now – the larger-than-large language model that will subsume all text hitherto generated by humans and will exert an overwhelming influence on all text generated thenceforth.
What is a human text-creator to do?
One reaction to this scenario would be to jealously guard our remaining text. Suppose I have some novel insight or expertise. Openly publishing this knowledge would amount to feeding the beast, tossing my text into the gaping maw of the technological horror that has been created to replace me. So instead of openly publishing it, maybe I will communicate my thoughts in closed fashion. Perhaps I could write it down by hand and mail it to a small circle of people I know personally. This would disseminate my knowledge, but in a way that keeps it beyond the reach of the big-time mondo LLM. If enough people pursued this approach, it could lead to a rise in closed communities of esoteric knowledge.
There is something noble about small communities forming to resist centralized text generation. But also there are some obvious problems. Since they would have to forego modern computer technology, communication would be inconvenient, slow, and unreliable. It would also lead to a general atmosphere of distrust and paranoia, totally antithetical to the dream of the open Internet. After all, if I don’t want my text out in the open, I have to make sure that the only people who see it are people who I know won’t share it.
But there is a grander problem. Though their precise functioning is obscure, LLMs are ultimately reflections of the text on which they have been trained. The text they produce will be statistically similar to the text they have seen. And so if I choose to keep my text esoteric and inaccessible, then the text generated by the big-time mondo LLM – which will come online regardless of what I do – will reflect text other than mine. And since the big LLM will, in this scenario, exert an overwhelming influence on all future text, that means giving up any chance to personally influence the direction of that text.
Put another way, if my goal is to influence the world via text, even in a small way, then it is in my long-term interest to get as much of my text as possible ingested by as many LLMs as possible. If there really is an inflection point past which all text will be pushed in a certain direction, then I want to have some say in what direction it is, and the only way to do that is to get my text ingested.
]]>As an example, let the initial value be 1 and the linear function be 2x + 3, and say we want to apply that function four times. The result is 61, no problem.
Now let the function and the initial value be the same, but instead of applying the function four times, we want to apply it four billion times.
You might ask, “Why on Earth would anyone want to do that? Shouldn’t the presence of such a farcically large parameter be treated as a sign that we’re on the wrong path? Wouldn’t we be better off trying to rethink our problem?” If I were you, I would stop asking stupid questions and get cracking on that computation, because four billion iterations is going to take a while.
The parameters aren’t the problem; the problem is that the straightforward programming solution won’t work. What’s required here is real math. Anyone who doesn’t know real math won’t be able to solve this. Try pitching this problem to the smartest programmer you can think of who doesn’t know real math. Within a minute they will be saying “Uhhhh”. They will try to program their way out of it, and they will fail.
There is no way around it: solving this problem requires the use of the geometric series formula:
And it works like magic! Of course, it’s isn’t actually magic, and someone who knows real math can explain and prove how it works. But without such a proof, it may as well be magic. According to this formula, starting with 1 and applying 2x + 3 four billion times works out to approximately 101204119983 (a number with 1.2 billion digits).
Obviously it would be best to know real math. And if wishes were horses, beggars would ride. So without such knowledge, what is a programmer to do?
After actually knowing real math, the next best thing is recognizing when a real math solution exists. When faced with a seemingly insurmountable numerical problem, this means stopping and saying “Surely someone has come across this problem before and worked out a formula.” What are the chances that I am the first person ever to try repeatedly applying a linear function to some initial value? No, somebody else has tried it, and it’s a generally a safe bet that they found a solution.
After assuming that a real math solution exists, you can go find someone who knows real math and ask them. A more advanced alternative is to thumb through the back pages of a math textbook until you find a formula that looks like it’s applicable.
]]>Some say the price of holding heat is often too high
You either be in a coffin or you be the new guy
The one that’s too fly to eat shoe pie
– MF DOOM
Suppose I am looping over a range of numbers to see if any of them matches some magic number:
On my modest desktop, this takes 60 seconds to execute.
But as everyone knows, using magic numbers in code is bad. It would be clearer and more maintainable to express the value as a variable:
On my modest desktop, this takes 70 seconds to execute. Ten seconds longer – that’s not great!
Okay, so forget about variables. A better form of abstraction would be to get the comparison logic out of the loop and centralize it in a function:
On my modest desktop, this one takes 110 seconds to execute. The price to be paid for this minor organizational improverment is a catastrophic degradation of performance.
The situation is grim if you have some Python hot spot code that needs to be cleaned up. Because it’s a hot spot, any abstraction could have dire consequences, and so the range of acceptable changes is constrained.
On the other hand, this is great news if you have a hot spot that uses abstractions. It may be possible to trade a little maintainability for a lot of speedup. Inline some constants, inline some function calls, that kind of thing. It doesn’t look nice, but hey, it’s Python.
]]>Money ain’t got no owners – only spenders.
Omar, The Wire
Recent advances in AI1 text-generation have produced systems with a dazzling range of powers. They can generate all sorts of text, everything from poetry to marketing copy to actual, working code. The text isn’t always perfect, but boy oh boy, there sure is a lot of it! AI text-generators can produce mostly-plausible text at a scale that is absolutely out of the question for humans. Sooner rather than later, a lot of text in our world is going to be machine-generated.
What role is there for humans in this future?
One hope is that there will be a market for boutique, artisinal, human-generated text. Such text would be a luxury good, since it cannot be produced at machine scale. Human-generated text will have that “special touch” that cannot be replicted by AIs, that could only be produced by humans with their creativity, perspective, experience, being-in-the-world, etc.
In the short term, this is not implausible. After all, anyone who has spent any time with these systems knows that the text they produce is often “not quite right”. It looks right superficially, but on closer inspection it can be nonsensical or false in bizarre ways. This problem will no doubt be fixed up over time, but it is likely to persist for a while.
In the long term, I think the hopes for human-generated text are doomed. This is because of the simple logical fact that there is no such thing as human-generated text.
This is not to say that humans don’t generate text. Of course they do! But for a given piece of text, the fact of its having been generated by a human is a matter of historical contingency, and not a property inherent to the text itself.
Text is nothing other than a finite sequence of words. Given a fixed length of text and a fixed vocabulary, there are only finitely many sequences of words. For example, consider a text that is no more than 100 words long, with a vocabulary limited to the 1,000 most commonly used English words. There are no more than 1001000 such texts. That is a lot of texts, but still there are only finitely many. With unlimited time and resources, every such text could enumerated in a long, long list. Such a list would contain everything that could possibly be said within the given bounds.
Suppose you had access to such a list and you needed to generate some text. It would only be a matter of searching the list until you found the “right” entry on the list, the text that says what needs to be said. That would be one way of “producing” text.
In reality, such a list could not possibly exist. And even if it did, it would take way too long to search. Instead of exhaustive enumeration and search, humans generally use cognition to generate text. The exact mechanisms for this process are not clear, but humans seem to have a sense of relevance and purpose for text with respect to their sense of the world. Humans (mostly) use meaning to generate text. AI text-generators use a very different process. Instead of looking at relevance and purpose and meaning, they take a bird’s-eye view of a large corpus of text and crunch numbers to generate text that is statistically similar.
But in any case, the text that is generated always already exists, in the sense that it could have been enumerated. And from this perspective, human text-generation and machine text-generation are merely different means of lighting upon the the right text for a given situation. For a given situation, the right text is what it is, and once in hand, its method of discovery is not just irrelevant, but in general impossible to discern.
1 “AI” stands for “artificial intelligence”. Despite their amazing capabilities, it is doubtful that any current AI systems are actually “intelligent” in any meaningful sense. So really, “AI” should be in “scare quotes” throughout this post. That would be kind of annoying to read though, so they aren’t included. But the “scare quotes” should be imagined to be there.
]]>However, a widely perceived drawback to comprehensions is that they are harder to debug. When something goes wrong with a manual loop, the first thing to do is to print out the iterated values as they turn up. But the values of a list comprehension can’t be accessed, so print-debugging isn’t possible. To deal with this, it’s common to unravel the comprehension into a manual loop. Manual loops are uglier and more complicated and more error-prone than comprehensions, but that’s the price that must be paid for debuggability.
That’s the perception at least, but it’s wrong. In fact, print-debugging comprehensions is easy. The key fact to understand is that print
is a function, and it can occur anywhere that a function can occur. In particular, print
can occur in a comprehension filter.
As an example, here’s some code that deals with graphs:
Notice the nested comprehensions: the dictionary comprehension contains set comprehensions as its values. Unraveling this into a manual loop would be just awful, but perhaps necessary to print the values as they show up:
(As a side note, statements like ret = {}
are a code smell and often an indication that a comprehension could be used instead.)
Rather than go through the hassle of unraveling the comprehensions, we can simply print the values as part of the comprehension filter. The print
function always returns None
, so it’s just a matter of creating a vacuously true filter that touches every iterated value but doesn’t discard any of them:
It isn’t pretty. But then again, neither is print-debugging.
This technique can be used in other places where debugging might be considered difficult, like in a chain of boolean checks:
It might happen that all the items in the sequence are failing the test conditions, and so none of them make it to do_stuff
. To see where they are being caught, print
calls can be added between the conditions:
(Note that this example uses an or
-chain, and so the dummy print
conditions need to be vacuously false rather than true.)
Again, this technique is possible because print
is a function. In older versions of Python, print
was a statement. That was a bad idea, and fortunately it was rectified. In general, statements are clunkier and less flexible than values. Python continues to improve with the addition of value-oriented language features like the walrus operator.