Trying Nushell
I've been trying out Nushell again lately. I
blogged in 2021 about how
command-not-found
is slow to build an index of which commands are provided by which Debian
packages. I created an alternative Posix shell script that performs well
without an index. Now, I ported this script to Nushell. I found it interesting
to compare the differences.
Comparison
The two versions are here. Both rely heavily on regular expressions. The Posix shell version is about 10 lines and 250 characters longer. Part of this is additional logic to use ripgrep with LZ4 when available, yet fall back to some default that works otherwise:
PATTERN="^(usr/)?s?bin/$1\s"
if command -v rg > /dev/null && command -v lz4 > /dev/null; then
LINES=$(files | xargs -0 rg --no-filename --search-zip "$PATTERN")
else
echo "Run 'sudo apt install ripgrep lz4' to speed this up"
LINES=$(files | xargs -0 /usr/lib/apt/apt-helper cat-file | grep -P "$PATTERN")
fi
The equivalent Nushell code uses
par-each
for
parallelism, which is built into Nushell and seems quite handy:
let packages = $files
| par-each {
/usr/lib/apt/apt-helper cat-file $in
| parse -r ('^(?:usr/)?s?bin/' + $command + '[ \t]+(?<packages>.*)$')
| get packages
# ...
}
(This website uses Pygments to do syntax highlighting. I found that it doesn't support Nushell syntax yet, so the above is rendered as Perl for now. Perl is my go-to when I need syntax highlighting on an unsupported language, since it accepts almost any syntax.)
I also appreciate how I was able to replace this sequence of sed commands. It
was opaque enough to require a comment, and the sed command had two bugs I
found in writing this post (missing -E
, which is needed for the +
, and
missing m
flag, which is needed for multiline processing):
# This sed expression drops the filename, splits the package list by the comma
# delimiter, and drops the section names.
PACKAGES=$(echo "$LINES" | sed -E 's/^.* +//; s/,/\n/g; s/^.*\///m')
With this Nushell code (following the longer pattern above), which makes its intent more clear:
# ...
| split row ','
| str replace -r '^.*/' ''
Similarly, another buggy call to sed in Posix shell (this won't work with newlines):
$(echo "$PACKAGES" | sed 's/ /|/g')
becomes a more obvious operation in Nushell:
($packages | str join '|')
Performance
I ran a quick benchmark to compare the versions of apt-binary.sh
(with and
without ripgrep) and apt-binary.nu
. I benchmarked with the files in cache, by
repeating each command at least once and discarding the first result, although
it didn't seem to matter much. I used
Nushell v0.103.0
from the Github release binary for x86_64-unknown-linux-gnu
.
I ran this in a container where there wasn't much parallelism available, as it only had Debian Bookworm package lists for one architecture:
$ ls /var/lib/apt/lists/*Contents* | get size | sort
╭───┬──────────╮
│ 0 │ 13.6 kB │
│ 1 │ 105.9 kB │
│ 2 │ 132.2 kB │
│ 3 │ 166.8 kB │
│ 4 │ 1.5 MB │
│ 5 │ 19.8 MB │
│ 6 │ 57.0 MB │
╰───┴──────────╯
Here are the results for the two versions:
$ sudo apt-get remove -y ripgrep | ignore
$ timeit { apt-binary.sh cvs }
Run 'sudo apt install ripgrep lz4' to speed this up
Sorting... Done
Full Text Search... Done
cvs/stable 2:1.12.13+real-28+deb12u1 amd64
Concurrent Versions System
908ms 116µs 365ns
$ sudo apt-get install -y ripgrep | ignore
$ timeit { apt-binary.sh cvs }
...
804ms 462µs 550ns
$ timeit { apt-binary.nu cvs }
...
1sec 510ms 564µs 515ns
The Nushell version takes almost twice as long as the ripgrep version. It seems to be much slower at doing pipelines with built-in commands, while performing similarly to dash for external pipelines.
Here's an example with counting bytes, where piping into Nushell's
length
adds a lot of
time:
$ let f = "/var/lib/apt/lists/deb.debian.org_debian_dists_bookworm_main_Contents-all.lz4"
$ timeit { open --raw $f | wc -c }
57084589
4ms 101µs 100ns
$ timeit { open --raw $f | length | print }
57084589
756ms 112µs 752ns
$ timeit { sh -c $'lz4 -d ($f) -c | wc -c' }
516439434
252ms 204µs 142ns
$ timeit { lz4 -d $f -c | wc -c }
516439434
259ms 633µs 621ns
$ timeit { lz4 -d $f -c | length | print }
516439434
6sec 361ms 656µs 908ns
Here's another example for a basic grep or equivalent, where Nushell's
find
takes much longer:
$ timeit { sh -c $'lz4 -d ($f) -c | grep ripgrep' }
usr/src/rustc-1.78.0/src/tools/rust-analyzer/crates/project-model/test_data/ripgrep-metadata.json devel/rust-web-src
283ms 70µs 233ns
$ timeit { lz4 -d $f -c | grep ripgrep }
usr/src/rustc-1.78.0/src/tools/rust-analyzer/crates/project-model/test_data/ripgrep-metadata.json devel/rust-web-src
297ms 517µs 864ns
$ timeit { lz4 -d $f -c | find ripgrep | print -r }
usr/src/rustc-1.78.0/src/tools/rust-analyzer/crates/project-model/test_data/ripgrep-metadata.json devel/rust-web-src
805ms 466µs 406ns
These may not be quite apples-to-apples comparisons for reasons like Unicode handling, but it seems like Nushell's internal pipelines could use more optimization.
Refactoring issues
Nushell has some support for
testing and assertions, which I
hoped to use for testing the parsing code. I ran into some problems, however.
It seems like parse
is
special in being able to take a byte stream and parse lines of it,
but this somehow doesn't work with a function call (a "command" in Nushell
terminology):
$ let f = "/var/lib/apt/lists/deb.debian.org_debian_dists_bookworm_main_Contents-all.lz4"
$ def parsefn [pattern] { $in | parse -r $pattern }
$ timeit { lz4 -d $f -c | parse -r "^usr/bin/(.*) " | length | print }
8791
1sec 142ms 15µs 922ns
$ timeit { lz4 -d $f -c | parsefn "^usr/bin/(.*) " | length | print }
0
1sec 344ms 869µs 81ns
$ timeit { lz4 -d $f -c | lines | parse -r "^usr/bin/(.*) " | length | print }
8791
1sec 269ms 821µs 783ns
$ timeit { lz4 -d $f -c | lines | parsefn "^usr/bin/(.*) " | length | print }
8791
3sec 169ms 105µs 892ns
It seems like byte streams can't cross into function calls:
$ /bin/echo hi | describe
byte stream
$ def describefn [] { $in | describe }
$ /bin/echo hi | describefn
string
So by trying to refactor my code to test it, I unintentionally changed its behavior. Getting the old behavior back would require harming performance drastically or finding a different abstraction boundary.
Closing thoughts
I'm torn about Nushell. Even with over two decades of using Bash/Dash/Zsh, I still struggle with it. In writing this post, I found 3 bugs in my Posix shell script. Nushell feels like a massive improvement. It has nice syntax, type checking, data structures, convenient argument parsing, and a cohesive library of built-in commands. Nushell remains succinct enough to feel like a shell rather than a programming language, with easy escapes into "raw" Unix programs.
Beyond retraining my brain, I struggle with two things. First, it didn't take me long to run into the issues above, so Nushell may still need more time/work to mature. Second, whenever I need interoperability with others, I can count on them having a Posix shell available, but I can't count on them using Nushell. I shouldn't let that hold me back from using a better tool on my own computers, but I can't escape having to write Posix shell scripts going forward, which means having to remember all the gotchas and tricks. Maybe it'd help if there was a subset of Nushell that could be compiled to Posix shell for interoperability.