forbidden-suffix.Rmd

---
title: "Are there forbidden radio callsigns?"
author: "Andrew Zimolzak"
date: '2022-05-11'
output: pdf_document
---

Which, if any, 3-letter amateur radio (ham) callsign suffixes do not
exist anywhere in FCC's database (presumably because they are banned
or actively shunned)? Short answer:

> FUC, PIS, and SOS

These are the only three interesting suffixes that are assigned
exactly never. There multiple other less interesting ones related to
"Q codes." *All* other suffixes are seemingly allowed, although some
potentially offensive ones are rarer than expected.


# Introduction

Ham radio operators can choose a "vanity" callsign. A usual callsign
format is "K5ABC". The "suffix" is the part of the callsign after the
number. Often the suffix is 3 letters long. This is enough letters to
make some mildly to very offensive letter combinations. So, which
3-letter combos are problematic enough that they never appear? Data
can be downloaded from FCC in a file called `l_amat.zip`.


```{r libraries, message=FALSE, warning=FALSE}
library(here)
library(dplyr)
library(data.table)
library(tidyr)
library(knitr)  # kable
library(ggplot2)
```


# Input the data and show its size

```{r reading-10sec}
# message("Reading dat (~8 sec...)")
d_1528k = fread(here('EN.dat'), header = FALSE, sep="|", quote="")
d_100k = fread(here('ENshort.dat'), header = FALSE, sep="|", quote="")
```

```{r manual-set-data-size}
X = d_1528k
dim(X)
```

```{r rename}
X %>%
  rename(
    tablename = V1, num = V2, callsign = V5, code = V6, lnum = V7, fullname = V8,
    firstname = V9, middlename = V10, lastname = V11, suffix = V12, address = V16,
    city = V17, state = V18, zip = V19, weird_addr_text = V20, misctext = V21,
    zeronum = V22, othernum = V23, othercode = V24) -> renamed
```


# String processing of suffixes

```{r split-suffix-40sec}
renamed %>%
  select(callsign, firstname, lastname, city, state) %>%
  separate(
    callsign, sep='[0-9]', into=c('prefix', 'suffix'),
    remove=FALSE, extra='merge'
  ) %>%
  mutate(len_suf = nchar(suffix)) %>%
  filter(len_suf >= 3) %>%
  mutate(
    s1 = substr(suffix, 1, 1),
    s2 = substr(suffix, 2, 2),
    s3 = substr(suffix, 3, 3)
  ) -> presuf

dim(presuf)
dim(presuf)[1] / dim(renamed)[1]
hams_per_suff <- dim(presuf)[1] / (26 ^ 3)
hams_per_suff
```

The above shows: How many hams have a 3-char suffix, what proportion
of hams have a 3-char suffix, and how many hams per suffix (mean).


# Count unique suffixes

```{r distribution-8sec}
presuf %>%
  arrange(suffix) %>%
  group_by(suffix, s1, s2, s3) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  filter(s3 != '0') -> counts

26 ^ 3 - dim(counts)[1]
```

The above shows how many suffixes are *missing,* or not attested, out
of the expected $26^3 = 17576$.


## Side analysis: most common suffixes

```{r print-kable-top}
counts %>%
  arrange(desc(count), suffix) %>%
  select(suffix, count) %>%
  head(n=35) %>%
  kable()
```


# Never-seen suffixes

```{r expand}
counts %>%
  expand(s1, s2, s3) -> expanded

left_join(expanded, counts, by=c('s1', 's2', 's3')) -> counts_all

dim(expanded)
dim(counts_all)
26^3
```

Compare dimension of `expanded` and `counts_all` with the expected
$26^3$. Expect to see 17,576 show up in all of those. **Funny thing:**
I initially saw 18,252 in the `expanded` data frame. Why? That equals
$26 \cdot 26 \cdot 27$. Must be extra weird character in `s1` or `s2`
or `s3`. Added `filter() %>%` so now probably **fixed.**

```{r filter-na}
counts_all %>%
  filter(is.na(count)) -> forbidden

dim(forbidden)[1]
26 ^ 3 - dim(counts)[1]
```

The above show the counts of never-seen suffixes. Expect the above two
to be equal. But again, weird. I first saw 765 versus 89. It so
happens that $765 - 89 = 26^3$. Seems like `s3` is sometimes equal to
the string "0".

Final weird thing: Then I was seeing 90 in `forbidden` versus 89
missing from `counts`, even after filtering out the zero versus letter
O thing? Filtered it earlier in the workflow instead, and now it
matches.


## Diagnostic

```{r diagnostic}
presuf %>% filter(s3 == 0) %>% kable()
```

I even wonder if this is a typo in the database of "zero" replacing
"capital letter O, as in Oscar."


# Final tables

Lots of suffixes starting with "Q" are banned. But *only three others*
are banned. Here they are.

```{r non-q}
forbidden %>% filter(s1 != 'Q') %>% kable()
```

So, somewhat surprisingly to me, all of the following are *not* banned
and exist at least once: ASS, POO, FUK. Here's a table of sketchy
suffixes, with some very common ones thrown in for comparison.

```{r sketchy-suffixes}
counts %>%
  filter(
    suffix == 'ASS' | suffix == 'POO' | suffix == 'KKK' | suffix == 'FUK' |
      suffix == 'TIT' | suffix == 'GOD' | suffix == 'FCC' | suffix == 'XXX' |
      suffix == 'SEX' | suffix == 'PEE' | suffix == 'KOK' | suffix == 'NSA' |
      suffix == 'DIK' | suffix == 'DIC' | suffix == 'CIA' | suffix == 'ZZZ' |
      suffix == 'ZYX' | suffix == 'USA' | suffix == 'WTF'
  ) %>%
  arrange(desc(count)) %>%
  mutate(obs_over_exp = count / hams_per_suff) %>%
  kable()
```

"POO" is one of the few mildly rude ones that is actually *more*
popular than expected. Yikes at the substring with 68 callsigns
containing it.


## Boring Q suffixes

```{r q-with-rstu}
forbidden %>%
  filter(s1 == 'Q') %>%
  group_by(s1, s2) %>%
  summarise(sum_q = n()) %>%
  kable()
```

Those numbers mean you can't have *any* suffix starting with "QR,"
"QS," or "QT." Plus a few starting with "QU."

```{r qu}
forbidden %>%
  filter(s1 == 'Q' & s2 == 'U') %>%
  kable()
```


## Diagnostic

```{r math}
forb_n <- 3 + 26*3 + 9
forb_n
26^3 - forb_n
dim(counts)
```


# Explore distribution of suffixes in general

```{r histo-broad}
qplot(counts$count) + geom_vline(xintercept = hams_per_suff)
```

Interesting that it's bimodal. There's the "nearly random" component
in the middle, centered around `r hams_per_suff`. But then there are
some that have nonzero but very low counts. Suffixes that used to be
banned but are not any more, perhaps?


## Low counts but not total ban

By inspection, a lot start with "X" for some reason. And also "QU."

```{r list-below-56}
counts %>%
  filter(count <= 56 & s1 != 'X') %>%
  filter(! (s1 == 'Q' & s2 == 'U')) %>%
  arrange(desc(count)) %>%
  select(suffix, count) -> below_57

below_57 %>%
  kable()
```

This definitely reveals some likely banned or discouraged suffixes
that I didn't think of.

## Zoom in on histogram

```{r zoom-in}
qplot(counts$count) +
  geom_vline(xintercept = hams_per_suff) +
  xlim(50, hams_per_suff)
```

I think I should change the cutoff to around 62. *Update:* tried that,
and the only interesting one in that range was "GOD," weighing in at a
count of 60. So I changed the threshold once again to $\le 56$. That
gives the 37 rarest suffixes.

Definitely includes some slurs. But also double-meaning words (nut,
pig, gas, box, sob). Most amusing one is "LID."


# Analysis of sketchy ones

```{r heatmap-data}
left_join(below_57, presuf, by='suffix') -> sketchy_people

sketchy_people %>%
  arrange(suffix, state) %>%
  group_by(suffix, state) %>%
  summarise(n_calls = n(), tot_suf = mean(count)) -> suffix_state_t1

suffix_state_t1 %>%
  group_by(state) %>%
  summarise(tot_state = sum(n_calls)) -> t2

n_sketchy <- dim(sketchy_people)[1]

left_join(suffix_state_t1, t2, by = 'state') %>%
  ungroup() %>%
  mutate(
    n_sketchy = n_sketchy,
    expected = tot_suf * tot_state / n_sketchy,
    obs_over_exp = n_calls / expected,
    r_state = dense_rank(state),
    r_suffix = dense_rank(suffix)
  ) -> suffix_state

n_sketchy
sum(t2$tot_state)

```

Many of the sketchy-seeming ones turn out to be people's initials.
Also, a few seem to be "reclaiming" the slur (inference based on
assumed nationality of last names).

```{r heatmap}
ggplot(suffix_state, aes(state, suffix)) +
  geom_raster(aes(fill = obs_over_exp)) +
  theme(legend.position="none")
```