NAME frequency-correlate SYNOPSIS frequency-correlate.py _reference_text_ DESCRIPTION Takes stdin and creates a frequency histogram to compare against that of the standard text, given the file name as the argument to the function. The text is assumed to be the 26 ASCII letters. All non-letters are discarded and all letters are changed to lower case. The histograms of stdin and the reference are normalized and the dot product is taken for each possible rotation of stdin. The output has the rotation expressed as a letter (a for 0, 1 for b, etc) and will be in a [0,1] interval. POSITIONAL ARGUMENTS reference_text file of reference text for the reference distribution OPTIONAL ARGUMENTS none HISTORY Introduced in the 221 offering (Fall 2021–2022). BUGS
NAME index-of-coincidence SYNOPSIS index-of-coincidence.py _maximum-circulate_ DESCRIPTION Takes stdin and and for each circulant of the text, from zero to the maxium, counts the number of collisions between the original text and the circulant. If the text is the character sequence s_1 s_2 ... s_n then the circulant by i is the character sequence s_(1+i) s_(2+i) ... s_n s_1 ... s_i. A collision between the sequences s_1 s_2 ... s_n and t_1 t_2 ... t_n are locations j where s_j==t_j. The output is the count of collisions for each amount of circulation, normalized to the [0,1] interval POSITIONAL ARGUMENTS maximum-circulate the test is for 1 to number maximum-circulate OPTIONAL ARGUMENTS none HISTORY Introduced in the 221 offering (Fall 2021–2022). BUGS
NAME column-extract SYNOPSIS column-extract.py _offset_ _skip_ DESCRIPTION Filters stdin ignoring all but characters passing to stdout the i-th character from stdin, when i = offest+i*skip, for integers i. POSITIONAL ARGUMENTS offset The first letter to pass from stdin to stdout skip Every letter skip places after the last letter passed to stdout is passed to stdout. OPTIONAL ARGUMENTS none HISTORY Introduced in the 221 offering (Fall 2021–2022). BUGS
Accept the github classroom assignment, and clone the github repo to your desktop. Add, commit, push whenever reasonable, but also before the deadline of the assignment, for the purposes of grading.
Frequency distribution exploration
The english language has a distinct and stable distribution of letters. While we consider it a random process, more naturally it is the percentage of occurrences of each letter out of 100% for all letters. Here is what it looks like.
For a simple shift cipher, this uniformity is exploited to break the cipher. The shift disguises the letters but leaves the pattern intact, only circulated (shifted on the base). To recover the shift, all circulants of the frequency distribution of the enciphered text and the one with the smallest distance from a reference is chosen.
Expressing the frequency distributions as unit length vector in ℕ26, the larger the dot product the closer the distributions. Let v(i) be the circulated by i vector from the enciphered text, and r be the standard, find i such that.
max_i < r | v(i) >
Tasks
author: burton rosenberg
created: 31 aug 2021
update: 2 sep 2023