Lab 8

Set up

library(tidyverse)
library(tidytext)
library(gutenbergr)

Interesting Code

cage <- gutenberg_download(1144, meta_fields = c("title", "author"))

This line of code is the first time I’ve used code to make a table. In other languages, whenever we want to represent table we’ll just make a 2-D array, which is a list of lists, or we’ll make a dictionary which is a data structure made of key-value pairs. However, as the lab progresses I begin to see the flexibility of R tables and how adaptable it is.

The next section talk about the wide range of functions R has. R being a weakly typed language where you don’t have to declare the type of a variable, it is very convenient to reassign variables and change the contents. Looking at this code below

words <- cage %>% 
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  ungroup

it is exactly the same thing as

# re-declaring cage variable in case cage gets mutated in the code above
cage2 <- gutenberg_download(1144, meta_fields = c("title", "author"))
words2 <- unnest_tokens(cage2, word, text)
words2 <- count(words2, word, sort = TRUE)
words2 <- ungroup(words2)

# test if words is the same as words2
print(all.equal(words2, words))

I have used pipes before, though the symbol for piping is

instead of %>%. I wanted to understand how the functions take the table as inputs so I looked up the functions in the R documentation and found the parameters each function takes. The documentation shows that all the functions above, unnest_tokens, count, and ungroup, all take a data frame as the first argument.

The Compiler

This made me start thinking about how the compiler, the thing that turns human-readable code into something the machine can read, parses the arguments. How does it know to use the results of the left of the pipe as the input to the right of the pipe? How does it know when there’s a pipe that there won’t be a data frame in the function? I struggled to conceptualize the inner-workings of exactly how the compiler breaks down the code. In Padua’s The Thrilling Adventures of Lovelace and Babbage, the difference machine truly breaks down the action of simple mathematical operations to the most basic steps. The carrying action in basic math was such a complex mechanism in the difference machine that it took Babbage a long time to construct something that completes the task.

Usually a compiler has a parser or sometimes called a tokenizer that breaks down a large string of text into smaller segments usually seperated by spaces and symbols. When we read books we often see words and sentences as one large entity. In fact, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe (From research done at the University of Cambridge)[https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/].

However, compilers and parsers care about the exact spelling of a word. In certain programming languages such as Python, compilers even care about the number of spaces and using tabs versus spaces. Parsers will increment character by character and if the spelling is slightly off, the parser will not recognize the word. If Babbage did create a spellchecker machine, the machine would take a very long time to check a word because the machine would need a list of all the words out there and check the given word to the list to see if the given word exists, if it doesn’t exist then there’s a big chance that there is a typo. But this approach does not take into consideration of valid ngrams and valid sentences. If we scale this approach to have a list of ngrams the runtime of the machine will be quadratic by the size of the ngram which will take a very long time without the technology of modern processors. So if a simple spellchecker is this complicated to create, how did Lovelace even theorize a compiler for code? The intellectual capacity of Ada Lovelace still fascinates me everytime.

The Syntax

As a computer science major, I’ve been introduced to a lot of different languages. When we had the in-class activity of reading the twitter bot code, I didn’t have much trouble understanding what the code is trying to accomplish. However, when I was going through this lab, some of the lines of code puzzled me so much. I remember just sitting there staring at one line of code

ggplot(plot_austen[1:20,], aes(word, count, fill = title)) +
  geom_bar(stat = "identity") +
  labs(x = NULL, y = "Frequent, Significant Words in Jane Austen's Novels") +
  coord_flip()

and being utterly confused. So it’s plotting with a table named plot_austen but it’s a bar graph? Where the x values are null?? NULL? null is nothing? No x values? Then you flip the coordinates? So y is nothing? And you add everything together?

My favorite professor of all time, Professor Olin Shivers, taught me that good code is code that you can reason about. Code is written for humans to read. These are two objectives I keep very close to my heart as a programmer. Just like proofreading in English, programming requires coding reviews by others before a segment of code can be shipped. There are a lot of ways to say something in English and one small tweak can change the entire meaning of the sentence. Similar to code, there are many ways to achieve the goal of the program; however, if other people don’t understand your code or if you don’t know how to explain your code then you won’t be able to reach your audience or scale the code to do something bigger and better.

Documentation

I’ve always been a huge fan of reading up on documentation when I want to find out more about a specific function or if I’m confused about how a function works. So throughout this fieldbook I’ve consulted on the following documents many times.

(R Documentation)[https://www.rdocumentation.org]
(gutenbergr library)[https://cran.r-project.org/web/packages/gutenbergr/gutenbergr.pdf]

2018-03-27 FIELDBOOKS · MODEL