library(tidyverse) library(tidytext) library(gutenbergr)
In the code block above I loaded different libraries, which are essentially databses that we pull from as we prepare to load and run the commands and prompts below. As you can see, I started off with the three tick marks and an r in brackets, then ended with three tick marks as well. Those tick marks and the r in brackets indicate that I want to start a code block. R automatically creates a code block when I do that, and makes the code block grey so I can more clearly see what I’m working with.
If I do not know what something is, I can go to the console (below) and then input ?(function/command), which will then show up to the window to the right in “Help.” So for example, if I look up tidyverse, it says:
“tidyverse: Easily Install and Load the ‘Tidyverse’ …. Description… The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step.”
So basically, it means it’s a collection of packages that help R run better because it’s compatible with a range of different functions.
Tidytext, when I looked it up, said it “provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.”
And then finally, the gutenbergr library indicates to pull from the Gutenberg library, which has the data encoded into R coding and thus allows us to search and download public domain texts that are in the Gutenberg database. This includes both tools for downloading books, in addition to a complete dataset of Project Gutenberg metadata that can be used to find words of interest.
Notice also that when I’m not in the code blocks, there will still be text in blue, such as the heading right below, Including Plots, and the heading above, R Markdown. That’s because we’re using R Markdown to code this, so the commands for Markdown, such as using the double pound symbols, will create a heading in a Markdown viewer.
You can also embed plots, for example:
melville <- gutenberg_works(author == "Melville, Herman") View(melville)
Now, in the code block above, I am setting up a variable. Notice again how I created the code block with the tick marks and the r in blacks. I start off by naming the variable whatever I want to be. In this case, I made it melville.
Then, I assign a function to the variable by doing this <- __. So in this case, the function is gutenberg_works(author == “Melville, Herman”). This function pulls from the gutenberg database and looks at all the works with the author Herman Melville. After that, I choose the function “View”, and then insert the variable melville, to invoke a data viewer for the variable I just created. So, a table will pop up with the gutenberg id assigned to the text and its corresponding title name. In other words, a table will pop up with all the available texts in the gutenberg database that are encoded with r. This allows us to see what texts are available for us to work with.
bartleby <- gutenberg_download(11231, meta_fields = c("title", "author")) View(bartleby)
Now, for the code block above, we’re setting another variable. This one is a bit more complicated, but there’s really nothing to worry about. The variable name, as you can see, is called bartleby. Any guesses as to why that’s so? If you guessed that its because we’ll be working with and setting a variable for the metadata for “Bartleby”” from the gutenbergr library, you’re exactly right!
In the previous code block (lines 38-42), we looked up the list of works by Melville. Since we looked up their r code, we were also able to see the works’ ID numbers. For “Bartleby, The Scrivener”, it was 11231. So, for the variable, I created a commanded gutenberg_download(work ID #). In addition to that, I instructed that I wanted the data (and thus text) for the title and author to be shown as well in the data chart.
I should mention that the little green arrow in the top right hand corner of the text block (the one that looks like a ‘play’ button) has a useful function. It means to run the whole block of code, in order. Another way to do that is to just put my cursor on a certain line, then press command (or option) and then enter.
Thus, when I run the whole code block, it will load the entire text of “Bartleby, the Scrivener.” The c stands for combine, which combines the values / functions into a vector or list. So in this case, we combine and include title and author into the list. As a result, this information ends up becoming very repetitive (the text ID # on every single line, in addition to the title and author), but I think that when we do more complicated commands that deal with multiple texts and perhaps even authors at once, it will make perfect sense to have the information for each of the columns. That is especially relevant if we will be analyzing and comparing/contrast differences between works of an author, works of more than one author, things like word and phrase trends during certain time periods, etc. This segues perfectly to the next code block.
words <- bartleby %>% unnest_tokens(word,text) %>% anti_join(stop_words) %>% count(word,sort = TRUE) %>% ungroup View(words)
I mentioned above that the green arrow runs the whole code block in order, and that doing command/option + enter while putting my cursor on the specific line will run just that line of code. The cursor + command enter line is much more useful in the code block above. Notice the weird text that’s after several of the lines. It looks like this %>%. That’s called a pipe, and it connects the separate line of code. It tells the program where a line of code stops and where another one ends. Just putting enter will not do anything, since its only for aethetic purposes and the program will not treat it something like a new line because of an enter. An easy way to remember and make sense of this is to think about real pipes, which connect things like water fountains and sinks. The pipe will allow water to flow into the machine, then allows water to drain/flush back out. In the same way, the pipe command will tell the program, “Hey, here’s a new line of code, which means this code is ending. So don’t combine the two lines together.”
As I mentioned above, the line of code are separate lines for aesthetic purposes. I could just make the code block all one line with the pipes in between, but its just easier on my eyes to see which lines of code I’m working with.
Now, for the coding itself. We make another variable, this time, called words. I use a previous line of code, bartleby, which was in the previous code block. The the entire text for Bartleby, as you remember. Then, the pipe. So onto the next line.
The next line has something that includes unnest_tokens and(word,text). When I look up what that does (going to console, and typing in ?unnest_tokens…), then it will say in Help that unnest_tokens “splits a column into tokens using the tokenizers package, splitting the table into one-token-per-row.”
So what does that mean? From my understanding, that command is going to split and basically parse/examine the string of data we have from the tidytext library (which we loaded way above at the beginning), and the unit by which we do that are “word (words)”. Remember that we’re making words the variable, and that a previous variable (bartleby) is the variable for that. And then of course bartleby is the variable we set as the entire text of Bartleby. It’s a lot of layering, I know. But that fits in well with the idea of a continually flowing stream that flows through connected pipes. So, the pipe indicates that we should look at the next line.
This next line includes something about anti_join and stop_words. First, we must know what stop_words are. Stop words are words that don’t really have an assigned meaning to them. So for example, a lot of particles of speech like “a, an, the” do not have a meaning within themselves. So, we don’t want to include those in our commands, because who gives a hoot about the usage and frequency of those words? That’s not what we’re going for.
So, what are we going for? Remember that we’re working with the variable words. After parsing the data (each word, essentially) for Bartleby (the variable and the whole text iself), we’re now taking out the stop_words from that comprehensive Bartleby variable. Why are we doing this? Let’s take a look at the next line.
This next line is saying something about count. We’re counting the words (again, in Bartleby, excluding the stop words). I’m not quite certain what that TRUE is there for, but I think it menans to only list (sort) the words that hold true for the previous commands and lines of code. So, we’re only going to be listing the counts (frequencies) for the words in Bartleby that are not stop words. And we did that by pulling from the gutenbergr database, after looking at all the works by Melville. Phew.
Now, I’m not quite sure what that ungroup does, but when I looked it up, it said that “Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping.”
That had me scratching my head, but I think that means that we’re looking at all the different variables in here as separate. Yes, they build off each other and are nested within each other, but perhaps ungrouping clarifies the distinction between those variables and clarifies to the program that we’re looking at them separately, despite them being so nested within each other. Ovbiously, I’m just speculating, and I’m not sure if what I’m saying is the most logical thing, but I thought it wouldn’t hurt to at least say what I think it means.
Analysis and Reflection
And now, for the analysis and reflection of this whole experience.
This was initially such a frustrating assignment. I was dismayed the first time we started working with R. I had a lot of my mind, and then having to suddenly figure out this new language and programming to complete the next couple of labs sounded really daunting. I was quite discouraged at first, even though I was well aware by my classmates’ round eyes and sticky notes on their laptops that I was far from being the only one who was confused. I remembered back to when I was working with Markdown and thinking that was difficult. Oh my, how I long to be just working in Markdown again. I thought that Markdown was just working with plain, plain text and coding, but now I see that even Markdown still has a lot of presets and things like that in place. Something about seeing my code immediately transform visually made me a lot more sure of myself, while with R, I have to explain a lot of things to myself and do a lot of trial and error (not to mension several questions to David and Professor Cordell) to make sure that I have the basic idea of the coding and what means what. Something about having to have the perfect code to even see if I did it right when I run it makes it a bit more frusting. But then, when I think about it, I also had to do much the same for Markdown, its just that I could see my changes a lot quicker and really visualize if I did the right thing. For example, if I wanted to insert a picture in Markdown, I could toy around with the brackets and link just a little bit to get that picture in there, since I do remember the basics that I need to name the link for the picture and then do something with the brackets. So I guess its a different sort of visualization where I really have to be more conscientious about each step.
Now, this reminded me of something way at the beginning of Sydney Padua’s “The Thrilling Adventures of Lovelace and Babbage.” Its actually in the section that real and isn’t the alternate reality/world/progression of the real story. In that section, we see Lovelace at a party where Babbage is talking about his difference engine. He says on page 20-21,
“…. and I conceived that arithmetical calculations of great complexity can be effected by mechanical means! Observe the wheels they will produce a series of numbers according to one law. It is impossible, from its very structure, for the machine to produce any other number. We may predict the engine will continue on this series, without althernation, for eternity. But AHA - the mathematicians amongst you will perceive an unexpected result! How is this possible? How mnay machinery, without intervention, produce a variation? Another axis, hitherto hidden, has been brought into operation!”
What I got from this is what according to the laws of the logical world, a is always supposed to lead to b. That is when Babbage says “according to one law… it is impossible.. for the machine to produce any other number.” So basically, a cannot lead to c. It just can’t, and that will always be the case (“we may predict the engine will continue…. for eternity”). But then, Babbage says somethign about an unexpected result, since “another axis once hidden is now brought into play.”
I thought that that quote was a perfect analogy for how Markdown code builds off each other. Within the context of this lab, the various variables built off each other. When we worked with the first variable, things seemed very binary, yes and no, this is that. But then when we introduced other variables in and built them off each other, then we were able to create more complex results and get what we wanted. So, at first, there was no just one way to see how many of each words were in Bartleby. There was just no one line of code to do that. (“Observe the wheels they will produce a series of numbers according to one law. It is impossible, from its very structure, for the machine to produce any other number.”) But then when we built off variables, and kept doing that, we were able to get a result that once seemed impossible. The nestedness and the charactization of building off each other (and how the pipes also signify the code builds off each other = “Another axis, hitherto hidden, has been brought into operation!”) What once seemed impossible is now possible because we built off each the code, because one thing led to another, and to another, and to this result that seemed impossible. So, after all, a can equal and lead to c, or even d or e (you get the idea!) because a -> b -> c -> d, a -> d, a = d.
But we should also think of R coding in general.
Every time I looked up a code online, it would always mention who developed the line of code. And when were were working with the poetry bots today, and when David was guiding us through a lab, both Professor Cordell and David mentioned that they came up with coding themselves. So that just goes to show how coding and commands themselves are also new laws and rules in the world of R, just like how the rules and laws in the case of this lab were the variables and how they built off each other. And now, we can build off different lines of code and connect them together to achieve so so much. There are endless possibilies! Actually, Lady Lovelace’s response to Babbage’s quote above perfectly fits this idea of unlimited possibilities. She says on page 22,
“It can tabulate accurately and to an unlimited extent all series whose general term is comprised by the formula (complicated math equation, lets not bother with trying to input it, please.)!!! Indeed, ALL other series which are capable of tabulation by the method of differences!!”
To which Babbage responds, “Exactly!”
We’ve worked with tables before, yes? Of course we have! The variables for this whole lab were all organized in tables! And if we build off what I just said about R coding commands that are being written and invented all the time, and variables building off each other, this really emphasizes how there’s unlimited possibilities with R coding. We will always be building off other’s work, always using different variables within those commands, and at this point, a can more than equal e, a can equal z!
What else can a equal?
Lady Lovelace says on page 26, “Indeed, we may consider the engine as the material and mechanical representative of analysis! Such a science of operations has its own truth and logic… a process which alters the relation of two or more things… any process it might act upon other things besides number…. the bounds of arithemetic have been overstepped!”
So a can more can equal z. a can equal another dimension (the bounds… overstepped). a can equal 1, a can equal ㅂ (a korean syllable).. a can equal anything!!
Before I go, I would like to point out that Lady Lovelace seems to be defining this exact infiniteness and greater purpose of R and coding and variables on page 27.
“In enabling mechanism to combine together general symbols in sucessions of unlimited variety and extent (R code and variables building off each other!!) … a uniting link is established between the operations of matter and the abstract mental processes of the most abstract branch of mathematical science! (two wholly unrelated fields can explain each other –> you could start off somewhere and end up with a result that seems completely infeasible with where you started off with!). The engine could analyze all subjects in the universe! A new, a vast, and powerful language is developed for the future use of analysis, in which to wield its truths! Almost a political science (and we started off with math!)”
Just looking at the illustration above the quote on page 27 demonstrates the infinite possibilites. And all that is because the difference engine facitilied this new political science language that is somehow removed and somehow so connected to math from the difference engine.
In the same way, R can facilitate infinite possibilities, it can create poems to tweet to Twitter, it can create cool and colorful graphs and charts, all because its a language that builds off each other and in the process, is translated to something else that transforms…
Note that the
echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
FIELDBOOKS · MODEL