This week’s approach to text analysis felt more complex than last week’s. I was familiar with Voyant and Tableau, although the dive into topic modeling was a bit deeper than what I’d encountered before. My only experience using Python was at an SSDA Intro to Python workshop last year, and it was very (as would be expected) social sciences based so I struggled using it in ways I never would for my own research; therefore, I was very excited to dig into this week’s focus on Text Analysis with Python.
Nguyen et al.’s article “How We Do things With Words: Analyzing Text as Social and Cultural Data” provided a comprehensive overview of the steps someone might need to take in creating a text analysis research project. The specificity of this article was helpful in imagining how I might formulate my own text analysis project by working through developing research questions, conceptualization, data, operationalization, and analysis. The example of using Reddit to examine hate speech was very compelling, particularly with everything that has happened in the last year. The section I found most compelling was probably the one on operationalization–the sub-section on modeling considerations alone demonstrated how much I was leaving out of what it takes to develop this kind of project.
The Natural Language Processing with Python book was probably my most frustrating experience in this class so far, as I tried to work through the exercises and the provided tutorials as I went along. After installing Anaconda, I had a lot of trouble getting Jupyter notebook to run the command to download the suggested corpora (in total it took 3 hours–about 2.5 hours more than I’d care to admit). This hands-on approach to working with text analysis was enlightening, and I felt more comfortable with these exercises than the ones I participated in at the SSDA workshop I mentioned above. I was able to run commands to compare texts, determine word frequencies, and even reorder or combine sentences among many other things. I also must share that I was very entertained by running commands that compared Monty Python and the Holy Grail to the Book of Genesis. The later chapters, as acknowledged by the Preface, are much more in-depth and rely on more specialized Python and linguistic knowledge, but I think if I have time over the summer I might want to try and work through the entire book rather than just read it.
Sandeep Soni et al.’s article, “Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers,” was very helpful in that it provided an example of a specific project using this kind of methodology. The data and results sections included several ways of visualizing the numerical outputs produced by the tests. For instance, figures on pages 29 and 31 represent leader-follower pairs, while Figure 6 on page 32 visualizes page-rank scores. Ultimately, the research team was able to trace shifts in the meanings of words and how these changes were diffused across newspapers.
Honestly, I feel that I could spend many more weeks working on this unit. Much of the mathematical representation used across all of these works was over my head, and I know that two days of working through Python tutorials is nowhere near enough for me to even begin to understand all of the ins and outs that make it useful in this kinds of project. However, I definitely think I’ve found my summer project…