Lab 04 introduces the tools of string operators and regular expressions that enable rich text analysis in R. They are important functions for cleaning data in large datasets, generating new variables, and qualitative analysis of text-based databases using tools like content analysis, sentiment analysis, and natural language processing libraries.
The data for the lab comes from IRS archives of 1023-EZ applications that nonprofits submit when they are filing for tax-exempt status. We will use mission statements, organizational names, and activity codes.
The lab consists of three parts. Part I is a warm-up that asks you to construct a few regular expressions to identify specific patterns in the mission text.
Part II asks you to use the quanteda package, a popular text analysis package in R, to perform a simple content analysis by counting the most frequently-used words in the mission statements.
Part III asks you to use text functions and regular expressions to search mission statements to develop a sample of specific nonprofits using keywords and phrases.
# REMEMBER TO RUN ONCE AND NEVER INCLUDE
# INSTALL STEPS IN RMD FILES!
<- c( "quanteda", "quanteda.textmodels",
quanteda.bundle "quanteda.textstats", "quanteda.textplots" )
install.packages( quanteda.bundle )
library( dplyr )
library( pander )
library( quanteda )
library( quanteda.textmodels )
library( quanteda.textstats )
library( quanteda.textplots )
IRS documentation on 1023-EZ forms are available here.
Use this archived version of the data for the lab:
<- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
URL <- readRDS(gzcon(url( URL ))) dat
We will be working primarily with the nonprofit activity codes and mission text for this assignment:
head( dat[ c("orgname","codedef01","mission") ] ) %>% pander()
orgname | codedef01 |
---|---|
NIA PERFORMING ARTS | Arts, Culture, and Humanities |
THE YOUNG ACTORS GUILD INC | Arts, Culture, and Humanities |
RUTH STAGE INC | Arts, Culture, and Humanities |
STRIPLIGHT COMMUNITY THEATRE INC | Arts, Culture, and Humanities |
NU BLACK ARTS WEST THEATRE | Arts, Culture, and Humanities |
OLIVE BRANCH THEATRICALS INC | Arts, Culture, and Humanities |
mission |
---|
A community based art organization that inspires, nutures,educates and empower artist and community. |
We engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. We produce community theater productions for children as well as educational theater camps and workshops. |
Theater performances and performing arts |
To produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability. |
You need to search for the following patterns below, print the first six missions that meet the criteria, and report a total count of missions that meet the criteria.
You will use grep() and grepl(). The grep() function will find missions that match the criteria and return the full string. grepl() is the logical version of grep(), so it returns a vector of TRUE or FALSE, with TRUE representing the cases that match the specified criteria.
grep( pattern="some.reg.ex", x="mission statements", value=TRUE )
You will print and count matches as follows. As an example, let’s search for missions that contain numbers.
grep( pattern="[0-9]", x=dat$mission, value=TRUE ) %>% head()
## [1] "Provide entertainment and education to private residence of a private residential community over 55."
## [2] "To serve the community as a nonprofit organization dedicated to producing live theatre and educational opportunities in the theater arts. The theater's primary activity is to put on 3-5 plays annually in Colorado Springs, CO."
## [3] "The organization is a theater company that performs 3-4 plays per year."
## [4] "Our mission is to facilitate personal growth and social development through the creativity of the Theatre Arts. We offer musical theatre camps for ages 4-7, 8-12, and 13-20 in the summers & community theatre and classes for all ages fall-spring"
## [5] "Nurture minority actors, directors, playwrights, and theater artists by offering them the opportunity to participate in the best classic, contemporary, and original theater (A65 and R30)."
## [6] "The 574 Theatre Company strives to be a professional theatre company located in St. Joseph County, IN who seeks to create, inspire, and educate the members of the 574 community by producing high quality and innovative theatrical entertainment."
grepl( "[0-9]", dat$mission ) %>% sum()
## [1] 4142
How many missions start with the word “to”? Make sure it is the word “to” and not words that start with “to” like “towards”. You can ignore capitalization.
How many mission fields are blank? How many mission fields contain only spaces (one or more)?
How many missions have trailing spaces (extra spaces at the end)? After identifying the cases with trailing spaces use the trim white space function trimws() to clean them up.
How many missions contain the dollar sign? Note that the dollar sign is a special symbol, so you need to use an escape character to search for it.
How many mission statements contain numbers that are at least two digits long? You will need to use a quantity qualifier from regular expressions.
Report your code and answers for these five questions.
Perform a very basic content analysis with the mission text. Report the ten most frequently-used words in mission statements. Exclude punctuation, and “stem” the words.
You will be using the quanteda package in R for the language processing functions. It is an extremely powerful tool that integrates with a variety of natural language processing tools, qualitative analysis packages, and machine learning frameworks for predictive analytics using text inputs.
In general, languages and thus text are semi-structured data sources. There are patterns and rules to languages, but rules are less rigid and patterns can be more subtle (computers are much better at picking out patterns in language use from large amounts of text than humans are). As a result of the nature of text as data, you will find that the cleaning, processing, and preparation steps can be more intensive than quantitative data. They are designed to filter out large portions of text that hold sentences together and create subtle meaning in context, but offer little in terms of general pattern recognition. Eliminating capitalization and punctuation help simplify the text. Paragraphs and sentences are atomized into lists of words. And things like stemming or converting multiple words to single compound words (e.g. White House to white_house) help reduce the complexity of the text.
The short tutorial below is meant to introduce you to a few functions that can be useful for initiating analysis with text and introduce you to common pre-processing steps.
Note that in the field of literature we refer to an author’s or a field’s body of work. In text analysis, we refer to a database of text-based documents as a “corpus” (Latin for body). Each document has text, which is the unit of analysis. But it also has meta-data that is useful for making sense of the text and identifying patterns. Common meta-data might be things like year of publication, author, type of document (newspaper article, tweet, email, spoken speech, etc.). The corpus() function primarily serves to make the text database easy to use by keeping the text and meta-data connected and simpatico during pre-processing steps.
Typically large texts are broken into smaller parts, or “tokenized”. Paragraphs can be split into sentences, sentences split into words. In regression we pay attention to correlations between numbers - when one variable X is increasing, is another variable Y also increasing, decreasing, or not covarying with X? In text analysis the analogous operation is co-occurrence. How often do words co-occur in sentences or documents? Or do we expect them to co-occur more frequently than they actually do given their frequency in the corpus (the equivalent of two words being negatively correlated). It is through tokenization that these relationships can be established.
In the example below we will split missions into sets of words, apply a “dictionary” or “thesaurus” to join multiple words that describe a single concept (e.g. New York City), stem the words to standardize them as much as possible, then conduct the simplest type of content analysis possible - count word frequency.
# library( quanteda )
# convert missions to all lower-case
$mission <- tolower( dat$mission )
dat
# use a sample for demo purposes
<- dat[ sample( 1:50000, size=1000 ) , ]
dat.sample
<- corpus( dat.sample, text_field="mission" )
corp corp
## Corpus consisting of 1,000 documents and 36 docvars.
## text1 :
## "we are advocacy group that provides free or low cost legal r..."
##
## text2 :
## "to enlist volunteers, promote the social, educational and re..."
##
## text3 :
## "raise funds and support necessary to support hospital staff ..."
##
## text4 :
## "sole purpose is raising funds to support our local high scho..."
##
## text5 :
## "help dedicated/responsible pet owners with partial financial..."
##
## text6 :
## "gurdon baseball and softball mission is to provide a league ..."
##
## [ reached max_ndoc ... 994 more documents ]
# print first five missions
1:5] corp[
## Corpus consisting of 5 documents and 36 docvars.
## text1 :
## "we are advocacy group that provides free or low cost legal r..."
##
## text2 :
## "to enlist volunteers, promote the social, educational and re..."
##
## text3 :
## "raise funds and support necessary to support hospital staff ..."
##
## text4 :
## "sole purpose is raising funds to support our local high scho..."
##
## text5 :
## "help dedicated/responsible pet owners with partial financial..."
# summarize corpus
summary(corp)[1:10,]
# pre-processing steps:
# remove mission statements that are less than 3 sentences long
<- corpus_trim( corp, what="sentences", min_ntoken=3 )
corp
# remove punctuation
<- tokens( corp, what="word", remove_punct=TRUE )
tokens head( tokens )
## Tokens consisting of 6 documents and 36 docvars.
## text1 :
## [1] "we" "are" "advocacy" "group"
## [5] "that" "provides" "free" "or"
## [9] "low" "cost" "legal" "representation"
## [ ... and 22 more ]
##
## text2 :
## [1] "to" "enlist" "volunteers" "promote" "the"
## [6] "social" "educational" "and" "recreational" "well-being"
## [11] "of" "the"
## [ ... and 23 more ]
##
## text3 :
## [1] "raise" "funds" "and" "support" "necessary" "to"
## [7] "support" "hospital" "staff" "in" "care" "and"
## [ ... and 3 more ]
##
## text4 :
## [1] "sole" "purpose" "is" "raising" "funds" "to" "support"
## [8] "our" "local" "high" "school" "ffa"
## [ ... and 21 more ]
##
## text5 :
## [1] "help" "dedicated" "responsible" "pet" "owners"
## [6] "with" "partial" "financial" "assistance" "for"
## [11] "specialty" "medical"
## [ ... and 24 more ]
##
## text6 :
## [1] "gurdon" "baseball" "and" "softball" "mission" "is"
## [7] "to" "provide" "a" "league" "within" "our"
## [ ... and 19 more ]
# remove filler words like the, and, a, to
<- tokens_remove( tokens, c( stopwords("english"), "nbsp" ), padding=F ) tokens
<- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
my_dictionary united_states = c("united states"),
high_school=c("high school"),
non_profit=c("non-profit", "non profit"),
stem=c("science technology engineering math",
"science technology engineering mathematics" ),
los_angeles=c("los angeles"),
ny_state=c("new york state"),
ny=c("new york")
))
# apply the dictionary to the text
<- tokens_compound( tokens, pattern=my_dictionary )
tokens head( tokens )
## Tokens consisting of 6 documents and 36 docvars.
## text1 :
## [1] "advocacy" "group" "provides" "free"
## [5] "low" "cost" "legal" "representation"
## [9] "underprivileged" "individuals" "families" "plan"
## [ ... and 10 more ]
##
## text2 :
## [1] "enlist" "volunteers" "promote" "social" "educational"
## [6] "recreational" "well-being" "city's" "residents" "visitors"
## [11] "lessen" "burdens"
## [ ... and 9 more ]
##
## text3 :
## [1] "raise" "funds" "support" "necessary" "support" "hospital"
## [7] "staff" "care" "comfort" "children"
##
## text4 :
## [1] "sole" "purpose" "raising" "funds" "support"
## [6] "local" "high_school" "ffa" "chapter" "provide"
## [11] "educational" "materials"
## [ ... and 10 more ]
##
## text5 :
## [1] "help" "dedicated" "responsible" "pet" "owners"
## [6] "partial" "financial" "assistance" "specialty" "medical"
## [11] "care" "bridge"
## [ ... and 15 more ]
##
## text6 :
## [1] "gurdon" "baseball" "softball" "mission" "provide"
## [6] "league" "within" "community" "children" "play"
## [11] "ball" "non_profit"
## [ ... and 4 more ]
# find frequently co-occuring words (typically compound words)
<- tokens_ngrams( tokens, n=2 ) %>% dfm()
ngram2 %>% textstat_frequency( n=10 ) ngram2
<- tokens_ngrams( tokens, n=3 ) %>% dfm()
ngram3 %>% textstat_frequency( n=10 ) ngram3
%>% dfm( stem=F ) %>% topfeatures( ) tokens
## provid educ communiti organ support mission youth purpos
## 390 281 274 233 206 167 143 136
## promot famili
## 130 117
Many words have a stem that is altered when conjugated (if a verb) or made pluran (if a noun). As a result, it can be hard to consistently count the appearance of specific word.
Stemming removes the last part of the word such that the word is reduced to it’s most basic stem. For example, running would become run, and Tuesdays would become Tuesday.
Quanteda already has a powerful stemming function included.
%>% dfm( stem=T ) %>% topfeatures( ) tokens
## provid educ communiti organ support mission youth purpos
## 390 281 274 233 206 167 143 136
## promot famili
## 130 117
Replicate the steps above with the following criteria:
For the last part of this lab, you will use text to classify nonprofits.
A large foundation is interested in knowing how many new nonprofits created in 2018 have an explicit mission of serving minority communities. We will start by trying to identify nonprofits that are membership organizations for Black communities or provide services to Black communities.
To do this, you will create a set of words or phrases that you believe indicates that the nonprofit works with or for the target population.
You will need to think about different ways that language might be used distinctively within the mission statements of nonprofit that serve Black communities. There is a lot of trial and error involved, as you can test several words and phrases, preview the mission statements that are identified, then refine your methods.
Your final product will be a data frame of the nonprofit names, activity codes, and mission statements for the organizations identified by your criteria. The goal is to identify as many as possible while minimizing errors that occur when you include a nonprofit that does not serve the Black community. This example was selected specifically because “black” is a common and ambiguous term.
To get you started, let’s look at a similar example where we want to identify immigrants rights nonprofits. We would begin as follows:
# start with key phrases
#
# use grep( ..., value=TRUE ) so you can view mission statements
# that meet your criteria and adjust the language as necessary
grep( "immigrant rights", dat$mission, value=TRUE ) %>% head()
## [1] "community justice alliance, inc. promotes social change through advocacy, communications, community education, and litigation in the areas of racial justice, immigrant rights, and political access."
## [2] "to provide fair, trustworthy immigration legal counsel. to provide information on immigrant rights and opportunities within the community."
grep( "immigration", dat$mission, value=TRUE ) %>% head()
## [1] "charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks."
## [2] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [3] "helping the immigration community with education, culture and humanity."
## [4] "provides legal immigration service to low income immigrants"
## [5] "to research migration and immigration patterns impacting economic, political and social landscape in the united states."
## [6] "to provide immigration services for low-income population in new york city, educating the public about information and issues related to immigration law as well as organizing educational programs for youth and women."
grep( "refugee", dat$mission, value=TRUE ) %>% head()
## [1] "the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment. in addition, we wish to help the homeless, refugees and be an educational resource."
## [2] "dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness."
## [3] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [4] "ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources."
## [5] "fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence."
## [6] "the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society."
After you feel comfortable that individual statements are primarily identifying nonprofits within your desired group and have low error rates, you will need to combine all of the criteria to create one group. Note that any organization that has more than one keyword or phrase in it’s mission statement would be double-counted if you use the raw groups, so we need to make sure we include each organizaton only once. We can do this using compound logical statements.
Note that grepl() returns a logical vector, so we can combine multiple statements using AND and OR conditions.
.01 <- grepl( "immigrant rights", dat$mission )
criteria.02 <- grepl( "immigration", dat$mission )
criteria.03 <- grepl( "refugee", dat$mission )
criteria.04 <- grepl( "humanitarian", dat$mission )
criteria.05 <- ! grepl( "humanities", dat$mission ) # exclude humanities criteria
Note that to select all high school boys you would write:
( grade_9 | grade_10 | grade_11 | grade_12 ) & ( boys )
You would NOT specify:
( grade_9 | grade_10 | grade_11 | grade_12 ) | boys
Because that would then include boys at all levels and all people in grades 9-12.
Now create your sample:
<- ( criteria.01 | criteria.02 | criteria.03 | criteria.04 ) & criteria.05
these.nonprofits sum( these.nonprofits )
## [1] 406
$activity.code <- paste0( dat$codedef01, ": ", dat$codedef02 )
dat<- dat[ these.nonprofits, c("orgname","activity.code","mission") ]
d.immigrant row.names( d.immigrant ) <- NULL
%>% head(25) %>% pander() d.immigrant
orgname | activity.code |
---|---|
WALK OF STARS FOUNDATION | Arts, Culture, and Humanities: Media, Communications Organizations |
UNITED ARAB-AMERICAN SOCIETY | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
INCLUSIVE MOVEMENT FOR BOSNIA AND HERZEGOVINA INC | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
HEREDITARY ORDER OF THE RED DRAGON | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
FILIPINO-AMERICAN ASSOCIATION OF COASTAL GEORGIA INC | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
CYRUS THE GREAT KING OF PERSIA FOUNDATION | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
DANCE BEYOND BORDERS | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
SIKKO-MANDO RELIEF ASSOCIATION | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ETHIOPIAN AND ERITREAN CULTURAL ANDRESOURCE CENTER | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
HARYANVI BAYAREA ASSOCIATION | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
STREET CRY INC | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ASOCIACION DE MIGRANTES TIERRA Y LIBERTAD | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
CONCERTS FOR COMPASSION INCORPORATED | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ARTOGETHER | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
TEEN TREEHUGGERS INC | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
UMUKABIA EHIME DEVELOPMENT ASSOC | Arts, Culture, and Humanities: Single Organization Support |
EVAN WALKER ORGANIZATION | Arts, Culture, and Humanities: Single Organization Support |
SAEA INC | Arts, Culture, and Humanities: Professional Societies & Associations |
ROSEDALE PICTURES INC | Arts, Culture, and Humanities: Film, Video |
AMERICAN DELEGATION OF THE ORDER OFDANILO 1 INC | Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C. |
VALLEY MOO DUK KWAN | Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C. |
MOVE THE WORLD | Arts, Culture, and Humanities: Dance |
BOOTHEEL CULTURAL AND PERFORMING ARTS CENTER | Arts, Culture, and Humanities: Arts Service Activities/ Organizations |
GREAT NSASS ALUMNI ASSOCIATION OF NORTH AMERICA INC | Arts, Culture, and Humanities: Humanities Organizations |
CHULA VISTA SUNSET ROTARY FOUNDATION INC | Arts, Culture, and Humanities: Humanities Organizations |
mission |
---|
the mission of the walk of stars foundation is to honor those who have excelled and made major contributions in their respective capacity in the entertainment area in motion pictures, radio and television, humanitarians, civic leaders, medal of hono |
to serve as a social organization for arab-americans to preserve the arab heritage, culture and traditions. to promote and support humanitarian and community outreach efforts locally and internationally. |
to promote the advancement of bosnia and herzegovina by fostering an inclusive platform for innovation and entrepreneurship; cultural and humanitarian activities; and networking. |
charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks. |
to engage in humanitarian, civic, educational , cultural, and charitable activities that would preserve, promote, and share with the community the customs, values and heritage of the filipino culture. |
the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment. in addition, we wish to help the homeless, refugees and be an educational resource. |
dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness. |
1.provide immigration and other social assistance services to new oromo immigrants and refugees; 2.provide health awareness and education services to members and the community at large; 3.promote self-help and social assistance among the oromos |
ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources. |
hba is involved in multiple non-profit activities including haryanvi cultural promotion and preservation, community services, educational activities, humanitarian aid and social activities. |
fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence. |
helping the immigration community with education, culture and humanity. |
the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society. |
artogether is a community building creative arts project that hosts free creative art workshops, social gatherings, and family picnics to forge connections between the refugee community and the general public. |
the purpose for which the corporation is organized is to provide youth the platform to address wildlife, environmental, and humanitarian issues through arts journalism. |
umukabia ehime development association, usa, is organized exclusively for charitable purposes, which includes assisting nigerian abandoned and disabled children and to support humanitarian programs in our community and the public in general. |
our mission is to spread prosperity and compassion for all by achieving our commitment to excellence in humanitarian, environmental, and scientific initiatives. we recently conducted a successful food drive for evacuees of wildfires in ca. |
sudanese american engineers association is a non-profit, non-political, educational and humanitarian organization. its members are engineer professionals of sudanese descent. |
to train employ underprivileged individuals particularly refugees in the visual media industry. build moviemaking skills with learn by doing approach creating content to be produced by the organization with profits to partially fund future projects |
perpetuating the traditions of the dynastic and hereditary chivalric orders of the royal house of petrovi-njegos by supporting charitable, humanitarian. educational and artistic works that promote a continuing public interest in their history. |
to undertake charitable activities, both through group and individual action consistent with the practice of the art and in the interest of humanitarian principles. to facilitate participation in, and stugy of the martial arts of soo bah do. |
through the powerful expression of dance, our youth dance company performs on a local, nat’l and global stage to raise awareness of social and environmental issues. example: pollution, hunger, global refugee crisis |
to meet the cultural and humanitarian needs of the under served in the bootheel region |
humanitarian aid educational services for children and adults that are in hardships or in disaster areas within and outside united states |
world and local humanitarian, educational and cultural community service |
P3-Q1. Report the total count of organizations you have identified as serving Black communities.
P3-Q2. Take a random sample of 20 of the organizations in your sample and verify that your search criteria are identifying nonprofits that serve Black communities.
<- dplyr::sample_n( d.immigrant, 20 ) sample
Report your rate of false positives in this sample (how many organizations in your sample do NOT belong there).
Note that an error rate of 10% in a classification model is very good. An error rate above 40% suggests poor performance.
We are measuring false positives, here, not overall accuracy. You can have a very low false positive rate by using extremely narrow search criteria, but then you will miss a large portion of the population you are trying to capture and study (the false negative rate).
The more inclusive your criteria are, the larger your false positive rate will be. The goal is to design search criteria that identify a large number of organizations while keeping false positive rates reasonable.
P3-Q3. Similar to the immigrant example, print a data frame that contains the nonprofit names, activity codes (see above on how to combine them), and their mission statements. Print the full data frame for your sample so all missions are visible.
If you selected three nonprofit subsectors from the activity codes (code01), then created three data subsets based upon these criteria you could conduct something like content analysis as a way to compare how the three groups use language difference.
Re-run the content analysis to identify the most frequently-used words. But this time run it separately for each subsector.
How do the most frequently used words vary by subsector? Which words are shared between the three subsectors? Which are distinct?
Another way to compare differences in language use is by creating semantic networks:
Compare prominent word relationships in mission statements of arts, environmental, and education nonprofits (codedef01). Build semantic networks for each, then compare and contrast the prominence of specific words within the networks.
When you have completed your assignment, knit your RMD file to generate your rendered HTML file.
Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.
Remember to:
See Google’s R Style Guide for examples.
Note that when you knit a file, it starts from a blank slate. You might have packages loaded or datasets active on your local machine, so you can run code chunks fine. But when you knit you might get errors that functions cannot be located or datasets don’t exist. Be sure that you have included chunks to load these in your RMD file.
Your RMD file will not knit if you have errors in your code. If you get stuck on a question, just add eval=F
to the code chunk and it will be ignored when you knit your file. That way I can give you credit for attempting the question and provide guidance on fixing the problem.
If you are having problems with your RMD file, visit the RMD File Styles and Knitting Tips manual.