Setup
Part I
Part II
Part III
- For your deliverables for Part III:
Challenge Question
- Part 1:
- Part 2:
Submission Instructions
- Notes on Knitting
- Markdown Trouble?

CPP 527: Foundations of Data Science II

Lab 04 introduces the tools of string operators and regular expressions that enable rich text analysis in R. They are important functions for cleaning data in large datasets, generating new variables, and qualitative analysis of text-based databases using tools like content analysis, sentiment analysis, and natural language processing libraries.

The data for the lab comes from IRS archives of 1023-EZ applications that nonprofits submit when they are filing for tax-exempt status. We will use mission statements, organizational names, and activity codes.

The lab consists of three parts. Part I is a warm-up that asks you to construct a few regular expressions to identify specific patterns in the mission text.

Part II asks you to use the quanteda package, a popular text analysis package in R, to perform a simple content analysis by counting the most frequently-used words in the mission statements.

Part III asks you to use text functions and regular expressions to search mission statements to develop a sample of specific nonprofits using keywords and phrases.

Setup

Packages

# REMEMBER TO RUN ONCE AND NEVER INCLUDE
# INSTALL STEPS IN RMD FILES!

quanteda.bundle <- c( "quanteda", "quanteda.textmodels",
                      "quanteda.textstats", "quanteda.textplots" )
install.packages( quanteda.bundle )

library( dplyr )
library( pander )
library( quanteda )
library( quanteda.textmodels )
library( quanteda.textstats )
library( quanteda.textplots )

Data

IRS documentation on 1023-EZ forms are available here.

Use this archived version of the data for the lab:

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
dat <- readRDS(gzcon(url( URL )))

We will be working primarily with the nonprofit activity codes and mission text for this assignment:

head( dat[ c("orgname","codedef01","mission") ] ) %>% pander()

Table continues below
orgname	codedef01
NIA PERFORMING ARTS	Arts, Culture, and Humanities
THE YOUNG ACTORS GUILD INC	Arts, Culture, and Humanities
RUTH STAGE INC	Arts, Culture, and Humanities
STRIPLIGHT COMMUNITY THEATRE INC	Arts, Culture, and Humanities
NU BLACK ARTS WEST THEATRE	Arts, Culture, and Humanities
OLIVE BRANCH THEATRICALS INC	Arts, Culture, and Humanities

mission
A community based art organization that inspires, nutures,educates and empower artist and community.
We engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. We produce community theater productions for children as well as educational theater camps and workshops.
Theater performances and performing arts
To produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.

Data Dictionary:

ein: nonprofit tax ID
orgname: organizatoin name
mission: mission statement
code01: NTEE mission code top level
codedef01: mission code definition
code02: NTEE mission code second level
codedef02: mission code definition
orgpurposecharitable: Organization is organized and operated exclusively for Charitable purposes
orgpurposereligious: Organization is organized and operated exclusively for Religious purposes
orgpurposeeducational: Organization is organized and operated exclusively for Educational purposes
orgpurposescientific: Organization is organized and operated exclusively for Scientific purposes
orgpurposeliterary: Organization is organized and operated exclusively for Literary purposes
orgpurposepublicsafety: Organization is organized and operated exclusively for x purposes
orgpurposeamateursports: Organization is organized and operated exclusively for x purposes
orgpurposecrueltyprevention: Organization is organized and operated exclusively for x purposes
leginflno: Organization has NOT attempted or has no plans to attempt to influence legislation
leginflyes: Organization has attempted or has plans to attempt to influence legislation
donatefundsno: Organization has no plans to donate funds or pay expenses to any individuals
donatefundsyes: Organization plans to donate funds or pay expenses to individuals
conductactyoutsideusno: Plans to conduct activities outside of the US?
conductactyoutsideusyes: Plans to conduct activities outside of the US?
compofcrdirtrustno: Organization has no plans to pay compensation to any officers, directors, or trustees
compofcrdirtrustyes: Organization plans to pay compensation to any officers, directors, or trustees
financialtransofcrsno: Organization has no plans to engage in financial transactions (for example, loans, grants, or other assistance, payments for goods or services, rents, etc.) with their officers, directors, or trustees, or any entities they own or control
financialtransofcrsyes: Organization plans to engage in financial transactions (for example, loans, grants, or other assistance, payments for goods or services, rents, etc.) with their officers, directors, or trustees, or any entities they own or control
unrelgrossincm1000moreno: Unrelated gross income of more than $1,000?
unrelgrossincm1000moreyes: Unrelated gross income of more than $1,000?
gamingactyno: Organization has no plans to conduct bingo or other gaming activities
gamingactyyes: Organization plans to conduct bingo or other gaming activities
disasterreliefno: Organization has no plans to provide Disaster Relief assistance
disasterreliefyes: Organization plans to provide Disaster Relief assistance
onethirdsupportpublic: Organization normally receives at least one-third of their support from public sources
onethirdsupportgifts: Organization normally receives at least one-third of their support from a combination of gifts, grants, contributions, membership fees, and gross receipts (from permitted sources) from activities related to their exempt functions
benefitofcollege: Organization is organized and operated exclusively to receive, hold, invest, and administer property for and make expenditures to or for the benefit of a state or municipal college or university
privatefoundation508e: Organization’s organizing document contains specific provisions required by section 508(e)
hospitalorchurchno: Organization does not qualify as a hospital or a church
hospitalorchurchyes: Organization qualifies as a hospital or a church

Part I

You need to search for the following patterns below, print the first six missions that meet the criteria, and report a total count of missions that meet the criteria.

You will use grep() and grepl(). The grep() function will find missions that match the criteria and return the full string. grepl() is the logical version of grep(), so it returns a vector of TRUE or FALSE, with TRUE representing the cases that match the specified criteria.

grep(  pattern="some.reg.ex", x="mission statements", value=TRUE )

You will print and count matches as follows. As an example, let’s search for missions that contain numbers.

grep( pattern="[0-9]", x=dat$mission, value=TRUE ) %>% head()

## [1] "Provide entertainment and education to private residence of a private residential community over 55."                                                                                                                                                 
## [2] "To serve the community as a nonprofit organization dedicated to producing live theatre and educational opportunities in the theater arts. The theater's primary activity is to put on 3-5 plays annually in Colorado Springs, CO."                    
## [3] "The organization is a theater company that performs 3-4 plays per year."                                                                                                                                                                              
## [4] "Our mission is to facilitate personal growth and social development through the creativity of the Theatre Arts.  We offer musical theatre camps for ages 4-7, 8-12, and 13-20 in the summers & community theatre and classes for all ages fall-spring"
## [5] "Nurture minority actors, directors, playwrights, and theater artists by offering them the opportunity to participate in the best classic, contemporary, and original theater (A65 and R30)."                                                          
## [6] "The 574 Theatre Company strives to be a professional theatre company located in St. Joseph County, IN who seeks to create, inspire, and educate the members of the 574 community by producing high quality and innovative theatrical entertainment."

grepl( "[0-9]", dat$mission ) %>% sum()

## [1] 4142

How many missions start with the word “to”? Make sure it is the word “to” and not words that start with “to” like “towards”. You can ignore capitalization.
How many mission fields are blank? How many mission fields contain only spaces (one or more)?
How many missions have trailing spaces (extra spaces at the end)? After identifying the cases with trailing spaces use the trim white space function trimws() to clean them up.
How many missions contain the dollar sign? Note that the dollar sign is a special symbol, so you need to use an escape character to search for it.
How many mission statements contain numbers that are at least two digits long? You will need to use a quantity qualifier from regular expressions.

Report your code and answers for these five questions.

Part II

Perform a very basic content analysis with the mission text. Report the ten most frequently-used words in mission statements. Exclude punctuation, and “stem” the words.

You will be using the quanteda package in R for the language processing functions. It is an extremely powerful tool that integrates with a variety of natural language processing tools, qualitative analysis packages, and machine learning frameworks for predictive analytics using text inputs.

In general, languages and thus text are semi-structured data sources. There are patterns and rules to languages, but rules are less rigid and patterns can be more subtle (computers are much better at picking out patterns in language use from large amounts of text than humans are). As a result of the nature of text as data, you will find that the cleaning, processing, and preparation steps can be more intensive than quantitative data. They are designed to filter out large portions of text that hold sentences together and create subtle meaning in context, but offer little in terms of general pattern recognition. Eliminating capitalization and punctuation help simplify the text. Paragraphs and sentences are atomized into lists of words. And things like stemming or converting multiple words to single compound words (e.g. White House to white_house) help reduce the complexity of the text.

The short tutorial below is meant to introduce you to a few functions that can be useful for initiating analysis with text and introduce you to common pre-processing steps.

Note that in the field of literature we refer to an author’s or a field’s body of work. In text analysis, we refer to a database of text-based documents as a “corpus” (Latin for body). Each document has text, which is the unit of analysis. But it also has meta-data that is useful for making sense of the text and identifying patterns. Common meta-data might be things like year of publication, author, type of document (newspaper article, tweet, email, spoken speech, etc.). The corpus() function primarily serves to make the text database easy to use by keeping the text and meta-data connected and simpatico during pre-processing steps.

Typically large texts are broken into smaller parts, or “tokenized”. Paragraphs can be split into sentences, sentences split into words. In regression we pay attention to correlations between numbers - when one variable X is increasing, is another variable Y also increasing, decreasing, or not covarying with X? In text analysis the analogous operation is co-occurrence. How often do words co-occur in sentences or documents? Or do we expect them to co-occur more frequently than they actually do given their frequency in the corpus (the equivalent of two words being negatively correlated). It is through tokenization that these relationships can be established.

In the example below we will split missions into sets of words, apply a “dictionary” or “thesaurus” to join multiple words that describe a single concept (e.g. New York City), stem the words to standardize them as much as possible, then conduct the simplest type of content analysis possible - count word frequency.

# library( quanteda )

# convert missions to all lower-case 
dat$mission <- tolower( dat$mission )

# use a sample for demo purposes
dat.sample <- dat[ sample( 1:50000, size=1000 ) , ]

corp <- corpus( dat.sample,  text_field="mission" )
corp

## Corpus consisting of 1,000 documents and 36 docvars.
## text1 :
## "we are advocacy group that provides free or low cost legal r..."
## 
## text2 :
## "to enlist volunteers, promote the social, educational and re..."
## 
## text3 :
## "raise funds and support necessary to support hospital staff ..."
## 
## text4 :
## "sole purpose is raising funds to support our local high scho..."
## 
## text5 :
## "help dedicated/responsible pet owners with partial financial..."
## 
## text6 :
## "gurdon baseball and softball mission is to provide a league ..."
## 
## [ reached max_ndoc ... 994 more documents ]

# print first five missions 
corp[1:5]

## Corpus consisting of 5 documents and 36 docvars.
## text1 :
## "we are advocacy group that provides free or low cost legal r..."
## 
## text2 :
## "to enlist volunteers, promote the social, educational and re..."
## 
## text3 :
## "raise funds and support necessary to support hospital staff ..."
## 
## text4 :
## "sole purpose is raising funds to support our local high scho..."
## 
## text5 :
## "help dedicated/responsible pet owners with partial financial..."

# summarize corpus
summary(corp)[1:10,]

# pre-processing steps:

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# remove punctuation 
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens )

## Tokens consisting of 6 documents and 36 docvars.
## text1 :
##  [1] "we"             "are"            "advocacy"       "group"         
##  [5] "that"           "provides"       "free"           "or"            
##  [9] "low"            "cost"           "legal"          "representation"
## [ ... and 22 more ]
## 
## text2 :
##  [1] "to"           "enlist"       "volunteers"   "promote"      "the"         
##  [6] "social"       "educational"  "and"          "recreational" "well-being"  
## [11] "of"           "the"         
## [ ... and 23 more ]
## 
## text3 :
##  [1] "raise"     "funds"     "and"       "support"   "necessary" "to"       
##  [7] "support"   "hospital"  "staff"     "in"        "care"      "and"      
## [ ... and 3 more ]
## 
## text4 :
##  [1] "sole"    "purpose" "is"      "raising" "funds"   "to"      "support"
##  [8] "our"     "local"   "high"    "school"  "ffa"    
## [ ... and 21 more ]
## 
## text5 :
##  [1] "help"        "dedicated"   "responsible" "pet"         "owners"     
##  [6] "with"        "partial"     "financial"   "assistance"  "for"        
## [11] "specialty"   "medical"    
## [ ... and 24 more ]
## 
## text6 :
##  [1] "gurdon"   "baseball" "and"      "softball" "mission"  "is"      
##  [7] "to"       "provide"  "a"        "league"   "within"   "our"     
## [ ... and 19 more ]

# remove filler words like the, and, a, to
tokens <- tokens_remove( tokens, c( stopwords("english"), "nbsp" ), padding=F )

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

## Tokens consisting of 6 documents and 36 docvars.
## text1 :
##  [1] "advocacy"        "group"           "provides"        "free"           
##  [5] "low"             "cost"            "legal"           "representation" 
##  [9] "underprivileged" "individuals"     "families"        "plan"           
## [ ... and 10 more ]
## 
## text2 :
##  [1] "enlist"       "volunteers"   "promote"      "social"       "educational" 
##  [6] "recreational" "well-being"   "city's"       "residents"    "visitors"    
## [11] "lessen"       "burdens"     
## [ ... and 9 more ]
## 
## text3 :
##  [1] "raise"     "funds"     "support"   "necessary" "support"   "hospital" 
##  [7] "staff"     "care"      "comfort"   "children" 
## 
## text4 :
##  [1] "sole"        "purpose"     "raising"     "funds"       "support"    
##  [6] "local"       "high_school" "ffa"         "chapter"     "provide"    
## [11] "educational" "materials"  
## [ ... and 10 more ]
## 
## text5 :
##  [1] "help"        "dedicated"   "responsible" "pet"         "owners"     
##  [6] "partial"     "financial"   "assistance"  "specialty"   "medical"    
## [11] "care"        "bridge"     
## [ ... and 15 more ]
## 
## text6 :
##  [1] "gurdon"     "baseball"   "softball"   "mission"    "provide"   
##  [6] "league"     "within"     "community"  "children"   "play"      
## [11] "ball"       "non_profit"
## [ ... and 4 more ]

# find frequently co-occuring words (typically compound words)
ngram2 <- tokens_ngrams( tokens, n=2 ) %>% dfm()
ngram2 %>% textstat_frequency( n=10 )

ngram3 <- tokens_ngrams( tokens, n=3 ) %>% dfm()
ngram3 %>% textstat_frequency( n=10 )

Tabulate top word counts

tokens %>% dfm( stem=F ) %>% topfeatures( )

##    provid      educ communiti     organ   support   mission     youth    purpos 
##       390       281       274       233       206       167       143       136 
##    promot    famili 
##       130       117

Stemming

Many words have a stem that is altered when conjugated (if a verb) or made pluran (if a noun). As a result, it can be hard to consistently count the appearance of specific word.

Stemming removes the last part of the word such that the word is reduced to it’s most basic stem. For example, running would become run, and Tuesdays would become Tuesday.

Quanteda already has a powerful stemming function included.

tokens %>% dfm( stem=T ) %>% topfeatures( )

##    provid      educ communiti     organ   support   mission     youth    purpos 
##       390       281       274       233       206       167       143       136 
##    promot    famili 
##       130       117

Instructions

Replicate the steps above with the following criteria:

Use the full mission dataset, not the small sample used in the demo.
Add at least ten concepts to your dictionary to convert compound words into single words.
Report the ten most frequently-used words in the missions statements after applying stemming.

Part III

For the last part of this lab, you will use text to classify nonprofits.

A large foundation is interested in knowing how many new nonprofits created in 2018 have an explicit mission of serving minority communities. We will start by trying to identify nonprofits that are membership organizations for Black communities or provide services to Black communities.

To do this, you will create a set of words or phrases that you believe indicates that the nonprofit works with or for the target population.

You will need to think about different ways that language might be used distinctively within the mission statements of nonprofit that serve Black communities. There is a lot of trial and error involved, as you can test several words and phrases, preview the mission statements that are identified, then refine your methods.

Your final product will be a data frame of the nonprofit names, activity codes, and mission statements for the organizations identified by your criteria. The goal is to identify as many as possible while minimizing errors that occur when you include a nonprofit that does not serve the Black community. This example was selected specifically because “black” is a common and ambiguous term.

To get you started, let’s look at a similar example where we want to identify immigrants rights nonprofits. We would begin as follows:

# start with key phrases
#
# use grep( ..., value=TRUE ) so you can view mission statements
# that meet your criteria and adjust the language as necessary 
grep( "immigrant rights", dat$mission, value=TRUE ) %>% head()

## [1] "community justice alliance, inc. promotes social change through advocacy, communications, community education, and litigation in the areas of racial justice, immigrant rights, and political access."
## [2] "to provide fair, trustworthy immigration legal counsel. to provide information on immigrant rights and opportunities within the community."

grep( "immigration", dat$mission, value=TRUE ) %>% head()

## [1] "charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks."                                          
## [2] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [3] "helping the immigration community with education, culture and humanity."                                                                                                                                                                                         
## [4] "provides legal immigration service to low income immigrants"                                                                                                                                                                                                     
## [5] "to research migration and immigration patterns impacting economic, political and social landscape in the united states."                                                                                                                                         
## [6] "to provide immigration services for low-income population in new york city, educating the public about information and issues related to immigration law as well as organizing educational programs for youth and women."

grep( "refugee", dat$mission, value=TRUE ) %>% head()

## [1] "the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment.  in addition, we wish to help the homeless, refugees and be an educational resource."                               
## [2] "dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness."            
## [3] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [4] "ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources."             
## [5] "fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence."                   
## [6] "the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society."

After you feel comfortable that individual statements are primarily identifying nonprofits within your desired group and have low error rates, you will need to combine all of the criteria to create one group. Note that any organization that has more than one keyword or phrase in it’s mission statement would be double-counted if you use the raw groups, so we need to make sure we include each organizaton only once. We can do this using compound logical statements.

Note that grepl() returns a logical vector, so we can combine multiple statements using AND and OR conditions.

criteria.01 <- grepl( "immigrant rights", dat$mission ) 
criteria.02 <- grepl( "immigration", dat$mission ) 
criteria.03 <- grepl( "refugee", dat$mission ) 
criteria.04 <- grepl( "humanitarian", dat$mission ) 
criteria.05 <- ! grepl( "humanities", dat$mission )  # exclude humanities

Note that to select all high school boys you would write:

( grade_9 | grade_10 | grade_11 | grade_12 ) & ( boys )

You would NOT specify:

( grade_9 | grade_10 | grade_11 | grade_12 ) | boys

Because that would then include boys at all levels and all people in grades 9-12.

Now create your sample:

these.nonprofits <- ( criteria.01 | criteria.02 | criteria.03 | criteria.04 ) &  criteria.05
sum( these.nonprofits )

## [1] 406

dat$activity.code <- paste0( dat$codedef01, ": ", dat$codedef02 )
d.immigrant <- dat[ these.nonprofits, c("orgname","activity.code","mission") ] 
row.names( d.immigrant ) <- NULL
d.immigrant %>% head(25) %>% pander()

Table continues below
orgname	activity.code
WALK OF STARS FOUNDATION	Arts, Culture, and Humanities: Media, Communications Organizations
UNITED ARAB-AMERICAN SOCIETY	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
INCLUSIVE MOVEMENT FOR BOSNIA AND HERZEGOVINA INC	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
HEREDITARY ORDER OF THE RED DRAGON	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
FILIPINO-AMERICAN ASSOCIATION OF COASTAL GEORGIA INC	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
CYRUS THE GREAT KING OF PERSIA FOUNDATION	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
DANCE BEYOND BORDERS	Arts, Culture, and Humanities: Cultural/Ethnic Awareness
SIKKO-MANDO RELIEF ASSOCIATION	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
ETHIOPIAN AND ERITREAN CULTURAL ANDRESOURCE CENTER	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
HARYANVI BAYAREA ASSOCIATION	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
STREET CRY INC	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
ASOCIACION DE MIGRANTES TIERRA Y LIBERTAD	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
CONCERTS FOR COMPASSION INCORPORATED	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
ARTOGETHER	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
TEEN TREEHUGGERS INC	Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose
UMUKABIA EHIME DEVELOPMENT ASSOC	Arts, Culture, and Humanities: Single Organization Support
EVAN WALKER ORGANIZATION	Arts, Culture, and Humanities: Single Organization Support
SAEA INC	Arts, Culture, and Humanities: Professional Societies & Associations
ROSEDALE PICTURES INC	Arts, Culture, and Humanities: Film, Video
AMERICAN DELEGATION OF THE ORDER OFDANILO 1 INC	Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C.
VALLEY MOO DUK KWAN	Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C.
MOVE THE WORLD	Arts, Culture, and Humanities: Dance
BOOTHEEL CULTURAL AND PERFORMING ARTS CENTER	Arts, Culture, and Humanities: Arts Service Activities/ Organizations
GREAT NSASS ALUMNI ASSOCIATION OF NORTH AMERICA INC	Arts, Culture, and Humanities: Humanities Organizations
CHULA VISTA SUNSET ROTARY FOUNDATION INC	Arts, Culture, and Humanities: Humanities Organizations

mission
the mission of the walk of stars foundation is to honor those who have excelled and made major contributions in their respective capacity in the entertainment area in motion pictures, radio and television, humanitarians, civic leaders, medal of hono
to serve as a social organization for arab-americans to preserve the arab heritage, culture and traditions. to promote and support humanitarian and community outreach efforts locally and internationally.
to promote the advancement of bosnia and herzegovina by fostering an inclusive platform for innovation and entrepreneurship; cultural and humanitarian activities; and networking.
charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks.
to engage in humanitarian, civic, educational , cultural, and charitable activities that would preserve, promote, and share with the community the customs, values and heritage of the filipino culture.
the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment. in addition, we wish to help the homeless, refugees and be an educational resource.
dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness.
1.provide immigration and other social assistance services to new oromo immigrants and refugees; 2.provide health awareness and education services to members and the community at large; 3.promote self-help and social assistance among the oromos
ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources.
hba is involved in multiple non-profit activities including haryanvi cultural promotion and preservation, community services, educational activities, humanitarian aid and social activities.
fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence.
helping the immigration community with education, culture and humanity.
the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society.
artogether is a community building creative arts project that hosts free creative art workshops, social gatherings, and family picnics to forge connections between the refugee community and the general public.
the purpose for which the corporation is organized is to provide youth the platform to address wildlife, environmental, and humanitarian issues through arts journalism.
umukabia ehime development association, usa, is organized exclusively for charitable purposes, which includes assisting nigerian abandoned and disabled children and to support humanitarian programs in our community and the public in general.
our mission is to spread prosperity and compassion for all by achieving our commitment to excellence in humanitarian, environmental, and scientific initiatives. we recently conducted a successful food drive for evacuees of wildfires in ca.
sudanese american engineers association is a non-profit, non-political, educational and humanitarian organization. its members are engineer professionals of sudanese descent.
to train employ underprivileged individuals particularly refugees in the visual media industry. build moviemaking skills with learn by doing approach creating content to be produced by the organization with profits to partially fund future projects
perpetuating the traditions of the dynastic and hereditary chivalric orders of the royal house of petrovi-njegos by supporting charitable, humanitarian. educational and artistic works that promote a continuing public interest in their history.
to undertake charitable activities, both through group and individual action consistent with the practice of the art and in the interest of humanitarian principles. to facilitate participation in, and stugy of the martial arts of soo bah do.
through the powerful expression of dance, our youth dance company performs on a local, nat’l and global stage to raise awareness of social and environmental issues. example: pollution, hunger, global refugee crisis
to meet the cultural and humanitarian needs of the under served in the bootheel region
humanitarian aid educational services for children and adults that are in hardships or in disaster areas within and outside united states
world and local humanitarian, educational and cultural community service

For your deliverables for Part III:

P3-Q1. Report the total count of organizations you have identified as serving Black communities.

P3-Q2. Take a random sample of 20 of the organizations in your sample and verify that your search criteria are identifying nonprofits that serve Black communities.

sample <- dplyr::sample_n( d.immigrant, 20 )

Report your rate of false positives in this sample (how many organizations in your sample do NOT belong there).

Note that an error rate of 10% in a classification model is very good. An error rate above 40% suggests poor performance.

We are measuring false positives, here, not overall accuracy. You can have a very low false positive rate by using extremely narrow search criteria, but then you will miss a large portion of the population you are trying to capture and study (the false negative rate).

The more inclusive your criteria are, the larger your false positive rate will be. The goal is to design search criteria that identify a large number of organizations while keeping false positive rates reasonable.

P3-Q3. Similar to the immigrant example, print a data frame that contains the nonprofit names, activity codes (see above on how to combine them), and their mission statements. Print the full data frame for your sample so all missions are visible.

Challenge Question

If you selected three nonprofit subsectors from the activity codes (code01), then created three data subsets based upon these criteria you could conduct something like content analysis as a way to compare how the three groups use language difference.

Part 1:

Re-run the content analysis to identify the most frequently-used words. But this time run it separately for each subsector.

How do the most frequently used words vary by subsector? Which words are shared between the three subsectors? Which are distinct?

Part 2:

Another way to compare differences in language use is by creating semantic networks:

TUTORIAL

Compare prominent word relationships in mission statements of arts, environmental, and education nonprofits (codedef01). Build semantic networks for each, then compare and contrast the prominence of specific words within the networks.

Submission Instructions

When you have completed your assignment, knit your RMD file to generate your rendered HTML file.

Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.

Remember to:

name your files according to the convention: Lab-##-LastName.Rmd
show your solution, include your code.
do not print excessive output (like a full data set).
follow appropriate style guidelines (spaces between arguments, etc.).

See Google’s R Style Guide for examples.

Notes on Knitting

Note that when you knit a file, it starts from a blank slate. You might have packages loaded or datasets active on your local machine, so you can run code chunks fine. But when you knit you might get errors that functions cannot be located or datasets don’t exist. Be sure that you have included chunks to load these in your RMD file.

Your RMD file will not knit if you have errors in your code. If you get stuck on a question, just add eval=F to the code chunk and it will be ignored when you knit your file. That way I can give you credit for attempting the question and provide guidance on fixing the problem.

Markdown Trouble?

If you are having problems with your RMD file, visit the RMD File Styles and Knitting Tips manual.

Data Science for
Public Service

Labs designed by Jesse D. Lecy
Source code is available on GitHub

Creative Common License:
(CC BY-NC-SA 4.0)

CPP 527 Foundations of Data Science II
Part of the MS in Evaluation and Analytics
@ Arizona State University

Lab 04 - Regular Expressions