Women have historically been under-represented in entrepreneurship in the US.
According to the Kauffman Index of Startup Activity National Report (2017), the rate of new male entrepreneurs is almost double the rate of female entrepreneurs and over the past two decades men have owned two-thirds of all businesses. As a result, there is a growing interest in understanding what motivates women to participate in entrepreneurial processes.
This study draws upon survey data of nonprofit entrepreneurs – individuals that have recently created new nonprofits. This week’s lab data consists of 657 founders that have responded to a series of survey questions describing their backgrounds and their organizations. Unlike other industries, over 50 percent of nonprofit entrepreneurs are women: 55 percent of respondents are female in this sample.
We would like to design a study where we follow a new nonprofit for five years and assess success of the launch measured by (1) survival of the first five years of operation, (2) growth of organizational programs measured by annual program expenses, and (3) a survey of peer organizations to assess their reputations with regards to quality of programs and community impact. The study aims to establish whether female entrepreneurs are more or less successful than their male counterparts.
Observed differences in performance could be explained by one of two distinct theories (or both together).
The first explanation for performance differences is a behavioral theory – women innately approach the entrepreneurial task differently than men. This would include things like attitudes toward risk, decision-making processes, leadership styles, etc. To make it concrete, behavioral economics provides us with a recent example from the investment world:
Terry Odean, a University of California professor, has studied stock picking by gender for more than two decades. A seven-year study found single female investors outperformed single men by 2.3 percent, female investment groups outperformed male counterparts by 4.6 percent and women overall outperformed by 1.4 percent. Why? The short answer is overconfidence. Men trade more, and the more you trade, typically the more you lose — not to mention running up transaction costs. [Washington Post 2013]
The overconfidence is partly tied to higher levels of testosterone in men, a hormone that leads to more decisive or compulsive decision-making. Thus, the male investors exhibit a tendency to make trades at a higher frequency when they have access to the same information as female investors.
One could brainstorm other potential behavior differences in how men and women approach entrepreneurial tasks – how they communicate with team members, how they yield power over subordinates, how they build trust with stakeholders, and how they communicate the vision for the organization to funders, to name a few. Collectively, these subtle behavioral differences could lead to better or worse performance for organizations that are created and run by women.
Alternatively, the second theory is a resources theory. Entrepreneurial success is largely predicted by resources at the time of start-up: access to financial capital, the amount of knowledge about the specific product or program domain, and past start-up experience. Consequently, the resources that an entrepreneur has at the time the new business is created will largely predict success.
If male and female entrepreneurs have different initial levels of resources when the organizations are started then performance might appear to be driven by male and female behavioral differences, but in fact gender could simply be a proxy for resources. Or stated differently, gender predicts the level of resources founders have at the point of business creation, and resources predict success.
Or stated differently, men and women with similar levels of resources produce start-ups that are equally successful.
As an example of this phenomenon, a 2003 study examined the financial performance of large companies run by women and concluded that women make worse CEOs than men. After reading the new story, University of Exeter researchers Michelle Ryan and Alexander Haslam re-examined the data and identified the Glass Cliff phenomenon. It turns out it was NOT the case that women make worse CEOs than men - they actually found that female CEOs had more positive impact than similar male CEOs in many instances. The problem was women were only promoted to leadership opportunities in failing companies.
“The glass cliff is a phenomenon whereby women (and other minority groups) are more likely to occupy positions of leadership that are risky and precarious. This can happen when share price performance is poor, when facing a scandal, or when the role involves reputational risk.”
“When firms are doing poorly, the really qualified white male candidates say, ‘I don’t want to step into this,’ Women and minorities might feel like this might be their only shot, so they need to go ahead and take it.”[Vox 2018]
In other words, all else equal female CEOs took over companies with fewer resources and worse initial performance than the companies that hired male CEOs.
In the nonprofit context things like a scarcity of mentoring or leadership opportunities for women at early stages of their career, lower wages, and career gaps associated with care-taking responsibilities of kids or aging parents could lead to lower levels of financial and professional capital at the time of organizational creation.
So there is a bit of a nature and nuture component to the research. There are innate (or perhaps socially programmed) differences in how men and women make decisions, evaluate risk, build teams, model leadership, and manage conflict. And there are other aspects of gender that are accidental like savings at the time of start-up. Programs like mentoring or seed grants for female entrepreneurs can erase the resource differences, but they cannot erase the innate differences or cumulative effects of past discrimination.
The departure point of the research is thus descriptive - a basic understanding of resource differences and behavioral differences observed during the start-up process is necessary to start building a research agenda for the field.
https://www.vox.com/2018/10/31/17960156/what-is-the-glass-cliff-women-ceos
Camarena, L., Feeney, M. K., & Lecy, J. (2021). Nonprofit Entrepreneurship: Gender Differences in Strategy and Practice. Nonprofit and Voluntary Sector Quarterly. PDF
This lab focuses on the resources question - are there differences in the financial and professional capital men and women have when they start nonprofits?
This is an example of establishing study group equivalence, which is a prerequisite for valid inferences for some (but not all) of the estimators you will learn in this class.
The reason randomization is powerful is because it generates equivalent treatment and control groups since randomization ensures that proportions of observed and unobserved traits (gender, age, etc.) are the same in each group. This ensures that the only difference between the groups is the treatment, which is what allows us to interpret performance differences in the post-treatment period as treatment or program effects. Since all other group characteristics are identical, none of the other participant characteristics would explain the outcome. The test for “happy randomization” in RCTs is a test of group equivalence.
As such, this example will focus on the second theory of performance differences – the resources that the entrepreneurs bring to the table.
The test for group equivalence prior to the nonprofit launch will show us whether male and female entrepreneurs in the nonprofit sector have similar experience, financing, and support to carry them through the process.
If they are equivalent then any performance differences between organizations started by men and organizations started by women are more likely to be driven more by unmeasured behavioral traits that observable characteristics like years of experience or savings. We can eliminate resources as an explanation and focus on ways in which women approach or experience the start-up process differently from men.
The lab demonstrates how to compare characteristics of two groups to establish equivalence.
A contrast is a statistical comparison of means of two groups.
In a natural sciences context, the claim that a cup of flour weighs the same as 0.79 cups of sugar means that the weights of the precise volumes of the two substances is identical. We can test group the equivalence of the weights either by using a precise scale and comparing the results, or using a calibrated balance. For the weights to be equivalent they need to be mathematically identical (with error smaller than whatever decimal level is needed for the study).
In applied statistics, however, the statement that two groups are equivalent does NOT mean that groups are mathematically identical (it is rare to get precisely the same group means in samples), but rather that there are NO statistically significant differences of the group means.
For example, in this instance the average age of male and female founders differs by a year or two:
Gender | min.age | median.age | mean.age | max.age |
---|---|---|---|---|
Female | 27 | 51 | 52.05 | 78 |
Male | 29 | 53 | 53.63 | 72 |
But when we test for significance we find that the contrast is NOT statistically significant:
t.test( Age ~ Gender, data=d )
##
## Welch Two Sample t-test
##
## data: Age by Gender
## t = -0.69933, df = 78.271, p-value = 0.4864
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.090387 2.923789
## sample estimates:
## mean in group Female mean in group Male
## 52.05085 53.63415
Specifically, the p-value of the t-test is greater that 0.05.
Thus male and female entrepreneurs in the study have equivalent ages.
Make a mental note that statistically equivalent is different than mathematically identical.
Randomization produces statistically equivalent treatment and control groups, not mathematically identical groups (the observed means of the two groups will still differ by small amounts).
The contrast, then, is a comparison of group means with a test for significance.
Contrasts are typically reported in a table by listing the two means, reporting their differences, and either reporting p-values associated with the differences or including starts that signal significant differences.
In some cases the t-statistic is reported instead of the p-value, but the rule of thumb is that a t-stat greater than 2 implies statistical significance at the alpha=0.05 level (a 95% confidence interval). There are caveats to the rule of thumb (it assumes a large sample size, for example), but we use the default 1.96 value in confidence intervals because it is the t-value associated with a 95% confidence interval, so any t-value greater than that suggests the p-value is less than 0.05.
Randomization almost always produces statistically equivalent treatment and control groups. We can also use other quasi-experimental techniques to generate balanced groups, meaning groups that have no observed differences in the traits that are measured and reported in the data. Randomization is preferable because it ensures balance across all measured and unmeasured traits, not just the ones we have data on.
The problem with omitted variables is we don’t always know what they even are, so it’s hard to ensure you have ALL variables that might matter in a regression. Since randomization produces balance in unobserved traits it fixes omitted variable bias problems that you were not even aware you had! Which is why RCTs are “the gold standard” in causal inference and astute policy-makers will look for convergence in results across several quasi-experimental studies before trusting the inferences.
Two Continuous Variables
Recall the meaning of statistical significance in the regression context. If the slope coefficient is significant then knowing information about the treatment (milligrams of caffeine consumed) should give you information about the outcome (heart rate).
If we know that an individual has received 425 mg of caffeine would 90 (the sample average) or 107 be a better guess of their actual heart rate?
Heart rate depends on caffeine intake if the two variables are related in some way.
If we have no information about caffeine intake of each participant then our best guess for a person’s heart rate is the average heart rate of all people in the sample (the blue line, y-bar).
If our model is meaningful (the slope is statistically significant) then our best guess for a person’s heart rate is the average of all individuals that consumed the same amount of caffeine, or the predicted value of Y for a level of X (the point on the red regression line).
If heart rate does NOT depend on caffeine the variables are independent.
When two variables are independent then knowing additional information about X tells us nothing about Y. In the case on the left we could do just as well guessing the heart rate for each individual using the average for the entire group instead of using the regression line.
When variables are correlated, then information about X should tell us something about the expected value of Y. The regression line on the right is a much better estimate of heart rate for most cases than the average heart rate of the group (the blue line).
In other words, saying that Caffeine and Heart Rate are independent is equivalent to saying the slope b1 representing the relationship between caffeine and heart rate is zero, i.e. not statistically significant (or more precisely that a high p-value does not allow us to reject that null that the slope is zero).
The variance explained measure R-square gives some sense of how much the prediction would improve based upon the model.
For the prediction to improve the distance from someone’s actual heart rate to the regression line (red line) should be shorter than the distance from their actual heart rate to the mean (blue line).
The improvement of the best guess using the regression line over the best guess using the sample mean is the variance that is “explained” by the model.
Independence of categorical variables
It is common for studies to examine traits like race and gender that are categorical, not continuous, measures.
T-tests are used to conduct a contrasts when measures are continuous, but they do not work when the variable is categorical (how would you calculate the average of four categories of race?).
When variables are categorical we will instead use a chi-square test. It is a common test for independence of factor levels in statistics.
Let’s apply the same predictive reasoning as above to understand the meaning of a test for independence of two categorical variables.
We have a sample of 100 individuals with either black or blonde hair. We record the eye color of each individual. Are hair color and eye color independent?
TO answer this we start by creating a contingency table of the two factors:
The most meaningful null hypothesis in a regression model is that the slope b1=0, that X and Y are related or independent. If the p-value for b1 in the model is less than 0.05 we conclude that X and Y are related or that they are NOT independent. So the slope=0 is the default null in regression models.
Note that factor levels will not always be proportionate in the population - more people tend to have dark hair in the world than blonde hair.
The first step of the chi-square test, then, is figuring out what proportion of individuals we would expect to observe in each cell if the factors are independent.
In this case we know that 60 percent of the sample will have black hair, which represents 60 individuals in this case since we have 100 individuals in the study.
Out of the 60 individuals, how many do we expect to have brown eyes? Well, if the factors are independent (if eye color does not depend on hair color) then we know 70 percent of the population has brown eyes, so we would guess than 70 percent of the 60 individuals would also have brown eyes.
70 percent of 60 individuals is 42 individuals.
Similarly, the proportion of cases that we expect to observe in each cell will be the product of the row proportion and the column proportion.
After we have created our table of expected values under the null, we can compare the observed values to the expected values to see how closely our data conforms to a world in which the two factors are independent.
Similar to the regression model, our residuals are the amount our observed values deviate from the expected values. The proportions are reported on the tables here to keep it simple. The chi-square values utilize counts, not proportions. And similar to regression residuals we square the deviations before summing otherwise they would all cancel each other out since half will be positive deviations and half negative.
The thing that makes a chi-square test different from other tests is that it does not produce a standard error, so we can’t build a simple confidence interval to determine statistical significance. Instead we would need to construct all possible tables that have a sample size of 100 and the same row and column proportions, calculate the chi-square value for each table, then see how often the a chi-square value is as large as our observed value.
When tables are small this is a pretty straight-forward calculation. But when we have large sample and factors have many levels the number of possible tables can grow exponentially leading to a scenario where it can take hours or years to examine all tables that meet those criteria. As a result p-values are often calculated using an approximation heuristic. You might notice that if you run the same test several times the p-value will be slightly different each time.
Real World Example
This example uses data about hair color and eye color from 592 individuals to test the independence of these two traits.
To examine the relationship between the two variables we would create a cross-tab of the two traits using table( hair, eyes )
to produce one similar to the table stored in the vcd package:
# library(vcd)
# data( HairEyeColor )
#
# combine genders for example:
# m1 <- HairEyeColor[ , , 1 ]
# m2 <- HairEyeColor[ , , 2 ]
# m3 <- m1 + m2
# dput( m3 )
<-
m structure(c(68, 119, 26, 7, 20, 84, 17, 94, 15, 54, 14, 10, 5,
29, 14, 16), class = "table", .Dim = c(4L, 4L), .Dimnames = list(
Hair = c("Black", "Brown", "Red", "Blond"), Eye = c("Brown",
"Blue", "Hazel", "Green")))
%>% pander() m
Brown | Blue | Hazel | Green | |
---|---|---|---|---|
Black | 68 | 20 | 15 | 5 |
Brown | 119 | 84 | 54 | 29 |
Red | 26 | 17 | 14 | 14 |
Blond | 7 | 94 | 10 | 16 |
What does it mean for the categorical variables of hair color and eye color to be independent?
Similar to the regression model above, if the slope coefficient for caffeine is not statistically significant than the y-hat values from the regression model will do no better than simply using y-bar from the population. The factors are independent.
In this example we are trying to predict eye color. If we have no information other than sample characteristics then we should guess the modal value, brown eyes, for every case:
apply( m, MARGIN=2, FUN=sum )
## Brown Blue Hazel Green
## 220 215 93 64
If the factors are independent, then each group of people with a different hair color should all have proportionally similar numbers of brown-eyed individuals, blue-eyed individuals, hazel and green. But we see that is not in fact the case.
There should be approximately 37% of brown-eyed individuals in each group, but we see that people with black hair have brown eyes at higher rates than the population average, and people with blonde hair have brown eyes at lower rates than the population average:
# margin=1 calculates proportions by row; each row sums to 1
# margin=2 calculates proportions by column: each column sums to 1
# margin=NULL returns cells as a proportion of the whole table
apply( m, MARGIN=2, FUN=sum ) / sum(m) ) %>% round(2) (
## Brown Blue Hazel Green
## 0.37 0.36 0.16 0.11
%>% prop.table( margin=1 ) %>% round(2) m
## Eye
## Hair Brown Blue Hazel Green
## Black 0.63 0.19 0.14 0.05
## Brown 0.42 0.29 0.19 0.10
## Red 0.37 0.24 0.20 0.20
## Blond 0.06 0.74 0.08 0.13
If hair color and eye color are correlated, then knowing something about hair color would give us additional information about the likely eye color of a person. If so, then these variables are NOT independent. Hair color and eye color are, in fact, correlated.
The mosaic plot in the vcd package provides a nice visualization of these cross-tabs:
::mosaic( m, shade=TRUE, legend=TRUE ) vcd
Cells are sized to represent the proportion of each hair-eye combination in the sample. Blue signifies cells where we observe MORE cases than expected if the factors were actually independent, and red signifies there are fewer cases observed than expected under the assumption of independence.
Note that the “brown hair” row is all gray, meaning that people with brown hair have eye colors that approximately match the population proportions. Glancing back at the table we see this is true.
<- ( apply( m, MARGIN=2, FUN=sum ) / sum(m) ) %>% round(2)
population <- m %>% prop.table( margin=1 ) %>% round(2)
m2 <- m2[ 2, ]
brown.hair <- rbind( population, brown.hair )
m3 %>% pander() m3
Brown | Blue | Hazel | Green | |
---|---|---|---|---|
population | 0.37 | 0.36 | 0.16 | 0.11 |
brown.hair | 0.42 | 0.29 | 0.19 | 0.1 |
Similar to how flipping a fair coin is unlikely to produce exactly 50 heads (it will usually be one or two off by chance), independent categorical factors will never have mathematically identical proportions (the row proportions all the same as the population proportions) so we need a test for significance.
The chi-square test
The chi-square statistic provides a test for independence of two factors:
chisq.test( m )
##
## Pearson's Chi-squared test
##
## data: m
## X-squared = 138.29, df = 9, p-value < 2.2e-16
In this example the p-value is less than 0.05, so the factor levels are dependent, i.e. they are correlated. Thus, if we know something about the hair color of an individual then we can more accurately predict their eye color.
The most common eye color in the sample is brown. But if we know the person’s hair is blonde, we should guess their eyes are blue.
In the program evaluation context the column factor would describe traits of individuals in the sample, e.g. race or gender, and the row factor would be the treatment and control categories.
The test for independence would tell us whether proportions of levels of the factor are more-or-less equal in the treatment and control groups.
If we use gender as the example, then the chi-square test would tell us whether we have equal proportions of men and women in the treatment and control groups:
set.seed( 1234 )
<- sample( c("male","female","male"), 100, replace=T )
gender <- sample( c("treatment","control"), 100, replace=T )
study.group
<- table( study.group, gender )
t %>% prop.table( margin=1 ) %>% round(2) %>% pander() t
female | male | |
---|---|---|
control | 0.33 | 0.67 |
treatment | 0.46 | 0.54 |
We see that we have approximately one-third female in each study group. The chi-square test shows us that the groups are equivalent or independent on this measure (the test is NOT statistically significant):
chisq.test( t )
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t
## X-squared = 1.2169, df = 1, p-value = 0.27
In the examples above each contrast is a test for group differences or the dependence of a single trait.
But how do we know collectively if the groups are equivalent when we have lots of measured characteristics? If they differ on a single trait then do we throw out all group equivalence?
The table reports characteristics of villages that participated in an economic development program and the villages they have been matched to for comparison.
If two groups are equivalent we should expect that tests show no meaningful differences across the measured traits. If the matching worked (the study groups are well-balanced) then no significant differences should exist.
Since we cannot use a t-test or chi-square to compare a bunch of variables at once, we must apply the criteria that none of the contrasts are statistically significant to establish group equivalence.
There is a problem here, though!
The decision criteria of alpha=0.05 allows for Type I errors (incorrectly identifying difference that don’t actually exist) five percent of the time. So let’s assume that we are comparing 20 group attributes in our table.
Assuming that the groups are in fact equivalent, if the probability of a Type I error is five percent for each variable what happens if we conduct 20 contrasts at the same time? What is the likelihood that we observe AT LEAST ONE contrast with a p-value below 0.05 when the groups are statistically equivalent?
It’s a lot more than 0.05 if we perform 20 independent contrasts and use the same decision criteria for rejecting the null that we use for a single test.
We can demonstrate the issue with a coin-flip example. If you were asked, what’s the probability that I observe five heads in my coin flip experiment the pertinent question would be, “How many coin flip trials did you conduct?”
If there were only five flips made then this outcome would be quite rate - it would only occur in 3 percent of the experiments using a fair coin:
(0.5)(0.5)(0.5)(0.5)(0.5) = 0.03125
If, however, you had performed 10 coin flips in the experiment then the outcome of 5 heads would occur about 25 percent of the time.
Similarly, if you asked about the likelihood of getting all heads in an experiment, knowing whether it was 2 flips or 20 flips makes a big difference.
Analogously, the p-value describes the likelihood of observing an outcome when the null hypothesis is true (the NULL being slope=0 in a regression, or no differences in group means in a t-test). It is a reasonable decision criteria if we are doing a single contrast. But once we start doing lots of contrasts together we are increasing the likelihood that we observe at least one small p-value even when the groups are equivalent.
Scroll to the end for another example of designing a fair decision criteria that accounts for context.
Recall that the p-value is the probability of observing a statistically significant difference (of means or table proportions) when the groups were in fact equivalent.
Even when the groups are equivalent we expect to get a p-value below 0.05 five times out of 100 simply due to the variance of the sampling procedure and dumb luck (drawing a bunch of high-value or low-value cases at the same time in a given sample).
We use the p-value because we need some criteria for deciding whether the sample means of variable X are different between groups and we are comfortable incorrectly reject the null (group-equivalence) 5 times of to 100 if the groups are in fact the same.
What happens, though, when we scale up from a contrast to a blanket statement that the groups are equivalent?
Testing 20 contrasts at once and rejecting group equivalence if any 1 contrast is significant will results in a Type I error rate that is too high, so we need a new decision criteria that lowers it back to the original 0.05 rate.
It turns out, if we divide our the original alpha by the number of contrasts in the table and use that as our decision criteria, we are back to a scenario where we experience Type I errors only 5 percent of the time.
This is called The Bonferroni Correction. It is applied in omnibus hypotheses scenarios where were are conducting a lot of independent tests at the same time, and the failure of any one test (p-value below 0.05) results in a rejection of all tests (we no longer consider the groups to be equivalent even if 19 out of 20 contrasts were non-significant.
Bonferroni Corrected alpha = original alpha / number of contrasts
After the Bonferroni Correction, the new alpha will be 0.05 / 20 contrasts, or 0.0025.
The new decision criteria is that the p-value needs to be below 0.0025 in order to reject the NULL of group equivalence. Using these criteria we expect to reject the group equivalence NULL 5 percent of the time, so we are back to the same Type I Error rate that we are comfortable with.
So for a concrete example, consider again the table from the Millennium Challenge Villages study.
They compared 11 different traits of villages in the treated and the comparison groups. So the new decision criteria will be:
Reject the null if any p-value in the table is < ( 0.05 / 11 )
So the adjusted alpha is 0.0045.
It’s a little annoying when p-values are not reported directly in a table. We can calculate them from the reported p-values, but you will need to do it manually.
Calculate the p-value for the largest contrast in the table (the one with the highest t-value):
# tvalue = largest tvalue in the table
# n1 = sample size of treatment group
# n2 = samle size of comparison group
<- function( tvalue, n1, n2 )
get_pval_from_tval
{<- min( n1, n2 ) - 1 # size of treated group and comparison
df <- 2 * pt( abs(tvalue), df=df, lower.tail=FALSE )
pval return( pval )
}
get_pval_from_tval( tvalue=2.16, n1=2964, n2=2664 )
## [1] 0.03086164
Since 0.03 > 0.0045 then we do NOT reject the null ::: THE GROUPS ARE CONSIDERED EQUIVALENT
We used the largest t-value in the table, which will return the smallest p-value. If this p-value is above the decision criteria then nothing in the table will be smaller, so we have established group equivalence in this study to the extent possible with the variables we have (unlike groups created with randomization, unobserved differences are still possible).
This lab will use the following packages:
library( dplyr )
library( pander )
library( ggplot2 )
The dataset contains ten survey questions from a survey of nonprofit entrepreneurs.
The gender variable represents our study groups, male and female.
<- "https://github.com/DS4PS/cpp-524-sum-2020/blob/master/labs/data/female-np-entrepreneurs.rds?raw=true"
URL <- readRDS(gzcon(url( URL )))
dat head( dat )
For the variables listed below construct an appropriate contrast. Include a table of row proportions for each categorical variable. Report your conclusion for each variable - do the study groups differ on that variable?
After creating the contrasts, report your findings about study group equivalence with regards to resource available to male and female nonprofit entrepreneurs at the start-up phase. You will need to calculate the new decision rule using a Bonferroni Correction for this step.
TABLES:
<- table( dat$f1, dat$f2 )
t %>% prop.table( margin=1 ) %>% round(2) %>% pander() t
CHI-SQUARE TESTS:
# create your table
<- table( dat$f1, dat$f2 )
t
# basic syntax:
chisq.test( t )
# use this setting for more accurate p-values:
chisq.test( t, simulate.p.value=TRUE, B=10000 )
T-TESTS:
t.test( age ~ gender, data=dat )
RMD Template:
To download the template click on the link here then right-click on the template and save as lab-02-LASTNAME.rmd.
Compare past nonprofit formation experience of male and female entrepreneurs.
<- table( dat$experience.np.other, dat$gender )
t %>% prop.table( margin=1 ) %>% round(2) %>% pander() t
Female | Male | |
---|---|---|
No | 0.58 | 0.42 |
Yes | 0.53 | 0.47 |
chisq.test( t, simulate.p.value=TRUE, B=10000 )
##
## Pearson's Chi-squared test with simulated p-value (based on 10000
## replicates)
##
## data: t
## X-squared = 1.6311, df = NA, p-value = 0.2274
ANSWER: No, we do not observe past founding differences between male and female entrepreneurs. Female founders reported slightly more past experience (53 percent versus 47 percent of male founders), but the contrast was not significant at the alpha=0.05 level (chi-squre p-value 0.2277).
Compare education levels of male and female entrepreneurs.
Compare work experience for male and female entrepreneurs.
Compare success in accessing seed funding for male and female entrepreneurs.
Compare the willingness to take on personal debt for male and female entrepreneurs.
Compare sources of first year funding for male and female entrepreneurs.
Compare age at the time of nonprofit formation for male and female entrepreneurs.
Compare income levels prior to starting the nonprofit for male and female entrepreneurs.
Based upon these seven contrasts, would you conclude that the resources male and female nonprofit entrepreneurs have at the time of founding were equivalent?
Q8-A:
What is the adjusted decision criteria used for contrasts to maintain an alpha of 0.05 for the omnibus test of group equivalence?
Q8-B:
What is the lowest p-value you observed across the seven contrasts?
Q8-C:
Can we claim study group equivalency? Why or why not?
When you have completed your assignment, knit your RMD file to generate your rendered HTML file.
Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.
Remember to:
This example is meant to demonstrate the process of creating a binding and actionable decision criteria in order to test a hypothesis.
We often use alpha=0.05 in evaluation studies but it is not clear why. It turns out that alpha=0.05 is a cutoff that tends to do a good job of balancing Type I and Type II Errors under general conditions. When you make a test more sensitive so that you are more likely to detect a phenomenon like a COVID infection it also increases the likelihood of false positives. Making the criteria more stringent to eliminate this noise will alternatively increase the likelihood of missing real cases. So a good decision criteria must account for the cost of each type of error.
Let’s us an example where we need to design a regulatory framework for casinos.
Assume a casino has started offering a coin flip game: if the flip comes up heads the casino wins, tails and the gambler wins. State regulators visit the casino each month to inspect their games to make sure they are fair and the casino is not cheating.
They have decided on a test: they flip the same coin FOUR times to inspect outcomes. The probability of getting heads one flip should be 0.5, so the probability of getting heads all four times is:
(0.5)(0.5)(0.5)(0.5) = 0.0625
To keep things simple the regulating agency says if this occurs the casino gets fined. Similar to a hypothesis-testing scenario, casinos that are using fair coins they will unjustly receive a fine 6.25 percent of the time (a Type I error).
The regulator and the casino owners have agree that it’s an appropriate test because the alternatives to this simple test are expensive and time-consuming (the fair casino is better off paying 1 fine approximately every 20 visits than to pay for more extensive monthly audits).
One of the regulators proposes they can increase the sample size to improve the accuracy of the tests (sounds reasonable, right?). They will now flip the coin TEN times instead of four. HOWEVER, they do not adjust the decision criteria.
Now, if at least four of ten flips are heads, then the casino is still fined.
Did the larger sample size improve the accuracy of the process? Is this a good decision criteria?
Obviously not. With ten coins we expect five to be heads (which happens about 1/4 of the time). With that many flips there is an 82 percent chance of getting 4 or more heads, so now the casino is paying a fine 8 visits out of 10 when the coin is fair instead of 1 in 20.
Likelihood of each outcome:
<- 10
n <- 9
k <- 0.5
p # choose( n=n, k=k )
<- NULL
x for( k in 0:10 )
{+1] <- choose( n=n, k=k ) * ( p ^ k ) * ( (1-p) ^ (n-k) )
x[k
}
<- data.frame( heads=(0:10), prob=round(x,3) )
d
%>% pander() d
heads | prob |
---|---|
0 | 0.001 |
1 | 0.01 |
2 | 0.044 |
3 | 0.117 |
4 | 0.205 |
5 | 0.246 |
6 | 0.205 |
7 | 0.117 |
8 | 0.044 |
9 | 0.01 |
10 | 0.001 |
Instead of increasing the accuracy of the test, the regulatory agency has significantly increase the rate of Type I errors! Casinos with fair coins will not be happy.
The regulators wanted to increase the sample size so they would not be increase power, but they kept the same decision criteria from the earlier process (if four or more heads are observed the coins are considered unfair).
What they should have done was select the desired confidence level (0.0625 to be consistent with the original process) and then determine a new decision rule that will fail approximately six percent of the time when the groups are actually equivalent.
According to the table above, the likelihood of getting 8 or more heads in 10 fair coin flips is 0.055. So that would be an appropriate threshold.
Analogously, if we want to determine group equivalency but we are conducting 20 contrasts instead of one, then we can’t use the same decision criteria as when we are just testing a single contrast (alpha=0.05). Otherwise we will be greatly increasing the Type I error rate since there is a high probability that at least one out of 20 tests have p-values below 0.05 by chance.
We need to define a new decision criteria for the omnibus scenario - testing all hypotheses together.
SIDE NOTE
If you are curious, the probability of getting all heads is easy to calculate since it would simply entail ten heads in a row, each with a probability of 0.5: \[(0.5)^{10} = 0.00097\] or approximately one in a thousand.
To determine success for 9 out of ten coin flips is a little harder. We need to model the flips as a Bernoulli process:
\[ P(k) = {n \choose x} p^k q^{n-k} \]
Where n is the number of trials, k is the number of successes, p is the probability of success in each independent trail, and q is the probability of failure, which is always 1 - p.
Thus the likelihood of getting 9 heads out of 10 trials is.
\[ P(k=9) = {10 \choose 9} (0.5)^9 (0.5)^{1} \]
<- 10
n <- 9
k <- 0.5
p choose( n=n, k=k ) * ( p ^ k ) * ( (1-p) ^ (n-k) )
## [1] 0.009765625
\[ P(k=8) = {10 \choose 8} (0.5)^8 (0.5)^{2} \]
<- 10
n <- 8
k <- 0.5
p choose( n=n, k=k ) * ( p ^ k ) * ( (1-p) ^ (n-k) )
## [1] 0.04394531
Once we reach 8 heads, we are already approach a 5 percent probability mark, so a reasonable decision criteria for the regulators would be observing 8 or more heads in ten flips would result in a fine.
The total likelihood of these event would be P(k=10) + P(k=9) + P(k=8):
0.00097 + 0.009765625 + 0.04394531
## [1] 0.05468093
So about 5.5 percent chance of one of these occuring by chance if the coin is in fact fair.
If we look at the next step, we see a big leap in the likelihood of observing 7 heads out of ten flips:
\[ P(k=7) = {10 \choose 7} (0.5)^7 (0.5)^{3} \]
<- 10
n <- 7
k <- 0.5
p choose( n=n, k=k ) * ( p ^ k ) * ( (1-p) ^ (n-k) )
## [1] 0.1171875
So using any criteria less than 8 would result in a higher Type I error rate than desired.