Overview

This lab is a review of hypothesis-testing with dummy variables, concepts covered in CPP 523 in the lecture Dummy Variables and Hypothesis Tests.

Specifically, the lab reminds us that when we construct groups using dummy variables we will always have a reference group captured by the intercept, and all dummy coefficients represents contrasts with that baseline group.

In the scenario where we have four groups we can test up to six hypotheses with our data, but we can at most test three at a time in any given model. Therefore we must select a reference group and dummy variables that correspond with the hypotheses of interest.



This lab uses a fake data set based upon the example of Regular and Teach for America instructors in the lecture notes. Regular instructors have a college degree in education, whereas Teach for America instructors have college degrees in fields other than education, but get credentialed in education through an immersion program that includes specialized training and mentoring.

Your job is to select a model that aligns with the hypothesis of interest in each question.

Setup

library( dplyr )     # data wrangling
library( pander )    # formatting 
library( stargazer ) # pretty regression tables 
d <- read.csv( "https://raw.githubusercontent.com/Watts-College/cpp-524-fall-2021/main/labs/data/teach-for-america.csv" )
head( d ) %>% pander()
teacher school score tfa.dummy reg.dummy sub.dummy urb.dummy
TFA suburb 79.68 1 0 1 0
TFA suburb 78.01 1 0 1 0
TFA suburb 72.87 1 0 1 0
TFA suburb 72.15 1 0 1 0
TFA suburb 73.55 1 0 1 0
TFA suburb 74.03 1 0 1 0



Study Group Means:

d %>% 
  group_by( teacher, school ) %>% 
  summarize( ave.score=mean(score) ) %>% 
  pander( digits=0 )

summarise() has grouped output by ‘teacher’. You can override using the .groups argument.

teacher school ave.score
Regular suburb 75
Regular urban 57
TFA suburb 75
TFA urban 66

Question 1

Consider the following models:

\(Model \space (1): \space \space score = b_0 + b_1 \cdot tfa + b_2 \cdot suburban\)

\(Model \space (2): \space \space score = b_0 + b_1 \cdot tfa + b_2 \cdot suburban + b_3 \cdot tfa \cdot suburban\)

Note that we use a colon instead of multiplication sign to interact coefficients in the R lm() syntax:

m1 <- lm( score ~ tfa.dummy + sub.dummy, data=d )
m2 <- lm( score ~ tfa.dummy + sub.dummy + tfa.dummy:sub.dummy, data=d )

stargazer( m1, m2, 
           type="html", digits=3,
           intercept.bottom = FALSE,
           omit.stat = c("ser","f","rsq","adj.rsq") )
Dependent variable:
score
(1) (2)
Constant 60.361*** 57.294***
(0.247) (0.291)
tfa.dummy 3.900*** 8.500***
(0.253) (0.356)
sub.dummy 13.297*** 17.677***
(0.254) (0.347)
tfa.dummy:sub.dummy -8.213***
(0.475)
Observations 2,000 2,000
Note: p<0.1; p<0.05; p<0.01

Calculate the four group means with regression coefficients from Model (1):



Are any of these means correct? Here is the table of group means for reference:

tapply( d$score, list(d$teacher,d$school), mean ) %>% 
  round(0) %>% pander()
  suburb urban
Regular 75 57
TFA 75 66

Question 2

Now calculate the four group means with coefficients from Model (2):

Are these means correct?


Question 3

Why do we need to include the interaction term to recover the true group means?

If you want to make sense of the issue using geometry note that in a model without an interaction term the groups will always form a parallelogram because the final group mean can only be a linear combination of the two dummy variables coefficients: b1+b2 or b2+b1 (up and over vs over and up).

\(Model \space (1): \space \space score = b_0 + b_1 \cdot tfa + b_2 \cdot suburban\)

On a piece of paper draw an x-axis with the two groups labeled Regular and TFA. The y-axis will be the outcome. Draw four dots represent the four group means you found in Question 1. Connect the four dots to form a square.

Now add the actual group means to the diagram:

  suburb urban
Regular 75 57
TFA 75 66

Regressions attempt to minimize the sum-of-squared-errors in a model. When X is continuous the model produces a slope and intercept that are selected to minimize the squared residual. In a model with all dummy variables we are trying to minimize the distance between the predicted mean of each group (red dots) and all of the observations in the group, represented by the blue dots here (it’s the average of the observations, not the observations themselves, in this diagram for the sake of simplicity).

What should be visually apparent is that you cannot shift the red box up or down in any way to achieve a smaller residual in the model. You are seeing here the parallelogram produced by Model 1 which minimizes the residual subject to the constraints of the model (only two degrees of freedom).

Note that Model 1 gets none of the group means correct. Why is that?


Back to the question - why do we need the interaction term to provide proper model fit?

How does the interaction allow us to fit the data with a polygon instead of a parallelogram? How many additional degrees of freedom do we need for this?

Recall the Seven Sins of Regression from CPP 523.

This is another example of specification bias, which occurs when we have all of the variables we need to fit a model properly but we need to use the right specification or we will get poor coefficient values because of constraints we have imposed on the data, not because of poor data or omitted variable bias.




Question 4

Explain the hypothesis associated with each coefficient in Model (2):




Question 5

Run baseline model comparing performance of Regular instructors to TFA instructors. Do not control for school geography in this model (urban vs suburban schools).

\(score = b_0 + b_1 \cdot tfa.dummy\)

m0 <- lm( score ~ tfa.dummy, data=d )

stargazer( m0, 
           type="html", digits=3,
           intercept.bottom = FALSE,
           omit.stat = c("ser","f","rsq","adj.rsq") )
Dependent variable:
score
Constant 69.668***
(0.262)
tfa.dummy -0.089
(0.371)
Observations 2,000
Note: p<0.1; p<0.05; p<0.01

Do we find differences in performance across teacher types now? Report your decision criteria.



This is an example of Simpson’s Paradox. When you have unequal group sizes (the number of observations per group) and you aggregate groups for a comparison, you can introduce bias into your comparisons.

tapply( d$score, d$teacher, mean ) %>% 
  round(0) %>% pander()
Regular TFA
70 70

Even when we know they exist:

tapply( d$score, list(d$teacher,d$school), mean ) %>% 
  round(0) %>% pander()
  suburb urban
Regular 75 57
TFA 75 66

This occurs specifically in our sample because TFA instructors are more likely to work in urban schools and Regular instructors are more likely to work in suburban schools:

table( d$teacher, d$school ) %>% pander()
  suburb urban
Regular 700 300
TFA 400 600

In other words, Simpson’s Paradox is another word for a selection problem. If we don’t account for the selection process (in this instance by controlling for school geography) then we will improperly conclude there is no performance difference between teacher type.

In reality, TFA and Regular instructors perform the same in suburban schools. But TFA instructors perform better in urban schools. You need to compare them within-context in order to avoid Simpson’s Paradox.




Question 6

Now run a model comparing teacher performance in urban schools.

\(Model \space (3): \space \space score = b_0 + b_1 \cdot tfa + b_2 \cdot urban + b_3 \cdot tfa \cdot urban\)

m3 <- lm( score ~ tfa.dummy + urb.dummy + tfa.dummy:urb.dummy, data=d )

stargazer( m3, 
           type="html", digits=3,
           intercept.bottom = FALSE,
           omit.stat = c("ser","f","rsq","adj.rsq") )
Dependent variable:
score
Constant 74.972***
(0.190)
tfa.dummy 0.286
(0.315)
urb.dummy -17.677***
(0.347)
tfa.dummy:urb.dummy 8.213***
(0.475)
Observations 2,000
Note: p<0.1; p<0.05; p<0.01

Question 7

Q7-a: which model allows us to test whether Regular instructors perform differently in urban versus suburban schools?

Q7-b: which model allows us to test whether TFA instructors perform differently in urban versus suburban schools?

Challenge Question

What is the minimum number of regression models you need to run to test all six hypotheses in the TFA vs Regular at Urban vs Suburban Schools example?

Report which models you would run and the hypotheses associated with each coefficient.