This tutorial shows the steps the team used to build the neighborhood change baseline model.

# load necessary packages ----
library( dplyr )
library( here )
library( knitr )
library( pander )
library( stargazer )
library( scales )

# set randomization seed ----
set.seed( 1234 )

# load necessary functions and objects ----
# note: all of these are R objects that will be used throughout this .rmd file
import::here("S_TYPE",
             "panel.cor",
             "panel.smooth",
             "jplot",
             "d",
             "df",
             "cbsa_stats_df",
             "descriptive_stat",
             "median_home_value",
             "change_MHV_2000_2010",
             "percent_change_MHV_2000_2010",
             "variables",
             "variable_skew",
             "skew_scatter_plot",
             "addressing_skew",
             "log",
             "cbsa",
             "metro",
             "metro_level",
             "predicted_variables",
             # notice the use of here::here() that points to the .R file
             # where all these R objects are created
             .from = here::here("labs/wk06/unified_team3_source.R"),
             .character_only = TRUE)

Part 1 - Data

This data set shows the full dataset that includes 2000 and 2010 census variables that drop all rural census tracts. All data with median home value less than $10K and growth rates greater than 200 percent are omitted.

A few columns from the full data set

This subseted data frame contains median home values and home value change from 2000 to 2010.

df

Median Home value

This histogram is a visual representation of the median home value in 2000.

median_home_value(x)

Descriptive Statistics

Model One

descriptive_stat (descriptive)


Statistic	Min	Pctl(25)	Median	Mean	Pctl(75)	Max

MedianHomeValue2000	11,167	105,661	154,903	187,129	224,337	1,288,551
MedianHomeValue2010	9,999	123,200	193,200	246,570	312,000	1,000,001
MHV.Change.00.to.10	-1,228,651	7,187	36,268	60,047	94,881	1,000,001
MHV.Growth.00.to.12	-97	6	25	33	50	6,059

Change in MHV 2000-2010

change_MHV_2000_2010(MHV)

Percent Change in MHV 2000 to 2010

Like the absolute change in median home value, the percentage median home value provides another level of context for growth rates of home value in the census tracts.

percent_change_MHV_2000_2010(perMHV)

Metro Level Statistics

Both changes within a census tract and changes within a city can affect the price of a home. Since our policy of interest focuses on changes made within a tract (new business and new housing created through NMTC and LIHTC), we want to control for things happening at the metro level to ensure we are not inferring that programs that target the tract are driving outcomes when it is actually broad metro-level trends.

Here we calculate several metro-level statistics for our model (growth in population in the city (Pittsburgh), changes in demographics, changes in industry structure and wealth, etc.). The calculated a metro level statistic are executed by aggregating up (population count) or averaging across census tracts.

# view results
cbsa_stats_df %>% head()

Part 2 - Predict MHV Change

Here we begin by modeling changes in median home values that are predicted by tract characteristics in 2000 and changes in the city between 2000 and 2010.

Variable selection

In order to predict the median home value change between 2000 and 2010, the team selected the following variables: Percentage of the White Population (p.black),Percentage of the population with a high school degree (p.hs), and the Percentage of population that is unemployed (p.unemp).

However, the primary purpose for selecting the above variables is to assess how the variable affects the median home value change. Before running a model for the three variables selected, the team checked for variable skew and multicollinearity among the selected variables and the Media Home Value change for the 2000 and 2010 datasets.

Variable Skew

Since it is common for social science data not to normally distributed, the team conducted a formal testing of the variables selected to assess if there is any skew in the variables as shown below:

log.p.unemp <- log10( d$p.unemp + 1 )
log.p.black <- log10( d$p.black + 1 )
these <- sample( 1:length(log.p.black), 5000 )

par( mfrow=c(1,2) )
jplot( d$p.black[these], d$p.unemp[these], 
       lab1="Black population(percent)", lab2="Unemployment (percent)",
       main="Raw Measures" )
jplot( log.p.black[these], log.p.unemp[these], 
       lab1="black population(percent)", lab2="Unemployment (percent)",
       main="Log Transformed" )

Based on the notes above, the team noticed that before the application of log transformation, the relationship between unemployment and the black population were skewed towards the left, which shows evidence of a relationship between the two variables.

skew_scatter_plot()

The relationship between selected variables in the scatter plot shows a clusters in the lower left-hand corner of the scatter plots, which shows evidence of skew data in relation to MHV growth.

Addressing Skew in selected vaiables in relation to MHV growrth

In addressing the skew in the variables in correlation to MHV growth, a log transformation was applied in order strengthen the relationship among the variable.

addressing_skew()

jplot (d$p.unemp, d$mhv.growth, ylim=c(-50,100),
       lab1="Unemployment", lab2="MHV Growth")

The plots here show no noticeable cluster, pattern, or skew. The analysis will continue with a logarithmic transformation on .

Multicollinearity

Multicollinearity happens when two independent variables in a regression model are highly correlated and explain the same variance. As a result, standard error increases, slopes typically shift toward the null and lack statistical significance. In order to check for Multicollinearity, the team ran the independent variables and the dependent variable to see the validity of the relationship among these variables and whether there is a significant increase in standard error.

Model Two

variables(var)


	Dependent variable:

	mhv.growth
	(1)	(2)	(3)

p.black	4.05^***		1.15^***
	(0.26)		(0.29)

p.unemp		15.57^***	11.62^***
		(0.56)	(0.64)

p.hs			72.30^***
			(3.03)

Constant	26.20^***	17.43^***	-115.10^***
	(0.26)	(0.46)	(5.57)


Observations	58,839	58,801	58,800
Adjusted R²	0.004	0.01	0.02
Residual Std. Error	35.10 (df = 58837)	34.94 (df = 58799)	34.77 (df = 58796)

Note:	p<0.1; p<0.05; p<0.01

cbsa()

metro()

Model Three

predicted_variables()


	Dependent variable:

	mhv.growth
	(1)	(2)	(3)

p.black	4.05^***	2.71^***	2.42^***
	(0.26)	(0.20)	(0.24)

metro.mhv.growth		0.01^***
		(0.0000)

Constant	26.20^***	1.62^***	24.13^***
	(0.26)	(0.23)	(4.04)


Metro Fixed Effects:	NO	NO	YES
Observations	58,839	58,839	58,839
Adjusted R²	0.004	0.41	0.41
Residual Std. Error	35.10 (df = 58837)	27.12 (df = 58836)	27.10 (df = 58459)

Note:	p<0.1; p<0.05; p<0.01

Results and Conclusions from the models

Even though all the variables are statistically significant as shows in model two, the following conclusions can be observed:

Coefficients for the black, unemployed population decrease
Standard errors increase for for all the variables
Decrease in R-square across the model
Increase in Residual Std. Error in the model

The increase in the standard error in the model shows the distance of the data point to the regression line, affecting the model’s accuracy. The above analysis suggests that the variables selected contain unwanted information, and including them together results in canceling each other out leads to a less effective model. In order to account for the context that may be responsible for the skew, the metro-level fixed effect is added to the model.

However, the relationship between the three variables (black population, unemployment population, and high school degree population) and Home Value Growth Change is positive and statistically significant. We can conclude that these three variables have an effect on MHV Growth.

Moreover, the black population is considered the most critical variable among the three variables used to predict MHV change. After running models two and three, the black population remains statistically significant, and the standard error reduces.

Build Baseline Model

Team 03