This tutorial shows the steps the team used to build the neighborhood change baseline model.
# load necessary packages ----
library( dplyr )
library( here )
library( knitr )
library( pander )
library( stargazer )
library( scales )
# set randomization seed ----
set.seed( 1234 )
# load necessary functions and objects ----
# note: all of these are R objects that will be used throughout this .rmd file
::here("S_TYPE",
import"panel.cor",
"panel.smooth",
"jplot",
"d",
"df",
"cbsa_stats_df",
"descriptive_stat",
"median_home_value",
"change_MHV_2000_2010",
"percent_change_MHV_2000_2010",
"variables",
"variable_skew",
"skew_scatter_plot",
"addressing_skew",
"log",
"cbsa",
"metro",
"metro_level",
"predicted_variables",
# notice the use of here::here() that points to the .R file
# where all these R objects are created
.from = here::here("labs/wk06/unified_team3_source.R"),
.character_only = TRUE)
This data set shows the full dataset that includes 2000 and 2010 census variables that drop all rural census tracts. All data with median home value less than $10K and growth rates greater than 200 percent are omitted.
d
This subseted data frame contains median home values and home value change from 2000 to 2010.
df
This histogram is a visual representation of the median home value in 2000.
median_home_value(x)
Model One
descriptive_stat (descriptive)
Statistic | Min | Pctl(25) | Median | Mean | Pctl(75) | Max |
MedianHomeValue2000 | 11,167 | 105,661 | 154,903 | 187,129 | 224,337 | 1,288,551 |
MedianHomeValue2010 | 9,999 | 123,200 | 193,200 | 246,570 | 312,000 | 1,000,001 |
MHV.Change.00.to.10 | -1,228,651 | 7,187 | 36,268 | 60,047 | 94,881 | 1,000,001 |
MHV.Growth.00.to.12 | -97 | 6 | 25 | 33 | 50 | 6,059 |
change_MHV_2000_2010(MHV)
Like the absolute change in median home value, the percentage median home value provides another level of context for growth rates of home value in the census tracts.
percent_change_MHV_2000_2010(perMHV)
Both changes within a census tract and changes within a city can affect the price of a home. Since our policy of interest focuses on changes made within a tract (new business and new housing created through NMTC and LIHTC), we want to control for things happening at the metro level to ensure we are not inferring that programs that target the tract are driving outcomes when it is actually broad metro-level trends.
Here we calculate several metro-level statistics for our model (growth in population in the city (Pittsburgh), changes in demographics, changes in industry structure and wealth, etc.). The calculated a metro level statistic are executed by aggregating up (population count) or averaging across census tracts.
# view results
%>% head() cbsa_stats_df
Here we begin by modeling changes in median home values that are predicted by tract characteristics in 2000 and changes in the city between 2000 and 2010.
In order to predict the median home value change between 2000 and 2010, the team selected the following variables: Percentage of the White Population (p.black),Percentage of the population with a high school degree (p.hs), and the Percentage of population that is unemployed (p.unemp).
However, the primary purpose for selecting the above variables is to assess how the variable affects the median home value change. Before running a model for the three variables selected, the team checked for variable skew and multicollinearity among the selected variables and the Media Home Value change for the 2000 and 2010 datasets.
Since it is common for social science data not to normally distributed, the team conducted a formal testing of the variables selected to assess if there is any skew in the variables as shown below:
<- log10( d$p.unemp + 1 )
log.p.unemp <- log10( d$p.black + 1 )
log.p.black <- sample( 1:length(log.p.black), 5000 )
these
par( mfrow=c(1,2) )
jplot( d$p.black[these], d$p.unemp[these],
lab1="Black population(percent)", lab2="Unemployment (percent)",
main="Raw Measures" )
jplot( log.p.black[these], log.p.unemp[these],
lab1="black population(percent)", lab2="Unemployment (percent)",
main="Log Transformed" )
Based on the notes above, the team noticed that before the application of log transformation, the relationship between unemployment and the black population were skewed towards the left, which shows evidence of a relationship between the two variables.
skew_scatter_plot()
The relationship between selected variables in the scatter plot shows a clusters in the lower left-hand corner of the scatter plots, which shows evidence of skew data in relation to MHV growth.
In addressing the skew in the variables in correlation to MHV growth, a log transformation was applied in order strengthen the relationship among the variable.
addressing_skew()
jplot (d$p.unemp, d$mhv.growth, ylim=c(-50,100),
lab1="Unemployment", lab2="MHV Growth")
The plots here show no noticeable cluster, pattern, or skew. The analysis will continue with a logarithmic transformation on .
Multicollinearity happens when two independent variables in a regression model are highly correlated and explain the same variance. As a result, standard error increases, slopes typically shift toward the null and lack statistical significance. In order to check for Multicollinearity, the team ran the independent variables and the dependent variable to see the validity of the relationship among these variables and whether there is a significant increase in standard error.
Model Two
variables(var)
Dependent variable: | |||
mhv.growth | |||
(1) | (2) | (3) | |
p.black | 4.05*** | 1.15*** | |
(0.26) | (0.29) | ||
p.unemp | 15.57*** | 11.62*** | |
(0.56) | (0.64) | ||
p.hs | 72.30*** | ||
(3.03) | |||
Constant | 26.20*** | 17.43*** | -115.10*** |
(0.26) | (0.46) | (5.57) | |
Observations | 58,839 | 58,801 | 58,800 |
Adjusted R2 | 0.004 | 0.01 | 0.02 |
Residual Std. Error | 35.10 (df = 58837) | 34.94 (df = 58799) | 34.77 (df = 58796) |
Note: | p<0.1; p<0.05; p<0.01 |
cbsa()
metro()
Model Three
predicted_variables()
Dependent variable: | |||
mhv.growth | |||
(1) | (2) | (3) | |
p.black | 4.05*** | 2.71*** | 2.42*** |
(0.26) | (0.20) | (0.24) | |
metro.mhv.growth | 0.01*** | ||
(0.0000) | |||
Constant | 26.20*** | 1.62*** | 24.13*** |
(0.26) | (0.23) | (4.04) | |
Metro Fixed Effects: | NO | NO | YES |
Observations | 58,839 | 58,839 | 58,839 |
Adjusted R2 | 0.004 | 0.41 | 0.41 |
Residual Std. Error | 35.10 (df = 58837) | 27.12 (df = 58836) | 27.10 (df = 58459) |
Note: | p<0.1; p<0.05; p<0.01 |
Even though all the variables are statistically significant as shows in model two, the following conclusions can be observed:
Coefficients for the black, unemployed population decrease
Standard errors increase for for all the variables
Decrease in R-square across the model
Increase in Residual Std. Error in the model
The increase in the standard error in the model shows the distance of the data point to the regression line, affecting the model’s accuracy. The above analysis suggests that the variables selected contain unwanted information, and including them together results in canceling each other out leads to a less effective model. In order to account for the context that may be responsible for the skew, the metro-level fixed effect is added to the model.
However, the relationship between the three variables (black population, unemployment population, and high school degree population) and Home Value Growth Change is positive and statistically significant. We can conclude that these three variables have an effect on MHV Growth.
Moreover, the black population is considered the most critical variable among the three variables used to predict MHV change. After running models two and three, the black population remains statistically significant, and the standard error reduces.