An Idiot's Guide to Econometrics


1. Difference-in-differences Estimation 
Difference in differences (DID ) is an econometric technique that attempt to identify the impact of a policy intervention in the treatment group as compared to the control group. In contrast to the Randomized Controlled Experiment, it uses observational data to identify such differential impact of treatment
The main idea of DID is to compare the effect on an outcome variable by comparing the average change over time in the outcome variable for the treatment group, compared to the average change over time for the control group. 
The concept of DID can be made clear with the following example: 
Consider a policy intervention of providing skill training. We are interested whether wage rate rises or not as a result of this intervention or not.

 In the graph above, the line HC shows the growth of wages for the control group where policy intervention is absent. The line GA shows the growth of wages for the treatment group with skill training. In year 1, the total difference between the wages of the two groups is not the true impact of the policy intervention because there was per-existing difference in wages. Thus, the true impact is: 
True Impact (AB)=AC-BC(=GH) 
One crucial assumption taken here is the parallel trend assumption. That is, if no policy intervention was brought, the growth in wages for both groups would follow a similar trend. 

DID Estimation in STATA
To illustrate the DID estimation in stata, consider the hypothetical data where :
  • year and treat are dummy variables 
  • year=0 if pre-intervention, 1 =post intervention 
  • treat=0 for control group and 1=treatment group  
  • yeartreated=year*treated is an interactive dummy variable 
  • wage =per hour wage in dollar.
The excel data file and the stata do file are available here.

         year        treat  yeartreated wage 
0 1 0 12
0 0 0 4
0 1 0 22
0 0 0 3
0 1 0 12
0 0 0 3
0 1 0 24
0 0 0 15
0 1 0 12
0 0 0 8
0 1 0 12
0 0 0 4
0 1 0 15
0 0 0 6
0 1 0 11
0 0 0 6
0 1 0 12
0 0 0 15
0 1 0 2
1 0 0 12
1 1 1 21
1 0 0 12
1 1 1 23
1 0 0 16
1 1 1 33
1 0 0 12
1 1 1 24
1 0 0 15
1 1 1 29
1 0 0 21
1 1 1 32
1 0 0 14
1 1 1 23
1 0 0 13
1 1 1 53
1 0 0 16
1 1 1 24
1 0 0 25
1 1 1 42
Import the data into stata.
Method I : Rgeression Method
We run the regression : 
wage =α+β*year+γ*treated+δ*yeartreated+u 
Here : 
Expected wage for control group in year 1 is :   α+β 
Expected wage for treatment group in year 1 is :   α+β+δ
DID= α+β+δ-(α+β)=δ
Thus, the coefficient  δ provides us the difference in difference estimator.
To implement this regression in stata, type the command : 
reg wage year treat yeartreated, robust  
The following result will appear. 
Here, the DID estimator is 8.51 which is statistically significant at 10 percent level of significance.
The DID estimator through regression can be calculated through the hastag command as :
 *hastag method
reg wage year##treat, r
The following output will appear in stata:
The results in this table are essentially the same except the fact that here it is not necessary to define the interactive dummy variable to run the regression. Such dummy variable is automatically defined by stata during the estimation process.

Method II : Installing Diff program file 
The user written command file diff can be installed to estimate the DID. For this, the stata commands are : 
ssc install diff  The stata says that 
checking diff consistency and verifying not already installed...
all files already exist and are up to date.
Then type 
diff wage, t(treat) p(year)The following results will appear : 

The results  match with the previous method. 

Method II : t-test Method 
The DID can be found by using grouped t-test in year 0 and year 1 as below :
 ttest wage if year==0, by (treat)
The output is :

ttest wage if year==1, by (treat)
The output is :

The DID estimator is the difference between the differences : -6.288-(-14.8)=8.512 

Method III : Collapse command
In stata, type the following command 
collapse (mean) wage, by(year treat)This command collapses the data set into four categories of mean value of wages according to the values to be taken by the dummies.
Type the browse command and the result in our example will appear as :

year treat    wage
0       0        7.11
0       1        13.4
1       0        15.6
1       1        30.4
In year zero, the pre-existing difference in mean wages between treatment and control group is  : 13.4-7.11=6.29
In year 1, the difference in mean wages  between the groups is : 30.4-15.6=14.8
Difference in differences =14.8-6.29=8.51
Among all these methods, the first and second methods are preferred as they provide the standard error of the differences that us helpful for deciding whether the the DID is statistically significant or not.
For comments, please contact srb863@g.harvard.edu 

..................................................................................................................................................................... 
 2.Statistical and Economic Significance

Statistical Significance  and economic significance are the two  terms used quite often while interpreting the regression results.
Statistical significance is a pure statistical concept that refers to whether we can conclude that there is no relationship between the variables at all and the true population coefficient is zero. Thus, it is to do about the hypothesis testing of the regression coefficient.
As a rule of thumb, if the standard error of the estimate of the regression coefficient is less than half the value of the estimated coefficient, the  variable is said to be statistically significant in the relationship. 
Different language to measure the statistical significance of the regression coefficient  are :
A variable included in the regression is statistically significant if 
  • The standard error of the regression coefficient is less than half of the value of the coefficient or 
  • The statistic associated with the coefficient is greater than 1.96 (we have considered 5 % level of significance) or 
  • The probability value associated with the coefficient is less than 0.05 (again 5 % level of significance is assumed) 
  • The confidence interval of the coefficient does not include zero in its range. 
 An Example : Consider the regression output given below : 
  
The variable treat is statistically significant because: 
  • Method I : Standard error (2.26)<1/2 of coefficient(10.66)
  • Method II : t-ratio is greater than 1.96.
  • Method III : The probability value is less than 0.05.
  • Method IV: The 95 percent confidence interval does not include zero. 
On the other hand, economic significance is to do with whether the coefficient matters for policy or not. For example, consider the following hypothetical output from a regression : 
Crime =10-0.000001*educ+0.0000236*drinking
Where,
Crime =number of crimes committed in ith society 
Educ=average year of education of the ith society 
drinking =average level of alcohol drinking in the ith society
Suppose that the coefficient are statistically significant?
Do they matter for policy ??
Do you recommend providing more education for controlling crime ??

Even if we change the variable from its mean value to its first quartile or third quartile (a large shift from policy perspective and expenditure to be incurred), the reduction in crime is insignificant. Thus, the coefficients are not economically significant. 
...................................................................................................................................................................

No comments: