My Machine Learning Works: March 2023

clear all
set more off

/****************************/
/******** QUESTION 1 *******/
/****************************/

*Importing data

use "D:\STATA2023\part1_timeseries.dta"

* sorting the year column
sort year

*set data as time series
tsset year, yearly

*label variables
lab var children "The number of own children under age 5"
lab var unemployed "The share of unemployed women within (25-54) "

* 1. Time series of ’children’ and ’unemployed’ over time

twoway (tsline children) (tsline unemployed, yaxis(2)), ttitle(Years) ///
title(Time series of Unemployed and Children)legend(c(1))

* simple regression of ’children’ on ’unemployed’
reg children unemployed

* 2. Regression of ’children’ on ’unemployed’ and share_married
reg children unemployed share_married

* 3. testing for trends in the variables

*set up a time trend
gen t = _n

reg unemployed t // non linear
reg share_married t
reg children t

gen ln_share_married = ln(share_married)
label var ln_share_married "logarithm of unemployed share of married women"
gen ln_children = ln(children)
label var ln_children "logarithm of No. of children"

reg ln_children unemployed ln_share_married

* 4. testing for autocorrelation and unit roots

* testing for unit roots

*Test for unit roots (Dickey-Fuller (DF) test)
dfuller children
dfuller unemployed

* AR(1) for children
reg ln_children L.ln_children
reg ln_children L.ln_children t

* AR(1) for share_married

reg ln_share_married L.ln_share_married
reg ln_share_married L.ln_share_married t

/****************************/
/******** QUESTION 2 *******/
/****************************/
use "D:\STATA2023\part2_panel.dta", clear

* setting the data as panel data
xtset statefip year

* 5
keep if year==2022 // keeping observations in 2022 only

lab var children "The number of own children under age 5" // changes the label of children

twoway scatter children lnincome, ///
    ytitle("Average number of children")///
   xtitle("Natural logarithm of median household income") ///
    title("Relationship between children and lnincome in 2022") ///
    graphregion(color(white)) plotregion(color(white)) || lfit children lnincome

*summary of variable
summarize children lnincome

*summary of control variable

summarize share_married share_women

* 6
use "D:\STATA2023\part2_panel.dta", clear

* fertility and income pooled

reg children lnincome ib(last).year

* pooled ols with other control variables

reg children lnincome ib(last).year share_married share_women pop

* 7.
* setting the data as panel data
xtset year

* fixed effects
xtreg children lnincome share_married share_women pop, fe

* generating first difference variables
gen t =_n
tsset t // set the time variable

gen dchildren = d.children
gen dlnincome = d.lnincome
gen dshare_married = d.share_married
gen dshare_women = d.share_women
gen dpop = d.pop

* regression first difference

reg dchildren dlnincome dshare_married dshare_women dpop,nocons

Time Series

Question 1

Part 1.

The graph in figure 1, shows a time series decline gradient for the mean children variable, which explains that the number of children below the age of five has been on a decline among the women. On the other hand, the mean of unemployed women shows a seasonal pattern, where between 1990 to around 2008, there was a decline in the unemployment rate; however, there was a sharp increase in unemployment from 2008 to 2010, which can be linked to the great depression that was experienced in the US. From 2010 when the economy recovered, there is a sharp slope that indicates a decline in unemployment until around 2019, during the onset of the COVID-19 pandemic.

Figure 1 Time Series Graph

Looking at the regression of children on unemployed output in figure 2, we can see that there is a positive correlation between children and unemployed. The regression equation is as shown below.

Children = 0.3725*unemployed + 0.2661

From the equation, we can see that for every increase in the rate of unemployed, there is an increase in the number of children below the age of 5 by 0.3725. The p-value of the unemployed coefficient is .095. at .05 significance level, we can conclude that the unemployed is non-significant in predicting the number of children because the p-value (.095) is greater than .05.

Additionally, we can see that the R-squared is .0872, which is an indication that only approximately 8.72% of the variation in children is explained by unemployed.

Figure 2 Regression of children on unemployed

Figure 3 Regression with control variable

When share_married is included as a control variable, we can see that the Adjusted-R increases from 0.0577 to 0.8584. This shows that there is an increased effect of the share_ married variable. We can see that the R- Square increases from 0.0872 to 0.8673, which indicates a shift in the effect of the independent variables on the children. Approximately 86% of the variation in the mean number of children under 5 is explained by unemployed and share_married. The equation is:

Children = .389*unemployed + .746*share_married -.149

We can see that for every increase in the share of the number of married women, there will be an increase in the number of children under 5 by 0.746, and also, for every increase in the number of unemployed by 1 unit there will be an increase in the number of children below the age of 5 by 0.389.

The improved performance in the second regression indicates that share_married is a key variable in predicting the number of children below the age of five. The explanation would be that married women are likely to have younger children while they are in their prime age and when they are unemployed; the likely explanation would be that the women are out of work to take care of the infants. 3.

Figure 4 Regression for trend test

To test for trend, we run a regression between the time variable and the independent variables. We can observe that for linear time trend for unemployed, is not statistically significant. However, it is statistically significant for share_married. Similarly, the trend in children is statistically significant. This means that we will generate a logarithm variable of the two variables, children and share_married.

Figure 5 log-log regression

Using the log-log regression, we can see that the impact of both variables unemployed and log of share_married are still positive. For 1 unit increase in unemployed, there will be an increase in children by 139%, while an increase in share_married by 1 percent will lead to an increase in the number of children by 1.48%

Figure 6 dickey fuller test of unit roots

The output of the dickey fuller test above show that the p value for children is 0.9718, while the p-value for unemployed is 0.3451. Using 0.05 level of significance we can fail to reject the null-hypothesis that the variables are non-stationary. Therefore, there is presence of unit roots.

Text, table

Description automatically generated

The share married does not converge to the true value faster while it does for the children. this is because the coefficient of children is 1 while the coefficient of share married is less than 1.

Panel Questions

Figure 7 Two-way scatter

The scatter plot shows the relationship between the average number of children and the natural logarithm of median house income. A percentage increase in income shows that there is a decrease in income between the states. However, the relationship between income and the number of children has a very gentle slope which shows that there is a weak relationship between the states.

Figure 8 Descriptive statistics of state in 2022

From the summary table, we can see that the minimum average number of children below the age of 5 is 0.1599, the maximum is 0.4098, and the mean is 0.2599. This tells us that households in the US will have an average of 0.2599 children below the age of 5. In comparison, the least number of children that women within the prime age will have is at least an average of 0.1599. The natural logarithm of median household income in the state has a minimum of 10.956, while the max is 11.720, while the mean is 11.363.

Figure 9 Descriptive statistics of states in 2022 for control variables

Shared married has a mean of 0.523 with a minimum of an average of 0.322 and a maximum of 0.610. On the other hand, the share_women average has a minimum of 0.486, a maximum of 0.526, and a mean is 0.506.

Figure 10 Pooled Ols children on lnincome

The pooled OLS output shows that compared to 2022, all the years have seen a positive effect on fertility except 2021, which has a negative impact. We can also see that income has a negative effect on fertility, where an increase in income by 1% will lead to a decrease in fertility by 0.00017. The model is significant at .05 level.

Table

Description automatically generated

Figure 11 Pooled OLS with control variables.

The regression output shows that the year 2021 and 2021 realized a negative effect on fertility for every when the control variables were added. Population and share of women have a negative effect on fertility along with income. However, share_married has a positive effect on fertility.

The pooled OLS is different from the simple regression because it addresses the influence of the years on fertility and how they interact with the other predictor variables.

Figure 12 fixed effect regression

We can see that the income, share of women, and population all have a negative effect on fertility. However, share of married women continues to have a positive effect on fertility.

Figure 13 first difference regression

Using the first difference, we realize that the population bow has a positive effect compared to the fixed effect, where it had a negative effect. Also, share of the women will still have a negative effect on fertility, along with income. What comes out clearly is that share of married women has a positive effect on fertility.

Fixed effect is better than the first difference because it has a better R square, which means that more variations in fertility can be explained by the independent variables.

My Machine Learning Works

Saturday, March 11, 2023

The do file

USING STATA FOR ANALYSIS

The Need for Efficient Cable Organizer in the Digital Age

Report Abuse