Skip to main content

3.1 Creating new variables and recoding - generate/replace

The command generate is a tool for generating new variables. It requires name of variable and what values it should have. This can be a specific value or a value based upon an equation/ formula. If-conditions are used to indicate which cases/units are to receive a value.

Note that generate can only be used to specify one value. If you want to specify more values (based on other conditions), the replace-command can be used to complete the process.

The generate-command can also be used to copy other variables: generate <new variable> = <old variable>. This too can be combined with if-conditions.

Example on how to code the dummy "male" derived from the source variable BEFOLKNING_KJOENN (contains information on gender, where the alphanumerical value '1' represents males):

 import ds/BEFOLKNING_KJOENN as gender
generate male = 1
replace male = 0 if gender != '1'
 

There are many possible ways to create logical conditions, all of which will give the same result. The dummy variable "male" could also be coded as follows:

 
import ds/BEFOLKNING_KJOENN as gender
generate male = 0
replace male = 1 if gender == '1'
 

A more compact method where you don't have to use replace (the value 1 is given to observations that satisfy the condition you set - the other value (0) is automatically set for all observations that do not satisfy the condition):

 import ds/BEFOLKNING_KJOENN as gender
generate male = gender == '1'
 

IMPORTANT INFO
  • = are used to set values through the commands generate or replace. However, == are used in relation to logical if-expressions.

  • Values for alphanumerical variables need to be specified with singular or double quotation marks ('1', '2', ... etc, or "1", "2", ... etc), while numerical values are specified without quotation marks (1, 2, .... etc).

    • The value format are found by looking at the specific variable on the top left (the dataset window) or bottom left (registry database window).
  • Code for missing data are specified the following way: sysmiss(<variable>)

    • Example (removing units with missing data on "gender":
     
    import ds/BEFOLKNING_KJOENN as gender
    generate male = 1
    replace male = 0 if gender != '1'
    drop if sysmiss( gender )
     
  • The following logical operators may be used in if-expressions:

    • Larger than: >

    • Less than: <

    • Equal to: ==

    • Larger than or equal to: >=

    • Less than or equal to: <=

    • Not equal to: !=

    • Or: |

    • And: &

  • Dummy variables need to be numerical of methodically reasons, and must also take the values 1 and 0. A dummy variable cannot take only the value 1 as this will give unwanted results or error messages when performing regression analysis. In practice, one must therefore be careful to code all units that do not have the "success" value with the value 0 (see example at the top of the previous page)

  • When using dummy variables in if-expressions, there is no need to specify the value 1.

    • Example: The expression tabulate sivilstatus if male == 1 will give the exact same result as tabulate sivilstatus if male
  • If the purpose of the adaptation of the variables is to perform regression analyses, categorical values ​​should be coded in numerical form. If not, there is a risk that the system will not accept the variable input, and an error message may occur when running commands such as regress, logit, etc.

  • For methodological reasons, categorical variables should usually be arranged as dummy variables such as in the example of the variable "mann" above. This also applies to multi-category variables (more than two categories) such as "Education level". In such cases, a set of dummy variables which, in combination, corresponds to the multi-category variable need to be created. In practice, each category minus the reference/base category needs to be represented by separate dummy variables, where the estimates are interpreted relative to the reference category. The process of creating sets of dummy variables can however be automated by using the prefix i. in front of the variable name in the regression expression. Then the lowest value is automatically used as the reference value.

  • Missing values: Be aware that all units where at least one of the included variables has a missing value are excluded from regression runs. Variables with many missing values ​​that are not recoded will then result in the regression analysis being performed on a much smaller data set than planned. This is something one should be aware of during the facilitation. In the gender example, there will typically be few units/individuals with missing value, but there may be other variables that indicate e.g. social security benefits such as "disability". Here, a majority will have missing value, and only those who are disabled will have a valid value. In such cases one should code in the following manner:

 
import ds/PENSJONER_UFOERGRAD 2010-01-01 as disabilitydegree
generate disabled = 1
replace disabled = 0 if sysmiss( disabilitydegree )
 
  • Missing Values for income variables: This will typically refer to all people with income = 0. If these need to be included, they should be recoded into 0's: replace income = 0 if sysmiss(income)

\rhd Examples of facilitating variables and use of label functionality