2.3.1 Datasets containing cross-sectional data
The command import
is used for imports of the following types of information (four temporality types):
-
Fixed information (e.g. gender, birthdate, birth country)
-
Longitudinal information (custom extractions from event based variables)
-
Cross-sectional information on regularly predetermined measurement dates
-
Accumulated information (mainly yearly economical measures like income, wealth etc)
The name of the variable to import into your dataset is required, in addition to the measurement date. When you are working in the command window, the system suggests relevant variables and dates through a self-filling feature that minimizes the chance of writing errors.
For each time the import
is run, a new variable will be added to the work dataset (automatically linked with the respective individuals). The resulting dataset will consist of one observation per unit (individual), with an optional number of variables.
When importing fixed (constant value) variables, the measurement date is excluded from the expression:
import <variable> as <alias>
However, all other variables require a measurement date on the format YYYY-MM-DD:
import <variable> <measurement date> as <alias>
For cross-sectional variables, the regularly predetermined measurement dates must be used since the values in only will apply to these particular dates (one does not know the actual change dates for such variables). If you are working in the command window, the microdata.no analysis system will in such cases suggest the relevant statistics dates through the self-filling function in order to help the user as much as possible. When importing longitudinal data on custom dates, the last used date is proposed. For variables with accumulated measures, it is the annual value for the year in question that is imported, therefore the specific date does not matter as long as the correct year is used (January 1. or December 31. the particular year, the result will be the same).
Example: Data matrix using import (4 variables)
ID | Variable 1 | Variable 2 | Variable 3 | Variable 4 |
---|---|---|---|---|
123456 | 1 | 200000 | 0301 | 1 |
135791 | 1 | 410000 | 0301 | 1 |
147036 | 2 | 515000 | 1201 | sysmiss |
159371 | 2 | 309011 | 1101 | sysmiss |
160505 | 2 | 357000 | 1101 | 1 |
173951 | 2 | 399000 | 0301 | 3 |
The command import
performs practically two operations:
- Retrieves values for a given variable
- Links the variable values onto the existing dataset, using the built-in identificator key through a so-called "left-join" merging (by default)
Linking/merging through “left-join” means that only values for units (individuals) in the existing dataset are imported. Therefore, it is important to start by importing a variable with a limited number of missing values, such as gender, country background or date of birth. On the other hand, if the first variable imported into your dataset is a variable that indicates sickness absence on a given date, your population will be defined by these individuals with a valid sickness absence measurement and it will not be possible to retrieve information about other people in later stages. In other words, the first import-step will define the population of the current dataset.
Alternative import solution: The import
command has an option, outer_join
, which can be used if you want to import a new variable using the so-called "outer join" approach. This means that new units (individuals) that do not already exist in your dataset are also added, provided that they have a valid value for the new variable. Your population will then increase in size depending on how many new units have a valid value for the new variable. This can be useful if you want to create a population that covers all units (individuals) that have had a value / status over a period of time, and not just reconnect information only to those that exist in the population defined by variable number 1.
Example:
import salary19, outer_join
Note that the sample population may be trimmed along the way in the process of building up a dataset, through the commands drop
and keep
, cf. section 2.6.
Units/individuals in an existing dataset that have a missing value for an imported variable will still be included in the sample, but will have a so-called sysmiss-values for the spesific variable (see section 2.6).
If you have a clear idea of which units (individuals) should be included in an analysis population, you may want to trim the population as early as possible. This could provide significant improvements in how fast the system works.
If the first variable imported into your dataset is universal, e.g. "gender", your dataset will consist of most of the individuals from the total database, including people who are dead, emigrated or unborn at the specific time measurement. This can be solved by first importing the variable BEFOLKNING_STATUSKODE measured at the corresponding measurement date, and then keeping all individuals who take the value '1' (= resident in Norway).
Example:
require no.ssb.fdb:23 as db
create-dataset demographics
import db/BEFOLKNING_KJOENN as gender
import db/BEFOLKNING_STATUSKODE 2021-01-01 as regstat21
keep if regstat21 == '1'
[Example of creating and modifying datasets](i18n\en\docusaurus-plugin-content-docs\current\eksempel\Grunnleggende operasjoner\Opprette og endre et datasett.md)